13-hour AWS meltdown reportedly triggered by Amazon’s own AI tools

Amazon Web Services experienced two separate outages last year, both reportedly caused by the company’s own artificial intelligence tools, according to reporting that has raised pointed questions about the safety of deploying autonomous AI systems inside critical cloud infrastructure. One of those incidents stretched for roughly 13 hours, disrupting services for businesses and consumers who depend on AWS as the backbone of their online operations. The episodes arrive at an awkward moment for Amazon, which is simultaneously pouring enormous sums into AI development under the direction of CEO Andy Jassy.

AWS Outages Traced to Internal AI Systems

Amazon’s cloud division was hit by two outages caused by AI tools last year, with the longer incident lasting an estimated 13 hours. The failures were not the result of external cyberattacks or routine hardware breakdowns. Instead, internal AI tooling designed to assist with operational tasks reportedly made autonomous changes that cascaded into service disruptions. The detail that Amazon’s own technology turned against its own infrastructure adds a layer of irony to the company’s aggressive push to sell AI products to enterprise customers worldwide.

Among the internal AI tools Amazon has been developing is one called Kiro, which the Financial Times has identified as part of the company’s broader AI strategy. While the precise technical sequence that led each outage to spiral remains unclear from publicly available information, the pattern of AI-initiated operational failures points to gaps in how automated systems interact with access controls and configuration settings inside large-scale cloud environments. The incidents suggest that even the most well-resourced cloud providers can be caught off guard when AI agents operate with insufficient guardrails.

Jassy’s Bet on AI Spending Collides with Reliability Concerns

The outages carry particular weight because of the scale of Amazon’s financial commitment to AI. Andy Jassy has staked the next phase of AWS growth on a reported spending drive to build out AI capabilities across the company’s cloud platform. That investment is meant to position AWS as the default choice for enterprises looking to run AI workloads, train large language models, and deploy intelligent automation at scale. Yet the very tools Amazon is building internally to improve efficiency appear to have introduced new categories of risk that the company had not fully accounted for.

This tension between speed and stability is not unique to Amazon, but the company’s position as the largest cloud provider makes its failures especially consequential. When AWS goes down, the ripple effects touch e-commerce platforms, streaming services, financial applications, and government systems. For Jassy, the strategic calculus is clear: AI spending is supposed to revive AWS revenue growth and fend off competition from Microsoft Azure and Google Cloud. But if that same AI investment produces outages that shake customer confidence, the spending drive could undermine the trust it was designed to reinforce.

Why Agentic AI in Infrastructure Is a Different Kind of Risk

Most prior AWS outages have been traced to human configuration errors, network failures, or capacity limits. The 2025 incidents represent something qualitatively different. When an AI tool autonomously alters settings inside a production environment, the failure mode is harder to predict and harder to contain. Traditional change management processes assume a human operator who can be trained, audited, and held accountable. An AI agent operating at machine speed can propagate a bad decision across thousands of servers before any human reviewer has a chance to intervene.

The broader industry conversation around “agentic AI,” systems that take actions rather than simply generating text or recommendations, has largely focused on customer-facing applications like coding assistants and automated customer service. Amazon’s outages reveal that the risks are just as acute, and possibly more dangerous, when agentic AI is deployed inside the operational core of cloud infrastructure. A chatbot that gives a wrong answer is an inconvenience. An AI tool that misconfigures access controls on a core database can take down services for millions of users. The distinction matters because it suggests that the governance frameworks companies are building for external AI products may not be sufficient for internal operational AI.

What Customers and Competitors Should Take from This

For AWS customers, the incidents raise a practical question: how much visibility do they have into the automated systems that manage the infrastructure their businesses run on? Cloud providers have historically treated their internal tooling as proprietary, offering customers service-level agreements and uptime guarantees rather than transparency into operational processes. If AI tools are now making autonomous decisions that can trigger multi-hour outages, customers may begin demanding more detailed disclosure about how those tools are governed, tested, and constrained. Enterprises with strict compliance requirements in sectors like healthcare, finance, and government may find that existing support arrangements need significant updates to address AI-driven operational risk.

Competitors are likely watching closely. Microsoft and Google have both invested heavily in AI, but neither has publicly reported outages tied to their own internal AI tooling at this scale. That does not mean such incidents have not occurred or will not occur. It does mean that Amazon is the first major cloud provider to face public scrutiny over this specific failure mode, and the way it responds will set expectations for the industry. If Amazon treats these outages as isolated bugs to be patched, it risks repeating them. If it uses them as a reason to build more rigorous human-in-the-loop requirements for operational AI, it could establish a model that other providers eventually adopt.

The Governance Gap That AI Spending Alone Cannot Fix

The core issue exposed by these outages is not that Amazon built bad AI tools. It is that the company appears to have deployed AI agents into high-stakes operational roles without the kind of pre-deployment testing and containment protocols that match the potential blast radius of a failure. Spending large sums on AI development does not automatically produce safe AI deployment. The two are distinct disciplines, and the gap between them is where systemic risk accumulates. Governance in this context means more than internal policies; it requires technical mechanisms such as strict permission boundaries, staged rollouts, and automated rollback systems that assume AI tools will sometimes behave in unexpected ways.

Those mechanisms also need to be paired with cultural and organisational changes. Engineers must be empowered to halt AI-driven processes when anomalies appear, even if doing so temporarily slows down automation goals. Boards and executives who are enthusiastic about AI-driven cost savings need to be equally insistent on investment in resilience, redundancy, and incident response. Some of that work is unglamorous compared with headline-grabbing model launches, but the recent outages show how quickly neglected governance can turn into reputational damage. Customers will judge providers not only on how advanced their AI offerings are, but on how transparently they acknowledge failures and how credibly they commit to preventing repeats.

There is also a broader labour dimension to consider. As cloud companies automate more operational work with AI, they may be tempted to shrink or de-skill human operations teams. Yet the AWS incidents suggest that experienced engineers remain essential as a backstop when automated systems go wrong. Rather than replacing human expertise, AI in infrastructure may need to be framed as augmenting it, with clear escalation paths and training that prepares staff to interpret and override machine decisions. For workers across the tech industry, including those searching for roles through platforms like specialist job boards, the most valuable skills may be those that combine an understanding of AI systems with deep knowledge of traditional reliability engineering.

For individual developers and decision-makers at customer organisations, the events offer a cautionary template. Companies that are building their own internal AI agents to manage deployments, monitor security, or optimise costs should not assume that using a major cloud provider automatically insulates them from these risks. The same principles that Amazon must now apply (limited scopes for autonomous action, rigorous testing in sandboxed environments, and independent oversight) are relevant to any enterprise experimenting with agentic AI. In some cases, that may mean slowing down rollouts or requiring human approval for certain classes of changes, even when the technology appears capable of operating fully autonomously.

Finally, the outages highlight a growing asymmetry between the complexity of modern cloud systems and the tools that customers have to understand them. While users can log into dashboards, manage resources, and authenticate through services reminiscent of consumer platforms such as online account portals, they rarely see the layers of automation that sit beneath. As AI takes on a larger share of that hidden work, pressure is likely to grow for new standards around disclosure, auditability, and external review. Regulators, industry groups, and even media organisations that depend on reader revenue from products like digital subscriptions all have a stake in ensuring that the infrastructure underpinning modern life is not quietly handed over to poorly supervised algorithms.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

13-hour AWS meltdown reportedly triggered by Amazon’s own AI tools

AWS Outages Traced to Internal AI Systems

Jassy’s Bet on AI Spending Collides with Reliability Concerns

Why Agentic AI in Infrastructure Is a Different Kind of Risk

What Customers and Competitors Should Take from This

The Governance Gap That AI Spending Alone Cannot Fix

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X