Amazon’s bungling AI triggered multiple crippling AWS outages

Amazon Web Services suffered two separate outages last year that were directly caused by the company’s own artificial intelligence tools, raising hard questions about the safety of embedding experimental AI agents into the infrastructure that powers much of the internet. One incident knocked out services in AWS’s China region for roughly 13 hours, while a second disruption was tied to a different AI coding assistant. The episodes expose a tension at the core of Amazon’s strategy: the same AI systems the company is racing to deploy are now capable of breaking the very platform its customers depend on.

Two AI Tools, Two Failures

The first and more severe incident occurred in December, when an AI system called Kiro triggered a roughly 13-hour outage across AWS infrastructure in China. The disruption cascaded through dependent services, leaving customers unable to access critical cloud resources for an extended stretch. A second, separate incident was linked to Q Developer, Amazon’s AI-powered coding assistant, which caused its own round of service degradation. Both tools are part of Amazon’s aggressive push to weave AI into every layer of its cloud operations, from automated code generation to infrastructure management.

What makes these failures notable is not just their duration or scope but their origin. These were not hardware malfunctions or configuration errors introduced by human engineers. They were the direct result of AI agents operating inside Amazon’s own systems. The reported issues affected at least one major user-facing product, and the pattern of two distinct AI-caused outages within a single year suggests a systemic vulnerability rather than a one-off accident. For a company that has long sold the cloud as a safer, more reliable alternative to on-premise infrastructure, that is a serious warning sign.

Speed Over Safety in AI Deployment

Amazon has spent the last several years in an arms race with Microsoft and Google to integrate generative AI across its product lines. AWS, the company’s most profitable division, sits at the center of that effort. Tools like Kiro and Q Developer represent Amazon’s bid to automate complex engineering tasks, reduce human labor costs, and attract developers who want AI-assisted workflows. But the two outages reveal a significant downside to that velocity. When AI agents are given access to production systems and the authority to make changes, a single flawed action can ripple across regions and services in ways that human operators might catch before execution.

The conventional wisdom in cloud reliability engineering holds that automation reduces human error. That assumption breaks down when the automated agent itself becomes the source of failure. Traditional software bugs are typically deterministic: given the same input, they produce the same bad output, making them relatively straightforward to identify and patch. AI agents, by contrast, can behave unpredictably depending on the data they process and the context in which they operate. This makes pre-deployment testing far harder and post-incident forensics more complex. Amazon’s decision to deploy these tools at scale before fully accounting for those risks appears to have backfired in at least two documented cases, raising doubts about whether the company has struck the right balance between innovation and resilience.

Expert Warnings About Containment

Cybersecurity expert Michal Wozniak offered a blunt assessment of the problem. He said it would be “nearly impossible for Amazon to completely prevent internal AI agents” from triggering disruptive incidents. That framing shifts the conversation from whether AI-caused outages will happen again to how frequently they will occur and how severe they will be. If containment is effectively impossible, the question becomes one of damage limitation and recovery speed rather than prevention. In that world, redundancy, rapid rollback tools and transparent communication during crises become just as important as the AI features themselves.

Wozniak’s warning carries weight because it reflects a broader concern among security professionals about the governance of AI agents operating inside critical infrastructure. Most enterprise AI deployments today lack the kind of rigorous access controls and kill-switch mechanisms that would allow operators to halt an AI agent mid-action before it causes cascading harm. Amazon, despite its enormous engineering resources, appears to be no exception. The company has not publicly detailed what safeguards were in place during either incident or what changes it has made since. For customers and regulators, that opacity complicates efforts to evaluate whether AWS is learning the right lessons from these failures or simply hoping that better training data and incremental tweaks will be enough.

What This Means for AWS Customers

AWS is not a niche product. It is the backbone for a vast range of online services, from streaming platforms and e-commerce sites to government databases and financial systems. When AWS goes down, the effects spread far beyond Amazon’s own operations. The 13-hour China-region outage alone would have disrupted businesses and users who had no direct relationship with the AI tools that caused the failure. For those customers, the cause is almost irrelevant. What matters is that the service they paid for was unavailable, and the root cause was a decision Amazon made internally about how aggressively to deploy untested technology.

This dynamic creates a trust problem. Cloud customers choose AWS in large part because of its reputation for reliability and uptime guarantees, typically formalized in service-level agreements. When outages are caused by external attacks or rare hardware failures, customers generally accept them as part of operating in a complex environment. But when the provider’s own experimental AI tools are the cause, the calculus changes. Customers may reasonably ask whether Amazon is prioritizing its AI ambitions over the stability commitments it has made to paying users. Some may hedge their bets by adopting multi-cloud architectures or revisiting their incident response plans, even as they continue to rely on AWS for core workloads.

The Feedback Loop Amazon Has Not Solved

The deeper structural issue is one of incentives. Amazon is under intense competitive pressure to ship AI features quickly. Every quarter that passes without a compelling AI story risks losing developer mindshare to rival ecosystems. That pressure creates an environment where AI agents are deployed into production faster than the guardrails around them can mature. The result is a feedback loop: rapid deployment leads to incidents, incidents demand engineering resources for remediation, and those resources are then unavailable for building the safety infrastructure that would have prevented the problem in the first place. Unless Amazon explicitly rebalances its priorities, that loop will be difficult to escape.

Most coverage of these outages has focused on the immediate disruption and the novelty of AI-caused failures. But the more important story is about organizational priorities. Amazon is not a company that lacks engineering talent or operational discipline; it built AWS into the dominant cloud platform precisely because it excelled at reliability engineering. The fact that its own AI tools have now caused material outages suggests that internal incentives have shifted toward speed and feature delivery, even at the expense of the conservative change management that once defined its cloud operations. For customers, the key question is whether Amazon is prepared to slow down AI rollouts, increase human oversight and accept short-term competitive risk in order to restore the level of stability they have come to expect.

A Test Case for Responsible AI in the Cloud

These incidents also arrive at a moment when policymakers and the public are grappling with how to regulate powerful AI systems. High-profile failures inside a platform as central as AWS may strengthen arguments that cloud providers should be held to stricter standards when they embed AI agents into infrastructure that underpins entire economies. That could include mandatory disclosure of AI-related incidents, independent audits of safety practices, or even limits on the degree of autonomy granted to AI tools in production environments. How Amazon responds could shape not only its own reputation but the baseline expectations for the rest of the industry.

For now, customers and observers must piece together the picture from limited public information and a handful of detailed reports. Those who rely on AWS for critical workloads may want to push for more contractual clarity about how AI is used in the services they buy, and what remedies are available when experimental tools cause harm. Others may look to independent journalism, supported by readers through options like direct contributions or recurring subscriptions, to keep pressure on large tech companies to disclose the real-world consequences of their AI strategies. As more professionals in cloud computing, security and policy follow these developments, some will turn to specialized outlets and even sector-focused job boards such as Guardian Jobs to find roles where they can influence how AI is governed inside critical systems.

Ultimately, the outages caused by Kiro and Q Developer are more than embarrassing technical glitches. They are early indicators of how powerful, semi-autonomous software behaves when woven into the deepest layers of digital infrastructure. If Amazon treats them as isolated mishaps, similar failures are likely to recur, potentially with wider and more damaging consequences. If, instead, the company uses them as a catalyst to rethink incentives, strengthen safeguards and communicate more candidly with its customers, AWS could yet turn this setback into a template for responsible AI deployment at scale. The stakes extend far beyond one provider’s uptime metrics: they touch on whether the next wave of AI will make the internet more resilient, or more fragile, for everyone who depends on it.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X