Researchers at Andon Labs built a simulated vending machine business and asked frontier AI models, including Claude, to run it autonomously over extended periods. The results expose a troubling pattern: every model eventually spiraled into what the researchers call “meltdown loops,” losing the ability to manage basic tasks like pricing and inventory long before the simulation ended. The findings challenge a growing industry assumption that large language models are ready to operate as reliable, long-running autonomous agents in real commercial settings.
What Vending-Bench Actually Tests
The benchmark, formally titled “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents,” sets up a deceptively simple challenge. An AI agent must operate a vending machine business day after day, handling ordering, inventory management, pricing decisions, and daily fees. None of these individual tasks would stump a modern language model in isolation. The difficulty comes from sustaining coherent decision-making across all of them simultaneously, over token horizons that exceed 20 million tokens per run, according to the Andon Labs paper published on arXiv. That is an enormous context window, far beyond what most agent benchmarks require, and it is precisely where things fall apart.
The researchers built their agent implementation on the inspect-ai framework, a structured evaluation tool that allowed them to observe model behavior at granular intervals throughout each simulation. This setup matters because it means the failures they documented are not anecdotal one-off bugs. They are reproducible patterns captured within a controlled testing environment. The benchmark was designed to mirror the kind of ongoing, low-stakes commercial operation that AI companies increasingly pitch as a natural fit for autonomous agents. If a model cannot keep a virtual vending machine solvent, the implications for more complex deployments are stark.
Anatomy of a Meltdown Loop
The most striking finding from the paper is the documentation of what the authors call “derail/meltdown loops.” These are not sudden crashes. Instead, they describe a gradual degradation in which the agent begins making decisions that conflict with its earlier actions, losing track of its own inventory state or setting prices that make no economic sense. Think of it as a slow-motion unraveling: the agent might order stock it already has, fail to account for fees it previously acknowledged, or price items below cost without any strategic rationale. Over millions of tokens, these small inconsistencies compound until the business is functionally incoherent.
What makes this worse than a simple accuracy problem is the feedback loop. Each bad decision alters the state of the simulation, and the agent then reasons from that corrupted state to make its next move. The errors do not stay contained. They propagate forward, warping subsequent choices in ways that accelerate the decline. This is the core mechanism the researchers describe, and it was observed across models, not just Claude. The meltdown is not a quirk of one architecture. It appears to be a structural weakness in how current language models maintain coherence over very long operational horizons.
Why Current Memory Tools Fall Short
A natural objection is that better memory management should solve this. After all, retrieval-augmented generation and external memory stores are standard tools in the agent-building toolkit. But the Vending-Bench results suggest these solutions are not yet sufficient for the kind of sustained, stateful reasoning the benchmark demands. The simulation requires the agent to hold a running mental model of its business: what is in stock, what has been ordered, what prices are set, what fees are due. That model must update correctly with every action and every new piece of information across a session that stretches well beyond 20 million tokens.
Current memory architectures tend to work well for retrieval of discrete facts but struggle with maintaining a coherent, evolving world model. The vending machine scenario is relatively simple compared to, say, managing a supply chain or coordinating logistics across multiple locations. Yet even here, the agents could not keep their internal state aligned with reality. This gap between what memory tools promise and what they deliver in practice is one of the most important takeaways from the research. It suggests that scaling context windows or bolting on external databases will not, by itself, produce agents capable of reliable long-term autonomy.
The Real-World Stakes Beyond Vending Machines
No one is losing sleep over a simulated vending machine going bankrupt. But the pattern the benchmark reveals has direct relevance to the wave of autonomous agent products now entering the market. Companies across industries are experimenting with AI agents that manage customer service queues, adjust dynamic pricing for e-commerce, handle procurement workflows, and even execute trades. All of these tasks share the same core requirement that Vending-Bench tests: the ability to make consistent, state-aware decisions over extended periods without human oversight. If frontier models reliably melt down in a simplified version of this challenge, the risk profile for real deployments is considerably higher than marketing materials suggest.
It is notable that the meltdown phenomenon was observed across multiple models, not just one. This is not a case where switching providers solves the problem. The failure mode appears to be baked into how transformer-based language models handle long-horizon reasoning, regardless of which company trained them. For businesses evaluating autonomous agent solutions, this means that vendor selection alone is unlikely to mitigate the risk. The underlying architecture needs to advance before these systems can be trusted with genuinely autonomous, long-running commercial operations. Until then, human-in-the-loop checkpoints are not just a nice-to-have safety measure. They are a practical necessity.
A Needed Correction to Agent Hype
Much of the current excitement around AI agents assumes that because models perform well on short, bounded tasks, they will naturally extend that competence to longer, open-ended ones. Vending-Bench provides concrete evidence that this assumption is wrong. The benchmark’s design is deliberately modest in complexity. It does not ask the agent to negotiate contracts or write software. It asks the agent to run a small business with a handful of variables. The fact that this is enough to trigger systematic failure should recalibrate expectations about what autonomous agents can reliably do right now.
One common critique of AI safety research is that it relies on contrived scenarios disconnected from real applications. The Andon Labs team sidesteps that objection by grounding their benchmark in a mundane, commercially relevant task. The vending machine framing is almost disarmingly ordinary, which is exactly what gives the results their force. If the agent cannot handle ordinary, the extraordinary is not on the table. For anyone building products on top of long-running AI agents, the lesson is straightforward: today’s systems still need guardrails, monitoring, and clear boundaries on what they are allowed to control, no matter how impressive their short-term performance may look in a demo.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.