AI agents stumble without real-world context, not raw intelligence

Ask a top-tier AI agent to summarize a legal brief or write a Python function, and it will usually deliver. Ask it to find the return policy on a retailer’s website, add an item to a cart, and complete a checkout, and it will fail roughly five out of six times. That gap, between acing structured tests and fumbling through the kind of messy, multi-step web tasks most people handle on autopilot, has become one of the most studied problems in AI research heading into mid-2026. The emerging consensus among benchmark designers is that raw intelligence is not the bottleneck. The missing ingredient is the ability to find, interpret, and act on real-world context that is scattered, incomplete, and constantly changing.

The benchmarks that exposed the problem

Several independent research teams have built testing environments specifically designed to measure how AI agents perform outside the comfort of clean, structured prompts. Their results, drawn from real websites, real codebases, and real information-retrieval challenges, tell a remarkably consistent story.

The PATHWAYS benchmark, published in early 2025, tests whether web agents can discover hidden contextual information and then correctly use it. Agents in the study often navigated to the right web pages but still failed to extract the decisive piece of evidence sitting in front of them. In some cases, they hallucinated having consulted sources they never actually accessed. The reasoning ability was there. The perceptual grounding was not.

WebArena, developed by a team led by Shuyan Zhou at Carnegie Mellon, built a set of fully functional replica websites spanning e-commerce, forums, content management, and mapping. In the original 2023 evaluation, the best GPT-4-based agent completed just 14.41% of tasks, compared to 78.24% for human participants. That five-to-one gap did not come from the model lacking subject-matter knowledge. It came from the agent’s inability to manage stateful, multi-step interactions where every click changes the page and the available options. These figures represent founding benchmarks rather than the current state of the art; subsequent model generations, including GPT-4o and Claude 3.5 Sonnet, have narrowed the gap on WebArena, though as of spring 2026 no agent has come close to matching human reliability on the full task set.

SWE-bench, introduced by Carlos E. Jimenez and colleagues at Princeton, took a different angle, drawing tasks from real GitHub issues and pull requests and asking agents to produce working patches. Success required coordinating changes across multiple files, understanding how a codebase fits together, and maintaining contextual awareness across hundreds of lines of code. Early solve rates were extremely low, even for the strongest models available at the time. Newer model families have posted substantially higher scores on the original SWE-bench set, but questions about data contamination (discussed below) complicate direct comparisons.

Then there is GAIA, a benchmark for general AI assistants that combines web browsing, tool use, and multimodal reasoning. In the original 2023 paper, human respondents scored roughly 92%, while GPT-4 with plugins managed about 15%. Those numbers established a baseline rather than a ceiling; more recent model families have improved on GAIA, though the gap between human and agent performance remains large. The tasks were not intellectually demanding for people. They were demanding because the information needed to complete them was scattered across sources, partially hidden, and sometimes contradictory.

Why the numbers might overstate (or understate) the gap

Not everyone reads these benchmarks the same way. A June 2025 preprint (not yet peer-reviewed) titled “The SWE-Bench Illusion” argues that some apparent gains on static benchmarks may reflect memorization of training data rather than genuine problem-solving ability. If models have already seen the test cases during training, their scores paint a rosier picture than reality warrants. The paper’s core warning: when evaluation data leaks into training pipelines, perceived intelligence can be an artifact. Because the paper has not undergone formal peer review, its specific claims about the magnitude of contamination should be treated as preliminary.

The SWE-bench team responded with SWE-bench Live, described in a May 2025 preprint. The variant draws from over 1,300 tasks posted since early 2024 across 93 repositories, specifically designed to resist data contamination through continuous updates. Early results from that variant suggest performance does drop on genuinely novel problems, though the size of the drop varies by model family and has not yet been pinned down with precision.

A separate question the benchmarks cannot answer on their own is how well lab results predict what happens in production. WebArena and GAIA simulate realistic environments, but they are still simulations. No large-scale public dataset from enterprise deployments confirms whether the context-retrieval failures observed in controlled settings persist at the same rate when agents operate inside corporate systems with access to internal documentation, proprietary APIs, and human-in-the-loop guardrails. Research on grounded web interaction through shopping tasks shows that dynamic pages, partial observability, and long decision horizons all degrade agent performance. Real enterprise environments add further layers of complexity that no benchmark has fully captured.

Have newer models closed the gap?

Readers following the rapid release cadence of GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro may wonder whether these results are already outdated. The short answer is that newer models have posted higher scores on several of the benchmarks discussed above, but the structural problem persists. Gains tend to be largest on tasks that reward better reasoning or longer context windows and smallest on tasks that require real-time navigation of changing web pages or multi-step stateful workflows. As Shuyan Zhou’s WebArena team noted in follow-up work, improvements in raw model capability do not automatically translate into proportional improvements in end-to-end task completion, because the failure mode is not about what the model knows but about how it interacts with a live environment. Until benchmark designers and model developers publish head-to-head comparisons on continuously updated, contamination-resistant test sets like SWE-bench Live, the precise size of the remaining gap will stay uncertain.

Tool access helps, but not where it matters most

One line of research offers a partial counterweight. The Toolformer study demonstrated that language models can learn to call external tools through self-supervised training, and browser-augmented systems like WebGPT showed that giving models search access improves factual accuracy. These are meaningful advances, but they do not resolve the core problem the benchmarks have identified.

Giving an agent a browser does not guarantee it will extract the right information from the page it lands on. The PATHWAYS results make this explicit: agents reached the correct pages and still missed the answer. The failure happened after retrieval, during interpretation. The agent had the evidence in front of it and could not recognize what mattered.

This distinction matters for anyone building or buying agent-based systems. A model that can call a search API is not the same as a model that can sift through a cluttered product page, identify the one clause in a return policy that answers a customer’s question, and act on it correctly while tracking the state of a multi-step workflow.

Why adaptive context retrieval is the engineering priority for agent deployment

The consistent failure mode across four independent benchmarks is not a lack of knowledge or reasoning power. It is the inability to navigate messy, evolving information environments where the right answer requires active investigation, not just retrieval. Scaling model size alone, or fine-tuning on static datasets, has not closed that gap in any of the test regimes examined here.

The investment priority the research supports is building adaptive context-retrieval architectures: systems that can handle partial information, detect when evidence is misleading, and maintain state across long interaction sequences. As the PATHWAYS and WebArena results both demonstrate, the breakdown happens not at the reasoning layer but at the point where the agent must perceive, filter, and act on a live environment. Until that layer is engineered to match the robustness of the reasoning layer above it, the distance between demo performance and production reliability will remain wide.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

AI agents stumble without real-world context, not raw intelligence

The benchmarks that exposed the problem

Why the numbers might overstate (or understate) the gap

Have newer models closed the gap?

Tool access helps, but not where it matters most

Why adaptive context retrieval is the engineering priority for agent deployment

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X