Human scientists still trounce the best AI agents on complex tasks, Nature study finds

The latest generation of AI agents can draft code, summarize papers, and churn through datasets at speeds no human can match. But when a scientific problem demands creativity, multi-step reasoning, or the kind of improvisation researchers rely on when an experiment goes sideways, humans still win handily. That is the central conclusion of a news analysis published by Nature in May 2026, written by Davide Castelvecchi, which draws on several recent benchmarks to assess where AI agents stand against trained scientists on complex, open-ended work.

“AI tools are becoming ubiquitous in research, but the benchmarks tell a humbling story,” Castelvecchi writes, noting that autonomous agents consistently failed at the kinds of tasks that define real scientific practice: designing experiments, troubleshooting unexpected results, and synthesizing findings across disciplines. The implications land squarely on labs and companies that have poured resources into AI-driven discovery, raising a pointed question about what, exactly, these tools can be trusted to do on their own.

A concrete example: reproducing a computational biology result

One of the most striking illustrations comes from CORE-Bench, a benchmark measuring whether agents can reproduce computational results from published papers across disciplines including biology, medicine, and computer science. In one representative task, an agent was given a published computational biology paper and asked to reproduce its main quantitative finding by setting up the software environment, running the analysis code, handling the data pipeline, and matching the reported result. Even frontier-class agents, including models from the GPT-4 and Claude families, failed to complete the full pipeline. They could execute individual code blocks but broke down when they had to troubleshoot dependency conflicts, adapt to undocumented data formatting quirks, or decide which intermediate output to trust when numbers diverged from the paper’s tables. A trained graduate student, by contrast, completed the same task by recognizing that a version mismatch in a statistical library was producing slightly different rounding behavior, a judgment call that required understanding the science behind the computation, not just the syntax.

Where AI agents fall short across benchmarks

CORE-Bench is not an outlier. Several other purpose-built benchmarks, each testing agents from a different angle, converge on the same finding.

SkillsBench, a preprint-stage benchmark designed to test how well agent capabilities transfer across diverse domains, found that AI agents struggle on many real-world tasks without heavy scaffolding. When researchers supplied curated procedural guides, called “skills,” pass rates improved, but performance still swung wildly by domain. An agent that handled one category of problem competently could fail badly on another, a pattern that points to brittleness rather than anything resembling general intelligence.

Professional knowledge work told a similar story. APEX-Agents, another preprint benchmark, was built around tasks created by investment banking analysts, management consultants, and corporate lawyers. These are long-horizon, cross-application workflows where a single deliverable depends on chaining together research, analysis, drafting, and review across multiple tools. Agents tested against these standards did not match the quality that trained professionals produced.

Separate peer-reviewed work published in Nature tested what a so-called “AI scientist” system could accomplish within a machine-learning research workflow. The evaluation distinguished between focused, template-based tasks and template-free, open-ended ones. A human evaluation component used blinded peer review at an ICLR 2025 workshop called ICBINB. The verdict: AI systems handled structured subtasks adequately but fell short on the open-ended work that produces genuine scientific insight.

Which AI models were tested

The benchmarks reviewed in the Nature analysis evaluated several of the most capable commercially available models, including OpenAI’s GPT-4 family, Anthropic’s Claude models, and Google DeepMind’s Gemini. In most evaluations, these frontier systems were tested both with and without additional scaffolding such as tool use, retrieval augmentation, and multi-step prompting frameworks. Even with those enhancements, none of the agents matched human performance on the most demanding open-ended tasks. The results suggest that the gap is not specific to any single model architecture but reflects a broader limitation of current agent designs when confronted with problems requiring genuine scientific judgment.

AI adoption is surging anyway

None of this has slowed adoption. Stanford’s 2026 AI Index Report documents that the share of scientific publications mentioning AI has more than tripled over the past decade, with AI tools now embedded in research workflows across disciplines from genomics to materials science. The trend lines confirm that AI has penetrated deeply into how research gets done. But penetration and performance are different things, and the benchmarks suggest that the ceiling for autonomous agents remains stubbornly low on the hardest problems.

The disconnect matters because expectations have outpaced reality. Over the past two years, major AI labs have marketed their frontier models as capable research partners. The benchmarks reviewed by Nature test those claims against measurable criteria and find them wanting, at least for work that requires more than pattern-matching or retrieval. As of June 2026, none of the major AI companies, including OpenAI, Anthropic, or Google DeepMind, have publicly disputed the Nature analysis or the benchmark results it draws on.

Why measuring the gap is harder than it looks

The benchmarks agree on the broad conclusion, but important questions remain open. One is how to define and measure “complex” in a way that holds across fields. A peer-reviewed study on general scales for AI evaluation, also published in Nature, argues that some current evaluation methods understate or overstate agent capability relative to humans. The difficulty of grading open-ended, agential tasks means that two benchmarks testing similar skills can produce conflicting impressions of how capable an agent really is.

Cost and reliability add another layer of ambiguity. Princeton’s Science of Agent Evaluation research group maintains the Holistic Agent Leaderboard, a third-party, cost-aware evaluation infrastructure that organizes results across multiple agent benchmarks. Which agent qualifies as “best” depends on the benchmark chosen, the budget available, and the reliability threshold a user is willing to accept. A cheap agent that passes narrow tests may look impressive on one leaderboard and mediocre on another that weights cost or consistency differently.

There is also limited public data on how hybrid human-AI teams perform compared to either humans or agents working alone. The benchmarks reviewed test agents in isolation or compare them directly to human baselines, but real-world scientific work increasingly involves humans directing AI tools for routine subtasks while retaining control over experimental design and interpretation. Whether that collaborative model closes the performance gap, or simply shifts it, has not been rigorously measured across disciplines.

What the strongest evidence actually shows

Not all of the evidence behind these findings carries equal weight. The strongest data comes from peer-reviewed studies and purpose-built benchmarks like SkillsBench, APEX-Agents, and CORE-Bench, each designed to test agents against specific, measurable criteria with transparent methods. Readers should note, however, that SkillsBench and APEX-Agents are preprints that have not yet undergone formal peer review. The Nature news analysis synthesizes these results but relies on the underlying technical papers for its factual claims.

Institutional reports like the Stanford AI Index provide valuable context about adoption trends and publication volumes, but they describe the landscape rather than directly testing agent performance. They help explain why the question matters now without settling when, or whether, agents will catch up.

Evaluation methodology itself is an active area of research. The work on general evaluation scales highlights that benchmark design choices shape results. A benchmark testing narrow, well-defined tasks will produce a more flattering picture of AI capability than one demanding the kind of improvisation a scientist uses when an experiment produces unexpected data. For anyone trying to judge whether AI agents are ready for independent scientific work, results from open-ended, multi-step evaluations deserve more weight than scores on constrained tasks.

Human oversight remains the dividing line between useful AI and unreliable AI

For researchers and lab directors deciding how to deploy these tools, the practical signal from the benchmarks is consistent: current agents are most effective as assistants handling structured, repeatable subtasks under human supervision. Deploying them as autonomous stand-ins for trained scientists on complex projects risks wasted resources and unreliable results.

The technology will keep improving, and future models may narrow the gap. But as of mid-2026, the evidence points in one direction. On the work that matters most in science, the work that requires judgment, adaptation, and the ability to recognize when something genuinely new is happening, human researchers remain firmly ahead.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X