Human scientists still trounce the best AI agents on complex research tasks — but the gap is closing fast

Give a top AI agent two hours and a well-defined coding problem, and it will match or beat a skilled human engineer. Give that same agent an eight-hour research challenge, and the human pulls ahead. Extend the clock further, into the kind of open-ended, multi-day work that defines real scientific research, and the gap widens considerably.

That pattern, documented across four major benchmarks published between late 2024 and early 2025, captures where artificial intelligence stands as a research tool in mid-2026: powerful on short sprints, still clearly behind on the complex, sustained reasoning that drives discovery. But the boundary between “AI-ready” and “human-only” tasks is shifting fast, and the rate of that shift is now measurable.

Short tasks favor agents; longer ones still favor humans

The most direct comparison comes from RE-Bench, a benchmark developed by the research organization METR that pits frontier language-model agents against human ML engineers on research-style programming tasks across multiple time budgets. At the two-hour mark, agents performed on par with or slightly above the human engineers. At eight hours, humans narrowly edged ahead. Beyond that, the advantage grew: people kept improving their solutions while agents plateaued.

“The results were humbling in both directions,” said one METR researcher involved in the study. “Agents blew past our expectations on the quick tasks, but on anything requiring real depth, the humans just kept going where the models couldn’t.”

The explanation is intuitive to anyone who has wrestled with a hard technical problem. Agents execute quickly on well-scoped subtasks but struggle to step back, reframe an approach, or debug a subtle failure that requires understanding the full context of a system. Human researchers start slower because they spend time building that understanding, but the investment pays off as problems grow more complex.

The crossover point is moving, and there is a number for how fast

A paper accepted to NeurIPS 2025 puts a concrete metric on the trend. Researchers introduced what they call the 50%-task-completion horizon: the maximum length of a software task (measured by how long it takes a human expert) that an AI agent can still finish at least half the time. Since 2019, that horizon has roughly doubled every seven months. The authors noted possible acceleration in 2024, though whether that reflects a durable shift or a one-time leap from a particular model generation remains an open question.

If the trend holds, tasks that required four hours of expert effort a year ago will fall within reliable agent range within months. For research managers and lab directors, this is not an abstract projection. It directly affects how teams should divide work between human researchers and AI tools. Worth noting: the metric tracks performance on software-specific tasks, so users of general-purpose assistants such as ChatGPT or Microsoft Copilot should not assume the same horizon applies to every kind of work those products handle.

Full paper replication remains far out of reach

OpenAI’s PaperBench tests something far more ambitious: whether an AI agent can replicate a complete AI research paper from scratch. The benchmark uses ML PhDs as its human baseline and enforces strict rules, including blocking direct access to author-released code so agents cannot simply copy existing implementations. A protocol called JudgeEval validates scoring consistency across evaluators. The agents tested were early-2025 frontier models, meaning the results reflect the capabilities of the most advanced systems available at the time of evaluation rather than any single commercial product.

Agents scored well below the PhD-level human baseline on these replication tasks. That gap is the widest documented in any current benchmark and highlights the kind of work where humans retain a commanding advantage: multi-step projects that require choosing experimental designs, interpreting ambiguous results, and integrating knowledge across subproblems over hours or days.

One important caveat: PaperBench tests replication of existing papers, not original discovery. Whether an agent that can faithfully reproduce a known result could also identify a promising hypothesis or design a novel experiment is a separate, unanswered question. The benchmark’s authors do not claim a direct mapping between the two skills.

General reasoning hits a wall on hard problems

The GAIA benchmark, designed to evaluate general AI assistants on questions requiring multi-step reasoning and real-world knowledge retrieval, reinforces the same difficulty gradient. Agents handle straightforward queries competently but drop off sharply on problems that demand chaining together several reasoning steps or synthesizing information from diverse, sometimes noisy sources. The benchmark’s baseline results show that even top-performing systems lose reliability as task complexity increases.

For anyone evaluating AI tools for research support, GAIA’s results carry a practical lesson: task structure and duration predict agent usefulness far better than raw model size or headline benchmark scores.

What the benchmarks do not tell us

All four benchmarks focus on machine learning and software engineering. Whether the same patterns hold in experimental biology, chemistry, or physics is not established by any primary cross-domain study in the current evidence base. A lab automating protein-folding experiments or synthesizing novel compounds faces different bottlenecks than one writing code, and the transfer of these findings to those settings is speculative.

No primary economic impact assessment accompanies these results, either. The benchmarks measure task completion and replication accuracy, not cost savings, team productivity, or labor-market effects. Commentary about hybrid human-AI research teams has been widespread, but those projections lack the controlled evidence that the benchmarks themselves provide for raw performance.

There are also robustness questions. RE-Bench and the long-software-task study rely on curated task suites with clear success criteria. Real research often involves ambiguous requirements, incomplete prior work, and goals that shift as results come in. How sensitive agent performance is to those messier conditions has not been rigorously tested.

Finally, none of these benchmarks include longitudinal tracking across multiple model generations released after mid-2024. The doubling-rate estimate draws on data going back to 2019, but the most recent data points are snapshots, not confirmed trend markers for the latest frontier models. Progress could accelerate further, or it could hit domain-specific ceilings that slow the curve.

How research teams should allocate work between humans and agents

Taken together, these four benchmarks describe AI systems that are already strong accelerators on short, well-scoped tasks but still lag behind humans on open-ended, multi-hour projects demanding flexible reasoning and conceptual reframing. The crossover point is moving steadily toward longer tasks, yet as of mid-2026, it remains well within the range of a typical workday.

The practical case for a hybrid approach is strong: delegate modular, time-bounded subtasks to agents while keeping humans in charge of problem decomposition, experimental design, and final interpretation. As newer models arrive, these same benchmarks, extended across domains and tracked over successive generations, will be the best tools for distinguishing durable capability gains from one-off jumps and for deciding when to restructure scientific workflows around machine collaborators.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X