Morning Overview

GPT-5.5 tops Claude Opus 4.7 on Terminal-Bench with an 82.7% score

OpenAI’s GPT-5.5 has posted an 82.7% score on Terminal-Bench 2.0, a benchmark that throws AI agents into difficult, real-world command-line tasks and grades them with zero partial credit. According to results circulating among AI researchers this month, that score places GPT-5.5 ahead of Anthropic’s Claude Opus 4.7 on the same test suite. Neither company has publicly confirmed the figures, but the benchmark itself is fully open, giving any engineering team the ability to verify the claims on their own hardware.

What Terminal-Bench actually measures

Terminal-Bench 2.0 is defined in a technical preprint hosted on arXiv. The benchmark consists of 89 discrete tasks executed inside terminal environments, each paired with a human-written reference solution and a verification suite that checks whether the agent completed the work correctly. Tasks range from file manipulation and process management to network configuration and multi-step debugging, the kinds of jobs a senior systems administrator or DevOps engineer handles on a production server.

What sets Terminal-Bench apart from older coding benchmarks like HumanEval or SWE-bench is its scoring model. An agent must satisfy every verification check for a given task to receive credit. Miss one step in a firewall configuration or leave a dangling process, and the task counts as a failure. The preprint documents specific failure modes the suite is designed to catch: incomplete command sequences, overlooked error states, and misread instructions. These are the mistakes that, in a live environment, could mean deleted databases or exposed ports.

ArXiv, the nonprofit repository hosted at Cornell University and supported by a network of member institutions, distributes the paper but does not peer-review it. That distinction matters. The methodology is public and reproducible, but it has not yet cleared formal journal review. For practitioners, the upside is transparency: anyone can pull the 89-task suite, run a model against it, and compare results firsthand.

Where the evidence stands in April 2026

The 82.7% figure for GPT-5.5 has been widely cited across AI research channels, but it does not appear in a formal leaderboard maintained by the benchmark’s creators. No official results table in the preprint itself lists model-specific scores. The number traces to secondary reporting, and as of late April 2026, neither OpenAI nor Anthropic has issued a statement confirming or contesting it. Claude Opus 4.7’s exact score on Terminal-Bench 2.0 has not been publicly disclosed, which means the “tops” framing rests on claims that have not been independently locked down.

The task-by-task breakdown for each model is also missing from public sources. An overall score of 82.7% tells you GPT-5.5 passed roughly 74 of 89 tasks, but not which 15 it failed. Did it stumble on database administration? Network diagnostics? Multi-step debugging chains? For an engineering team choosing between AI assistants for a specific workflow, those details matter far more than the aggregate number.

Testing conditions add another layer of uncertainty. Benchmark scores can shift with prompt formatting, temperature settings, context-window usage, and whether the agent has access to tools beyond the terminal. The preprint defines the task suite and verification logic, but the exact configuration under which GPT-5.5 achieved 82.7% has not been independently documented. A fair head-to-head comparison requires identical conditions, and that equivalence has not been publicly confirmed for any model pairing on this benchmark.

Why the benchmark design matters more than the score

For enterprise buyers evaluating AI coding assistants, the architecture of the test is at least as important as any single number. A benchmark built on trivial file-rename operations would produce high scores that mean nothing in production. Terminal-Bench 2.0’s emphasis on hard, realistic tasks and its all-or-nothing verification model suggest it captures meaningful differences between agents. When a model fails a task here, the failure maps to something a human operator would notice and care about.

That said, a two- or three-point gap between models on an 89-task suite can flip depending on which task categories a team prioritizes. A model scoring 80% overall might outperform a higher-ranked rival on the exact slice of work your infrastructure demands. Until per-task results are published, aggregate scores are useful signals, not purchasing decisions.

The benchmark also arrives at a moment when the industry is wrestling with how to evaluate AI agents that operate autonomously in high-stakes environments. Traditional code-generation benchmarks test whether a model can write a function. Terminal-Bench tests whether an agent can navigate a live-feeling system, handle errors, and leave the environment in a correct state. That shift from code output to operational competence is why the benchmark has drawn attention from platform engineering and SRE teams, not just ML researchers.

What engineering teams should do with this

The most practical takeaway is not the headline number. It is the fact that Terminal-Bench 2.0 is open and reproducible. Engineering leaders can download the task suite from the preprint, adapt the verification scripts to their own infrastructure, and run GPT-5.5, Claude Opus 4.7, or any other agent under conditions that mirror their actual production workloads. That turns a reported benchmark score into internal, first-party data.

Organizations already using AI agents in terminal workflows should treat the 82.7% figure as a starting point for their own evaluation, not a final ranking. Add tasks that reflect your environment: your deployment tooling, your monitoring stack, your edge cases. Tighten the verification criteria where the default suite is too lenient for your risk tolerance. The benchmark’s open design makes this straightforward.

Until OpenAI, Anthropic, or independent labs publish detailed, reproducible results with full task-level breakdowns and documented test configurations, the competitive narrative around GPT-5.5 and Claude Opus 4.7 on Terminal-Bench remains provisional. The benchmark gives the industry a concrete, testable framework for measuring agent capability in command-line environments. The scores will sharpen as more teams run the suite and share their findings. For now, the smartest move is to stop debating the leaderboard and start running the tests yourself.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.