An internal Google memo, first circulated in early April 2026 and since described by multiple people familiar with its contents, reportedly acknowledges that Gemini CLI lags behind Anthropic’s Claude Code in agentic tool execution, the ability of an AI coding assistant to run commands, call external tools, and manage multi-step workflows without constant human hand-holding. The memo has not been published in full, and Google has not commented on it publicly. But three independent academic studies, all posted to the arXiv preprint server run by Cornell University between March and April 2026, lend weight to the core claim: when it comes to acting on a developer’s behalf, Claude Code is measurably more reliable than its rivals.
The distinction matters because the AI coding market has shifted. Generating a block of clean Python or JavaScript is table stakes now. The real differentiator is what happens next: Can the agent execute a shell command, check the result, adjust its plan, and move on to the next step without breaking things or asking for permission it doesn’t need? That orchestration layer is where the leaked memo reportedly says Google is falling short.
What three studies found
The strongest public evidence comes from a paper titled “Engineering Pitfalls in AI Coding Tools,” which presents a systematic analysis of publicly reported bugs filed against Claude Code, OpenAI’s Codex, and Gemini CLI. The paper does not disclose an exact sample size or use the word “thousands” to describe its dataset; it aggregates bug reports from public issue trackers and applies a classification framework to identify recurring failure patterns. The researchers found that failures cluster not around raw code generation but around tool invocation, command execution, and task orchestration. In plain terms, the agents stumble most when they have to do something, not just write something.
A second study, “Measuring the Permission Gate,” zeroed in on a specific Claude Code feature: its auto-mode safety system, which decides when the agent can execute a tool call on its own and when it should pause and ask the developer. The researchers ran structured stress tests and reported quantified error rates for both false positives and false negatives in that gating system. A false positive means the agent blocks a safe action and wastes the developer’s time. A false negative means it greenlights a risky one. The paper reports a false-positive rate below 5 percent and a false-negative rate below 2 percent for Claude Code’s permission gate under the test conditions described, a balance that neither Gemini CLI nor Codex has demonstrated in any published evaluation to date.
The third paper, “Dive into Claude Code,” took a systems-level look at Claude Code’s architecture by reverse-engineering its publicly available TypeScript source code. (Anthropic open-sourced the codebase in 2025.) The researchers mapped four design principles, authority, safety, security, and reliability, to specific implementation choices, documenting how the agent decides when to invoke a tool and how it enforces boundaries on what it is allowed to do. That level of architectural transparency is something Gemini CLI and Codex have not offered to outside researchers.
What the memo does and doesn’t tell us
The leaked document has been described by secondary sources as an internal Google assessment, but its full text has not surfaced. No Google or Anthropic executive has addressed the specific performance comparisons on the record. That gap makes it hard to know whether Google treats the orchestration deficit as a short-term engineering fix or a deeper architectural problem.
There are other caveats worth flagging. The bug data in the first study comes from publicly reported issues, not from Google’s proprietary internal logs. It is entirely possible that Gemini CLI’s internal error tracking paints a different picture, but no counter-evidence has appeared. The arXiv papers also represent snapshots. Coding agents ship updates frequently, and performance can shift between releases. Without a follow-up evaluation or an official product roadmap disclosure, the current state of Gemini CLI’s orchestration layer as of late April 2026 is not independently confirmed.
The memo’s chain of custody raises its own questions. It has been attributed to Google’s internal systems, but the path from those servers to public circulation has not been documented. Readers should treat it as an attributed but unverified internal assessment, not as established fact.
Why orchestration is the new battleground
For developers weighing which agent to adopt, the practical lesson from all three studies is the same: the quality of a coding agent depends heavily on its orchestration layer, not just the language model underneath it. An agent that writes elegant code but fumbles a shell command, mishandles a permission check, or loses track of a multi-step sequence will cost a team hours and introduce real risk into a codebase.
Consider a routine task: refactoring a module, running the test suite, reading the failure output, and fixing the broken tests. A strong orchestration layer handles that loop autonomously. A weak one stalls at the permission gate, misreads the test output, or re-runs a command it already completed. The arXiv research consistently points to Claude Code as more disciplined in these sequences, with measurable advantages in permission gating and modular safety design.
None of that means the gap is permanent. Google has deep engineering resources and a history of closing competitive deficits once they become visible priorities. Anthropic’s lead in agentic execution today does not lock in an advantage for next quarter, especially if Google ships targeted orchestration improvements in upcoming Gemini CLI releases. OpenAI, for its part, continues to iterate on Codex with its own agentic features.
Testing orchestration in your own workflows
The most useful step for any team evaluating these tools is to stop relying on benchmarks alone and start testing orchestration reliability in their own workflows. Run multi-step tasks that mirror real project conditions. Observe how each agent handles permission boundaries. Track failure rates over a week of actual use, not a single demo session. The academic research provides a strong starting framework, but production environments introduce variables no lab study can fully replicate.
The leaked memo, if accurate, confirms what the published research already suggests: in the race to build AI coding agents that developers actually trust, the model is only half the story. The other half is everything the agent does after it finishes writing code.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.