Developers who rely on GPT-4.5 inside ChatGPT will lose access to the model on June 27, as OpenAI phases out one of its larger general-purpose systems. The retirement arrives at a moment when Microsoft and Google are each promoting specialized coding models, backed by newer and harder evaluation benchmarks that test whether AI agents can handle the kind of multi-file, long-horizon software tasks that real engineering teams face daily. The convergence of these events puts pressure on OpenAI to show that its next generation of models can keep pace in a race that has shifted from chatbot fluency to measurable coding performance.
What is verified so far
The strongest evidence available centers on two academic benchmarks that Microsoft and Google have used to support their respective coding-model claims. SWE-Bench Pro, described by its authors as a harder, more realistic benchmark, was designed to evaluate whether AI agents can resolve software engineering tasks that span multiple files and require changes consistent with a real pull-request workflow. The benchmark provides detailed methodological documentation on task design, evaluation criteria, and what counts as a “resolved” issue. Model vendors, including Microsoft with its MAI-Code-1-Flash, have cited SWE-Bench Pro results when making public performance claims.
On Google’s side, the company has pointed to MCP-Atlas, a benchmark that measures tool-use competency by running models against live MCP servers rather than simulated test harnesses. Google has described MCP-Atlas as a coding and agentic benchmark, and Gemini 3.5 Flash posted high scores on it. The distinction matters because tool-use accuracy across real servers tests a different skill set than simply generating correct code in isolation. It reflects the direction both companies are pushing: AI that can act inside a developer’s actual workflow, not just answer questions about code.
Both benchmarks were built to expose weaknesses that earlier, simpler evaluations missed. SWE-Bench Pro tasks require agents to edit across file boundaries and pass full test suites before a resolution is accepted. That means a model must not only understand the bug description but also navigate a large codebase, locate the relevant modules, and propose changes that integrate cleanly with existing architecture. MCP-Atlas, by contrast, focuses on whether a model can correctly invoke tools and chain actions through real server infrastructure, such as opening files, running commands, or interacting with version control via standardized protocols. Together, they represent a shift in how the industry measures coding ability, moving away from single-function correctness toward the kind of sustained, multi-step reasoning that enterprise development demands.
Within this verified frame, it is clear that Microsoft and Google are staking their reputations on benchmarks that more closely resemble day-to-day engineering work. Their public claims emphasize not just raw accuracy but the ability to operate as agents that can plan, execute, and verify complex tasks. That framing implicitly sets expectations for any successor to GPT-4.5: developers will expect OpenAI’s next models to be evaluated on the same or comparable benchmarks, under conditions that can be independently scrutinized.
What remains uncertain
Several key pieces of the story lack primary confirmation. No public OpenAI changelog, blog post, or official statement from the company has been surfaced in available reporting to confirm the exact June 27 retirement date for GPT-4.5 or to explain what migration path existing users will follow. Without that documentation, the specific date and the scope of the change should be treated as unconfirmed by primary evidence. Users who depend on GPT-4.5 for production workflows do not yet have a clear picture of whether they will be automatically shifted to a successor model, given a grace period, or simply cut off.
The competitive framing also carries gaps. Neither Microsoft nor Google has published the raw evaluation artifacts, such as resolved task lists or failure traces, that would let independent researchers verify the benchmark scores each company has promoted. SWE-Bench Pro and MCP-Atlas each define “success” differently: one measures pull-request-level resolution across multiple files, the other measures tool invocation accuracy against live servers. Comparing results across these two benchmarks without running the same model under identical agent scaffolding is not straightforward. A model that scores well on SWE-Bench Pro may struggle on MCP-Atlas, and vice versa, because the tasks test fundamentally different capabilities.
There is also no direct statement from OpenAI linking the GPT-4.5 retirement to competitive pressure from Microsoft or Google. The timing is suggestive, but inferring a causal connection without on-the-record confirmation from OpenAI would overstate what the evidence supports. It is equally possible that the retirement reflects internal resource allocation decisions, cost optimization, or preparation for a successor model that OpenAI has not yet announced. Until OpenAI publishes technical documentation, deprecation notices, or migration guides, any explanation for the retirement remains speculative.
Another unresolved question is how GPT-4.5 itself would perform on SWE-Bench Pro or MCP-Atlas under standardized conditions. OpenAI has not released benchmark results for GPT-4.5 on these specific evaluations, leaving a gap in direct comparability. Without such numbers, claims that GPT-4.5 is falling behind its rivals rest more on inference than on side-by-side data. Developers are left to extrapolate from anecdotal experience in ChatGPT and from older benchmarks that may not capture the complexity of modern coding tasks.
How to read the evidence
The two strongest pieces of evidence in this story are the benchmark papers themselves. SWE-Bench Pro and MCP-Atlas are both primary academic sources with published methodologies, which means their definitions of task difficulty, resolution criteria, and evaluation protocols can be independently reviewed. When Microsoft or Google cites a score on either benchmark, readers can check the original paper to understand what that score actually measures. That level of transparency is the baseline for credible performance claims.
What sits below that baseline is the promotional layer. Vendor blog posts and product announcements that reference benchmark scores without linking to raw results or reproducible evaluation logs are closer to marketing than to evidence. High scores on MCP-Atlas or SWE-Bench Pro are meaningful only if the evaluation conditions, including agent scaffolding, retry policies, and timeout settings, match what the benchmark authors specified. Without that detail, a reported score tells you what a company wants you to believe, not necessarily what the model can do in your environment.
For developers trying to decide what this means for their own work, the practical question is narrow. If GPT-4.5 disappears from ChatGPT on or around June 27, the immediate step is to identify which tasks currently depend on that model and test alternatives before the cutoff. That may involve running small, representative workloads-such as code review prompts, refactoring requests, or bug-reproduction sessions-against other OpenAI models and, where possible, against competitors that publish at least partial benchmark details. The goal is not to replicate SWE-Bench Pro or MCP-Atlas in-house, but to approximate your own production patterns well enough to see whether a replacement model behaves acceptably.
Interpreting benchmark claims also means keeping their scope in perspective. SWE-Bench Pro emphasizes end-to-end bug fixing in open-source repositories, which may overrepresent certain languages, frameworks, or project structures. MCP-Atlas centers on tool use against live servers, which may align more closely with organizations that have already invested in MCP-compatible infrastructure. Neither benchmark directly measures softer but still critical dimensions of developer experience, such as how well a model explains its reasoning, how predictable its failure modes are, or how easily engineers can debug misbehavior.
Finally, the absence of clear communication from OpenAI about GPT-4.5’s retirement is itself a data point. In a landscape where rivals are tying their narratives to transparent, externally reviewed benchmarks, the company’s silence leaves customers to fill in the gaps with their own risk assessments. Until OpenAI publishes concrete timelines and migration guidance, the prudent course for teams that depend on GPT-4.5 is to assume change is coming, validate backup options, and treat vendor claims-no matter who makes them-as starting points for their own testing rather than as definitive answers.
More from Morning Overview
*This article was researched with the help of AI, with human editors creating the final content.