Morning Overview

UK regulators move to assess risks of Anthropic’s latest AI model

Before Anthropic released its upgraded Claude 3.5 Sonnet to the public, government safety teams in both the United Kingdom and the United States had already stress-tested it. Working under a formal agreement signed months earlier, evaluators from the UK AI Safety Institute and the U.S. AI Safety Institute probed the model for biological, cybersecurity, and software-development risks, then handed their findings to Anthropic ahead of launch.

It was the first time the two governments had jointly reviewed a frontier AI system before deployment under a structured bilateral arrangement, according to a summary published by the National Institute of Standards and Technology. The collaboration grew out of a Memorandum of Understanding that NIST’s AI Safety Institute signed with Anthropic in August 2024, an agreement that also covered OpenAI and established a recurring framework for safety research, testing, and evaluation. While the Claude 3.5 Sonnet review involved only Anthropic’s model, the inclusion of OpenAI in the same MOU signals that similar bilateral evaluation processes could extend to other frontier developers under the same structural terms.

Now, as of spring 2026, the episode stands as a reference point for how AI oversight is shifting from voluntary industry pledges toward structured, cross-border scrutiny with real technical teeth.

What the joint evaluation actually involved

The distinction that matters most is the word “pre-deployment.” Regulators did not review marketing materials or accept self-reported safety scores. They ran their own adversarial evaluations against the model before it shipped, testing across four domains: biological risks, cyber threats, software and AI development capabilities, and the effectiveness of built-in safeguards.

That approach marked a concrete departure from the voluntary commitments AI companies made at the Bletchley Park AI Safety Summit in November 2023 and the Seoul AI Summit in May 2024. Those earlier pledges were broad, with vague timelines and no enforcement mechanism. Under the Memorandum of Understanding, testing is tied to specific model releases, and results go back to the developer with enough lead time to act on them before public launch.

On the UK side, the AI Safety Institute has been building independent evaluation methods in parallel. A technical report published on arXiv by the institute’s research team describes its approach to alignment evaluation through an applied case study focused on a coding-assistant scenario. The study tests whether advanced AI systems reliably follow intended goals when placed in realistic work environments, specifically probing for sabotage and subtle misalignment behaviors. The research draws on earlier foundational work, including published studies on misalignment risks, to inform its evaluation design. (The arXiv report itself has not been directly linked here because no stable URL for the specific paper has been independently verified; readers should note this as a gap in the available sourcing.)

The coding-assistant scenario is not hypothetical hand-wringing. As companies integrate AI tools into software development pipelines, the risk that a misaligned system could subtly undermine the code it is supposed to help write becomes a supply-chain concern, not just an academic one.

What the public still does not know

For all the procedural transparency, the substance of the evaluation remains largely hidden. No public record from either safety institute discloses what the Claude 3.5 Sonnet review actually found. NIST confirmed the testing domains and the fact that results were shared with Anthropic, but whether the model passed, failed, or raised red flags in any specific area has not been made public.

Anthropic’s response is equally opaque. The company received the findings before release, but there is no public accounting of what it changed in the model as a result. Without access to the raw evaluation data or Anthropic’s remediation steps, outside observers cannot determine whether the testing led to meaningful safety improvements or functioned primarily as an informational exercise. Anthropic’s own Responsible Scaling Policy, which commits the company to pausing deployment if internal evaluations reveal certain capability thresholds, has not been publicly reconciled with the government review’s findings.

The UK institute’s arXiv case study, while technically rigorous, has not been publicly linked to specific results from the Claude 3.5 Sonnet evaluation. The methods it describes apply to a general coding-assistant scenario, not to a named commercial product. Whether those exact methods were used during the joint pre-deployment testing, or whether they represent a separate research track, remains unconfirmed.

The most recent public update on the joint testing dates to late 2024. Neither safety institute has confirmed whether the same bilateral process has been applied to newer Anthropic models or whether the evaluation framework has been revised since that initial round.

How this fits the broader regulatory picture

The UK-U.S. collaboration did not emerge in a vacuum. The political groundwork was laid at the Bletchley Park summit, where 28 countries signed a declaration acknowledging frontier AI risks, and extended at the Seoul summit, where leading AI companies made more specific safety commitments. The bilateral testing of Claude 3.5 Sonnet was, in effect, the first operational product of those diplomatic efforts.

Meanwhile, the European Union’s AI Act, which began phased enforcement in 2025, takes a different approach by classifying AI systems by risk level and imposing binding obligations on developers of high-risk and general-purpose models. The UK has so far avoided that kind of prescriptive legislation, opting instead for sector-specific regulation and the kind of technical evaluation capacity the AI Safety Institute represents. The U.S. approach under NIST similarly relies on voluntary frameworks and bilateral agreements rather than statutory mandates, though political pressure for binding rules continues to build in Congress.

For readers comparing these approaches, the key difference is enforcement. The EU framework carries legal penalties. The UK-U.S. model, at least as it operated during the Claude 3.5 Sonnet review, relies on cooperation and reputational incentives. Whether that softer approach can keep pace as AI systems grow more capable is one of the central questions facing regulators on both sides of the Atlantic.

Why the transparency gap now defines the debate

The strongest conclusion the public record supports is about process, not performance. Governments have demonstrated they can run independent, pre-deployment evaluations of frontier AI models and coordinate those efforts across borders. They are building increasingly sophisticated tools to probe for misalignment, sabotage, and misuse risks. What they have not yet demonstrated is a willingness to disclose what those evaluations find or to show that testing results translate into enforceable changes.

That transparency gap matters. Without it, the public is asked to trust that the system works based on the existence of the system itself. For AI companies, the signal is clear: safety evaluations are moving from voluntary, optics-driven exercises to structured engagements with regulators that could eventually shape product timelines and design decisions. For policymakers, the Claude 3.5 Sonnet review offers a working prototype of international AI oversight, one that now needs accountability mechanisms to match its technical ambitions.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.