Morning Overview

The U.S. Commerce Department is now testing the newest AI models from Google, Microsoft, and xAI in classified conditions before the public ever sees them

Somewhere inside a secure federal facility, government evaluators are probing the next generation of AI systems from Google, Microsoft, and Elon Musk’s xAI for the ability to help build biological weapons, breach critical infrastructure, or synthesize chemical agents. The testing happens before any of these models reach a single consumer device. And as of June 2026, this classified screening pipeline is no longer a pilot program or a policy aspiration. It is operational.

The Commerce Department’s Center for AI Standards and Innovation, known as CAISI, has signed formal agreements with all three companies to conduct pre-release national security evaluations of their most powerful AI models. The arrangement places a federal agency in a role that did not exist three years ago: a frontline reviewer of commercial AI systems, screening them for catastrophic risks before the public ever interacts with them.

How the testing pipeline works

CAISI sits within the National Institute of Standards and Technology and traces its origins to the U.S. AI Safety Institute that President Biden ordered the Commerce Department to establish following Executive Order 14110 in October 2023. In June 2025, Commerce Secretary Howard Lutnick rebranded the institute as CAISI, shifting its stated mission toward pro-innovation standards while preserving its authority to test frontier models. Lutnick described CAISI as “industry’s primary point of contact for testing, collaborative research, and best practices related to commercial AI systems.”

Under the agreements, Google DeepMind, Microsoft, and xAI grant CAISI access to frontier models before public release. Evaluators then test those systems inside secure government environments across three core risk domains: cybersecurity, biosecurity, and chemical weapons. These are not informal consultations or handshake arrangements. They are structured agreements in which results are shared across federal agencies with direct national security responsibilities.

That cross-agency coordination runs through the TRAINS Taskforce, an interagency body the Commerce Department established to manage AI model research and testing across the federal government. TRAINS covers evaluation in domains including chemical, biological, radiological, and nuclear threats, along with cybersecurity, critical infrastructure, and conventional military capabilities. The Defense Department, the Energy Department, and the intelligence community all participate. When CAISI flags a concern in a model from Google or xAI, the finding does not stay in one office. It feeds into a network of agencies equipped to assess whether a capability poses a genuine national security threat.

What the DeepSeek evaluation revealed about CAISI’s methods

The technical methods CAISI applies behind closed doors are partly visible through its recent public work. In May 2026, the center published an evaluation of DeepSeek V4, the Chinese AI company’s latest frontier model. The report showed CAISI using PortBench, an internally built evaluation tool, alongside held-out datasets designed to test capabilities that standard public benchmarks miss. The evaluation included targeted probes of code generation, system access, and scientific reasoning, offering a concrete glimpse of how the center might examine whether a model could assist in offensive cyber operations or weapons development.

Those same evaluation methods, or close variants, form the backbone of the classified testing now applied to models from U.S. companies. The DeepSeek report is significant not just for its findings but for what it reveals about CAISI’s operational maturity: the center is not merely drafting frameworks. It is running real evaluations with proprietary tools.

A separate initiative addresses the practical tension between rigorous government testing and corporate intellectual property protection. In March 2026, CAISI signed a cooperative research and development agreement with OpenMined, a privacy-technology organization, to build secure protocols for handling sensitive model evaluations. The partnership explores techniques such as secure enclaves, differential privacy, and tightly controlled API access, all aimed at letting government evaluators probe a company’s most advanced system without exposing proprietary model weights or training data.

The OpenAI question and other gaps

One notable absence from the announced agreements is OpenAI, the maker of ChatGPT and one of the most widely deployed AI systems in the world. OpenAI is not named alongside Google DeepMind, Microsoft, and xAI in the current round of CAISI arrangements. The omission is striking in part because Microsoft, which is named, is OpenAI’s largest investor and cloud infrastructure provider. Whether Microsoft’s agreement covers OpenAI-derived models, or whether a separate arrangement exists or is pending, has not been publicly clarified. The gap leaves open the question of whether some of the most popular consumer AI products are undergoing the same pre-release national security review as their competitors.

Other critical details remain undisclosed. No public CAISI document identifies the specific model versions currently under review from any of the three companies. The exact evaluation criteria and scoring rubrics have not been published, and it is unclear whether CAISI applies the same PortBench-style benchmarks uniformly or tailors its approach to each model’s architecture and training data.

The consequences of a negative finding are equally opaque. Official records do not specify what happens if CAISI identifies a serious risk in a pre-release model. There is no public description of legal or contractual penalties a company would face for releasing a model before CAISI cleared it. The agreements appear to be voluntary rather than regulatory mandates, which raises a straightforward enforcement question: if a company disagrees with CAISI’s assessment, what recourse does either side have? Available documentation describes no dispute resolution process and no appeals mechanism.

How classified findings flow back to the companies is another open question. Federal agencies routinely handle sensitive information through compartmented channels, but the specific protocols governing how CAISI communicates test results to private-sector developers have not been made public. The balance between revealing enough detail to enable a company to fix a problem and withholding operationally sensitive information, such as concrete exploit chains or step-by-step synthesis instructions, has not been described in any available document.

Why this pipeline may outlast the models it tests

The most consequential aspect of CAISI’s work may not be any single evaluation but the institutional infrastructure it is building. By positioning itself as the default pre-release reviewer for frontier AI, the Commerce Department is creating a role that could outlast any single administration or generation of models. The benchmarks CAISI develops, the risk thresholds it applies, and the interagency relationships it cultivates through TRAINS will shape how future systems are judged long after today’s large language models are superseded by new architectures.

That structural permanence cuts in two directions. For companies, participation offers a way to demonstrate due diligence to regulators and international partners while gaining early feedback on how their systems might behave under adversarial conditions. For the government, the pipeline provides a standing capability to assess emerging risks without needing to build a new institution each time the technology shifts.

But the arrangement also carries risks of its own. Once CAISI is embedded as a routine checkpoint for frontier systems, pressure may grow to expand evaluations beyond the current narrow focus on national security into adjacent areas such as misinformation, election interference, or economic disruption. No public framework specifies how new risk domains would be added, who would decide, or how industry and civil society groups would be consulted. The United Kingdom’s AI Safety Institute and the European Union’s AI Act both offer alternative models for government oversight of frontier systems, but neither involves the same kind of classified pre-release testing that CAISI now conducts.

For now, the most honest assessment is that CAISI’s testing infrastructure is real, operational, and growing, but its effectiveness remains unproven in any publicly verifiable way. No public record shows that the center has caught a dangerous capability in a pre-release model or that a company changed its product as a result of CAISI feedback. The classified nature of the work makes independent assessment difficult by design. Whether this pipeline becomes a meaningful safety check or primarily a trust-building exercise between Washington and Silicon Valley is a question that may take years to answer, and one the public may never fully be able to evaluate from the outside.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.