OpenAI, AMD, Nvidia, Intel, Microsoft, and Broadcom release an open protocol to stop GPU clusters from crashing during large-scale AI training

Training a frontier AI model means keeping thousands of GPUs synchronized for weeks on end. When a single network link fails, the entire job can crash, wiping out days of compute that cost millions of dollars. In May 2026, OpenAI and five of the biggest names in computing announced a fix: an open networking protocol called Multipath Reliable Connection, or MRC, designed to detect broken data paths and reroute traffic in microseconds so training runs keep going instead of starting over.

The six companies behind MRC are OpenAI, AMD, Nvidia, Intel, Microsoft, and Broadcom. That lineup is striking because it includes direct competitors in the GPU, chip, and cloud markets. Their willingness to collaborate reflects how severe the network reliability problem has become. As The Decoder reported, network congestion and link failures have emerged as key constraints on scaling AI training clusters, and no single vendor can solve the problem alone.

Why network failures are so expensive

Large-scale AI training is inherently fragile. Meta disclosed in its 2024 technical report on training Llama 3 that the company experienced over 400 hardware and infrastructure failures during a single 54-day training run across 16,384 GPUs. Each failure required intervention, and many forced partial or full restarts. The standard defense is aggressive checkpointing, where the system periodically saves the model’s state to storage so it can roll back to the last good snapshot after a crash. But checkpointing itself consumes time and compute, and it does not prevent the lost work between the last checkpoint and the failure.

Network outages are among the most disruptive failure types because distributed training depends on constant, high-speed communication between GPUs. A broken link does not just slow things down. It breaks the synchronization that the entire training run relies on. Traditional network recovery mechanisms can take seconds to detect and reroute around a failed path. In a tightly coupled distributed system, that delay is more than enough to crash the job.

How MRC works

MRC operates at the network transport layer. According to OpenAI’s description, the protocol monitors multiple data paths simultaneously and, when one fails, switches traffic to an alternate route in microseconds rather than seconds. That speed is the core value proposition: by rerouting fast enough to stay within the tolerance window of distributed training frameworks, MRC can prevent the cascade of desynchronization that typically forces a restart.

The protocol is designed to layer on top of existing data center interconnects rather than replace them. Operators who already manage large GPU clusters would not need to rip out their networking hardware or redesign their physical topology. Instead, MRC adds multipath awareness and rapid failover as a software-level capability that integrates with current high-speed fabrics.

Microsoft, which operates the Azure cloud infrastructure where much of OpenAI’s training takes place, has described MRC’s goals in operational terms. In a statement accompanying the announcement, Microsoft’s Azure engineering team said the protocol is intended to deliver “more resilient and predictable training at very large scale, better utilization of expensive GPU infrastructure, and automatic adaptation at machine speed.” Those priorities map directly onto the economics of AI training, where fewer crashes translate to lower cost per run.

OpenAI framed the effort as essential to the next generation of AI systems. “As we scale to larger and larger training runs, network reliability becomes a first-order engineering challenge,” the company wrote in its announcement of MRC. “We built this protocol in the open because the problem affects everyone building at this scale.”

Crucially, MRC is being released as an open standard rather than a proprietary tool tied to one vendor’s hardware. Modern AI supercomputers often combine components from multiple manufacturers, mixing Nvidia GPUs with AMD accelerators, Intel networking hardware, and Broadcom switches. A protocol locked to a single ecosystem would have limited reach. An open standard can, in principle, work across these heterogeneous environments.

How MRC relates to existing multipath networking

Multipath networking is not a new concept. Data center operators have long used techniques like Equal-Cost Multi-Path routing (ECMP) to spread traffic across parallel links, and the broader networking world has experimented with Multipath TCP (MPTCP) to allow a single connection to use multiple network paths simultaneously. The Ultra Ethernet Consortium, a separate industry group, has also been working on next-generation Ethernet specifications designed to handle the bursty, latency-sensitive traffic patterns that AI workloads produce.

MRC appears to occupy a different niche. Where ECMP distributes flows across paths at the routing level and MPTCP operates at the general-purpose transport layer, MRC is specifically engineered for the communication patterns of distributed AI training, where collective operations like all-reduce must complete across thousands of endpoints within tight timing windows. The protocol’s microsecond failover target reflects the fact that even brief interruptions in these collective operations can desynchronize an entire training run, a failure mode that general-purpose multipath solutions were not designed to prevent.

Whether MRC builds directly on any of these existing standards, or is an entirely new transport-layer design, has not been clarified in the available announcements. The relationship between MRC and the Ultra Ethernet Consortium’s ongoing work is also unspecified. It is possible that MRC could eventually be folded into or aligned with broader industry Ethernet standards, but no such roadmap has been published as of June 2026.

What is still missing

For all the weight behind the announcement, several important details remain undisclosed. Neither OpenAI nor any of its partners have published benchmarks showing how much MRC actually reduces training failure rates or improves end-to-end training throughput. The microsecond rerouting capability has been described in general terms but has not been tied to a specific test environment, workload size, or measurement methodology in any publicly available technical paper.

None of the hardware partners, including AMD, Nvidia, Intel, and Broadcom, have issued public statements detailing their specific contributions to the protocol or their timelines for integrating MRC into shipping products. That leaves open the question of where MRC will appear first. Cloud environments like Azure, where OpenAI already runs large training jobs, are the most likely early deployment targets. But whether enterprises operating on-premises clusters will have access, and on what timeline, remains unclear.

The open-source licensing terms and governance structure for MRC have also not been detailed. Whether outside developers and smaller hardware vendors can freely adopt and modify the protocol, or whether participation requires membership in a formal consortium, will determine how broadly MRC spreads beyond the six founding companies.

There is also an open question about how MRC interacts with existing reliability strategies. If the protocol reduces the frequency of network-induced crashes, operators might be able to checkpoint less aggressively, freeing up compute that currently goes toward saving model states. But no one involved has said whether that is an intended outcome or just a theoretical side benefit.

What MRC signals about the future of AI training networks

The emergence of MRC marks a shift in how the industry thinks about AI training bottlenecks. For years, the conversation centered on GPU supply: who could get enough chips, and how fast. That constraint has not disappeared, but as training runs have scaled to tens of thousands of GPUs running for weeks at a time, the network connecting those GPUs has become just as critical. A cluster with 100,000 GPUs is only as reliable as the weakest link in its interconnect fabric.

An open protocol backed by six major players does not solve every reliability problem in AI supercomputing. But it represents a concrete acknowledgment that network failures are no longer an acceptable cost of doing business. For organizations running large distributed training jobs, the practical impact will depend on how quickly MRC moves from announcement to deployment. If it ships as part of standard networking stacks on Azure and major GPU platforms in the coming months, it could meaningfully reduce the cost and complexity of training runs that currently depend on elaborate checkpoint-and-restart procedures.

If adoption is slower or limited to specific hardware configurations, the benefits will be narrower, accruing first to customers of the cloud providers and hardware vendors that move fastest. Either way, the fact that OpenAI, AMD, Nvidia, Intel, Microsoft, and Broadcom all agreed this problem was worth solving together says something about where AI infrastructure is headed: toward treating the network not as plumbing, but as a first-class component of the training pipeline.

More from Morning Overview

*This article was researched with the help of AI, with human editors creating the final content.

IG

FB

PIN

LI

X

OpenAI, AMD, Nvidia, Intel, Microsoft, and Broadcom release an open protocol to stop GPU clusters from crashing during large-scale AI training

Why network failures are so expensive

How MRC works

How MRC relates to existing multipath networking

What is still missing

What MRC signals about the future of AI training networks

Author

Get weekly updates with the latest news and tips!

More in AI

IG

FB

PIN

LI

X