📑 Table of Contents

Nvidia, AMD, Intel, Broadcom Unite to Fix GPU Waste

📅 · 📁 Industry · 👁 8 views · ⏱️ 12 min read
💡 OpenAI and 5 chip giants launch MRC protocol to slash GPU compute waste in AI supercomputers, already deployed across all OpenAI training clusters.

OpenAI and Chip Giants Launch Protocol to End GPU Compute Waste

OpenAI has joined forces with Nvidia, AMD, Intel, Broadcom, and Microsoft to release a groundbreaking open network protocol called MRC (Multi-path Reliable Connection) that promises to dramatically reduce wasted GPU compute in large-scale AI training clusters. The protocol, published through the Open Compute Project (OCP), is already deployed across every supercomputer OpenAI uses to train its frontier models — and it could reshape how the entire industry builds AI infrastructure.

The announcement, made on May 6, arrives at a critical moment. As AI companies race to build ever-larger training clusters with tens of thousands of GPUs, network reliability has become one of the biggest bottlenecks — and one of the largest hidden costs — in the industry. A single network switch failure can halt training runs that cost millions of dollars per day.

Key Takeaways at a Glance

  • MRC splits single data transfers across hundreds of network paths, rerouting around failures in microseconds
  • The protocol is built into the latest 800 Gb/s network interfaces
  • It is already live on all OpenAI training supercomputers, including the Oracle Cloud Infrastructure (OCI) site in Abilene, Texas and Microsoft's Fairwater supercomputer
  • 5 major chip companies — Nvidia, AMD, Intel, Broadcom, and Microsoft — co-developed the standard
  • MRC simplifies network control plane architecture, reducing operational complexity at every layer of the stack
  • The protocol enabled OpenAI to restart core switches without coordinating with training operations teams

Why GPU Compute Waste Is a $Billion Problem

Modern AI training clusters are staggeringly expensive. A single cluster running tens of thousands of Nvidia H100 or B200 GPUs can cost hundreds of millions of dollars to build, with daily operating costs reaching into the millions. When a network link fails — even briefly — the entire distributed training job can stall, wasting compute across every GPU in the cluster.

Traditional network protocols were never designed for this scale. They typically route data along a single path, meaning one failed link can cascade into a cluster-wide interruption. Recovery often requires manual intervention from operations teams, adding hours of downtime to already tight training schedules.

OpenAI experienced this firsthand. Over the past several years, the company built and maintained 3 generations of supercomputers with its partners before embarking on its massive Stargate infrastructure project. Each generation taught the same lesson: efficiently utilizing compute and successfully completing training runs requires dramatically reducing complexity at every layer of the stack — including a fundamental redesign of the network itself.

How MRC Works: Hundreds of Paths, Microsecond Recovery

At its core, MRC introduces 3 key innovations that distinguish it from conventional network protocols:

  • Multi-path data distribution: Instead of sending data along a single route, MRC splits each transfer across hundreds of parallel paths simultaneously. This maximizes bandwidth utilization and eliminates single points of failure.
  • Microsecond-level fault avoidance: When a link goes down, MRC detects the failure and reroutes traffic in microseconds — far faster than traditional protocols, which can take seconds or even minutes to reconverge.
  • Simplified control plane: By pushing intelligence into the 800 Gb/s network interface cards (NICs) themselves, MRC reduces the complexity of the network control plane, making clusters easier to manage and less prone to configuration errors.

The protocol is embedded directly into the latest generation of 800 Gb/s network interfaces, meaning it operates at the hardware level rather than as a software overlay. This tight integration is what enables the microsecond-scale response times that make MRC practical for latency-sensitive distributed training workloads.

Real-World Impact: Restarting Switches Without Breaking Training

Perhaps the most compelling evidence of MRC's effectiveness comes from OpenAI's own operations. In a blog post, the company described a recent scenario that would have been unthinkable under traditional network architectures.

While training a frontier large model for ChatGPT and Codex, OpenAI's infrastructure team needed to restart 4 Tier-1 core switches. Under previous setups, restarting core switches was an extremely delicate operation. It required intense coordination between network engineers and the team managing the active training run, careful scheduling during maintenance windows, and a high risk of disrupting the training job.

With MRC deployed, the infrastructure team simply restarted the switches — without even notifying the training operations team in advance. The training job continued running without interruption as MRC automatically rerouted traffic across alternative paths during the brief switch outages. This kind of operational resilience represents a paradigm shift for AI infrastructure management.

An Unprecedented Alliance of Competitors

What makes this announcement particularly remarkable is the coalition behind it. Nvidia, AMD, Intel, and Broadcom are fierce competitors in the semiconductor market, yet all 4 have co-signed this open standard alongside Microsoft and OpenAI.

This unusual alliance reflects a shared recognition across the industry:

  • Network bottlenecks hurt everyone. Whether a cluster runs on Nvidia GPUs, AMD Instinct accelerators, or Intel Gaudi chips, network failures waste compute equally.
  • Proprietary solutions fragment the ecosystem. If each vendor develops its own reliability protocol, interoperability suffers and costs rise for data center operators.
  • Open standards accelerate adoption. By publishing MRC through the Open Compute Project, the coalition ensures that cloud providers, enterprises, and startups can all benefit — and contribute improvements.
  • Scale demands collaboration. The next generation of AI training clusters will contain hundreds of thousands of accelerators. No single company can solve the networking challenges alone.

The involvement of Broadcom is especially noteworthy. As the dominant supplier of data center networking ASICs and switch silicon, Broadcom's participation signals that MRC support will likely be built into the networking hardware that underpins most major cloud and enterprise data centers worldwide.

Industry Context: The Infrastructure Arms Race Intensifies

MRC arrives amid an unprecedented infrastructure buildout across the AI industry. OpenAI's Stargate project, announced in partnership with SoftBank and Oracle, envisions up to $500 billion in data center investment across the United States. Microsoft is simultaneously expanding its global data center footprint at record pace. Meta, Google, and Amazon are all racing to deploy clusters with 100,000+ GPUs.

At this scale, even small improvements in network reliability translate into enormous savings. If MRC reduces wasted compute by just 1-2% across a $1 billion training cluster, it could save tens of millions of dollars per training run. Multiply that across dozens of frontier model training campaigns per year, and the economic impact becomes substantial.

The protocol also aligns with a broader industry trend toward disaggregated and composable infrastructure. As AI workloads grow, operators increasingly need the ability to dynamically reconfigure clusters, swap out failed components, and perform maintenance — all without halting production workloads. MRC's ability to transparently reroute traffic makes these operations far more practical.

What This Means for Developers and Cloud Customers

For AI researchers and engineers who rent GPU clusters from cloud providers, MRC's benefits will largely be invisible — and that is exactly the point. Training runs will complete more reliably, with fewer interruptions and less wasted compute time. Costs per training run should decrease as utilization rates improve.

For cloud providers and data center operators, the implications are more direct:

  • Lower operational overhead: Maintenance operations become simpler and less risky
  • Higher GPU utilization rates: Less compute wasted on network-related interruptions
  • Simpler network architecture: Reduced control plane complexity means fewer configuration errors
  • Better SLA compliance: More reliable training runs mean happier customers
  • Vendor flexibility: As an open standard, MRC avoids lock-in to any single chip vendor

For the broader AI ecosystem, MRC represents another step toward making large-scale AI training more accessible. By reducing the hidden 'tax' of network unreliability, the protocol effectively makes every GPU in a cluster more productive.

Looking Ahead: From Open Standard to Industry Default

The critical question now is how quickly MRC moves from OpenAI's clusters to the rest of the industry. Several factors suggest adoption could be rapid.

First, the protocol's publication through OCP means any hardware or software vendor can implement it without licensing fees. Second, the backing of all major chip companies ensures that MRC-compatible hardware will be widely available. Third, the clear economic benefits — reduced waste, simpler operations, higher utilization — create strong incentives for adoption.

The next milestones to watch include formal integration into commercial networking products from Broadcom and other switch vendors, adoption by major cloud providers like AWS and Google Cloud, and potential extensions of the protocol to handle inference workloads as well as training.

As AI models continue to grow and training clusters expand toward millions of accelerators, protocols like MRC will shift from 'nice to have' to 'absolutely essential.' The chip industry's unusual show of unity on this standard suggests that all major players understand what is at stake — and none of them want GPU compute waste to be the bottleneck that slows down the AI revolution.