📑 Table of Contents

OpenAI and NVIDIA Launch MRC Protocol for AI Training

📅 · 📁 Industry · 👁 9 views · ⏱️ 12 min read
💡 OpenAI teams up with NVIDIA, AMD, Intel, Microsoft, and Broadcom to release the open-source MRC protocol, tackling network bottlenecks in large-scale AI training.

OpenAI has joined forces with 5 major tech giants — NVIDIA, AMD, Intel, Microsoft, and Broadcom — to unveil the Multipath Reliable Connection (MRC) protocol, an open-source networking standard designed to eliminate the crippling latency and failure issues that plague large-scale AI model training. The protocol, announced on May 6, is being released through the Open Compute Project (OCP), making it freely available to the entire industry.

The move signals a rare moment of collaboration among fierce competitors, all united by a shared bottleneck: current network architectures simply cannot keep up with the demands of training frontier AI models across tens of thousands of GPUs.

Key Takeaways

  • What it is: MRC is a new networking protocol built on top of the RoCE standard with SRv6 technology, designed for AI supercomputer clusters
  • Who is behind it: OpenAI, NVIDIA, AMD, Intel, Microsoft, and Broadcom co-developed the protocol
  • Core innovation: A multi-plane network design that splits a single 800Gb/s interface into multiple smaller links, connecting up to ~131,000 GPUs with just 2 layers of switches
  • Open source: Released through OCP for industry-wide adoption, not locked behind any single vendor
  • Key benefit: Dramatically reduces network power consumption, component count, and cost compared to traditional 3- or 4-layer architectures
  • Traffic management: Introduces adaptive packet spraying technology for intelligent load balancing across multiple paths

Why Current AI Training Networks Are Breaking Down

Training large AI models is one of the most network-intensive computing tasks ever attempted. When thousands of GPUs work in parallel on a single model, they must constantly exchange data — gradient updates, activations, and parameters — across the network fabric. A single delayed data transmission can stall the entire training process, leaving expensive GPUs sitting idle.

The primary culprits are network congestion, link failures, and device malfunctions. As clusters grow larger, these problems do not just scale linearly — they compound exponentially. A cluster of 10,000 GPUs experiences network incidents far more frequently than a cluster of 1,000, simply because there are more potential points of failure.

Traditional network architectures were never designed for this kind of workload. They rely on deep hierarchies of switches — typically 3 or 4 layers — which add latency, increase power consumption, and create bottlenecks. Each additional layer introduces more components that can fail and more hops that data must traverse. For companies like OpenAI, which are reportedly training models on clusters exceeding 100,000 GPUs, this architectural limitation has become an existential problem.

How MRC Solves the Scaling Problem

The MRC protocol attacks the problem from multiple angles simultaneously, starting with a fundamentally different approach to network topology.

Multi-Plane Network Design

Instead of relying on a traditional fat-tree or multi-layer switch hierarchy, MRC employs a multi-plane network design. The protocol splits a single 800Gb/s network interface into multiple smaller links distributed across independent network planes. This architectural shift is significant: it means a system can connect approximately 131,000 GPUs using only 2 layers of switches.

Compared to conventional 3- or 4-layer architectures, this flatter design delivers several advantages:

  • Lower latency: Fewer switch hops mean data arrives faster
  • Reduced power consumption: Fewer switches and cables translate to lower energy costs
  • Fewer components: Simplified infrastructure means fewer potential failure points
  • Greater path diversity: Multiple independent planes provide natural redundancy
  • Lower cost: Reduced hardware requirements drive down total infrastructure spend

For context, traditional large-scale GPU clusters often require 3 tiers of switches — Top-of-Rack (ToR), spine, and super-spine — each adding cost and complexity. MRC's 2-layer approach essentially eliminates an entire tier of networking infrastructure.

Adaptive Packet Spraying

On the traffic management side, MRC introduces adaptive packet spraying technology, a significant departure from conventional single-path data transmission. In traditional networks, data flows between 2 endpoints typically follow a single predetermined path. If that path becomes congested or experiences a failure, the entire communication stream suffers.

MRC instead distributes individual packets across all available network paths simultaneously. The protocol dynamically monitors path conditions and adjusts distribution in real time, routing around congestion and failures without interrupting the data stream. This approach maximizes bandwidth utilization across the entire network fabric rather than overloading specific links while others sit underutilized.

The combination of multi-plane topology and adaptive packet spraying creates a network that is both faster and more resilient — exactly what large-scale AI training demands.

Industry Context: A Rare Alliance of Rivals

The list of companies backing MRC reads like a who's who of the semiconductor and cloud computing industries. NVIDIA and AMD are direct competitors in the GPU market. Intel and Broadcom compete in networking silicon. Microsoft is both a cloud provider and OpenAI's largest investor. Yet all 6 companies have agreed to collaborate on an open standard.

This level of cooperation reflects just how urgent the networking bottleneck has become. Every major AI lab — from OpenAI to Google DeepMind to Anthropic — is racing to build larger training clusters. The companies involved in MRC recognize that proprietary networking solutions will not scale fast enough to meet demand. An open standard benefits everyone by accelerating adoption and enabling a broader ecosystem of compatible hardware.

The decision to release MRC through the Open Compute Project is also strategically significant. OCP, originally founded by Facebook (now Meta) in 2011, has become the de facto standard body for open-source data center hardware. By anchoring MRC within OCP, the consortium ensures the protocol will receive broad industry scrutiny, contribution, and adoption.

This is not the first time the AI industry has rallied around shared infrastructure challenges. The Ultra Ethernet Consortium (UEC), launched in 2023, similarly brings together competitors to develop next-generation Ethernet standards for AI workloads. MRC can be seen as complementary to these efforts, focusing specifically on the reliable transport layer.

What This Means for AI Developers and Businesses

For organizations building or operating large-scale AI training infrastructure, MRC promises meaningful practical benefits.

Cost reduction is perhaps the most immediate impact. By eliminating an entire layer of network switches, MRC could reduce networking infrastructure costs by a significant margin. For hyperscale data centers spending hundreds of millions of dollars on networking alone, even a 20-30% reduction translates to enormous savings.

Training reliability improves dramatically as well. Every hour of GPU downtime during training costs thousands of dollars. By providing automatic failover across multiple network paths, MRC reduces the frequency and duration of training interruptions. This means faster time-to-model for organizations racing to train the next generation of AI systems.

Energy efficiency is another critical consideration. Data centers are already facing scrutiny over their power consumption, and AI training clusters are among the most power-hungry workloads in existence. Reducing the number of switches, cables, and optical transceivers directly lowers the energy footprint of the network fabric.

For smaller AI companies and research labs, the open-source nature of MRC is particularly valuable. Rather than being locked into a single vendor's proprietary networking stack, organizations can adopt MRC across hardware from multiple suppliers, fostering competition and driving down prices.

Looking Ahead: The Road to Adoption

While the announcement is significant, several questions remain about MRC's path to widespread deployment.

Hardware compatibility is the first hurdle. Network interface cards (NICs), switches, and other components will need firmware and potentially hardware updates to support MRC. The involvement of NVIDIA, AMD, Intel, and Broadcom — which collectively dominate the networking silicon market — suggests hardware support will follow, but timelines remain unclear.

Interoperability testing will be critical. An open standard is only as good as its real-world implementations. The industry will need extensive testing to ensure MRC works reliably across equipment from different vendors, in different data center configurations, and at different scales.

Competition from alternatives also bears watching. NVIDIA's proprietary InfiniBand technology currently dominates high-performance AI networking. While NVIDIA's participation in MRC suggests the company sees value in an open Ethernet-based alternative, it remains to be seen whether MRC will match InfiniBand's performance characteristics in practice.

The timing of this release is no coincidence. As AI companies push toward clusters of 200,000+ GPUs for next-generation model training, the networking challenge is becoming the single biggest bottleneck to progress. MRC represents the industry's collective bet that open collaboration — not proprietary lock-in — is the fastest path to solving it.

If MRC delivers on its promises, it could fundamentally reshape how AI supercomputers are built, lowering barriers to entry and accelerating the pace of AI development worldwide.