📑 Table of Contents

OpenAI Launches MRC Protocol With AMD, Intel, Nvidia

📅 · 📁 Industry · 👁 8 views · ⏱️ 13 min read
💡 OpenAI and 5 tech giants release Multipath Reliable Connection, an open networking protocol to boost AI supercomputer efficiency.

OpenAI has joined forces with AMD, Broadcom, Intel, Microsoft, and Nvidia to launch Multipath Reliable Connection (MRC), a new open networking protocol designed to dramatically improve the speed, reliability, and efficiency of massive AI training clusters. The protocol, announced on May 6, is already fully deployed across all of OpenAI's large-scale supercomputers used to train its frontier AI models.

MRC targets one of the most persistent and costly problems in large-scale AI training: idle GPU time caused by network failures and inefficiencies. By enabling smarter, more resilient data routing across multiple network paths, the protocol promises to reduce wasted compute and accelerate training runs that can cost tens of millions of dollars.

Key Takeaways at a Glance

  • What: MRC is an open networking protocol for large AI training clusters
  • Who: Developed collaboratively by OpenAI, AMD, Broadcom, Intel, Microsoft, and Nvidia
  • Why: Reduces GPU idle time and improves network reliability during AI model training
  • Where deployed: Already running on OpenAI's supercomputers, including Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft's Fairwater supercomputer cluster
  • Impact: Significant compute efficiency gains for training frontier AI models
  • Open standard: Available for the broader industry to adopt and implement

Why Networking Is AI Training's Hidden Bottleneck

Most conversations about AI infrastructure focus on GPUs — their count, their speed, their availability. But networking is quietly one of the biggest bottlenecks in scaling AI training to tens of thousands of accelerators. When a single network link fails or congests in a cluster of 100,000 GPUs, the entire training job can stall.

Traditional networking protocols were not designed for the unique demands of AI workloads. These workloads require collective communication patterns where thousands of GPUs must synchronize data simultaneously. A failure in any single path can force every GPU in the cluster to wait, burning millions of dollars in idle compute time.

This is the problem MRC was built to solve. Unlike conventional single-path protocols, MRC routes data across multiple network paths simultaneously, automatically detecting failures and rerouting traffic without interrupting the training process. The result is a more fault-tolerant system that keeps GPUs productive even when individual network components fail.

How MRC Works: A Technical Breakdown

At its core, MRC operates as a transport-layer protocol that sits between the application and the physical network infrastructure. It introduces several key innovations that distinguish it from existing approaches.

Multipath Data Routing

MRC spreads data across multiple available network paths rather than relying on a single connection. This approach offers 2 major advantages: it increases aggregate bandwidth by utilizing more of the available network fabric, and it provides instant failover capability when any individual path experiences issues.

Reliable Delivery With Low Overhead

The protocol ensures reliable data delivery without the high overhead typically associated with reliability mechanisms. Traditional approaches like TCP introduce significant latency through their acknowledgment and retransmission schemes. MRC uses a more efficient reliability mechanism optimized specifically for the bursty, high-bandwidth communication patterns typical of AI training workloads.

Key technical features include:

  • Automatic path failover in microseconds rather than milliseconds
  • Load balancing across multiple network paths for optimal bandwidth utilization
  • Low-latency reliability mechanisms designed for AI collective operations
  • Scalability to clusters with tens of thousands of GPUs
  • Hardware-agnostic design that works across different chip vendors and network equipment

The fact that MRC works across hardware from AMD, Intel, and Nvidia — 3 companies that fiercely compete in the AI accelerator market — underscores the protocol's vendor-neutral ambitions.

Already Running on OpenAI's Most Powerful Supercomputers

MRC is not a theoretical proposal or a future roadmap item. OpenAI confirmed that the protocol is already fully deployed across all of its large-scale training supercomputers. This includes 2 notable installations.

The first is the Oracle Cloud Infrastructure (OCI) site in Abilene, Texas, one of the largest AI training facilities in the world. Oracle has been aggressively expanding its cloud infrastructure to serve AI workloads, and the Abilene data center represents a massive investment in GPU-dense computing.

The second is Microsoft's Fairwater supercomputer cluster, part of Microsoft's extensive AI infrastructure buildout that supports OpenAI's model development. Microsoft has invested over $13 billion in OpenAI and continues to provide the bulk of the compute infrastructure for training models like GPT-4 and its successors.

The real-world deployment across these facilities provides strong validation that MRC delivers measurable performance improvements at production scale, not just in controlled lab environments.

The Strategic Significance of an Open Protocol

Perhaps the most significant aspect of MRC is its status as an open protocol. In an industry increasingly defined by proprietary ecosystems and vendor lock-in, the decision to make MRC openly available sends a powerful signal about how the AI infrastructure stack may evolve.

Nvidia currently dominates AI networking through its InfiniBand and NVLink technologies, which provide high-bandwidth, low-latency interconnects for GPU clusters. However, these are largely proprietary solutions that tie customers to Nvidia's ecosystem. The emergence of MRC as an open alternative — one that Nvidia itself has co-developed — suggests a recognition that the industry needs common standards as AI clusters scale beyond any single vendor's capabilities.

For AMD and Intel, participation in MRC represents an opportunity to compete more effectively in the AI infrastructure market. If training clusters can seamlessly mix hardware from different vendors using a common protocol, it lowers the barrier for customers to adopt non-Nvidia accelerators for portions of their workloads.

Broadcom's involvement is equally telling. As one of the world's largest networking chip manufacturers, Broadcom's support suggests that MRC could eventually be implemented directly in network interface cards and switches, further improving performance and reducing CPU overhead.

Industry Context: The Race to Build Bigger AI Clusters

MRC arrives at a critical moment in the AI industry. Companies are racing to build ever-larger training clusters, with facilities containing 100,000 or more GPUs becoming the new standard for frontier model development.

  • xAI's Colossus cluster in Memphis, Tennessee houses 100,000 Nvidia H100 GPUs
  • Meta has disclosed plans for clusters exceeding 350,000 GPUs
  • Microsoft and OpenAI are reportedly planning a $100 billion data center project codenamed 'Stargate'
  • Google continues to scale its TPU-based training infrastructure across multiple data centers

At these scales, even small improvements in network efficiency translate to enormous cost savings. If MRC reduces GPU idle time by just 5% in a 100,000-GPU cluster running $30,000 H100 GPUs, the value preservation runs into hundreds of millions of dollars over the cluster's lifetime.

The protocol also arrives as the industry grapples with the energy costs of AI training. Idle GPUs still consume significant power. By keeping GPUs productive and reducing wasted compute cycles, MRC contributes to both economic and environmental sustainability goals.

What This Means for Developers and Businesses

For AI researchers and engineers working on large-scale training, MRC promises several practical benefits.

Faster training times mean models can be iterated more quickly, accelerating the pace of AI development. When GPUs spend less time waiting for network operations, the overall time-to-completion for training runs shrinks proportionally.

Lower costs are perhaps the most immediate benefit. GPU time is the single largest expense in AI model training. Reducing waste directly impacts the bottom line for organizations running large training jobs.

Greater reliability reduces the operational burden on infrastructure teams. Large training runs can take weeks or months. Network failures that interrupt these runs force expensive restarts from checkpoints, wasting days of compute. MRC's fault-tolerance mechanisms minimize these disruptions.

For cloud providers and data center operators, MRC offers:

  • Higher utilization rates for expensive GPU infrastructure
  • Vendor flexibility in hardware procurement decisions
  • Simplified networking through standardized protocol support
  • Reduced operational complexity from fewer network-related training failures

Looking Ahead: Open Standards as AI Infrastructure Matures

The release of MRC signals a broader trend toward standardization in AI infrastructure. As the industry moves beyond the early gold-rush phase of AI compute buildout, the need for interoperable, efficient, and open technologies becomes increasingly urgent.

The coalition behind MRC — spanning chip designers, cloud providers, and AI companies — represents a powerful consensus that proprietary networking alone cannot meet the demands of next-generation AI systems. Future developments could see MRC expanded to cover inference workloads, multi-data-center training, and edge AI deployments.

OpenAI's decision to open-source the protocol rather than keep it as a proprietary advantage also reflects a strategic calculation. By establishing MRC as an industry standard, OpenAI ensures that its infrastructure investments benefit from a broader ecosystem of compatible hardware and software, rather than being locked into any single vendor's roadmap.

The coming months will reveal how quickly other major AI players — particularly Google, Amazon, and Meta — adopt or respond to MRC. Their decisions will determine whether MRC becomes a true industry standard or remains one of several competing approaches to AI cluster networking.

What is clear today is that the networking layer of AI infrastructure is finally getting the attention it deserves. As AI models continue to grow in size and complexity, protocols like MRC will play a critical role in determining who can train the next generation of frontier models — and how efficiently they can do it.