📑 Table of Contents

OpenAI and Tech Giants Build MRC Protocol for AI Scale

📅 · 📁 Industry · 👁 11 views · ⏱️ 12 min read
💡 OpenAI partners with AMD, Broadcom, Intel, Microsoft, and NVIDIA to create MRC, an open source networking protocol that connects 100,000+ GPUs.

OpenAI has unveiled MRC, a new open source networking protocol built in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, designed to eliminate the crippling data bottlenecks that plague today's AI supercomputers. The protocol, already live on OpenAI's Stargate supercomputer, sends data across hundreds of paths simultaneously between GPUs — a fundamental shift in how the world's largest AI clusters communicate.

This is not a minor incremental improvement. MRC represents the kind of infrastructure-level innovation that could reshape the economics and scalability of AI training for years to come.

Key Facts at a Glance

  • MRC (Multi-path Routing and Congestion) is an open source network protocol developed by 6 major tech companies
  • The protocol transmits data across hundreds of simultaneous paths between GPUs, rather than relying on traditional single or few-path approaches
  • MRC reduces switch layers from the typical 3 or 4 down to just 2, enabling connections for over 100,000 GPUs
  • Both power consumption and infrastructure costs are significantly reduced
  • The protocol is already deployed on OpenAI's Stargate supercomputer
  • The coalition includes the biggest names in chips and cloud: AMD, Broadcom, Intel, Microsoft, and NVIDIA

Why AI Networking Has Become the Biggest Bottleneck

Training frontier AI models requires thousands — soon hundreds of thousands — of GPUs working in concert. The raw compute power of individual chips has improved dramatically over the past few years, with NVIDIA's H100 and upcoming B200 GPUs delivering exponential performance gains.

But there's a problem that faster chips alone cannot solve. The network fabric connecting those GPUs has become the single greatest constraint on scaling AI systems. When tens of thousands of GPUs need to share gradients, synchronize parameters, and exchange intermediate results during training, the networking layer must handle staggering volumes of data with minimal latency.

Traditional data center networking architectures rely on hierarchical switch topologies — typically 3 or 4 layers of switches arranged in a tree or fat-tree structure. Each additional layer adds latency, complexity, power draw, and cost. At the scale OpenAI and its peers are now operating, these inefficiencies compound into serious obstacles.

  • Latency spikes occur when traffic bottlenecks at higher switch layers
  • Power consumption from switch infrastructure can rival the GPUs themselves
  • Cost escalation is nonlinear — each additional switch layer multiplies hardware and cabling expenses
  • Failure domains grow larger with more layers, increasing the blast radius of any single component failure
  • Cooling requirements increase substantially with every additional switch tier

This is the problem MRC was built to address.

How MRC Works: Hundreds of Paths Instead of a Few

The core innovation behind MRC lies in its multi-path routing architecture. Unlike conventional protocols that route traffic along a small number of predetermined paths, MRC spreads data across hundreds of simultaneous routes between any two GPUs in the cluster.

This approach is conceptually similar to how modern content delivery networks distribute web traffic, but applied at the hardware level within a data center. By utilizing a massive number of parallel paths, MRC avoids the congestion hotspots that plague traditional hierarchical networks.

The result is striking: MRC needs only 2 switch layers to connect over 100,000 GPUs. Compared to the industry-standard 3- or 4-layer topologies, this is a dramatic simplification. Fewer switch layers mean fewer physical switches, less cabling, lower power consumption, and reduced cooling demands.

The protocol also incorporates sophisticated congestion management algorithms. In a system where hundreds of paths are active simultaneously, detecting and responding to congestion in real time is essential. MRC handles this at the protocol level, dynamically rerouting traffic away from overloaded paths without requiring intervention from higher-level software.

The Stargate Deployment: From Theory to Production

What makes MRC particularly noteworthy is that it is not a research paper or a proof of concept. OpenAI confirms the protocol is already running in production on its Stargate supercomputer — the massive AI infrastructure project backed by a reported $500 billion investment commitment announced earlier in 2025.

Stargate represents one of the most ambitious computing projects in history. It aims to provide OpenAI with the raw computational capacity needed to train next-generation models that go far beyond GPT-4's capabilities. The system is designed to scale to hundreds of thousands of GPUs, making efficient networking not just desirable but absolutely essential.

Deploying MRC on Stargate serves as both a validation of the protocol's readiness and a real-world stress test at unprecedented scale. If the protocol performs as designed under Stargate's workloads, it provides a proven blueprint that the entire industry can adopt.

An Unprecedented Coalition of Competitors

Perhaps the most remarkable aspect of MRC is the coalition behind it. AMD, Broadcom, Intel, Microsoft, NVIDIA, and OpenAI are not natural allies — they compete fiercely across multiple markets. AMD and NVIDIA battle for GPU dominance. Intel competes with both in accelerators and networking silicon. Broadcom supplies networking chips that compete with NVIDIA's own networking division (formerly Mellanox).

Yet all 6 organizations recognized that the networking bottleneck is an industry-wide problem that no single company can solve alone. An open standard benefits everyone by:

  • Preventing vendor lock-in — customers can mix and match hardware from different suppliers
  • Accelerating adoption — open source means faster iteration and broader peer review
  • Reducing fragmentation — a single protocol standard avoids the inefficiency of competing proprietary solutions
  • Lowering barriers to entry — smaller players and research institutions can leverage the same technology
  • Driving interoperability — GPUs from different manufacturers can communicate efficiently over the same fabric

This kind of cross-industry collaboration on foundational infrastructure echoes historical precedents like the development of Ethernet, USB, and PCIe — open standards that unlocked entire technology ecosystems.

What This Means for Developers and Businesses

For AI developers and enterprises building large-scale training infrastructure, MRC's implications are significant. The protocol's open source nature means organizations will not need to license proprietary networking solutions from a single vendor. This directly reduces the total cost of ownership for large GPU clusters.

The reduction from 3-4 switch layers to 2 also has concrete financial impact. Networking infrastructure — including switches, optical transceivers, and cabling — can represent 20-30% of total data center capital expenditure for AI workloads. Cutting switch layers in half could translate to savings of hundreds of millions of dollars at hyperscale.

For cloud providers like Microsoft Azure, Google Cloud, and Amazon Web Services, MRC could enable denser, more efficient GPU clusters. This means better performance per dollar for customers renting AI compute. It could also accelerate the availability of truly massive GPU clusters — 100,000+ GPUs — as a service.

Smaller AI labs and research institutions stand to benefit as well. If MRC becomes the industry standard, the networking components and expertise needed to build large clusters will become more commoditized and accessible.

Industry Context: The Infrastructure Race Heats Up

MRC arrives at a pivotal moment in the AI industry. The race to build ever-larger AI supercomputers has intensified dramatically in 2025. Meta is constructing its own massive GPU clusters. Google continues to expand its TPU-based infrastructure. xAI's Memphis data center reportedly houses over 100,000 NVIDIA GPUs.

Every one of these organizations faces the same networking challenge MRC addresses. The question now is whether competitors will adopt OpenAI's protocol or develop their own alternatives. The open source nature of MRC makes adoption more likely, but competitive dynamics could lead some players — particularly Google, which was notably absent from the coalition — to pursue independent solutions.

The timing also coincides with growing concerns about the energy consumption of AI infrastructure. Data centers supporting AI training are consuming electricity at rates that strain local power grids. Any technology that reduces power consumption — as MRC does by eliminating switch layers — aligns with both economic incentives and increasing regulatory pressure around sustainability.

Looking Ahead: The Future of AI Infrastructure

MRC is likely just the beginning. As AI models continue to grow in size and complexity, the demands on networking infrastructure will only intensify. Future iterations of the protocol may need to support millions of GPUs working in coordination — a scale that would dwarf even today's largest deployments.

The open source release of MRC also sets the stage for rapid community-driven improvement. Researchers and engineers across the industry can now study, test, modify, and extend the protocol. This collaborative development model could accelerate innovation far beyond what any single company could achieve.

Several key questions remain. Will MRC become a true industry standard, or will it fragment into competing forks? How will the protocol evolve to support emerging hardware architectures beyond traditional GPUs? And will the coalition hold together as its members continue to compete in other areas?

What is clear is that OpenAI and its partners have identified and addressed a critical bottleneck that was threatening to slow the entire AI industry's progress. By open-sourcing the solution, they have made a bet that a rising tide lifts all boats — and that the real competition lies not in networking protocols, but in what gets built on top of them.