📑 Table of Contents

Cloudflare Builds LLM Infrastructure on Its Edge Network

📅 · 📁 Industry · 👁 13 views · ⏱️ 12 min read
💡 Cloudflare unveils disaggregated prefill architecture and custom Infire inference engine to run large language models efficiently across its global edge network.

Cloudflare has launched a new high-performance infrastructure designed to run large language models directly on its global edge network, introducing a novel architecture that splits AI inference into specialized phases for maximum efficiency. The company also unveiled Infire, a custom-built AI inference engine capable of orchestrating GPU resources across multiple servers at scale.

The announcements, made during Cloudflare's 2025 anniversary event, represent a significant push by the web infrastructure giant into the competitive AI inference market — a space currently dominated by cloud hyperscalers like AWS, Google Cloud, and Microsoft Azure.

Key Takeaways

  • Cloudflare now runs LLM inference workloads on its global edge network spanning 300+ cities worldwide
  • The company developed a disaggregated prefill architecture that splits inference into 2 distinct phases on separate servers
  • A new custom inference engine called Infire enables more efficient multi-GPU scheduling
  • The prefill phase handles compute-intensive input processing, while the decode phase handles memory-intensive output generation
  • The approach optimizes both hardware utilization and response latency for end users
  • Key contributors include Michelle Chen (Chief Product Manager), Kevin Flansburg (Senior Engineering Manager), and Vlad Krasnov (Chief Systems Engineer)

Disaggregated Prefill: Splitting Inference for Maximum Efficiency

Traditional LLM inference treats the entire process — from reading input tokens to generating output tokens — as a single monolithic workload running on the same hardware. Cloudflare's engineering team recognized this as fundamentally inefficient because the 2 phases of inference have vastly different computational profiles.

The prefill phase processes all input tokens simultaneously and populates what is known as the KV cache (key-value cache), a critical data structure that stores intermediate attention computations. This phase is heavily compute-bound, demanding maximum GPU floating-point throughput. In contrast, the decode phase generates output tokens one at a time in an autoregressive fashion, making it primarily memory-bandwidth-bound rather than compute-bound.

By separating these 2 phases onto different, purpose-optimized server pools, Cloudflare can match each workload to the hardware best suited for it. Compute-dense prefill servers can be equipped with high-FLOPS GPUs, while decode servers can prioritize memory bandwidth and capacity. This disaggregated approach mirrors a trend seen in cutting-edge AI research from companies like Google DeepMind and startups such as Groq, but Cloudflare's implementation is unique in that it operates across a distributed edge network rather than centralized data centers.

Why Edge-Based LLM Inference Changes the Game

Running LLMs at the edge — close to end users — offers several distinct advantages over centralized cloud inference. The most obvious benefit is reduced latency. When a user in London sends a prompt to an LLM, having that inference happen at a nearby Cloudflare point of presence rather than a data center in Virginia can shave dozens or even hundreds of milliseconds off response time.

For real-time applications like AI-powered customer service chatbots, coding assistants, and interactive agents, this latency reduction translates directly into better user experiences. It also matters for emerging use cases such as AI-driven content personalization at the CDN layer, where models need to make decisions in the critical path of web requests.

However, running LLMs at the edge presents significant challenges:

  • GPU availability is more limited at edge locations compared to hyperscale data centers
  • Model sizes for state-of-the-art LLMs can exceed 70 billion parameters, requiring multiple GPUs
  • Thermal and power constraints at edge facilities are typically tighter
  • Workload balancing across hundreds of distributed locations adds orchestration complexity
  • Cost efficiency demands higher GPU utilization rates than traditional cloud deployments

Cloudflare's disaggregated architecture directly addresses several of these challenges by allowing the company to use its GPU resources more flexibly across its network.

Infire: Cloudflare's Custom Inference Engine

Perhaps the most technically significant announcement is Infire, Cloudflare's proprietary AI inference engine. While many companies rely on open-source inference frameworks like vLLM, TensorRT-LLM from NVIDIA, or llama.cpp, Cloudflare chose to build its own engine from the ground up.

Infire is designed to handle multi-GPU inference scheduling with a focus on the unique requirements of a distributed edge environment. Unlike centralized inference engines that assume all GPUs are co-located in the same data center rack, Infire must coordinate workloads across GPUs that may be spread across different servers and even different physical locations.

The decision to build a custom engine rather than adopt an existing open-source solution suggests that Cloudflare found significant performance gaps in available tools — particularly around GPU resource scheduling and the disaggregated prefill architecture. Building proprietary inference infrastructure is an expensive undertaking, but it gives Cloudflare full control over optimization and differentiation in an increasingly crowded market.

This approach parallels moves by other major tech companies. Google uses its own TPU inference stack, Amazon has invested in custom Inferentia and Trainium chips with proprietary software, and Microsoft has developed specialized inference optimizations for its Azure OpenAI Service. Cloudflare's bet is that edge-native inference requires purpose-built tooling that existing solutions simply cannot provide.

The Broader AI Infrastructure Arms Race

Cloudflare's infrastructure push comes amid an unprecedented buildout of AI compute capacity across the tech industry. According to recent estimates, the global AI inference market is projected to exceed $100 billion by 2028, growing faster than the training market as more applications move into production.

Several trends are driving this expansion:

  • Enterprise AI adoption is accelerating, with companies deploying LLMs for everything from document processing to code generation
  • Inference costs now represent the majority of AI operational expenses for most organizations
  • Latency requirements are tightening as AI moves from batch processing to real-time applications
  • Data sovereignty regulations in the EU and other regions increasingly require processing to happen closer to users
  • Multi-model architectures and AI agent frameworks are multiplying the number of inference calls per user interaction

Cloudflare's edge-based approach is particularly well-positioned to address the data sovereignty angle. With points of presence in over 100 countries, the company can potentially offer AI inference that keeps data within specific geographic boundaries — a growing requirement under regulations like the EU's AI Act and GDPR.

Compared to hyperscalers who concentrate GPU capacity in a handful of mega-regions, Cloudflare's distributed model trades raw scale for geographic reach. This makes it potentially more attractive for latency-sensitive and compliance-driven use cases, even if it may not match the raw throughput of a centralized GPU cluster running thousands of NVIDIA H100s.

What This Means for Developers and Businesses

For developers already using Cloudflare Workers or other Cloudflare services, the new LLM infrastructure creates a compelling integrated stack. Instead of managing separate AI inference providers alongside their edge computing and CDN layers, they can consolidate onto a single platform.

The practical implications include simplified architecture, potentially lower costs through reduced data transfer between services, and the ability to run AI inference with the same global distribution as their web applications. For startups and mid-size companies that lack the resources to negotiate enterprise GPU contracts with major cloud providers, Cloudflare's offering could represent a more accessible entry point into production AI deployment.

Enterprise customers stand to benefit as well, particularly those with strict latency budgets or regulatory requirements around data locality. The disaggregated prefill architecture could also deliver better price-performance ratios by improving GPU utilization — a metric that remains notoriously low across the industry, with some estimates suggesting average GPU utilization rates below 30% in many deployments.

Looking Ahead: Cloudflare's AI Ambitions

Cloudflare's investment in custom AI infrastructure signals a long-term strategic commitment to becoming a major player in the AI inference market. The company's edge network — originally built for caching web content and mitigating DDoS attacks — is being systematically repurposed for AI workloads.

Several questions remain about Cloudflare's roadmap. Will the company expand GPU capacity aggressively enough to compete with hyperscalers on model size support? How will Infire evolve to support emerging model architectures beyond traditional transformer-based LLMs? And can Cloudflare maintain competitive pricing as GPU demand continues to outstrip supply globally?

What is clear is that the AI inference landscape is diversifying beyond the traditional cloud providers. With Cloudflare bringing LLM inference to the edge, alongside similar moves by companies like Fastly and Akamai, developers will soon have a much broader set of options for deploying AI models in production. The winners in this space will be those who deliver the best combination of latency, cost efficiency, and developer experience — and Cloudflare's disaggregated architecture and custom Infire engine represent a serious bid for that position.