📑 Table of Contents

Stochastic KV Routing: A New Paradigm for Cache Sharing Across the Depth Dimension

📅 · 📁 Research · 👁 11 views · ⏱️ 8 min read
💡 A latest arXiv paper proposes "Stochastic KV Routing" technology, enabling adaptive KV cache sharing across the depth dimension of Transformer models, offering a novel memory optimization approach for efficient large model inference serving.

Introduction: The Memory Dilemma of KV Cache

Inference deployment of large language models (LLMs) is facing an increasingly severe challenge — the memory overhead of KV cache. During autoregressive generation, Transformer models must cache Key-Value (KV) pairs at every layer to avoid redundant computation. While this mechanism significantly boosts inference throughput, it incurs enormous GPU memory consumption. As model sizes continue to grow and context windows keep extending, KV cache memory consumption has become the core bottleneck constraining serving costs and concurrency capacity.

Recently, a paper published on arXiv (arXiv:2604.22782v1) proposed a novel method called "Stochastic KV Routing," which takes an unconventional approach by tackling the KV cache optimization problem from the model's depth dimension, achieving adaptive cross-layer cache sharing and delivering a refreshingly innovative solution to this field.

Core Idea: From Temporal Compression to Depth-Dimension Sharing

Limitations of Existing Approaches

Current mainstream KV cache optimization methods primarily operate along the temporal axis, including:

  • Cache compression: Reducing the storage footprint of each KV pair through quantization, low-rank decomposition, and other techniques
  • Cache eviction: Discarding KV pairs of unimportant historical tokens based on attention scores and other strategies
  • Sparse attention: Such as sliding window attention, limiting the historical range each token attends to

While effective, these methods essentially work within the same dimension, making them prone to hitting compression limits, with diminishing returns in long-context scenarios.

A New Perspective from the Depth Dimension

The paper's core insight is that significant redundancy exists among KV representations across different layers of Transformer models. The researchers point out that the depth dimension is a long-overlooked optimization direction. If KV representations between adjacent or even distant layers are sufficiently similar, multiple layers can share the same KV cache, thereby reducing memory usage by multiples.

The Stochastic KV Routing Mechanism

The core design of "Stochastic KV Routing" encompasses the following key elements:

  1. Adaptive routing decisions: The model dynamically determines at inference time whether each layer uses its own independent KV cache or routes to a shared cache from another layer. This decision is not hardcoded but adapts based on input content.

  2. Stochastic sampling strategy: Routing decisions are made through stochastic sampling, a design that not only introduces beneficial regularization effects but also makes the training process more stable. During inference, deterministic approximations can be used to ensure output consistency.

  3. Depth sharing groups: Several adjacent layers can be dynamically grouped, with layers within a group sharing the same set of KV caches. The grouping method is not fixed but is adaptively determined based on the similarity of KV representations across layers.

Technical Analysis: Why Does Depth Sharing Work?

Theoretical Basis for Cross-Layer KV Redundancy

Multiple studies in recent years have revealed that representation changes between adjacent layers in deep Transformer models tend to be incremental. This phenomenon is particularly pronounced under residual connection architectures — each layer's output is a "fine-tuning" of the previous layer's output rather than an entirely new representation. This means that Key and Value matrices between adjacent layers exhibit high linear correlation, providing theoretical support for cross-layer sharing.

Complementarity with Existing Methods

Notably, "Stochastic KV Routing" does not conflict with existing temporal-axis optimization methods. It can be orthogonally combined with quantization, sparse attention, and other techniques to achieve multi-dimensional joint compression. For example, sharing caches across the depth dimension while applying quantization to the shared caches themselves promises to achieve even higher compression ratios.

Balancing Performance and Efficiency

From a design perspective, a major advantage of the stochastic routing mechanism lies in its soft decision characteristic. Unlike hard layer sharing (such as directly forcing several layers to use the same KV), stochastic routing allows the model to retain independent caches in scenarios requiring fine-grained differentiation while automatically merging them when representations are highly similar, achieving a delicate balance between performance and efficiency.

Industry Impact: A New Direction for Inference Cost Optimization

Significance for Large Model Deployment

Currently, KV cache GPU memory usage often exceeds the model parameters themselves in long-context inference scenarios. Taking a 70B-parameter model as an example, processing 128K context may require tens of gigabytes of GPU memory for the KV cache alone. If depth sharing can effectively halve the number of cache layers, the saved memory directly translates to:

  • Higher batch concurrency: Serving more requests simultaneously on a single GPU
  • Longer context support: Processing longer input sequences within limited GPU memory
  • Lower hardware barriers: Enabling small- and medium-scale GPUs to run large models

This research direction aligns closely with multiple current industry trends. Google's Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) have already achieved KV sharing across the attention head dimension, while "Stochastic KV Routing" extends the sharing concept to the layer dimension, which can be seen as a natural extension of KV cache optimization to a higher dimension.

Outlook: Toward Full-Dimensional KV Cache Optimization

The introduction of "Stochastic KV Routing" marks a shift in KV cache optimization research from single-dimension approaches to multi-dimensional synergy. In the future, we can anticipate a unified framework that achieves joint optimization across the head dimension (e.g., GQA), temporal dimension (e.g., cache eviction), and depth dimension (e.g., KV routing).

Meanwhile, this method also brings new inspiration to model architecture design: if depth-sharing-aware training objectives are introduced during the pretraining phase, models may spontaneously learn KV representations more suitable for cross-layer sharing, enabling more aggressive cache compression without sacrificing performance.

As large model inference costs become a critical constraint for industrial deployment, this type of optimization research rooted in architectural fundamentals will play an increasingly important role in driving the democratization of AI technology.