📑 Table of Contents

Why Does Reinforcement Learning Generalize? Feature-Level Mechanistic Study Reveals Secrets of LLM Post-Training

📅 · 📁 Research · 👁 11 views · ⏱️ 7 min read
💡 A latest arXiv paper analyzes feature-level mechanisms to reveal why reinforcement learning post-training enhances out-of-domain generalization in large language models while supervised fine-tuning leads to catastrophic forgetting, offering new perspectives on understanding post-training paradigm differences.

Introduction: The Core Puzzle of the Post-Training Paradigm Debate

In the path toward enhancing large language model (LLM) capabilities, post-training has become an indispensable step. The two dominant post-training paradigms — reinforcement learning (RL)-based methods and supervised fine-tuning (SFT) — exhibit starkly different outcomes in practice: RL post-training often extends a model's reasoning capabilities beyond the training domain, while SFT frequently causes the model to forget its general-purpose abilities.

This phenomenon has been widely observed across the industry, but the underlying mechanisms have long lacked a clear explanation. Recently, a paper published on arXiv titled "Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models" (arXiv:2604.25011) systematically answers this question for the first time from the perspective of feature-level mechanistic analysis.

Core Methodology: A Feature-Level Mechanistic Analysis Framework

The study proposes an entirely new "Feature-Level Mechanistic Analysis" methodology that conducts deep probing of the RL generalization phenomenon through carefully designed controlled experimental environments.

Unlike previous research that primarily relied on macro-level performance metrics, this approach delves directly into the model's internals, observing the differential effects of RL and SFT on model parameters and activation patterns at the feature representation level. This methodological framework comprises several key components:

  • Controlled experimental setup: The researchers constructed precisely controllable training and test domains to ensure accurate measurement of generalization capability
  • Feature-level probing: Beyond observing model output behavior, the study deeply analyzes change patterns in intermediate-layer features
  • Comparative analysis: Contrasting the different effects of RL and SFT on internal model representations under identical conditions

Key Findings: The Fundamental Difference Between RL and SFT

The paper reveals a core insight: RL post-training and SFT differ fundamentally in how they alter the model's internal feature representations.

SFT's "Overfitting Trap": Supervised fine-tuning tends to perform "local reshaping" in the feature space, causing the model to over-adapt to surface-level patterns in the training data. These drastic changes at the feature level destroy the general representation structures acquired during pre-training, leading to degradation of out-of-domain capabilities — the so-called "catastrophic forgetting" phenomenon.

RL's "Precision Tuning": In contrast, reinforcement learning post-training acts more like "precise activation and recombination" of the model's existing features. Rather than rewriting the model's feature representations, RL learns to better leverage the knowledge structures already encoded during pre-training. This gentle yet targeted adjustment approach enables the model to maintain or even enhance out-of-domain generalization performance while improving specific reasoning capabilities.

In other words, the key reason RL can generalize is that it preserves the "universal feature foundation" accumulated during the model's pre-training phase and optimizes the strategy for using those features rather than altering the features themselves.

Research Significance: Implications for LLM Training Practice

The value of this research lies not only in its theoretical explanations but also in its guidance for practical training strategies.

First, it provides a theoretical basis for choosing post-training paradigms. In the past, the industry relied more on empirical intuition when choosing between RL and SFT. This research clarifies the capability boundaries of both approaches at the mechanistic level, helping researchers make more rational choices based on specific objectives.

Second, it offers new ideas for designing hybrid training strategies. With an understanding of RL generalization's feature-level mechanisms, researchers can explore how to introduce similar "feature protection" mechanisms into SFT, or design more refined joint RL-SFT training schemes.

Third, it advances interpretability research. The feature-level analysis methodology proposed in this paper provides a reusable framework for studying LLM internal mechanisms, with potential for broader application in analyzing additional post-training techniques.

Industry Context and Broader Impact

This research arrives at a time when RL post-training is being adopted at massive scale across the industry. From OpenAI's RLHF to DeepSeek-R1's GRPO, from Claude's Constitutional AI to various rule-based reward reasoning training methods, reinforcement learning has become a standard component in building high-capability LLMs. However, the academic community previously lacked sufficiently deep answers to the fundamental question of "why RL works."

This study fills that theoretical gap and resonates with several recent related works. For example, prior research found that RL post-training does not enable models to "learn new knowledge" but rather activates existing capabilities — the feature-level evidence in this paper precisely validates this macro-level observation from a microscopic perspective.

Outlook: From "Knowing That It Works" to "Understanding Why It Works"

This research marks an important step in LLM post-training research, transitioning from being "results-driven" to "mechanism-driven." Looking ahead, we can reasonably expect progress in the following directions:

  • More fine-grained feature attribution analysis to precisely locate key knowledge modules "activated" during RL post-training
  • Novel post-training algorithms based on mechanistic understanding that achieve better balance between generalization capability and training efficiency
  • Cross-model and cross-scale mechanistic comparison studies to explore whether generalization mechanisms are universal

Moving from "knowing that RL works" to "understanding why RL works" represents not only academic progress but will profoundly influence the design of training paradigms for the next generation of large language models.