📑 Table of Contents

Safe but Useless? New Benchmark Exposes the LLM Alignment Dilemma

📅 · 📁 Research · 👁 12 views · ⏱️ 7 min read
💡 A research team has introduced CarryOnBench, the first benchmark to systematically evaluate whether large language models can recover usefulness while maintaining safety in multi-turn conversations, revealing the severe "over-refusal" problem caused by current safety alignment techniques.

The Hidden Cost of Safety Alignment: Models Become 'Safe' but No Longer 'Useful'

Current safety alignment techniques for large language models (LLMs) are facing a long-overlooked paradox — models are becoming increasingly robust against malicious attacks, yet growing overly cautious when facing legitimate requests from well-intentioned users, often resorting to blanket refusals. A recent paper published on arXiv introduces an interactive benchmark called "CarryOnBench," the first to systematically measure whether LLMs can recover usefulness while maintaining safety in multi-turn dialogue scenarios, providing a quantitative evaluation framework for this critical issue.

CarryOnBench: The First Interactive Benchmark Focused on 'Usefulness Recovery'

The paper notes that virtually all existing LLM safety evaluations focus on a single dimension — whether a model can refuse harmful requests. However, real-world conversational scenarios are far more complex. Many user queries may "appear harmful" on the surface but carry entirely legitimate intentions. For example, a medical researcher asking about the chemical mechanisms of a toxin, or a novelist requesting help writing a conflict scene — these requests might superficially trigger a model's safety mechanisms but are fundamentally well-intentioned and reasonable.

CarryOnBench is designed precisely around this pain point. Starting from 398 "seemingly harmful" queries, the benchmark simulates the complete interaction process where well-intentioned users clarify their intentions through subsequent dialogue turns after being initially refused by the model. The benchmark evaluates two core capabilities:

  • Safety Maintenance: Whether the model consistently upholds safety boundaries throughout the entire multi-turn conversation without being guided into generating harmful content
  • Usefulness Recovery: Whether the model can correct its judgment of user intent after the user clearly clarifies their benign intentions, transitioning from refusal to providing valuable assistance

Exposing an Industry Pain Point: The 'Over-Refusal' Problem Is Worse Than Expected

This research touches on an increasingly prominent contradiction in current LLM development. As major model providers continue to strengthen safety alignment training, the "over-refusal" phenomenon has become a significant bottleneck for user experience. Many users report that when facing sensitive but legitimate questions involving medical, legal, or safety topics, models often adopt a "better safe than sorry" strategy, delivering formulaic refusal responses.

More critically, traditional safety evaluation systems fail to capture this problem. If "whether harmful requests are refused" is the sole metric, a model that refuses all requests would score perfectly — yet it would obviously be useless. CarryOnBench's innovation lies in placing "safety" and "usefulness" within the same evaluation framework, forcing researchers to confront the tension between the two.

Multi-Turn Dialogue Capability Emerges as a New Evaluation Dimension

Notably, CarryOnBench extends the evaluation scenario from single-turn Q&A to multi-turn dialogue — a design choice of significant importance. In real-world usage, user interactions with AI are rarely one-off exchanges. When a model refuses a request, a well-intentioned user's natural response is to provide additional context, clarify the background, or rephrase their needs. Whether a model can understand this follow-up information and dynamically adjust its judgment is an important measure of its "intelligence."

This effectively imposes higher-level requirements on models: beyond content safety judgment, they need contextual understanding, intent reasoning, and flexible decision-adjustment capabilities. A truly excellent AI assistant should be able to both identify genuine malicious attacks and firmly refuse them, while also promptly "course-correcting" and providing help once a user clarifies their benign intentions.

Far-Reaching Implications for Safety Alignment Research

The introduction of CarryOnBench brings a new direction of thinking to the LLM safety alignment field. For a long time, safety alignment research has focused on how to make models "refuse better," but this work reminds the research community that "what happens after refusal" is equally important.

From a technical pathway perspective, this could drive development in several directions:

  1. More Fine-Grained Intent Recognition: Models need to upgrade from "keyword-level" safety judgments to "context-level" intent understanding
  2. Dynamic Safety Policies: Safety mechanisms should not be static binary switches but should adjust dynamically based on conversation progression
  3. Redefinition of Alignment Objectives: Incorporating "usefulness" into the core optimization objectives of safety alignment, rather than treating it as a secondary metric

Outlook: Finding the Optimal Balance Between Safety and Usefulness

The core insight of this research is that true AI safety is not simply about "refusing all suspicious requests" but about finding a precise balance between safety and usefulness. A model that is "safe but useless" may ultimately drive users toward less secure alternatives, creating even greater risks.

As LLMs are increasingly deployed in professional domains such as healthcare, education, and law, enabling models to uphold safety boundaries on sensitive topics without sacrificing practical value will become a central challenge for the next phase of alignment research. CarryOnBench provides a much-needed evaluation tool for this direction, and its subsequent impact deserves continued attention.