📑 Table of Contents

DO-Bench: A New Benchmark for Precisely Diagnosing Object Hallucination in Vision-Language Models

📅 · 📁 Research · 👁 13 views · ⏱️ 6 min read
💡 A research team has introduced the DO-Bench benchmark, which for the first time decomposes the causes of object hallucination in vision-language models into two dimensions — perceptual deficiencies and textual prior biases — providing a novel framework for diagnosing model reliability.

Object Hallucination: A Reliability Challenge for Vision-Language Models

Vision-language models (VLMs) have made remarkable progress in recent years on tasks such as image-text understanding and visual question answering, yet a core reliability issue continues to plague the research community — object-level hallucination. Put simply, when a model is asked whether a certain object exists in an image, it may confidently give the wrong answer, claiming to see an object that does not actually exist or overlooking one that clearly does.

A recent paper published on arXiv (arXiv:2604.22822) introduces a new diagnostic benchmark called DO-Bench, which aims to analyze the root causes of object hallucination, bringing a more refined evaluation tool to the field.

DO-Bench: A Paradigm Shift from Measuring Results to Measuring Causes

Existing hallucination evaluation benchmarks mostly focus on measuring overall accuracy — how many questions the model answered correctly and how many it got wrong. However, this coarse-grained evaluation approach has a critical blind spot: it cannot distinguish the fundamental reasons behind the model's errors.

The research team points out that object hallucination typically stems from two distinctly different mechanisms:

  • Perceptual deficiencies: The model's visual encoder fails to correctly identify objects in the image, resulting in biased visual information extraction at its source.
  • Textual prior interference: The model's language module is influenced by contextual textual cues, causing it to "fill in" objects that do not exist in the image based on linguistic statistical patterns. For example, when a kitchen scene appears in an image, the model may be inclined to assume a "refrigerator" must be present due to language priors.

The core design philosophy of DO-Bench lies in effectively isolating these two failure mechanisms through a controlled experimental paradigm. This means researchers can not only know that the model "got it wrong" but can precisely pinpoint "why it got it wrong" — whether the eyes failed to see clearly or the brain was fabricating things.

Technical Significance: Providing Precise Navigation for Model Improvement

This "attributable" diagnostic approach holds significant practical value for the iterative optimization of vision-language models.

First, for model developers, if a model's hallucinations primarily stem from perceptual deficiencies, optimization efforts should focus on upgrading the visual encoder — for example, by adopting a stronger image feature extraction network or increasing image resolution. Conversely, if the problem lies in textual prior biases, the approach should address the language model's decoding strategy, training data debiasing, and related areas.

Second, DO-Bench's design provides more granular dimensions for fair comparison across different models. Two models with similar overall accuracy may exhibit entirely different distribution characteristics in terms of hallucination causes, which is crucial for understanding the strengths and weaknesses of each architecture.

Third, this attribution-based evaluation also lays the groundwork for future safety alignment efforts. In high-risk scenarios such as medical image analysis and autonomous driving, distinguishing between a model's perceptual errors and reasoning biases has direct implications for developing targeted safety strategies.

Industry Context: Hallucination Evaluation Continues to Heat Up

Object hallucination is not a new topic — several evaluation benchmarks, including POPE and CHAIR, have already been widely adopted. However, as multimodal large models such as GPT-4o, Gemini, and Qwen-VL rapidly advance in capability, the research community's demands for evaluation tool granularity have risen accordingly.

From POPE's binary judgment accuracy, to CHAIR's description-level hallucination statistics, to DO-Bench's newly proposed "attributable diagnosis," hallucination evaluation is continuously evolving along a trajectory from coarse to fine, from surface phenomena to underlying mechanisms. This trend reflects the community's gradual shift from a "model capability race" toward a deeper concern for "model reliability building."

Outlook: Toward More Trustworthy Multimodal AI

The launch of DO-Bench marks the entry of vision-language model evaluation into a more mature phase. In the future, as similar attribution-based diagnostic tools continue to improve, researchers can expect to build a comprehensive "model failure mechanism atlas," systematically tackling the reliability bottlenecks of multimodal AI.

Promisingly, this methodology of deeply integrating evaluation with diagnosis is also expected to extend to other hallucination types — such as attribute hallucination and relational hallucination — paving the way for building truly trustworthy multimodal AI systems.