Study Reveals 'Verbal Confidence Saturation' Phenomenon in Small Open-Source LLMs
Introduction: When AI Says 'I'm Sure,' Is It Really Sure?
When we pose questions to large language models (LLMs), the models often accompany their answers with a confidence expression — for example, "I'm 90% confident." This technique, known as Verbal Confidence Elicitation, is widely used to extract uncertainty estimates from LLMs and serves as an important measure of model reliability. However, a newly published preregistered study on arXiv (arXiv:2604.22215) has revealed a concerning phenomenon: small-to-medium-scale open-source instruction-tuned models suffer from severe "saturation" issues when expressing confidence, with their output confidence values almost entirely failing to reflect the true quality of their judgments.
Core Findings: All Seven Mainstream Models Failed Validity Tests
The study was preregistered by the research team on the Open Science Framework (OSF) (osf.io/azbvx), ensuring transparency and reproducibility in experimental design. The researchers selected seven instruction-tuned open-source models from four model families, ranging from 3B to 9B parameters, and tested them using 524 TriviaQA questions.
The experiment employed numerical (0–100 scale) confidence elicitation and used a Greedy Decoding strategy to ensure deterministic outputs. The core metric of interest was "item-level Type-2 discrimination" — that is, whether models could effectively distinguish between their correct and incorrect answers through confidence values.
The results were alarming: none of the tested models met minimum psychometric validity standards. Specifically, these models exhibited significant Verbal Confidence Saturation — regardless of whether the answer was correct or not, the models tended to output extreme confidence values (typically clustered in the high-confidence range), resulting in severely skewed confidence distributions that lost their discriminative function as uncertainty indicators.
Technical Analysis: Why Do Small Models Struggle with Self-Awareness?
Side Effects of Instruction Tuning
The core objective of Instruction Tuning is to make models better at following user instructions and generating responses that align with human expectations. However, this process may introduce an implicit bias: models are reinforced during training to "answer questions confidently," thereby skewing their confidence expressions toward high values. This phenomenon is particularly pronounced in smaller-parameter models, as their limited internal representational space makes it harder to encode fine-grained uncertainty information.
Greedy Decoding Amplifies the Saturation Effect
Greedy Decoding selects the highest-probability token as output, meaning that when models generate confidence numbers, they tend to select the most common numerical patterns from the training data. Since high-confidence expressions (such as "95%" or "100%") are likely more prevalent in training corpora, greedy decoding further exacerbates the tendency for confidence values to cluster at the high end.
Limitations of Model Scale
While models with 3B to 9B parameters perform admirably on many tasks, they may not yet have achieved sufficient complexity in metacognitive ability — the capacity to "know what they know and don't know." Whether larger models (such as 70B and above) can better calibrate confidence outputs remains to be verified through further research.
Methodological Value of the Study
Of particular note is the study's use of a preregistered experimental design. In the AI research community, preregistration has not yet become standard practice, but it effectively prevents issues such as HARKing (Hypothesizing After Results are Known) and selective reporting. The research team publicly registered their hypotheses, methods, and analysis plans in advance, setting a methodological benchmark for LLM evaluation research.
Additionally, the study chose a Psychometrics framework to evaluate the quality of LLM confidence, rather than relying solely on traditional metrics such as Calibration Error. This interdisciplinary perspective offers fresh approaches to AI reliability assessment. Type-2 discrimination focuses on metacognitive signal detection capability, providing a more rigorous validity test than simple calibration curves.
Industry Impact: A Credibility Crisis for Uncertainty Estimation
This finding carries important cautionary implications for multiple application scenarios:
- Medical AI-Assisted Diagnosis: If models cannot accurately express uncertainty about their own judgments, physicians will struggle to determine when to trust AI recommendations and when to rely on their own expertise.
- Automated Decision-Making Systems: In high-stakes scenarios such as financial risk management and legal consulting, unreliable confidence estimates could lead to serious decision-making errors.
- AI Safety and Alignment: Model "overconfidence" is essentially an alignment failure that causes users to place undue trust in AI outputs.
Many current deployment solutions based on small open-source models rely on verbal confidence as a trigger mechanism for output filtering or human-AI collaboration. The effectiveness of these approaches needs to be re-examined.
Outlook: Paths Toward Reliable Self-Assessment
The study points to several possible directions for future improvement:
First, exploring better confidence elicitation strategies. Beyond direct numerical questioning, methods such as multi-turn dialogue elicitation, contrastive elicitation, or metacognitive reasoning combined with Chain-of-Thought may help mitigate the saturation phenomenon.
Second, incorporating confidence calibration objectives during the instruction tuning phase. By including samples with calibrated confidence annotations in training data or designing specialized calibration loss functions, it may be possible to fundamentally improve models' self-assessment capabilities.
Third, systematically comparing metacognitive capability boundaries across models of different parameter scales. This would help the community identify above which scale thresholds verbal confidence can serve as a reliable uncertainty signal.
Fourth, developing uncertainty estimation methods that do not depend on verbal output, such as leveraging internal model representations (e.g., hidden layer activations) or sampling consistency as alternative indicators.
In conclusion, this study, with its rigorous preregistered experimental design and psychometric perspective, sounds an alarm for the AI community: on small-to-medium-scale open-source models, the seemingly simple and intuitive strategy of "letting AI state how confident it is" may be far less reliable than we assumed. As we pursue ever more powerful model capabilities, teaching AI to truly "know its limits" remains a critical challenge that urgently demands solutions.
📌 Source: GogoAI News (www.gogoai.xin)
🔗 Original: https://www.gogoai.xin/article/small-open-source-llms-verbal-confidence-saturation-study
⚠️ Please credit GogoAI when republishing.