Abstract
Language models today are trained to convey confidence in their outputs, regardless of whether those outputs are correct. The alignment methods we use to make them helpful also push them toward unwarranted certainty, rewarding decisive answers over appropriate hedging. As these foundation models enter high-stakes domains such as science and medicine, this disconnect between how sure they sound and how accurate they are can become dangerous. Here, we examine why post-training degrades a model’s sense of uncertainty, and we review techniques that can bring expressed confidence back in line with actual reliability. Through this, we argue that trustworthy AI means treating calibration as a core design goal.
Introduction
In a recent cross-sectional evaluation, a group of clinicians tested several large language models on routine medical questions. The models performed strongly on the evaluation set. They could fluently and correctly list drug interactions, sketch differential diagnoses, and summarize guidelines. But when the clinicians asked the same models to rate their own confidence, a more unsettling pattern emerged. The systems were almost always sure of themselves. They were nearly as confident when they were wrong as when they were right [1].
That mismatch between confidence and correctness is what we call calibration. A calibrated model that claims to be 90% certain should be wrong about one out of ten times, not half the time. Humans are far from perfect at this, but we at least have ways to signal doubt (e.g. a pause, a hedge, a referral to a colleague).
Researchers have noticed that language models, before alignment, tend to be well calibrated [2]. But these pre-trained models are not directly useful. They are simply trained to predict text continuations, not to answer questions or follow instructions. Post-training alignment enhances them for conversational uses. However, in modifying the model’s outputs to make them helpful, we degrade this calibration [3]. As a result, we have built a generation of aligned models that sound eloquent, well-behaved, and systematically overconfident.
How alignment inflates confidence
Why does alignment deform calibration so badly? The answer lies in how we set up our post-training loss functions and the implicit preferences encoded in human feedback.
During pre-training, an LLM is trained to predict the next token in text. For each context, there are many plausible continuations. The model learns to approximate a distribution over those possibilities. That objective naturally leaves room for both uncertainty and multiple answers with non-trivial probability.
Post-training changes the objective. In supervised fine-tuning, we give the model a prompt and a single reference answer, then optimize a cross-entropy loss on that sequence. The loss is lowest when the model puts almost all its probability mass on that reference and almost none on alternatives. There is no incentive inside this objective to leave probability for other plausible completions or to reflect ambiguity. Over time, the model learns that the safest strategy for minimizing loss is to be extremely sure.
Preference-based methods such as Reinforcement Learning from Human Feedback (RLHF) and direct preference optimization (DPO) push in the same direction. Rather than matching a fixed answer, the model is updated so that its outputs align with those that a reward model scores highly. This tends to sharpen the distribution over answers and concentrate probability on a small set of “preferred” continuations that humans liked during training.
Recent work on aligned models documents this effect clearly. Xiao and colleagues [3] show that pre-trained models are often reasonably calibrated, but that preference alignment pushes them into a regime where probabilities become unreliable, even as task performance improves. Once models cross this threshold, simply fine-tuning more for this type of reward no longer restores calibration.
Fine-tuning also interacts with what the model already knows. Wang and co-authors [4] find that when fine-tuning data repeats knowledge already present from pre-training, models tend to become overconfident on that data. When they are trained on genuinely new information, calibration is better. In other words, prior knowledge makes models versatile, but the overlap between pre-training and fine-tuning can push their confidence well beyond what performance justifies.
On top of these numerical effects, alignment injects a stylistic bias. Human raters commonly prefer answers that sound clear and decisive. Hedged or uncertain language is often scored as less helpful. Models pick up that preference and learn that a confident tone earns reward. Over time, this style bleeds into the probabilities themselves. High-confidence language and high-confidence numbers reinforce each other.
We have, unintentionally, trained our models to deploy certainty.
How we fix this
Temperature scaling and other post-hoc calibration
The simplest place to intervene is at the last step, after the model has produced logits or raw scores. Temperature scaling, introduced by Guo and colleagues [5], learns a single scalar temperature that rescales logits before the softmax, so that predicted probabilities better match observed frequencies on a held-out set. It is popular in vision and tabular settings because it is easy to implement and does not change accuracy, only the sharpness of the probabilities [6].
More flexible methods, such as Platt scaling [7, 8] and isotonic regression [9], learn a calibration map from raw scores to probabilities. In multiclass problems they are often applied per class or in a one-vs-rest fashion and can give excellent calibration when the label space is small and well-covered by data. For language models with vocabularies of tens of thousands of tokens however, it is hard to gather enough labelled examples to fit a separate non-parametric calibrator for each token, and the computational and storage costs grow quickly. In practice, post-hoc calibration works best when we reduce the problem to a modest number of options (e.g. yes/no, a handful of multiple-choice answers, or a small label set where a calibrator can be trained once and reused across models).
Task-specific fine-tuning and better reward design
Post-hoc fixes are helpful, but they do not change the fact that standard alignment objectives push models toward overconfidence. Some recent work takes calibration into account during training itself. Xiao and co-authors [3] show that preference alignment can move models from a “calibratable” regime into a phase where probabilities become hard to repair, and they propose calibration-aware fine-tuning that adds an explicit regularizer for expected calibration error to the loss. This allows them to keep task performance, while preventing the worst confidence inflation.
Wang and colleagues [4] look at how prior knowledge interacts with fine-tuning. They find that overlapping “known” data drives overconfidence and introduce a cognition-aware framework (CogCalib) that treats known and unknown examples differently and improves calibration by more than 50% in some settings.
These ideas all point in the same direction. When we design objectives and reward models, we should reward appropriate uncertainty and abstention, not only correctness and fluency. That means including ambiguous and hard examples in alignment datasets and treating “I don’t know” as a success when the model truly lacks information.
Probing internal states
Another idea to restore calibration is to ignore the generative head altogether. In our recent work we developed PING [10], a simple probing framework that treats an LLM as a frozen feature extractor. A small probe, such as a shallow neural network, is trained on the model’s hidden states to predict which answer option is correct and with what probability. Because the probe only ever sees a fixed set of labels (for example, four choices on a multiple-choice exam), the calibration problem shrinks from “all possible tokens” to a small, well-defined label space. Across several benchmarks, PING matches or slightly exceeds generative accuracy while cutting expected calibration error by up to 96% and it can recover knowledge that alignment has hidden behind refusals in clinical models.
Other groups report similar gains with different probes. InternalInspector [11] aggregates attention, feed-forward and activation states across layers and uses contrastive learning to estimate confidence and detect hallucinations, improving both calibration error and hallucination detection compared with methods that only look at the final layer. CCPS [12] takes a different route and perturbs the final hidden state, using the stability of that representation as a feature for a small classifier. Other work shows that internal states can signal hallucination risk even before an answer is generated, and that truthfulness information is often concentrated in specific token representations, which can be exploited by dedicated probes to detect likely errors [13].
The practical advantage is that these probes are lightweight, auditable, and task specific. They can be trained for the handful of decisions that matter in a given pipeline, while the underlying model stays frozen. In that sense, probing is not a competitor to temperature scaling or better rewards, but a complementary layer that lets us tap into what the model already knows without inheriting all of its stylistic overconfidence.
Conclusion
As we bring large language models into clinics, labs, and other high-stakes settings, “good on benchmarks but badly calibrated” is an unacceptable standard. An AI system that is equally sure about its hallucinations as its facts is a failure of design.
The methods reviewed here represent meaningful progress. Temperature scaling, calibration-aware training, and probing techniques can recover probability estimates that better reflect what a model has learned. For structured tasks with defined answer sets, these approaches allow users to extract calibrated probabilities they can trust. A clinician deciding whether to act on a model’s suggestion, or a researcher weighing its outputs against other evidence, benefits greatly from knowing when the system is uncertain.
Yet these techniques operate on internal representations. They do not change the text the model generates. A model with perfectly calibrated probabilities may still produce assertive, unhedged prose, because that is the style alignment has trained it to favor. The confidence a user encounters in a response is linguistic, and the gap between expressed certainty and internal probability remains largely unaddressed.
Closing this gap should be a central goal for the field. Promising directions include training models to express uncertainty in natural language that reflects their internal states and building interfaces that surface calibration information alongside generated text. The former may require rethinking the reward signals that currently favor assertive responses. If AI is to serve the public good, honest communication cannot be an afterthought. As researchers in the field, we need to ensure that it is a core design goal.
Author contribution
All authors contributed to the ideas, writing, and review of the manuscript.
Declarations
Competing interests
The authors have no competing interests to declare.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Graciela Gonzalez-Hernandez, Email: Graciela.GonzalezHernandez@csmc.edu.
Nicholas P. Tatonetti, Email: Nicholas.tatonetti@csmc.edu
References
- 1.Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the confidence of large Language models in answering clinical questions: Cross-Sectional evaluation study. JMIR Med Inf. 2025;13:e66917–66917. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. 2023; Available from: http://arxiv.org/abs/2303.08774.
- 3.Xiao J, Hou B, Wang Z, Jin R, Long Q, Su WJ, et al. Restoring calibration for aligned large language models: a calibration-aware fine-tuning approach. 2025.
- 4.Wang Z, Shi Z, Zhou H, Gao S, Sun Q, Li J. Towards Objective fine-tuning: how LLMs’ prior knowledge causes potential poor calibration? in: proceedings of the 63rd annual meeting of the association for computational linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics. 2025;1:14830–53.
- 5.Guo C, Pleiss G, Sun Y, Weinberger KQ. On calibration of modern neural networks. 2017.
- 6.Minderer M, Djolonga J, Romijnders R, Hubis F, Zhai X, Houlsby N et al. Revisiting the calibration of modern neural networks. 2021.
- 7.Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999.
- 8.Spiess C, Gros D, Pai KS, Pradel M, Rabin MRI, Alipour A et al. Calibration and correctness of language models for code. In: 2025 IEEE/ACM 47th international conference on software engineering (ICSE). IEEE. 2025;540–52.
- 9.Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM. 2002;694–9.
- 10.Berkowitz J, Kivelson S, Srinivasan A, Gisladottir U, Tsang KK, Acitores Cortina JM, et al. Probing Hidden states for calibrated, alignment-resistant predictions in LLMs. 2025.
- 11.Beigi M, Shen Y, Yang R, Lin Z, Wang Q, Mohan A, et al. InternalInspector I2: robust confidence Estimation in LLMs through internal States. Findings of the association for computational linguistics: EMNLP 2024. Stroudsburg, PA, USA: Association for Computational Linguistics. 2024;12847–65. [Google Scholar]
- 12.Khanmohammadi, R., Miahi, E., Mardikoraem, M., Kaur, S., Brugere, I., Smiley, C., Thind, K. S., & Ghassemi, M. M. (2025). Calibrating LLM confidence by probing perturbed representation stability. In C. Christodoulopoulos, T. Chakraborty, C. Rose, & V. Peng (Eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp. 10448–10514). Association for Computational Linguistics. 10.18653/v1/2025.emnlp-main.530
- 13.Ji Z, Chen D, Ishii E, Cahyawijaya S, Bang Y, Wilie B et al. LLM internal States reveal hallucination risk faced with a query. 2024 Sep 29.
