Skip to main content
. 2024 Jun 19;630(8017):625–630. doi: 10.1038/s41586-024-07421-0

Fig. 2. Detecting confabulations in sentence-length generations.

Fig. 2

Semantic entropy outperforms leading baselines and naive entropy. AUROC (scored on the y-axes) measures how well methods predict LLM mistakes, which correlate with confabulations. AURAC (likewise scored on the y-axes) measures the performance improvement of a system that refuses to answer questions which are judged likely to cause confabulations. Results are an average over five datasets, with individual metrics provided in the Supplementary Information.