Skip to main content
. 2024 Jun 19;630(8017):625–630. doi: 10.1038/s41586-024-07421-0

Fig. 3. Detecting GPT-4 confabulations in paragraph-length biographies.

Fig. 3

The discrete variant of our semantic entropy estimator outperforms baselines both when measured by AUROC and AURAC metrics (scored on the y-axis). The AUROC and AURAC are substantially higher than for both baselines. At above 80% of questions being answered, semantic entropy has the highest accuracy. Only when the top 20% of answers judged most likely to be confabulations are rejected does the answer accuracy on the remainder for the P(True) baseline exceed semantic entropy.