Skip to main content
[Preprint]. 2024 Feb 13:2023.07.10.23292373. [Version 2] doi: 10.1101/2023.07.10.23292373

Table 4:

Aggregate Accuracy, True Negative Rate, (Micro- and Macro-) Precision and Recall for MMSE and CDR scores extracted by ChatGPT and LlaMA-2.

All notes with parsed JSON (N=710) Double-reviewed notes with parsed JSON (N=306)
ChatGPT LlaMA-2 ChatGPT LlaMA-2
MMSE
Total notes without any MMSE (in ground truth) 115 48
Total notes without any MMSE (in GPT results) 77 110 25 46
Total correctly predicted empty MMSEs 76 66 24 23
ChatGPT’s True Negative Rate for MMSE(%) 98.7 60.0 96 50.0
ChatGPT’s False Negative Rate for MMSE(%) 1.2 40.0 4 50.0
Remaining notes with un-empty GPT response undergone Precision/Recall calculation for MMSE 633 600 281 260
Total MMSE instances predicted 831 957 366 410
MMSE Macro Precision (mean % (sd %)) 82.9 (sd 36.2) 62.2(sd 45.5) 82.7 (sd 36.8) 63.4 (sd 44.9)
MMSE Macro Recall (mean % (sd %)) 87.8 (sd 30.4) 69.9 (sd 43.5) 89.7 (sd 28.3) 71.8 (sd 42.1)
MMSE Micro Precision (%) 83.8 57.7 84.1 59.3
MMSE Micro Recall (%) 83.7 68.1 87.5 69.0
Total notes with any error MMSE result 121 238 52 98
Overall accuracy of MMSE (%) 82.9 66.4 83.0 68.0
CDR
Total notes without CDR (in ground truth) 608 260
Total notes without CDR (in GPT results) 533 497 233 215
Total correctly predicted empty CDR 532 489 233 212
CDR True Negative Rate (%) 99.8 98.4 100 98.6
CDR False Negative Rate (%) 0.2 1.6 0 1.4
Remaining notes with un-empty GPT response undergone Precision/Recall calculation for CDR 177 213 73 153
Total CDR instances predicted 256 344 92 153
CDR Macro Precision (mean % sd %) 48.3 (sd 49.9) 16.1 (sd 35.5) 57.5 (sd 49.4) 18.1 (sd 36.9)
CDR Macro Recall (mean % sd %) 84.3 (sd 36.3) 39.7 (sd 48.7) 91.3 (sd 28.1) 43.5 (sd 49.6)
CDR Micro Precision (%) 36.3 12.0 51.0 13.2
CDR Micro Recall (%) 85.3 37.6 92.1 39.2
Total notes with any error CDR result 91 181 31 76
Overall accuracy of CDR (%) 87.1 74.5 89.8 75.4