Skip to main content
. 2024 Dec 26;28:9–15. doi: 10.1016/j.csbj.2024.12.013

Table 2.

Comparison Between Response Completeness, Detail, and Diagnostic Accuracy of LLMs. Highlights the relationship between response completeness, detail, and diagnostic accuracy for each model, considering both the entire questionnaire and symptom-specific questions.

Total questions (n = 18)
Symptom related questions (n = 7)
Diagnosis (n,a) Words (m,SD) NA (m,SD) Completeness (m,SD) Detail (m,SD) Words (m,SD) NA (m,SD) Completeness (m,SD) Detail (m,SD)
GPT−4o
Correct 85 (91.4 %) 55.5 ± 19.2 3.5 ± 3.4 80.5 ± 19.0 % 3.1 ± 1.1 31.1 ± 14.3 0.5 ± 0.9 92.6 ± 13.2 % 4.4 ± 2.0
Wrong 8 (8.6 %) 58.3 ± 18.0 1.9 ± 3.9 89.6 ± 21.9 % 3.2 ± 1.0 32.6 ± 10.5 1.9 ± 1.5 73.2 ± 20.8 % 4.7 ± 1.5
p - 0.429 0.014 0.014 0.429 0.375 < 0.001 < 0.001 0.375
GPT−4 Turbo
Correct 75 (80.6 %) 55.4 ± 18.9 3.4 ± 3.5 81.1 ± 19.2 % 3.1 ± 1.1 30.6 ± 13.2 0.5 ± 0.9 92.6 ± 13.4 % 4.4 ± 1.9
Wrong 18 (19.4 %) 57.1 ± 20.4 3.2 ± 3.7 82.1 ± 20.4 % 3.2 ± 1.1 33.7 ± 16.9 1.1 ± 1.3 84.1 ± 18.9 % 4.8 ± 2.4
p - 0.931 0.763 0.763 0.931 0.641 0.063 0.063 0.641
GPT−4 mini
Correct 70 (75.3 %) 56.0 ± 19.3 3.2 ± 3.2 82.3 ± 17.8 % 3.1 ± 1.1 31.3 ± 14.6 0.5 ± 0.9 92.3 ± 13.5 % 4.5 ± 2.1
Wrong 23 (24.7 %) 54.8 ± 18.6 3.9 ± 4.2 78.3 ± 23.5 % 3.0 ± 1.0 30.9 ± 11.9 0.9 ± 1.3 87.0 ± 18.2 % 4.4 ± 1.7
p - 0.910 0.826 0.826 0.910 0.732 0.150 0.150 .732
GPT−3.5
Correct 67 (72.0 %) 54.3 ± 17.3 3.0 ± 3.2 83.4 ± 18.0 % 3.0 ± 1.0 30.3 ± 13.0 0.6 ± 1.0 90.8 ± 14.7 % 4.3 ± 1.9
Wrong 26 (28.0 %) 59.3 ± 22.9 4.3 ± 3.9 75.9 ± 21.9 % 3.3 ± 1.3 33.7 ± 16.1 0.6 ± 1.1 91.2 ± 15.7 % 4.8 ± 2.3
p - 0.358 0.239 0.239 0.358 0.294 0.960 0.960 0.294
Llama−3.1
Correct 39 (41.9 %) 55.3 ± 17.5 3.5 ± 3.4 80.6 ± 19.0 % 3.1 ± 1.0 29.1 ± 12.8 0.3 ± 0.6 96.0 ± 8.6 % 4.2 ± 1.8
Wrong 54 (58.1 %) 56.1 ± 20.3 3.3 ± 3.5 81.8 ± 19.7 % 3.1 ± 1.1 32.8 ± 14.7 0.9 ± 1.2 87.3 ± 17.3 % 4.7 ± 2.1
p - 0.976 0.524 0.524 0.976 0.239 0.007 0.007 0.239
Gemma 2
Correct 72 (77.4 %) 55.9 ± 18.9 3.3 ± 3.4 81.7 ± 18.8 % 3.1 ± 1.1 31.4 ± 14.5 0.6 ± 1.0 91.9 ± 14.4 % 4.5 ± 2.1
Wrong 21 (22.6 %) 55.0 ± 20.0 3.6 ± 3.9 79.9 ± 21.6 % 3.1 ± 1.1 30.8 ± 12.1 0.9 ± 1.1 87.8 ± 16.5 % 4.4 ± 1.7
p - 0.904 0.793 0.793 0.904 0.813 0.095 0.095 0.813
Mistral-Nemo
Correct 54 (58.1 %) 55.1 ± 17.6 3.5 ± 3.5 80.6 ± 19.5 % 3.1 ± 1.0 30.1 ± 13.4 0.5 ± 1.0 92.9 ± 13.8 % 4.3 ± 1.9
Wrong 39 (41.9 %) 56.5 ± 21.1 3.2 ± 3.5 82.3 ± 19.3 % 3.1 ± 1.2 32.8 ± 14.8 0.8 ± 1.1 88.3 ± 16.0 % 4.7 ± 2.1
p - 0.924 0.543 0.543 0.924 0.355 0.085 0.085 0.355
Gemini 1.5
Correct 71 (76.3 %) 56.1 ± 19.5 3.4 ± 3.5 81.1 ± 19.5 % 3.1 ± 1.1 31.4 ± 14.7 0.5 ± 0.9 92.8 ± 13.4 % 4.5 ± 2.1
Wrong 22 (23.7 %) 54.6 ± 18.1 3.2 ± 3.5 82.1 ± 19.2 % 3.0 ± 1.0 30.6 ± 11.5 1.1 ± 1.3 85.1 ± 17.9 % 4.4 ± 1.6
p - 0.916 0.705 0.705 0.916 0.829 0.017 0.017 0.829
Gemini 1.0
Correct 70 (75.3 %) 57.0 ± 19.1 3.3 ± 3.5 81.6 ± 19.3 % 3.2 ± 1.1 32.1 ± 14.7 0.5 ± 0.9 93.1 ± 13.5 % 4.6 ± 2.1
Wrong 23 (24.7 %) 51.9 ± 18.9 3.5 ± 3.6 80.4 ± 19.9 % 2.9 ± 1.1 28.5 ± 11.4 1.1 ± 1.2 84.5 ± 17.2 % 4.1 ± 1.6
p - 0.190 0.752 0.752 0.190 0.349 0.012 0.012 0.349

a: accuracy, m: mean, SD: standard deviation, NA: not answered. Bold means significant two-sided p (p < 0.05)