Table 2.
Comparison Between Response Completeness, Detail, and Diagnostic Accuracy of LLMs. Highlights the relationship between response completeness, detail, and diagnostic accuracy for each model, considering both the entire questionnaire and symptom-specific questions.
Total questions (n = 18) |
Symptom related questions (n = 7) |
||||||||
---|---|---|---|---|---|---|---|---|---|
Diagnosis (n,a) | Words (m,SD) | NA (m,SD) | Completeness (m,SD) | Detail (m,SD) | Words (m,SD) | NA (m,SD) | Completeness (m,SD) | Detail (m,SD) | |
GPT−4o | |||||||||
Correct | 85 (91.4 %) | 55.5 ± 19.2 | 3.5 ± 3.4 | 80.5 ± 19.0 % | 3.1 ± 1.1 | 31.1 ± 14.3 | 0.5 ± 0.9 | 92.6 ± 13.2 % | 4.4 ± 2.0 |
Wrong | 8 (8.6 %) | 58.3 ± 18.0 | 1.9 ± 3.9 | 89.6 ± 21.9 % | 3.2 ± 1.0 | 32.6 ± 10.5 | 1.9 ± 1.5 | 73.2 ± 20.8 % | 4.7 ± 1.5 |
p | - | 0.429 | 0.014 | 0.014 | 0.429 | 0.375 | < 0.001 | < 0.001 | 0.375 |
GPT−4 Turbo | |||||||||
Correct | 75 (80.6 %) | 55.4 ± 18.9 | 3.4 ± 3.5 | 81.1 ± 19.2 % | 3.1 ± 1.1 | 30.6 ± 13.2 | 0.5 ± 0.9 | 92.6 ± 13.4 % | 4.4 ± 1.9 |
Wrong | 18 (19.4 %) | 57.1 ± 20.4 | 3.2 ± 3.7 | 82.1 ± 20.4 % | 3.2 ± 1.1 | 33.7 ± 16.9 | 1.1 ± 1.3 | 84.1 ± 18.9 % | 4.8 ± 2.4 |
p | - | 0.931 | 0.763 | 0.763 | 0.931 | 0.641 | 0.063 | 0.063 | 0.641 |
GPT−4 mini | |||||||||
Correct | 70 (75.3 %) | 56.0 ± 19.3 | 3.2 ± 3.2 | 82.3 ± 17.8 % | 3.1 ± 1.1 | 31.3 ± 14.6 | 0.5 ± 0.9 | 92.3 ± 13.5 % | 4.5 ± 2.1 |
Wrong | 23 (24.7 %) | 54.8 ± 18.6 | 3.9 ± 4.2 | 78.3 ± 23.5 % | 3.0 ± 1.0 | 30.9 ± 11.9 | 0.9 ± 1.3 | 87.0 ± 18.2 % | 4.4 ± 1.7 |
p | - | 0.910 | 0.826 | 0.826 | 0.910 | 0.732 | 0.150 | 0.150 | .732 |
GPT−3.5 | |||||||||
Correct | 67 (72.0 %) | 54.3 ± 17.3 | 3.0 ± 3.2 | 83.4 ± 18.0 % | 3.0 ± 1.0 | 30.3 ± 13.0 | 0.6 ± 1.0 | 90.8 ± 14.7 % | 4.3 ± 1.9 |
Wrong | 26 (28.0 %) | 59.3 ± 22.9 | 4.3 ± 3.9 | 75.9 ± 21.9 % | 3.3 ± 1.3 | 33.7 ± 16.1 | 0.6 ± 1.1 | 91.2 ± 15.7 % | 4.8 ± 2.3 |
p | - | 0.358 | 0.239 | 0.239 | 0.358 | 0.294 | 0.960 | 0.960 | 0.294 |
Llama−3.1 | |||||||||
Correct | 39 (41.9 %) | 55.3 ± 17.5 | 3.5 ± 3.4 | 80.6 ± 19.0 % | 3.1 ± 1.0 | 29.1 ± 12.8 | 0.3 ± 0.6 | 96.0 ± 8.6 % | 4.2 ± 1.8 |
Wrong | 54 (58.1 %) | 56.1 ± 20.3 | 3.3 ± 3.5 | 81.8 ± 19.7 % | 3.1 ± 1.1 | 32.8 ± 14.7 | 0.9 ± 1.2 | 87.3 ± 17.3 % | 4.7 ± 2.1 |
p | - | 0.976 | 0.524 | 0.524 | 0.976 | 0.239 | 0.007 | 0.007 | 0.239 |
Gemma 2 | |||||||||
Correct | 72 (77.4 %) | 55.9 ± 18.9 | 3.3 ± 3.4 | 81.7 ± 18.8 % | 3.1 ± 1.1 | 31.4 ± 14.5 | 0.6 ± 1.0 | 91.9 ± 14.4 % | 4.5 ± 2.1 |
Wrong | 21 (22.6 %) | 55.0 ± 20.0 | 3.6 ± 3.9 | 79.9 ± 21.6 % | 3.1 ± 1.1 | 30.8 ± 12.1 | 0.9 ± 1.1 | 87.8 ± 16.5 % | 4.4 ± 1.7 |
p | - | 0.904 | 0.793 | 0.793 | 0.904 | 0.813 | 0.095 | 0.095 | 0.813 |
Mistral-Nemo | |||||||||
Correct | 54 (58.1 %) | 55.1 ± 17.6 | 3.5 ± 3.5 | 80.6 ± 19.5 % | 3.1 ± 1.0 | 30.1 ± 13.4 | 0.5 ± 1.0 | 92.9 ± 13.8 % | 4.3 ± 1.9 |
Wrong | 39 (41.9 %) | 56.5 ± 21.1 | 3.2 ± 3.5 | 82.3 ± 19.3 % | 3.1 ± 1.2 | 32.8 ± 14.8 | 0.8 ± 1.1 | 88.3 ± 16.0 % | 4.7 ± 2.1 |
p | - | 0.924 | 0.543 | 0.543 | 0.924 | 0.355 | 0.085 | 0.085 | 0.355 |
Gemini 1.5 | |||||||||
Correct | 71 (76.3 %) | 56.1 ± 19.5 | 3.4 ± 3.5 | 81.1 ± 19.5 % | 3.1 ± 1.1 | 31.4 ± 14.7 | 0.5 ± 0.9 | 92.8 ± 13.4 % | 4.5 ± 2.1 |
Wrong | 22 (23.7 %) | 54.6 ± 18.1 | 3.2 ± 3.5 | 82.1 ± 19.2 % | 3.0 ± 1.0 | 30.6 ± 11.5 | 1.1 ± 1.3 | 85.1 ± 17.9 % | 4.4 ± 1.6 |
p | - | 0.916 | 0.705 | 0.705 | 0.916 | 0.829 | 0.017 | 0.017 | 0.829 |
Gemini 1.0 | |||||||||
Correct | 70 (75.3 %) | 57.0 ± 19.1 | 3.3 ± 3.5 | 81.6 ± 19.3 % | 3.2 ± 1.1 | 32.1 ± 14.7 | 0.5 ± 0.9 | 93.1 ± 13.5 % | 4.6 ± 2.1 |
Wrong | 23 (24.7 %) | 51.9 ± 18.9 | 3.5 ± 3.6 | 80.4 ± 19.9 % | 2.9 ± 1.1 | 28.5 ± 11.4 | 1.1 ± 1.2 | 84.5 ± 17.2 % | 4.1 ± 1.6 |
p | - | 0.190 | 0.752 | 0.752 | 0.190 | 0.349 | 0.012 | 0.012 | 0.349 |
a: accuracy, m: mean, SD: standard deviation, NA: not answered. Bold means significant two-sided p (p < 0.05)