. 2024 Dec 26;28:9–15. doi: 10.1016/j.csbj.2024.12.013

Table 2.

Comparison Between Response Completeness, Detail, and Diagnostic Accuracy of LLMs. Highlights the relationship between response completeness, detail, and diagnostic accuracy for each model, considering both the entire questionnaire and symptom-specific questions.

		Total questions (n = 18)				Symptom related questions (n = 7)
	Diagnosis (n,a)	Words (m,SD)	NA (m,SD)	Completeness (m,SD)	Detail (m,SD)	Words (m,SD)	NA (m,SD)	Completeness (m,SD)	Detail (m,SD)
GPT−4o
Correct	85 (91.4 %)	55.5 ± 19.2	3.5 ± 3.4	80.5 ± 19.0 %	3.1 ± 1.1	31.1 ± 14.3	0.5 ± 0.9	92.6 ± 13.2 %	4.4 ± 2.0
Wrong	8 (8.6 %)	58.3 ± 18.0	1.9 ± 3.9	89.6 ± 21.9 %	3.2 ± 1.0	32.6 ± 10.5	1.9 ± 1.5	73.2 ± 20.8 %	4.7 ± 1.5
p	-	0.429	0.014	0.014	0.429	0.375	< 0.001	< 0.001	0.375
GPT−4 Turbo
Correct	75 (80.6 %)	55.4 ± 18.9	3.4 ± 3.5	81.1 ± 19.2 %	3.1 ± 1.1	30.6 ± 13.2	0.5 ± 0.9	92.6 ± 13.4 %	4.4 ± 1.9
Wrong	18 (19.4 %)	57.1 ± 20.4	3.2 ± 3.7	82.1 ± 20.4 %	3.2 ± 1.1	33.7 ± 16.9	1.1 ± 1.3	84.1 ± 18.9 %	4.8 ± 2.4
p	-	0.931	0.763	0.763	0.931	0.641	0.063	0.063	0.641
GPT−4 mini
Correct	70 (75.3 %)	56.0 ± 19.3	3.2 ± 3.2	82.3 ± 17.8 %	3.1 ± 1.1	31.3 ± 14.6	0.5 ± 0.9	92.3 ± 13.5 %	4.5 ± 2.1
Wrong	23 (24.7 %)	54.8 ± 18.6	3.9 ± 4.2	78.3 ± 23.5 %	3.0 ± 1.0	30.9 ± 11.9	0.9 ± 1.3	87.0 ± 18.2 %	4.4 ± 1.7
p	-	0.910	0.826	0.826	0.910	0.732	0.150	0.150	.732
GPT−3.5
Correct	67 (72.0 %)	54.3 ± 17.3	3.0 ± 3.2	83.4 ± 18.0 %	3.0 ± 1.0	30.3 ± 13.0	0.6 ± 1.0	90.8 ± 14.7 %	4.3 ± 1.9
Wrong	26 (28.0 %)	59.3 ± 22.9	4.3 ± 3.9	75.9 ± 21.9 %	3.3 ± 1.3	33.7 ± 16.1	0.6 ± 1.1	91.2 ± 15.7 %	4.8 ± 2.3
p	-	0.358	0.239	0.239	0.358	0.294	0.960	0.960	0.294
Llama−3.1
Correct	39 (41.9 %)	55.3 ± 17.5	3.5 ± 3.4	80.6 ± 19.0 %	3.1 ± 1.0	29.1 ± 12.8	0.3 ± 0.6	96.0 ± 8.6 %	4.2 ± 1.8
Wrong	54 (58.1 %)	56.1 ± 20.3	3.3 ± 3.5	81.8 ± 19.7 %	3.1 ± 1.1	32.8 ± 14.7	0.9 ± 1.2	87.3 ± 17.3 %	4.7 ± 2.1
p	-	0.976	0.524	0.524	0.976	0.239	0.007	0.007	0.239
Gemma 2
Correct	72 (77.4 %)	55.9 ± 18.9	3.3 ± 3.4	81.7 ± 18.8 %	3.1 ± 1.1	31.4 ± 14.5	0.6 ± 1.0	91.9 ± 14.4 %	4.5 ± 2.1
Wrong	21 (22.6 %)	55.0 ± 20.0	3.6 ± 3.9	79.9 ± 21.6 %	3.1 ± 1.1	30.8 ± 12.1	0.9 ± 1.1	87.8 ± 16.5 %	4.4 ± 1.7
p	-	0.904	0.793	0.793	0.904	0.813	0.095	0.095	0.813
Mistral-Nemo
Correct	54 (58.1 %)	55.1 ± 17.6	3.5 ± 3.5	80.6 ± 19.5 %	3.1 ± 1.0	30.1 ± 13.4	0.5 ± 1.0	92.9 ± 13.8 %	4.3 ± 1.9
Wrong	39 (41.9 %)	56.5 ± 21.1	3.2 ± 3.5	82.3 ± 19.3 %	3.1 ± 1.2	32.8 ± 14.8	0.8 ± 1.1	88.3 ± 16.0 %	4.7 ± 2.1
p	-	0.924	0.543	0.543	0.924	0.355	0.085	0.085	0.355
Gemini 1.5
Correct	71 (76.3 %)	56.1 ± 19.5	3.4 ± 3.5	81.1 ± 19.5 %	3.1 ± 1.1	31.4 ± 14.7	0.5 ± 0.9	92.8 ± 13.4 %	4.5 ± 2.1
Wrong	22 (23.7 %)	54.6 ± 18.1	3.2 ± 3.5	82.1 ± 19.2 %	3.0 ± 1.0	30.6 ± 11.5	1.1 ± 1.3	85.1 ± 17.9 %	4.4 ± 1.6
p	-	0.916	0.705	0.705	0.916	0.829	0.017	0.017	0.829
Gemini 1.0
Correct	70 (75.3 %)	57.0 ± 19.1	3.3 ± 3.5	81.6 ± 19.3 %	3.2 ± 1.1	32.1 ± 14.7	0.5 ± 0.9	93.1 ± 13.5 %	4.6 ± 2.1
Wrong	23 (24.7 %)	51.9 ± 18.9	3.5 ± 3.6	80.4 ± 19.9 %	2.9 ± 1.1	28.5 ± 11.4	1.1 ± 1.2	84.5 ± 17.2 %	4.1 ± 1.6
p	-	0.190	0.752	0.752	0.190	0.349	0.012	0.012	0.349

a: accuracy, m: mean, SD: standard deviation, NA: not answered. Bold means significant two-sided p (p < 0.05)