Skip to main content
[Preprint]. 2024 Aug 7:2024.08.06.24311544. [Version 1] doi: 10.1101/2024.08.06.24311544

Table 1 –

Overall and age-stratified average reviewer ratings of GPT-4 and Gemini across five evaluation criteria

Large Language Model Accuracy, mean (95% CI) Completeness, mean (95% CI) Age-Appropriateness, mean (95% CI) Possibility of Demographic Bias, mean (95% CI) Overall Quality, mean (95% CI)
GPT-4 4.37 (4.27, 4.47) 4.25 (4.16, 4.34) 3.95 (3.81, 4.09) 1.61 (1.49, 1.73) 3.88 (3.75, 4.01)
Gemini 4.55 (4.45, 4.65) 4.39 (4.28, 4.50) 3.26 (3.09, 3.43) 1.16 (1.11, 1.21) 3.43 (3.26, 3.60)
P-value 0.08 0.15 <0.001 <0.001 0.004

CI = confidence interval