Skip to main content
[Preprint]. 2024 Aug 7:2024.08.06.24311544. [Version 1] doi: 10.1101/2024.08.06.24311544

Table 2 –

Age-stratified average reviewer ratings of GPT-4 and Gemini responses across five evaluation criteria

Large Language Model Accuracy, mean (95% CI) Completeness, mean (95% CI) Age-Appropriateness, mean (95% CI) Possibility of Demographic Bias, mean (95% CI) Overall Quality, mean (95% CI)
GPT-4
5-Year-Old 4.20 (3.76, 4.64) 4.07 (3.67, 4.47) 3.47 (2.76, 4.18) 1.53 (1.07, 1.99) 3.47 (2.87, 4.07)
7-Year-Old 4.40 (4.08, 4.72) 4.20 (3.99, 4.41) 4.07 (3.62, 4.52) 1.53 (1.15, 1.91) 3.93 (3.63, 4.23)
9-Year-Old 4.47 (4.21, 4.73) 4.27 (3.97, 4.57) 4.07 (3.71, 4.43) 1.60 (1.28, 1.92) 3.93 (3.57, 4.29)
11-Year-Old 4.40 (3.94, 4.86) 4.27 (3.97, 4.57) 4.00 (3.57, 4.43) 1.33 (1.08, 1.58) 3.80 (3.32, 4.28)
13-Year-Old 4.27 (3.91, 4.63) 4.13 (3.75, 4.51) 3.87 (3.33, 4.41) 1.73 (1.24, 2.22) 3.93 (3.41, 4.45)
15-Year-Old 4.40 (3.98, 4.82) 4.40 (4.08, 4.72) 3.67 (2.91, 4.43) 1.93 (1.34, 2.52) 3.93 (3.31, 4.55)
17-Year-Old 4.47 (4.09, 4.85) 4.40 (4.08, 4.72) 4.53 (4.27, 4.79) 1.60 (1.07, 2.13) 4.13 (3.81, 4.45)
Gemini
5-Year-Old 4.47 (4.01, 4.93) 4.27 (3.82, 4.72) 2.53 (1.79, 3.27) 1.33 (1.02, 1.64) 2.87 (2.18, 3.56)
7-Year-Old 4.53 (4.11, 4.95) 4.40 (3.98, 4.82) 2.53 (1.90, 3.16) 1.07 (0.94, 1.20) 3.07 (2.32, 3.82)
9-Year-Old 4.60 (4.14, 5.06) 4.47 (4.09, 4.85) 3.00 (2.37, 3.63) 1.20 (0.99, 1.41) 3.20 (2.51, 3.89)
11-Year-Old 4.60 (4.28, 4.92) 4.40 (4.03, 4.77) 3.00 (2.49, 3.51) 1.07 (0.94, 1.20) 3.07 (2.48, 3.66)
13-Year-Old 4.67 (4.42, 4.92) 4.27 (3.91, 4.63) 3.80 (3.32, 4.28) 1.13 (0.95, 1.31) 4.00 (3.57, 4.43)
15-Year-Old 4.60 (4.23, 4.97) 4.47 (4.01, 4.93) 3.80 (3.52, 4.08) 1.20 (0.99, 1.41) 3.87 (3.41, 4.33)
17-Year-Old 4.47 (4.01, 4.93) 4.27 (3.82, 4.72) 2.53 (1.79, 3.27) 1.33 (1.02, 1.64) 2.87 (2.18, 3.56)

CI = confidence interval