Table 2 –
Age-stratified average reviewer ratings of GPT-4 and Gemini responses across five evaluation criteria
Large Language Model | Accuracy, mean (95% CI) | Completeness, mean (95% CI) | Age-Appropriateness, mean (95% CI) | Possibility of Demographic Bias, mean (95% CI) | Overall Quality, mean (95% CI) |
---|---|---|---|---|---|
GPT-4 | |||||
5-Year-Old | 4.20 (3.76, 4.64) | 4.07 (3.67, 4.47) | 3.47 (2.76, 4.18) | 1.53 (1.07, 1.99) | 3.47 (2.87, 4.07) |
7-Year-Old | 4.40 (4.08, 4.72) | 4.20 (3.99, 4.41) | 4.07 (3.62, 4.52) | 1.53 (1.15, 1.91) | 3.93 (3.63, 4.23) |
9-Year-Old | 4.47 (4.21, 4.73) | 4.27 (3.97, 4.57) | 4.07 (3.71, 4.43) | 1.60 (1.28, 1.92) | 3.93 (3.57, 4.29) |
11-Year-Old | 4.40 (3.94, 4.86) | 4.27 (3.97, 4.57) | 4.00 (3.57, 4.43) | 1.33 (1.08, 1.58) | 3.80 (3.32, 4.28) |
13-Year-Old | 4.27 (3.91, 4.63) | 4.13 (3.75, 4.51) | 3.87 (3.33, 4.41) | 1.73 (1.24, 2.22) | 3.93 (3.41, 4.45) |
15-Year-Old | 4.40 (3.98, 4.82) | 4.40 (4.08, 4.72) | 3.67 (2.91, 4.43) | 1.93 (1.34, 2.52) | 3.93 (3.31, 4.55) |
17-Year-Old | 4.47 (4.09, 4.85) | 4.40 (4.08, 4.72) | 4.53 (4.27, 4.79) | 1.60 (1.07, 2.13) | 4.13 (3.81, 4.45) |
Gemini | |||||
5-Year-Old | 4.47 (4.01, 4.93) | 4.27 (3.82, 4.72) | 2.53 (1.79, 3.27) | 1.33 (1.02, 1.64) | 2.87 (2.18, 3.56) |
7-Year-Old | 4.53 (4.11, 4.95) | 4.40 (3.98, 4.82) | 2.53 (1.90, 3.16) | 1.07 (0.94, 1.20) | 3.07 (2.32, 3.82) |
9-Year-Old | 4.60 (4.14, 5.06) | 4.47 (4.09, 4.85) | 3.00 (2.37, 3.63) | 1.20 (0.99, 1.41) | 3.20 (2.51, 3.89) |
11-Year-Old | 4.60 (4.28, 4.92) | 4.40 (4.03, 4.77) | 3.00 (2.49, 3.51) | 1.07 (0.94, 1.20) | 3.07 (2.48, 3.66) |
13-Year-Old | 4.67 (4.42, 4.92) | 4.27 (3.91, 4.63) | 3.80 (3.32, 4.28) | 1.13 (0.95, 1.31) | 4.00 (3.57, 4.43) |
15-Year-Old | 4.60 (4.23, 4.97) | 4.47 (4.01, 4.93) | 3.80 (3.52, 4.08) | 1.20 (0.99, 1.41) | 3.87 (3.41, 4.33) |
17-Year-Old | 4.47 (4.01, 4.93) | 4.27 (3.82, 4.72) | 2.53 (1.79, 3.27) | 1.33 (1.02, 1.64) | 2.87 (2.18, 3.56) |
CI = confidence interval