Table 2.
Performance accuracy and answer consistency across artificial intelligence tools over 3 attempts on United States Medical Licensing Examination Step 1 questions (N=119).
|
|
ChatGPT, n (%; 95% CI) | Copilot, n (%; 95% CI) | DeepSeek, n (%; 95% CI) | Gemini, n (%; 95% CI) | Grok, n (%; 95% CI) | P value | Effect sizea | Post hocb | |
| Attemptc | |||||||||
|
|
First | 95 (79.8; 72.8-86.7) | 101 (84.9; 78.8-90.9) | 86 (72.3; 64.3-80.2) | 100 (84; 77.7-90.3) | 109 (91.6; 86.4-96.8) | .002 | 0.084 | Grok > DeepSeek |
|
|
Second | 95 (79.8; 72.8-86.7) | 100 (84; 77.7-90.3) | 85 (71.4; 63.4-79.4) | 100 (84; 77.7-90.3) | 109 (91.6; 86.4-96.8) | .001 | 0.086 | Grok > DeepSeek |
|
|
Third | 96 (80.7; 73.9-87.5) | 107 (89.9; 84.5-95.4) | 86 (72.3; 64.3-80.2) | 100 (84; 77.7-90.3) | 109 (91.6; 86.4-96.8) | <.001 | 0.094 | (Grok=Copilot) > DeepSeek |
| Consistency | |||||||||
|
|
No change in answer | 118 (99.2; 97.7-100) | 112 (94.1; 89.2-99) | 115 (96.6; 92.9-100) | 117 (98.3; 95.9-100) | 119 (100; 100-100) | .02 | 0.069 | No significant pairwise difference (after Bonferroni adjustment) |
aEffect size is reported using Cramér V, which measures the strength of association between categorical variables (0.10=small, 0.30=medium, and 0.50=large).
bP values for post hoc pairwise comparisons were Bonferroni-adjusted to control for multiple testing. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.
cThe first, second, and third attempts correspond to the initial answer, the response after the first confirmation prompt “Are you sure?,” and the response after the second confirmation prompt “Are you sure?,” respectively.