Skip to main content
. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 2.

Performance accuracy and answer consistency across artificial intelligence tools over 3 attempts on United States Medical Licensing Examination Step 1 questions (N=119).


ChatGPT, n (%; 95% CI) Copilot, n (%; 95% CI) DeepSeek, n (%; 95% CI) Gemini, n (%; 95% CI) Grok, n (%; 95% CI) P value Effect sizea Post hocb
Attemptc

First 95 (79.8; 72.8-86.7) 101 (84.9; 78.8-90.9) 86 (72.3; 64.3-80.2) 100 (84; 77.7-90.3) 109 (91.6; 86.4-96.8) .002 0.084 Grok > DeepSeek

Second 95 (79.8; 72.8-86.7) 100 (84; 77.7-90.3) 85 (71.4; 63.4-79.4) 100 (84; 77.7-90.3) 109 (91.6; 86.4-96.8) .001 0.086 Grok > DeepSeek

Third 96 (80.7; 73.9-87.5) 107 (89.9; 84.5-95.4) 86 (72.3; 64.3-80.2) 100 (84; 77.7-90.3) 109 (91.6; 86.4-96.8) <.001 0.094 (Grok=Copilot) > DeepSeek
Consistency

No change in answer 118 (99.2; 97.7-100) 112 (94.1; 89.2-99) 115 (96.6; 92.9-100) 117 (98.3; 95.9-100) 119 (100; 100-100) .02 0.069 No significant pairwise difference (after Bonferroni adjustment)

aEffect size is reported using Cramér V, which measures the strength of association between categorical variables (0.10=small, 0.30=medium, and 0.50=large).

bP values for post hoc pairwise comparisons were Bonferroni-adjusted to control for multiple testing. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.

cThe first, second, and third attempts correspond to the initial answer, the response after the first confirmation prompt “Are you sure?,” and the response after the second confirmation prompt “Are you sure?,” respectively.