. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 2.

Performance accuracy and answer consistency across artificial intelligence tools over 3 attempts on United States Medical Licensing Examination Step 1 questions (N=119).

		ChatGPT, n (%; 95% CI)	Copilot, n (%; 95% CI)	DeepSeek, n (%; 95% CI)	Gemini, n (%; 95% CI)	Grok, n (%; 95% CI)	P value	Effect size^a	Post hoc^b
Attempt^c
	First	95 (79.8; 72.8-86.7)	101 (84.9; 78.8-90.9)	86 (72.3; 64.3-80.2)	100 (84; 77.7-90.3)	109 (91.6; 86.4-96.8)	.002	0.084	Grok > DeepSeek
	Second	95 (79.8; 72.8-86.7)	100 (84; 77.7-90.3)	85 (71.4; 63.4-79.4)	100 (84; 77.7-90.3)	109 (91.6; 86.4-96.8)	.001	0.086	Grok > DeepSeek
	Third	96 (80.7; 73.9-87.5)	107 (89.9; 84.5-95.4)	86 (72.3; 64.3-80.2)	100 (84; 77.7-90.3)	109 (91.6; 86.4-96.8)	<.001	0.094	(Grok=Copilot) > DeepSeek
Consistency
	No change in answer	118 (99.2; 97.7-100)	112 (94.1; 89.2-99)	115 (96.6; 92.9-100)	117 (98.3; 95.9-100)	119 (100; 100-100)	.02	0.069	No significant pairwise difference (after Bonferroni adjustment)

^aEffect size is reported using Cramér V, which measures the strength of association between categorical variables (0.10=small, 0.30=medium, and 0.50=large).

^bP values for post hoc pairwise comparisons were Bonferroni-adjusted to control for multiple testing. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.

^cThe first, second, and third attempts correspond to the initial answer, the response after the first confirmation prompt “Are you sure?,” and the response after the second confirmation prompt “Are you sure?,” respectively.