. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 3.

Accuracy of artificial intelligence tools in answering United States Medical Licensing Examination Step 1 questions by type, format, and subject area.

		ChatGPT, n (%; 95% CI)	Copilot, n (%; 95% CI)	DeepSeek, n (%; 95% CI)	Gemini, n (%; 95% CI)	Grok, n (%; 95% CI)	P value^a	Post hoc
Question type
	Text only (n=96)	78 (81.3; 72.3-87.8)	86 (89.6; 81.9-94.2)	86 (89.6; 81.9-94.2)	88 (91.7; 84.4-95.7)	88 (91.7; 84.4-95.7)	.13	N/A^b
	With visual media (n=23)	17 (73.9; 53.5-87.5)	15 (65.2; 44.9-81.2)	0 (0; 0-14.3)	12 (52.2; 33-70.8)	21 (91.3; 73.2-97.6)	<.001	All > DeepSeek Grok > Gemini
Question format
	Information-based (n=41)	33 (80.5; 66-89.8)	35 (85.4; 71.6-93.1)	33 (80.5; 66-89.8)	37 (90.2; 77.5-96.1)	39 (95.1; 83.9-98.7)	.23	N/A
	Case-based (n=78)	62 (79.5; 69.2-87)	66 (84.6; 75-91)	53 (67.9; 57-77.3)	63 (80.8; 70.7-88)	70 (89.7; 81-94.7)	.01	Grok > DeepSeek
Subject
	Biochemistry and molecular biology (n=7)	6 (85.7; 48.7-97.4)	7 (100; 64.6-100)	5 (71.4; 35.9-91.8)	6 (85.7; 48.7-97.4)	6 (85.7; 48.7-97.4)	.95	N/A
	Biostatistics and epidemiology (n=6)	6 (100; 61-100)	5 (83.3; 43.6-97)	6 (100; 61-100)	6 (100; 61-100)	6 (100; 61-100)	>.99	N/A
	Cardiovascular (n=8)	7 (87.5; 52.9-97.8)	7 (87.5; 52.9-97.8)	3 (27.5; 13.7-69.4)	6 (75; 40.9-92.9)	8 (100; 67.6-100)	.04	No significant pairwise difference (after Bonferroni adjustment)
	Endocrinology (n=7)	4 (57.1; 25-84.2)	6 (85.7; 48.7-97.4)	5 (71.4; 35.9-91.8)	7 (100; 64.6-100)	5 (71.4; 35.9-91.8)	.55	N/A
	Ethics and communication skills (n=9)	8 (88.9; 56.5-98)	7 (77.8; 45.3-93.7)	9 (100; 70.1-100)	8 (88.9; 56.5-98)	8 (88.9; 56.5-98)	.95	N/A
	Gastrointestinal (n=14)	12 (85.7; 60.1-96)	13 (92.9; 68.5-98.7)	9 (64.3; 38.8-83.7)	11 (78.6; 52.4-92.4)	12 (85.7; 60.1-96)	.47	N/A
	Hematology and oncology (n=9)	8 (88.9; 56.5-98)	7 (77.8; 45.3-93.7)	8 (88.9; 56.5-98)	8 (88.9; 56.5-98)	8 (88.9; 56.5-98)	>.99	N/A
	Microbiology and immunology (n=8)	7 (87.5; 52.9-97.8)	7 (87.5; 52.9-97.8)	6 (75; 40.9-92.9)	6 (75; 40.9-92.9)	8 (100; 67.6-100)	.85	N/A
	Musculoskeletal, skin, and connective tissue (n=7)	6 (85.7; 48.7-97.4)	5 (71.4; 35.9-91.8)	2 (28.6; 8.2-64.1)	3 (42.9; 15.8-75)	6 (85.7; 48.7-97.4)	.12	N/A
	Neurology, special senses, and psychiatry (n=10)	6 (60; 31.3-83.2)	8 (80; 49-94.3)	7 (70; 39.7-89.2)	10 (100; 72.2-100)	9 (90; 59.6-98.2)	.25	N/A
	Pharmacology (n=6)	3 (50; 18.8-81.2)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	.76	N/A
	Reproduction (n=10)	7 (70; 39.7-89.2)	9 (90; 59.6-98.2)	7 (70; 39.7-89.2)	9 (90; 59.6-98.2)	10 (100; 72.2-100)	.30	N/A
	Respiratory (n=6)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	5 (83.3; 43.6-97)	6 (100; 61-100)	>.99	N/A
	Uro-renal (n=12)	10 (83.3; 55.2-95.3)	10 (83.3; 55.2-95.3)	9 (75; 46.8-91.1)	10 (83.3; 55.2-95.3)	12 (100; 75.8-100)	.59	N/A

^aP values are based on chi-square or Fisher exact test, with Bonferroni adjustment applied in post hoc comparisons. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.

^bN/A: not applicable.