Skip to main content
. 2026 Mar 9;5:e76928. doi: 10.2196/76928

Table 3.

Accuracy of artificial intelligence tools in answering United States Medical Licensing Examination Step 1 questions by type, format, and subject area.


ChatGPT, n (%; 95% CI) Copilot, n (%; 95% CI) DeepSeek, n (%; 95% CI) Gemini, n (%; 95% CI) Grok, n (%; 95% CI) P valuea Post hoc
Question type

Text only (n=96) 78 (81.3; 72.3-87.8) 86 (89.6; 81.9-94.2) 86 (89.6; 81.9-94.2) 88 (91.7; 84.4-95.7) 88 (91.7; 84.4-95.7) .13 N/Ab

With visual media (n=23) 17 (73.9; 53.5-87.5) 15 (65.2; 44.9-81.2) 0 (0; 0-14.3) 12 (52.2; 33-70.8) 21 (91.3; 73.2-97.6) <.001 All > DeepSeek Grok > Gemini
Question format

Information-based (n=41) 33 (80.5; 66-89.8) 35 (85.4; 71.6-93.1) 33 (80.5; 66-89.8) 37 (90.2; 77.5-96.1) 39 (95.1; 83.9-98.7) .23 N/A

Case-based (n=78) 62 (79.5; 69.2-87) 66 (84.6; 75-91) 53 (67.9; 57-77.3) 63 (80.8; 70.7-88) 70 (89.7; 81-94.7) .01 Grok > DeepSeek
Subject

Biochemistry and molecular biology (n=7) 6 (85.7; 48.7-97.4) 7 (100; 64.6-100) 5 (71.4; 35.9-91.8) 6 (85.7; 48.7-97.4) 6 (85.7; 48.7-97.4) .95 N/A

Biostatistics and epidemiology (n=6) 6 (100; 61-100) 5 (83.3; 43.6-97) 6 (100; 61-100) 6 (100; 61-100) 6 (100; 61-100) >.99 N/A

Cardiovascular (n=8) 7 (87.5; 52.9-97.8) 7 (87.5; 52.9-97.8) 3 (27.5; 13.7-69.4) 6 (75; 40.9-92.9) 8 (100; 67.6-100) .04 No significant pairwise difference (after Bonferroni adjustment)

Endocrinology (n=7) 4 (57.1; 25-84.2) 6 (85.7; 48.7-97.4) 5 (71.4; 35.9-91.8) 7 (100; 64.6-100) 5 (71.4; 35.9-91.8) .55 N/A

Ethics and communication skills (n=9) 8 (88.9; 56.5-98) 7 (77.8; 45.3-93.7) 9 (100; 70.1-100) 8 (88.9; 56.5-98) 8 (88.9; 56.5-98) .95 N/A

Gastrointestinal (n=14) 12 (85.7; 60.1-96) 13 (92.9; 68.5-98.7) 9 (64.3; 38.8-83.7) 11 (78.6; 52.4-92.4) 12 (85.7; 60.1-96) .47 N/A

Hematology and oncology (n=9) 8 (88.9; 56.5-98) 7 (77.8; 45.3-93.7) 8 (88.9; 56.5-98) 8 (88.9; 56.5-98) 8 (88.9; 56.5-98) >.99 N/A

Microbiology and immunology (n=8) 7 (87.5; 52.9-97.8) 7 (87.5; 52.9-97.8) 6 (75; 40.9-92.9) 6 (75; 40.9-92.9) 8 (100; 67.6-100) .85 N/A

Musculoskeletal, skin, and connective tissue (n=7) 6 (85.7; 48.7-97.4) 5 (71.4; 35.9-91.8) 2 (28.6; 8.2-64.1) 3 (42.9; 15.8-75) 6 (85.7; 48.7-97.4) .12 N/A

Neurology, special senses, and psychiatry (n=10) 6 (60; 31.3-83.2) 8 (80; 49-94.3) 7 (70; 39.7-89.2) 10 (100; 72.2-100) 9 (90; 59.6-98.2) .25 N/A

Pharmacology (n=6) 3 (50; 18.8-81.2) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) .76 N/A

Reproduction (n=10) 7 (70; 39.7-89.2) 9 (90; 59.6-98.2) 7 (70; 39.7-89.2) 9 (90; 59.6-98.2) 10 (100; 72.2-100) .30 N/A

Respiratory (n=6) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) 5 (83.3; 43.6-97) 6 (100; 61-100) >.99 N/A

Uro-renal (n=12) 10 (83.3; 55.2-95.3) 10 (83.3; 55.2-95.3) 9 (75; 46.8-91.1) 10 (83.3; 55.2-95.3) 12 (100; 75.8-100) .59 N/A

aP values are based on chi-square or Fisher exact test, with Bonferroni adjustment applied in post hoc comparisons. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.

bN/A: not applicable.