Table 3.
Accuracy of artificial intelligence tools in answering United States Medical Licensing Examination Step 1 questions by type, format, and subject area.
|
|
ChatGPT, n (%; 95% CI) | Copilot, n (%; 95% CI) | DeepSeek, n (%; 95% CI) | Gemini, n (%; 95% CI) | Grok, n (%; 95% CI) | P valuea | Post hoc | |
| Question type | ||||||||
|
|
Text only (n=96) | 78 (81.3; 72.3-87.8) | 86 (89.6; 81.9-94.2) | 86 (89.6; 81.9-94.2) | 88 (91.7; 84.4-95.7) | 88 (91.7; 84.4-95.7) | .13 | N/Ab |
|
|
With visual media (n=23) | 17 (73.9; 53.5-87.5) | 15 (65.2; 44.9-81.2) | 0 (0; 0-14.3) | 12 (52.2; 33-70.8) | 21 (91.3; 73.2-97.6) | <.001 | All > DeepSeek Grok > Gemini |
| Question format | ||||||||
|
|
Information-based (n=41) | 33 (80.5; 66-89.8) | 35 (85.4; 71.6-93.1) | 33 (80.5; 66-89.8) | 37 (90.2; 77.5-96.1) | 39 (95.1; 83.9-98.7) | .23 | N/A |
|
|
Case-based (n=78) | 62 (79.5; 69.2-87) | 66 (84.6; 75-91) | 53 (67.9; 57-77.3) | 63 (80.8; 70.7-88) | 70 (89.7; 81-94.7) | .01 | Grok > DeepSeek |
| Subject | ||||||||
|
|
Biochemistry and molecular biology (n=7) | 6 (85.7; 48.7-97.4) | 7 (100; 64.6-100) | 5 (71.4; 35.9-91.8) | 6 (85.7; 48.7-97.4) | 6 (85.7; 48.7-97.4) | .95 | N/A |
|
|
Biostatistics and epidemiology (n=6) | 6 (100; 61-100) | 5 (83.3; 43.6-97) | 6 (100; 61-100) | 6 (100; 61-100) | 6 (100; 61-100) | >.99 | N/A |
|
|
Cardiovascular (n=8) | 7 (87.5; 52.9-97.8) | 7 (87.5; 52.9-97.8) | 3 (27.5; 13.7-69.4) | 6 (75; 40.9-92.9) | 8 (100; 67.6-100) | .04 | No significant pairwise difference (after Bonferroni adjustment) |
|
|
Endocrinology (n=7) | 4 (57.1; 25-84.2) | 6 (85.7; 48.7-97.4) | 5 (71.4; 35.9-91.8) | 7 (100; 64.6-100) | 5 (71.4; 35.9-91.8) | .55 | N/A |
|
|
Ethics and communication skills (n=9) | 8 (88.9; 56.5-98) | 7 (77.8; 45.3-93.7) | 9 (100; 70.1-100) | 8 (88.9; 56.5-98) | 8 (88.9; 56.5-98) | .95 | N/A |
|
|
Gastrointestinal (n=14) | 12 (85.7; 60.1-96) | 13 (92.9; 68.5-98.7) | 9 (64.3; 38.8-83.7) | 11 (78.6; 52.4-92.4) | 12 (85.7; 60.1-96) | .47 | N/A |
|
|
Hematology and oncology (n=9) | 8 (88.9; 56.5-98) | 7 (77.8; 45.3-93.7) | 8 (88.9; 56.5-98) | 8 (88.9; 56.5-98) | 8 (88.9; 56.5-98) | >.99 | N/A |
|
|
Microbiology and immunology (n=8) | 7 (87.5; 52.9-97.8) | 7 (87.5; 52.9-97.8) | 6 (75; 40.9-92.9) | 6 (75; 40.9-92.9) | 8 (100; 67.6-100) | .85 | N/A |
|
|
Musculoskeletal, skin, and connective tissue (n=7) | 6 (85.7; 48.7-97.4) | 5 (71.4; 35.9-91.8) | 2 (28.6; 8.2-64.1) | 3 (42.9; 15.8-75) | 6 (85.7; 48.7-97.4) | .12 | N/A |
|
|
Neurology, special senses, and psychiatry (n=10) | 6 (60; 31.3-83.2) | 8 (80; 49-94.3) | 7 (70; 39.7-89.2) | 10 (100; 72.2-100) | 9 (90; 59.6-98.2) | .25 | N/A |
|
|
Pharmacology (n=6) | 3 (50; 18.8-81.2) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | .76 | N/A |
|
|
Reproduction (n=10) | 7 (70; 39.7-89.2) | 9 (90; 59.6-98.2) | 7 (70; 39.7-89.2) | 9 (90; 59.6-98.2) | 10 (100; 72.2-100) | .30 | N/A |
|
|
Respiratory (n=6) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | 5 (83.3; 43.6-97) | 6 (100; 61-100) | >.99 | N/A |
|
|
Uro-renal (n=12) | 10 (83.3; 55.2-95.3) | 10 (83.3; 55.2-95.3) | 9 (75; 46.8-91.1) | 10 (83.3; 55.2-95.3) | 12 (100; 75.8-100) | .59 | N/A |
aP values are based on chi-square or Fisher exact test, with Bonferroni adjustment applied in post hoc comparisons. Omnibus chi-square P values are unadjusted. P<.05 is statistically significant.
bN/A: not applicable.