Skip to main content
. 2023 Feb 8;9:e45312. doi: 10.2196/45312

Table 1.

The performance of the 3 large language models (LLMs) on the 4 outlined data sets.

LLM, response NBMEa-Free-Step1 (n=87), n (%) NBME-Free-Step2 (n=102), n (%) AMBOSS-Step1 (n=100), n (%) AMBOSS-Step2 (n=100), n (%)
ChatGPTb

Correct 56 (64.4) 59 (57.8) 44 (44) 42 (42)

Incorrect 31 (35.6) 43 (42.2) 56 (56) 58 (58)
InstructGPT

Correct 45 (51.7) 54 (52.9) 36 (36) 35 (35)

Incorrect 42 (48.3) 48 (47.1) 64 (64) 65 (65)
GPT-3

Correct 22 (25.3) 19 (18.6) 20 (20) 17 (17)

Incorrect 65 (74.7) 83 (81.4) 80 (80) 83 (83)

aNBME: National Board of Medical Examiners.

bChatGPT: Chat Generative Pre-trained Transformer.