Skip to main content

View full-text article in PMC

. 2023 Feb 8;9:e45312. doi: 10.2196/45312

Table 1.

The performance of the 3 large language models (LLMs) on the 4 outlined data sets.

LLM, response		NBME^a-Free-Step1 (n=87), n (%)	NBME-Free-Step2 (n=102), n (%)	AMBOSS-Step1 (n=100), n (%)	AMBOSS-Step2 (n=100), n (%)
ChatGPT^b
	Correct	56 (64.4)	59 (57.8)	44 (44)	42 (42)
	Incorrect	31 (35.6)	43 (42.2)	56 (56)	58 (58)
InstructGPT
	Correct	45 (51.7)	54 (52.9)	36 (36)	35 (35)
	Incorrect	42 (48.3)	48 (47.1)	64 (64)	65 (65)
GPT-3
	Correct	22 (25.3)	19 (18.6)	20 (20)	17 (17)
	Incorrect	65 (74.7)	83 (81.4)	80 (80)	83 (83)

^aNBME: National Board of Medical Examiners.

^bChatGPT: Chat Generative Pre-trained Transformer.