. 2023 Jun 22;15(6):e40822. doi: 10.7759/cureus.40822

Table 2. Percentage of questions answered correctly by GPT-3.5 vs. GPT-4 vs. humans by generalized anatomical category and difficulty level.

The “anterior segment” included the cornea, cataract, and refractive surgery categories; the “posterior segment” included the retina and vitreous category; the “other” category consisted of neuro-ophthalmology, pediatrics, and oculoplastics. Questions from the glaucoma, pathology, and uveitis categories were individually divided amongst the “anterior,” “posterior,” and “other” categories according to question content. Level 1 indicated the “basic” difficulty level and tested recall; Level 2 indicated “moderate” difficulty and tested the ability to comprehend basic facts; Level 3 was described as “difficult” and tested application, or knowledge use in care; Level 4 was considered an “expert” high-complexity question and tested analysis and evaluation skills.

Bolding indicates statistical significance.

		GPT-3.5 Questions Answered Correctly (%)	GPT-4 Questions Answered Correctly (%)	Human Questions Answered Correctly (%)	GPT-3.5 vs GPT-4 P-Value	GPT-3.5 vs Human P-Value	GPT-4 vs Human P-Value
Question Difficulty Level	1 (n = 7)	86	86	63	1	0.176	0.176
	2 (n = 278)	60	86	61	<0.001	0.513	<0.001
	3 (n = 265)	56	68	57	<0.001	0.662	<0.001
	4 (n = 117)	49	75	60	<0.001	0.008	<0.001
Generalized Anatomical Category	Anterior segment (n = 208)	54	71	59	< 0.001	0.211	< 0.001
	Posterior segment (n = 75)	57	71	60	0.033	0.657	0.034
	Other (n = 184)	56	77	57	< 0.001	0.725	< 0.001