Skip to main content
. 2023 Jun 22;15(6):e40822. doi: 10.7759/cureus.40822

Table 2. Percentage of questions answered correctly by GPT-3.5 vs. GPT-4 vs. humans by generalized anatomical category and difficulty level.

The “anterior segment” included the cornea, cataract, and refractive surgery categories; the “posterior segment” included the retina and vitreous category; the “other” category consisted of neuro-ophthalmology, pediatrics, and oculoplastics. Questions from the glaucoma, pathology, and uveitis categories were individually divided amongst the “anterior,” “posterior,” and “other” categories according to question content. Level 1 indicated the “basic” difficulty level and tested recall; Level 2 indicated “moderate” difficulty and tested the ability to comprehend basic facts; Level 3 was described as “difficult” and tested application, or knowledge use in care; Level 4 was considered an “expert” high-complexity question and tested analysis and evaluation skills.

Bolding indicates statistical significance.

    GPT-3.5 Questions Answered Correctly (%) GPT-4 Questions Answered Correctly (%) Human Questions Answered Correctly (%) GPT-3.5 vs GPT-4 P-Value GPT-3.5 vs Human P-Value GPT-4 vs Human P-Value
Question Difficulty Level  1 (n = 7) 86 86 63 1 0.176 0.176
 2 (n = 278) 60 86 61 <0.001 0.513 <0.001
 3 (n = 265) 56 68 57 <0.001 0.662 <0.001
 4 (n = 117) 49 75 60 <0.001 0.008 <0.001
Generalized Anatomical Category  Anterior segment (n = 208) 54 71 59 < 0.001 0.211 < 0.001
 Posterior segment (n = 75) 57 71 60 0.033 0.657 0.034
 Other (n = 184) 56 77 57 < 0.001 0.725 < 0.001