Table 2. Percentage of questions answered correctly by GPT-3.5 vs. GPT-4 vs. humans by generalized anatomical category and difficulty level.
The “anterior segment” included the cornea, cataract, and refractive surgery categories; the “posterior segment” included the retina and vitreous category; the “other” category consisted of neuro-ophthalmology, pediatrics, and oculoplastics. Questions from the glaucoma, pathology, and uveitis categories were individually divided amongst the “anterior,” “posterior,” and “other” categories according to question content. Level 1 indicated the “basic” difficulty level and tested recall; Level 2 indicated “moderate” difficulty and tested the ability to comprehend basic facts; Level 3 was described as “difficult” and tested application, or knowledge use in care; Level 4 was considered an “expert” high-complexity question and tested analysis and evaluation skills.
Bolding indicates statistical significance.
GPT-3.5 Questions Answered Correctly (%) | GPT-4 Questions Answered Correctly (%) | Human Questions Answered Correctly (%) | GPT-3.5 vs GPT-4 P-Value | GPT-3.5 vs Human P-Value | GPT-4 vs Human P-Value | ||
Question Difficulty Level | 1 (n = 7) | 86 | 86 | 63 | 1 | 0.176 | 0.176 |
2 (n = 278) | 60 | 86 | 61 | <0.001 | 0.513 | <0.001 | |
3 (n = 265) | 56 | 68 | 57 | <0.001 | 0.662 | <0.001 | |
4 (n = 117) | 49 | 75 | 60 | <0.001 | 0.008 | <0.001 | |
Generalized Anatomical Category | Anterior segment (n = 208) | 54 | 71 | 59 | < 0.001 | 0.211 | < 0.001 |
Posterior segment (n = 75) | 57 | 71 | 60 | 0.033 | 0.657 | 0.034 | |
Other (n = 184) | 56 | 77 | 57 | < 0.001 | 0.725 | < 0.001 |