Figure 3. Comparing the performance of GPT-3.5, GPT-4, and humans on StatPearls questions divided by difficulty levels.
Level 1 indicated the “basic” difficulty level and tested recall; Level 2 indicated “moderate” difficulty and tested the ability to comprehend basic facts; Level 3 was described as “difficult” and tested application, or knowledge use in care; Level 4 was considered an “expert” high-complexity question and tested analysis and evaluation skills.
*, **, † indicates statistical significance