Skip to main content
[Preprint]. 2024 Jul 16:2024.07.16.24310297. [Version 1] doi: 10.1101/2024.07.16.24310297

Table 2:

Performance on image and no image questions

Test Image Questions p-value No Image Questions p-value
GPT-4 (4) 40.7% (37.0% – 44.4%) p< 0.001 (4 vs. G)

p= 0.885 (4 vs. 4T)

p< 0.001 (G vs. 4T)

p= 0.821 (4o vs. 4T)

p< 0.001 (4o vs. G)

p= 0.956 (4o vs. 4)
59.2% (58.2% – 60.6%) p< 0.001 (4 vs. G)

p< 0.001 (4 vs. 4T)

p< 0.001 (G vs. 4T)

p= 0.001 (4o vs. 4T)

p< 0.001 (4o vs. G)

p< 0.001 (4o vs. 4)
Gemini (G) 22.2% (20.4% – 25.9%) 44.7% (44.0% – 46.1%)
GPT-4 Turbo (4T) 44.4% (40.7% – 48.1%) 62.4% (62.1% – 63.8%)
GPT-4o (4o) 44.4% (42.2% – 46.7%) 66.7% (65.7% – 67.7%)
*

Values are presented as median percentiles and 95% confidence intervals of 30 attempts at the exam.