Skip to main content
. 2025 Aug 21;55(4):177–185. doi: 10.4274/tjo.galenos.2025.27895

Table 2. Evaluation of ChatGPT-4o and Gemini 1.5 Pro across examination years: language and model comparisons .

Year

n

Correct answer count and accuracy (%)

p*

TR

EN

ChatGPT-4o

Gemini 1.5 Pro

ChatGPT-4o

Gemini 1.5 Pro

Interlingual

Inter-AI

First

Second

Final

First

Second

Final

First

Second

Final

First

Second

Final

ChatGPT-4o

Gemini 1.5 Pro

TR

EN

2006

9

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

2007

17

17 (100)

17 (100)

17 (100)

15 (88.2)

15 (88.2)

15 (88.2)

17 (100)

17 (100)

17 (100)

15 (88.2)

15 (88.2)

15 (88.2)

2008

19

19 (100)

19 (100)

19 (100)

18 (94.7)

18 (94.7)

18 (94.7)

19 (100)

19 (100)

19 (100)

19 (100)

19 (100)

19 (100)

2009

10

10 (100)

10 (100)

10 (100)

9 (90)

9 (90)

10 (100)

10 (100)

10 (100)

10 (100)

10 (100)

10 (100)

10 (100)

2010

9

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

9 (100)

2011

12

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

12 (100)

2012

9

7 (77.8)

7 (77.8)

7 (77.8)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

8 (88.9)

2013

16

16 (100)

16 (100)

16 (100)

15 (93.8)

15 (93.8)

15 (93.8)

16 (100)

16 (100)

16 (100)

15 (93.8)

15 (93.8)

15 (93.8)

2014

21

19 (90.5)

19 (90.5)

21 (100)

21 (100)

21 (100)

21 (100)

20 (95.2)

20 (95.2)

21 (100)

19 (90.5)

19 (90.5)

19 (90.5)

2015

12

11 (91.7)

11 (91.7)

12 (100)

12 (100)

12 (100)

12 (100)

10 (83.3)

10 (83.3)

12 (100)

10 (83.3)

10 (83.3)

10 (83.3)

2016

16

15 (93.8)

15 (93.8)

15 (93.8)

13 (81.3)

13 (81.3)

13 (81.3)

15 (93.8)

15 (93.8)

15 (93.8)

14 (87.5)

14 (87.5)

14 (87.5)

2017

13

13 (100)

13 (100)

13 (100)

12 (92.3)

12 (92.3)

13 (100)

12 (92.3)

12 (92.3)

13 (100)

11 (84.6)

11 (84.6)

11 (84.6)

2018

13

13 (100)

13 (100)

13 (100)

11 (84.6)

11 (84.6)

11 (84.6)

13 (100)

13 (100)

13 (100)

12 (92.3)

12 (92.3)

12 (92.3)

2019

11

10 (90.9)

11 (100)

11 (100)

8 (72.7)

9 (81.8)

9 (81.8)

10 (90.9)

10 (90.9)

10 (90.9)

10 (90.9)

10 (90.9)

10 (90.9)

2020

15

13 (86.7)

13 (86.7)

13 (86.7)

14 (93.3)

15 (100)

15 (100)

13 (86.7)

13 (86.7)

13 (86.7)

15 (100)

15 (100)

15 (100)

2021

16

14 (87.5)

14 (87.5)

15 (93.8)

14 (87.5)

14 (87.5)

15 (93.8)

15 (93.8)

15 (93.8)

16 (100)

14 (87.5)

14 (87.5)

15 (93.8)

2022-2024 (10%)

2

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

2 (100)

TR: Turkish, EN: English, AI: Artificial intelligence, LLM: Large language models. *Comparison between the years 2016-2024 via chi-square test. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. p<0.05 was considered statistically different in 95% confidence interval