Table 2. Evaluation of ChatGPT-4o and Gemini 1.5 Pro across examination years: language and model comparisons .
|
Year |
n |
Correct answer count and accuracy (%) |
p* |
||||||||||||||
|
TR |
EN |
||||||||||||||||
|
ChatGPT-4o |
Gemini 1.5 Pro |
ChatGPT-4o |
Gemini 1.5 Pro |
Interlingual |
Inter-AI |
||||||||||||
|
First |
Second |
Final |
First |
Second |
Final |
First |
Second |
Final |
First |
Second |
Final |
ChatGPT-4o |
Gemini 1.5 Pro |
TR |
EN |
||
|
2006 |
9 |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
|
2007 |
17 |
17 (100) |
17 (100) |
17 (100) |
15 (88.2) |
15 (88.2) |
15 (88.2) |
17 (100) |
17 (100) |
17 (100) |
15 (88.2) |
15 (88.2) |
15 (88.2) |
||||
|
2008 |
19 |
19 (100) |
19 (100) |
19 (100) |
18 (94.7) |
18 (94.7) |
18 (94.7) |
19 (100) |
19 (100) |
19 (100) |
19 (100) |
19 (100) |
19 (100) |
||||
|
2009 |
10 |
10 (100) |
10 (100) |
10 (100) |
9 (90) |
9 (90) |
10 (100) |
10 (100) |
10 (100) |
10 (100) |
10 (100) |
10 (100) |
10 (100) |
||||
|
2010 |
9 |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
9 (100) |
||||
|
2011 |
12 |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
||||
|
2012 |
9 |
7 (77.8) |
7 (77.8) |
7 (77.8) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
8 (88.9) |
||||
|
2013 |
16 |
16 (100) |
16 (100) |
16 (100) |
15 (93.8) |
15 (93.8) |
15 (93.8) |
16 (100) |
16 (100) |
16 (100) |
15 (93.8) |
15 (93.8) |
15 (93.8) |
||||
|
2014 |
21 |
19 (90.5) |
19 (90.5) |
21 (100) |
21 (100) |
21 (100) |
21 (100) |
20 (95.2) |
20 (95.2) |
21 (100) |
19 (90.5) |
19 (90.5) |
19 (90.5) |
||||
|
2015 |
12 |
11 (91.7) |
11 (91.7) |
12 (100) |
12 (100) |
12 (100) |
12 (100) |
10 (83.3) |
10 (83.3) |
12 (100) |
10 (83.3) |
10 (83.3) |
10 (83.3) |
||||
|
2016 |
16 |
15 (93.8) |
15 (93.8) |
15 (93.8) |
13 (81.3) |
13 (81.3) |
13 (81.3) |
15 (93.8) |
15 (93.8) |
15 (93.8) |
14 (87.5) |
14 (87.5) |
14 (87.5) |
||||
|
2017 |
13 |
13 (100) |
13 (100) |
13 (100) |
12 (92.3) |
12 (92.3) |
13 (100) |
12 (92.3) |
12 (92.3) |
13 (100) |
11 (84.6) |
11 (84.6) |
11 (84.6) |
||||
|
2018 |
13 |
13 (100) |
13 (100) |
13 (100) |
11 (84.6) |
11 (84.6) |
11 (84.6) |
13 (100) |
13 (100) |
13 (100) |
12 (92.3) |
12 (92.3) |
12 (92.3) |
||||
|
2019 |
11 |
10 (90.9) |
11 (100) |
11 (100) |
8 (72.7) |
9 (81.8) |
9 (81.8) |
10 (90.9) |
10 (90.9) |
10 (90.9) |
10 (90.9) |
10 (90.9) |
10 (90.9) |
||||
|
2020 |
15 |
13 (86.7) |
13 (86.7) |
13 (86.7) |
14 (93.3) |
15 (100) |
15 (100) |
13 (86.7) |
13 (86.7) |
13 (86.7) |
15 (100) |
15 (100) |
15 (100) |
||||
|
2021 |
16 |
14 (87.5) |
14 (87.5) |
15 (93.8) |
14 (87.5) |
14 (87.5) |
15 (93.8) |
15 (93.8) |
15 (93.8) |
16 (100) |
14 (87.5) |
14 (87.5) |
15 (93.8) |
||||
|
2022-2024 (10%) |
2 |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
2 (100) |
||||
TR: Turkish, EN: English, AI: Artificial intelligence, LLM: Large language models. *Comparison between the years 2016-2024 via chi-square test. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. p<0.05 was considered statistically different in 95% confidence interval