. 2025 Aug 21;55(4):177–185. doi: 10.4274/tjo.galenos.2025.27895

Table 2. Evaluation of ChatGPT-4o and Gemini 1.5 Pro across examination years: language and model comparisons .

Year	n	Correct answer count and accuracy (%)												p*
		TR						EN						p*
		ChatGPT-4o			Gemini 1.5 Pro			ChatGPT-4o			Gemini 1.5 Pro			Interlingual		Inter-AI
		First	Second	Final	First	Second	Final	First	Second	Final	First	Second	Final	ChatGPT-4o	Gemini 1.5 Pro	TR	EN
2006	9	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99
2007	17	17 (100)	17 (100)	17 (100)	15 (88.2)	15 (88.2)	15 (88.2)	17 (100)	17 (100)	17 (100)	15 (88.2)	15 (88.2)	15 (88.2)
2008	19	19 (100)	19 (100)	19 (100)	18 (94.7)	18 (94.7)	18 (94.7)	19 (100)	19 (100)	19 (100)	19 (100)	19 (100)	19 (100)
2009	10	10 (100)	10 (100)	10 (100)	9 (90)	9 (90)	10 (100)	10 (100)	10 (100)	10 (100)	10 (100)	10 (100)	10 (100)
2010	9	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)	9 (100)
2011	12	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)	12 (100)
2012	9	7 (77.8)	7 (77.8)	7 (77.8)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)	8 (88.9)
2013	16	16 (100)	16 (100)	16 (100)	15 (93.8)	15 (93.8)	15 (93.8)	16 (100)	16 (100)	16 (100)	15 (93.8)	15 (93.8)	15 (93.8)
2014	21	19 (90.5)	19 (90.5)	21 (100)	21 (100)	21 (100)	21 (100)	20 (95.2)	20 (95.2)	21 (100)	19 (90.5)	19 (90.5)	19 (90.5)
2015	12	11 (91.7)	11 (91.7)	12 (100)	12 (100)	12 (100)	12 (100)	10 (83.3)	10 (83.3)	12 (100)	10 (83.3)	10 (83.3)	10 (83.3)
2016	16	15 (93.8)	15 (93.8)	15 (93.8)	13 (81.3)	13 (81.3)	13 (81.3)	15 (93.8)	15 (93.8)	15 (93.8)	14 (87.5)	14 (87.5)	14 (87.5)
2017	13	13 (100)	13 (100)	13 (100)	12 (92.3)	12 (92.3)	13 (100)	12 (92.3)	12 (92.3)	13 (100)	11 (84.6)	11 (84.6)	11 (84.6)
2018	13	13 (100)	13 (100)	13 (100)	11 (84.6)	11 (84.6)	11 (84.6)	13 (100)	13 (100)	13 (100)	12 (92.3)	12 (92.3)	12 (92.3)
2019	11	10 (90.9)	11 (100)	11 (100)	8 (72.7)	9 (81.8)	9 (81.8)	10 (90.9)	10 (90.9)	10 (90.9)	10 (90.9)	10 (90.9)	10 (90.9)
2020	15	13 (86.7)	13 (86.7)	13 (86.7)	14 (93.3)	15 (100)	15 (100)	13 (86.7)	13 (86.7)	13 (86.7)	15 (100)	15 (100)	15 (100)
2021	16	14 (87.5)	14 (87.5)	15 (93.8)	14 (87.5)	14 (87.5)	15 (93.8)	15 (93.8)	15 (93.8)	16 (100)	14 (87.5)	14 (87.5)	15 (93.8)
2022-2024 (10%)	2	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)	2 (100)

TR: Turkish, EN: English, AI: Artificial intelligence, LLM: Large language models. *Comparison between the years 2016-2024 via chi-square test. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. p<0.05 was considered statistically different in 95% confidence interval