. 2025 Aug 21;55(4):177–185. doi: 10.4274/tjo.galenos.2025.27895

Table 1. Interlingual and inter-model comparisons of chatbot performance on section questions across multiple attempts .

-	n	Correct answer count and accuracy (%)												p*
		ChatGPT-4o						Gemini 1.5 Pro						p*
		TR			EN			TR			EN			Inter-AI		Interlingual
		First	Second	Final	First	Second	Final	First	Second	Final	First	Second	Final	TR	EN	ChatGPT-4o	Gemini 1.5 Pro
Neuro-ophthalmology	72	67 (93.1)	68 (94.4)	70 (97.2)	66 (91.7)	66 (91.7)	70 (97.2)	65 (90.3)	65 (90.3)	67 (93.1)	64 (88.9)	64 (88.9)	64 (88.9)	0.7630.7450.441	0.7790.7790.097	>0.990.745>0.99	>0.990.7790.561
Retina and vitreous	23	22 (95.7)	22 (95.7)	23 (100)	23 (100)	23 (100)	23 (100)	22 (95.7)	22 (95.7)	22 (95.7)	22 (95.7)	22 (95.7)	22 (95.7)	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99
Glaucoma	13	12 (92.3)	12 (92.3)	13 (100)	13 (100)	13 (100)	13 (100)	12 (92.3)	12 (92.3)	13 (100)	12 (92.3)	12 (92.3)	13 (100)	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99
Cornea & anterior segment	38	38 (100)	38 (100)	38 (100)	38 (100)	38 (100)	38 (100)	36 (94.7)	36 (94.7)	36 (94.7)	38 (100)	38 (100)	38 (100)	0.4930.4930.493	>0.99>0.99>0.99	>0.99>0.99>0.99	0.4930.4930.493
Pediatric ophthalmology and strabismus	22	21 (95.5)	21 (95.5)	21 (95.5)	21 (95.5)	21 (95.5)	21 (95.5)	20 (90.9)	20 (90.9)	20 (90.9)	20 (90.9)	20 (90.9)	20 (90.9)	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99
Oculoplastics and ocular oncology	39	37 (94.9)	37 (94.9)	37 (94.9)	38 (97.4)	38 (97.4)	39 (100)	36 (92.3)	36 (92.3)	37 (94.9)	37 (94.9)	37 (94.9)	37 (94.9)	>0.99>0.99>0.99	>0.99>0.990.494	>0.99>0.990.494	>0.99>0.99>0.99
Uveitis	13	12 (92.3)	12 (92.3)	12 (92.3)	11 (84.6)	11 (84.6)	11 (84.6)	11 (84.6)	12 (92.3)	12 (92.3)	11 (84.6)	11 (84.6)	11 (84.6)	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99	>0.99>0.99>0.99
Total	220	209 (95)	210 (95.5)	214 (97.3)	21 (95.5)	210 (95.5)	215 (97.7)	202 (91.8)	203 (92.3)	207 (94.1)	204 (92.7)	204 (92.7)	205 (93.2)	0.2490.2330.159	0.3120.3120.039	>0.99>0.99>0.99	0.858>0.990.845

TR: Turkish, EN: English, AI: artificial intelligence, LLM: Large language models. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. *If at least one of the expected frequencies from the quadruple variables was below 5, “Fisher’s exact test”; and if it was between 5 and 25, “Yates’ continuity corrected chi-square test” was used. p<0.05 was considered statistically different in 95% confidence interval