Skip to main content
. 2025 Aug 21;55(4):177–185. doi: 10.4274/tjo.galenos.2025.27895

Table 1. Interlingual and inter-model comparisons of chatbot performance on section questions across multiple attempts .

-

n

Correct answer count and accuracy (%)

p*

ChatGPT-4o

Gemini 1.5 Pro

TR

EN

TR

EN

Inter-AI

Interlingual

First

Second

Final

First

Second

Final

First

Second

Final

First

Second

Final

TR

EN

ChatGPT-4o

Gemini 1.5 Pro

Neuro-ophthalmology

72

67 (93.1)

68 (94.4)

70 (97.2)

66 (91.7)

66

(91.7)

70 (97.2)

65 (90.3)

65

(90.3)

67 (93.1)

64

(88.9)

64

(88.9)

64

(88.9)

0.7630.7450.441

0.7790.7790.097

>0.990.745>0.99

>0.990.7790.561

Retina and vitreous

23

22

(95.7)

22

(95.7)

23 (100)

23 (100)

23

(100)

23 (100)

22 (95.7)

22

(95.7)

22 (95.7)

22

(95.7)

22

(95.7)

22

(95.7)

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

Glaucoma

13

12

(92.3)

12

(92.3)

13 (100)

13 (100)

13

(100)

13 (100)

12 (92.3)

12

(92.3)

13 (100)

12

(92.3)

12

(92.3)

13

(100)

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

Cornea & anterior segment

38

38

(100)

38

(100)

38 (100)

38 (100)

38

(100)

38 (100)

36 (94.7)

36

(94.7)

36 (94.7)

38

(100)

38

(100)

38

(100)

0.4930.4930.493

>0.99>0.99>0.99

>0.99>0.99>0.99

0.4930.4930.493

Pediatric ophthalmology and strabismus

22

21

(95.5)

21

(95.5)

21 (95.5)

21 (95.5)

21

(95.5)

21 (95.5)

20 (90.9)

20

(90.9)

20 (90.9)

20

(90.9)

20

(90.9)

20

(90.9)

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

Oculoplastics and ocular oncology

39

37

(94.9)

37

(94.9)

37 (94.9)

38 (97.4)

38

(97.4)

39 (100)

36 (92.3)

36

(92.3)

37 (94.9)

37

(94.9)

37

(94.9)

37

(94.9)

>0.99>0.99>0.99

>0.99>0.990.494

>0.99>0.990.494

>0.99>0.99>0.99

Uveitis

13

12

(92.3)

12

(92.3)

12 (92.3)

11 (84.6)

11

(84.6)

11 (84.6)

11 (84.6)

12

(92.3)

12 (92.3)

11

(84.6)

11

(84.6)

11

(84.6)

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

>0.99>0.99>0.99

Total

220

209

(95)

210 (95.5)

214 (97.3)

21 (95.5)

210 (95.5)

215 (97.7)

202 (91.8)

203 (92.3)

207 (94.1)

204 (92.7)

204 (92.7)

205 (93.2)

0.2490.2330.159

0.3120.3120.039

>0.99>0.99>0.99

0.858>0.990.845

TR: Turkish, EN: English, AI: artificial intelligence, LLM: Large language models. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. *If at least one of the expected frequencies from the quadruple variables was below 5, “Fisher’s exact test”; and if it was between 5 and 25, “Yates’ continuity corrected chi-square test” was used. p<0.05 was considered statistically different in 95% confidence interval