Table 1. Interlingual and inter-model comparisons of chatbot performance on section questions across multiple attempts .
|
- |
n |
Correct answer count and accuracy (%) |
p* |
||||||||||||||
|
ChatGPT-4o |
Gemini 1.5 Pro |
||||||||||||||||
|
TR |
EN |
TR |
EN |
Inter-AI |
Interlingual |
||||||||||||
|
First |
Second |
Final |
First |
Second |
Final |
First |
Second |
Final |
First |
Second |
Final |
TR |
EN |
ChatGPT-4o |
Gemini 1.5 Pro |
||
|
Neuro-ophthalmology |
72 |
67 (93.1) |
68 (94.4) |
70 (97.2) |
66 (91.7) |
66 (91.7) |
70 (97.2) |
65 (90.3) |
65 (90.3) |
67 (93.1) |
64 (88.9) |
64 (88.9) |
64 (88.9) |
0.7630.7450.441 |
0.7790.7790.097 |
>0.990.745>0.99 |
>0.990.7790.561 |
|
Retina and vitreous |
23 |
22 (95.7) |
22 (95.7) |
23 (100) |
23 (100) |
23 (100) |
23 (100) |
22 (95.7) |
22 (95.7) |
22 (95.7) |
22 (95.7) |
22 (95.7) |
22 (95.7) |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
|
Glaucoma |
13 |
12 (92.3) |
12 (92.3) |
13 (100) |
13 (100) |
13 (100) |
13 (100) |
12 (92.3) |
12 (92.3) |
13 (100) |
12 (92.3) |
12 (92.3) |
13 (100) |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
|
Cornea & anterior segment |
38 |
38 (100) |
38 (100) |
38 (100) |
38 (100) |
38 (100) |
38 (100) |
36 (94.7) |
36 (94.7) |
36 (94.7) |
38 (100) |
38 (100) |
38 (100) |
0.4930.4930.493 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
0.4930.4930.493 |
|
Pediatric ophthalmology and strabismus |
22 |
21 (95.5) |
21 (95.5) |
21 (95.5) |
21 (95.5) |
21 (95.5) |
21 (95.5) |
20 (90.9) |
20 (90.9) |
20 (90.9) |
20 (90.9) |
20 (90.9) |
20 (90.9) |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
|
Oculoplastics and ocular oncology |
39 |
37 (94.9) |
37 (94.9) |
37 (94.9) |
38 (97.4) |
38 (97.4) |
39 (100) |
36 (92.3) |
36 (92.3) |
37 (94.9) |
37 (94.9) |
37 (94.9) |
37 (94.9) |
>0.99>0.99>0.99 |
>0.99>0.990.494 |
>0.99>0.990.494 |
>0.99>0.99>0.99 |
|
Uveitis |
13 |
12 (92.3) |
12 (92.3) |
12 (92.3) |
11 (84.6) |
11 (84.6) |
11 (84.6) |
11 (84.6) |
12 (92.3) |
12 (92.3) |
11 (84.6) |
11 (84.6) |
11 (84.6) |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
>0.99>0.99>0.99 |
|
Total |
220 |
209 (95) |
210 (95.5) |
214 (97.3) |
21 (95.5) |
210 (95.5) |
215 (97.7) |
202 (91.8) |
203 (92.3) |
207 (94.1) |
204 (92.7) |
204 (92.7) |
205 (93.2) |
0.2490.2330.159 |
0.3120.3120.039 |
>0.99>0.99>0.99 |
0.858>0.990.845 |
TR: Turkish, EN: English, AI: artificial intelligence, LLM: Large language models. The AI-based LLMs were listed in alphabetical order. The p values of the interlingual and inter-AI comparisons are listed in order for the first, second, and final attempts. *If at least one of the expected frequencies from the quadruple variables was below 5, “Fisher’s exact test”; and if it was between 5 and 25, “Yates’ continuity corrected chi-square test” was used. p<0.05 was considered statistically different in 95% confidence interval