. 2025 Aug 12;6(1):100911. doi: 10.1016/j.xops.2025.100911

Table 3.

Multimodal Large Language Model Performance for Multiple-Choice (8 Retina Conditions) Identification of Diabetic Retinopathy (Prompt 1)

Model	Accuracy	Sensitivity	Specificity	Sensitivity (DR)	Sensitivity	Specificity (DR)	Specificity (No DR)
ChatGPT 4o	0.566 (0.506–0.624)^†,‡	0.641 (0.557–0.659)^†,‡	0.779 (0.733–0.825)∗	0.456 (0.383–0.525)	0.826 (0.732–0.904)	0.881 (0.795–0.941)	0.677 (0.609–0.745)
Claude Sonnet 3.5	0.608 (0.547–0.668)^§,\|\|	0.618 (0.564–0.676)^§,\|\|	0.636 (0.568–0.689)^§,\|\|	0.594 (0.517–0.668)	0.641 (0.522–0.752)	0.663 (0.545–0.774)	0.608 (0.527–0.682)
Gemini 1.5 Pro	0.339 (0.283–0.399)	0.447 (0.422–0.478)	0.771 (0.726–0.808)	0.177 (0.127–0.226)	0.717 (0.613–0.817)	0.978 (0.940–1.000)	0.563 (0.491–0.636)
Perplexity Llama 3.1/Default	0.369 (0.311–0.433)	0.476 (0.438–0.514)	0.758 (0.722–0.785)	0.212 (0.158–0.273)	0.739 (0.635–0.830)	0.935 (0.879–0.977)	0.581 (0.507–0.649)

DR = diabetic retinopathy; MLLM = multimodal large language model.

Accuracy, mean multiclass sensitivity and specificity, as well as sensitivity and specificity on a per-class basis (DR vs. no DR), were presented for each MLLM model. Multimodal large language models were given the following choices: age-related macular degeneration, diabetic retinopathy, hypertensive retinopathy, sickle cell retinopathy, post-traumatic retinopathy, retinal vein occlusion, normal retina, and other pathology. Means are presented with 95% confidence intervals. Significant differences (P < 0.0006, per Bonferroni correction) between models for overall accuracy, sensitivity, and specificity are denoted by the following: ∗ChatGPT vs. Claude; ^†ChatGPT vs. Gemini; ^‡ChatGPT vs. Perplexity; ^§Claude vs. Gemini; ^||Claude vs. Perplexity.