Skip to main content
. 2025 Aug 12;6(1):100911. doi: 10.1016/j.xops.2025.100911

Table 3.

Multimodal Large Language Model Performance for Multiple-Choice (8 Retina Conditions) Identification of Diabetic Retinopathy (Prompt 1)

Model Accuracy Sensitivity Specificity Sensitivity (DR) Sensitivity Specificity (DR) Specificity (No DR)
ChatGPT 4o 0.566 (0.506–0.624)†,‡ 0.641 (0.557–0.659)†,‡ 0.779 (0.733–0.825)∗ 0.456 (0.383–0.525) 0.826 (0.732–0.904) 0.881 (0.795–0.941) 0.677 (0.609–0.745)
Claude Sonnet 3.5 0.608 (0.547–0.668)§,|| 0.618 (0.564–0.676)§,|| 0.636 (0.568–0.689)§,|| 0.594 (0.517–0.668) 0.641 (0.522–0.752) 0.663 (0.545–0.774) 0.608 (0.527–0.682)
Gemini 1.5 Pro 0.339 (0.283–0.399) 0.447 (0.422–0.478) 0.771 (0.726–0.808) 0.177 (0.127–0.226) 0.717 (0.613–0.817) 0.978 (0.940–1.000) 0.563 (0.491–0.636)
Perplexity Llama 3.1/Default 0.369 (0.311–0.433) 0.476 (0.438–0.514) 0.758 (0.722–0.785) 0.212 (0.158–0.273) 0.739 (0.635–0.830) 0.935 (0.879–0.977) 0.581 (0.507–0.649)

DR = diabetic retinopathy; MLLM = multimodal large language model.

Accuracy, mean multiclass sensitivity and specificity, as well as sensitivity and specificity on a per-class basis (DR vs. no DR), were presented for each MLLM model. Multimodal large language models were given the following choices: age-related macular degeneration, diabetic retinopathy, hypertensive retinopathy, sickle cell retinopathy, post-traumatic retinopathy, retinal vein occlusion, normal retina, and other pathology. Means are presented with 95% confidence intervals. Significant differences (P < 0.0006, per Bonferroni correction) between models for overall accuracy, sensitivity, and specificity are denoted by the following: ∗ChatGPT vs. Claude; ChatGPT vs. Gemini; ChatGPT vs. Perplexity; §Claude vs. Gemini; ||Claude vs. Perplexity.