Table 3.
Multimodal Large Language Model Performance for Multiple-Choice (8 Retina Conditions) Identification of Diabetic Retinopathy (Prompt 1)
Model | Accuracy | Sensitivity | Specificity | Sensitivity (DR) | Sensitivity | Specificity (DR) | Specificity (No DR) |
---|---|---|---|---|---|---|---|
ChatGPT 4o | 0.566 (0.506–0.624)†,‡ | 0.641 (0.557–0.659)†,‡ | 0.779 (0.733–0.825)∗ | 0.456 (0.383–0.525) | 0.826 (0.732–0.904) | 0.881 (0.795–0.941) | 0.677 (0.609–0.745) |
Claude Sonnet 3.5 | 0.608 (0.547–0.668)§,|| | 0.618 (0.564–0.676)§,|| | 0.636 (0.568–0.689)§,|| | 0.594 (0.517–0.668) | 0.641 (0.522–0.752) | 0.663 (0.545–0.774) | 0.608 (0.527–0.682) |
Gemini 1.5 Pro | 0.339 (0.283–0.399) | 0.447 (0.422–0.478) | 0.771 (0.726–0.808) | 0.177 (0.127–0.226) | 0.717 (0.613–0.817) | 0.978 (0.940–1.000) | 0.563 (0.491–0.636) |
Perplexity Llama 3.1/Default | 0.369 (0.311–0.433) | 0.476 (0.438–0.514) | 0.758 (0.722–0.785) | 0.212 (0.158–0.273) | 0.739 (0.635–0.830) | 0.935 (0.879–0.977) | 0.581 (0.507–0.649) |
DR = diabetic retinopathy; MLLM = multimodal large language model.
Accuracy, mean multiclass sensitivity and specificity, as well as sensitivity and specificity on a per-class basis (DR vs. no DR), were presented for each MLLM model. Multimodal large language models were given the following choices: age-related macular degeneration, diabetic retinopathy, hypertensive retinopathy, sickle cell retinopathy, post-traumatic retinopathy, retinal vein occlusion, normal retina, and other pathology. Means are presented with 95% confidence intervals. Significant differences (P < 0.0006, per Bonferroni correction) between models for overall accuracy, sensitivity, and specificity are denoted by the following: ∗ChatGPT vs. Claude; †ChatGPT vs. Gemini; ‡ChatGPT vs. Perplexity; §Claude vs. Gemini; ||Claude vs. Perplexity.