Table 5.
Multimodal Large Language Model Performance for Grading Diabetic Retinopathy Disease Severity (Prompts 3.1 and 3.2)
| Prompt 3.1 |
Prompt 3.2 |
|||||
|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | Accuracy | Sensitivity | Specificity | |
| ChatGPT 4o | 0.395 (0.338–0.454) | 0.229 (0.214–0.269) | 0.866 (0.861–0.882)∗ | 0.372 (0.315–0.432)† | 0.277 (0.231–0.299) | 0.863 (0.860–0.874)∗,† |
| Claude Sonnet 3.5 | 0.314 (0.258–0.375) | 0.184 (0.173–0.189)§ | 0.840 (0.836–0.842)‡,§ | 0.272 (0.213–0.330) | 0.168 (0.163–0.176) | 0.834 (0.832–0.838) |
| Gemini 1.5 Pro | 0.392 (0.329–0.453) | 0.224 (0.201–0.232) | 0.858 (0.850–0.866) | — | — | — |
| Perplexity Llama 3.1/Default | 0.411 (0.351–0.473) | 0.247 (0.228–0.269) | 0.864 (0.855–0.871) | 0.217 (0.164–0.273) | 0.181 (0.117–0.219) | 0.831 (0.829–0.834) |
MLLM = multimodal large language model.
Accuracy, sensitivity, and specificity (95% confidence interval) are provided for each MLLM model. Prompt 3.1 named the ETDRS grading criteria, whereas Prompt 3.2 provided the full ETDRS grading criteria. Prompt 3.1 yielded the best results on average. Gemini was incompatible with Prompt 3.2, as it did not accept multiple image uploads per query; therefore, these results are not shown. Significant differences (P < 0.0006, per Bonferroni correction) in accuracy, sensitivity, and specificity between models are denoted by the following: ∗ChatGPT vs. Claude; †ChatGPT vs. Perplexity; ‡Claude vs. Gemini; §Claude vs. Perplexity.