Abstract
Recent advancements in generative artificial intelligence have enabled analysis of text with visual data, which could have important implications in healthcare. Diagnosis in ophthalmology is often based on a combination of ocular examination, and clinical context. The aim of this study was to evaluate the performance of multimodal GPT-4 (GPT-4 V) in an integrated analysis of ocular images and clinical text. This retrospective study included 40 patients seen in our institution with images of their ocular examinations. Cases were selected by a board-certified ophthalmologist, to represent various pathologies. We provided the model with each patient image, without and then with the clinical context. We also asked two non-ophthalmology physicians to write diagnoses for each image, without and then with the clinical context. Answers for both GPT-4 V and the non-ophthalmologists were evaluated by two board-certified ophthalmologists. Performance accuracies were calculated and compared. GPT-4 V provided the correct diagnosis in 19/40 (47.5%) cases based on images without clinical context, and in 27/40 (67.5%) cases when clinical context was provided. Non-ophthalmologist physicians provided the correct diagnoses in 24/40 (60.0%), and 23/40 (57.5%) of cases without clinical context, and in 29/40 (72.5%) and 27/40 (67.5%) with clinical context. For all study participants adding context improved accuracy (p = 0.033). GPT-4 V is currently able to simultaneously analyze and integrate visual and textual data, and arrive at accurate clinical diagnoses in the majority of cases. Multimodal large language models like GPT-4 V have significant potential to advance both patient care and research in ophthalmology.
Keywords: GPT, LLMs, Large language models, AI, Ophthalmology, Multimodal algorithms
Subject terms: Medical research, Eye diseases
Introduction
Artificial intelligence (AI) has made remarkable advancements in healthcare, and in ophthalmology, where images are central to diagnosis. Deep learning algorithms have demonstrated promising performance in analyzing ocular images1. For example, AI applications have been developed for the screening of strabismus2, diagnosis of keratoconus and ectatic corneal disease3, and cataract classification4. Most past applications were trained to analyze only one type of image, mostly retinal or optic nerve-based images5,6, and did not integrate clinical context.
Large Language Models (LLMs) such as GPT-4 by OpenAI have shown impressive abilities in free-text analysis and generation across different healthcare tasks7,8, and in ophthalmology specifically9–12. Patient history and complaints, have been established as a major component of patient diagnosis13,14. However, visual data such as physical examination, imaging tests and pathology, are often critical in patient evaluation15.
Ophthalmology in particular relies significantly on the integration of the clinical context, physical examination and imaging16. Multimodal models, such as GPT-4 with vision capabilities (GPT-4 V), represent a paradigm shift by enabling simultaneous analysis of visual and textual data.
The aim of our study was to evaluate the performance of GPT-4 V in diagnosing ocular conditions based on external eye photographs, both with and without supplemental clinical context.
Methods
Study design
A Sheba Medical Center institutional board approval (IRB) was granted to this study (0143-23-SMC(. This was a retrospective study assessing GPT-4 multimodal (GPT-4 V, with image analysis capability) diagnostic performance on ocular images against two non-ophthalmologist physicians, without and with supplemental clinical context. The GPT-4 V model was accessed on November 3rd. Two radiology residents were each presented with the same subset of cases to provide diagnoses.
Data collection and diagnostic procedure
A series of 40 anonymized ocular images was curated, representing a spectrum of ocular conditions (Table 1). Retinal or optic nerve pathologies were not included. The photos were obtained by a phone camera or a slit lamp microscope camera. We included only external images of the eyes. We did not include fundus photography or other ocular imaging technologies such as ocular coherence tomography (OCT), ocular ultrasound, fluorescein angiography or radiology images in this analysis. Cases were selected by a board certified ophthalmologist, to represent various pathologies and match the level of ophthalmology residents. Each participant (GPT-4 V and human physicians) separately received these images in two sequences: initially without and subsequently with added clinical context.
Table 1.
Ocular pathologies of patients (n = 40).
| Category | Pathologies | No. (%) |
|---|---|---|
| Ocular tumors | Nevus of Ota, iris melanoma, iris nevus, iris cysts, amelanotic conjunctival nevus | 6 (15) |
| Cornea, anterior segment and external eye diseases | Corneal abscess, herpes zoster keratitis, viral conjunctivitis, bacterial conjunctivitis, pingueculitis, subconjunctival hemorrhage, corneal lattice dystrophy, cataract | 13 (32.5) |
| Traumatic/mechanic | Eyelid laceration, canalicular trauma, corneal foreign body causing corneal penetration, ocular perforation, lens subluxation, corneal erosion | 7 (17.5) |
| Iatrogenic/Surgical complications | Argentinean flag, Ahmed glaucoma valve | 2 (5) |
| Eyelid and lacrimal system related | Ptosis, hordeolum, eyelid pustule, lower eyelid melanoma, dacryocystitis | 7 (17.5) |
| Orbital | Globe luxation, dacryops, thyroid-eye disease, orbital cellulitis | 5 (12.5) |
Some diagnoses appeared twice.
Both GPT-4 and the physicians were asked to render a diagnosis for the images first without any clinical context and then with additional clinical information. Clinical context included age, symptoms and relevant medical history. All interactions with GPT-4 V were conducted through the OpenAI web interface. Each inquiry was initiated in a distinct instance to ensure independence of responses. The output without clinical context and the output with clinical context were tested in separate dialogue instances to ensure independence of responses.
Specific prompts used were:
Outcome measures
The primary metric was the accuracy of diagnoses, expressed as a percentage of correct identifications. A qualitative analysis of GPT-4 V answers was also performed. All diagnostic responses from GPT-4 V and the participating physicians were evaluated for accuracy in consensus by two board-certified ophthalmologists. An additional evaluation based on class agreement analysis by anatomical region was performed. For this analysis, each case was categorized into a specific anatomical region, and the agreement with the ground truth, as determined by the board-certified ophthalmologists, was calculated both with and without clinical context.
Statistical analysis
Statistical computations were conducted using SPSS software for windows version 24.0 by IBM. A Fisher’s exact test was utilized to contrast the performance differences between GPT-4 V and the physicians, and to compare overall accuracy with and without context. We considered P-values less than 0.05 as indicative of statistical significance.
Results
Study cohort and pathological variance
The study cohort comprised a diverse array of 40 ocular conditions presented to the AI model and non-ophthalmologist physicians for diagnosis. Mean age of patients included was 54.4 ± 23.2 years. The pathologies included are detailed in Table 1.
Diagnostic accuracy without and with clinical context
The diagnostic accuracy of GPT-4 V based on images alone was 47.5% (19/40). In comparison, Physician 1 achieved an accuracy of 60.0% (24/40) under the same conditions. Physician 2 correctly identified 57.5% (23/40) of cases (Table 2).
Table 2.
Comparisons between AI and human readers with and without clinical context.
| Accuracy without context (%) | P-value GPT-4 vs. humans without context | Accuracy with context (%) | P-value GPT-4 vs. humans with context | |
|---|---|---|---|---|
| GPT-4 | 19/40 (47.5) | 0.251 | 27/40 (67.5) | 0.688 |
| Physician 1 | 24/40 (60.0) | 29/40 (72.5) | ||
| Physician 2 | 23/40 (57.5) | 27/40 (67.5) |
When clinical context was included, GPT-4 V’s diagnostic accuracy improved to 67.5% (27/40). Physician 1’s accuracy was 72.5% (29/40), and Physician 2’s accuracy was 67.5% (27/40). There was no statistically significant difference between GPT-4 and physicians’ diagnostic accuracy (Table 2). Overall, for all study readers, adding context improved accuracy, as can be seen in Fig. 1 (p = 0.033).
Fig. 1.
Diagnostic accuracies of multimodal GPT-4 and non-ophthalmologist physicians without and with clinical context provided for images.
Qualitative analysis of GPT responses
Cases in which GPT-4 V was initially wrong, but when provided with clinical context correctly altered the diagnoses included: nevus of Ota, dacryocystitis, Argentinean flag, herpes zoster pseudo-dendrite, thyroid eye disease, iris nevus, cornea foreign body, and ocular perforation. Context-enriched answers showed deeper diagnostic reasoning, and blending clinical history with visual findings. Example cases are detailed in Figs. 2 and 3.
Fig. 2.
An example case where GPT-4 provided the correct diagnosis when clinical context was provided. GPT-4’s output without clinical context: “This is an image of a human eye displaying a condition known as corneal opacity. Key findings include a clouding of the cornea, which is discernible as irregular, darkened areas against the lighter, normal corneal tissue. The clarity of the cornea is compromised, which is a hallmark of this condition. The localization and pattern of the opacities can give insights into the underlying cause, which might include infection, injury, or underlying ocular disease.” GPT-4’s output with the clinical context (An 82 year old man arrived for a routine examination): “The image depicts a close-up view of a human eye with the following findings: The cornea appears clear. The iris has heterochromia with a lighter peripheral ring, likely representing arcus senilis given the patient’s age. There is a presence of multiple dark pigmented lesions on the iris, suggesting iris nevi or possibly iris melanoma. No obvious cataract is visible from this view. Given the clinical context of an 82-year-old man at a routine examination and the presence of multiple pigmented lesions on the iris, a differential diagnosis would include iris nevi, which are usually benign. However, the possibility of iris melanoma must be considered due to the patient’s age and the lesions’ characteristics. It is imperative to correlate these findings with intraocular pressure measurements, a full examination of the ocular fundus, and additional imaging if necessary to rule out glaucoma and assess for any signs of malignancy such as secondary glaucomatous changes or signs of growth in the nevi. In conclusion, the image findings are suggestive of iris nevi, with a differential diagnosis of iris melanoma, requiring further ophthalmological evaluation and monitoring.”
Fig. 3.
An example case where GPT-4 was completely wrong, without and with clinical context. GPT-4’s output without clinical context: “The image shows bilateral periorbital ecchymosis and edema, commonly referred to as "raccoon eyes." This clinical sign is often associated with basilar skull fractures. However, it can also be seen in conditions like amyloidosis or as a result of severe facial trauma. No further detailed ocular abnormalities can be ascertained from the image. A detailed clinical correlation and further evaluation are essential for a definitive diagnosis.” GPT-4’s output with the clinical context (A 63 year old man complain of a runny nose, redness and discharge from both eyes with a foreign body sensation in both eyes for the past several days): “The image shows bilateral periorbital edema and ecchymosis (raccoon eyes), chemosis (swelling of the conjunctiva), and conjunctival injection (redness). Given the clinical context of a runny nose, redness, discharge from both eyes, and a foreign body sensation for several days, this is suggestive of orbital cellulitis with possible sinusitis due to the proximity of symptoms related to the nose. The severity and bilateral nature are concerning for potential complications such as cavernous sinus thrombosis, given the “raccoon eyes” sign. Immediate evaluation by an ophthalmologist and possibly a radiologist for imaging would be warranted to assess the extent of infection and rule out any complications.”
When analyzing the results per anatomical region, regions that are readily visible in external photographs, such as the cornea and eyelids had high class agreement. In contrast, regions that can only be indirectly inferred from external images, such as the lacrimal system and orbit had relatively lower class agreement. A detailed breakdown of class agreement by anatomical region is provided in Table 3.
Table 3.
Class agreement based on anatomical regions.
| Anatomical region | Number of cases | Without context | With context | ||
|---|---|---|---|---|---|
| Class agreement | % | Class agreement | % | ||
| Eyelid | 9 | 5 | 55.5 | 7 | 77.7 |
| Conjunctiva | 8 | 4 | 50.0 | 4 | 50.0 |
| Cornea | 9 | 5 | 55.5 | 9 | 100 |
| Orbit | 4 | 2 | 50.0 | 2 | 50.0 |
| Lacrimal system | 3 | 0 | 0 | 1 | 33.3 |
| Anterior segment | 7 | 3 | 42.8 | 4 | 57.1 |
Discussion
This study evaluated multimodal GPT-4 for clinical diagnosis in ophthalmology based on patient ocular images and clinical context. There are several important findings: (1) GPT-4 V showed capability for ocular image analysis, correctly identifying 48.5% (16/33) of cases based on images alone. (2) When clinical context was added, the accuracy of the model improved to 69.7%. (3) GPT-4 V performance on ophthalmology cases was comparable to non-ophthalmology physicians.
A multimodal algorithm that synergizes clinical text with images signals a groundbreaking advance in medical image analysis. The integration of visual and textual data imitates a human decision making process. This is relevant to all medical specialties, but is mostly prominent in specialties that rely on pattern recognition, such as ophthalmology, dermatology, radiology and pathology. These are also the specialties that are at the forefront of AI applications in healthcare, with variable algorithms available and being evaluated for medical images processing1,17. With an added value of textual analysis, LLMs may ultimately surpass current algorithms that analyze images only18. Although there are few recent publications on multimodal GPT-4 V, to the best of our knowledge, our study is the first to evaluate actual patient cases instead of relying on images available in medical question repositories19.
In this study, GPT-4 V and physicians’ performance improved with clinical context. This reinforces the long-standing medical principle that clinical history is key for accurate diagnosis13. Consequently, the potential of multimodal LLMs in ophthalmology is vast. With improvement, such an algorithm can be used as a decision support tool for physicians in diagnosis and management planning. It can also be used in research for cohort generation, enabling creation of large datasets that include textual data with images findings. A multimodal LLM might also significantly impact education in ophthalmology.
There is a lack of medical students’ and primary care physicians’ education and training on initial management of basic ophthalmic cases20. This is supported by the results of our study, with non-ophthalmology physicians achieving diagnostic accuracies ranging between 58 and 79%. An educational tool that can provide detailed explanations of ocular examination and imaging findings could be used to improve basic understanding and recognition of ophthalmic pathologies. Patient education in this field is also lacking. Many patients already seek initial information using online and unreliable ways, which might lead to harm. A potential restricted model focused on patient education, might be able to integrate patient taken external images with a short patient submitted history to provide supervised case specific patient information, and recommended initial management.
Despite their tremendous potential, there are challenges with multimodal application of LLMs in ophthalmology. First, images that cannot be anonymized, such as full-face photos, pose significant privacy issues. To address this, accessing the model would require strict security protocols. Furthermore, these models can be susceptible to cyber threats, including adversarial attacks21. In addition, the models can potentially perpetuate bias in healthcare based on data from images, such as skin color or gender22.
This study has several limitations. First, this was a retrospective analysis of cases, chosen subjectively with potential for selection bias. Second, this was a proof-of-concept study, with a small sample size. This study included only external eye photographs and excluded imaging modalities such as OCT, fundoscopy, or radiological imaging. Cases were intentionally curated by a board-certified ophthalmologist to represent a spectrum of common external eye conditions encountered at the resident level, rather than through random sampling. This approach may have introduced a selection bias and limited the representation of diseases requiring specialized imaging, potentially influencing the overall performance. Finally, due to patient privacy concerns, we have only inserted tightly cropped images focused only on the eyes, to preserve anonymity. This may have influenced the model’s diagnostic abilities.
To conclude, GPT-4 V at its current stage is not yet suitable for clinical application in ophthalmology. Nonetheless, its ability to simultaneously analyze and integrate visual and textual data is promising. Multimodal large language models like GPT-4 V have significant potential to advance patient care, education and research in ophthalmology. Future studies involving diverse imaging modalities and randomized case selection are essential to provide a more comprehensive evaluation of these models, their strengths, and limitations.
Author contributions
V.S. and E.K. conceived the study; VS drafted the manuscript; N.K., I.H., O.Z., H.B. and Y.B. were involved in the acquisition and annotation of data; B.S.G., N.K., I.H., O.Z., G.N.N., A.L. and E.K. contributed to the interpretation of data and critically revised the manuscript; E.K. oversaw the study. All authors read and approved the final manuscript.
Data availability
Data are available from the corresponding authors, VS or AL, upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Consent for publication
All authors approved the submission.
Ethical approval
A Sheba Medical Center institutional board approval (IRB) was granted to this study (0143-23-SMC).
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ting, D. S. W. et al. Artificial intelligence and deep learning in ophthalmology. Br. J. Ophthalmol.103, 167–175 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wu, D. et al. An artificial intelligence platform for the screening and managing of strabismus. Eye38, 3101–3107 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ambrósio, R. et al. Multimodal diagnostics for keratoconus and ectatic corneal diseases: A paradigm shift. Eye Vis.10 (2023). [DOI] [PMC free article] [PubMed]
- 4.Zhou, Y., Li, G. & Li, H. Automatic cataract classification using deep neural network with discrete state transition. IEEE Trans. Med. Imaging39, 436–446 (2020). [DOI] [PubMed] [Google Scholar]
- 5.Li, B. et al. The performance of a deep learning system in assisting junior ophthalmologists in diagnosing 13 major fundus diseases: A prospective multi-center clinical trial. Npj Digit. Med.7 (2024). [DOI] [PMC free article] [PubMed]
- 6.Dong, L. et al. Artificial intelligence for screening of multiple retinal and optic nerve diseases. JAMA Netw. Open5, e229960 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint 10.48550/arXiv.2303.13375 (2023). [DOI]
- 8.Li, J., Dada, A., Kleesiek, J. & Egger, J. ChatGPT in healthcare: A taxonomy and systematic review. medRxiv 10.1101/2023.03.30.23287899 (2023). [DOI] [PubMed]
- 9.Temsah, M-H., Altamimi, I., Jamal, A., Alhasan, K. & Al-Eyadhy, A. ChatGPT surpasses 1000 publications on PubMed: Envisioning the road ahead. Cureus. 10.7759/cureus.44769 (2023). [DOI] [PMC free article] [PubMed]
- 10.Delsoz, M. et al. The use of ChatGPT to assist in diagnosing glaucoma based on clinical case reports. Ophthalmol. Therapy. 12, 3121–3132 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hu, X. et al. What can GPT-4 do for diagnosing rare eye diseases? A pilot study. Ophthalmol. Therapy12, 3395–3402 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pushpanathan, K. et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience26, 108163 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang, M. Y., Asanad, S., Asanad, K., Karanjia, R. & Sadun, A. A. Value of medical history in ophthalmology: A study of diagnostic accuracy. J. Curr. Ophthalmol.30, 359–364 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hampton, J. R., Harrison, M. J., Mitchell, J. R., Prichard, J. S. & Seymour, C. Relative contributions of history-taking, physical examination, and laboratory investigation to diagnosis and management of medical outpatients. Bmj2, 486–489 (1975). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Larson, D. B. & Langlotz, C. P. The role of radiology in the diagnostic process: Information, communication, and teamwork. Am. J. Roentgenol.209, 992–1000 (2017). [DOI] [PubMed] [Google Scholar]
- 16.Golnik, K. C., Goldenhar, L. M., Gittinger, J. W. & Lustbader, J. M. The ophthalmic clinical evaluation exercise (OCEX). Ophthalmology111, 1271–1274 (2004). [DOI] [PubMed] [Google Scholar]
- 17.Soffer, S. et al. Convolutional neural networks for radiologic images: A radiologist’s guide. Radiology290, 590–606 (2019). [DOI] [PubMed] [Google Scholar]
- 18.Sorin, V. et al. Diagnostic accuracy of GPT multimodal analysis on USMLE questions including text and visuals. medRxiv (2023).
- 19.Yan, Z. et al. Multimodal ChatGPT for medical applications: an experimental study of GPT-4V. arXiv preprint 10.48550/arXiv.2310.19061 (2023). [DOI]
- 20.Quillen, D. A., Harper, R. A. & Haik, B. G. Medical student education in ophthalmology: Crisis and opportunity. Ophthalmology112, 1867–1868 (2005). [DOI] [PubMed] [Google Scholar]
- 21.Sorin, V. et al. Adversarial attacks in radiology—a systematic review. Eur. J. Radiol.167, 111085 (2023). [DOI] [PubMed] [Google Scholar]
- 22.Sorin, V. & Klang, E. Artificial intelligence and health care disparities in radiology. Radiology301, E443–E443 (2021). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data are available from the corresponding authors, VS or AL, upon reasonable request.




