Skip to main content
Lippincott Open Access logoLink to Lippincott Open Access
. 2024 Oct 10;45(3):301–306. doi: 10.1097/WNO.0000000000002274

ChatGPT Assisting Diagnosis of Neuro-Ophthalmology Diseases Based on Case Reports

Yeganeh Madadi 1, Mohammad Delsoz 1, Priscilla A Lao 1, Joseph W Fong 1, T J Hollingsworth 1, Malik Y Kahook 1, Siamak Yousefi 1,
PMCID: PMC12341750  PMID: 40830998

Abstract

Background:

To evaluate the accuracy of Chat Generative Pre-Trained Transformer (ChatGPT), a large language model (LLM), to assist in diagnosing neuro-ophthalmic diseases based on case reports.

Methods:

We selected 22 different case reports of neuro-ophthalmic disorders from a publicly available online database. These cases included a wide range of chronic and acute diseases commonly seen by neuro-ophthalmologists. We inserted each case as a new prompt into ChatGPTs (GPT-3.5 and GPT-4) and asked for the most likely diagnosis. We then presented the exact information to 2 neuro-ophthalmologists and recorded their diagnoses, followed by comparing responses from both versions of ChatGPT.

Results:

GPT-3.5 and GPT-4 and the 2 neuro-ophthalmologists were correct in 13 (59%), 18 (82%), 19 (86%), and 19 (86%) out of 22 cases, respectively. The agreements between the various diagnostic sources were as follows: GPT-3.5 and GPT-4, 13 (59%); GPT-3.5 and the first neuro-ophthalmologist, 12 (55%); GPT-3.5 and the second neuro-ophthalmologist, 12 (55%); GPT-4 and the first neuro-ophthalmologist, 17 (77%); GPT-4 and the second neuro-ophthalmologist, 16 (73%); and first and second neuro-ophthalmologists 17 (77%).

Conclusions:

The accuracy of GPT-3.5 and GPT-4 in diagnosing patients with neuro-ophthalmic disorders was 59% and 82%, respectively. With further development, GPT-4 may have the potential to be used in clinical care settings to assist clinicians in providing accurate diagnoses. The applicability of using LLMs like ChatGPT in clinical settings that lack access to subspeciality trained neuro-ophthalmologists deserves further research.


Neuro-ophthalmology, a unique intersection of neurology and ophthalmology, focuses on disorders involving the visual pathways, visual processing, and ocular motor control.1 This subspecialty demands an extensive synthesis of clinical acumen and neuroimaging, often dealing with vision- and life-threatening conditions.2 Despite the critical role neuro-ophthalmologists play, the field faces a persistent shortage of specialists, contributing to diagnostic delays and patient morbidity. As of 2022, adequate neuro-ophthalmic coverage exists in only 8 US states, with median wait times exceeding 6 weeks, indicating a dire need for innovative diagnostic solutions.3

Recent advancements in artificial intelligence (AI) and machine learning (ML) have shown promise in various medical domains, including ophthalmology.4,5 Emerging large language models (LLMs) such as GPT-4 (OpenAI LLC, CA, San Francisco) represent a significant leap in natural language processing (NLP) capabilities, offering potential applications in medical diagnostics.68 Although previous research has demonstrated the efficacy of AI in ophthalmology,912 the utility of LLMs such as GPT-4 in neuro-ophthalmology has not been adequately investigated. This study aims to evaluate the diagnostic accuracy of GPT-4 compared with experienced neuro-ophthalmologists by analyzing detailed patient case descriptions.

METHODS

Case Collection

We used cases from the Department of Ophthalmology and Visual Sciences at the University of Iowa's publicly accessible database (https://webeye.ophth.uiowa.edu/eyeforum/cases.html). We selected 22 patients with various common and uncommon neuro-ophthalmic disorders including “Acute Demyelinating Encephalomyelitis (ADEM) with Associated Optic Neuritis,” “Optic Nerve Drusen,” “Optic Nerve Hypoplasia,” “Optic Neuritis,” “Cranial Nerve IV (Trochlear Nerve) Palsy,” “Dorsal Midbrain Syndrome (Parinaud's Syndrome),” “Traumatic Optic Neuropathy,” “Ethambutol Toxicity and Optic Neuropathy,” “Posterior Ischemic Optic Neuropathy,” “Recurrent Neuroretinitis,” “Horner's Syndrome due to Cluster Headache,” “Horner Syndrome due to Ipsilateral Internal Carotid Artery Dissection,” “Idiopathic Intracranial Hypertension (Pseudotumor Cerebri),” “Idiopathic Orbital Myositis,” “Leber Hereditary Optic Neuropathy,” “Miller Fisher Syndrome,” “Myasthenia Gravis,” “Neurofibromatosis Type 1—Optic Nerve Glioma,” “Thyroid Eye Disease (Graves' ophthalmopathy),” “Visual Snow Syndrome,” and “Unilateral Optic Nerve Hypoplasia.” Descriptions of each case included patient demographics, history of the presenting illness, chief complaint, relevant medical or ocular history, and examination findings.

Our local institutional review board (IRB) office advised that no IRB approval was required for this study because we used a publicly available dataset containing no patient information and the study adhered to the Declaration of Helsinki’s principles.

ChatGPT for Diagnosis

We fed each case description into the ChatGPT (GPT-3.5 and GPT-4) and asked if the model could provide a diagnosis. We specifically asked: “What is the most probable diagnosis?” (Figs. 1 and 2). Then, we evaluated the accuracy of ChatGPTs in making a correct diagnosis. As ChatGPT may learn from previous interactions once a single ChatBox is used, we recorded all responses based on our first inquiry of diagnosis and used a new ChatBox in any case that we asked the same questions more than once.

FIG. 1.

FIG. 1.

A sample case description of a patient with idiopathic intracranial hypertension (pseudotumor cerebri) input into the GPT-3.5 model and the corresponding response.

FIG. 2.

FIG. 2.

A sample case description of a patient with idiopathic intracranial hypertension (pseudotumor cerebri) input into the GPT-4 model and the corresponding response.

In addition, we provided the same 22 cases to 2 neuro-ophthalmologists with at least 1 year of postfellowship experience in a masked manner and asked them to make a provisional diagnosis. Neuro-ophthalmologists were not allowed to use ChatGPT or other similar tools for making diagnoses. We computed the frequency of correct diagnoses of ChatGPTs and the 2 neuro-ophthalmologists and compared the accuracies. We also evaluated the agreements between ChatGPTs and the 2 neuro-ophthalmologists. We asked the 2 neuro-ophthalmologists to assess the clinical accuracy of the responses by ChatGPT and flag any fabricated or irrelevant information. We developed a checklist of critical information expected in each case for specific scenarios within neuro-ophthalmology to ensure the prevention of omissions of important contents. Furthermore, we applied ethical guidelines to assess the potential risk of harm and bias in the generated content. This included analyzing the ChatGPT responses by our expert neuro-ophthalmologists. Also, we analyzed the ChatGPT responses for contextual biases. This included examining the data for demographic biases, with a focus on ensuring equitable representation and accuracy across diverse patient populations.

RESULTS

GPT-3.5 and GPT-4 made the correct diagnosis in 13 (59%) and 18 (82%) of 22 cases, respectively, while both neuro-ophthalmologists correctly diagnosed 19 (86%) out of 22 cases. Table 1 represents the details of diagnoses that are provided by GPT-3.5, GPT-4, and the 2 neuro-ophthalmologists.

TABLE 1.

Diagnoses provided by GPT-3.5, GPT-4, and 2 neuro-ophthalmologists

Correct Diagnosis of the Case GPT-3.5 Diagnosis GPT-4 Diagnosis Neuro-Ophthalmologist 1 Diagnosis Neuro-Ophthalmologist 2 Diagnosis
Demyelinating encephalomyelitis (ADEM) with associated optic neuritis Disseminated encephalomyelitis (ADEM) Neuromyelitis optica spectrum disorder (NMOSD) Optic neuritis due to meningitis Viral meningitis
Optic nerve drusen Papilledema, specifically due to idiopathic intracranial hypertension (IIH) Optic disc drusen Optic disc drusen Optic disc drusen
Optic nerve hypoplasia Congenital glaucoma Mature cataract and optic atrophy in the left eye, and exotropia Optic nerve hypoplasia Optic nerve hypoplasia
Optic neuritis Optic neuritis Optic neuritis Optic neuritis due to multiple sclerosis (MS) Optic neuritis; MS
Optic nerve hypoplasia Septo-optic dysplasia (SOD) Optic nerve hypoplasia (ONH) de Morsier Syndrome SOD with optic nerve hypoplasia
Cranial nerve IV (trochlear nerve) palsy Fourth nerve palsy Fourth nerve palsy (trochlear nerve palsy) Traumatic CN IV palsy OD Traumatic right fourth nerve palsy
Dorsal midbrain syndrome (Parinaud syndrome) Superior oblique myokymia Parinaud syndrome, (dorsal midbrain syndrome) Dorsal midbrain syndrome Dorsal midbrain syndrome
Traumatic optic neuropathy Traumatic optic neuropathy (TON) TON TON secondary to orbital canal fracture TON
Ethambutol toxicity and optic neuropathy Ethambutol-induced optic neuropathy Toxic optic neuropathy, most likely related to the use of ethambutol Ethambutol toxicity/toxic optic neuropathy Toxic (ethambutol) optic neuropathy
Posterior ischemic optic neuropathy Posterior ischemic optic neuropathy (PION) Perioperative ischemic optic neuropathy (PION) PION PION
Recurrent neuroretinitis Multiple sclerosis (MS) Optic neuritis (ON) Optic neuritis Optic neuritis
Horner syndrome due to cluster headache Cluster headache, specifically the variant known as “cluster-tic syndrome” or “cluster migraine syndrome” Cluster headache (CH), specifically with a Horner syndrome Horner syndrome secondary to cluster headaches Cluster headache
Horner syndrome due to ipsilateral internal carotid artery dissection Carotid artery dissection Horner syndrome related to a carotid artery dissection Horner syndrome due to carotid dissection Cervical carotid artery dissection causing Horner syndrome
Idiopathic intracranial hypertension (pseudotumor cerebri) Idiopathic intracranial hypertension (IIH), also known as pseudotumor cerebri Idiopathic intracranial hypertension (IIH), also known as pseudotumor cerebri IIH IIH
Idiopathic orbital myositis Graves ophthalmopathy, also known as thyroid eye disease Idiopathic orbital inflammatory syndrome (IOIS), also known as orbital myositis Orbital myositis Thyroid eye disease
Leber hereditary optic neuropathy Idiopathic intracranial hypertension (IIH) Papilledema LHON LHON
Miller Fisher syndrome Miller Fisher syndrome Miller Fisher syndrome (MFS) MFS/GuillainBarre MFS
Myasthenia gravis Myasthenia gravis (MG) Myasthenia gravis (MG) MG Ocular MG
Neurofibromatosis Type 1—optic nerve glioma Optic pathway glioma Neurofibromatosis Type 1 (NF1) Optic nerve glioma Optic pathway glioma
Thyroid eye disease (Graves ophthalmopathy) Thyroid eye disease (Graves ophthalmopathy) Thyroid eye disease (TED), also known as Graves ophthalmopathy TED TFD with compressive optic neuropathy
Visual snow syndrome Persistent migraine aura without infarction (PMA) Visual snow syndrome (VSS) Visual snow Visual snow syndrome
Unilateral optic nerve hypoplasia Optic nerve hypoplasia (ONH) Optic nerve hypoplasia (ONH) Sensory exotropia secondary to optic nerve hypoplasia Optic nerve hypoplasia

The agreements between the various diagnostic sources were as follows: GPT-3.5 and GPT-4, 13 (59%); GPT-3.5 and the first neuro-ophthalmologist, 12 (55%); GPT-3.5 and the second neuro-ophthalmologist, 12 (55%); GPT-4 and the first neuro-ophthalmologist, 17 (77%); GPT-4 and the second neuro-ophthalmologist, 16 (73%); and first and second neuro-ophthalmologists 17 (77%).

DISCUSSION

We prospectively investigated the diagnostic capability of ChatGPT in neuro-ophthalmic disorders using 22 patients from the University of Iowa database. GPT-3.5, GPT-4, and 2 neuro-ophthalmologists correctly diagnosed 13, 18, 19, and 19 cases, respectively. Despite the complexity of neuro-ophthalmic cases, GPT-4's diagnostic performance was comparable to that of neuro-ophthalmologists.

The agreement between GPT-3.5 and neuro-ophthalmologists was 55%–59%, while GPT-4 had a higher agreement of 73%–77%, similar to the 77% concordance between the neuro-ophthalmologists. This indicates that GPT-4 has a reasonable diagnostic accuracy, although understanding its limitations is crucial for effective utilization in clinical practice.

A recent study13 showed that ChatGPT correctly answered 43% of neuro-ophthalmology practice questions, while another14 found GPT-3.5 and GPT-4 had accuracies of 10%–32% and 40%–48%, respectively, for questions from the Basic and Clinical Science Course and from the Ophtho-Questions online question bank. These results differ from our study due to different methodologies, highlighting the need for context-specific evaluations.

Advancement of LLM models including ChatGPT may benefit ophthalmology in general and neuro-ophthalmology in particular by making accurate diagnoses based on readily accessible models. Neuro-ophthalmic issues result from a complex interaction between the central nervous system and visual apparatus, necessitating a comprehensive evaluation to identify subtle distinctions.15

Another significant advantage of ChatGPT and comparable LLM models is that they can actively learn based on their reinforcement learning capability, thereby correcting their previous errors, enhancing their performance over time, and becoming more precise. However, concerns about active learning and HIPAA compliance must be addressed since clinical AI tools need to operate within healthcare environments. AI models such as ChatGPT that are used in clinical practice must operate behind secure healthcare organization firewalls to ensure patient data confidentiality and regulatory adherence.

Our study revealed a lower-than-expected concordance between the 2 neuro-ophthalmologists, who agreed on 77% of the cases. This variability underscores the inherent complexity of neuro-ophthalmic diagnoses. It also highlights the potential role of AI in providing consistent and objective second opinions, which could improve diagnostic accuracy and support clinicians in making more informed decisions.

While we used patients with various neuro-ophthalmic conditions, our study does have limitations. First, we used cases from an online database to investigate ChatGPT capabilities raising concerns that ChatGPT may have previously been exposed to this database. To resolve this issue, we examined the years in which these cases were added to the database and found that the majority were added before September 2021, when the most recent ChatGPT training concluded. In addition, we used the Memorization Effects Levenshtein Detector (MELD) technique16 to analyze whether ChatGPT was potentially trained based on the online dataset. The MELD provides a quantitative measure of memorization by comparing the similarity between the dataset examples and the content generated by the models. Through this analysis, we sought to identify and exclude any data points that exhibit high similarity scores, indicative of memorization. We thus split each case report item in half, showed the first half to ChatGPT, and then asked ChatGPT to generate the second half. We then compared the generated second half part to the original second half using MELD. Based on the MELD technique, a high similarity score of ≥95% demonstrates that the test content has most likely been part of the ChatGPT training materials. The MELD scores of all our cases were <95% suggesting that our online dataset was most likely not part of the ChatGPT initial training. GPT-3.5 and GPT-4 contained errors in 9 and 4 instances, respectively, alleviating concerns that this database has been accessed previously. Second, ChatGPT was used to evaluate only 22 cases. Consequently, other studies may evaluate ChatGPT on the basis of a larger number of cases to confirm our findings. However, the selected dataset, although compact, was meticulously curated to include a wide variety of cases that represent the breadth and complexity of clinical scenarios encountered in neuro-ophthalmology. This ensured that our analysis covered a broad spectrum of the field's diagnostic challenges and therapeutic interventions. Another limitation of LLMs such as ChatGPT is lack of image interpretation that limits their applicability in more complex cases with information from various imaging modalities. In addition, while these were real-world cases, they were published primarily for teaching purposes. Several of these cases exhibited classic presentations of diseases with “buzzword” descriptions. Therefore, this subset of cases may not fully reflect the actuality of neuro-ophthalmology practice with borderline findings and confounding factors in many instances. Conversely, some cases exhibited atypical presentations of the correct diagnosis. This is where the art of medicine comes into play and requires a human practitioner to weigh these factors appropriately.

In summary, the introduction of LLMs, such as ChatGPT, can be a potentially revolutionary step forward in augmenting clinical diagnostics, independent of the availability of subspeciality trained clinicians. This is especially interesting in fields such as neuro-ophthalmology, which is characterized by an abundance of complicated cases. GPT-4 appears capable of diagnosing neuro-ophthalmic cases when provided with structured data in the case report format, with a level of accuracy that is comparable to that of experienced neuro-ophthalmologists. Re-assessment of the performance of ChatGPT and other LLMs in similar testing conditions but with multimodal capabilities would be a natural next step when availability of such algorithms. LLMs may potentially bridge the gap between AI and human competence. Future research should explore how technological innovations, such as enhanced LLMs, can augment medical decision-making and assist clinicians toward enhancing patient outcomes.

STATEMENT OF AUTHORSHIP

Conception and design: Y. Madadi, M. Delsuz; Acquisition of data: Y. Madadi, M. Delsoz, P. A. Lao, J. W. Fong; Analysis and interpretation of data: Y. Madadi, M. Delsoz, P. A. Lao, J. W. Fong. Drafting the manuscript: Y. Madadi; Revising the manuscript for intellectual content: Y. Madadi, M. Delsoz, P. A. Lao, J. W. Fong, T. J. Hollingsworth, M. Y. Kahook, S. Yousefi. Final approval of the completed manuscript: Y. Madadi, M. Delsoz, P. A. Lao, J. W. Fong, T. J. Hollingsworth, M. Y. Kahook, S. Yousefi.

Footnotes

Supported by NIH Grants R01EY033005 (S. Yousefi). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The authors report no conflicts of interest.

Posted history: This manuscript was previously posted to medRxiv: doi: 10.1101/2023.09.13.23295508.

Contributor Information

Yeganeh Madadi, Email: ymadadi@uthsc.edu.

Mohammad Delsoz, Email: mdelsoz@uthsc.edu.

Priscilla A. Lao, Email: plao@uthsc.edu.

Joseph W. Fong, Email: jfong2@uthsc.edu.

T. J. Hollingsworth, Email: thollin1@uthsc.edu.

Malik Y. Kahook, Email: malik.kahook@cuanschutz.edu.

REFERENCES

  • 1.Levin LA Nilsson SF Ver Hoeve J, et al. Adler’s Physiology of the Eye E-Book: Expert Consult-Online and Print. Edinburgh, London: Elsevier Health Sciences; 2011. [Google Scholar]
  • 2.Stunkel L, Sharma RA, Mackay DD, et al. Patient harm due to diagnostic error of neuro-ophthalmologic conditions. Ophthalmology. 2021;128:1356–1362. [DOI] [PubMed] [Google Scholar]
  • 3.DeBusk A, Subramanian PS, Scannell Bryan M, Moster ML, Calvert PC, Frohman LP. Mismatch in supply and demand for neuro-ophthalmic care. J Neuroophthalmol. 2022;42:62–67. [DOI] [PubMed] [Google Scholar]
  • 4.Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25:24–29. [DOI] [PubMed] [Google Scholar]
  • 5.Madadi Y, Abu-Serhan H, Yousefi S. Domain Adaptation-Based deep learning model for forecasting and diagnosis of glaucoma disease. Biomed Signal Process Control. 2024;92:106061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–1901. [Google Scholar]
  • 7.Huang X Raja H Madadi Y, et al. Reply to comment on: predicting glaucoma before onset using a Large Language model chatbot. Am J Ophthalmol. 2024;266:322–323. [DOI] [PubMed] [Google Scholar]
  • 8.Delsoz M, Madadi Y, Munir WM, et al. Performance of ChatGPT in diagnosis of corneal eye diseases. medRxiv. 2023:2023.08.25.23294635. [DOI] [PubMed] [Google Scholar]
  • 9.Kapoor R, Walters SP, Al-Aswad LA. The current state of artificial intelligence in ophthalmology. Surv Ophthalmol. 2019;64:233–240. [DOI] [PubMed] [Google Scholar]
  • 10.Li J-PO, Liu H, Ting DS, et al. Digital technology, tele-medicine and artificial intelligence in ophthalmology: a global perspective. Prog Retin Eye Res. 2021;82:100900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Li Z, Wang L, Wu X, et al. Artificial intelligence in ophthalmology: the path to the real-world clinic. Cel Rep Med. 2023;4:101095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Madadi Y, Delsoz M, Khouri AS, Boland M, Grzybowski A, Yousefi S. Applications of artificial intelligence-enabled robots and chatbots in ophthalmology: recent advances and future trends. Curr Opin Ophthalmol. 2024;35:238–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141:589–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3:100324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu GT, Volpe NJ, Galetta SL. Neuro-ophthalmology E-Book: Diagnosis and Management. Edinburgh, London: Elsevier Health Sciences; 2010. [Google Scholar]
  • 16.Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2023;108:1379–1383. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Neuro-Ophthalmology are provided here courtesy of Wolters Kluwer Health

RESOURCES