Abstract
Amblyopia is a neurodevelopmental disorder affecting children’s visual acuity, requiring early diagnosis for effective treatment. Traditional diagnostic methods rely on subjective evaluations of eye tracking recordings from high fidelity eye tracking instruments performed by specialized pediatric ophthalmologists, often unavailable in rural, low resource clinics. As such, there is an urgent need to develop a scalable, low cost, high accuracy approach to automatically analyze eye tracking recordings. Large Language Models (LLM) show promise in accurate detection of amblyopia; our prior work has shown that the Google Gemini model, guided by expert ophthalmologists, can detect control and amblyopic subjects from eye tracking recordings. However, there is a clear need to address the issues of transparency and trust in medical applications of LLMs. To bolster the reliability and interpretability of LLM analysis of eye tracking records, we developed a Feature Guided Interprative Prompting (FGIP) framework focused on critical clinical features. Using the Google Gemini model, we classify high-fidelity eye-tracking data to detect amblyopia in children and apply the Quantus framework to evaluate the classification results across key metrics (faithfulness, robustness, localization, and complexity). These metrics provide a quantitative basis for understanding the model’s decision-making process. This work presents the first implementation of an Explainable Artificial Intelligence (XAI) framework to systematically characterize the results generated by the Gemini model using high-fidelity eye-tracking data to detect amblyopia in children. Results demonstrated that the model accurately classified control and amblyopic subjects, including those with nystagmus while maintaining transparency and clinical alignment. The results of this study support the development of a scalable and interpretable clinical decision support (CDS) tool using LLMs that has the potential to enhance the trustworthiness of AI applications.
1. Introduction
Amblyopia, commonly known as “lazy eye”, is a significant challenge in pediatric ophthalmology that affects millions of children worldwide [1]. The condition, characterized by reduced vision in one eye, usually develops from birth to age seven due to improper visual development, often because the brain favors the other eye [2, 3]. If untreated, amblyopia can result in permanent visual impairment [3]. Traditional diagnostic methods for amblyopia rely heavily on subjective visual assessments and the specialized expertise of pediatric ophthalmology, which can delay diagnosis and intervention and may not always provide an accurate assessment due to their reliance on patient cooperation. Early detection and treatment are crucial to prevent long-term visual deficits, and there is an increasing demand for objective, scalable, and interpretable solutions that can overcome the limitations of traditional methods to provide timely and accurate diagnoses. The integration of advanced technologies into diagnostic practices offers the potential to significantly enhance early detection and treatment outcomes for amblyopia. Eye movement recordings, for instance, have shown significant promise in detecting amblyopia through the detailed analysis of parameters such as fixation duration, saccadic velocity, and other ocular metrics [4]. Recent advancements in generative artificial intelligence have opened new avenues for the early diagnosis and treatment of various medical conditions, including amblyopia [5, 6]. Specifically, large language models (LLM) have emerged as powerful tools capable of transforming clinical decision support (CDS) systems by providing more precise, efficient, and scalable solutions, which has relevance in amblyopia detection.
In previous work [6], we introduced a new approach that utilized LLMs to analyze multi-view eye movement data for the classification of amblyopia, which demonstrated the feasibility of applying LLMs to biomedical data analysis, leveraging their transfer learning and few-shot learning capabilities to achieve high diagnostic accuracy. The multi-view prompting framework incorporated eye movement recordings under various viewing conditions and enhanced the model’s ability to capture nuanced differences between amblyopic patients and control subjects. While our initial study achieved promising results, it highlighted a significant limitation inherent in the use of LLMs for CDS systems, that is, the “black box” nature of these models. The lack of transparency in the decision-making process poses a substantial barrier to clinical adoption of AI models as healthcare professionals desire interpretable and explainable learning models to trust and effectively integrate AI tools into CDS system [7]. Without clear insights into how the model arrived at its conclusions, clinicians are hesitant to rely on its recommendations, limiting the practical utility of such AI systems.
In this study, we address the pressing issue of transparency by incorporating an eXplainable AI (XAI) framework into our existing LLM-enabled CDS system for amblyopia detection. Specifically, we apply the Quantus explainability framework to rigorously assess the model’s outputs across four key dimensions, that is, Faithfulness, Robustness, Localization, and Complexity [8]. This comprehensive evaluation ensures that the model’s decisions remain accurate, reliable, and clearly interpretable, fostering trust and potentially greater acceptance of AI-enabled tools among clinicians. By prioritizing explainability, we address a critical obstacle to AI adoption in clinical practice. Moreover, by improving the interpretability of LLM-generated outputs, we demonstrate how existing XAI frameworks can be implemented in a real world CDS system in pediatric ophthalmology.
2. Background
2. 1 Fixation Eye Movement Recordings for Early Detection of Amblyopia
Amblyopia develops during the critical period of visual development when disruptions such as strabismus (misalignment of the eyes), anisometropia (unequal refractive errors), or visual deprivation from conditions like cataracts prevent normal use of one eye [2]. Untreated amblyopia can result in reduced visual acuity, impaired binocular vision, and diminished depth perception, affecting a child’s educational performance, social interactions, and overall quality of life [9]. Eye-tracking technology offers a non-invasive and objective method for assessing visual and neurological function by recording and analyzing eye movement patterns with high precision [4, 10]. Individuals with amblyopia exhibit distinct eye movement abnormalities, including decreased fixation stability, increased saccadic intrusions, and nystagmus [11, 12]. For example, amblyopic eyes often show greater variability in fixation, reflecting deficits in oculomotor control that can be quantitatively measured. These objective measures have the potential to serve as reliable biomarkers for amblyopia, enabling earlier detection and monitoring of treatment efficacy. However, the complexity of eye movement data necessitates advanced analytical methods capable of extracting meaningful patterns associated with the disorder. Machine learning (ML) algorithms have been applied to classify eye movement patterns, but they often require extensive feature engineering and large datasets, which may not be practical in clinical settings [4].
2.2 Large Language Models in Clinical Decision Support
LLMs have demonstrated state-of-the-art performance in natural language processing (NLP) tasks and they have shown promise in interpreting complex, multimodal datasets [13-17]. Their ability to perform few-shot learning allows them to adapt to new tasks with minimal training data, making them suitable for medical applications where large, labeled datasets may be scarce [13]. Recent studies detail the foundations for using these models in various CDS, showing that LLMs can analyze unstructured data, identify patterns, and generate insights that support clinical decision making [18]. In our prior study, we applied the Gemini LLM to analyze multi-view eye movement recordings, introducing a multi-view prompting framework that presented the model with data from various viewing conditions [6]. This approach enhanced the model’s ability to discern subtle differences in eye movements between amblyopic patients and control subjects, achieving high accuracy. The success of this method demonstrated the potential of LLMs in processing complex biomedical data for diagnostic purposes. Despite promising results, the lack of transparency in how AI models arrive at their conclusions poses ethical and practical concerns, highlighting the increasing importance of explainability and transparency in AI applications within healthcare [7, 19]. This work aims to improve upon prior use of LLMs for diagnosis of amblyopia by leveraging an explainable AI framework called the Quantus framework.
2.3 The Feature-Guided Interpretative Prompting (FGIP) Framework
The FGIP framework integrates the XAI framework into LLM by embedding clinically important diagnostic features such as eye stability, scale variability, and misalignment, directly into the prompt design. This structure guides the model toward explanations that align with clinical reasoning and meet healthcare professionals’ needs for transparency. FGIP transforms the output of LLM from a “black box” system into an interpretable CDS tool while maintaining or improving accuracy. By focusing on expert-defined features through structured prompts, the framework ensures that outputs remain relevant and actionable in pediatric ophthalmology. An additional benefit of FGIP is its ability to include few-shot learning examples [13]. This allows the model to adapt its reasoning with a limited set of illustrative cases, further refining interpretability. Compared to other XAI approaches, FGIP provides domain-specific rigor by mirroring expert clinical pathways, fostering trust, and facilitating acceptance of AI-supported clinical decisions.
2.4 Explainability Assessment
A range of tools and frameworks address explainability challenges in machine learning, including LIME (Local Interpretable Model-Agnostic Explanations) [20], SHAP (SHapley Additive exPlanations) [21], counterfactual explanations [22], and Saliency Maps [23]. While these methods offer insights into model behavior, they typically function as modular solutions rather than comprehensive frameworks [24]. LIME builds local surrogate models to approximate individual predictions [20], SHAP leverages cooperative game theory to provide feature importance scores [21], counterfactual explanations suggest minimal input changes to yield alternative model outcomes [22], and Saliency Maps highlight the input regions most critical for a prediction [23]. However, these techniques often lack breadth in covering multiple interpretability dimensions, such as robustness, faithfulness, and complexity—vital factors in healthcare. Integrated frameworks such as AIX360 [25], InterpretML [26], ALIBI [27], and Quantus [8] aim to address these limitations by combining multiple interpretability methods [19]. AIX360 (AI Explainability 360) provides a set of post-hoc explanation algorithms and metrics [25], though it primarily focuses on general-purpose machine learning and does not emphasize healthcare-specific evaluation. InterpretML offers interpretable models like Explainable Boosting Machines (EBMs) and post-hoc tools including SHAP and LIME [26], but it does not cover multidimensional metrics essential for medical applications. ALIBI, designed for production-level systems, includes methods such as counterfactual and contrastive explanations [27] but does not include built-in evaluations of robustness or faithfulness. Quantus distinguishes itself by offering a multidimensional suite of metrics to assess faithfulness, robustness, localization, and complexity [8, 28]. These dimensions are particularly relevant for clinical AI systems, where reliability and transparency are essential [29]. Faithfulness ensures that explanations accurately reflect the model’s decision-making, robustness tests the stability of explanations under minor perturbations to input data, localization measures the relevance of input features to the model’s predictions [8], and complexity evaluates whether generated explanations are understandable [28]. Given these capabilities, Quantus was chosen for this study to meet the specific requirements of CDS applications in healthcare.
XAI seeks to make AI models’ decision-making processes transparent, enabling users to understand, trust, and effectively manage AI outputs [7]. By applying the Quantus framework to the LLM’s outputs, we aim to enhance the interpretability of the LLM output and offers clinicians with clear insights into the factors influencing each classification. Quantus provides a comprehensive toolkit for explainability, covering multiple dimensions [8]:
Faithfulness measures how well the explanations provided by the model align with its internal decision-making processes, ensuring that the rationale behind each prediction is consistent with the model’s learned features.
Robustness evaluates the stability of the model’s explanations in response to minor perturbations in the input data.
Localization assesses the model’s ability to correctly identify and focus on relevant input features that influence its predictions. This is crucial for understanding which aspects of the eye movement data are most significant in the model’s decision-making process.
Complexity analyzes the simplicity of the model’s explanations, ensuring they were accessible and comprehensible to human users. Simpler explanations are often preferred as they provide clearer insights into the model’s reasoning.
Axiomatic properties ensure consistency and logical coherence. This metric evaluates whether explanations adhere to formalized properties (such as monotonicity or relevance) to provide a theoretical foundation for assessing the reliability of interpretability methods, particularly in critical applications.
Randomization tests the robustness of the explanations. This measure evaluates how explanations degrade when the data labels or model parameters are progressively randomized and assesses whether explanations genuinely reflect the underlying model behavior or are artifacts of randomness.
2.5 Contribution to the Body of Knowledge
This study advances both pediatric ophthalmology and artificial intelligence by integrating an XAI framework to interpret LLM-generated outputs within a CDS system. By leveraging XAI in amblyopia diagnosis, we not only enhance model transparency but also foster greater clinician trust [7]. This approach expands the applicability of AI in ophthalmology and can serve as a template for other medical domains. Specifically, the FGIP framework directs AI models toward essential clinical features such as fixation stability and eye misalignment, thereby improving accuracy and interpretability. This improvement has the potential to facilitate earlier detection and intervention for amblyopia, ultimately leading to better patient outcomes. Through the Quantus explainability framework, we systematically evaluate the AI model’s outputs across key dimensions such as faithfulness, robustness, localization and complexity [8], ensuring that explanations align with clinician needs and uphold ethical standards for responsible AI deployment in healthcare. By demonstrating how LLM outputs can be made interpretable and synchronized with clinical reasoning, this work narrows the gap between emerging AI technologies and clinical practice [7]. Greater transparency is vital for integrating AI tools into clinical workflows and building trust among healthcare professionals. In sum, this study not only refines earlier diagnostic models for amblyopia but also provides valuable insights for developing explainable, clinically relevant AI systems, thereby promoting broader adoption of AI-assisted CDS and improving patient care.
3. Methods
3.1 Overview of Workflow
A key challenge in applying LLM to healthcare is the “black-box” nature of many LLMs. While these models can produce accurate predictions, they often lack interpretable explanations, making them difficult to trust and integrate into clinical workflows. In pediatric ophthalmology, timely and accurate diagnosis of amblyopia is especially critical. However, conventional AI-based approaches generally do not provide interpretability aligned with clinical expertise. This study addresses that gap by integrating an XAI framework into an LLM-based system to enhance both transparency and clinical usability. To overcome these limitations, we employed FGIP along with the Quantus explainability framework, creating a tailored approach to amblyopia diagnosis. FGIP embeds domain-specific knowledge, fixation stability, scale variability, and misalignment, into structured prompts, guiding the LLM’s focus on features most relevant to clinical decision-making. Figure 1 illustrates the multi-stage FGIP process: (1) eye-movement data were collected; (2) visualized to emphasize fixation stability, scale variability, and misalignment; (3) integrated into FGIP prompts for the Gemini model, thus steering the model’s attention toward clinically pertinent factors; and (4) assessed via the Quantus framework across faithfulness, robustness, localization, and complexity to ensure that the model’s outputs remained interpretable, stable under input perturbations, and aligned with expert-defined diagnostic pathways. This design ensures that the model’s outputs are accurate and aligned with expert-defined diagnostic pathways. Eye-tracking data were collected from 135 participants (95 with amblyopia, 40 controls) under binocular and monocular conditions using an EyeLink 1000 Plus eye tracker. Few-shot and in-context learning methods were leveraged to enable effective adaptation despite limited training examples.
Figure 1:
Overview of the multi-stage FGIP process. (1) Data Collection: Eye-movement data are acquired in a clinical setting. (2) Data Analysis: Critical features, such as fixation stability, misalignment, and scale, are identified and visualized. (3) Prompting Framework: The Gemini model incorporates these features via FGIP, using few-shot and in-context learning. (4) Explainability Assessment: The Quantus framework evaluates the model’s outputs across faithfulness, robustness, localization, and complexity, ensuring interpretability and clinical relevance.
3.2 Data Collection, preprocessing and Study Participants
Eye movement data were collected and preprocessed using the same protocols as in our previous study [6]. Briefly, eye movement recordings were obtained from a cohort of 135 participants, including 95 children diagnosed with amblyopia and 40 control subjects, using the EyeLink 1000 Plus video-based eye tracker at a sampling rate of 500 Hz. Participants performed visual tasks designed to elicit fixation, smooth pursuit, and saccadic eye movements under binocular and monocular viewing conditions. Data preprocessing involved artifact removal, noise filtering, segmentation into fixation and movement periods, and extraction of key features such as eye stability, scale, and misalignment. All procedures, including calibration, task design, and feature extraction methods, followed the detailed descriptions provided in our prior work [6], ensuring consistency and comparability across studies. Additionally, the eye movement recordings were visualized using Cartesian plots with appropriate scaling measures, providing clear insights into the spatial distribution of the data under different conditions Informed consent was obtained from all participants, and the study received approval from the Cleveland Clinic Institutional Review Board. Written consent was provided by each participant or their parent/legal guardian, in accordance with the Declaration of Helsinki.
3.3 Data Perturbation for Robustness Testing
We assessed the robustness of the model’s interpretability by injecting 25% Gaussian noise into the original dataset, then comparing feature-importance rankings and classification outcomes between the original and perturbed data. This approach provided a systematic way to gauge how well the model’s reasoning and overall accuracy held up when faced with suboptimal input conditions. By analyzing shifts in ranking and any fluctuations in performance, we gained valuable insights into how sensitive the model is to input perturbations, thereby evaluating its reliability under realistic clinical conditions.
3.4 Feature-Guided Interpretative Prompting
FGIP embeds clinically relevant knowledge directly into the LLM’s decision-making. By structuring prompts around critical diagnostic features, fixation stability, scale variability, and misalignment, FGIP ensures that the model’s outputs follow expert-defined clinical pathways. Few-shot examples anchor FGIP by illustrating how real clinical data capture these features. This approach not only directs the diagnostic logic but also reveals feature interactions, such as the relationship between large eye-movement scale and reduced stability. Consequently, FGIP boosts both the accuracy and interpretability of the model’s classification results.
3.5 In-Context Learning for Eye Movement Data
To adapt the Gemini model for amblyopia classification, we employed in-context learning within the FGIP framework. Due to the Gemini model’s context window limit of 16 images per input, we carefully selected the most representative eye movement plots capturing key diagnostic features and limited the number of few-shot examples provided to the model. Concise textual descriptions were paired with visual inputs to maximize information content within the available context window, optimizing the in-context learning approach for our specific application. LLMs like Gemini have demonstrated remarkable capabilities in few-shot or in-context learning, adapting to specific contexts with minimal examples, as documented in models like GPT-4 [13, 14, 30, 31]. In this study, we leveraged the FGIP framework to enhance the Gemini model interpretative abilities, providing 12 examples for in-context learning. The Quantus explainability framework was employed to rigorously evaluate the outputs, focusing on how well the model interpreted and classified the eye movement data. Figure 2 illustrates the FGIP framework developed for this study, which includes structured prompts and expert-annotated texts combined with few-shot examples guiding the model through step by step reasoning. We tested the model’s performance on a dataset of 123 subjects, including 26 control subjects and 87 patients with amblyopia. The FGIP approach’s effectiveness was systematically assessed by examining its impact on amblyopia classification, its ability to determine the severity of the condition (mild and treated, moderate, severe), and its performance in cases involving amblyopia with nystagmus.
Figure 2:
An overview of the feature-guided interpretative prompting framework for amblyopia classification
3.6 Explainability
The Quantus explainability framework [8] was employed to assess the model’s explanations across four dimensions: faithfulness, robustness, localization, and complexity. These dimensions were selected to address core challenges in healthcare AI, where system outputs must be reliable and easily interpretable by clinicians and patients [32] [29, 33]. Faithfulness ensures that explanations accurately reflect the model’s underlying decision-making; robustness assesses how stable those explanations remain under minor input perturbations; localization checks whether explanations concentrate on clinically significant features; and complexity gauges how accessible they are to non-technical stakeholders. In addition to these metrics, a step-by-step reasoning approach was introduced during the prompting process, breaking down the model’s decision-making into clearly defined phases [34]. This method provided deeper insights into how the model arrived at its conclusions, thereby enhancing the transparency of the entire CDS workflow [31]. Such granular detail not only fosters trust in LLM output but also supports clinicians in validating and integrating these insights into routine care.
3.7 Model Implementation and experimental setup
For this study, we utilized the Gemini 1.5 Pro Vision model—a multimodal large language model capable of processing both text and image inputs to generate textual outputs. The model was selected for its proficiency in handling annotated eye movement plots alongside structured prompts from the FGIP framework. To optimize the balance between creativity and coherence in the model’s responses, the temperature parameter was set to 0.7, and a token limit of 1,024 was established to ensure concise explanations. Notably, we preserved the integrity of the pre-trained model by refraining from altering its parameters; instead, we employed in-context learning without fine-tuning. This adaptation involved presenting the model with 12 representative examples—including both amblyopic patients and control participants—to enhance its ability to interpret and classify eye movement data effectively. The model was evaluated using binocular viewing collages that combined images from both eyes, derived from recordings of the selected subjects to ensure a balanced assessment under binocular viewing conditions. A separate test dataset comprising 123 subjects was used to assess the model’s performance, with the initial 12 subjects excluded from this set. All experiments were conducted on a server equipped with an Intel® CoreTM i7-13700K CPU and 32 GB of RAM, running Ubuntu 22.04.1 LTS (64-bit), providing a robust computational environment for processing and analysis.
3.8 Evaluation and Metrics
The output of Gemini model was evaluated using the Quantus framework, where it was guided through a structured reasoning process with 12 few-shot examples. The model’s performance was then reported on a separate dataset of 123 unseen subjects to assess its generalization capabilities. Accuracy was the primary metric used to measure the model’s performance, defined as the proportion of correctly classified instances out of the total number of instances. This evaluation was crucial to ensure the model’s effectiveness on unseen data, providing a clear measure of its ability to generalize beyond the initial few-shot examples used during in-context learning. The separation of training and test datasets, along with the emphasis on generalization, ensured that the evaluation accurately reflected the model’s real-world applicability.
4. Results
The Gemini output was evaluated on its ability to classify amblyopic patients and control subjects using eye movement recordings. The model’s performance was assessed using a test dataset comprising 123 subjects (87 amblyopic patients and 36 control subjects), distinct from the 12 subjects used for in-context learning. The results demonstrate that the model consistently identified key features associated with amblyopia, such as scale variability, eye stability, and misalignment, and provided clear reasoning for its classifications.
4.1 Overall Model Performance
Table 1 summarizes the classification performance of our tool, which integrates FGIP with a few-shot learning approach. The table reports accuracy across multiple diagnostic scenarios: distinguishing amblyopia from control subjects, as well as classifying subgroups such as mild and treated amblyopia versus control, moderate and severe amblyopia versus control, and amblyopia with nystagmus versus control. As the number of few-shot examples (K) increases, the model’s accuracy improves, demonstrating the effectiveness of FGIP in adapting to various amblyopia conditions.
Table 1.
Performance of the Gemini model using feature-guided interpretative prompting with few shot leaning to classify amblyopia and control subjects across different conditions.
| Model (Few shot) | Amblyopia versus control | Mild and treated amblyopia versus control | Moderate and severe amblyopia versus control | Amblyopia with nystagmus versus control |
| K=2 | 0.65 | 0.7 | 0.72 | 0.73 |
| K=4 | 0.67 | 0.73 | 0.78 | 0.77 |
| K=8 | 0.77 | 0.8 | 0.8 | 0.83 |
| K=12 | 0.79 | 0.82 | 0.84 | 0.86 |
4.2 Explainability Metrics Evaluation
We employed four metrics from the Quantus explainability framework such as robustness, faithfulness, localization, and complexity, to assess the interpretability of the LLM’s outputs.
4.2.1 Faithfulness
The explanations provided by the model aligned well with the known clinical features of amblyopia and control conditions. In the examples, the model’s reasoning reflects the diagnostic criteria used by clinicians, such as scale variability and eye stability, confirming that the explanations are faithful to the model’s decision-making process. The consistent identification of critical features such as scale and stability across various cases demonstrates the model’s adherence to clinically relevant indicators.
4.2.2 Robustness
We assessed the robustness of the model’s explanations by introducing 25% of Gaussian noise into the original dataset and comparing feature importance rankings in both the original and perturbed scenarios. Figure 3 illustrates the robustness of feature importance rankings under original (A) and perturbed (B) data conditions, assessed through the Quantus explainability framework. In the original dataset, the LLM ranked Scale as the first feature for 52 subjects, followed by Eye Stability and Eye Misalignment. After noise was introduced, a modest shift was observed: Scale remained the first feature for 41 subjects, Eye Stability moved into first place for 43 subjects, and Scale was again ranked first for 39 subjects. Although these variations indicate some sensitivity to data perturbations, the overall distribution of feature importance remained relatively stable. This suggests that the LLM’s interpretative reasoning is largely preserved, underscoring the robustness of its explanations—even in the face of minor input alterations.
Figure 3:
The bar graph showing the robustness of feature ranking in original data (A) and compared perturbed data (B).
Importantly, the consistent prioritization of clinically relevant features across varying conditions demonstrates the model’s potential for robust performance in real-world CDS systems, where data may be subject to unavoidable noise and variability.
4.2.3 Localization
The LLM effectively localized the critical features that influenced its decisions. In the examples, the model accurately identified the scale and stability of eye movements as the primary indicators of amblyopia, demonstrating strong localization of relevant diagnostic features. The model’s capacity to pinpoint specific areas of variability and misalignment within the eye movement data provides a clear understanding of which aspects of the input data are most influential in the classification process.
4.2.4 Complexity
The LLMs outputs were detailed yet accessible, balancing technical accuracy with clarity. The step-by-step reasoning approach ensured that the model’s outputs were interpretable and understandable, even to those without a deep technical background. By breaking down the diagnostic process into simple, logical steps, the model’s explanations become more accessible to clinicians and other stakeholders, facilitating greater trust and usability in real-world settings.
5. Discussion
In this study, we enhanced the CDS for amblyopia diagnosis by integrating XAI techniques with LLMs to analyze eye movement recordings. The FGIP framework guided the Gemini LLM to focus on clinically significant features, improving diagnostic accuracy and providing transparent explanations aligned with clinical reasoning. The Quantus explainability framework evaluated the interpretability of the model’s outputs, ensuring that the explanations were both accurate and clinically relevant. However, dataset, while sufficient for initial findings, is relatively small and lacks diversity, which may affect the generalizability of the results. Additionally, manual feature extraction was employed, which limits scalability and increases the potential for human error. Future work will focus on developing more robust quantitative explainability techniques to further enhance the interpretability and reliability of AI models in CDS system. By refining these methods and automating feature extraction, we aim to provide deeper insights into the model’s outputs, fostering greater trust among clinicians and facilitating the integration of AI-assisted diagnostics into clinical practice.
6. Conclusion
This study aimed to develop a machine learning-based tool for amblyopia that is both accurate and interpretable, addressing the limitations of traditional diagnostic methods. By integrating the FGIP framework with the Gemini model, we were able to achieve high accuracy across different severities of amblyopia, including cases with nystagmus. A key aspect of our research was the use of the Quantus explainability framework to evaluate the LLMs outputs. This framework provided a comprehensive assessment of the model’s reasoning through metrics such as Faithfulness, which confirmed the alignment of the model’s decisions with clinical diagnostic criteria; Robustness, which validated the stability of these decisions under data perturbation; Localization, which ensured the model accurately identified and focused on the most relevant features; and Complexity, which assessed the clarity and accessibility of the model’s explanations. Our findings suggest that the Gemini model, when guided by FGIP, can serve as a reliable and interpretable CDS tool for amblyopia, with the potential to be implemented in clinical practice. The model’s ability to provide interpretable and clinically relevant explanations ensures that it can be trusted by healthcare professionals, thereby advancing the integration of AI in pediatric ophthalmology.
Acknowledgments
This research was partially supported by grants from the US National Institutes of Health (NIH): U24EB029005, R01DA053028, and the US Department of Defense (DoD) grant W81XWH2110859, as well as by the Clinical and Translational Science Collaborative of Cleveland, funded by the NIH National Center for Advancing Translational Sciences, through Clinical and Translational Science Award grant UL1TR002548. Additional funding was provided by NEI T32: 5 T32 EY 24236-4, the Case Western Reserve University Biomedical Research Fellowship-Hartwell Foundation, the Blind Children’s Foundation, the Research to Prevent Blindness Disney Amblyopia Award, and the Cleveland Clinic RPC Grant. Further support came from a Cleveland VA Medical Center grant, Research to Prevent Blindness, a CCLCM Unrestricted Block Grant, an NIH-NEI P30 Core Grant, and the Cleveland Eye Bank.
Figures & Table
Figure 4:
LLM outputs for amblyopia classification, showing two cases: a control subject (Case-I) and an amblyopic patient (Case-II). Key explainability metrics are highlighted: Faithfulness (Blue) for clinical alignment, Robustness (Green) for stability, Localization (Orange) for focus on key features, and Complexity (Purple) for clarity of explanations.
References
- 1.McKean-Cowdin R., Cotter S.A., Tarczy-Hornoch K., Wen G., Kim J., Borchert M., Varma R., Multi-Ethnic Pediatric Eye Disease Study G. ‘Prevalence of amblyopia or strabismus in asian and non-Hispanic white preschool children: multi-ethnic pediatric eye disease study’. Ophthalmology. 2013;120(10):2117–2124. doi: 10.1016/j.ophtha.2013.03.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Holmes J.M., Clarke M.P. ‘Amblyopia’. Lancet. 2006;367(9519):1343–1351. doi: 10.1016/S0140-6736(06)68581-4. [DOI] [PubMed] [Google Scholar]
- 3.Holmes J.M., Lazar E.L., Melia B.M., Astle W.F., Dagi L.R., Donahue S.P., Frazier M.G., Hertle R.W., Repka M.X., Quinn G.E., Weise K.K., Pediatric Eye Disease Investigator G. ‘Effect of age on response to amblyopia treatment in children’. Arch Ophthalmol. 2011;129(11):1451–1457. doi: 10.1001/archophthalmol.2011.179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Komogortsev O.V., Karpov A. ‘Automated classification and scoring of smooth pursuit eye movements in the presence of fixations and saccades’. Behav Res Methods. 2013;45(1):203–215. doi: 10.3758/s13428-012-0234-9. [DOI] [PubMed] [Google Scholar]
- 5.Esteva A., Robicquet A., Ramsundar B., Kuleshov V., DePristo M., Chou K., Cui C., Corrado G., Thrun S., Dean J. ‘A guide to deep learning in healthcare’. Nature medicine. 2019;25(1):24–29. doi: 10.1038/s41591-018-0316-z. [DOI] [PubMed] [Google Scholar]
- 6.Upadhyaya D.P., Shaikh A.G., Cakir G.B., Prantzalos K., Golnari P., Ghasia F.F., Sahoo S.S. ‘A 360° View for Large Language Models: Early Detection of Amblyopia in Children Using Multi-view Eye Movement Recordings’. Lect Notes Artif Int. 2024;14845:165–175. [Google Scholar]
- 7.Tjoa E., Guan C.T. ‘A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI’. Ieee T Neur Net Lear. 2021;32(11):4793–4813. doi: 10.1109/TNNLS.2020.3027314. [DOI] [PubMed] [Google Scholar]
- 8.Hedström A., Weber L., Krakowczyk D., Bareeva D., Motzkus F., Samek W., Lapuschkin S., Höhne M.M.-C. ‘Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond’. J Mach Learn Res. 2023;24(34):1–11. [Google Scholar]
- 9.Webber A.L., Wood J. ‘Amblyopia: prevalence, natural history, functional effects and treatment’. Clin Exp Optom. 2005;88(6):365–375. doi: 10.1111/j.1444-0938.2005.tb05102.x. [DOI] [PubMed] [Google Scholar]
- 10.Ghasia F., Wang J. ‘Amblyopia and fixation eye movements’. Journal of the Neurological Sciences. 2022:120373. doi: 10.1016/j.jns.2022.120373. [DOI] [PubMed] [Google Scholar]
- 11.Subramanian V., Jost R.M., Birch E.E. ‘A quantitative study of fixation stability in amblyopia’. Investigative ophthalmology & visual science. 2013;54(3):1998–2003. doi: 10.1167/iovs.12-11054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Niechwiej-Szwedo E., Colpa L., Wong A.M. ‘Visuomotor behaviour in amblyopia: deficits and compensatory adaptations’. Neural Plasticity. 2019;2019(1):6817839. doi: 10.1155/2019/6817839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A. ‘Language models are few-shot learners’. Advances in neural information processing systems. 2020;33:1877–1901. [Google Scholar]
- 14.Singhal K., Azizi S., Tu T., Mahdavi S.S., Wei J.S., Chung H.W., Scales N., Tanwani A., Cole-Lewis H., Pfohl S., Payne P., Seneviratne M., Gamble P., Kelly C., Babiker A., Schaerli N., Chowdhery A., Mansfield P., Demner-Fushman D., Natarajan V. ‘Large language models encode clinical knowledge (vol 620, pg 172, 2023)’. Nature. 2023 doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Saab K., Tu T., Weng W.-H., Tanno R., Stutz D., Wulczyn E., Zhang F., Strother T., Park C., Vedadi E. ‘Capabilities of gemini models in medicine’. arXiv preprint arXiv:2404.18416. 2024 [Google Scholar]
- 16.Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., Barham P., Chung H.W., Sutton C., Gehrmann S. ‘Palm: Scaling language modeling with pathways’. Journal of Machine Learning Research. 2023;24(240):1–113. [Google Scholar]
- 17.Satya S, Sahoo J.M.P., Xu Hua, Uzuner Ozlem, Cohen Trevor, Yetisgen Meliha, Liu Honggang, Wang Yanshan. ‘ Large Language Models for Biomedicine: Foundations, Opportunities, Challenges, and Best Practices.’. Journal of the American Medical Informatics Association (JAMIA) 2024 doi: 10.1093/jamia/ocae074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sahoo S.S., Plasek J.M., Xu H., Uzuner O., Cohen T., Yetisgen M., Liu H., Meystre S., Wang Y. ‘Large language models for biomedicine: foundations, opportunities, challenges, and best practices’. J Am Med Inform Assoc. 2024 doi: 10.1093/jamia/ocae074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Amann J., Blasimme A., Vayena E., Frey D., Madai V.I., Consortium P.Q. ‘Explainability for artificial intelligence in healthcare: a multidisciplinary perspective’. Bmc Med Inform Decis. 2020;20(1) doi: 10.1186/s12911-020-01332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ribeiro M.T., Singh S., Guestrin C. ‘Why should i trust you?" Explaining the predictions of any classifier.’. in Editor (Ed.)^(Eds.): ‘Book Why should i trust you?" Explaining the predictions of any classifier.’. (ACM, 2016, edn.) [Google Scholar]
- 21.Lundberg SM, L.S.-I. ‘A unified approach to interpreting model predictions’. in Editor (Ed.)^(Eds.): ‘Book A unified approach to interpreting model predictions’. 2017, edn.:4768–4777. [Google Scholar]
- 22.Wachter S., Mittelstadt B., Russell C. ‘Counterfactual explanations without opening the black box: Automated decisions and the GDPR’. Harv. JL & Tech. 2017;31:841. [Google Scholar]
- 23.Simonyan K., Vedaldi A., Zisserman A. ‘Deep inside convolutional networks: Visualising image classification models and saliency maps’. arXiv preprint arXiv:1312.6034. 2013 [Google Scholar]
- 24.Adadi A., Berrada M. ‘Peeking inside the black-box: a survey on explainable artificial intelligence (XAI)’. IEEE access. 2018;6:52138–52160. [Google Scholar]
- 25.Arya V., Bellamy R.K., Chen P.-Y., Dhurandhar A., Hind M., Hoffman S.C., Houde S., Liao Q.V., Luss R., Mojsilović A. ‘Ai explainability 360 toolkit’. in Editor (Ed.)^(Eds.): ‘Book Ai explainability 360 toolkit’. 2021, edn.:376–379. [Google Scholar]
- 26.Nori H., Jenkins S., Koch P., Caruana R. ‘Interpretml: A unified framework for machine learning interpretability’. arXiv preprint arXiv:1909.09223. 2019 [Google Scholar]
- 27.Klaise J., Van Looveren A., Vacanti G., Coca A. ‘Alibi: Algorithms for monitoring and explaining machine learning models’. 2020 URL https://github.com/SeldonIO/alibi . [Google Scholar]
- 28.Guidotti R., Monreale A., Ruggieri S., Turini F., Giannotti F., Pedreschi D. ‘A survey of methods for explaining black box models’. ACM computing surveys (CSUR) 2018;51(5):1–42. [Google Scholar]
- 29.Tonekaboni S., Joshi S., McCradden M.D., Goldenberg A. ‘What clinicians want: contextualizing explainable machine learning for clinical end use’. in Editor (Ed.)^(Eds.): ‘Book What clinicians want: contextualizing explainable machine learning for clinical end use’. PMLR, 2019, edn.:359–380. [Google Scholar]
- 30.Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., Barham P., Chung H.W., Sutton C., Gehrmann S., Schuh P., Shi K., Tsvyashchenko S., Maynez J., Rao A., Barnes P., Tay Y., Shazeer N., Prabhakaran V., Fiedel N. ‘PaLM: Scaling Language Modeling with Pathways’. J Mach Learn Res. 2023;24 [Google Scholar]
- 31.Wei J., Wang X., Schuurmans D., Bosma M., Xia F., Chi E., Le Q.V., Zhou D. ‘Chain-of-thought prompting elicits reasoning in large language models’. Advances in neural information processing systems. 2022;35:24824–24837. [Google Scholar]
- 32.Niechwiej-Szwedo E., Goltz H.C., Chandrakumar M., Hirji Z.A., Wong A.M. ‘Effects of anisometropic amblyopia on visuomotor behavior, I: saccadic eye movements’. Invest Ophthalmol Vis Sci. 2010;51(12):6348–6354. doi: 10.1167/iovs.10-5882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Amann J., Blasimme A., Vayena E., Frey D., Madai V.I., Consortium P.Q. ‘Explainability for artificial intelligence in healthcare: a multidisciplinary perspective’. BMC medical informatics and decision making. 2020;20:1–9. doi: 10.1186/s12911-020-01332-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jacovi A., Goldberg Y. ‘Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?’. arXiv preprint arXiv:2004.03685. 2020 [Google Scholar]




