Abstract
Cataract surgery is one of the most common and effective surgeries performed worldwide, yet patient education remains a challenge due to limitations in health literacy among the general population. Our study evaluated the reliability of different large language models (LLMs) in providing accurate, complete, and clear responses to frequently asked questions (FAQs) related to cataract surgery. A comprehensive list of 20 FAQs about cataract surgery were submitted sequentially as a prompt to nine different LLMs. All 180 answers were recorded and scored by two expert ophthalmologists, blinded to the model type, on a 5-point scale measuring the degree of accuracy, completeness, and clarity. Interrater agreement was measured using a weighted kappa coefficient and model performances were compared using the Friedman test and post-hoc analysis. Our results showed all models performed well responding to FAQs (79% of responses scored “excellent”), serving as effective tools in answering patient FAQs. LLaMA 4 and Copilot scored lower on average relative to other models (p < .05), however, they remained effective at FAQ responses overall. Potential expansion of LLMs as patient education tools into clinical settings should be considered, as they exhibit effectiveness in providing clear, accurate, and complete responses to cataract surgery FAQs.
Keywords: Artificial intelligence, large language models, frequently asked questions, cataract surgery, patient education
Introduction
Cataract surgery is one of the most common and effective surgeries performed in medicine.1–3 In 2015, 3.7 million cataract surgeries were performed in the United States, and the incidence is expected to continue rising. 2 As with any other surgery, patients seek answers to important questions about risks, benefits, expectations, and more, prior to consenting to this procedure.
Large language models (LLMs), a subset of artificial intelligence (AI) models, are being used for many different applications within ophthalmology, mainly within the realm of diagnostics.4–6 Given the complexities of cataract surgery and the lack of health literacy surrounding the eye and aspects of vision health among the general population, patient education becomes a crucial component of perioperative counseling and care.7–9 LLMs are particularly well-suited to aid in this task for providers due to their superiority in generating and understanding human texts. 10 Many providers are already using LLMs to help develop patient education leaflets and other materials.3,11 Some are now wondering whether these LLMs can effectively assist in counseling patients on frequently asked questions (FAQs) as well.12–15
Unfortunately, despite being a routine ophthalmic procedure, the complex nature of cataract surgery requires precise and clear answers. Furthermore, given the increased integration and widespread availability of LLMs on the internet, it is likely some patients are already using these models to answer their medical questions. Therefore, evaluating LLM performance in patient education becomes imperative to identify potential sources of health misinformation.
This study aims to assess the adequacy of various LLMs in providing accurate, complete, and clear answers to FAQs related to cataract surgery. Our findings are intended to inform on the safety and efficacy of AI tools in patient education. Evaluating these models is critical prior to wider adoption in the clinical setting.
Methods
Study design
In this single-blinded cross-sectional study, an internet search of several websites, including large academic medical centers and ophthalmology organizations was performed. These websites were browsed for educational pages on cataract surgery, specifically questions resembling patient FAQs about the surgery. An extensive list of questions was saved on a Word document along with the answers the educational pages provided. A set of 20 questions were finalized after screening out repetitive questions and questions not directly related to the surgery. The final list of FAQs, reference answers, and where they were sourced can be found in the supplementary material of this manuscript (Supplemental Document 1).
Data collection
Free accounts were created for nine different commonly used and well known LLMs: ChatGPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash, LLaMA 4, Mixtral 8 × 22B, DeepSeek V3, Grok 3, Qwen 3, and Standard Copilot. Each of the 20 FAQs were entered into the models’ web-based interface as individual, sequential prompts within the same chat. The same question order was input into all model conversations for experimental consistency. Each model was accessed between 11 and 12 June 2025, and all answers were saved on a Word document. Questions were entered separately to allow the models to provide full answers to each prompt. Additionally, this was thought to resemble patient conversational behavior of asking one question at a time. Documentation of model answers resulted in a total of 180 answers (20 questions × 9 models).
Two expert ophthalmologists were assigned as graders to provide a score for each model's answers to the FAQs. Graders were assigned a duplicate copy of answers that were blinded to the model type (labeled “Models A-I”). Graders were also provided a reference sheet for the answers posted on the websites from where the FAQs were selected, and a scoring sheet (Excel sheet). Scoring was on a scale of 1–5 based on response accuracy, completeness, and clarity. Each response was also reviewed for potentially unsafe or misleading content (e.g. inaccurate medical advice). Only whole number scores were permitted, and providing rationale for scores was optional (per grader discretion). The scoring legend is provided in Table 1 of this article. The distribution of scores for each model's FAQ responses is illustrated in Supplemental Figure S1.
Table 1.
Scoring rubric.
| Score | Grading | Description |
|---|---|---|
| 5 | Excellent | Fully accurate, complete, and clearly written. Matches expert-level answer. No misleading, biased, or unsafe content. Easy for patients to understand. |
| 4 | Good | Mostly accurate and complete. May omit minor detail or use less clear phrasing, but still safe and reliable. No major factual errors. |
| 3 | Acceptable | Generally accurate but includes small factual inaccuracies, vague phrasing, or lacks clarity. Generally safe but may confuse a layperson or omit an important point. |
| 2 | Poor | Contains clear factual inaccuracies, lacks important clinical detail, or is misleading. May sound plausible but would concern an expert reviewer. Potential safety risk. |
| 1 | Unacceptable | Factually incorrect, misleading, or unsafe. Fails to answer the question appropriately. Could cause harm if trusted. |
Statistical analysis
Descriptive statistics summarized the performance of each model. Interrater agreement was also determined using the Weighted Cohen's Kappa Coefficient. Kappa was used instead of an interclass correlation coefficient due to the ordinal nature of the scoring data. A nonparametric test, the Friedman test, was used to compare model performance. Additionally, since the kappa indicated high levels of grader agreement, the scores from both graders were averaged for each answer and only one Friedman test was completed. Results were analyzed by Wilcoxon Signed-Rank test with False Discovery Rate (FDR) correction to make pairwise comparisons between models. The FDR correction was used to adjust for the multiple comparisons. This approach is suitable given the large number of pairwise comparisons (36), allowing for increased statistical power while also controlling the false positives.
Results
Scores for all individual AI model responses were primarily clustered at “4” and “5” on the scoring rubric. In fact, about 79% of responses received a score of 5, 20% a score of 4, and only 1% received a score of 3. Furthermore, no responses were deemed unsafe or contained medically harmful advice either. A Weighted Cohen's Kappa Coefficient measuring interrater reliability for our ordinal-level score data indicated a “substantial agreement” between graders (κ = 0.675; 95% CI [0.51–0.80]) and this finding was found to be statistically significant (p < .001).
Since grader scores were in substantial agreement, the scores of both graders were averaged to compare performance by model type. Model comparisons are illustrated in Figure 1, with error bars representing the standard deviation.
Figure 1.
This figure illustrates the differences in LLM performance by model type. Error bars represent standard deviation; *indicates Standard Copilot and LlaMa 4 have a statistically significant lower mean score compared to other models (p < .05).
LLM: large language model.
A Friedman test was performed thereafter to evaluate whether there was a difference in performance by model type. The results of this test indicated there is a statistically significant difference in performance by model types (χ2 = 75.6, p-value = 3.778 × 10−13) with a moderate effect size (Kendall's W = 0.47). Therefore, post-hoc analysis using a Wilcoxon Signed-Rank test with FDR correction was used to make pairwise comparisons between the models.
Results of the post-hoc analysis showed LLaMA 4 and Copilot scored lower on average compared to all other models (p < .05), with pairwise effect sizes mostly ranging from small-to-moderate (r ≈ 0.03–0.26) with the exception of a large, poor performance effect size for LLaMA 4 performance against Mixtral 8 × 22B, DeepSeek V3, Grok 3, and Qwen 3 (r ≈ 0.84–0.88). However, no statistically significant difference was observed between LLaMA 4 and Copilot themselves (p = 1.000). There were no other significant findings between different model performances. Key findings of the pairwise comparisons are listed in Table 2, however, the complete list of results can be found in Supplemental Table S1.
Table 2.
Key model pairwise comparisons.
| Pairwise comparison | Adjusted p-value | Effect size |
|---|---|---|
| ChatGPT-4o vs LLaMA 4 | .003 | 0.259 |
| ChatGPT-4o vs Copilot | .012 | 0.042 |
| Claude 4 Sonnet vs LLaMA 4 | .005 | 0.092 |
| Claude 4 Sonnet vs Copilot | .011 | 0.058 |
| Gemini 2.5 Flash vs LLaMA 4 | .003 | 0.125 |
| Gemini 2.5 Flash vs Copilot | .007 | 0.029 |
| LLaMA 4 vs Mixtral 8 × 22B | .005 | -0.847 |
| LLaMA 4 vs DeepSeek V3 | .003 | -0.839 |
| LLaMA 4 vs Grok 3 | .003 | -0.877 |
| LLaMA 4 vs Qwen 3 | .003 | -0.877 |
| LLaMA 4 vs Copilot | 1.000 | 0.451 |
| Mixtral 8 × 22B vs Copilot | .012 | 0.033 |
| DeepSeek V3 vs Copilot | .011 | 0.042 |
| Grok 3 vs Copilot | .005 | 0.117 |
| Qwen 3 vs Copilot | .003 | 0.125 |
Discussion
Across the board, all models performed well, with an average of a “good” to “excellent” grading (Figure 1). In fact, none of the responses by any of the models scored below “acceptable” and the graders did not note any unsafe content in the responses. About 79% of responses among both graders scored a 5 (“excellent”), and only 20% scored a 4 (“good”). There was also a substantial degree of agreement between response graders (κ = 0.675, p < .001), adding validity to the findings.
When different LLMs were compared, it was found that most models performed at a similar level for cataract surgery FAQs. However, LLaMA 4 and Copilot underperformed compared to their peer models in providing fully accurate, complete, and clear responses to the FAQs (Table 2). Despite this, there was no statistically meaningful difference between LLaMA 4 and Copilot. Both models still performed in the “good” to “excellent” range, overall, for the 20 FAQs. This indicates the difference is not clinically meaningful.
This outcome, in itself, is a testament to the overall effectiveness of LLMs in providing high-quality responses to cataract surgery FAQs. The mean scores of best performing models are clustered closely, drawing concern for a potential “ceiling effect.” However, our statistical analysis was able to detect a significant difference in performance across the models. This demonstrates that our 5-point scale and statistical methods were sensitive enough to identify performance disparities, even with a narrow range of high scores. For a more detailed overview of the score distribution by AI model, Supplemental Figure S1 is available.
These findings demonstrate the ability of LLMs to function effectively in providing education to cataract patients. The FAQs evaluated in this study (Supplemental Document S1) were comprehensive, addressing a variety of questions patients typically ask pre-, intra-, and postoperatively for cataract surgery. This study supported our confidence in the increasing value of AI to address patient educational needs within ophthalmology. Patients already using LLMs to answer medically relevant questions benefit from responses that are clear, accurate, and easily accessible. Moreover, with the great efficacy LLMs present in ophthalmic patient education, there is potential to integrate LLMs into real-time ophthalmology clinics or virtual spaces.
Integration into the clinical workflow may alleviate time and staffing-associated burdens, while enhancing accessibility to cataract surgery information for patients. In practice, we could embed LLM-powered chat interfaces into patient portals, electronic health records, or clinic websites to address common questions. This would allow clinicians to focus on more complex concerns; however, several barriers still exist before such tools can be widely adopted. This includes rigorous validation of AI responses, proper regulatory oversight, and management of medical–legal liabilities.
This study has many factors contributing to its validity. For instance, question prompts were submitted individually, mimicking patient behavior and allowing the LLMs to formulate a full response. This study also incorporated a diverse set of widely used LLMs, making it more applicable to real-world scenarios. For instance, Gemini, one of the models evaluated, powers the Google “AI Mode” and “AI Overview” in Search, representing a tool that many patients may be inadvertently using. FAQs selected were clinically relevant and from patient-focused websites, ensuring real-world relevance. The use of kappa coefficient calculations and two expert ophthalmologists blinded to model types further strengthened the validity of our grading process and reduced bias in model scoring.
While prior studies have compared different LLMs in patient education tasks related to cataract surgery as well, they primarily focused on critical assessment of readability and understanding of the content.3,11 Our study encompasses a larger variety of LLMs and greater breadth of factors being assessed by the graders. Some studies, such as those by Dihan et al. and Qiu et al., only evaluated three LLM models.12,13 A study by Tailor et al. had a more coarse grading system compared to our 5-point scale. 14 On the other hand, our more granular grading scale allowed for a more statistically analytical comparison of the different models compared to prior studies.15–19
Limitations
While this study has many strengths, there are a few noteworthy limitations as well. AI technology is rapidly developing and improving, therefore newer and better versions of the models assessed in this study will continue to emerge. Additionally, only unpaid versions of the models were evaluated in this study, whereas some patients may be using enhanced performance versions of these models. Paid versions of these models may produce different study results.
Question order and input temperature are other factors that may affect outcomes and introduce bias. Although the same set and sequence of questions were used for all models to maintain consistency, future research may benefit from randomizing question order or performing replicate runs to evaluate for context-related variability. Future studies can mitigate this bias by inputting each question multiple times to test for consistency in responses and evaluate how strongly question order prejudices the results.
Lastly, despite the objectivity of a quantitative scoring rubric by two ophthalmology experts, we concede that a certain degree of subjectivity remains. Assigning a particular score to a generated response is not free from subjectivity, however, a more structured, granular scoring rubric may reduce this subjectivity.
Future directions
Since only 20 FAQs are used and a single surgery studied, ceiling effects could have impacts on the outcomes. However, the questions were selected to represent the most commonly encountered patient concerns about cataract surgery, ensuring relevance and practical applicability. Future studies may consider expanding the question pool to include a broader range of topics or subspecialties to further assess model discriminative performance.
Future studies assessing LLM responses to FAQs should also consider an expanded scoring rubric with scores for different assessment domains, such as an ordinal subscale for unsafe or potentially harmful content, rather than a singular score for each response. This would allow a more detailed analysis for comparison of LLMs. This study should be replicated within other fields of ophthalmology, providing a more generalizable insight into the clinical scope of LLMs in patient education. Tailor et al., for instance, covered a broad range of ophthalmic subspecialties in their study, allowing for a comprehensive exploration of LLMs in ophthalmic patient education. 14
With the rapidly accelerating advancements in AI technology, we hope this study will contribute to their future clinical applications. Our findings suggest they show promise as effective patient education tools. An interesting direction for future studies would be to assess possible LLM integration into patient–physician communication portals.
Conclusions
LLMs demonstrate a strong ability to provide “good” to “excellent” quality complete, accurate, and comprehensible responses to cataract surgery FAQs. LLaMA 4 and Copilot performed worse relative to other models, yet this difference is not clinically significant. In summary, LLMs display effective delivery of patient education for this common ophthalmic surgical intervention. Further consideration should be given to the integration of LLMs in clinical practice, as it may reduce clinic burden and improve accessibility to patient education.
Supplemental Material
Supplemental material, sj-docx-1-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH
Supplemental material, sj-pdf-2-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH
Supplemental material, sj-docx-3-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH
Abbreviations
The following abbreviations are used in this article:
- LLM(s)
Large language model(s)
- FAQ(s)
Frequently asked question(s)
- FDR
False discovery rate
- AI
Artificial intelligence
Footnotes
ORCID iDs: Omar Nusair https://orcid.org/0000-0001-7831-0154
Hassan Asadigandomani https://orcid.org/0000-0003-0744-8337
Ikesinachi Osuorah https://orcid.org/0000-0002-4473-7552
Jaron Sanchez https://orcid.org/0000-0001-5457-1624
Mohammad Soleimani https://orcid.org/0000-0002-6546-3546
Author contributions: Conceptualization: Omar Nusair; methodology: Omar Nusair, Hassan Asadigandomani, and Mohammad Soleimani; software: Omar Nusair; validation: Omar Nusair, Hassan Asadigandomani, and Mohammad Soleimani; formal analysis: Omar Nusair; investigation: Omar Nusair and Hassan Asadigandomani; resources: Omar Nusair; data curation: Omar Nusair; writing—original draft preparation: Omar Nusair, Ikesinachi Osuorah, and Jaron Sanchez; writing—review and editing: Ikesinachi Osuorah and Jaron Sanchez; visualization: Ikesinachi Osuorah and Jaron Sanchez; supervision: Mohammad Soleimani; project administration: Omar Nusair and Mohammad Soleimani; funding acquisition, N/A. All authors have read and agreed to the published version of the manuscript.
Institutional review board statement: This study does not involve human subjects’ research or any patient data. All data utilized was publicly available, therefore institutional review board approval is not applicable.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement: All data collected for this study are publicly available. All other data outside what is within the manuscript can be found in the supplementary materials provided.
Disclaimer/publisher's note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Supplemental material: Supplemental material for this article is available online.
References
- 1.Thompson J, Lakhani N. Cataracts. Prim Care 2015; 42: 409–423. [DOI] [PubMed] [Google Scholar]
- 2.Moshirfar M, Milner D, Patel BC. Cataract surgery. In: StatPearls [Internet]. StatPearls Publishing; 2023. [PubMed]
- 3.Thompson P, Thornton R, Ramsden CM. Assessing chatbots’ ability to produce leaflets on cataract surgery: Bing AI, ChatGPT 3.5, ChatGPT 4o, ChatSonic, Google Bard, perplexity, and pi. Cataract Refract Surg 2025; 51: 371–375. [DOI] [PubMed] [Google Scholar]
- 4.Jiao C, Rosas E, Asadigandomani H, et al. Diagnostic performance of publicly available large language models in corneal diseases: a comparison with human specialists. Diagnostics (Basel) 2025; 15: 1221. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hussain ZS, Delsoz M, Elahi M, et al. Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT assisting in diagnosis of corneal eye diseases, glaucoma, and neuro-ophthalmology diseases based on clinical case reports. medRxiv [Preprint]. 2025 Mar 14:2025.03.14.25323836. doi:10.1101/2025.03.14.25323836.
- 6.Delsoz M, Madadi Y, Raja H, et al. Performance of ChatGPT in diagnosis of corneal eye diseases. Cornea 2024; 43: 664–670. [DOI] [PubMed] [Google Scholar]
- 7.Capó H, Edmond JC, Alabiad CR, et al. The importance of health literacy in addressing eye health and eye care disparities. Ophthalmology 2022; 129: e137–e145. [DOI] [PubMed] [Google Scholar]
- 8.Gupta R, Gupta A, Omair M, et al. Health literacy on cataract and its treatment options among patients with operable cataract: a cross-sectional study from Moradabad (India). Delhi J Ophthalmol 2022; 32: 50–54. [Google Scholar]
- 9.Hu S, Wey S, Yano RA, et al. Fear of cataract surgery and vision loss: the effects of health literacy and patient comprehension at an academic hospital-based eye clinic. Clin Ophthalmol 2025; 19: 1103–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sreerakuvandana S, Pappachan P, Arya V. Understanding large language models. Advances in Computational Intelligence and Robotics Book Series 2024: 1–24. [Google Scholar]
- 11.Azzopardi M, Ng B, Logeswaran A, et al. Artificial intelligence chatbots as sources of patient education material for cataract surgery: chatGPT-4 versus Google Bard. BMJ Open Ophthalmol 2024; 9: e001824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Dihan Q, Chauhan MZ, Eleiwa TK, et al. Large language models: a new frontier in paediatric cataract patient education. Br J Ophthalmol 2024; 108: 1470–1476. PMID: 39174290. [DOI] [PubMed] [Google Scholar]
- 13.Qiu X, Luo C, Zhang Q, et al. Enhancing the readability of online pediatric cataract education materials: a comparative study of large language models. Transl Vis Sci Technol 2025; 14: 19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tailor PD, Xu TT, Fortes BH, et al. Appropriateness of ophthalmology recommendations from an online chat-based artificial intelligence model. Mayo Clin Proc Digit Health 2024; 2: 119–128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gupta AS, Sulewski ME, Armenti ST. Performance of ChatGPT in cataract surgery counseling. J Cataract Refract Surg 2024; 50: 424–425. [DOI] [PubMed] [Google Scholar]
- 16.Cohen SA, Brant A, Fisher AC, et al. Dr. Google vs. Dr. ChatGPT: exploring the use of artificial intelligence in ophthalmology by comparing the accuracy, safety, and readability of responses to frequently asked patient questions regarding cataracts and cataract surgery. Semin Ophthalmol 2024; 39: 472–479. [DOI] [PubMed] [Google Scholar]
- 17.Sundaramoorthy S, Ratra V, Shankar V, et al. Conversational guide for cataract surgery complications: a comparative study of surgeons versus large language model-based chatbot-generated instructions for patient interaction. Ophthalmic Epidemiol 2025; 32: 1–8. doi: 10.1080/09286586.2025.2484772. [DOI] [PubMed] [Google Scholar]
- 18.Yılmaz IBE, Doğan L. Talking technology: exploring chatbots as a tool for cataract patient education. Clin Exp Optom 2025; 108: 56–64. [DOI] [PubMed] [Google Scholar]
- 19.Su Z, Jin K, Wu H, et al. Assessment of large language models in cataract care information provision: a quantitative comparison. Ophthalmol Ther 2025; 14: 103–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH
Supplemental material, sj-pdf-2-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH
Supplemental material, sj-docx-3-dhj-10.1177_20552076251406304 for Evaluating the reliability of large language models in answering FAQs for cataract surgery by Omar Nusair, Hassan Asadigandomani, Ikesinachi Osuorah, Jaron Sanchez and Mohammad Soleimani in DIGITAL HEALTH

