Abstract
Background/Objectives
Experimental investigation. Bing Chat (Microsoft) integration with ChatGPT-4 (OpenAI) integration has conferred the capability of accessing online data past 2021. We investigate its performance against ChatGPT-3.5 on a multiple-choice question ophthalmology exam.
Subjects/Methods
In August 2023, ChatGPT-3.5 and Bing Chat were evaluated against 913 questions derived from the Academy’s Basic and Clinical Science Collection collection. For each response, the sub-topic, performance, Simple Measure of Gobbledygook readability score (measuring years of required education to understand a given passage), and cited resources were collected. The primary outcomes were the comparative scores between models, and qualitatively, the resources referenced by Bing Chat. Secondary outcomes included performance stratified by response readability, question type (explicit or situational), and BCSC sub-topic.
Results
Across 913 questions, ChatGPT-3.5 scored 59.69% [95% CI 56.45,62.94] while Bing Chat scored 73.60% [95% CI 70.69,76.52]. Both models performed significantly better in explicit than clinical reasoning questions. Both models performed best on general medicine questions than ophthalmology subsections. Bing Chat referenced 927 online entities and provided at-least one citation to 836 of the 913 questions. The use of more reliable (peer-reviewed) sources was associated with higher likelihood of correct response. The most-cited resources were eyewiki.aao.org, aao.org, wikipedia.org, and ncbi.nlm.nih.gov. Bing Chat showed significantly better readability than ChatGPT-3.5, averaging a reading level of grade 11.4 [95% CI 7.14, 15.7] versus 12.4 [95% CI 8.77, 16.1], respectively (p-value < 0.0001, ρ = 0.25).
Conclusions
The online access, improved readability, and citation feature of Bing Chat confers additional utility for ophthalmology learners. We recommend critical appraisal of cited sources during response interpretation.
Subject terms: Education, Eye diseases
Introduction
Artificial intelligence (AI) holds tremendous promise for ophthalmology, a specialty reliant on image-based and data-rich diagnostics [1]. For example, AI has been previously applied to detecting retinal various pathologies such as diabetic retinopathy and age-related macular degeneration [2–6]. The literature also depicts utility of AI in non-retinal pathologies [7]. Other evidence suggests that AI may optimize and even automate clinical workflows [8]. Of relevance to this report, AI may hold potential as a wide-encompassing educational tool for ophthalmology learners.
ChatGPT (OpenAI, San Francisco, CA, USA), a large language model (LLM), has ignited a paradigm shift toward the use of AI in medicine [9–11]. The fundamental mechanics of ChatGPT have been described elsewhere at length [9, 12]. GPT-3.5 is an augmented version of its predecessor GPT-3 (2020) and was trained on a vast array of parameters, qualifying it as the largest neural network on the market [11]. It is currently limited to training on online materials dated up to September 2021 [13]. Thereafter, GPT-4, an improved version of GPT-3.5, was released on March 13, 2023 [13]. GPT-4 is capable of handling more complex and robust tasks with optimized accuracy, including the processing of images, mathematics, and longer chat prompts [13]. Recent works recommend that caution be exercised for any immediate application of AI in this capacity [14]. To an early stage learner, it can be ambiguous whether to trust each response from an AI, especially given the risk of critical error in care if its recommendations are incorrect. Early evidence reveals that ChatGPT-3.5 answered 46% of questions correctly on a validated ophthalmic multiple choice test examination [15]. More recently, ChatGPT-4 has shown tremendous improvements in its interpretation of ophthalmology, demonstrating a latest performance of 84% on the same examination [16].
In February 2023, GPT-4 was integrated into the Bing Chat (Microsoft, Redmond, Washington, United States) search assistant [17]. While Bing Chat and ChatGPT-4 both use AI and natural language processing, Bing Chat offers additional features including the capability of retrieving and citing online, up-to-date information from the Bing database for subsequent analysis using its ChatGPT integration [17]. The “Page Context” sub-feature further evaluates and answers questions on educational ophthalmology webpages that the user is observing, which may aid learners with better understanding the content they study, although focused evaluation of this feature is needed [17]. Comparatively, ChatGPT-4 is limited to datasets up to September 2021, and may not reflect the latest ophthalmology evidence [13]. Further, when generating responses, Bing Chat is designed to follow prescribed AI guidelines such as avoiding controversial or harmful content, which may reduce the impact of misinformation on responses [18]. Contrarily, ChatGPT-4 generates responses based on data that it has been trained on, for which there is no method to assess its accuracy. Finally, ChatGPT-4 remains only accessible through paid subscription [13]. To-date, there remains a paucity of evidence on whether this iteration of ChatGPT with Bing Chat better positions this platform for educational applications in ophthalmology. Preliminary evidence supports encouraging performance of Bing Chat, though as denoted in one reply article, is limited by relatively lower study power [19, 20]. In this highly powered study, we sought to compare the performance of the freely available GPT-4 Bing Chat and ChatGPT-3.5 on an ophthalmology examination based on official American Academy of Ophthalmology (AAO) materials.
Methods
An internal study protocol was prepared before study commencement. Per the cause of article 2.4 of the Tri-Council Policy Statement, approval from IRB/Ethics Committees was not required as data was obtained from publicly acquirable resources. This research adheres to the tenets of the Declaration of Helsinki. No new identifiable information was generated in this work.
An ophthalmology exam was created by amalgamating study questions from the complete American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) Resident Set collection [17]. This is in keeping with works in the literature using pre-existing official resources for collating examination questions [15, 16, 19, 21]. The BCSC collection is a key resource for ophthalmology learners to become eventual well-informed and competent practitioners in the field. Included topics were Update on General Medicine; Fundamentals and Principles of Ophthalmology; Clinical Optics; Ophthalmic Pathology and Intraocular Tumors; Neuro-Ophthalmology; Pediatric Ophthalmology and Strabismus; Orbit, Eyelids, and Lacrimal System; External Disease and Cornea; Intraocular Inflammation and Uveitis; Glaucoma; Lens and Cataract; Retina and Vitreous; and Refractive Surgery. All multiple-choice questions from these texts were included into the final examination. However, image-reliant questions, whether clinical, radiologic, or diagrammatic, were excluded as no version of GPT used in this study supported image-based inputs. We include all eligible questions from the 2016–2017 and 2019–2020 BCSC collection editions, which were readily accessible (accessible for free at our institution, unlike other editions), to enhance the power of this work. Duplicate questions, defined as pairs differing by fewer than 50 input characters by fuzzy matching, were removed. The <stringdist> R package (version 0.9.12) and implicit <amatch> function was used to conduct approximate string matching across all questions. The <amatch> function uses <match> and <%in%> to return the position of a given question’s closest match, from which it was specified that any approximate match differing less than 50 characters in any string position would be eliminated, thereby removing duplicates that differed by insignificant characters (e.g., additional spaces and punctuation, which may have varied between yearly editions).
In August 2023, ChatGPT-3.5 and Bing Chat were queried by BT, JM, and NH using the assembled exam question bank [9, 17]. In Bing Chat, the “focused” response style was selected to enhance the focus of the model on selecting a given choice response, rather than creating a new and distinct answer entity. To prevent additional input into each query, exam questions were entered directly, without any additional prompt. In the case that ChatGPT or Bing Chat had produced an answer that did not match any possible choice, then the original question was re-entered, but this time preceded by the phrase “Please select the best answer: “. If the model still did not select a given option, the answer was marked as “Does Not Know”. Successive questions were queried in a fresh session of the ChatGPT or Bing Chat session to prevent influence from previous questions. A fresh session was created by first deleting the prior chat session before creating a new blank-slate chat session, which could not draw on the previous query. As well, to ensure answer consistency, both ChatGPT and Bing Chat were tested on each question in duplicate.
Excel (Microsoft, USA) was used for organization of all study materials and data collection. For the purpose of grading, the following steps were completed in duplicate by independent reviewers (BT and JM). For scoring purposes, ChatGPT or Bing Chat were deemed to supply a “correct” answer to a given question if they selected the same choice as recommended by the answer key. Alternatively, an answer was “incorrect” response if the response did not match the answer key suggestion, if either platform could not decide on any option when further prompted, or if the third attempt was incorrect in the case of conflicting duplicate responses (e.g., correct first query, incorrect second query).
Other collected variables included question type, sub-topic, question readability score, number of citations provided with a response, and a rating of the reliability of the cited sources. Question type was dichotomized into two categories by independent reviewers (BT and JM; with conflicts resolved through discussion with the senior author): explicit and situational. Explicit questions were defined as involving simple recall questions, such as the main finding from a landmark ophthalmology trial or selecting the largest risk factor for a given eye disease. Situation questions included those with higher-level processing requisites, such analysis of mathematical reasoning, simple clinical reasoning, and complex clinical vignette-based questions. Readability scores of exam questions were calculated using the Simple Measure of Gobbledygook (SMOG) measure, accounting for average sentence length and percentage of longer words, which estimates a score correlating to years of education needed to understand a string input. The <sylcount> package and implicit <readability> function were used for objective and consistent calculation. The SMOG is a validated gold-standard tool for assessing readability in healthcare settings and was shown to possess a 0.985 correlation with readers who exhibited 100% comprehension of reading text materials [22]. Next, the number and content of citations were collected from responses produced by the new citation tool featured in the latest Bing Chat release. The identity of the primary citation, defined as the first and often only cited reference in each response, was recorded. Finally, the reliability of the citation was rated on an ordinal scoring scale of zero (no use of any peer-reviewed/reliable sources), one (mixed use of peer-reviewed/reliable sources and non-reliable sources), and two (sole use of reliable sources). Reliable sources were defined to be those subject to peer-review such as (but not limited to) eyewiki.aao.org, pubmed.com, and those hailing from publisher websites (e.g., Elsevier.com).
The primary outcome of this study was to assess its crude accuracy in correctly completing a set of multiple-choice questions (MCQ). Secondary aims included investigating the citation contents of its responses, as well as how the model’s accuracy varied with MCQ type, subject topic, and question readability score. Descriptive statistics are presented as proportions (e.g., of correct answers) and means with standard deviation (SD) or 95% confidence interval (CI). A kappa statistic was calculated to assess inter-rater reliability between both models in this study. Kappa ranges from −1 to 1 and may indicate poor (<0.00), slight (0.00–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), or almost perfect agreement (0.81–1.00) [23]. Where appropriate, chi-square tests with phi correlation or Wilcox rank sum tests with rho correlation were used. Multiple comparison testing proceeded via the Tukey Honestly Significant Difference test. Analysis was completed by BT in RStudio 2022.02.0 + 443.
Results
Following deduplication, we assembled 913 questions from BSCS collection that satisfied the inclusion criteria. Of this total, there were 662 explicit and 251 clinical reasoning questions. Table 1 summarizes number of questions from each BCSC topic. The mean (SD) question length was 296 (153) characters. Comparatively, the mean (SD) of model responses were 1013 (534) and 497 (247) characters for ChatGPT-3.5 and Bing Chat, respectively. For either model, repeated query across questions did not result in any instances where a different answer option was chosen compared to the first selected option.
Table 1.
Breakdown of multiple-choice questions by sub-topic.
| Section | Total (N = 913) |
|---|---|
| External and cornea | 73 |
| General medicine | 72 |
| Glaucoma | 25 |
| Inflammation, uveitis | 65 |
| Lens, cataract | 30 |
| Neuro ophthalmology | 75 |
| Ophthalmology fundamentals | 72 |
| Ophthalmic pathology & intraocular tumors | 50 |
| Optics | 144 |
| Orbit, eyelids, lacrimal | 62 |
| Pediatric ophthalmology | 94 |
| Refractive surgery | 55 |
| Retina, vitreous | 96 |
Across all questions, ChatGPT-3.5 provided correct responses to 59.69% [95% CI 56.45, 62.94] while Bing Chat was proficient in 73.60% [95% CI 70.69, 76.52] of queries (p value = <0.0001, φ = 0.15). ChatGPT-3.5 attempted all questions and selected “none of the above” for two queries, while Bing Chat admitted to not knowing the answer to 16 questions and selected “none of the above” for 32 queries. Between these models, the kappa value of agreement was 0.461, indicating moderate inter-system reliability. Stratified by question type, for explicit questions, ChatGPT-3.5 provided correct responses to 63.0% [95% CI 59.2, 66.7] while Bing Chat was proficient in 77.8% [95% CI 74.6, 81.0] of queries (p-value ≤ 0.0001, φ = 0.16). For clinical reasoning questions, ChatGPT-3.5 provided correct responses to 51.0% [95% CI 44.7, 57.3] while Bing Chat was proficient in 62.5% [95% CI 56.4, 68.7] of queries (p-value = 0.06, φ = 0.05). Table 2 stratifies the performance of these two LLMs across all sections covered by the BSCS. Overall, both models performed best on the general medicine section. Comparatively, ChatGPT-3.5 performed the worst on the lens and cataract section, while Bing Chat performed the worst on the optics section. Within ChatGPT-3.5, performance was significantly stronger in general medicine than lens and cataract (p-value = 0.005), general medicine than optics (p-value < 0.0001), and general medicine than orbit, eyelids, and lacrimal systems (p-value = 0.002). Comparatively, within Bing Chat, performance was significantly stronger in general medicine than optics (p-value < 0.0001), inflammation and uveitis than optics (p-value = 0.001), ophthalmic pathology than optics (p-value = 0.002), and retina and vitreous than optics (p-value = 0.016). The exhaustive list of pairwise comparisons (accounting for multiple comparisons), including non-significant differences, are included in the Supplemental Data File.
Table 2.
Performance of ChatGPT-3.5 and Bing Chat, stratified by sub-topic.
| ChatGPT-3.5 | Bing Chat | ||||||
|---|---|---|---|---|---|---|---|
| Section | Mean | Lower 95% CI | Upper 95% CI | Section | Mean | Lower 95% CI | Upper 95% CI |
| General medicine | 81.94% | 72.88% | 91.01% | General medicine | 87.50% | 79.70% | 95.30% |
| Ophthalmic pathology & intraocular tumors | 72.00% | 59.30% | 84.70% | Ophthalmology fundamentals | 86.11% | 77.96% | 94.26% |
| Refractive surgery | 70.91% | 58.66% | 83.16% | Ophthalmic pathology & intraocular tumors | 82.00% | 71.13% | 92.87% |
| Ophthalmology fundamentals | 66.67% | 55.56% | 77.78% | Inflammation uveitis | 80.00% | 70.08% | 89.92% |
| Glaucoma | 64.00% | 44.80% | 83.20% | Lens, cataract | 80.00% | 65.39% | 94.61% |
| Neuro ophthalmology | 64.00% | 52.91% | 75.09% | Refractive surgery | 80.00% | 69.21% | 90.79% |
| External diseases and cornea | 60.27% | 48.82% | 71.73% | Pediatric ophthalmology | 76.60% | 67.86% | 85.33% |
| Pediatric ophthalmology | 57.45% | 47.25% | 67.65% | Glaucoma | 76.00% | 58.92% | 93.08% |
| Retina, vitreous | 57.29% | 47.19% | 67.39% | External diseases and cornea | 73.97% | 63.70% | 84.24% |
| Inflammation uveitis | 55.38% | 43.05% | 67.72% | Retina, vitreous | 72.92% | 63.85% | 81.99% |
| Optics | 47.92% | 39.59% | 56.24% | Orbit, eyelids, lacrimal | 70.97% | 59.44% | 82.50% |
| Orbit, eyelids, lacrimal | 46.77% | 34.10% | 59.45% | Neuro ophthalmology | 69.33% | 58.68% | 79.98% |
| Lens, cataract | 40.00% | 22.11% | 57.89% | Optics | 52.08% | 43.76% | 60.41% |
Bing Chat referenced 927 unique online sources, which were collectively referenced 4165 times across all questions. The mean (SD) number of citations per response was 4.56 (3.36). Bing Chat provided at-least one citation to 836 of the 913 questions. Table 3 depicts the top 50 common sources that were referenced, regardless of whether they were the primary citation. There was no significant association between the presence of a citation and likelihood of a correct response (p-value = 0.053). Likewise, there was no significant association between number of citations and likelihood of a correct response (p-value = 0.066). Citation with a top 10-most cited resource was also not significantly associated with likelihood of a correct response (p-value = 0.17). With respect to citation reliability, across the 913 questions, we identified 169 (18.5%) responses with unreliable sources, 602 (65.9%) responses with mixed reliability sources, and 142 (15.6%) responses that used entirely reliable sources. Higher degree of citation reliability was significantly associated with likelihood of a correct response (p = 0.005, ρ = 0.09).
Table 3.
List of online resources cited by Bing Chat across all multiple-choice questions.
| Citation | Citations |
|---|---|
| eyewiki.aao.org | 400 |
| aao.org | 244 |
| en.wikipedia.org | 190 |
| ncbi.nlm.nih.gov | 181 |
| link.springer.com | 169 |
| mayoclinic.org | 122 |
| my.clevelandclinic.org | 84 |
| bing.com | 70 |
| NONE | 66 |
| verywellhealth.com | 60 |
| uptodate.com | 53 |
| healthline.com | 51 |
| reviewofophthalmology.com | 48 |
| sciencedirect.com | 46 |
| radiopaedia.org | 43 |
| nature.com | 42 |
| doi.org | 40 |
| reviewofoptometry.com | 40 |
| frontiersin.org | 39 |
| bmcophthalmol.biomedcentral.com | 34 |
| merckmanuals.com | 33 |
| aafp.org | 32 |
| bjo.bmj.com | 31 |
| webmd.com | 31 |
| allaboutvision.com | 30 |
| webeye.ophth.uiowa.edu | 29 |
| entokey.com | 27 |
| medicalnewstoday.com | 25 |
| researchgate.net | 24 |
| academic.oup.com | 23 |
| hindawi.com | 22 |
| eandv.biomedcentral.com | 21 |
| hopkinsmedicine.org | 21 |
| visioncenter.org | 21 |
| mdpi.com | 20 |
| nei.nih.gov | 20 |
| medlineplus.gov | 19 |
| jamanetwork.com | 18 |
| aapos.org | 16 |
| cdc.gov | 16 |
| iovs.arvojournals.org | 16 |
| journals.plos.org | 15 |
| statpearls.com | 15 |
| canada.ca | 13 |
| kenhub.com | 13 |
| msdmanuals.com | 13 |
| myvision.org | 13 |
| neuro-ophthalmology.stanford.edu | 13 |
| quizlet.com | 13 |
| rarediseases.info.nih.gov | 12 |
Across questions, the mean SMOG readability score was grade level 8.66 [95% CI 5.88, 11.4]. Comparatively, the mean readability of model responses was grade level 12.4 [95% CI 8.77, 16.1] and 11.4 [95% CI 7.14, 15.7] characters for ChatGPT-3.5 and Bing Chat, respectively (p-value < 0.0001, ρ = 0.25). However, unlike with ChatGPT-3.5, question character length was significantly negatively associated with likelihood of correct answer from Bing Chat (p-value < 0.0001, ρ = −0.14).
Discussion
In recent years, the proposition for AI integration into medical education and clinical practice has been expanding. Recent evidence supports a promising problem-solving capability among the latest iterations of LLMs. For example, one work reports the relative success of ChatGPT with performing approximately about the 60% passing threshold on a sample United States Medical Licensing Exam (USMLE) Step 1 and Step 2 Clinical Knowledge assessments [21]. More focused studies have trialled ChatGPT against ophthalmology-specific prompts. Most recently, ChatGPT-4 was shown to exhibit a performance of 84% across 125 MCQs from OphthoQuestions; a considerable improvement from previous ChatGPT-3.5 scores of 48% and 58% in January and February, 2023, respectively [16]. In this work, we evaluated the latest GPT-4 and Bing Chat integration, which facilitates new access to and citations of real-time online data, against a high-yield question set tailored for ophthalmology board certification preparation [24]. Across 913 questions, ChatGPT-3.5 scored 59.69% [95% CI 56.45, 62.94] while Bing Chat was proficient in 73.60% [95% CI 70.69, 76.52], both of which performed significantly better in explicit than clinical reasoning questions and in general medicine than ophthalmology subsections. The 95% CI span of Bing Chat appears to fall short of the 84% score documented by Mihalache et al., though this disparity may be attributable to a study power differential; unequal inclusion of misleading online content into Bing Chat responses despite Microsoft AI guidelines; and inherent computational differences attributable to the underlying Prometheus model that integrates Bing Chat with GPT-4 [16, 25]. However, our results substantiate the results of Cai et al. who report performance scores of 58.8% and 71.6% for ChatGPT-3.5 and Bing Chat, respectively [19]. Nonetheless, Bing Chat’s integration into the Microsoft Edge Browser offers a more versatile and integrated experience for the ophthalmology learner, given its ability to analyse and answer users’ questions about ophthalmology content that they read online. For example, in addition to aiding with board examination MCQs, the “Page Context” feature has potential to aid users with compacting the latest research articles into more easily understood learning summaries. Future investigation should evaluate the utility of this particular feature. At this time, while Bing Chat’s online access shows promising potential to increase its reliability, further iterations ought to optimize its computational capability and use of verified online information. Thus far, research efforts have yet to rigorously confirm whether integration of ChatGPT/Bing Chat (or AI in general) into medical education results in increased student engagement and learning. However, one small trial indicated a significant benefit for student passing rates when AI was used as an academic performance evaluator, which was used to provide feedback to and motivate study participants [26, 27]. In clinical practice, growing evidence demonstrates instances of clinical benefit (e.g., AI-assisted interventions outperforming usual care) from AI-implementation, although the available trials are limited in number and heterogenous [28]. Further investigation on AI-implementations into clinical practice is needed.
Bing Chat referenced 927 online entities and provided at-least one citation to 836 of the 913 questions, though we found no significant association between citations and likelihood of correct response. Highly cited sources included eyewiki.aao.org (n = 400), aao.org (n = 244), wikipedia.org (n = 190), ncbi.nlm.nih.gov (n = 181), and link.springer.com (n = 169). Although the top-ten most cited sources contained considerably reputable resources, it was also composed of less reliable, non-peer reviewed sources. As well, there were qualitatively far more less reliable sources outside of the top 10. Further, Bing Chat cited 4.56 online items per response, which often contained a mixture of reliable and unreliable sources, which may have diluted the contribution of reliable information into the final response. These factors were likely implicit to the initial analytical finding of no significant association between likelihood of correct response and presence of a top 10 citation. However, through higher resolution analysis that rated the reliability of sources used for each response, we determined that BingChat mixed peer-reviewed citations with less reliable sources more often than using either peer-reviewed or unreliable sources alone. This higher resolution analysis revealed a significant positive, albeit weak, association of reliable source use with likelihood of correct response. Thus, BingChat may be further optimized for academic learners if its future iterations were programmed to be more explicit in using solely peer-reviewed academic sources. Regardless, Bing Chat remains a promising resource for trainees with a baseline ophthalmology knowledge who may critically appraise the citations associated with each response. However, the risk of idea transmission from unreliable sources is magnified for non-medical users who may not be readily familiar with characteristics of reliable online medical information. Further refinement of the Bing Chat citation feature is needed with respect to identifying and leveraging reliable peer-reviewed resources, especially for public use.
Responses from Bing Chat had significantly better readability than from ChatGPT-3.5, averaging a reading level of grade 11.4 [95% CI 7.14, 15.7] versus 12.4 [95% CI 8.77, 16.1], respectively (p-value < 0.0001, ρ = 0.25). For trainees, Bing Chat may therefore use more accessible language for learning, which may improve understanding and retention of online academic research material or complex ophthalmology concepts. However, both models had reading levels that were higher than the level recommended for patient use in the United States. Per the American Medical Association and National Institutes of Health, patient education materials are recommended to be written at no more than sixth-grade reading level [29]. That said, one reason for the higher reading levels outputted by both models is to be commensurate with the use of technical language that was implicit within the input MCQs.
Integrating AI and LLMs into ophthalmology education ultimately requires adequate personal judgement, allowing for the user to benefit from AI while mitigating its associated risks. For educators, given the potential for integration of AI into ophthalmic investigations (e.g., AI interpretation of visual fields), one recommendation is to challenge learners to critically appraise AI-assisted ophthalmic investigative reports in a safe environment, highlighting both successes and limitations of AI. Of more relevance to the learner, we recommend that AI be used as a non-standalone supplementary learning tool. For example, ChatGPT or Bing Chat may be used to provide a concise primer on a topic of interest, before delving into deeper peer-reviewed material (which may even be recommended directly by the LLM). Alternatively, we recommend that users consider running challenging excerpts from their reading through ChatGPT with a request to explain simpler terms, before returning to the source text for further comprehension. If working on practice questions, we recommend using LLMs as an adjunct for further explanation of reasoning (provided it attains the correct answer) if the learner is unable to completely understand the rationale for a given response. LLMs may also be used to generate new, potentially important considerations to open-ended questions (e.g., unconsidered potential for drug interactions), after which the user can confirm through further reading. However, we recommend against using a LLM a lone resource to guide learning or clinical management, given the inherent risk of model hallucination, which can lead to clinical error and catastrophic patient outcomes.
The strengths of this work include its high-power conferred by using over 900 MCQs and its comprehensive analysis of additional features such as cited sources and readability scoring. As well, the BCSC collection is a comprehensive resource collated by a community of over 100 ophthalmologists worldwide [24]. Sponsored by the American Academy of Ophthalmology to OKAP exam preparation, this collection is also deemed the primary educational resource for all European ophthalmology training programs [24]. However, some study limitations should be considered. The influence of non-correctable factors such as device location may have introduced regional biases when generating responses. As well, we acknowledge that ChatGPT and Bing Chat generate unique answers, even to identical queries, though this effect is mitigated by the reporting of 95% CIs which may account for this variability [15, 16]. As well, this analysis did not evaluate image- or video-based questions, which are not readily analysed on the freely available ChatGPT-3.5. This work also did not evaluate free-response questions. Further, another weakness is that this study did not include human resident test-takers, and so we are unable to compare either model’s performance to human test-takers. However, one study reports a human resident performance of 75.7% across 1073 similar BCSC multiple choice questions; this performance appears to exceed that of ChatGPT-3.5 in this work, while being on-par with Bing Chat [30]. Finally, this work evaluated ChatGPT-3.5 rather than ChatGPT-4 since the former remains free for public use and therefore continues to be the most widely available option. However, this study is therefore unable to compare ChatGPT-4 with Bing Chat.
Among other research avenues, a notable project of future investigation may involve uploading the AAO/BCSC manual to ChatGPT or Bing Chat, where possible, and subsequently asking the LLM closed- and open-ended questions about the referenced text. Presumably, if the LLM’s responses are more accurate than in the present study, it may suggest that the LLM’s weakness lies in unassisted research capability (involving web-search and synthesis of findings) rather than its ability to analyse or interpret a given text and answer related questions.
Bing Chat yielded a promising improvement over the currently freely available ChatGPT-3.5 for providing correct answers to board certification preparation materials. As well, its online integration, citation features, and more user-friendly language may pose a more suitable tool for ophthalmology learners. Ongoing caution should be exercised when interpreting the results of AI. Whether future iterations of ChatGPT or Bing Chat provide better performance on MCQs, or other question types, deserves further investigation.
Summary
What was known before
In limited power studies, ChatGPT-3.5, ChatGPT-4.0, and Bing Chat showed promising performance on ophthalmology board examination questions.
What this study adds
High powered analysis showed that ChatGPT-3.5 and Bing Chat have a current performance of 59.69% [95% CI 56.45, 62.94] and 73.60% [95% CI 70.69, 76.52], respectively.
However, Bing Chat used a mix of reliable and non-peer-reviewed sources to inform its responses.
While promising tools for aiding ophthalmology learners, Bing Chat and ChatGPT do not have perfect performance and often resort to non-peer-reviewed online content.
We recommend high caution when appraising responses for learning or clinical practice.
Supplementary information
Author contributions
Conception/Design/Acquisition/Analysis/Interpretation (BT, JAM), Acquisition (BT, NH, JM), Drafting/Revision (all authors), Final approval (all authors), Agreement of accountability (all authors).
Funding
BT is funded by the Eye Foundation of Canada. No other financial support.
Data availability
Available upon reasonable request to the corresponding author.
Competing interests
The authors declare no competing interests.
Ethics approval
Under article 2.4 of the Tri-Council Policy Statement, this study was exempt from institutional review board approval since all data was gathered from publicly accessible published works. No identifiable forms of information were generated in this work.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s41433-024-03037-w.
References
- 1.Honavar SG. Artificial intelligence in ophthalmology - Machines think! Indian J Ophthalmol. 2022;70:1075–9. [DOI] [PMC free article] [PubMed]
- 2.Abràmoff MD, Lou Y, Erginay A, Clarida W, Amelon R, Folk JC, et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Investig Ophthalmol Vis Sci. 2016;57:5200–6. doi: 10.1167/iovs.16-19964. [DOI] [PubMed] [Google Scholar]
- 3.Gargeya R, Leng T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology. 2017;124:962–9. doi: 10.1016/j.ophtha.2017.02.008. [DOI] [PubMed] [Google Scholar]
- 4.Ting DSW, Cheung CY-L, Lim G, Tan GSW, Quang ND, Gan A, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA. 2017;318:2211–23. doi: 10.1001/jama.2017.18152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Grassmann F, Mengelkamp J, Brandl C, Harsch S, Zimmermann ME, Linkohr B, et al. A deep learning algorithm for prediction of age-related eye disease study severity scale for age-related macular degeneration from color fundus photography. Ophthalmology. 2018;125:1410–20. doi: 10.1016/j.ophtha.2018.02.037. [DOI] [PubMed] [Google Scholar]
- 6.Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol. 2017;135:1170–6. doi: 10.1001/jamaophthalmol.2017.3782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ting DSW, Pasquale LR, Peng L, Campbell JP, Lee AY, Raman R, et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019;103:167. doi: 10.1136/bjophthalmol-2018-313173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Singh S, Djalilian A, Ali MJ. ChatGPT and ophthalmology: exploring its potential with discharge summaries and operative notes. Semin Ophthalmol. 2023;38:503–7. doi: 10.1080/08820538.2023.2209166. [DOI] [PubMed] [Google Scholar]
- 9.ChatGPT. OpenAI. https://openai.com/chatgpt. Accessed 30 Jul 2023.
- 10.Ting DSJ, Tan TF, Ting DSW. ChatGPT in ophthalmology: the dawn of a new era? Eye. 2023. [DOI] [PMC free article] [PubMed]
- 11.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40. doi: 10.1038/s41591-023-02448-8. [DOI] [PubMed] [Google Scholar]
- 12.Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. [DOI] [PMC free article] [PubMed]
- 13.Models. OpenAI. https://platform.openai.com/docs/models/overview. Accessed 30 Jul 2023
- 14.Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11:887. [DOI] [PMC free article] [PubMed]
- 15.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141:589–97. doi: 10.1001/jamaophthalmol.2023.1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Mihalache A, Huang RS, Popovic MM, Muni RH. Performance of an upgraded artificial intelligence chatbot for ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141:798–800. doi: 10.1001/jamaophthalmol.2023.2754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Bing Chat. Microsoft. https://www.microsoft.com/en-us/edge/features/bing-chat?form=MT00D8. Accessed 30 Jul 2023.
- 18.Responsible and trusted AI. Microsoft. https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/innovate/best-practices/trusted-ai. Accessed 30 Jul 2023.
- 19.Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board style questions. Am J Ophthalmol. 2023;254:141–9. doi: 10.1016/j.ajo.2023.05.024. [DOI] [PubMed] [Google Scholar]
- 20.Kleebayoon A, Wiwanitkit V. Comment on performance of generative large language models on ophthalmology board style questions. Am J Ophthalmol. 2023;256:200. [DOI] [PubMed]
- 21.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2:e0000198. doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.McLaughlin GH. SMOG grading: a new readability formula. J Read. 1969;12:639–46. [Google Scholar]
- 23.Ishak NM, Bakar AYA. Qualitative data management and analysis using NVivo:An approach used to examine leadership qualitiesamong student leaders. Educ Res J. 2012;2:94–103. [Google Scholar]
- 24.Basic and clinical science course residency set. American Academy of Ophthalmology. https://store.aao.org/basic-and-clinical-science-course-residency-set.html. Accessed 30 Jul 2023.
- 25.Mehdi Y. Reinventing search with a new AI-powered Microsoft Bing and Edge, your copilot for the web. Microsoft. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/. Accessed 3 Aug 2023.
- 26.Lee H. The rise of ChatGPT: exploring its potential in medical education. Anat Sci Educ. 2023;00:1–6. doi: 10.1002/ase.2270. [DOI] [PubMed] [Google Scholar]
- 27.Hasan MR, Khan B. An AI-based intervention for improving undergraduate STEM learning. PLoS ONE. 2023;18:e0288844. doi: 10.1371/journal.pone.0288844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lam T, Cheung M, Munro Y, Lim K, Shung D, Sung J. Randomized controlled trials of artificial intelligence in clinical practice: systematic review. J Med Internet Res. 2022;24:e37188. doi: 10.2196/37188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Grabeel K, Russomanno J, Oelschlegel S, Tester E, Heidel R. Computerized versus hand-scored health literacy tools: a comparison of Simple Measure of Gobbledygook (SMOG) and Flesch-Kincaid in printed patient education materials. J Med Library Assoc. 2018;106:38–45. [DOI] [PMC free article] [PubMed]
- 30.Taloni A, Borselli M, Scarsi V, Rossi C, Coco G, Scorcia V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology. Sci Rep. 2023;13:18562. doi: 10.1038/s41598-023-45837-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Available upon reasonable request to the corresponding author.
