This study evaluates the performance of the novel release of an artificial intelligence chatbot that is capable of processing imaging data.
Key Points
Question
How does the artificial intelligence chatbot ChatGPT-4 (OpenAI) perform in processing ophthalmic imaging data?
Findings
In this cross-sectional study including 136 ophthalmic cases provided by OCTCases, the chatbot answered 70% of all multiple-choice questions correctly, performing better on nonimage–based questions (82%) than image-based questions (65%).
Meaning
In this study, the chatbot demonstrated a fair performance on multiple-choice questions pertaining to ophthalmic cases that required multimodal input; as multimodal chatbots become increasingly widespread, it is necessary to stress their appropriate integration within medicine.
Abstract
Importance
Ophthalmology is reliant on effective interpretation of multimodal imaging to ensure diagnostic accuracy. The new ability of ChatGPT-4 (OpenAI) to interpret ophthalmic images has not yet been explored.
Objective
To evaluate the performance of the novel release of an artificial intelligence chatbot that is capable of processing imaging data.
Design, Setting, and Participants
This cross-sectional study used a publicly available dataset of ophthalmic cases from OCTCases, a medical education platform based out of the Department of Ophthalmology and Vision Sciences at the University of Toronto, with accompanying clinical multimodal imaging and multiple-choice questions. Across 137 available cases, 136 contained multiple-choice questions (99%).
Exposures
The chatbot answered questions requiring multimodal input from October 16 to October 23, 2023.
Main Outcomes and Measures
The primary outcome was the accuracy of the chatbot in answering multiple-choice questions pertaining to image recognition in ophthalmic cases, measured as the proportion of correct responses. χ2 Tests were conducted to compare the proportion of correct responses across different ophthalmic subspecialties.
Results
A total of 429 multiple-choice questions from 136 ophthalmic cases and 448 images were included in the analysis. The chatbot answered 299 of multiple-choice questions correctly across all cases (70%). The chatbot’s performance was better on retina questions than neuro-ophthalmology questions (77% vs 58%; difference = 18%; 95% CI, 7.5%-29.4%; χ21 = 11.4; P < .001). The chatbot achieved a better performance on nonimage–based questions compared with image-based questions (82% vs 65%; difference = 17%; 95% CI, 7.8%-25.1%; χ21 = 12.2; P < .001).The chatbot performed best on questions in the retina category (77% correct) and poorest in the neuro-ophthalmology category (58% correct). The chatbot demonstrated intermediate performance on questions from the ocular oncology (72% correct), pediatric ophthalmology (68% correct), uveitis (67% correct), and glaucoma (61% correct) categories.
Conclusions and Relevance
In this study, the recent version of the chatbot accurately responded to approximately two-thirds of multiple-choice questions pertaining to ophthalmic cases based on imaging interpretation. The multimodal chatbot performed better on questions that did not rely on the interpretation of imaging modalities. As the use of multimodal chatbots becomes increasingly widespread, it is imperative to stress their appropriate integration within medical contexts.
Introduction
Artificial intelligence (AI) chatbots may have transformative potential in ophthalmology, given their capability to reshape patient engagement and health care provision. Previous literature has highlighted their ability to offer high-diagnostic accuracy, contribute to patient education, enable remote monitoring of chronic eye conditions, and ease burden on health care professionals. Nevertheless, addressing regulatory compliance, privacy concerns, and the seamless integration of AI chatbots within healthcare systems necessitates further exploration. There has been immense interest in cutting-edge large language models (LLMs), particularly ChatGPT-4 (OpenAI), given its capacity for real-time analysis of medical prompts. Previous versions of this chatbot were limited to analyzing text-based prompts. Our prior investigations found the performance of this chatbot in medical and ophthalmic settings has been improving at an impressive rate. However, the new ability of the recent version of the chatbot to interpret ophthalmic images has not yet been explored.
Ophthalmology is reliant on effective interpretation of multimodal imaging to ensure diagnostic accuracy. Multimodal imaging enhances patient outcomes through earlier and more precise diagnoses, and more effective follow-up visits and treatments. The new release of the chatbot holds great potential in enhancing the efficiency of ophthalmic image interpretation, which may reduce workload on clinicians, mitigate variability in interpretations and errors, and ultimately, lead to improved patient outcomes. Our present study aims to evaluate the performance of the multimodal release of the chatbot in image interpretation of ophthalmic teaching cases.
Methods
We used a publicly available dataset of ophthalmic cases from OCTCases, a medical education platform based out of the Department of Ophthalmology and Vision Sciences at the University of Toronto in Toronto, Canada. All cases on OCTCases are comprehensively reviewed by platform’s founders (A.P. and J.K.) and at least 1 board-certified ophthalmologist from the University of Toronto. Cases are organized into the following categories: retina, neuro-ophthalmology, uveitis, glaucoma, ocular oncology, and pediatric ophthalmology. Our investigation analyzed all multiple-choice questions across all available ophthalmic cases on OCTCases. The following data were collected from each case: the date on which the case was inputted into the chatbot, the corresponding category on OCTCases, the number and types of ophthalmic images provided in each case, the length of the chatbot’s responses in characters, the duration of the chatbot’s responses in seconds, and whether a question directly required image interpretation to be answered. To allow for objective grading of the chatbot’s responses, only multiple-choice questions were included in our statistical analysis. Open-ended questions without multiple-choice options were excluded to mitigate subjective grading of the chatbot’s responses. Our study adhered to Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guidelines. Institutional review board approval was not required by the University of Toronto for this study and for the publication of cases with ophthalmic imaging on the OCTCases platform, as all cases were entirely anonymized and contained modified demographic data to ensure data were not identifiable to individual patients.
We created a new ChatGPT Plus account to ensure no prior conversation history with the chatbot preceding study initiation. This account was granted multimodal capabilities by OpenAI on October 15, 2023, and all cases on OCTCases, along with their ophthalmic imaging, were inputted into the chatbot from October 16 to October 23, 2023. Prior to the first question in each case, we primed the chatbot with the same description of the patient’s presentation and associated imaging, as noted on OCTCases. All descriptions of patient presentations, ophthalmic imaging, and questions were inputted into the chatbot exactly as they appeared on OCTCases. A sample entry of an OCTCases case onto the chatbot is shown in eFigure 1 in Supplement 1. We refreshed the chatbot and cleared all conversation history in between cases to mitigate the influence of concurrent conversations on its interpretation of subsequent cases. All the chatbot’s responses were manually reviewed by 2 independent reviewers (A.M. and R.S.H.) to determine which multiple-choice answer was selected. If the chatbot selected more than 2 multiple-choice answers, all of the above, or none of the above for questions that did not have these as potential answer choices, we deemed its response incorrect.
Our primary end point was the chatbot’s accuracy in answering multiple-choice questions pertaining to ophthalmic cases across various categories, measured as the proportion of correct responses. Our secondary end points were the differences in the chatbot’s performance on image-based and nonimage–based questions, the character length of its responses, as well as the association between the number of images inputted per case and the proportion of multiple-choice questions answered correctly per case. We also conducted an exploratory analysis of the chatbot’s performance based on the type of imaging modality uploaded into the chatbot, directly preceding the input of a multiple-choice question. We conducted χ2 tests to compare proportions of correct responses across 2 samples on MedCalc. We performed Mann-Whitney U tests to compare differences between observed median response lengths across 2 samples. Univariable and multivariable linear regression were performed to investigate associations between the number of images inputted per case and the proportion of multiple-choice questions answered correctly by the chatbot per case, adjusted by ophthalmic subspecialty. We recorded 2-tailed P values and there was no adjustment to P values for multiple analyses.
Results
A total of 137 cases on OCTCases were considered and 1 was excluded, as it contained only open-ended questions. Thus, 136 cases with 448 images on OCTCases were included in the analysis. Across all included cases, 429 questions were formatted as multiple-choice questions (82%) and were included in the statistical analysis. Of the 429 multiple-choice questions, 303 directly required image interpretation to be answered (71%), whereas 126 were not based on images and tended to be knowledge questions pertaining to a specified disease or its management (29%). A total of 125 cases were accompanied by macular optical coherence tomography (OCT) scans (92%), 82 cases by fundus images (60%), 22 cases by retinal nerve fiber layer (RNFL) OCT analyses (16%), 17 cases by Humphrey visual fields (HVFs) (13%), 15 cases by fundus autofluorescence (FAF) (11%), 9 cases by intravenous fluorescein angiography findings (IVFA) (7%), 9 cases by ganglion cell complex (GCC) OCT analyses (7%), 7 cases by OCT angiography (OCTA) (5%), and 3 cases by A-scan ultrasonography (2%). The chatbot’s mean (SD) response length to multiple-choice questions was 530.2 (500.9) characters. The mean (SD) response duration was 16.0 (12.2) seconds.
Overall, the recent version of the chatbot answered 299 of the multiple-choice questions correctly across all cases (70%) (Table). Examples of the chatbot’s responses to questions answered correctly and incorrectly can be found in the eTable in Supplement 1 and an example of the latter is shown in the Figure. The chatbot performed best on questions in the retina category, where it selected the correct multiple-choice response in 160 of 209 questions (77%). The chatbot performed the poorest on questions in the neuro-ophthalmology category, where it selected the correct multiple-choice response in 61 of 105 questions (58%). The chatbot demonstrated intermediate performance on questions from the ocular oncology (21 of 29 correct [72%]), pediatric ophthalmology (17 of 25 correct [68%]), uveitis (29 of 43 correct [67%]), and glaucoma (11 of 18 correct [61%]) categories. Statistical analysis revealed that the chatbot’s performance was better on retina questions than neuro-ophthalmology questions (difference = 18%; 95% CI, 7.5%-29.4%; χ21 = 11.4; P < .001) although, its performance was similar for all other pairwise comparisons of different question categories.
Table. Summary of the Chatbot’s Performance Across Multiple-Choice Questions on OCTCases.
| Subspecialty category | Question type | No. of questions | No. of correct responses (%) | No. of incorrect responses (%) |
|---|---|---|---|---|
| All | All | 429 | 299 (70) | 130 (30) |
| Image-based | 303 | 196 (65) | 107 (35) | |
| Nonimage–based | 126 | 103 (82) | 23 (18) | |
| Retina | All | 209 | 160 (77) | 49 (23) |
| Image-based | 145 | 106 (73) | 39 (27) | |
| Nonimage–based | 64 | 54 (84) | 10 (16) | |
| Neuro-ophthalmology | All | 105 | 61 (58) | 44 (42) |
| Image-based | 76 | 41 (54) | 35 (46) | |
| Nonimage–based | 29 | 20 (69) | 9 (31) | |
| Uveitis | All | 43 | 29 (67) | 14 (33) |
| Image-based | 30 | 19 (63) | 11 (37) | |
| Nonimage–based | 13 | 10 (77) | 3 (23) | |
| Glaucoma | All | 18 | 11 (61) | 7 (39) |
| Image-based | 17 | 10 (59) | 7 (41) | |
| Nonimage–based | 1 | 1 (100) | 0 | |
| Ocular oncology | All | 29 | 21 (72) | 8 (28) |
| Image-based | 18 | 11 (61) | 7 (39) | |
| Nonimage–based | 11 | 10 (91) | 1 (9) | |
| Pediatric ophthalmology | All | 25 | 17 (68) | 8 (32) |
| Image-based | 17 | 9 (53) | 8 (47) | |
| Nonimage–based | 8 | 8 (100) | 0 |
Figure. Example of an Image-Based Question Answered Incorrectly by the Chatbot in the Retina Category.
Across 303 multiple-choice questions that directly required image interpretation to be answered, the chatbot answered 196 correctly (65%). Across 126 nonimage–based questions, the chatbot answered 103 correctly (82%). Hence, the chatbot achieved a better performance on nonimage–based questions compared with image-based questions (difference = 17%; 95% CI, 7.8%-25.1%; χ21 = 12.2 P < .001). This finding was particularly evident for questions in the pediatric ophthalmology category (difference = 47%; 95% CI, 8.5%-69.0%; χ21 = 12.2; P = .02); however, the chatbot’s performance on image-based and nonimage–based questions was similar for questions in the retina (difference = 11%; 95% CI, −1.4% to 21.7%; χ21 = 3.1; P = .08), neuro-ophthalmology (difference = 15%; 95% CI, −6.1% to 32.7%; χ21 = 1.9; P = .17), uveitis (difference = 14%; 95% CI, −17.4% to 36.8%; χ21 = 0.7; P = .39), glaucoma (difference = 41%; 95% CI, −40.5% to 64.0%; χ21 = 0.6; P = .43), and ocular oncology (difference = 30%; 95% CI, −4.3% to 53.5%; χ21 = 2.9; P = .09) categories. The following represents the proportion of multiple-choice questions answered correctly by the chatbot, stratified by the imaging modality uploaded directly prior to a question: OCTA (7 of 7 correct [100%]), GCC OCT analyses (7 of 9 correct [78%]), FAF (8 of 11 correct [73%]), RNFL OCT analyses (14 of 20 correct [70%]), HVFs (11 of 16 correct [69%]), fundus imaging (42 of 62 correct [68%]), macular OCT (103 of 163 correct [63%]), IVFA (2 of 5 correct [40%]), and A-scan ultrasonography (1 of 3 correct [33%]).
Across cases, the chatbot’s median response length was shorter for its correct responses compared with its incorrect responses (z score = −3.6; P < .001). Moreover, the chatbot’s median response length was longer for multiple-choice questions that directly required image interpretation to be answered compared with those that did not (z score = 3.1; P = .002). Overall, the number of images inputted into the chatbot per case was not associated with the proportion of multiple-choice questions answered correctly per case.
Discussion
In our investigation, the recent version of the chatbot was able to accurately respond to most multiple-choice questions with multimodal input from OCTCases, achieving an accuracy of 70% in the overall sample. This chatbot performed best on cases pertaining to retina and worst on cases pertaining to neuro-ophthalmology. Moreover, it performed considerably better on questions that did not directly require image interpretation compared with those that did. This finding was particularly evident in the category of pediatric ophthalmology, indicating the chatbot’s image-processing capabilities may be less robust in more niche subspecialities. Nonetheless, our results suggest that this chatbot has the potential to interpret findings from various modalities of ophthalmic imaging, including OCT, fundus images, RNFL analyses, HVF testing, FAF, IVFA, GCC analyses, and OCTA with relative accuracy. As the chatbot’s accuracy increases with time, it may develop the potential to inform clinical decision-making in ophthalmology via real-time analysis of ophthalmic cases.
In a prior investigation of 125 text-based multiple-choice questions commonly used by trainees in preparing for ophthalmology board certification, we found that the previous version of the chatbot answered 46% of questions correctly in January 2023. The proportion of questions answered correctly varied based on subspecialty: uveitis (50%), glaucoma (50%), pathology and tumors (50%), pediatrics (44%), neuro-ophthalmology (43%), and retina and vitreous (0%). In a follow-up investigation using the same text-based questions, the recent version of the chatbot answered 84% of questions correctly in March 2023. Remarkably, the chatbot’s accuracy increased across subspecialties: uveitis (100%), retina and vitreous (100%), pediatrics (89%), neuro-ophthalmology (86%), glaucoma (83%), and pathology and tumors (75%). Compared with this previous study, the chatbot’s performance on image-based questions in the present study appears to be inferior. However, given that this is a novel addition to the chatbot’s platform, we anticipate its performance on image-based questions may increase considerably with time, as was previously observed in our analyses of text-based questions.
Compared with traditional computer-vision AI systems that can interpret visual data, LLMs are easily accessible and do not necessarily require prior familiarity with computer vision, AI, or coding. While our current investigation found that the chatbot achieved an accuracy of 65% when responding to multiple-choice questions requiring ophthalmic imaging interpretation, its performance remains inferior to previously published AI systems designed for screening or diagnosing retinal pathologies from ophthalmic imaging, such as OCT scans and fundus images. For instance, relative to professional graders, a previous deep learning system by Ting et al achieved a sensitivity of 90.5% and specificity of 91.6% for detecting referable diabetic retinopathy, a sensitivity of 100% and specificity of 91.1% for vision-threatening diabetic retinopathy, a sensitivity of 96.4% and specificity of 87.2% for possible glaucoma, and a sensitivity of 93.2% and specificity of 88.7% for age-related macular degeneration using a validation dataset of retinal images. Overall, AI-assisted OCT analysis can be highly accurate, sensitive, and specific for the diagnosis of retinal disease, and it is possible that the incorporation of more robust AI algorithms into LLMs may further improve their multimodal capabilities.
In our investigation, the chatbot performed better on cases pertaining to the subspecialty of retina compared with neuro-ophthalmology. This difference may have been mediated by the different imaging modalities that were more commonly used across cases in each subspecialty, where cases in the retina category were predominantly accompanied by macular OCT and fundus images ,and cases in the neuro-ophthalmology category had a higher proportion of RNFL and GCC OCT analyses. Macular OCT and fundus imaging are considered the most common ophthalmic imaging modalities; thus, it is plausible that the current release of the chatbot may be better equipped in interpreting more widely used ophthalmic imaging modalities. Nonetheless, our exploratory analysis found that the chatbot performed slightly better on multiple-choice questions when RNFL or GCC OCT analyses were uploaded into the chatbot directly preceding a question in comparison with macular OCTs. However, it is important to acknowledge the considerably smaller sample size of RNFL or GCC OCT analyses relative to macular OCTs. Moreover, a recent retrospective review by Kalaw, Tavakoli, and Baxter found that of 965 articles published in the journal Ophthalmology from January 2018 to December 2022, the highest proportion of articles were in the subspecialty of retina (30.7%), whereas a lower proportion of articles were in neuro-ophthalmology (2.7%). Therefore, it is plausible that the chatbot has a stronger knowledge corpus in retina, which was overrepresented on OCTCases compared with other subspecialties.
LLMs may eventually have a role as clinical decision-making tools. Interestingly, the chatbot provided interpretations of ophthalmic images beyond information explicitly stated on OCTCases in some cases, as shown in eFigure 2 in Supplement 1. However, it remains unclear whether they have reached a stage where they can improve clinical workflow for health care professionals. The recent version of the chatbot enables increased access to the processing of ophthalmic information, which may benefit patients who are interested in engaging in independent learning. Concurrently, there is also the downside of increased risks of misinformation and information overload. With the chatbot’s new multimodal capabilities, concerns exist regarding the potential for widespread misuse of AI chatbots within medical contexts. For instance, anyone with internet access can access an AI chatbot relatively easily, although clear guidelines are imperative to enforce the protection of patient confidentiality and privacy, especially with respect to medical images. Health care practitioners and patients must exercise tremendous caution when uploading images into the chatbot, especially if images are not deidentified. Medicolegal challenges pertaining to liability in cases of misdiagnosis or misleading recommendations also remain principal concerns associated with AI chatbots. Proactively addressing these ethical and legal concerns is essential prior to any potential formal adoption of this technology within medicine. It is also imperative to emphasize that AI chatbots should not be used to replace human expertise, nor should false confidence be bestowed on their output by users.
Limitations
Our study was limited for several reasons. Ophthalmic research is constantly advancing, and in its current form, the chatbot may not necessarily be trained to adapt to the rapidly changing landscape of ophthalmology and its multimodal imaging technologies, limiting its performance. All questions from cases in our analysis were obtained from a single source, OCTCases, which may have possibly limited the generalizability of our results. The chatbot’s performance on similar questions is expected to vary if other datasets with varying degrees of image and patient description quality are tested. The difficulty of multiple-choice questions on OCTCases, which tend to range from an early resident level to the level of a fellow or staff ophthalmologist, may have also varied considerably across questions and subspecialties. Therefore, the chatbot’s ability to answer questions of varying difficulties cannot be ascertained from our results. Instead, our results serve to demonstrate the chatbot is currently able to interpret ophthalmic imaging with adequate capacity. The chatbot’s performance on OCTCases cannot be compared with the average performance of the website’s users, as these data are not available. Furthermore, our results cannot be extrapolated to inform the use of this AI chatbot as a clinical decision-making tool, as imaging modalities and quality may differ considerably across different centers worldwide. Although the chatbot performed adequately on multiple-choice questions in our study, AI tools are generally prompted by clinicians and patients with open-ended questions, which were not assessed in our investigation. Our findings must also be interpreted within the context of their time, as future iterations of the chatbot may yield different results, likely due to an expanded knowledge corpus. Although most questions inputted into the chatbot directly required image interpretation to produce a valid response, text-based information pertaining to patient presentation was used to prime the chatbot’s responses and may have influenced the interpretation of images. Hence, our findings cannot be generalized to the chatbot’s ability to interpret stand-alone ophthalmic images without the contextual details about a patient. Given the strong contextual dependency of the chatbot, our results may also not necessarily be generalizable to different methods of inputting prompts, as our methodology inputted prompts exactly as they appeared on OCTCases. Our analysis of the chatbot’s performance based on different types of ophthalmic imaging uploaded into the chatbot directly before a multiple-choice question was substantially limited by confounding and must strictly be interpreted as an exploratory analysis. In many cases, multiple imaging modalities were uploaded together for a multiple-choice question, precluding our ability to analyze the chatbot’s performance at interpreting independent imaging modalities. For instance, if a macular OCT and RNFL OCT analysis were both uploaded directly prior to a question, the chatbot would have interpreted both imaging modalities when answering the question and the influence of either could not be isolated.
Conclusions
In this study, the recent version of the chatbot accurately responded to most multiple-choice questions pertaining to ophthalmic cases requiring multimodal input from OCTCases, albeit performing better on questions that did not rely on ophthalmic imaging interpretation. As multimodal LLMs become increasingly widespread, it remains imperative to continuously stress their appropriate use in medicine and highlight concerns surrounding confidentiality and bioethics. Future studies should continue investigating the chatbot’s ability to interpret different ophthalmic imaging modalities to gauge whether it can eventually become as accurate as specific machine learning systems in ophthalmology. Future work should also evaluate the chatbot’s ability to interpret ophthalmic images that are not publicly accessible.
eFigure 1. Priming the chatbot with patient’s presentation and associated imaging as noted on OCTCases, followed by the first question in the sample retina case
eTable. Examples of the chatbot’s responses to questions answered correctly and incorrectly
eFigure 2. Chatbot provided interpretations of ophthalmic images beyond information explicitly stated on OCTCases
Data sharing statement
References
- 1.Tan TF, Thirunavukarasu AJ, Jin L, et al. Artificial intelligence and digital health in global eye health: opportunities and challenges. Lancet Glob Health. 2023;11(9):e1432-e1443. doi: 10.1016/S2214-109X(23)00323-6 [DOI] [PubMed] [Google Scholar]
- 2.Lyons RJ, Arepalli SR, Fromal O, Choi JD, Jain N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol. 2023;S0008-4182(23)00234-X. doi: 10.1016/j.jcjo.2023.07.016 [DOI] [PubMed] [Google Scholar]
- 3.Keenan TDL, Loewenstein A. Artificial intelligence for home monitoring devices. Curr Opin Ophthalmol. 2023;34(5):441-448. doi: 10.1097/ICU.0000000000000981 [DOI] [PubMed] [Google Scholar]
- 4.Bernstein IA, Zhang YV, Govil D, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open. 2023;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Srivastav S, Chandrakar R, Gupta S, et al. ChatGPT in radiology: the advantages and limitations of artificial intelligence for medical imaging diagnosis. Cureus. 2023;15(7):e41435. doi: 10.7759/cureus.41435 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Accessed January 30, 2024. doi: 10.1080/0142159X.2023.2249588 [DOI] [PubMed]
- 7.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mihalache A, Huang RS, Popovic MM, Muni RH. Performance of an upgraded artificial intelligence chatbot for ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;141(8):798-800. doi: 10.1001/jamaophthalmol.2023.2754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Schuster AK, Wolfram C, Hudde T, et al. Impact of routinely performed optical coherence tomography examinations on quality of life in patients with retinal diseases-results from the ALBATROS data collection. J Clin Med. 2023;12(12):3881. doi: 10.3390/jcm12123881 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huang D, Swanson EA, Lin CP, et al. Optical coherence tomography. Science. 1991;254(5035):1178-1181. doi: 10.1126/science.1957169 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.OCTCases . Homepage. Accessed January 30, 2024. https://www.octcases.com/
- 12.Schoonjans F, Zalata A, Depuydt CE, Comhaire FH. MedCalc: a new computer program for medical statistics. Comput Methods Programs Biomed. 1995;48(3):257-262. doi: 10.1016/0169-2607(95)01703-8 [DOI] [PubMed] [Google Scholar]
- 13.Campbell I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Stat Med. 2007;26(19):3661-3675. doi: 10.1002/sim.2832 [DOI] [PubMed] [Google Scholar]
- 14.Richardson JTE. The analysis of 2 × 2 contingency tables–yet again. Stat Med. 2011;30(8):890-890. doi: 10.1002/sim.4116 [DOI] [PubMed] [Google Scholar]
- 15.MedCalc . Mann-Whitney test (independent samples). Accessed January 30, 2024. https://www.medcalc.org/manual/mannwhitney.php
- 16.O’Mahony N, Campbell S, Carvalho A, et al. Deep Learning vs. Traditional Computer Vision;2020:128-144, doi: 10.1007/978-3-030-17795-9_10. [DOI] [Google Scholar]
- 17.Schwartz S, Yaeli A, Shlomov S. Enhancing trust in LLM-based ai automation agents: new considerations and future challenges. Accessed January 30, 2024. https://arxiv.org/abs/2308.05391
- 18.Liu X, Zhao C, Wang L, et al. Evaluation of an OCT-AI-based telemedicine platform for retinal disease screening and referral in a primary care setting. Transl Vis Sci Technol. 2022;11(3):4. doi: 10.1167/tvst.11.3.4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Cao S, Zhang R, Jiang A, et al. Application effect of an artificial intelligence-based fundus screening system: evaluation in a clinical setting and population screening. Biomed Eng Online. 2023;22(1):38. doi: 10.1186/s12938-023-01097-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kim KM, Heo TY, Kim A, et al. Development of a fundus image-based deep learning diagnostic tool for various retinal diseases. J Pers Med. 2021;11(5):321. doi: 10.3390/jpm11050321 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ting DSW, Cheung CYL, Lim G, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA. 2017;318(22):2211-2223. doi: 10.1001/jama.2017.18152 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Bai J, Wan Z, Li P, et al. Accuracy and feasibility with AI-assisted OCT in retinal disorder community screening. Front Cell Dev Biol. 2022;10:1053483. doi: 10.3389/fcell.2022.1053483 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tong Y, Lu W, Yu Y, Shen Y. Application of machine learning in ophthalmic imaging modalities. Eye Vis (Lond). 2020;7(1):22. doi: 10.1186/s40662-020-00183-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kalaw FGP, Tavakoli K, Baxter SL. Evaluation of publications from the American Academy of Ophthalmology: a 5-year analysis of ophthalmology literature. Ophthalmology Science. Published online September 11, 2023. doi: 10.1016/j.xops.2023.100395 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Jassar S, Adams SJ, Zarzeczny A, Burbridge BE. The future of artificial intelligence in medicine: medical-legal considerations for health leaders. Healthc Manage Forum. 2022;35(3):185-189. doi: 10.1177/08404704221082069 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel). 2023;11(6):887. doi: 10.3390/healthcare11060887 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Mihalache A, Popovic MM, Muni RH. Advances in artificial intelligence chatbot technology in ophthalmology-reply. JAMA Ophthalmol. 2023;141(11):1088-1089. doi: 10.1001/jamaophthalmol.2023.4623 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eFigure 1. Priming the chatbot with patient’s presentation and associated imaging as noted on OCTCases, followed by the first question in the sample retina case
eTable. Examples of the chatbot’s responses to questions answered correctly and incorrectly
eFigure 2. Chatbot provided interpretations of ophthalmic images beyond information explicitly stated on OCTCases
Data sharing statement

