Skip to main content
Taiwan Journal of Ophthalmology logoLink to Taiwan Journal of Ophthalmology
. 2024 Sep 13;14(3):409–413. doi: 10.4103/tjo.TJO-D-23-00166

Investigating the comparative superiority of artificial intelligence programs in assessing knowledge levels regarding ocular inflammation, uvea diseases, and treatment modalities

Eyupcan Sensoy 1,*, Mehmet Citirik 1
PMCID: PMC11488809  PMID: 39430359

Abstract

PURPOSE:

The purpose of the study was to evaluate the knowledge level of the Chat Generative Pretrained Transformer (ChatGPT), Bard, and Bing artificial intelligence (AI) chatbots regarding ocular inflammation, uveal diseases, and treatment modalities, and to investigate their relative performance compared to one another.

MATERIALS AND METHODS:

Thirty-six questions related to ocular inflammation, uveal diseases, and treatment modalities were posed to the ChatGPT, Bard, and Bing AI chatbots, and both correct and incorrect responses were recorded. The accuracy rates were compared using the Chi-squared test.

RESULTS:

The ChatGPT provided correct answers to 52.8% of the questions, while Bard answered 38.9% correctly, and Bing answered 44.4% correctly. All three AI programs provided identical responses to 20 (55.6%) of the questions, with 45% of these responses being correct and 55% incorrect. No significant difference was observed between the correct and incorrect responses from the three AI chatbots (P = 0.654).

CONCLUSION:

AI chatbots should be developed to provide widespread access to accurate information about ocular inflammation, uveal diseases, and treatment modalities. Future research could explore ways to enhance the performance of these chatbots.

Keywords: Bard, bing, chat generative pretrained transformer, ocular inflammation, uveitis

Introduction

Technology continues to evolve. This development has increased its influence in all fields of medicine and has made a wide variety of contributions.[1] One of the best examples of which is artificial intelligence (AI) in medicine. AI, developed to imitate the human brain, is a subbranch of computer science.[2] Although it was first put forward during a conference at Dartmouth College in 1956, the first studies were carried out until the early 1970s.[3] Recognizing speech, dealing with problems, and interpreting images can be given as examples, along with having a wide variety of features.[4]

With the development of AI applications, it has started to gain a strong place in ophthalmology practice and has found a place for use in the diagnosis and follow-up of various eye diseases.[5] Another use of AI programs is chatbots, which are used to provide fast access to accurate information.[6] Chat generative pretrained transformer (ChatGPT) developed by OpenAI, Bard developed by Google, and Bing AI chatbots developed by Microsoft are important examples of this group.

The rapid developments in AI and the increase in usage areas in the field of ophthalmology necessitate a closer examination of the effectiveness and advantages/disadvantages of these applications. The aim of our study was to evaluate the knowledge levels of ChatGPT, Bard, and Bing chatbots on ocular immunology, inflammation, and uveitis, to discuss the advantages and disadvantages of their use, and to investigate their superiority over each other.

Materials and Methods

All 36 questions in the study questions section of the American Academy of Ophthalmology 2022–2023 basic and clinical science course intraocular inflammation and uveitis book were included in the study.[7] Of these 36 questions, eight were about the ocular immune system, 20 were about the diagnosis, and eight were questions that tested the level of knowledge related to treatment. The questions were asked to create AI programs on June 22, 2023. ChatGPT GPT-3.5 (OpenAI; San Francisco, CA), Bard (by Google), and Bing (Microsoft, Redmond, Washington) chatbots, which were released for free by three different manufacturers, first said, “I will ask you multiple choice questions. Please tell me the correct answer option.” command was given. Each question is answered separately. A new chat session was started after each question to avoid the memory effect of the programs. The answers to the questions were noted as correct or incorrect. We recorded the subjects to whom the questions were related. In addition, common right and wrong questions were noted, and their relationships with the participants were categorized. The study was conducted in accordance with the Declaration of Helsinki and was approved by thics Committee of Ankara Etlik City Hospital with (approval number: 2023.376). The Institutional Review Board confirms my study does not involve any human experiment, patient info or symptoms nor does it require a comprehensive IRB review.

All three chatbots were asked about their work schedules. The results were recorded. ChatGPT is a large language model (LLM)-based chatbot that can understand word strings in texts and produce human-like responses. It has been processed and trained using data obtained from many sources, such as books, articles, and websites. There was no ability to access data. After September 2021, it cannot provide a simultaneous Internet connection. Bard Chatbot is an LLM-based AI program with simultaneous Internet access. Its performance is improved by using machine learning algorithms and language processing algorithms. It has advanced features such as text detection, the ability to generate answers, and the ability to translate languages accurately. There is limited access to paid data on the Internet. Bing Chatbot is an LLM-based AI program that can access up-to-date free data on the Internet and indicate the source from which it is obtained. This program may perform searches using certain patterns and, consequently, may include missing information in the search results.

Statistical analysis

The Statistical Package for the Social Sciences version 23.0 (SPSS Inc., Chicago, IL, USA) was used for statistical analysis of the data. Pearson Chi-square and Yates Chi-square tests were used to compare percentage values and nominal independent variables in statistical analysis. Statistical significance was set at P < 0.05.

Results

ChatGPT gave correct answers to 19 (52.8%) of the 36 questions and incorrect answers to 17 (47.2%) questions. There was a fair agreement between ChatGPT’s answers to the questions and the answer key (Cohen’s κ =0.346, P = 0.00). Of the questions answered correctly, 5 (26.3%) were testing information related to the immune system, 6 (31.6%) were related to the diagnosis, and 8 (42.1%) were testing information related to treatment. Three of the questions that were answered incorrectly (17.6%) were related to the immune system, and 14 (82.4%) were testing information related to the diagnosis.

Bard accurately answered 14 (38.9%) of the 36 questions and responded incorrectly to 19 (52.8%). For 3 (8.3%) questions, Bard replied, “I can’t help you with this. I am merely a language model.” There was a fair agreement between Bard’s answers to the questions and the answer key (Cohen’s κ =0.21, P = 0.042). Of the correctly answered questions, 5 (35.7%) pertained to the immune system, 5 (35.7%) to diagnosis, and 4 (28.6%) to treatment. Among the incorrectly answered questions, 3 (15.8%) related to the immune system, 15 (78.9%) to diagnosis, and 1 (5.3%) to treatment. In addition, Bard responded with “I can’t help you with this. I am merely a language model” to three questions concerning initial treatment.

Bing gave the correct answer to 16 (44.4%) of the questions asked and the incorrect answer to 20 (55.6%). There was a fair agreement between Bing’s answers to the questions and the answer key (Cohen’s κ = 0.23, P = 0.016). Of the questions answered correctly, 6 (37.5%) were related to the immune system, 6 (37.5%) were related to the diagnosis, and 4 (25%) tested the knowledge levels related to the treatment. Two (10%) of the questions answered incorrectly were related to the immune system, 14 (70%) were related to the diagnosis, and four (20%) tested the knowledge levels related to the treatment [Table 1]. The number of correct answers given by chatbots to questions about the immune system, diagnosis, and treatment is summarized in Table 2.

Table 1.

The success of artificial intelligence chatbots on questions related to ocular inflammation and uvea diseases and treatment modalities

Answers (n) ChatGPT, n (%) Bard, n (%) Bing, n (%)
Correct 19 (52.8) 14 (38.9) 16 (44.4)
 Immune system/diagnosis/treatment 5/6/8 5/5/4 6/6/4
Incorrect 17 (47.2) 19 (52.8) 20 (55.6)
 Immune system/diagnosis/treatment 3/14/0 3/15/1 2/14/4
Same answers (n) 20 (55.6)
Correct 9 (45)
 Immune system/diagnosis/treatment 5/1/3
Incorrect 11 (55)
 Immune system/diagnosis/treatment 2/9/0

ChatGPT=Chat generative pretrained transformer

Table 2.

The number of correct answers given by chatbots to questions related to the immune system, diagnosis, and treatment

graphic file with name TJO-14-409-g001.jpg

All three AI programs provided identical answers to 20 (55.6%) of the questions posed. Their responses showed moderate agreement (Fleiss κ = 0.41, P < 0.05). Among these, 9 (45%) were correct, while 11 (55%) were incorrect. Of the questions answered correctly, 5 (55.6%) were testing information related to the immune system, 1 (11.1%) related to diagnosis, and 3 (33.3%) related to treatment. 2 (18.2%) of the questions answered incorrectly were related to the immune system, and 9 (81.8%) were testing the questions related to the diagnosis.

The use of three AI chatbots showed no significant impact on correct and incorrect response rates (P = 0.654). Similarly, there was no significant difference between ChatGPT and Bard (P = 0.536), ChatGPT and Bing (P = 0.637), or Bing and Bard (P = 1.0) programs.

Discussion

AI technology is constantly developing and becoming a reality in our daily lives. While the main targets of AI programs developed in the past were deep learning-based models created to teach and recognize patterns, today, LLM-based AI programs that can connect and interpret words have come into use. If this model is trained with sufficient information, it can develop answers that are suitable for the human mindset.[8] ChatGPT, Bard, and Bing AI programs were developed based on the LLM.

ChatGPT, which was developed to create responses similar to human thought, uses 175 billion parameters and continues to maintain its strong place among existing AI applications with these features.[9] Considering its useful features, such as making personalized learning plans, performing language translations, and providing rapid access to research and information, its wide range of benefits and uses for medical education are understood.[6] Taking advantage of these features, it has found a wide range of uses, from medical students who need to access information quickly and reliably to health professionals serving in a wide variety of fields.[10] In addition, this AI program facilitates the work of medical researchers by answering the questions asked in the research, finding summaries of a wide variety of texts, examining the literature and analyzing the data, and providing accurate and up-to-date access from a large amount of data on the Internet.[11] Bard and Bing’s AI programs, launched in 2023, are current AI chatbots that have similar features to ChatGPT. Therefore, they could be used for similar purposes. In general, considering the advantages of AI chatbots in reaching the correct information quickly, we believe that they will contribute to education about ocular inflammation and uveitis types, and will help their development. There are many different stages of ocular inflammation, including a wide variety of uveitis types, treatment algorithms, and differential diagnoses. Considering this complex structure, the idea of quickly accessing the correct information creates excitement for researchers. Simultaneously, as these programs develop and become more popular, many ethical issues arise. These problems can be examined in many subgroups, such as patient data privacy, prejudice, equality, justice, and legal responsibilities.[12] Data privacy and patient autonomy are important. Health professionals must respect patients’ autonomy and take care of patient confidentiality, and it is important to take the necessary precautions for this. Thus, prejudice is an important issue. The information used in the development of these programs may cause a biased evaluation of the data required for processing, resulting in an incorrect diagnosis. To prevent this and provide fair results, data diversity should be increased, and these programs should be audited regularly. Another issue is who is responsible for the consequences that will develop as a result of incorrect diagnoses made by these programs. These and many other ethical problems are waiting to be solved and seem to be encountered frequently as these programs become more integrated into our lives. However, these programs also have some technical disadvantages. The medical information used to train LLMs is obtained from the publicly available WebText, Wikipedia, and books available on the Internet and may therefore, have limited information. A significant amount of medical information is not available in these language models due to limited access to publications with payment barriers, such as some journal articles.[13] In addition, the data used by ChatGPT covers 2021 and previous years. The fact that ChatGPT covers information for 2021 and before causes a lack of information in all areas of developing medicine and may cause it to be out of date.[9] Considering these disadvantages, access to information about ever-evolving ocular immunology and inflammation is restricted, which may cause outdated and outdated information to be misconstrued as correct.

A wide variety of studies have been conducted on medical questions in recent years. In one study, models that answered yes/no questions in PubMed had an accuracy rate of 68.1%, whereas another study in China that evaluated the answers to 12,723 questions had an accuracy rate of 36.7%.[14,15] With the development of AI, programs such as ChatGPT, Bing, and Bard have emerged to answer more complex questions. They stated that in a recent study investigating the effectiveness of the previously released ChatGPT application in answering questions correctly, more than 50% of the questions asked were answered correctly.[8] Again, in another study conducted with ChatGPT, they stated that it gave more than 60% correct answers, which was a successful response rate, and they concluded that it was equal to or more successful than previous AI programs.[16] In another recent study, in which questions about ophthalmology were tested, ChatGPT gave 58.8% of the correct answers and Bing 71.2% of the correct answers. ChatGPT gave 70% correct answers to the questions related to uveitis, and Bing gave 75% correct answers.[13] In our study, we obtained 52.8%, 44.4%, and 38.9% correct answers with the ChatGPT, Bing, and Bard chatbots, respectively. These rates are lower than those reported in the current study. We believe that the reason for this may be related to the limited content of ChatGPT in accessing up-to-date information after 2021 or the limitation of chatbots’ access to current articles that can be accessed for a fee. Ocular immunology and inflammation are current fields of study, and new developments are constantly taking place. The fact that these AI chatbots could not access this updated information simultaneously may have reduced their correct answer rates. In addition, the possibility of these three AI chatbots misunderstanding the word strings in the questions and reaching wrong answers with the wrong keywords may have reduced the correct answer rates. Another concern is that the Bard chatbot did not answer the initial treatment question. The reason for this may be to prevent the self-treatment of people who are sick due to the speed of development of AI and ease of access and to ensure that they receive help from a health professional.

The limitations of this study are that ChatGPT provides access to data before 2021 and before, other AI chatbots do not have access to all up-to-date information, its closed structure and inability to fine-tune, and the low number of questions.

Conclusion

To the best of our knowledge, our study is the first to test and compare the information related to treatment modalities of ocular inflammation and uveitis types of ChatGPT, Bard, and Bing AI modalities released by three different manufacturers based on LLM. While there was no significant difference in the accuracy rates among the three AI methods, comparing these groups may not provide meaningful insights. It appears that further development is necessary to enable AI chatbots to provide accurate information on ocular inflammation, uveitis types, and treatment modalities.

Data availability statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Financial support and sponsorship

Nil.

Conflicts of interest

The authors declare that there are no conflicts of interest in this paper.

References

  • 1.Evans RS. Electronic health records: Then, now, and in the future. Yearb Med Inform. 2016;(Suppl 1):S48–61. doi: 10.15265/IYS-2016-s006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rahimy E. Deep learning applications in ophthalmology. Curr Opin Ophthalmol. 2018;29:254–60. doi: 10.1097/ICU.0000000000000470. [DOI] [PubMed] [Google Scholar]
  • 3.Patel VL, Shortliffe EH, Stefanelli M, Szolovits P, Berthold MR, Bellazzi R, et al. The coming of age of artificial intelligence in medicine. Artif Intell Med. 2009;46:5–17. doi: 10.1016/j.artmed.2008.07.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mikolov T, Deoras A, Povey D, Burget L, Černocký J. Strategies for Training Large Scale Neural Network Language models. 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings. 2011:196–201. [Google Scholar]
  • 5.Kapoor R, Walters SP, Al-Aswad LA. The current state of artificial intelligence in ophthalmology. Surv Ophthalmol. 2019;64:233–40. doi: 10.1016/j.survophthal.2018.09.002. [DOI] [PubMed] [Google Scholar]
  • 6.Khan RA, Jawaid M, Khan AR, Sajjad M. ChatGPT – Reshaping medical education and clinical management. Pak J Med Sci. 2023;39:605–7. doi: 10.12669/pjms.39.2.7653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sen Nida H, Albini Thomas A, Burkholder Bryn M, Dahr Sam S, Dodds Emilio M, Leveque Thellea K, et al., editors. Uveitis and Ocular Inflammation. San Francisco: American Academy of Ophtalmology; 2022. [Google Scholar]
  • 8.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2:e0000198. doi: 10.1371/journal.pdig.0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wen J, Wang W. The future of ChatGPT in academic research and publishing: A commentary for clinical and translational medicine. Clin Transl Med. 2023;13:e1207. doi: 10.1002/ctm2.1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stber AT, Topalis J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. 2024;34:2817–25. doi: 10.1007/s00330-023-10213-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gao CA, Howard FM, Markov NS, Dyer EC, Ramesh S, Luo Y, et al. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. BioRxiv. [[Last accessed on 2023 Jun 11]]. 2022.12.23.521610. Available from: https://www.biorxiv.org/content/10.1101/2022.12.23.521610v1 .
  • 12.Tan Yip Ming C, Rojas-Carabali W, Cifuentes-González C, Agrawal R, Thorne JE, Tugal-Tutkun I, et al. The potential role of large language models in uveitis care: Perspectives after ChatGPT and bard launch. Ocul Immunol Inflamm. 2023:1–5. doi: 10.1080/09273948.2023.2242462. doi: 10.1080/09273948.2023.2242462. Epub ahead of print. [DOI] [PubMed] [Google Scholar]
  • 13.Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board-style questions. Am J Ophthalmol. 2023;254:141–9. doi: 10.1016/j.ajo.2023.05.024. [DOI] [PubMed] [Google Scholar]
  • 14.Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. EMNLP-IJCNLP;2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. 2019. [[Last accessed on 2023 Jun 11]]. pp. 2567–77. Available from: https://arxiv.org/abs/1909.06146v1 .
  • 15.Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. [[Last accessed on 2023 Jun 11]];Appl Sci. 2021 11:6421. Available from: https://www.mdpi.com/2076-3417/11/14/6421/htm . [Google Scholar]
  • 16.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.


Articles from Taiwan Journal of Ophthalmology are provided here courtesy of Wolters Kluwer -- Medknow Publications

RESOURCES