Abstract
This cross-sectional study assesses the accuracy of answers generated by an updated version of a popular chatbot to board certification examination preparation questions.
Artificial intelligence (AI) chatbots have been shown to produce humanlike responses to virtually any prompt.1 A recent study found that the older version of a popular chatbot correctly answered approximately half of the multiple-choice questions in the free OphthoQuestions trial used to prepare for the American Board of Ophthalmology board certification examination.2 In this study, we updated our previous investigation2 by assessing the accuracy of an updated version of this chatbot in responding to practice questions for the ophthalmology board certification examination.
Methods
We asked ChatGPT-4 (March 2023 release; OpenAI) the same practice questions for the Ophthalmic Knowledge Assessment Program (OKAP) and Written Qualifying Exam (WQE) tests from the free OphthoQuestions trial, per the methods outlined in our previous study.2 In accordance with the Common Rule (6 CFR Part 46), this cross-sectional study was exempt from ethics review and informed consent because it was not considered human participant research. We followed the STROBE reporting guideline.
We recorded the proportion of trainees in ophthalmology using the OphthoQuestions trial who selected the same response as the chatbot. The primary outcome was the number of multiple-choice questions that the chatbot answered correctly. Data analysis was performed with Excel 2019 (Microsoft).
Results
The chatbot generated answers to questions on OphthoQuestions on March 18, 2023. The mean (SD) length of questions was 306.40 (204.75) characters, and the mean (SD) length of chatbot responses was 473.83 (205.89) characters (Table 1). Of the 125 text-based multiple-choice questions, 105 (84%) were answered correctly by the chatbot (Table 2).
Table 1. Characteristics of Questions on OphthoQuestions and Chatbot Responses on March 18, 2023.
Category | No. of available questions | No. of text-based questions (%) | Mean (SD) question length, characters | Mean (SD) response length, characters | Mean (SD) response time, s | No. of questions eligible as stand-alone questions |
---|---|---|---|---|---|---|
Clinical optics | 13 | 13 (100) | 246.15 (145.38) | 788.85 (419.94) | 31.87 (17.08) | 11 |
Cornea | 12 | 10 (83) | 193.6 (51.71) | 400.80 (105.37) | 12.42 (2.94) | 6 |
Fundamentals | 15 | 15 (100) | 181.67 (100.86) | 425.87 (133.13) | 16.07 (3.61) | 11 |
General medicine | 14 | 14 (100) | 226.57 (148.84) | 358.43 (52.49) | 10.66 (1.54) | 10 |
Glaucoma | 8 | 6 (75) | 237.33 (81.62) | 490.67 (69.88) | 16.98 (3.81) | 3 |
Lens and cataract | 14 | 12 (86) | 295.08 (166.23) | 341.42 (68.84) | 10.48 (2.20) | 8 |
Neuro-ophthalmology | 13 | 7 (54) | 336.43 (205.60) | 338.00 (146.91) | 11.17 (4.68) | 5 |
Oculoplastics | 15 | 10 (67) | 314.50 (187.70) | 470.20 (88.82) | 17.17 (3.26) | 8 |
Pathology and tumors | 12 | 4 (33) | 450.50 (140.58) | 577.50 (130.31) | 17.76 (2.66) | 1 |
Pediatrics | 13 | 9 (69) | 357.56 (134.67) | 525.67 (99.64) | 17.09 (2.93) | 2 |
Refractive surgery | 15 | 14 (93) | 441.14 (292.85) | 520.57 (135.18) | 18.50 (5.82) | 10 |
Retina and vitreous | 11 | 3 (27) | 420.00 (109.99) | 444.33 (62.42) | 15.60 (2.03) | 1 |
Uveitis | 11 | 8 (73) | 543.25 (349.78) | 473.63 (117.89) | 17.02 (2.77) | 2 |
Total | 166 | 125 (75) | 306.40 (204.75) | 473.83 (205.89) | 16.59 (8.60) | 78 |
Table 2. Accuracy of Chatbot Responses Across Question Categories on OphthoQuestions on March 18, 2023.
Category | No. of correct responses/total No. of responses (%) | Proportion of trainees agreeing with ChatGPT-4, % | No. of responses with least popular choice/total No. of responses (%) | No. of responses with second least popular choice/total No. of responses (%) | No. of responses with second most popular choice/total No. of responses (%) | No. of responses with explanation provided/total No. of responses (%) | No. of correct responses/total No. of stand-alone questions (%) | No. of responses with consistency between question styles/total No. of stand-alone questions (%) |
---|---|---|---|---|---|---|---|---|
Clinical optics | 8/13 (62) | 57 | 0 | 1/13 (8) | 2/13 (15) | 11/13 (85) | 4/11 (36) | 5/11 (45) |
Cornea | 9/10 (90) | 77 | 0 | 0 | 1/10 (10) | 10/10 (100) | 3/6 (50) | 3/6 (50) |
Fundamentals | 14/15 (93) | 82 | 0 | 0 | 1/15 (7) | 15/15 (100) | 10/11 (91) | 11/11 (100) |
General medicine | 14/14 (100) | 75 | 0 | 0 | 0 | 14/14 (100) | 7/10 (70) | 7/10 (70) |
Glaucoma | 5/6 (83) | 67 | 0 | 0 | 1/6 (17) | 6/6 (100) | 1/3 (33) | 0/3 (0) |
Lens and cataract | 8/12 (67) | 62 | 0 | 1/12 (8) | 3/12 (25) | 12/12 (100) | 7/8 (88) | 6/8 (75) |
Neuro-ophthalmology | 6/7 (86) | 69 | 0 | 1/7 (14) | 0 | 7/7 (100) | 4/5 (80) | 5/5 (100) |
Oculoplastics | 8/10 (80) | 70 | 0 | 0 | 2/10 (20) | 10/10 (100) | 4/8 (50) | 4/8 (50) |
Pathology and tumors | 3/4 (75) | 60 | 0 | 1/4 (25) | 0 | 4/4 (100) | 0/1 (0) | 1/1 (100) |
Pediatrics | 8/9 (89) | 75 | 0 | 0 | 1/9 (11) | 9/9 (100) | 1/2 (50) | 2/2 (100) |
Refractive surgery | 11/14 (79) | 64 | 2/14 (14) | 0 | 1/14 (7) | 14/14 (100) | 6/10 (60) | 6/10 (60) |
Retina and vitreous | 3/3 (100) | 83 | 0 | 0 | 0 | 3/3 (100) | 1/1 (100) | 1/1 (100) |
Uveitis | 8/8 (100) | 80 | 0 | 0 | 0 | 8/8 (100) | 1/2 (50) | 2/2 (100) |
Total | 105/125 (84) | 71 | 2/125 (2) | 4/125 (3) | 12/125 (10) | 123/125 (98) | 49/78 (63) | 53/78 (68) |
The chatbot responded correctly to 100% of the questions in general medicine, retina and vitreous, and uveitis. Its performance was lower in clinical optics, responding correctly to 8 of 13 questions (62%). In addition, the chatbot provided explanations and additional insight to 123 of 125 questions (98%). On average, 71% (95% CI, 66%-75%) of trainees in ophthalmology selected the same response to the multiple-choice questions as the chatbot. The chatbot provided a correct response to 49 of 78 stand-alone questions (63%) with multiple-choice options removed.
The median (IQR) length of multiple-choice questions that the chatbot answered correctly was 217 (155-383) characters and answered incorrectly was 246 (209-471) characters. The median (IQR) length of correct responses was 428 (353-525) characters and incorrect responses was 465 (347-761) characters.
Discussion
Chatbots, like the one used in this study, are dynamic language models that try to improve on existing conversational AI systems.3 In this study, we found that the newer version of the chatbot, ChatGPT-4, responded correctly to 84% of multiple-choice practice questions on OphthoQuestions, which is commonly used in preparing for the board certification examination. This study expanded on recent work, which found that the previous version of this chatbot correctly answered 46% of the same multiple-choice questions in January 2023 and 58% in February 2023.2 Performance of the updated version of this chatbot across all question categories on OphthoQuestions appeared to improve compared with the performance of the previous version.2 Results of this study also suggest that in most cases, the updated version of the chatbot generated accurate responses when options were given. OpenAI states that ChatGPT-4 outperforms its predecessor, ChatGPT-3, in other disciplines.4,5
This study had several limitations. OphthoQuestions offers preparation material for OKAP and WQE, and ChatGPT may perform differently in official examinations. The chatbot provides unique responses, which may differ if this study were repeated. The previous study2 may have helped train the chatbot in this setting. Results of the present study must be interpreted in the context of the study date as the chatbot’s knowledge corpus will likely continue to expand rapidly.
Data Sharing Statement
References
- 1.Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615(7954):773. doi: 10.1038/d41586-023-00816-5 [DOI] [PubMed] [Google Scholar]
- 2.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in an ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;e231144. doi: 10.1001/jamaophthalmol.2023.1144 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhai X. ChatGPT user experience: implications for education. SSRN. Preprint posted online January 4, 2023. doi: 10.2139/SSRN.4312418 [DOI]
- 4.OpenAI. GPT-4 technical report. Preprint posted online March 2023. https://arxiv.org/abs/2303.08774
- 5.OpenAI. GPT-4 . Accessed March 20, 2023. https://openai.com/product/gpt-4
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Sharing Statement