Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment

Andrew Mihalache; Ryan S Huang; Marko M Popovic; Rajeev H Muni

doi:10.1001/jamaophthalmol.2023.2754

. 2023 Jul 13;141(8):798–800. doi: 10.1001/jamaophthalmol.2023.2754

Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment

Andrew Mihalache ¹, Ryan S Huang ², Marko M Popovic ³, Rajeev H Muni ^3,^4,^✉

PMCID: PMC10346493 PMID: 37440220

Abstract

This cross-sectional study assesses the accuracy of answers generated by an updated version of a popular chatbot to board certification examination preparation questions.

Artificial intelligence (AI) chatbots have been shown to produce humanlike responses to virtually any prompt.¹ A recent study found that the older version of a popular chatbot correctly answered approximately half of the multiple-choice questions in the free OphthoQuestions trial used to prepare for the American Board of Ophthalmology board certification examination.² In this study, we updated our previous investigation² by assessing the accuracy of an updated version of this chatbot in responding to practice questions for the ophthalmology board certification examination.

Methods

We asked ChatGPT-4 (March 2023 release; OpenAI) the same practice questions for the Ophthalmic Knowledge Assessment Program (OKAP) and Written Qualifying Exam (WQE) tests from the free OphthoQuestions trial, per the methods outlined in our previous study.² In accordance with the Common Rule (6 CFR Part 46), this cross-sectional study was exempt from ethics review and informed consent because it was not considered human participant research. We followed the STROBE reporting guideline.

We recorded the proportion of trainees in ophthalmology using the OphthoQuestions trial who selected the same response as the chatbot. The primary outcome was the number of multiple-choice questions that the chatbot answered correctly. Data analysis was performed with Excel 2019 (Microsoft).

Results

The chatbot generated answers to questions on OphthoQuestions on March 18, 2023. The mean (SD) length of questions was 306.40 (204.75) characters, and the mean (SD) length of chatbot responses was 473.83 (205.89) characters (Table 1). Of the 125 text-based multiple-choice questions, 105 (84%) were answered correctly by the chatbot (Table 2).

Table 1. Characteristics of Questions on OphthoQuestions and Chatbot Responses on March 18, 2023.

Category	No. of available questions	No. of text-based questions (%)	Mean (SD) question length, characters	Mean (SD) response length, characters	Mean (SD) response time, s	No. of questions eligible as stand-alone questions
Clinical optics	13	13 (100)	246.15 (145.38)	788.85 (419.94)	31.87 (17.08)	11
Cornea	12	10 (83)	193.6 (51.71)	400.80 (105.37)	12.42 (2.94)	6
Fundamentals	15	15 (100)	181.67 (100.86)	425.87 (133.13)	16.07 (3.61)	11
General medicine	14	14 (100)	226.57 (148.84)	358.43 (52.49)	10.66 (1.54)	10
Glaucoma	8	6 (75)	237.33 (81.62)	490.67 (69.88)	16.98 (3.81)	3
Lens and cataract	14	12 (86)	295.08 (166.23)	341.42 (68.84)	10.48 (2.20)	8
Neuro-ophthalmology	13	7 (54)	336.43 (205.60)	338.00 (146.91)	11.17 (4.68)	5
Oculoplastics	15	10 (67)	314.50 (187.70)	470.20 (88.82)	17.17 (3.26)	8
Pathology and tumors	12	4 (33)	450.50 (140.58)	577.50 (130.31)	17.76 (2.66)	1
Pediatrics	13	9 (69)	357.56 (134.67)	525.67 (99.64)	17.09 (2.93)	2
Refractive surgery	15	14 (93)	441.14 (292.85)	520.57 (135.18)	18.50 (5.82)	10
Retina and vitreous	11	3 (27)	420.00 (109.99)	444.33 (62.42)	15.60 (2.03)	1
Uveitis	11	8 (73)	543.25 (349.78)	473.63 (117.89)	17.02 (2.77)	2
Total	166	125 (75)	306.40 (204.75)	473.83 (205.89)	16.59 (8.60)	78

Open in a new tab

Table 2. Accuracy of Chatbot Responses Across Question Categories on OphthoQuestions on March 18, 2023.

Category	No. of correct responses/total No. of responses (%)	Proportion of trainees agreeing with ChatGPT-4, %	No. of responses with least popular choice/total No. of responses (%)	No. of responses with second least popular choice/total No. of responses (%)	No. of responses with second most popular choice/total No. of responses (%)	No. of responses with explanation provided/total No. of responses (%)	No. of correct responses/total No. of stand-alone questions (%)	No. of responses with consistency between question styles/total No. of stand-alone questions (%)
Clinical optics	8/13 (62)	57	0	1/13 (8)	2/13 (15)	11/13 (85)	4/11 (36)	5/11 (45)
Cornea	9/10 (90)	77	0	0	1/10 (10)	10/10 (100)	3/6 (50)	3/6 (50)
Fundamentals	14/15 (93)	82	0	0	1/15 (7)	15/15 (100)	10/11 (91)	11/11 (100)
General medicine	14/14 (100)	75	0	0	0	14/14 (100)	7/10 (70)	7/10 (70)
Glaucoma	5/6 (83)	67	0	0	1/6 (17)	6/6 (100)	1/3 (33)	0/3 (0)
Lens and cataract	8/12 (67)	62	0	1/12 (8)	3/12 (25)	12/12 (100)	7/8 (88)	6/8 (75)
Neuro-ophthalmology	6/7 (86)	69	0	1/7 (14)	0	7/7 (100)	4/5 (80)	5/5 (100)
Oculoplastics	8/10 (80)	70	0	0	2/10 (20)	10/10 (100)	4/8 (50)	4/8 (50)
Pathology and tumors	3/4 (75)	60	0	1/4 (25)	0	4/4 (100)	0/1 (0)	1/1 (100)
Pediatrics	8/9 (89)	75	0	0	1/9 (11)	9/9 (100)	1/2 (50)	2/2 (100)
Refractive surgery	11/14 (79)	64	2/14 (14)	0	1/14 (7)	14/14 (100)	6/10 (60)	6/10 (60)
Retina and vitreous	3/3 (100)	83	0	0	0	3/3 (100)	1/1 (100)	1/1 (100)
Uveitis	8/8 (100)	80	0	0	0	8/8 (100)	1/2 (50)	2/2 (100)
Total	105/125 (84)	71	2/125 (2)	4/125 (3)	12/125 (10)	123/125 (98)	49/78 (63)	53/78 (68)

Open in a new tab

The chatbot responded correctly to 100% of the questions in general medicine, retina and vitreous, and uveitis. Its performance was lower in clinical optics, responding correctly to 8 of 13 questions (62%). In addition, the chatbot provided explanations and additional insight to 123 of 125 questions (98%). On average, 71% (95% CI, 66%-75%) of trainees in ophthalmology selected the same response to the multiple-choice questions as the chatbot. The chatbot provided a correct response to 49 of 78 stand-alone questions (63%) with multiple-choice options removed.

The median (IQR) length of multiple-choice questions that the chatbot answered correctly was 217 (155-383) characters and answered incorrectly was 246 (209-471) characters. The median (IQR) length of correct responses was 428 (353-525) characters and incorrect responses was 465 (347-761) characters.

Discussion

Chatbots, like the one used in this study, are dynamic language models that try to improve on existing conversational AI systems.³ In this study, we found that the newer version of the chatbot, ChatGPT-4, responded correctly to 84% of multiple-choice practice questions on OphthoQuestions, which is commonly used in preparing for the board certification examination. This study expanded on recent work, which found that the previous version of this chatbot correctly answered 46% of the same multiple-choice questions in January 2023 and 58% in February 2023.² Performance of the updated version of this chatbot across all question categories on OphthoQuestions appeared to improve compared with the performance of the previous version.² Results of this study also suggest that in most cases, the updated version of the chatbot generated accurate responses when options were given. OpenAI states that ChatGPT-4 outperforms its predecessor, ChatGPT-3, in other disciplines.^4,5

This study had several limitations. OphthoQuestions offers preparation material for OKAP and WQE, and ChatGPT may perform differently in official examinations. The chatbot provides unique responses, which may differ if this study were repeated. The previous study² may have helped train the chatbot in this setting. Results of the present study must be interpreted in the context of the study date as the chatbot’s knowledge corpus will likely continue to expand rapidly.

Supplement.

Data Sharing Statement

Click here for additional data file.^{(16.2KB, pdf)}

References

1.Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615(7954):773. doi: 10.1038/d41586-023-00816-5 [DOI] [PubMed] [Google Scholar]
2.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in an ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;e231144. doi: 10.1001/jamaophthalmol.2023.1144 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Zhai X. ChatGPT user experience: implications for education. SSRN. Preprint posted online January 4, 2023. doi: 10.2139/SSRN.4312418 [DOI]
4.OpenAI. GPT-4 technical report. Preprint posted online March 2023. https://arxiv.org/abs/2303.08774
5.OpenAI. GPT-4 . Accessed March 20, 2023. https://openai.com/product/gpt-4

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement.

Data Sharing Statement

Click here for additional data file.^{(16.2KB, pdf)}

[eld230003r1] 1.Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615(7954):773. doi: 10.1038/d41586-023-00816-5 [DOI] [PubMed] [Google Scholar]

[eld230003r2] 2.Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in an ophthalmic knowledge assessment. JAMA Ophthalmol. 2023;e231144. doi: 10.1001/jamaophthalmol.2023.1144 [DOI] [PMC free article] [PubMed] [Google Scholar]

[eld230003r3] 3.Zhai X. ChatGPT user experience: implications for education. SSRN. Preprint posted online January 4, 2023. doi: 10.2139/SSRN.4312418 [DOI]

[eld230003r4] 4.OpenAI. GPT-4 technical report. Preprint posted online March 2023. https://arxiv.org/abs/2303.08774

[eld230003r5] 5.OpenAI. GPT-4 . Accessed March 20, 2023. https://openai.com/product/gpt-4

PERMALINK

Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment

Andrew Mihalache

Ryan S Huang, MSc

Marko M Popovic, MD, MPH

Rajeev H Muni, MD, MSc

Abstract

Methods

Results

Table 1. Characteristics of Questions on OphthoQuestions and Chatbot Responses on March 18, 2023.

Table 2. Accuracy of Chatbot Responses Across Question Categories on OphthoQuestions on March 18, 2023.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Performance of an Upgraded Artificial Intelligence Chatbot for Ophthalmic Knowledge Assessment

Andrew Mihalache

Ryan S Huang, MSc

Marko M Popovic, MD, MPH

Rajeev H Muni, MD, MSc

Abstract

Methods

Results

Table 1. Characteristics of Questions on OphthoQuestions and Chatbot Responses on March 18, 2023.

Table 2. Accuracy of Chatbot Responses Across Question Categories on OphthoQuestions on March 18, 2023.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases