This cross-sectional study assesses the performance of the chatbot ChatGPT in answering text-based practice questions for the board certification examination from the OphthoQuestions free trial.
Key Points
Question
How does the artificial intelligence chatbot ChatGPT perform in answering practice questions for board certification in ophthalmology?
Findings
In this cross-sectional study that included 125 text-based multiple-choice questions provided by the OphthoQuestions free trial for board certification examination preparation, ChatGPT answered 58 questions (46%) correctly.
Meaning
Medical professionals and students should appreciate the advances of artificial intelligence in medicine, while acknowledging that ChatGPT as used in this investigation did not answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.
Abstract
Importance
ChatGPT is an artificial intelligence (AI) chatbot that has significant societal implications. Training curricula using AI are being developed in medicine, and the performance of chatbots in ophthalmology has not been characterized.
Objective
To assess the performance of ChatGPT in answering practice questions for board certification in ophthalmology.
Design, Setting, and Participants
This cross-sectional study used a consecutive sample of text-based multiple-choice questions provided by the OphthoQuestions practice question bank for board certification examination preparation. Of 166 available multiple-choice questions, 125 (75%) were text-based.
Exposures
ChatGPT answered questions from January 9 to 16, 2023, and on February 17, 2023.
Main Outcomes and Measures
Our primary outcome was the number of board certification examination practice questions that ChatGPT answered correctly. Our secondary outcomes were the proportion of questions for which ChatGPT provided additional explanations, the mean length of questions and responses provided by ChatGPT, the performance of ChatGPT in answering questions without multiple-choice options, and changes in performance over time.
Results
In January 2023, ChatGPT correctly answered 58 of 125 questions (46%). ChatGPT’s performance was the best in the category general medicine (11/14; 79%) and poorest in retina and vitreous (0%). The proportion of questions for which ChatGPT provided additional explanations was similar between questions answered correctly and incorrectly (difference, 5.82%; 95% CI, −11.0% to 22.0%; χ21 = 0.45; P = .51). The mean length of questions was similar between questions answered correctly and incorrectly (difference, 21.4 characters; SE, 36.8; 95% CI, −51.4 to 94.3; t = 0.58; df = 123; P = .22). The mean length of responses was similar between questions answered correctly and incorrectly (difference, −80.0 characters; SE, 65.4; 95% CI, −209.5 to 49.5; t = −1.22; df = 123; P = .22). ChatGPT selected the same multiple-choice response as the most common answer provided by ophthalmology trainees on OphthoQuestions 44% of the time. In February 2023, ChatGPT provided a correct response to 73 of 125 multiple-choice questions (58%) and 42 of 78 stand-alone questions (54%) without multiple-choice options.
Conclusions and Relevance
ChatGPT answered approximately half of questions correctly in the OphthoQuestions free trial for ophthalmic board certification preparation. Medical professionals and trainees should appreciate the advances of AI in medicine while acknowledging that ChatGPT as used in this investigation did not answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.
Introduction
ChatGPT is an artificial intelligence (AI) chatbot developed by OpenAI that has significant societal implications and has already been involved in various scientific and medical applications.1,2,3,4 ChatGPT is novel, interacts with users conversationally, and can serve as an educational tool.5 The responsible use of ChatGPT in medical education and clinical practice is of utmost importance.
Training curricula using AI are being developed for medical students, who may need to receive training for AI systems. Students must recognize the limitations of ChatGPT, which can sometimes output incorrect information.5 A recent preprint found that ChatGPT performs at the level of a third-year medical student when answering questions pertaining to United States Medical Licensing Examination (USMLE) Step 1 and 2 examinations.6 However, the performance of ChatGPT across various disciplines remains unclear. Our study aims to assess the performance of ChatGPT in answering practice questions for ophthalmology board certification examinations.
Methods
We used a data set of practice multiple-choice questions for the Ophthalmic Knowledge Assessment Program (OKAP) and Written Qualifying Exam (WQE) from the free trial of OphthoQuestions,7 a common resource for board certification examination preparation. We included text-based questions and excluded questions requiring input of images and videos because ChatGPT only accepts text input. Our study used American Association for Public Opinion Research (AAPOR) reporting guidelines and did not require ethics approval, given that it used only publicly accessible data not involving research participants.
Our primary outcome was the performance of ChatGPT in answering practice board certification examination questions, measured as the proportion of correct answers. Our secondary outcomes were the proportion of questions for which ChatGPT provided additional explanations, mean length of questions and responses provided by ChatGPT, performance of ChatGPT in answering questions without multiple-choice options, and changes in performance over time.
We refreshed ChatGPT before entering questions to determine whether the website was experiencing high demand. Conversations with ChatGPT were cleared before entering each question because ChatGPT’s responses may be influenced by concurrent conversations. A new ChatGPT account was used to ensure no conversation history was present prior to study initiation. Multiple-choice questions were entered into ChatGPT per methods similar to those used by Gilson et al.6 An example of our input of a sample question can be found in the eFigure in Supplement 1.
In a secondary analysis, our study was repeated without providing multiple-choice options to ChatGPT, to assess its performance in conversational contexts. Question stems containing the phrases, “Which of the following…” or “All of the following… except” were excluded from this secondary analysis because these questions required multiple-choice options for a response.
The January 9 version of ChatGPT was used in our primary analysis and the February 13 version in our secondary analysis. We manually reviewed all answers to determine the performance of ChatGPT. If ChatGPT selected “none of the above” or “all of the above” for questions that did not have these as potential answers, the response was deemed incorrect.
We performed statistical analysis on MedCalc using a t test to compute the difference between observed means in 2 independent samples.8,9,10 We used a χ2 test to compare proportions across 2 samples.11,12,13,14 P values were 2-tailed, and a P value less than .05 was considered statistically significant.
Results
ChatGPT answered questions from the OphthoQuestions free trial from January 9 to 16, 2023, in the primary analysis and on February 17, 2023, in a secondary analysis. Of 166 available questions, 125 (75%) were text-based and analyzed by ChatGPT (Table 1). Per OphthoQuestions, all included questions were deemed high yield for board certification examinations. ChatGPT experienced high demand when responding to 44 questions (35%); its mean (SD) response time was 17.8 seconds (14.4).
Table 1. Characteristics of Questions From OphthoQuestions and ChatGPT Responses in January 2023.
Category | No. of available questions | No. of text-based questions (%) | Mean (SD) | ||
---|---|---|---|---|---|
Question length, No. of characters | Response length, No. of characters | Response time, s | |||
Clinical optics | 13 | 13 (100) | 246.2 (145.4) | 609.6 (377.4) | 29.9 (18.2) |
Cornea | 12 | 10 (83) | 193.6 (51.7) | 765.8 (325.9) | 25.8 (14.8) |
Fundamentals | 15 | 15 (100) | 181.7 (100.9) | 467.3 (229.4) | 14.2 (7.2) |
General medicine | 14 | 14 (100) | 226.6 (148.8) | 905.9 (209.7) | 28.8 (7.5) |
Glaucoma | 8 | 6 (75) | 237.3 (81.6) | 960.0 (386.7) | 31.8 (9.4) |
Lens and cataract | 14 | 12 (86) | 295.1 (166.2) | 126.3 (167.4) | 12.0 (11.8) |
Neuro-ophthalmology | 13 | 7 (54) | 336.4 (205.6) | 158.1 (130.3) | 8.1 (5.2) |
Oculoplastics | 15 | 10 (67) | 314.5 (187.7) | 211.2 (192.1) | 8.2 (5.2) |
Pathology and tumors | 12 | 4 (33) | 450.5 (140.6) | 172.3 (92.1) | 22.4 (33.4) |
Pediatrics | 13 | 9 (69) | 357.6 (134.7) | 245.3 (198.1) | 11.5 (7.5) |
Refractive surgery | 15 | 14 (93) | 441.1 (292.9) | 251.6 (214.9) | 13.6 (15.0) |
Retina and vitreous | 11 | 3 (27) | 420.0 (110) | 334.3 (146.1) | 9.9 (4.0) |
Uveitis | 11 | 8 (73) | 543.3 (349.8) | 273.6 (259.4) | 9.6 (6.7) |
Total | 166 | 125 (75) | 306.4 (204.8) | 443.2 (366.9) | 17.8 (14.4) |
In January 2023, ChatGPT answered 58 of 125 multiple-choice questions (46.4%) correctly (Table 2). Examples of ChatGPT’s output to questions across various categories can be found in Table 3. ChatGPT performed best in the category general medicine, responding to 11 of 14 questions (79%) correctly. ChatGPT performed worst in the category retina and vitreous, responding incorrectly to all questions.
Table 2. Performance of ChatGPT Across Various Question Categories on OphthoQuestions.
Category | Correct response, No. %a | Trainees agree, %a | Least popular choice, %a | Second least popular choice, %a | Second most popular choice, %a | Explanation provided, %a | Correct response in second trial, No. %b | Correct response to stand-alone questions, No. %b |
---|---|---|---|---|---|---|---|---|
Clinical optics | 5 (39) | 34 | 15 | 8 | 31 | 92 | 7 (54) | 6/11 (55) |
Cornea | 3 (30) | 28 | 40 | 0 | 30 | 100 | 8 (80) | 3/6 (50) |
Fundamentals | 8 (53) | 60 | 7 | 13 | 20 | 87 | 12 (80) | 8/11 (73) |
General medicine | 11 (79) | 62 | 0 | 29 | 7 | 100 | 9 (64) | 7/10 (70) |
Glaucoma | 3 (50) | 44 | 50 | 0 | 0 | 100 | 2 (33) | 1/3 (33) |
Lens and cataract | 5 (42) | 42 | 17 | 13 | 20 | 8 | 7 (58) | 4/8 (50) |
Neuro-ophthalmology | 3 (43) | 42 | 0 | 29 | 29 | 14 | 3 (43) | 2/5 (40) |
Oculoplastics | 2 (20) | 27 | 10 | 50 | 20 | 40 | 6 (60) | 3/8 (38) |
Pathology and tumors | 2 (50) | 49 | 0 | 25 | 25 | 50 | 3 (75) | 0/1 |
Pediatrics | 4 (44) | 45 | 11 | 22 | 22 | 44 | 3 (33) | 2/2 (100) |
Refractive surgery | 8 (57) | 49 | 7 | 14 | 21 | 43 | 8 (57) | 4/10 (40) |
Retina and vitreous | 0 | 7 | 0 | 33 | 67 | 67 | 3 (100) | 1/1 (100) |
Uveitis | 4 (50) | 47 | 0 | 13 | 25 | 38 | 2 (25) | 1/2 (50) |
Total | 58 (46) | 44 | 11 | 18 | 22 | 63 | 73 (58) | 42/78 (54) |
Primary analysis conducted from January 9, 2023, to January 16, 2023.
Secondary analysis conducted on February 17, 2023.
Table 3. Sample Multiple-Choice Questions and ChatGPT’s Output in January 2023.
Category | Outcome | Question | ChatGPT’s response |
---|---|---|---|
Clinical optics | Correct | A myope is corrected for distance vision with −15.00 diopter glasses with a vertex of 1 cm. If this patient decided to wear contact lenses, what would be the contact lens prescription? A. −11.5 B. −17.5 C. −21 D. −13 |
To calculate the contact lens prescription, we need to take into account the vertex distance, which is the distance between the front surface of the glasses and the front surface of the eye. Since the vertex distance for this patient is 1 cm, we need to adjust the glasses prescription to account for this distance. The formula for converting a glasses prescription to a contact lens prescription is: Contact lens prescription = Glasses prescription − (vertex distance * 0.25). Plugging in the values for this patient, we get: Contact lens prescription = −15.00 − (1 * 0.25); = −15.00 − 0.25; = −15.25. So the contact lens prescription would be approximately −15.25 diopters. This is closest to answer choice D, −13. |
Incorrect | A lens has a power of +10.00 diopters. An object lies 40 cm to the left, and a real image of it is formed on the right of the lens. What is the distance from the image to the lens? A. 20 cm B. 8 cm C. 13 cm D. 4 cm |
To solve this problem, you can use the lens equation: 1/f = 1/d_o +1/d_i. Where f is the focal length of the lens, d_o is the distance from the lens to the object, and d_i is the distance from the lens to the image. In this case, we are given that the power of the lens is +10.00 diopters, which is equivalent to a focal length of 1/10 m. We are also given that the object is 40 cm to the left of the lens, which is equivalent to a distance of 0.40 m. We want to find the distance from the image to the lens, which is d_i. Plugging these values into the lens equation, we get: 1/0.10 = 1/0.40 + 1/d_i. Solving for d_i, we find that d_i = 0.08 m. Converting this to centimeters, we find that d_i = 8 cm. Therefore, the correct answer is B, 8 cm. | |
Cornea | Correct | Which of the following culture media is most optimal for isolating fungi? A. Chocolate agar B. Thioglycollate broth C. Sabouraud’s agar D. Lowenstein-Jensen medium |
Sabouraud’s agar is most commonly used for isolating fungi. It is a selective medium that is suitable for the growth of a wide variety of fungi, including yeasts and molds. It is made with a low pH and contains peptones, glucose, and agar, which makes it a good source of nutrients for fungi. Chocolate agar is also commonly used for isolating and cultivating certain types of fungi, especially those that are fastidious or require special growth conditions. Thioglycollate broth is a general purpose, non-selective medium that is used for the cultivation of a wide variety of microorganisms, including bacteria, fungi, and aerobic and anaerobic microorganisms. Lowenstein-Jensen medium is a solid medium that is used for the cultivation of mycobacteria, including Mycobacterium tuberculosis, the bacterium that causes tuberculosis. |
Incorrect | Which of the following steps should be done first in the repair of an open globe? A. Repair of posterior scleral laceration B. Repair of anterior scleral laceration C. Repair of corneal limbus D. Repair of anterior cornea |
The correct answer is B. Repair of the anterior scleral laceration should be done first in the repair of an open globe. The anterior sclera is the front part of the white of the eye, and lacerations to this area can affect the function of the eye and cause serious damage if not repaired promptly. It is important to repair any lacerations in the anterior sclera before addressing any other injuries to the eye. | |
Fundamentals | Correct | What of the following eye drugs is a direct cholinergic agonist? A. Pilocarpine B. Physostigmine C. Phenylephrine D. Tropicamide |
A. Pilocarpine is a direct cholinergic agonist. |
Incorrect | Which extraocular muscle is the longest including muscle and tendon? A. Superior oblique B. Superior rectus C. Inferior oblique D. Inferior rectus |
The inferior rectus muscle is the longest extraocular muscle including muscle and tendon. It is a muscle in the orbit that moves the eye downward. The other options are also extraocular muscles that move the eye in different directions: A. The superior oblique muscle is a muscle in the orbit that moves the eye downward and medially. B. The superior rectus muscle is a muscle in the orbit that moves the eye upward. C. The inferior oblique muscle is a muscle in the orbit that moves the eye upward and laterally. |
|
General medicine | Correct | What treatment for tuberculosis causes “orange-red” tears? A. Isoniazid B. Ethambutol C. Pyrazinamide D. Rifampin |
The treatment for tuberculosis that is known to cause “orange-red” tears is rifampin (Option D). This side effect is caused by the metabolite of rifampin, which is excreted in the tears and can cause a discoloration of the tears, sweat, and urine. This discoloration is usually harmless, but it can be concerning for some people. In most cases, it is temporary and goes away once the medication is discontinued. It’s worth noting that rifampin is a very effective antibiotic for TB and it is often used in combination with other drugs like isoniazid, pyrazinamide, and ethambutol to achieve optimal efficacy and lower the risk of resistance. |
Incorrect | Which medication used in HIV therapy and for cytomegalovirus (CMV) retinitis frequently causes an anterior uveitis? A. Foscarnet B. Cidofovir C. Ganciclovir D. Fomivirsen |
C. Ganciclovir is a medication that is used in HIV therapy and for treating cytomegalovirus (CMV) retinitis, and it is known to frequently cause an anterior uveitis, a type of inflammation of the uvea (the middle layer of the eye that includes the iris, ciliary body, and choroid). Anterior uveitis is an infrequent but potentially serious side effect of ganciclovir therapy, characterized by eye pain, redness, light sensitivity, and blurred vision. This can be treated with topical or systemic non-steroidal anti-inflammatory drugs (NSAIDs) and/or corticosteroids and discontinuation of the drug if severe. Other options, A, B, D, are not related to anterior uveitis. | |
Glaucoma | Correct | The main site of resistance to aqueous humor outflow in the eye is generally thought to occur at which structure? A. Uveal trabecular meshwork B. Corneoscleral trabecular meshwork C. Juxtacanalicular trabecular meshwork D. Schlemm’s canal |
The main site of resistance to aqueous humor outflow in the eye is generally thought to occur at the trabecular meshwork, which is located at the junction of the cornea and the sclera (the white part of the eye). Specifically, the resistance is thought to occur at the juxtacanalicular trabecular meshwork, which is located just outside of Schlemm’s canal. So the answer will be C, juxtacanalicular trabecular meshwork. |
Incorrect | Which of the following substances has been associated with a decrease in intraocular pressure (IOP)? A. Alcohol B. Lysergic acid diethylamide (LSD) C. Caffeine D. Ketamine |
Caffeine has been associated with a decrease in intraocular pressure (IOP). Research has shown that consuming moderate amounts of caffeine (about the amount found in one or two cups of coffee) can cause a small, temporary decrease in IOP. However, it is important to note that excessive caffeine intake or using caffeine-containing eye drops can actually increase IOP, and therefore should be used with caution in people with glaucoma or other conditions that cause elevated IOP. The other substances you listed (alcohol, lysergic acid diethylamide [LSD], and ketamine) have not been shown to have any significant effect on IOP. | |
Lens and cataract | Correct | From what germ layer is the lens derived? A. Mesoderm B. Endoderm C. Ectoderm D. Neuroectoderm |
C. Ectoderm. |
Incorrect | You are examining a one week post-operative cataract surgery patient. The cataract surgery went well and you used a suture in the 180 degree (ie, horizontal) meridian to close the phaco wound. On the post-op 1 week exam, you note 2 diopters of astigmatism from what you suspect is due to a tight suture. What axis would a correcting POSITIVE cylindrical lens need to be oriented in front of the eye if your thinking is accurate? A. 180 degrees B. 90 degrees C. 45 degrees D. 135 degrees |
The correct answer is B, 90 degrees. If the astigmatism is caused by a tight suture, the wound would be oriented horizontally (180 degrees) and the tightness of the suture would be pulling the tissue in a vertical direction. This would cause the eye to have more refractive power in the vertical meridian and less refractive power in the horizontal meridian. To correct this, a positive cylindrical lens with its axis oriented at 90 degrees (perpendicular to the suture) would be placed in front of the eye, to increase the refractive power in the horizontal meridian, to match the refractive power in the vertical meridian. | |
Neuro-ophthalmology | Correct | A 50-y-old male presents with 4 mos of progressive horizontal binocular diplopia. Exam shows normal visual acuity, color vision, pupils, IOP, and visual fields to confrontation. Motility exam shows a −2 left abduction deficit. Alternate cover alignment testing shows a 15 prism diopter esotropia in primary gaze that increases to 40 prism diopters in left gaze and decreases to 5 prism diopters in right gaze. Neurological examination is significant for diminished hearing and upper + lower facial weakness on the left hand side. Where is the most likely location of the lesion causing these deficits? A. The left cerebellar hemisphere B. The cerebellar flocculonodular lobe C. The cerebellopontine angle D. The midbrain |
C. The cerebellopontine angle is the most likely location of the lesion causing the deficits described in the scenario. This location is consistent with the findings of a left abduction deficit, a left-sided horizontal binocular diplopia that increases in left gaze and decreases in right gaze, and associated cranial nerve deficits (diminished hearing and upper + lower facial weakness) on the left side. |
Incorrect | Which of the following types of nystagmus is NOT usually caused by lower brainstem or cerebellar pathology? A. See-saw nystagmus B. Periodic alternating nystagmus C. Upbeat nystagmus D. Downbeat nystagmus |
B. Periodic alternating nystagmus is usually caused by a lesion in the vestibular nuclei in the brainstem, rather than lower brainstem or cerebellar pathology. | |
Oculoplastics | Correct | A 52-y-old obese patient presents with chronic papillary conjunctivitis, flaccid upper eyelids, and nonspecific irritative symptoms. For which of the following tests should the patient be referred? A. Serum creatinine B. Complete blood count C. Sleep study D. Carotid doppler |
Based on the symptoms described, it is most likely that the patient should be referred for a sleep study (Option C) as the symptoms of chronic papillary conjunctivitis and flaccid upper eyelids can be indicative of a sleep disorder such as obstructive sleep apnea (OSA). A sleep study can help to diagnose and determine the severity of OSA, and guide treatment decisions. However, it is always best to consult with a medical professional to make sure the patient is appropriately referred. |
Incorrect | A 9-y-old boy presents for exam. His mom states that he has had markedly increased fullness to the lower eyelid that was not there before. The fullness is also associated with a bruise. On exam the boy has a hypertropia and a 2 mm proptosis to the right eye. What is the most malignant type of the lesion that is most concerning in this child? A. Embryonal B. Alveolar C. Pleomorphic D. Botryoid |
The most concerning type of lesion in this case would likely be C, pleomorphic, as it is considered the most malignant type of rhabdomyosarcoma. The symptoms described, such as increased fullness to the lower eyelid and proptosis (bulging of the eye), are consistent with the presence of a tumor. A prompt referral to a pediatric oncologist or ophthalmologist is necessary for further evaluation and management. | |
Pathology and tumors | Correct | The COMS (Collaborative Ocular Melanoma Study) trial concluded which of the following? A. Radiotherapy as an adjunct to enucleation in large choroidal melanotic tumors did not improve overall survival B. There is an approximately 30% local tumor recurrence rate at 5 y in medium sized choroidal melanotic tumors treated with brachytherapy C. There is a melanoma-specific mortality rate of 15% at 5 y in small choroidal melanotic tumors D. There is a significantly reduced survival rate in medium choroidal melanotic tumors treated with external beam radiotherapy compared to those treated with enucleation |
A. Radiotherapy as an adjunct to enucleation in large choroidal melanotic tumors did not improve overall survival. |
Incorrect | Which of the following statements contrasting ultrasound biomicroscopy (UBM) from B-scan ultrasound is TRUE? A. UBM uses a lower frequency, has higher resolution, and penetrates more than B-scan B. UBM uses a higher frequency, has lower resolution, and penetrates more than B-scan C. UBM uses a lower frequency, has higher resolution, and penetrates less than B-scan D. UBM uses a higher frequency, has higher resolution, and penetrates less than B-scan |
C. UBM uses a lower frequency, has higher resolution, and penetrates less than B-scan, is true. | |
Pediatrics | Correct | You diagnose a 3-y-old child with congenital motor nystagmus (CMN). He displays a left jerk nystagmus and an abnormal head posture. What would be this child’s expected head posture? A. Right head turn B. Left head turn C. Chin up posture D. Chin down posture |
B. Left head turn. |
Incorrect | A 2-y-old child presents to your office 1 y after bilateral medial rectus recession surgery for congenital esotropia. On alternate cover testing, you notice that the occluded eye moves slowly up and outward, and upon uncovering, there is no corresponding downward movement of the opposite eye. Which of the following best describes this patient’s clinical condition? A. Dissociated vertical deviation, which disobeys Hering’s law B. Dissociated vertical deviation, which disobeys Sherrington’s law C. Skew deviation, which disobeys Hering’s law D. Skew deviation, which disobeys Sherrington’s law |
The patient’s clinical condition is best described as C, skew deviation, which disobeys Hering’s law. Skew deviation is a type of ocular misalignment where one eye deviates vertically and horizontally, while the other eye deviates horizontally in the opposite direction. Hering’s law states that when one eye moves in a particular direction, the other eye will move in the opposite direction. In skew deviation, the eyes do not move in opposite directions, hence it is said to disobey Hering’s law. | |
Refractive surgery | Correct | A patient presents for evaluation for cataract surgery for the right eye. She underwent myopic LASIK 4 y ago in both eyes. She gives you her corneal topography and exam data prior to her LASIK procedure. You note that her pre-op Ks for the right eye were 44.00/46.00 at 90 degrees and her refraction was −4.00 + 2.00 × 90 in both eyes. After her LASIK surgery, her stable refraction was −1.00 sphere in both eyes. You take a measurement of her Ks in your clinic and find them to be 44.00/44.00 at 90 degrees. Based on the clinical history method, what K value should you enter into the IOL master? A. 40.5 B. 41.0 C. 43.0 D. 47.5 |
C, 43.0. The clinical history method is used to estimate the preoperative keratometry (K) value of a patient who has undergone refractive surgery, such as LASIK, in order to select the appropriate intraocular lens (IOL) for cataract surgery. The pre-LASIK Ks for the right eye were 44.00/46.00 at 90 degrees, the stable refraction was −1.00 sphere in both eyes, and the post-surgery Ks measurement is 44.00/44.00 at 90 degrees. Based on this information, the estimated preoperative K value for the right eye would be 43.0. |
Incorrect | In which of the following two quadrants is steepening most common in keratoconus? A. Superonasal and superotemporal B. Superotemporal and inferotemporal C. Inferotemporal and inferonasal D. Inferonasal and superonasal |
Steepening is most commonly found in the inferonasal and superonasal quadrants in keratoconus. | |
Retina and vitreous | Correct | NA | NA |
Incorrect | A 29-y-old woman presents to you for a routine eye exam. She has no visual complaints and has no past ocular history besides contact lens wear for low hyperopia. Her exam is significant for lattice degeneration in the peripheral retina of both eyes. There are two atrophic holes within the area of lattice degeneration in both eyes. What is this woman’s risk of retinal detachment in either eye if she does not undergo prophylactic laser demarcation? A. 1% B. 12% C. 25% D. 60% |
The risk of retinal detachment in a patient with lattice degeneration and atrophic holes in the peripheral retina is significant. Without prophylactic laser demarcation, the risk of retinal detachment is estimated to be around 25% (C) in either eye. It is important to closely monitor these patients and consider prophylactic laser treatment to decrease the risk of retinal detachment. | |
Uveitis | Correct | Which of the following is characteristic of punctate inner choroiditis (PIC)? A. Male gender B. Hyperopic refraction C. Age >60 y old at presentation D. Choroidal neovascularization at presentation |
D. Choroidal neovascularization at presentation. |
Incorrect | Which of the following two uveitis entities can be associated with a hypopyon? A. Fuchs heterochromic iridocyclitis and juvenile idiopathic arthritis B. Posner-Schlossman syndrome and Behcet syndrome C. Pars planitis and juvenile idiopathic arthritis D. HLA-B27-related uveitis diseases and Adamantiades-Behcet disease syndrome |
B. Posner-Schlossman syndrome and Behcet syndrome can be associated with a hypopyon. |
Abbreviation: NA, not applicable.
ChatGPT provided explanations and additional insight for 79 of 125 questions (63%). The proportion of questions for which ChatGPT provided explanations and additional insight was similar between questions answered correctly and incorrectly (difference, 5.8%; 95% CI, −11.0% to 22.0%; χ21 = 0.45; P = .51). The mean length of questions was similar between questions answered correctly and incorrectly (difference, 21.4 characters; SE, 36.8; 95% CI, −51.5 to 94.3; t = 0.58; df = 123; P = .22). The mean length of responses was also similar between questions answered correctly and incorrectly (difference, −80.0 characters; SE, 65.4; 95% CI, −209.5 to 49.5; t = −1.22; df = 123; P = .22).
ChatGPT selected the same multiple-choice response as the most common answer provided by ophthalmology trainees on OphthoQuestions 44% of the time. ChatGPT selected the multiple-choice response that was least popular among ophthalmology trainees 11% of the time, second least popular 18% of the time, and second most popular 22% of the time.
Our analysis was repeated on February 17, 2023. The performance of ChatGPT in responding to multiple-choice questions from OphthoQuestions improved to 73 of 125 (58%). We also prompted ChatGPT to respond to questions that could be posed as stand-alone questions without multiple-choice options. ChatGPT correctly responded to 42 of 78 stand-alone questions (54%), which was similar to its performance when answering questions with multiple-choice options (difference, 4.6%; 95% CI, −9.2% to 18.3%; χ21 = 0.40; P = .53).
Discussion
Our study found that ChatGPT correctly answered 46% of high-yield questions meant for preparation for ophthalmic board certification, with varying performance across different subspecialties. Although the role of ChatGPT may increase in medical education and clinical practice over time, it is important to stress the importance of using such AI systems responsibly.
A recent preprint found that ChatGPT answered more than 60% of sample questions correctly on topics covered in the USMLE Step 1 licensing examination.6 We found that ChatGPT performed best with general medicine questions, answering 79% correctly. Another study investigating the performance of ChatGPT for USMLE examinations found that questions answered poorly were associated with a poverty of insight in the question stem.15 Our study found no difference between the length of questions answered correctly and incorrectly by ChatGPT, suggesting the depth of questions may not necessarily be associated with ChatGPT’s performance. Nonetheless, these findings may vary across different disciplines.
Limitations
Our study was limited for a few notable reasons. Certain parameters, including the time of ChatGPT’s responses, may have been biased by internet speed, online traffic, and delays in response time. ChatGPT provides unique answers, and conducting the same methodology in a subsequent study may yield different results. ChatGPT cannot analyze images or videos; therefore, we excluded questions that were not text-based. OphthoQuestions provides preparation material for OKAP and WQE examinations; hence, ChatGPT may perform differently on real certification examinations. Some questions posed to ChatGPT without multiple-choice options may have been answered more broadly, resulting in an incorrect response when compared with the correct answer denoted by OphthoQuestions. Our results should not be generalized to other disciplines, and it is likely that ChatGPT will rapidly expand its knowledge corpus via user feedback.
Conclusions
In the OphthoQuestions free trial for OKAP/WQE examinations, ChatGPT correctly answered 46% of questions in January 2023 and 58% in February 2023. ChatGPT as used in this investigation did not answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time. Future research should continually evaluate the progression of performance of AI chatbots in medicine and predictors of poor performance.
References
- 1.Gozalo-Brizuela R, Garrido-Merchan EC. ChatGPT is not all you need: a state of the art review of large generative AI models. arXiv. Published online January 11, 2023. doi: 10.48550/arXiv.2301.04655 [DOI]
- 2.Castelvecchi D. Are ChatGPT and AlphaCode going to replace programmers? Nature. Published online December 8, 2022. doi: 10.1038/d41586-022-04383-z [DOI] [PubMed] [Google Scholar]
- 3.Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. arXiv. Published online December 30, 2022. doi: 10.48550/arXiv.2212.14882 [DOI] [PMC free article] [PubMed]
- 4.Aydın Ö, Karaarslan E. OpenAI ChatGPT generated literature review: digital twin in healthcare. SSRN Electron J. Published online December 21, 2022. doi: 10.2139/SSRN.4308687 [DOI]
- 5.Azaria A. ChatGPT usage and limitations. OSF Preprints. Published online December 27, 2022. doi: 10.31219/osf.io/5ue7n [DOI]
- 6.Gilson A, Safranek C, Huang T, et al. How does ChatGPT perform on the medical licensing exams? the implications of large language models for medical education and knowledge assessment. medRxiv. Published online December 26, 2022. doi: 10.1101/2022.12.23.22283901 [DOI]
- 7.OphthoQuestions . Accessed January 21, 2023. https://www.ophthoquestions.com
- 8.Kirkwood BB, Sterne J. Essential Medical Statistics. John Wiley & Sons; 2003. http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Essential+Medical+Statistics#0 [Google Scholar]
- 9.Altman DG. Practical Statistics for Medical Research. Chapman and Hall/CRC; 1990. doi: 10.1201/9780429258589 [DOI] [Google Scholar]
- 10.MedCalc . MedCalc’s comparison of means calculator. Published 2023. Accessed January 21, 2023. https://www.medcalc.org/calc/comparison_of_means.php
- 11.MedCalc . Comparison of proportions calculator. Published 2023. Accessed January 21, 2023. https://www.medcalc.org/calc/comparison_of_proportions.php
- 12.Altman D, Machin D, Bryant T, Gardner M, eds. Statistics With Confidence. 2nd ed. BMJ Publishing Group; 2000. [Google Scholar]
- 13.Richardson JTE. The analysis of 2 × 2 contingency tables–yet again. Stat Med. 2011;30(8):890-890. doi: 10.1002/sim.4116 [DOI] [PubMed] [Google Scholar]
- 14.Campbell I. Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Stat Med. 2007;26(19):3661-3675. doi: 10.1002/sim.2832 [DOI] [PubMed] [Google Scholar]
- 15.Kung TH, Cheatham M, ChatGPT, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. medRxiv. Published online December 21, 2022. doi: 10.1101/2022.12.19.22283643 [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.