Abstract
This study compares performance on free-response clinical reasoning examinations of first- and second-year medical students vs 2 models of a popular chatbot.
A popular chatbot is an interface for the generative pretrained transformer (GPT) large language model artificial intelligence (AI) system that generates humanlike text in response to user input.1 An upgrade from the initial chatbot to a newer model was evaluated, and it exceeded the passing threshold for multiple-choice questions that simulate the US Medical License Examinations.2 This cross-sectional study examined how well the chatbot responded to free-response, multiphase, case-based questions that better reflect real-life synthesis and application of data with broader implications for the instruction and assessment of clinical reasoning. Using clinical reasoning final examinations given to first- and second-year students at Stanford School of Medicine, we compared performance of students vs 2 chatbot models.
Methods
We selected 14 clinical cases used from March 2019 to December 2022 for clinical reasoning final examinations for first- and second-year medical and physician assistant students who received identical clinical reasoning training at our academic medical center. This study was approved by the Stanford institutional review board, which waived informed consent under 45 CFR §46. We followed the STROBE reporting guideline.
Cases consisted of discrete, sequential passages, separated by 2 to 7 free-response questions that assess multiple clinical reasoning skills (Box). Each case received preliminary runs through the chatbot (ChatGPT; OpenAI) to allow for minor, iterative rewording of question stems to ensure relevant responses.3 Between January and April 2023, each case was run through GPT 3.5 (model 3.5) and GPT 4 (model 4) of the chatbot twice each, and 2 faculty independently graded responses according to rubrics from original examinations to achieve a mean final score for each model. The passing threshold for cases was predefined as 70% or higher. Historical student examination scores for each question were used for comparison. We reported 95% CIs of differences for comparisons and used χ2 tests to compare passing rates and 2-independent sample t tests for comparison of scores. To assess variation in chatbot responses, we selected 1 high-complexity case to run 20 times using both chatbot models. Two-sided P < .05 was considered significant. Analyses were performed using R, version 4.2.1.
Box. Sample Case Scenarioa.
A man aged 32 years with HIV infection presenting with headache, fever, and confusion
Prompt 4 (of 6 Prompts for This Case)
The following are his routine labs.
His white blood cell count is elevated at 14 500 per microliter. His hemoglobin is normal. His platelet count is elevated at 525 000 per microliter.
His basic metabolic panel is normal. His liver function tests are also normal.
His chest x-ray is normal.
His head CT is normal.
Compose a summary of this case in 200 words or less, including a statement as to the most likely diagnosis.
Grading Rubric (12 Points)
There are countless potential summary statements for this patient. Consider the following as a guide, but graders may use their own discretion to slightly modify this based on the students' response.
1 Point for including age and gender
1 Point for including history of HIV infection
1 Point each for including the 3 primary symptoms: headache, fever, and confusion
1 Point for including at least 1 semantic qualifier (eg, progressive, constant)
1 Point for including tachycardia and/or hypotension
1 Point for including any other relevant examination findings
1 Point for including the elevated WBC count
1 Point for including the normal head CT findings
2 Points for coming up with the correct leading diagnosis of a CNS infection (including meningitis)
Deduct 2 points for exceeding 200 words
Deduct 1 point for each completely irrelevant piece of information mentioned, up to −3 points maximum
-
An example of a case summary that would receive full credit:
The patient is a 32 y/o man with a history of HIV, who is presenting with acute, progressive headache, confusion, and fever. His exam reveals tachycardia, hypotension, inattentiveness, disorientation, and impaired memory. His cardiac, pulmonary, and abdominal exams are normal. Routine labs are only notable for a modest leukocytosis, and both a chest x-ray and head CT are normal. The most likely diagnosis to cause confusion, headache, and sepsis in an immunosuppressed individual is a CNS infection, particularly bacterial or viral meningitis. However, given the patient's immunosuppression, opportunistic infections should also be considered.
Chatbot Model 4 Answer
A 32-year-old HIV-positive man presented to the emergency department with a 24-hour history of progressively worsening headache, accompanied by chills, subjective fever, and confusion. His HIV infection was controlled with combination therapy, and his CD4 count was mildly low at 400 cells/mm3 six months ago. The patient had no history of recent travel or known sick contacts. On physical examination, he was acutely ill and in moderate distress, with elevated temperature, tachycardia, and hypotension. His neurologic examination revealed inattentiveness, disorientation, slow speech, and impaired short-term memory. There were no signs of nuchal rigidity or focal neurological deficits. Laboratory findings showed an elevated white blood cell count, normal hemoglobin, and elevated platelet count. Basic metabolic panel, liver function tests, and chest x-ray were normal, as was the head CT.
Based on the patient's history, symptoms, and laboratory findings, the most likely diagnosis is meningitis. The patient's immunocompromised status and symptoms, including fever, headache, and confusion, are consistent with a meningeal infection. The absence of nuchal rigidity doesn't rule out meningitis, as it may not be present in all cases, particularly in immunocompromised patients. Further diagnostic tests, such as lumbar puncture, should be pursued for confirmation, and appropriate management should be implemented under the guidance of a health care professional.
Score: 10/12 (exceeded the 200-word limit)
Results
Model 4 scored a mean 4.2 points (95% CI, 0.7-7.8; P = .02) more than students, while respective passing rates (93% vs 85%; P = .40) were not statistically distinguishable. Model 4 scored 18 points (95% CI, 12-23; P < .001) more than model 3.5, with a higher passing rate (93% vs 43%; P < .001). For the high-complexity case, model 4 had higher scores and lower variation vs model 3.5, with a higher passing rate (100% vs 35%; P < .001). Compared with student performance on clinical reasoning skills, model 4 outperformed students on creating a problem list by 16 points (95% CI, 11-21; P < .001); all other skills were statistically indistinguishable (Table).
Table. Summary of Cases, Clinical Skills Assessed, and Scores for Chatbot and Student Responses by Case and Clinical Reasoning Skill.
Case description | Total word count | Assessed clinical reasoning skills | Chatbot score, %a | Student score, mean (SD), % | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Diagnostic schemab | Differential diagnosis | Illness scriptsc | Case summary | Problem list | Otherd | Model 3.5 | Model 4 | |||
Chronic fatigue and anemia | 583 | Yes | Yes | No | No | Yes | Yes | 62.9 | 76.8 | 75.9 (10.7) |
Acute abdominal pain and diarrhea | 798 | Yes | Yes | No | No | Yes | Yes | 72.7 | 90.6 | 83.1 (11.5) |
Acute confusion and hypertension | 1013 | Yes | Yes | Yes | Yes | Yes | Yes | 65.4 | 88.7 | 78.2 (11.6) |
Chronic diarrhea and amenorrhea | 1109 | No | Yes | No | Yes | No | Yes | 65. | 70.2 | 82.1 (11.0) |
Subacute fever and abdominal pain | 919 | Yes | Yes | No | No | Yes | Yes | 67.9 | 87.8 | 83.0 (7.7) |
Chronic dyspnea | 831 | No | Yes | No | No | Yes | No | 60.4 | 85.7 | 74.1 (14.0) |
Acute chest pain | 747 | No | No | No | No | No | Yes | 82.3 | 91.7 | 96.0 (13.0) |
Acute RUQ pain | 902 | No | Yes | No | Yes | No | Yes | 61.5 | 81.6 | 77.6 (16.7) |
Acute lightheadedness | 885 | No | Yes | No | Yes | Yes | Yes | 41.5 | 70.3 | 80.5 (12.2) |
Acute abdominal pain and fever | 1071 | No | Yes | No | Yes | No | Yes | 79.4 | 99.4 | 86.8 (10.3) |
Chronic fatigue | 917 | No | Yes | No | Yes | Yes | Yes | 71.4 | 94.7 | 86.2 (8.3) |
Subacute confusion | 953 | Yes | Yes | Yes | Yes | Yes | Yes | 71.6 | 93.6 | 78.2 (8.3) |
Acute abdominal pain and nausea | 972 | Yes | Yes | No | No | Yes | Yes | 81.1 | 86.5 | 84.1 (10.0) |
Subacute dyspnea | 1093 | No | Yes | Yes | No | Yes | Yes | 80.9 | 92.4 | 80.5 (11.2) |
Model 4 score, mean (SD), %e | NA | 89.8 (9.0) | 84.3 (13.8) | 92.0 (7.8) | 81.9 (25.3) | 87.7 (11.2)f | 86.4 (17) | NA | NA | NA |
Student score, mean (SD), %e | NA | 85.4 (17.3) | 86.1 (16.8) | 87.6 (14.8) | 82.2 (21.0) | 71.8 (20.1)f | 82.8 (20.3) | NA | NA | NA |
Abbreviations: NA, not applicable; RUQ, right upper quadrant.
Scores listed are the mean score for each case from 2 runs, each graded by 2 independent faculty graders using the same grading rubric.
A diagnostic schema is defined as a thorough collection of causes for a specific symptom, which is organized into categories based on organ system or physiological process.
An illness script is defined as a summary of the features of a specific disease, organized into categories such as epidemiology, historical features, examination findings, and relevant test abnormalities.
Other assessed clinical skills include the following: diagnostic test selection and interpretation, identification of cognitive biases, discussion of relevant literature search strategies, and interpretation of the significance of physical examination findings.
Scores listed are the mean score for questions or prompts that tested a specific clinical reasoning skill, graded by 2 independent faculty graders using the same grading rubric.
The difference in mean scores between model 4 and students on problem list–related questions was statistically significant (P < .001).
Discussion
In our study, model 4 of the chatbot outperformed first- and second-year students on clinical reasoning examinations and had significant improvement vs model 3.5. Our findings underscore challenges and opportunities for medical training and practice and suggest likelihood of more future dramatic advances.
A chatbot that scores higher than students on clinical reasoning assessments creates challenges for evaluating students’ clinical competency. Closed-book examinations are a potential temporary solution but do not mirror integration of information from different sources. Medical assessments are shifting toward open-book formats, including use of UpToDate on American Board of Internal Medicine recertification examinations.4
A study limitation is that the chatbot’s responses can be sensitive to rewording of prompts. The chatbot demonstrated different understanding of specific clinical reasoning terms (eg, illness script, problem list), which required revision of questions to include explanation of the term. This approach improved the chatbot’s responses.
Given the abilities of general-purpose chatbot AI systems, medicine should incorporate AI-related topics into clinical training and continuing medical education. As the medical community had to learn online resources and electronic medical records, the next challenge is learning judicious use of generative AI to improve patient care.
Footnotes
Abbreviations: CNS, central nervous system; CT, computed tomography; WBC, white blood cell.
SI conversion: To convert WBC count to ×109/L, multiply by 0.001; to convert platelet count to ×109/L, multiply by 1.0.
This is a representative example of a prompt from a clinical reasoning case. An excerpt from a representative example of the format of the clinical cases was given to model 4 of the chatbot. There was a prompt consisting of the text provided to chatbot verbatim, the standardized grading rubric, chatbot’s actual response, and the score assigned to that response by a grader.
References
- 1.Open AI. ChatGPT: optimizing language models for dialogue. OpenAI. Published November 30, 2022. Accessed February 1, 2023. https://openai.com/blog/chatgpt/
- 2.Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv. Preprint posted online March 20, 2023. doi: 10.48550/arXiv.2303.13375 [DOI]
- 3.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233-1239. doi: 10.1056/NEJMsr2214184 [DOI] [PubMed] [Google Scholar]
- 4.American Board of Internal Medicine. Doctors maintaining ABIM board certification will soon be able to access an electronic resource they use in practice during periodic knowledge assessments. September 27, 2017. Accessed February 2, 2023. https://www.abim.org/media-center/press-releases/abim-open-book-assessments-will-feature-access-to-uptodate.aspx.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.