Abstract
This prognostic study assesses the ability of a chatbot to write a history of present illness compared with senior internal medicine residents.
Large language model (LLM) chatbots have received widespread attention for their ability to produce complex human-like conversational output.1 These models represent a substantial advancement in generative artificial intelligence (AI) with potential applications in many industries.2 Medical documentation is a health care use case worth examining given its notable burden on clinicians.3 In this prognostic study, we evaluated the ability of a chatbot to generate a history of present illness (HPI) compared with senior internal medicine residents.
Methods
The study was conducted between January and February 2023 and involved internal medicine senior residents and attending physicians. The HPIs were generated by a chatbot (ChatGPT; OpenAI [January 9, 2023, version]) and written by 4 residents based on 3 patient interview scripts portraying different types of chest pain (eAppendixes 1 and 2 in Supplement 1).4 The HPIs produced by the chatbot were generated using an iterative process known as prompt engineering (eAppendix 3 in Supplement 1). Ten HPIs per script were first generated using a basic prompt and analyzed for errors (eTable 1 in Supplement 1). The HPIs were considered acceptable if they contained no errors. The prompt was then modified, and this process was repeated twice. One HPI per script was selected from the final round of prompt engineering for comparison with resident samples (eTable 2 in Supplement 1). This study was determined to be exempt from review by the Stanford University institutional review board. Thirty internal medicine attending physicians each blindly evaluated 5 HPIs (4 written by residents and 1 generated by a chatbot), grading them on level of detail, succinctness, and organization (eAppendix 4 in Supplement 1). They also stated whether they believed HPIs were produced by a resident or chatbot. This study followed the relevant portions of the TRIPOD reporting guideline. Statistical significance was defined as 2-sided P < .05. Data analysis was performed using R statistical software version 4.2.1 (R Foundation for Statistical Computing).
Results
During prompt engineering, chatbot-generated HPIs were assessed for major inaccuracies, atypical terminology, and structural errors. The most common error was addition of patient age and gender, which none of the scripts specified. The acceptance rate for chatbot-generated HPIs improved from 10.0% to 43.3% by the final round of prompt engineering (Figure). Grades of resident and chatbot-generated HPIs differed by less than 1 point on the 15-point composite scale (resident mean [SD], 12.18 [2.40] vs chatbot mean [SD], 11.23 [2.84]; P = .09), although resident HPIs scored higher on level of detail scale (resident mean [SD], 4.13 [0.86] vs chatbot mean [SD], 3.57 [1.04]; P = .006) (Table). Attending physicians correctly classified HPIs as written by residents or the chatbot with an accuracy of 61% (95% CI, 0.53-0.68; P = .06).
Figure. Prompt Engineering With the Chatbot.
A total of 10 HPIs per script (30 total) were generated in each round of prompt engineering and assessed for major errors such as reported information in the HPIs that was not present in the source dialogue (hallucinations), the presence of atypical terminology, and structural issues. The HPIs without any major errors were considered acceptable. Prompts were refined in each round to minimize error and maximize acceptance. HPI indicates history of present illness.
Table. Composite and Domain-Specific HPI Gradesa.
Grade type | Mean (SD) resident HPI grade (n = 120) | Mean (SD) chatbot HPI grade (n = 30) | Wilcoxon rank sum P value |
---|---|---|---|
Composite | 12.18 (2.40) | 11.23 (2.84) | .09 |
Detail-oriented | 4.13 (0.86) | 3.57 (1.04) | .006b |
Succinctness | 3.93 (1.09) | 3.70 (1.15) | .29 |
Organization | 4.12 (0.91) | 3.97 (0.96) | .43 |
Abbreviation: HPI, history of present illness.
HPIs were graded across 3 domains addressing level of detail, succinctness, and organization using a 5-point Likert scale with 5 representing the best grade on the domain. The composite grade was calculated as the total grade across the 3 domains.
Denotes statistical significance, defined as 2-sided P < .05.
Discussion
In this study, HPIs generated by a chatbot or written by senior internal medicine residents were graded similarly by internal medicine attending physicians. These findings underscore the potential of chatbots to aid clinicians with medical documentation. Further work is needed to assess the utility of LLMs in other clinical summarization tasks. We also found that the chatbot’s performance was heavily dependent on prompt quality. Without robust prompt engineering, the chatbot frequently reported information in the HPIs that was not present in the source dialogue. This type of error, called a hallucination, has been noted in a prior assessment of generative AI models.5 The generation of hallucinations in the medical record is clearly of great concern. Before LLMs can be safely used in the clinical environment, close collaboration between clinicians and AI developers is needed to ensure that prompts are effectively engineered to optimize output accuracy. The core framework of model output generation also needs to be evaluated for other methods to improve accuracy, such as instruction prompt tuning, which can align LLMs to specific domains.6 As the field of generative AI evolves to address these challenges, clinicians will be well positioned to guide implementation of LLMs in care delivery.
This study has several limitations. First, residents were instructed to write in full sentences without abbreviations or acronyms to help standardize HPI samples. Second, chatbot responses can vary even with identical prompts, hindering reproducibility of this study. Third, only the best chatbot outputs were compared to resident performance. Fourth, the grading instruments were not formally validated. In addition, this study was based on a version of the chatbot available at the time. Future work should explore whether the recently released version has better performance.
eAppendix 1. Scripts of Standardized Patient Interviews
eAppendix 2. Resident Survey with Link to Example
eAppendix 3. Link to Prompt Engineering Results
eTable 1. Examples of Common Errors in ChatGPT-generated HPIs
eTable 2. Final Resident and ChatGPT HPIs
eAppendix 4. Faculty Grader Survey with Link to Example
Data Sharing Statement
References
- 1.Online ChatGPT: optimizing language models for dialogue. Published February 10, 2023. Accessed February 11, 2023. https://online-chatgpt.com/
- 2.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224-226. doi: 10.1038/d41586-023-00288-7 [DOI] [PubMed] [Google Scholar]
- 3.Gaffney A, Woolhandler S, Cai C, et al. Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study. JAMA Intern Med. 2022;182(5):564-566. doi: 10.1001/jamainternmed.2022.0372 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Schulman KA, Berlin JA, Harless W, et al. The effect of race and sex on physicians’ recommendations for cardiac catheterization. N Engl J Med. 1999;340(8):618-626. doi: 10.1056/NEJM199902253400806 [DOI] [PubMed] [Google Scholar]
- 5.Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv. Preprint posted online February 8, 2023. doi: 10.48550/arXiv.2302.04023 [DOI]
- 6.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Preprint posted online December 26, 2022. doi: 10.48550/arXiv.2212.13138 [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
eAppendix 1. Scripts of Standardized Patient Interviews
eAppendix 2. Resident Survey with Link to Example
eAppendix 3. Link to Prompt Engineering Results
eTable 1. Examples of Common Errors in ChatGPT-generated HPIs
eTable 2. Final Resident and ChatGPT HPIs
eAppendix 4. Faculty Grader Survey with Link to Example
Data Sharing Statement