Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents

Ashwin Nayak; Matthew S Alkaitis; Kristen Nayak; Margaret Nikolov; Kevin P Weinfurt; Kevin Schulman

doi:10.1001/jamainternmed.2023.2561

. 2023 Jul 17;183(9):1026–1027. doi: 10.1001/jamainternmed.2023.2561

Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents

Ashwin Nayak ^1,^✉, Matthew S Alkaitis ¹, Kristen Nayak ¹, Margaret Nikolov ², Kevin P Weinfurt ³, Kevin Schulman ^1,^2,⁴

¹Department of Medicine, Stanford University, Stanford, California

²Clinical Excellence Research Center, School of Medicine, Stanford University, Stanford, California

³Department of Population Health Sciences and Duke Clinical Research Institute, Durham, North Carolina

⁴Graduate School of Business, Stanford University, Stanford, California

Accepted for Publication: March 23, 2023.

Published Online: July 17, 2023. doi:10.1001/jamainternmed.2023.2561

^✉

Corresponding Author: Ashwin Nayak, MD, MS, Department of Medicine, Stanford University, 300 Pasteur Dr, H0110, Stanford, CA 94305 (aknayak@stanford.edu).

Author Contributions: Dr A. Nayak had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: A. Nayak, Nikolov, Weinfurt, Schulman.

Acquisition, analysis, or interpretation of data: A. Nayak, Alkaitis, K. Nayak, Nikolov, Weinfurt.

Drafting of the manuscript: A. Nayak, Alkaitis, K. Nayak, Nikolov.

Critical revision of the manuscript for important intellectual content: A. Nayak, Alkaitis, K. Nayak, Weinfurt, Schulman.

Statistical analysis: A. Nayak, Alkaitis, Nikolov, Weinfurt.

Administrative, technical, or material support: K. Nayak, Schulman.

Supervision: A. Nayak.

Conflict of Interest Disclosures: None reported.

Data Sharing Statement: See Supplement 2.

Additional Contributions: We thank all internal medicine faculty participants for grading HPIs and contributing to this study. We thank the following Stanford internal medicine residents who helped write HPIs for this study: Evan Baum, MD, Ryan Chiang, MD, Caitlin Parmer-Chow, MD, Sulaiman Somani, MD, and Sarah Talamantes, MD. We also thank the following Stanford internal medicine attending physicians who provided feedback on survey design for this study: Michelle Chiu, MD, Errol Ozdalga, MD, Bo Sun, MD, and Linda Wang, MD. These individuals were not compensated for their help.

^✉

Corresponding author.

PMCID: PMC10352925 PMID: 37459091

Abstract

This prognostic study assesses the ability of a chatbot to write a history of present illness compared with senior internal medicine residents.

Large language model (LLM) chatbots have received widespread attention for their ability to produce complex human-like conversational output.¹ These models represent a substantial advancement in generative artificial intelligence (AI) with potential applications in many industries.² Medical documentation is a health care use case worth examining given its notable burden on clinicians.³ In this prognostic study, we evaluated the ability of a chatbot to generate a history of present illness (HPI) compared with senior internal medicine residents.

Methods

The study was conducted between January and February 2023 and involved internal medicine senior residents and attending physicians. The HPIs were generated by a chatbot (ChatGPT; OpenAI [January 9, 2023, version]) and written by 4 residents based on 3 patient interview scripts portraying different types of chest pain (eAppendixes 1 and 2 in Supplement 1).⁴ The HPIs produced by the chatbot were generated using an iterative process known as prompt engineering (eAppendix 3 in Supplement 1). Ten HPIs per script were first generated using a basic prompt and analyzed for errors (eTable 1 in Supplement 1). The HPIs were considered acceptable if they contained no errors. The prompt was then modified, and this process was repeated twice. One HPI per script was selected from the final round of prompt engineering for comparison with resident samples (eTable 2 in Supplement 1). This study was determined to be exempt from review by the Stanford University institutional review board. Thirty internal medicine attending physicians each blindly evaluated 5 HPIs (4 written by residents and 1 generated by a chatbot), grading them on level of detail, succinctness, and organization (eAppendix 4 in Supplement 1). They also stated whether they believed HPIs were produced by a resident or chatbot. This study followed the relevant portions of the TRIPOD reporting guideline. Statistical significance was defined as 2-sided P < .05. Data analysis was performed using R statistical software version 4.2.1 (R Foundation for Statistical Computing).

Results

During prompt engineering, chatbot-generated HPIs were assessed for major inaccuracies, atypical terminology, and structural errors. The most common error was addition of patient age and gender, which none of the scripts specified. The acceptance rate for chatbot-generated HPIs improved from 10.0% to 43.3% by the final round of prompt engineering (Figure). Grades of resident and chatbot-generated HPIs differed by less than 1 point on the 15-point composite scale (resident mean [SD], 12.18 [2.40] vs chatbot mean [SD], 11.23 [2.84]; P = .09), although resident HPIs scored higher on level of detail scale (resident mean [SD], 4.13 [0.86] vs chatbot mean [SD], 3.57 [1.04]; P = .006) (Table). Attending physicians correctly classified HPIs as written by residents or the chatbot with an accuracy of 61% (95% CI, 0.53-0.68; P = .06).

Figure. — A total of 10 HPIs per script (30 total) were generated in each round of prompt engineering and assessed for major errors such as reported information in the HPIs that was not present in the source dialogue (hallucinations), the presence of atypical terminology, and structural issues. The HPIs without any major errors were considered acceptable. Prompts were refined in each round to minimize error and maximize acceptance. HPI indicates history of present illness.

Table. Composite and Domain-Specific HPI Grades^a.

Grade type	Mean (SD) resident HPI grade (n = 120)	Mean (SD) chatbot HPI grade (n = 30)	Wilcoxon rank sum P value
Composite	12.18 (2.40)	11.23 (2.84)	.09
Detail-oriented	4.13 (0.86)	3.57 (1.04)	.006^b
Succinctness	3.93 (1.09)	3.70 (1.15)	.29
Organization	4.12 (0.91)	3.97 (0.96)	.43

Open in a new tab

Abbreviation: HPI, history of present illness.

^{^a}

HPIs were graded across 3 domains addressing level of detail, succinctness, and organization using a 5-point Likert scale with 5 representing the best grade on the domain. The composite grade was calculated as the total grade across the 3 domains.

^{^b}

Denotes statistical significance, defined as 2-sided P < .05.

Discussion

In this study, HPIs generated by a chatbot or written by senior internal medicine residents were graded similarly by internal medicine attending physicians. These findings underscore the potential of chatbots to aid clinicians with medical documentation. Further work is needed to assess the utility of LLMs in other clinical summarization tasks. We also found that the chatbot’s performance was heavily dependent on prompt quality. Without robust prompt engineering, the chatbot frequently reported information in the HPIs that was not present in the source dialogue. This type of error, called a hallucination, has been noted in a prior assessment of generative AI models.⁵ The generation of hallucinations in the medical record is clearly of great concern. Before LLMs can be safely used in the clinical environment, close collaboration between clinicians and AI developers is needed to ensure that prompts are effectively engineered to optimize output accuracy. The core framework of model output generation also needs to be evaluated for other methods to improve accuracy, such as instruction prompt tuning, which can align LLMs to specific domains.⁶ As the field of generative AI evolves to address these challenges, clinicians will be well positioned to guide implementation of LLMs in care delivery.

This study has several limitations. First, residents were instructed to write in full sentences without abbreviations or acronyms to help standardize HPI samples. Second, chatbot responses can vary even with identical prompts, hindering reproducibility of this study. Third, only the best chatbot outputs were compared to resident performance. Fourth, the grading instruments were not formally validated. In addition, this study was based on a version of the chatbot available at the time. Future work should explore whether the recently released version has better performance.

Supplement 1.

eAppendix 1. Scripts of Standardized Patient Interviews

eAppendix 2. Resident Survey with Link to Example

eAppendix 3. Link to Prompt Engineering Results

eTable 1. Examples of Common Errors in ChatGPT-generated HPIs

eTable 2. Final Resident and ChatGPT HPIs

eAppendix 4. Faculty Grader Survey with Link to Example

Click here for additional data file.^{(321.2KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(16.2KB, pdf)}

References

1.Online ChatGPT: optimizing language models for dialogue. Published February 10, 2023. Accessed February 11, 2023. https://online-chatgpt.com/
2.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224-226. doi: 10.1038/d41586-023-00288-7 [DOI] [PubMed] [Google Scholar]
3.Gaffney A, Woolhandler S, Cai C, et al. Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study. JAMA Intern Med. 2022;182(5):564-566. doi: 10.1001/jamainternmed.2022.0372 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Schulman KA, Berlin JA, Harless W, et al. The effect of race and sex on physicians’ recommendations for cardiac catheterization. N Engl J Med. 1999;340(8):618-626. doi: 10.1056/NEJM199902253400806 [DOI] [PubMed] [Google Scholar]
5.Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv. Preprint posted online February 8, 2023. doi: 10.48550/arXiv.2302.04023 [DOI]
6.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Preprint posted online December 26, 2022. doi: 10.48550/arXiv.2212.13138 [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1.

eAppendix 1. Scripts of Standardized Patient Interviews

eAppendix 2. Resident Survey with Link to Example

eAppendix 3. Link to Prompt Engineering Results

eTable 1. Examples of Common Errors in ChatGPT-generated HPIs

eTable 2. Final Resident and ChatGPT HPIs

eAppendix 4. Faculty Grader Survey with Link to Example

Click here for additional data file.^{(321.2KB, pdf)}

Supplement 2.

Data Sharing Statement

Click here for additional data file.^{(16.2KB, pdf)}

[ild230016r1] 1.Online ChatGPT: optimizing language models for dialogue. Published February 10, 2023. Accessed February 11, 2023. https://online-chatgpt.com/

[ild230016r2] 2.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224-226. doi: 10.1038/d41586-023-00288-7 [DOI] [PubMed] [Google Scholar]

[ild230016r3] 3.Gaffney A, Woolhandler S, Cai C, et al. Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study. JAMA Intern Med. 2022;182(5):564-566. doi: 10.1001/jamainternmed.2022.0372 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ild230016r4] 4.Schulman KA, Berlin JA, Harless W, et al. The effect of race and sex on physicians’ recommendations for cardiac catheterization. N Engl J Med. 1999;340(8):618-626. doi: 10.1056/NEJM199902253400806 [DOI] [PubMed] [Google Scholar]

[ild230016r5] 5.Bang Y, Cahyawijaya S, Lee N, et al. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv. Preprint posted online February 8, 2023. doi: 10.48550/arXiv.2302.04023 [DOI]

[ild230016r6] 6.Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv. Preprint posted online December 26, 2022. doi: 10.48550/arXiv.2212.13138 [DOI]

PERMALINK

Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents

Ashwin Nayak, MD, MS

Matthew S Alkaitis, MD, PhD

Kristen Nayak, MD

Margaret Nikolov, PhD

Kevin P Weinfurt, PhD

Kevin Schulman, MD, MBA

Abstract

Methods

Results

Figure. Prompt Engineering With the Chatbot.

Table. Composite and Domain-Specific HPI Grades^a.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents

Ashwin Nayak, MD, MS

Matthew S Alkaitis, MD, PhD

Kristen Nayak, MD

Margaret Nikolov, PhD

Kevin P Weinfurt, PhD

Kevin Schulman, MD, MBA

Abstract

Methods

Results

Figure. Prompt Engineering With the Chatbot.

Table. Composite and Domain-Specific HPI Gradesa.

Discussion

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table. Composite and Domain-Specific HPI Grades^a.