Abstract
The efficacy of large language models (LLMs) in discharge summary preparation using real clinical documentation remains novel. Our study aimed to test the efficacy of two LLMs to generate DC summaries which were scored using a validated discharge summary scoring metric. The models performed nearly identically, with the llama3:instruct model having a mean score of 19.1/31 (SD: 2.42) compared to 19.2/31 (SD: 3.48) when produced by llama3:70b. Using LLMs to aid in the generation of discharge summaries may help to reduce the overall clinical administrative workload.
Keywords: artificial intelligence, machine learning, discharge summary, hospital efficiency
The preparation of timely and detailed discharge summaries not only aids communication with primary and secondary care but has also been shown to reduce patient harm. 1 Delays in producing discharge summaries reduce overall continuity of care. 1 Furthermore, the production and distribution of delayed and low‐quality discharge summaries have been shown to increase the risk of medication errors, information errors and communication errors. 1 Several systematic reviews have found delayed production of discharge summaries to be associated with readmission rates. 2 , 3 Preparing discharge summaries forms a large part of junior clinician workload, with the mean time spent writing discharge summaries among junior doctors in the United Kingdom reported to be up to 27%. 4 Locally, Australian resident medical officers are estimated to spend 2.5–14.25 h weekly writing discharge summaries. 5 Large language models (LLMs) have already begun to aid in improving clinical communication and patient understanding of discharge summaries. 6 A study by Zaretsky et al. found LLMs to be a valuable tool to aid in the translation of discharge summaries to a more understandable format for patient understanding. 6 LLMs offer the capability of producing large amounts of text in mere seconds compared to human‐produced text. Other papers have begun to explore and discuss the role of LLMs in preparing discharge summaries. 7 These studies have relied on clinician‐written and generated clinical vignettes designed to simulate electronic medical record (EMR) documentation. Few studies have sought to evaluate the ability of current LLMs to generate discharge summaries using real EMR data; accordingly, the ability of LLMs to navigate the nuances and imperfections of clinical documentation remains novel. Should LLMs be able to extract and synthesise pertinent health information to produce discharge summaries in a timely and efficient manner, these tools may alleviate many health system delays, reduce clinical administrative workload and improve patient care. In combination, clinicians using LLMs to aid in writing discharge summaries may significantly reduce the amount of time spent on producing these vital pieces of documentation.
Clinical vignettes were generated based on a random selection of inpatients from the Lyell McEwin Hospital (LMH) General Medicine service whose estimated discharge date was either the same day or the following day. Ten cases in total were randomly collected via Structured Query Language (SQL) query. The notes, which included all inpatient documentation from the selected patient's admission (e.g. ward round notes, medical progress notes, admission notes), were collected from the EMR via SQL query and were then attached to a prompt given separately to the llama3:70b and llama3:instruct models. 8 The documents did not include pathology, medical imaging or medication charts or nursing or allied health documentation beyond those included in the treating medical officers' documentation due to data limits with the LLM. The LLMs were provided the following prompt: ‘You are a doctor. You need to write a discharge summary for a GP. Here is the note (note). Please provide the discharge summary’.
Each discharge summary generated from the two LLMs was evaluated by L. Hains using the discharge summary scoring tool by Savvopoulos et al. 9 This scoring tool assessed 17 scoring discharge summary quality domains. The author assessing the generated discharge summaries was also provided the EMR notes used to generate the summaries to assess discharge summary accuracy. The author scoring the discharge summaries was not blinded as to the summary's origin. The human‐written discharge summary was also collected manually and scored using the same metric (Appendix S1). Two‐tailed t tests were employed to show statistically significant data. Responses were recorded by an electronic form. This study was approved by the Central Adelaide Local Health Network Human Research Ethics Committee (reference number 18665).
Twenty discharge summaries were assessed using the scoring tool. Discharge summaries produced by the llama3:instruct model had a mean score of 19.1/31 (SD: 2.42) and 19.2/31 (SD: 3.48) when produced by llama3:70b (P = 0.974). Inclusion of pertinent physical examination findings was the lowest scoring metric across both models (mean 0.4/2) despite being highly prevalent across many forms of provided clinical documentation (e.g. emergency department admission notes and ward round notes). The llama3:instruct model significantly outperformed the llama3:70b model in inclusion of relevant discharge medications (1.66/2 vs 1.00/2) and inclusion of relevant inpatient treating clinician contact details (1.40/2 vs 0.2/2). However, the models often inferred the relevant clinician for contact through assumptions based on the ward round notes and may not be indicative of the actual discharging clinician. The llama3:70b model outperformed the llama3:instruct in the inclusion of pertinent investigations (1.0/2 vs 0.6/2). When compared to the scores for the human‐written discharge summaries (mean score: 24.2/31, SD: 3.32), both LLMs underperformed (P = 0.0241) (Table 1).
Table 1.
Mean scores for each large language model (LLM)
| llama 3:instruct | llama:70b | Mean LLM score | Clinician | |
|---|---|---|---|---|
|
Admission diagnosis 0 – No information 1 – Less than optimal, for example, only chief complaint or presenting symptoms or excessive 2 – Preliminary or working diagnosis given at time of admission |
2.00 | 2.00 | 2.00 | 2.00 |
|
List of discharge diagnosis 0 – No information 1 – Less than optimal (only signs, symptoms, or unknown abbreviations) or excessive 2 – Principal discharge diagnosis or main reason for admission and all additional pertinent diagnosis where applicable |
2.00 | 2.00 | 2.00 | 1.80 |
|
Discharge diagnosis responsible for greatest part of LOS 0 – Omitted 1 – Maximum one diagnosis accountable for largest portion of patient's stay or excessive 2 – Optimal |
1.00 | 1.00 | 1.00 | 2.00 |
|
History of present illness 0 – No information 1 – Some information missing 2 – A brief summary of initial presentation and diagnostic evaluation 1 – Excessive description |
1.00 | 1.30 | 1.15 | 2.00 |
|
Pertinent physical findings 0 – No information 1 – Some information missing 2 – Findings relevant to diagnoses 1 – All findings or substantial number of irrelevant findings |
0.20 | 0.60 | 0.40 | 0.10 |
|
Goals of care 0 – No information 1 – Some information missing 2 – Level of treatment, code status (e.g. curative, life‐prolonging palliative and symptomatic palliative) |
1.50 | 1.60 | 1.55 | 2.00 |
|
Course in hospital 0 – No information 1 – Incomplete description with missing links 2 – Synoptic, problem‐based description of sequential events and respective evaluations, treatments and prognoses 1 – Excessive information |
1.30 | 1.50 | 1.40 | 2.00 |
|
Procedures in hospital 0 – No information 1 – Unknown abbreviations used 2 – A list of procedures with key findings and date OR statement ‘not applicable’ |
0.50 | 0.78 | 0.64 | 0.50 |
|
Discharge medications 0 – No information 1 – Some information missing 2 – A listing of all discharge medications with specific description of new, altered and discontinued medications and rationale for changes OR specific statement: ‘See DMR’ OR a specific statement ‘no medications’ |
1.63 | 1.00 | 1.31 | 1.89 |
|
Pertinent laboratory tests and investigation results 0 – No information 1 – Some information missing 2 – Relevant (Key) tests and investigations 1 – All tests and investigations or substantial number of irrelevant tests |
0.60 | 1.00 | 0.80 | 1.20 |
|
Test results pending at discharge 0 – No information 1 – Some information missing 2 – Tests ordered during hospitalisation that are pending at time of discharge |
0.70 | 1.00 | 0.85 | 1.00 |
|
Outcome of care/condition at discharge‐functional ability 0 – No information 1 – Some information missing 2 – A documentation that gives a sense of patient's functional and/or cognitive health status at discharge when applicable, for example, stable at baseline where applicable, includes residual comorbid illnesses and risk factors |
1.70 | 1.80 | 1.75 | 1.80 |
|
Follow‐up issues identified 0 – No information 1 – Suboptimal 2 – Description of outstanding issues that will require follow‐up along with recommendations for recipient health‐care provider OR statement that ‘no outstanding issues exist’ or ‘no recommendations exist’ |
1.90 | 2.00 | 1.95 | 1.40 |
|
Appointments after discharge 0 – No information 1 – Some information missing 2 – Person responsible for scheduling, date, time/timeframe, care provider name and specialty where applicable |
0.90 | 0.70 | 0.80 | 1.00 |
|
Discharge instructions 0 – No information 1 – Some information missing 2 – List of verbal/written information/education provided to patient/SDM clearly stated where applicable, symptoms and signs to seek care for (e.g. unresolved or recurring chest pain, signs of infection) OR statement ‘No special education/instruction required’ |
1.10 | 1.00 | 1.05 | 1.70 |
|
Identified attending clinician to be called by primary care physicians if there are questions 0 – No information 1 – Some information missing 2 – Main author of the discharge summary clearly stated |
1.40 | 0.20 | 0.80 | 2.00 |
| Totals | 19.43 | 19.48 | 19.45 | 24.39 |
There were several instances of the treating clinician's name being incorrectly listed as the patient's name in many of the generated discharge summaries. This may be due to the absence of the patient's name and other identifying details in many of the inpatient documents.
Both LLMs incorrectly stated which COVID‐19 antiviral was provided to one of the included patients. The LLM likely inferred the incorrect medication based on the recommended pharmacological management suggested in the emergency department admission and initial ward round notes. However, the provided pharmacy notes explicitly mention the change to the correct medication. There were also numerous instances of suggested inpatient management, such as monitoring of bowel charts being incorrectly suggested as an outpatient plan in the discharge summaries.
Discussion
Despite the relatively low performance of both models (61.6% and 61.9%), the two LLMs tested were able to produce concise and informative forms of medical documentation. Some of the criteria in which the model scored 0 points were areas such as failing to include the name and contact details of the inpatient treating clinician and specifically indicating booked outpatient review appointments after discharge. Many of these exclusions were not necessarily due to the LLM itself and instead were due to the documentation not including this information.
A large proportion of the LLM‐derived discharge summaries remained factual, with no major discrepancies between the provided clinical notes and the resulting discharge summaries. As mentioned, there was an instance of a medication error in both LLM‐derived discharge summaries produced for one patient. Despite this medication error being included, this was for a medication that had been given as an inpatient. Errors in discharge medications likely occurred because the LLM generated summaries prospectively, predicting discharge timelines. As a result, it misidentified inpatient medications versus post‐discharge prescriptions.
There were limitations with this study that may have affected the resulting discharge summary generated by the LLM. Because the LLMs relied on a document called the ‘Ward Round Note’ and other documentation attributed to the patient's inpatient stay within the admitting service, pertinent medical information from inpatient stays in areas of the hospital that use differently named notes (such as intensive care) was not included for the LLM to use. This was mainly due to LLM limitations in context‐window length that constrained the amount of information that could be provided to the model. For patients who had initially begun their stay in intensive care or another hospital within the Northern Adelaide Local Health Network (NALHN), the LLM was forced to infer what care was provided in these areas based on the included information in the Ward Round Note. While this is currently a real‐world limitation for LLM implementation in this context, newer LLM models with larger context windows may overcome this challenge.
This study used a single LLM. The reason for this choice was that at the time of the study, this was the best open‐weighted model that could be run locally. Running the model locally was and will be a necessity due to the use of patient information and to protect patient confidentiality. Future studies should seek to use and compare other newly available locally run models. Another limitation of this study is that scoring of the discharge summaries was undertaken by a single, non‐independent, unblinded reviewer, introducing potential bias in data scoring. Future studies should seek to involve multiple, blinded comprehensive assessments to increase objectivity and reduce the overall risk of bias.
Development of more specific prompts for LLMs to produce accurate discharge summaries may be required for future development. Prompts should seek to outline the discharge summary structure and necessary pertinent information for inclusion in the discharge summary. Furthermore, prompts are designed to allow LLMs to identify areas where the information is unavailable and prompt clinicians to provide this information. The construction of optimised prompts can be achieved through libraries such as DSPy and textGrad via algorithmic fine‐tuning of prompt instructions rather than manual trial and error. Moreover, novel retrieval‐augmented generation (RAG) techniques deploying graph structures and medical ontologies can further support factual and contextually grounded generation of discharge summaries, as evidenced by Wu et al. 10 Finally, the generation of LLM workflows which permit clinician input or a ‘human in the loop’ and review may allow for LLMs to become a valuable tool in generating fast, efficient and accurate discharge summaries. Future studies should seek to validate a combined LLM/clinician workflow to produce efficiently discharge summaries, noting that this will likely differ across specialty fields. It is also important to note that with the evolving and continuously improving nature of LLMs may come improvement in discharge summary quality.
Although the total scores for each LLM were below that of expected discharge summary quality, even with a rudimentary and basic prompt, both models were able to deliver a reasonable result that, with minor edits and inclusions, would make a sufficient discharge summary. Mixed solutions which combine the strengths of LLMs, such as in summarisation and information synthesis, with clinician oversight and manual entry of fields that necessitate precision (such as medication lists and relevant pathology) may reduce the amount of time junior clinicians are spending preparing discharge summaries. Using more specific prompts targeting specific inclusions found in the scoring criteria and the use of more finely tuned LLMs such as more sophisticated models like RAG architectures may aid in the development of a LLM‐based tool for clinical use.
Supporting information
Appendix S1. Modified discharge summary scoring tool by Savvopoulos et al. 9
Acknowledgements
Open access publishing facilitated by The University of Adelaide, as part of the Wiley ‐ The University of Adelaide agreement via the Council of Australian University Librarians.
Funding: S. Bacchi is supported by a Fulbright Scholarship. L. Hains, O. Kleinig and A. Murugappa were supported by the NALHN Young Health Innovators Scholarship.
Conflict of interest: None.
References
- 1. Schwarz CM, Hoffmann M, Schwarz P, Kamolz LP, Brunner G, Sendlhofer G. A systematic literature review and narrative synthesis on the risks of medical discharge letters for patients' safety. BMC Health Serv Res 2019; 19: 158. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Hoyer EH, Odonkor CA, Bhatia SN, Leung C, Deutschendorf A, Brotman DJ. Association between days to complete inpatient discharge summaries with all‐payer hospital readmissions in Maryland. J Hosp Med 2016; 11: 393–400. [DOI] [PubMed] [Google Scholar]
- 3. Were MC, Li X, Kesterson J, Cadwallader J, Asirwa C, Khan B et al. Adequacy of hospital discharge summaries in documenting tests with pending results and outpatient follow‐up providers. J Gen Intern Med 2009; 24: 1002–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Yemm R, Bhattacharya D, Wright D, Poland F. What constitutes a high quality discharge summary? A comparison between the views of secondary and primary care doctors. Int J Med Educ 2014; 5: 125–131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Samuel C, Andrew PM, Clifford WP, Stephen JA, Darren LW, Helen EW. Improving the efficiency of discharge summary completion by linking to preexisiting patient information databases. BMJ Qual Improv Rep 2014; 3: u200548.w2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Zaretsky J, Kim JM, Baskharoun S, Zhao Y, Austrian J, Aphinyanaphongs Y et al. Generative artificial intelligence to transform inpatient discharge summaries to patient‐friendly language and format. JAMA Netw Open 2024; 7: e240357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Patel SB, Lam K. ChatGPT: the future of discharge summaries? Lancet Digit Health 2023; 5: e107–e108. [DOI] [PubMed] [Google Scholar]
- 8. Meta . llama3. HuggingFace; 2024. (cited 2024 Oct 15). Available from URL: https://ollama.com/library/llama3
- 9. Savvopoulos S, Sampalli T, Harding R, Blackmore G, Janes S, Kumanan K et al. Development of a quality scoring tool to assess quality of discharge summaries. J Family Med Prim Care 2018; 7: 394–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wu J, Zhu J, Qi Y, Chen J, Xu M, Menolascina F et al. Medical graph RAG: towards safe medical large language model via graph retrieval‐augmented generation; 2024. (cited 2024 Aug 1). Available from URL: https://ui.adsabs.harvard.edu/abs/2024arXiv240804187W
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1. Modified discharge summary scoring tool by Savvopoulos et al. 9
