Abstract
Automatically summarizing patients’ main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient’s daily care plan using input from the provider’s progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.
1. Introduction
The progress note is a common note type in the electronic health record (EHR) that also contains the necessary details for medical billing; therefore, every hospital day will contain at least one progress note for a patient. Healthcare providers write them to document a patient’s daily progress and care plan (Brown et al., 2014). The progress note contains both subjective and objective information gathered by the care team, and it is updated daily and serves as the most viewed clinical document by providers. The complexity of the progress note increases as the patient’s illness worsens with progress notes collected in the intensive care unit (ICU) representing the sickest patients in the hospital. In the ICU, information and cognitive overload occur frequently, with more opportunities for missed diagnoses and medical errors (Furlow, 2020; Hultman et al., 2019). Automatically generating a set of diagnoses/problems in a progress note may assist providers in overcoming cognitive biases and heuristics and apply evidence-based medicine via information synthesis to accurately understand a patient’s condition. These processes may ultimately reduce the effort in document review and augment care during a time-sensitive hospital event (Devarakonda et al., 2017).
Clinical note summarization using natural language processing (NLP) has demonstrated promise in previous work. Hirsch et al. (2015) introduced HARVEST, an EHR summarizer that is currently deployed at point-of-care in a New York hospital. The NLP components of HARVEST include a Markov chain named-entity tagger that identifies diseases explicitly mentioned in clinical notes and a TF-IDF scorer that weighs the importance of the mentions (Lipsky-Gorman and Elhadad, 2011; Hirsch et al., 2015). With the advances of neural methods, recent work has focused on radiology report summarization (Zhang et al., 2018; MacAvaney et al., 2019; Gharebagh et al., 2020) with pointer generator network (See et al., 2017), and doctor-patient conversation summarization (Yim and Yetisgen-Yildiz, 2021; Zhang et al., 2021) with transformer architectures (Vaswani et al., 2017; Raffel et al., 2020). Few investigations apply transformers to Problem Summarization progress notes to identify and generate the top diagnoses during a patient’s hospitalization.
Problem summarization requires complex cognitive processes to arrive at an accurate diagnosis. When a patient is admitted to the hospital, medical evaluations and diagnostics are initially performed to understand a patient’s condition. The review is accompanied by documentation in the progress notes to include pertinent details about the patient’s symptoms, medications, physical exam findings, radiology findings, laboratory results, etc. These data are organized in the progress note and used with the physician’s medical knowledge to arrive at an assessment of the current problems followed by a treatment plan. The system of nonanalytic and analytic reasoning strategies represent clinical diagnostic reasoning, a process involving clinical evidence acquisition with integration and abstraction over medical knowledge to synthesize a conclusion in the form of a diagnosis(Barrows et al., 1980; Bowen, 2006). We hypothesize that to summarize a patient’s problems and ultimately develop computerized diagnostic decision support systems, the ability of clinical diagnostic reasoning is the key for NLP systems, a gap in existing NLP literature. In this work, we propose a new summarization task designed to meet the real-world need in the hospital setting as the first step to developing NLP models for clinical diagnostic reasoning. The task is built on a new annotation subset of MIMIC-III (Johnson et al., 2016), a large and publicly available EHR. We formulate the task as a problem list summarization, as we see the task as the first step in a bigger vision of generating entire notes or sections of notes. Ultimately, the task is designed with our clinical informatics partners to move toward a future real-world application, where a system generating relevant diagnoses can assist healthcare providers and overcome the cognitive burden and information overload. Our contributions include:
The first knowledge-intensive summarization task towards building NLP systems for computerized diagnostic decision support (sec §3), with an annotated set of clinical notes that are publicly available (sec §4);
An evaluation of two transformer models for this task, T5 (Raffel et al., 2020) and BART (Lewis et al., 2020), to examine progress in using the state-of-the-art models over a rule-based medical concept extractor (sec §5);
Domain adaptive pre-training to establish benchmark performance for this task across multiple evaluation metrics (sec §6), with discussion of key challenges and future directions (§7).
2. Related Work
In this section, we provide a brief overview of recently published papers on clinical summarization that use neural methods.
Task setup
The stream of recent work on clinical summarization may be divided into two groups: extractive summarization and abstractive summarization. The data corpora are heterogeneous, with multiple note types represented. For extractive sumarization, Liang et al. (2019) proposes a summarization task that extracts sentences from progress notes. Adams et al.(2021) introduces a clinical note summarization task to generate a discharge summary generated from prior notes during hospitalization. More efforts have been made toward abstractive summarization. Several work focus on summarizing radiology reports into an impression, a short piece of text stating the findings from the source image (Zhang et al., 2018; MacAvaney et al., 2019; Gharebagh et al., 2020). Another task is doctor-patient conversation summarization where the output is a summary describing the patient’s visit: (Yim and Yetisgen-Yildiz, 2021; Manas et al., 2021; Zhang et al., 2021); or generating clinical notes using both extractive and abstractive summarization: (Krishna et al., 2021). Our work is similar to (Liang et al., 2019) in the emphasis on summarizing problems from progress notes. Yet, Liang et al. (2019) uses a disease-specific dataset (hypotension and diabetes), and formulates the problem as extractive summarization. Our annotations span a broad range of diagnoses across multiple disciplines (surgery, medicine, neurology, cardiology, trauma, etc.) and investigate extractive and abstractive approaches in the task.
Evaluation
Prior work has relied on ROUGE (Lin, 2004) as the primary evaluation metric for summarization. Most papers also report human evaluation with aspects of clinical relevancy, factual accuracy and readability (MacAvaney et al., 2019; Gharebagh et al., 2020; Krishna et al., 2021; Yim and Yetisgen-Yildiz, 2021; Zhang et al., 2021). Few have evaluated using a concept F-score, measuring if the predicted summaries contain accurate medical concepts (Liang et al., 2019; Zhang et al., 2021). Our work follows prior work and uses ROUGE, concept F-score, and human evaluation to assess the quality of generated summaries. We also evaluate content quality based on semantic representation using BERTScore (Zhang* et al., 2020) and cosine similarity for sentence embedding.
3. Task Description
Many clinical NLP applications aim to improve physicians’ efficiency and decision-making by automatically highlighting essential information from the large body of textual data in the EHR. The goal of Problem Summarization is to identify and generate the problems and diagnoses for the patient’s ICU stay. The Problem Summarization task could be developed using a multi-document approach with all notes captured during a hospital encounter. A patient encounter may generate multiple clinical notes (e.g. admission note, transfer note, daily progress notes, etc.), involving different modalities of data such as structured EHR data and radiology images. However, we are particularly interested in facilitating NLP model development for clinical diagnostic reasoning. We define the task as single-document summarization and focus only on a cross-sectional point in time with a single progress note. Our work will show that summarizing a patient’s problems over a single progress note is a challenging task and a necessary foundation that requires clinical text understanding and reasoning over sequences of medical concepts.
The progress note is organized in the ubiquitous SOAP format with four components: Subjective, Objective, Assessment, and Plan, a documentation method designed to present patient’s problems in a highly structured way and developed by Larry Weed, MD (Weed, 1964). Each component has multiple sections gathering patients’ information, helping the healthcare providers quickly recognize medical events and active problems. Subjective sections are written in natural language and record information about health concerns expressed by patients (e.g. Chief Complaints), and past medical events and history (e.g. Allergies, Family History). Objective sections are primarily structured data, including vital signs, lab tests, medications. Assessment is a brief description of passive and active diagnoses. It states why the patient is admitted to the hospital and the active problem for the day, usually accompanied by the patient’s comorbidities. Plan section includes multiple subsections, each listing a medical problem and treatment plan. The progress note is time-sensitive EHR data because it is documented daily. As a patient’s condition changes and the length of stay increases, the progress note may also increase in length. Another reason for the increasing size is from copy-and-paste behaviour, also known as “note bloat” adding redundant information or noise and hindering the efficiency in data synthesis, which increases the risk for medical error (Rule et al., 2021; Tsou et al., 2017; Shoolin et al., 2013). This reiterates our motivation to develop an NLP system that automatically generates problems and diagnoses to assist providers in clinical workflow and improve diagnostic accuracy.
Our task took Subjective and Assessment sections in progress notes as input and omitted the Objective sections. Both the Subjective sections and the Assessment section contained information about the reason for admission; therefore, they became the source text (see Figure 2 for an example). The reference summary is a list of problems mentioned in each Plan subsection relevant to the reasons for hospitalization. We will explain the annotation process in the next section. 1
Figure 2:
An input example of assessment and subjective sections available in the notes: Chief Complaint, Allergies, Review of systems.
4. Data
All progress notes were sourced from MIMIC-III, a publicly available dataset of de-identified EHR data from approximately 60,000 hospital ICU admissions at Beth Isreal Deaconess Medical Center in Boston, Massachusetts. We randomly sampled a subset of 768 progress notes and annotated the text spans for the SOAP components. The goal of the annotation was to obtain lists of problems from the Plan subsections. For each Plan subsection, the annotators marked the text span for the Problem, separating the diagnosis/problems from the treatment or action plans. The annotators subsequently determined if the problem was a primary diagnosis (Direct), or a past medical problem or consequence from the primary diagnosis (Indirect). Two more labels were available for annotating the Plan subsection: Neither if the problem or diagnosis was not mentioned in the progress note; Not Relevant if the Plan subsection contained non-diagnostic comments such as describing nutrition, prophylaxis, or disposition. Finally, we concatenated the Direct and Indirect problems using semi-colons and used them as reference summary. Two medical school students were trained as annotators under the supervision of two board-certified critical care ICU physicians. On the four labels, they achieved a Cohen’s Kappa of 0.74 on 10 randomly sampled notes, considered as good quality given the complexity of the task. More details may be found in the Appendix A. 2
Figure 3 illustrates the task setup. The Direct and Indirect problems were labeled from each Plan subsection using information presented in the input Assessment (entire progress note was also available to the annotators for more information), forming the reference summary (All Problems in the bottom). A total of 1404 and 1599 text spans were labeled as Direct and Indirect Problems, respectively. The majority of the Direct problems were found in the input Assessment but many of the Indirect problems were not explicitly mentioned in the input Assessment and may be found in other parts of the progress note (abdominal pain finding in Subjective or pneumothorax finding in chest imaging result of Objective). We also performed medical concept mapping through UMLS (see §5) on the input Assessment and kept the overlap with the reference summaries and categorized them Explicit Mention of Problems as an automated labeling approach and baseline. Therefore, the problems represent extractive and abstractive medical concepts. We presented the results across these subgroups assuming the complexity increases as we move from Explicit to Direct to Indirect problems.
Figure 3:
Top: An example assessment input with all the concepts (highlighted in color box) identified through QuickUMLS, a state-of-the-art off-the-shelf medical concept extractor. Middle: Two example plan subsections with the the annotated problems, with relation labels omitted. Bottom: The reference summary (All Problems) consists of problems annotated as the main reasons for hospitalization (Direct Problems) and secondary concerns (Indirect Problems); explicit mention of the problems is detected by overlapping the concepts identified through UMLS in input and reference summary.
5. Experiment Setting
The Unified Medical Language System (UMLS) from the National Library of Medicine is the largest resource containing biomedical concepts and their relationships (Bodenreider, 2004). We applied the concept extractor from QuickUMLS (Soldaini and Goharian), a fast and lightweight Python package, to identify all the medical concepts in the text as the baseline system. Two state-of-the-art seq2seq transformers were selected to compare with the rule-based method: T5 (Raffel et al., 2020) and BART (Lewis et al., 2020). The transformer models are known as data hungry and pre-trained on domain general text, yet our training data was limited in size but full of medical terms. To help the model learn the medical vocabulary and knowledge, we used data augmentation to generate more training samples for our experiments (§5.1 and §5.2).
5.1. Data augmentation
Figure 4 presents a workflow of the data augmentation method across the following three steps: (1) concept identification; (2) synonym mapping; and (3) augmented sample generation. Given an input text, the step of concept identification extracted ngram terms that were matched concepts with UMLS entities, from QuickUMLS. This step was done through a text matcher algorithm using a cosine similarity threshold, setting as Jaccard score with cosine similarity as 1 in our use case. The results returned Concept Unique Identifiers (CUI), a symbolic ID for the medical concept from UMLS. An example output of this step is illustrated in Figure 4: a dictionary of the matched ngrams, e.g. “pancreatic cancer”, with start and end character positions and CUIs, e.g. [C0235974]. The mapping module in step 2 found synonyms through CUIs. Here, we used OWLReady (Lamy, 2017) that automatically constructed an UMLS ontology graph, linking the concepts with relations and enabling a quick synonym lookup given a CUI. The synonyms were then passed to the last module for augmented sample generation. The last module randomly chose the synonyms and replaced concepts. An input text may contain n concepts, with each concept having r maximum number of synonyms, the number of combinations of synonyms rn grows exponentially as n increases. Considering the efficiency, we limited the number of combinations generated by concept replacement to 1000. We ran the pipeline on both reference summary and input assessment, and obtained approximately 132,000 pairs of samples for additional training data. We conducted quality measurement on the augmented samples and report the results in the Appendix B.
Figure 4:
Workflow of the data augmentation method with an input reference summary and output augmented sample
5.2. Domain adaptive pretraining with random concept masking
The summarization task requires clinical text understanding and medical knowledge, exposing challenges for models pre-trained from the general domain. Previous work proposed strategies of continuously training the pre-trained language model on domain-specific tasks to enable domain knowledge learning. (Gupta et al., 2021; Pruksachatkun et al., 2020; Gururangan et al., 2020). We followed a similar approach to investigate the effect of domain adaptive pre-training (DAPT) on our summarization task. Specifically, we continuously trained T5 on Assessment and Plan sections from all progress notes in MIMIC, excluding the set of notes for the test set. The result set had 293,000 notes, with the top 3 most frequent note types as Nursing Progress Notes (181k), Physician Resident Progress Notes (61k), and Intensivist Notes (25k).
T5 was trained by random token masking: given a text string, it randomly replaced the token spans with a special tag “<extra_id_>“ and learned to generate the masked tokens. However, not all words were equally important in our task and we wanted the model to learn clinical semantic types such as symptoms and diseases. Previous work proposed masking on biomedical entities and time expression, achieving performance gains when compared to BERT without entity masking (Lin et al., 2021; Pergola et al., 2021). Inspired by these work, we adapted a concept masking policy where we randomly masked the concepts identified through UMLS. We set the mask token ratio to 15%. For example, the highlighted text in Figure 3 was randomly replaced with the special tag. The statistics of the training set are shown in Table 1.
Table 1:
Size and average input length (and standard deviation σ) of training set for different experiment settings: the original annotated set for fine-tuning, the data generated from data augmentation method, and DAPT.
Set | Fine-tuning | Data Augmt. | DAPT |
---|---|---|---|
#Notes | 700 | 132k | 293k |
Input Lens(σ) | 43.3324.75 | 212.7478.39 | 46.1970.95 |
5.3. Evaluation
We use ROUGE-L (Lin, 2004), a conventional metric in summarization evaluation that based on ngram overlap, as well as BERTScore (Zhang* et al., 2020), reporting maximum pairwise cosine similarity on word embedding from reference summary and predicted summary. ROUGE fails to recognize synonyms and abbreviations, which are common in biomedical text: e.g., heart attack is the same clinical diagnosis as myocardial infarction, and MI is the abbreviation of myocardial infarction. BERTScore compensates this limitation by using contextualized word embeddings from SapBERT (Liu et al., 2021), a state-of-the-art BERT encoder (Devlin et al., 2019) for biomedical entity representation that assigns high cosine similarity for synonyms and abbreviations based on UMLS. The reliability of both metrics are validated in literature, thus we report them as main results.
Meanwhile, to better understand the system output, we provide two additional metrics that measure the quality of higher-level information and medical concepts. We took the hidden states of the last layer from SapBERT when taking reference and predicted summary as input, and measure the cosine similarity on sequence embedding (Sent.θ). To evaluate the model’s performance in predicting medical concepts, we ran QuickUMLS to get all CUIs from the reference and predicted summaries and computed the F-score. This metric has its own limitation due to the tricky parameter tuning in matching algorithms, causing superfluous or deficient extraction. Regardless, we include them as approximate solutions towards knowledge-based evaluation for clinical summarization, and leave the metric development for future work.
In the experiments, we set the maximum input and output length to 512 and 128 tokens, respectively. The input text was truncated if the maximum length was exceeded. All experiments occurred on two NVIDIA Tesla V100 32GB GPUs. We used early stopping on the development set during training and saved the models with the highest validation ROUGE-L F-score for evaluation. More implementation details are presented in the Appendix C.
6. Results and Analysis
We evaluated all systems on a test set of 92 progress notes and summaries. Recall that the progress notes contained the Subjective, Objective, Assessment and Plan sections. We set two types of input to the models: (1) Assessment section only (Assmt), (2) Assessment and Subjective sections (length permitting) (A+Subj). Both input settings also had augmented samples from the data augmentation method introduced earlier. We started with a simple rule-based system that was a UMLS concept extractor on the Assessment section. The evaluation metrics across the rule-based, fine-tuned T5 and BART (§6.1), and T5 with domain adaptation pre-training (DAPT, §6.2) are shown in Tables 2 and 3. T5 with DAPT outperformed all other systems and established a benchmark performance for the task. We include a qualitative analysis to provide data-driven insights of the task (§6.3).
Table 2:
ROUGE-L F-score (RL-F), sentence embedding cosine similarity (Sent.θ), BERTScore (BS), and evaluation using CUI F-score (CUI) from fine-tuning T5 and BART on the two input settings: Assessment (Assmt), Assessment with Subjective sections(A+Subj.) ++ represents the training with data augmentation.
Model | Setting | Explicit Mentions | Direct Problems | Indirect Problems | All Problems | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||
RL-F | Sent.θ | BS | CUI | RL-F | Sent.θ | BS | CUI | RL-F | Sent.θ | BS | CUI | RL-F | Sent.θ | BS | CUI | ||
Rule-based | Assmt | 34.45 | 58.81 | 59.80 | 38.97 | 12.31 | 55.33 | 40.13 | 34.23 | 9.49 | 55.58 | 44.46 | 33.16 | 13.45 | 68.61 | 50.32 | 43.93 |
| |||||||||||||||||
T5 | Assmt | 32.77 | 59.57 | 57.75 | 41.73 | 13.68 | 53.44 | 39.72 | 36.10 | 10.40 | 54.76 | 44.16 | 35.08 | 14.82 | 67.49 | 49.89 | 44.51 |
++ | 31.76 | 58.74 | 57.12 | 42.19 | 13.78 | 53.65 | 40.30 | 35.84 | 10.55 | 54.10 | 43.48 | 35.20 | 15.00 | 67.32 | 50.36 | 44.55 | |
A+Subj | 20.24 | 50.04 | 47.55 | 33.44 | 9.52 | 51.91 | 39.72 | 30.43 | 7.10 | 54.14 | 43.87 | 30.29 | 10.89 | 64.63 | 49.75 | 39.02 | |
++ | 20.72 | 59.64 | 57.97 | 33.56 | 9.46 | 53.55 | 39.52 | 18.76 | 7.35 | 54.69 | 44.36 | 14.40 | 10.93 | 67.19 | 50.42 | 24.83 | |
| |||||||||||||||||
BART | Assmt | 25.70 | 54.98 | 52.99 | 32.49 | 10.00 | 53.66 | 39.08 | 29.41 | 8.04 | 54.66 | 43.12 | 29.04 | 11.56 | 66.86 | 48.48 | 38.36 |
++ | 28.22 | 57.04 | 55.16 | 32.28 | 10.33 | 53.40 | 39.21 | 30.75 | 8.29 | 54.48 | 44.01 | 32.08 | 11.65 | 66.67 | 49.23 | 40.69 | |
A+Subj | 18.80 | 49.19 | 46.77 | 26.96 | 7.04 | 51.70 | 38.24 | 25.30 | 6.00 | 54.29 | 43.71 | 26.01 | 9.25 | 64.95 | 48.19 | 34.02 | |
++ | 20.23 | 57.91 | 54.68 | 32.91 | 7.88 | 53.85 | 40.21 | 30.09 | 6.85 | 54.61 | 43.15 | 30.12 | 9.84 | 67.00 | 49.70 | 38.72 |
Table 3:
Performance of T5 with domain adaptation pre-training using Assessment (Assmt) as input, under two mask policies: Token Masking and Concept Masking. We report Rouge-L F-score (RL-F), and BERTScore (BS), as well as Sentence embedding cosine similarity (Sent θ) and CUI F-score. Numbers with green background address the highest performance across all results, with subscript number (↑) denoting the improvements over rule-based results.
Setting | Model | Token Masking | Concept Masking | ||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
RL-F | Sent.θ | BS | CUI | RL-F | Sent.θ | BS | CUI | ||
Explicit | Assmt | 32.66 | 61.34(↑2.53) | 56.68 | 47.10(↑8.13) | 29.86 | 55.87 | 53.91 | 40.27(↑2.14) |
++ | 26.94 | 59.40(↑0.59) | 55.05 | 42.73(↑3.76) | 32.82 | 58.21 | 56.80 | 43.16(↑4.19) | |
| |||||||||
Direct | Assmt | 12.69 | 53.63 | 42.40(↑2.27) | 35.39(↑1.16) | 14.90(↑2.59) | 55.48(↑0.15) | 47.10(↑6.97) | 35.29(↑1.06) |
++ | 10.44 | 53.47 | 43.46(↑3.33) | 37.45(↑3.22) | 15.76(↑5.22) | 56.82(↑1.49) | 48.72(↑8.72) | 37.74(↑3.51) | |
| |||||||||
Indirect | Assmt | 10.07(↑0.58) | 52.72 | 41.47 | 38.19(↑5.03) | 13.58(↑4.36) | 53.44 | 44.91(↑0.45) | 33.56(↑0.40) |
++ | 8.04 | 51.84 | 40.45 | 37.53(↑4.37) | 13.28(↑4.06) | 55.02 | 45.51(↑1.05) | 35.10(↑1.94) | |
| |||||||||
All | Assmt | 14.49(↑1.04) | 62.40 | 49.62 | 40.44 | 18.72(↑5.27) | 64.69 | 54.03(↑3.71) | 42.69 |
++ | 12.12 | 63.08 | 50.20 | 45.58(↑1.65) | 18.80(↑5.35) | 66.08 | 55.29(↑4.86) | 44.56(↑0.63) |
6.1. Overall performance of fine-tuned models
Table 2 represents ROUGE-L F-score, cosine similarity on sentence embedding (Sent.θ), BERTScore and CUI F-score. Overall, scores dropped from Explicit Mentions to Direct Problems to Indirect Problems, likely due to increasing complexity with more abstractive concepts over extractive concepts. Explicit Mention summarization was the easiest and Indirect Problem summarization was the hardest. The rule-based system outperformed all T5 and BART variants on the Explicit Mentions, given that it identified the obvious entity mentions. For T5 and BART, fine-tuning with augmented samples slightly improved the ROUGE scores. Adding subjective sections (A+Subj) did not bring benefits, possibly because most subjective sections are empty in ICU progress note. T5 had more variants with better scores than BART. In our manual investigation, we find that BART generated text that are not relevant to medical domain3. In sum, all fine-tuned model performance were tied with the baseline, which is impressive given that the baseline uses domain knowledge (medical concept).
6.2. The effect of domain adaptation pre-training
Table 3 contains results from training T5 with DAPT and fine-tuned on the annotated set across two methods for masking: random token (T5-DAPT-TKS) and concept masking (T5-DAPT-CUI). To highlight the differences before and after DAPT, we showed four scores as well as the performance gained over the baseline system on Assmt input. Overall, both DAPT settings delivered better performance. The performance gain of T5-DAPT-TKS was mainly from the CUI F-score (+1.16 to +8.13). Superior results were seen from T5-DAPT-CUI, achieving best performance on all setting except for Explicit Mention, yielding large performance gained on ROUGE F score(+2.59 to +5.35) and BERTScore (+0.45 to +8.72).
In addition, Figure 5 includes the ROUGE Recall and Precision drops and gains from all models over baseline. ROUGE Recall measures the content coverage and Precision computes content relevancy in predicted summary (Lin, 2004). All models reported lower recall compared to baseline, indicating their coverage was limited. T5 DAPT variants showed higher gains on precision, yielding the largest gain (+5 to +12 for T5-DAPT-CUI). These results indicate that continuously training T5 with domain vocabulary is a promising direction to solve the task.
Figure 5:
Performance drops (lighter color) and gains (darker color) over baseline (first column) on ROUGE-L Recall (top 4 rows) and Precision (bottom 4 rows). The darker the cell color is, the higher performance gain the model obtains over baseline.
6.3. Qualitative analysis
Besides the numeric metrics reported above, we provide example predicted summaries and qualitative analysis done by a domain expert (a critical care ICU physician). We cherry-picked two examples from T5-DAPT-CUI that best represent the characteristics of medical diagnostic consistency in clinical diagnostic reasoning, and present them in Figure 6. Example 6.1 shows the model performed extractive summarization: it generated both hypertension and hypotension as relevant diagnoses that represent an Indirect label for past medical history of hypertension, and Direct label for an active problem during the hospitalization with hypotension. In example 6.2, the model performed abstraction summarization. The last half of the Assessment highlights a type of heart attack (e.g., “NSTEMI”) requiring an emergent medical procedure (e.g., “cath wtih DES in LAD and LMCA”), and the model summarized a rather complex statement into a single, accurate diagnosis of Coronary Artery Disease in its abbreviated form as “CAD”.
Figure 6:
Two cherry-picked examples from T5-DAPT-CUI output, with cyan fonts highlighted the correct diseases.4
7. Discussion
Our work begins with a single note in cross-sectional design to build our models; however, a patient’s hospitalization is a multi-document workflow with repeated measures of progress notes and other note types across several days and multiple providers. In addition, providers generate their diagnoses via a reasoning process that includes structured data from vital signs, laboratory results, etc. Images and radiology reports are another modality that highlights the multi-modality approach in diagnostic reasoning. Nonetheless, our work opens the door for future research on knowledge intensive clinical summarization. This section includes a discussion of future directions in solving this task.
Exploring structured data
The objective section of the progress note contain embedded structured data, delivering rich information regarding patient’s problem. Recall the example in Figure 3, the reference summary contains diagnosis: “Leukocytosis“ (high white blood cell count), “anemia”(low red blood cell count). These diagnosis are usually found in laboratory results. To investigate the use of objective sections and structured data, we append both Subjective and Objective sections in chronological order to the Assessment and input to T5 for fine-tuning and evaluation (T5-ALL), and we let the T5 tokenizer truncate the text when it exceeds the 512 token limit. On test set, the scores are too low to report. Yet, we observed that T5-ALL, instead of generating medical concepts, often extracts lines of lab values that strongly associate with the disease in reference summary (see Figure 7.1 and 7.2). This preliminary result indicates the future direction of understanding the association between disease and lab values in summarization.
Figure 7:
Two example reference (REF) and predicted summaries (PRED) from T5-ALL (input with objective sections).
Incorporating knowledge into models
We propose a knowledge intensive summarization task that requires clinical text understanding, knowledge representation and diagnostic reasoning. The experiment results showed that the models pre-trained on medical concepts effectively improved the performance, while challenges remain in understanding the associations among medications, symptoms and disease. Recent work on event extraction and clinical relation extraction incorporates biomedical knowledge graph into pre-trained language models (Huang et al., 2020; Roy and Pan, 2021). Our future work will investigate the incorporation of knowledge graph into seq2seq pre-trained models.
Evidence-based evaluation
Medical diagnosis is a critical component of effective healthcare but misdiagnosis is a major contributor to medical errors, especially in critical care settings where quick decision-making is needed. Medical diagnoses predicted by systems that are not redundant must be contextually relevant to the data gathered in a progress note to achieve valid reasoning. We believe an automated evaluation method for problem summarization should assess the knowledge representation, non-redundancy, and evidence relevancy, and the automated metrics used in our work cover partial aspects. Recently, Moramarco et al. (2021) studied a fact-based evaluation for medical summarization using human evaluation, which we plan to carry out in future work.
8. Conclusion
We propose a problem summarization task that address diagnostic reasoning, and show that T5 with DAPT achieves benchmarking performance for the task, but some key challenges remained. Our work lays the ground for future research on knowledge fused clinical summarizers as well as real-world clinical diagnostic decision support system. Future work will investigate the uses of structured data, evidence-based evaluation metric and better models for knowledge representation and summarization.
Ethical Statement
The use of the data in this research came from a fully de-identified dataset (contains no protected health information) that we received permission for use under a PhysioNet Credentialed Health Data Use Agreement (v1.5.0). The study was determined to be exempt from human subjects research. All experiments followed the PhysioNet Credentialed Health Data License Agreement.
Medical charting by providers in the electronic health record is at-risk for multiple types of bias. Our research focused on building a system to overcome the cognitive biases in medical decision-making by providers. However, statistical and social biases need to be addressed before integrating our work into any clinical decision support system for clinical trials or healthcare delivery. In particular, implicit bias towards vulnerable populations and stigmatizing language in certain medical conditions like substance use disorders are genuine concerns that can transfer into language model training (Thompson et al., 2021; Saitz et al., 2021; Karnik et al., 2020). Therefore, it should be assumed that our corpus of notes for this task will carry social bias features that can affect fairness and equity during model training. Before the deployment of any pre-trained language model, it is the responsibility of the scientists and health system to audit the model for fairness and equity in its performance across disparate health groups (Saleiro et al., 2018). Fairness and equity audits alongside model explanations are needed to ensure an ethical model trustworthy to all stakeholders, especially patients and providers.
Figure 1:
When a sick patient arrives to the hospital, diagnostic evaluations are performed to assess the patient’s condition and deduce the problems causing the illness.
A. Annotator Training
We recruited two medical students as annotators who received training in medical school curriculum in SOAP note documentation. A three-week orientation and training was conducted by one of the critical care physicians. Each annotation achieved an inter-annotator agreement with a kappa score above 0.80 with the adjudicator. Another round of training was performed on 200 notes and the inter-annotator agreement was measured between annotators and the adjudicator. The annotation was reviewed if the kappa score is below 0.80 threshold.
Table 4:
Quality measurement on augmented input assessment (Assmt) and reference summary (Summ). For ever pair of original and augmented sample, we report cosine similarity between text embedding (θ), Jaccard token overlap, and mean and standard deviation (σ) of length difference.
Input | Sent.θ | Jaccard | Length Diff. |
---|---|---|---|
Assmt | 89.00 | 37.85 | 6.13 (4.12) |
Summ | 83.14 | 14.43 | 9.42 (5.99) |
Table 5:
Hyperparameters for T5 DAPT
Hyper-parameter | Setting |
---|---|
Optimizer | AdamW |
Epoch | 10 (with early stopping) |
Learning rate | 1e-3, 1e-4 |
Batch size | 256 |
Gradient accumulation | True |
B. Quality Measure for Data Augmentation
The quality of augmented data directly affected the training process. To ensure a high-quality training corpus, we randomly selected 2,000 pairs of augmented samples. We evaluated how well the meanings were preserved in the augmented sample, and how much lexical variance was introduced into the augmented samples. We reported cosine similarity between the embedding pairs for quality of meanings, and we reported Jaccard similarity for degree of string overlap. Specifically, given a pair of the original sample and the augmented sample, we generated a text embedding through SapBERT (Liu et al., 2021), a BERT encoder pre-trained to represent biomedical entities using UMLS. We expected a high cosine similarity if the augmented samples expressed the same meanings as the original samples. We ran Jaccard similarity by treating the samples as lists of tokens, and expected a low Jaccard score if there were new terms introduced in the augmented samples, e.g. ARF and Acute Renal Failure. We also reported the mean and standard deviation of the length differences between original and augmented samples (Table 4). On both input assessment and reference summary, the cosine similarity between original and augmented samples was higher than 0.80. Assessment input contained more words that were not biomedical concepts; thus, the augmented sample had a greater proportion of overlapping text than the reference summary. Both had more than 6 token differences in length. In conclusion, our proposed strategy of data augmentation successfully produced a high quality training corpus.
Figure 8:
Given an input assessment, we show the reference summary, example output from fine-tuning T5 and BART, and T5 DAPT with token masking and concept masking. The red fonts show the information that is outside the input text.
Table 6:
Hyperparameters for fine-tuning T5 and BART
Hyper-parameter | Setting |
---|---|
Optimizer | Adam |
Epoch | 10 (with early stopping) |
Learning rate | 1e-4, 1e-5, 1e-6 |
Batch size | 4 |
Task Prefix (t5) | “summarize:” |
Encoder max length | 512 |
Decoder max length | 128 |
Beam size | 10 |
Length penalty | 1 |
no repeat ngram size | 2 |
C. Hyperparameters
Here we report the hyper-parameters we used for T5 DAPT experiments in table 5, and fine-tuning for t5 and BART in table 6. The input length to both T5 and BART is set to 512 tokens. On the training data, the average length of input assessment is 43.33 tokens, and the average length of input and subjective sections is 70.97 tokens. Therefore the maximum encoder length is appropriate for our task.
D. Model Example Output
Figure 8 shows the example output from fine-tuning T5 and BART, and T5-DAPT with token masking as well as concept masking policy. T5-DAPT-CUI extracts medical concepts. FT-T5 and T5-DAPT-Tks extract sequence of text from input assessment. FT-BART produces text with information that is not mentioned in the input (red fonts).
Footnotes
Training script is available at: https://git.doit.wisc.edu/smph/dom/UW-ICU-Data-Science-Lab/drbench.
Annotation is available through PhysioNet.
see Appendix D for example output for all models
Semicolons are removed during fine-tuning and evaluation. We manually inserted them back for presentation purpose.
Contributor Information
Yanjun Gao, ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison.
Timothy Miller, Boston Children’s Hospital, and Harvard Medical School.
Dongfang Xu, Boston Children’s Hospital, and Harvard Medical School.
Dmitriy Dligach, Loyola University Chicago.
Matthew M. Churpek, ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison
Majid Afshar, ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison.
References
- Adams Griffin, Alsentzer Emily, Ketenci Mert, Zucker Jason, and Elhadad Noémie. 2021. What’s in a summary? laying the groundwork for advances in hospital-course summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4794–4811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrows Howard S, Tamblyn Robyn M, et al. 1980. Problem-based learning: An approach to medical education, volume 1. Springer Publishing Company. [Google Scholar]
- Bodenreider Olivier. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bowen Judith L. 2006. Educational strategies to promote clinical diagnostic reasoning. New England Journal of Medicine, 355(21):2217–2225. [DOI] [PubMed] [Google Scholar]
- Brown PJ, Marquard JL, Amster B, Romoser M, Friderici J, Goff S, and Fisher D. 2014. What do physicians read (and ignore) in electronic progress notes? Applied clinical informatics, 5(02):430–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devarakonda Murthy V, Mehta Neil, Tsou Ching-Huei, Liang Jennifer J, Nowacki Amy S, and Jelovsek John Eric. 2017. Automated problem list generation and physicians perspective from a pilot study. International journal of medical informatics, 105:121–129. [DOI] [PubMed] [Google Scholar]
- Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers; ), pages 4171–4186. [Google Scholar]
- Furlow Bryant. 2020. Information overload and unsustainable workloads in the era of electronic health records. The Lancet Respiratory Medicine, 8(3):243–244. [DOI] [PubMed] [Google Scholar]
- Gharebagh Sajad Sotudeh, Goharian Nazli, and Filice Ross. 2020. Attend to medical ontologies: Content selection for clinical abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1899–1905. [Google Scholar]
- Gupta Yash, Ammanamanchi Pawan Sasanka, Bordia Shikha, Manoharan Arjun, Mittal Deepak, Pasunuru Ramakanth, Shrivastava Manish, Singh Maneesh, Bansal Mohit, and Jyothi Preethi. 2021. The effect of pretraining on extractive summarization for scientific documents. In Proceedings of the Second Workshop on Scholarly Document Processing, pages 73–82. [Google Scholar]
- Gururangan Suchin, Marasović Ana, Swayamdipta Swabha, Lo Kyle, Beltagy Iz, Downey Doug, and Smith Noah A. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360. [Google Scholar]
- Hirsch Jamie S, Tanenbaum Jessica S, Gorman Sharon Lipsky, Liu Connie, Schmitz Eric, Hashorva Dritan, Ervits Artem, Vawdrey David, Sturm Marc, and Elhadad Noémie. 2015. Harvest, a longitudinal patient record summarizer. Journal of the American Medical Informatics Association, 22(2):263–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang Kung-Hsiang, Yang Mu, and Peng Nanyun. 2020. Biomedical event extraction with hierarchical knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1277–1285. [Google Scholar]
- Hultman Gretchen M, Marquard Jenna L, Lindemann Elizabeth, Arsoniadis Elliot, Pakhomov Serguei, and Melton Genevieve B. 2019. Challenges and opportunities to improve the clinician experience reviewing electronic progress notes. Applied clinical informatics, 10(03):446–453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson Alistair EW, Pollard Tom J, Shen Lu, Lehman Liwei H, Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, and Mark Roger G. 2016. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karnik Niranjan S, Afshar Majid, Churpek Matthew M, and Nunez-Smith Marcella. 2020. Structural disparities in data science: a prolegomenon for the future of machine learning. The American Journal of Bioethics, 20(11):35–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krishna Kundan, Khosla Sopan, Bigham Jeffrey P, and Lipton Zachary C. 2021. Generating soap notes from doctor-patient conversations using modular summarization techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4958–4972. [Google Scholar]
- Lamy Jean-Baptiste. 2017. Owlready: Ontology-oriented programming in python with automatic classification and high level constructs for biomedical ontologies. Artificial intelligence in medicine, 80:11–28. [DOI] [PubMed] [Google Scholar]
- Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. [Google Scholar]
- Liang Jennifer, Tsou Ching-Huei, and Poddar Ananya. 2019. A novel system for extractive clinical note summarization using EHR data. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 46–54, Minneapolis, Minnesota, USA. Association for Computational Linguistics. [Google Scholar]
- Lin Chen, Miller Timothy, Dligach Dmitriy, Bethard Steven, and Savova Guergana. 2021. Entitybert: Entity-centric masking strategy for model pretraining for the clinical domain. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 191–201. [Google Scholar]
- Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81. [Google Scholar]
- Lipsky-Gorman S and Elhadad N. 2011. Clinnote and healthtermfinder: a pipeline for processing clinical notes. Columbia University. [Google Scholar]
- Liu Fangyu, Shareghi Ehsan, Meng Zaiqiao, Basaldella Marco, and Collier Nigel. 2021. Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238. [Google Scholar]
- Sean MacAvaney Sajad Sotudeh, Cohan Arman, Goharian Nazli, Talati Ish, and Filice Ross W. 2019. Ontology-aware clinical abstractive summarization. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1013–1016. [Google Scholar]
- Manas Gaur, Aribandi Vamsi, Kursuncu Ugur, Alambo Amanuel, Shalin Valerie L, Thirunarayan Krishnaprasad, Beich Jonathan, Narasimhan Meera, Sheth Amit, et al. 2021. Knowledge-infused abstractive summarization of clinical diagnostic interviews: Framework development study. JMIR Mental Health, 8(5):e20865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moramarco Francesco, Juric Damir, Savkov Aleksandar, and Reiter Ehud. 2021. Towards objectively evaluating the quality of generated medical summaries. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 56–61. [Google Scholar]
- Pergola Gabriele, Kochkina Elena, Gui Lin, Liakata Maria, and He Yulan. 2021. Boosting low-resource biomedical qa via entity-aware masking strategies. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1977–1985. [Google Scholar]
- Pruksachatkun Yada, Phang Jason, Liu Haokun, Htut Phu Mon, Zhang Xiaoyi, Pang Richard Yuanzhe, Vania Clara, Kann Katharina, and Bowman Samuel. 2020. Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5231–5247. [Google Scholar]
- Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.34305477 [Google Scholar]
- Roy Arpita and Pan Shimei. 2021. Incorporating medical knowledge in bert for clinical relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5357–5366. [Google Scholar]
- Rule Adam, Bedrick Steven, Chiang Michael F, and Hribar Michelle R. 2021. Length and redundancy of outpatient progress notes across a decade at an academic medical center. JAMA Network Open, 4(7):e2115334–e2115334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saitz Richard, Miller Shannon C, Fiellin David A, and Rosenthal Richard N. 2021. Recommended use of terminology in addiction medicine. Journal of Addiction Medicine, 15(1):3–7. [DOI] [PubMed] [Google Scholar]
- Saleiro Pedro, Kuester Benedict, Hinkson Loren, London Jesse, Stevens Abby, Anisfeld Ari, Rodolfa Kit T, and Ghani Rayid. 2018. Aequitas: A bias and fairness audit toolkit. arXiv preprint arXiv:1811.05577. [Google Scholar]
- See Abigail, Liu Peter J, and Manning Christopher D. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083. [Google Scholar]
- Shoolin J, Ozeran L, Hamann C, and Bria Ii W. 2013. Association of medical directors of information systems consensus on inpatient electronic health record documentation. Applied clinical informatics, 4(02):293–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soldaini Luca and Goharian Nazli. Quickumls: a fast, unsupervised approach for medical concept extraction. [Google Scholar]
- Thompson Hale M, Sharma Brihat, Bhalla Sameer, Boley Randy, McCluskey Connor, Dligach Dmitriy, Churpek Matthew M, Karnik Niranjan S, and Afshar Majid. 2021. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. Journal of the American Medical Informatics Association, 28(11):2393–2403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsou Amy Y, Lehmann Christoph U, Michel Jeremy, Solomon Ronni, Possanza Lorraine, and Gandhi Tejal. 2017. Safe practices for copy and paste in the ehr. Applied clinical informatics, 26(01):12–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in neural information processing systems, 30. [Google Scholar]
- Weed Lawrence L. 1964. Medical records, patient care, and medical education. Irish Journal of Medical Science (1926-1967), 39(6):271–282. [DOI] [PubMed] [Google Scholar]
- Yim Wen-wai and Yetisgen-Yildiz Meliha. 2021. Towards automating medical scribing: Clinic visit dialogue2note sentence alignment and snippet summarization. In Proceedings of the Second Workshop on Natural Language Processing for Medical Conversations, pages 10–20. [Google Scholar]
- Zhang Longxiang, Negrinho Renato, Ghosh Arindam, Jagannathan Vasudevan, Hassanzadeh Hamid Reza, Schaaf Thomas, and Gormley Matthew R. 2021. Leveraging pretrained models for automatic summarization of doctor-patient conversations. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3693–3712. [Google Scholar]
- Zhang Tianyi*, Kishore Varsha*, Wu Felix*, Weinberger Kilian Q., and Artzi Yoav. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations. [Google Scholar]
- Zhang Yuhao, Ding Daisy Yi, Qian Tianpei, Manning Christopher D, and Langlotz Curtis P. 2018. Learning to summarize radiology findings. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pages 204–213. [Google Scholar]