Abstract
Context:
Documented goals-of-care discussions are an important quality metric for patients with serious illness. Natural language processing (NLP) is a promising approach for identifying goals-of-care discussions in the electronic health record (EHR).
Objectives:
To compare three NLP modeling approaches for identifying EHR documentation of goals-of-care discussions and generate hypotheses about differences in performance.
Methods:
We conducted a mixed-methods study to evaluate performance and misclassification for three NLP featurization approaches modeled with regularized logistic regression: bag-of-words (BOW), rule-based, and a hybrid approach. From a prospective cohort of 150 patients hospitalized with serious illness over 2018–2020, we collected 4,391 inpatient EHR notes; 99 (2.3%) contained documented goals-of-care discussions. We used leave-one-out cross-validation to estimate performance by comparing pooled NLP predictions to human abstractors with receiver-operating-characteristic (ROC) and precision-recall (PR) analyses. We qualitatively examined a purposive sample of 70 NLP-misclassified notes using content analysis to identify linguistic features that allowed us to generate hypotheses underpinning misclassification.
Results:
All three modeling approaches discriminated between notes with and without goals-of-care discussions (AUCROC: BOW, 0.907; rule-based, 0.948; hybrid, 0.965). Precision and recall were only moderate (precision at 70% recall: BOW, 16.2%; rule-based, 50.4%; hybrid, 49.3%; AUCPR: BOW, 0.505; rule-based, 0.579; hybrid, 0.599). Qualitative analysis revealed patterns underlying performance differences between BOW and rule-based approaches.
Conclusion:
NLP holds promise for identifying EHR-documented goals-of-care discussions. However, the rarity of goals-of-care content in EHR data limits performance. Our findings highlight opportunities to optimize NLP modeling approaches, and support further exploration of different NLP approaches to identify goals-of-care discussions.
Keywords: Natural language processing, machine learning, goals of care, electronic health record, medical informatics
INTRODUCTION:
Goals of care have been defined as the central aims of medical care for a patient, which are informed by their values and goals and used to guide clinical decisions.1 High-quality, patient-centered goals-of-care discussions and their documentation are important components of care for patients with serious illness.2–5 Goals-of-care discussions are associated with improved patient- and family-centered outcomes,4,6 increased goal-concordant care,7,8 lower healthcare costs,9 and lower intensity of end-of-life care.4,10 However, goals-of-care discussions often occur late in the course of serious illness or do not occur at all.11–13 Although there is ongoing debate about the value of advance care planning,14,15 the occurrence and quality of goals-of-care discussions for patients with serious illness remains an important quality metric and research outcome for palliative care.16
The electronic health record (EHR) presents rich opportunities for examining process and outcome measures in palliative care, including the occurrence and quality of documented goals-of-care discussions.17–19 However, because such documentation is typically written as unstructured free-text, it is often challenging to extract this data from the EHR.19–21 Manual review—the current “gold standard”—is costly and profoundly time-intensive.22,23 Natural language processing (NLP), which uses computational techniques to identify constructs in free-text data,24 is a promising approach for measuring outcomes using real-world EHR data.25,26 NLP is often combined with supervised machine learning (ML), wherein predictive models are trained to recognize labeled free-text examples of a given construct.27 NLP has been used to assess quality indicators in cancer patients undergoing palliative surgery28 and to measure documentation of ICU patients’ care preferences.29
Our research group recently developed an NLP algorithm to measure documented goals-of-care discussions in a purposively enriched dataset of EHR notes.30 However, the optimal modeling approach for this task and performance of NLP in data collected from a prospective cohort have not been reported.30 To inform future development, we conducted a hypothesis-generating mixed-methods study of three different NLP modeling approaches for identifying documented goals-of-care discussions in a large dataset of EHR notes collected from a prospective cohort of patients hospitalized with serious illness.
METHODS:
Data Source
This study utilized data from a randomized trial, Project to Improve Communication About Serious Illness – Pilot Study,31 which enrolled 150 hospitalized patients of age ≥ 18 with chronic life-limiting illness32 between December 2018 and February 2020 at two large teaching hospitals at UW Medicine. Both the trial and this study were approved by the University of Washington Institutional Review Board (STUDY00004821, STUDY00011002).
All EHR notes for the enrollment hospitalization authored by a physician or advance practice provider were coded using a standardized protocol to identify documented goals-of-care discussions (Appendix I).33 Although EHR metrics for palliative care and serious illness communication have been described in various settings,19,28,34–37 there is presently no consensus on how to identify documented goals-of-care discussions in the inpatient setting. Available evidence suggests that goals-of-care discussions and their documentation exist along a spectrum, from routine code status verification to detailed explorations of patients’ values and goals as they relate to end-of-life care.38–42 Toward the aim of the parent trial (which examined the effects of a communication-priming intervention31), we adopted a pragmatic definition of documented goals-of-care discussions for this study that captured goals-of-care discussions as typically documented by both palliative care and non-palliative care clinicians, whilst excluding routine admission code status discussions. Using the framework for serious illness communication adopted by the American College of Physicians High Value Task Force and other groups,1,5,41–44 we pragmatically defined documented goals-of-care discussions for this study as documentation of conversations with patients or surrogate decision-makers that either (1) explored a patient’s overarching values and goals for the purpose of guiding medical decision-making, or (2) explored a patient’s preferences about specific life-sustaining treatments or strategies relevant to their serious illness (such as mechanical ventilation, chemotherapy, or transfer to intensive care). As verification and documentation of code status is ubiquitous among patients admitted to UW Medicine medical services (e.g. “Full code”, “DNR/intubation OK”, “DNR/DNI, confirmed by [surrogate] on admission”), we only considered code status documentation to represent a documented goals-of-care discussion if it included additional documentation of values, goals, or preferences about treatments other than those offered for cardiopulmonary arrest. Discussions with inpatient physicians that led to the completion of new advance care planning or POLST documents while hospitalized, or that led to referral to specialty palliative care to further clarify goals of care, were classified as “implied or impending goals-of-care discussions” and coded separately; notes with these codes were also classified as “positive” to facilitate detection of discussions that might not have been documented separately. For this initial coding, four trained chart abstractors met twice-weekly with investigators over a three-month period to discuss and review coded goals-of-care discussions; all instances of disagreement were resolved by consensus. Randomization was concealed from abstractors until data collection was complete.
Natural Language Processing
Following preprocessing of notes to remove non-textual templated content, we developed three different featurization approaches for use in supervised machine learning: (1) bag-of-words (BOW) models, in which each note is represented by the frequency distribution of words that comprise it, weighted by term frequency–inverse document frequency (tf–idf);45,46 (2) rule-based models, in which each note is represented by binary matches with a series of expert-defined textual search patterns indicative of goals-of-care documentation, such as the term “goals of care” (Appendix II); and, (3) hybrid models, which concatenate features from BOW- and rule-based approaches.27,47 Each featurization approach was then used to train and test supervised ML classifiers using logistic regression with L2 regularization, in a leave-one-group-out cross-validation procedure (LOGO-CV) detailed below.48 Within each LOGO-CV fold, we used five-fold cross-validation to optimize the regularization parameter over a grid search toward maximal area under the precision-recall curve.49 Each leave-one-group-out model was then refit to the entirety of its available training data with the optimized hyperparameter value and used to predict the held-out patient’s data. NLP preprocessing and term-weighting were performed using the Natural Language Toolkit v3.2.4 (NLTK Project, nltk.org).
Quantitative Analysis
NLP models for classifying individual clinical notes were trained and tested using leave-one-group-out cross-validation (LOGO-CV).48 In this approach, 150 unique ML classifiers were trained on notes from 149 patients and tested on all notes from the patient left out of the training set. Machine-predicted classification scores for each patient-test set were pooled across all 150 folds to facilitate performance analyses. Pooled classification scores were compared against the human-abstracted gold standard to estimate sensitivity (recall), specificity, positive predictive value (PPV; precision), negative predictive value (NPV), area under the receiver operating characteristic curve (AUCROC), and area under the precision-recall curve (AUCPR). Although conventional diagnostic tests are often evaluated using the AUCROC (which summarizes sensitivity and specificity across the range of discrimination thresholds), tests designed to identify rare outcomes require particularly high specificity to achieve useful PPV.50 For such tests, the informativeness of the AUCROC may be suboptimal, as much of the area measured represents discrimination thresholds of no practical utility. Accordingly, we report and analyze PPV (precision), sensitivity (recall), and area under the precision-recall curve (AUCPR) for each approach.49,51,52
To evaluate performance at the level of the patient-hospitalization (in which each patient is classified as having, or not having, a documented goals-of-care discussion during the index hospitalization), we compared the union of each patient’s note-level ML predictions to the union of their human-abstracted note classifications in a similar fashion.
Machine learning was implemented using the scikit-learn library v0.19.0 (scikit-learn developers, scikit-learn.org).53 Receiver operating characteristic (ROC) and precision-recall (PR) analyses were conducted using Stata v17.0 (StataCorp, stata.com) and the prcurve package.49
Qualitative Analysis
For qualitative analysis of machine-misclassified notes, we reviewed notes misclassified by rule-based and BOW approaches. We did not examine notes misclassified by the hybrid approach because it combines features, and likely error patterns, of the other two approaches. We purposively sampled a corpus of 60 misclassified notes (30 false-positive and 30 false-negative) ensuring diversity of misclassifying modeling approach (BOW vs. rule-based vs. both), note type, and number of unique patients represented. We also examined 10 true-positive notes correctly classified by both approaches for comparison. Toward the aim of generating hypotheses underlying misclassification, we used content analysis to examine potentially-explanatory note characteristics and linguistic features, using investigator triangulation to ensure trustworthiness.54,55 Based on the underlying theory of existing frameworks in serious illness communication1,5,33,41–44,56,57 augmented by immersive review of an initial corpus of 10 notes, investigators (AU, RYL, JRC) created an initial set of structured codes representing known aspects of documented goals-of-care discussions and the semantics that are used to document them; these codes were then iteratively revised through a process of consensus into a coding framework (Appendix III) that was applied to the entire corpus of notes. Coded data were then iteratively reviewed alongside note characteristics and NLP predictions to generate hypotheses underlying performance differences between modeling approaches. A subset of 18 notes (25.7%) were co-coded independently by another investigator (JT) to ensure reproducibility of the codes with over 90% agreement. Disagreements were resolved through consensus.
RESULTS:
Table 1 summarizes the baseline characteristics of the 150 hospitalized patients enrolled in this study. Patients had a median age of 61 years; 56% were male, and 71% were non-Hispanic and white. Using automated EHR queries, we amassed a dataset of 4,391 clinical notes corresponding to each patient’s entire enrollment hospitalization. Among these 4,391 notes, 99 notes (2.3%) for 32 patients contained documented goals-of-care discussions identified by manual review.
Table 1.
Baseline characteristics of prospective cohort from which data was collected
| Characteristic | Total sample (N=150) |
|---|---|
|
| |
| Patient age, years, median (IQR) | 61 (16) |
|
| |
| Patient sex, n (%) | |
| Male | 84 (56) |
| Female | 66 (44) |
|
| |
| Race/ethnicity, n (%) | (n=149) |
| White, non-Hispanic | 107 (71) |
| White, Hispanic | 9 (6) |
| Black | 19 (13) |
| Asian | 2 (1) |
| Native American | 2 (1) |
| Mixed race/ethnicity | 10 (7) |
|
| |
| No. of chronic life-limiting illnesses, n (%) | |
| 0 a | 10 (7) |
| 1 | 64 (43) |
| 2 | 34 (23) |
| 3 | 21 (14) |
| ≥ 4 | 21 (14) |
|
| |
| Chronic life-limiting illnesses (not mutually exclusive) | |
| Cancer with poor prognosis | 25 (17) |
| Chronic lung disease | 46 (31) |
| Coronary artery disease | 43 (29) |
| Congestive heart failure | 52 (35) |
| Peripheral vascular disease | 21 (14) |
| Chronic kidney disease, moderate-to-severe | 56 (37) |
| Chronic liver disease, severe | 25 (17) |
| Diabetes with end-organ damage | 21 (14) |
| Dementia | 6 (4) |
Abbreviations: IQR, interquartile range.
Patients greater than 80 years of age were included in the study regardless of presence or absence of a chronic life-limiting illness.
Quantitative Analysis
Table 2a presents note-level performance metrics for BOW, hybrid, and rule-based modeling approaches as estimated by LOGO-CV at a nominal sensitivity (recall) of 70%. In ROC curve analysis, all three algorithms demonstrated ability to discriminate for presence of goals-of-care content (AUCROC: BOW, 0.907; rule-based, 0.948; hybrid, 0.965; Figure 1a). However, analysis of PPV (precision) and sensitivity (recall) demonstrated a marked trade-off between these two performance characteristics (Figure 1a). At thresholds favoring PPV over sensitivity, BOW demonstrated greater PPV than rule-based or hybrid approaches; whereas, at thresholds favoring sensitivity over PPV, hybrid and rule-based approaches demonstrated greater PPV than the BOW approach. The area under the precision-recall curve (AUCPR), which summarizes PPV (precision) and sensitivity (recall) across discrimination thresholds,58,59 was modestly higher for the hybrid approach (0.599) than the BOW (0.505) and rule-based (0.579) approaches.
Table 2.
Performance characteristics of NLP/ML by modeling approaches
| (a) Note-level classifiers, N = 4,391 (99 positive) | ||||||
| Modeling approach | AUCROC a | AUCPR b | Performance at nominal sensitivity [recall] of 70% c | |||
| Sensitivity [Recall] | Specificity | PPV [Precision] | NPV | |||
| BOW | 0.907 | 0.505 | 0.697 | 0.917 | 0.162 | 0.992 |
| Hybrid | 0.965 | 0.599 | 0.697 | 0.983 | 0.493 | 0.993 |
| Rules | 0.948 | 0.579 | 0.707 | 0.984 | 0.504 | 0.993 |
| (b) Patient-level classifiers, N = 150 (32 positive) | ||||||
| Modeling approach | AUCROC a | AUCPR b | Performance at nominal sensitivity [recall] of 70% c | |||
| Sensitivity [Recall] | Specificity | PPV [Precision] | NPV | |||
| BOW | 0.927 | 0.817 | 0.688 | 0.949 | 0.786 | 0.918 |
| Hybrid | 0.932 | 0.796 | 0.688 | 0.915 | 0.688 | 0.915 |
| Rules | 0.903 | 0.768 | 0.688 | 0.932 | 0.733 | 0.917 |
Abbreviations: BOW, bag-of-words; AUC, area under the curve; ROC, receiver operating characteristic curve; PR, precision-recall curve; PPV, positive predictive value; NPV, negative predictive value
AUCROC is calculated using non-parametric analysis of pooled LOGO-CV classifier predictions.
AUCPR is numerically calculated from pooled LOGO-CV classifier predictions using cubic splines.49
Sensitivity [recall], specificity, PPV [precision] and NPV vary over discrimination thresholds (see Figure 1). Performance metrics are shown for the observed discrimination threshold at which observed sensitivity [recall] is closest to the specified value of 70%; additional values at other thresholds are presented in Supplementary Table S1.
Figure 1.

Receiver operating characteristic and precision-recall curves, by modeling approach.
Table 2b presents performance metrics for patient-level classifiers defined by the union of a patient’s note-level predictions. In examining modeling approaches at the patient level, the three approaches demonstrated comparable performance (Figure 1b).
Qualitative Analysis
We assembled a purposive sample of 70 notes, representing 34 patients. Of these 70 notes, 43 were primarily authored by a resident physician or fellow, 20 by an attending physician, and 7 by nurse practitioners or physician assistants. Among the 40 sampled notes with a documented goals-of-care discussion, 17 commented on specific treatment preferences, 8 commented on overarching values and goals for medical care, and 5 commented on a patient’s global understanding of their illness; 13 contained a new referral to subspeciality palliative care, hospice, or comfort care, and 7 indicated discussion or completion of a new advance care planning document. Among the 30 sampled notes that lacked a documented goals-of-care discussion (i.e. notes misclassified by NLP as false-positive), 13 notes referenced a prior goals-of-care discussion and 11 notes referenced a planned goals-of-care discussion that had not yet occurred.
Among notes containing goals-of-care documentation that were missed by one or both NLP modeling approaches, four patterns potentially underlying misclassification emerged from our review (Figure 2a):
Figure 2.

Qualitative features of NLP-misclassified notes.
Values and goals are a key signal for BOW- and rule-based identification of goals-of-care discussions.
Although many documented goals-of-care discussions in our dataset contained clear descriptions of patients’ values and goals, there were others in which documentation focused on exchange of prognostic information and decisions regarding preferences for life-sustaining treatments or strategies. Both BOW and rule-based modeling approaches misclassified many notes that omitted language describing the values and goals that would ideally inform such treatment decisions.
“Long discussion with patient at bedside. He had many appropriate questions. We discussed plan of care and estimated trajectory. Ultimately, he is concerned that he will not get better and that there is nothing else we can do to help him. I provided education and discussed options like [treatment] should that become necessary. He was reassured by this.” (Hematology-oncology progress note, false-negative by both BOW and rule-based models)
Discussions about advance care planning documents are difficult to capture by BOW- and rule-based NLP.
We observed in our data that goals-of-care discussions that centered on advance care planning documents, such as Physician Orders for Life Sustaining Treatment (POLST), were difficult to identify by both BOW and rule-based approaches.
“POLST filled out in conversation with patient and her friend. Confirmed DNR, limited additional interventions, antibiotics when life can be extended, and long-term feeding tube if needed.” (Medicine progress note, false-negative by both BOW and rule-based models)
Although terms indicative of advance care planning documents were represented in our rule-based and hybrid models (Appendix II), we observed that terms such as “POLST” (Physician Orders for Life-Sustaining Treatment), “advance directive”, and “DPOA” (Durable Power of Attorney) were frequently present in EHR passages that were not intended to document goals-of-care discussions. Often, these terms occurred in standardized checklists at the end of daily progress notes, such as the following example:
“VTE prophylaxis:
Risk Level: at risk
Treatment(s) ordered: enoxaparin
Contacts: [name] DPOA, sister: [phone number]
Code status: DNR. Will need POLST on discharge to SNF”
(Medicine progress note, true-negative by both BOW and rule-based models)
Widespread use of these terms in the EHR may have limited their specificity in identifying documented goals-of-care discussions.
BOW models miss goals-of-care content embedded within lengthy records.
Many goals-of-care discussions were documented in interim and discharge summaries, presumably to assist in communication with other healthcare providers. Although this type of documentation was often quite detailed, and inclusive of key phrases such as “goals of care”, BOW models often missed such content when it was embedded in lengthy records that were predominated by non-goals-of-care content. Our rule-based models were less susceptible to this type of misclassification. In the following example, the goals-of-care discussion comprises 3% of the words in this 3600-word note.
“Goals of care: A GOC meeting was held with [patient], her partner of 30 years, and the renal fellow. [Patient] expressed a desire to maintain a life-prolonging approach to care. She would like to spend as much time as possible with her partner... They enjoy being together at home, playing games and watching TV. She also asked that we prioritize pain management.” (Medicine interim summary, false-negative by BOW models, true-positive by rule-based models)
Our rule-based models had difficulty identifying code status discussions substantiated by values or goals.
In our data, there were many instances of documentation focused closely on treatment preferences for cardiopulmonary arrest, often isolated from the acute clinical context. These discussions reflect the current standard of confirming patients’ code status upon admission, but the extent to which their documentation is substantiated by the patient’s underlying values and goals is highly variable. As these discussions were rarely labeled in the EHR as “goals of care” or “family meetings” by note authors, rule-based models often misclassified such notes as false-negative.
“She has done extensive thinking about intubation and does not want to be intubated if her respiratory status worsens, even it if means dying. She wishes to do everything possible to prevent getting to that point including escalating respiratory support and antibiotics. If she worsens despite maximal respiratory and medical support, she wishes to be made comfortable and allowed to die peacefully.” (Geriatrics consult note, false-negative by rule-based models, true-positive by BOW models)
Among notes misclassified as false-positive by one or both NLP approaches, two patterns emerged in our qualitative review that applied to both modeling approaches (Figure 2b):
References to the past or the future are hard to distinguish using BOW- or rule-based NLP.
Because the evaluated modeling approaches are largely naïve to grammar, both the BOW and rule-based models over-called many planned goals-of-care discussions and references to previous discussions.
“Planning to have meeting with the patient and her partner on [date]. Will discuss patient goals, trajectory, and ongoing concerns.” (Medicine progress note, false-positive by both BOW and rule-based models)
BOW- and rule-based NLP misidentified related constructs that do not pertain to overarching goals of care.
Both BOW and rule-based models over-called notes that documented closely related concepts such as shared decision-making or trade-offs between treatments and side effects.
“Thinks his pain is under adequate control today and continues to voice his preference for enough medication to allow him to feel comfortable enough to go for walks, but not so much that he is ‘dopey’.” (Palliative Care progress note, false-positive by both BOW and rule-based models)
DISCUSSION:
Reliable and automated identification of goals-of-care discussions documented in the EHR may provide a powerful tool for research and quality improvement. Through this mixed-methods comparison of three NLP/ML approaches, we identified salient differences in performance and generated hypotheses about the patterns underlying these differences. One of the key reasons why NLP is potentially useful for measuring documented goals-of-care discussions is that such documentation is very infrequent in the EHR. In our real-world sample of 4,391 clinical notes collected from hospitalized patients with serious illness, only 99 (2.3%) contained goals-of-care content by manual review. The rarity of goals-of-care content in the EHR poses unique challenges for the development and evaluation of NLP algorithms designed to measure it.60 Our quantitative findings are also an illustrative example of how the widely-recognized metric of AUCROC may not fully represent real-world performance of classifiers designed to identify rare outcomes.49,51,52
Our study does not prove the superiority of any single modeling approach that we examined. However, our observation that hybrid models had the highest note-level AUCROC and AUCPR supports ongoing efforts to further develop and optimize approaches that combine machine-derived features and expert-defined rules.27 When compared to rule-based methods, NLP models that use machine-derived features have the theoretical advantage of being able to identify predictive features that extend beyond the imagination of human experts, thereby improving performance. However, training data for rare constructs such as documented goals-of-care discussions is both sparse and costly to acquire. In these situations, a hybrid approach that combines machine-derived features with expert-defined rules allows the designer to specify heuristics that can improve both performance and generalizability. For these reasons, hybrid NLP modeling approaches are becoming increasingly common in the biomedical sciences.61
We qualitatively identified several patterns among notes misclassified by the rule-based and BOW modeling approaches, which suggest potential strategies for improving performance. Both BOW and rule-based approaches missed goals-of-care discussions that lacked clear documentation of patients’ values and goals. As many such discussions centered around exchange of prognostic information or preferences about life-sustaining treatments, incorporation of additional linguistic features into rule-based and hybrid approaches (e.g. words indicative of life-sustaining treatments located within contexts suggestive of patient/family communication) may help capture documentation of these discussions. Both BOW and rule-based modeling approaches also missed discussions about advance care planning documents completed in the hospital, as many inpatient notes contained templated, context-free references to such documents (e.g. “POLST 12/14/10”) without indication of an incident goals-of-care discussion. Ideally, this would be addressed by the creation of EHR structures or registries that capture incident advance care planning documents, obviating the need for NLP to measure this outcome. BOW models also frequently missed goals-of-care discussions embedded within lengthy notes, which demonstrates the limitations of classifying EHR text using note-level word frequencies. This deficiency may be addressed by changing the unit of analysis: analyzing text at the paragraph, sentence, or phrase level may facilitate accurate detection of clustered positive predictors within lengthy records.
It was reassuring that many of the notes in the false-positive sample contained documentation of constructs related to goals-of-care, even if this content did not meet our outcome definition. Additionally, while both modeling approaches often misclassified references to past and future goals-of-care discussions, this type of misclassification may be less problematic in applications where outcomes are aggregated over longer periods of time. Notably, several notes across both the false-negative and false-positive samples contained goals-of-care content that was difficult for human reviewers to reliably characterize, even through a process of consensus. This finding highlights the inherent challenges in dichotomizing the presence or absence of goals-of-care discussions in the EHR, as actual goals-of-care conversations are multifaceted and vary over dimensions of context, timing, depth, content, and execution.
Although the methods we evaluated are common and accessible to many clinical NLP developers,26,27 more sophisticated NLP approaches such as deep learning may address many of the challenges we describe. In deep learning-based NLP, words are represented mathematically as embeddings sensitive to position, syntax, and semantics (all aspects that are only partially representable in rule-based NLP, and not represented in BOW); and, models are constructed using neural network-like architectures that rely on fewer mathematical assumptions than traditional ML methods.62,63 Deep learning-based NLP has been deployed in a small number of studies that measure palliative care outcomes, including one group’s broader definition of documented serious illness conversations which includes standalone documentation of code status.29,35,64 We believe that the parallels between several of our key findings and the theoretical advantages of deep learning supports exploration of deep learning-based NLP for this task. However, because adopting deep learning-based approaches incurs substantial costs—larger quantities of training data, and more extensive computational resources (the most notable of which is model training time)—we also believe that future studies should quantitatively compare the performance, generalizability, and scalability of deep learning approaches to traditional NLP/ML approaches for identifying EHR documentation of goals-of-care discussions.63
Our study has several limitations. First, our dataset of over 4,000 notes contained only 99 goals-of-care discussions for 32 patients, which aligns with previous studies showing a low prevalence of goals-of-care discussions in hospitalized patients with serious illness.13,34,65 This small sample of positive notes may not reflect the diversity of goals-of-care documentation in the EHR, which may threaten generalizability. The small sample of positive notes also precluded testing of a single trained ML classifier on a held-out test set. While the LOGO-CV approach is well-accepted, predictions pooled from distinct LOGO-CV classifiers may lead to biased estimates of performance.66 Second, our sample of notes was derived from participants enrolled in a clinical trial at two hospitals in a single academic health system. Because the prevalence and content of goals-of-care documentation differs across patient populations and health systems, our findings may not generalize to other patient populations or health systems. This challenge is shared by many clinical NLP applications and reinforces the need both for external validation and for novel methods that address bias and facilitate generalization of NLP algorithms to settings beyond the training data.67,68 Third, our findings are specific to the NLP methods we evaluated, and may not generalize to other NLP featurization or modeling methods. Finally, while our qualitative analysis revealed potential etiologies for differences in performance between NLP modeling approaches, our findings are hypothesis-generating and should be confirmed by further study.
CONCLUSION:
Through this mixed-methods comparison of three NLP modeling approaches to identify EHR documentation of goals-of-care discussions, we observed differences in performance and generated hypotheses about misclassification patterns underlying these differences. Our quantitative findings suggest that future efforts using NLP to identify rare outcomes should continue to explore hybrid approaches, which may have advantages over BOW and rule-based approaches, particularly in smaller datasets. Our qualitative findings support incorporation of additional linguistic features that identify information exchange and treatment preferences, and suggest that refining the unit of analysis may lead to performance gains. Our findings also support exploration of deep learning-based NLP strategies to improve performance. Finally, our findings highlight the inherently multi-dimensional nature of goals-of-care discussions. We believe that NLP continues to represent a promising approach towards measuring documented goals-of-care discussions as a quality metric and research outcome.
Supplementary Material
KEY MESSAGE:
This study suggests NLP has potential for measuring EHR documentation of goals-of-care discussions for quality improvement and research. We use mixed methods to generate hypotheses for how to further optimize NLP modeling approaches toward this task.
ACKNOWLEDGEMENTS:
This work was supported by the National Palliative Care Research Center, the National Institutes of Health (HL137940, AG062441), and the Cambia Health Foundation. Additionally, infrastructure support was provided by the University of Washington Institute of Translational Health Sciences (ITHS), which is funded by the National Center for Advancing Translational Sciences (UL1 TR002319). The funding sources had no role in study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication. There are no conflicts of interest from any of the authors.
Footnotes
DISCLOSURES:
The authors declare no financial conflicts of interest.
REFERENCES:
- 1.Secunda K, Wirpsa MJ, Neely KJ, et al. Use and Meaning of “Goals of Care” in the Healthcare Literature: a Systematic Review and Qualitative Discourse Analysis. J Gen Intern Med. 2019;35(5):1559–1566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sinuff T, Dodek P, You JJ, et al. Improving End-of-Life Communication and Decision Making: The Development of a Conceptual Framework and Quality Indicators. J Pain Symptom Manage. 2015;49(6):1070–1080. [DOI] [PubMed] [Google Scholar]
- 3.Ferrell BR, Twaddle ML, Melnick A, Meier DE. National Consensus Project Clinical Practice Guidelines for Quality Palliative Care Guidelines, 4th Edition. J Palliat Med. 2018;21(12):1684–1689. [DOI] [PubMed] [Google Scholar]
- 4.Wright AA, Zhang B, Ray A, et al. Associations between end-of-life discussions, patient mental health, medical care near death, and caregiver bereavement adjustment. JAMA. 2008;300(14):1665–1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bernacki RE, Block SD, American College of Physicians High Value Care Task F. Communication about serious illness care goals: a review and synthesis of best practices. JAMA Intern Med. 2014;174(12):1994–2003. [DOI] [PubMed] [Google Scholar]
- 6.Detering KM, Hancock AD, Reade MC, Silvester W. The impact of advance care planning on end of life care in elderly patients: randomised controlled trial. BMJ. 2010;340:c1345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Modes ME, Engelberg RA, Downey L, Nielsen EL, Curtis JR, Kross EK. Did a Goals-of-Care Discussion Happen? Differences in the Occurrence of Goals-of-Care Discussions as Reported by Patients, Clinicians, and in the Electronic Health Record. J Pain Symptom Manage. 2019;57(2):251–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sanders JJ, Curtis JR, Tulsky JA. Achieving Goal-Concordant Care: A Conceptual Model and Approach to Measuring Serious Illness Communication and Its Impact. J Palliat Med. 2018;21(S2):S17–S27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zhang B, Wright AA, Huskamp HA, et al. Health care costs in the last week of life: associations with end-of-life conversations. Arch Intern Med. 2009;169(5):480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Curtis JR, Treece PD, Nielsen EL, et al. Randomized Trial of Communication Facilitators to Reduce Family Distress and Intensity of End-of-Life Care. Am J Respir Crit Care Med. 2016;193(2):154–162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Heyland DK, Dodek P, You JJ, et al. Validation of quality indicators for end-of-life communication: results of a multicentre survey. CMAJ. 2017;189(30):E980–E989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Knauft E, Nielsen EL, Engelberg RA, Patrick DL, Curtis JR. Barriers and facilitators to end-of-life care communication for patients with COPD. Chest. 2005;127(6):2188–2196. [DOI] [PubMed] [Google Scholar]
- 13.Orford NR, Milnes SL, Lambert N, et al. Prevalence, goals of care and long-term outcomes of patients with life-limiting illness referred to a tertiary ICU. Crit Care Resusc. 2016;18(3):181–188. [PubMed] [Google Scholar]
- 14.Morrison RS, Meier DE, Arnold RM. What’s Wrong With Advance Care Planning? JAMA. 2021;326(16):1575–1576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Curtis JR. Three Stories About the Value of Advance Care Planning. JAMA. 2021;326(21):2133–2134. [DOI] [PubMed] [Google Scholar]
- 16.Tulsky JA, Beach MC, Butow PN, et al. A Research Agenda for Communication Between Health Care Professionals and Patients Living With Serious Illness. JAMA Intern Med. 2017;177(9):1361–1366. [DOI] [PubMed] [Google Scholar]
- 17.Huber MT, Highland JD, Krishnamoorthi VR, Tang JW. Utilizing the Electronic Health Record to Improve Advance Care Planning: A Systematic Review. Am J Hosp Palliat Care. 2018;35(3):532–541. [DOI] [PubMed] [Google Scholar]
- 18.Esfahani S, Yi C, Madani CA, Davidson JE, Edmonds KP, Wynn S. Exploiting Technology to Popularize Goals-of-Care Conversations and Advance Care Planning. Crit Care Nurse. 2020;40(4):32–41. [DOI] [PubMed] [Google Scholar]
- 19.Curtis JR, Sathitratanacheewin S, Starks H, et al. Using Electronic Health Records for Quality Measurement and Accountability in Care of the Seriously Ill: Opportunities and Challenges. J Palliat Med. 2018;21(S2):S52–S60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wilson CJ, Newman J, Tapper S, et al. Multiple locations of advance care planning documentation in an electronic health record: are they easy to find? J Palliat Med. 2013;16(9):1089–1094. [DOI] [PubMed] [Google Scholar]
- 21.Lamas D, Panariello N, Henrich N, et al. Advance Care Planning Documentation in Electronic Health Records: Current Challenges and Recommendations for Change. J Palliat Med. 2018;21(4):522–528. [DOI] [PubMed] [Google Scholar]
- 22.Vassar M, Holzmann M. The retrospective chart review: important methodological considerations. J Educ Eval Health Prof. 2013;10:12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liddy C, Wiens M, Hogg W. Methods to achieve high interrater reliability in data collection from primary care medical records. Ann Fam Med. 2011;9(1):57–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc. 2011;18(5):544–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2017;2(4):230–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wang Y, Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018;77:34–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kreimeyer K, Foster M, Pandey A, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. J Biomed Inform. 2017;73:14–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lindvall C, Lilley EJ, Zupanc SN, et al. Natural Language Processing to Assess End-of-Life Quality Indicators in Cancer Patients Receiving Palliative Surgery. J Palliat Med. 2019;22(2):183–187. [DOI] [PubMed] [Google Scholar]
- 29.Udelsman BV, Moseley ET, Sudore RL, Keating NL, Lindvall C. Deep Natural Language Processing Identifies Variation in Care Preference Documentation. J Pain Symptom Manage. 2020;59(6):1186–1194 e1183. [DOI] [PubMed] [Google Scholar]
- 30.Lee RY, Brumback LC, Lober WB, et al. Identifying Goals of Care Conversations in the Electronic Health Record Using Natural Language Processing and Machine Learning. J Pain Symptom Manage. 2021;61(1):136–142 e132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Project to Improve Communication About Serious Illness—Pilot Study (PICSI-P). ClinicalTrials.gov identifier: NCT03746392. https://clinicaltrials.gov/ct2/show/NCT03746392. Accessed November 3, 2020. [Google Scholar]
- 32.Dartmouth Institute for Health Policy and Clinical Practice. Crosswalk File of ICD9 Diagnosis Codes to Risk Group Assessment. http://archive.dartmouthatlas.org/downloads/methods/Chronic_Disease_Codes.pdf. Published 2015. Updated Apr 1, 2015. Accessed Aug 24, 2016. [Google Scholar]
- 33.Lee RY, Okimoto K, Treece PD, et al. Chart Abstractor Codebook for Project to Improve Communication in Serious Illness - Pilot Study (PICSI-P). Cambia Palliative Care Center of Excellence at UW Medicine. https://faculty.washington.edu/rlee06/picsi-p-public/PICSI-P-Abstractor-Codebook.pdf. Published 2020. Updated May 18, 2020. Accessed Jan 29, 2021. [Google Scholar]
- 34.Curtis JR, Downey L, Back AL, et al. Effect of a Patient and Clinician Communication-Priming Intervention on Patient-Reported Goals-of-Care Discussions Between Patients With Serious Illness and Clinicians: A Randomized Clinical Trial. JAMA Intern Med. 2018;178(7):930–940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Chan A, Chien I, Moseley E, et al. Deep learning algorithms to identify documentation of serious illness conversations during intensive care unit admissions. Palliat Med. 2019;33(2):187–196. [DOI] [PubMed] [Google Scholar]
- 36.Walling AM, Asch SM, Lorenz KA, et al. The quality of care provided to hospitalized patients at the end of life. Arch Intern Med. 2010;170(12):1057–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Ahluwalia SC, Tisnado DM, Walling AM, et al. Association of Early Patient-Physician Care Planning Discussions and End-of-Life Care Intensity in Advanced Cancer. J Palliat Med. 2015;18(10):834–841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Thurston A, Wayne DB, Feinglass J, Sharma RK. Documentation quality of inpatient code status discussions. J Pain Symptom Manage. 2014;48(4):632–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gehlbach TG, Shinkunas LA, Forman-Hoffman VL, Thomas KW, Schmidt GA, Kaldjian LC. Code status orders and goals of care in the medical ICU. Chest. 2011;139(4):802–809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.van Dyck LI, Fried TR. Prognostic information, goals of care, and code status decision-making among older patients. J Am Geriatr Soc. 2021;69(7):2025–2028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Scheunemann LP, Arnold RM, White DB. The facilitated values history: helping surrogates make authentic decisions for incapacitated patients with advanced illness. Am J Respir Crit Care Med. 2012;186(6):480–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Childers JW, Back AL, Tulsky JA, Arnold RM. REMAP: A Framework for Goals of Care Conversations. J Oncol Pract. 2017;13(10):e844–e850. [DOI] [PubMed] [Google Scholar]
- 43.Charles C, Gafni A, Whelan T. Decision-making in the physician-patient encounter: revisiting the shared treatment decision-making model. Soc Sci Med. 1999;49(5):651–661. [DOI] [PubMed] [Google Scholar]
- 44.White DB, Braddock CH 3rd, Bereknyei S, Curtis JR. Toward shared decision making at the end of life in intensive care units: opportunities for improvement. Arch Intern Med. 2007;167(5):461–467. [DOI] [PubMed] [Google Scholar]
- 45.Zhou W, Wang H, Sun H, Sun T. A Method of Short Text Representation Based on the Feature Probability Embedded Vector. Sensors (Basel). 2019;19(17). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Sparck Jones K A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972;28(1):11–21. [Google Scholar]
- 47.Koopman B, Zuccon G, Nguyen A, Bergheim A, Grayson N. Extracting cancer mortality statistics from death certificates: A hybrid machine learning and rule-based approach for common and rare cancers. Artif Intell Med. 2018;89:1–9. [DOI] [PubMed] [Google Scholar]
- 48.Wong T-T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition. 2015;48(9):2839–2846. [Google Scholar]
- 49.Cook J, Ramadas V. When to consult precision-recall curves. The Stata Journal: Promoting communications on statistics and Stata. 2020;20(1):131–148. [Google Scholar]
- 50.Rothman KJ, Greenland S, Lash TL. Modern Epidemiology. Philadelphia: Lippincott Williams & Wilkins; 2008. [Google Scholar]
- 51.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Ozenne B, Subtil F, Maucort-Boulch D. The precision-recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. J Clin Epidemiol. 2015;68(8):855–859. [DOI] [PubMed] [Google Scholar]
- 53.Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [Google Scholar]
- 54.Hsieh HF, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005;15(9):1277–1288. [DOI] [PubMed] [Google Scholar]
- 55.Archibald MM. Investigator triangulation: a collaborative strategy with potential for mixed methods research. Journal of Mixed Methods Research. 2016;10(3):228–250. [Google Scholar]
- 56.Sudore RL, Fried TR. Redefining the “planning” in advance care planning: preparing for end-of-life decision making. Ann Intern Med. 2010;153(4):256–261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Sudore RL, Lum HD, You JJ, et al. Defining Advance Care Planning for Adults: A Consensus Definition From a Multidisciplinary Delphi Panel. J Pain Symptom Manage. 2017;53(5):821–832 e821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. Joint European Conference on Machine Learning and Knowledge Discovery in Databases; 2013; Prague, Czech Republic. [Google Scholar]
- 59.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international Conference on Machine Learning; June, 2006; Pittsburgh, PA. [Google Scholar]
- 60.Steiner JM, Morse C, Lee RY, Curtis JR, Engelberg RA. Sensitivity and Specificity of a Machine Learning Algorithm to Identify Goals-of-care Documentation for Adults With Congenital Heart Disease at the End of Life. J Pain Symptom Manage. 2020;60(3):e33–e36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Doan S, Conway M, Phuong TM, Ohno-Machado L. Natural Language Processing in Biomedicine: A Unified System Architecture Overview. In: Trent R, ed. Clinical Bioinformatics. New York, NY: Springer New York; 2014:275–294. [DOI] [PubMed] [Google Scholar]
- 62.Young T, Hazarika D, Poria S, Cambria E. Recent Trends in Deep Learning Based Natural Language Processing [Review Article]. IEEE Computational Intelligence Magazine. 2018;13(3):55–75. [Google Scholar]
- 63.Wu S, Roberts K, Datta S, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27(3):457–470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Chien I, Shi A, Chan A, Lindvall C. Identification of Serious Illness Conversations in Unstructured Clinical Notes using Deep Neural Networks. Proceedings of the First Joint Workshop on AI in Health (AIH 2018), organized as part of the Federated AI Meeting (FAIM 2018); July 13–14, 2018; Stockholm, Sweden. [Google Scholar]
- 65.Hofmann JC, Wenger NS, Davis RB, et al. Patient preferences for communication with physicians about end-of-life decisions. SUPPORT Investigators. Study to Understand Prognoses and Preference for Outcomes and Risks of Treatment. Ann Intern Med. 1997;127(1):1–12. [DOI] [PubMed] [Google Scholar]
- 66.Parker BJ, Gunter S, Bedo J. Stratification bias in low signal microarray studies. BMC Bioinformatics. 2007;8:326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Howell KP, Barnes MR, Curtis JR, et al. Controlling for confounding variables: accounting for dataset bias in classifying patient-provider interactions. Proceedings of the W3PHIAI International Workshop on Health Intelligence; February, 2020; New York, NY. [Google Scholar]
- 68.Bender EM, Daumé H III, Ettinger A, Rao S. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task. Proceedings of the First Workshop on Building Linguistically Generalizable NLP systems (BLGNLP), Empirical Methods in Natural Language Processing (EMNLP 2017); September 8, 2017; Copenhagen, Denmark. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
