Skip to main content
JAMA Network logoLink to JAMA Network
. 2023 Mar 2;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204

Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome

Robert Y Lee 1,2,, Erin K Kross 1,2, Janaki Torrence 1,2, Kevin S Li 3, James Sibley 1,4, Trevor Cohen 1,3, William B Lober 1,3,4,5, Ruth A Engelberg 1,2, J Randall Curtis 1,2,4,6
PMCID: PMC9982698  PMID: 36862411

This diagnostic study evaluates the performance, feasibility, and power implications of using natural language processing to measure outcomes in a randomized clinical trial of a communication intervention among adults with serious illness.

Key Points

Question

Can natural language processing (NLP) be used to measure clinical trial outcomes?

Findings

In this diagnostic study evaluating the performance, feasibility, and power implications of using deep-learning NLP to measure the outcome of documented goals-of-care discussions in a 2512-patient pragmatic trial, NLP-screened human abstraction measured the outcome with 92.6% sensitivity, substantial savings in abstractor-hours, and minimal loss of power, compared with manual abstraction.

Meaning

The findings suggest that NLP may facilitate measurement of previously inaccessible outcomes in clinical trials and that incorporation of misclassification-adjusted power calculations into the design of studies using NLP may be beneficial.

Abstract

Importance

Many clinical trial outcomes are documented in free-text electronic health records (EHRs), making manual data collection costly and infeasible at scale. Natural language processing (NLP) is a promising approach for measuring such outcomes efficiently, but ignoring NLP-related misclassification may lead to underpowered studies.

Objective

To evaluate the performance, feasibility, and power implications of using NLP to measure the primary outcome of EHR-documented goals-of-care discussions in a pragmatic randomized clinical trial of a communication intervention.

Design, Setting, and Participants

This diagnostic study compared the performance, feasibility, and power implications of measuring EHR-documented goals-of-care discussions using 3 approaches: (1) deep-learning NLP, (2) NLP-screened human abstraction (manual verification of NLP-positive records), and (3) conventional manual abstraction. The study included hospitalized patients aged 55 years or older with serious illness enrolled between April 23, 2020, and March 26, 2021, in a pragmatic randomized clinical trial of a communication intervention in a multihospital US academic health system.

Main Outcomes and Measures

Main outcomes were natural language processing performance characteristics, human abstractor-hours, and misclassification-adjusted statistical power of methods of measuring clinician-documented goals-of-care discussions. Performance of NLP was evaluated with receiver operating characteristic (ROC) curves and precision-recall (PR) analyses and examined the effects of misclassification on power using mathematical substitution and Monte Carlo simulation.

Results

A total of 2512 trial participants (mean [SD] age, 71.7 [10.8] years; 1456 [58%] female) amassed 44 324 clinical notes during 30-day follow-up. In a validation sample of 159 participants, deep-learning NLP trained on a separate training data set identified patients with documented goals-of-care discussions with moderate accuracy (maximal F1 score, 0.82; area under the ROC curve, 0.924; area under the PR curve, 0.879). Manual abstraction of the outcome from the trial data set would require an estimated 2000 abstractor-hours and would power the trial to detect a risk difference of 5.4% (assuming 33.5% control-arm prevalence, 80% power, and 2-sided α = .05). Measuring the outcome by NLP alone would power the trial to detect a risk difference of 7.6%. Measuring the outcome by NLP-screened human abstraction would require 34.3 abstractor-hours to achieve estimated sensitivity of 92.6% and would power the trial to detect a risk difference of 5.7%. Monte Carlo simulations corroborated misclassification-adjusted power calculations.

Conclusions and Relevance

In this diagnostic study, deep-learning NLP and NLP-screened human abstraction had favorable characteristics for measuring an EHR outcome at scale. Adjusted power calculations accurately quantified power loss from NLP-related misclassification, suggesting that incorporation of this approach into the design of studies using NLP would be beneficial.

Introduction

Natural language processing (NLP) of free-text electronic health records (EHRs) presents rich opportunities for measuring outcomes that would otherwise require costly, laborious medical record abstraction.1,2,3 However, NLP can introduce inaccuracies and misclassify outcomes, particularly when measuring complex constructs.4,5,6,7 Many statistical procedures common in clinical research ignore misclassification, and applying these procedures to imperfectly measured outcomes can lead to underpowered studies and improper estimates.8,9,10 In using NLP for clinical research, it is important to implement robust approaches to address NLP-related misclassification in study design and analysis.9,10

Researchers in the fields of palliative care and serious illness communication have shown interest in using NLP7,11,12,13,14 to measure the occurrence and timing of EHR-documented goals-of-care discussions, an outcome that reflects clinicians’ assessment and documentation of patients’ values, goals, and treatment preferences.15,16 This outcome represents a guideline-recommended best practice,16,17,18,19,20 an area of ongoing deficiencies,21,22,23 and a mediator of delivery of patient-centered care.16,24,25,26 However, goals-of-care discussions are difficult to measure from structured EHR data or claims data, and their rarity within free-text records makes them costly to manually abstract at scale.13,27,28,29,30 Although NLP models have been developed to measure this and related constructs,6,7,12,13,14,31 the linguistic complexity encountered in documented goals-of-care discussions continues to challenge NLP, and there is ongoing interest in refining NLP approaches to improve performance.7,14

In this study, we used deep-learning NLP to measure the primary outcome of EHR-documented goals-of-care discussions in a large pragmatic trial of a communication-priming intervention for hospitalized patients.32,33 We evaluated the sensitivity, specificity, and predictive values of our deep-learning NLP model through receiver operating characteristic (ROC) curves and precision-recall analyses and examined the effect of NLP-related misclassification on study power. We concluded by examining the performance, feasibility, and power implications of an NLP-screened human abstraction34 approach to measuring the primary outcome of the parent trial.

Methods

This diagnostic study was conducted to inform selection of an outcome measurement strategy for a pragmatic randomized clinical trial of a communication-priming intervention for hospitalized patients (Project to Improve Communication About Serious Illness—Hospital Study: Pragmatic Trial 1 [PICSI-H Trial 1]).32,33 All procedures for this study and the parent clinical trial were approved by the University of Washington institutional review board. Patients were enrolled under an institutional review board–approved waiver of informed consent and Health Insurance Portability and Accountability Act authorization due to minimal risk of the intervention, which was designed to promote best practices. Findings of the current study are reported in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline.35

The trial enrolled patients hospitalized at any of 3 study hospitals who either had advanced age (≥80 years) or were aged 55 years or older and had a chronic life-limiting illness as defined by diagnosis codes used by the Dartmouth Atlas Project to study end-of-life care (eTable 1 in Supplement 1).36,37,38 Eligible patients were enrolled under a waiver of informed consent and randomized in a 1:1 ratio to usual care or a clinician-facing prompting intervention designed to promote goals-of-care discussions. Clinicians caring for patients in the intervention arm received an e-mailed patient-specific document (“Jumpstart Guide”; eAppendix 1 in Supplement 1) that suggested possible appropriateness of a goals-of-care discussion and provided communication prompts adapted from the VitalTalk communication training model.39,40 The primary outcome was EHR documentation of a goals-of-care discussion within 30 days of randomization. Goals-of-care discussions were defined as discussion of the overarching aims of medical care for a patient15 and operationalized using a medical record abstraction manual (eAppendix 2 in Supplement 1) adapted from a previous pilot trial.14,41 We did not consider stand-alone code status discussions or citations of past advance care planning documents to be goals-of-care discussions. The outcome was measured in all notes written by inpatient and outpatient clinicians (physicians, residents, fellows, subinterns, nurse practitioners, and physician assistants) from the date of randomization to 30 days thereafter.

During planning, the trial was specified to use NLP to measure its primary outcome, with human abstraction as a backup strategy. However, because NLP approaches were developed concurrently with enrollment, the expected degree of NLP-related misclassification was not known at the time the trial began. The trial was initially specified to target a sample size of 2000 participants, which would result in 80% power to detect a difference in proportions of at least 6.2% (assuming a control arm proportion of 0.54 and 2-sided α of 0.05). However, this sample size determination did not consider the potential effect of NLP-related misclassification. Development, training, and testing of NLP models continued throughout enrollment, using data sources gathered from outside the trial.

Between April 23, 2020, and March 26, 2021, the trial enrolled 2512 patients. The prespecified enrollment target of 2000 was exceeded to increase the number of participants with Alzheimer disease and related dementias (ADRD), a prespecified subgroup. Following conclusion of enrollment and prior to unblinding and primary analyses, we froze our NLP program, evaluated NLP performance within a validation sample collected from the trial, and reevaluated the statistical power, human abstraction burden, and pragmatic implications of 3 strategies for measuring the primary outcome: (1) conventional manual abstraction, (2) NLP alone, and (3) NLP-screened human abstraction, in which only EHR passages scored by NLP above a predefined threshold would be reviewed by human abstractors for documented goals-of-care discussions.34

NLP Development and Model Training

We collected a training data set of 4642 EHR notes from 150 participants in a previous pilot trial of a similar patient- and clinician-facing communication-priming intervention at the same study hospitals (Project to Improve Communication About Serious Illness—Pilot Study [PICSI-P]) (eTable 1 in Supplement 1).41 Using a codebook adapted from that used to measure the PICSI-P trial outcome (eAppendix 2 in Supplement 1), 5 abstractors (including R.Y.L., J.T.) manually reviewed and coded these 4642 notes for documented goals-of-care discussions using the interface of a qualitative data analysis platform (Dedoose).42 Randomization was concealed from abstractors, and instances of disagreement were resolved by consensus.

We used the training data set to train Bio+ClinicalBERT, a publicly available and freely distributed deep-learning NLP model, to predict the presence of documented goals-of-care discussions in EHR text (eMethods in Supplement 1).43,44 BERT is a deep-learning NLP model developed by Google Research that is pretrained on large quantities of unlabeled text to build a foundation of linguistic information, including how words influence one another’s meaning in context, that can then be applied to a variety of NLP tasks.45,46,47 BERT contains a total of 110 million parameters for which values are fitted during pretraining.45 Bio+ClinicalBERT is an instance of BERT that was further pretrained on unlabeled biomedical literature and deidentified medical records from the MIMIC III (Medical Information Mart for Intensive Care) database.43,44,48,49 To identify documented goals-of-care discussions, we used publicly available software interfaces50 to add a classification layer to Bio+ClinicalBERT and trained (“fine-tuned” in BERT parlance) the parameters of the composite model to the training data and its manually abstracted labels for goals-of-care content (Figure 1). As BERT models can only analyze text sequences of limited length (≤512 words or subwords), we used automated algorithms to split each note into component passages compatible with BERT (eMethods in Supplement 1). The resulting fine-tuned model predicts the likelihood of goals-of-care content within any string of candidate text.

Figure 1. Training, Prediction, and Validation of the Natural Language Processing (NLP) Model.

Figure 1.

EHR indicates electronic health record; GOC, goals of care; PICSI-H Trial 1, Project to Improve Communication About Serious Illness—Hospital Study: Pragmatic Trial (Trial 1)32,33; and PICSI-P, Project to Improve Communication About Serious Illness—Pilot Study.41

Trial Data Set and Validation Sample

Following conclusion of the PICSI-H Trial 1 outcome assessment period, we used automated database queries to collect all EHR notes authored by attending and trainee physicians, subintern medical students, nurse practitioners, and physician assistants between the date of randomization and 30 days thereafter. This yielded a PICSI-H Trial 1 data set of 44 324 notes from 2512 patients (Figure 1 and eTable 2 in Supplement 1). We randomly selected 159 of these trial participants (eTable 1 in Supplement 1) for manual whole medical record abstraction, oversampling for patients with ADRD (80 of 159 [50%]), yielding a validation sample of 2480 notes (Figure 1 and eTable 2 in Supplement 1). A team of 4 abstractors (including J.T.) manually reviewed and coded all records in this validation sample using the same methods used to label the training data set. Randomization was concealed from abstractors, and instances of disagreement were resolved by consensus.

Statistical Analysis

Evaluating NLP Performance

Following model training and manual abstraction of the validation sample, we used the trained BERT NLP model to predict the probability of documented goals-of-care discussions in all 2.64 million EHR passages (44 324 notes) of the PICSI-H Trial 1 data set. To characterize the expected performance of BERT NLP in the trial data set, we resampled the validation sample to reflect the prevalence of enrolled patients with ADRD (11%) and compared NLP-predicted probabilities for each note and patient against manual abstraction using ROC curves and precision-recall analyses. For all analyses, we report note- and patient-level performance, defining note- and patient-level probability as the maximum predicted probability for all constituent passages and defining the gold-standard label for each note and patient as the union of labels for all constituent passages. Statistical analyses were conducted using Stata/MP, version 17.0 (StataCorp LLC)51 with the roctab and prcurve packages.52

Misclassification-Adjusted Power Calculations

To estimate the detectable risk difference in proportions of patients with documented goals-of-care discussions at a given power in the absence of misclassification, we assumed a control arm prevalence (p1) of 33.5% based on preliminary data21,53 and 2512 patients with 1:1 allocation and calculated the intervention arm proportion (p2) that would maintain 80% power using a Pearson χ2 test with 2-sided α of 0.05.54,55,56 To estimate detectable risk difference in the presence of nondifferential misclassification, we substituted sensitivity (se)– and specificity (sp)–corrected terms representing the observed proportions (1, 2) into the same power calculation,10,57 iterating across values of the true intervention arm proportion (p2) to identify the value of actual risk difference (p1 p2) at which power to detect a difference between 1 and 2 equaled 80%. The terms for the observed proportions are defined by the following formulae presented by Devine10:

1 = se1 + (1sp)(11)

2 = se2 + (1 − sp)(1 − 2)

To empirically estimate statistical power in the presence of nondifferential misclassification, we performed Monte Carlo simulations constrained to the same sample sizes and control arm prevalence (p1) over a range of risk differences (p2 – p1) and values for patient-level sensitivity and specificity. For each of 10 000 replications, we generated true outcomes as binomial variates conditioned on the given values of n1, n2, p1, and p2 and observed outcomes as binomial variates conditioned on the true outcome, sensitivity, and specificity. We then tested for an association between the observed outcome and treatment arm using a χ2 test with 2-sided α of 0.05 and reported the proportion of replications that rejected the null hypothesis (H0:1=2) as the observed power. We used Bland-Altman analysis to compare observed power in simulations against misclassification-adjusted calculated power.58

Power calculations and Monte Carlo simulations were performed using Stata/MP, version 17.0, and further parallelized for performance using the parallel Stata module, version 1.20.0.59,60 The source code for misclassification-adjusted power calculations and simulation procedures is provided in eAppendix 3 in Supplement 1. Data were visualized and analyzed using Stata/MP, the blandaltman Stata module,61 and Plotly Chart Studio (Plotly). Two-sided P < .05 was considered significant.

Results

Between April 23, 2020, and March 26, 2021, PICSI-H Trial 1 enrolled 2512 patients, whose baseline characteristics are presented in eTable 1 in Supplement 1. The mean (SD) age was 71.7 (10.8) years; 1456 (58%) were female, 1056 (42%) were male, and most had 2 or more chronic life-limiting illness diagnoses (1281 [51%]).

Abstraction and Composition of Training and Validation Data Sets

In manual abstraction of 4642 EHR notes in the training data set, 340 notes (7%; belonging to 34 of 150 patients [23%]) contained documented goals-of-care discussions (eTable 2 in Supplement 1). The training data set was abstracted using 287 abstractor-hours over a 4-month period.

In manual abstraction of 2840 EHR notes in the validation sample, 268 notes (9%; belonging to 54 of 159 patients [34%]) contained documented goals-of-care discussions (eTable 2 in Supplement 1). Note- and patient-level prevalence of EHR-documented goals-of-care discussions were similar in patients with and without ADRD (note level: 113 of 1248 with ADRD [9%] vs 155 of 1232 without ADRD [13%]; χ2 P = .81; patient level: 25 of 80 with ADRD [31%] vs 29 of 79 without ADRD [37%]; χ2 P = .47). Adjusted for ADRD, 12.2% of notes and 36.1% of patients in the validation sample had an EHR-documented goals-of-care discussion. The validation sample was abstracted using 204 abstractor-hours over a 6-month period.

BERT NLP Performance in the Validation Sample

In comparing BERT NLP predictions with manual abstraction in the validation sample, BERT NLP demonstrated note- and patient-level areas under the ROC curve of 0.962 and 0.924, respectively (Figure 2). The note- and patient-level areas under the precision-recall curve were 0.824 and 0.879, respectively, and the maximal observed note- and patient-level F1 scores were 0.77 and 0.82, respectively. We also report the observed note- and patient-level sensitivity, specificity, positive and negative predictive values, and F1 scores at select predefined thresholds (Table).

Figure 2. Performance of BERT Natural Language Processing in Classifying 30-Day Documented Goals-of-Care Discussions at Note and Patient Levels in a 159-Patient, 2480-Note Internal Validation Sample.

Figure 2.

Representative values of sensitivity, specificity, positive and negative predictive values, and F1 scores from these curves are presented in the Table. B, Dotted lines indicate nondiscriminating classifiers. AUC indicates area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; and PPV, positive predictive value.

Table. Performance Metrics for BERT NLP in Classifying 30-Day Documented Goals-of-Care Discussions at Note and Patient Levels in a 159-Patient, 2480-Note Internal Validation Samplea.

Unit of classification Metric, % F1 score AUC AUPRC
Sensitivity Specificity PPV NPV
Note level (n = 2480) 70.1 98.1 83.6 95.9 0.76 0.962 0.824
79.9 94.5 66.9 97.1 0.73
89.7 88.1 51.0 98.4 0.65
Patient level (n = 159) 70.0 92.8 84.5 84.6 0.77 0.924 0.879
79.4 91.0 83.3 88.6 0.81
89.5 69.5 62.3 92.1 0.73

Abbreviations: AUC, area under the receiver operating characteristic curve; AUPRC, area under the precision-recall curve; NLP, natural language processing; NPV, negative predictive value; PPV, positive predictive value.

a

Performance metrics are shown at observed discrimination thresholds with sensitivities closest to prespecified values of 70%, 80%, and 90%.

Power Estimates With and Without Misclassification

In conventional power analysis (which assumes no misclassification), we calculated that the trial (N = 2512; 1:1 allocation; assumed p1 of 33.5%) would have 80% power to detect a risk difference of 5.4% with 2-sided α of 0.05. Misclassification-adjusted calculations of detectable risk difference at 80% power across ranges of outcome sensitivity and specificity demonstrated a superlinear increase in detectable risk difference (ie, loss of power) with decreasing sensitivity or specificity (Figure 3). Notably, at this sample size, the detectable risk difference remained under 10% even with substantial misclassification (eg, sensitivity 80%, specificity 80%). Monte Carlo simulations across ranges of sensitivity, specificity, and risk difference demonstrated excellent agreement between calculated and observed (simulated) power in Bland-Altman analysis (eFigure 1 in Supplement 1).

Figure 3. Detectable Risk Difference Over Classifier Performance.

Figure 3.

Assumptions: n1 = 1256; n2 = 1256; p1 = 0.335; power = 0.8; and 2-sided α = .05. An interactive 3-D plot is available at https://chart-studio.plotly.com/~rlee06uw/14/#/.

Comparison of Outcome Measurement Strategies

We evaluated 3 strategies for measuring the primary outcome: manual human abstraction, BERT NLP alone, and BERT NLP–screened human abstraction. The first approach, manual human abstraction, is the de facto gold standard and would power the study to detect a risk difference of 5.4%. Our experience suggested that individual abstractors could perform this abstraction task for up to 3 hours per day before experiencing excessive fatigue. To estimate the number of hours required for complete manual abstraction of the trial data set, we scaled the known abstractor-hours required to collect data for the validation sample to the entire trial data set, yielding an estimate of approximately 3000 abstractor-hours (ie, 67 work weeks for a team of 3 abstractors devoting 3 hours per day to this task; costing $195 000 at a rate of $65 per hour). Based on estimates from the validation sample, constraining abstractors to records from randomization to the first EHR-documented goals-of-care discussion (or 30 days if none was present) would reduce this estimate to approximately 2000 abstractor-hours (ie, 45 work weeks for a team of 3 abstractors, or $130 000 at a rate of $65 per hour).

In the second approach, BERT NLP alone, the primary outcome could be measured with any combination of sensitivity and specificity represented on the ROC curve in Figure 2. Based on the calculated detectable risk differences shown in Figure 3, a BERT NLP model implemented with a discrimination threshold corresponding to a maximal patient-level F1 (82.5% sensitivity, 89.2% specificity) would power the study to detect a risk difference of 7.6% compared with 5.4% without misclassification.

In the third approach, BERT NLP–screened human abstraction, only EHR passages that were scored by NLP above a predefined threshold would be reviewed by human abstractors for documented goals-of-care discussions.34 We used the results shown in Figure 3 to determine the detectable risk difference at 80% power across a range of sensitivities and 100% specificity (Figure 4 and eFigure 2 in Supplement 1) and to estimate patient-level sensitivity and number of EHR passages requiring human verification to achieve complete outcome data from randomization to the first human-confirmed documented goals-of-care discussion (or 30 days if none was present) across a representative range of screening thresholds (Figure 4). All misclassification-adjusted power calculations under these constraints were again consistent with Monte Carlo simulations (eFigure 2 in Supplement 1). Based on these data and the time and resources available, we elected to measure the primary outcome using NLP-screened human abstraction at a screening threshold corresponding to 92.6% estimated patient-level sensitivity. At this threshold, there were 22 187 EHR passages (0.8% of all 2.64 million passages) from 11 287 notes and 1957 patients that screened positive by NLP. Assuming 33.5% control-arm prevalence, abstraction of all NLP-positive passages from randomization to the first human-confirmed documented goals-of-care discussion (or 30 days if none was present) would power the trial to detect a risk difference of 5.7% at 80% power with 2-sided α of 0.05 and was estimated to require manual abstraction of approximately 8500 NLP-screened EHR passages containing a median of 52 words each (IQR, 25-101 words) (Figure 4). Following this decision, 3 research coordinators (including J.T.) adjudicated 7494 EHR passages using 34.3 abstractor-hours over a 3-week period to complete primary outcome measurements for all trial participants from randomization to the first goals-of-care discussion (or 30 days if none was present) at the given screening threshold. The NLP-screened passages were adjudicated by abstractors in a random order, and abstractors were blinded to patient randomization.

Figure 4. Detectable Risk Difference vs Sensitivity of Natural Language Processing (NLP)–Screened Human Abstraction.

Figure 4.

In considering the use of NLP-screened human abstraction to measure the trial outcome, we evaluated the added utility of human verification of NLP-positive passages compared with NLP alone. Measuring the outcome using NLP alone (square) would have powered the trial to detect a risk difference (RD) of 7.6%. Measuring the outcome using NLP-screened human abstraction would improve power (ie, decrease detectable risk difference) at the cost of the number of passages requiring human verification; this cost increases in a superlinear manner as the screening threshold moves toward higher sensitivity. We ultimately chose to measure the outcome using NLP-screened human abstraction at a sensitivity of 92.6% (circles), which powered the trial to detect a risk difference of 5.7% at the predicted cost of human verification of 8500 NLP-screened electronic health record (EHR) passages. Assumptions: n1 = 1256; n2 = 1256; p1 = 0.335; 2-sided α = .05; power = 0.8. The relationship between power, sensitivity, and detectable risk difference is shown in eFigure 2 in Supplement 1.

aSensitivity and detectable risk difference of BERT NLP alone at the discrimination threshold corresponding to maximal patient-level F1 score (82.5% sensitivity, 89.2% specificity).

bSensitivity of NLP-screened human abstraction at the screening threshold selected for the clinical trial.

cAt 100% sensitivity (perfect classifier; diamonds), the estimated number of NLP-positive EHR passages requiring human abstraction for complete outcomes from date of randomization to first goals-of-care discussion was approximately 1.2 million. Passages contained a median of 52 words each (IQR, 25-101 words).

Discussion

In this diagnostic study, we examined the novel use of deep-learning–based NLP to measure a complex outcome from unstructured EHR text in a large pragmatic clinical trial. We also demonstrated and validated the use of statistical methods to quantitatively assess the effects of NLP-related misclassification on study power at a given sample size.

Natural language processing–screened human abstraction represents an efficient and useful approach for measuring EHR outcomes in large pragmatic studies and is increasingly used to measure palliative care outcomes similar to the one we examined.34,62,63,64,65 In the 2512-patient PICSI-H Trial 1, measuring the primary outcome using conventional manual abstraction would have required thousands of abstractor-hours, which is both costly and time consuming. In contrast, NLP-screened human abstraction allowed investigators to make up-front investments in developing NLP and collecting training and validation data and then measure the primary outcome during a smaller number of abstractor-hours with acceptable losses in sensitivity and power. Misclassification-adjusted interim power analyses allowed investigators to select a strategy for measuring the primary outcome that best balanced statistical power with abstractor time and resources.

Researchers considering NLP or NLP-aided approaches to measuring outcomes should consider the effects of the outcome measurement strategy on the statistical power, costs, and validity of the trial. In this study, we demonstrated 2 simple methods for researchers to perform misclassification-adjusted power calculations, and we encourage the uptake of this approach in the design of future trials.9,10 We also demonstrated the utility of NLP-screened human abstraction34 to measure trial outcomes from the EHR. It should be noted that in implementing NLP-screened human abstraction of EHR passages, both the expected sensitivity and the number of EHR passages that require human verification are functions of the selected NLP screening threshold in a given data set; an analysis such as that shown in Figure 4 may help investigators select the most appropriate screening threshold for their study. Notably, at a given threshold score, the patient-level sensitivity of NLP-screened human abstraction can be lower than that of NLP alone due to the potential presence of documented goals-of-care discussions scoring beneath the screening threshold in patients for whom all NLP-positive passages are false-positive. Because of this, studies that use NLP-screened human abstraction of EHR passages must estimate patient-level sensitivity at a given screening threshold in the sample population either by statistical methods or by testing within validation data sampled at the patient level.

Perhaps the most conspicuous limitation of using NLP to measure clinical research outcomes is the investment required to implement NLP. Although this investment may be minimal for outcomes that are easily detected using rule-based NLP, identifying more complex constructs such as the one investigated here may require substantial software development and acquisition of training and validation data. Our research group has spent more than 500 developer-hours implementing this NLP model, in addition to 491 abstractor-hours spent collecting training and validation data. We anticipate that the cost of implementing high-performing pretrained NLP models will decrease as the field matures. Additionally, many NLP development costs are fixed with respect to the number of study participants, and many aspects of NLP development are transferrable to other studies and research questions. Although our BERT NLP model is not yet portable due to the privacy risks associated with training on identifiable protected health information,66 we believe it represents a first step toward the development of a publicly available, validated, and generalizable BERT NLP model for identifying EHR-documented goals-of-care discussions.

Limitations

Our study has several important limitations. First, our model was trained and validated on data from a single health system, and its performance may not generalize to other systems. Future external validation and efforts to improve explainability of deep learning models67,68 will be important to developing models that generalize across health systems. Second, our model was validated within a relatively small sample of 159 patients, which limits our assessment of generalizability outside the validation sample. Although the performance observed in the validation sample was comparable with that observed in within-training-set leave-one-group-out cross-validation procedures, these estimates have the same limitation of small sample size. Given the high cost of abstracting this outcome per patient hospitalization, future validation of NLP models measuring this outcome should consider alternative sampling strategies or statistical methods that account for incomplete reference data.69,70 Third, the limited diversity of the validation sample limited our ability to examine potential patient-level biases in the resulting model.71 It is possible that our model performs differently across variables such as race, ethnicity, gender, or other clinical and socioecological patient characteristics. While such differential performance is less likely to bias the findings of a randomized clinical trial (in which predictors are randomly distributed between groups), observational studies of NLP-measured outcomes must consider differential misclassification over exposures as a potential source of bias. This issue is particularly salient for palliative care researchers due to the known role of race, ethnicity, and racism in palliative care disparities.53,72 As clinical NLP matures toward developing clinical predictive models, it will be imperative to thoroughly validate model performance and generalizability over diverse socioecological groups to avoid perpetuating health disparities. Fourth, although NLP-screened human abstraction may facilitate larger sample sizes than manual data collection, the scalability of this approach is inherently limited by variable costs of human abstraction that increase with sample size. Fifth, our conclusions for researchers are not applicable to all clinical outcomes. Although the outcome we examined is linguistically complex, other outcomes of interest may be even more difficult to classify using EHR text, and the potential of NLP to screen for or measure such outcomes may be limited. Sixth, we did not examine the effects of NLP-related outcome misclassification on estimates of exposure-outcome associations. Recent advances in statistical methods that account for outcome misclassification have shown promise for enhancing the validity of such estimates,8,9,73,74 and we believe clinical researchers who use NLP to measure outcomes should consider adopting such methods in their analyses.

Conclusions

In this diagnostic study evaluating the use of deep-learning NLP to measure EHR-documented goals-of-care discussions, we measured the primary outcome of a large pragmatic clinical trial using NLP-screened human abstraction with acceptable sensitivity and substantial savings in abstractor-hours. Our experience demonstrated that NLP may facilitate clinical research studies that would otherwise be infeasible due to the costs of manual medical record abstraction. Misclassification-adjusted power calculations quantified power loss from NLP-related misclassification, suggesting that incorporation of this approach into the design of future studies that use NLP to measure outcomes would be beneficial.

Supplement 1.

eTable 1. Patient Characteristics by Note Corpus

eTable 2. Characteristics of Electronic Health Record Note Corpora

eFigure 1. Comparison of Calculated vs Observed (Simulated) Power Over Ranges of Sensitivity, Specificity, and Risk Difference

eFigure 2. Power vs Detectable Risk Difference vs Sensitivity of NLP-screened Human Abstraction

eMethods.

eAppendix 1. Examples of Clinician-facing Communication-priming Intervention Forms (Jumpstart Guide) From Clinical Trial

eAppendix 2. Chart Abstractor Codebook for the Project to Improve Communication in Serious Illness (PICSI) Trial Series

eAppendix 3. Stata Source Code for Study Power Calculation and Simulation Procedures

eReferences

Supplement 2.

Data Sharing Statement

References

  • 1.Yim WW, Yetisgen M, Harris WP, Kwan SW. Natural language processing in oncology: a review. JAMA Oncol. 2016;2(6):797-804. doi: 10.1001/jamaoncol.2016.0213 [DOI] [PubMed] [Google Scholar]
  • 2.Wu S, Roberts K, Datta S, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc. 2020;27(3):457-470. doi: 10.1093/jamia/ocz200 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Curtis JR, Sathitratanacheewin S, Starks H, et al. Using electronic health records for quality measurement and accountability in care of the seriously ill: opportunities and challenges. J Palliat Med. 2018;21(S2):S52-S60. doi: 10.1089/jpm.2017.0542 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Luo Y, Thompson WK, Herr TM, et al. Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 2017;40(11):1075-1089. doi: 10.1007/s40264-017-0558-6 [DOI] [PubMed] [Google Scholar]
  • 5.Bejan CA, Angiolillo J, Conway D, et al. Mining 100 million notes to find homelessness and adverse childhood experiences: 2 case studies of rare and severe social determinants of health in electronic health records. J Am Med Inform Assoc. 2018;25(1):61-71. doi: 10.1093/jamia/ocx059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lindvall C, Lilley EJ, Zupanc SN, et al. Natural language processing to assess end-of-life quality indicators in cancer patients receiving palliative surgery. J Palliat Med. 2019;22(2):183-187. doi: 10.1089/jpm.2018.0326 [DOI] [PubMed] [Google Scholar]
  • 7.Chien I, Shi A, Chan A, Lindvall C. Identification of serious illness conversations in unstructured clinical notes using deep neural networks. In: Koch F, Koster A, Bichindaritz I, et al, eds. Artificial Intelligence in Health: First International Workshop, AIH 2018, Stockholm, Sweden, July 13-14, 2018, Revised Selected Papers. Springer Nature Switzerland; 2019:199-212. [Google Scholar]
  • 8.Brakenhoff TB, Mitroiu M, Keogh RH, Moons KGM, Groenwold RHH, van Smeden M. Measurement error is often neglected in medical literature: a systematic review. J Clin Epidemiol. 2018;98:89-97. doi: 10.1016/j.jclinepi.2018.02.023 [DOI] [PubMed] [Google Scholar]
  • 9.Keogh RH, Shaw PA, Gustafson P, et al. STRATOS guidance document on measurement error and misclassification of variables in observational epidemiology: part 1—basic theory and simple methods of adjustment. Stat Med. 2020;39(16):2197-2231. doi: 10.1002/sim.8532 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Devine O. The impact of ignoring measurement error when estimating sample size for epidemiologic studies. Eval Health Prof. 2003;26(3):315-339. doi: 10.1177/0163278703255232 [DOI] [PubMed] [Google Scholar]
  • 11.Udelsman BV, Moseley ET, Sudore RL, Keating NL, Lindvall C. Deep natural language processing identifies variation in care preference documentation. J Pain Symptom Manage. 2020;59(6):1186-1194.e3. doi: 10.1016/j.jpainsymman.2019.12.374 [DOI] [PubMed] [Google Scholar]
  • 12.Chan A, Chien I, Moseley E, et al. Deep learning algorithms to identify documentation of serious illness conversations during intensive care unit admissions. Palliat Med. 2019;33(2):187-196. doi: 10.1177/0269216318810421 [DOI] [PubMed] [Google Scholar]
  • 13.Lee RY, Brumback LC, Lober WB, et al. Identifying goals of care conversations in the electronic health record using natural language processing and machine learning. J Pain Symptom Manage. 2021;61(1):136-142.e2. doi: 10.1016/j.jpainsymman.2020.08.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Uyeda AM, Curtis JR, Engelberg RA, et al. Mixed-methods evaluation of three natural language processing modeling approaches for measuring documented goals-of-care discussions in the electronic health record. J Pain Symptom Manage. 2022;63(6):e713-e723. doi: 10.1016/j.jpainsymman.2022.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Secunda K, Wirpsa MJ, Neely KJ, et al. Use and meaning of “goals of care” in the healthcare literature: a systematic review and qualitative discourse analysis. J Gen Intern Med. 2020;35(5):1559-1566. doi: 10.1007/s11606-019-05446-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bernacki RE, Block SD; American College of Physicians High Value Care Task Force . Communication about serious illness care goals: a review and synthesis of best practices. JAMA Intern Med. 2014;174(12):1994-2003. doi: 10.1001/jamainternmed.2014.5271 [DOI] [PubMed] [Google Scholar]
  • 17.Davidson JE, Powers K, Hedayat KM, et al. ; American College of Critical Care Medicine Task Force 2004-2005, Society of Critical Care Medicine . Clinical practice guidelines for support of the family in the patient-centered intensive care unit: American College of Critical Care Medicine Task Force 2004-2005. Crit Care Med. 2007;35(2):605-622. doi: 10.1097/01.CCM.0000254067.14607.EB [DOI] [PubMed] [Google Scholar]
  • 18.Halpern SD, Becker D, Curtis JR, et al. ; Choosing Wisely Taskforce; American Thoracic Society; American Association of Critical-Care Nurses; Society of Critical Care Medicine . An official American Thoracic Society/American Association of Critical-Care Nurses/American College of Chest Physicians/Society of Critical Care Medicine policy statement: the Choosing Wisely Top 5 list in Critical Care Medicine. Am J Respir Crit Care Med. 2014;190(7):818-826. doi: 10.1164/rccm.201407-1317ST [DOI] [PubMed] [Google Scholar]
  • 19.Kon AA, Davidson JE, Morrison W, Danis M, White DB. Shared decision-making in intensive care units: executive summary of the American College of Critical Care Medicine and American Thoracic Society policy statement. Am J Respir Crit Care Med. 2016;193(12):1334-1336. doi: 10.1164/rccm.201602-0269ED [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Davidson JE, Aslakson RA, Long AC, et al. Guidelines for family-centered care in the neonatal, pediatric, and adult ICU. Crit Care Med. 2017;45(1):103-128. doi: 10.1097/CCM.0000000000002169 [DOI] [PubMed] [Google Scholar]
  • 21.Heyland DK, Barwich D, Pichora D, et al. ; ACCEPT (Advance Care Planning Evaluation in Elderly Patients) Study Team; Canadian Researchers at the End of Life Network (CARENET) . Failure to engage hospitalized elderly patients and their families in advance care planning. JAMA Intern Med. 2013;173(9):778-787. doi: 10.1001/jamainternmed.2013.180 [DOI] [PubMed] [Google Scholar]
  • 22.Shah K, Swinton M, You JJ. Barriers and facilitators for goals of care discussions between residents and hospitalised patients. Postgrad Med J. 2017;93(1097):127-132. doi: 10.1136/postgradmedj-2016-133951 [DOI] [PubMed] [Google Scholar]
  • 23.Kruser JM, Benjamin BT, Gordon EJ, et al. Patient and family engagement during treatment decisions in an ICU: a discourse analysis of the electronic health record. Crit Care Med. 2019;47(6):784-791. doi: 10.1097/CCM.0000000000003711 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Curtis JR, Patrick DL, Shannon SE, Treece PD, Engelberg RA, Rubenfeld GD. The family conference as a focus to improve communication about end-of-life care in the intensive care unit: opportunities for improvement. Crit Care Med. 2001;29(2)(suppl):N26-N33. doi: 10.1097/00003246-200102001-00006 [DOI] [PubMed] [Google Scholar]
  • 25.Seaman JB, Arnold RM, Scheunemann LP, White DB. An integrated framework for effective and efficient communication with families in the adult intensive care unit. Ann Am Thorac Soc. 2017;14(6):1015-1020. doi: 10.1513/AnnalsATS.201612-965OI [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Comer AR, Hickman SE, Slaven JE, et al. Assessment of discordance between surrogate care goals and medical treatment provided to older adults with serious illness. JAMA Netw Open. 2020;3(5):e205179. doi: 10.1001/jamanetworkopen.2020.5179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wilson CJ, Newman J, Tapper S, et al. Multiple locations of advance care planning documentation in an electronic health record: are they easy to find? J Palliat Med. 2013;16(9):1089-1094. doi: 10.1089/jpm.2012.0472 [DOI] [PubMed] [Google Scholar]
  • 28.Sinuff T, Dodek P, You JJ, et al. Improving end-of-life communication and decision making: the development of a conceptual framework and quality indicators. J Pain Symptom Manage. 2015;49(6):1070-1080. doi: 10.1016/j.jpainsymman.2014.12.007 [DOI] [PubMed] [Google Scholar]
  • 29.Tulsky JA, Beach MC, Butow PN, et al. A research agenda for communication between health care professionals and patients living with serious illness. JAMA Intern Med. 2017;177(9):1361-1366. doi: 10.1001/jamainternmed.2017.2005 [DOI] [PubMed] [Google Scholar]
  • 30.Turnbull AE, Bosslet GT, Kross EK. Aligning use of intensive care with patient values in the USA: past, present, and future. Lancet Respir Med. 2019;7(7):626-638. doi: 10.1016/S2213-2600(19)30087-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Lilley EJ, Lindvall C, Lillemoe KD, Tulsky JA, Wiener DC, Cooper Z. Measuring processes of care in palliative surgery: a novel approach using natural language processing. Ann Surg. 2018;267(5):823-825. doi: 10.1097/SLA.0000000000002579 [DOI] [PubMed] [Google Scholar]
  • 32.Project to Improve Communication About Serious Illness—Hospital Study: Pragmatic Trial (Trial 1) (PICSI-H). ClinicalTrials.gov identifier: NCT04281784. Accessed November 3, 2020. https://clinicaltrials.gov/ct2/show/NCT04281784
  • 33.Curtis JR, Lee RY, Brumback LC, et al. Improving communication about goals of care for hospitalized patients with serious illness: study protocol for two complementary randomized trials. Contemp Clin Trials. 2022;120:106879. doi: 10.1016/j.cct.2022.106879 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lindvall C, Deng CY, Moseley E, et al. Natural language processing to identify advance care planning documentation in a multisite pragmatic clinical trial. J Pain Symptom Manage. 2022;63(1):e29-e36. doi: 10.1016/j.jpainsymman.2021.06.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. J Clin Epidemiol. 2015;68(2):134-143. doi: 10.1016/j.jclinepi.2014.11.010 [DOI] [PubMed] [Google Scholar]
  • 36.Iezzoni LI, Heeren T, Foley SM, Daley J, Hughes J, Coffman GA. Chronic conditions and risk of in-hospital death. Health Serv Res. 1994;29(4):435-460. [PMC free article] [PubMed] [Google Scholar]
  • 37.Wennberg JE, Fisher ES, Goodman DC, Skinner JS. Tracking the Care of Patients With Severe Chronic Illness: The Dartmouth Atlas of Health Care 2008. The Dartmouth Institute for Health Policy and Clinical Practice; 2008. [PubMed]
  • 38.Goodman DC, Esty AR, Fisher ES, Chang CH. Trends and Variation in End-of-life Care for Medicare Beneficiaries With Severe Chronic Illness: A Report of the Dartmouth Atlas Project. The Dartmouth Institute for Health Policy and Clinical Practice; April 12, 2011. [PubMed]
  • 39.Back AL, Arnold RM, Tulsky JA, Baile WF, Fryer-Edwards KA. Teaching communication skills to medical oncology fellows. J Clin Oncol. 2003;21(12):2433-2436. doi: 10.1200/JCO.2003.09.073 [DOI] [PubMed] [Google Scholar]
  • 40.Abedini NC, Merel SE, Hicks KG, et al. Applying human-centered design to refinement of the Jumpstart Guide, a clinician- and patient-facing goals-of-care discussion priming tool. J Pain Symptom Manage. 2021;62(6):1283-1288. doi: 10.1016/j.jpainsymman.2021.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lee RY, Kross EK, Downey L, et al. Efficacy of a communication-priming intervention on documented goals-of-care discussions in hospitalized patients with serious illness: a randomized clinical trial. JAMA Netw Open. 2022;5(4):e225088. doi: 10.1001/jamanetworkopen.2022.5088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Dedoose. SocioCultural Research Consultants, LLC. Accessed January 26, 2023. https://www.dedoose.com/
  • 43.Alsentzer E, Murphy JR, Boag W, et al. Publicly available clinical BERT embeddings. arXiv. Preprint posted online April 6, 2019. doi: 10.48550/arXiv.1904.03323 [DOI]
  • 44.Alsentzer E. Bio_ClinicalBERT. 2019. Accessed Mar 15, 2022. https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT
  • 45.Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. Preprint posted online October 11, 2018. doi: 10.48550/arXiv.1810.04805 [DOI]
  • 46.Google Research . BERT: TensorFlow code and pre-trained models for BERT. 2018. Accessed March 15, 2022. https://github.com/google-research/bert
  • 47.Khalid S. BERT explained: a complete guide with theory and tutorial. November 2, 2019. Accessed December 20, 2022. https://medium.com/@samia.khalid/bert-explained-a-complete-guide-with-theory-and-tutorial-3ac9ebc8fa7c
  • 48.Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hugging Face . Transformers. 2022. Accessed March 15, 2022. https://huggingface.co/docs/transformers/index
  • 51.roctab—nonparametric ROC analysis. Stata. Version 17. StataCorp LLC; 2021. [Google Scholar]
  • 52.Cook J, Ramadas V. When to consult precision-recall curves. Stata J. 2020;20(1):131-148. doi: 10.1177/1536867X20909693 [DOI]
  • 53.Uyeda AM, Lee RY, Pollack LR, et al. Predictors of documented goals-of-care discussion for hospitalized patients with chronic illness. J Pain Symptom Manage. 2022;S0885-3924(22)00973-3. doi: 10.1016/j.jpainsymman.2022.11.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. 3rd ed. John Wiley & Sons; 2003.
  • 55.Agresti A. Categorical Data Analysis. 3rd ed. John Wiley & Sons; 2013. [Google Scholar]
  • 56.power twoproportions—power analysis for a two-sample proportions test. Stata. Version 17. StataCorp LLC; 2021.
  • 57.Rahme E, Joseph L. Estimating the prevalence of a rare disease: adjusted maximum likelihood. Statistician. 1998;47(1):149-158. doi: 10.1111/1467-9884.00120 [DOI] [Google Scholar]
  • 58.Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-310. doi: 10.1016/S0140-6736(86)90837-8 [DOI] [PubMed] [Google Scholar]
  • 59.Vega Yon GG, Quistorff B. parallel: a command for parallel computing. Stata J. 2019;19(3):667-684. doi: 10.1177/1536867X19874242 [DOI] [Google Scholar]
  • 60.Vega Yon G, Quistorff B. PARALLEL: Stata module for parallel computing. Version 1.20.0. 2018. Accessed January 12, 2022. https://github.com/gvegayon/parallel
  • 61.Chatfield M. BLANDALTMAN: Stata module to create Bland-Altman plots. 2022. Accessed June 2, 2022. https://ideas.repec.org/c/boc/bocode/s459040.html
  • 62.Greer JA, Moy B, El-Jawahri A, et al. Randomized trial of a palliative care intervention to improve end-of-life care discussions in patients with metastatic breast cancer. J Natl Compr Canc Netw. 2022;20(2):136-143. doi: 10.6004/jnccn.2021.7040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Volandes AE, Zupanc SN, Paasche-Orlow MK, et al. Association of an advance care planning video and communication intervention with documentation of advance care planning among older adults: a nonrandomized controlled trial. JAMA Netw Open. 2022;5(2):e220354. doi: 10.1001/jamanetworkopen.2022.0354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Lakin JR, Brannen EN, Tulsky JA, et al. ; ACP-PEACE Investigators . Advance Care Planning: Promoting Effective and Aligned Communication in the Elderly (ACP-PEACE): the study protocol for a pragmatic stepped-wedge trial of older patients with cancer. BMJ Open. 2020;10(7):e040999. doi: 10.1136/bmjopen-2020-040999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Eneanya ND, Lakin JR, Paasche-Orlow MK, et al. Video Images about Decisions for Ethical Outcomes in Kidney Disease (VIDEO-KD): the study protocol for a multi-centre randomised controlled trial. BMJ Open. 2022;12(4):e059313. doi: 10.1136/bmjopen-2021-059313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Lehman E, Jain S, Pichotta K, Goldberg Y, Wallace BC. Does BERT pretrained on clinical notes reveal sensitive data? arXiv. Preprint posted online April 15, 2021. doi: 10.18653/v1/2021.naacl-main.73 [DOI]
  • 67.Castelvecchi D. Can we open the black box of AI? Nature. 2016;538(7623):20-23. doi: 10.1038/538020a [DOI] [PubMed] [Google Scholar]
  • 68.Ras G, Xie N, van Gerven M, Doran D. Explainable deep learning: a field guide for the uninitiated. J Artif Intell Res. 2022;73:329-397. doi: 10.1613/jair.1.13200 [DOI] [Google Scholar]
  • 69.Tan WK, Heagerty PJ. Surrogate-guided sampling designs for classification of rare outcomes from electronic medical records data. Biostatistics. 2022;23(2):345-361. doi: 10.1093/biostatistics/kxaa028 [DOI] [Google Scholar]
  • 70.Pepe MS. Incomplete data and imperfect reference tests. In: The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003:168-213. [Google Scholar]
  • 71.Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544-1547. doi: 10.1001/jamainternmed.2018.3763 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Brown CE, Curtis JR, Doll KM. A race-conscious approach toward research on racial inequities in palliative care. J Pain Symptom Manage. 2022;63(5):e465-e471. doi: 10.1016/j.jpainsymman.2021.11.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Edwards JK, Cole SR, Troester MA, Richardson DB. Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. Am J Epidemiol. 2013;177(9):904-912. doi: 10.1093/aje/kws340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Shaw PA, Gustafson P, Carroll RJ, et al. STRATOS guidance document on measurement error and misclassification of variables in observational epidemiology: part 2—more complex methods of adjustment and advanced topics. Stat Med. 2020;39(16):2232-2263. doi: 10.1002/sim.8531 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

  1. Wennberg JE, Fisher ES, Goodman DC, Skinner JS. Tracking the Care of Patients With Severe Chronic Illness: The Dartmouth Atlas of Health Care 2008. The Dartmouth Institute for Health Policy and Clinical Practice; 2008. [PubMed]
  2. Goodman DC, Esty AR, Fisher ES, Chang CH. Trends and Variation in End-of-life Care for Medicare Beneficiaries With Severe Chronic Illness: A Report of the Dartmouth Atlas Project. The Dartmouth Institute for Health Policy and Clinical Practice; April 12, 2011. [PubMed]
  3. Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. 3rd ed. John Wiley & Sons; 2003.
  4. power twoproportions—power analysis for a two-sample proportions test. Stata. Version 17. StataCorp LLC; 2021.

Supplementary Materials

Supplement 1.

eTable 1. Patient Characteristics by Note Corpus

eTable 2. Characteristics of Electronic Health Record Note Corpora

eFigure 1. Comparison of Calculated vs Observed (Simulated) Power Over Ranges of Sensitivity, Specificity, and Risk Difference

eFigure 2. Power vs Detectable Risk Difference vs Sensitivity of NLP-screened Human Abstraction

eMethods.

eAppendix 1. Examples of Clinician-facing Communication-priming Intervention Forms (Jumpstart Guide) From Clinical Trial

eAppendix 2. Chart Abstractor Codebook for the Project to Improve Communication in Serious Illness (PICSI) Trial Series

eAppendix 3. Stata Source Code for Study Power Calculation and Simulation Procedures

eReferences

Supplement 2.

Data Sharing Statement


Articles from JAMA Network Open are provided here courtesy of American Medical Association

RESOURCES