Key Points
Question
Can prediction of patient outcomes in heart failure based on routinely collected claims data be improved with machine learning methods and incorporating linked electronic medical records?
Findings
In this prognostic study including records on 9502 patients, machine learning methods offered only limited improvement over logistic regression in predicting key outcomes in heart failure based on administrative claims. Inclusion of additional predictors from electronic medical records improved prediction for mortality, heart failure hospitalization, and loss in home days but not for high cost.
Meaning
Models based on claims-only predictors may achieve modest discrimination and accuracy in prediction of key patient outcomes in heart failure, and machine learning approaches and incorporation of additional predictors from electronic medical records may offer some improvement in risk prediction of select outcomes.
Abstract
Importance
Accurate risk stratification of patients with heart failure (HF) is critical to deploy targeted interventions aimed at improving patients’ quality of life and outcomes.
Objectives
To compare machine learning approaches with traditional logistic regression in predicting key outcomes in patients with HF and evaluate the added value of augmenting claims-based predictive models with electronic medical record (EMR)–derived information.
Design, Setting, and Participants
A prognostic study with a 1-year follow-up period was conducted including 9502 Medicare-enrolled patients with HF from 2 health care provider networks in Boston, Massachusetts (“providers” includes physicians, clinicians, other health care professionals, and their institutions that comprise the networks). The study was performed from January 1, 2007, to December 31, 2014; data were analyzed from January 1 to December 31, 2018.
Main Outcomes and Measures
All-cause mortality, HF hospitalization, top cost decile, and home days loss greater than 25% were modeled using logistic regression, least absolute shrinkage and selection operation regression, classification and regression trees, random forests, and gradient-boosted modeling (GBM). All models were trained using data from network 1 and tested in network 2. After selecting the most efficient modeling approach based on discrimination, Brier score, and calibration, area under precision-recall curves (AUPRCs) and net benefit estimates from decision curves were calculated to focus on the differences when using claims-only vs claims + EMR predictors.
Results
A total of 9502 patients with HF with a mean (SD) age of 78 (8) years were included: 6113 from network 1 (training set) and 3389 from network 2 (testing set). Gradient-boosted modeling consistently provided the highest discrimination, lowest Brier scores, and good calibration across all 4 outcomes; however, logistic regression had generally similar performance (C statistics for logistic regression based on claims-only predictors: mortality, 0.724; 95% CI, 0.705-0.744; HF hospitalization, 0.707; 95% CI, 0.676-0.737; high cost, 0.734; 95% CI, 0.703-0.764; and home days loss claims only, 0.781; 95% CI, 0.764-0.798; C statistics for GBM: mortality, 0.727; 95% CI, 0.708-0.747; HF hospitalization, 0.745; 95% CI, 0.718-0.772; high cost, 0.733; 95% CI, 0.703-0.763; and home days loss, 0.790; 95% CI, 0.773-0.807). Higher AUPRCs were obtained for claims + EMR vs claims-only GBMs predicting mortality (0.484 vs 0.423), HF hospitalization (0.413 vs 0.403), and home time loss (0.575 vs 0.521) but not cost (0.249 vs 0.252). The net benefit for claims + EMR vs claims-only GBMs was higher at various threshold probabilities for mortality and home time loss outcomes but similar for the other 2 outcomes.
Conclusions and Relevance
Machine learning methods offered only limited improvement over traditional logistic regression in predicting key HF outcomes. Inclusion of additional predictors from EMRs to claims-based models appeared to improve prediction for some, but not all, outcomes.
This prognostic study compares several machine learning approaches with traditional logistic regression for development of predictive models for all-cause mortality, heart failure hospitalization, high cost, and loss in home time, among patients with heart failure.
Introduction
With aging of the global population, heart failure (HF) is being recognized as an increasing clinical and public health problem associated with significant mortality, morbidity, and health care expenditures, particularly among patients aged 65 years and older.1 Heart failure is estimated to contribute to 1 in every 8 deaths in the United States.2 Despite progress in reducing HF-related mortality through therapeutic development, hospitalizations for HF remain frequent.3 Total costs of care related to the treatment and management of HF in the United States were estimated to be $31 billion in 2012, with more than two-thirds attributable to direct medical costs.2 A need for optimizing treatment and improving outcomes has led to a large field of predictive modeling in HF. In a systematic review, Rahimi et al4 identified a total of 64 different models predicting either mortality or hospitalizations in patients with HF. Although these models differ substantially in terms of the target populations (eg, inpatients or outpatients, reduced or preserved ejection fraction [EF], or younger or older ages) and prediction risk window (eg, 30-day mortality risk, 1-year mortality risk), they share the ultimate objective of facilitating risk stratification of patients with HF and are noted to have variable success rates with discrimination indices in the range of 0.60 to 0.89.4
There are several shortcomings of the currently available risk prediction models for HF. First, most previous models are developed using traditional statistical approaches, such as regression modeling, and newer alternatives, such as machine learning–based prediction models, have remained underused.5 Second, most models are developed to contain only a small number of important predictors that clinicians can easily access or order to compute a risk score at the bedside to determine the appropriate treatment course for a particular patient. As a result, these models have limited utility to inform population-level interventions because the policy makers, for example a large insurer, may not have the ability to obtain additional information on top of routinely collected health care data through insurance claims or electronic medical records (EMRs) for enrolled patients. Third, predictive models for outcomes that are important from the payers’ perspective (high cost)6 and from the patients’ perspective (loss in home time)7,8 have not received as much attention as mortality and hospitalization. To address these limitations of the previously proposed models, we undertook this investigation with the primary objective of comparing several machine learning approaches with traditional logistic regression for development of predictive models for all-cause mortality, HF hospitalization, high cost, and loss in home time in patients with HF. Medicare claims data linked to EMRs from 2 large academic health care provider networks (“providers” includes physicians, clinicians, other health care professionals, and their institutions that compose the networks) in Boston, Massachusetts, were used to evaluate the added value of augmenting only claims-based predictive models with EMR-derived information.
Methods
Data Source
In this prognostic study, we used 2007-2014 Medicare claims data from Parts A (inpatient coverage), B (outpatient coverage), and D (prescription benefits) that were linked deterministically by beneficiary numbers, date of birth, and sex (linkage success rate, 99.2%) with EMRs for 2 large health care provider networks in the Boston metropolitan area. We identified patients between 2007 and 2013 and used data from January 1, 2007, to December 31, 2014, for outcome assessment. Data from the network with a larger sample size were used for model development (training set), and data from the second network were used for model validation (testing set). The Medicare claims data contain information on demographic characteristics (age, sex, and race/ethnicity), enrollment start and end dates, dispensed medications and performed procedures, and medical diagnosis codes.9 Data not recorded in claims were extracted from the EMR, including laboratory test results and free-text information from patient medical records. A signed data use agreement with the Center for Medicare & Medicaid Services was available, and the Brigham and Women’s Hospital’s Institutional Review Board approved this study with waiver of individual patient consent based on secondary analysis of existing data that did not require patient recontact or intervention. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline for prediction model development and validation.
Study Design
We identified a cohort of patients aged 65 years or older with HF from Medicare fee-for-service claims using International Classification of Diseases, Ninth Revision, codes (listed in eTable 1 in the Supplement)10 after at least 180 days of continuous enrollment in fee-for-service Medicare Parts A, B, and D between 2007 and 2013 and at least 1 recorded EF value in EMRs within 30 days on either side of the claims-based HF diagnosis date. This claims-based HF diagnosis date was defined as the cohort entry date, and a previous 180-day period was defined as the baseline period. After the cohort entry date, patients were followed up for outcomes of interest for 365 days with censoring on Medicare disenrollment or mortality. eFigure 1 in the Supplement summarizes the study design.
Outcomes
We focused on 4 key outcomes of interest in HF. First, all-cause mortality within 365 days of the cohort entry date was identified based on information recorded in Medicare claims. Second, HF hospitalization was identified using Medicare inpatient claims (part A) based on the primary discharge diagnosis of HF within 365 days of the cohort entry date. Third, total costs for all-causes were identified from Medicare parts A, B, and D including hospitalization costs, outpatient costs, and medication costs for 365 days of the cohort entry date. To account for variable follow-up owing to early mortality in some patients, monthly costs were estimated by dividing total costs by total months of follow-up. Based on the resulting average monthly cost distribution per patient, membership in the highest cost decile was identified. Fourth, we created a binary variable as greater than or equal to 25% loss in home times versus less than 25% after subtracting days spent in the hospital and nursing homes from the total follow-up time (ie, between cohort entry date and last date of follow-up) to quantify the number of days patients spent at home. This measure has been shown to have good correlation with patients’ functional status. In a prior study, loss of 15 days or more at home were found to have 3 to 5-fold higher incidence of patient centered outcomes including poor self-rated health and mobility impairment.7 In another study, reduced home time over 1 year after HF hospitalization was closely correlated with traditional time-to-event mortality and hospitalization outcomes.8
Predictors
Based on a review of existing literature to identify factors associated with prognosis of HF,4,11 we selected a total of 54 variables from Medicare claims, including demographic characteristics (age, sex, and race/ethnicity), HF-related variables (specific International Classification of Diseases, Ninth Revision, codes indicating systolic, diastolic, left, rheumatic, hypertensive, or unspecified HF, number of HF hospitalizations, site of recorded HF diagnosis at study entry [inpatient or outpatient], history of implantable cardioverter-defibrillator, cardiac resynchronization therapy, or left-ventricular assist device), HF-related medication use, comorbid conditions, and 2 composite scores (claims-based frailty index12 and claims-based socioeconomic status index13,14). eTable 2 in the Supplement contains the full list of variables and operational definitions.
Eight additional variables were extracted from EMRs based on the most proximal recorded value to the cohort entry date during the baseline period, including serum sodium, serum potassium, serum urea nitrogen, serum creatinine, and B-type natriuretic peptide levels; left-ventricular EF value; EF classification (<40% considered reduced; 40%-49%, moderately reduced; or ≥50%, preserved), and body mass index class (<18 considered underweight; 18-25, healthy; 26-29, overweight; ≥30, obese; or missing [calculated as weight in kilograms divided by height in meters squared]). For laboratory results (serum sodium, serum potassium, serum urea nitrogen, and serum creatinine levels), missing values were observed in the range of 5% to 25%. Therefore, we used the multiple imputation procedure using an expectation maximization algorithm for maximum likelihood parameter estimation based on all other predictors and outcomes separately within the training and testing data sets. Imputation is widely considered to be beneficial for handling missing data in predictive models, and inclusion of outcomes for imputation is recommended.15,16 For B-type natriuretic peptide level and body mass index in which the proportion of missing data was substantially higher (54% and 69%, respectively), we did not consider the multiple imputation procedure to be feasible and instead opted for missing indicator categories. Because we required the recording of an EF as a cohort entry criterion, there were no missing values for this variable. Distributions of all predictor variables were reported for training and testing data separately. Standardized differences17 for all variables between training and testing data were reported, in which absolute values greater than 10 may be suggestive of an important difference in distribution of a particular variable between these populations.
Machine Learning Modeling Approaches
In addition to the traditional multivariable logistic regression model, we constructed predictive models for each outcome using the following machine learning approaches in the training data. All models were constructed in 2 phases: first using only claims-based predictors, and second adding EMR-based variables to the claims-based predictors.
Least Absolute Shrinkage and Selection Operator
The least absolute shrinkage and selection operator (LASSO) is a regularized regression approach that incorporates a penalty to the log-likelihood function with the goal of shrinking imprecise coefficients toward 0. We used 10-fold cross-validation to select the value of the penalty parameter in a way that minimized the model deviance. The LASSO model offers several key advantages, including consistency in identifying the true underlying model18,19 and effective handling of multicollinearity.20
Classification and Regression Tree
Classification and regression tree (CART) analysis is a nonparametric approach that uses a decision tree framework to progressively segregate values of predictors in binary splits. Every value of the predictor variable is evaluated as a potential split, and the optimal split is determined by the gain in information (decrease in entropy). We implemented CART in a conditional inference framework, whereby stopping criteria were applied based on multiple testing procedures to obviate the need for subjective pruning and control overfitting.21
Random Forests
Random forest is a supervised ensemble learning method that builds many decision trees to predict the outcome of interest. We constructed a forest consisting of 500 individual trees. A key advantage of random forests is that, as long as a reasonably large number of trees is constructed, the forest does not require extensive tuning. Random forest error rates are largely insensitive to the number of features selected to split each node.22 Therefore, we used the default of a random sample of √n predictors at each node, where n is the total number of predictors under consideration. The predicted probability was derived based on average prediction across all of the trees.
Gradient-Boosted Model
Gradient-boosted model (GBM) is another tree-based ensemble learning method in which a series of weak classifiers is sequentially constructed and combined, each time aiming to correct errors made in the prediction by the previous classifier, to form a strong learner. We selected a low learning rate (0.01) and interaction depth of 4, as these parameters are noted to have robust performance across a variety of scenarios.23 We evaluated a maximum of 10 000 iterations; to tune the optimal number of iterations, we used 10-fold cross-validation.
Statistical Analysis
Performance Evaluation of Candidate Approaches
Performance of all modeling approaches was evaluated using the following parameters in the testing data: (1) Brier score,24 which is a quadratic scoring rule in which the squared differences between actual binary outcomes and predicted probabilities are calculated and lower values indicate higher overall accuracy; (2) area under the receiver operating characteristic curve; and (3) calibration plots characterized by visual inspection and reporting the intercept and slope.24 The intercept’s departure from 0 indicates the extent to which predictions are systematically underpredicting or overpredicting probability of the event of interest. Departure of the slope from 1 indicates that predicted and observed probabilities are farther from the perfect prediction line of 45°. We compared the area under the receiver operating characteristic curves for machine learning models with a logistic regression model using the 2-sided DeLong test at a significance level of .05.25
Performance Evaluation of the Selected Approach
After selecting the most efficient statistical modeling approach based on the metrics outlined above for each outcome, we provided additional performance characteristics for the selected approach to focus on the differences between claims-only and claims + EMR versions for those models. First, we constructed precision-recall curves,26 which provide insights into the relevant question of what proportion of true cases the algorithm can identify (sensitivity) and with the level of accuracy (the positive predictive value) at different probability cutoffs. Next, we constructed decision curves27 to summarize the comparative utility of claims-only and claims + EMR versions for those models in selecting a patient population for intervention in terms of net benefit, defined as net increase in the number of true-positive cases identified without an increase in the number of false-positive results at various threshold probability values. In addition, we reported observed probability of events by predicted risk deciles. We reported the 10 most influential predictors for all 4 outcomes selected by these models. In addition, we characterized performance across a variety of subgroups, including HF type (reduced EF, midrange EF, and preserved EF), sex, age (65-74 and ≥75 years), and source of HF diagnosis at study cohort entry (inpatient or outpatient) based on the area under the receiver operating characteristic curve.
Data analysis was conducted from January 1 to December 31, 2018. All models were developed in R software, version 3.4.3 (R Project for Statistical Computing). Codes for implementation of these models are publicly available (http://www.drugepi.org/dope-downloads/).
Results
Study Cohort
We included a total of 9502 patients aged 65 years or older in this study with at least 1 HF diagnosis and a recorded measurement of EF within 1 month of the HF diagnosis date; 6113 of these patients were included in the training set and 3389 were used as the testing set. Table 1 summarizes baseline characteristics of these patients. The mean (SD) age was 78 (8) years, 2779 were men (45.5%), and 5571 were white (91.1%) in the training data set; the mean (SD) age was 77 (8) years, 1486 were men (43.8%), and 2853 were white (84.2%) in the testing data set (standardized differences between training and testing data sets of 12.5 for age, 3.4 for male sex, and 21.1 for white race). Distribution of left-ventricular EF–based HF class was similar between training and testing data sets, with 73.8% and 73.7% of patients having preserved EF in the training and testing data sets, respectively (standardized difference, 0.2). Mortality incidence was 20.6% (n = 1259) in the training set and 22.6% (n = 766) in the testing set. Congestive HF hospitalization was observed in 11.3% (n = 693) and 11.4% (n = 387), whereas home time loss of 25% of days or higher during follow-up was observed in 24.0% (n = 1467) and 23.9% (n = 810) in the training and testing sets, respectively.
Table 1. Baseline Characteristics of Medicare-Enrolled Patients With HF Included in the Study, 2007-2014.
Characteristic | No. (%) | Standardized Difference Between Training and Testing Data Sets | ||
---|---|---|---|---|
Training (n = 6113) | Testing (n = 3389) | Total (N = 9502) | ||
Information Extracted From Medicare Claims, Parts A, B, and D | ||||
Age, mean (SD), y | 78 (8) | 77 (8) | 78 (8) | 12.5 |
Men | 2779 (45.5) | 1486 (43.8) | 4265 (44.9) | 3.4 |
Race/ethnicity | ||||
White | 5571 (91.1) | 2853 (84.2) | 8424 (88.7) | 21.1 |
Black | 196 (3.2) | 290 (8.6) | 486 (5.1) | –23.1 |
Other | 346 (5.7) | 246 (7.3) | 592 (6.2) | –6.5 |
HF class (claims based) | ||||
Diastolic | 947 (15.5) | 442 (13.0) | 1389 (14.6) | 7.2 |
Left | 469 (7.7) | 130 (3.8) | 599 (6.3) | 16.8 |
Rheumatic or hypertensive | 101 (1.7) | 62 (1.8) | 163 (1.7) | –0.8 |
Systolic | 422 (6.9) | 175 (5.2) | 597 (6.3) | 7.1 |
Unspecified | 4174 (68.3) | 2580 (76.1) | 6754 (71.1) | –17.5 |
No. of prior HF hospitalizations | ||||
0 | 5355 (87.6) | 2980 (87.9) | 8335 (87.7) | –0.9 |
1 | 680 (11.1) | 363 (10.7) | 1043 (11.0) | 1.3 |
>1 | 78 (1.3) | 46 (1.4) | 124 (1.3) | –0.9 |
Cohort entry diagnosis in outpatient claims | 3985 (65.2) | 2069 (61.1) | 6054 (63.7) | 8.5 |
Medication | ||||
ACE inhibitors | 2219 (36.3) | 1230 (36.3) | 3449 (36.3) | 0 |
Mineralocorticoid receptor antagonists | 205 (3.4) | 102 (3.0) | 307 (3.2) | 2.3 |
ARBs | 997 (16.3) | 601 (17.7) | 1598 (16.8) | –3.7 |
β-Blockers | 3619 (59.2) | 2023 (59.7) | 5642 (59.4) | –1.0 |
Digoxin | 472 (7.7) | 192 (5.7) | 664 (7.0) | 8.0 |
Hydralazine | 134 (2.2) | 61 (1.8) | 195 (2.1) | 2.9 |
Loop diuretics | 1856 (30.4) | 998 (29.4) | 2854 (30.0) | 2.2 |
Nitrates | 861 (14.1) | 449 (13.2) | 1310 (13.8) | 2.6 |
Potassium-sparing diuretics | 247 (4.0) | 102 (3.0) | 349 (3.7) | 5.4 |
Thiazide diuretics | 1001 (16.4) | 554 (16.3) | 1555 (16.4) | 0.3 |
Total No. of HF medication classes | ||||
0 | 1049 (17.2) | 588 (17.4) | 1637 (17.2) | –0.5 |
1 | 1610 (26.3) | 929 (27.4) | 2539 (26.7) | –2.5 |
2 | 1952 (31.9) | 1076 (31.7) | 3028 (31.9) | 0.4 |
3 | 1291 (21.1) | 673 (19.9) | 1964 (20.7) | 3.0 |
4 | 184 (3.0) | 111 (3.3) | 295 (3.1) | –1.7 |
5 | 27 (0.4) | 12 (0.4) | 39 (0.4) | 0 |
Atrial fibrillation | 2709 (44.3) | 1386 (40.9) | 4095 (43.1) | 6.9 |
Anemia | 2510 (41.1) | 1491 (44.0) | 4001 (42.1) | –5.9 |
Coronary artery bypass graft | 164 (2.7) | 75 (2.2) | 239 (2.5) | 3.2 |
Cardiomyopathy | 1482 (24.2) | 538 (15.9) | 2020 (21.3) | 20.8 |
COPD | 1633 (26.7) | 891 (26.3) | 2524 (26.6) | 0.9 |
Implantable cardioverter-defibrillator | 142 (2.3) | 91 (2.7) | 233 (2.5) | –2.6 |
Depression | 1128 (18.5) | 590 (17.4) | 1718 (18.1) | 2.9 |
Diabetic nephropathy | 303 (5.0) | 157 (4.6) | 460 (4.8) | 1.9 |
Diabetes | 2199 (36.0) | 1291 (38.1) | 3490 (36.7) | –4.3 |
Endocarditis | 51 (0.8) | 38 (1.1) | 89 (0.9) | –3.1 |
Heart transplant | 14 (0.2) | 13 (0.4) | 27 (0.3) | –3.7 |
Hyperkalemia | 452 (7.4) | 220 (6.5) | 672 (7.1) | 3.5 |
Hyperlipidemia | 4167 (68.2) | 2313 (68.3) | 6480 (68.2) | –0.2 |
Hypertension | 5331 (87.2) | 2946 (86.9) | 8277 (87.1) | 0.9 |
Hypotension | 1065 (17.4) | 608 (17.9) | 1673 (17.6) | –1.3 |
Left-ventricular assist device | 0 | 1 (0) | 1 (0) | 0 |
Myocardial infarction | 1141 (18.7) | 946 (27.9) | 2087 (22.0) | –21.9 |
Obesity | 719 (11.8) | 361 (10.7) | 1080 (11.4) | 3.5 |
Other dysrhythmias | 2784 (45.5) | 1783 (52.6) | 4567 (48.1) | –14.2 |
Psychosis | 3603 (58.9) | 2001 (59.0) | 5604 (59.0) | –0.2 |
Pulmonary hypertension | 811 (13.3) | 450 (13.3) | 1261 (13.3) | 0 |
Renal dysfunction | 2141 (35.0) | 1189 (35.1) | 3330 (35.0) | –0.2 |
Cardiac resynchronization therapy | 6 (0.1) | 6 (0.2) | 12 (0.1) | –2.6 |
Rheumatic heart disease | 1682 (27.5) | 604 (17.8) | 2286 (24.1) | 23.3 |
Sleep apnea | 393 (6.4) | 217 (6.4) | 610 (6.4) | 0 |
Smoking | 1374 (22.5) | 728 (21.5) | 2102 (22.1) | 2.4 |
Stable angina | 604 (9.9) | 341 (10.1) | 945 (9.9) | –0.7 |
Stroke | 1044 (17.1) | 562 (16.6) | 1606 (16.9) | 1.3 |
Unstable angina | 809 (13.2) | 575 (17.0) | 1384 (14.6) | –10.6 |
Valve disorders | 1920 (31.4) | 712 (21.0) | 2632 (27.7) | 23.8 |
No. of medications used, mean (SD) | 8.48 (4.80) | 8.61 (4.91) | 8.53 (4.84) | –2.7 |
Office visits, mean (SD), No. | ||||
Cardiologist | 0.74 (1.33) | 0.89 (1.59) | 0.79 (1.43) | –10.2 |
Any physician | 7.97 (6.79) | 9.03 (7.58) | 8.35 (7.10) | –14.7 |
No. of prior all-cause hospitalizations | ||||
0 | 1519 (24.8) | 746 (22.0) | 2265 (23.8) | 6.6 |
1 | 2913 (47.7) | 1604 (47.3) | 4517 (47.5) | 0.8 |
>1 | 1681 (27.5) | 1039 (30.7) | 2720 (28.6) | –7.0 |
Any emergency department visits | 2083 (34.1) | 1132 (33.4) | 3215 (33.8) | 1.5 |
Socioeconomic status index, mean (SD) | 58.40 (5.55) | 57.78 (5.94) | 58.01 (5.81) | 10.8 |
Frailty score, mean (SD) | 0.21 (0.05) | 0.21 (0.05) | 0.21 (0.05) | 0 |
Information Extracted From Electronic Medical Records | ||||
BMI | ||||
Underweight, <18 | 42 (0.7) | 25 (0.7) | 67 (0.7) | 0 |
Healthy, 18-25 | 453 (7.4) | 334 (9.9) | 787 (8.3) | –8.9 |
Overweight, 26-29 | 574 (9.4) | 426 (12.6) | 1000 (10.5) | –10.2 |
Obese, ≥30 | 644 (10.5) | 471 (13.9) | 1115 (11.7) | –10.4 |
Not recorded | 4400 (72.0) | 2133 (62.9) | 6533 (68.8) | 19.5 |
B-type natriuretic peptide, quartile | ||||
Highest | 779 (12.7) | 328 (9.7) | 1107 (11.7) | 9.5 |
Second | 762 (12.5) | 337 (9.9) | 1099 (11.6) | 8.3 |
Third | 767 (12.5) | 350 (10.3) | 1117 (11.8) | 6.9 |
Lowest | 681 (11.1) | 370 (10.9) | 1051 (11.1) | 0.6 |
Unknown | 3124 (51.1) | 2004 (59.1) | 5128 (54.0) | –16.1 |
EF, mean (SD) | 0.57 (0.16) | 0.55 (0.14) | 0.56 (0.15) | 13.3 |
HF class | ||||
Reduced EF, <0.40 | 916 (15.0) | 539 (15.9) | 1455 (15.3) | –2.5 |
Midrange EF, 0.40-0.49 | 684 (11.2) | 353 (10.4) | 1037 (10.9) | 2.6 |
Preserved EF, ≥0.50 | 4513 (73.8) | 2497 (73.7) | 7010 (73.8) | 0.2 |
BUN, mean (SD), g/L | 25.12 (15.60) | 25.01 (15.32) | 25.08 (15.50) | 0.7 |
Serum creatinine, mean (SD), mg/dL | 1.28 (0.87) | 1.25 (0.95) | 1.27 (0.90) | 3.3 |
Serum sodium, mean (SD), mEq/L | 138.38 (3.83) | 138.12 (3.76) | 138.29 (3.81) | 6.9 |
Serum potassium, mean (SD), mEq/L | 4.16 (0.50) | 4.14 (0.46) | 4.15 (0.49) | 4.2 |
Abbreviations: ACE, angiotensin-converting enzyme; ARBs, angiotensin receptor blockers; BMI, body mass index (calculated as weight in kilograms divided by height in meters squared); BUN, blood urea nitrogen; COPD, chronic obstructive pulmonary disease; EF, ejection fraction; HF, heart failure.
SI conversion factors: To convert blood urea nitrogen to millimoles per liter, multiply by 0.357; serum creatinine to micromoles per liter, multiply by 88.4; and serum potassium and sodium to millimoles per liter, multiply by 1.
Comparison of Modeling Approaches
Of the 5 candidate modeling approaches, GBM consistently provided the highest discrimination and lowest Brier scores across all 4 outcomes, which was closely followed by random forests and LASSO (Table 2). Absolute differences in area under the receiver operating characteristic curves between logistic regression and other models were small when using claims-only predictors for all outcomes except HF hospitalization (Table 2; eFigure 2 in the Supplement). C statistics for logistic regression using claims-only predictors were as follows: mortality, 0.724 (95% CI, 95% CI, 0.705-0.744); HF hospitalization, 0.707 (95% CI, 95% CI, 0.676-0.737); high cost, 0.734 (95% CI, 95% CI, 0.703-0.764); and home days loss, 0.781 (95% CI, 95% CI, 0.764-0.798). The C statistics for GBM using claims-only predictors were as follows: mortality, 0.727 (95% CI, 95% CI, 0.708-0.747); HF hospitalization, 0.745 (95% CI, 95% CI, 0.718-0.772); high cost, 0.733 (95% CI, 95% CI, 0.703-0.763); and home days loss claims only, 0.790 (95% CI, 95% CI, 0.773-0.807). The CART model was consistently outperformed by all other approaches. Improvements were noted in accuracy and discrimination for all models when EMR-based predictors were added to claims-only predictors for mortality, HF hospitalization, and home time loss outcomes but not for the cost outcome.
Table 2. Comparison of Models in Predicting Outcomes in Patients With Heart Failure in the Testing Data Set.
Characteristic | Predictors | |||||
---|---|---|---|---|---|---|
Claims Only | Claims + EMR | |||||
Overall Accuracy: Brier Score | Discrimination | Overall Accuracy: Brier Score | Discrimination | |||
C Statistic (95% CI) | P Valuea | C Statistic (95% CI) | P Valuea | |||
All-Cause Mortality | ||||||
Logistic regression | 0.158 | 0.724 (0.705-0.744) | [Reference] | 0.152 | 0.749 (0.729-0.768) | [Reference] |
LASSO | 0.157 | 0.725 (0.706-0.745) | .27 | 0.152 | 0.750 (0.731-0.769) | .25 |
CART | 0.165 | 0.678 (0.658-0.699) | <.001 | 0.161 | 0.700 (0.680-0.721) | <.001 |
Random forest | 0.156b | 0.723 (0.704-0.743) | .92 | 0.150 | 0.757 (0.739-0.776) | .17 |
GBM | 0.156b | 0.727 (0.708-0.747)b | .45 | 0.148b | 0.767 (0.749-0.786)b | <.001 |
Heart Failure Hospitalization | ||||||
Logistic regression | 0.0898 | 0.707 (0.676-0.737) | [Reference] | 0.0888 | 0.738 (0.711-0.766) | [Reference] |
LASSO | 0.0890 | 0.728 (0.700-0.757) | .002 | 0.0877 | 0.764 (0.738-0.789) | <.001 |
CART | 0.0849 | 0.724 (0.696-0.752) | .13 | 0.0848 | 0.738 (0.710-0.765) | .95 |
Random forest | 0.0876 | 0.740 (0.713-0.767) | .003 | 0.0872 | 0.764 (0.738-0.790) | .007 |
GBM | 0.0847b | 0.745 (0.718-0.772)b | <.001 | 0.0838b | 0.778 (0.753-0.802)b | <.001 |
High Cost (Top Cost Decile) | ||||||
Logistic regression | 0.0774 | 0.734 (0.703-0.764)a | [Reference] | 0.0786 | 0.724 (0.693-0.755) | [Reference] |
LASSO | 0.0772 | 0.732 (0.702-0.763) | .61 | 0.0782 | 0.724 (0.694-0.755) | .92 |
CART | 0.0814 | 0.645 (0.612-0.679) | <.001 | 0.0816 | 0.648 (0.616-0.681) | <.001 |
Random forest | 0.0777 | 0.731 (0.701-0.761) | .78 | 0.0778 | 0.731 (0.703-0.763) | .36 |
GBM | 0.0769b | 0.733 (0.703-0.763) | .96 | 0.0769b | 0.732 (0.701-0.762)b | .39 |
Home Time Loss (>25%) | ||||||
Logistic regression | 0.150 | 0.781 (0.764-0.798) | [Reference] | 0.143 | 0.800 (0.783-0.816) | [Reference] |
LASSO | 0.149 | 0.783 (0.765-0.800) | .23 | 0.144 | 0.800 (0.783-0.816) | .96 |
CART | 0.160 | 0.738 (0.719-0.756) | <.001 | 0.156 | 0.757 (0.739-0.774) | <.001 |
Random forest | 0.149 | 0.784 (0.767-0.801) | .51 | 0.143 | 0.807 (0.791-0.823) | .17 |
GBM | 0.147b | 0.790 (0.773-0.807)a | .009 | 0.138b | 0.816 (0.801-0.832)b | <.001 |
Abbreviations: CART, classification and regression tree; EMR, electronic medical record; GBM, gradient-boosted model; LASSO, least absolute shrinkage and selection operator.
P values for the DeLong test comparing area under the receiver operating characteristic curves for different models with logistic regression.
Best performance with respect to the metric (lowest Brier score or highest C statistic).
Visual inspection of the calibration plots (eFigures 3-10 in the Supplement) indicated that GBM was generally well calibrated, with slopes closer to 1 and intercepts closer to 0 across all outcomes. Calibration with GBM was better at the highest-risk strata for the high cost outcome, in which logistic regression had poor calibration (eFigure 7 and eFigure 8 in the Supplement). Based on these observations, we selected GBM as the most consistent modeling approach of the 5 approaches evaluated.
Further Evaluation of GBM
Higher area under the precision-recall curves were obtained for claims + EMR vs claims-only GBMs predicting mortality (0.484 vs 0.423), HF hospitalization (0.413 vs 0.403), and home time loss (0.575 vs 0.521) but not cost (0.249 vs 0.252) (eFigure 11 in the Supplement). For mortality and home time loss outcomes, the observed probability was higher in the highest-risk strata when using claims + EMR predictors (Figure 1). In line with this observation, the decision curve analysis also suggested that the net benefit of using claims + EMR predictors was higher than using the claims-only set at various threshold probability values for mortality and home time loss outcomes but similar for the other 2 outcomes (eFigure 12 in the Supplement).
Most Influential Predictors
Figure 2 contains the 10 most influential predictors selected by GBM from claims-only and claims + EMR sets for all outcomes. Age and frailty score were selected in all models with relative influence (RI) in the range of 2.9 to 12.2 for age and 3.5 to 31.5 for frailty score across various models. EMR-based predictors that were selected in all models included serum urea nitrogen (RI range, 3.4-12.2), serum creatinine (RI range, 3.2-5.9), and serum potassium (RI range, 2.5-4.7) levels. For HF hospitalization and high cost outcomes, history of HF hospitalizations (RI, 31.2 in claims + EMR model; 38.5 in claims-only model) and prior cost decile (RI, 45.8 in claims + EMR model; 50.6 in claims-only model), respectively, were the most influential predictors.
Model Performance in Subgroups
Model performance was generally equivalent across subgroups, including HF type (reduced EF and preserved EF) and sex. However, the discrimination was relatively lower for patients with midrange EF, inpatients, and patients older than 75 years for all outcomes (Table 3). For instance, the claims-only model for mortality had discrimination of 0.702 for patients aged 75 years or older and 0.761 for patients younger than 75 years.
Table 3. Subgroup-Specific ROC of the Gradient-Boosted Models in the Testing Data Set.
Subgroup | All-Cause Mortality, ROC (95% CI) | HF Hospitalization, ROC (95% CI) | High Cost, ROC (95% CI) | Home Time Loss, ROC (95% CI) | ||||
---|---|---|---|---|---|---|---|---|
Claims Only | Claims + EMR | Claims Only | Claims + EMR | Claims Only | Claims + EMR | Claims Only | Claims + EMR | |
Reduced EF | 0.738 (0.693-0.783) | 0.761 (0.718-0.804) | 0.729 (0.676-0.782) | 0.750 (0.698-0.801) | 0.729 (0.656-0.801) | 0.723 (0.652-0.795) | 0.807 (0.768-0.847) | 0.830 (0.794-0.866) |
Midrange EF | 0.685 (0.619-0.750) | 0.735 (0.670-0.799) | 0.647 (0.563-0.730) | 0.640 (0.552-0.728) | 0.802 (0.727-0.877) | 0.804 (0.728-0.880) | 0.773 (0.717-0.828) | 0.808 (0.756-0.861) |
Preserved EF | 0.732 (0.709-0.755) | 0.770 (0.748-0.791) | 0.748 (0.712-0.784) | 0.781 (0.749-0.813) | 0.725 (0.689-0.762) | 0.725 (0.688-0.761) | 0.789 (0.770-0.809) | 0.814 (0.795-0.832) |
Women | 0.744 (0.715-0.773) | 0.780 (0.752-0.807) | 0.738 (0.694-0.782) | 0.775 (0.736-0.815) | 0.746 (0.703-0.790) | 0.743 (0.699-0.787) | 0.811 (0.786-0.835) | 0.837 (0.813-0.860) |
Men | 0.713 (0.687-0.740) | 0.756 (0.731-0.781) | 0.748 (0.713-0.783) | 0.782 (0.750-0.813) | 0.722 (0.680-0.764) | 0.722 (0.680-0.763) | 0.774 (0.751-0.796) | 0.800 (0.778-0.821) |
Inpatients | 0.709 (0.679-0.740) | 0.755 (0.726-0.783) | 0.685 (0.642-0.728) | 0.719 (0.680-0.759) | 0.699 (0.646-0.752) | 0.698 (0.646-0.751) | 0.736 (0.707-0.765) | 0.772 (0.745-0.800) |
Outpatients | 0.737 (0.712-0.762) | 0.774 (0.750-0.798) | 0.769 (0.732-0.806) | 0.806 (0.773-0.838) | 0.753 (0.716-0.790) | 0.751 (0.714-0.788) | 0.822 (0.802-0.842) | 0.842 (0.823-0.862) |
Age, 65-74 y | 0.761 (0.731-0.790) | 0.793 (0.764-0.821) | 0.753 (0.706-0.801) | 0.801 (0.761-0.841) | 0.742 (0.701-0.783) | 0.742 (0.700-0.784) | 0.805 (0.780-0.831) | 0.833 (0.808-0.857) |
Age, ≥75 y | 0.702 (0.676-0.728) | 0.747 (0.722-0.771) | 0.730 (0.695-0.764) | 0.758 (0.725-0.790) | 0.719 (0.675-0.763) | 0.715 (0.671-0.759) | 0.778 (0.756-0.800) | 0.803 (0.782-0.824) |
Abbreviations: EF, ejection fraction; EMR, electronic medical record; HF, heart failure; ROC, area under the receiver operating characteristic curve.
Discussion
In this study, we constructed predictive models for 4 important outcomes in HF using routinely collected health care data from insurance claims and EMRs of 1 health care provider network followed by an independent validation using data from a second network. We observed that machine learning methods, including tree-based ensemble approaches and penalized regression, offered only limited improvement over the widely used logistic regression. Although augmenting claims data with detailed EMR-derived predictors resulted in notable improvement in model performance for certain outcomes, including mortality and home days loss, such improvement was not seen for prediction of high future costs.
Our study adds to a growing body of literature indicating limited performance improvement with machine learning approaches over logistic regression for clinical risk prediction problems and additionally offers several insights. In a large meta-analysis of 71 studies, Christodoulou et al28 found no evidence supporting the hypothesis that clinical prediction models based on machine learning have improved discrimination. For HF specifically, Frizzell et al5 concluded that use of a number of machine learning approaches did not improve prediction of 30-day readmissions compared with logistic regression. In our study, we observed that when using only claims-based predictors, many of which are binary variables indicating presence or absence of medical conditions or use of specific medications, the performance improvement with machine learning approaches was minimal for prediction of most outcomes. However, when the predictor set was expanded to include EMR-based information, which included numerous laboratory test results as continuous variables, we noted that machine learning approaches generally fared better than logistic regression. This observation follows the intuition that, because tree-based machine learning approaches, such as GBM or random forests, are nonparametric and do not assume linearity for a predictor-outcome association, they are usually more adept at generating predictions based on continuous variables. Furthermore, we observed that meaningful improvement in prediction of certain health care use type outcomes, such as high cost, may be more difficult to achieve even with the addition of more granular EMR-based predictors.
In addition to the methodologic learnings, the models constructed and validated in this study may also be important from an applications standpoint. The primary intended use of these models is facilitating risk stratification with respect to key patient-level outcomes in using routinely collected health care data to identify a high-risk target population for effectively deploying population-based interventions. For instance, an insurer who is interested in deploying interventions, such as home nurse visits, to ensure optimal HF management and downstream cost savings may benefit from identifying a population with a high 1-year risk of HF hospitalization based on their administrative data using models from this study to possibly ensure the most efficient use of finite resources.
Strengths and Limitations
There some key strengths of this study. First, we reported discrimination as well as calibration of the models from an independent validation sample. Furthermore, our models included 2 key predictors that were not used in previous models—a frailty score12 and a composite score as a proxy for socioeconomic status13—both of which appeared to improve prediction meaningfully independent of other variables. In addition, we studied 4 different outcomes and were able to generate insights based on model performance in predicting each of these outcomes.
There are some limitations of the study. First, our data source contained patients from only 1 geographic region of the United States, which limits generalizability and requires validation in other populations. Second, our model validation was conducted only with concurrent patients in different health care provider networks without additional prospective validation within the same provider networks. Third, we used only structured and curated predictor variables in our machine learning approaches; future research is required to test the improvement in prediction of HF-specific outcomes offered by machine learning approaches that are able to mine unstructured information, such as clinicians’ free-text notes.29 Fourth, we used only a subset of machine learning approaches and, therefore, cannot comment on performance of approaches that were not evaluated herein, such as neural networks and support vector machines. In addition, we focused on administrative claims–based prediction and augmented claims data with select EMR-based variables. We did not evaluate model performance based on EMR data alone, which is an important limitation because such a model could be useful for clinicians as they weigh various care options for patients during medical visits.
Conclusions
Machine learning methods offered limited improvement over logistic regression in predicting key outcomes in HF based on administrative claims. Inclusion of additional clinical parameters from EMRs improved prediction for some, but not all, outcomes. Models constructed in this report may be helpful in identifying a high-risk target population for deploying population-based interventions.
References
- 1.Mozaffarian D, Benjamin EJ, Go AS, et al. ; Writing Group Members; American Heart Association Statistics Committee; Stroke Statistics Subcommittee . Executive summary: heart disease and stroke statistics—2016 update: a report from the American Heart Association. Circulation. 2016;133(4):-. doi: 10.1161/CIR.0000000000000366 [DOI] [PubMed] [Google Scholar]
- 2.Benjamin EJ, Virani SS, Callaway CW, et al. ; American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee . Heart disease and stroke statistics—2018 update: a report from the American Heart Association. Circulation. 2018;137(12):e67-e492. doi: 10.1161/CIR.0000000000000558 [DOI] [PubMed] [Google Scholar]
- 3.Blecker S, Paul M, Taksler G, Ogedegbe G, Katz S. Heart failure–associated hospitalizations in the United States. J Am Coll Cardiol. 2013;61(12):1259-1267. doi: 10.1016/j.jacc.2012.12.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rahimi K, Bennett D, Conrad N, et al. Risk prediction in patients with heart failure: a systematic review and analysis. JACC Heart Fail. 2014;2(5):440-446. doi: 10.1016/j.jchf.2014.04.008 [DOI] [PubMed] [Google Scholar]
- 5.Frizzell JD, Liang L, Schulte PJ, et al. Prediction of 30-day all-cause readmissions in patients hospitalized for heart failure: comparison of machine learning and other statistical approaches. JAMA Cardiol. 2017;2(2):204-209. doi: 10.1001/jamacardio.2016.3956 [DOI] [PubMed] [Google Scholar]
- 6.Greiner MA, Hammill BG, Fonarow GC, et al. Predicting costs among Medicare beneficiaries with heart failure. Am J Cardiol. 2012;109(5):705-711. doi: 10.1016/j.amjcard.2011.10.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lee H, Shi SM, Kim DH. Home time as a patient-centered outcome in administrative claims data. J Am Geriatr Soc. 2019;67(2):347-351. doi: 10.1111/jgs.15705 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Greene SJ, O’Brien EC, Mentz RJ, et al. Home-time after discharge among patients hospitalized with heart failure. J Am Coll Cardiol. 2018;71(23):2643-2652. doi: 10.1016/j.jacc.2018.03.517 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hennessy S. Use of health care databases in pharmacoepidemiology. Basic Clin Pharmacol Toxicol. 2006;98(3):311-313. doi: 10.1111/j.1742-7843.2006.pto_368.x [DOI] [PubMed] [Google Scholar]
- 10.McCormick N, Lacaille D, Bhole V, Avina-Zubieta JA. Validity of heart failure diagnoses in administrative databases: a systematic review and meta-analysis. PLoS One. 2014;9(8):e104519. doi: 10.1371/journal.pone.0104519 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ouwerkerk W, Voors AA, Zwinderman AH. Factors influencing the predictive power of models for predicting mortality and/or heart failure hospitalization in patients with heart failure. JACC Heart Fail. 2014;2(5):429-436. doi: 10.1016/j.jchf.2014.04.006 [DOI] [PubMed] [Google Scholar]
- 12.Kim DH, Schneeweiss S, Glynn RJ, Lipsitz LA, Rockwood K, Avorn J. Measuring frailty in Medicare data: development and validation of a claims-based frailty index. J Gerontol A Biol Sci Med Sci. 2018;73(7):980-987. doi: 10.1093/gerona/glx229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bonito A, Bann C, Eicheldinger C, Carpenter L. Creation of New Race-Ethnicity Codes and Socioeconomic Status (SES) Indicators for Medicare Beneficiaries: Final Report, Sub-Task 2. Rockville, MD: Agency for Healthcare Research and Quality; January 2008. AHRQ publication 08-0029-EF.
- 14.Gopalakrishnan C, Gagne JJ, Sarpatwari A, et al. Evaluation of socioeconomic status indicators for confounding adjustment in observational studies of medication use. Clin Pharmacol Ther. 2019;105(6):1513-1521. doi: 10.1002/cpt.1348 [DOI] [PubMed] [Google Scholar]
- 15.Steyerberg EW, van Veen M. Imputation is beneficial for handling missing data in predictive models. J Clin Epidemiol. 2007;60(9):979. doi: 10.1016/j.jclinepi.2007.03.003 [DOI] [PubMed] [Google Scholar]
- 16.Sterne JA, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393. doi: 10.1136/bmj.b2393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Austin PC. Using the standardized difference to compare the prevalence of a binary variable between two groups in observational research. Commun Stat Simul Comput. 2009;38(6):1228-1234. doi: 10.1080/03610910902859574 [DOI] [Google Scholar]
- 18.Chand S. On tuning parameter selection of LASSO-type methods—a Monte Carlo study. Paper presented at: Applied Sciences and Technology (IBCAST) 2012. 9th International Bhurban Conference; January 9-12, 2012; Islamabad, Pakistan. https://ieeexplore.ieee.org/document/6177542. Accessed January 31, 2018.
- 19.Zhang Y, Li R, Tsai C-L. Regularization parameter selections via generalized information criterion. J Am Stat Assoc. 2010;105(489):312-323. doi: 10.1198/jasa.2009.tm08013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Oyeyemi GM, Ogunjobi EO, Folorunsho AI. On performance of shrinkage methods—a Monte Carlo study. Int J Stat Appl. 2015;5(2):72-76. doi: 10.5923/j.statistics.20150502.04 [DOI] [Google Scholar]
- 21.Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat. 2006;15(3):651-674. doi: 10.1198/106186006X133933 [DOI] [Google Scholar]
- 22.Breiman L. Random forests. Mach Learn. 2001;45(1):5-32. doi: 10.1023/A:1010933404324 [DOI] [Google Scholar]
- 23.Hastie T, Tibshirani R, Friedman J. Boosting and additive trees. In: The Elements of Statistical Learning. New York, NY: Springer; 2009:337-387. doi: 10.1007/978-0-387-84858-7_10 [DOI] [Google Scholar]
- 24.Steyerberg EW, Vickers AJ, Cook NR, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128-138. doi: 10.1097/EDE.0b013e3181c30fb2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. doi: 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
- 26.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. doi: 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565-574. doi: 10.1177/0272989X06295361 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22. doi: 10.1016/j.jclinepi.2019.02.004 [DOI] [PubMed] [Google Scholar]
- 29.Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):18. doi: 10.1038/s41746-018-0029-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.