Abstract
Objective
We aimed to investigate bias in applying machine learning to predict real-world individual treatment effects.
Materials and Methods
Using a virtual patient cohort, we simulated real-world healthcare data and applied random forest and gradient boosting classifiers to develop prediction models. Treatment effect was estimated as the difference between the predicted outcomes of a treatment and a control. We evaluated the impact of predictors (ie, treatment predictors [X1], confounders [X2], treatment effects modifiers [X3], and other outcome risk factors [X4]) with known effects on treatment and outcome using real-world data, and outcome imbalance on predicting individual outcome. Using counterfactuals, we evaluated percentage of patients with biased predicted individual treatment effects
Results
The X4 had relatively more impact on model performance than X2 and X3 did. No effects were observed from X1. Moderate-to-severe outcome imbalance had a significantly negative impact on model performance, particularly among subgroups in which an outcome occurred. Bias in predicting individual treatment effects was significant and persisted even when the models had a 100% accuracy in predicting health outcome.
Discussion
Inadequate inclusion of the X2, X3, and X4 and moderate-to-severe outcome imbalance may affect model performance in predicting individual outcome and subsequently bias in predicting individual treatment effects. Machine learning models with all features and high performance for predicting individual outcome still yielded biased individual treatment effects.
Conclusions
Direct application of machine learning might not adequately address bias in predicting individual treatment effects. Further method development is needed to advance machine learning to support individualized treatment selection.
Keywords: precision medicine, machine learning, comparative treatment effectiveness, real-world evidence, virtual patient cohort
INTRODUCTION
One of the most common and challenging issues facing clinicians daily is how to choose optimal treatment for a specific patient. The average treatment effects from clinical studies may not be applicable to all patients due to large heterogeneity in treatment response across individuals in real-world clinical practice.1–7 Individualized treatment selection requires the ability to target specific treatments to those most likely to benefit from them and least likely to be harmed by them.3–6,8,9 Machine learning, by enabling the prediction of individual patient response to specific treatments, offers an exciting methodological approach for predicting individual treatment effects.10–14 With the advent of “big data” (ie, availability of large claims and electronic health record databases), combined with machine learning capabilities, such prediction for individualized treatment selection is now technically feasible.
Machine learning prediction uses machine learning algorithms to train a prediction model based on measured input features (predictors) and labeled outcome from a dataset to predict whether an outcome would occur for a subject.10–14 The validity of a machine learning prediction model is typically assessed by how well the predicted outcome matches the observed outcome for subjects.10–14 Machine learning models have been applied in the literature to predict individual health outcomes.15,16 However, it is more challenging to apply machine learning to predict individual treatment effects using real-world healthcare data. Treatment effect is different from a health outcome: a treatment effect is the difference between the outcome of a treatment (ie, a drug) and the outcome of a control (ie, a placebo).2,17–20 In real life, each patient can only have the outcome of either a treatment or a control at a given time. Therefore, unlike individual health outcomes, individual treatment effects cannot be directly labeled (measured) in real-world healthcare data, and thus are not available to directly train prediction models. Prediction of individual treatment effects needs to be estimated by first having the predicted outcomes of a patient if they receive a treatment and if they receive a control, and then calculating their difference.
Prior methodological studies have shown that confounding or bias needs to be addressed to estimate causal (true) treatment effects using real-world observational data.2,17–20 Bias in treatment effects estimation happens when the difference between predicted outcomes of treatment and control may not be attributed solely to the differences in treatment, but instead to other factors such as difference in characteristics of patients who received treatment and of patients who received control as well.2,17–20 As the outcome of either the treatment or control can be observed for an individual patient, the outcome of patients with a control is used to estimate or predict the outcome of a patient with treatment if they were originally treated with control, and vice versa.2,17–20 This estimation process may result in biased estimates of treatment effects for a patient.2,17–20
Many factors in real-world healthcare data may bias estimated treatment effects. It is well known in clinical epidemiology and pharmacoepidemiology that factors such as confounders, treatment effects modifiers, and other outcome risk factors have an impact on estimating treatment effects (see Figure 1 for the concept illustration). However, it is unknown whether these factors will affect the prediction of individual treatment outcome and subsequently the bias in predicting individual treatment effects. Furthermore, it is unknown whether having all these factors measured as input features in machine learning models will lead to unbiased prediction of individual treatment effects. In other words, will there be any bias in predicting individual treatment effects if the prediction model has all the needed features and high accuracy in predicting individual outcome?
Figure 1.
Analytical framework of virtual patient cohort simulation for treatment and outcomes.
In addition, real-world clinical treatment or health outcomes are usually not balanced (ie, 20% of patients had stroke and 80% patients did not). Such imbalance may affect prediction performance in machine learning models. Nonetheless, a void exists in understanding how commonly encountered treatment outcome distributions may affect the prediction performance of machine learning models.
To better understand these real-world issues, we used a simulated virtual patient cohort to investigate these questions. We use a virtual patient cohort database instead of real patient cohort database for the following reasons. First, true individual treatment effects are typically unknown in real-world healthcare data but are needed to assess model prediction bias. Second, an existing real patient database may not contain all the important predictors or features. Such omissions make it difficult to assess whether differences in prediction performances are due to inadequate training of machine learning models or insufficient measures of predictors or features. Each patient in our virtual patient cohort has a unique profile of characteristics (predictors) and the cohort, by design, not only mimics real-world healthcare data, but also has complete capture of all relevant predictors and their effects (including true individual treatment effects). To develop the prediction model, we applied 2 machine learning algorithms—the random forest classifier and the gradient boosting classifier. We selected these 2 because of their ease of interpretation for clinical data with binary outcomes (label) and their ability to address overfitting (a problem common in machine learning prediction).21–23 Specifically we aimed to (1) evaluate the impact of different types of predictors on machine learning model performance in predicting an individual treatment outcome, (2) determine the impact of outcome imbalance on machine learning model performance in predicting individual treatment outcome, and (3) assess the bias in estimating treatment effects from the direct application of machine learning models. In so doing, our results inform efforts to use machine learning to predict treatment outcomes using real patient healthcare data in the clinical setting.
MATERIALS AND METHODS
Simulation of a virtual patient cohort
As depicted in Figure 1, we delineate all the factors (patient characteristics) in predicting treatment and outcomes into the 4 categories: factors that only affect treatment choice (X1); factors related to both treatment choice and outcomes, or confounders (X2); factors that may only affect the treatment response across individual patients, or treatment effects modifiers (X3); and individual factors only related to outcomes, or other outcome risk factors (X4).
Boxes 1 and 2 present the design and process used for simulation of the virtual patient cohort. This simulation is an extension of an analytical framework on treatment effects heterogeneity that has been published previously.24 In brief, we simulated a cohort of 50 000 patients with binary treatment (T) and outcome (Y). Patient characteristics were simulated as 100 binary variables with 10 for X1, 20 for X2, 30 for X3, and 40 for X4. Based on our observation of Medicare data from our previous studies,25–28 we simulated the prevalence of the characteristics as random variables with a negative binomial distribution within the range of 0.01-0.80 for each variable category with different random seeds (Supplementary Appendix 1). For X2, X3, and X4 the predictors of outcomes, we further set a restricting parameter (b0) to their prevalence as a way to titrate the prevalence of outcomes in the study cohort. The value (1 or 0) of each variable for each patient was simulated using SAS Rand function with Bernoulli distribution. Correlation and multicollinearity of patient clinical and demographic characteristics are common in real-world healthcare data. Thus, in this study, the correlation between 2 variables in the same variable category was simulated as the phi correlation coefficient or Pearson correlation coefficient for binary variables.29,30 The simulated correlations were selected from a normal distribution with different random seeds for different categories of predictors. The distribution of the correlations reflects that observed in the Medicare data in our previous studies.25–28 These simulated correlations between a 2-variable pairs further yielded multicollinearity in the simulation (Supplementary Appendix 2). The minimum and maximum values of the correlation was capped by 4 equations derived from the phi correlation coefficient formula and the simulated prevalence of Xj, s to avoid invalid cell percentage values such as being negative or over 100%. Figure 2 presents the comparisons of distributions of variable prevalence, effects on outcome, and correlation between the virtual patient cohort and the real-patient cohort.25–28 The comparison shows that the key characteristics of the simulated virtual patient cohort are consistent with the real-patient cohort.
Table 3.
Bias in estimating individual treatment effect from prediction models
| Random Forest Classifier |
Gradient Boosting Classifier |
|||||
|---|---|---|---|---|---|---|
| Models(Input Predictors/Features) | Bias | Error | Difference (Bias – Error) | Bias | Error | Difference (Bias – Error) |
| T, X1-X4 | 0.22 | 0.12 | 0.10 | 0.11 | 0.00 | 0.11 |
| T, X2-X4 | 0.22 | 0.12 | 0.10 | 0.11 | 0.00 | 0.11 |
| T, X3, X4 | 0.25 | 0.13 | 0.12 | 0.17 | 0.03 | 0.14 |
| T, X2, X3 | 0.34 | 0.24 | 0.10 | 0.31 | 0.14 | 0.17 |
| T, X2, X4 | 0.30 | 0.18 | 0.12 | 0.25 | 0.06 | 0.19 |
Data in the bias columns indicate the proportion of patients with biased treatment effect estimation. Data in the error columns indicate the proportion of patients with the wrong outcome prediction.
T: treatment; X1: factors only affecting treatment choice; X2: factors related to both treatment choice and outcomes (confounders); X3: factors that may affect the treatment effects in individual patients (effects modifiers); X4: individual risk factors only related to outcomes.
Box 1. Simulation design and process: virtual patient cohort
Notations and setup:
|
Simulate prevalence and correlation of Xj, s
|
| Two-stage treatment and outcome simulation |
| Stage 1. Simulate treatment. Treatment modeling—latent variable model for treatment selection/decision: |
| T* = a0 + a1*X1+ a2*X2 + V*(- βT) + e --- 1) |
| T* = latent variable for treatment |
| X1 = Vector (X1, s= 1, 2, …m1), m1 (the number of variables) = 10 |
| X2 = Vector (X2, s= 1, 2, …m2), m2 = 20 |
| a0 = The expected value on the cost from the treatment |
| V = The expected value on avoiding outcome Y |
| βT = The treatment effects on outcome Y unconditioned on X3 |
| V*(- βT) = The expected value of the treatment effectiveness in reducing the risk of outcome |
| a1 = Vector (a1, s). a1, s = the expected effects of X1 on T |
| a2 = Vector (a2, s). a2, s = the expected effects of X2 on T |
| e = Error term ∼ norm (0, 1)*d, the residuals of expected value in treatment outcome across individuals with a mean of 0 and standard deviation of d |
| If T*> v∼ then T = 1, else T = 0, v∼ = the expected value threshold for the use of the treatment. |
| Fixing parameter values: |
| v∼ = The mean of (T*i=1, 2,…N), a0 = -140, V = 800, βT = 0.5, d=150 for e ∼ norm (0, 1)*150 |
| a1, s = 5*s/(1+m1); a2, s = 10*s/(1+m2). |
| Stage 2. Simulate outcome. Outcome modeling—using the simulated T |
| Pr(Y=1|T, X:β) = Logistic(β0 + β2*X2 + T*(- βT + β3*X3) + β4*X4) --- 2) |
| X3 = Vector (X3, s= 1, 2, …m3), m3 = 30 |
| X4 = Vector (X4, s= 1, 2, …m4), m4 = 40 |
| β0 = The baseline outcome risk without treatment (T=0) |
| β2 = Vector of (β2, s), the effects of X2 on Y |
| β3 = Vector of (β3, s), the effects of X3 on the effects of T on Y |
| β4 = Vector of (β4, s), the effects of X4 on Y |
| βT = Treatment effects, the effects of T on Y without X3 |
| Y is simulated for each patient by feeding Pr(Y=1|T, X:β) to random variable generator SAS Rand function with Bernoulli distribution. |
| Fixing parameter values: |
| β0 = -2; βT = -2; |
| β2 random variable ∼ norm (0.4, 2) with a different random seed |
| β3 random variable ∼ norm (0.5, 3) with a different random seed |
| β4 random variable ∼ norm (0.3, 2) with a different random seed |
| b0= 0.05, 0.40, 0.80, and 1.00 for Pr(Y=1) of about 12%, 35%, 45%, and 50% respectively |
Figure 2.
Comparison of (A) variable prevalence, (B) effects on outcome, and (C) correlation distribution between the virtual patient and real-patient cohort. The outcome prevalence in the real-patient cohort is about 30% and in the virtual cohort is about 35%.
A 2-stage modeling approach was used for the simulation of treatment and outcome.24 In the first stage we used a latent-variable method for treatment decision modeling to simulate treatment choice (Equation 1 in Box 1).24 A patient is assigned to receive the treatment (T = 1) if the net expected value (T*) from treatment cost (a0), effects of X1 and X2, and treatment benefit(V*(– βT)), is greater than a value threshold (v∼). Otherwise, T = 0. We set the value of v∼ to the mean of T* so that 50% of the cohort received the treatment (T = 1). In the second stage, the simulated treatment (T) was then applied for the simulation of the outcome (Y) (Equation 2 in Box 1). The effects of X2, X3, and X4 on Y (β2, β3, and β4) were simulated as random variables with normal distribution. We fixed the value of b0 to 0.5, 0.40, 0.80, and 1.00 to generate cohorts with the prevalence of outcome (Y = 1) of about 12%, 35%, 45%, and 50% respectively.
To estimate causal treatment effects, counterfactual outcome for the counterfactual treatment is needed. Counterfactual outcome refers to the health outcome that would occur if the same patient were treated with an alternative treatment (including no treatment) instead of the original treatment at the time the original treatment was received. Because this hypothetical condition can never happen in real-world settings, the alternative treatment for the same patient at the same time is referred to as the counterfactual treatment. Both randomized clinical trials and observational studies use proxy of counterfactuals to estimate causal treatment effects. Box 2 presents our simulation for the counterfactual treatment (Tc) and counterfactual outcome (Yc) as well as calculation for the true and predicted treatment effects from the model prediction. In the simulation, we assigned Tc = 0 for patients whose T = 1 and Tc = 1 for patients whose T = 0. Then we simulated the counterfactual outcome (Yc) for each patient by replacing T with Tc in Equation 3. The individual true treatment effects (TEi) are defined as the difference between the individual outcome (Yi) from the original treatment and the individual counterfactual outcome (Yc, i) from the counterfactual treatment (Tc). The individual estimated treatment effects (E[TEi]) are defined as the difference of the predicted individual outcome (E[Yi]) from the original treatment and the predicted individual counterfactual outcome (E[Yc, i]). Both E[Yi] and E[Yc, i] are generated from the final machine learning model to predict individual treatment outcome.
Box 2. Simulation design and process: treatment and outcome counterfactuals and treatment effects
| a. Simulate counterfactuals |
| Tc = 1 – T |
| Tc = Counterfactual treatment |
| Pr(Yc=1|Tc, X:β) = Logistic(β0 + β2*X2 + Tc*(- βT + β3*X3) + β4*X4)--- 3) |
| Yc = Counterfactual outcomes |
| Yc is simulated for each patient by feeding Pr(Yc=1|T, X:β) to random variable generator SAS Rand function with Bernoulli distribution. |
| b. Compute true and estimated treatment effects, and bias |
| TEi = Yi – Yc, i |
| TEi = True treatment effects from T for patient i. |
| E[TEi] = E[Yi] – E[Yc, i] |
| E[TEi] = Estimated treatment effects from T for patient i. |
| E[Yi] = Predicted treatment outcome Y for patient i |
| E[Yc, i] = Predicted counterfactual outcome Yc for patient i |
| Biasi = 0 if E[TEi] = TEi or 1 if E[TEi] ≠ TEi |
| Biasi, the difference between true treatment effects and predicted treatment effects for patient i |
Training prediction model using machine learning
Machine learning algorithms
We applied the random forest classifier and the gradient boosting classifier.21–23,31 The random forest classifier and the gradient boosting classifier are decision tree–based ensemble approaches. The tree ensemble estimator consists of a set of classification and regression trees.32 The random forest classifier fits a number of decision trees on various subsamples of the dataset and uses averaging to improve the predictive accuracy and control for overfitting.21 The gradient boosting classifier is also an ensemble estimator that builds an additive model through a forward-stage approach. In each stage, a decision tree is fitted on the negative gradient of the binomial or multinomial deviance loss function to gradually improve the predictive accuracy and to control for overfitting.22,23
Model tuning and hyperparameterization
Training a predictive model using machine learning algorithms requires the tuning of model hyperparameters to optimize its performance. Hyperparameters are parameters that are not directly learned within estimators but they determine the optimal performance of the model. The hyperparameters for each of the algorithms are presented in Supplementary Appendix 3. We applied the exhaustive grid search approach to identify the values of hyperparameters that optimize the model prediction performance.
Model performance assessment and selection
To address the common overfitting problem (high accuracy in training but low accuracy in validation and application) in the machine learning model development, we applied the robust 10-fold cross-validation with stratification by outcome for model performance assessment and final model selection. The stratified cross-validation randomly assigned the samples into 10 subsets with same distribution of outcome. Of each subset, 90% was used for model training and 10% was used for model validation. We used F1 score as the measure for model performance and selected the final model based on the highest F1 score from the 10 fold cross-validation.33 F1 score is the harmonic mean of precision and recall (sensitivity), . Compared with other measures of model prediction performance, F1 is more robust when the data are imbalanced.34–36
Analysis
We used multiple measures to assess the performance of the respective final prediction models. The measures include accuracy, error (1 – accuracy), precision, sensitivity, specificity, area under the curve, and F1 score.37,38 The measures were assessed using the validation samples from the 10-fold cross-validation. The validation samples were not used for the prediction model development and training to avoid cross-contamination and exacerbation of overfitting.
To evaluate the impact of different types of predictors as input features on the model performance, we assessed the effects of the full model (input features: T, X1, X2, X3, and X4) and models excluding X1, X2, X3, and X4 as input features on all the performance measures in predicting individual outcomes. The outcome event rate for the original treatment was fixed to 50% (balanced). The results were also stratified by treatment (T = 1 or 0) and outcome (Y = 1 or 0).
To determine the impact of outcome imbalance on the model performance, we assessed and compared all the performance measures in predicting individual outcome when the outcome event rates was 12% (severe imbalance), 35% (moderate imbalance), 45% (mild imbalance), and 50% (balanced). The results were also stratified by outcome and treatment.
To study the bias in estimating individual treatment effects, we assessed the proportion of patients whose estimated treatment effects are different from their actual treatment effects (Biasi = 1) (Box 2) given different combinations of X1-X4 as input features. The final prediction models developed from each machine learning algorithm were applied to yield the predicted outcomes from the original treatment and the counterfactual treatment for each individual patient, which is then used to calculated the predicted individual treatment effects The bias in estimating individual treatment effects (Biasi) was computed as the difference between the true treatment effects (TEi) and the estimated treatment effects (E[TEi]) for each individual patient. We hypothesize that the total amount of the bias may stem from inadequately controlled confounding bias and model error in predicting individual health outcome. Thus, we also calculated the difference between the bias (proportion of patients with Biasi = 1) and the model error (proportion of patient with wrong outcome prediction) in the validation samples. The outcome event rate for the original treatment was fixed to 50% (balanced) for this analysis. We further implemented stratified analysis for different models.
All data processing and analyses were implemented using SAS 9.4 and Python 3.6.
RESULTS
Table 1 and Figure 3 present the impact of the different types of predictors as input feature on model performance for individual outcomes. The gradient boosting classifier model showed consistently higher performance than the random forest classifier model. When all input predictors (T, X1-X4) were included, the accuracy (error) for the full cohort is 0.88 (0.12) for the random forest classifier model and 1.00 (0.00) for the gradient boosting classifier model. Compared with the full model, the models excluding X1 (treatment predictors) had the same performance. For the random forest classifier, the accuracy (error) is 0.88 (0.12), 0.87 (0.13), 0.76 (0.24), and 0.82 (0.18) for the models excluding X1, X2, X3, and X4, respectively. Similar results patterns were observed for other performance measures and the gradient boosting classifier models. Furthermore, when the cohort was stratified by treatment (T) and outcome (Y) the impact of excluding respective categories of predictors (X2, X3, and X4) on some performance measures was exacerbated. For example, when X4 (other outcome risk factors) was excluded, the random forest classifier model performed relatively poorly: the sensitivity and accuracy (error) were at the worst with a value of 0.57 and 0.57 (0.43), respectively, for the subgroup of T = 0 and Y = 1, while their values were 0.76 and 0.76 (0.24), respectively, for the full cohort. The change in precision and F1 score is less pronounced. This exacerbation was also observed for the gradient boosting classifier models.
Table 1.
Impact of different types of predictors on model performance in predicting individual outcome
| Random Forest Classifier |
Gradient Boosting Classifier |
||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Performance MeasureModels (Input Predictors) | Precision | Sensitivity | Specificity | AUC | F1 | Accuracy | Error | Precision | Sensitivity | Specificity | AUC | F1 | Accuracy | Error | |
| T, X1-X4; all | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.12 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| T=1 | all | 0.90 | 0.90 | 0.89 | 0.90 | 0.90 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.91 | 0.86 | 0.93 | — | 0.89 | 0.86 | 0.14 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.89 | 0.93 | 0.89 | — | 0.90 | 0.93 | 0.07 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| T=0 | all | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.13 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.87 | 0.90 | 0.83 | — | 0.88 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.87 | 0.83 | 0.90 | — | 0.85 | 0.83 | 0.17 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
|
| |||||||||||||||
| T, X2-X4; all | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.12 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| T=1 | all | 0.90 | 0.90 | 0.89 | 0.90 | 0.90 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.91 | 0.86 | 0.93 | — | 0.91 | 0.86 | 0.14 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.89 | 0.93 | 0.86 | — | 0.91 | 0.93 | 0.07 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| T=0 | all | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.13 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.87 | 0.90 | 0.83 | — | 0.88 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.87 | 0.83 | 0.90 | — | 0.87 | 0.83 | 0.17 | 1.00 | 1.00 | 1.00 | — | 1.00 | 1.00 | 0.00 | |
|
| |||||||||||||||
| T, X3, X4; all | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.13 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.03 | |
| T=1 | all | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.12 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.02 |
| Y=0 | 0.88 | 0.85 | 0.91 | — | 0.89 | 0.85 | 0.15 | 0.98 | 0.97 | 0.99 | — | 0.98 | 0.97 | 0.03 | |
| Y=1 | 0.88 | 0.91 | 0.85 | — | 0.89 | 0.90 | 0.10 | 0.98 | 0.98 | 0.98 | — | 0.98 | 0.99 | 0.01 | |
| T=0 | all | 0.86 | 0.89 | 0.82 | 0.85 | 0.87 | 0.86 | 0.14 | 0.96 | 0.97 | 0.95 | 0.96 | 0.96 | 0.96 | 0.04 |
| Y=0 | 0.86 | 0.89 | 0.82 | — | 0.87 | 0.89 | 0.11 | 0.96 | 0.97 | 0.95 | — | 0.96 | 0.97 | 0.03 | |
| Y=1 | 0.86 | 0.82 | 0.85 | — | 0.89 | 0.82 | 0.18 | 0.96 | 0.95 | 0.97 | — | 0.96 | 0.95 | 0.05 | |
|
| |||||||||||||||
| T, X2, X3; all | 0.77 | 0.76 | 0.76 | 0.76 | 0.76 | 0.76 | 0.24 | 0.86 | 0.86 | 0.86 | 0.86 | 0.87 | 0.86 | 0.14 | |
| T=1 | all | 0.83 | 0.83 | 0.82 | 0.82 | 0.83 | 0.82 | 0.17 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 | 0.09 |
| Y=0 | 0.84 | 0.76 | 0.88 | — | 0.80 | 0.80 | 0.24 | 0.91 | 0.89 | 0.92 | — | 0.90 | 0.89 | 0.11 | |
| Y=1 | 0.82 | 0.88 | 0.76 | — | 0.83 | 0.88 | 0.12 | 0.91 | 0.92 | 0.89 | — | 0.91 | 0.92 | 0.08 | |
| T=0 | all | 0.71 | 0.70 | 0.68 | 0.69 | 0.70 | 0.70 | 0.30 | 0.82 | 0.82 | 0.81 | 0.81 | 0.82 | 0.82 | 0.18 |
| Y=0 | 0.70 | 0.81 | 0.57 | — | 0.75 | 0.81 | 0.19 | 0.81 | 0.88 | 0.75 | — | 0.84 | 0.88 | 0.12 | |
| Y=1 | 0.72 | 0.57 | 0.81 | — | 0.63 | 0.57 | 0.43 | 0.83 | 0.82 | 0.81 | — | 0.82 | 0.75 | 0.25 | |
|
| |||||||||||||||
| T, X2, X4; all | 0.82 | 0.82 | 0.82 | 0.82 | 0.82 | 0.82 | 0.18 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.94 | 0.06 | |
| T=1 | all | 0.77 | 0.77 | 0.77 | 0.77 | 0.77 | 0.77 | 0.23 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 | 0.91 | 0.09 |
| Y=0 | 0.75 | 0.74 | 0.80 | — | 0.74 | 0.74 | 0.26 | 0.91 | 0.89 | 0.92 | — | 0.90 | 0.89 | 0.11 | |
| Y=1 | 0.79 | 0.80 | 0.74 | — | 0.79 | 0.80 | 0.20 | 0.91 | 0.92 | 0.89 | — | 0.91 | 0.92 | 0.08 | |
| T=0 | all | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.87 | 0.13 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.03 |
| Y=0 | 0.87 | 0.90 | 0.84 | — | 0.88 | 0.90 | 0.10 | 0.97 | 0.97 | 0.97 | — | 0.97 | 0.97 | 0.03 | |
| Y=1 | 0.87 | 0.84 | 0.90 | — | 0.86 | 0.84 | 0.16 | 0.97 | 0.97 | 0.97 | — | 0.97 | 0.97 | 0.03 | |
AUC: area under the curve; T: treatment; X1: factors that only affect treatment choice; X2: factors related to both treatment choice and outcomes (confounders); X3: factors that may affect the treatment effects in individual patients (effects modifiers); X4: individual risk factors only related to outcomes; Y: outcomes.
Figure 3.
Impact of different types of predictors on model performance. Note: AUC: area under the curve; T: treatment; X1: factors that only affect treatment choice; X2: factors related to both treatment choice and outcomes (confounders); X3: factors that may affect the treatment effects in individual patients (effects modifiers); X4: individual risk factors only related to outcomes; Y: outcomes.
Table 2 and Figure 4 present the impact of data imbalance on model performance for individual outcomes. Again, the gradient boosting classifier model had higher performance than did the random forest classifier model across all performance measures, across all cases of data imbalance, and when stratifying by treatment and outcome. For the full cohort, most of the performance measures for the gradient boosting classifier model did not have significant changes when the rates of outcome (Y) ranged from balanced to severely imbalanced. For example, the accuracy (error) ranged from 1.00 (0.00) to 0.92 (0.08). The exceptions are specificity and area under the curve, which dropped from 0.93 to 0.49 and from 0.94 to 0.70, respectively, when the outcome rate moved from moderately imbalanced to severely imbalanced. Considerable changes of the performance measures occurred in some subgroups stratified by treatment and outcome when the outcome is moderately to severely imbalanced. For example, when outcome is moderately imbalanced, the accuracy (error) was 0.88 (0.12) for the subgroup of T = 0 and Y = 1. When the outcome is severely imbalanced, the accuracy (error) was 0.69 (0.31) for T = 1 and Y = 1 and 0.26 (0.74) for T = 1 and Y = 1. Here, the random forest classifier model showed similar patterns as the gradient boosting classifier model.
Table 2.
Impact of outcome imbalance on model performance in predicting individual outcome
| Random Forest Classifier |
Gradient Boosting Classifier |
||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Performance Measures | Precision | Sensitivity | Specificity | AUC | F1 | Accuracy | Error | Precision | Sensitivity | Specificity | AUC | F1 | Accuracy | Error | |
| Balanced (50% Y=1); all | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.12 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| T=1 | all | 0.90 | 0.90 | 0.89 | 0.90 | 0.90 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.91 | 0.86 | 0.93 | — | 0.91 | 0.86 | 0.14 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.89 | 0.93 | 0.86 | — | 0.91 | 0.93 | 0.07 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| T=0 | all | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.87 | 0.13 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.87 | 0.90 | 0.83 | — | 0.88 | 0.90 | 0.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.87 | 0.83 | 0.90 | — | 0.87 | 0.83 | 0.17 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 | |
|
| |||||||||||||||
| Mildly imbalanced (45% Y=1); all | 0.85 | 0.85 | 0.84 | 0.85 | 0.85 | 0.85 | 0.15 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.01 | |
| T=1 | all | 0.84 | 0.84 | 0.83 | 0.84 | 0.84 | 0.84 | 0.16 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.00 |
| Y=0 | 0.84 | 0.89 | 0.79 | — | 0.86 | 0.89 | 0.11 | 1.00 | 1.00 | 0.99 | — | 1.00 | 1.00 | 0.00 | |
| Y=1 | 0.86 | 0.79 | 0.89 | — | 0.82 | 0.79 | 0.21 | 1.00 | 0.99 | 1.00 | — | 1.00 | 0.99 | 0.01 | |
| T=0 | all | 0.86 | 0.86 | 0.85 | 0.86 | 0.86 | 0.86 | 0.14 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.02 |
| Y=0 | 0.84 | 0.91 | 0.80 | — | 0.88 | 0.91 | 0.09 | 0.98 | 0.99 | 0.97 | — | 0.98 | 0.99 | 0.01 | |
| Y=1 | 0.88 | 0.80 | 0.91 | — | 0.84 | 0.80 | 0.20 | 0.99 | 0.97 | 0.98 | — | 0.98 | 0.97 | 0.03 | |
|
| |||||||||||||||
| Moderately imbalanced (35% Y=1); all | 0.78 | 0.78 | 0.72 | 0.75 | 0.78 | 0.78 | 0.22 | 0.95 | 0.95 | 0.93 | 0.94 | 0.95 | 0.95 | 0.05 | |
| T=1 | all | 0.79 | 0.78 | 0.77 | 0.78 | 0.78 | 0.78 | 0.22 | 0.97 | 0.97 | 0.96 | 0.96 | 0.97 | 0.97 | 0.04 |
| Y=0 | 0.84 | 0.80 | 0.75 | — | 0.82 | 0.80 | 0.20 | 0.97 | 0.98 | 0.94 | — | 0.97 | 0.98 | 0.02 | |
| Y=1 | 0.70 | 0.75 | 0.80 | — | 0.72 | 0.75 | 0.25 | 0.97 | 0.94 | 0.98 | — | 0.95 | 0.94 | 0.06 | |
| T=0 | all | 0.78 | 0.78 | 0.67 | 0.73 | 0.78 | 0.78 | 0.22 | 0.93 | 0.93 | 0.91 | 0.92 | 0.93 | 0.93 | 0.07 |
| Y=0 | 0.81 | 0.89 | 0.56 | — | 0.85 | 0.89 | 0.11 | 0.94 | 0.96 | 0.88 | — | 0.95 | 0.96 | 0.04 | |
| Y=1 | 0.72 | 0.56 | 0.89 | — | 0.63 | 0.56 | 0.44 | 0.92 | 0.88 | 0.96 | — | 0.90 | 0.88 | 0.12 | |
|
| |||||||||||||||
| Severely imbalanced (12% Y=1); all | 0.89 | 0.90 | 0.46 | 0.68 | 0.89 | 0.90 | 0.10 | 0.91 | 0.92 | 0.49 | 0.70 | 0.90 | 0.92 | 0.08 | |
| T=1 | all | 0.94 | 0.94 | 0.63 | 0.78 | 0.94 | 0.94 | 0.06 | 0.96 | 0.96 | 0.72 | 0.84 | 0.96 | 0.96 | 0.04 |
| Y=0 | 0.96 | 0.98 | 0.59 | — | 0.97 | 0.98 | 0.02 | 0.97 | 0.99 | 0.69 | — | 0.98 | 0.99 | 0.01 | |
| Y=1 | 0.74 | 0.59 | 0.63 | — | 0.94 | 0.59 | 0.41 | 0.88 | 0.69 | 0.99 | — | 0.96 | 0.69 | 0.31 | |
| T=0 | all | 0.84 | 0.86 | 0.37 | 0.62 | 0.84 | 0.86 | 0.14 | 0.87 | 0.87 | 0.38 | 0.62 | 0.93 | 0.87 | 0.13 |
| Y=0 | 0.87 | 0.98 | 0.25 | — | 0.92 | 0.98 | 0.02 | 0.88 | 0.99 | 0.26 | — | 0.93 | 0.99 | 0.01 | |
| Y=1 | 0.69 | 0.25 | 0.98 | — | 0.37 | 0.25 | 0.75 | 0.82 | 0.26 | 0.99 | — | 0.85 | 0.26 | 0.74 | |
AUC: area under the curve; T: treatment; Y: outcomes.
Figure 4.
Impact of outcome data imbalance on model performance. AUC: area under the curve; T: treatment; Y: outcomes.
Table 3 and Figure 5 present the amount of bias in predicting individual treatment effects, the error in predicting individual outcome, and their difference. For the gradient boosting classifier, both the full model and the model including X1 had an error of 0.00 in predicting individual outcome. Nonetheless, both models had a bias of 0.11%-11% of the validation samples whose predicted treatment effects were different from the true treatment effects. The largest bias in predicting individual treatment effects was 0.31 when the model excluded X4, followed by the model excluding X3 (0.25) and the model excluding X2 (0.17). The model prediction error had a similar pattern to bias effects. The difference between bias and error across different models remained relatively stable, ranging from 0.11 to 0.19. The random forest classifier models had very similar patterns, except that both the error and the bias were higher. The error ranged from 0.12 for the full model to 0.24 for the model excluding X4 and the bias ranged from 0.22 for the full model to 0.34 for the model excluding X4. Similar to the gradient boosting classifier models, the difference between the bias and the error remained consistent across models, ranging from 0.10 to 0.12.
Figure 5.
Bias in estimating individual treatment effects. X1: factors that only affect treatment choice; X2: factors related to both treatment choice and outcomes (confounders); X3: factors that may affect the treatment effects in individual patients (effects modifiers); X4: individual risk factors only related to outcomes.
DISCUSSION
In assessing the impact of different types of predictors on predicting individual treatment outcome, we found that model performance varied substantially depending on the types of predictors included as input features. This variation was more pronounced in patient subgroups with no treatment or occurrence of health outcomes. The biggest impact is excluding X4 (other risk factors). This finding suggests that it is important to include factors that are risk factors for the health outcomes even though they are not associated with treatment selection or modify treatment effects among individual patients. Although we found the impact of confounders (X2) and treatment effects modifiers (X3) to be relatively smaller, we caution against the interpretation that it is less important to include confounders and effect modifiers in machine learning model development. This is because their relatively small role may be explained by the fact that their effects were mediated through the treatment, which was included as an input feature. Our findings also suggest that including factors that only predict treatment choice may not be necessary, though their inclusion seems not to harm model performance. Overall, inadequate inclusion of X2, X3, and X4 may affect performance in predicting individual outcome and subsequently bias in predicting individual treatment effects.
Outcome imbalance may also affect the performance of machine learning to predict individual treatment outcome. We found that machine learning model performance was inversely associated with the extent of health outcome imbalance. Even when the outcome was moderately imbalanced (35% of patients incurred the outcome), the impact was significant. The impact became more substantial when outcome imbalance was severe (ie, only 12% of patients incurred the outcome). The impact on model performance also varied considerably in the subgroups stratified by treatment and outcome. The negative impact on prediction performance is most pronounced among patients for whom the treatment outcome occurred. The results suggest that applying methods to address outcome imbalance is critically important when the outcome is moderately to severely imbalanced. Even though several approaches have been proposed to address general outcome imbalance (ie, Synthetic Minority Oversampling Techniques),39–44 further research is needed to identify an optimal approach for moderate to severe outcome imbalance in the context of predicting individual treatment outcomes. In addition, the results from our study show that different performance measures have different sensitivity in responding to outcome data imbalance. Thus, multiple performance measures may need to be used and assessed to make any robust scientific inference.
Biased prediction of individual treatment effects can lead to erroneous treatment decisions that can harm patients. In our analysis, we found that bias in estimating individual treatment effects existed and was at a significant level (10%) even when the machine learning model achieved 100% accuracy in predicting individual outcomes. This suggests that using machine learning algorithms alone may not fully address bias and confounding and ultimately make a causal inference for treatment effects estimation. In addition, our results also suggest that the total bias in treatment effects estimation may arise from both error in predicting health outcome and the confounding bias. Thus, having a high-performing prediction model from machine learning will reduce bias in treatment prediction by reducing error in predicting health outcome. Based on prior methodological studies of causal treatment effects,2,17–20 the bias in machine learning model estimated that individual treatment effects may stem from imbalance in the features between treatment groups. The imbalanced features may include X2 (confounders) and X3 (treatment effects modifiers). Thus, future research needs to incorporate comparative treatment effectiveness methodologies into a machine learning prediction model development to address or mitigate this feature imbalance problem.
It is important to note that our simulation findings may be contingent on the assumptions and parameters that we established, and thus may not generalize to other scenarios. However, the virtual patient cohort we used was simulated based on an analytical framework of treatment effects heterogeneity, the counterfactual theory for causal inference, and the predictor (input features) categories well known in clinical epidemiology and pharmacoepidemiology.24 It was also simulated to mimic a real-patient cohort from a large, real-world claims database. The distributions of key cohort characteristics of the virtual patient cohort are consistent with that of the real-patient cohort. Thus, the virtual patient cohort provides a good basis for both the internal and external validity to investigate our research questions. Our study only assessed 2 machine learning algorithms, and there are new algorithms (eg, hidden Markov model and deep learning) that have been introduced to address common classification and prediction questions in biomedicine. Future research is needed to assess how such new methods in addressing bias. A strength of our study is how well our machine learning models are trained and developed. Because all the predictors and their effects are captured in the simulated data, it is expected that a well-trained and developed model should attain a high performance. We undertook rigorous and robust development and training of the prediction models. The models achieved very good overall performance in predicting health outcome (label) with the accuracy ranging between approximately 0.80 and 0.90 for random forest models and 0.90 and 1.00 for the gradient boosting models in the 10-fold cross-validation. Therefore, the observed variations in our analysis are unlikely to be attributed to the error or insufficiency of the machine learning model development.
CONCLUSION
Direct application of machine learning might not adequately address bias in predicting individual treatment effects. Bias in causal inference may significantly affect the scientific validity and clinical accuracy of applying machine learning to estimate individual treatment effects. Further method development is needed to advance machine learning to support individualized treatment selection. An analytical framework–guided approach with an advanced virtual patient cohort may be useful in supporting this development.
FUNDING
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
AUTHOR CONTRIBUTIONS
GF and IA were involved in conception or design of the work as well as data generation, analysis, or interpretation of data. GF, IA, JEL, and SC were involved in drafting the work or revising it critically for important intellectual content as well as final approval of the version to be published.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
Supplementary Material
ACKNOWLEDGMENTS
The draft of the manuscript received very helpful comments and suggestions from Dr Evan Colmenares and Dr Ryan Hickson and 2 anonymous journal external reviewers. We appreciate their inputs to improve this manuscript. Dr. Fang's research was supported in part by the US National Institute of Health's (NIH) National Institute on Aging grants 1R01AG046267-01A1 & 1R21AG043668-01A1.
CONFLICT OF INTEREST STATEMENT
None declared.
REFERENCE
- 1. Fang G, Brooks JM, Chrischilles EA.. Apples and oranges? Interpretations of risk adjustment and instrumental variable estimates of intended treatment effects using observational data. Am J Epidemiol 2012; 1751: 60–5. [DOI] [PubMed] [Google Scholar]
- 2. Rubin DB. Estimating Causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 1974; 665: 688–701. [Google Scholar]
- 3. Kent DM, Hayward RA.. Limitations of applying summary results of clinical trials to individual patients: the need for risk stratification. JAMA 2007; 29810: 1209–12. [DOI] [PubMed] [Google Scholar]
- 4. Rothwell PM. Can overall results of clinical trials be applied to all patients? Lancet 1995; 3458965: 1616–9. [DOI] [PubMed] [Google Scholar]
- 5. Kravitz RL, Duan N, Braslow J.. Evidence‐based medicine, heterogeneity of treatment effects, and the trouble with averages. Milbank Q 2004; 824: 661–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Fahey T. Applying the results of clinical trials to patients to general practice: perceived problems, strengths, assumptions, and challenges for the future. Br J Gen Pract 1998; 48429: 1173–8. [PMC free article] [PubMed] [Google Scholar]
- 7. Zarbin MA. Challenges in applying the results of clinical trials to clinical practice. JAMA Ophthalmol 2016; 1348: 928–33. [DOI] [PubMed] [Google Scholar]
- 8. Ashley EA. The precision medicine initiative: a new national effort. JAMA 2015; 31321: 2119–20. [DOI] [PubMed] [Google Scholar]
- 9. Collins FS, Varmus H.. A new initiative on precision medicine. N Engl J Med 2015; 3729: 793–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Darcy AM, Louie AK, Roberts L.. Machine learning and the profession of medicine. JAMA 2016; 3156: 551–2. [DOI] [PubMed] [Google Scholar]
- 11. Deo RC. Machine learning in medicine. Circulation 2015; 13220: 1920–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Obermeyer Z, Emanuel EJ.. Predicting the Future—big data, machine learning, and clinical medicine. N Engl J Med 2016; 37513: 1216–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Crown WH. Potential application of machine learning in health outcomes research and some statistical cautions. Value Health 2015; 182: 137–40. [DOI] [PubMed] [Google Scholar]
- 14. Beam AL, Kohane IS.. Big data and machine learning in health care. JAMA 2018; 31913: 1317–8. [DOI] [PubMed] [Google Scholar]
- 15. Elfiky AA, Pany MJ, Parikh RB, Obermeyer Z.. Development and application of a machine learning approach to assess short-term mortality risk among patients with cancer starting chemotherapy. JAMA Netw Open 2018; 13: e180926.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wong A, Young AT, Liang AS, Gonzales R, Douglas VC, Hadley D.. Development and validation of an electronic health record–based machine learning model to estimate delirium risk in newly hospitalized patients without known cognitive impairment. JAMA Netw Open 2018; 14: e181018.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Rosenbaum PR, Rubin DB.. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 701: 41–55. [Google Scholar]
- 18. Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med 1997; 127 (8_Part_2): 757–763. [DOI] [PubMed] [Google Scholar]
- 19. Lunceford JK, Davidian M.. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med 2004; 2319: 2937–2960. [DOI] [PubMed] [Google Scholar]
- 20. Heckman JJ. The scientific model of causality. Sociol Methodol 2005; 351: 1–97. [Google Scholar]
- 21. Breiman L. Random forests . Mach Learn 2001; 451: 5–32. [Google Scholar]
- 22. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001; 295: 1189–1232. [Google Scholar]
- 23. Chen T, Guestrin C.. Xgboost: a scalable tree boosting system. arXiv 2016 Jun 10. [Google Scholar]
- 24. Brooks JM, Fang G.. Interpreting treatment-effect estimates with heterogeneity and choice: simulation model results. Clin Ther 2009; 314: 902–919. [DOI] [PubMed] [Google Scholar]
- 25. Fang G, Annis IE, Farley JF, et al. Incidence of and risk factors for severe adverse events in elderly patients taking angiotensin-converting enzyme inhibitors or angiotensin II receptor blockers after an acute myocardial infarction. Pharmacotherapy 2018; 381: 29–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Hickson RP, Robinson JG, Annis IE, et al. Changes in statin adherence following an acute myocardial infarction among older adults: patient predictors and the association with follow-up with primary care providers and/or cardiologists. J Am Heart Assoc 2017; 610: e007106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Korhonen MJ, Robinson JG, Annis IE, et al. Adherence tradeoff to multiple preventive therapies and all-cause mortality after acute myocardial infarction. J Am Coll Cardiol 2017; 7013: 1543–1554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Lauffenburger JC, Farley JF, Gehi AK, Rhoney DH, Brookhart MA, Fang G.. Effectiveness and safety of dabigatran and warfarin in real‐world US patients with non‐valvular atrial fibrillation: a retrospective cohort study. J Am Heart Assoc 2015; 44: e001798.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Cramér H. Mathematical Methods of Statistics (PMS-9). Vol. 9 Princeton, NJ: Princeton University Press; 2016. [Google Scholar]
- 30. Guilford JP. Psychometric Methods. 2nd ed.New York: McGraw-Hill; 1954. [Google Scholar]
- 31. Geurts P, Ernst D, Wehenkel L.. Extremely randomized trees. Mach Learn 2006; 631: 3–42. [Google Scholar]
- 32. Breiman L. Classification and Regression Trees. New York: Routledge; 2017. [Google Scholar]
- 33. Powers DMW. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2011; 21: 37–63. [Google Scholar]
- 34. Molinaro AM, Simon R, Pfeiffer RM.. Prediction error estimation: a comparison of resampling methods. Bioinformatics 2005; 2115: 3301–3307. [DOI] [PubMed] [Google Scholar]
- 35. Hawkins DM, Basak SC, Mills D.. Assessing model fit by cross-validation. J Chem Inf Comput Sci 2003; 432: 579–586. [DOI] [PubMed] [Google Scholar]
- 36. Kuhn M, Johnson K.. Applied Predictive Modeling. New York: Springer; 2013. [Google Scholar]
- 37. Swets JA. Measuring the accuracy of diagnostic systems. Science 1988; 2404857: 1285–93. [DOI] [PubMed] [Google Scholar]
- 38. Vickers AJ, Van Calster B, Steyerberg EW.. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ 2016; 352: i6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Sun YM, Wong AKC, Kamel MS.. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 2009; 2304: 687–719. [Google Scholar]
- 40. Ting KM. An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 2002; 143: 659–65. [Google Scholar]
- 41. Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM.. Applying resampling methods for imbalanced datasets to not so imbalanced datasets. Lect Notes Comput Sc 2013; 8109: 111–20. [Google Scholar]
- 42. Elkan C. The foundations of cost-sensitive learning In: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence; 2001. [Google Scholar]
- 43. Sun Y, Kamel MS, Wong AK, Wang Y.. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 2007; 4012: 3358–78. [Google Scholar]
- 44. Fan W, Stolfo SJ, Zhang J, Chan PK.. AdaCost: misclassification cost-sensitive boosting In: Proceedings of the Sixteenth International Conference on Machine Learning; 1999: 97–105. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





