Table 2. Model Performance for Predicting Survival in the Electronic Health Record Cohort.
Training cohort | Variables, No. | iAUC, mean (SD)a |
---|---|---|
RCT datab | 101 | 0.722 (0.118) |
EHR data with all variables available | 84 | 0.762 (0.106) |
EHR data with all independent variables (no covariates) | 60 | 0.775 (0.098) |
EHR data with top 25 variables selected by RFEc | 25 | 0.792 (0.097) |
EHR data with top 15 variablesc | 15 | 0.785 (0.098) |
EHR data with top 10 variablesc | 10 | 0.779 (0.099) |
Abbreviations: EHR, electronic health record; iAUC, integrated area under the curve; RCT, randomized clinical trial; RFE, recursive feature elimination.
Mean (SD) iAUC from the 10-fold validation on 90% of the EHR data set.
Among the 101 RCT variables, 17 were not available in the EHR data set because they are not routinely collected. Their missing values were imputed using a penalized Gaussian regression model based on nonmissing variables. Other partially missing variables were also imputed.
Top variables are those with the highest absolute correlation with the outcome.