Table 4.
Average AUC (area under the ROC curve) performance measures for the various feature selection and classifier combinations in the predictive modeling pipelines for the three different EHR data sets. The average and standard deviation is computed over 100 classifier runs (10 × 10-fold cross-validation).
Data Set | Feature Selection | Classification | Average AUC (std) |
---|---|---|---|
Small | Information Gain | K-Nearest Neighbor | 0.680 (0.075) |
Naïve Bayesian | 0.687 (0.042) | ||
Logistic Regression | 0.690 (0.044) | ||
Random Forest | 0.717 (0.045) | ||
Fisher Score | K-Nearest Neighbor | 0.655 (0.102) | |
Naïve Bayesian | 0.690 (0.042) | ||
Logistic Regression | 0.689 (0.046) | ||
Random Forest | 0.713 (0.043) | ||
Medium | Information Gain | K-Nearest Neighbor | 0.598 (0.013) |
Naïve Bayesian | 0.692 (0.013) | ||
Logistic Regression | 0.746 (0.012) | ||
Random Forest | 0.752 (0.012) | ||
Fisher Score | K-Nearest Neighbor | 0.616 (0.013) | |
Naïve Bayesian | 0.688 (0.014) | ||
Logistic Regression | 0.741 (0.012) | ||
Random Forest | 0.749 (0.012) | ||
Large | Information Gain | K-Nearest Neighbor | 0.602 (0.027) |
Naïve Bayesian | 0.634 (0.007) | ||
Logistic Regression | 0.706 (0.006) | ||
Random Forest | 0.705 (0.006) | ||
Fisher Score | K-Nearest Neighbor | 0.597 (0.023) | |
Naïve Bayesian | 0.632 (0.007) | ||
Logistic Regression | 0.705 (0.006) | ||
Random Forest | 0.704 (0.006) |