Skip to main content
. Author manuscript; available in PMC: 2019 Aug 1.
Published in final edited form as: J Biomed Inform. 2018 Jun 15;84:11–16. doi: 10.1016/j.jbi.2018.06.011

Table 3.

Effect of using different variable categories, datasets, and model variations on prediction accuracy, the Balanced subset is of (14,500 cases and controls), the Unbalanced subset is of (the same14,500 cases and 109,079 controls), whiles the full set consists of 152,790 cases and 1,152,517controls)

Experiment # Model AUC
1 RETAIN trained and tested on Balanced Subset – Diagnoses Data only 0.769
1 RETAIN trained and tested on Balanced Subset – Diagnoses Data grouped using CCS codes 0.759
1 RETAIN trained and tested on Balanced Subset – Diagnoses and Demographic Covariates (age, gender, race) 0.779
1 RETAIN trained and tested on Balanced Subset – Diagnoses, Demographic, and Medication 0.787
1 & 2 RETAIN trained and tested on Balanced Subset – All codes (Diagnoses, Demographic, Medication, and Surgery) 0.789
2 RETAIN trained and tested on Unbalanced Subset – All codes 0.79
2 RETAIN trained on Balanced Subset – All codes and tested on Unbalanced Subset – All codes 0.787
3 Forced Matching RETAIN trained and tested on Balanced Subset – Diagnoses Data only 0.762
3 Forced Matching RETAIN trained and tested on Balanced Subset – Diagnoses Data grouped using CCS codes 0.751
4 RETAIN trained and tested on the full cohort set 0.822
4 Logistic Regression with L2 0.766
4 Logistic Regression using embedding 0.782
4 Logistic Regression using embedding with L1 0.785
4 Logistic Regression using embedding with L2 0.786