Skip to main content
. 2020 Mar 9;8(1):e001055. doi: 10.1136/bmjdrc-2019-001055

Table 2.

AUC and overfitting values for 30 machine learning models

Imputing or not Sampling
methods
Binning or not Screening methods Model methods Num of Variables Num of Samples AUC_TR AUC_TE AUC_Set 2 OF1 OF2
Not Not Not Not $D 16 167 0.693±0.025 0.621±0.054 0.737±0.062 1.125±0.122 0.849±0.111
Not Not Not Forward and stepwise $S 3 167 0.551±0.033 0.577±0.078 0.557±0.051 0.974±0.162 1.039±0.131
Not Not Not Backward $L 5 167 0.659±0.019 0.658±0.039 0.687±0.052 1.005±0.072 0.964±0.102
Not Not Yes Forward and stepwise $S 3 167 0.551±0.033 0.577±0.078 0.577±0.051 0.974±0.162 1.039±0.131
Not Not Yes Backward $S 3 167 0.551±0.033 0.577±0.078 0.577±0.051 0.974±0.162 1.039±0.131
Not Undersampling Not Not $XF 16 98 0.778±0.025 0.827±0.080 0.744±0.085 0.949±0.102 1.130±0.198
Not Undersampling Not Forward and stepwise $L 3 98 0.679±0.032 0.660±0.075 0.664±0.067 1.044±0.142 1.008±0.192
Not Undersampling Not Backward $L 3 98 0.679±0.032 0.660±0.075 0.664±0.067 1.044±0.142 1.008±0.192
Not Undersampling Yes Forward and stepwise $KNN 4 98 0.725±0.028 0.674±0.074 0.715±0.070 1.086±0.122 0.955±0.169
Not Undersampling Yes Backward $XF 5 98 0.755±0.047 0.753±0.074 0.725±0.058 1.013±0.136 1.044±0.125
Not Oversampling Not Not $R 16 263 0.781±0.021 0.770±0.040 0.758±0.070 1.017±0.060 1.028±0.150
Not Oversampling Not Backward $XF 5 263 0.814±0.031 0.799±0.037 0.761±0.094 1.021±0.073 1.066±0.155
Not Oversampling Not Forward and stepwise $XF 4 263 0.716±0.020 0.726±0.026 0.690±0.068 0.987±0.048 1.064±0.133
Not Oversampling Yes Forward and stepwise $XF 4 263 0.834±0.030 0.821±0.031 0.782±0.080 1.018±0.068 1.060±0.112
Not Oversampling Yes Backward $XF 7 263 0.864±0.028 0.856±0.022 0.813±0.127 1.010±0.040 1.088±0.261
Yes Not Not Not $D 16 315 0.725±0.012 0.678±0.047 0.703±0.051 1.074±0.087 0.973±0.131
Yes Not Yes Forward and stepwise $XF 5 315 0.812±0.024 0.760±0.048 0.757±0.056 1.073±0.089 1.008±0.091
Yes Not Not Forward and stepwise $XF 4 315 0.752±0.020 0.701±0.070 0.711±0.055 1.084±0.131 0.994±0.144
Yes Not Not Backward $XF 6 315 0.742±0.019 0.734±0.063 0.734±0.066 1.017±0.094 1.012±0.160
Yes Not Yes Backward $B 6 315 0.729±0.019 0.718±0.099 0.714±0.100 1.034±0.151 1.034±0.266
Yes Undersampling Not Not $B 16 199 0.785±0.032 0.811±0.087 0.778±0.063 0.980±0.126 1.052±0.170
Yes Undersampling Yes Forward and stepwise $XF 4 199 0.701±0.027 0.665±0.074 0.722±0.050 1.067±0.135 0.927±0.130
Yes Undersampling Not Forward and stepwise $S 4 199 0.685±0.022 0.658±0.069 0.702±0.053 1.053±0.137 0.946±0.146
Yes Undersampling Yes Backward $S 5 199 0.699±0.015 0.754±0.083 0.733±0.052 0.938±0.113 1.034±0.143
Yes Undersampling Not Backward $KNN 5 199 0.740±0.029 0.738±0.082 0.736±0.065 1.017±0.143 1.013±0.165
Yes Oversampling Not Not $XF 16 513 0.916±0.030 0.869±0.041 0.862±0.123 1.056±0.052 1.039±0.243
Yes Oversampling Yes Forward and stepwise $B 7 513 0.857±0.023 0.824±0.039 0.849±0.072 1.042±0.052 0.978±0.118
Yes Oversampling Not Forward and stepwise $XF 8 513 0.907±0.031 0.861±0.039 0.843±0.115 1.054±0.049 1.049±0.230
Yes Oversampling Not Backward $XF 9 513 0.907±0.024 0.871±0.030 0.866±0.082 1.041±0.036 1.017±0.134
Yes Oversampling Yes Backward $B 9 513 0.865±0.032 0.823±0.050 0.839±0.107 1.054±0.070 1.003±0.191

OF1 was calculated using the formula: AUCSet 1 training set /AUCSet 1 testing set, and OF2, AUCSet 1 testing set /AUCSet 2.

The bold value was the maximum AUCSet 2 of 30 algorithms.

AUC, area under the receiver operating characteristic curve; AUC_Set 2, AUC of set 2; AUC_TE, AUC of set 1 testing set; AUC_TR, AUC of set 1 training set; $B, Bayesian network; $D, discriminant model; $KNN, KNN algorithm; $L, logistic regression model; $R, CHAID; $S, SVM; $XF, the ensemble model.