Skip to main content
. 2020 Mar 9;8(1):e001055. doi: 10.1136/bmjdrc-2019-001055

Table 4.

The impact of modeling approaches on predictive indicators

Approaches AUC_TR AUC_TE AUC_Set 2 OF1 OF2
Univariate analysis Multivariate analysis* Univariate analysis Multivariate analysis Univariate analysis Multivariate analysis Univariate analysis Multivariate analysis Univariate analysis Multivariate analysis
P value MMD/R P value SE P value MMD/R P value SE P value MMD/R P value SE P value MMD/R P value SE P value MMD/R P value SE
Imputing or not <0.0001 0.0793 0.0019 0.3020 <0.0001 0.0605 0.0813 −0.2082 <0.0001 0.0759 0.1719 −0.1687 0.0394 0.0231 0.6812 −0.0635 0.1084† 0.0203 0.6829 −0.0634
Sampling methods <0.0001 0.1695 0.0141 0.2757 <0.0001 0.1620 0.2980 −0.1436 <0.0001 0.1349 0.0840 −0.2471 0.7884† 0.0145 0.3667 −0.1615 0.5764† 0.0440 0.6617 0.0786
Binning or not 0.6441† 0.0053 0.7135 −0.0149 0.7837† 0.0009 0.8834 −0.0073 0.8258† 0.0028 0.7578 0.0160 0.9188† 0.0066 0.9212 −0.0064 0.6672† 0.0036 0.7668 −0.0193
Screening methods 0.0119 0.0489 0.0338 0.1102 0.0042 0.0541 0.0024 0.1950 0.0091 0.0513 0.0352 0.1394 0.2277‡ 0.0242 0.1343 −0.1242 0.6512† 0.0213 0.4484 0.0631
Model methods <0.0001 0.2015 <0.0001 0.2734 <0.0001 0.1654 <0.0001 0.2864 <0.0001 0.1739 <0.0001 0.2025 0.0143 0.1166 0.5617 −0.0386 0.0271 0.1274 0.1616 0.0936
Num of Variables <0.0001§ 0.6121 <0.0001 0.2905 <0.0001§ 0.5197 <0.0001 0.3360 <0.0001§ 0.5147 <0.0001 0.3134 0.2593§ 0.0653 0.4035 −0.0739 0.2219§ −0.0707 0.7425 0.0292
Num of Samples¶ <0.0001§ 0.6716 <0.0001 0.9020 <0.0001§ 0.5083 0.0022 0.5584 <0.0001§ 0.4949 <0.0001 0.3655 0.0722§ 0.1039 0.1972 0.3024 0.5946§ −0.0308 0.8326 −0.0497

Oversampling leading to abnormal distribution in all five indexes.

The bold values indicate the parameters of approaches which would significantly affect predictive indicators.

*Multiple linear regression was used for multivariate analysis.

†Kruskal-Wallis test.

‡General linear models for analysis of variance (ANOVA).

§Spearman correlation analysis.

¶The variance inflation factor (VIF) of variable ‘Num of Samples’ in multiregression model is 16.4146 (which is greater than 10), indicates multicollinearity that maybe exists and may make the model unstable; this variable may be severely collinear with imputing, binning, and sampling, so the multiple linear regression (MLR) model was re-established after the three variables were eliminated.

AUC, area under the receiver operating characteristic curve; AUC_Set 2, AUC of set 2; AUC_TE, AUC of set 1 testing set; AUC_TR, AUC of set 1 training set; MMD, maximum mean difference among levels; R, correlation coefficient; SE, standardized estimate.