Skip to main content
. 2019 Jan 29;17(2):81. doi: 10.3390/md17020081

Table 2.

Summary performances (accuracies) of our best binary classifiers after removing 39 highly correlated variables (rpearson > 0.9, multicollinearity), applying recursive feature extraction and/or tuning hyperparameters. All models were built using a training set (300 observations) and an external testing set (32 observations) with stratified 10-fold cross-validation.

Model Name Number Variables Hyperparameters Model Accuracy External Test Accuracy
After Recursive Feature Elimination with Cross-Validation (RFECV)
logistic regression 135 default 0.82 0.66
decision tree 200 default 0.75 0.69
random forest 154 default 0.82 0.78
gradient boosting 160 default 0.78 0.72
After Tuning Hyperparameters
logistic regression 1 200 l1, 1 0.81 ± 0.06 0.63 ± 0.18
decision tree 2 200 auto, 5, 1 0.70 ± 0.05 0.56 ± 0.20
random forest 3 200 sqrt, 400, 5, 2 0.80 ± 0.07 0.75 ± 0.20
gradient boosting 4 200 log2, 200, 10, 4 0.78 ± 0.07 0.69 ± 0.20
After Tuning Hyperparameters and RFECV
logistic regression 1 18 l1, 1 0.82 0.63
decision tree 2 156 auto, 5, 1 0.77 0.63
random forest 3 150 sqrt, 400, 5, 2 0.82 0.72
gradient boosting 4 162 log2, 200, 10, 4 0.82 0.75
After Multicollinearity and RFECV
logistic regression 88 default 0.82 0.66
decision tree 2 default 0.70 0.63
random forest 127 default 0.79 0.78
gradient boosting 69 default 0.78 0.69
After Multicollinearity and Tuning Hyperparameters
logistic regression 1 161 l2, 10 0.81 ± 0.06 0.78 ± 0.18
decision tree 2 161 auto, 6, 1 0.74 ± 0.04 0.60 ± 0.20
random forest 3 161 log2, 700, 5, 2 0.80 ± 0.07 0.80 ± 0.20
gradient boosting 4 161 log2, 300, 15, 4 0.80 ± 0.07 0.80 ± 0.20
After Multicollinearity, Tuning Hyperparameters and RFECV
logistic regression 1 99 l2, 10 0.83 0.72
decision tree 2 85 auto, 6, 1 0.76 0.63
random forest 3 150 log2, 700, 5, 2 0.82 0.72
gradient boosting 4 161 log2, 300, 15, 4 0.80 0.78

Hyperparameters for 1 logistic regression {penalty, cost C}, 2 decision tree {max_features, max_depth, min_samples_leaf}, 3 random forest {max_features, n_estimators, max_depth, min_samples_leaf} and 4 gradient boosting {max_features, n_estimators, max_depth, min_samples_leaf} classifiers.