Table 1.
Report section | Method | ML Classifier | HEAF feature space | Rank | Software | Optimized metric | Tested hyperparameter space | Selected number of features or hyperparameter settings on outer fold 1.0–5.0 | Accuracy# [min–max; %] | ME | AUC | BS | LL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Human Expert-Annotated Features (HEAF) | CART | CT | p = 28 (all) | rpart [R] | ACC | rpart.control = default; cp = 0.01 no optimization (no pruning) | 28 | 73.3 [66.7–79.2] | 0.27 | 0.63 | 0.37 | 0.87 | |
vRF | RF | p = 28 (all) | 4 | randomForest [R] | ME | ntree = 500, mtry = 5, pvarsel = 28 | 28 | 81.5 [73.8–92.7] | 0.18 | 0.82 | 0.27 | 0.44 | |
vRF | RF | p = 28 (all) | randomForest[R] | ME | ntree = 500, mtry = 5, pvarsel = 9 | 9 | 71.0 [59.5–82.9] | 0.29 | 0.69 | 0.37 | 0.56 | ||
vRF | RF | p = 28 (all) | randomForest[R] | ME | ntree = 500, mtry = 5, pvarsel = 5 | 5 | 75.2 [68.3–83.3] | 0.25 | 0.69 | 0.36 | 0.54 | ||
tRFBS | RF | p = 28 (all) | 2 | randomForest[R] | BS | ntree = [100, 200, 300, … , 900, 1000] | 28, 14, 14, 14, 14 | 83.1 [76.2–90.2] | 0.17 | 0.81 | 0.27 | 0.44 | |
tRFME | RF | p = 28 (all) | randomForest[R] | ME | mtry = [3, 4, 5, 6, 7] | 28, 28, 14, 5, 14 | 79.6 [68.3–90.2] | 0.20 | 0.79 | 0.29 | 0.46 | ||
tRFLL | RF | p = 28 (all) | 2 | randomForest[R] | LL | pvarsel = [3, 5, 10, 14, 20, 25, 28] | 25, 14, 14, 14, 14 | 83.1 [76.2–90.2] | 0.17 | 0.81 | 0.27 | 0.44 | |
ELNET | ELNET | p = 28 (all) | 3 | glmnet[R] | ME | α = [0, 0.1, 0.2, … , 0.8, 0.9, 1] λ = tenfold CV with default hot-start | α = [0.1, 0.8, 0, 1, 0.1] λ = [0.195, 0.0688, 0.208, 0.0301, 0.1632] | 82.0 [78.6–85.4] | 0.18 | 0.79 | 0.27 | 0.43 | |
SVM-LK | SVM | p = 28 (all) | 1 | e1071[R] | ME | C = [0.001, 0.01, 0.1, 1, 10, 100, 1000] | C = [1, 1, 100, 10, 10] | 87.4 [82.9–90.2] | 0.13 | 0.79 | 0.22 | 0.37 | |
XGBoost | BT | p = 28 (all) | 5 | xgboost[R] | ME | nrouds/ntree = 100, | nrouds = 100 | 80.6 [75.0–85.7] | 0.19 | 0.70 | 0.30 | 0.48 | |
max_depth = [3, 5, 6, 8] | max_depth = [5, 3, 5, 8, 3] | ||||||||||||
eta = [0.1, 0.3] | eta = [0.1, 0.1, 0.1, 0.3, 0.1] | ||||||||||||
gamma = [0, 0.5, 1.0] | gamma = [0, 0.5, 1, 0.5, 1] | ||||||||||||
colsample_bytree = [0.1, 0.25, 0.5, 0.693 (ln2) ~ RF, 1] | colsample_bytree = [1, 1, 0.5, 1, 0.5] |
Accuracy#: the averaged fivefold CV accuracy is calculated, ACC: accuracy, AUC: multiclass area under the ROC after Hand and Till (that can only be calculated if probabilities are scaled to 1), BS: Brier score, ME: misclassification error, LL: multiclass log loss, vRF and tRF: vanilla- and tuned random forests, ELNET: elastic net penalized multinomial logistic regression, SVM: support vector machines, LK: linear kernel SVM; XGBoost: extreme gradient boosting using trees as base learners, BT: boosted trees, CART: classification and regression trees; CT: classification tree; cp: complexity parameter used for CART node splitting (for this no optimization (pruning) was performed); ln(2) ~ RF: column sampling (i.e. bootstrap) representing the settings equivalent to running RF in the xgboost library, [R]: R statistical software environment.