Table 1. QSAR Model Validations on the External 5-Fold CV Sets As Well As the Additional Independent External Set from WOMBAT.
confusion matrix |
statistics |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
machine learning methods | external sets | prediction CCR | N(1)a | N(2)a | TP | TN | FP | FN | SE | SP | EN(1) | EN(2) | |
1 | 0.86 | 19b | 14 | 17 | 11 | 3 | 2 | 0.89 | 0.79 | 1.61 | 1.76 | ||
2 | 0.61 | 20 | 13 | 15 | 6 | 7 | 5 | 0.75 | 0.46 | 1.16 | 1.30 | ||
k-nearest neighbor | 3 | 0.77 | 22 | 11 | 20 | 7 | 4 | 2 | 0.91 | 0.64 | 1.43 | 1.75 | |
4 | 0.86 | 20 | 13 | 19 | 10 | 3 | 1 | 0.95 | 0.77 | 1.61 | 1.88 | ||
5 | 0.68 | 23 | 10 | 22 | 4 | 6 | 1 | 0.96 | 0.40 | 1.23 | 1.80 | ||
Cumulative | 0.76 | 104 | 61 | 93 | 38 | 23 | 11 | 0.89 | 0.62 | 1.41 | 1.71 | ||
WOMBAT | N/A | 66 | 0 | 62 | N/A | N/A | 4 | 0.94 | N/A | N/A | N/A | ||
1 | 0.80 | 20 | 14 | 16 | 11 | 3 | 4 | 0.80 | 0.79 | 1.58 | 1.59 | ||
2 | 0.68 | 20 | 13 | 15 | 8 | 5 | 5 | 0.75 | 0.62 | 1.32 | 1.42 | ||
random forest | 3 | 0.84 | 22 | 11 | 21 | 8 | 3 | 1 | 0.95 | 0.73 | 1.56 | 1.88 | |
4 | 0.74 | 20 | 13 | 19 | 7 | 6 | 1 | 0.95 | 0.54 | 1.35 | 1.83 | ||
5 | 0.83 | 23 | 10 | 22 | 7 | 3 | 1 | 0.96 | 0.70 | 1.52 | 1.88 | ||
Cumulative | 0.78 | 105 | 61 | 93 | 41 | 20 | 12 | 0.89 | 0.67 | 1.46 | 1.71 | ||
WOMBAT | N/A | 66 | 0 | 62 | N/A | N/A | 4 | 0.94 | N/A | N/A | N/A | ||
1 | 0.87 | 20 | 14 | 19 | 11 | 3 | 1 | 0.95 | 0.79 | 1.36 | 1.88 | ||
2 | 0.68 | 20 | 13 | 18 | 6 | 7 | 2 | 0.90 | 0.46 | 1.25 | 1.64 | ||
support vector machines | 3 | 0.95 | 22 | 11 | 22 | 10 | 1 | 0 | 1.00 | 0.91 | 1.83 | 2.00 | |
4 | 0.76 | 20 | 13 | 18 | 8 | 5 | 2 | 0.90 | 0.62 | 1.40 | 1.72 | ||
5 | 0.76 | 23 | 10 | 21 | 6 | 4 | 2 | 0.91 | 0.60 | 1.39 | 1.75 | ||
Cumulative | 0.80 | 105 | 61 | 98 | 41 | 20 | 7 | 0.93 | 0.67 | 1.48 | 1.82 | ||
WOMBAT | N/A | 66 | 0 | 62 | N/A | N/A | 4 | 0.96 | N/A | N/A | N/A |
N(1) = number of actives, N(2) = number of inactives, TP = true positive (actives predicted as actives), FP = false positives (inactives predicted as actives), FN = false negatives (actives predicted as inactives), TN = true negative (inactives predicted as inactives), SE = sensitivity = TP/N(1), SP = specificity = TN/N(2), EN = the normalized enrichment, EN(1) = (2TP × N(2))/(TP × N(2) + FP × N(1)), EN(2) = (2TN × N(1))/(TN × N(1) + FN × N(2)), and CCR = correct classification rate.
Some N(1) actives of and N(2) inactives were out of application domain of all consensus models, thus having no prediction. Only data for compounds found within the AD were used for statistical summaries.