TABLE 1.
Classification quality report for random forest (RF), gradient boosting trees (GBT) and support vector machine (SVM). Run time reflects the computational time to train the respective machine-learning algorithm. CV:F1-score is the result of cross-validation performed 25 times with split 9:1 on the training data. Train/Test:F1-score, GC-content, and Top 3 FI are results of the best respective estimator for training and test set F1-score, the importance of the GC-content feature and the thre most important sequence positions along with the importance values. The training was performed with the same Train(56)-Test(7) division with cross-validation on the training set (100 times 9:1 split). Nucleotides included as features were filtered to contain at least an entropy of 0.2 bits, which resulted in 15 positions, in addition to GC-content (input vector: 15 × 4 + 1). Only tree-based methods (RF, GBT) extract the feature importance (FI).
| RF | GBT | SVM | |
|---|---|---|---|
| Run time (s) | 144 | 927 | 111 |
| CV:F1-score | 0.42 ± 0.19 | 0.5 ± 0.22 | 0.47 ± 0.21 |
| Train:F1-score | 0.58 | 0.89 | 0.88 |
| Test:F1-score | 0.62 | 0.46 | 0.14 |
| GC-content | 2nd | 1st | N.A. |
| Top 3 FI | −35: T: 0.24 | −34: T: 0.08 | N.A. |
| −34: T: 0.10 | −35: A: 0.05 | N.A. | |
| −35: A: 0.10 | −14: G: 0.05 | N.A. |