Skip to main content
. 2021 Oct 14;1:747428. doi: 10.3389/fbinf.2021.747428

TABLE 1.

Classification quality report for random forest (RF), gradient boosting trees (GBT) and support vector machine (SVM). Run time reflects the computational time to train the respective machine-learning algorithm. CV:F1-score is the result of cross-validation performed 25 times with split 9:1 on the training data. Train/Test:F1-score, GC-content, and Top 3 FI are results of the best respective estimator for training and test set F1-score, the importance of the GC-content feature and the thre most important sequence positions along with the importance values. The training was performed with the same Train(56)-Test(7) division with cross-validation on the training set (100 times 9:1 split). Nucleotides included as features were filtered to contain at least an entropy of 0.2 bits, which resulted in 15 positions, in addition to GC-content (input vector: 15 × 4 + 1). Only tree-based methods (RF, GBT) extract the feature importance (FI).

RF GBT SVM
Run time (s) 144 927 111
CV:F1-score 0.42 ± 0.19 0.5 ± 0.22 0.47 ± 0.21
Train:F1-score 0.58 0.89 0.88
Test:F1-score 0.62 0.46 0.14
GC-content 2nd 1st N.A.
Top 3 FI −35: T: 0.24 −34: T: 0.08 N.A.
−34: T: 0.10 −35: A: 0.05 N.A.
−35: A: 0.10 −14: G: 0.05 N.A.