Table 2.
Seq. Identity cutoff | PPIs subject to prediction by SVM (RF)(a) | Recall by SVMloc(b) | Recall by RFloc(c) | Recall by SVMrand(b) | Recall by RFrand(c) |
---|---|---|---|---|---|
All PPIs | 3280 (2468) | 0.7466 (0.7057) | 0.8440 (0.8327) | 0.3250 (0.2565) | 0.9444 (0.9444) |
80% | 1123 (797) | 0.7266 (0.7114) | 0.7804 (0.7604) | 0.2654 (0.2597) | 0.9448 (0.9448) |
50% | 937 (660) | 0.7033 (0.6818) | 0.7758 (0.7561) | 0.2465 (0.2303) | 0.9470 (0.9470) |
30% | 825 (585) | 0.6909 (0.6667) | 0.7641 (0.7453) | 0.2364 (0.2239) | 0.9470 (0.9470) |
The PPIs were clustered by the sequence identity cutoffs of 80%, 50%, and 30% to reduce similar sequences. The sequence identity of protein pairs was computed with the Needleman-Wunsch (global sequence alignment) algorithm implemented in the nwalign python library. On this dataset, we evaluated recall, i.e. the fraction of PPIs in the datasets that were correctly predicted as interacting protein pairs. SVM trained by PPIloc or PPIrand were named as SVMloc or SVMrand. RF trained by PPIloc or PPIrand were named as RFloc or RFrand. (a) RF with the eight features was able to be applied only for PPIs that have gene co-expression data available. The numbers in the parentheses count such PPIs with expression data available. (b) Recall is the fraction of PPIs that are correctly predicted. In the parentheses, SVM recall values measured on the PPIs with co-expression data i.e. the same dataset as used for prediction by RF with the eight features, are shown. (c) The values show the recall of RF using the eight features including the gene expression features. Results with the four feature combinations that only use the functional association scores and the phylogenetic profile are provided in the parentheses.