Table 1. Data abundancy after using CD-HIT.
Sequence identity | Positive data of training set | Positive data of independent test set | Negative data |
100% (original) | 283 | 99 | 19512 |
90% | 274 | 94 | 18897 |
80% | 268 | 94 | 18447 |
70% | 256 | 94 | 17727 |
60% | 242 | 88 | 16710 |
50% | 226 | 82 | 15255 |
40% | 202 | 80 | 13333 |
30% | 173 | 65 | 11113 |