Table 5. For each search term used to identify putatively important conserved protein domains, we show the number of domain descriptions that contain this term for various categories including: (i) the starting set, (ii) the input to the Random Forest model, (iii) the top 50 and (iv) the top 20 features after model fitting.
Note that column entries will not sum up to the n depicted at the top of each column as many descriptions contain multiple search terms.
| Starting | Model | |||
|---|---|---|---|---|
| Search term | (n = 371) | (n = 206) | Top 50 | Top 20 |
| integrase | 101 | 72 | 24 | 13 |
| excisionase | 5 | 4 | 2 | 0 |
| recombinase | 72 | 52 | 28 | 17 |
| transposase | 143 | 70 | 14 | 3 |
| lysogen | 23 | 10 | 2 | 1 |
| temperate | 11 | 10 | 3 | 0 |
| parA |ParA |parB |ParB | 65 | 29 | 7 | 0 |