Table 1.
Accuracy of host prediction based on distance (d) between tetranucleotide frequencies of viral and microbial genomes
Predicted | Host order | Host family | Host genus | ||||
---|---|---|---|---|---|---|---|
Correct | Ratio (%) | Correct | Ratio (%) | Correct | Ratio (%) | ||
All reference sequences | |||||||
d < 4 × 10−04 | 98 | 97 | 98.98 | 97 | 98.98 | 97 | 98.98 |
4 × 10−04 ≤ d < 1 × 10−03 | 10,173 | 9361 | 92.02 | 8971 | 88.18 | 5261 | 51.72 |
1 × 10−03 ≤ d | 2508 | 1872 | 74.64 | 1757 | 70.06 | 917 | 36.56 |
Host species excluded | |||||||
d < 4 × 10−04 | 21 | 20 | 95.24 | 20 | 95.24 | 20 | 95.24 |
4 × 10−04 ≤ d < 1 × 10−03 | 10,003 | 9067 | 90.64 | 8372 | 83.69 | 2992 | 29.91 |
1 × 10−03 ≤ d | 2755 | 1981 | 71.91 | 1840 | 66.79 | 818 | 29.69 |
Host genus excluded | |||||||
d < 4 × 10−04 | 1 | 0 | 0.00 | 0 | 0.00 | 0 | 0.00 |
4 × 10−04 ≤ d < 1 × 10−03 | 9085 | 7303 | 80.39 | 6181 | 68.04 | 0 | 0.00 |
1 × 10−03 ≤ d | 3693 | 1768 | 47.87 | 1388 | 37.58 | 0 | 0.00 |
For each viral genome, the order, family, and genus of its host were predicted from the taxonomy of the closest microbial genome (based on the mean absolute difference between tetranucleotide frequency vectors) and compared to the order, family, and genus of the actual host (i.e., the taxonomy of the genome with which the virus was identified). These predictions were computed with (i) all microbial genomes, (ii) excluding specifically all genomes from the host species, and (iii) excluding all genomes from the host genus. Cases with over 75% of prediction accuracy are highlighted in gray.