Table 1. The 20 feature sets generated from the four representations of the viral genomes.
Genome representation | [Letter set] (Alphabet size) | K-mers lengths tested (k) | Feature set size for maximum k |
---|---|---|---|
DNA | [A,C,G,T] (4) | [1–9] | 262,144 |
AA | [Amino Acid single letter code] (20) | [1–4] | 160,000 |
PC | [t-z] (7) | [1–6] | 117,649 |
Domains | [All domains predicted within each dataset] | 1 | Total number of unique domains in all the viral genomes in VHDB—2200 |
Genome representation: DNA—nucleotide sequence; AA—amino acid sequence of CDS regions; PC—physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property; Domains—presence of PFAM domain in the sequence.