Skip to main content
. 2020 May 26;16(5):e1007894. doi: 10.1371/journal.pcbi.1007894

Table 1. The 20 feature sets generated from the four representations of the viral genomes.

Genome representation [Letter set] (Alphabet size) K-mers lengths tested (k) Feature set size for maximum k
DNA [A,C,G,T] (4) [1–9] 262,144
AA [Amino Acid single letter code] (20) [1–4] 160,000
PC [t-z] (7) [1–6] 117,649
Domains [All domains predicted within each dataset] 1 Total number of unique domains in all the viral genomes in VHDB—2200

Genome representation: DNA—nucleotide sequence; AA—amino acid sequence of CDS regions; PC—physio-chemical properties, each amino acid residue binned into one of seven bins based on its physio-chemical property; Domains—presence of PFAM domain in the sequence.