Skip to main content
. 2022 Jun 13;7(3):e00192-22. doi: 10.1128/msystems.00192-22

TABLE 1.

Summary statistics of viral sequence recovery for the short-read assembly (SRa), long-read assembly (LRa), and raw-read (LR) data sets

Statistic Illumina assembly (SRa) PacBio assembly (LRa) PacBio CCS15 reads (LR)
Starting sequences 149,018 19,982 1,535,891
Putative phages (VIBRANT) 10,979 947 50,296
95% identity clustering 10,979 947 42,156
Unique sequencesa 5,886 36 30,203
Nucleotides sequenced (Gb) 23.4 31.0 7.6
Unique sequences/Gbp sequenced 251.53 1.16 3,974
Unique sequences (versus GOV2)b 4,196 35 26,766
No. complete (high quality)c 9 (53) 15 (114) 0 (27)
Min–max sequence length (bp) 1,000–188,349 1,353–428,169 1,011–17,836
Avg sequence length (bp) 4,906 32,260 5,261
Min–max GC content (%) 19.40–65.25 19.56–69.93 14.25–86.03
Avg GC content (%) 35.45 36.9 38.13
Total proteinsd 80,487 41,599 330,157
Unique terminase (terL) proteins 30 2 393
Avg proteins/sequence 7.33 43.92 7.83
Avg protein length (aa) 190.29 223.42 177.9
a

Sequences not present in the other data sets (BLASTN, 95%; coverage of at least 70% of the smallest sequence).

b

Sequences not present in the other data sets or the Global Ocean Virome 2.0 (BLASTN, 95%; coverage of at least 70% of the smallest sequence).

c

VIBRANT defines a high-quality sequence as one that likely contains the majority of a virus’s complete genome (~70% completeness).

d

Values shown here represent protein numbers after dereplication (CD-HIT, 95% identity).