Skip to main content
. Author manuscript; available in PMC: 2023 May 22.
Published in final edited form as: Curr Opin Virol. 2022 Jan 17;53:101200. doi: 10.1016/j.coviro.2022.101200

Table 1.

Recommendations for the questions, biases, and pitfalls posed in each section.

Sweeping contamination under the rug: balancing recovery and false discovery
All software tools that predict viruses from metagenomes can make mistakes
1. Using multiple virus prediction tools and combining results can strengthen predictions by mitigating the biases and pitfall of each individual tool
2. In published work, report all parameters and thresholds used for predicting viruses, including methods of manual curation
3. Selecting low thresholds when running software or retaining low probability predictions will often generate “more data” at the expense of that data being low quality (i.e., contaminated)
4. Read the tool’s publication (if available) in addition to the software documentation to best understand the tool’s utility, pitfalls, and performance benchmarks
Of reference and reality
The reliance of most software tools on reference databases is a source of bias
1. Consider homology search to additional curated databases in addition to NCBI databases when reporting novel sequences or gene features
The reference-free fallacy: no such thing as a reference-free virus prediction
No current tool for predicting virus sequences is reference-free
1. Repeated training tools on NCBI databases has led to overlap in training and testing datasets across tools, making benchmarks increasingly difficult to perform without bias. Including non-NCBI databases in training, testing, and curating databases can reduce bias
2. Avoid falsely assuming database-independent machine learning models, whether trained on protein annotations or nucleotide features, overcome the necessity for reference-based searches
Linear genomes can be complete: where did all the linear genomes go?
Emphasis is placed on circular genomes as complete, excluding linear genomes
1. Although complete, linear genomes may be identified as high quality or near complete, the lack of circularization signatures underemphasizes these genomes in databases or analyses
2. A metagenomics-scale approach to identify complete viral genomes without terminal repeats may reduce the bias towards circular genomes. Until such a tool is available, it is necessary to keep in mind the possibility of underrepresenting linear genomes
Metagenomes are puzzles: an unfinished puzzle is still just pieces
Not all metagenomic viral scaffolds represent the whole genome
1. The inclusion of binning in virus analysis pipelines and constructing viral metagenome-assembled genomes (vMAGs) will likely better represent true composition of viruses and viral diversity