Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2022 May 11;50(14):e83. doi: 10.1093/nar/gkac341

vRhyme enables binning of viral genomes from metagenomes

Kristopher Kieft 1,2, Alyssa Adams 3,4, Rauf Salamzade 5,6, Lindsay Kalan 7,8, Karthik Anantharaman 9,
PMCID: PMC9371927  PMID: 35544285

Abstract

Genome binning has been essential for characterization of bacteria, archaea, and even eukaryotes from metagenomes. Yet, few approaches exist for viruses. We developed vRhyme, a fast and precise software for construction of viral metagenome-assembled genomes (vMAGs). vRhyme utilizes single- or multi-sample coverage effect size comparisons between scaffolds and employs supervised machine learning to identify nucleotide feature similarities, which are compiled into iterations of weighted networks and refined bins. To refine bins, vRhyme utilizes unique features of viral genomes, namely a protein redundancy scoring mechanism based on the observation that viruses seldom encode redundant genes. Using simulated viromes, we displayed superior performance of vRhyme compared to available binning tools in constructing more complete and uncontaminated vMAGs. When applied to 10,601 viral scaffolds from human skin, vRhyme advanced our understanding of resident viruses, highlighted by identification of a Herelleviridae vMAG comprised of 22 scaffolds, and another vMAG encoding a nitrate reductase metabolic gene, representing near-complete genomes post-binning. vRhyme will enable a convention of binning uncultivated viral genomes and has the potential to transform metagenome-based viral ecology.

INTRODUCTION

Viruses and bacteriophages (collectively termed viruses) are pervasive members of essentially all ecosystems. Viruses form a continuum of symbiotic interactions with their hosts, from lethal parasitism to essential mutualism (1–3). These interactions are known to impact biogeochemical and nutrient cycling processes, human health, infrastructure and industries and ecosystem community dynamics (4–7). As a result of the rising interest in viromics, the previously unknown members of the virosphere, the range in the encoded genetic potential of viruses, known viral diversity, and limits of viral genome sizes have been continuously expanding (8–12).

Metagenomic sequencing can be a mechanism to identify, recognize, understand, and even harness the information encoded on viral genomes. Most metagenomes will assemble into many short fragments (scaffolds or contigs) representing partial genome sequences. The process of binning is employed to group scaffolds into a putative genome, termed a metagenome-assembled genome (MAG). With the information encoded by a MAG, rather than individual scaffolds, stronger inferences of metabolic potential, phylogenies, taxonomy, and community interactions can be generated (13).

Conversely, viral scaffolds are typically not binned. Handling complex and often enigmatic viral scaffolds in metagenomes often poses computational challenges unique from microbes. One justification to not bin viruses is that their genomes are small relative to cellular organisms and the assumption that most scaffolds represent the majority, or the entirety, of an identifiable genome. For dsDNA viruses, the target of most viral metagenomes, genome sizes will have a general range of 20–200 kb, with the largest of viruses being 500–2,000 kb. Since the majority of scaffolds in most assembled metagenomes are <20 kb in length, it can be estimated that a single scaffold likely will not represent an entire viral genome. In fact, benchmarks have shown that viruses often do not assemble into a single scaffold (14,15). Further difficulties with binning viral genomes arise due to viruses not encoding universal single copy or marker genes, making a standardized approach for all viruses difficult to create. Additionally, studies incorporating many samples for co-abundance comparisons have traditionally been uncommon, and that viral populations are often comprised of highly heterogeneous genomes that result in fragmented assemblies.

Many software tools have been developed for binning bacterial, archaeal, and eukaryotic metagenomic scaffolds into MAGs (16–25). These tools employ a wide range of methodologies, mainly focusing on tetranucleotide frequencies and read coverage abundance variance comparisons between scaffolds. A significant portion of the tools tailored to bacteria and archaea also rely on identifying microbial single copy genes to inform the construction of bins along with completeness and contamination estimates. Some tools for binning microbes are suitable for binning viruses due to their independence from microbial single copy gene analysis, namely MetaBat2, VAMB, CONCOCT, and BinSanity. MetaBat2 uses a composite scoring system based on the geometric mean of tetranucleotide frequencies and coverage abundance of individual scaffolds to generate bins according to a weighted graph clustering algorithm (17). VAMB implements unsupervised deep learning variational autoencoders based on individual scaffold tetranucleotide frequencies and coverage abundance to generate bins by iterative medoid clustering (18,26). CONCOCT uses tetranucleotide frequencies and coverage abundance, reduced by multidimensional reduction, to cluster scaffolds into bins with Gaussian mixture models (27). BinSanity uses affinity propagation clustering based on coverage abundances to bin scaffolds, followed by bin refinement using tetranucleotide frequencies and GC content (24). Despite the abundance of tools for binning bacteria and archaea, there is a conspicuous dearth of tools available for binning viruses. Only one tool, CoCoNet (28), has thus far been developed for binning viral genomes from metagenomes (viral MAGs or vMAGs). CoCoNet implements an unsupervised deep learning neural network to identify shared tetranucleotide and coverage abundance patterns between scaffold pairs, followed by graph clustering of potential pairs into bins (28).

Here, we present vRhyme, a software tool that incorporates supervised machine learning based classification of diverse sequence feature compositions as well as read coverage abundance effect size comparisons to generate weighted networks of bins. vRhyme leverages unique features of viral genomes to optimize and refine the binning of vMAGs, including overcoming the lack of single copy genes by scoring protein redundancy based on the observation that viruses seldom encode redundant genes. vRhyme is capable of binning viruses from diverse families, host and source environment affiliations, varying states of genome fragmentation, and wide ranges of genome lengths. In benchmarking vRhyme, we show that it is fast, inclusive, and accurate in binning viral scaffolds, with low computational demands, in synthetic and natural metagenomes compared to other binning software. When applied to human skin metagenomes, we show that vRhyme enabled a more comprehensive analysis of shared viruses and viral features across a cohort of individuals, and likely better recapitulated natural systems. vRhyme is implemented in Python and is freely available for download at https://github.com/AnantharamanLab/vRhyme.

MATERIALS AND METHODS

Coverage processing

The input for read coverage information is variable: paired or unpaired short reads, SAM alignment file, BAM alignment file, or a pre-calculated coverage table. For short reads input, reads will be aligned to input scaffolds using either Bowtie2 (29) or BWA (30); Bowtie2 is run with the parameters –no-unal –no-discordant, the latter being for paired reads only, and BWA is run with the mem algorithm. All reads should be quality filtered before being used as input. The resulting SAM alignment file, or an input SAM alignment file, will be converted into BAM format using Samtools (31). BAM alignment files, either generated by the vRhyme pipeline or as user input, will then be processed. As such, any input combinations of short reads, SAM or BAM alignment files are compatible. BAM alignment files, if not already provided as input, are sorted and indexed using Samtools.

The Python package Pysam (https://github.com/pysam-developers/pysam) is then used to fetch aligned records within sorted and indexed BAM alignment files for processing and coverage calculations. First, aligned reads are filtered according to the percent identity alignment, as calculated by the sum of the number of gaps Inline graphic and the number of mismatches Inline graphic in the alignment divided by the length of the alignment Inline graphic. The default is a 97% identity alignment.

graphic file with name M0003.gif

Aligned reads passing the set threshold are used to calculate the total coverage of each nucleotide base per scaffold, inclusive of bases with a coverage of zero. Finally, the coverage values at the terminal ends of scaffolds are masked to increase coverage fidelity by considering erroneous read alignment at partial scaffold ends. The default is to ignore all coverage values within the first and last 150 bp of the scaffold. The average and standard deviation of coverage per scaffold is calculated according to respective, individual base coverages. All alignment filtering and coverage calculations are handled natively within vRhyme. This final step yields a coverage table comprised of the average and standard deviation of coverage per scaffold per input sample. This coverage table, or a user-generated table of the same format, can be used as input for vRhyme in place of reads or SAM/BAM alignment files.

Next, scaffold coverages across all k samples are pairwise compared using the effect size of coverage differences. First, all average coverages are increased by a pseudo-count of 0.1 to avoid coverages of zero (pseudo-counts are excluded from coverage table). Effect size is calculated by the Cohen's d effect size metric equation (32). Cohen's d is calculated as follows, where Inline graphic and Inline graphic are average read coverages and Inline graphic and Inline graphic are standard deviations of the coverages for a scaffold pair i and j:

graphic file with name M0008.gif

For each pairwise comparison, an effect size value Inline graphic is generated per sample k across all samples n. Values exceeding the effect size threshold, set by vRhyme presets, generate an additive penalty weight p. The average effect size across all samples Inline graphic, with any added penalties, is normalized to the number of input samples, yielding a normalized effect size Inline graphic, which considers higher statistical power to more sample comparisons:

graphic file with name M00012.gif
graphic file with name M00013.gif

The normalized and penalized Inline graphic values are compared to a normalized preset effect size threshold and all pairwise comparisons passing the set criteria are considered as co-occurring by coverage. Any scaffold not found to co-occur with another is discarded. For computational efficiency, a pre-filter is applied where only the best (i.e. lowest Inline graphic) Inline graphic pairs per individual scaffold are retained, where Inline graphic is ‘–max_edges’ multiplied by 3.

Nucleotide processing

All co-occurring scaffolds by read coverage are compared by seven nucleotide content metrics. The pairwise distance calculations per metric are used as inputs to supervised machine learning models for classification. All nucleotide features and distances are calculated natively within vRhyme.

The first feature, codon usage (CU), is calculated from nucleotide open reading frames (i.e. genes). Predicted genes can be used as input, otherwise vRhyme will automate prediction using Prodigal (33) (-m -p meta). In-frame trinucleotide counts Inline graphic for each of the 64 codons Inline graphic (step of 3 bases) along a scaffold are divided by the total count of observed codons. The final codon, if representing a stop, is ignored. Counts are inclusive of zero counts but exclusive of ambiguous (e.g. N) bases. The following yields a CU frequency vector Inline graphic for each codon Inline graphic in scaffold Inline graphic.

graphic file with name M00023.gif
graphic file with name M00024.gif

The next three features (GC content, CpG content, and GC-skew) are calculated per scaffold from individual scaffold bases. GC content Inline graphic is calculated by the sum of all G and C bases, divided by the sum of all bases (A, T, C and G). CpG content Inline graphic is calculated by the sum of all CG di-nucleotides per scaffold (step of 1 base) divided by the sum of all bases. GC-skew Inline graphic is calculated by subtracting the total of C bases from the total G bases, divided by the sum of G and C bases.

graphic file with name M00028.gif
graphic file with name M00029.gif
graphic file with name M00030.gif

The last three features—relative tetranucleotide frequency (RTF), tetranucleotide usage deviation (TUD), and tetranucleotide zero’th order Markov method (ZOM) – are calculated from whole scaffold tetranucleotide frequencies (step of 1 base) of the forward and reverse strands (34). A total of 136 possible tetranucleotides are considered after combining identical, reverse complement, and palindromic sequences. Counts are inclusive of zero counts but exclusive of ambiguous (i.e. N) bases.

For RTF, all counts Inline graphic for each of the 136 tetranucleotides Inline graphic along a scaffold are divided by the total count of observed tetranucleotides. The following yields a tetranucleotide frequency vector Inline graphic for each tetranucleotide Inline graphic in scaffold Inline graphic.

graphic file with name M00036.gif
graphic file with name M00037.gif

For TUD, expected nucleotide frequencies Inline graphic are first calculated by dividing the count of each base Inline graphic by the sum of all bases in the scaffold. Next, observed counts per base Inline graphic per tetranucleotide Inline graphic are calculated by the sum of each base inclusive of zero counts. For each unique tetranucleotide, expected frequencies per base are raised to the power of observed frequencies multiplied by two to yield a deviation value Inline graphic per base. The deviation values for all four bases are multiplied the count of total observed tetranucleotides and the count of the given tetranucleotide to yield a TUD value per tetranucleotide. The following yields a TUD frequency vector Inline graphic for each tetranucleotide Inline graphic in scaffold Inline graphic.

graphic file with name M00046.gif
graphic file with name M00047.gif
graphic file with name M00048.gif
graphic file with name M00049.gif
graphic file with name M00050.gif

For ZOM, the same expected Inline graphic nucleotide frequencies per base Inline graphic are used. For each tetranucleotide Inline graphic, the count Inline graphic of the given tetranucleotide is divided by the product of each of the present tetranucleotide's bases’ expected frequencies to yield a ZOM frequency vector Inline graphic for each tetranucleotide Inline graphic in scaffold Inline graphic.

graphic file with name M00058.gif
graphic file with name M00059.gif

Pairwise distance calculations for GC, CpG and GC-skew are made by the absolute value difference in the respective metric's content between two scaffolds. For example, the following is the pairwise distance Inline graphic in GC content between scaffolds Inline graphic and Inline graphic.

graphic file with name M00063.gif

Pairwise distance calculations for CU, RTF, TUD and ZOM are made by cosine distances. For each value Inline graphic and Inline graphic, corresponding to the same tetranucleotide Inline graphic, in frequency vectors of scaffolds Inline graphic and Inline graphic, with vector averages of Inline graphic and Inline graphic, cosine similarity Inline graphic is calculated. Cosine distances between two scaffolds are calculated for CU, RTF, TUD and ZOM individually.

graphic file with name M00072.gif

The result of distance calculations is a vector Inline graphic of length seven for each pairwise comparison between scaffolds Inline graphic and Inline graphic.

graphic file with name M00076.gif

Machine learning model training and testing

NCBI databases (RefSeq (35) and Genbank (36), release July 2019) were queried for ‘prokaryotic virus’ and genomes >10 kb in length were retained. In addition, the IMG/VR database (release July 2018) (37) was downloaded, and sequences were limited to a minimum length of 10 kb. For the IMG/VR dataset, VIBRANT (38) (v1.2.1, -virome) and CheckV (39) (v0.6.0) were used to obtain circular and/or complete sequences. The resulting NCBI and IMG/VR datasets were dereplicated by 95% identity using the method described here (–derep_only –derep_id 0.95 –frac 0.70 –method longest) and combined, resulting in a total of 11,881 putatively complete genomes. The sequences representing complete genomes in the combined dataset were split into non-overlapping fragments of 15 kb with a minimum length of 10 kb. A total of 39,105 fragments were generated for training and testing machine learning models, with 38,732 represented in the training and 30,618 represented in the testing datasets (Supplementary Figure S1a).

The machine learning models were generated based on the Inline graphic vectors described above using the generated 39,105 genome fragments. Filtering of pairwise comparisons before training and testing was made according to vRhyme default parameters (–max_gc 0.20 –min_kmer 0.60). The pairwise comparison matrix was split 75:25 for training and testing, respectively. Fragment pairs were labeled as ‘same’ or ‘different’ for supervised machine learning according to if the paired fragments originated from the same or different source genomes. An equal number (69,632) of ‘same’ and ‘different’ pairs were used for training by randomly dropping excess ‘different’ comparisons. For testing, a set of 38,685 ‘different’ and 7,736 ‘same’ pairs were used. There were no redundant pairs between the training and testing datasets.

Scikit-Learn (v0.24.2) (40) was used to generate machine learning models using a grid search approach to optimize parameters. Several models and algorithms were considered, including MLPClassifier, ExtraTrees, KNeighbors, SVC, Gradient Boost, Decision Tree and Random Forest classifiers. Iterative training and testing yielded MLPClassifier (alpha = 0.001, beta_1 = 0.7, beta_2 = 0.8, hidden_layer_sizes = (5,25,50,75,100,100,75,50,25,5), learning_rate_init = 0.0001, max_iter = 1250, n_iter_no_change = 15, tol = 1e-08) and ExtraTreesClassifier (max_depth = 10, max_features = 7, n_estimators = 1500) as the most robust.

Machine learning and network processing

Each scaffold pair is classified by the two machine learning models separately to yield two probability values of ‘same’, one per model. The probability values are averaged to yield Inline graphic. Any pair with Inline graphic below the preset threshold is discarded. Then, Inline graphic calculated previously for the pair is divided by Inline graphic to yield a network edge weight Inline graphic.

graphic file with name M00083.gif

Any pair with Inline graphic below the preset threshold is retained for network clustering. As before, for computational efficiency, only the best (i.e. lowest Inline graphic) Inline graphic pairs per individual scaffold are retained, where Inline graphic is ‘–max_edges’. Weighted networks, representing unrefined bins, are created where each node is a scaffold and each edge is a weighted connection between paired scaffolds. Networks are refined using MiniBatchKMeans implemented in Scikit-Learn with the following parameters: n_clusters = Inline graphic+1, batch_size = Inline graphic, max_iter = 100, max_no_improvement = 5, n_init = 5. Batch size Inline graphic is 25% of the number of nodes with a minimum of 2 and maximum of 100. The number of clusters Inline graphic is defined by the number of nodes with a clustering coefficient value below the preset constant 0.36 but not 0. For each node Inline graphic, the clustering coefficient Inline graphic is calculated as follows, where Inline graphic is the degree of the node and Inline graphic is the number of edges between the neighbors of Inline graphic:

graphic file with name M00097.gif

Refined networks are split into distinct, separate networks according to Inline graphic. Here, each connected network represents a putative bin.

Score processing

Each binning iteration is given a score Inline graphic according to protein redundancy, total bins, and the number of scaffolds binned. To calculate protein redundancy, all proteins within a bin are clustered using Mmseqs2 (41) (linclust –min-seq-id 0.5 -c 0.8 -e 0.01 –min-aln-len 50 –cluster-mode 0 –seq-id-mode 0 –alignment-mode 3 –cov-mode 5 –kmer-per-seq 75). Any proteins clustered within a bin, excluding those along the same scaffold, are considered redundant. The iteration with the maximum score is selected as the final representative.Inline graphic is calculated as follows:

graphic file with name M000101.gif
graphic file with name M000102.gif
graphic file with name M000103.gif
graphic file with name M000104.gif

Dereplication

vRhyme implements Nucmer (42) and MASH (43) for the dereplication of scaffolds. First, scaffolds are roughly grouped using MASH (sketch -k 31 -s 1000; dist) to reduce the pairwise comparison space. Next, all possible pairs of scaffolds within each resulting group are aligned using Nucmer (-c 1000 -b 1000 -g 1000). Regardless of the comparison method (‘–method’), any pair of scaffolds with 100% identity over 100% coverage are first reduced to the longest representative. For all percent coverage calculations in dereplication, coverage is of the shortest scaffold. For ‘–method longest’ the longest scaffold in pairs meeting the set percent identity (e.g. 97%) and percent coverage (e.g. 60%) thresholds is taken as the representative. For ‘–method composite’, scaffold pairs meeting the percent identity and percent coverage thresholds are joined over the region of sequence overlap to yield artificially chimeric scaffolds. Any alignments exceeding the sensitivity values for merging over complex alignments, such as low identity scaffold ends without overlap, are not joined. After scaffold pairs are joined, identical cycles of MASH, Nucmer, and composite joining are completed until no further alignments are detected. For all methods, reverse complement sequence alignments are considered and adjusted accordingly.

Performance validation datasets and metrics

Scaffolds used to benchmark performance were acquired from nine separate publicly available datasets derived from eight unique metagenomes (one metagenome was split into two separate datasets). The metagenomes were acquired from marine (44,45), freshwater (46–48), human gut (49), and soil environments (50,51). Details on the studies, scaffolds, reads, and accession numbers can be found in Supplementary Table S1. Each dataset was processed separately. First, VIBRANT (v1.2.1) was used to predict viruses. From these viruses, VIBRANT and CheckV were used to identify circular scaffolds representing complete genomes. Next, scaffolds were dereplicated by 97% identity using the method described here (–derep_only –derep_id 0.97 –frac 0.70 –method longest). The non-redundant scaffolds were randomly fragmented into sequences ranging from 2 kb to 20 kb in length. A total of 999 scaffolds (i.e. putatively complete genomes) were used to generate 4,324 fragments of at least 2 kb in length. Full benchmarking was performed on the 4,324 fragments and validation of complete genome binning was performed on the 999 scaffolds representing complete genomes (Supplementary Figure S1b). Only 255 of the performance benchmarking fragments had significant sequence similarity to fragments used to train the machine learning models (Supplementary Figure S1c).

Since the circular scaffolds (sources) were estimated to be complete genomes, any of the fragments originating from the same source were expected to create a single bin, bins containing fragments from multiple sources were considered as contaminated, fragments from the same source in different bins were considered as split genomes, and fragments representing an entire source (singletons) were not expected to bin. The following equations are for genome- (source) and bin-based performance metrics, where Inline graphic is the expected number of bins (i.e. sources with at least two fragments), Inline graphic is the number of bins generated, Inline graphic is the expected number of binned fragments (i.e. fragments representing Inline graphic sources), Inline graphic is the total number of bins containing a single source, Inline graphic is the total number of fragments binned, Inline graphic is the number of unique sources binned, Inline graphic is the number of sources contained in a single bin, Inline graphic is the total number of singletons, and Inline graphic the number of binned singletons.

graphic file with name M000115.gif
graphic file with name M000116.gif
graphic file with name M000117.gif
graphic file with name M000118.gif
graphic file with name M000119.gif
graphic file with name M000120.gif
graphic file with name M000121.gif
graphic file with name M000122.gif
graphic file with name M000123.gif
graphic file with name M000124.gif
graphic file with name M000125.gif
graphic file with name M000126.gif

To validate binning further, each pairwise connection between fragments within a bin was evaluated according to each fragment's nucleotide length. These standard performance metrics were evaluated per bin using true positive Inline graphic, true negative Inline graphic, false positive Inline graphic, and false negative Inline graphic connections. The following equations are for pairwise nucleotide-based performance metrics:

graphic file with name M000131.gif
graphic file with name M000132.gif
graphic file with name M000133.gif
graphic file with name M000134.gif
graphic file with name M000135.gif

Performance benchmarking

The performance of vRhyme (v1.0.0) was compared to MetaBat2 (17) (v2.12.1, -s 4000 -m 2000), CONCOCT (27) (v1.0.0, -l 2000), VAMB (18) (v3.0.2, -i 2 -m 2000 -t 40), CoCoNet (28) (v1.0.0, –min-ctg-len 1000 –min-prevalence 1), and BinSanity (24) (v0.5.4, -x 2000). Additional binning tools, namely MaxBin2 (16), MyCC (19), SolidBin (20) and DASTool (22), perform microbial single copy gene analysis and were not applicable, or did not function, for viruses. For VAMB, the starting batch size had to be adjusted to accommodate the relatively small size of the input datasets, and all but three datasets failed to run. The coverage tables for each of the tools were generated from sorted BAM files using each tool's respective method, except for VAMB for which the same coverage table as MetaBat2 was used. The sorted BAM files were generated using Samtools (v1.13) with reads quality filtered by Sickle (v1.33) aligned by Bowtie2 (v2.3.5.1, –no-unal –no-discordant).

Metagenomic datasets and analyses

Publicly available metagenomes from marine (52), agricultural soil (53), and human skin (54) environments were used. Details on the studies, reads used, and accession numbers can be found in Supplementary Table S1. Viruses were predicted from each metagenome using VIBRANT and only the identified virus scaffolds were binned using vRhyme. For the human skin datasets, 270 metagenomes from a cohort of 34 individuals with eight body sites per individual were used (antecubital fossa (Af), alar crease (Al), back (Ba), nare (Na), occiput (Oc), toe-web space (Tw), umbilicus (Um) and volar forearm (Vf)). Reads were filtered for quality, adapters, and host-contamination as described previously (54) using fastp (55) (v0.21.0, –detect_adapter_for_pe) and KneadData (v0.8.0). MegaHit (56) (v1.2.9) was used to generate individual metagenomic assemblies for each sample, corresponding to the microbiome of a particular body site for a specific participant at a given timepoint. After predicting viruses, all viruses per body site were combined and dereplicated (–method longest) before binning.

It is important to note that for bins, scaffolds had to be linked with Ns in order to run CheckV analysis since there is no mode to input bins. For all benchmarking using CheckV, the tool was modified to run Prodigal with the -m flag to accommodate linking vMAGs and not predicting open reading frames across the appended strings of Ns connecting scaffolds. For taxonomy of the validation dataset, a publicly available custom reference database of NCBI viruses was used as previously described (57). In brief, DIAMOND (58) (v0.9.14) BLASTp (59) (v2.6.0) was used to identify the most likely taxonomic affiliation of a sequence.

Additional datasets and benchmarking

Additional publicly available datasets were used to assess the performance of vRhyme under different scenarios and conditions. To assess binning of related types of viruses within the same sample, a total of 101 publicly available crAssphage sequences (60) were dereplicated using vRhyme (–derep_id 0.97 –frac 0.70 –method longest) to 86 non-redundant scaffolds. The non-redundant scaffolds were randomly fragmented as described previously into 791 fragments. To assess binning of megaphages and eukaryotic viruses with large genomes, the 540 kb Prevotella phage Lak C1 (61) was randomly fragmented into 51 fragments, and four different eukaryotic viruses (62,63) with genome lengths ranging from 154 kb to 201 kb were each randomly fragmented into 11 to 19 fragments. To assess binning of active and dormant prophages, VIBRANT was used to predict prophage regions for 10 active prophages from three different hosts and 24 dormant prophages from five different hosts. Activity or dormancy was determined according to respective studies described elsewhere (64–66) and validated using PropagAtE (67) (v1.1.0). Whole prophage scaffolds from the same host genome were binned together. Details on the studies, reads used, scaffolds, and accession numbers can be found in Supplementary Table S1.

To validate protein redundancy, NCBI databases (RefSeq and Genbank, release July 2019) were queried for ‘prokaryotic virus’ as before and genomes greater than 3 kb in length were retained. Likewise, NCBI databases (RefSeq and Genbank, release September 2021) were queried for ‘eukaryotic virus’ and genomes greater than 20 kb in length were retained. Proteins were predicted using Prodigal (-p meta) for 15,238 prokaryotic and 557 eukaryotic viruses. Protein redundancy was calculated per genome based on the method described for vRhyme, with the exception that proteins could be redundant if encoded along the same scaffold.

Effect of number of samples

The effect of the number of input samples on vRhyme performance was done by stepwise increasing the number of BAM files used to calculate coverage from one to the maximum number of samples for a given dataset. To do this, samples were arranged in descending order, starting at the sample with the greatest total coverage across all scaffolds and were stepwise combined, ending with the sample with the lowest coverage.

Visualizations

All plots and visualizations were done using Matplotlib (68) (v3.2.0) and Seaborn (69) (v0.11.0). Genome alignment visualizations were made using EasyFig (70) (v2.2.2) and Geneious Prime 2019.0.3. Genome alignments to identify percent sequence identity were made using progressiveMauve (71) (development snapshot 2015–02-25). vConTACT2 (72) (v0.9.19, –rel-mode Diamond –db ‘None’ –pcs-mode MCL –vcs-mode ClusterONE, ClusterONE (73) v1.0) was used to construct protein clustering networks and visualized using Cytoscape (74) (v3.7.2).

RESULTS

vRhyme overview and workflow

The vRhyme workflow is done in five steps: read coverage processing, sequence feature extraction, supervised machine learning, iterative network clustering, and bin scoring (Figure 1). The base input to vRhyme are the assembled scaffolds or contigs to be binned (hereafter scaffolds) with a set minimum size of 2 kb. For optimal results, only virome scaffolds or predicted virus scaffolds should be used as input, though vRhyme can function with the input of an entire metagenome. An initial dereplication step to remove redundant input scaffolds is optional. Next, scaffolds are compared pairwise by read coverage composition per sample, which is a proxy for relative abundance. vRhyme performs optimally with an input of multiple samples (i.e. coverage files) for more robust coverage co-occurrence estimations, but it will function with a single sample input with a minor decrease in performance. Statistically dissimilar scaffolds by coverage composition are screened out and the remaining potential pairs are compared by nucleotide feature similarity. Seven total nucleotide and gene features are used to classify pairs as similar versus dissimilar using two supervised machine learning models (decision trees and neural network). Following this step, potential connections are made between scaffolds based on similarity in read coverage and nucleotide features. These connections are used to create weighted networks that are further refined into genome bins using KMeans clustering. The entire process of read coverage comparison, nucleotide feature machine learning, and weighted network refinement is performed over several binning iterations in parallel. vRhyme has 15 built-in presets of thresholds for Cohen's d, machine learning model probabilities, and network edge weights. The number of presets used is equivalent to the number of binning iterations completed. A list of all presets and their hierarchy can be found in Supplementary Table S2. Each bin within all binning iterations is scored according to protein redundancy, a proxy for contamination, and the best binning iteration by sequences binned, bins generated, and redundancy metrics is selected. The bins within this best binning iteration are reported along with relevant metadata, including number of members and total protein redundancy. Alternative binning iterations are likewise saved if manual inspection and selection of a different iteration is desired.

Figure 1.

Figure 1.

Flowchart of vRhyme workflow and methodology. Scaffolds are compared pairwise by read coverage effect size differences using single or multiple samples (top-left), followed by sequence feature distance comparisons (top-right). Multiple iterations of network clustering of putative bins are generated with edge weights representing normalized coverage effect size and supervised machine learning probabilities of sequence feature similarity (center). The bins are refined by KMeans clustering, and the best set of bins from a single iteration are identified after identifying protein redundancy and scoring (bottom).

Assessment of binning quality

To evaluate vRhyme, we first benchmarked vRhyme against reference datasets and compared the performance to several available binning tools, all of which are built for microbes. Many binning tools and wrapper software were not suitable for viral binning due to reliance on microbial single copy genes. We were able to successfully compare vRhyme to MetaBat2 (17), VAMB (18), CoCoNet (28), CONCOCT (27) and BinSanity (24) on nine datasets curated from metagenomic data (see Methods). The nine datasets were comprised of 999 non-redundant and putatively complete viral genomes that were split into 4,324 sequence fragments of varying lengths between 2 kb and 20 kb. Of these, 1,118 fragments were less than 5 kb, 1,361 were greater than 5 kb and <10 kb, and the remaining 1,854 were greater than 10 kb. The average length was 9.4 kb. Although these fragments were derived from datasets not represented in the machine learning training dataset, we first verified that the fragments were distinct and would not result in a bias associated with an overfitted machine learning model. Based on BLASTn similarity at 70% identity and 60% overlap, only 255 (∼6%) of the 4,324 fragments were represented in the machine learning model training dataset, with all but two of the represented fragments being from the same human gut dataset.

A total of 17 different evaluation metrics were used, including five traditional metrics for recall, precision, accuracy, specificity, and F1 score (Figure 2). The five traditional metrics were calculated according to the true positive, true negative, false positive, and false negative rates of binning fragments together from the same or different source genomes (Supplementary Table S3a). Note that the machine learning models were not benchmarked individually since performance is measured based on the entire pipeline. vRhyme yielded the highest F1 score, the harmonic average of precision and recall, with an average of 0.87 across all nine datasets. MetaBat2 and VAMB performed equally with F1 scores of 0.81 and 0.82, respectively, but importantly VAMB only successfully binned three of the nine datasets due to input size requirements. vRhyme likewise yielded the highest, or equal to highest, average precision (0.94), accuracy (0.90) and specificity (0.96) compared to all benchmarked tools. Compared to MetaBat2, VAMB and CoCoNet, vRhyme likewise yielded the greatest average recall (0.80). CONCOCT and BinSanity yielded the greatest average recall values (0.96 and 0.91, respectively) but at the expense of precision (0.45 and 0.44, respectively). At least for viral genomes, CONCOCT and BinSanity were found to not be suitable binning options. VAMB had suitable performance on the three datasets with enough input sequences, but VAMB is likely not an option for many applications of binning viral genomes due to requiring many input sequences (e.g. tens of thousands (18)) for optimal performance. Based on these metrics, vRhyme performed exceptionally in binning viral genomes but did not considerably improve on the performance of MetaBat2.

Figure 2.

Figure 2.

Benchmarking performance metrics of vRhyme compared to MetaBat2, VAMB, CoCoNet, CONCOCT and BinSanity. Each boxplot represents the results of nine different datasets, except for VAMB in which three datasets are shown. In total, 999 non-redundant genomes artificially split into 4,324 sequence fragments are shown. For some plots, a dotted line is shown at 1.0 to indicate optimal performance. CONCOCT and BinSanity are partially shown on the Genome-to-Bin Ratio plot for better visualization; each yielded an average ratio >2.0.

The remaining 12 evaluation metrics were calculated according to complete genomes and individual bins. These included evaluating if genomes were placed into a single or separate bins, and if bins contained fragments from a single or multiple source genomes. These metrics were better able to show the distinct performance of vRhyme compared to the other tools (Supplementary Table S3b). Namely, vRhyme was better able to reduce the following: placement of genomes into separate bins, placement of fragments from multiple source genomes into a single bin, and binning circular scaffolds representing entire genomes. Importantly, this was not at the cost of reduced fragment recall by vRhyme. To combine these metrics, we created a genome score and bin score that considered recall and precision as a substitution for F1 score. For genome scores and bin scores, respectively, vRhyme (0.89 and 0.96) outperformed, or was equivalent to, MetaBat2 (0.77 and 0.93) and VAMB (0.90 and 0.93). Again, it is important to note that VAMB only successfully binned three of the nine datasets. For CoCoNet, CONCOCT, BinSanity, genome scores (0.66, 0.74 and 0.70, respectively) and bin scores (0.65, 0.48 and 0.18, respectively) reflected the propensity to ‘over bin’ distinct genomes together into one bin. CoCoNet did not bin any sequence in two of the datasets, and after removal of these zero-values, the average genome score and bin score for CoCoNet both increased to 0.84.

Furthermore, we evaluated how well vRhyme bins compare to the input, unfragmented genomes. First, using CheckV (39) we show a distinct change in genome completeness estimation in the binned versus unbinned sequence fragments. vRhyme was able to recapitulate the completeness of the input genomes (Figure 3A). This is supported by a similar observation in the length of the input genomes versus the bins (Figure 3B). Moreover, we estimated the taxonomy of the input genomes, fragments, and binned vMAGs. We identified a distinct decrease in the ability to identify taxonomy of the fragments, which were rescued by binning (Figure 3C). The identifiable difference in the vMAGs is a lack of Microviridae. Yet, this is to be expected since the small genome size of Microviridae (<10 kb) typically results in near-complete scaffolds that appropriately remain unbinned. Finally, we evaluated whether vRhyme could distinguish the source scaffolds. To do this, each of the nine datasets were binned, but the scaffolds were not fragmented. The expected result is that none of the circular scaffolds should bin together. Although vRhyme did bin ∼11% of the whole scaffolds, it was a marked improvement on VAMB and CoCoNet (Figure 3D).

Figure 3.

Figure 3.

Impact of binning with vRhyme on the benchmarking datasets. For (A–C), the putatively complete unsplit input genomes, generated sequence fragments, binning sequence fragments, and vRhyme bins (vMAGs) are compared. (A) Estimation of genome completeness using CheckV. (B) Sequence or vMAG nucleotide length. For (A, B) each dot represents a single sequence or vMAG. (C) Estimation of taxonomy at the family level using a custom analysis script. ‘unassigned’ represents a taxonomic classification to a group with an unassigned family, ‘ambiguous’ represents equal assignment to multiple families (typically Caudoviricetes), and ‘unknown’ represents the inability to make a prediction. (D) Evaluation of vRhyme, MetaBat2, VAMB, and CoCoNet for the binning of complete genomes. The expectation is that complete genomes should remain unbinned as uncultivated virus genomes (UViGs).

Benchmarking vRhyme on marine viromes

We next applied vRhyme to the Global Ocean Virome 2 (GOV2) database (52) and compared the results to MetaBat2 and CoCoNet. For metagenomic datasets such as GOV2 the expected number of scaffolds to bin and the number of bins is unknown. First, all scaffolds from the GOV2 database were limited to scaffolds at least 5 kb in length and dereplicated by 98% identity. Of the 108,947 input scaffolds, vRhyme binned 56,642 scaffolds into 13,175 bins, MetaBat2 binned 57,800 scaffolds into 11,826 bins, and CoCoNet binned 91,842 scaffolds into 9,914 bins. Despite the number of bins generated being relatively similar, the number of scaffolds binned was quite different. However, vRhyme yielded 15,106 redundant proteins whereas MetaBat2 (29,334) and CoCoNet (71,364) yielded more, indicating that vRhyme was likely more precise and generated fewer contaminated bins (Figure 4A). In support of this, vRhyme generated 1,266 bins with two or more redundant proteins whereas MetaBat2 (1,648) and CoCoNet (2,743) generated more. When these likely contaminated bins were removed, vRhyme binned 48.251 scaffolds into 11.909 bins, MetaBat2 binned 33,351 scaffolds into 10,178 bins, and CoCoNet binned 35,380 scaffolds into 7,171 bins (Figure 4B). Based on protein redundancy, vRhyme was capable of binning far more viral scaffolds and generating more bins of low contamination compared to MetaBat2 and CoCoNet. Note, we identified bins with ‘low contamination’ to be 0–1 redundant proteins based on a benchmark of prokaryotic and eukaryotic viral genomes from NCBI databases (Supplementary Figure S2). Contamination was not estimated using CheckV as that metric does not consider contamination of multiple viral genomes, but rather contamination of non-viral sequences.

Figure 4.

Figure 4.

Benchmark binning and genome completeness evaluation of GOV2. Comparison of vRhyme, MetaBat2, and CoCoNet (A) raw results and (B) low contamination filtering results by the number of scaffolds binned and identified redundancy. For vRhyme only, CheckV was used to identify (C) the estimated completeness values, (D) number of ‘NA’ completeness values, (E) number of ‘no viral genes’ scaffolds/vMAGs and (F) number of ‘longer than expected’ scaffolds/vMAGs for the low contamination results of individual binned scaffolds as well as vMAGs.

We also estimated the completeness of the 11,909 low contamination vRhyme bins and the individual 48,251 scaffolds that generated those bins using CheckV. The binned scaffolds individually yielded 25,969 (53.8%) completeness values with an average of 14% complete, 79 estimated to be 100% complete, 22,282 (46.2%) with ‘NA’ completeness, and 27,295 (56.6%) with ‘no viral genes detected’. The scaffolds within each bin, after being linked into vMAGs, yielded 8,393 (70.5%) completeness values with an average of 48% complete, 775 estimated to be 100% complete, 3,516 (29.5%) with ‘NA’ completeness, and 4,039 (33.9%) with ‘no viral genes detected’ (Figure 4CE). There was an increase in the number of vMAGs (195, 1.6%) versus individual scaffolds (16, 0.03%) that were estimated to be ‘longer than expected’, potentially due to a marginal rate of multiple genomes being binned into a single vMAG (Figure 4F). Overall, vRhyme generated vMAGs with greater average completeness to aid in downstream analyses and interpretations, even with high complexity or large datasets such as GOV2.

Discovery of vMAGs in human skin metagenomes

To demonstrate the ability of vRhyme to aid metagenome analyses and discovery, we applied vRhyme to 270 human skin metagenomes (54). Viruses were predicted from a cohort of 34 individuals with eight body sites (Af, Al, Ba, Na, Oc, Tw, Um and Vf) sampled per individual (see Methods). From all individuals, 10,601 viral scaffolds were identified and binned, across eight different body sites individually, into a total of 849 vMAGs representing 2,794 viral scaffolds. Although bins with redundant proteins may in fact be a single genome or partially redundant copies of a single genome, we ignored all vMAGs with greater than one redundant protein for analysis to yield 762 vMAGs representing 2,413 viral scaffolds, leaving the remaining 8,188 as discrete viral scaffolds (Supplementary Table S4) (Figure 5A). The taxonomic classification of UViGs pre-binning, UViGs and low redundancy vMAGs post-binning, and vMAGs-only displayed that most bins were constructed of genomes from the class Caudoviricetes, similar to the observed taxonomy pre-binning (Supplementary Figure S3). The bins were comprised of an average of 3.2 scaffolds each. In total we identified seven bins, representing separate body sites, that were present across at least 30 individuals (Figure 5B). In addition, two bins of unique characteristics were identified and examined in detail.

Figure 5.

Figure 5.

Binning improves and expands the analysis of viruses from human skin. (A) Comparison of the number of original viral scaffolds identified across all individuals before and after binning. (B) Heatmap of coverage for the seven common bins per individual. (C) Genome visualization and alignment of Herelleviridae reference phiSA_BS2 (outer) and Tw bin 8 (inner). Each arrow represents a predicted open reading frame and black bars are artificial connections between vMAG scaffolds. (D) Alignment of vRhyme Vf bin 113 to the closest reference virus Siphoviridae isolate ctiXA4 (BK057074.1). Each of the four scaffolds were independently aligned by tBLASTx similarity. The narG AMG is labeled in yellow and viral hallmark annotations are labeled in light blue. (E) Representative cluster from all input viral scaffolds generated by vConTACT2, with the four Vf bin 113 scaffolds labeled in green. There are no connections between any of the four green scaffolds. Each dot represents a single scaffold. (F) Partial network from all vRhyme binned and unbinned viral scaffolds generated by vConTACT2, with vMAG bins labeled in orange and Vf bin 113 in green. For (E, F), Complete network diagrams can be found in Supplementary Figures S4 and S5.

The first such bin contained 22 members (Tw bin 8), more than what would be expected for a viral bin, and aligned to a reference Herelleviridae phage (Staphylococcus phage phiSA_BS2) (Figure 5c). Herelleviridae infecting abundant Staphylococcus on the skin are likely to be highly relevant to skin ecology and disease (75). Before binning, each of the 22 members were identified by CheckV as low-quality genome fragments with individual completeness estimations ranging from 1.8% to 7.1%. The fragments averaged 5.2 kb in length and ranged from 2.6 kb to 10.0 kb. After binning, the final bin was 115 kb in length and identified as a high-quality genome with 100% completeness by CheckV. The reference phage genome is 143 kb, suggesting the true completeness of the bin is likely 80% to 100%. All CheckV results for the skin metagenomes can be found in Supplementary Table S5.

The second bin of interest contained four members (Vf bin 113), with one encoding a nitrate reductase (narG) auxiliary metabolic gene (AMG) (Figure 5D). The narG was positioned as the last gene on a scaffold, and conventional approaches for AMG validation would suggest discarding the AMG as likely bacterial contamination. However, binning aided in the validation of the AMG as likely to be correct. The first line of evidence was the lack of any integrase or lysogenic viral signatures on any of the four binned scaffolds, suggesting the AMG is not from bacterial contamination resulting from host integration. Second, alignment of all four scaffolds to the nearest reference genome (Siphoviridae isolate ctiXA4) displayed that the AMG was situated at the intersection of two scaffolds within the genome rather than at a genome end. CheckV identified each member as low-quality with completeness values of 11.6% to 28.0% for the respective 7.4 kb to 16.8 kb scaffolds. The bin was estimated to be of medium-quality with a completeness of 74.9%, or 92% based on the length of the closest reference genome. Moreover, one of the four scaffolds lacked characteristic viral annotations to aid with manual inspection or analyses such as phylogeny, yet binning with the other scaffolds containing viral hallmark and nucleotide replication annotations was able to validate the scaffold as viral and place it in better genomic context for analysis. Therefore, binning was able to not only generate a more complete sequence, but also validate the presence of an understudied and ecologically important AMG. Using vConTACT2 (72), we clustered all of the individual, unprocessed viral scaffolds (Figure 5E) in addition to the bin with the complete binning results (low-contamination bins plus unbinned scaffolds) (Figure 5F). Clustering of the individual scaffolds placed all four scaffolds of the bin into a single cluster distinct from other groups, yet as anticipated none of the scaffolds of the bin were connected. Clustering of the binning results yielded more connections between scaffolds and vMAGs and better placed the bin within evolutionary and community relationship contexts. Complete vConTACT2 networks can be found in Supplementary Figures S4 and S5.

DISCUSSION

Binning viral scaffolds into vMAGs is uncommon, with most or all remaining as discrete virus operational taxonomic units (vOTUs) or uncultivated virus genomes (UViGs) (76). We believe adopting a more genome-centric approach for UViGs will enable innovative discoveries, such as the construction of large or highly heterogenous viral genomes that often assemble into dissimilar fragments. Here, we have presented vRhyme and demonstrate that the ‘one scaffold, one virus’ convention can skew interpretations of a virosphere and the interactions of its viral community members. To address this, vRhyme enables the binning of viral genomes into vMAGs using a virus-centric approach, unique from existing binning software, in an easy to use and reproducible command line tool.

In addition to performance benchmarks on artificial and real metagenomes, we evaluated the robustness of vRhyme by binning artificially fragmented NCLDV, megaphage, large eukaryotic viruses, crAssphage, active and inactive integrated prophages, and microbial genomes (Supplementary Information, Supplementary Table S6). vRhyme was largely capable of precisely binning these unique and complex viral datasets. However, notable exceptions were difficulties with separating multiple inactive (non-replicating) prophages from the same host genome as well as binning non-viral genomes, though the latter was an anticipated limitation. Moreover, we displayed that vRhyme is efficient and likely precise in binning large and complex datasets using GOV2 and agricultural soil viromes (53) (Supplementary Information, Supplementary Table S7). In total, we hope that with the availability of vRhyme as a reliable binning tool, vMAG construction will become a common practice and adopted into existing frameworks of studying viral ecology, host associations, community interactions, evolution, and biogeochemical cycling.

To further evaluate the computational capabilities of vRhyme or potential restraints, we assessed the effect of the coverage calculation methods, the number of input coverage samples and the effect of user-modifiable parameters on performance, as well as the runtime, memory usage and reproducibility of binning (Supplementary Information). We found that vRhyme performs optimally with multiple input samples for more robust coverage variance comparisons, though the optimal value depends on how the dataset or metagenome was constructed (Supplementary Table S8, Figure S6). For example, a metagenome assembled from a single, standalone sample may perform suitably. As for modifying parameters, vRhyme likely will yield optimal results with the default settings due to the coverage calculation method employed and built-in binning iterations (Supplementary Table S9, Figure S7). Furthermore, the runtime of vRhyme for average sized viral datasets was on the scale of seconds. The GOV2 dataset, the largest dataset evaluated, finished in 93 min with 2.3 GB of memory using 15 CPU threads (Supplementary Table S10, Figure S8). Lastly, the methods employed by vRhyme allow it to be fully reproducible. Overall, we found the necessary requirements to be relatively low and even possible on personal laptop systems.

There are several important considerations in the binning of vMAGs that are unique from microbial MAGs. First, any viral scaffold not contained within a bin (vMAG) should be considered as a vOTU or UViG. This aligns with the ‘one scaffold, one virus’ convention which is likely true for many viral genomes, especially circular and complete genomes. In the skin datasets presented here, ∼23% of the viral scaffolds were binned into low contamination vMAGs and the remaining ∼77% should still be utilized in analyses as discrete scaffolds. Second, an entire metagenome can be used as input to vRhyme, or viral binning in general, with the caveat that contamination of bins with non-viral sequences may be higher with the added advantage that fewer viral scaffolds may be missed. For example, many phage genomes are arranged in cassettes such that structural, nucleotide replication, lysis and auxiliary genes form distinct regions. If these regions were to assemble into separate scaffolds, virus identification may only identify a portion of the scaffolds, such as missing an auxiliary region, whereas binning may place them all together into a single vMAG. When applied to a synthetic dataset of predominately non-viral sequences, MetaBat2 performed better than vRhyme (Supplementary Information, Supplementary Table S11). Third, accurate read coverage profiles are crucial for accurate binning. This is true for all binning software that depend on differential coverage and is especially true for distinguishing bins of integrated prophages from a single host population. vMAGs representing prophages generated by vRhyme will likely represent the greatest fraction of redundant, contaminated bins.

DATA AVAILABILITY

vRhyme and all auxiliary scripts are freely available as open-source Python code at https://github.com/AnantharamanLab/vRhyme.

Supplementary Material

gkac341_Supplemental_Files

ACKNOWLEDGEMENTS

We thank members of the Anantharaman laboratory at the University of Wisconsin-Madison for helpful feedback and discussions.

Author contributions: K.K. and K.A. designed the study. K.K., A.A. and R.S. developed code and conducted bioinformatic analyses. K.K. and K.A drafted the manuscript. All authors (K.K., A.A., R.S., L.K. and K.A.) reviewed the results and approved the manuscript.

Contributor Information

Kristopher Kieft, Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA; Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA.

Alyssa Adams, Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA; Computation and Informatics in Biology and Medicine, University of Wisconsin–Madison, Madison, WI, USA.

Rauf Salamzade, Microbiology Doctoral Training Program, University of Wisconsin–Madison, Madison, WI, USA; Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA.

Lindsay Kalan, Department of Medical Microbiology and Immunology, University of Wisconsin–Madison, Madison, WI, USA; Department of Medicine, University of Wisconsin–Madison, Madison, WI, USA.

Karthik Anantharaman, Department of Bacteriology, University of Wisconsin–Madison, Madison, WI, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institute of General Medical Sciences of the National Institutes of Health [R35GM143024 to K.A., R35GM137828, U19AI142720 to L.K.]; A.A. was funded by a University of Wisconsin-Madison CIBM postdoctoral traineeship from the National Library of Medicine [T15LM007359]; K.K. was supported by a Wisconsin Distinguished Graduate Fellowship Award from the University of Wisconsin-Madison; William H. Peterson Fellowship Award from the Department of Bacteriology, University of Wisconsin-Madison. Funding for open access charge: NIH Grant funds (to K.A.).

Conflict of interest statement. None declared.

REFERENCES

  • 1. Drew G.C., Stevens E.J., King K.C.. Microbial evolution and transitions along the parasite–mutualist continuum. Nat. Rev. Microbiol. 2021; 19:623–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Roossinck M.J. Move over, bacteria! Viruses make their mark as mutualistic microbial symbionts. J. Virol. 2015; 89:6532–6535. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Barr J.J. Missing a phage: unraveling tripartite symbioses within the human gut. Msystems. 2019; 4:e00105-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hurwitz B.L., U’Ren J.M. Viral metabolic reprogramming in marine ecosystems. Curr. Opin. Microbiol. 2016; 31:161–168. [DOI] [PubMed] [Google Scholar]
  • 5. Howard-Varona C., Lindback M.M., Bastien G.E., Solonenko N., Zayed A.A., Jang H., Andreopoulos B., Brewer H.M., Rio T.G.del, Adkins J.N.et al.. Phage-specific metabolic reprogramming of virocells. ISME J. 2020; 14:881–895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Kieft K., Breister A.M., Huss P., Linz A.M., Zanetakos E., Zhou Z., Rahlff J., Esser S.P., Probst A.J., Raman S.et al.. Virus-associated organosulfur metabolism in human and environmental systems. Cell Rep. 2021; 36:109471. [DOI] [PubMed] [Google Scholar]
  • 7. Barr J.J., Auro R., Furlan M., Whiteson K.L., Erb M.L., Pogliano J., Stotland A., Wolkowicz R., Cutting A.S., Doran K.S.et al.. Bacteriophage adhering to mucus provide a non-host-derived immunity. Proc. Natl. Acad. Sci. U.S.A. 2013; 110:10771–10776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Al-Shayeb B., Sachdeva R., Chen L.-X., Ward F., Munk P., Devoto A., Castelle C.J., Olm M.R., Bouma-Gregson K., Amano Y.et al.. Clades of huge phages from across Earth's ecosystems. Nature. 2020; 578:425–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Paez-Espino D., Eloe-Fadrosh E.A., Pavlopoulos G.A., Thomas A.D., Huntemann M., Mikhailova N., Rubin E., Ivanova N.N., Kyrpides N.C.. Uncovering Earth's virome. Nature. 2016; 536:425–430. [DOI] [PubMed] [Google Scholar]
  • 10. Tisza M.J., Buck C.B.. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. U.S.A. 2021; 118:e2023202118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Roux S., Krupovic M., Daly R.A., Borges A.L., Nayfach S., Schulz F., Sharrar A., Carnevali P.B.M., Cheng J.-F., Ivanova N.N.et al.. Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth's biomes. Nat. Microbiol. 2019; 4:1895–1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Dutilh B.E., Cassman N., McNair K., Sanchez S.E., Silva G.G.Z., Boling L., Barr J.J., Speth D.R., Seguritan V., Aziz R.K.et al.. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat. Commun. 2014; 5:4498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bowers R.M., Kyrpides N.C., Stepanauskas R., Harmon-Smith M., Doud D., Reddy T.B.K., Schulz F., Jarett J., Rivers A.R., Eloe-Fadrosh E.A.et al.. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 2017; 35:725–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Roux S., Emerson J.B., Eloe-Fadrosh E.A., Sullivan M.B.. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017; 5:e3817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Schulz F., Andreani J., Francis R., Boudjemaa H., Khalil J.Y.B., Lee J., Scola B.L., Woyke T.. Advantages and limits of metagenomic assembly and binning of a giant virus. Msystems. 2020; 5:e00048-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wu Y.-W., Tang Y.-H., Tringe S.G., Simmons B.A., Singer S.W.. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014; 2:26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Kang D.D., Li F., Kirton E., Thomas A., Egan R., An H., Wang Z. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019; 7:e7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Nissen J.N., Johansen J., Allesøe R.L., Sønderby C.K., Armenteros J.J.A., Grønbech C.H., Jensen L.J., Nielsen H.B., Petersen T.N., Winther O.et al.. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 2021; 39:555–560. [DOI] [PubMed] [Google Scholar]
  • 19. Lin H.-H., Liao Y.-C.. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 2016; 6:24175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Wang Z., Wang Z., Lu Y.Y., Sun F., Zhu S.. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics. 2019; 35:4229–4238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Mallawaarachchi V., Wickramarachchi A., Lin Y.. GraphBin: refined binning of metagenomic contigs using assembly graphs. Bioinformatics. 2020; 36:3307–3313. [DOI] [PubMed] [Google Scholar]
  • 22. Sieber C.M.K., Probst A.J., Sharrar A., Thomas B.C., Hess M., Tringe S.G., Banfield J.F.. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 2018; 3:836–843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Uritskiy G.V., DiRuggiero J., Taylor J.. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome. 2018; 6:158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Graham E.D., Heidelberg J.F., Tully B.J.. BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation. PeerJ. 2017; 5:e3035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. West P.T., Probst A.J., Grigoriev I.V., Thomas B.C., Banfield J.F.. Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome Res. 2018; 28:569–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Johansen J., Plichta D.R., Nissen J.N., Jespersen M.L., Shah S.A., Deng L., Stokholm J., Bisgaard H., Nielsen D.S., Sørensen S.J.et al.. Genome binning of viral entities from bulk metagenomics data. Nat. Commun. 2022; 13:965. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Alneberg J., Bjarnason B.S., de Bruijn I., Schirmer M., Quick J., Ijaz U.Z., Lahti L., Loman N.J., Andersson A.F., Quince C.. Binning metagenomic contigs by coverage and composition. Nat. Methods. 2014; 11:1144–1146. [DOI] [PubMed] [Google Scholar]
  • 28. Arisdakessian C.G., Nigro O.D., Steward G.F., Poisson G., Belcaid M.. CoCoNet: an efficient deep learning tool for viral metagenome binning. Bioinformatics. 2021; 37:2803–2810. [DOI] [PubMed] [Google Scholar]
  • 29. Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9:357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Li H., Durbin R.. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R.1000 Genome Project Data Processing Subgroup . The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25:2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2013; NY: Academic Press. [Google Scholar]
  • 33. Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinf. 2010; 11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Siranosian B., Perera S., Williams E., Ye C., de Graffenried C., Shank P.. Tetranucleotide usage highlights genomic heterogeneity among mycobacteriophages. F1000Res. 2015; 4:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W.. GenBank. Nucleic Acids Res. 2016; 44:D67–D72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Paez-Espino D., Chen I.-M.A., Palaniappan K., Ratner A., Chu K., Szeto E., Pillay M., Huang J., Markowitz V.M., Nielsen T.et al.. IMG/VR: a database of cultured and uncultured DNA viruses and retroviruses. Nucleic Acids Res. 2017; 45:D457–D465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Kieft K., Zhou Z., Anantharaman K.. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome. 2020; 8:90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Nayfach S., Camargo A.P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N.C.. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 2021; 39:578–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V.et al.. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 2011; 12:2825–2830. [Google Scholar]
  • 41. Steinegger M., Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35:1026–1028. [DOI] [PubMed] [Google Scholar]
  • 42. Marçais G., Delcher A.L., Phillippy A.M., Coston R., Salzberg S.L., Zimin A.. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 2018; 14:e1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., Phillippy A.M.. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17:132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Anantharaman K., Duhaime M.B., Breier J.A., Wendt K.A., Toner B.M., Dick G.J.. Sulfur oxidation genes in diverse deep-sea viruses. Science. 2014; 344:757–760. [DOI] [PubMed] [Google Scholar]
  • 45. Li M., Baker B.J., Anantharaman K., Jain S., Breier J.A., Dick G.J.. Genomic and transcriptomic evidence for scavenging of diverse organic compounds by widespread deep-sea archaea. Nat. Commun. 2015; 6:8933. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Tran P.Q., Bachand S.C., McIntyre P.B., Kraemer B.M., Vadeboncoeur Y., Kimirei I.A., Tamatamah R., McMahon K.D., Anantharaman K.. Depth-discrete metagenomics reveals the roles of microbes in biogeochemical cycling in the tropical freshwater Lake Tanganyika. ISME J. 2021; 15:1971–1986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Okazaki Y., Nishimura Y., Yoshida T., Ogata H., Nakano S.. Genome-resolved viral and cellular metagenomes revealed potential key virus-host interactions in a deep freshwater lake. Environ. Microbiol. 2019; 21:4740–4754. [DOI] [PubMed] [Google Scholar]
  • 48. Coutinho F.H., Cabello-Yeves P.J., Gonzalez-Serrano R., Rosselli R., López-Pérez M., Zemskaya T.I., Zakharenko A.S., Ivanov V.G., Rodriguez-Valera F.. New viral biogeochemical roles revealed through metagenomic analysis of Lake Baikal. Microbiome. 2020; 8:163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. He Q., Gao Y., Jie Z., Yu X., Laursen J.M., Xiao L., Li Y., Li L., Zhang F., Feng Q.et al.. Two distinct metacommunities characterize the gut microbiota in Crohn's disease patients. Gigascience. 2017; 6:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Trubl G., Roux S., Solonenko N., Li Y.-F., Bolduc B., Rodríguez-Ramos J., Eloe-Fadrosh E.A., Rich V.I., Sullivan M.B.. Towards optimized viral metagenomes for double-stranded and single-stranded DNA viruses from challenging soils. PeerJ. 2019; 7:e7265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Woodcroft B.J., Singleton C.M., Boyd J.A., Evans P.N., Emerson J.B., Zayed A.A.F., Hoelzle R.D., Lamberton T.O., McCalley C.K., Hodgkins S.B.et al.. Genome-centric view of carbon processing in thawing permafrost. Nature. 2018; 560:49–54. [DOI] [PubMed] [Google Scholar]
  • 52. Gregory A.C., Zayed A.A., Conceição-Neto N., Temperton B., Bolduc B., Alberti A., Ardyna M., Arkhipova K., Carmichael M., Cruaud C.et al.. Marine DNA viral Macro- and Microdiversity from pole to pole. Cell. 2019; 177:1109–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Santos-Medellin C., Zinke L.A., Horst A.M., Gelardi D.L., Parikh S.J., Emerson J.B.. Viromes outperform total metagenomes in revealing the spatiotemporal patterns of agricultural soil viral communities. ISME J. 2021; 15:1956–1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Swaney M.H., Sandstrom S., Kalan L.R.. Cobamide sharing drives skin microbiome dynamics. 2021; bioRxiv doi:10 November 2021, preprint: not peer reviewed 10.1101/2020.12.02.407395. [DOI] [PMC free article] [PubMed]
  • 55. Chen S., Zhou Y., Chen Y., Gu J.. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018; 34:i884–i890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Li D., Liu C.-M., Luo R., Sadakane K., Lam T.-W.. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015; 31:1674–1676. [DOI] [PubMed] [Google Scholar]
  • 57. Kieft K., Zhou Z., Anderson R.E., Buchan A., Campbell B.J., Hallam S.J., Hess M., Sullivan M.B., Walsh D.A., Roux S.et al.. Ecology of inorganic sulfur auxiliary metabolism in widespread bacteriophages. Nat. Commun. 2021; 12:3503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Buchfink B., Xie C., Huson D.H.. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 2015; 12:59–60. [DOI] [PubMed] [Google Scholar]
  • 59. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]
  • 60. Norman J.M., Handley S.A., Baldridge M.T., Droit L., Liu C.Y., Keller B.C., Kambal A., Monaco C.L., Zhao G., Fleshner P.et al.. Disease-Specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015; 160:447–460. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Devoto A.E., Santini J.M., Olm M.R., Anantharaman K., Munk P., Tung J., Archie E.A., Turnbaugh P.J., Seed K.D., Blekhman R.et al.. Megaphages infect Prevotella and variants are widespread in gut microbiomes. Nat. Microbiol. 2019; 4:693–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Israeli O., Cohen-Gihon I., Zvi A., Shifman O., Melamed S., Paran N., Laskar-Levy O., Beth-Din A.. Complete genome sequence of the first camelpox virus case diagnosed in Israel. Microbiol. Resour. Announc. 2019; 8:e00671-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Caro-Vegas C., Sellers S., Host K.M., Seltzer J., Landis J., Fischer W.A., Damania B., Dittmer D.P.. Runaway Kaposi Sarcoma-associated Herpesvirus Replication correlates with systemic IL-10 levels. Virology. 2020; 539:18–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Hertel R., Rodríguez D.P., Hollensteiner J., Dietrich S., Leimbach A., Hoppert M., Liesegang H., Volland S.. Genome-Based identification of active prophage regions by next generation sequencing in Bacillus licheniformis DSM13. PLoS One. 2015; 10:e0120759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Gutiérrez R., Markus B., Carstens Marques de Sousa K., Marcos-Hadad E., Mugasimangalam R.C., Nachum-Biala Y., Hawlena H., Covo S., Harrus S. Prophage-Driven genomic structural changes promote bartonella vertical evolution. Genome Biol. Evol. 2018; 10:3089–3103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Ho C.-H., Stanton-Cook M., Beatson S.A., Bansal N., Turner M.S.. Stability of active prophages in industrial Lactococcus lactis strains in the presence of heat, acid, osmotic, oxidative and antibiotic stressors. Int. J. Food Microbiol. 2016; 220:26–32. [DOI] [PubMed] [Google Scholar]
  • 67. Kieft K., Anantharaman K.. Deciphering active prophages from metagenomes. mSystems. 2022; 7:e00084-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Hunter J.D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 2007; 9:90–95. [Google Scholar]
  • 69. Waskom M.L. seaborn: statistical data visualization. J. Open Source Software. 2021; 6:3021. [Google Scholar]
  • 70. Sullivan M.J., Petty N.K., Beatson S.A.. Easyfig: a genome comparison visualizer. Bioinformatics. 2011; 27:1009–1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Darling A.E., Mau B., Perna N.T.. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010; 5:e11147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Jang H.B., Bolduc B., Zablocki O., Kuhn J.H., Roux S., Adriaenssens E.M., Brister J.R., Kropinski A.M., Krupovic M., Lavigne R.et al.. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 2019; 37:632–639. [DOI] [PubMed] [Google Scholar]
  • 73. Nepusz T., Yu H., Paccanaro A.. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Methods. 2012; 9:471–472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Shannon P. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13:2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Byrd A.L., Belkaid Y., Segre J.A.. The human skin microbiome. Nat. Rev. Microbiol. 2018; 16:143–155. [DOI] [PubMed] [Google Scholar]
  • 76. Roux S., Adriaenssens E.M., Dutilh B.E., Koonin E.V., Kropinski A.M., Krupovic M., Kuhn J.H., Lavigne R., Brister J.R., Varsani A.et al.. Minimum information about an uncultivated virus genome (MIUViG). Nat. Biotechnol. 2019; 37:29–37. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkac341_Supplemental_Files

Data Availability Statement

vRhyme and all auxiliary scripts are freely available as open-source Python code at https://github.com/AnantharamanLab/vRhyme.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES