Hecatomb: an integrated software platform for viral metagenomics

Michael J Roach; Sarah J Beecroft; Kathie A Mihindukulasuriya; Leran Wang; Anne Paredes; Luis Alberto Chica Cárdenas; Kara Henry-Cocks; Lais Farias Oliveira Lima; Elizabeth A Dinsdale; Robert A Edwards; Scott A Handley

doi:10.1093/gigascience/giae020

. 2024 Jun 4;13:giae020. doi: 10.1093/gigascience/giae020

Hecatomb: an integrated software platform for viral metagenomics

Michael J Roach ^1,^2,³, Sarah J Beecroft ⁴, Kathie A Mihindukulasuriya ^5,⁶, Leran Wang ^7,⁸, Anne Paredes ⁹, Luis Alberto Chica Cárdenas ^10,¹¹, Kara Henry-Cocks ¹², Lais Farias Oliveira Lima ¹³, Elizabeth A Dinsdale ¹⁴, Robert A Edwards ¹⁵, Scott A Handley ^16,^17,^✉

¹Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia

² Adelaide Centre for Epigenetics, University of Adelaide, Adelaide, SA, 5005, Australia

³ South Australian Immunogenomics Cancer Institute, University of Adelaide, Adelaide, SA, 5005, Australia

⁴ Harry Perkins Institute of Medical Research, Perth, WA, 6009, Australia

⁵ Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA

⁶ The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA

⁷ Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA

⁸ The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA

⁹ Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA

¹⁰ Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA

¹¹ The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA

¹²Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia

¹³ Biology Department, San Diego State University, San Diego, CA, 92182, USA

¹⁴Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia

¹⁵Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia

¹⁶ Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA

¹⁷ The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA

^✉

Correspondence address. Scott A. Handley, Department of Pathology and Immunology, 660 South Euclid Avenue, Campus Box 8118, St. Louis, MO, 63110, USA. E-mail: shandley@wustl.edu

PMCID: PMC11148595 PMID: 38832467

Abstract

Background

Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis. Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs. In addition, varying study designs will require project-specific statistical analyses.

Findings

Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis. Hecatomb means “a great sacrifice.” In this setting, Hecatomb is “sacrificing” false-positive viral annotations using extensive quality control and tiered-database searches. Hecatomb processes metagenomic data obtained from both short- and long-read sequencing technologies, providing annotations to individual sequences and assembled contigs. Results are provided in commonly used data formats useful for downstream analysis. Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome.

Conclusion

Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization, including rigorous quality control, host removal, and both read- and contig-based analysis. Each step is managed using the Snakemake workflow manager with dependency management using Conda. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools, enabling effective data interpretation for a variety of study designs. Hecatomb is hosted on GitHub (github.com/shandley/hecatomb) and is available for installation from Bioconda and PyPI.

Keywords: virome, virus discovery, bioinformatic workflow, viral metagenomics

Background

Viruses are also the most dominant entity on the planet with current global estimates as high as 10³¹ viral particles [1], and they are omnipresent in all cellular life forms [2]. As such, they exert significant influence on their surroundings. Metagenomic sequencing offers a powerful tool for studying viral diversity in both host-associated and environmental systems [3–13]. However, there are currently many challenges associated with viral metagenomics. While viruses are the most abundant and diverse biological entity on the planet, they represent a minority of reference genomes in GenBank, largely due to difficulties associated with studying them [14]. There is a vast amount of sequence information that remains taxonomically or functionally ill-defined. These sequences are regularly referred to as “viral dark matter” and pose a significant barrier to annotating viral sequences from metagenomic data (reviewed in [15]). In addition, researchers must contend with how to annotate viral sequences from viruses with either RNA- or DNA-based genomes. Our ability to successfully annotate metagenomic data as “viral” is directly impacted by the size and diversity of the reference database and the sensitivity of our search algorithms. Larger, more diverse reference databases can improve viral sequence annotation but are less conducive to high-sensitivity search algorithms required to identify distant sequence similarity. This dichotomy can force researchers to choose between optimal databases and search algorithms.

Another challenge to the interpretation of reference-based sequence annotation is that viral metagenomes are often plagued with false-positive classifications [16–18]. Viruses share regions of sequence similarity with all other domains of life, including “stolen” genes incorporated from their hosts’ genomes and repetitive or low-complexity insertion elements or transposons. These sequences are present in many reference databases and can result in false classifications due to shared sequence similarity across taxonomies. The presence of false-positive classifications may influence data interpretation. For instance, the misclassification of viral sequences in clinical samples could lead to incorrect hypotheses about virus–disease associations or patient diagnosis. Similarly, an increased false-positive rate in any environment could lead to overestimates of species diversity. Highly curated databases may alleviate false positives, but they require tremendous resources and time. Likewise, they risk missing newly discovered viruses that have yet to make their way through the curation process. Thus, it is important for bioinformatic tools to provide a system to classify the quality of similarity-based annotations in light of imperfect databases.

Numerous bioinformatics tools exist for identifying viral sequences from metagenomic data [19–44]. However, many of these are lacking in features or fail to offer researchers an end-to-end solution to manage the myriad tasks required of virome analysis (e.g., quality control, host removal, assembly). For example, few tools are designed for read-based annotation of viral sequences and none are currently maintained [45, 46]. Individual read annotations are valuable as detecting a small number of viral reads could signify the presence of a virus. This is true for viral sequences both closely related to and divergent from reference viral sequences. Detection of divergent viral sequences requires sensitive alignment-based tools such as BLAST, DIAMOND, or MMSeqs2 [47–49]. Alternative approaches, such as those that use k-mer distances, have also been implemented [50–53]. The k-mer–based algorithms are fast but limited in their ability to annotate viral sequences divergent from those in reference databases. Therefore, sensitive alignment-based approaches are preferred for detecting divergent or novel viruses.

Alignment-based approaches are computationally more expensive than k-mer–based approaches but can be effectively implemented using a tiered database query approach [45,54]. In the tiered approach, initial queries are made against small virus-only sequence databases. Subsequent secondary cross-checking against reference databases representing sequences from all domains of life is required to remove false-positive viral annotations. In addition, queries against both amino acid and nucleotide databases may be of interest. Reference sequences from viral taxa may only be represented in one or the other database types, and amino acid databases will not include noncoding viral sequences. Tiered alignment-based queries across multiple databases require a number of steps and produce a series of disconnected outputs. A robust system to monitor and manage these steps and coordinate the outputs into a tractable framework would permit researchers to focus on making biological insights instead of on job and file management.

While read-based annotation can provide sensitive and specific detection of viral sequences within a metagenome, metagenome assembled contigs can provide additional layers of information such as gene content, gene order, and metabolic pathway prediction. Many of the tools used to assemble viral contigs are the same ones used for assembling bacterial contigs [55–58]. More recently, metavirome-specific assembly tools have begun to emerge with promising results [59,60]. Virome metagenomics is evolving beyond binning with algorithms specifically designed to resolve complete genomes from metavirome assemblies [61]. Similar to the requirements of read-based annotations, assembly requires multiple steps to ensure high-quality data for downstream analysis.

Once contigs are obtained, they can be further analyzed using one of many established virome analysis tools [23,28,31,43]. Each of these tools provides distinct information about the viral content of a metagenome. For example, VirSorter2, Cenote-Taker 2, and geNomad use customized classifiers and curated hidden Markov models (HMMs) to estimate the “viralness” of metagenome assembled contigs. This is useful for separating viral from nonviral contigs, but additional steps are required to assign taxonomic lineages. vContact2 can assign genus-level taxonomy using gene-sharing networks to prokaryotic but not eukaryotic viruses. VIBRANT uses deep learning neural networks to classify prokaryotic viral contigs. Numerous additional tools are also available for researchers to mine taxonomic and functional information from a metagenome [62]. This complex array of tools for virome interrogation provides a number of opportunities for virome researchers. However, each of them is dependent on the generation of high-quality input assembly data from a wide range of experimental systems, and they vary greatly in terms of useability and support. A common workflow that takes inputs from both short and long reads and emphasizes rigorous quality control to remove nonbiological contamination and host from a variety of library types and study designs would ensure researchers provide the highest quality of data to each of these tools regardless of their experimental system.

While several options are available for virome analysis using read- or contig-based approaches, integrating these results with study data is critical for making biological insights. Project-specific statistical analyses are often required. For example, statistical models to test if a pathogenic virus is associated with a disease using sparse read-based results will differ wildly from a study analyzing bacteriophage ecology. Fortunately, the vast majority of these disparate statistical approaches are available as R software packages [63]. In addition, principles guided by the popular tidyverse and ggplot2 packages are familiar to many researchers [64,65]. These principles have been successfully applied to the analysis of bacterial microbiome data in software suites such as PhyloSeq and microViz but have yet to be applied to the analysis of virome data [66,67].

Here we present Hecatomb, a bioinformatics platform designed to address the above issues. Hecatomb supports the analysis of both long- and short-read technologies and various library types. The pipeline performs rigorous quality control followed by tiered alignment-based taxonomic assignment using MMseqs2 [45]. Hecatomb also performs metagenomic assembly and annotation. Each step is managed using the Sankemake workflow manager with dependency management using Conda [68,69]. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools enabling effective data interpretation for a variety of study designs.

Implementation

Hecatomb (RRID: SCR_025002) serves as an end-to-end pipeline by processing raw sequencing reads (single or paired end, long or short reads from Illumina, MGI, PacBio, or Oxford Nanopore platforms) through 4 key modules (Fig. 1). In this way, it was designed to address many of the useability and functionality issues that are present in other software that we summarize in Supplementary Fig. S1.

Figure 1: — Hecatomb pipeline and implementation. The Hecatomb pipeline is divided into 4 modules. (1) Reads for each sample undergo preprocessing and clustering (*orange*); (2) clustered reads undergo annotation using viral and multikingdom protein databases, and clustered reads not annotated by the protein search are annotated using viral and multikingdom nucleotide databases (*blue*); (3) quality trimmed reads for each sample undergo assembly, and assemblies for each sample are coalesced into a single assembly (*green*); and (4) read-based annotations are combined with the assembly to provide contig annotations (*pink*). The assembly stages—*green* and *pink*–can optionally be skipped.

Module 1: Sequence quality control and host removal

Module 1 (preprocessing) removes nonbiological contaminants (i.e., primers, adapters) as well as common laboratory contaminants cataloged in NCBI’s UniVec database [70]. The user can select to use fastp for contaminant removal from standard library preps or BBTools for more complicated library preparation strategies such as the round A/B library, which converts RNA into cDNA using a tagged-randomized primer and second-strand DNA synthesis; the resulting dsDNA is subjected to 10–40 rounds of PCR amplification. This process generates DNA/cDNA libraries for single- and double-stranded RNA and DNA viruses [71–73]. Low-quality sequences are also trimmed or removed prior to host removal.

Module 1 has an option to remove host sequences (e.g., mouse, human) using Minimap2 [74]. This is optional as it may not be applicable to all samples (e.g., water, air, soil). Hecatomb comes packaged with several commonly used host-reference genomes, which have been masked of potential viral sequence. This masking is designed to minimize the inadvertent removal of viral sequences that may have similarities to the host sequence. Masked reference genomes were generated as follows: (i) all viral genomes from the National Center for Biotechnology Information (NCBI) viral assembly database [75] were downloaded and computationally “shredded” into short fragments with an average length of 85 bases sharing a 30-base overlap using shred.sh from the BBTools suite [73]. Shredded viral sequences were then mapped (minimum identity of 90% and at most 2 insertions or deletions) and masked from host-reference genomes using BBmap requiring a [73]. Precomputed masked reference genomes for the following host genomes are available in Hecatomb: human, mouse, rat, camel, Caenorhabditis elegans, dog, cow, macaque, mosquito, pig, rat, and tick are available within Hecatomb. A command is provided to generate new masked genomes for additional hosts not included with Hecatomb.

Sequences free of contamination and host are clustered using the nucleotide version of Linclust packaged with MMSeqs2 [49,76]. Clustering reduces the number of sequences requiring taxonomic classification to a single, representative sequence, thus greatly reducing the computational requirements for read-based annotation in Module 2. Sequences are clustered requiring a minimum sequence identity of 97% and 80% alignment coverage of target sequence to the representative sequence (–min-seq-id 0.97 -c 0.8 –cov-mode 1). The size of each cluster per sample is maintained in the final output table (seqtable.fasta). This information serves as a “count table” for each sequence, and values are provided as both raw and counts normalized to library size.

Module 2: Read-based annotation

Taxonomic and functional annotation is provided to reads in seqtable.fasta using a tiered approach (Fig. 2A). All queries are carried out using MMseqs2 [49]. Queries against amino acid databases are performed using MMSeqs2 6-frame translation. Each read in seqtable.fasta is first queried against all viral (taxonomy id: 10239) amino acid sequences in UniProtKB clustered at 99% identity using Linclust (Viral AA DB) [76,77]. Sequences annotated as virus are subsequently queried against UniClust50 (Multi-kingdom AA DB) to remove false-positive annotations [78]. UniProt functional annotations are also applied when available.

Figure 2: — Read-based annotation. (A) Tiered annotation strategy. All alignments are completed using MMSeqs2. (1) High-quality representative sequences are queried against a viral amino acid (aa, *green*) sequence database. (2) Potentially viral sequences are subjected to a secondary, confirmatory query against a multikingdom amino acid sequence database. (3) Representative sequences that do not match a known viral amino acid sequence are subjected to an untranslated query to a viral nucleic acid sequence database (nt, *purple*), (4) followed by a secondary, confirmatory query against a polymicrobial nucleotide database (5). Sequences that have been classified as either viral (*blue*) or nonviral (*pink*) in either the translated (aa database) or untranslated (nt database) queries are combined into a final taxonomy table. (B) Read annotation data structure. (1) Read annotations are generated using the clustered sequences (seqtable.fasta). (2) The clustered sequence IDs are unpacked to yield the sample ID, the number of reads that sequence represents, and the percentage of host-removed reads that sequence represents. (3) The alignment metrics from the annotation module are joined into the read annotations using the sequence ID as the primary key. (4) Taxonomic annotations are calculated and joined into the read annotations again using the sequence ID. (5) ICTV viral classifications are joined into the read annotations by the taxonomic family annotation. (6) Sample metadata can be joined into the read annotation table using the sample ID as the primary key. (7) The read annotation table with sample metadata can be quickly and easily analyzed.

Reads not identified as viral using queries against amino acid databases are queried against nucleotide databases (Fig. 2A). Similar to the translated queries, each read is first queried against a virus-only reference sequence database (Virus NT DB). This database consists of all viral sequences in GenBank (taxonomy id: 10239) clustered at 100% identity using Linclust [49,76]. Sequences annotated as virus are subsequently queried against a customized nucleotide database (Polymicrobial NT DB) containing the Virus NT DB and representative RefSeq genomes from bacteria (1 per genus, n = 14,933), archaea (n = 511), fungi (n = 423), protozoa (n = 90), and plant (n = 145) genomes. These reference genomes represent a genomic “polymicrobial” community and cover a large amount of microbial sequence space. This allows for the removal of false-positive annotations from the first query using a relatively small reference database.

Taxonomic annotations are augmented using a modified version of the 2b lowest common ancestor (2b-LCA) algorithm described in [79]. The 2b-LCA algorithm provides conservative taxonomic assignments toward lower nodes of the tree when similarity is found across a heterogeneous collection of taxonomies. However, the LCA algorithm fails when crossing higher taxonomic ranks. For example, sequences with similarity to both bacterial and viral taxa have an LCA of “root” in the NCBI tree, while viruses from distinct viral domains (e.g., bacteriophage and vertebrate viruses) are assigned to “virus root.” Hecatomb will identify these instances and augment the annotations by reverting to the top-hit annotation. Each instance of this is flagged in the final output table so researchers can choose to include or exclude these from downstream analysis. This approach provides additional information about sequences with ambiguous taxonomic assignments instead of just leaving them as “root” or “virus root” and simply discarding them.

Sequence annotations from queries against both amino acid and nucleotide databases are combined into 1 table and assigned updated taxonomies using the most recent version of NCBI’s Taxonomy Database using TaxonKit [80,81] (Fig. 2A). This taxonomy table contains full Linnaean taxonomic lineages (Kingdom, Phylum, Class, Order, Family, Genus, and Species), alignment type used for annotation (translated [aa, amino acid database] or untranslated [nt, nucleotide database]) and LCA augmentation information. Due to Hecatomb keeping track of individual sequence IDs throughout this process, it is then possible to combine these read-based taxonomic assignments to other data generated by Hecatomb or external data resources (Fig. 2B). By default, Hecatomb will combine MMSeqs2 alignment information (e.g., target/query ID, e-value, percent identity, alignment length, etc.) and count table information gathered during the clustering process. As an example of combining data to external resources, Hecatomb will provide Baltimore virus type information (both Baltimore class and group). Baltimore is a classification system that places viruses into 1 of 7 groups depending on a combination of their nucleic acid (DNA or RNA), strandedness (single-stranded or double-stranded), sense, and method of replication. This could easily be extended to a variety of external data resources. Together, these disparate data tables are collected into Hecatomb’s bigtable.tsv. As Hecatomb tracks sample identifiers, it is then possible to combine the bigtable with sample data. All of these data are formatted to make them easily importable into commonly used data analysis tools for statistical and graphical analysis.

Module 3: Assembly

By default, Hecatomb performs an assembly for individual samples using MEGAHIT for short reads or Canu for long reads (Fig. 3) [82,83]. Individual sample assemblies are then merged into a population assembly using Flye [84].

Figure 3: — Viral metagenome merged assembly. (1) High-quality k-mer–normalized sequences from individual samples are assembled using either MEGAHIT or Canu. (2) The sequences for each sample are mapped to their respective assemblies. (3) The unmapped reads from all samples are pooled together. (4) The pooled unmapped reads are assembled using either MEGAHIT or Canu. (5) The contigs from all sample assemblies and the unmapped reads assembly are combined together. (6) Overlapping contigs are joined together using Flye using the subassemblies algorithm.

Per sample contig abundances are calculated by mapping individual sample reads to the population assembly using BBMap [85]. Read counts are reported normalized to library size and contig length using a variety of measures (reads per kilobase million [RPKM], fragments per kilobase million [FPKM], and sequences per million [SPM]). SPM is the same calculation as used for transcripts per kilobase million (TPM) except that the sequences are not assumed to be transcripts [86,87]. Additional contig properties (e.g., length, GC content, coverage percentage) are combined with taxonomic assignments and sample abundance estimates into a final table (contig_count_table.tsv).

Options to skip the assembly step and to perform a cross-assembly are available for the user. Cross-assembly assembles all reads from all samples simultaneously (skipping the individual sample assemblies). This can result in better-quality assemblies but is computationally expensive for larger datasets and may not be an option for many users.

Module 4: Contig-based annotation

Module 4 provides taxonomic annotations to contigs. Taxonomy is assigned to all contigs in the population assembly using MMseqs2 [49]. Each contig is queried against the same polymicrobial nucleotide database (Polymicrobial NT DB) used for read-based annotation (contigAnnotations.tsv). Additionally, information obtained from both the read-based annotation and assembly modules (Fig. 1) is combined. Read mapping information (start, stop, mapping quality, etc.) is maintained during the sample abundance estimation (mapping with BBMap) performed as part of Module 3 [85]. This mapping information is combined with read-based annotations (bigtable.tsv) to generate a new table combining read-based taxonomic information across each contig (contigSeqTable.tsv).

Installation and dependency management

Hecatomb is hosted on GitHub [88] and is available for installation from Bioconda [89] and the Python Package Index [90], easing installation for individual users with a single command. Hecatomb makes liberal use of Conda environments to ensure portability, ease of installation, and proper versioning of software dependencies (Supplementary Fig. S2). All required and optional software dependencies are summarized in Supplementary Table S1 [49,71,74,81–85,91–94]. Hecatomb and Conda handle the installation of all dependencies. Conda environments for jobs are created automatically by Snakemake. The use of isolated Conda environments for Hecatomb minimizes package version conflicts, minimizes overhead when rebuilding environments for updated dependencies, and allows maintenance and customization of different Hecatomb versions.

While Hecatomb is a Snakemake pipeline, it uses the Snaketool command line interface to make running the pipeline as simple as possible [95]. Snaketool populates required file paths and configuration files, allowing Hecatomb to be configured and run with a simple command, and it offers a convenient way to modify parameters and customize options.

High-performance computing deployment

Hecatomb can be deployed on a high-performance computing (HPC) cluster and can utilize Snakemake profiles for cluster job schedulers (e.g., Slurm, SGE, etc.). Snakemake uses profiles to submit pipeline jobs to the job scheduler and monitor their progress. Profiles can be created manually, but Hecatomb has been designed for compatibility with the official Cookiecutter [96] profiles for Snakemake [97] and comes with a preinstalled Slurm example profile.

Customization

Hecatomb comes precompiled with many predefined settings regarding individual process options. These settings are highly customizable through the inclusion of a Snakemake YAML file. This file provides a single-source solution to user customization. Settings such as the quality threshold used for read trimming in Module 1 or the length of contig to maintain in Module 3 can easily be adjusted per an individual user or project needs.

Application

Hecatomb accelerates profiling of viral metagenomes

We reanalyzed a previously published dataset of 95 stool samples collected from simian immunodeficiency virus (SIV)–infected rhesus macaques (Macaca mulatta) (NCBI BioProject accession: PRJEB9503) [5]. Sequences were generated using the Illumina MiSeq 2 × 250 paired-end protocol using round A/B libraries (DNA and cDNA to enable detection of both RNA and DNA viruses) from virus-like particle preparations. These data contain sequences from a variety of RNA and DNA vertebrate viruses. The original study also identified a statistically significant difference in the abundance of several enteric viruses in SIV-infected animals compared to uninfected animals. These data included samples ranging from 768,268 to 4,229,134 raw input reads with a duplication rate at 97% identity ranging from 3.98% to 56.06% (Supplementary Table S2).

We first assessed Hecatomb’s overall ability to detect diverse viral sequences. Hecatomb was executed using the round A/B preprocessing module and default parameters. Hecatomb classified sequences into phylogenetically diverse viral groups (Fig. 4A). Of the 2,394,740 reads annotated by Hecatomb as viral, 1,989,352 (83%) were annotated using queries against protein (aa) databases and 405,388 (17%) annotated using queries against nucleotide databases (nt). Bacteriophages from the family Microviridae and the order Caudovirales were highly abundant. Sequences belonging to a diverse set of viruses associated with infection of plants and protists were also detected (Supplementary Fig. S3). Similar to the original study, Hecatomb identified a large number of sequences belonging to the Picornaviridae and Adenoviridae. Sequences from these viral families were found to be more abundant in SIV-infected macaques when compared to uninfected animals in the original study (Fig. 4C).

Figure 4: — Reanalysis of rhesus macaque stool viromes. (A) Abundance of reads classified by viral phylum (color) and type (shape) from 2,394,740 input sequences (all annotated sequences from the entire study). Phyla represented by fewer than 1,000 reads were excluded. (B) Percent identity and alignment lengths of all sequences classified for the 4 animal viruses identified in the previous study and 2 viruses of protists. Horizontal (70% identity) and vertical (150-base alignment length) dashed lines indicate a user-defined quadrant space. Each point represents an individual sequence colored by classification method (aa = translated search to an amino acid database, nt = classified via an untranslated search to a nucleotide database). Panels A and B represent data obtained from all 95 samples in the study. (C) Comparison of the number of sequences in SIV-infected and uninfected samples 5 weeks postinfection with SIV and at the time of necropsy. Significance determined by the Wilcoxon signed-rank test. *P ≤ 0.05, **P ≤ 0.01, ***P ≤ 0.001, ^****P ≤ 0.0001. CPM: counts per million.

Hecatomb collects and organizes alignment statistics (e.g., e-values, percent identity, alignment length, etc.) generated for each sequence annotation in Module 2. These data can be useful for assessing the prevalence and quality of viral annotations within a study. As an example, we examined the percent identity and alignment lengths of every read assigned taxonomy to the 4 families of viral enteropathogens identified in the original study (Circoviridae, Picornaviridae, Adenoviridae, and Parvoviridae) (Fig. 4B). Hecatomb annotated sequences for these 4 viral families using both translated queries to amino acid (aa) databases and untranslated queries to nucleotide (nt) databases. Quadrants were applied to visualize low- and high-identity and short- and long-alignment lengths for every annotated sequence. Sequences in the upper 2 quadrants are highly similar to sequences in the reference databases over short (upper left, quartile 1 [Q1]) or long (upper right, Q2) alignment lengths, while sequences in the lower 2 quadrants have low similarity over short (lower left, Q3) or long (lower right, Q4) alignment lengths. For this analysis, we arbitrarily selected 70% identity to represent the cutoff between low and high identity for translated (aa database) and 90% identity for untranslated (nt database) alignments. These values are adjustable and could be customized for each study and viral family of interest. Using this framework, it is clear that a majority of sequences are high identity (both short and long alignments) to sequences in both the aa and nt reference databases for the 4 families of enteropathogenic viruses.

In contrast, there were also a large number of sequences classified due to similarity to reference sequences from viruses of protists (Fig. 4B). Viruses in the family Mimiviridae infect Acanthamoeba, and those in the family Phycodnaviridae infect algae, and both are dsDNA viruses with large genomes [98]. While it is conceivable that these viruses may exist in the stool samples of rhesus macaques via water or food, using the quadrant framework, there is little evidence of high-identity alignments to any sequence in either the aa or nt databases (Fig. 4B, Supplementary Fig. S4). Hecatomb does not automatically remove sequences from these families due to their presence in environmental datasets. There is evidence for short and long low-identity alignments (quadrant 4) to both Phycodnaviridae and Mimiviridae reference sequences. Thus, these sequences should be analyzed using additional metrics (i.e., E-values, abundance across samples, etc.) to determine if these represent potentially novel viral sequences. This would not have been possible using stringent E-value filtering prior to data analysis.

Reevaluation of existing environmental metagenomic datasets

We assessed Hecatomb’s ability to analyze nonhuman-associated viromes by processing a previously studied coral reef dataset (NCBI BioProject accession: PRJNA595374, SRA study number SRP237459) [99,100]. The dataset consists of whole-genome shotgun (WGS) metagenomic sequencing of both seawater and coral mucus from inner and outer sections of a Bermuda reef system. The original studies only considered bacterial metagenome-assembled genomes, which makes it an excellent candidate for generating new biological insights by characterizing the viruses of this previously published dataset. The original study identified statistically significant differences in bacterial compositions between the coral mucus and seawater microbiomes and the coral mucus microbiomes from the inner and outer reefs. All analysis was performed using output from the contig annotations provided by Hecatomb’s modules 3 and 4.

We first tested for differences in viral species alpha diversity using both Shannon diversity and richness comparing both inner and outer reef samples and coral mucus and reef water samples (Fig. 5). Both Shannon diversity and richness were significantly higher for inner reef samples compared to outer reef samples (Fig. 5A). Shannon diversity and richness were not significantly different between coral mucus and reef water. This result is in contrast with the bacterial diversity and richness metrics being similar across all samples as reported in the original study [99].

Figure 5: — Reanalysis of coral reef metagenomes. (A) Viral species richness and Shannon diversity boxplots of inner and outer reef samples, colored by sample type. Significance (P < 0.05) is indicated. (B) Principal coordinates analysis (PCoA) of viral genera abundance. Inner and outer reef water samples are colored light blue and dark blue, respectively. Inner and outer coral mucus samples are colored light and dark green, respectively. Permutational multivariate analysis of variance (PERMANOVA) identified nonhomogenous distributions of inner versus outer reef samples (P = 0.001) and coral mucus versus reef water samples (P = 0.015). Ellipses for sample groups are drawn at 85% confidence levels for multivariate t-distribution. (C) Dendrogram of most prevalent viral taxa (>10% of samples). Linear regression (LR) models were generated for all taxa for the outer reef samples compared with the inner reef samples. Nodes are weighted by prevalence, and nodes and edges are colored by the LR model estimate coefficients. Significant LR models (P < 0.05) are indicated, and taxa with absolute coefficients greater than 3 are labeled. Nodes colored red are elevated in outer reef samples, whereas nodes colored blue are elevated in inner reef samples. (D) Same as for Fig. 5C except LR models are calculated for coral mucus samples compared with reef water samples. Nodes colored red are elevated in outer reef samples, whereas nodes colored blue are elevated in inner reef samples.

Next, we compared the viral compositions of these coral reef samples using beta diversity. Principal coordinate analysis (PCoA) of Bray–Curtis dissimilarity of viral genera, and permutational multivariate analysis of variance (PERMANOVA) showed nonhomogenous distributions between inner and outer reef samples (P = 0.001) as well as between coral mucus and reef water samples (P = 0.015) (Fig. 5B). There is a strong separation of inner and outer reef samples along the x-axis and a weaker separation of reef water and coral mucus samples along the y-axis; this is the same trend observed in the original study for the bacterial compositions.

Linear regression (LR) characterizes the viral taxa that are driving the differences between inner and outer reefs, as well as between coral mucus and reef water samples. The R package microViz will calculate LR models at all taxa and taxon levels for the various sample groups. We calculated LR models to the genus level and generated tree plots for all taxa with prevalence greater than 10% of samples, colored by the (LM) estimate coefficients, and weighted by prevalence between the inner and outer reef samples (Fig. 5C) and between the coral mucus and reef water samples (Fig. 5D). These analyses indicate that inner reef samples have significantly higher relative abundances of many viral taxa, including several Caudoviricetes taxa and especially the Kyanoviridae family, which consist of various Synechococcus phages and Cyanophages (Fig. 5C). Conversely, far fewer viral taxa were more abundant in outer reef samples. Reef water samples contained elevated abundances of many viral taxa compared to coral mucus samples, with the main exception of the family of giant viruses Mimiviridae (Fig. 5D).

Accelerated discovery of novel viruses

Hecatomb retains the assembly graph as well as the assembly itself, which downstream tools can utilize to resolve metagenome-assembled genomes. There was strong evidence for the presence of novel bacteriophage within the SIV-macaque dataset in the form of many high-quality but low-identity alignments to known reference viruses (Supplementary Fig. S5). We therefore processed the assembly graph with Phables and identified 127 probable complete phage genomes [61]. Phables bins fragmented assemblies into complete genomes. These genomes were assessed with CheckV, which determined that 121 of them were high-quality complete phage genomes [101]. We assigned taxonomy using MMSeqs2 with the Hecatomb primary nucleotide databases (Supplementary Table S3) [49]. Lastly, the genomes were annotated using Pharokka [102]. Of the 121 genomes, 98 were Microviridae (Supplementary Table S3). Of these Microviridae, 96 exhibited the hallmark replication initiation protein followed by a major capsid protein, and a further 55 had the hallmark minor tail or pilot tail spike protein (Fig. 6A) while the other 41 contained hypothetical proteins where the tail spike protein would be. There were 10 Caudoviricetes, 2 Cressdnaviricota genomes, and 8 that had hits to known phages with no taxonomic information. There were 13 cases where 2 genomes were resolved from the same assembly graph element and likely represent quasi-species that can occur, for instance, through recombination events [103,104]. We also processed the coral dataset using the same method. Synteny was also conserved in these larger phages; for instance, Caudoviricetes phage 1112C1 arranged its capsid and tail proteins together and exhibited the conserved layout described in [105] (Fig. 6B). In the coral dataset, we identified 3 complete Caudoviricetes phage genomes (Supplementary Table S3). The number of samples in this study was much lower and likely impacted the number of recovered viral genomes. The recovered viral genomes across both studies are novel, with only 18 aligning to a known phage with an identity higher than 90% (Supplementary Table S3). This demonstrates Hecatomb’s utility for data-mining published environmental metagenome projects to generate novel viral genomes.

Figure 6: — Circos plots of complete bacteriophage genomes. Circos plots were generated for all novel bacteriophage genomes using Phrokka’s pharokka_plotter.py script (circos plots available at 10.5281/zenodo.6388251). (A) Circos plot for uncultured Microviridae phage 977C1. (B) Circos plot for uncultured Caudoviricetes phage 1112C1.

Analysis of an in silico set of viral genomes

To determine the ability of Hecatomb to accurately identify viral sequences in a mixed metagenome, we generated a mock in silico data set. Canonical viral genomes from each genomic type (single- and double-stranded DNA and RNA genomes and 1 retrovirus) were downloaded from NCBI (Table 1). For each genome, we simulated 1 × 10⁴ 250 base-pair sequences using the Illumina error-model model available in InSilicoSeq [106]. Next, we generated 1 × 10⁶ sequences from a collection of 1,520 cultivated human gut bacterial genomes [102] using the same Illumina error model. Simulated sequences from viruses and bacteria were combined to simulate a mixed virus–bacteria environment and processed through Hecatomb using the default settings. Hecatomb assigned the correct taxonomy to 99.64–99.65% of the simulated viral reads with a sensitivity ranging from 0.98–1 and a false-positive rate ranging from 8.8 × 10⁻⁴ to 1.35 × 10⁻³ (Table 1).

Table 1.

Analysis of an in silico generated mock viral–bacterial community

Virus	NCBI accession	Genome type	Genome length (bp)	True positive	False positive	Sensitivity	False-positive rate
Human gammaherpesvirus 4	NC_007605.1	dsDNA	171,823	9,910	1,345	0.99	1.35 × 10⁻³
Human parvovirus B19	NC_000883.2	ssDNA	5,596	9,964	873	1	8.8 × 10⁻⁴
Rotavirus	NC_011507:0-10	dsRNA	18,562	9,856	134	0.99	1.35 × 10⁻⁴
Rabies	NC_001542.1	ssRNA	11,932	9,932	654	0.99	6.6 × 10⁻⁴
Human immunodeficiency virus 1	NC_001802.1	retrovirus	9,181	9,876	564	0.99	5.69 × 10⁻⁴
Crassphage	NC_055760	dsDNA	94,878	9,765	4578	0.98	4.58 × 10⁻³

Open in a new tab

Discussion

Virome sequencing is the premier approach to evaluate the viral content of both host-derived and environmental samples. It is useful for determining what types of viruses are present in individual samples and how virome compositions compare between sample groups. This information forms the foundations for answering a wide array of interesting biological questions. For example, virome composition has recently been analyzed as an indicator of the microbial impacts of climate change [107,108]. Virome sequencing was also critical for the discovery and characterization of SARS-CoV-2 in 2019 [109]. Effective characterization of virome sequencing data requires rigorous and integrated software platforms to facilitate and accelerate virus discovery and virome compositional analysis. Given these tools, researchers will be better prepared to assess how viruses are associated with some of the most important challenges to human life today.

All virome studies are dependent on effective computational tools to identify and classify viral reads or assembled contigs within a metagenome. Viral metagenomics is often dependent on identifying sequence similarity against reference sequence databases, either directly via homology-based searchers or using machine learning techniques that have been trained on reference databases to identify features unique to viral sequences (reviewed in [110]). Homology-based searches can take a “brute-force” approach, wherein all unclassified sequences are queried against a comprehensive, multikingdom reference sequence database (e.g., NCBI nt or nr). This approach relies on the search algorithm (e.g., BLAST, DIAMOND [48]) to pick the best or lowest-common ancestor of a group of hits to provide a final taxonomic assignment to an unknown query sequence. This approach is slow and requires significant computational resources, which is why Hecatomb takes an alternate approach. First, it captures all “potentially viral” sequences initially querying a small viral sequence database. The “potentially viral” sequences typically represent a fraction of the full metagenomic data, making subsequent computation more tractable. To confirm viral taxonomic assignment, potentially viral sequences are cross-checked against a curated small transkingdom reference database containing genomic representatives from all kingdoms of life. Hecatomb completes this iterative search approach using translated searches against amino acid databases as well as untranslated searches against nucleotide databases, combining the results of each to ensure detection of viral sequences is database independent. This iterative search strategy uses database orders of magnitude smaller than comprehensive, multikingdom databases (such as NCBI’s nt and nr) increasing computational efficiency without limiting viral detection.

Hecatomb’s design philosophy recognizes that there are no “perfect” databases or search algorithms. Both the brute-force and iterative search approaches against comprehensive or curated databases will result in different rates of true/false positives/negatives. Instead, Hecatomb relies on providing a compiled and rich set of data for search result evaluation. We used this strategy to reassess the virome composition of SIV-infected and uninfected rhesus macaques [5]. The original study used an iterative approach but relied on comprehensive, transkingdom databases (NCBI nt and nr) and identified associations between 4 families of animal viruses (Circoviridae, Picornaviridae, Adenoviridae, and Parvoviridae) and SIV infection. The new Hecatomb transkingdom database is 6 orders of magnitude smaller than GenBank nt (5.0 × 10⁶ versus 1.3 × 10¹²), which results in a significant reduction in computational time and resources. Hecatomb identified the same 4 viral families and their relationship to SIV-mediated disease. Similar to our analysis of these samples using Hecatomb, the original study also classified a number of sequences to the Mimiviridae and Phycodnaviridae. Statistical comparison of these sequences between groups (e.g., SIV infected vs. uninfected) did not reveal any significant associations, and thus they were not discussed further. However, new evaluation of results from Hecatomb indicates that there were likely false-positive classifications reported in the original analysis. Lastly, we identified 121 novel complete phage genomes in this dataset. The majority of these genomes were Microviridae, which was the most abundant family by read abundance in the dataset. Recovering these genomes with a provisional taxonomic classification using Hecatomb’s annotations is simple, fast, and scalable. This method is not a replacement for culturing and characterizing viruses in a lab environment. However, these in silico methods are essential when considering the scale of the viral dark matter problem. This reanalysis highlights how coordinated data such as alignment statistics and taxonomy can be powerful tools for virome evaluation and novel virus discovery.

Hecatomb was able to effectively evaluate the viromes of environmental (nonhost-associated) viromes. Leveraging the R packages phyloseq and microViz allowed us to quickly and easily complete the analysis in approximately 200 lines of code [66,67]. This analysis was primarily designed to identify compositional changes in viromes between reef types (inner or outer) and within coral mucosa and the surrounding water from a previously published metagenomic data set [99,100]. The original study identified elevated levels of Pelagibacter, Synechococcus, and unclassified Rickettsiales in inner reef samples compared to outer reef samples. Indeed, we found Synechococcus phages and Cyanophages were important for distinguishing the highly fluctuating inner reef system from the thermally stable outer reef. The authors showed that the bacterial microbiomes were unique for inner and outer reefs for both coral and mucus samples. We applied the same method on the viral compositions and confirmed that the viromes for these 4 sample types are also all unique.

Interestingly, viral species richness and diversity were significantly elevated in inner reef samples, whereas the original study found these to be similar in terms of bacterial composition. Viral activity is an important vehicle for nutrient cycling, which was thought to be much higher in the inner reef based on the metabolic profiles of these samples [99]. There were many viral taxa that were more abundant in the inner reef samples and few that were elevated in the outer reef. Similarly, many viruses were more abundant in reef water samples than coral mucosa, with the main exception of the giant viruses from the family Mimiviridae. It would be interesting to elucidate whether the coral mucus is impacting viral infectivity directly or if the phages are switching to a lysogenic rather than lytic life cycle in this environment.

Corals occasionally shed their mucosa. Shedding occurs far more frequently in the inner reef due to stressors such as thermal fluctuation and sedimentation from surface runoff. The increased flux of nutrients and microbes from corals to the surrounding reef water may be contributing to increased microbial and viral activity in inner reef samples compared to the outer reef. Different concentrations of microbes might also be having an impact. The outer reef systems are subject to upwelling, resulting in greater exchange of water with the open ocean, which is probably flushing microbes and viruses from the environment. Unfortunately, it is not possible to infer microbial concentrations from WGS sequencing. The reanalysis of this coral dataset has generated many new hypotheses about viral host interactions within a coral reef system.

Potential implications

Virome analysis is complex and requires efficient computational tools to generate analyst-friendly results. Hecatomb provides a comprehensive and computationally efficient solution for both read- and assembly-based viral annotation, virome analysis, and novel virus discovery. The pipeline is delivered with a convenient and easy-to-use front end and is compatible with different sequencing technologies. Hecatomb’s comprehensive collection of data throughout the pipeline’s execution, in particular the collection of alignment statistics, empowers the identification and interrogation of viral taxonomic assignments. We demonstrate Hecatomb’s utility for rapid processing and analysis of viral metagenomes with a well-studied validation gut viral metagenome dataset. We also demonstrate its utility for mining regular metagenome samples for virome analysis by analyzing an existing environmental dataset. Virome analysis is an evolving field requiring novel approaches for comprehensive characterization. For example, none of the described workflows account for spliced coding sequences (CDS), hampering the automation of viral genome annotation. The modularity of Hecatomb’s approach should enable easy integration of novel methods as they emerge.

Methods

All commands used for analyzing the Hecatomb annotations are available as a gist on GitHub [111].

Reevaluation of the SIV dataset

We reanalyzed a previously published data set of 95 samples obtained from stool samples collected from SIV-infected rhesus macaques (Macaca mulatta) (NCBI BioProject accession: PRJEB9503) [5]. Sequence data were generated using the Illumina MiSeq 2 × 250 paired-end protocol on libraries of total nucleic acid (DNA and cDNA to enable detection of both RNA and DNA viruses). Hecatomb was executed using the round A/B preprocessing module and otherwise default parameters. Data were analyzed in R with Tidyverse [64]; commands are available in the above GitHub gist.

Reevaluation of coral microbiomes

We reanalyzed a coral reef dataset (NCBI BioProject accession: PRJNA595374, SRA study number SRP237459) [99,100] of WGS metagenomic sequencing (Illumina MiSeq, paired 2 × 250) of both seawater and coral mucus from inner and outer sections of a Bermuda reef system. Hecatomb was run with fast search parameters, cross-assembly, and otherwise default parameters. Data were analyzed in R with PhyloSeq [66] and MicroViz [67]; commands are available in the above GitHub gist.

Identification of phage genomes from Hecatomb assemblies

To identify complete phage genomes, the assembly graphs created by Hecatomb were processed with Phables [61]. The predicted phage genomes were assessed with CheckV [101]. High-quality and complete phage genomes were assigned provisional taxonomic annotations using MMSeqs2 with the Hecatomb viral amino acid database (easy taxonomy) and viral nucleotide database (easy-search plus TaxonKit). Lastly, the genomes were annotated with Pharokka [102].

Supplementary Material

giae020_GIGA-D-23-00206_Original_Submission

giae020_giga-d-23-00206_original_submission.pdf^{(8.8MB, pdf)}

giae020_GIGA-D-23-00206_Revision_1

giae020_giga-d-23-00206_revision_1.pdf^{(7.8MB, pdf)}

giae020_GIGA-D-23-00206_Revision_2

giae020_giga-d-23-00206_revision_2.pdf^{(7.4MB, pdf)}

giae020_Response_to_Reviewer_Comments_Original_Submission

giae020_response_to_reviewer_comments_original_submission.pdf^{(70.4KB, pdf)}

giae020_Response_to_Reviewer_Comments_Revision_1

giae020_response_to_reviewer_comments_revision_1.pdf^{(45.3KB, pdf)}

giae020_Reviewer_1_Report_Original_Submission

Arvind Varsani -- 9/8/2023 Reviewed

giae020_reviewer_1_report_original_submission.pdf^{(121KB, pdf)}

giae020_Reviewer_2_Report_Original_Submission

Satoshi Hiraoka -- 9/10/2023 Reviewed

giae020_reviewer_2_report_original_submission.pdf^{(131KB, pdf)}

giae020_Reviewer_2_Report_Revision_1

Satoshi Hiraoka -- 2/2/2024 Reviewed

giae020_reviewer_2_report_revision_1.pdf^{(108.8KB, pdf)}

giae020_Supplemental_Files

giae020_supplemental_files.zip^{(37.8KB, zip)}

Acknowledgement

The authors thank Chandni Desai and Barry Hykes for their thoughtful commentary regarding the design philosophy of Hecatomb and Sarah Giles, Susie Grigson, Bhavya Papudeshi, Vijini Mallawaarachchi, and Laura Inglis for feedback on the manuscript. The support provided by Flinders University for HPC research resources is acknowledged.

Contributor Information

Michael J Roach, Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia; Adelaide Centre for Epigenetics, University of Adelaide, Adelaide, SA, 5005, Australia; South Australian Immunogenomics Cancer Institute, University of Adelaide, Adelaide, SA, 5005, Australia.

Sarah J Beecroft, Harry Perkins Institute of Medical Research, Perth, WA, 6009, Australia.

Kathie A Mihindukulasuriya, Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA; The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA.

Leran Wang, Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA; The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA.

Anne Paredes, Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA.

Luis Alberto Chica Cárdenas, Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA; The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA.

Kara Henry-Cocks, Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia.

Lais Farias Oliveira Lima, Biology Department, San Diego State University, San Diego, CA, 92182, USA.

Elizabeth A Dinsdale, Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia.

Robert A Edwards, Flinders Accelerator for Microbiome Exploration, Flinders University, Adelaide, SA, Australia.

Scott A Handley, Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, 63110, USA; The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA.

Additional Files

Supplementary Fig. S1. Features and useability of popular viral metagenomics software as of March 2023. “Packaged install” indicates that the software is installable from any package manager or Apptainer. Software is considered “recently updated” if it has been updated within the last 12 months. Software has been selected from a community-driven compilation of viral bioinformatics tools (an archived version is available at 10.5281/zenodo.6388251).

Supplementary Fig. S2. Implementation of Hecatomb using Snaketool, Snakemake, and Conda. Hecatomb takes in command line arguments, data, and configuration parameters and outputs both results for analysis and run information. Hecatomb interacts with the job scheduler in high-performance computing (HPC) environments. Hecatomb distributes individual tasks to the job queue. Command line arguments, gray; files, yellow; Conda environments, blue; scripts/programs, green; workload manager, pink.

Supplementary Fig. S3. Taxonomic subsets of virus types. Viral families present in the 95-sample SIV reanalysis study (A) plant viruses and (B) protist viruses (C) is supposed to be (B).

Supplementary Fig. S4. Sequence per quadrant evaluation. Percentage of reads per quadrant in Fig. 5. (A) Translated (aa reference database) and (B) untranslated (nt reference database).

Supplementary Fig. S5. Alignments of bacteriophage sequences from rhesus macaque stool viromes. Alignment lengths and percent identities are shown in separate plots for the bacteriophage orders Caudovirales, Levivirales, Petitvirales, and Tubulavirales. Alignments are colored by viral family.

Availability of Source Code and Requirements

Project name: Hecatomb

Project homepage: github.com/shandley/hecatomb [88]

Project documentation: hecatomb.readthedocs.io [112]

Operating system: Linux

Programming language: Python

Other requirements: Conda or pip

License: MIT

RRID: SCR_025002

Bio.tools ID: hecatomb

Restrictions to use by nonacademics: None

Abbreviations

AIDS: acquired immunodeficiency syndrome; ANOVA: analysis of variance; CPM: counts per million; FPKM: fragments per kilobase million; HPC: high-performance computing; ICTV: International Committee on Taxonomy of Viruses; LCA: lowest common ancestor; NCBI: National Center for Biotechnology Information; PERMANOVA: permutational analysis of variance; PCoA: principal coordinate analysis; RPKM: reads per kilobase million; SIMPER: similarity percentage; SIV: simian immunodeficiency virus; SPM: sequences per million; WGS: whole-genome shotgun.

Author Contributions

Conceptualization and methodology: M.J.R., R.A.E., S.A.H.; software and validation: M.J.R., S.J.B., K.H.-C., R.A.E., L.A.C.C., S.A.H.; formal analysis: M.J.R., L.F.O.L., R.A.E., E.A.D., L.A.C.C., S.A.H; investigation: M.J.R., S.A.H.; visualization: M.J.R., K.A.M., L.W., A.P., S.A.H.; writing—original draft: M.J.R., R.A.E., S.A.H.; supervision, project administration, and funding acquisition: R.A.E., S.A.H.

Funding

Research reported in this publication was supported by grants from the National Institutes of Health (RC2 DK116713 and U01 AI151810) awarded to R.A.E. and S.A.H. M.J.R. was supported by Flinders University under an Impact Seed Funding for Early Career Researchers grant.

Data availability

The reanalysis with Hecatomb utilized preexisting datasets, which are available under the NCBI BioProject accessions PRJEB9503 for the macaque SIV dataset [5] and PRJNA595374 (SRA: SRA study number SRP237459) for the coral reef dataset [99,100]. Accessions for novel phage genomes identified in this study are available in Supplementary Table S3. An archival copy of the code and supporting data is also available via the GigaScience repository, GigaDB [113]. Hecatomb is registered on WorkflowHub [114].

References

1. Hendrix RW, Smith MC, Burns RN, et al. Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc Natl Acad Sci USA. 1999; ;96:2192–97. 10.1073/pnas.96.5.2192. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Koonin EV, Dolja VV, Krupovic M, et al. Global organization and proposed megataxonomy of the virus world. Microbiol Mol Biol Rev. 2020;84:1–33. 10.1128/MMBR.00061-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Kim AH, Armah G, Dennis F, et al. Enteric virome negatively affects seroconversion following oral rotavirus vaccination in a longitudinally sampled cohort of Ghanaian infants. Cell Host Microbe. 2022;30:110–23.e5. 10.1016/j.chom.2021.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Maqsood R, Rodgers R, Rodriguez C, et al. Discordant transmission of bacteria and viruses from mothers to babies at birth. Microbiome. 2019;7:156. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Handley SA, Desai C, Zhao G et al. SIV infection-mediated changes in gastrointestinal bacterial microbiome and virome are associated with immunodeficiency and prevented by vaccination. Cell Host Microbe. 2016;19:323–35. 10.1016/j.chom.2016.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Norman JM, Handley SA, Baldridge MT, et al. Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015;160:447–460. 10.1016/j.cell.2015.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Neri U, Wolf YI, Roux S, et al. Expansion of the global RNA virome reveals diverse clades of bacteriophages. Cell. 2022;185:4023–37. 10.1016/j.cell.2022.08.023. [DOI] [PubMed] [Google Scholar]
8. Zayed AA, Wainaina JM, Dominguez-Huerta G, et al. Cryptic and abundant marine viruses at the evolutionary origins of Earth's RNA virome. Science. 2022;376:156–62. 10.1126/science.abm5847. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Williamson SJ, Allen LZ, Lorenzi HA, et al. Metagenomic exploration of viruses throughout the Indian Ocean. PLoS One. 2012;7:e42047. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Yang K, Wang X, Hou R, et al. Rhizosphere phage communities drive soil suppressiveness to bacterial wilt disease. Microbiome. 2012;11:16. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Pastrana DV, Peretti A, Welch NL et al. Metagenomic discovery of 83 new human papillomavirus types in patients with immunodeficiency. mSphere. 2018;3:e00645–18. 10.1128/mSphereDirect.00645-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Dutilh BE, Cassman N, McNair K et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun. 2014;5:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Dai Z, Wang H, Wu H et al. Parvovirus dark matter in the cloaca of wild birds. Gigascience. 2022;12:giad001. 10.1093/gigascience/giad001. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Krishnamurthy SR, Wang D. Origins and challenges of viral dark matter. Virus Res. 2017;239:136–42. 10.1016/j.virusres.2017.02.002. [DOI] [PubMed] [Google Scholar]
15. Pargin E, Roach MJ, Skye A, et al. The human gut virome: composition, colonization, interactions, and impacts on human health. Front Microbiol. 2023;14:963173. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Rosseel T, Pardon B, De Clercq K et al. False-positive results in metagenomic virus discovery: a strong case for follow-up diagnosis. Transbound Emerg Dis. 2014;61:293–99. 10.1111/tbed.12251. [DOI] [PubMed] [Google Scholar]
17. Skewes-Cox P, Sharpton TJ, Pollard KS et al. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One. 2014;9:e105067. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ponsero AJ, Hurwitz BL. The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front Microbiol. 2019;10:806. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Bai Z, Zhang Y-Z, Miyano S, et al. Identification of bacteriophage genome sequences with representation learning. Bioinformatics. 2022;38:4264–70. 10.1093/bioinformatics/btac509. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Pandolfo M, Telatin A, Lazzari G et al. MetaPhage: an automated pipeline for analyzing, annotating, and classifying bacteriophages in metagenomics sequencing data. mSystems2022;7:e0074122. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Miao Y, Liu F, Hou T, et al. Virtifier: a deep learning-based identifier for viral sequences from metagenomes. Bioinformatics. 2022;38:1216–22. 10.1093/bioinformatics/btab845. [DOI] [PubMed] [Google Scholar]
22. Marquet M, Hölzer M, Pletz MW et al. What the Phage: a scalable workflow for the identification and analysis of phage sequences. Gigascience. 2022;11:giac110. 10.1093/gigascience/giac110. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Guo J, Bolduc B, Zayed AA, et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome. 2021;9:37. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Tisza MJ, Pastrana DV, Welch NL et al. Discovery of several thousand highly diverse circular DNA viruses. eLife. 2020;; 9. 10.7554/eLife.51971. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Ren J, Song K, Deng C, et al. Identifying viruses from metagenomic data using deep learning. Quant Biol. 2020;8:64–77. 10.1007/s40484-019-0187-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Plyusnin I, Kant R, Jääskeläinen AJ, et al. Novel NGS pipeline for virus discovery from a wide spectrum of hosts and sample types. Virus Evol. 2020;6:veaa091. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Auslander N, Gussow AB, Benler S, et al. Seeker: alignment-free identification of bacteriophage genomes by deep learning. Nucleic Acids Res. 2020; 48:e121. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Kieft K, Zhou Z, Anantharaman K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome. 2020;8:90. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Deaton J, Yu FB, Quake SR. Mini-metagenomics and nucleotide composition aid the identification and host association of novel bacteriophage sequences. Adv Biosyst. 2019;3:e1900108. [DOI] [PubMed] [Google Scholar]
30. Fang Z, Tan J, Wu S, et al. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience. 2019;8:giz066. 10.1093/gigascience/giz066. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Bin Jang H, Bolduc B, Zablocki O, et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019;37:632–39. 10.1038/s41587-019-0100-8. [DOI] [PubMed] [Google Scholar]
32. Liu Q, Liu F, He J, et al. VFM: identification of bacteriophages from metagenomic bins and contigs based on features related to gene and genome composition. IEEE Access. 2012;7:177529–38. 10.1109/ACCESS.2019.2957833. [DOI] [Google Scholar]
33. Tampuu A, Bzhalava Z, Dillner J, et al. Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLoS One. 2019;14:e0222271. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Garretto A, Hatzopoulos T, Putonti C. virMine: automated detection of viral sequences from complex metagenomic samples. PeerJ. 2019;7:e6695. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Zheng T, Li J, Ni Y et al. Mining, analyzing, and integrating viral signals from metagenomic data. Microbiome2019;7:42. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Tithi SS, Aylward FO, Jensen RV et al. FastViromeExplorer: a pipeline for virus and phage identification and abundance profiling in metagenomics data. PeerJ. 2018;6:e4227. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Abdelkareem AO, Khalil MI, Elaraby M et al. VirNet: deep attention model for viral reads identification. In: 2018 13th International Conference on Computer Engineering and Systems (ICCES). 2018;623–26. 10.1109/ICCES.2018.8639400. [DOI]
38. Ren J, Ahlgren NA, Lu YY et al. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome. 2017;5:69. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Laffy PW, Wood-Charlson EM, Turaev D et al. HoloVir: a workflow for investigating the diversity and function of viruses in invertebrate holobionts. Front Microbiol2016;7:822. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Jurtz VI, Villarroel J, Lund O, et al. MetaPhinder-identifying bacteriophage sequences in metagenomic data sets. PLoS One. 2016;11:e0163111. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Li Y, Wang H, Nie K et al. VIP: an integrated pipeline for metagenomics of virus identification and discovery. Sci Rep. 2016;6:23774. [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Roux S, Enault F, Hurwitz BL et al. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3:e985. [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Tisza MJ, Belford AK, Domínguez-Huerta G et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 2021;7:veaa100. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Camargo AP, Roux S, Schulz F et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2023.; 1546–696. 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Zhao G, Wu G, Lim ES, et al. VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. Virology. 2017;503:21–30. 10.1016/j.virol.2017.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Kalantar KL, Carvalho T, de Bourcy CFA, et al. IDseq—an open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring. Gigascience. 2020;9:giaa111. 10.1093/gigascience/giaa111. [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
48. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
49. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;; 35:1026–28. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
50. Shen W, Xiang H, Huang T, et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics. 2023;39:btac845. 10.1093/bioinformatics/btac845. [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19:198. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Kim D, Song L, Breitwieser FP, et al. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–29. 10.1101/gr.210641.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Monaco CL, Gootenberg DB, Zhao G et al. Altered virome and bacterial microbiome in human immunodeficiency virus-associated acquired immunodeficiency syndrome. Cell Host Microbe. 2016;19:311–22. 10.1016/j.chom.2016.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
55. Li D, Luo R, Liu C-M, et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3–11. 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]
56. Roux S, Emerson JB, Eloe-Fadrosh EA et al. Benchmarking viromics: an evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817. [DOI] [PMC free article] [PubMed] [Google Scholar]
57. Nurk S, Meleshko D, Korobeynikov A, et al. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–34. 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
58. Peng Y, Leung HCM, Yiu SM, et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–28. 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]
59. Antipov D, Raiko M, Lapidus A et al. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics. 2020;36:4126–29. 10.1093/bioinformatics/btaa490. [DOI] [PubMed] [Google Scholar]
60. Antipov D, Rayko M, Kolmogorov M et al. viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data. Genome Biol. 2022;23:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
61. Mallawaarachchi V, Roach MJ, Decewicz P, et al. Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics. 2023;39. 10.1093/bioinformatics/btad586. [DOI] [PMC free article] [PubMed] [Google Scholar]
62. Ho SFS, Wheeler NE, Millard AD et al. Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data. Microbiome. 2023;11:84. 10.1186/s40168-023-01533-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
63. R Core Team . R: A Language and Environment for Statistical Computing. Vienna, Austria: R Project for Statistical Computing. 2021. https://www.r-project.org/.
64. Wickham H, Averick M, Bryan J, et al. Welcome to the tidyverse. J Open Source Softw. 2019; 4:1686. 10.21105/joss.01686. [DOI] [Google Scholar]
65. Wickham H. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Cham, Switzerland: Springer International Publishing, 2009. 10.1007/978-0-387-98141-3. [DOI] [Google Scholar]
66. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8:e61217. [DOI] [PMC free article] [PubMed] [Google Scholar]
67. Barnett D, Arts I, Penders J. microViz: an R package for microbiome data visualization and statistics. J Open Source Softw. 2021; 6:3201. [Google Scholar]
68. Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. 10.12688/f1000research.29032.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
69. Anaconda Software Distribution . Anaconda Inc. https://www.anaconda.com/download.
70. Cochrane GR, Galperin MY. The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources. Nucleic Acids Res. 2010;38:D1–D4. 10.1093/nar/gkp1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
71. Chen S, Zhou Y, Chen Y et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
72. Finkbeiner SR, Holtz LR, Jiang Y,et al. Human stool contains a previously unrecognized diversity of novel astroviruses. Virol J. 2009; 6:161. [DOI] [PMC free article] [PubMed] [Google Scholar]
73. Bushnell B. BBTools. 2023. https://sourceforge.net/projects/bbmap/.
74. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
75. NCBI . Viral assembly database. 2024. https://www.ncbi.nlm.nih.gov/assembly/?%20term=viruses Accessed 2023.
76. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9:2542. [DOI] [PMC free article] [PubMed] [Google Scholar]
77. UniProt Consortium . UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
78. Mirdita M, von den Driesch L, Galiez C, et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–76. 10.1093/nar/gkw1081. [DOI] [PMC free article] [PubMed] [Google Scholar]
79. Hingamp P, Grimsley N, Acinas SG, et al. Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes. ISME J. 2013;7:1678–95. 10.1038/ismej.2013.59. [DOI] [PMC free article] [PubMed] [Google Scholar]
80. Schoch CL, Ciufo S, Domrachev M et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020: baaa062. 10.1093/database/baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
81. Shen W, Ren H. TaxonKit: a practical and efficient NCBI taxonomy toolkit. J Genet Genomics. 2021;48:844–50. 10.1016/j.jgg.2021.03.006. [DOI] [PubMed] [Google Scholar]
82. Li D, Liu C-M, Luo R, et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31:1674–76. 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
83. Koren S, Walenz BP, Berlin K, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;; 27:722–36. 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
84. Kolmogorov M, Yuan J, Lin Y et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–46. 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
85. Bushnell B. BBMap: A Fast, Accurate, Splice-Aware Aligner. Berkeley, CA: Lawrence Berkeley National Lab; 2014. Report No.: LBNL-7065E.
86. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinf. 2011;12:323. [DOI] [PMC free article] [PubMed] [Google Scholar]
87. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131:281–85. 10.1007/s12064-012-0162-3. [DOI] [PubMed] [Google Scholar]
88. Roach MJ, Beecroft SJ, Mihindukulasuriya KA et al. : Hecatomb @ GitHub. 2020. https://github.com/shandley/hecatomb Accessed 2024.
89. Roach M. Hecatomb @ Bioconda. 2021. https://anaconda.org/bioconda/hecatomb Accessed 2024.
90. Roach M. Hecatomb @ PyPI. 2023. https://pypi.org/project/hecatomb/ Accessed 2024.
91. Köster J, Rahmann S Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–22. 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
92. Li H, Handsaker B, Wysoker A et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–79. 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
93. Shen W, Le S, Li Y et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One2016;11:e0163962. [DOI] [PMC free article] [PubMed] [Google Scholar]
94. Roach MJ, Hart BJ, Beecroft SJ, et al. Koverage: read-coverage analysis for massive (meta)genomics datasets. J Open Source Softw. 2024; 9:6235. [Google Scholar]
95. Roach MJ, Tessa Pierce-Ward N, Suchecki R, et al. Ten simple rules and a template for creating workflows-as-applications. PLoS Comput Biol. 2022;e1010705. 10.1371/journal.pcbi.1010705. [DOI] [PMC free article] [PubMed] [Google Scholar]
96. Greenfeld AR, Greenfeld DR, Pierzina R, et al. Cookiecutter: a command-line utility that creates projects from cookiecutter project templates. 2013. https://github.com/cookiecutter/cookiecutter Accessed 2024.
97. Köster J. Snakemake profiles. 2017. https://github.com/Snakemake-Profiles/doc. Accessed 2024.
98. Sun T-W, Yang C-L, Kao T-T, et al. Host range and coding potential of eukaryotic giant viruses. Viruses. 2020;12:1337. 10.3390/v12111337. [DOI] [PMC free article] [PubMed] [Google Scholar]
99. Lima LFO, Alker A, Papudeshi B, et al. Coral and Seawater Metagenomes Reveal Key Microbial Functions to Coral Health and Ecosystem Functioning Shaped at Reef Scale. Microb Ecol. 2021. 10.1007/s00248-022-02094-6. [DOI] [PMC free article] [PubMed]
100. Lima LFO, Weissman M, Reed M, et al. Modeling of the coral microbiome: the influence of temperature and microbial network. mBio. 2020;11. 10.1128/mBio.02691-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
101. Nayfach S, Camargo AP, Schulz F et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;; 39:578–85. 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
102. Bouras G, Nepal R, Houtak G, et al. Pharokka: a fast scalable bacteriophage annotation tool. Bioinformatics. 2023;39:: btac776. 10.1093/bioinformatics/btac776. [DOI] [PMC free article] [PubMed] [Google Scholar]
103. Routh A, Ordoukhanian P, Johnson JE. Nucleotide-resolution profiling of RNA recombination in the encapsidated genome of a eukaryotic RNA virus by next-generation sequencing. J Mol Biol. 2012;424:257–69. 10.1016/j.jmb.2012.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
104. Silva JM, Pratas D, Caetano T, et al. The complexity landscape of viral genomes. Gigascience. 2022;11:: giac079. 10.1093/gigascience/giac079. [DOI] [PMC free article] [PubMed] [Google Scholar]
105. Kang HS, McNair K, Cuevas DA, et al. Prophage genomics reveals patterns in phage genome organization and replication. Biorxiv. 2017. 10.1101/114819. [DOI] [Google Scholar]
106. Gourlé H, Karlsson-Lindsjö O, Hayer J et al. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics. 2019;; 35:521–22. 10.1093/bioinformatics/bty630. [DOI] [PMC free article] [PubMed] [Google Scholar]
107. Zhong Z-P, Vik D, Rapp J, et al. Lower viral evolutionary pressure under stable versus fluctuating conditions in subzero Arctic brines. Microbiome. 2023. 10.1186/s40168-023-01619-6. [DOI] [PMC free article] [PubMed]
108. Han L-L, Yu D-T, Bi L et al. Distribution of soil viruses across China and their potential role in phosphorous metabolism. Environ Microbiome. 2022;17:6. 10.1186/s40793-022-00401-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
109. Zhu N, Zhang D, Wang W et al. A novel coronavirus from patients with pneumonia in China. N Engl J Med. 2020;382:727–33. 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]
110. Kieft K, Anantharaman K. Virus genomics: what is being overlooked?. Curr Opin Virol. 2022;53:101200. [DOI] [PMC free article] [PubMed] [Google Scholar]
111. Roach M, Handley S. Reanalysis of Hecatomb datasets. GitHub. 2022. https://gist.github.com/beardymcjohnface/3d3245b2bf6d9544c524f412037d5065.
112. Roach M, Handley S, Henry-Cocks K. Hecatomb @ ReadTheDocs. 2022. https://hecatomb.readthedocs.io/en/latest/.
113. Michael RJ, Sarah BJ, Kathie MA et al. Supporting data for “Hecatomb: An Integrated Software Platform for Viral Metagenomics.” GigaScience Database. 2024. 10.5524/102506. [DOI] [PMC free article] [PubMed]
114. Roach M. Hecatomb. WorkflowHub. 2024. 10.48546/WORKFLOWHUB.WORKFLOW.235.1. [DOI]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Michael RJ, Sarah BJ, Kathie MA et al. Supporting data for “Hecatomb: An Integrated Software Platform for Viral Metagenomics.” GigaScience Database. 2024. 10.5524/102506. [DOI] [PMC free article] [PubMed]

Supplementary Materials

giae020_GIGA-D-23-00206_Original_Submission

giae020_giga-d-23-00206_original_submission.pdf^{(8.8MB, pdf)}

giae020_GIGA-D-23-00206_Revision_1

giae020_giga-d-23-00206_revision_1.pdf^{(7.8MB, pdf)}

giae020_GIGA-D-23-00206_Revision_2

giae020_giga-d-23-00206_revision_2.pdf^{(7.4MB, pdf)}

giae020_Response_to_Reviewer_Comments_Original_Submission

giae020_response_to_reviewer_comments_original_submission.pdf^{(70.4KB, pdf)}

giae020_Response_to_Reviewer_Comments_Revision_1

giae020_response_to_reviewer_comments_revision_1.pdf^{(45.3KB, pdf)}

giae020_Reviewer_1_Report_Original_Submission

Arvind Varsani -- 9/8/2023 Reviewed

giae020_reviewer_1_report_original_submission.pdf^{(121KB, pdf)}

giae020_Reviewer_2_Report_Original_Submission

Satoshi Hiraoka -- 9/10/2023 Reviewed

giae020_reviewer_2_report_original_submission.pdf^{(131KB, pdf)}

giae020_Reviewer_2_Report_Revision_1

Satoshi Hiraoka -- 2/2/2024 Reviewed

giae020_reviewer_2_report_revision_1.pdf^{(108.8KB, pdf)}

giae020_Supplemental_Files

giae020_supplemental_files.zip^{(37.8KB, zip)}

Data Availability Statement

[bib1] 1. Hendrix RW, Smith MC, Burns RN, et al. Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage. Proc Natl Acad Sci USA. 1999; ;96:2192–97. 10.1073/pnas.96.5.2192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2. Koonin EV, Dolja VV, Krupovic M, et al. Global organization and proposed megataxonomy of the virus world. Microbiol Mol Biol Rev. 2020;84:1–33. 10.1128/MMBR.00061-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3. Kim AH, Armah G, Dennis F, et al. Enteric virome negatively affects seroconversion following oral rotavirus vaccination in a longitudinally sampled cohort of Ghanaian infants. Cell Host Microbe. 2022;30:110–23.e5. 10.1016/j.chom.2021.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4. Maqsood R, Rodgers R, Rodriguez C, et al. Discordant transmission of bacteria and viruses from mothers to babies at birth. Microbiome. 2019;7:156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5. Handley SA, Desai C, Zhao G et al. SIV infection-mediated changes in gastrointestinal bacterial microbiome and virome are associated with immunodeficiency and prevented by vaccination. Cell Host Microbe. 2016;19:323–35. 10.1016/j.chom.2016.02.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6. Norman JM, Handley SA, Baldridge MT, et al. Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015;160:447–460. 10.1016/j.cell.2015.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7. Neri U, Wolf YI, Roux S, et al. Expansion of the global RNA virome reveals diverse clades of bacteriophages. Cell. 2022;185:4023–37. 10.1016/j.cell.2022.08.023. [DOI] [PubMed] [Google Scholar]

[bib8] 8. Zayed AA, Wainaina JM, Dominguez-Huerta G, et al. Cryptic and abundant marine viruses at the evolutionary origins of Earth's RNA virome. Science. 2022;376:156–62. 10.1126/science.abm5847. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9. Williamson SJ, Allen LZ, Lorenzi HA, et al. Metagenomic exploration of viruses throughout the Indian Ocean. PLoS One. 2012;7:e42047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10. Yang K, Wang X, Hou R, et al. Rhizosphere phage communities drive soil suppressiveness to bacterial wilt disease. Microbiome. 2012;11:16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11. Pastrana DV, Peretti A, Welch NL et al. Metagenomic discovery of 83 new human papillomavirus types in patients with immunodeficiency. mSphere. 2018;3:e00645–18. 10.1128/mSphereDirect.00645-18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12. Dutilh BE, Cassman N, McNair K et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun. 2014;5:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13. Dai Z, Wang H, Wu H et al. Parvovirus dark matter in the cloaca of wild birds. Gigascience. 2022;12:giad001. 10.1093/gigascience/giad001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14. Krishnamurthy SR, Wang D. Origins and challenges of viral dark matter. Virus Res. 2017;239:136–42. 10.1016/j.virusres.2017.02.002. [DOI] [PubMed] [Google Scholar]

[bib15] 15. Pargin E, Roach MJ, Skye A, et al. The human gut virome: composition, colonization, interactions, and impacts on human health. Front Microbiol. 2023;14:963173. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16. Rosseel T, Pardon B, De Clercq K et al. False-positive results in metagenomic virus discovery: a strong case for follow-up diagnosis. Transbound Emerg Dis. 2014;61:293–99. 10.1111/tbed.12251. [DOI] [PubMed] [Google Scholar]

[bib17] 17. Skewes-Cox P, Sharpton TJ, Pollard KS et al. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One. 2014;9:e105067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18. Ponsero AJ, Hurwitz BL. The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front Microbiol. 2019;10:806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19. Bai Z, Zhang Y-Z, Miyano S, et al. Identification of bacteriophage genome sequences with representation learning. Bioinformatics. 2022;38:4264–70. 10.1093/bioinformatics/btac509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20. Pandolfo M, Telatin A, Lazzari G et al. MetaPhage: an automated pipeline for analyzing, annotating, and classifying bacteriophages in metagenomics sequencing data. mSystems2022;7:e0074122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21. Miao Y, Liu F, Hou T, et al. Virtifier: a deep learning-based identifier for viral sequences from metagenomes. Bioinformatics. 2022;38:1216–22. 10.1093/bioinformatics/btab845. [DOI] [PubMed] [Google Scholar]

[bib22] 22. Marquet M, Hölzer M, Pletz MW et al. What the Phage: a scalable workflow for the identification and analysis of phage sequences. Gigascience. 2022;11:giac110. 10.1093/gigascience/giac110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23. Guo J, Bolduc B, Zayed AA, et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome. 2021;9:37. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24. Tisza MJ, Pastrana DV, Welch NL et al. Discovery of several thousand highly diverse circular DNA viruses. eLife. 2020;; 9. 10.7554/eLife.51971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25. Ren J, Song K, Deng C, et al. Identifying viruses from metagenomic data using deep learning. Quant Biol. 2020;8:64–77. 10.1007/s40484-019-0187-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26. Plyusnin I, Kant R, Jääskeläinen AJ, et al. Novel NGS pipeline for virus discovery from a wide spectrum of hosts and sample types. Virus Evol. 2020;6:veaa091. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27. Auslander N, Gussow AB, Benler S, et al. Seeker: alignment-free identification of bacteriophage genomes by deep learning. Nucleic Acids Res. 2020; 48:e121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28. Kieft K, Zhou Z, Anantharaman K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome. 2020;8:90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29. Deaton J, Yu FB, Quake SR. Mini-metagenomics and nucleotide composition aid the identification and host association of novel bacteriophage sequences. Adv Biosyst. 2019;3:e1900108. [DOI] [PubMed] [Google Scholar]

[bib30] 30. Fang Z, Tan J, Wu S, et al. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience. 2019;8:giz066. 10.1093/gigascience/giz066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31. Bin Jang H, Bolduc B, Zablocki O, et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat Biotechnol. 2019;37:632–39. 10.1038/s41587-019-0100-8. [DOI] [PubMed] [Google Scholar]

[bib32] 32. Liu Q, Liu F, He J, et al. VFM: identification of bacteriophages from metagenomic bins and contigs based on features related to gene and genome composition. IEEE Access. 2012;7:177529–38. 10.1109/ACCESS.2019.2957833. [DOI] [Google Scholar]

[bib33] 33. Tampuu A, Bzhalava Z, Dillner J, et al. Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLoS One. 2019;14:e0222271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34. Garretto A, Hatzopoulos T, Putonti C. virMine: automated detection of viral sequences from complex metagenomic samples. PeerJ. 2019;7:e6695. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35. Zheng T, Li J, Ni Y et al. Mining, analyzing, and integrating viral signals from metagenomic data. Microbiome2019;7:42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36. Tithi SS, Aylward FO, Jensen RV et al. FastViromeExplorer: a pipeline for virus and phage identification and abundance profiling in metagenomics data. PeerJ. 2018;6:e4227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37. Abdelkareem AO, Khalil MI, Elaraby M et al. VirNet: deep attention model for viral reads identification. In: 2018 13th International Conference on Computer Engineering and Systems (ICCES). 2018;623–26. 10.1109/ICCES.2018.8639400. [DOI]

[bib38] 38. Ren J, Ahlgren NA, Lu YY et al. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome. 2017;5:69. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39. Laffy PW, Wood-Charlson EM, Turaev D et al. HoloVir: a workflow for investigating the diversity and function of viruses in invertebrate holobionts. Front Microbiol2016;7:822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40. Jurtz VI, Villarroel J, Lund O, et al. MetaPhinder-identifying bacteriophage sequences in metagenomic data sets. PLoS One. 2016;11:e0163111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41. Li Y, Wang H, Nie K et al. VIP: an integrated pipeline for metagenomics of virus identification and discovery. Sci Rep. 2016;6:23774. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib42] 42. Roux S, Enault F, Hurwitz BL et al. VirSorter: mining viral signal from microbial genomic data. PeerJ. 2015;3:e985. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43. Tisza MJ, Belford AK, Domínguez-Huerta G et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 2021;7:veaa100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44. Camargo AP, Roux S, Schulz F et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2023.; 1546–696. 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45. Zhao G, Wu G, Lim ES, et al. VirusSeeker, a computational pipeline for virus discovery and virome composition analysis. Virology. 2017;503:21–30. 10.1016/j.virol.2017.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46. Kalantar KL, Carvalho T, de Bourcy CFA, et al. IDseq—an open source cloud-based pipeline and analysis service for metagenomic pathogen detection and monitoring. Gigascience. 2020;9:giaa111. 10.1093/gigascience/giaa111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215:403–10. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]

[bib48] 48. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12:59–60. 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]

[bib49] 49. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;; 35:1026–28. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]

[bib50] 50. Shen W, Xiang H, Huang T, et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics. 2023;39:btac845. 10.1093/bioinformatics/btac845. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] 51. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] 52. Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19:198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] 53. Kim D, Song L, Breitwieser FP, et al. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–29. 10.1101/gr.210641.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib54] 54. Monaco CL, Gootenberg DB, Zhao G et al. Altered virome and bacterial microbiome in human immunodeficiency virus-associated acquired immunodeficiency syndrome. Cell Host Microbe. 2016;19:311–22. 10.1016/j.chom.2016.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib55] 55. Li D, Luo R, Liu C-M, et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods. 2016;102:3–11. 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]

[bib56] 56. Roux S, Emerson JB, Eloe-Fadrosh EA et al. Benchmarking viromics: an evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017;5:e3817. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] 57. Nurk S, Meleshko D, Korobeynikov A, et al. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–34. 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib58] 58. Peng Y, Leung HCM, Yiu SM, et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28:1420–28. 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]

[bib59] 59. Antipov D, Raiko M, Lapidus A et al. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics. 2020;36:4126–29. 10.1093/bioinformatics/btaa490. [DOI] [PubMed] [Google Scholar]

[bib60] 60. Antipov D, Rayko M, Kolmogorov M et al. viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data. Genome Biol. 2022;23:57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib61] 61. Mallawaarachchi V, Roach MJ, Decewicz P, et al. Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics. 2023;39. 10.1093/bioinformatics/btad586. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] 62. Ho SFS, Wheeler NE, Millard AD et al. Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data. Microbiome. 2023;11:84. 10.1186/s40168-023-01533-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib63] 63. R Core Team . R: A Language and Environment for Statistical Computing. Vienna, Austria: R Project for Statistical Computing. 2021. https://www.r-project.org/.

[bib64] 64. Wickham H, Averick M, Bryan J, et al. Welcome to the tidyverse. J Open Source Softw. 2019; 4:1686. 10.21105/joss.01686. [DOI] [Google Scholar]

[bib65] 65. Wickham H. Ggplot2: Elegant Graphics for Data Analysis. 2nd ed. Cham, Switzerland: Springer International Publishing, 2009. 10.1007/978-0-387-98141-3. [DOI] [Google Scholar]

[bib66] 66. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One. 2013;8:e61217. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib67] 67. Barnett D, Arts I, Penders J. microViz: an R package for microbiome data visualization and statistics. J Open Source Softw. 2021; 6:3201. [Google Scholar]

[bib68] 68. Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. 10.12688/f1000research.29032.2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib69] 69. Anaconda Software Distribution . Anaconda Inc. https://www.anaconda.com/download.

[bib70] 70. Cochrane GR, Galperin MY. The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources. Nucleic Acids Res. 2010;38:D1–D4. 10.1093/nar/gkp1077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib71] 71. Chen S, Zhou Y, Chen Y et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib72] 72. Finkbeiner SR, Holtz LR, Jiang Y,et al. Human stool contains a previously unrecognized diversity of novel astroviruses. Virol J. 2009; 6:161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib73] 73. Bushnell B. BBTools. 2023. https://sourceforge.net/projects/bbmap/.

[bib74] 74. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib75] 75. NCBI . Viral assembly database. 2024. https://www.ncbi.nlm.nih.gov/assembly/?%20term=viruses Accessed 2023.

[bib76] 76. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9:2542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib77] 77. UniProt Consortium . UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib78] 78. Mirdita M, von den Driesch L, Galiez C, et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45:D170–76. 10.1093/nar/gkw1081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib79] 79. Hingamp P, Grimsley N, Acinas SG, et al. Exploring nucleo-cytoplasmic large DNA viruses in Tara Oceans microbial metagenomes. ISME J. 2013;7:1678–95. 10.1038/ismej.2013.59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib80] 80. Schoch CL, Ciufo S, Domrachev M et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020: baaa062. 10.1093/database/baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib81] 81. Shen W, Ren H. TaxonKit: a practical and efficient NCBI taxonomy toolkit. J Genet Genomics. 2021;48:844–50. 10.1016/j.jgg.2021.03.006. [DOI] [PubMed] [Google Scholar]

[bib82] 82. Li D, Liu C-M, Luo R, et al. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31:1674–76. 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]

[bib83] 83. Koren S, Walenz BP, Berlin K, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;; 27:722–36. 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib84] 84. Kolmogorov M, Yuan J, Lin Y et al. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019;37:540–46. 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]

[bib85] 85. Bushnell B. BBMap: A Fast, Accurate, Splice-Aware Aligner. Berkeley, CA: Lawrence Berkeley National Lab; 2014. Report No.: LBNL-7065E.

[bib86] 86. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinf. 2011;12:323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib87] 87. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131:281–85. 10.1007/s12064-012-0162-3. [DOI] [PubMed] [Google Scholar]

[bib88] 88. Roach MJ, Beecroft SJ, Mihindukulasuriya KA et al. : Hecatomb @ GitHub. 2020. https://github.com/shandley/hecatomb Accessed 2024.

[bib89] 89. Roach M. Hecatomb @ Bioconda. 2021. https://anaconda.org/bioconda/hecatomb Accessed 2024.

[bib90] 90. Roach M. Hecatomb @ PyPI. 2023. https://pypi.org/project/hecatomb/ Accessed 2024.

[bib92] 91. Köster J, Rahmann S Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–22. 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]

[bib93] 92. Li H, Handsaker B, Wysoker A et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–79. 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib94] 93. Shen W, Le S, Li Y et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One2016;11:e0163962. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib95] 94. Roach MJ, Hart BJ, Beecroft SJ, et al. Koverage: read-coverage analysis for massive (meta)genomics datasets. J Open Source Softw. 2024; 9:6235. [Google Scholar]

[bib96] 95. Roach MJ, Tessa Pierce-Ward N, Suchecki R, et al. Ten simple rules and a template for creating workflows-as-applications. PLoS Comput Biol. 2022;e1010705. 10.1371/journal.pcbi.1010705. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib97] 96. Greenfeld AR, Greenfeld DR, Pierzina R, et al. Cookiecutter: a command-line utility that creates projects from cookiecutter project templates. 2013. https://github.com/cookiecutter/cookiecutter Accessed 2024.

[bib98] 97. Köster J. Snakemake profiles. 2017. https://github.com/Snakemake-Profiles/doc. Accessed 2024.

[bib99] 98. Sun T-W, Yang C-L, Kao T-T, et al. Host range and coding potential of eukaryotic giant viruses. Viruses. 2020;12:1337. 10.3390/v12111337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib100] 99. Lima LFO, Alker A, Papudeshi B, et al. Coral and Seawater Metagenomes Reveal Key Microbial Functions to Coral Health and Ecosystem Functioning Shaped at Reef Scale. Microb Ecol. 2021. 10.1007/s00248-022-02094-6. [DOI] [PMC free article] [PubMed]

[bib101] 100. Lima LFO, Weissman M, Reed M, et al. Modeling of the coral microbiome: the influence of temperature and microbial network. mBio. 2020;11. 10.1128/mBio.02691-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib102] 101. Nayfach S, Camargo AP, Schulz F et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;; 39:578–85. 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib103] 102. Bouras G, Nepal R, Houtak G, et al. Pharokka: a fast scalable bacteriophage annotation tool. Bioinformatics. 2023;39:: btac776. 10.1093/bioinformatics/btac776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib104] 103. Routh A, Ordoukhanian P, Johnson JE. Nucleotide-resolution profiling of RNA recombination in the encapsidated genome of a eukaryotic RNA virus by next-generation sequencing. J Mol Biol. 2012;424:257–69. 10.1016/j.jmb.2012.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib105] 104. Silva JM, Pratas D, Caetano T, et al. The complexity landscape of viral genomes. Gigascience. 2022;11:: giac079. 10.1093/gigascience/giac079. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib106] 105. Kang HS, McNair K, Cuevas DA, et al. Prophage genomics reveals patterns in phage genome organization and replication. Biorxiv. 2017. 10.1101/114819. [DOI] [Google Scholar]

[bib107] 106. Gourlé H, Karlsson-Lindsjö O, Hayer J et al. Simulating Illumina metagenomic data with InSilicoSeq. Bioinformatics. 2019;; 35:521–22. 10.1093/bioinformatics/bty630. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib108] 107. Zhong Z-P, Vik D, Rapp J, et al. Lower viral evolutionary pressure under stable versus fluctuating conditions in subzero Arctic brines. Microbiome. 2023. 10.1186/s40168-023-01619-6. [DOI] [PMC free article] [PubMed]

[bib109] 108. Han L-L, Yu D-T, Bi L et al. Distribution of soil viruses across China and their potential role in phosphorous metabolism. Environ Microbiome. 2022;17:6. 10.1186/s40793-022-00401-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib110] 109. Zhu N, Zhang D, Wang W et al. A novel coronavirus from patients with pneumonia in China. N Engl J Med. 2020;382:727–33. 10.1056/NEJMoa2001017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib111] 110. Kieft K, Anantharaman K. Virus genomics: what is being overlooked?. Curr Opin Virol. 2022;53:101200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib112] 111. Roach M, Handley S. Reanalysis of Hecatomb datasets. GitHub. 2022. https://gist.github.com/beardymcjohnface/3d3245b2bf6d9544c524f412037d5065.

[bib113] 112. Roach M, Handley S, Henry-Cocks K. Hecatomb @ ReadTheDocs. 2022. https://hecatomb.readthedocs.io/en/latest/.

[bib114] 113. Michael RJ, Sarah BJ, Kathie MA et al. Supporting data for “Hecatomb: An Integrated Software Platform for Viral Metagenomics.” GigaScience Database. 2024. 10.5524/102506. [DOI] [PMC free article] [PubMed]

[bib115] 114. Roach M. Hecatomb. WorkflowHub. 2024. 10.48546/WORKFLOWHUB.WORKFLOW.235.1. [DOI]

PERMALINK

Hecatomb: an integrated software platform for viral metagenomics

Michael J Roach

Sarah J Beecroft

Kathie A Mihindukulasuriya

Leran Wang

Anne Paredes

Luis Alberto Chica Cárdenas

Kara Henry-Cocks

Lais Farias Oliveira Lima

Elizabeth A Dinsdale

Robert A Edwards

Scott A Handley

Abstract

Background

Findings

Conclusion

Background

Implementation

Figure 1:

Module 1: Sequence quality control and host removal

Module 2: Read-based annotation

Figure 2:

Module 3: Assembly

Figure 3:

Module 4: Contig-based annotation

Installation and dependency management

High-performance computing deployment

Customization

Application

Hecatomb accelerates profiling of viral metagenomes

Figure 4:

Reevaluation of existing environmental metagenomic datasets

Figure 5:

Accelerated discovery of novel viruses

Figure 6:

Analysis of an in silico set of viral genomes

Table 1.

Discussion

Potential implications

Methods

Reevaluation of the SIV dataset

Reevaluation of coral microbiomes

Identification of phage genomes from Hecatomb assemblies

Supplementary Material

Acknowledgement

Contributor Information

Additional Files

Availability of Source Code and Requirements

Abbreviations

Author Contributions

Funding

Data availability

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases