ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data

Zachary Deng; Eric Delwart

doi:10.1186/s12859-021-04038-2

. 2021 Mar 12;22:119. doi: 10.1186/s12859-021-04038-2

ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data

Zachary Deng ^1,^2,^✉, Eric Delwart ^1,^2,^✉

PMCID: PMC7953547 PMID: 33706720

Abstract

Background

Metagenomics is the study of microbial genomes for pathogen detection and discovery in human clinical, animal, and environmental samples via Next-Generation Sequencing (NGS). Metagenome de novo sequence assembly is a crucial analytical step in which longer contigs, ideally whole chromosomes/genomes, are formed from shorter NGS reads. However, the contigs generated from the de novo assembly are often very fragmented and rarely longer than a few kilo base pairs (kb). Therefore, a time-consuming extension process is routinely performed on the de novo assembled contigs.

Results

To facilitate this process, we propose a new tool for metagenome contig extension after de novo assembly. ContigExtender employs a novel recursive extending strategy that explores multiple extending paths to achieve highly accurate longer contigs. We demonstrate that ContigExtender outperforms existing tools in synthetic, animal, and human metagenomics datasets.

Conclusions

A novel software tool ContigExtender has been developed to assist and enhance the performance of metagenome de novo assembly. ContigExtender effectively extends contigs from a variety of sources and can be incorporated in most viral metagenomics analysis pipelines for a wide variety of applications, including pathogen detection and viral discovery.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-021-04038-2.

Keywords: Metagenomics, De novo assembly, Next-Gen Sequencing, Viral discovery, Pathogen detection

Background

Metagenomic next-generation sequencing (mNGS) has emerged as an unbiased, high throughput tool for clinical infectious agent detection and novel pathogen discovery [1–6]. Analytical metagenome pipelines are currently undergoing active development [7–11]. To identify microbial sequences, millions of NGS reads are compared to publicly available databases of reference sequences. In this analysis, the creation of longer contigs from short overlapping reads is accomplished through de novo metagenome assembly. The longer contigs generated by high-quality sequence assembly have two major advantages over short reads: improved detection sensitivity of novel pathogens without strong sequence homology to known pathogens; and a reduced need of manual genome extension through polymerase chain reaction [12].

De novo assembly has been an essential tool in recent studies in metagenomics viral discovery [13–22]. Dedicated metagenome assemblers have emerged recently, as a result of very active development in this field. Several metagenome assemblers are summarized in a recent review by Ayling et al. [23] and tested in our previous study [12]. Recently, the popular assembler SPAdes and its variant, metaSPAdes, specially designed for metagenomic applications, have emerged to be the tools of choice for metagenome assembly [24, 25]. The metaSPAdes algorithm is based on De Bruijn graphs (DBG) and has addressed many challenges in metagenome assembly, such as uneven coverage and the existence of mixtures of multiple strains. Since metaSPAdes has been adopted by many labs, we will use it as our benchmark tool to generate the initial seed contigs to be extended. Other DBG-based metagenome assemblers include IDBA-UD [26], MetaVelvet [27], MetaVelvet-SL [28], MEGAHIT [29], MegaGTA [30], Ray Meta [31], PRICE [32], and Xander [33]. Another type of assembler employs a strategy called Overlap-Layout-Consensus (OLC), which is based on joining overlaps to form new contigs. This group includes SAVAGE [34], Snowball [35], Genovo[36], BBAP [37], IVA [38], and VICUNA [39].

According to previous data [8, 12], de novo assemblers, when applied to real metagenomic samples, are unlikely to produce contigs longer than several kb. Uneven coverage, sequencing errors, library construction, and amplification artifacts are among the factors causing premature assembly termination. Another challenge is the presence of multiple viral, bacterial, and animal/host DNA fragments within metagenomic samples. However, in practice, it is often possible to find reads that overlap with the edges of the de novo assembled contigs, so iterative extension may significantly increase contig length. We are motivated by this observation and propose a novel algorithm and a software tool, ContigExtender, to automate the contig extension step following de novo assembly.

The basic ContigExtender algorithm is a greedy algorithm based on overlap search, with the following steps: (1) find overlapping reads with respect to both ends of the input contig; (2) calculate candidate extension paths based on these overlapping reads; (3) extend the existing contig; and (4) repeat the process until it can no longer be extended. To ensure it works properly under challenging situations, ContigExtender has several important unique features. First, the extension path is allowed to branch when multiple extension paths representing multiple strains are present. This is implemented using depth-first search (DFS) to explore multiple possible extending paths recursively. The benefit of this feature is to avoid the extension becoming trapped in local optima, which may cause pre-mature termination. Second, the algorithm focuses on overlapping quality rather than depth. A hard depth cutoff is avoided to allow contig extension in ultra-low coverage (1×) but otherwise high-quality overlapping areas. Last, instead of using a consensus sequence derived from all covering reads at the overlap, we separate the region into genotypes, since the overlapping reads may contain multiple strains. Overlapping reads are categorized according to their genotypes and the genotypes are ranked based on read concentration and quality. The main benefit of using genotypes instead of a single consensus is that it allows for branching for each strain. In addition, untrimmed adapters and sequencing errors are not likely to contribute to significant genotypes since these contaminations appear at random positions.

De novo assembler generated contigs are seed contigs that are input into ContigExtender. The outputs of ContigExtender are final contigs. The final contigs were rigorously evaluated based on gained length and contig accuracy, which is measured by similarity to NCBI viral reference nucleotide sequences using nucleotide BLAST [40]. Our results show that in comparison to other approaches to contig extension, ContigExtender is effective at extending seed contigs while maintaining high levels of contig accuracy.

Implementation

Datasets

Four groups of datasets, named “in silico synthetic”, “NIBSC virus standard”, “Animal”, and “Human”, were used to evaluate the ContigExtender algorithm. To evaluate performance (contig size and accuracy), we compared the output to the reference genome of the target pathogen.

A proof of concept “in silico synthetic” dataset was developed to test the efficacy of ContigExtender at varying read lengths (100 bp, 250 bp), error rates (1% and 5%), depth (10x, 20x, and 50x), and unevenness of coverage. Three target viral genomes include the Bas-Congo virus (BASV), a novel rhabdovirus associated with hemorrhagic fever cases in central Africa [41]; BK virus (BKV), a human polyomavirus; and human immunodeficiency virus type 1 (HIV-1). BKV is an unenveloped double-stranded DNA virus with circular genomes of around 5kbp. BASV genome is a negative-sense single-stranded RNA virus and HIV-1 is a positive-sense single-stranded RNA virus.

To emulate the unevenness of coverage, peaks of 50× coverage spanning 250 bp were spiked-in every 1 kb. Each genomic position has equal probability to be covered, emulating the ideal shot-gun sequencing process. However, the stochastic nature of this process will not result in perfect even coverage across the genome, but rather a binomial coverage distribution, especially in a low coverage situation. For example, we have observed that some positions can have as low as 1× coverage while the average coverage is 10x.

In addition, we simulated realistic 100 bp paired-end Illumina reads using pIRS version111 [42] for the above three reference viruses. This software does not allow longer reads, but it considers GC content, derived from real Illumina base calling profiles.

The NIBSC, Animal and Human samples are summarized in Table 1. The NIBSC dataset [43] (NIBSC sample 26) is assembled from both clinical specimens and cultured viruses. The target viral genomes range in length from ~ 6 to ~ 234 kb. Illumina MiSeq sequencing created a dataset of approximately 9 million paired-end reads of length 250 bp. The mosquito datasets [21] (pool20 and pool27) analyzed here originated from mosquito control districts throughout California. The resulting libraries were generated as previously described [43, 44], and sequenced with the HiSeq 4000 Illumina platform, using 2 × 150 cycle HiSeq. The Human metagenomic datasets characterize viral nucleic acids in nasal swabs or feces from apparently healthy young children with no recorded symptoms living in multiple small and remote Amazonian villages as described in our previous studies [13, 45]. The raw 250 bp paired-end reads were generated using MiSeq and deposited in Sequence Read Archive (SRA). The SRA project accession numbers are PRJNA391715 and PRJNA530270 for the mosquito samples and Amazon nasal swab samples respectively. The Amazon fecal samples are under accession numbers SRR6287056 to SRR6287135. We also included additional metagenome samples from human stool (47,210-feces) and human blood for a treated HIV positive sample (12-110034-veqrpcr), dog diarrhea stool (Dog-pool), and fish tumor tissue (Fish1-pool).

Table 1.

Metagenome datasets used to evaluate ContigExtender performance

Data set	Sample	Read length	#reads	Genome type	Sequencing platform	Description
NIBSC	NIBSC-26	250	8.55 M	25 different human RNA and DNA viral pathogens	MiSeq	Multiplexed viral standards
Animal	Mosquito Pool20	150	0.81 M	Culex Iflavi-like virus Mesoniviridae	HiSeq4000	Mosquito pool
Animal	Mosquito Pool27	150	1.54 M	Culex Iflavi-like virus Mesoniviridae	HiSeq4000	Mosquito pool
Animal	Fish1-pool	250	2.30 M	Enterococcus virus	MiSeq	Fish tumor mass
Animal	Dog-pool	250	1.31 M	Uncultured crAssphage	MiSeq	Dog stool sample
Human	12-110034-veqrpcr	250	0.53 M	Hepacivirus C	Miseq	Human blood sample
Human	47210-feces	250	1.90 M	Escherichia virus	Miseq	Human stool sample
Human	Amazon-4B	250	0.81 M	Norwalk Virus	Miseq	Human stool sample
Human	Amazon-3D	250	0.38 M	Husavirus	Miseq	Human stool sample
Human	Amazon-17D	250	1.61 M	Husavirus	Miseq	Human stool sample
Human	Amazon-6D	250	0.47 M	Human Cosavirus	Miseq	Human stool sample
Human	Amazon-S10-CNI-055	250	0.95 M	Betapapillomavirus	Miseq	Human nasal swab sample

Open in a new tab

Genomic sequences from NIBSC, Animal and Human metagenome datasets represent various pathogen types, genome sizes, sample backgrounds, and sequencing outputs that were encountered in real world metagenome and clinical applications using NGS

Preprocessing

Raw reads obtained from Illumina sequencing were preprocessed before assembly as described in [12]. Human host reads were subtracted by mapping the reads with human reference genome hg19 using bowtie2 [46]. Additionally, PRINSEQ version 0.20.4 was used to filter low complexity reads using default parameters [47].

De novo assembly

We use SPAdes 3.13.0 with the “-meta” option to enable MetaSPAdes mode. The k-mer sizes were set to 21, 33, 55, and 77 while all other parameters were set to the default. metaSPAdes first constructs the de Bruijn graph of all reads using SPAdes, transforms it into the assembly graph using various graph simplification procedures, and reconstructs paths in the assembly graph that correspond to long fragments of individual genomes within a metagenome [24, 25]. Responding to the microdiversity challenge, metaSPAdes focuses on reconstructing a consensus backbone of a strain-mixture and thus sometimes ignores some strain-specific features (often corresponding to rare strains) to improve the contiguity of assemblies.

Algorithm

The outline of the algorithm is illustrated in Figs. 1 and 2. The efficacy of the extension results from individual optimization of each individual contig, instead of the simultaneous process used by de novo assembly. The extension process occurs iteratively. During each iteration, alignments between the set of reads and the two ends of the input contig are computed using an external tool such as Bowtie 2. After these alignments are filtered for quality and length, they are aggregated and regions with high disagreement are recorded. Then, each potential solution is scored and becomes one of multiple possible consensuses. Each then becomes the input contig of the upcoming extension iteration. These steps are repeated until a cycle is detected or insufficient alignments are available to extend the contig further. The final output of the algorithm consists of a collection of potential contigs.

Fig. 1 — Schematic views of the ContigExtender assembly algorithm. (a) Iteratively recruit reads which overlap the edges of input contigs, then generate consensus sequence from the overlaps for form extended contigs. (b) Multiple strains may form alternative consensus contigs. Create branches when variant reads were detected. (c) A more detailed demonstration of the overlapping-consensus-branching algorithm, showing the two branches formed by depth first search (DFS). Two aligned reads have a three base disagreement region, so two different paths are formed for alternative extension. (d) Reads containing untrimmed adapters or other sequencing errors will not align well with contig and other reads. (e) Circular genome detection and extension termination

Fig. 2 — Pseudo code of ContigExtender algorithm

Alignment

The algorithm utilizes an existing alignment tool to find overlaps between the reads and the contig. To preserve only partially overlapping reads, and not reads that lie completely within the contig, only the prefix and suffix of the contig with length equal to the length of the longest read are considered. The bowtie2 parameters are chosen to not penalize ambiguous characters when running end-to-end alignment. For paired data, we provide an option to constrain the orientation and distance of each read pair. When the constraints are enabled, reads whose mates are aligned discordantly are discarded.

Computation of consensus

For each overlap k, a quality score $q_{k}$ is computed, equal to the number of bases that the read and contig are in agreement on. Portions of the read that lie outside the original contig are not considered in this quality score. Each read must have a quality score above a certain user-defined threshold to be considered in the consensus, effectively ignoring short alignments of very few bases.

Let $R_{k} [i]$ be the nucleotide of read k that would be at position i in the new consensus. At each position in the extended contig, define $Q [i] [j] = \sum_{R_{k} [i] = j R_{k} [i] = j} q_{k}^{2}$ . To create the new consensus c, set $c_{i}$ to be the base $j$ that maximizes $Q [i] [j]$ if this value is greater than a threshold. This threshold can be computed from the following expression:

t h r e s h o l d = 10^{- t} l^{2} c,

where t is a dimensionless tolerance value that represents the acceptable amount of evidence required for extension, l is the length of the reads, and c is an estimate of the coverage.

If this threshold is not reached, then leave this position in the consensus undefined. This process computes the main consensus, which is supported by the most evidence.

Alternate consensuses

The second most highly scoring base for each position I is found, and if its score is greater than some threshold, and a sufficiently large number of reads agree on this, then position I is marked as an ambiguous base. For each read, a string is created by taking the subset of the read such that the chosen positions correspond with the ambiguous bases. These strings represent each individual read’s solution to the ambiguous positions. The frequencies of these strings are sorted and those meet a certain threshold of matching reads are retained as alternate consensuses.

To prevent the number of alternates from growing exponentially with the length of the contig, a limit may be set on the quantity of alternates that may be considered at each iteration, in addition to adjustment of the threshold required for designating a position as ambiguous.

The algorithm terminates when a sufficiently large prefix or suffix of the extended contig is identical to one that has occurred earlier in the extension process, as this would create an infinite loop of the exact alignments and consensuses that occurred in between. This also handles circular chromosomes, which would otherwise experience the same behavior. Additionally, an upper limit to the length of the extended contig can also be set, to save processing time when a consensus that diverges from the reference is chosen.

Results

Software parameters

The performance of ContigExtender on simulated and real datasets is benchmarked against the existing contig extension tools PRICE [32], Kollector [48] and GenSeed-HMM [49]. PRICE V 1.2 is executed using the recommended parameters: 30 cycles, 500 bp insert size, and 90% identity to the initial contig. Kollector is executed with the recommended alignment thresholds, assembly K-mer size 32, and overlap K-mer size 25. We ran GenSeed-HMM with the ABySS assembler, 30 bp extension seeds, and a 90% alignment threshold. ContigExtender is run with the default tolerance of 2.5.

Simulated data

The performance of ContigExtender on the in silico synthetic BASV dataset is demonstrated in Table 2. From randomly chosen 1 kb seed contigs, ContigExtender was able to reconstruct nearly perfect genomes for all three viral genomes in all cases except for two challenging situations: (1) low sequencing depth 10 × coupled with short reads (100 bp) and (2) low depth 10 × coupled with high error rates (0.05). Results from the BKV and HIV1 datasets show similar trends (Additional file 1: Table S1). Although sequencing depth is a major factor for ContigExtender, longer reads of 250 bp coupled with low error rates (0.01) can compensate for low sequencing depth. ContigExtender can detect the circular nature of the BKV genome, avoiding over-extension.

Table 2.

Comparison of contigs produced from in silico reads from the BASV virus

BASV (11.9 kb)			Error rate = 0.01					Error rate = 0.05
Read length (bp)	Depth (x)	Spike-in peaks	Contig Extender (kb)	PRICE (kb)	GenSeed (kb)	Kollector (kb)	Meta SPAdes (kb)	Contig Extender (kb)	PRICE (kb)	GenSeed (kb)	Kollector (kb)	Meta SPAdes (kb)
100	10		6.7	1.6	1.1	1.8	6.7	1.9	1.3	NA	NA	1.8
	20		11.9	11.9	1.1	2.7	11.9	11.9	11.8	NA	NA	7.9
	20	Yes	11.9	11.8	1.1	2.7	11.9	11.9	11.9	NA	NA	7.9
	50		11.9	11.9	1.1	4.9	11.9	11.9	11.8	1.1	NA	7.9
250	10		11.8	7.2	1.1	1.1	7.9	8.4	7.3	NA	NA	7.6
	20		11.9	11.7	1.3	2.3	7.9	11.8	11.8	NA	NA	7.8
	20	Yes	11.9	11.5	1.3	2.3	7.9	11.8	NA	NA	NA	7.9
	50		11.9	11.8	1.4	2.4	7.9	11.9	11.9	NA	NA	11.9

Open in a new tab

Longest contig length produced by ContigExtender and other tools using in silico synthetic sequences for the BASV virus, representing varying read lengths, error rates, depths, and unevenness of coverage. Randomly selected sequences of 1 kb were used as seed contigs. Spike-in sequences were added, with each peak at a depth of at 50 × coverage and spanning 250 bp. De novo assemblies using metaSPAdes were also performed as a benchmark for assembly difficulty for each dataset. Runs that fail to produce extension are marked “NA”

MetaSPAdes is not directly comparable with ContigExtender since the former is a de novo assembler, whereas the latter is a seeded assembler. Nevertheless, MetaSPAdes results provide a benchmark to measure the difficulty of each dataset. Like ContigExtender, MetaSPAdes is sensitive to very low depth at 10x, but MetaSPAdes is less tolerant of sequencing errors and the existence of viral mixtures.

ContigExtender generally performed better than PRICE in low depth (10×) and high error rate datasets (Table 2 and Additional file 1: Table S1). Both reconstructed nearly the entire reference genome when given higher depth sequencing data. GenSeed-HMM and Kollector both reconstructed portions of the reference genome from low error rate reads but generally did not accomplish any extension in the high error rate datasets.

We also benchmarked ContigExtender on realistic paired-end simulated datasets generated by pIRS (Additional file 1: Table S2). The results suggest that ContigExtender outperforms PRICE at low coverage (10×). Additionally, while GenSeed-HMM and Kollector produced no output in some cases, ContigExtender produced extension in all trials. Also, we observed a clear advantage when ignoring the insert size constraint for paired-end extension. This allows the algorithm to overcome some difficult regions.

NIBSC data

Among 58 MetaSPAdes seed viral contigs that are at least 1.5 kb in length and are at least 95% aligned to one of the reference viral genomes, 26 contigs were extended by at least 200 bp (Table 3). The quality of extended contigs was measured by final length, gained length (final length of ContigExtender output minus length of metaSPAdes seed contig), and the percentage of the output contig that is aligned to target viral genomes. The depth for each contig varies from 7× to 267×, the final contig lengths range from 1.7 to 10 kb, and the largest extension is 5.8 kb.

Table 3.

ContigExtender results on NIBSC datasets using MetaSPAdes assembly outputs as seed contigs

Contig ID	Meta SPAdes (bp)	Contig Extender (bp)	Gained length (bp)	Gained (%)	Aligned (bp)	Aligned (%)	Viral genome (Accession)	Genome size (bp)	Gained genome (%)	PRICE (bp)	Kollector (bp)	Depth (x)
1	4251	10,059	5808	137	10,057	100	Human_rubulavirus_2 (NC_003443.1)	15,646	37			54
2	3114	8315	5201	167	8288	100	Human_mastadenovirus_C (NC_001405.1)	35,937	14			31
3	4705	6841	2136	45	6814	100	Human_mastadenovirus_C (NC_001405.1)	35,937	6			36
4	4118	5099	981	24	5057	99	Human_mastadenovirus_C (NC_001405.1)	35,937	3			36
5	2818	5063	2245	80	5062	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	2			14
6	2234	4671	2437	109	4675	100	Human_betaherpesvirus_5 (NC_006273.2)	235,646	1			28
7	1784	4224	2440	137	4224	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	2			11
8	3944	4171	227	6	4149	99	Human_mastadenovirus_C (NC_001405.1)	35,937	1			29
9	3051	4098	1047	34	4092	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	1			16
10	3158	4029	871	28	3575	89	Human_betaherpesvirus_5 (NC_006273.2)	235,646	0			42
11	3462	3964	502	15	3961	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	0			23
12	1789	3666	1877	105	3665	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	2			14
13	1761	3379	1618	92	3319	98	Rotavirus_A (NC_011507.2)	3302	49	3403		126
14	1759	3292	1533	87	2552	78	Bat_rotavirus (NC_040413.1)	2649	58	2274	2101	267
15	2748	3140	392	14	3146	100	Human_betaherpesvirus_5 (NC_006273.2)	235,646	0			24
16	2861	3115	254	9	3115	100	Human_respirovirus_1 (NC_003461.1)	15,600	2			29
17	2664	3016	352	13	3016	100	Human_mastadenovirus_C (NC_001405.1)	35,937	1			25
18	1525	2839	1314	86	2840	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	1			7
19	1958	2616	658	34	2612	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	1			27
20	1789	2213	424	24	2213	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	0			8
21	1889	2154	265	14	2154	100	Human_betaherpesvirus_5 (NC_006273.2)	235,646	0			29
22	1881	2093	212	11	2093	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	0			19
23	1748	2001	253	14	1996	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	0			14
24	1699	1931	232	14	1931	100	Human_alphaherpesvirus_3 (NC_001348.1)	124,884	0			21
25	1505	1846	341	23	1847	100	Human_betaherpesvirus_5 (NC_006273.2)	235,646	0			27
26	1508	1768	260	17	1770	100	Human_betaherpesvirus_5 (NC_006273.2)	235,646	0			26

Open in a new tab

Columns 2–14 are: 2) seed contig length generated by MetaSPAdes; 3) extended contig length from seed contig; 4) gained length from ContigExtender (column 3 subtracted by column 2); 5) gained length as percentage of seed contig (column 4 divided by column 2); 6) the largest contiguous segment length of extended contig that are aligned to reference genome; 7) percentage of the alignment segment of the extend contig; 8) reference viral genome; 9) viral genome size; 10) gained length from ContigExtender as percentage of viral genome (column 4 divided by column 9); 11) gained extension by PRICE; 12) gained extension by GenSeed; 13) gained extension by Kollector; 14) average sequencing depth of the extended contig. Entries in the PRICE, GenSeed, and Kollector columns are blank if they produced no extension

The accuracy of the final contigs is measured by the quality of alignment of these contigs to their respective reference viral genomes. Out of the 26 final contigs, all but two have nearly perfect alignments to reference genomes with greater than 98% single segment alignment. The other 2 contigs, Contig 10 (89%) and Contig 14 (78%) do not have a single alignment covering the whole contig, indicating possible chimeric contig formation during the contig extension process. Note that the performance reported here is achieved using ContigExtender’s default scoring parameters which can be adjusted to be more aggressive or more conservative. Aggressive extension produces longer contigs but risks higher chances of chimeric contig formation, while conservative extension results in shorter but possibly more accurate contigs.

PRICE, GenSeed-HMM, and Kollector did not produce any extension of most of the seed contigs tested (Table 3).

Figure 3 shows the wiggle plot of the top 6 longest final contigs for the NIBSC dataset. Contig2 and Contig3 wiggle plots are highly similar; they are variants of the same contig that aligned to the same region of Human Mastadenovirus C. The coverage is very uneven within the regions of each contig and across different contigs. Not surprisingly, the low coverage valleys are one of the main reasons that the assembly terminates prematurely. A sudden dip in the coverage will likely end the contig assembly or extension.

Fig. 3 — ContigExtender output shown alongside metaSPAdes seed contig and sequencing depth. Reads mapped to the final contig shown as wiggle plots (in blue), seed contigs generated by MetaSPAdes (dark brown line), and final contig regions that are aligned to reference viral genome (black line). The y axis is the depth in log scale and x axis is the contig length. This figure were generated from native Scalable Vector Graphics (SVG) images plotted using Python 3 scripts based on reads mapping to the viral reference genomes with blastn

Human and animal data

Ten contigs achieved significant extensions of greater than 0.2 kb from the viral seed contigs that are longer than 1.5 kb in the “Animal” dataset. Contig 5 is the only contig that has a significant chimeric extension of ~ 3 kb, but it still gained a ~ 4 kb correct extension (Table 4). For the “Human” dataset, 13 contigs achieved significant extension and were successfully aligned to a wide variety of reference viruses. The only contig that has a possible chimeric extension is Contig 1 (Table 5) which gained a ~ 2.4 kb extension that does not align to the Norwalk virus reference genome. All other contigs are successful extensions and some gained significant length, as much as 6 kb, such as Contig 3 from sample Amazon-17D.

Table 4.

ContigExtender results on Animal datasets using MetaSPAdes as seed contigs

Contig ID	Library	Meta SPAdes (bp)	Contig Extender (bp)	Gained length (bp)	Gained (%)	Aligned (bp)	Aligned (%)	Viral genome	Genome size (bp)	Gained genome (%)	PRICE (bp)	Kollector (bp)
1	Dog-pool	5521	9826	4305	78	9760	99	uncultured_crAssphage (NC_024711.1)	97,065	4
2	Fish1-pool	2723	7064	4341	159	6878	97	Enterococcus_virus_phiSHEF5 (NC_042023.1)	41,598	10
3	Mosquito-pool20	3074	10,130	7056	230	9699	96	Culex_Iflavi-like_virus_4 (NC_040716.1)	9698	73
4	Mosquito-pool20	3042	10,130	7088	233	9699	96	Culex_Iflavi-like_virus_4 (NC_040716.1)	9698	73
5	Mosquito-pool27	4106	10,095	5989	146	7030	70	Culex_Iflavi-like_virus_4 (NC_040716.1)	9698	62
6	Mosquito-pool27	6011	10,069	4058	68	10,068	100	Alphamesonivirus_1 (NC_015874.1)	20,192	20		6742
7	Mosquito-pool27	5638	10,016	4378	78	9673	97	Culex_Iflavi-like_virus_4 (NC_040574.1)	9698	45	7820
8	Mosquito-pool20	3689	9872	6183	168	9699	98	Culex_Iflavi-like_virus_4 (NC_040716.1)	9698	64
9	Mosquito-pool27	2430	2674	244	10	2626	98	Culex-associated_Tombus-like_virus (NC_040575.1)	2645	9
10	Mosquito-pool27	1786	2131	345	19	2052	96	Hubei_mosquito_virus_4 (NC_032231.1)	4971	7

Open in a new tab

Columns 3–14 are: 3) seed contig length generated by MetaSPAdes; 4) extended contig length from seed contig; 5) gained length from ContigExtender (column 4 subtracted by column 3); 6) Gained length as percentage of seed contig (column 5 divided by column 3); 7) the largest contiguous segment length of extended contig that are aligned to reference genome; 8) percentage of the alignment segment of the extend contig; 9) reference viral genome; 10) viral genome size; 11) gained length from ContigExtender as percentage of viral genome (column 5 divided by column 10); 12) gained extension by PRICE; 13) gained extension by GenSeed; 14) gained extension by Kollector. Entries in the PRICE, GenSeed, and Kollector columns are blank if they produced no extension

Table 5.

ContigExtender results on Human metagenome datasets using MetaSPAdes as seed contigs

Contig ID	Library	Meta SPAdes (bp)	Contig Extender (bp)	Gained length (bp)	Gained (%)	Aligned (bp)	Aligned (%)	Genome	Genome size (bp)	Gained genome (%)
1	Amazon-4B	7560	10,034	2474	33	7493	75	Norwalk_virus (NC_040876.1)	7521	33
2	Amazon-17D	7912	8329	417	5	7862	94	Husavirus_sp. (NC_032480.1)	8856	5
3	Amazon-3D	1537	7676	6139	399	7678	100	Husavirus_sp. (NC_032480.1)	8856	69
4	Amazon-3D	3776	7530	3754	99	7532	100	Husavirus_sp. (NC_032480.1)	8856	42
5	Amazon-3D	2165	7530	5365	248	7532	100	Husavirus_sp. (NC_032480.1)	8856	61
6	Amazon-S10-CNI-055	1671	3258	1587	95	3242	100	Betapapillomavirus_1 (NC_001531.1)	7746	20
7	Amazon-S10-CNI-055	1710	3258	1548	91	3242	100	Betapapillomavirus_1 (NC_001531.1)	7746	20
8	Amazon-6D	2151	2772	621	29	2681	97	Human_cosavirus (NC_023984.1)	7802	8
9	12-110034-veqrpcr	2339	5237	2898	124	5233	100	Hepacivirus_C(NC_004102.1)	9646	30
10	47210-feces	2436	4637	2201	90	4444	96	Escherichia_virus_AKFV33 (NC_017969.1)	108,853	2
11	47210-feces	2436	3572	1136	47	3572	100	Escherichia_virus_T5 (NC_005859.1)	121,750	1
12	12-110034-veqrpcr	2424	3157	733	30	3121	99	Hepacivirus_C (NC_004102.1)	9646	8
13	12-110,034-veqrpcr	2424	3156	732	30	3121	99	Hepacivirus_C (NC_004102.1)	9646	33

Open in a new tab

Columns 3–11 are: 3) seed contig length generated by MetaSPAdes; 4) extended contig length from seed contig; 5) gained length from ContigExtender (column 4 subtracted by column 3); 6) Gained length as percentage of seed contig (column 5 divided by column 3); 7) the largest contiguous segment length of extended contig that are aligned to reference genome; 8) percentage of the alignment segment of the extend contig; 9) reference viral genome; 10) viral genome size; 11) gained length from ContigExtender as percentage of viral genome (column 5 divided by column 10). Note that PRICE, GenSeed, and Kollector did not extend any seed contigs in this set, so their columns are omitted

For the combined 49 contigs from the three human and animal datasets, 45 are of high quality without chimeric extensions. For these 45 contigs, the average seed contig length generated by MetaSPAdes is 2.8 kb. ContigExtender increased these lengths by ~ 2 kb on average, resulting in a final average length of 4.8 kb. The median gained length, however, is ~ 1.5 kb and the median seed and output lengths are 2.4 kb and 3.9 kb, respectively. These results, shown in Tables 3, 4, and 5, demonstrate a significant improvement over de novo assembly in a wide variety of datasets. In comparison, PRICE and Kollector successfully extended only one contig each, while GenSeed-HMM extended zero contigs.

Discussions

Genome sizes of bacteriophages and viruses range from a few kb to several hundred kb. State-of-the-art de novo assemblers can only achieve contigs that are a few kb long; these contigs can often be further extended by iteratively mapping reads to the contig ends, which is currently a time consuming, manual process. The proposed method effectively turns the sequence assembly process into a two-step process: de novo assembly followed by contig extension. Our results demonstrated that contig extension can be an effective step in improving metagenomic sequence analysis. Compatible with any de novo assembler, ContigExtender can be built into most viral metagenomics analysis pipelines. A wide range of metagenomic applications such as pathogen detection, microbiological surveillance and viral discovery, can benefit from contig extension, which significantly reduces the time and effort required for manual contig extension.

A feature to combat the microdiversity challenge is the novel branching feature proposed by ContigExtender. The proposed DFS branching mechanism allows multiple branching paths, defined by different genotypes representing multiple strains. Each overlapping region is evaluated for homogeneity of read alignments. Potential branch points are created at the overlapping region when sufficient heterogeneity, representing multiple genotypes (strains) is observed. Intro- and inter-genomic repeats can also be genotyped and resolved during this branching, which allows the exploration of multiple extension paths to gain maximum extension.

Two major factors causing premature assembly termination are abrupt dips in coverage and excessive sequencing errors and contaminations. The former is observed in Fig. 3, which shows that many contigs failed at coverage valleys. The latter is observed in Table 2 where excessive sequencing error (5%) causes poor de novo contig formation on many of the silico synthesized datasets. ContigExtender addresses these challenges by utilizing a novel extension scoring function prioritizing overlapping over depth. It focuses on overlap quality rather than using a hard depth cutoff for extension on low coverage regions. To avoid chimeric contig extension as much as possible, our scoring function requires increased overlap lengths for low coverage regions. By using alignment rather than the kmer search utilized in most de novo assemblers, ContigExtender trades speed for accuracy, allowing for better performance in high sequencing error regions.

These features may explain the favorable performance of ContigExtender relative to other contig extension tools. PRICE iteratively assembles proximal reads and fills gaps between contigs using paired-end relationships. Kollector recruits reads using progressive Bloom filters instead of alignment. GenSeed-HMM, in a similar process to ContigExtender, iteratively finds similar reads and extends contigs through assembly software. These tools have a common element in that they all utilize de Bruijn assemblers to generate a consensus sequence. When the input contig is the final output of a de novo assembly tool such as metaSPAdes, further assembly based on de Bruijn graphs is unlikely to succeed, as the factors which caused metaSPAdes to terminate extension remain in the data. Thus, computing the consensus sequence using the scoring function employed by ContigExtender is more likely to overcome these challenging regions.

The viral reference genome database is by no means a gold standard for evaluating contigs because there are still many unknown viral species and strains not represented in the database. Therefore, some extended contigs cannot be aligned either to their originating genome or to a distant genome, and thus are designated as false chimeric contigs in our analysis. The NIBSC data, however, can be evaluated more accurately, because the samples only include known virus standards.

As we have shown in Tables 3, 4, and 5, there are several extensions that are the results of over-assembly or mis-assembly, as indicated by less than 100% alignment to reference genomes. We also observed multiple assemblies covering the same genomic regions, due to either sequencing errors or the mixtures of multiple strains in metagenomic samples. Mis-assembly can happen quite often in the initial de-novo assembly for generating seed contigs and it is even more likely in contig extension because contig extension pushes the limit of contig lengths by accepting a higher risk for mistakes. Our algorithm’s scoring system favors the most probable extensions and the balance between extension length and the probability of mis-assembly is adjustable. In practice, contig extension mis-assemblies or chimeric contigs can be identified when aligning extended contigs against known viral genomes, as we show in the results (Fig. 3 and Tables 3, 4, 5). For novel viruses, the identification of chimeric contigs cannot be achieved computationally but it can be accomplished through PCR extension. All the mis-assemblies in our tests (Tables 3, 4, 5) are segments from the same genome. After careful examinations of these mis-assemblies against reference genomes, we found that these mis-assemblies are caused by extending contig ends with incorrect reads from a different region of the same genome.

The current software is only optimized and tested on viral metagenomes, not for bacterial or eukaryotic genomes. We speculate that our current version may not work well for other genomes for two reasons: 1) Viral genomes contain considerably fewer repeats than other genomes; and 2) the sequencing dataset sizes for non-viral genomes are usually considerably larger, so the running time may require further optimization.

Conclusions

We have presented a new approach for enhancing the performance of de novo metagenomics assemblers. The proposed DFS branching algorithm allows multiple branching paths defined by different genotypes representing multiple strains. Our strategy automates the labor-intensive process of manually constructing viral genomes from the fragments produced by de novo assembly. With simulated and real-world animal and human metagenomics datasets, ContigExtender is demonstrated to be effective in improving upon both de novo assembly alone and de novo assembly combined with other extension tools. For contigs that are extendable, ContigExtender can accurately increase the contig length by several kb, which is significant for viral genomes. The software may also be incorporated into viral metagenomics analysis pipelines, with a variety of applications such as pathogen detection, viral discovery, clinical microbiology and environmental metagenomics. Thus, we believe that the use of our software will be of broad interest to researchers, epidemiologists, clinicians, and environmental biologists.

Availability and requirements

Project name: ContigExtender
Project home page: https://github.com/dengzac/contig-extender
Operating system(s): Linux, Windows
Programming language: Python 3.6 or higher
Other requirements: Bowtie2 2.3.5 or higher, Perl 5 or higher
License: GNU GPLv3
Any restrictions to use by non-academics: None

Supplementary Information

12859_2021_4038_MOESM1_ESM.docx^{(20.5KB, docx)}

Additional file 1: Supplementary Table S1. shows a comparison of contigs produced by ContigExtender, PRICE, Kollector, GenSeed, and metaSPAdes from in silico reads of the BKV and HIV viruses. Supplementary Table S2 shows contig lengths produced by ContigExtender on simulated paired-end reads from pIRS (read length 100, error rate 0.05).

Acknowledgements

The authors would like to thank Eda Altan and Elizabeth Fahsbender for contributing sequencing data.

Abbreviations

mNGS: Metagenomic Next-Gen Sequencing
DFS: Depth-First Search

Authors' contributions

ED conceived the study and participated in the development of software and manuscript. ZD developed the software, performed analysis and participated in manuscript writing. All authors read and approved the final manuscript.

Funding

This project was supported in part by National Institutes of Health grant R01-HL-105770 from the National Heart, Lung and Blood Institute (NHLBI). The funding body was not involved in the design of the study, collection, analysis, and interpretation of data, or in writing the manuscript.

Availability of data and materials

The datasets analyzed during the current study are available in the Sequence Read Archive repository, at https://www.ncbi.nlm.nih.gov/sra. Accession numbers are PRJNA391715 and PRJNA530270 for mosquito samples and Amazon nasal swab samples respectively. The Amazon fecal samples are under accession numbers SRR6287056 to SRR6287135.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Zachary Deng, Email: dengzac@gmail.com.

Eric Delwart, Email: delwarte@medicine.ucsf.edu.

References

1.Delwart E. A roadmap to the human virome. PLoS Pathog. 2013;9:e1003146. doi: 10.1371/journal.ppat.1003146. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Chiu CY. Viral pathogen discovery. Curr Opin Microbiol. 2013;16:468–478. doi: 10.1016/j.mib.2013.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Houldcroft CJ, Beale MA, Breuer J. Clinical and biological insights from viral genome sequencing. Nat Rev Microbiol. 2017;15:183–192. doi: 10.1038/nrmicro.2016.182. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536:425–430. doi: 10.1038/nature19094. [DOI] [PubMed] [Google Scholar]
5.Carroll D, Daszak P, Wolfe ND, Gao GF, Morel CM, Morzaria S, et al. The global virome project. Science. 2018;359:872–874. doi: 10.1126/science.aap7463. [DOI] [PubMed] [Google Scholar]
6.Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20:341–355. doi: 10.1038/s41576-019-0113-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Huson DH, Mitra S. Introduction to the analysis of environmental sequences: metagenomics with MEGAN. Methods Mol Biol Clifton NJ. 2012;856:415–429. doi: 10.1007/978-1-61779-585-5_17. [DOI] [PubMed] [Google Scholar]
8.Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res. 2014;24:1180–1192. doi: 10.1101/gr.171934.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, et al. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Clarke EL, Taylor LJ, Zhao C, Connell A, Lee J-J, Fett B, et al. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7:46. doi: 10.1186/s40168-019-0658-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Tamames J, Puente-Sánchez F. SqueezeM, a fully automatic metagenomic analysis pipeline from reads to bins. bioRxiv: 347559 (2018). [DOI] [PMC free article] [PubMed]
12.Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, et al. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data. Nucleic Acids Res. 2015;43:e46. doi: 10.1093/nar/gkv002. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Altan E, Dib JC, Gulloso AR, Juandigua DE, Deng X, Bruhn R, et al. Effect of geographic isolation on the nasal virome of indigenous children. J Virol 2019;93:e00681-19. [DOI] [PMC free article] [PubMed]
14.Altan E, K SV, ubiski, Burchell J, Bicknese E, Deng X, Delwart E. The first reptilian circovirus identified infects gut and liver tissues of black-headed pythons. Vet Res. 2019;50:35. doi: 10.1186/s13567-019-0653-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Altan E, Kubiski SV, Boros Á, Reuter G, Sadeghi M, Deng X, et al. A highly divergent picornavirus infecting the Gut Epithelia of Zebrafish (Danio rerio) in research institutions worldwide. Zebrafish. 2019;16:291–299. doi: 10.1089/zeb.2018.1710. [DOI] [PubMed] [Google Scholar]
16.Brito F, Cordey S, Delwart E, Deng X, Tirefort D, Lemoine-Chaduc C, et al. Metagenomics analysis of the virome of 300 concentrates from a Swiss platelet bank. Vox Sang. 2018;113:601–4. [DOI] [PMC free article] [PubMed]
17.Kapusinszky B, Ardeshir A, Mulvaney U, Deng X, Delwart E. Case–control comparison of enteric viromes in captive rhesus macaques with acute or idiopathic chronic diarrhea. J Virol. 2017;91:e00952-17. doi: 10.1128/JVI.00952-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kapusinszky B, Mulvaney U, Jasinska AJ, Deng X, Freimer N, Delwart E. Local virus extinctions following a host population bottleneck. J Virol. 2015;89:8152–8161. doi: 10.1128/JVI.00671-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ng TFF, Chen L-F, Zhou Y, Shapiro B, Stiller M, Heintzman PD, et al. Preservation of viral genomes in 700-y-old caribou feces from a subarctic ice patch. Proc Natl Acad Sci USA. 2014;111:16842–16847. doi: 10.1073/pnas.1410429111. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Phan TG, Dreno B, da Costa AC, Li L, Orlandi P, Deng X, et al. A new protoparvovirus in human fecal samples and cutaneous T cell lymphomas (mycosis fungoides) Virology. 2016;496:299–305. doi: 10.1016/j.virol.2016.06.013. [DOI] [PubMed] [Google Scholar]
21.Sadeghi M, Altan E, Deng X, Barker CM, Fang Y, Coffey LL, et al. Virome of > 12 thousand Culex mosquitoes from throughout California. Virology. 2018;523:74–88. doi: 10.1016/j.virol.2018.07.029. [DOI] [PubMed] [Google Scholar]
22.Phan TG, da Costa AC, Zhang W, Pothier P, Ambert-Balay K, Deng X, et al. A new gyrovirus in human feces. Virus Genes. 2015;51:132–135. doi: 10.1007/s11262-015-1210-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ayling M, Clark MD, Leggett RM. New approaches for metagenome assembly with short reads. Brief Bioinform. 2020 doi: 10.1093/bib/bbz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol J Comput Mol Cell Biol. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinform Oxf Engl. 2012;28:1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]
27.Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40:e155. doi: 10.1093/nar/gks678. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Afiahayati, Sato K, Sakakibara Y. MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res Int J Rapid Publ Rep Genes Genomes. 2015;22:69–77. doi: 10.1093/dnares/dsu041. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Li D, Luo R, Liu C-M, Leung C-M, Ting H-F, Sadakane K, et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods San Diego Calif. 2016;102:3–11. doi: 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]
30.Li D, Huang Y, Leung C-M, Luo R, Ting H-F, Lam T-W. MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs. BMC Bioinform. 2017;18(Suppl 12):408. doi: 10.1186/s12859-017-1825-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13:R122. doi: 10.1186/gb-2012-13-12-r122. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Ruby JG, Bellare P, Derisi JL. PRICE: software for the targeted assembly of components of (Meta) genomic sequence data. G3 Bethesda Md. 2013;3:865–880. doi: 10.1534/g3.113.005967. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Wang Q, Fish JA, Gilman M, Sun Y, Brown CT, Tiedje JM, et al. Xander: employing a novel method for efficient gene-targeted metagenomic assembly. Microbiome. 2015;3:32. doi: 10.1186/s40168-015-0093-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017;27:835–848. doi: 10.1101/gr.215038.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gregor I, Schönhuth A, McHardy AC. Snowball: strain aware gene assembly of metagenomes. Bioinform Oxf Engl. 2016;32:i649–i657. doi: 10.1093/bioinformatics/btw426. [DOI] [PubMed] [Google Scholar]
36.Afiahayati, Sato K, Sakakibara Y. An extended genovo metagenomic assembler by incorporating paired-end information. PeerJ. 2013;1:e196. doi: 10.7717/peerj.196. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lin Y-Y, Hsieh C-H, Chen J-H, Lu X, Kao J-H, Chen P-J, et al. De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline. BMC Bioinform. 2017;18:223. doi: 10.1186/s12859-017-1630-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Hunt M, Gall A, Ong SH, Brener J, Ferns B, Goulder P, et al. IVA: accurate de novo assembly of RNA virus genomes. Bioinform Oxf Engl. 2015;31:2374–2376. doi: 10.1093/bioinformatics/btv120. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, et al. De novo assembly of highly diverse viral populations. BMC Genomics. 2012;13:475. doi: 10.1186/1471-2164-13-475. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004;32(Web Server issue):W20–25. doi: 10.1093/nar/gkh435. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Grard G, Fair JN, Lee D, Slikas E, Steffen I, Muyembe J-J, et al. A novel rhabdovirus associated with acute hemorrhagic fever in Central Africa. PLoS Pathog. 2012 doi: 10.1371/journal.ppat.1002924. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics. 2012;28:1533–1535. doi: 10.1093/bioinformatics/bts187. [DOI] [PubMed] [Google Scholar]
43.Li L, Deng X, Mee ET, Collot-Teixeira S, Anderson R, Schepelmann S, et al. Comparing viral metagenomics methods using a highly multiplexed human viral pathogens reagent. J Virol Methods. 2015;213:139–146. doi: 10.1016/j.jviromet.2014.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Sadeghi M, Kapusinszky B, Yugo DM, Phan TG, Deng X, Kanevsky I, et al. Virome of US bovine calf serum. Biol J Int Assoc Biol Stand. 2017;46:64–67. doi: 10.1016/j.biologicals.2016.12.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Siqueira JD, Dominguez-Bello MG, Contreras M, Lander O, Caballero-Arias H, Xutao D, et al. Complex virome in feces from Amerindian children in isolated Amazonian villages. Nat Commun. 2018;9:1–11. doi: 10.1038/s41467-017-02088-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinform Oxf Engl. 2011;27:863–864. doi: 10.1093/bioinformatics/btr026. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Kucuk E, Chu J, Vandervalk BP, Hammond SA, Warren RL, Birol I. Kollector: transcript-informed, targeted de novo assembly of gene loci. Bioinform Oxf Engl. 2017;33:1782–1788. doi: 10.1093/bioinformatics/btx078. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Alves JMP, de Oliveira AL, Sandberg TOM, Moreno-Gallego JL, de Toledo MAF, de Moura EMM, et al. GenSeed-HMM: a tool for progressive assembly using profile HMMs as Seeds and its application in alpavirinae viral discovery from metagenomic data. Front Microbiol. 2016;7:269. doi: 10.3389/fmicb.2016.00269. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

12859_2021_4038_MOESM1_ESM.docx^{(20.5KB, docx)}

Data Availability Statement

[CR1] 1.Delwart E. A roadmap to the human virome. PLoS Pathog. 2013;9:e1003146. doi: 10.1371/journal.ppat.1003146. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Chiu CY. Viral pathogen discovery. Curr Opin Microbiol. 2013;16:468–478. doi: 10.1016/j.mib.2013.05.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Houldcroft CJ, Beale MA, Breuer J. Clinical and biological insights from viral genome sequencing. Nat Rev Microbiol. 2017;15:183–192. doi: 10.1038/nrmicro.2016.182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. Uncovering Earth’s virome. Nature. 2016;536:425–430. doi: 10.1038/nature19094. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Carroll D, Daszak P, Wolfe ND, Gao GF, Morel CM, Morzaria S, et al. The global virome project. Science. 2018;359:872–874. doi: 10.1126/science.aap7463. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20:341–355. doi: 10.1038/s41576-019-0113-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Huson DH, Mitra S. Introduction to the analysis of environmental sequences: metagenomics with MEGAN. Methods Mol Biol Clifton NJ. 2012;856:415–429. doi: 10.1007/978-1-61779-585-5_17. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D, Samayoa E, et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res. 2014;24:1180–1192. doi: 10.1101/gr.171934.113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, et al. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 2008;9:386. doi: 10.1186/1471-2105-9-386. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Clarke EL, Taylor LJ, Zhao C, Connell A, Lee J-J, Fett B, et al. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome. 2019;7:46. doi: 10.1186/s40168-019-0658-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Tamames J, Puente-Sánchez F. SqueezeM, a fully automatic metagenomic analysis pipeline from reads to bins. bioRxiv: 347559 (2018). [DOI] [PMC free article] [PubMed]

[CR12] 12.Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, et al. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data. Nucleic Acids Res. 2015;43:e46. doi: 10.1093/nar/gkv002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Altan E, Dib JC, Gulloso AR, Juandigua DE, Deng X, Bruhn R, et al. Effect of geographic isolation on the nasal virome of indigenous children. J Virol 2019;93:e00681-19. [DOI] [PMC free article] [PubMed]

[CR14] 14.Altan E, K SV, ubiski, Burchell J, Bicknese E, Deng X, Delwart E. The first reptilian circovirus identified infects gut and liver tissues of black-headed pythons. Vet Res. 2019;50:35. doi: 10.1186/s13567-019-0653-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Altan E, Kubiski SV, Boros Á, Reuter G, Sadeghi M, Deng X, et al. A highly divergent picornavirus infecting the Gut Epithelia of Zebrafish (Danio rerio) in research institutions worldwide. Zebrafish. 2019;16:291–299. doi: 10.1089/zeb.2018.1710. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Brito F, Cordey S, Delwart E, Deng X, Tirefort D, Lemoine-Chaduc C, et al. Metagenomics analysis of the virome of 300 concentrates from a Swiss platelet bank. Vox Sang. 2018;113:601–4. [DOI] [PMC free article] [PubMed]

[CR17] 17.Kapusinszky B, Ardeshir A, Mulvaney U, Deng X, Delwart E. Case–control comparison of enteric viromes in captive rhesus macaques with acute or idiopathic chronic diarrhea. J Virol. 2017;91:e00952-17. doi: 10.1128/JVI.00952-17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Kapusinszky B, Mulvaney U, Jasinska AJ, Deng X, Freimer N, Delwart E. Local virus extinctions following a host population bottleneck. J Virol. 2015;89:8152–8161. doi: 10.1128/JVI.00671-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Ng TFF, Chen L-F, Zhou Y, Shapiro B, Stiller M, Heintzman PD, et al. Preservation of viral genomes in 700-y-old caribou feces from a subarctic ice patch. Proc Natl Acad Sci USA. 2014;111:16842–16847. doi: 10.1073/pnas.1410429111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Phan TG, Dreno B, da Costa AC, Li L, Orlandi P, Deng X, et al. A new protoparvovirus in human fecal samples and cutaneous T cell lymphomas (mycosis fungoides) Virology. 2016;496:299–305. doi: 10.1016/j.virol.2016.06.013. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Sadeghi M, Altan E, Deng X, Barker CM, Fang Y, Coffey LL, et al. Virome of > 12 thousand Culex mosquitoes from throughout California. Virology. 2018;523:74–88. doi: 10.1016/j.virol.2018.07.029. [DOI] [PubMed] [Google Scholar]

[CR22] 22.Phan TG, da Costa AC, Zhang W, Pothier P, Ambert-Balay K, Deng X, et al. A new gyrovirus in human feces. Virus Genes. 2015;51:132–135. doi: 10.1007/s11262-015-1210-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Ayling M, Clark MD, Leggett RM. New approaches for metagenome assembly with short reads. Brief Bioinform. 2020 doi: 10.1093/bib/bbz020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol J Comput Mol Cell Biol. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinform Oxf Engl. 2012;28:1420–1428. doi: 10.1093/bioinformatics/bts174. [DOI] [PubMed] [Google Scholar]

[CR27] 27.Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40:e155. doi: 10.1093/nar/gks678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Afiahayati, Sato K, Sakakibara Y. MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Res Int J Rapid Publ Rep Genes Genomes. 2015;22:69–77. doi: 10.1093/dnares/dsu041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Li D, Luo R, Liu C-M, Leung C-M, Ting H-F, Sadakane K, et al. MEGAHIT v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods San Diego Calif. 2016;102:3–11. doi: 10.1016/j.ymeth.2016.02.020. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Li D, Huang Y, Leung C-M, Luo R, Ting H-F, Lam T-W. MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs. BMC Bioinform. 2017;18(Suppl 12):408. doi: 10.1186/s12859-017-1825-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13:R122. doi: 10.1186/gb-2012-13-12-r122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Ruby JG, Bellare P, Derisi JL. PRICE: software for the targeted assembly of components of (Meta) genomic sequence data. G3 Bethesda Md. 2013;3:865–880. doi: 10.1534/g3.113.005967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Wang Q, Fish JA, Gilman M, Sun Y, Brown CT, Tiedje JM, et al. Xander: employing a novel method for efficient gene-targeted metagenomic assembly. Microbiome. 2015;3:32. doi: 10.1186/s40168-015-0093-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017;27:835–848. doi: 10.1101/gr.215038.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Gregor I, Schönhuth A, McHardy AC. Snowball: strain aware gene assembly of metagenomes. Bioinform Oxf Engl. 2016;32:i649–i657. doi: 10.1093/bioinformatics/btw426. [DOI] [PubMed] [Google Scholar]

[CR36] 36.Afiahayati, Sato K, Sakakibara Y. An extended genovo metagenomic assembler by incorporating paired-end information. PeerJ. 2013;1:e196. doi: 10.7717/peerj.196. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Lin Y-Y, Hsieh C-H, Chen J-H, Lu X, Kao J-H, Chen P-J, et al. De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline. BMC Bioinform. 2017;18:223. doi: 10.1186/s12859-017-1630-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Hunt M, Gall A, Ong SH, Brener J, Ferns B, Goulder P, et al. IVA: accurate de novo assembly of RNA virus genomes. Bioinform Oxf Engl. 2015;31:2374–2376. doi: 10.1093/bioinformatics/btv120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, et al. De novo assembly of highly diverse viral populations. BMC Genomics. 2012;13:475. doi: 10.1186/1471-2164-13-475. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004;32(Web Server issue):W20–25. doi: 10.1093/nar/gkh435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Grard G, Fair JN, Lee D, Slikas E, Steffen I, Muyembe J-J, et al. A novel rhabdovirus associated with acute hemorrhagic fever in Central Africa. PLoS Pathog. 2012 doi: 10.1371/journal.ppat.1002924. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics. 2012;28:1533–1535. doi: 10.1093/bioinformatics/bts187. [DOI] [PubMed] [Google Scholar]

[CR43] 43.Li L, Deng X, Mee ET, Collot-Teixeira S, Anderson R, Schepelmann S, et al. Comparing viral metagenomics methods using a highly multiplexed human viral pathogens reagent. J Virol Methods. 2015;213:139–146. doi: 10.1016/j.jviromet.2014.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Sadeghi M, Kapusinszky B, Yugo DM, Phan TG, Deng X, Kanevsky I, et al. Virome of US bovine calf serum. Biol J Int Assoc Biol Stand. 2017;46:64–67. doi: 10.1016/j.biologicals.2016.12.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Siqueira JD, Dominguez-Bello MG, Contreras M, Lander O, Caballero-Arias H, Xutao D, et al. Complex virome in feces from Amerindian children in isolated Amazonian villages. Nat Commun. 2018;9:1–11. doi: 10.1038/s41467-017-02088-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinform Oxf Engl. 2011;27:863–864. doi: 10.1093/bioinformatics/btr026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Kucuk E, Chu J, Vandervalk BP, Hammond SA, Warren RL, Birol I. Kollector: transcript-informed, targeted de novo assembly of gene loci. Bioinform Oxf Engl. 2017;33:1782–1788. doi: 10.1093/bioinformatics/btx078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Alves JMP, de Oliveira AL, Sandberg TOM, Moreno-Gallego JL, de Toledo MAF, de Moura EMM, et al. GenSeed-HMM: a tool for progressive assembly using profile HMMs as Seeds and its application in alpavirinae viral discovery from metagenomic data. Front Microbiol. 2016;7:269. doi: 10.3389/fmicb.2016.00269. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data

Zachary Deng

Eric Delwart

Abstract

Background

Results

Conclusions

Supplementary Information

Background

Implementation

Datasets

Table 1.

Preprocessing

De novo assembly

Algorithm

Fig. 1.

Fig. 2.

Alignment

Computation of consensus

Alternate consensuses

Results

Software parameters

Simulated data

Table 2.

NIBSC data

Table 3.

Fig. 3.

Human and animal data

Table 4.

Table 5.

Discussions

Conclusions

Availability and requirements

Supplementary Information

Acknowledgements

Abbreviations

Authors' contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases