Summary
Long-read sequencing techniques, such as the Oxford Nanopore Technology, can generate reads that are tens of kilobases in length and are therefore particularly relevant for microbiome studies. However, owing to the higher per-base error rates than typical short-read sequencing, the application of long-read sequencing on microbiomes remains largely unexplored. Here we deeply sequenced two human microbiota mock community samples (HM-276D and HM-277D) from the Human Microbiome Project. We showed that assembly programs consistently achieved high accuracy (∼99%) and completeness (∼99%) for bacterial strains with adequate coverage. We also found that long-read sequencing provides accurate estimates of species-level abundance (R = 0.94 for 20 bacteria with abundance ranging from 0.005% to 64%). Our results not only demonstrate the feasibility of characterizing complete microbial genomes and populations from error-prone Nanopore sequencing data but also highlight necessary bioinformatics improvements for future metagenomics tool development.
Subject Areas: Microbiology, Microbial Genomics, Microbiome
Graphical Abstract
Highlights
-
•
First set of Nanopore long-read data on two NIH designated reference samples
-
•
Long-read assemblers achieved high accuracy (∼99%) and completeness (∼99%)
-
•
Nanopore sequencing provides accurate taxonomic profiling and binning
-
•
We highlighted challenges in long-read data for metagenomics tool development
Microbiology; Microbial Genomics; Microbiome
Background
The fundamental importance of microbiota as the microbial communities that reside in human body is increasingly recognized. Over the past decade, there have been tremendous amounts of evidence suggesting that microbiota plays a crucial role in human health through modulating the metabolic functions, as well as food energy harvest and storage. Microbiota, especially the gut microbiota, is associated with many chronic diseases such as obesity, diabetes, metabolic syndrome, inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), liver disease, and hepatocellular and colorectal carcinoma (Gill et al., 2006, Lewis et al., 2015, Chehoud et al., 2015, Hooper et al., 2003, Jones et al., 2015, Ley et al., 2006, Liang et al., 2015, Sartor, 2008, Schauber et al., 2003, Turnbaugh et al., 2009, Wang et al., 2016, Wen et al., 2008, Wu et al., 2011, Group et al., 2009). Therefore, accurate profiling of complete genomes and population is crucial to understanding the impact of microbiota on human health. Currently, high-throughput sequencing technologies have been widely used in microbial community characterization. In particular, 16S ribosomal RNA (rRNA) (Janda and Abbott, 2007) and shotgun metagenome sequencing on Illumina platforms (Quince et al., 2017) are two dominant approaches for describing microbiomes. Overall, the high-throughput nature of metagenomics sequencing allows us to interpret microbial community by using computational approaches such as operational taxonomic unit (OTU) identification (Hao and Chen, 2012), abundance quantification (Chen et al., 2017), read assembly (Ruan and Li, 2019, Bertrand et al., 2019, Koren et al., 2017, Kolmogorov et al., 2019, Li et al., 2015), and binning and taxonomic profiling (Gregor et al., 2016, Huson et al., 2016, Huson et al., 2018, Francis et al., 2013, Hong et al., 2014, Byrd et al., 2014). Specifically, 16S rRNA sequencing targets on very specific regions that are highly variable between species, which is much cost-efficient. This is very useful for us to examine and compare the microbiota across a high number of samples in a large-scale project. However, this technique can only identify bacteria but not viruses or fungi, and the low resolution limits its usage in microbiome study below the genus level. As opposed to only the 16S sequences, shotgun metagenome sequencing surveys the whole genomes of all organism in the community (Jovel et al., 2016, Laudadio et al., 2018, Ranjan et al., 2016). It allows us to perform deep investigation of the microbial community as its ability to capture sequences from all organisms.
Despite the theoretical advantage of shotgun metagenome sequencing, owing to the short read length (150–300 nucleotides), metagenomes cannot be fully characterized by next-generation sequencing (NGS) data. In addition, the lack of contextual information has become a barrier for short read to span both intra- and intergenomic repeats, which is crucial for complete de novo genome assembly of all dominant species in a microbial community. As a consequence, short-read assemblies remain highly fragmented. In comparison, the use of long-read sequencing has the potential to facilitate the complete and contiguous metagenome assembly. Lee et al. (2014) sequenced a reference mock community sample using PacBio long read and evaluated the metagenome assembly performance. Results showed that single-molecule real-time (SMRT) long-read data offered significantly improved assembly contiguity by spanning many of repetitive regions, whereas single bacterial chromosome was assembled to more than 50 contigs based on short-read data. In recent years, the Oxford Nanopore technologies (ONTs) have offered advantages over traditional short-read NGS technologies in genome study. This single-molecule sequencing platform is able to generate average read length of >10 kbp, spanning low complexity and repetitive genomic regions, which provides much more continuous assemblies. Subsequently, this approach has become an attractive option in metagenomics sequencing. Although the ONTs have great potential, complete and contiguous de novo metagenome assembly is still constrained by the high error rate (∼15%) of single-molecule long-read sequence data (Sczyrba et al., 2017). Therefore, a comprehensive evaluation of long-read bioinformatics tools in microbial profiling is needed (Mason et al., 2017). Nicholls et al. (2019) presented Nanopore sequencing datasets of two mock communities with 10 microbial species from ZymoBIOMICS (McIntyre et al., 2019). They showed the utility of these datasets for future bioinformatics method development for long-read metagenomics. However, publicly available datasets based other sequencing technologies of these samples are limited as the samples are only commercially available and are not well studied so far by competing approaches. A study to evaluate the advantages of Nanopore sequencing in complete microbial genomes and a comparison over other sequencing technologies is still lacking so far.
In this article, we generated two deeply sequenced Nanopore datasets from new reference samples that are more commonly studied and performed comprehensive analysis to compare microbial community profiling performance with PacBio and Illumina technologies. We first generated 525× coverage data on HM-276D mock community sample from Human Microbiome Project, which is an evenly mixed DNA sample of 20 bacterial strains (each with 5% abundance). We performed de novo assembly analysis with four long-read assemblers at different depth of coverage. Twenty bacterial genomes were assembled with high accuracy and genome completeness. This sample also has been well studied by many groups. As mentioned above, Lee et al. (2014) sequenced this mock community with PacBio to show the improvement of long-read data in metagenome assembly analysis. Jones et al.(2015) compared the influence of different NGS platforms on genomic and functional predictions using HM-276D sample. We downloaded these two datasets and compared the performance with Nanopore data. Our results show that Nanopore improved assembly contiguity compared with PacBio and Illumina across computational approaches. Next, we sequenced HM-277D Mock Community sample with 1,068× coverage. HM-277D is unevenly mixed DNA sample of 20 bacterial strains. Kuleshove et al. (2016) sequenced this sample with Illumina TruSeq synthetic long-read technique and showed the improvement in bacterial species identification, genome reconstruction compared with short sequences. Also, Leggett et al., 2020 demonstrated Nanopore metagenomics sequence can be reliably classified using this community. In addition to metagenome assembly, we evaluated taxonomy binning and profiling performance across technologies (Nanopore and PacBio) and samples (HM-276D and HM-277D). High identification and classification accuracy were achieved above the species level. Overall, we demonstrate the technical feasibility to characterize complete microbial genomes and populations from error-prone Nanopore sequencing without any DNA amplification. We also discuss the limitations of current bioinformatics tools, when dealing with error-prone long-read metagenomics sequencing data. All our data are made publicly available, to benefit computational tool development on long-read-based microbial genome assembly for metagenomics studies.
Results
Sequence Data Quality
HM-276D DNA sample includes 20 evenly mixed bacterial strains with reference genome size 70 Mb in total with 39 chromosomes. A total of 11,610,183 reads with 35,578,375,166 bases (525× coverage depth) were generated on the Nanopore GridION platform, with a median length of 1,374 bp. The N50 length is 6,828 bp, and median read quality is 9.39 in Phred scale. By using minimap2, 95% of reads were successfully aligned to reference genomes of 20 bacterial strains with 13.1% error rate (Table 1). As shown in Figure 1A, read coverage across 20 bacterial strains has good agreement with known abundances. Read depth is relatively homogeneous across bacterial strains with 521.9X (sd = 524.7X) in average. Sequencing depth of each strain is at least 150 reads and only 0.03% region is covered by less than 3 reads.
Table 1.
Mapping Statistics | HM-276D | HM-277D |
---|---|---|
# of reads | 8,086,684 | 18,254,839 |
# of mapped reads | 7,640,934 | 18,110,317 |
Reads unmapped | 445,750 | 144,522 |
Reads MQ0 | 60,972 | 103,601 |
Non-primary alignments | 287,369 | 732,671 |
Total length | 33,563,573,383 | 72,312,638,112 |
Bases mapped | 32,143,689,158 | 72,216,146,980 |
Bases mapped (cigar) | 31,156,025,998 | 70,073,211,829 |
Mismatches | 4,104,593,752 | 6,925,222,080 |
Average length | 4,150 | 3,961 |
Maximum length | 472,762 | 214,792 |
Average Phred quality per base | 13 | 17 |
Sequenced data were mapped against reference genomes of 20 known bacterial strains. Sequences indicate the number of QC passed reads. Number of mapped and unmapped reads were summarized. MQ0 represents number of mapped reads with MQ = 0. Clipping was ignored when calculating total length, bases mapped. Bases mapped (cigar) provides a more accurate number of mapped bases. Number of mismatches were obtained from NM field of BAM file.
HM-277D DNA sample includes 20 unevenly mixed bacterial strains. A total of 18,254,839 reads dataset with 72,312,638,112 bases (1,068× coverage depth) were generated, leading to 2,065 bp in median read length with 10.12 median read quality. The N50 length is 7,857 bp; 99.2% of QC-passed reads were mapped to the reference genome and the error rate was 9.8% (Table 1). As shown in Figure 1B, read distribution is more heterogeneous across strains due to unevenly mixed samples. The average coverage is 988.8 reads with standard deviation 1941.6 bp. This leads to 1.6% of region with less than three reads covered and four strains with sequencing depth less than 10 bp, which makes it more difficult for biological interpretation of this microbial community (Figures 1C and 1D).
De Novo Assembly of HM-276D Mock Community
To assess the ability of Nanopore sequencing in profiling microbial community, we first conducted a de novo assembly of dataset with 525× coverage from HM-276D mock community using four assemblers: wtdbg2 (Ruan and Li, 2019), OPERA-MS (Bertrand et al., 2019), Canu (Koren et al., 2017), and meta-flye (Kolmogorov et al., 2019). OPERA-MS and meta-flye are designed to be capable of handling metagenome data, whereas wtdbg2 and canu are broadly used for haploid or diploid genomes. Overall, the results show promise for the characterization of microbial genomes using long-read sequencing data. Canu produced the largest assembly of 69.5 Mb (99.3% of the benchmark data), including 83 contigs with contig N50 length of 3.91 Mb. meta-flye assembled 67.7 Mb genome with 89 contigs. wtdbg2 generated similar results with 64.9 Mb genome size, 61 contigs, and 2.97 Mb N50 length. Assembly metrics of OPERA-MS (67.9 Mb genome size, 4,734 contigs with contig N50 length of 2.94 Mb) are similar with Canu and wtdbg2, whereas much more contigs were generated because OPERA-MS utilizes both long and short sequencing reads for assembly. In addition, for aligned blocks, meta-flye yielded the highest NA50 with 1.71Mb in length compared with other assemblers (wtdbg2: 1.2Mb, OPERA-MS: 1.21Mb, Canu: 1.4Mb). Furthermore, by mapping all contigs to the reference genomes using MUMMer v3.23, we assessed the accuracy and genome completeness of contigs produced by four assemblers. As shown in Figure 2A, meta-flye achieved the highest genome fraction (99.99%) and one-to-one identity percentage (99.62%), followed by OPERA-MS (genome fraction: 99.98% and accuracy 99.92%), Canu (genome fraction 99.81% and accuracy 99.4%), and wtdbg2 (genome fraction 96.02% and accuracy 98.60%). Moreover, we evaluated aligned blocks for each method based on NA50 length. As shown in Table S1, meta-flye achieved the highest NA50 with 1.71 Mb in length compared with other assemblers (wtdbg2: 1.2Mb, OPERA-MS: 1.21Mb, Canu: 1.4Mb). Overall, four tools generated results with similar good quality in term of contiguity, accuracy, and completeness using long-read data with evenly mixed samples at 525× coverage depth.
Next, we subsampled 525× dataset to 365× (70%), 160× (30%), 80× (15%), 40× (7.5%), and 20× (3.75%) to examine the effect of sequencing depths on de novo assembly (Figure 2A, Table S1). The assembly results of four tools ranges 95.95%–99.96% in consensus accuracy and 91.26%–99.99% in genome fraction. In specific, OPERA-MS outperforms others with the highest and most consistent metrics for completeness and accuracy across different sequencing depths because its metagenomics design substantially improves the robustness to low sequencing depth, where genome fractions are 99.68% in average (sd = 0.61%) and consensus identities are 99.92% in average (sd = 0.05%). In spite of reduced metrics as the sequencing depth becomes lower, meta-flye and Canu still recovered at least 96.8% genomes with 98.5% accuracy. Notably, wtdbg2 improved the assembly metrics with coverage depth reduced from 365× to 80×. In addition, we examined whether genomes of 20 bacterial strains can be better constructed with Nanopore sequencing technology compared with PacBio and Illumina. As shown in Figure 2B, assemblers using Nanopore sequenced data outperforms other two technologies. With the same assembler, on average, the number of contigs of Nanopore is ∼30% lower than that of PacBio; genome fraction and genome size are 1.56% and 3.1 Mb higher, respectively. To understand the reason, we compared read length characteristics between these two datasets. For N50 length, ONT (7,350 bp) is 15% longer than PacBio (6,357 bp), and for N05 length, ONT (35.9 kbp) is even 159% longer than PacBio (13.8 kbp). This indicates that read length is the main advantage for ONT. Therefore, as shown in Figure 2B, N50 length of ONT (13.3 Mbp) is 68% longer than PacBio (8.1 Mbp). Assemblies using Illumina sequenced data have 99.9% accuracy, but more contigs generated and lower genome size in total compared with Nanopore.
De Novo Assembly of HM-277D Mock Community
To evaluate the metagenome reconstruction in a more realistic setting, we carried out another de novo assembly of 1,068× dataset from HM-277D Mock Community, with unevenly mixed DNA samples of the 20 bacterial strains (Figure S6). Assembly accuracy still remains high, ranging from 97.78% to 99.75% across tools. However, not surprisingly, genome fractions and genome sizes of all methods are substantially lower than even community. This is because 13 bacterial strains have extremely low abundances (<1%) in this unevenly mixed samples, leading to reduced genome coverage fractions (Canu: 71.68%, OPERA-MS: 71.25%, meta-flye: 91.57%, wtdbg2: 82.95%) and genome sizes (Canu: 50.21 Mb, OPERA-MS: 47.99 Mb, meta-flye: 64.12 Mb, wtdbg2: 61.75 Mb). To assess how strain abundance affects assemblies, we calculated strain-specific genome fraction for each tool. Across bacterial strains, meta-flye recovered the highest percentage of genome (median 100%), followed by OPERA-MS (median: 98.75%), Canu (median 94.78%), and wtdbg2 (median: 91.66%) (Figure 2C). For bacteria with relative abundance higher than 0.2%, least 99.99% of reference genome can be covered by assembly contigs (meta-flye), with identity consensus reaching to 99.93%. These results suggest that bacterial strain with nontrivial abundance can be accurately assembled with Nanopore sequenced data. Overall, we observed that meta-flye returned assemblies for 20 bacterial strains with the best performance in completeness and accuracy. Metric for each strain is correlated with abundance of the corresponding bacteria. Some strains were proved hard to assemble for all assemblers due to extremely low relative abundance. For example, 13.6% of region of Enterococcus faecalis (0.011% relative abundance) were covered by 0 or 1 read and 56.1% covered by less than 3 reads, leading to 4.47% genome fraction for meta-flye. Moreover, there were 2 contigs that belong to two different bacterial species, Bacteroides vulgatus (0.19% relative abundance) and Streptococcus pneumoniae (0.05% relative abundance), indicating the difficulty in differentiating one bacterium from another with low relative abundance.
Taxon Binning and Identification
Metagenome assemblers construct contigs with variable length to recover original genome of each bacteria from microbial community. Subsequently, another major challenge in studying the identity and diversity of this community member is to classify sequenced reads or contigs correctly according to their taxonomic origins. Here we investigated the taxonomic binning performance based on three scenarios of long-read sequencing data, HM-276D (Nanopore, PacBio) and HM-277D (Nanopore) at 160× depth of coverage, using a state-of-art taxonomic binner Megan-LR. First, all long reads were aligned to NCBI-nr database. Then, we used Megan-LR with interval-union LCA algorithm to assign ∼2 million aligned reads (∼4.6 Mb bases) to taxonomic nodes (Figures 3A, 3B, and Figures S7–S10). Overall, 4.22 Mb (0.087%) from Nanopore data of HM-276D sample were mis-assigned, whereas 4.37 Mb (0.075%) and 4.66 Mb (0.141%) for Nanopore data of HM-277D and PacBio data of HM-276D, respectively. Specifically, we evaluated the recovery of taxon bins at different ranks. We considered two metrics to quantify the read assignment accuracy, average precision, and sensitivity of 20 bacterial strains. For each taxonomic bin, we obtained precision by calculating the percentage of reads correctly classified out of all binned reads. Sensitivity is the percentage of correctly assigned reads out of all reads originally from the bin. As shown in Figure 3C, HM-276D (Nanopore) has the highest precision, which are all above 60% from phylum to genus. HM-277D (Nanopore) followed, with all above 50%, whereas HM-276D (PacBio) has the lowest average precision due to predicted small false-positive bins at the species level. Sensitivity has a similar pattern (Figure 3D). HM-276D (Nanopore) still appears to be the best dataset for read classification than the other two, and the difference in accuracy between these three scenarios is similar across ranks. Nanopore is ∼8% higher than PacBio and HM-276D is 10% higher than HM-277D. To evaluate the stability of read assignment accuracy, we calculated 95% confidence interval of precision and sensitivity for each scenario at each rank. Not surprisingly, confidence bands are narrower at higher rank, indicating that more taxon recovery accuracy can be reached. Owing to unevenly mixed bacterial strains, sensitivity is much more variable for HM-277D than other HM-276D. Overall, these results demonstrated the advantage of long-read data in accurate taxon recovery above the family level, whereas binning accuracy and stability were relatively at the species level.
In addition to assigning sequence fragments (reads or contigs) to taxon bins, we recognized the importance of accurate determination of taxonomic identity presence or absence from microbial community. Therefore, we continued to investigate the performance of taxonomic identity prediction between data from HM-276D (Nanopore, PacBio) and HM-277D (Nanopore). For taxon prediction, we defined that the species is significantly present in the community when at least 10 reads were assigned to it, whereas identity with less than 10 supporting reads was marked as absence. We considered two other metrics to quantify the detection accuracy, true-positive rate (TPR), and false discovery rate (FDR), where TPR is the percentage of correctly predicted taxonomic identities out of known existing taxon and FDR is the percentage of incorrectly predicted taxonomic identities out of all predicted taxon. TPR and FDR were calculated at different ranks in Figure 3E. TPRs were consistent across three datasets from phylum to order level (90%–77%). Below the order level, PacBio (HM-276D) and Nanopore (HM-277D) are 22% lower compared with Nanopore (HM-276D) (92%–87%). From phylum to family level, FDRs were controlled under 15% for all three datasets. However, at the genus level, more than 20% of detections are false for PacBio (HM-276D) and Nanopore (HM-277D), whereas it was 6% for Nanopore (HM-276). All three scenarios have inflated FDR (>20%) at the species level. Across datasets, there was drastic increase in FDR between phylum to family level and below family level, 10% ± 3% and 21% ± 5%. Similar to binning results, Nanopore data of HM-276D still consistently performed better than other two datasets across ranks. However, accurately predicting taxonomic profiles at the species level still remains challenging owing to many false predicted taxonomic identities with 10–100 reads assigned incorrectly.
Strain Profiling
Despite the challenges in assembly and binning of HM-277D microbial community even at the species level, especially for low abundance bacteria (relative abundance <1%), the golden standard profile of this mock community still allows us to evaluate other unique advantages of this deeply sequenced dataset at strain level. First, we examined the ability in identifying these 13 extremely rare strains based on annotated target genes. To explore the sensitivity of strain detection using this dataset, we mapped raw sequenced reads to reference genomes of the 20 bacterial strains with Minimap2. Then, for each strain-specific gene, the average coverage was estimated by summing up read depth across all exonic region, normalized for gene length. In addition, exon coverage fractions were calculated. We required a gene with average coverage greater than 1 and exon coverage fraction greater 50% simultaneously in order to be declared as a detected gene. The results are shown in Figures 4A, S11, and S12. Detection rates and average coverage among all genes largely keep high in abundant strains (>1%), ranging from 96.4 bp to 4,207.6 bp, as well as most of rare strains (<1%). Most of bacterial strains except for Bacteroides vulgatus (69.1%) and Streptococcus pneumoniae (81.7%) have achieved at least 97% gene detection rate.
Next, we recognized that 16S rRNA genes are most commonly used as gene marker for bacteria identification; we further selected them out for each strain based on RefSeq annotation. As shown in Figure 4A, although Bacteroides vulgatus and Streptococcus pneumoniae still have about 50% of 16S rRNA genes undetected by raw sequenced reads, 18 strains have 100% detection rates and exon coverage fraction with 434.77 bp coverage in average, which demonstrates the feasibility of identifying rare strain (<1%) in microbial community with long-read sequencing data. Additionally, read coverage of protein coding genes for 20 bacterial strains was summarized, which shows similar results. Fourteen strains have average coverage above 100 bp and gene detection rates for 18 strains have reached to 99%, indicating the presence of bacterial strains in the sample.
To understand the composition, diversity, and spatial dynamics of microbial communities, we continued to evaluate the bacterial abundance estimation accuracy based on Nanopore data. We determined two abundance metrics to measure the accuracy, Pearson correlation, and L1 norm. These two metrics assess how well Nanopore sequenced reads can reconstruct the bacterial abundances in comparison with the gold standard. Relative abundance was obtained by normalizing total read coverage with chromosome length for each taxon at different ranks. As shown in Figure 4B, abundance estimates at the species level agrees well with the known relative abundances from the mock community. However, abundance estimation at higher ranks appears to be more challenging, as correlation coefficient ranges from 0.87 to 0.85 and L1 norm is above 0.3 from class to family level, whereas two metrics improved with Pearson correlation >0.9 and L1 < 0.29 when rank is below the family level. Poor abundance estimation at class or family level may be due to the presence of extremely rare bacterial strains in the HM-277D sample, as read coverages were simply summed up between species belonging to the same family or class without accounting for abundance heterogeneity.
Discussion
Complete genome assembly and population profiling are critical for the interpretation of microbial community diversity. However, a benchmarking long-read dataset with consistent evaluation metrics is still lacking, which has hindered our understanding of long-read sequence data in metagenome assembly. In this study, we deeply sequenced HM-276D and HM-277D samples to assess the performance of error-prone Nanopore sequencing data and bioinformatics tools in characterizing microbial community. Assemblers consistently achieved high accuracy and completeness for nontrivial bacterial strains, and genome binners performed well at above the genus level. Furthermore, by targeting on marker genes, we were able to identify rare strains with extremely low abundance in microbial community. Overall, our results have demonstrated the technical feasibility to characterize complete microbial genomes and populations from Nanopore sequencing data with metagenomic software.
We note that, despite the feasibility to characterize complete microbial genomes from long-read sequencing data, there are still challenges to be resolved in our study. Even for evenly mixed samples, the best performing assembler meta-flye achieve 99.99% consensus accuracy. However, as the reference genomes contains 70 Mb, 0.04% error rate has led to 28 kbp of mismatches. These erroneous bases could be due to sequencing errors in low-quality read, a major drawback of long-read sequence data and base modification, which may complicate the genome assembly. To prevent these errors, a sequencer with unbiased and methylation-aware base caller is in need. (We also acknowledge that some of the mismatches may be due to natural differences between reference microbiome samples and the reference genomes that were used.) In addition, there is still room for further improvement in assembly completeness by using longer reads or better designed assemblers to account for long repeats in genomes. In our study, we assembled long-read sequenced data from 20 bacterial strains across species. However, the performance at strain-level still remains unknown as closely related genomes are always a major challenge for genome assembly. In the future, we anticipate that more mock microbial community will be released with bacteria at strain level for benchmarking study.
By evaluating the performance of bioinformatics tools across different technologies, we found that third-generation sequencing generally facilitates the complete characterization of complex bacterial genomes by overcoming many limitations of second-generation sequencing. The short read length has limited the ability of Illumina sequencing in genome interpretation. For example, the length of repetitive genomic region is larger than a single read. As a consequence, intra- and intergenomic diversities are unlikely to be captured by short sequencing data. This issue has been resolved by long-read sequencing technologies (ONT and PacBio), which is able to span low complexity and repetitive regions by providing sequence reads with at least 10 kb in length. While generating data with much higher error rate than PacBio, ONT has become a promising platform in many applications, especially for studies requiring large amounts of data. This is because ONT provides longer reads (up to 900 kb in length) with higher throughput compared with PacBio (10–15 kb in length). Moreover, ONT is currently more affordable with lower per-base cost of data generation, which is a key factor in long-read sequencing studies. Overall, the application of these two major long-read sequencing platforms in metagenomics analysis of complex communities is still restricted by higher error rate. This problem could be addressed with improvement of consensus sequences. Recently, newly released R10 chip from ONT has longer base-contacting constriction in the pore, which improves the homopolymer resolution as compared with R9 and improved per-base error rates. Similarly, the HiFi protocol from PacBio can provide Sanger-quality accuracy (>99%) with reduced read length, which are still much longer than short-read sequencing for assembly of complex genomes (Wenger et al., 2019).This can lead to metagenome assembly with higher accuracy and completeness, as well as more accurate OTU identification. Future metagenomics studies are expected to be changed dramatically by this approach. For example, strain UA159 and NN2025 under species Streptococcus mutans only share 8% common regions, which can be uniquely assigned. We then found that 20% of ONT reads can cover the unique region of these two strains, respectively, which is infeasible for short reads. Therefore, with better quality of long-read data, this approach may allow us to identify bacteria of interest directly at strain level instead of performing binning analysis in the future.
In addition to illustrating the advantages brought by long-read sequence data, we also assessed the performance of four de novo assembly algorithms and a long-read genome binner. The bioinformatics challenges to interpret rich information from complex microbial community include high error rates and low throughput for long-read sequencing, fragmented nature for short-read sequencing, and large CPU hours requirement. For evenly mixed (each with 5% abundance) HM-276D mock community, four tools consistently achieved high accuracy and completeness. No single assembler significantly outperforms others. By subsampling data to less coverage depths, not surprisingly, we found that the corresponding metrics for four tools decreased. In terms of speed, wtdbg2 is tens of times faster than other tools. For the unevenly mixed mock community HM-277D, assembly accuracy still remains high for all four tools (∼97%–98%). Genome fraction was reduced because 13 rare bacterial strains (<1%) were poorly assembled. Hybrid-assembler OPERA-MS, which combines the advantages from long- and short-read technologies, shows more robust performance to bacterial strains with extremely low abundance than other tools. However, it produced much more contigs with less contiguity, whereas meta-flye, Canu, and wtdbg2 returned single contig for 18, 15, and 17 strains respectively. Furthermore, taxonomic binning results show that Megan-LR performs well when genomes are not closely related. Taxon bins were reconstructed with acceptable accuracy down to the genus level, whereas performance decreased at species and strain levels.
In summary, our results not only demonstrate the feasibility of characterizing complete microbial genomes and populations from error-prone Nanopore sequencing data but also highlight necessary bioinformatics improvements for future metagenomics tool development to handle specific challenges in error-prone long-read sequencing data. We believe that future metagenomics studies will benefit from this approach to assemble complete microbial genomes, while maintaining the theoretical ability to detect DNA methylations and base modifications, infer repetitive elements and structural variants, and achieve strain-level resolution within microbial communities. All the datasets on reference microbiomes are made publicly available to facilitate benchmarking studies on metagenomics and the development of novel software tools.
Limitations of the Study
In this study, we note that there is still room for further improvement in assembly completeness using long reads. Also, the performance of binning analysis using long read at strain-level still remains unknown.
Resource Availability
Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Kai Wang (wangk@email.chop.edu).
Materials Availability
This study did not generate new unique reagents.
Data and Code Availability
The Oxford Nanopore sequencing data that support the findings of this study have been deposited in the BioProject database at http://www.ncbi.nlm.nih.gov/bioproject/630658 (reference number: PRJNA630658).
The PacBio data used in this study were generated from the PacBio RS II sequencer. The data were downloaded from the following URL: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun.
The Illumina paired-end data for HM-276D were downloaded from NCBI SRA database with accession numbers SRR2726671 and SRR2726672.
The Illumina TruSeq synthetic long-read data for HM-277D were downloaded from NCBI SRA database with accession number SRR2822457.
Methods
All methods can be found in the accompanying Transparent Methods supplemental file.
Acknowledgments
This work was supported by Children’s Hospital of Philadelphia Research Institute (United States) and NIH/NIGMS (United States) grant GM132713 to K.W.. The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Staggered, High Concentration), v5.2H, for Whole Genome Shotgun Sequencing, HM-277D.The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Even, High Concentration), v5.1H, for Whole Genome Shotgun Sequencing, HM-276D.
Author Contributions
Y.H. performed data analysis and wrote the manuscript. L.F. designed the study, performed long-read sequencing, and analyzed the data. C.N. performed long-read sequencing. K.W. designed the study, supervised the study, and wrote the manuscript. All authors read, revised, and approved the manuscript.
Declaration of Interests
The authors declare no conflict of interest.
Published: June 26, 2020
Footnotes
Supplemental Information can be found online at https://doi.org/10.1016/j.isci.2020.101223.
Supplemental Information
References
- Bertrand D., Shaw J., Kalathiyappan M., Ng A.H.Q., Kumar M.S., Li C., Dvornicic M., Soldo J.P., Koh J.Y., Tong C. Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat. Biotechnol. 2019;37:937–944. doi: 10.1038/s41587-019-0191-2. [DOI] [PubMed] [Google Scholar]
- Byrd A.L., Perez-Rogers J.F., Manimaran S., Castro-Nallar E., Toma I., McCaffrey T., Siegel M., Benson G., Crandall K.A., Johnson W.E. Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics. 2014;15:262. doi: 10.1186/1471-2105-15-262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chehoud C., Albenberg L.G., Judge C., Hoffmann C., Grunberg S., Bittinger K., Baldassano R.N., Lewis J.D., Bushman F.D., Wu G.D. Fungal signature in the gut microbiota of pediatric patients with inflammatory bowel disease. Inflamm. Bowel Dis. 2015;21:1948–1956. doi: 10.1097/MIB.0000000000000454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen E.Z., Bushman F.D., Li H. A model-based approach for species abundance quantification based on shotgun metagenomic data. Stat. Biosci. 2017;9:13–27. doi: 10.1007/s12561-016-9148-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Francis O.E., Bendall M., Manimaran S., Hong C., Clement N.L., Castro-Nallar E., Snell Q., Schaalje G.B., Clement M.J., Crandall K.A., Johnson W.E. Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res. 2013;23:1721–1729. doi: 10.1101/gr.150151.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gill S.R., Pop M., Deboy R.T., Eckburg P.B., Turnbaugh P.J., Samuel B.S., Gordon J.I., Relman D.A., Fraser-Liggett C.M., Nelson K.E. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–1359. doi: 10.1126/science.1124234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gregor I., Droge J., Schirmer M., Quince C., McHardy A.C. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ. 2016;4:e1603. doi: 10.7717/peerj.1603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Group N.H.W., Peterson J., Garges S., Giovanni M., McInnes P., Wang L., Schloss J.A., Bonazzi V., McEwen J.E., Wetterstrand K.A. The NIH human microbiome project. Genome Res. 2009;19:2317–2323. doi: 10.1101/gr.096651.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hao X., Chen T. OTU analysis using metagenomic shotgun sequencing data. PLoS One. 2012;7:e49785. doi: 10.1371/journal.pone.0049785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong C., Manimaran S., Shen Y., Perez-Rogers J.F., Byrd A.L., Castro-Nallar E., Crandall K.A., Johnson W.E. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome. 2014;2:33. doi: 10.1186/2049-2618-2-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hooper L.V., Stappenbeck T.S., Hong C.V., Gordon J.I. Angiogenins: a new class of microbicidal proteins involved in innate immunity. Nat. Immunol. 2003;4:269–273. doi: 10.1038/ni888. [DOI] [PubMed] [Google Scholar]
- Huson D.H., Albrecht B., Bagci C., Bessarab I., Gorska A., Jolic D., Williams R.B.H. MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol. Direct. 2018;13:6. doi: 10.1186/s13062-018-0208-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huson D.H., Beier S., Flade I., Gorska A., El-Hadidi M., Mitra S., Ruscheweyh H.J., Tappu R. MEGAN community edition - interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput. Biol. 2016;12:e1004957. doi: 10.1371/journal.pcbi.1004957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janda J.M., Abbott S.L. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J. Clin. Microbiol. 2007;45:2761–2764. doi: 10.1128/JCM.01228-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones M.B., Highlander S.K., Anderson E.L., Li W., Dayrit M., Klitgord N., Fabani M.M., Seguritan V., Green J., Pride D.T. Library preparation methodology can influence genomic and functional predictions in human microbiome research. Proc. Natl. Acad. Sci. U S A. 2015;112:14024–14029. doi: 10.1073/pnas.1519288112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jovel J., Patterson J., Wang W., Hotte N., O'Keefe S., Mitchel T., Perry T., Kao D., Mason A.L., Madsen K.L., Wong G.K. Characterization of the gut microbiome using 16S or shotgun metagenomics. Front. Microbiol. 2016;7:459. doi: 10.3389/fmicb.2016.00459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolmogorov M., Yuan J., Lin Y., Pevzner P.A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
- Koren S., Walenz B.P., Berlin K., Miller J.R., Bergman N.H., Phillippy A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–736. doi: 10.1101/gr.215087.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuleshov V., Jiang C., Zhou W., Jahanbani F., Batzoglou S., Snyder M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat. Biotechnol. 2016;34:64–69. doi: 10.1038/nbt.3416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laudadio I., Fulci V., Palone F., Stronati L., Cucchiara S., Carissimi C. Quantitative assessment of shotgun metagenomics and 16S rDNA amplicon sequencing in the study of human gut microbiome. Omics. 2018;22:248–254. doi: 10.1089/omi.2018.0013. [DOI] [PubMed] [Google Scholar]
- Lee, C.H., Bowman, B., and Hall, R. Developments in PacBio® metagenome sequencing: Shotgun whole genomes and full-length 16S. International Plant and Animal Genome Conference Asia, 2014.
- Leger A., Leonardi T. pycoQC, interactive quality control for Oxford Nanopore Sequencing. J. Open Source Softw. 2019;4:1236. [Google Scholar]
- Leggett R.M., Alcon-Giner C., Heavens D., Caim S., Brook T.C., Kujawska M., Martin S., Hoyles L., Clarke P., Hall L.J. Rapid profiling of the preterm infant gut microbiota using nanopore sequencing aids pathogen diagnostics. Nat. Microbiol. 2020;5:430–442. doi: 10.1038/s41564-019-0626-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewis J.D., Chen E.Z., Baldassano R.N., Otley A.R., Griffiths A.M., Lee D., Bittinger K., Bailey A., Friedman E.S., Hoffmann C. Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn's disease. Cell Host Microbe. 2015;18:489–500. doi: 10.1016/j.chom.2015.09.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ley R.E., Turnbaugh P.J., Klein S., Gordon J.I. Microbial ecology: human gut microbes associated with obesity. Nature. 2006;444:1022–1023. doi: 10.1038/4441022a. [DOI] [PubMed] [Google Scholar]
- Li D., Liu C.M., Luo R., Sadakane K., Lam T.W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015;31:1674–1676. doi: 10.1093/bioinformatics/btv033. [DOI] [PubMed] [Google Scholar]
- Liang X., Bittinger K., Li X., Abernethy D.R., Bushman F.D., Fitzgerald G.A. Bidirectional interactions between indomethacin and the murine intestinal microbiota. Elife. 2015;4:e08973. doi: 10.7554/eLife.08973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mason C.E., Afshinnekoo E., Tighe S., Wu S., Levy S. International standards for genomes, transcriptomes, and metagenomes. J. Biomol. Tech. 2017;28:8–18. doi: 10.7171/jbt.17-2801-006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McIntyre A.B.R., Alexander N., Grigorev K., Bezdan D., Sichtig H., Chiu C.Y., Mason C.E. Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat. Commun. 2019;10:579. doi: 10.1038/s41467-019-08289-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nicholls S.M., Quick J.C., Tang S., Loman N.J. Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience. 2019;8:giz043. doi: 10.1093/gigascience/giz043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quince C., Walker A.W., Simpson J.T., Loman N.J., Segata N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 2017;35:833–844. doi: 10.1038/nbt.3935. [DOI] [PubMed] [Google Scholar]
- Ranjan R., Rani A., Metwally A., McGee H.S., Perkins D.L. Analysis of the microbiome: advantages of whole genome shotgun versus 16S amplicon sequencing. Biochem. Biophys. Res. Commun. 2016;469:967–977. doi: 10.1016/j.bbrc.2015.12.083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruan J., Li H. Fast and accurate long-read assembly with wtdbg2. Nat. Methods. 2020;17:155–158. doi: 10.1038/s41592-019-0669-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sartor R.B. Microbial influences in inflammatory bowel diseases. Gastroenterology. 2008;134:577–594. doi: 10.1053/j.gastro.2007.11.059. [DOI] [PubMed] [Google Scholar]
- Schauber J., Svanholm C., Termen S., Iffland K., Menzel T., Scheppach W., Melcher R., Agerberth B., Luhrs H., Gudmundsson G.H. Expression of the cathelicidin LL-37 is modulated by short chain fatty acids in colonocytes: relevance of signalling pathways. Gut. 2003;52:735–741. doi: 10.1136/gut.52.5.735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sczyrba A., Hofmann P., Belmann P., Koslicki D., Janssen S., Droge J., Gregor I., Majda S., Fiedler J., Dahms E. Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods. 2017;14:1063–1071. doi: 10.1038/nmeth.4458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turnbaugh P.J., Hamady M., Yatsunenko T., Cantarel B.L., Duncan A., Ley R.E., Sogin M.L., Jones W.J., Roe B.A., Affourtit J.P. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–484. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang F., Kaplan J.L., Gold B.D., Bhasin M.K., Ward N.L., Kellermayer R., Kirschner B.S., Heyman M.B., Dowd S.E., Cox S.B. Detecting microbial dysbiosis associated with pediatric Crohn disease despite the high variability of the gut microbiota. Cell Rep. 2016;14:945–955. doi: 10.1016/j.celrep.2015.12.088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wen L., Ley R.E., Volchkov P.Y., Stranges P.B., Avanesyan L., Stonebraker A.C., Hu C., Wong F.S., Szot G.L., Bluestone J.A. Innate immunity and intestinal microbiota in the development of Type 1 diabetes. Nature. 2008;455:1109–1113. doi: 10.1038/nature07336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wenger A.M., Peluso P., Rowell W.J., Chang P.C., Hall R.J., Concepcion G.T., Ebler J., Fungtammasan A., Kolesnikov A., Olson N.D. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019;37:1155–1162. doi: 10.1038/s41587-019-0217-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu G.D., Chen J., Hoffmann C., Bittinger K., Chen Y.Y., Keilbaugh S.A., Bewtra M., Knights D., Walters W.A., Knight R. Linking long-term dietary patterns with gut microbial enterotypes. Science. 2011;334:105–108. doi: 10.1126/science.1208344. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The Oxford Nanopore sequencing data that support the findings of this study have been deposited in the BioProject database at http://www.ncbi.nlm.nih.gov/bioproject/630658 (reference number: PRJNA630658).
The PacBio data used in this study were generated from the PacBio RS II sequencer. The data were downloaded from the following URL: https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun.
The Illumina paired-end data for HM-276D were downloaded from NCBI SRA database with accession numbers SRR2726671 and SRR2726672.
The Illumina TruSeq synthetic long-read data for HM-277D were downloaded from NCBI SRA database with accession number SRR2822457.