Abstract
16S rRNA based analysis is the established standard for elucidating microbial community composition. While short read 16S analyses are largely confined to genus-level resolution at best since only a portion of the gene is sequenced, full-length 16S sequences have the potential to provide species-level accuracy. However, existing taxonomic identification algorithms are not optimized for the increased read length and error rate often observed in long-read data. Here we present Emu, a novel approach that employs an expectation-maximization (EM) algorithm to generate taxonomic abundance profiles from full-length 16S rRNA reads. Results produced from two simulated datasets and two mock communities show Emu capable of accurate microbial community profiling while obtaining fewer false positives and false negatives than alternative methods. Additionally, we illustrate a real-world application of our new software by comparing clinical sample composition estimates generated by an established whole-genome shotgun sequencing workflow to those returned by full-length 16S sequences processed with Emu.
Sequencing the 16S subunit of the ribosomal RNA gene has been a reliable way to characterize diversity in a community of microbes since Carl Woese used this technique to identify Archaea in 19771. Today, high-throughput sequencing machines used for this analysis are dominantly Illumina devices. Although cost-effective and accurate, Illumina sequences are limited to roughly 500 nucleotides (bps) per joined paired-end read. Since the 16S gene is approximately 1,550 bps, 16S targeted amplification sequencing is limited to just a portion of the gene and is completed by targeting a selected subset of the nine hypervariable regions. This constraint ultimately prevents distinction between highly similar species and therefore short read data can only reliably generate taxonomic profiles measured down to the genus level in most cases2. One workaround for this limitation is to assemble short reads through a synthetic long read method3. While this approach has shown promising accuracy, it requires significantly more sequences, which introduces additional financial costs.
Recent developments in third-generation sequencing, from providers like Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), permit amplification of sequences spanning the entire 16S gene. However, these long reads come with one notable drawback: high rates of sequencing error4. Errors can be corrected by deducing a consensus sequence from multiple passes on each strand of genetic material as seen in PacBio HiFi or from multiple reads tagged with matching unique molecular identifiers (UMIs)5. While these methods have produced near-perfect accuracy6, they again come with a significant increase in cost due to increased sequencing depth. To reduce these expenses and achieve species-level resolution from single pass 16S reads, the appropriate software to account for high error profiles is needed.
The canonical pipeline for 16S analysis operates in two main steps. First, the set of raw sequences is de-noised to identify a smaller set of core sequences, where each set is believed to represent a distinct taxonomic unit in the community. Various algorithms are available for this process7, yet the majority are calibrated to the low level of error associated with Illumina reads. Second, the representative sequences are compared to a database and assigned a taxonomic label. Since reads are already corrected for error at this point, a database lookup tool such as BLAST8 is effective here. These pipelines are operative because the input reads are accurate, and unfortunately produce inconsistent results when challenged with error-prone reads.
Since ONT sequencers are comparatively recent to the marketplace, 16S method development for these devices has only just begun9. In the absence of dedicated tools, some publications have chosen to use a more general read-mapping software such as BWA-MEM10 or the LAST aligner11 to align reads directly to raw 16S sequences from one of the major databases12–14 while other publications have chosen to incorporate metagenomic classification methods designed for whole genome shotgun sequences. Centrifuge15 proved capable of ONT shotgun sequencing analysis in its publication, and is now included as a step in ONT’s WIMP16 (What’s in my Pot?), a long-read 16S workflow provided on its EPI2ME analysis platform. Kraken has also faired well in a long-read 16S benchmarking and has performed favorable to QIIME 219 for 16S short reads when results are re-estimated with its Bayesian cousin Bracken20,21.
NanoClust22 became the first published method purpose-built for taxonomic abundance profiling using full-length 16S amplicon sequencing from ONT machines. Here, the two-stage cluster and database-lookup procedure is implemented in Nextflow23 with external tools for demultiplexing, quality-filtering, clustering, polishing and taxon assignment. While the use of clustered consensus sequences increases computational efficiency, this approach is susceptible to overlook identification of species that are truly present in the error-prone dataset.
One method that has successfully overcome high errors in long-read data is MetaMaps24, a method designed for taxonomic binning of long, high-error shotgun sequencing reads. MetaMaps uses an approximate read-mapping algorithm to identify multiple candidate species and locations for each read, then applies an expectation-maximization (EM) algorithm to adjust the relative confidence in each mapping based on the mapping density of other reads in the sample. This has the effect of smoothing out some of the noise that is inherently created by the ONT error profile. While the approximation methods built into MetaMaps make it incompatible for the analysis of highly similar 16S genes, it and other EM algorithms that successfully disambiguated ambiguous read mappings25,26 provoke interest in an EM method for error-correction of long 16S reads.
Here, we present Emu, a microbial community profiling software tool tailored for full-length 16S data with high error rates. Emu benefits from the increased precision potential provided by the full gene while accounting for high error rates produced by single pass third-generation sequencing. Emu’s algorithm involves a two-stage process. First, proper alignments are generated between reads and the supplied reference database. Then, an EM-based error-correction step is performed to iteratively refine species-level relative abundances based on total read-mapping counts. This results in microbial community profile estimations from full-length 16S reads which are more accurate than existing methods at both the genus and species level.
Results
To generate an accurate microbial community composition estimate from noisy full-length 16S reads, an expectation-maximization (EM) algorithm with a composition-dependent prior is developed in Emu. A high-level illustration of the algorithm can be found in Figure 1 and a more detailed visual of the equations in the algorithm can be seen in Extended Data Figure 1.
Figure 1. Pictorial representation of Emu algorithm.

The Emu algorithm begins by generating alignments between input reads (R) and database sequences (S). The probability of each non-matching character alignment type [mismatch (X), insertion (I), deletion (D), softclip (S)] is calculated based on the number of occurrences of each character alignment type within all primary alignments from the read mapping. The probability of each alignment in the read mapping is then generated as P(r|t) from the counts of each character alignment type and their corresponding established probabilities. The expectation-maximation (EM) phase is then entered, where each read is broken down into the likelihood it is derived from each possible species in the database P(t|r) and the overall composition estimate F(t) is deduced. This cycle repeats as the composition estimate influences read-taxonomy probabilities to give more weight to taxa with higher abundances, then the composition estimate is updated accordingly. Once minimal changes are detected between cycle iterations, the EM loop is exited. The composition estimate is then trimmed based on the specified minimum abundance probability threshold to complete one final EM iteration and output a final composition estimate.
To demonstrate the performance of Emu, four studies were completed. First, a quantitative comparison containing two distinct sets of simulated ONT sequences (Supplementary Tables 1–2). Second, a quantitative comparison using two distinct communities sequenced with both ONT and Illumina devices (Supplementary Tables 3–4), where a de facto ground truth could be used for evaluating accuracy and comparing methods. Third, a series of analyses highlighting the various facets of Emu through: a breakdown of profile estimations throughout the EM algorithm, a database comparison, a novel species simulation, a read mapper comparison and a naive application of the Emu default minimum abundance threshold. Finally, a demonstration of Emu’s applicability to understanding dynamics in actual microbial communities. In this real-world model, human vaginal microbiome clinical samples were processed with two separate pipelines: 16S long reads analyzed with Emu and whole genome shotgun sequences processed with Bracken.
Quantitative comparison
To quantify the output of Emu in relation to several existing methods, four communities were used. The first two are single data sets of simulated ONT reads that follow the distribution of a published mock community. The other two are synthetic mock communities, each of which were sequenced with both ONT and Illumina devices. Performance of each method was evaluated at both the genus and species level using three metrics: the L1-norm of the taxonomic abundance profile, the count of true positive taxa (TP), and the count of false positive taxa (FP). Computational resources required by each method were measured by recording the run time and memory usage for each software.
The set of methods used for comparison include those discussed above: Kraken 227, Bracken, NanoClust, Centrifuge, and MetaMaps. We also include QIIME 2 and the primary alignment generated by minimap228. Although minimap2 is not a composition estimator or read-level classifier in itself, it is included because it is instrumental in the Emu algorithm; minimap2 is the read-mapping software Emu uses to compute likelihood scores and iteratively estimate relative abundance. Including minimap2 in the comparison separates the effect of the EM impelementation in Emu from the read-mapping output it uses as a starting point. Identical reference databases were built for each software to ensure even comparison across methods (see Methods for details).
Ground truth relative abundance values for synthetic communities are based on sample-specific imputed values. This was done to correct for fluctuations from the theoretical abundances which may occur during handling, storage or library preparation (including potential primer bias during 16S amplification). The two ZymoBIOMICS community profiles are reasonably similar to their abundance claims (Extended Data Figure 2, Supplementary Table 5), but the synthetic gut community is subject to greater variation by nature of the microbes included and the skewed distribution. Details on this process are described in the Establishing ground truth section in Methods below, and for these two communities the term ground truth herein refers to this imputed value.
Simulated mock community datasets
ONT reads were simulated following the composition of a published mock communities: MBARC-2629 and the mouse gut profile from the Critical Assessment of Metagenome Interpretation II (CAMI2) Challenge30. MBARC-26 represents a simple community with 23 bacterial and 3 archaeal strains, while our subsampled version of the CAMI2 profile increases microbial richness and contains 345 unique species, each of which are present in the Emu default database. Detailed information on the reference sequences and distribution of simulated reads are contained in Supplementary Tables 1–2.
ZymoBIOMICS mock community standard dataset
A previous study compared 16S sample composition accuracy across a series of hypervariable regions as well as the full-length gene using the ZymoBIOMICS community standard catalog number D360531. We retrieved the ONT full-length dataset and one of the Illumina datasets for our analysis. We selected the Illumina dataset with targeted regions V4-V6 to represent short-read data since the study showed this dataset to produce classification results among the most accurate for this community specifically.
Synthetic mock gut microbiome dataset
To challenge our software, a synthetic community mimicking the human gut microbiome was created and sequenced with both ONT and Illumina devices, as described in the Gut Microbiome Mock Community Sample Creation section in Methods. To represent a real-world scenario with unknown species, Romboutsia hominis is included in the sample, even though this new species is not present in our database. The derived relative abundances of 21 species present in the sample are described in Supplementary Table 4. One notable difference between the two datasets for this community is that Bifidobacterium dentium is not considered to be a true positive in the ONT sequences. This is a result of a recently noted issue with the standard ONT forward primer, which contains three mismatching bases to the family Bifidobacteriaceae and thus fails to amplify microbes of this taxa13. Consequently, the ONT dataset does not contain reads from this microbe. This demonstrates the importance of an imputed ground for the mock communities and additionally highlights the need for further research to identify reliably universal primers for this region.
Performance
Results of all methods on the simulated data set and two synthetic mock communities are contained in Table 1. Computational resources required by each method are listed in Supplementary Table 6. Complete abundance profile output from all methods on all datasets are provided in Supplementary Tables 7–12. All results generated utilize the default Emu database.
Table 1.
Performance summary of 16S relative abundance estimates on ONT and Illumina sequences for all four communities. The row headers for TP contain the estimated number of actual true positives for each data set: for [x, y]: x denotes the expected TP for ONT dataset, and y for Illumina dataset.
| Oxford Nanopore Technologies | Illumina | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Emu | minimap2 | Kraken 2 | Bracken | NanoCLUST | Centrifuge | MetaMaps | QIIME 2 | Emu | minimap2 | Kraken 2 | Bracken | QIIME 2 | |||
|
|
|||||||||||||||
| MBARC-26 | genus | L1-norm | 2E-05 | 3E-04 | 0.03 | 0.72 | 1.77 | 0.01 | 0.42 | 0.16 | |||||
| TP [24] | 24 | 24 | 24 | 19 | 4 | 24 | 24 | 22 | |||||||
| FP | 0 | 18 | 339 | 15 | 0 | 484 | 229 | 8 | |||||||
|
| |||||||||||||||
| species | L1-norm | 3E-05 | 0.01 | 0.14 | 0.80 | 1.77 | 0.11 | 0.51 | 0.96 | ||||||
| TP [26] | 26 | 26 | 25 | 19 | 5 | 26 | 26 | 16 | |||||||
| FP | 0 | 73 | 626 | 43 | 0 | 860 | 415 | 28 | |||||||
|
| |||||||||||||||
| CAMI2 | genus | L1-norm | 0.01 | 0.02 | 0.13 | 1.12 | - | 0.10 | 0.05 | - | |||||
| TP [179] | 171 | 179 | 176 | 105 | - | 180 | 177 | - | |||||||
| FP | 69 | 1482 | 2134 | 542 | - | 2296 | 1366 | - | |||||||
|
| |||||||||||||||
| species | L1-norm | 0.03 | 0.13 | 0.43 | 1.45 | - | 0.24 | 0.11 | - | ||||||
| TP [345] | 330 | 343 | 338 | 162 | - | 343 | 339 | - | |||||||
| FP | 250 | 4665 | 5879 | 1175 | - | 6780 | 3271 | - | |||||||
|
| |||||||||||||||
| ZymoBIOMICS | genus | L1-norm | 1E-03 | 0.01 | 0.25 | 0.24 | 0.02 | 0.45 | 0.16 | 0.75 | 0.04 | 0.07 | 0.39 | 0.30 | 0.31 |
| TP [8,8] | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 7 | 8 | 8 | 8 | 8 | 7 | ||
| FP | 0 | 5 | 77 | 61 | 0 | 245 | 49 | 5 | 9 | 27 | 65 | 64 | 1 | ||
|
| |||||||||||||||
| species | L1-norm | 0.03 | 0.18 | 0.65 | 0.66 | 0.24 | 1.11 | 0.61 | 1.14 | 0.34 | 0.39 | 1.18 | 0.89 | 0.93 | |
| TP [8,8] | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 1 | 8 | 8 | 8 | 8 | 1 | ||
| FP | 6 | 45 | 219 | 191 | 1 | 480 | 146 | 15 | 81 | 120 | 262 | 195 | 2 | ||
|
| |||||||||||||||
| Synth. Gut | genus | L1-norm | 0.03 | 0.05 | 0.41 | 0.79 | 0.34 | 0.30 | 0.53 | 0.67 | 0.34 | 0.35 | 0.90 | 0.77 | 0.45 |
| TP [18,19] | 18 | 18 | 18 | 14 | 14 | 18 | 17 | 14 | 17 | 18 | 17 | 16 | 14 | ||
| FP | 15 | 93 | 505 | 80 | 2 | 1007 | 381 | 31 | 14 | 557 | 291 | 59 | 3 | ||
|
| |||||||||||||||
| species | L1-norm | 0.43 | 0.44 | 0.76 | 0.83 | 0.50 | 0.66 | 0.74 | 1.17 | 0.51 | 0.54 | 1.26 | 1.20 | 1.13 | |
| TP [20,21] | 18 | 19 | 19 | 12 | 14 | 19 | 17 | 9 | 16 | 18 | 15 | 12 | 6 | ||
| FP | 40 | 252 | 1156 | 166 | 4 | 2372 | 836 | 55 | 71 | 1230 | 539 | 110 | 4 | ||
MBARC-26 simulation:
For the MBARC-26 simulated data, Figure 1 shows Emu to outperform every method. Not only does Emu express the lowest L1 and L2 distances, but it is also the only method to correctly identify all 26 species without producing any false positives. The difference in both the false positive counts and relative abundance error measurements between Emu and minimap2 is substantial, reflecting the accuracy gains produced by the EM algorithm compared to a simple similarity-based taxon assignment approach. It is evident from the memory and run time data between these two methods (Supplementary Table 6) that the majority of computational resources used by Emu are in fact due to its use of minimap2 for alignment generation. NanoCLUST results differ from other methods shown in that it has no false positives, but fails to identify several of the present taxa; in other words, it is generally conservative in its identifications.
CAMI2 simulation:
Results on the more complex simulated data shown in the CAMI2 dataset align with those reported from our first simulated dataset. Again, Emu reports the lowest relative abundance error through both the L1 and L2 distance. In addition, Emu reports the best precision and F-score due to its ability to find a balance between true positive detection and false positive reports. Bracken’s re-estimated Kraken 2 results cause both the true and false positives to drop off significantly. However, Emu’s re-estimation of minimap2 findings reduce the false positive counts by an order of magnitude at the cost of only 11 true positives. NanoCLUST and QIIME 2 analyses were not completed here since the dataset lacked quality score information and was not compatible with either software. However, we expect neither software to be a top performer in species detection due to their inability to do so on the simpler simulated data shown above.
ZymoBIOMICS:
Emu on ONT reads expresses the lowest measured error distances across the methods tested at both the genus and species levels. While almost all methods accurately detect the 8 species in the sample, the number of false positives reported varies. Of the methods with perfect recall, NanoCLUST returns the fewest false positives and Emu returns the second fewest. It is also important to note that the abundance accuracy and sensitivity measured in the ONT dataset proves superior to those of the Illumina dataset, especially at the species level. When restricting to only the Illumina results, Emu again proves the lowest L1 distance. While Emu is not primed for Illumina 16S reads, this shows Emu to be a sensible approach regardless of the read-error profile. Figure 2 provides a graphical representation of accuracy measures.
Figure 2. Performance on simulated ONT reads.

(a) Quantitative result statistics for our MBARC-26 simulated dataset. Heatmap of species-level error between expected and inferred relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. Color scheme is capped at ±10, resulting in error greater than ±10% observing the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the included results. “Other” represents the sum of all species not shown in figure for the respective column. Species-level L1-norm, L2-norm, precision, recall, and F-score are also plotted for the methods evaluated. (b) The same statistics shown in (a) catered to our CAMI2 simulated dataset.
Synthetic gut microbiome:
Emu on ONT reads once again shows best or near-best for these metrics on the synthetic gut microbiome community. This is an intentionally challenging community containing several microbes, which even based on putative input abundance, are below 0.01% relative abundance. This is a particular form of stress-test for Emu because the EM algorithm specifically down-weights low-abundance taxa that are closely related to those in higher abundance (reflecting likelihood of sequencing error accounting for the match). Nonetheless, Emu reports the best L1 distance at the species level. Centrifuge reports the best L2 distance, although this statistic is impacted by the species that is not present in the database: Romboutsia hominis. In standard classification methods, these reads are classified under an assortment of Romboutsia species; however in methods involving statistical re-estimation, these reads are labeled as a close relative, which ultimately increases squared error. In terms of presence-absence calls, Emu is only one species shy of the highest TP count but has far fewer FPs than every method aside from NanoCLUST. Although NanoCLUST does report the lowest FP counts, it also detects fewer TPs than others. Results are visualized in Extended Data Figure 3.
Comparing EM iterations within Emu
To get a sense for the value of the error-correction step, Figure 4 visualizes relative abundance error calculated after subsequent EM iterations in Emu on the ZymoBIOMICS mock community ONT dataset. The error reduction with each iteration and is especially clear for the Bacillus genus. The species that is present as per the manufacturer is B. subtilis, yet B. halotolerans differs from it by fewer than 15 bases over the length of the entire 16S gene. Between this similarity and the high error in ONT reads, we would expect a large fraction of reads to map to B. halotolerans. Our minimap2 primary alignment results do just that by classifying the Bacillus species reads as roughly 67% B. subtilis, 18% B. halotolerans, and 15% distributed amongst 21 other species. In Emu’s final estimate however, over 92% of the Bacillus reads are dedicated to B. subtilis, while only 5 additional Bacillus species are falsely identified to account for the remaining 8%.
Figure 4. Relative error after consecutive expectation-maximization (EM) iterations within Emu on ZymoBIOMICS ONT reads.

Relative error of the Emu algorithm after 1, 2, 3, 4, 5, 10, 15, and 20 EM iterations as well as the final Emu output (out) on our ZymoBIOMICS sample sequenced by an ONT device. The 20 most abundant species in the computational estimate are displayed. X-axis denotes the number of completed EM iterations for the results portrayed in the respective column. “Out” represents the final Emu output, which includes threshold trimming and final re-estimation after 22 EM iterations. Darker blue represents an underestimate by the method, while darker red represents an overestimate. Color scheme is capped at ±5, resulting in error greater than ±5% observing the maximum error color. False positive count and L1-norm are reported for each iteration with the ZymoBIOMICS guaranteed minimum abundance threshold of 0.01% applied.
Database Comparison
To evaluate the default Emu database compared to a larger, well-reputed32 16S database, results were generated with the Ribosomal Database Project (RDP)33 using Emu, minimap2, Kraken 2 and Bracken for our four ONT test datasets. Performance of each method was evaluated at both the genus and species level using three metrics: the L1-norm of the taxonomic abundance profile, the count of true positive taxa (TP), and the count of false positive taxa (FP) (Supplementary Table 13). Computational resources required by each method were measured by recording the run time and memory usage for each software (Supplementary Table 14). Complete abundance profile outputs for all RDP database results can be found in Supplementary Tables 15–18. These results show Emu with the Emu default database to have the lowest L1-norm for all four test data sets at both the species and genus level compared to all other software tool and database combinations evaluated. While we would expect this from our simulated datasets (MBARC-26 and CAMI2) since these were simulated from sequences in the Emu database, these findings were mirrored in our mock communities as well.
Novel Species Simulation
To simulate the real-world scenario of communities containing novel species, or species that are not yet in our database, we used our CAMI2 dataset and removed sequences from our Emu and RDP databases. First, a list of 35 of the 345 species in our simulated CAMI2 dataset was randomly selected, in which reads from these species comprised of 9.5% of the total CAMI2 simulated reads. All database sequences classified under these 35 species were removed from both the Emu and RDP databases. Results were then generated for our complete CAMI2 dataset with both the incomplete Emu and RDP databases by Emu, minimap2, Kraken 2 and Bracken. Performance statistics L1-norm, TP, FP and unclassified percent are shown in Supplementary Table 19, while complete abundance profile outputs can be found in Supplementary Tables 20–21. These results show Emu with the Emu database to still produce the lowest L1-norm across taxonomic ranks evaluated when presented with novel species. A heatmap of the relative abundance error for families of the removed species with both the incomplete Emu and RDP databases is visulized in Extended Data Figure 4. This highlights the ability of Emu with the Emu database to accurately classify reads from novel species at the lowest taxonomic rank that is in the database.
Read Mapping Software Comparison
To evaluate the impact of the read mapping software in Emu, results were generated with a version of Emu with BWA-MEM10 as the read mapper (Supplementary Table 22). In both datasets tested, the error measured (L1-norm and L2-norm) and number of false positives decreased after the EM algorithm is applied to the read mapping results.
Minimum Abundance Threshold Comparison
To evaluate the error-correction step in Emu opposed to naively applying a minimum abundance threshold, we have applied the Emu default minimum abundance threshold of 10 reads to results shown in Table 1. These results show Emu to still report the fewest false positives across tested datasets and can be found in Supplementary Table 23.
Research Application: Human Vaginal Microbiome
Variation in vaginal microbiota is associated with several urogenital diseases (e.g. bacterial vaginosis)34,35, a variety of sexually transmitted infections (e.g. HIV ), and uncategorized phenotypes such as reproductive success34.37. Vaginal microbial communities can be classified into six so-called “community state types” (CSTs) I, II, III, IV-A, IV-B, and V, which are defined by relative abundance of several Lactobacillus species and the presence of anaerobic bacteria38,39. We generated community composition from 12 vaginal samples, 6 with diagnosed bacterial vaginosis and 6 controls, using Emu and an established whole-genome shotgun (WGS) metagenomic approach. We compared CST characterizations between the two pipelines to test Emu’s ability to reproduce accepted community clusters.
Experimental design
12 vaginal swabs were obtained from the German Centre for Infections in Gynecology and Obstetrics at Helios Hospital Wuppertal and prepared in the Institute of Medical Microbiology, Virology, and Hospital Hygiene at the University of Duesseldorf. Samples 1–6 originate from control group patients and samples 7–12 from patients with diagnosed bacterial vaginosis. Each sample was sequenced by whole-genome and 16S ONT workflows. The whole-genome reads were processed into species- and genus-level abundance profiles using Kraken 2 and Bracken, while the 16S reads were processed with Emu.
16S and WGS data comparison
Comparison of 16S and WGS sequencing data is not trivial, even when sequencing libraries are prepared from the same nucleic acid prep; bias introduced during amplification and sequencing in marker gene sequencing may differ from the bias produced in WGS sequencing, which ultimately influences the bioinformatic analysis in each approach40. Still, this comparison is useful to present the benefits and limitations of 16S sequencing. Since swabs contained a significant portion of host DNA (98–99% of reads classified as human by Kraken 2 and Bracken), the number of bacterial reads was lower in WGS than 16S sequencing. To reduce bias due to imbalance in sensitivity between the two methods, a species detection threshold of 0.01% was set for Emu.
Table 2 displays the most abundant bacterial genuses and four Lactobacillus species which are used as markers for inference of vaginal CST. Previous literature claims healthy vaginal microbial communities to be dominated by Lactobacillus species, while vaginal dysbiosis has been associated with high abundance of genera Gardnerella, Prevotella, Megasphaera, and Aerococcus41. Both pipelines, 16S and WGS, show relative abundance results agreeing with this previous research.
Table 2.
Relative abundance of dominant and marker taxons assigned by Emu from 16S rRNA ONT data and by Bracken from whole genome shotgun ONT data. Dominant genera are defined as those showing over 10% abundance in at least one sample. CST-marker species of Lactobacillus are defined in previous literature38,39. Values are rounded and “-” is used to denote true zero values.
| Control group | Vaginosis group | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Sample | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|
| |||||||||||||
| Dominant genuses | |||||||||||||
|
| |||||||||||||
| Lactobacillus | Em. | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 1.00 | 0.40 | 0.96 | 0.16 | 0.06 | 0.00 | 0.68 |
| Br. | 0.88 | 0.95 | 0.95 | 0.60 | 0.98 | 0.97 | 0.11 | 0.64 | 0.05 | 0.57 | 0.02 | 0.34 | |
|
| |||||||||||||
| Gardnerella | Em. | - | - | - | - | - | - | - | - | - | 0.00 | - | - |
| Br. | 0.03 | 0.02 | 0.03 | 0.37 | 0.01 | 0.02 | 0.57 | 0.03 | 0.44 | 0.31 | 0.07 | 0.48 | |
|
| |||||||||||||
| Prevotella | Em. | - | - | - | - | - | - | 0.03 | 0.00 | 0.04 | 0.01 | 0.04 | 0.02 |
| Br. | 0.01 | 0.01 | 0.01 | 0.01 | 0.00 | 0.01 | 0.09 | 0.09 | 0.13 | 0.12 | 0.53 | 0.05 | |
|
| |||||||||||||
| Megasphaera | Em. | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.32 | 0.50 | 0.00 | 0.25 |
| Br. | - | - | - | - | - | - | 0.00 | 0.00 | 0.06 | - | 0.00 | 0.05 | |
|
| |||||||||||||
| Aerococcus | Em. | - | - | - | - | - | - | 0.29 | - | 0.01 | 0.15 | 0.00 | 0.02 |
| Br. | - | 0.00 | - | - | - | - | 0.11 | 0.00 | 0.01 | - | 0.00 | 0.01 | |
|
| |||||||||||||
| CST marker species | |||||||||||||
|
| |||||||||||||
| L. crispatus | Em. | 0.00 | 0.50 | 0.99 | 0.00 | 1.00 | 0.99 | 0.00 | 0.96 | 0.00 | 0.00 | 0.00 | 0.00 |
| Br. | 0.06 | 0.63 | 0.92 | 0.06 | 0.89 | 0.96 | 0.03 | 0.64 | 0.02 | 0.61 | 0.01 | 0.04 | |
|
| |||||||||||||
| L. gasseri | Em. | 0.00 | 0.00 | 0.00 | 0.98 | 0.00 | 0.00 | 0.00 | - | - | 0.00 | - | 0.00 |
| Br. | - | - | - | 0.46 | - | - | 0.00 | - | - | - | - | 0.00 | |
|
| |||||||||||||
| L. iners | Em. | 0.99 | 0.47 | 0.00 | 0.00 | 0.00 | 0.00 | 0.34 | 0.00 | 0.16 | 0.06 | 0.00 | 0.68 |
| Br. | 0.86 | 0.28 | 0.01 | 0.01 | 0.00 | - | 0.06 | 0.01 | 0.03 | - | 0.00 | 0.29 | |
|
| |||||||||||||
| L. jensenii | Em. | - | 0.02 | 0.01 | - | - | 0.01 | 0.05 | - | - | - | - | - |
| Br. | - | 0.02 | 0.02 | 0.05 | - | 0.01 | 0.01 | - | - | - | - | - | |
|
| |||||||||||||
| Inferred CST | Em. | 3 | 1 | 1 | 2 | 1 | 1 | 4 | 1 | 4 | 4 | 4 | 4 |
| Br. | 3 | 1 | 1 | 2/4 | 1 | 1 | 4 | 1 | 4 | 1/4 | 4 | 4 | |
The most notable discrepancy between WGS and 16S amongst these genera is the relative abundance of G. vaginalis, where WGS depicts this species in significant abundance while 16S misses it almost entirely. This is a result of the same primer mismatch problem noted earlier for the family Bifidobacteriaceae, the parent family of G. vaginalis. Even with this bias, the inference community state between the Emu and Bracken workflows is consistent across samples; both pipelines express the same dominant CST-marker species in 11 of the 12 samples. Sample 10 is the only sample with differing assignment between methods, which may be explained by the low sequencing depth acquired in the WGS approach for this sample. Despite variation between the community profiles generated for these two pipelines (Extended Data Figure 5), samples with clearly inferred community state types (all except samples 4 and 10) were identical between pipelines, expressing congruency in the clinical outcome of these two approaches.
Discussion
Emu is a homology-aware alignment likelihood approach where read classification probabilities are adaptively updated based on read alignments to multiple reference sequences and the current community profile estimate. This iterative approach goes beyond simply classifying each read independently and utilizes information gathered from the entire community to enable accurate and robust community profiles despite high error rates in input sequences. Demonstrated error reduction (Figure 4) as well as superior results reported when comparing Emu’s output to both read mappers tested alone (Supplementary Table 22) exemplify the performance and flexibility of the employed adaptive likelihood model in Emu.
Emu is impactful for two main reasons: reducing the number of false positives and distinguishing between genomically-similar species. To get a sense for the false positive count reduction accomplished by the EM portion of Emu, we can compare results between Emu and minimap2. In each of our four ONT test sets, the false positives drop significantly, namely from 73 to 0, 4665 to 250, 45 to 6, and 252 to 40.
To observe Emu’s ability to distinguish between genomically-similar organisms, we can zoom in on two pairs of similar species in our MBARC-26 dataset. The first pair includes Salmonella bongori and Salmonella enterica, which have true relative abundances 0.04% and 0.17% respectively. The reference sequences for these species have an ANI of 97%, yet Emu is able to accurately depict the appropriate relative abundance for each of these species within 0.001%. A second similar pair includes Desulfosporosinus acidiphilus and Desulfosporosinus meridiei with relative abundances 7% and 2.6% respectively and an ANI between the two reference species of 94%. Emu accurately estimated each of their relative abundances within 0.0005% of the expected value.
While both MetaMaps and Emu were developed for long-read data and incorporate a form of the EM algorithm, their difference is best understood in terms of “horizontal” alignment (MetaMaps) and “vertical” alignment (Emu). MetaMaps was developed for the analysis of shotgun metagenomic data and can make use of homology information from entire microbial genomes to correctly place individual reads (“horizontal” alignment); this enables MetaMaps to skip (computationally expensive, in particular at the scale of reference databases used for whole-genome metagenomic analysis) base-level alignment between reads and reference sequences and limit itself to more efficient approximate alignment. Emu, on the other hand, is designed for the accurate analysis of an individual locus across a very large number of reference records (“vertical” alignment); as the 16S sequences of different species may differ by as little as a few bases, Emu operates under conditions in which individual alignment matches and mismatches need to be carefully examined while taking into account the increased error rate of ONT sequencing. Base-level alignment between individual reads and the 16S reference database (“vertical” alignment) thus becomes necessary as well as an EM approach that utilizes an alignment likelihood model tailored to the requirements of 16S sequence analysis.
Due to the nature of probabilistic models, Emu creates a long tail of low abundant species. To avoid this long inaccurate list in the results, the built-in threshold for Emu is the equivalent abundance of 1 read for samples with less than 1000 reads or 10 reads for anything larger. This means Emu will not be able to detect microbes with abundance lower than this threshold. This occurs in our synthetic gut mock community which contains only 5 C. leptum reads and thus Emu does not report this species. Therefore, in use cases where detection of ultra-low abundant organisms is imperative, Emu would likely not be the best tool.
The optimal abundance threshold cutoff to distinguish between true assignments and noise resulting from the EM algorithm is an open question for Emu. A future model could utilize the statistical information from the sample to establish a minimum abundance threshold rather than the current somewhat arbitrary cutoff explained above. A second parameter setting open for future development is the number of secondary alignments kept from the mimimap2 output. The current default of up to 50 alignments was selected with a parameter sweep evaluating the tradeoff between accuracy and additional computing cost, and can be modified at the command line. A future version of Emu could incorporate an algorithm to determine an optimal value given the input sample.
Since Emu is a full-length alignment-based approach, more computational resources (memory and time) are required than alternative methods. This may prevent certain users from incorporating Emu into their pipeline depending on access to appropriate computing resources. Future work in this area could reduce these requirements.
Emu is closed-reference approach, which ultimately restricts output to only those bacteria and archaea that are present in the database. As seen in our novel species simulation experiment, we would expect Emu to classify sequences from novel species as a near neighbor that is present in the database and therefore accurately classify these sequences at the lowest taxonomic rank that is present in the database. Further work in this area could label reads from novel species as ”unclassified“ instead. However, as shown in Supplementary Table 19, this is a complex task given that the use of the larger database RDP as well as k-mer based technologies Kraken 2 and Bracken also fail to label novel species as “unclassified” when it comes to high-error full-length 16S reads.
In addition to comparing 16S microbial community profiling methods, the results in this manuscript also bring awareness to the differences between 16S sequencing technologies by ONT and Illumina. In both mock communities, ONT reads are able to deliver lower L1 and L2 distances than Illumina reads. An argument can be made that selection of a different hypervariable region could have showed better results for Illumina reads; however this resembles the actual decision making process for Illumina users and the potential bias it may produce. It is also important to note that Emu and NanoCLUST are the only tools evaluated here that were designed for 16S, thus we expect full-length reads with these two methods to produce the most accurate results.
The potential for long, single molecule reads to deliver higher-resolution pictures of microbial communities from single pass sequencing remains enticing, but the high rate of sequencing error has presented a formidable obstacle. Specifically, while short reads are constrained in sensitivity below the genus level, long reads are not; rather, their difficulty is with specificity. In the case of long-reads applied to 16S amplicon sequencing, Emu represents an important improvement in minimizing this trade-off and has the potential to show the communities of well-studied environments in a new light. Since the error profiles are dynamically learned from the input data, Emu has the flexibility to adjust as sequencing technologies develop and presents an expanded use cases for novel third generation sequencing technologies in future applications.
Methods
Emu algorithm
Figure 1 contains a high-level pictorial representation of the algorithm, while Extended Data Figure 1 presents a more detailed visual of the equations in the algorithm. The pipeline begins by generating alignments between input reads and database sequences with minimap228. With these mappings, the following steps are completed: establishing initial probabilities, redistributing sample composition with an EM algorithm, then ultimately trimming noise for a final estimation.
Initial probabilities
To apply the EM framework, we need: 1.an initial sample composition estimate for vector F, and 2.alignment likelihoods P(r|s) between each sample read r and database reference sequence s ∈ S. Since we do not have any pre-existing knowledge regarding the sample composition, F starts as an evenly distributed vector where T is the set of all taxonomy identifications in S. To identify alignment likelihoods P(r|s), we start by generating pairwise sequence alignments between r ∈ R and s ∈ S with minimap2, where R represents all reads in the sample. We determine the likelihood for nucleotide alignment types: mismatch (X), insertion (I), deletion (D), and softclip (S), by counting the number of occurrences of each nucleotide alignment type in the primary alignments. We define these probabilities with a simple proportion: , where C=[X,I,D,S] and nc is the sum of occurrences of that type amongst all the primary alignments.
With the likelihood for each type of the nucleotide alignment, the likelihood for each pairwise sequence alignment r ∈ R, s ∈ S is calculated as , where is the normalized number of alignment type c observed in the alignment between r and s. The count for each alignment type is normalized by the length of the longest alignment for read r divided by the length of the alignment: . This normalization is to account for variation in alignment lengths for a given read, which is caused by deletions. In the event that no alignment is generated between r and s, P(r|s) = 0. Since we are interested in the most-likely taxonomy of r rather than the most-similar sequence s, we keep only the highest P(r|s) for any s with species-level taxonomy identification t. Thus, the alignment probability between each read r and species-level taxonomy t is calculated with , where s ∈ t represents all s with taxonomy id t. With initial probabilities set, we can now improve our sample composition estimation with an EM probabilistic model.
Redistribute sample composition
The likelihood r emanated from species t is constructed for each P(r|t) using Bayes’ Theorem, . With these probabilities, F is redistributed as . The accuracy of this estimate is evaluated by total log likelihood, , which increases with each iteration. If this L(R) improvement from the previous iteration is substantial (> .01), then this re-estimation step is repeated with the updated F. Otherwise, redistribution is complete, and we move to the final phase of the algorithm.
Trim noise for final estimation
Due to the nature of the probabilistic structure in an EM model, vector F is likely to contain a long tail of species claiming low abundance. To avoid this long list of false positives in the output, any abundance below the set threshold will be modified to 0 at this stage. The default threshold for Emu is the abundance equivalent to 1 read for samples with under 1000 reads and 10 reads for larger samples; however, the user can modify this parameter. Once F is trimmed, Emu enters one final round of abundance redistribution. The resulting F is exposed as the final sample composition estimation.
Simulated read generation
Two simulated datasets were generated to mimic ONT full-length 16S read. First, 958,655 ONT reads were simulated using DeepSimulator v1.542 with default settings on a synthetic metagenomic community structure following the composition of published mock community MBARC-2629. Reference 16S rRNA sequences were obtained from 16S RefSeq nucleotide sequence records43. For strains not present in the RefSeq 16s rRNA sequence database, all strains under the same species as the desired strain were used instead.
Since CAMISIM does not currently have the functionality to simulated 16S data, the simulator in its pipeline, NanoSim44, was used in isolation isolation following the CAMI230 mouse gut profile (https://www.microbiome-cosi.org). 16S rRNA sequences were selected from 16S RefSeq43 based on taxonomy IDs in the described CAMI2 mouse gut profile. For unfound organisms, 16S rRNA sequences were selected from rrnDB45 v5.6 by name instead. The number of reads simulated for each microbe was calcuated as 107 ∗ relative abundance to ensure each species contained at least one simulated read. Since the generated dataset contained over 400 million reads, this dataset was then subsampled down to 1% to reduce computational load and resulted in 4,310,093 reads.
Creation of gut microbiome mock community
Each gut bacterium was activated and proprogated individually in brain heart infusion (BHI) medium supplemented with hemin (5 mg/L) and yeast extract (10 g/L). Plate counting method was used to determine viable cells of cultures after 4 hours of anaerobic cultivation at 37 °C; all bacterial strains were combined with equal volume of 100 μL. Cultures were then centrifuged at 12,000 g for 10 min before extra bacterial lysis with lysozyme followed by DNA extraction using MasterPure™ Complete DNA and RNA Purification Kit. DNA was quantified by Qubit kit.
Sequencing mock communities
ZymoBIOMICS
Detailed description of steps taken to sequence the ZymoBIOMICS sample can be found in the Materials and Methods section of the study which produced these sequences.
Synthetic gut mock
Library construction and sequencing of V4 region of the 16S ribosomal RNA gene were performed using the NEXTflex 16S V4 Amplicon-Seq Kit 2.0 (Bioo Scientific, Austin, TX) with 20 ng of input DNA, and sequences were generated on the Illumina MiSeq platform (Illumina, San Diego, CA).
Library construction and sequencing of the full-length 16S gene were performed with MinION nanopore sequencer (Oxford Nanopore Technologies, Oxford, UK) and 16S Barcoding Kit 1–24 (SQK-16S024, Oxford Nanopore Technologies, Oxford, UK).The PCR amplification and barcoding was completed with 15 ng of template DNA added to the LongAmp Hot Start Taq 2X Master Mix (New England Biolabs, Ipswich, MA). Initial denaturation at 95°C was followed by 35 cycles of 20 s at 95°C, 30 s at 55°C, 2 min at 65°C, and a final extension step of 5 min at 65°C. Purification of the barcoded amplicons was performed using the AMPure XP Beads (Beckman Coulter, Brea, CA) as per ONT’s instructions. Samples were then quantified using Qubit fluorometer (Life Technologies, Carlsbad, CA) and pooled in equimolar ratio to a total of 50–100 ng in 10 μl. The pooled library was then loaded into an R9.4.1 flow cell and run as per the manufacturer’s instructions. The MINKNOW software 19.12.5 was used for data acquisition.
Emu 16S database
The default database of Emu is a combination of rrnDB version 5.645 and NCBI 16S RefSeq downloaded on September 17, 202043. Duplicate species-level entries, defined as entries with identical sequences and species-level identification, were removed. The resulting database contains 49,301 sequences from 17,555 unique microbial species. Database taxonomy was also retrieved from NCBI on the same date as the RefSeq download. This database can be reproduced by utilizing the build custom database option in Emu on both the rrnDB and RefSeq sequences separately, then concatenating the results.
Emu was first tested with the NCBI 16S RefSeq database on our ONT ZymoBIOMICS sample. This yielded subpar accuracy (Supplementary Table 24), which we attribute to the large number of reference sequences containing ambiguous bases. To increase the number of complete sequences in our database, rrnDB version 5.6 was added since it contains species-level taxonomy and few ambiguous bases.
Three popular 16S databases are Greengenes46, Ribosomal Database Project (RDP)33, and SILVA47. Although each of the three contains far more sequences than our curated Emu database, species-level annotation in Greengenes is relatively low, SILVA does not map completely to the NCBI taxonomy, and RDP did not perform as well in our experiments. Given Emu’s reliance on mapping each read to several database sequences, we have found a smaller, well-curated database to yield better performance in Emu. We have therefore created the default database of Emu as explained above, but have also pre-built an RDP database for Emu that is publicly available.
Emu RDP database
An Emu-compatible Ribosomal Database Project (RDP)33 version 11.5 database was generated with Emu’s build-database function for the database comparsion quantitative results shown in Supplementary Table 13 and computational resource results shown in Supplementary Table 14. To construct this database, Bacteria 16S and Archaea 16S unaligned fasta sequences were downloaded from the RDP website and the NCBI taxonomy database was downloaded in January 202248. Mappings between the RDP fasta sequence IDs and NCBI taxonomy IDs were generated using the NCBI accession2taxid database48. Sequences that mapped to a tax ID that was no longer in the NCBI taxonomy were removed. In additon, sequences that mapped to “uncultured organism’ (tax ID: 155900), bacteria sequences that mapped to “uncultured bacterium” (tax ID: 77133) and archaea sequences that mapped to “uncultured archaeon” (tax ID: 115547) were removed. The resulting Emu databases contains 1,089,863 sequences and can be downloaded via GitLab: https://gitlab.com/treangenlab/emu. The input files to create the Emu RDP database were then used to create a Kraken 2 database and generate both Kraken 2 and Bracken results. The Emu-compatible RDP database fasta file was used to generate minimap2 RDP results.
16S quantitative comparison
Barcodes were removed from each mock community dataset: the trim_barcodes function in Guppy Basecalling Software v4.4.249 for our ONT datasets and Trimmomatic v0.39 for Illumina. An equivalent of the default database of Emu was built for each software. “Unclassified” reads, or those that failed to match a reference sequence, were removed prior to calculating relative abundance for each method. Supplementary Note 1 contains a detailed list of all commands used.
Minimap2
Minimap2 v2.24 classification were generated by selecting the top database hit for each input read. The preset option for ONT was used for our long-read data and the genomic short-read mapping preset was used for our Illumina data. Relative abundance of each species was calculated as the number of species classificaitons divided by the total number of aligned reads.
Kraken 2
Kraken 2 v2.1.1 was used to generate a custom database matching our Emu default, then ultimately produce classification results. To calculate relative abundance from Kraken 2 classification, the “clade counts” column from the Kraken 2 report (kreport) was used. For species-level results, only rows with “rank:S” were kept. Relative abundance for each species was then defined as the “clade counts” for that species divided by the total number of “clade counts” in the reduced kreport. This process is then repeated at the genus-level by restricting the kreport to only those rows with “rank:G”.
Bracken
Bracken v2.5.0 was used to gather microbial abundance estimates from our Kraken 2 results. For full-length ONT reads, our custom Kraken 2 database was converted to a Bracken database with read lengths 1,500. The same process was applied for Illumina data, with read lengths 250 and 300 were used for ZymoBIOMICS and synthetic gut microbiome mock, respectively. Bracken abundance estimations were then generated for each dataset at the genus and species level.
NanoCLUST
Since NanoCLUST utilizes a BLAST database, a custom BLAST database was created to match our Emu default database. NanoCLUST v1.0 was then run on each of our long-read samples with the docker profile option. Since NanoCLUST generates relative abundance estimates at each taxonomic rank by default, no further processing was necessary.
Centrifuge
Centrifuge v1.0.4 was used to build a custom database and generate taxonomic classification of our 4 ONT datasets. The kreport generation functionality within Centrifuge was then incorporated to create Kraken-style reports for each Centrifuge classification result. Genus- and species-level relative abundance results were calculated from these kreports in the same manor as Kraken 2 results described above.
MetaMaps
MetaMaps v.0.1 was used to build a custom database from Emu default 16S database. The datasets were analyzed with following alterations to default settings: the estimated alignment identity target parameter was set to 90 (--pi 90) for all datasets for improved performance and the minimal read length was set to 500 (-m 500) for CAMI2 data since the dataset had shorter reads. Genus- and species-level relative abundance results were calculated from the output file ending in EM.WIMP by selecting all rows with the approriate “AnalysisLevel” (genus or species) then using the values in the “EMFrequency” column directly.
QIIME 2
QIIME 2 results were produced with the classify-sklearn Naive Bayes classifier workflow of QIIME 2 2020.11.1. First, a QIIME 2 artifact representation of the default Emu database was generated with the appropriate QIIME 2 import command. Then, reference sequences were extracted appropriately based on the primer used for each sample and fit to the reference taxonomy to produce a QIIME 2 classifier. The already demultiplexed sample reads were denoised (Illumina) or dereplicated (ONT), and then classified with the appropriate pre-fit classifier. The taxonomic classifications were then collapsed to genus and species levels, and relative abundances were calculated separately for the two taxonomic rank results.
Establishing ground truth
A “Zymo-exclusive” database containing only the provided 16S assembled reference genomes for the eight bacterial species in the sample was created. The ZymoBIOMICS samples were then mapped (BWA-MEM v0.7.17 for short-read data, minimap2 v2.17 for long-read) to this “Zymo-exclusive” database for accurate classification of each read. Reads were classified as the top hit and ground truth relative abundances were derived from these results.
A restricted database for the 21 species known to be in our synthetic gut microbiome community was created by retrieving the NCBI 16S RefSeq entries for those species. This resulted in 45 sequences from 20 of the 21 species. Since Romboutsia hominis is present in the sample, but not RefSeq, a Romboutsia hominis sequence was selected from GenBank50, and included in the restricted database. Mapping, classification, and sample composition calculation follow the workflow for the ZymoBIOMICS community described above. This community however is subject to other undocumented contamination that may introduce bias.
Accuracy evaluation metrics
L1-norm is essentially the linear error and is calculated through equation: , where set S consists of the union between all the species in the database and ground truth, and Es and Is are the respective expected and inferred relative abundances for species s. A perfect L1 distance is 0, while an entirely inaccurate sample composition estimate would return a L1 distance of 2 since and . L2-norm is the sum-of-squared-error which magnifies the cost of larger differences and is calculated through equation: . Precision, recall, and F-score are used to evaluate microbe presence accuracy. For this explanation, TP represents true positives, FP represents false positives, and FN represents false negatives. Precision states the proportion of claimed true positives that are truly present in the sample: . Recall expresses the percentage of expected positives that were detected by the software: . The F-score is simply the harmonic mean between the two values: . Since the ZymoBIOMICS sample is guaranteed to contain <0.01% foreign microbial DNA, all ZymoBIOMICS results are trimmed to include only taxa with abundance ≥ 0.01%, prior to calculated performance metrics.
Computational resources
All software analysis was completed on a Ubuntu 18.04.4 LTS system, with the exception of MetaMaps runs, which were completed on CentOS Linux release 7.9.2009. The /usr/bin/time command was used to gather time and memory statistics. Reported CPU time is calculated by summing the user and sys time, and RAM requirements with the maximum resident size. The only except is NanoCLUST, where, computational requirements were extracted from the Nextflow execution report and timeline instead. Here, run time was gathered from the “CPU-Hours” output in the execution report and maximum resident set size was determined by the step with the largest memory usage (RAM) in the execution timeline. Computational requirements recorded for Bracken is an accumulation of both the Bracken and Kraken 2 commands, since both are required to produce the Bracken abundance estimation. Computational requirements for the QIIME 2 workflow are left out of this analysis as QIIME 2 involves several commands.
Clinical vaginal samples
Data generation
Total DNA and RNA was extracted using ZymoBIOMICS DNA/RNA Miniprep Kit R2002. 16S Nanopore sequencing library was prepared from 10 ng of total DNA using 16S Barcoding Kit SQK-RAB204. Whole genome Nanopore library was prepared from remaining total DNA using Native Barcoding Expansion 1–12 (PCR-free) Kit EXP-NBD104 and Ligation Sequencing Kit SQK-LSK109. Data was sequenced on MinION flow cells type R9 (FLO-MIN106D) in two runs (16S run and whole genome run). Data was aquired with MINKNOW core v.4.0.5. Basecalling and demultiplexing was done using Guppy v.4.0.15.
Data analysis and databases
Computational analysis of vaginal samples was performed on a machine with CentOS Linux release 7.9.2009. Whole genome sequencing data were analyzed with Kraken v.2.1.1 and Bracken v.2.5.
Kraken 2 database was built from a custom metagenomic database, which includes all latest complete and reference genomes derived from RefSeq database in divisions bacteria, fungi, protozoa and viral of RefSeq (state 26.12.2019). The host portion of the metagenomic database is represented by 1000 genomes project reference sequence and two well-characterized human assemblies (GCA_001524155.4 and GCA_002009925.1).
Retrieved Bracken abundances at both genus- and species-level were recalculated considering only bacteria in order to align with 16S results. Therefore, total Bracken results belonging to superkingdom “Bacteria” was assumed as 100% abundance for each sample.
Emu was run on 16S sequencing data with a species detection threshold of 0.01%. Species- and genus-level abundances were retrieved from Emu output. CSTs were inferred from abundance profile considering dominance of four marker Lactobacillus species.
Extended Data
Extended Data Fig. 1.

Follow the grey-arrowed path until expectation-maximization (EM) iterations are complete, then pink arrows are followed to the final composition estimate. The method starts by establishing probabilities for each alignment type C=[mismatch (X), insertion (I), deletion (D), softclip (S)] through occurrence counts in the primary alignments. Next, alignment probability P(r|t) is calculated for each read, taxonomy pair (r,t) by assuming the maximum alignment probability between r and t. Meanwhile, an evenly distributed composition vector F is initialized. The EM phase is entered by determining P(t|r), the probability that r emanated from t, for all P(r|t). F is updated accordingly, and the total log likelihood of the estimate is calculated. If the total log likelihood is a significant increase over the previous iteration (>.01), then EM iterations continue. Otherwise, the loop is exited, and F is trimmed to remove all entries less than the set threshold. Now following the pink arrows, one final round of estimation is completed with the trimmed F to produce the final sample composition estimate.
Extended Data Fig. 2.

The theoretical values are taken from ZymoBIOMICS standard report of relative abundance estimates based on 16S gene copy numbers (https://files.zymoresearch.com/protocols/_d6305_d6306_zymobiomics_microbial_community_dna_standard.pdf). Truth_ONT and truth_illumina represent the ground truth relative abundances calculated for our ONT and Illumina datasets respectively, as described in the Establishing Ground Truth subsection under Methods.
Extended Data Fig. 3.

Heatmap of species-level error between calculated ground truth and estimated relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. All Oxford Nanopore Technologies (ONT) errors are measured in relation to the ground truth of the ONT dataset, while Illumina errors are measured in relation to the ground truth for the Illumina dataset. Color scheme is capped at ±10, resulting in error greater than ±10% observing the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the ONT or Illumina sample results. “ther” represents the sum of all species not shown in figure for the respective column. Species-level L1-norm, L2-norm, precision, recall, and F-score are also plotted for the methods evaluated.
Extended Data Fig. 4.

Heatmap of family-level error between ground truth and estimated relative abundances for both the Emu and RDP incomplete databases (missing 35 of the 345 CAMI2 simulated species) with our CAMI2 dataset. Here, darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. Color scheme is capped at ±3, resulting in error greater than ±3% observing the maximum error colors. Displayed are the families of the 35 species that were removed from each of the databases.
Extended Data Fig. 5.

Species with estimated abundance of over 1% in at least one sample with either Emu or Bracken are shown. Data is grouped by condition: healthy control or vaginosis.
Supplementary Material
Figure 3. Performance on our ZymoBIOMICS community standard dataset.

Heatmap of species-level error between calculated ground truth and estimated relative abundances, where darker blue denotes an underestimate by the software, darker red denotes an overestimate, and white represents no error. All Oxford Nanopore Technologies (ONT) errors are measured in relation to the ground truth of the ONT dataset, while Illumina errors are measured in relation to the ground truth for the Illumina dataset. Color scheme is capped at ±10, resulting in error greater than ±10% observing the maximum error colors. Displayed are the 20 species claiming the largest abundance in any of the ONT or Illumina results. “Other” represents the sum of all species not shown in figure for the respective column. Species-level L1-norm, L2-norm, precision, recall, and F-score are also plotted for the methods evaluated. True and false positive counts used to calculate precision, recall, and F-score are restricted to species with relative abundance ≥0.01% to align with guidance from ZymoBIOMICS on maximum levels of contamination.
Acknowledgements
This work has been supported by Jürgen Manchot Foundation and Deutsche Forschungsgemeinschaft (DFG) award #428994620 (A.D., A.T., W.M, P.F. and E.G.). This work has alos been supported by NIH grants from NIDDK P30-DK56338, NIAID R01-AI10091401, U01-AI24290 and P01-AI152999, and NINR R01-NR013497 (T.S. and Q.Wu). Q.Wa. and S.V. were supported in part by NIH grant R21NS106640 from National Institute for Neurological Disorders and Stroke (NINDS). K.C. was supported in part by a Ken Kennedy Institute Computational Science and Engineering Graduate Recruiting Fellowships. K.C., M.N., and T.T. were supported in part by NIH grant P01-AI152999 supported by National Institute of Allergy and Infectious Diseases (NIAID). K.C. and T.T. were supported in part by NSF EF-2126387. M.N. was funded by a fellowship from the National Library of Medicine Training Program in Biomedical Informatics and Data Science (T15LM007093, PI: Kavraki). Computational support and infrastructure were provided by the “Centre for Information and Media Technology” (ZIM) at the University of Düsseldorf (Germany). We would also like to thank two additional members of the Treangen Lab: Bryce Kille for technical support and Nicolae Sapoval for algorithm development.
Footnotes
Competing Interests Statement
The authors declare no competing interests.
Study of vaginal microbiomes was IRB-approved by the ethics committee of the Medical Faculty of Heinrich Heine University.
Code availability
Emu and all associate code are available on GitLab: https://gitlab.com/treangenlab/emu. Emu can be installed via Bioconda: https://anaconda.org/bioconda/emu. A Code Ocean capsule51 has been created at the time of this publication. All scripts and data used to compile quantitative comparison results can be found on GitLab: https://gitlab.com/treangenlab/emu-benchmark.
Data availability
All sequenced samples used in this study are publicly available on Sequence Read Achieve (SRA). Both ZymoBIOMICS data sets are under BioProject ID PRJNA587452 with SRA accessions SRR10391201 for ONT and SRR10391187 for Illumina31. Our gut mock community is under BioProject ID PRJNA725207. The 12 vaginal samples used for our real-world application demonstration are uploaded under BioProject ID PRJNA723982. Our simulated sequences are publicly available on OSF under project 56UF7 (https://osf.io/56uf7/). Databases used in this manuscript include: 16S RefSeq nucleotide sequence records (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/), Ribosomal Database Project (RDP) version 11.5 (https://rdp.cme.msu.edu/) and rrnDB version 5.7 (https://rrndb.umms.med.umich.edu/).
References
- 1.Woese CR & Fox GE Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. 74, 5088–90, DOI: 10.1073/PNAS.74.11.5088 (1977). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Martínez-Porchas M, Villalpando-Canchola E & Vargas-Albores F Significant loss of sensitivity and specificity in the taxonomic classification occurs when short 16S rRNA gene sequences are used. Heliyon 2, e00170, DOI: 10.1016/j.heliyon.2016.e00170 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Callahan BJ, Grinevich D, Thakur S, Balamotis MA & Yehezkel TB Ultra-accurate microbial amplicon sequencing with synthetic long reads. Microbiome 9, 130, DOI: 10.1186/s40168-021-01072-3 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Workman RE et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16, 1297–1305, DOI: 10.1038/s41592-019-0617-2 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Karst SM et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat. Methods 18, 165–169, DOI: 10.1038/s41592-020-01041-y (2021). [DOI] [PubMed] [Google Scholar]
- 6.Wenger AM et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162, DOI: 10.1038/s41587-019-0217-9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nearing JT, Douglas GM, Comeau AM & Langille MGI Denoising the Denoisers: An independent evaluation of microbiome sequence error-correction methods. DOI: 10.7287/peerj.preprints.26566v1 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ Basic local alignment search tool. J. molecular biology 215, 403–10, DOI: 10.1016/S0022-2836(05)80360-2 (1990). [DOI] [PubMed] [Google Scholar]
- 9.Santos A, van Aerle R, Barrientos L & Martinez-Urtaza J Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput. Struct. Biotechnol. J. 18, 296–305, DOI: 10.1016/j.csbj.2020.01.005 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Li H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio] (2013). 1303.3997. [Google Scholar]
- 11.Kiełbasa SM, Wan R, Sato K, Horton P & Frith MC Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493, DOI: 10.1101/gr.113985.110 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Benítez-Páez A, Portune KJ & Sanz Y Species-level resolution of 16S rRNA gene amplicons sequenced through the MinION™ portable nanopore sequencer. GigaScience 5, 4, DOI: 10.1186/s13742-016-0111-z (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fujiyoshi S, Muto-Fujita A & Maruyama F Evaluation of PCR conditions for characterizing bacterial communities with full-length 16S rRNA genes using a portable nanopore sequencer. Sci. Reports 10, 12580, DOI: 10.1038/s41598-020-69450-9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Shin J et al. Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing. Sci. Reports 6, 1–10, DOI: 10.1038/srep29681 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kim D, Song L, Breitwieser FP & Salzberg SL Centrifuge: Rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729, DOI: 10.1101/gr.210641.116 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Juul S et al. What’s in my pot? Real-time species identification on the MinION™. bioRxiv 030742, DOI: 10.1101/030742 (2015). [DOI] [Google Scholar]
- 17.Wood DE & Salzberg SL Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46, DOI: 10.1186/gb-2014-15-3-r46 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Valenzuela-González F, Martínez-Porchas M, Villalpando-Canchola E & Vargas-Albores F Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken). J. Microbiol. Methods 122, 38–42, DOI: 10.1016/j.mimet.2016.01.011 (2016). [DOI] [PubMed] [Google Scholar]
- 19.Bolyen E et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857, DOI: 10.1038/s41587-019-0209-9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lu J, Breitwieser FP, Thielen P & Salzberg SL Bracken: Estimating species abundance in metagenomics data. PeerJ Comput. Sci. 2017, e104, DOI: 10.7717/peerj-cs.104 (2017). [DOI] [Google Scholar]
- 21.Lu J & Salzberg SL Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2. Microbiome 8, 124, DOI: 10.1186/s40168-020-00900-2 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Rodríguez-Pérez H, Ciuffreda L & Flores C NanoCLUST: a species-level analysis of 16S rRNA nanopore sequencing data. Bioinformatics DOI: 10.1093/bioinformatics/btaa900 (2020). [DOI] [PubMed] [Google Scholar]
- 23.Di Tommaso P et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319, DOI: 10.1038/nbt.3820 (2017). [DOI] [PubMed] [Google Scholar]
- 24.Dilthey AT, Jain C, Koren S & Phillippy AM Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 1–12, DOI: 10.1038/s41467-019-10934-2 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bray NL, Pimentel H, Melsted P & Pachter L Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527, DOI: 10.1038/nbt.3519 (2016). [DOI] [PubMed] [Google Scholar]
- 26.Roberts A & Pachter L Streaming fragment assignment for real-time analysis of sequencing experiments. 10, 71–73, DOI: 10.1038/nmeth.2251 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wood DE, Lu J & Langmead B Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257, DOI: 10.1186/s13059-019-1891-0 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Li H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, DOI: 10.1093/bioinformatics/bty191 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Singer E et al. Next generation sequencing data of a defined microbial mock community. Sci. Data 3, 160081, DOI: 10.1038/sdata.2016.81 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Meyer F et al. Critical Assessment of Metagenome Interpretation - the second round of challenges. 2021.07.12.451567, DOI: 10.1101/2021.07.12.451567 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Winand R et al. Targeting the 16S rRNA Gene for Bacterial Identification in Complex Mixed Samples: Comparative Evaluation of Second (Illumina) and Third (Oxford Nanopore Technologies) Generation Sequencing Technologies. Int. J. Mol. Sci. 21, 298, DOI: 10.3390/ijms21010298 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Edgar R Taxonomy annotation and guide tree errors in 16S rRNA databases. PeerJ 6, e5030, DOI: 10.7717/peerj.5030 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cole JR et al. Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Res. 42, D633–D642, DOI: 10.1093/nar/gkt1244 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Smith SB & Ravel J The vaginal microbiota, host defence and reproductive physiology. The J. Physiol. 595, 451–463, DOI: 10.1113/JP271694 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Pybus V & Onderdonk AB Microbial interactions in the vaginal ecosystem, with emphasis on the pathogenesis of bacterial vaginosis. Microbes Infect. 1, 285–292, DOI: 10.1016/s1286-4579(99)80024-0 (1999). [DOI] [PubMed] [Google Scholar]
- 36.Petrova MI, van den Broek M, Balzarini J, Vanderleyden J & Lebeer S Vaginal microbiota and its role in HIV transmission and infection. FEMS microbiology reviews 37, 762–792, DOI: 10.1111/1574-6976.12029 (2013). [DOI] [PubMed] [Google Scholar]
- 37.Mendling W Vaginal Microbiota. Adv. Exp. Medicine Biol. 902, 83–93, DOI: 10.1007/978-3-319-31248-4_6 (2016). [DOI] [PubMed] [Google Scholar]
- 38.Gajer P et al. Temporal dynamics of the human vaginal microbiota. Sci. Transl. Medicine 4, 132ra52, DOI: 10.1126/scitranslmed.3003605 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Ravel J et al. Vaginal microbiome of reproductive-age women. Proc. Natl. Acad. Sci. United States Am. 108 Suppl 1, 4680–4687, DOI: 10.1073/pnas.1002611107 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Brooks JP et al. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC microbiology 15, 66, DOI: 10.1186/s12866-015-0351-6 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Onderdonk AB, Delaney ML & Fichorova RN The Human Microbiome during Bacterial Vaginosis. Clin. Microbiol. Rev. 29, 223–238, DOI: 10.1128/CMR.00075-15 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Li Y et al. DeepSimulator: A deep simulator for Nanopore sequencing. Bioinformatics 34, 2899–2908, DOI: 10.1093/bioinformatics/bty223 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.O’Leary NA et al. Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–745, DOI: 10.1093/nar/gkv1189 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yang C, Chu J, Warren RL & Birol I NanoSim: Nanopore sequence read simulator based on statistical characterization. GigaScience 6, DOI: 10.1093/gigascience/gix010 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Stoddard SF, Smith BJ, Hein R, Roller BR & Schmidt TM rrnDB: Improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res. 43, D593–D598, DOI: 10.1093/nar/gku1201 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.DeSantis TZ et al. Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072, DOI: 10.1128/AEM.03006-05 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Quast C et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 41, D590–D596, DOI: 10.1093/nar/gks1219 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schoch CL et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database : journal biological databases curation 2020, DOI: 10.1093/DATABASE/BAAA062 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wick RR, Judd LM & Holt KE Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol. 20, 129, DOI: 10.1186/s13059-019-1727-y (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J & Sayers EW GenBank. Nucleic Acids Res. 44, D67–D72, DOI: 10.1093/nar/gkv1276 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Clyburne-Sherin A, Fei X & Green SA Computational Reproducibility via Containers in Psychology. Meta-Psychology 3, DOI: 10.15626/MP.2018.892 (2019). [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All sequenced samples used in this study are publicly available on Sequence Read Achieve (SRA). Both ZymoBIOMICS data sets are under BioProject ID PRJNA587452 with SRA accessions SRR10391201 for ONT and SRR10391187 for Illumina31. Our gut mock community is under BioProject ID PRJNA725207. The 12 vaginal samples used for our real-world application demonstration are uploaded under BioProject ID PRJNA723982. Our simulated sequences are publicly available on OSF under project 56UF7 (https://osf.io/56uf7/). Databases used in this manuscript include: 16S RefSeq nucleotide sequence records (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/16S_process/), Ribosomal Database Project (RDP) version 11.5 (https://rdp.cme.msu.edu/) and rrnDB version 5.7 (https://rrndb.umms.med.umich.edu/).
