ABSTRACT
Amplicon sequencing variants (ASVs) have been proposed as an alternative to operational taxonomic units (OTUs) for analyzing microbial communities. ASVs have grown in popularity, in part because of a desire to reflect a more refined level of taxonomy since they do not cluster sequences based on a distance-based threshold. However, ASVs and the use of overly narrow thresholds to identify OTUs increase the risk of splitting a single genome into separate clusters. To assess this risk, I analyzed the intragenomic variation of 16S rRNA genes from the bacterial genomes represented in an rrn copy number database, which contained 20,427 genomes from 5,972 species. As the number of copies of the 16S rRNA gene increased in a genome, the number of ASVs also increased. There was an average of 0.58 ASVs per copy of the 16S rRNA gene for full-length 16S rRNA genes. It was necessary to use a distance threshold of 5.25% to cluster full-length ASVs from the same genome into a single OTU with 95% confidence for genomes with 7 copies of the 16S rRNA, such as Escherichia coli. This research highlights the risk of splitting a single bacterial genome into separate clusters when ASVs are used to analyze 16S rRNA gene sequence data. Although there is also a risk of clustering ASVs from different species into the same OTU when using broad distance thresholds, these risks are of less concern than artificially splitting a genome into separate ASVs and OTUs.
IMPORTANCE 16S rRNA gene sequencing has engendered significant interest in studying microbial communities. There has been tension between trying to classify 16S rRNA gene sequences to increasingly lower taxonomic levels and the reality that those levels were defined using more sequence and physiological information than is available from a fragment of the 16S rRNA gene. Furthermore, the naming of bacterial taxa reflects the biases of those who name them. One motivation for the recent push to adopt ASVs in place of OTUs in microbial community analyses is to allow researchers to perform their analyses at the finest possible level that reflects species-level taxonomy. The current research is significant because it quantifies the risk of artificially splitting bacterial genomes into separate clusters. Far from providing a better representation of bacterial taxonomy and biology, the ASV approach can lead to conflicting inferences about the ecology of different ASVs from the same genome.
KEYWORDS: 16S rRNA gene, ASV, OTU, bioinformatics, microbial communities, microbial ecology, microbiome
OBSERVATION
16S rRNA gene sequencing is a powerful technique for describing and comparing microbial communities (1). Efforts to link 16S rRNA gene sequences to taxonomic levels based on distance thresholds date to at least the 1990s. The distance-based threshold that was developed and is now widely used was based on DNA-DNA hybridization approaches that are not as precise as genome sequencing (2, 3). Instead, genome sequencing technologies have suggested that the widely used 3% distance threshold to operationally define bacterial taxa is too coarse (4–6). As an alternative to operational taxonomic units (OTUs), amplicon sequencing variants (ASVs) have been proposed as a way to adopt the thresholds suggested by genome sequencing to microbial community analysis using 16S rRNA gene sequences (7–10). It is widely understood that individual bacterial genomes often have multiple 16S rRNA genes that are not identical and that a 16S rRNA gene sequence could be found with different versions of the sequence in different genomes (11, 12). This could lead to the problem that ASVs and using too fine a threshold to identify OTUs could split a single genome into multiple clusters. Proponents of ASVs minimize concerns that most bacterial genomes have more than one copy of the rrn operon and that those copies are not identical (6, 13). Conversely, using too broad of a threshold to define OTUs could cluster multiple bacterial species into the same OTU. An example of both is seen in the comparison of Staphylococcus aureus (NCTC 8325) and Staphylococcus epidermidis (ATCC 12228), where each genome has 5 copies of the 16S rRNA gene. Each of the 10 copies of the 16S rRNA gene in these two genomes is distinct, and they represent 10 ASVs. Conversely, if the copies were clustered using a 3% distance threshold, then all 10 ASVs would cluster into the same OTU. The goal of this study was to quantify the trade-off of splitting a single genome into multiple clusters and the risk of clustering different bacterial species into the same cluster when using ASVs and various OTU definitions.
To investigate the variation in the number of copies of the 16S rRNA gene per genome and the intragenomic variation among copies of the 16S rRNA gene, I obtained 16S rRNA sequences from the rrn copy number database (rrnDB) (14). Among the 5,972 species represented in the rrnDB, there were 20,427 genomes. The median rrn copy number per species ranged between 1 (e.g., Mycobacterium tuberculosis) and 19 (Metabacillus litoralis). As the rrn copy number for a genome increased, the number of variants of the 16S rRNA gene in each genome also increased. On average, there were 0.58 variants per copy of the full-length 16S rRNA gene and averages of 0.32, 0.25, and 0.27 variants when considering the V3-V4, V4, and V4-V5 regions of the gene, respectively. Although a species tended to have a consistent number of 16S rRNA gene copies per genome, the number of total variants increased with the number of genomes that were sampled (see Fig. S1 in the supplemental material). For example, the 271 genome accessions of Mycobacterium tuberculosis in the rrnDB each had 1 copy of the gene per genome. However, across these accessions, there were 17 versions of the gene. An Escherichia coli genome typically had 7 copies of the 16S rRNA gene, with a median of 5 distinct full-length ASVs per genome (interquartile range of between 3 and 6). Across the 1,390 E. coli genomes in the rrnDB, there were 1,402 versions of the gene. These observations highlight the risk of selecting a threshold for defining clusters that is too narrow because it is possible to split a single genome into multiple clusters.
A method to avoid splitting a single genome into multiple clusters is to cluster 16S rRNA gene sequences together based on their distances between each other. Therefore, I assessed the impact of the distance threshold used to define clusters of 16S rRNA genes on the propensity to split a genome into separate clusters. To control for uneven representation of genomes across species, I randomly selected one genome from each species and repeated each randomization 100 times. I observed that as the rrn copy number increased, the distance threshold required to reduce the ASVs in each genome to a single OTU increased (Fig. 1). Among species with 7 copies of the rrn operon (e.g., E. coli), a distance threshold of 5.25% was required to reduce full-length ASVs into a single OTU for 95% of the species. Similarly, thresholds of 5.25, 2.50, and 3.75% were required for the V3-V4, V4, and V4-V5 regions, respectively. But if a 3% distance threshold was used, then ASVs from genomes containing fewer than 6, 6, 8, and 6 copies of the rrn operon would reliably be clustered into a single OTU for ASVs from the V1-V9, V3-V4, V4, and V4-V5 regions, respectively. Consequently, these results demonstrate that broad thresholds must be used to avoid splitting different operons from the same genome into separate clusters.
At broad thresholds, 16S rRNA gene sequences from multiple species could be clustered into the same ASV or OTU. I again randomly selected one genome from each species to control for uneven representation of genomes across species. For this analysis, I measured the percentages of ASVs and OTUs that contained 16S rRNA gene sequences from multiple species (Fig. 2). Without using distance-based thresholds, 4.1% of the ASVs contained sequences from multiple species when considering full-length sequences, and 10.9, 16.2, and 13.1% contained sequences from multiple species when considering the V3-V4, V4, and V4-V5 regions, respectively. At the commonly used 3% threshold for defining OTUs, 27.4% of the OTUs contained 16S rRNA gene sequences from multiple species when considering full-length sequences, and 31.7, 34.3, and 34.8% contained sequences from multiple species when considering the V3-V4, V4, and V4-V5 regions, respectively. Although the actual fractions of ASVs and OTUs that contain sequences from multiple species are dependent on the taxonomic composition of the sequences in the rrnDB, this analysis highlights the trade-offs of using distance-based thresholds.
The results of this analysis demonstrate that there is a significant risk of splitting a single genome into multiple clusters if using ASVs or too fine of a threshold to define OTUs. An ongoing problem for amplicon-based studies is defining a meaningful taxonomic unit (13, 15, 16). Since there is no consensus for a biological definition of a bacterial species (17–19), microbiologists must accept that how bacterial species are named is biased and that taxonomic rules are not applied in a consistent manner (e.g., see references 19 and 20). This makes it impossible to fit a distance threshold that matches a set of species names (21). Furthermore, the 16S rRNA gene does not evolve at the same rate across all bacterial lineages (15), which limits the biological interpretation of a common OTU definition. A distance-based definition of a taxonomic unit based on 16S rRNA gene or full-genome sequences is operational and not necessarily grounded in biological theory (15, 22–24). One benefit of a distance-based OTU definition is the ability to mask residual sequencing error. The analysis in this study was conducted using ideal sequences from assembled genomes, whereas sequences generated in microbiome studies would harbor PCR and sequencing errors. These errors would only exacerbate the inflated number of ASVs. There is general agreement in bacterial systematics that to classify an organism to a bacterial species, phenotypic and genome sequence data are needed (17–20). A short sequence from a bacterial genome simply cannot differentiate between species. Moreover, it is difficult to defend a clustering threshold that would split a single genome into multiple taxonomic units. It is not biologically plausible to entertain the possibility that different rrn operons from the same genome would have different ecologies. Individual bacteria are defined at the cellular or chromosomal level and not at the gene level. One could argue that, in practice, communities are compared on a relative rather than an absolute basis. However, communities harboring populations that tend to have more copies of the rrn operon would appear to have higher richness and diversity than those with fewer copies purely due to the propensity for populations with more rrn operons to generate more ASVs. Although there are multiple reasons why proponents favor ASVs, the significant risk of artificially splitting genomes into separate clusters is too high to warrant their use.
Data availability.
The 16S rRNA gene sequences used in this study were obtained from the rrnDB (https://rrndb.umms.med.umich.edu) (version 5.7, released 18 January 2021) (14). At the time of submission, this was the most current version of the database. The rrnDB obtained the curated 16S rRNA gene sequences from the KEGG database, which ultimately obtained them from the NCBI nonredundant RefSeq database. The rrnDB provided downloadable versions of the sequences with their taxonomy as determined using the naive Bayesian classifier trained on the RDP reference taxonomy. For some genomes, this resulted in multiple classifications since a genome’s 16S rRNA gene sequences were not identical. Instead, I mapped the RefSeq accession number for each genome in the database to obtain a single taxonomy for each genome. Because strain names were not consistently given to genomes across bacterial species, I disregarded the strain-level designations.
Definition of regions within the 16S rRNA gene.
The full-length 16S rRNA gene sequences were aligned to a SILVA reference alignment of the 16S rRNA gene (v. 138) using the mothur software package (v. 1.44.2) (25, 26). Regions of the 16S rRNA gene were selected because of their use in the microbial ecology literature. Full-length sequences corresponded to E. coli strain K-12 substrain MG1655 (GenBank accession number NC_000913) positions 28 through 1491, the V4 region corresponded to positions 534 through 786, V3-V4 corresponded to positions 358 through 786, and V4-V5 corresponded to positions 534 through 908. The positions between these coordinates reflect the fragments that would be amplified using commonly used PCR primers.
Clustering sequences into OTUs.
Pairwise distances between sequences were calculated using the dist.seqs command from mothur. The OptiClust algorithm, as implemented in mothur, was used to assign 16S rRNA gene sequences to OTUs (27). Distance thresholds of between 0.25 and 10.00% in increments of 0.25 percentage points were used to assign sequences to OTUs.
Controlling for uneven sampling of genomes by species.
Because of the uneven distribution of genome sequences across species, I randomly selected one genome from each species for the analysis of splitting genomes and clustering ASVs from different species (Fig. 1 and 2). The random selection was repeated 100 times. Analyses based on this randomization reported the median of the 100 randomizations. The interquartile range between randomizations was less than 0.0024. Because the range was so small, the confidence intervals were narrower than the thickness of the lines in Fig. 1 and 2 and were not included.
Reproducible data analysis.
The code to perform the analysis in this article and its history are available as a git-based version control repository at GitHub (https://github.com/SchlossLab/Schloss_rrnAnalysis_mSphere_2021). The analysis can be regenerated using a GNU Make-based workflow that made use of built-in bash tools (v. 3.2.57), mothur (v. 1.44.2), and R (v. 4.1.0). Within R, I used the tidyverse (v. 1.3.1), data.table (v. 1.14.0), Rcpp (v. 1.0.6), furrr (v. 0.2.2), here (v. 1.0.1), and rmarkdown (v. 2.8) packages. The conception and development of this analysis are available as a playlist on the Riffomonas YouTube channel (https://youtube.com/playlist?list=PLmNrK_nkqBpL7m_tyWdQgdyurerttCsPY).
Note on the usage of ASV, OTU, and cluster.
I used “ASV” to denote the cluster of true 16S rRNA gene sequences that were identical to each other and “OTU” to denote the product of distance-based clustering of sequences. Although ASVs represent a type of operational definition of a taxonomic unit and can be thought of as an OTU formed using a distance of zero, proponents of the ASV approach prefer to avoid the term OTU given the long history of OTUs being formed by distance-based clustering (https://github.com/benjjneb/dada2/issues/62 [accessed 26 February 2021]). For this reason, when an ASV split a genome into different units, those units were called clusters rather than OTUs.
ACKNOWLEDGMENTS
I am grateful to Robert Hein and Thomas Schmidt, who maintain the rrnDB, for their help in understanding the curation of the database and for making the 16S rRNA gene sequences and related metadata publicly available. I am also grateful to community members who watched the serialized version of this analysis on YouTube and provided suggestions and questions over the course of the development of this project.
This work was supported, in part, through grants from the NIH (P30DK034933, U01AI124255, and R01CA215574).
Contributor Information
Patrick D. Schloss, Email: pschloss@umich.edu.
Katherine McMahon, University of Wisconsin—Madison.
REFERENCES
- 1.Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR. 1985. Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A 82:6955–6959. doi: 10.1073/pnas.82.20.6955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Stackebrandt E, Goebel BM. 1994. Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Evol Microbiol 44:846–849. doi: 10.1099/00207713-44-4-846. [DOI] [Google Scholar]
- 3.Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. Int J Syst Evol Microbiol 57:81–91. doi: 10.1099/ijs.0.64483-0. [DOI] [PubMed] [Google Scholar]
- 4.Rodriguez-R LM, Castro JC, Kyrpides NC, Cole JR, Tiedje JM, Konstantinidis KT. 2018. How much do rRNA gene surveys underestimate extant bacterial diversity? Appl Environ Microbiol 84:e00014-18. doi: 10.1128/AEM.00014-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Stackebrandt E, Ebers J. 2006. Taxonomic parameters revisited: tarnished gold standards. Microbiol Today 33:152–155. [Google Scholar]
- 6.Edgar RC. 2018. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34:2371–2375. doi: 10.1093/bioinformatics/bty113. [DOI] [PubMed] [Google Scholar]
- 7.Edgar RC. 2016. UNOISE2: improved error-correction for Illumina 16S and its amplicon sequencing. bioRxiv 10.1101/081257. [DOI]
- 8.Amir A, McDonald D, Navas-Molina JA, Kopylova E, Morton JT, Zech Xu Z, Kightley EP, Thompson LR, Hyde ER, Gonzalez A, Knight R. 2017. Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems 2:e00191-16. doi: 10.1128/mSystems.00191-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 13:581–583. doi: 10.1038/nmeth.3869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML. 2015. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J 9:968–979. doi: 10.1038/ismej.2014.195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Pei AY, Oberdorf WE, Nossa CW, Agarwal A, Chokshi P, Gerz EA, Jin Z, Lee P, Yang L, Poles M, Brown SM, Sotero S, DeSantis T, Brodie E, Nelson K, Pei Z. 2010. Diversity of 16S rRNA genes within individual prokaryotic genomes. Appl Environ Microbiol 76:3886–3897. doi: 10.1128/AEM.02953-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Sun D-L, Jiang X, Wu QL, Zhou N-Y. 2013. Intragenomic heterogeneity of 16S rRNA genes causes overestimation of prokaryotic diversity. Appl Environ Microbiol 79:5962–5969. doi: 10.1128/AEM.01282-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Callahan BJ, McMurdie PJ, Holmes SP. 2017. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11:2639–2643. doi: 10.1038/ismej.2017.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Stoddard SF, Smith BJ, Hein R, Roller BRK, Schmidt TM. 2015. rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res 43:D593–D598. doi: 10.1093/nar/gku1201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Schloss PD, Westcott SL. 2011. Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol 77:3219–3226. doi: 10.1128/AEM.02810-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Johnson JS, Spakowicz DJ, Hong B-Y, Petersen LM, Demkowicz P, Chen L, Leopold SR, Hanson BM, Agresta HO, Gerstein M, Sodergren E, Weinstock GM. 2019. Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 10:5029. doi: 10.1038/s41467-019-13036-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Staley JT. 2006. The bacterial species dilemma and the genomic-phylogenetic species concept. Philos Trans R Soc Lond B Biol Sci 361:1899–1909. doi: 10.1098/rstb.2006.1914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Oren A, Garrity GM. 2014. Then and now: a systematic review of the systematics of prokaryotes in the last 80 years. Antonie Van Leeuwenhoek 106:43–56. doi: 10.1007/s10482-013-0084-1. [DOI] [PubMed] [Google Scholar]
- 19.Sanford RA, Lloyd KG, Konstantinidis KT, Löffler FE. 2021. Microbial taxonomy run amok. Trends Microbiol 29:394–404. doi: 10.1016/j.tim.2020.12.010. [DOI] [PubMed] [Google Scholar]
- 20.Baltrus DA, McCann HC, Guttman DS. 2017. Evolution, genomics and epidemiology of Pseudomonas syringae. Mol Plant Pathol 18:152–168. doi: 10.1111/mpp.12506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Konstantinidis KT, Tiedje JM. 2005. Towards a genome-based taxonomy for prokaryotes. J Bacteriol 187:6258–6264. doi: 10.1128/JB.187.18.6258-6264.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Barco RA, Garrity GM, Scott JJ, Amend JP, Nealson KH, Emerson D. 2020. A genus definition for bacteria and archaea based on a standard genome relatedness index. mBio 11:e02475-19. doi: 10.1128/mBio.02475-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P. 2020. A complete domain-to-species taxonomy for bacteria and archaea. Nat Biotechnol 38:1079–1086. doi: 10.1038/s41587-020-0501-8. [DOI] [PubMed] [Google Scholar]
- 24.Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, Whitman WB, Euzéby J, Amann R, Rosselló-Móra R. 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 12:635–645. doi: 10.1038/nrmicro3330. [DOI] [PubMed] [Google Scholar]
- 25.Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537–7541. doi: 10.1128/AEM.01541-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. 2013. The SILVA ribosomal RNA gene database project: improved data processing and Web-based tools. Nucleic Acids Res 41:D590–D596. doi: 10.1093/nar/gks1219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Westcott SL, Schloss PD. 2017. OptiClust, an improved method for assigning amplicon-based sequence data to operational taxonomic units. mSphere 2:e00073-17. doi: 10.1128/mSphereDirect.00073-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The 16S rRNA gene sequences used in this study were obtained from the rrnDB (https://rrndb.umms.med.umich.edu) (version 5.7, released 18 January 2021) (14). At the time of submission, this was the most current version of the database. The rrnDB obtained the curated 16S rRNA gene sequences from the KEGG database, which ultimately obtained them from the NCBI nonredundant RefSeq database. The rrnDB provided downloadable versions of the sequences with their taxonomy as determined using the naive Bayesian classifier trained on the RDP reference taxonomy. For some genomes, this resulted in multiple classifications since a genome’s 16S rRNA gene sequences were not identical. Instead, I mapped the RefSeq accession number for each genome in the database to obtain a single taxonomy for each genome. Because strain names were not consistently given to genomes across bacterial species, I disregarded the strain-level designations.