Nonparametric richness estimators Chao1 and ACE must not be used with amplicon sequence variant data

Yongcui Deng; Alexander K Umbach; Josh D Neufeld

doi:10.1093/ismejo/wrae106

. 2024 Jun 13;18(1):wrae106. doi: 10.1093/ismejo/wrae106

Nonparametric richness estimators Chao1 and ACE must not be used with amplicon sequence variant data

Yongcui Deng ^1,^2,^#, Alexander K Umbach ^3,^#, Josh D Neufeld ^4,^✉

PMCID: PMC11208923 PMID: 38869966

Alpha diversity metrics are used by microbial ecologists to quantify species richness and evenness from small subunit ribosomal RNA (SSU rRNA) gene sequence data. Although various parametric or non-parametric diversity indices may be selected for measuring sample profile diversity, the commonplace nonparametric richness estimators Chao1 and ACE must not be used for estimating total richness of amplicon sequence variant (ASV) datasets, especially when generated by algorithms that remove singletons. This perspective explains why using these richness estimators with ASV data leads to meaningless results.

The nonparametric Chao1 index [1] estimates total species richness with an equation that generates a ratio of the number of singletons squared divided by the number of doubletons multiplied by two (Box 1). This value is then added to the number of observed species to generate a species richness prediction. Within the context of microbiome datasets, each “species” is commonly represented by an ASV with a representative sequence, along with an associated read count representing abundance, and it is from these data that richness is estimated. A bias-corrected version of the Chao1 calculation is now widely preferred, and default in QIIME2 [2] and mothur [3], because it solves the issue of division by zero if no doubletons are present in the data, although this is unlikely given high depths of coverage commonly applied to samples. Similar to Chao1, the ACE richness estimator [4] depends on the total number of singleton species, but also incorporates the number of relatively abundant species (i.e. read counts >10) and rare species (i.e. read counts ≤10), along with sample coverage and the coefficient of variation for rare species (Box 1). Thus, both of these richness estimators are highly dependent on rare taxa abundances present in count data from sampled microbial communities.

Box 1.

1. Two formulas for Chao1

Inline graphic

S_obs represents the total number of observed species;

F ₁ refers to the number of singleton species;

F ₂ refers to the number of doubleton species.

2. Formula for ACE

Inline graphic

S_abund represents the total number of abundant species (abundance > threshold, e.g. 10);

S_rare represents the total number of rare species (1 ≤ abundance ≤ threshold of 10);

F ₁ refers to the number of singleton species;

C_ACE is the estimated value of sample coverage;

γ²_ACE is the coefficient of variation for rare species.

Although operational taxonomic units (OTUs) were commonplace for clustering SSU rRNA gene sequences, most approaches now resolve sequences to ASVs instead. These “denoising” methods distinguish sequences by as little as one nucleotide, producing ASVs that provide a higher resolution compared to 97% OTUs. In doing so, ASVs have improved reusability across studies, sensitivity to population abundance changes, and direct comparability across datasets [5]. Increasingly, and since ~2019 when ASVs became commonplace, many articles have been published annually employing ASV data to calculate Chao1 (Fig. 1) and ACE (data not shown). However, the application of traditional richness estimators, such as Chao1 and ACE, must not be used with ASV data that have had singletons removed because this abundance category is essential for these richness estimate calculations.

Google scholar search results for the number of publications in each year range from 2015 to 2024 (^* indicates that results were for search results available June 10, 2024), retrieved using Boolean searches: [“Chao1” AND “OTU”] or [“Chao1” AND “ASV”].

Commonly used denoising algorithms, such as DADA2 [6] and Deblur [7], are used for processing amplicon sequence reads and generating ASVs. The DADA2 algorithm calculates sequence error rates using run-specific sequence quality scores, while Deblur relies on a set of precalculated values for determining misread probability. Both algorithms combine this information with read frequency and abundance data to determine whether a given read is legitimate or an artifact. However, these algorithms are impeded by technical limitations intrinsic to Illumina amplicon data that prevent confident resolution of authentic singleton sequences. As such, removal of all ASVs supported by a single sequence is the default setting. For example, when using the default DADA2 pipeline, singletons are deleted when denoising the forward and reverse pairs of sequences, respectively. Although singleton ASVs will occasionally be generated following read merging in paired-end datasets, these singletons represent pairings of forward and reverse reads that were unable to be resolved in isolation, are just as likely to be erroneous, and are thus recommended for removal.

Some singleton sequences can be retained by pooling samples during ASV generation by the DADA2 algorithm. This option shares read information among samples during ASV inference and will retain singletons that appear in more than one sample. Therefore, the pooling sample pipeline will result in higher richness estimates compared to non-pooled samples [8]. This option is computationally more demanding, detects only few additional sequences, and is not default within DADA2 or QIIME2, and thus rarely used. Regardless, using pooled-sample ASV generation does not prevent singletons from being removed if they are detected in a single sample, thus the fundamental issue remains.

The inability of denoising algorithms to confidently resolve true singleton ASVs, in addition to default and/or mandatory removal of singletons (i.e. default removal in the R version of DADA2; default and mandatory removal in the QIIME2 version of DADA2), renders the use of singleton-dependent Chao1 and ACE metrics meaningless and thus unacceptable for alpha diversity analyses for ASV-generated datasets. In addition, ASV datasets are often standardized for sequencing depth among samples by rarefying or using an equivalent method (e.g. ranked subsampling). Although the decision to rarefy ASV datasets is itself an area of active debate [9, 10], the result of rarefying is the generation of new count data that will contain singletons arising from the process of subsampling itself, which will be influenced by the original sequencing depths of samples, and does not resolve the underlying problem associated with ASV generation. Ultimately, any data processing resulting in the removal of rare ASVs will bias alpha diversity estimates [11]. Thus, ASV generation pipelines that automatically delete rare taxa necessary for Chao1 and ACE richness estimators will result in nonsensical estimates that are not ecologically or mathematically relevant and should not be presented or interpreted.

As an alternative to mandatory removal of singletons, DADA2 in the R version [6], Deblur [7], and UNOISE [12] provide options to set minimum ASV abundance thresholds, with defaults of 2, 10, and 8, respectively, that allow users to retain singletons and other relatively rare ASV sequences. However, for high-throughput sequence data, and clone library sequences prior to “next-generation” sequencing, rare taxa will be increasingly associated with sequencing errors, PCR bias, chimeras, and other technical artifacts. Because a particularly large proportion of singleton sequences represent artifacts [13], it would be unwise to perform diversity calculations based on the inclusion of singleton ASV data. However, for many diversity metrics, this bias will be common to all samples, leading to global overestimation from true values, yet still permitting relative comparisons of diversity among samples and treatments. This could also be argued for Chao1 and ACE richness estimators, but any use of these metrics with ASV data would require specifying in the accompanying methods section that rare ASVs were retained and justify the value of using these estimates given such high sensitivity to the prevalence of artifacts in the resulting data.

Because Chao1 and ACE calculations must not be performed on singleton-removed ASV data, and their use with singleton-retained ASV data is highly questionable, an alternative approach would be to generate OTUs, using pipelines such as mothur [3] or UPARSE [14], for the specific purpose of analyzing alpha diversity. This would be appropriate given that sequencing artifacts will be less impactful on the resulting data (e.g. for 97% OTUs), low abundance taxa are retained, and the biological units captured by 97% OTUs may offer resolution akin to genus (or species) levels often desired by microbial ecologists. In addition, when calculating these alpha diversity metrics, rarefaction has been recommended to help address uneven sequencing effort [10, 15]. For those continuing to use more typical and commonplace ASV generation with default pipeline settings, other diversity metrics that rely less on specific counts of singletons or doubletons (e.g. observed ASVs, Faith’s phylogenetic diversity, Shannon index, and Simpson index) should be used instead of generating Chao1 and ACE estimates.

Acknowledgements

We thank Benjamin Callahan for helpful discussions.

Contributor Information

Yongcui Deng, Department of Geography, Nanjing Normal University, No. 1 Wenyuan Road, Nanjing, Jiangsu Province 210023, China; Department of Biology, University of Waterloo, Waterloo, Ontario N2T 1P5, Canada.

Alexander K Umbach, Department of Biology, University of Waterloo, Waterloo, Ontario N2T 1P5, Canada.

Josh D Neufeld, Department of Biology, University of Waterloo, Waterloo, Ontario N2T 1P5, Canada.

Conflicts of interest

The authors declare no conflicts of interest.

Funding

This work was supported by the National Natural Science Foundation of China (No. 42371135) and a Discovery Grant from the Natural Sciences and Engineering Council of Canada (NSERC).

Data availability

Data sharing not applicable to this article as no datasets were generated for the current study.

References

1. Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat 1984;11:265–70 [Google Scholar]
2. Bolyen E, Rideout JR, Dillon MRet al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 2019;37:852–7. 10.1038/s41587-019-0209-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Schloss PD, Westcott SL, Ryabin Tet al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009;75:7537–41. 10.1128/AEM.01541-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Chao A, Lee SM. Estimating the number of classes via sample coverage. J Am Stat Assoc 1992;87:210–7. 10.1080/01621459.1992.10475194 [DOI] [Google Scholar]
5. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 2017;11:2639–43. 10.1038/ismej.2017.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Callahan BJ, McMurdie PJ, Rosen MJet al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581. 10.1038/nmeth.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Amir A, McDonald D, Navas-Molina JAet al. Deblur rapidly resolves single- nucleotide community sequence patterns. mSystems 2017;2:e00191–16. 10.1128/mSystems.00191-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Bardenhorst SK, Vital M, Karch Aet al. Richness estimation in microbiome data obtained from denoising pipelines. Comput Struct Biotechnol J 2022;20:508–20. 10.1016/j.csbj.2021.12.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 2014;10:e1003531. 10.1371/journal.pcbi.1003531 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Schloss PD. Waste not, want not: revisiting the analysis that called into question the practice of rarefaction. mSphere 2024;9:e00355–23. 10.1128/msphere.00355-23 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Schloss PD. Removal of rare amplicon sequence variants from 16S rRNA gene sequence surveys biases the interpretation of community structure data. bioRxiv2020. 10.1101/2020.12.11.422279 [DOI]
12. Edgar RC. UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv, 2016. 10.1101/081257 [DOI]
13. Brown SP, Veach AM, Rigdon-Huss ARet al. Scraping the bottom of the barrel: are rare high throughput sequences artifacts? Fungal Ecol 2015;13:221–5. 10.1016/j.funeco.2014.08.006 [DOI] [Google Scholar]
14. Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods 2013;10:996–8. 10.1038/nmeth.2604 [DOI] [PubMed] [Google Scholar]
15. Schloss PD. Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses. mSphere 2024;9:00354–23. 10.1128/msphere.00354-23 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated for the current study.

[ref1] 1. Chao A. Nonparametric estimation of the number of classes in a population. Scand J Stat 1984;11:265–70 [Google Scholar]

[ref2] 2. Bolyen E, Rideout JR, Dillon MRet al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 2019;37:852–7. 10.1038/s41587-019-0209-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Schloss PD, Westcott SL, Ryabin Tet al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009;75:7537–41. 10.1128/AEM.01541-09 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Chao A, Lee SM. Estimating the number of classes via sample coverage. J Am Stat Assoc 1992;87:210–7. 10.1080/01621459.1992.10475194 [DOI] [Google Scholar]

[ref5] 5. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 2017;11:2639–43. 10.1038/ismej.2017.119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Callahan BJ, McMurdie PJ, Rosen MJet al. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods 2016;13:581. 10.1038/nmeth.3869 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Amir A, McDonald D, Navas-Molina JAet al. Deblur rapidly resolves single- nucleotide community sequence patterns. mSystems 2017;2:e00191–16. 10.1128/mSystems.00191-16 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Bardenhorst SK, Vital M, Karch Aet al. Richness estimation in microbiome data obtained from denoising pipelines. Comput Struct Biotechnol J 2022;20:508–20. 10.1016/j.csbj.2021.12.036 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 2014;10:e1003531. 10.1371/journal.pcbi.1003531 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Schloss PD. Waste not, want not: revisiting the analysis that called into question the practice of rarefaction. mSphere 2024;9:e00355–23. 10.1128/msphere.00355-23 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Schloss PD. Removal of rare amplicon sequence variants from 16S rRNA gene sequence surveys biases the interpretation of community structure data. bioRxiv2020. 10.1101/2020.12.11.422279 [DOI]

[ref12] 12. Edgar RC. UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon sequencing. bioRxiv, 2016. 10.1101/081257 [DOI]

[ref13] 13. Brown SP, Veach AM, Rigdon-Huss ARet al. Scraping the bottom of the barrel: are rare high throughput sequences artifacts? Fungal Ecol 2015;13:221–5. 10.1016/j.funeco.2014.08.006 [DOI] [Google Scholar]

[ref14] 14. Edgar RC. UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods 2013;10:996–8. 10.1038/nmeth.2604 [DOI] [PubMed] [Google Scholar]

[ref15] 15. Schloss PD. Rarefaction is currently the best approach to control for uneven sequencing effort in amplicon sequence analyses. mSphere 2024;9:00354–23. 10.1128/msphere.00354-23 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Nonparametric richness estimators Chao1 and ACE must not be used with amplicon sequence variant data

Yongcui Deng

Alexander K Umbach

Josh D Neufeld

Box 1.

Figure 1.

Acknowledgements

Contributor Information

Conflicts of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Nonparametric richness estimators Chao1 and ACE must not be used with amplicon sequence variant data

Yongcui Deng

Alexander K Umbach

Josh D Neufeld

Box 1.

Figure 1.

Acknowledgements

Contributor Information

Conflicts of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases