Analysis of compositions of microbiomes with bias correction

Huang Lin; Shyamal Das Peddada

doi:10.1038/s41467-020-17041-7

. 2020 Jul 14;11:3514. doi: 10.1038/s41467-020-17041-7

Analysis of compositions of microbiomes with bias correction

Huang Lin ¹, Shyamal Das Peddada ^1,^✉

PMCID: PMC7360769 PMID: 32665548

Abstract

Differential abundance (DA) analysis of microbiome data continues to be a challenging problem due to the complexity of the data. In this article we define the notion of “sampling fraction” and demonstrate a major hurdle in performing DA analysis of microbiome data is the bias introduced by differences in the sampling fractions across samples. We introduce a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC), which estimates the unknown sampling fractions and corrects the bias induced by their differences among samples. The absolute abundance data are modeled using a linear regression framework. This formulation makes a fundamental advancement in the field because, unlike the existing methods, it (a) provides statistically valid test with appropriate p-values, (b) provides confidence intervals for differential abundance of each taxon, (c) controls the False Discovery Rate (FDR), (d) maintains adequate power, and (e) is computationally simple to implement.

Subject terms: Computational biology and bioinformatics, Microbiology, Ecology

Differential abundance analysis of microbiome data continues to be challenging due to data complexity. The authors propose a method which estimates the unknown sampling fractions and corrects the bias induced by their differences among samples.

Introduction

A number of procedures have been proposed and used in the literature for identifying deferentially abundant taxa between two or more ecosystems. A detailed survey of some of the existing methods and their performance has been discussed in Weiss et al.¹. As noted in a list of studies^2–6, the observed microbiome data are relative abundances which sum to a constant, hence they are compositional. Standard statistical methods are not appropriate for analyzing compositional data⁷. Methods such as ANOVA, Kruskal–Wallis test do not appropriately take into consideration the compositional feature of microbiome data when performing differential abundance (DA) analysis. As demonstrated in literatures^1,2, these methods are subject to inflated false discovery rates (FDR). Although metagenomeSeq⁸ was specifically developed for microbiome data, it too is subject to inflated FDR under the Gaussian mixture model^1,2.

ANCOM², which is based on Aitchison’s methodology, uses relative abundances to infer about absolute abundances. According to an extensive simulation study¹, among the available methods for DA analysis, only ANCOM performs well in controlling FDR while maintaining high power, as long as the sample size is not too small. One of the deficiencies of ANCOM is that it does not provide p value for individual taxon, nor can it provide standard errors or confidence intervals of DA for each taxon, and it can be computationally intensive.

The Differential Ranking (DR) methodology⁶ reformulates the DA analysis as a multinomial regression problem. By imposing the Additive Log-Ratio transformation to relative abundances, the DR methodology accounts for compositionality of microbiome data. As demonstrated in⁶, the ranks of relative differentials perfectly correlate with ranks of absolute differentials. However, similar to ANCOM, the DR procedure does not provide p values or confidence intervals to declare statistical significance.

It is important to distinguish between absolute and relative abundances of taxa in a unit volume of an ecosystem. Change in the absolute abundance of a single taxon can alter the relative abundances of all taxa (Fig. 1). The choice of parameter for statistical analysis is important and needs to be clearly stated. Often researchers are interested in identifying taxa that are different in mean absolute abundance between two or more ecosystems⁶. Testing hypotheses regarding mean relative abundance is not equivalent to testing hypotheses regarding mean absolute abundance^2,6. In addition, note that not all samples have the same sampling fraction, which is defined as the ratio of the expected absolute abundance of a taxon in a random sample (e.g., a stool sample) to its absolute abundance in a unit volume of the ecosystem (e.g., a unit volume of gut) where the sample was derived from. Consequently, the observed counts are not comparable between samples. Thus, all DA methodologies require the counts to be properly normalized to account for differences in sampling fractions across samples. Sampling fraction is affected by two components, namely, the microbial load in a unit volume of the ecosystem and the library size of the corresponding sample (e.g., total species abundances sequenced from a subject’s stool sample). Therefore, it is not sufficient to normalize the library size across samples as one needs to take into consideration the differences in the microbial loads. Consider the toy example in Fig. 2. Suppose the gut of subject A as well as B consist of only two taxa, the red and green varieties. Clearly, the true absolute abundance of each taxon is 50% more in subject B’s ecosystem as compared with subject A’s. However, they each have the same library size (six each) in their respective samples. Furthermore, sample relative abundance as well as sample absolute abundances are identical in the two samples. If a normalization method is based only on the library size and ignores the sampling fraction, then the two samples would be considered as normalized. Consequently, an investigator would falsely conclude that none of the taxa are differentially abundant in the two ecosystems. This erroneous conclusion would be avoided if one recognizes that we have a larger sampling fraction in the sample obtained from A’s ecosystem than from B’s ( $\frac{3}{9}$ vs $\frac{2}{9}$ ), Thus, normalizing data on the basis of sampling fractions gives a better description of the truth than normalization methods that rely purely on the library sizes.

Fig. 1 — As shown in this figure, all taxa (in different colors and shapes) may be identically abundant in a unit volume of two ecosystems (e.g., a unit volume of gut), except for one differentially abundant taxon (the green variety). Due to this one differentially abundant taxon, the two ecosystems may differ in the relative abundance of all taxa. A researcher may not only be interested in knowing if the mean relative abundance of a taxon is different between two ecosystems but may also want to know if the absolute abundance of a taxon is different in a unit volume of two ecosystems.

Fig. 2 — Sampling fraction is defined as the ratio of expected absolute abundance in a sample to the corresponding absolute abundance in the ecosystem, which could be empirically estimated by the ratio of library size to the microbial load. Differences in sampling fractions may introduce bias and increase in false positive as well as false negative rates in differential abundance analysis. In this toy example, the microbial load for subject A in a unit volume of ecosystem (e.g., a unit volume of gut) is 18 (12 red + 6 green), while for subject B is 27 (18 red + 9 green). However, the samples taken from subject A and B have the same library size 6 (4 red + 2 green), the same observed absolute abundance as well as the same relative abundance of red and green taxa. Thus, one may mistakenly conclude that the red and green taxa are not differentially abundant, which is not the case in the two ecosystems. This false negative conclusion is caused by differences in the sampling fractions in the two samples. The sampling fraction in sample A is 3/9 and for B it is 2/9. One can similarly construct examples where a false positive conclusion is arrived at. Thus, a normalization method must account for differences in sampling fractions to avoid such erroneous conclusions.

Ideally, under the null hypothesis, the test statistic for DA analysis should be (at least approximately) centered at zero (i.e., unbiased). However, for many DA methods, this is not always true for at least one of the following reasons: (1) The test statistic may not be designed for testing hypothesis regarding the actual parameter of interest; (2) Data are not properly normalized; (3) Underlying structure, such as compositionality, is ignored. Motivated by the limitations of existing DA methods, in this paper we propose a methodology called Analysis of Compositions of Microbiomes with Bias Correction (ANCOM-BC) that is aimed to address the problems mentioned above. As in ANCOM and DR, the proposed ANCOM-BC methodology assumes that the observed sample is an unknown fraction of a unit volume of the ecosystem, and the sampling fraction varies from sample to sample. ANCOM-BC accounts for sampling fraction by introducing a sample-specific offset term in a linear regression framework, that is estimated from the observed data. The offset term serves as the bias correction, and the linear regression framework in log scale is analogous to log-ratio transformation to deal with the compositionality of microbiome data. The case of zero counts is also discussed in “Methods” section. This methodology has some conceptual similarities with DR, but is fundamentally different. With ANCOM-BC, one can perform standard statistical tests and construct confidence intervals for DA. Moreover, as demonstrated in benchmark simulation studies, ANCOM-BC (a) controls the FDR very well while maintaining adequate power compared with other popular methods, and (b) it is substantially faster than ANCOM. The CPU time (RStudio, x86_64-apple-darwin15.6.0, and macOS) is 0.28 min vs. 63 min when the number of taxa is 500. The CPU time for ANCOM increases dramatically as the number of taxa increases to 1000. In this case, the CPU times for ANCOM-BC and ANCOM are 0.51 and 211 min, respectively. In addition to results based on synthetic data, we also illustrate ANCOM-BC using the well-known global gut microbiota dataset⁹.

Results

Normalization

Using simulated data, we illustrate how the existing normalization methods fail to eliminate the bias introduced by differences in sampling fractions across samples, whereas the normalization method introduced in ANCOM-BC performs well. Specifically, we compare our proposed method with Cumulative-Sum Scaling (CSS) implemented in metagenomeSeq⁸, Median (MED) in DESeq2¹⁰, Upper Quartile (UQ) and Trimmed Mean of M values (TMM), and Total-Sum Scaling (TSS). In addition, we also considered modified versions of UQ and TMM implemented in edgeR¹¹. These are obtained by multiplying the normalization factors with the corresponding library size to account for “effective library size”¹², and are denoted as ELib-UQ and ELib-TMM (see Supplementary Table 7 for formulas and Supplementary Fig. 11 for workflow).

We considered a variety of simulation scenarios as follows. The details of the simulation study are presented in the Supplementary Notes.

Unbalanced microbial load in two experimental groups and balanced library size for each sample. This results in a large variability in sampling fractions (Fig. 3).
Unbalanced microbial load in two experimental groups and unbalanced library size for each sample. This results in a moderate variability in sampling fractions (Supplementary Fig. 1).
Balanced microbial load in two experimental groups and balanced library size for each sample. This results in a small variability in sampling fractions (Supplementary Fig. 2).

Thus, we simulated data where sampling fraction in Group 1 is systematically different from sampling fraction in Group 2. Consequently, observed absolute abundances in the samples in the two groups were systematically different even though the actual absolute abundances in the ecosystems are same. To evaluate the performance of each normalization method, we introduced a residual measure that estimates the deviation between the estimated sampling fraction and the true sampling fraction (see Supplementary Discussion). For simplicity of exposition, we plotted the centered residuals, by subtracting the group average of the residuals. If a normalization method is effective then it should eliminate the bias due to the differences in the sampling fractions so that samples from the two groups (circles and triangles) in Fig. 3 should intermix and not cluster by the group labels.

From Fig. 3 (and Supplementary Figs. 1, 2) we notice that the samples normalized by ANCOM-BC are nicely intermixed and do not cluster by the group labels. This is not the case with most of the remaining methods where residuals cluster by group labels, thus indicating that they are unable to eliminate the underlying differences in sampling fractions between the two groups. Thus, under the null hypothesis of no difference in the absolute abundance of a taxon in two groups, their test statistics are not centered at zero. This results in inflated FDR (see Supplementary Discussion). We also note from Fig. 3 and Supplementary Figs. 1, 2, that not only ANCOM-BC does well in estimating the bias due to differences in sampling fraction, the variability in the estimates of the sampling fractions is very small as seen from the height of the box plot for ANCOM-BC. This is an important observation because it suggests that the variability in the estimator of bias due to sampling fraction is potentially negligible in the test statistic described in “Methods” section.

Clearly, as seen in Fig. 4a, b and Supplementary Figs. 3a, b, 4a,b, the normalization of data has a major effect on the FDR and power of various methods.

DA analyses

Simulating data from Poisson-Gamma distributions (see Supplementary Notes for simulation settings and Supplementary Fig. 12 for workflow), we evaluated the performance of various methods in terms of FDR and power. Since there is no hard threshold available for DR to declare whether a taxon is differentially abundant or not, it was not included in this simulation study.

Not surprisingly, standard Wilcoxon rank-sum test applied to relative abundance data leads to highly inflated FDR (Fig. 4a and Supplementary Figs. 3a, 4a) in all simulation scenarios. This is primarily because such standard tests ignore the compositional structure of the data, and seen from Fig. 3, TSS does not successfully normalize the data. Simply applying nonparametric tests without any normalization can also be problematic when the sampling fractions are different across experimental groups (Fig. 4a). The two widely used count-based methods in RNA-Seq literature, edgeR (implemented using ELib-TMM¹² by default) and DESeq2, generally exceed the 5% nominal FDR level when there are differences in sampling fractions (Fig. 4a and Supplementary Fig. 3a). For instance, edgeR has FDR as large as 40% (Fig. 4a), meaning that 40% of findings could be potentially false discoveries. The zero-inflated Gaussian mixture model used in metagenomeSeq (ZIG) consistently has the largest FDR when sampling fractions are not constant (Fig. 4a and Supplementary Fig. 3a). In some cases, the FDR could be as much as 70%, which perhaps is partly due to the Gaussian distribution assumption for log abundance data. Although metagenomeSeq using zero-inflated Log-Gaussian mixture model successfully controls the FDR under 5% in all simulations, it suffers a severe loss of power (Fig. 4b and Supplementary Figs. 3b, 4b). The power of detecting differentially abundant taxa could be lower than 10%.

Similar to ANCOM, ANCOM-BC not only controls the FDR at the nominal level (5%) but also maintains adequate power in all simulation settings considered here. An important observation to be made regarding all methods, other than ANCOM and ANCOM-BC, is that as the sample size within each group increases, so does the FDR. This is perhaps a consequence of the fact that the test statistics are not centered at the true null parameter but are shifted due to differences in the sampling fraction. Hence asymptotically, these tests fail to control the false positive as well as FDR (see Supplementary Discussion).

In addition to the above Poisson-Gamma model, we performed simulations using the real global patterns data¹³, to get a broader perspective on the performance of the various methods (see Supplementary Notes for simulation details). In this case again, ANCOM and ANCOM-BC controlled the FDR and competed well in terms of power with all other methods. The estimated FDR of DESeq2 and edgeR increased further in this simulation set-up (Supplementary Fig. 5a, b) compared with the simulation using Poisson-Gamma distribution. Note that DESeq2 and edgeR were designed for Poisson-Gamma distribution, and hence it is not surprising that these methods performed poorly in this new set-up.

Illustration using gut microbiota data

We illustrate ANCOM-BC by analyzing the US, Malawi and Venezuela gut microbiota data⁹. This dataset consists of 11,905 OTUs obtained from subjects in the USA (n = 317), Malawi (n = 114), and Venezuela (n = 99). We first assessed the performance of different normalization methods mentioned above. One heuristic approach to gain insights on the impact of normalization is to examine how well the normalized samples separate from each other according to their phenotypes in a nonmetric multidimensional scaling (NMDS) plot. We provide the results for Malawi and Venezuela populations in Fig. 5.

Fig. 5 — First two NMDS coordinates are used to evaluate the performance of various normalization methods (ANCOM-BC, ELib-UQ, ELib-TMM, CSS, MED, UQ, TMM, and TSS) applied on Malawi and Venezuela samples of the global gut microbiota data at genus level. Samples from Malawi are colored in red while samples from Venezuela are colored in green. Visually, ANCOM-BC, ELib-TMM, MED, and CSS appear to provide best separation between Malawi and Venezuela samples. Quantitatively, in terms of Between-Group Sum of Squares (BSS), standardized by Total Sum of Squares (TSS) so that all methods are comparable, ANCOM-BC has the largest BSS, followed by ELib-TMM, and MED. Rest of the methods perform poorly both visually as well as quantitatively with small BSS values. These findings appear to be consistent with results of the synthetic data shown in Fig. 3 and Supplementary Figs. 1, 2.

As seen from this figure, ANCOM-BC appears to perform very well visually in separating samples from the two populations and has the largest between-group sum of squares (BSS). BSS measures how well clusters are separated. Larger the BSS value the better a method is in clustering objects according to group labels. ELib-TMM, CSS, and MED also performed well. Consistent with the bias correction and FDR/Power simulations reported in Figs. 3 and 4, where ELib-UQ, UQ, TMM, and TSS perform poorly in correcting biases and have poor FDR control, they also have poor performances in distinguishing samples based on their nationalities.

We also report results of pairwise DA analyses at phylum level among the above three countries using ANCOM-BC. It is well-known that the infant gut microbiota evolve with their age¹⁴ due to changes in the feeding patterns, diet, and other exposures. Hence, for illustration purposes, we performed a stratified analysis by considering two age groups, infants below 2 years (labeled as “infants”) and adults between 18 and 40 (labeled as “adults”). Results of all pairwise comparisons are provided in Fig. 6a and Supplementary Table 1. Note that ANCOM-BC is the first method in the literature that can not only identify differentially abundant taxa while controlling the FDR for multiple testing, it also provides 95% simultaneous confidence intervals for the mean DA of each taxon in the two experimental groups. These confidence intervals are adjusted for multiplicity using Bonferroni method. Thus, a researcher can evaluate the effect size associated with each taxon when comparing two experimental groups. This is particularly important in the present climate when researchers are increasingly skeptical about making decisions based on p values (alone)¹⁵.

Fig. 6 — Data are represented by effect size (log fold change) and 95% confidence interval bars (two-sided; Bonferroni adjusted) derived from the ANCOM-BC model. All effect sizes with adjusted p < 0.05 are indicated, *significant at 5% level of significance; **significant at 1% level of significance; ***significant at 0.1% level of significance. Exact adjusted p values can be found in Supplementary Table 1. Diamonds on top of some bars indicate structural zeros. a Pairwise differential abundance analyses stratified by age using ANCOM-BC: Infants (age ≤ 2 years, n = 133), and adults (18 ≤ age ≤ 40, n = 83). Infant samples are colored in red and adult samples are colored in green. Phyla Acidobacteria and Chloroflexi are not represented in the plot since they are present only in Venezuela samples. b Pairwise tests using ANCOM-BC for the equality of mean log ratio of Bacteroidetes to Firmicutes stratified by age. Viewing b and Supplementary Table 2 together, lower BMI seems to be associated with higher levels of the Bacteroidetes to Firmicutes ratio, a result widely acknowledged in the literature.

Interestingly, phyla such as Cyanobacteria, Elusimicrobia, Euryarchaeota, and Spirochetes, which are known to be associated with rural environment and hygiene^16–19, are significantly more abundant among Malawi than the US infants and adults. We discover an interesting trend in the absolute abundance of phylum Verrucomicrobia, whose absolute abundance is known to increase with antibiotics usage to protect against pathogens and other opportunistic bacteria²⁰. Consistent with the high usage of antibiotics in the western world among infants as well as adults, we discover a significant increase in the absolute abundance of Verrucomicrobia in US relative to Malawi adults and infants, and relative to Venezuelan adults (Fig. 6a). Similarly, there is a significant increase in its absolute abundance among Venezuelan infants compared with Malawi (Fig. 6a).

It is well-documented in the literature that BMI is linked to the ratio of Bacteroidetes to Firmicutes²¹. In our sample, the US infants, as well as adults, had higher BMI than their counterparts in Malawi; The US infants also had higher BMI than Venezuela infants (Supplementary Table 2). Interestingly the ratio of Bacteroidetes to Firmicutes was larger among Malawi infants than the US infants (Fig. 6b and Supplementary Table 3). Similarly, the ratio was significantly larger among Venezuela infants than the US infants (Fig. 6b and Supplementary Table 3). Although the differences of the ratio of Bacteroidetes to Firmicutes between US and non-US adults were not significant, the effect sizes showed a similar trend as infants indicating that US adults had smaller ratio of Bacteroidetes to Firmicutes. We did not find any significant differences between Malawi and Venezuelan infants as well as adults. These results are in line with our findings that there were no differences in the mean absolute abundances of Firmicutes as well as Bacteroidetes among Malawi and Venezuelan infants as well as adults (Fig. 6a).

Discussion

The DA analysis of microbiome data is a challenging problem^5,6, in part due to inaccessibility of data necessary for drawing inferences on DA in two or more ecosystems. An important unobservable parameter that impacts DA analysis is the sampling fraction of a sample drawn from a unit volume of ecosystem. As noted in previous studies^5,6, the bias correction due to sampling fraction is a major hurdle. While, ANCOM as well as DR procedures find ways to get around the problem from different perspectives, there is room for improvement. Secondly, differential relative abundance analysis of microbiome data is not equivalent to differential absolute abundance analysis of microbiome data. Often simplex or compositional data analysis-based methods transform the simplex coordinate system to Euclidean space by performing log ratio transformation. However, such methods require the researcher to prespecify the reference taxon and the results may be highly dependent on the choice of the reference taxon⁶. It is important to reiterate that ANCOM computes log-ratios with respect to all taxa and thus the choice of reference taxon is not an issue for ANCOM. As demonstrated mathematically in ANCOM methodology², as long as two taxa are not differentially abundant between two ecosystems, one can draw inferences about DA using differential relative abundance.

ANCOM-BC enjoys several important unique characteristics. First, it is the only method available in the literature that estimates the sampling fraction and performs DA analysis by correcting bias due to differential sampling fractions across samples. It is the only procedure that provides valid p values and confidence intervals for each taxon. Second, unlike ANCOM, it simplifies DA analysis by recasting the problem as a linear regression problem with an offset. The offset is due to the sampling fraction. By virtue of linear regression formulation, ANCOM-BC can be applied to a broad collection of study designs, including longitudinal data, repeated measurements design, covariance adjusted analysis, and so on. Using a broad range of simulations studies, we demonstrate that ANCOM-BC, like ANCOM, controls the FDR very well, while almost all other methods investigated in this paper fail.

The ANCOM-BC methodology may not perform well when the sample sizes are very small, such as n = 5 per group. The FDR is not controlled by ANCOM-BC in such cases (Supplementary Fig. 6a, b). However, when the sample size increases to 10, our simulation results indicate that ANCOM-BC controls FDR with adequate power (Supplementary Fig 6a, b). We also evaluated the performance of ANCOM-BC when the number of taxa is small, as when researchers perform DA analysis at the phylum or class levels. Even in such instances, ANCOM-BC controls the FDR very well while maintaining high power (Supplementary Table 4). ANCOM-BC performs best in terms of FDR control when the proportion of differentially abundant taxa is not too large (e.g., <75%). Otherwise, it may have slightly elevated FDR (Supplementary Fig. 7a, b). However, none of the other methods control the FDR either, in fact, they have larger FDRs than ANCOM-BC.

In many applications, researchers are interested in drawing inferences regarding DA of taxa in more than two ecosystems. We extended ANCOM-BC to deal with such multigroup situations. Extensive simulations suggest that ANCOM-BC controls FDR while maintaining high power (Supplementary Table 5).

In summary, the proposed ANCOM-BC methodology (1) explicitly tests hypothesis regarding differential absolute abundance of individual taxon and provides valid confidence intervals; (2) provides an approach to correct the bias induced by (unobservable) differential sampling fractions across samples; (3) takes into account the compositionality of the microbiome data, and (4) does not rely on strong parametric assumptions. With the linear regression framework adopted in ANCOM-BC, it allows researchers to derive p value associated with each taxon as well as confidence interval estimation for differential absolute abundance. These are unique to ANCOM-BC, to the best of our knowledge. Last but not the least, because of the regression framework adopted in ANCOM-BC, it can be extended to more general settings involving multigroup comparisons, adjusting covariates as well as applying to longitudinal/repeated measurements data.

Methods

Notation

The notations described in ANCOM-BC methodology are summarized in Table 1.

Table 1.

Summary of notations.

Notation	Description
i	Taxon index, i = 1, 2, …, m.
j	Group index, j = 1, 2, …, g.
k	Sample index, k = 1, 2, …, n_j.
θ_ij^a	Expected absolute abundance of ith taxon in a unit volume of ecosystem in the jth group.
A_ijk^b	Unobserved absolute abundance of ith taxon in a unit volume of ecosystem of kth sample in the jth group.
A._jk^b	Microbial load in a unit volume of ecosystem of kth sample in the jth group. $A_{\cdot j k} = \sum_{i = 1}^{m} A_{i j k}$ .
γ_ijk^b	Unobserved relative abundance of ith taxon in a unit volume of ecosystem of kth sample in the jth group. $γ_{i j k} = \frac{A_{i j k}}{A_{\cdot j k}}$ .
O_ijk^b	Observed absolute abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of kth sample in the jth group.
O._jk^b	Library size of a random specimen taken from a unit volume of ecosystem of kth sample in the jth group. $O_{\cdot j k} = \sum_{i = 1}^{m} O_{i j k}$ .
r_ijk^b	Observed relative abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of kth sample in the jth group. $r_{i j k} = \frac{O_{i j k}}{O_{\cdot j k}}$ .
c_jk^a	For kth sample from the jth group, c_jk represents the proportion of its ecosystem (unobserved absolute abundance) in a random sample (observed absolute abundance), thus $c_{j k} = \frac{E (O_{i j k} ∣ A_{i j k})}{A_{i j k}}$ . We shall refer to this constant as “sampling fraction”.
y_ijk^b	$\log (O_{i j k})$ .
μ_ij^a	$\log (θ_{i j})$ .
d_jk^a	$\log (c_{j k})$ .

Open in a new tab

^aParameter.

^bRandom variable.

Data preprocessing

We adopted the methodology of ANCOM-II²² as the preprocessing step to deal with different types of zeros before performing DA analysis.

There are instances where some taxa are systematically absent in an ecosystem. For example, there may be taxa present in a soil sample from a desert that might absent in a soil sample from a rain forest. In such cases, the observed zeros are called structural zeros. Let p_ij denote proportion non-zero samples of the ith taxon in the jth group, and let ${\hat{p}}_{i j} = \frac{1}{n_{j}} \sum_{k = 1}^{n_{j}} I (O_{i j k} \neq 0)$ denote the estimate of p_ij. In practice, we declare the ith taxon to have structural zeros in the jth group if either of the following is true:

${\hat{p}}_{i j} = 0 .$
${\hat{p}}_{i j} - 1.96 \sqrt{\frac{{\hat{p}}_{i j} (1 - {\hat{p}}_{i j})}{n_{j}}} \leq 0$ .

If a taxon is considered to be a structural zero in an experimental group then, for that specific ecosystem, the taxon is not used in further analysis. Thus, suppose there are three ecosystems A, B, and C and suppose taxon X is a structural zero in ecosystems A and B but not in C, then taxon X is declared to be differentially abundant in C relative to A and B and not analyzed further. If taxon Y is a structurally zero in ecosystem A but not in B and C, in that case we declare that taxon Y is differentially abundant in B relative to A as well as differentially abundant in C relative to A. We then compare the absolute abundance of taxon Y between B and C using the methodology described in this section. Taxa identified to be structural zeros among all experimental groups are ignored from the following analyses.

In a similar fashion, we address the outlier zeros as well as sampling zeros using the methodology developed in ANCOM-II²².

Model assumptions

Assumption 0.1.

E (O_{i j k} ∣ A_{i j k}) = c_{j k} A_{i j k} V a r (O_{i j k} ∣ A_{i j k}) = σ_{w, i j k}^{2},

where $σ_{w, i j k}^{2}$ = variability between specimens within the kth sample from the jth group. Therefore, $σ_{w, i j k}^{2}$ characterizes the within-sample variability. Typically, researchers do not obtain more than one specimen at a given time in most microbiome studies. Consequently, variability between specimens within sample is usually not estimated.

According to Assumption 0.1, in expectation the absolute abundance of a taxon in a random sample is in constant proportion to the absolute abundance in the ecosystem of the sample. In other words, the expected relative abundance of each taxon in a random sample is equal to the relative abundance of the taxon in the ecosystem of the sample.

Assumption 0.2. For each taxon i, A_ijk, j = 1, …, g, k = 1, …, n_j, are independently distributed with

E (A_{i j k} ∣ θ_{i j}) = θ_{i j} V a r (A_{i j k} ∣ θ_{i j}) = σ_{b, i j}^{2},

where $σ_{b, i j}^{2}$ = between-sample variation within group j for the ith taxon.

The Assumption 0.2 states that for a given taxon, all subjects within and between groups are independent, where θ_ij is a fixed parameter rather than a random variable.

Regression framework

From Assumptions 0.1 and 0.2, we have:

E (O_{i j k}) = c_{j k} θ_{i j} V a r (O_{i j k}) = f (σ_{w, i j k}^{2}, σ_{b, i j}^{2}) : = σ_{t, i j k}^{2} .

Motivated by the above set-up, we introduce the following linear model framework for log-transformed OTU counts data:

y_{i j k} = d_{j k} + μ_{i j} + ϵ_{i j k},

with

E (ϵ_{i j k}) = 0, E (y_{i j k}) = d_{j k} + μ_{i j}, V a r (y_{i j k}) = V a r (ϵ_{i j k}) : = σ_{i j k}^{2} .

Note that with a slight abuse of notation for simplicity of exposition, the above log-transformation of data is inspired by the Box–Cox family of transformations²³ which are routinely used in data analysis. Note that d in the above equation is not exactly log(c) due to Jensenʼs inequality, it simply reflects the effect of c

An important distinction from standard ANOVA: Before we describe the details of the proposed methodology, we like to draw attention to a fundamental difference between the current formulation of the problem and the standard one-way ANOVA model. For simplicity, let us suppose we have two groups, say a treatment and a control group. Let us also suppose that there is only one taxon in our microbiome study and n subjects are assigned to the treatment group and n are assigned to the control group. Suppose the researcher is interested in comparing the mean absolute abundance of the taxon in the ecosystems of the two groups. Then under the above assumptions, the model describing the study is given by:

y_{j k} = d_{j k} + μ_{j} + ϵ_{j k}, j = 1, 2, k = 1, 2, \dots, n .

Then trivially the sample mean absolute abundance of jth group is given by $ȳ_{j .} = \frac{1}{n} \sum_{k = 1}^{n} y_{j k}$ and $E (ȳ_{j .}) = \frac{1}{n} \sum_{k = 1}^{n} d_{j k} + μ_{j} = {\bar{d}}_{j .} + μ_{j}$ . The difference in the sample means between the two groups is $ȳ_{1 .} - ȳ_{2 .}$ and its expectation is $E (ȳ_{1 .} - ȳ_{2 .}) = ({\bar{d}}_{1 .} - {\bar{d}}_{2 .}) + (μ_{1} - μ_{2})$ . Under the null hypothesis μ₁ = μ₂, $E (ȳ_{1 .} - ȳ_{2 .}) = {\bar{d}}_{1 .} - {\bar{d}}_{2 .} \neq 0$ , unless ${\bar{d}}_{1 .} = {\bar{d}}_{2 .}$ . Thus because of the differential sampling fractions, which are sample specific, the numerator of the standard t-test under the null hypothesis for these microbiome data is non-zero. This introduces bias and hence inflates the Type I error. On the other hand, the standard one-way ANOVA model for two groups, which is not applicable for the microbiome data described in this paper, is of the form:

y_{j k} = d + μ_{j} + ϵ_{j k}, j = 1, 2, k = 1, 2, \dots, n .

Hence under the null hypothesis μ₁ = μ₂, $E (ȳ_{1 .} - ȳ_{2 .}) = 0$ . Thus, in this case the standard t-test is appropriate. Hence in this paper we develop methodology to eliminate the bias introduced by the differential sampling fraction by each sample. To do so, we exploit the fact that we have a large number of taxa on each subject and we borrow information across taxa to estimate this bias, which is the essence of the following methodology.

Bias and variance of bias estimation under the null hypothesis: From the above model (equation (4)), for each j, note that $E (ȳ_{i j \cdot}) = {\bar{d}}_{j \cdot} + μ_{i j}$ and $E (ȳ_{\cdot j k}) = d_{j k} + {\bar{μ}}_{\cdot j}$ , where $\bar{w}$ represents the arithmetic mean over the suitable index. Using the least squares framework, we therefore estimate μ_ij and d_jk as follows:

{\hat{d}}_{j k} = {\bar{y}}_{\cdot j k} - {\bar{y}}_{\cdot j \cdot}, k = 1, \dots, n_{j}, j = 1, 2, \dots g, {\hat{μ}}_{i j} = {\bar{y}}_{i j \cdot} - {\bar{\hat{d}}}_{j \cdot} = {\bar{y}}_{i j \cdot}, i = 1, \dots, m .

Note that $E ({\hat{μ}}_{i j}) = E ({\bar{y}}_{i j \cdot}) = μ_{i j} + {\bar{d}}_{j \cdot}$ . Thus, for each j = 1, 2, …g, ${\hat{μ}}_{i j}$ is a biased estimator and $E ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}) = (μ_{i 1} - μ_{i 2}) + {\bar{d}}_{1 \cdot} - {\bar{d}}_{2 \cdot}$ . Denote $δ = {\bar{d}}_{1 \cdot} - {\bar{d}}_{2 \cdot} .$ To begin with, in the following we shall assume there are two experimental groups with balanced design, i.e., g = 2 and n₁ = n₂ = n. Later the methodology is easily extended to unbalanced design and multigroup settings. Suppose we have two ecosystems and for each taxon i, i = 1, 2, …m, we wish to test the hypothesis

H_{0} : μ_{i 1} = μ_{i 2} H_{1} : μ_{i 1} \neq μ_{i 2} .

Under the null hypothesis, $E ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}) = δ \neq 0$ , and hence biased. The goal of ANCOM-BC is to estimate this bias and accordingly modify the estimator ${\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}$ so that the resulting estimator is asymptotically centered at zero under the null hypothesis and hence the test statistic is asymptotically centered at zero. First, we make the following observations. Since $E ({\bar{y}}_{i j \cdot}) = {\bar{d}}_{j \cdot} + μ_{i j}$ and ${\hat{μ}}_{i j} = {\bar{y}}_{i j \cdot}$ , therefore ${\hat{μ}}_{i j}$ is an unbiased estimator of ${\bar{d}}_{j \cdot} + μ_{i j}$ . From (5) and Lyapunov central limit theorem, we have:

\frac{{\hat{μ}}_{i j} - (μ_{i j} + {\bar{d}}_{j \cdot})}{σ_{i}} \to_{d} N (0, 1) as n \to \infty, where σ_{i j}^{2} = V a r ({\hat{μ}}_{i j}) = V a r ({\bar{y}}_{i j .}) = \frac{1}{n^{2}} \sum_{k = 1}^{n} σ_{i j k}^{2} .

Let Σ_jk denote an m × m covariance matrix of $ϵ_{j k} = {(ϵ_{1 j k}, ϵ_{2 j k}, \dots, ϵ_{m j k})}^{T}$ , where $σ_{i i^{'} j k}$ is the ${(i, i^{'})}^{t h}$ element of Σ_jk and $σ_{i j k}^{2}$ is the ith diagonal element of Σ_jk. Furthermore, suppose

Assumption 0.3.

σ_{i j k}^{2} < σ_{0}^{2} < \infty \frac{\sum_{i \neq i^{'}}^{m} σ_{i i^{'} j k}}{m^{2}} = o (1) .

Denote 1 = (1, 1, …, 1)^T, then we have

0 \leq 1^{T} Σ 1 = \sum_{i = 1}^{m} \sum_{i^{'} = 1}^{m} σ_{i i^{'} j k} = \sum_{i = 1}^{m} σ_{i j k}^{2} + \sum_{i \neq i^{'}}^{m} σ_{i i^{'} j k} \leq m σ_{0}^{2} + \sum_{i \neq i^{'}}^{m} σ_{i i^{'} j k},

Hence

0 \leq \frac{1^{T} Σ 1}{m^{2}} \leq \frac{σ_{0}^{2}}{m} + \frac{\sum_{i \neq i^{'}}^{m} σ_{i i^{'} j k}}{m^{2}} = o (1) .

Thus, for each k = 1, 2, …, n, and for each taxon i = 1, 2, …, m, according to Assumption 0.3, we have:

\frac{1}{m} \sum_{i = 1}^{m} (y_{i j k} - (d_{j k} + μ_{i j})) \to_{p} 0 as m \to \infty .

Thus,

{\hat{d}}_{j k} = {\bar{y}}_{\cdot j k} - {\bar{y}}_{\cdot j \cdot} \to_{p} (d_{j k} + {\bar{μ}}_{\cdot j}) - ({\bar{d}}_{j \cdot} + {\bar{μ}}_{\cdot j}) = d_{j k} - {\bar{d}}_{j \cdot}, as m \to \infty .

Using (8) and (13), let

{\hat{σ}}_{i j}^{2} = \frac{1}{n^{2}} \sum_{k = 1}^{n} {(y_{i j k} - {\hat{d}}_{j k} - {\hat{μ}}_{i j})}^{2}

denote the mean residual sum of squares. Then under some mild regularity conditions²⁴, we have the following consistency result

n ({\hat{σ}}_{i j}^{2} - σ_{i j}^{2}) \to_{p} 0, as m, n \to \infty .

Therefore, using ${\hat{σ}}_{i j}$ for σ_ij in (8) and appealing to Slutsky’s theorem, we have:

\frac{{\hat{μ}}_{i j} - (μ_{i j} + {\bar{d}}_{j \cdot})}{{\hat{σ}}_{i j}} \to_{d} N (0, 1), as m, n \to \infty .

Furthermore, based on Assumption 0.3, from (8) and (15) we obtain:

{\hat{σ}}_{i j} \to_{p} 0, as m, n \to \infty .

Consequently,

{\hat{μ}}_{i j} \to_{p} μ_{i j} + {\bar{d}}_{j \cdot}, as m, n \to \infty .

The above observations regarding the convergence of various statistics play a critical role in the following. Since the sampling fraction is constant for all taxa within the subject, we attempt to pool information across taxa when estimating δ. We model the taxa using the following Gaussian mixtures model. For the ith taxon, i = 1, 2, …, m, let $Δ_{i} = {\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}$ . Let C₀ denote the set of taxa that are not differentially abundant between the two groups, i.e., C₀ = {i ∈ (1, 2, …, m): μ_i1 = μ_i2}, C₁ denote the set of taxa whose mean abundance in group 1 is less than that of group 2, i.e., C₁ = {i ∈ (1, 2, …, m): μ_i1 < μ_i2}, and let C₂ denote the set of taxa whose mean abundance in group 1 is greater than that of group 2, i.e., C₂ = {i ∈ (1, 2, …, m): μ_i1 > μ_i2}, Let π_r denote the probability that a taxon belongs to set C_r, r = 0, 1, 2. For simplicity of estimation of parameters, similar to GEE, we shall assume that Δ_i, i = 1, 2, …, m are independently distributed. Thus, we ignore the underlying correlation structure when estimating δ. This is similar to what is often done in other omics studies. Thus, we model the distribution of Δ_i by Gaussian mixture as follows:

f (Δ_{i}) = π_{0} ϕ (\frac{Δ_{i} - δ}{ν_{i 0}}) + π_{1} ϕ (\frac{Δ_{i} - (δ + l_{1})}{ν_{i 1}}) + π_{2} ϕ (\frac{Δ_{i} - (δ + l_{2})}{ν_{i 2}}),

where

ϕ is the normal density function,
δ + l₁ and δ + l₂ are means for Δ_i∣C₁, and Δ_i∣C₂, respectively. l₁ < 0, l₂ > 0,
ν_i0, ν_i1, and ν_i2 are variances of Δ_i∣C₀, Δ_i∣C₁, and Δ_i∣C₂, respectively.

For computational simplicity, we assume that ν_i1 > ν_i0, ν_i2 > ν_i0. Thus, without loss of generality for κ₁, κ₂ > 0, let ν_i1 = ν_i0 + κ₁ and ν_i2 = ν_i0 + κ₂. While this assumption is not a requirement for our method, it is reasonable to assume that variability among differentially abundant taxa is larger than that among the null taxa. By making this assumption, we speed-up the computation time.

Assuming samples are independent between experimental groups, we begin by first estimating $ν_{i 0}^{2} = V ar ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}) = V ar ({\hat{μ}}_{i 1}) + V ar ({\hat{μ}}_{i 2})$ . Using the estimator stated in (14), we estimate $ν_{i 0}^{2}$ consistently as follows:

{\hat{ν}}_{i 0}^{2} = \sum_{j = 1}^{2} {\hat{σ}}_{i j}^{2} = \sum_{j = 1}^{2} \frac{1}{n^{2}} \sum_{k = 1}^{n} {(y_{i j k} - {\hat{d}}_{j k} - {\hat{μ}}_{i j})}^{2} .

In all future calculations, we plug in ${\hat{ν}}_{i 0}^{2}$ for $ν_{i 0}^{2}$ . This is similar in spirit to many statistical procedures involving nuisance parameters. The following lemma is useful in the sequel.

Lemma 0.1.

$\frac{\partial}{\partial θ} \log f (x) = E_{f (z ∣ x)} [\frac{\partial}{\partial θ} \log f (z) + \frac{\partial}{\partial θ} \log f (x ∣ z)]$ .²⁵

Let $Θ = {(δ, π_{1}, π_{2}, π_{3}, l_{1}, l_{2}, κ_{1}, κ_{2})}^{T}$ denote the set of unknown parameters, then for each taxon the log-likelihood can be reformulated using Lemma 0.1, as follows:

Θ \leftarrow \arg \max_{Θ} \sum_{i = 1}^{m} \sum_{r = 0}^{2} p_{r, i} [\log P r (i \in C_{r}) + \log f (Δ_{i} ∣ i \in C_{r})] .

Then the E–M algorithm is described as follows:

E-step: Compute conditional probabilities of the latent variable. Define $p_{r, i} = P r (i \in C_{r} ∣ Δ_{i}) = \frac{π_{r} ϕ (\frac{Δ_{i} - (δ + l_{r})}{ν_{i r}})}{\sum_{r} π_{r} ϕ (\frac{Δ_{i} - (δ + l_{r})}{ν_{i r}})}, r = 0, 1, 2; i = 1, \dots, m$ , which are conditional probabilities representing the probability that an observed value follows each distribution. Note that l₀ = 0.
M-step: Maximize the likelihood function with respect to the parameters, given the conditional probabilities.

We shall denote the resulting estimator of δ by ${\hat{δ}}_{EM} .$

Next we estimate $V ar ({\hat{δ}}_{EM})$ . Since the likelihood function is not a regular likelihood and hence it is not feasible to derive the Fisher information. Consequently, we take a simpler and a pragmatic approach to derive an approximate estimator of $V ar ({\hat{δ}}_{EM})$ using $V ar ({\hat{δ}}_{WLS})$ , which is defined below. Extensive simulation studies suggest that ${\hat{δ}}_{EM}$ and ${\hat{δ}}_{WLS}$ are highly correlated (Supplementary Fig. 9) and it appears to be reasonable to approximate $V ar ({\hat{δ}}_{EM})$ by $V ar ({\hat{δ}}_{WLS})$ .

Let {C_r} = m_r, r = 0, 1, 2, then

{\hat{δ}}_{WLS} = \frac{\sum_{i \in C_{0}} \frac{Δ_{i}}{{\hat{ν}}_{i 0}^{2}} + \sum_{i \in C_{1}} \frac{Δ_{i} - {\hat{l}}_{1}}{{\hat{ν}}_{i 1}^{2}} + \sum_{i \in C_{2}} \frac{Δ_{i} - {\hat{l}}_{2}}{{\hat{ν}}_{i 2}^{2}}}{\sum_{i \in C_{0}} \frac{1}{{\hat{ν}}_{i 0}^{2}} + \sum_{i \in C_{1}} \frac{1}{{\hat{ν}}_{i 1}^{2}} + \sum_{i \in C_{2}} \frac{1}{{\hat{ν}}_{i 2}^{2}}} = \frac{\sum_{i \in C_{0}} \frac{Δ_{i}}{ν_{i 0}^{2}} + \sum_{i \in C_{1}} \frac{Δ_{i} - l_{1}}{ν_{i 1}^{2}} + \sum_{i \in C_{2}} \frac{Δ_{i} - l_{2}}{ν_{i 2}^{2}}}{\sum_{i \in C_{0}} \frac{1}{ν_{i 0}^{2}} + \sum_{i \in C_{1}} \frac{1}{ν_{i 1}^{2}} + \sum_{i \in C_{2}} \frac{1}{ν_{i 2}^{2}}} + o_{p} (1) .

The above expression is of the form

\frac{a_{1}^{T} x_{1} + a_{2}^{T} x_{2} + a_{3}^{T} x_{3}}{a_{1}^{T} 1 + a_{2}^{T} 1 + a_{3}^{T} 1} \equiv \frac{α^{T} u}{α^{T} 1},

where

1 = (1, …, 1)^T,
$a_{r} = {(a_{r 1}, a_{r 2}, \dots, a_{r m_{r}})}^{T} : = {(\frac{1}{ν_{i r}^{2}})}^{T}, i \in C_{r}, r = 0, 1, 2$ ,
$x_{r} = {(x_{r 1}, x_{r 2}, \dots, x_{r m_{r}})}^{T} : = {(Δ_{i} - l_{i})}^{T}, i \in C_{r}, r = 0, 1, 2$ . Note that l₀ = 0,
$α = {(α_{1}, α_{2}, \dots, α_{m})}^{T} \equiv {(a_{1}^{T}, a_{2}^{T}, a_{3}^{T})}^{T}$ ,
$u = {(u_{1}, u_{2}, \dots, u_{m})}^{T} \equiv {(x_{1}^{T}, x_{2}^{T}, x_{3}^{T})}^{T}$ .

For the simplicity of notation, we relabel a and x by α and u, respectively. Denote Cov(x) = Cov(u) by Ω, and let $ω_{i i^{'}}$ denotes the $(i, i^{'})$ element of Ω. As in Assumption 0.3, we make the following assumption

Assumption 0.4.

\frac{\sum_{i \neq i^{'}}^{m} ω_{i i^{'}}}{m^{2}} = o (1) .

Using the above expressions, we compute the variance as follows:

V ar ({\hat{δ}}_{WLS}) = V ar (\frac{α^{T} u}{α^{T} 1}) = \frac{\sum_{i = 1}^{m} α_{i}^{2} ω_{i i}}{{(\sum_{i = 1}^{m} α_{i})}^{2}} + \frac{\sum_{i \neq i^{'}}^{m} α_{i} α_{i^{'}} ω_{i i^{'}}}{{(\sum_{i = 1}^{m} α_{i})}^{2}} .

Recall that (a) for i ∈ C₀, $ω_{i i} = V ar (Δ_{i}) = ν_{i 0}^{2} = O (n^{- 1})$ , (b) for i ∈ C₁, $ω_{i i} = V ar (Δ_{i}) = ν_{i 1}^{2} = ν_{i 0}^{2} + κ_{1} = O (1)$ , and (c) for i ∈ C₂, $ω_{i i} = V ar (Δ_{i}) = ν_{i 2}^{2} = ν_{i 0}^{2} + κ_{2} = O (1)$ . Note that $α_{i} = \frac{1}{V ar (Δ_{i})} = \frac{1}{ω_{i i}}$ , thus we have:

V ar (\frac{α^{T} u}{α^{T} 1}) = \frac{\sum_{i = 1}^{m} α_{i}^{2} ω_{i i}}{{(\sum_{i = 1}^{m} α_{i})}^{2}} + \frac{\sum_{i \neq i^{'}}^{m} α_{i} α_{i^{'}} ω_{i i^{'}}}{{(\sum_{i = 1}^{m} α_{i})}^{2}} = \frac{1}{\sum_{i = 1}^{m} α_{i}} + \frac{\sum_{i \neq i^{'}}^{m} α_{i} α_{i^{'}} ω_{i i^{'}}}{{(\sum_{i = 1}^{m} α_{i})}^{2}} .

Since $ν_{i 0}^{2} = O (n^{- 1})$ , $ν_{i 1}^{2} = O (1)$ , and $ν_{i 2}^{2} = O (1)$ , consequently, a_1i = O(n), a_2i = a_3i = O(1), and

\sum_{i = 1}^{m} α_{i} = 1^{T} a_{1} + 1^{T} a_{2} + 1^{T} a_{3} = \sum_{i \in C_{0}} O (n) + \sum_{i \in C_{1}} O (1) + \sum_{i \in C_{2}} O (1) = O (m_{0} n) + O (m_{1}) + O (m_{2}) = O (m_{0} n) if m_{0} n \geq \max {m_{1}, m_{2}} .

Using these facts and Assumption 0.4 in (26), we get

V ar (\frac{α^{T} u}{α^{T} 1}) = O (m_{0}^{- 1} n^{- 1}) + \frac{\sum_{i \neq i^{'}}^{m} {n^{- 1} m^{- 1} α_{i}} {n^{- 1} m^{- 1} α_{i^{'}}} ω_{i i^{'}}}{n^{- 2} m^{- 2} {(\sum_{i = 1}^{m} α_{i})}^{2}} = O (m_{0}^{- 1} n^{- 1}) + \frac{1}{m^{2}} \frac{\sum_{i \neq i^{'}}^{m} {n^{- 1} α_{i}} {n^{- 1} α_{i^{'}}} ω_{i i^{'}}}{{(\sum_{i = 1}^{m} n^{- 1} m^{- 1} α_{i})}^{2}} = O (m_{0}^{- 1} n^{- 1}) + \frac{1}{m^{2}} \frac{O (1) o (m^{2})}{O (1)} = O (m_{0}^{- 1} n^{- 1}) .

Thus, under Assumption 0.4 regarding $ω_{i i^{'}}$ , the contribution of the covariance terms in the above variance expression is negligible as long as m is very large compared with n, which is usually the case. Hence

V ar ({\hat{δ}}_{WLS}) = V ar (\frac{α^{T} u}{α^{T} 1}) = O (m_{0}^{- 1} n^{- 1}) .

Furthermore, appealing to Cauchy–Schwartz inequality we get

Cov ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}, {\hat{δ}}_{WLS}) \leq \sqrt{V ar ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}) V ar ({\hat{δ}}_{WLS})} \leq O (n^{- 1 / 2}) O (m_{0}^{- 1 / 2} n^{- 1 / 2}) = O (n^{- 1} m_{0}^{- 1 / 2}) .

Hence, as long as m₀ is large, the contribution made by $V ar ({\hat{δ}}_{WLS})$ and $Cov ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2}, {\hat{δ}}_{WLS})$ relative to $V ar ({\hat{μ}}_{i 1} - {\hat{μ}}_{i 2})$ is negligible.

Neglect the covariance term in (26), let ${\hat{C}}_{r}$ denote the estimator of C_r, r = 0, 1, 2 from the E–M algorithm, define

\hat{V ar} ({\hat{δ}}_{WLS}) = \frac{1}{\sum_{i \in {\hat{C}}_{0}} \frac{1}{{\hat{ν}}_{i 0}^{2}} + \sum_{i \in {\hat{C}}_{1}} \frac{1}{{\hat{ν}}_{i 1}^{2}} + \sum_{i \in {\hat{C}}_{2}} \frac{1}{{\hat{ν}}_{i 2}^{2}}},

an estimator of $V ar ({\hat{δ}}_{WLS})$ under the Assumption 0.4. Then

\hat{V ar} ({\hat{δ}}_{WLS}) \to_{p} \frac{1}{\sum_{i = 1}^{m} α_{i}} = \frac{1}{\sum_{i \in C_{0}} \frac{1}{ν_{i 0}^{2}} + \sum_{i \in C_{1}} \frac{1}{ν_{i 1}^{2}} + \sum_{i \in C_{2}} \frac{1}{ν_{i 2}^{2}}}, as m, n \to \infty .

We performed extensive simulations to evaluate the bias and variance of ${\hat{δ}}_{EM}$ and that of ${\hat{δ}}_{WLS}$ . The scatter plot (Supplementary Fig. 9) of ${\hat{δ}}_{EM}$ and ${\hat{δ}}_{WLS}$ are almost perfectly linear with correlation coefficient nearly 1 in all cases. This suggests that ${\hat{δ}}_{WLS}$ is a very good approximation for ${\hat{δ}}_{EM}$ . The variance of ${\hat{δ}}_{EM}$ as well as that of ${\hat{δ}}_{WLS}$ are roughly of the order $n^{- 1} m_{0}^{- 1}$ , as we expected. In addition, they are approximately unbiased (Supplementary Table 6).

Hypothesis testing for two-group comparison

For taxon i, we test the following hypothesis

H_{0} : μ_{i 1} = μ_{i 2} H_{1} : μ_{i 1} \neq μ_{i 2}

using the following test statistic which is approximately centered at zero under the null hypothesis:

W_{i} = \frac{{\hat{μ}}_{i 1} - {\hat{μ}}_{i 2} - {\hat{δ}}_{EM}}{\sqrt{{\hat{σ}}_{i 1}^{2} + {\hat{σ}}_{i 2}^{2}}} .

From Slutsky’s theorem, we have:

W_{i} \to_{d} N (0, 1), as m, n \to \infty .

If the sample size is not very large and/or the number of non-null taxa are very large, then we modify the above test statistic as follows:

W_{i}^{*} = \frac{{\hat{μ}}_{i 1} - {\hat{μ}}_{i 2} - {\hat{δ}}_{WLS}}{\sqrt{{\hat{σ}}_{i 1}^{2} + {\hat{σ}}_{i 2}^{2} + \hat{V ar} ({\hat{δ}}_{WLS}) + 2 \sqrt{({\hat{σ}}_{i 1}^{2} + {\hat{σ}}_{i 2}^{2}) \hat{V ar} ({\hat{δ}}_{WLS})}}} .

To control the FDR due to multiple comparisons, we recommend applying the Holm–Bonferroni method²⁶ or Bonferroni^27,28 correction rather than the Benjamini–Hochberg (BH) procedure²⁹ to adjust the raw p values as research has showed that it is more appropriate to control the FDR when p values were not accurate³⁰, and the BH procedure controls the FDR provided you have either independence or some special correlation structures such as perhaps positive regression dependence among taxa^29,31. In our simulation studies, since the absolute abundances for each taxon are generated independently, we compared the ANCOM-BC results adjusted either by Bonferroni correction (Fig. 4) or BH procedure (Supplementary Fig. 10), it is clearly that the FDR control by Bonferroni correction is more conservative while implementing BH procedure results in FDR around the nominal level (5%). Obviously, ANCOM-BC has larger power when using BH procedure.

Hypothesis testing for multigroup comparison

In some applications, for a given taxon, researchers are interested in drawing inferences regarding DA in more than two ecosystems. For example, for a given taxon, researchers may want to test whether there exists at least one experimental group that is significantly different from others, i.e., to test

H_{0, i} : \cap_{j \neq j^{'} \in {1, \dots, g}} μ_{i j} = μ_{i j^{'}} H_{1, i} : \cup_{j \neq j^{'} \in {1, \dots, g}} μ_{i j} \neq μ_{i j^{'}} .

Similar to the two-group comparison, after getting the initial estimates of ${\hat{μ}}_{i j}$ and ${\hat{d}}_{j k}$ , setting the reference group r (e.g., r = 1), and obtaining the estimator of the bias term ${\hat{δ}}_{r j}$ through E–M algorithm, the final estimator of mean absolute abundance of the ecosystem (in log scale) are obtained by transforming ${\hat{μ}}_{i j}$ of (6) into:

{\hat{μ}}_{i j}^{*} : = \{\begin{matrix} {\hat{μ}}_{i r}, j = r \\ {\hat{μ}}_{i j} + {\hat{δ}}_{r j}, j \neq r \in 1, \dots, g \end{matrix} .

Thus, based on (18) and the E–M estimator of δ_rj, as $m, \min (n_{j}, n_{j^{'}}) \to \infty$

{{\hat{μ}}_{i j}^{*} - {\hat{μ}}_{i j^{'}}^{*} \to}_{p} \{\begin{matrix} 0 & if taxon i is not differentially abundant between group j and j^{'}, \\ μ_{i j} - μ_{i j^{'}} & otherwise . \end{matrix}

Similarly, the estimator of the sampling fraction is obtained by transforming ${\hat{d}}_{j k}$ of (6) into

{\hat{d}}_{j k}^{*} : = \{\begin{matrix} {\hat{d}}_{r k}, j = r \\ {\hat{d}}_{j k} - {\hat{δ}}_{r j}, j \neq r \in 1, \dots, g \end{matrix} .

As by (13) and the E–M estimator of δ_rj

{\hat{d}}_{j k}^{*} \to_{p} d_{j k} - {\bar{d}}_{r \cdot} as m, \min (n_{j}, n_{j^{'}}) \to \infty,

which indicates that we are only able to estimate sampling fractions up to an additive constant ( ${\bar{d}}_{r \cdot}$ ).

Define the test statistic for pairwise comparison as:

W_{i, j j^{'}} = \frac{{\hat{μ}}_{i j}^{*} - {\hat{μ}}_{i j^{'}}^{*}}{\sqrt{{\hat{σ}}_{i j}^{2} + {\hat{σ}}_{i j^{'}}^{2}}}, i = 1, \dots, m, j \neq j^{'} \in {1, \dots, g} .

For computational simplicity, inspired by the William’s type of test^32–35, we reformulate the global test with the following test statistic:

W_{i} = \max_{j \neq j^{'} \in {1, \dots, g}} ∣ W_{i, j j^{'}} ∣, i = 1, \dots, m .

Under null, $W_{i, j j^{'}} \to_{d} N (0, 1)$ , thus we can construct the null distribution of W_i by simulations, i.e., for each specific taxon i,

Generate $W_{i, j j^{'}}^{(b)} ~ N (0, 1), j \neq j^{'} \in {1, \dots, g}, b = 1, \dots, B .$
Compute $W_{i}^{(b)} = \max_{j \neq j^{'} \in {1, \dots, g}} W_{i, j j^{'}}^{(b)} .$
Repeat above steps B times (e.g., B = 1000), we then get the null distribution of W_i by ${(W_{i}^{(1)}, \dots, W_{i}^{(B)})}^{T} .$

Finally, p value is calculated as

p_{i} = \frac{1}{B} \sum_{b = 1}^{B} I (W_{i}^{(b)} > W_{i}), i = 1, \dots, m

and the Bonferroni correction is applied to control the FDR.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Peer Review File^{(751.7KB, pdf)}

Supplementary Information^{(4.5MB, pdf)}

Reporting Summary^{(163.5KB, pdf)}

Acknowledgements

This research was funded by the Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.

Author contributions

Both authors contributed equally to the theory and methodology described in this paper. All numerical works and computations were conducted by H.L. who developed ANCOM-BC pipeline in R that is freely and publicly available. Please contact H.L. for software requests.

Data availability

DNA sequences from the global gut microbiota study⁹ can be found in MG-RAST https://www.mg-rast.org/index.html server under search string “mgp401” for Illumina V4-16S rRNA.

Code availability

All analyses can be found under https://github.com/FrederickHuangLin/ANCOM-BC.

Competing interests

The authors declare no competing interests.

Footnotes

Peer review information Nature Communications thanks Nathan Olson and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information is available for this paper at 10.1038/s41467-020-17041-7.

References

1.Sophi W, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5:27. doi: 10.1186/s40168-017-0237-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Mandal S, et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26:27663. doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 2016;62:692–703. doi: 10.1139/cjm-2015-0821. [DOI] [PubMed] [Google Scholar]
4.Gloor GB, Wu JR, Pawlowsky-Glahn V, Egozcue JJ. It’s all relative: analyzing microbiome data as compositions. Ann. Epidemiol. 2016;26:322–329. doi: 10.1016/j.annepidem.2016.03.003. [DOI] [PubMed] [Google Scholar]
5.Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Morton JT, et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Aitchison J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B. 1982;44:139–160. [Google Scholar]
8.Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat. methods. 2013;10:1200. doi: 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Tanya Y, et al. Human gut microbiome viewed across age and geography. Nature. 2012;486:222. doi: 10.1038/nature11053. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Chen, Y., McCarthy, D., Robinson, M. & Smyth, G. K. EdgeR: Differential Expression Analysis of Digital Gene Expression Data User’s Guide. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf (accessed 17 September 2008) (2014).
13.Caporaso JG, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. 2011;108:4516–4522. doi: 10.1073/pnas.1000080107. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lozupone CA, et al. Meta-analyses of studies of the human microbiota. Genome Res. 2013;23:1704–1714. doi: 10.1101/gr.151803.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical significance. https://www.nature.com/articles/d41586-019-00857-9?fbclid=IwAR3K6PysQ9FY4togs39BSciW3YsK-Pf6EE0Il9R8zxkW4GvrGBHFuz8yF5c (2019). [DOI] [PubMed]
16.Codd G. Cyanobacterial toxins: occurrence, properties and biological significance. Water Sci. Technol. 1995;32:149–156. doi: 10.2166/wst.1995.0177. [DOI] [Google Scholar]
17.Herlemann DP, Geissinger O, Brune A. The termite group I phylum is highly diverse and widespread in the environment. Appl. Environ. Microbiol. 2007;73:6682–6685. doi: 10.1128/AEM.00712-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Obregon-Tito AJ, et al. Subsistence strategies in traditional societies distinguish gut microbiomes. Nat. Commun. 2015;6:6505. doi: 10.1038/ncomms7505. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Halperin JJ. A tale of two spirochetes: lyme disease and syphilis. Neurologic Clin. 2010;28:277–291. doi: 10.1016/j.ncl.2009.09.009. [DOI] [PubMed] [Google Scholar]
20.Dubourg G, et al. High-level colonisation of the human gut by Verrucomicrobia following broad-spectrum antibiotic treatment. Int. J. antimicrobial agents. 2013;41:149–155. doi: 10.1016/j.ijantimicag.2012.10.012. [DOI] [PubMed] [Google Scholar]
21.Castaner, O. et al. The Gut Microbiome Profile in Obesity: A Systematic Review. Int. J. Endocrinol. 1–9 (2018). 10.1155/2018/4095789. [DOI] [PMC free article] [PubMed]
22.Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of microbiome data in the presence of excess zeros. Front. Microbiol. 2017;8:2114. doi: 10.3389/fmicb.2017.02114. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Box GE, Cox DR. An analysis of transformations. J. R. Stat. Soc. Ser. B. 1964;26:211–243. [Google Scholar]
24.Peddada SD, Smith T. Consistency of a class of variance estimators in linear models under heteroscedasticity. Sankhyā Indian J. Stat. Ser. B. 1997;59:1–10. doi: 10.1111/1467-9868.00053. [DOI] [Google Scholar]
25.McLachlan, G. & Krishnan, T. The EM Algorithm and Extensions, Vol. 382 (John Wiley & Sons, 2007).
26.Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979;6:65–70. [Google Scholar]
27.Dunn OJ. Estimation of the means of dependent variables. Ann. Math. Stat. 1958;29:1095–1111. doi: 10.1214/aoms/1177706443. [DOI] [Google Scholar]
28.Dunn OJ. Multiple comparisons among means. J. Am. Stat. Assoc. 1961;56:52–64. doi: 10.1080/01621459.1961.10482090. [DOI] [Google Scholar]
29.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300. [Google Scholar]
30.Lim C, Sen PK, Peddada SD. Robust analysis of high throughput screening assay data. Technometrics. 2013;55:150–160. doi: 10.1080/00401706.2012.749166. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001;29:1165–1188. doi: 10.1214/aos/1013699998. [DOI] [Google Scholar]
32.Williams D. A test for differences between treatment means when several dose levels are compared with a zero-dose control. Biometrics. 1971;27:103–117. doi: 10.2307/2528930. [DOI] [PubMed] [Google Scholar]
33.Williams DA. Some inference procedures for monotonically ordered normal means. Biometrika. 1977;64:9–14. doi: 10.1093/biomet/64.1.9. [DOI] [Google Scholar]
34.Peddada SD, Prescott KE, Conaway M. Tests for order restrictions in binary data. Biometrics. 2001;57:1219–1227. doi: 10.1111/j.0006-341X.2001.01219.x. [DOI] [PubMed] [Google Scholar]
35.Farnan L, Ivanova A, Peddada SD. Linear mixed effects models under inequality constraints with applications. PloS ONE. 2014;9:e84778. doi: 10.1371/journal.pone.0084778. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Peer Review File^{(751.7KB, pdf)}

Supplementary Information^{(4.5MB, pdf)}

Reporting Summary^{(163.5KB, pdf)}

Data Availability Statement

DNA sequences from the global gut microbiota study⁹ can be found in MG-RAST https://www.mg-rast.org/index.html server under search string “mgp401” for Illumina V4-16S rRNA.

All analyses can be found under https://github.com/FrederickHuangLin/ANCOM-BC.

[CR1] 1.Sophi W, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5:27. doi: 10.1186/s40168-017-0237-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Mandal S, et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26:27663. doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 2016;62:692–703. doi: 10.1139/cjm-2015-0821. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Gloor GB, Wu JR, Pawlowsky-Glahn V, Egozcue JJ. It’s all relative: analyzing microbiome data as compositions. Ann. Epidemiol. 2016;26:322–329. doi: 10.1016/j.annepidem.2016.03.003. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Morton JT, et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Aitchison J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B. 1982;44:139–160. [Google Scholar]

[CR8] 8.Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat. methods. 2013;10:1200. doi: 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Tanya Y, et al. Human gut microbiome viewed across age and geography. Nature. 2012;486:222. doi: 10.1038/nature11053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Chen, Y., McCarthy, D., Robinson, M. & Smyth, G. K. EdgeR: Differential Expression Analysis of Digital Gene Expression Data User’s Guide. https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf (accessed 17 September 2008) (2014).

[CR13] 13.Caporaso JG, et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. 2011;108:4516–4522. doi: 10.1073/pnas.1000080107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Lozupone CA, et al. Meta-analyses of studies of the human microbiota. Genome Res. 2013;23:1704–1714. doi: 10.1101/gr.151803.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical significance. https://www.nature.com/articles/d41586-019-00857-9?fbclid=IwAR3K6PysQ9FY4togs39BSciW3YsK-Pf6EE0Il9R8zxkW4GvrGBHFuz8yF5c (2019). [DOI] [PubMed]

[CR16] 16.Codd G. Cyanobacterial toxins: occurrence, properties and biological significance. Water Sci. Technol. 1995;32:149–156. doi: 10.2166/wst.1995.0177. [DOI] [Google Scholar]

[CR17] 17.Herlemann DP, Geissinger O, Brune A. The termite group I phylum is highly diverse and widespread in the environment. Appl. Environ. Microbiol. 2007;73:6682–6685. doi: 10.1128/AEM.00712-07. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Obregon-Tito AJ, et al. Subsistence strategies in traditional societies distinguish gut microbiomes. Nat. Commun. 2015;6:6505. doi: 10.1038/ncomms7505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Halperin JJ. A tale of two spirochetes: lyme disease and syphilis. Neurologic Clin. 2010;28:277–291. doi: 10.1016/j.ncl.2009.09.009. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Dubourg G, et al. High-level colonisation of the human gut by Verrucomicrobia following broad-spectrum antibiotic treatment. Int. J. antimicrobial agents. 2013;41:149–155. doi: 10.1016/j.ijantimicag.2012.10.012. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Castaner, O. et al. The Gut Microbiome Profile in Obesity: A Systematic Review. Int. J. Endocrinol. 1–9 (2018). 10.1155/2018/4095789. [DOI] [PMC free article] [PubMed]

[CR22] 22.Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of microbiome data in the presence of excess zeros. Front. Microbiol. 2017;8:2114. doi: 10.3389/fmicb.2017.02114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Box GE, Cox DR. An analysis of transformations. J. R. Stat. Soc. Ser. B. 1964;26:211–243. [Google Scholar]

[CR24] 24.Peddada SD, Smith T. Consistency of a class of variance estimators in linear models under heteroscedasticity. Sankhyā Indian J. Stat. Ser. B. 1997;59:1–10. doi: 10.1111/1467-9868.00053. [DOI] [Google Scholar]

[CR25] 25.McLachlan, G. & Krishnan, T. The EM Algorithm and Extensions, Vol. 382 (John Wiley & Sons, 2007).

[CR26] 26.Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979;6:65–70. [Google Scholar]

[CR27] 27.Dunn OJ. Estimation of the means of dependent variables. Ann. Math. Stat. 1958;29:1095–1111. doi: 10.1214/aoms/1177706443. [DOI] [Google Scholar]

[CR28] 28.Dunn OJ. Multiple comparisons among means. J. Am. Stat. Assoc. 1961;56:52–64. doi: 10.1080/01621459.1961.10482090. [DOI] [Google Scholar]

[CR29] 29.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300. [Google Scholar]

[CR30] 30.Lim C, Sen PK, Peddada SD. Robust analysis of high throughput screening assay data. Technometrics. 2013;55:150–160. doi: 10.1080/00401706.2012.749166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001;29:1165–1188. doi: 10.1214/aos/1013699998. [DOI] [Google Scholar]

[CR32] 32.Williams D. A test for differences between treatment means when several dose levels are compared with a zero-dose control. Biometrics. 1971;27:103–117. doi: 10.2307/2528930. [DOI] [PubMed] [Google Scholar]

[CR33] 33.Williams DA. Some inference procedures for monotonically ordered normal means. Biometrika. 1977;64:9–14. doi: 10.1093/biomet/64.1.9. [DOI] [Google Scholar]

[CR34] 34.Peddada SD, Prescott KE, Conaway M. Tests for order restrictions in binary data. Biometrics. 2001;57:1219–1227. doi: 10.1111/j.0006-341X.2001.01219.x. [DOI] [PubMed] [Google Scholar]

[CR35] 35.Farnan L, Ivanova A, Peddada SD. Linear mixed effects models under inequality constraints with applications. PloS ONE. 2014;9:e84778. doi: 10.1371/journal.pone.0084778. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Analysis of compositions of microbiomes with bias correction

Huang Lin

Shyamal Das Peddada

Abstract

Introduction

Fig. 1. The distinction between absolute abundances and relative abundances.

Fig. 2. The bias introduced by cross-sample variations in sampling fractions.

Results

Normalization

Fig. 3. Box plot of residuals between true sampling fraction and its estimate for each sample.

Fig. 4. FDR and power comparisons using synthetic data from Poisson-Gamma distributions.

DA analyses

Illustration using gut microbiota data

Fig. 5. Non-metric multidimensional scaling (NMDS) visualizations of normalized data.

Fig. 6. Analysis of the global gut microbiota data in phylum level.

Discussion

Methods

Notation

Table 1.

Data preprocessing

Model assumptions

Regression framework

Lemma 0.1.

Hypothesis testing for two-group comparison

Hypothesis testing for multigroup comparison

Reporting summary

Supplementary information

Acknowledgements

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases