Analysis of microbial compositions: a review of normalization and differential abundance analysis

Huang Lin; Shyamal Das Peddada

doi:10.1038/s41522-020-00160-w

. 2020 Dec 2;6:60. doi: 10.1038/s41522-020-00160-w

Analysis of microbial compositions: a review of normalization and differential abundance analysis

Huang Lin ¹, Shyamal Das Peddada ^1,^2,^✉

PMCID: PMC7710733 PMID: 33268781

Abstract

Increasingly, researchers are discovering associations between microbiome and a wide range of human diseases such as obesity, inflammatory bowel diseases, HIV, and so on. The first step towards microbiome wide association studies is the characterization of the composition of human microbiome under different conditions. Determination of differentially abundant microbes between two or more environments, known as differential abundance (DA) analysis, is a challenging and an important problem that has received considerable interest during the past decade. It is well documented in the literature that the observed microbiome data (OTU/SV table) are relative abundances with an excess of zeros. Since relative abundances sum to a constant, these data are necessarily compositional. In this article we review some recent methods for DA analysis and describe their strengths and weaknesses.

Subject terms: Bacteria, Microbiology

Introduction

Human oral and gut microbiome are estimated to have 45.6 million genes, which is ~2000-fold more genes than human genes¹, therefore the microbiome is sometimes referred to as the “second genome”, or another “organ” of human body^2–4. Hence it is not surprising that numerous diseases such as obesity⁵, inflammatory bowel diseases⁶ and HIV⁷ are associated or even caused by changes in the microbial ecosystem. For these reasons, understanding changes in the composition of microbiome under different conditions is important for studying human diseases.

For clarity, we begin by defining some important terms used in this paper and in the literature. The phrase absolute abundance of a taxon refers to the unobservable actual abundance of a taxon in a unit volume of an ecosystem, such as the gut. Accordingly, one could define absolute relative abundance of a taxon in a unit volume of an ecosystem as the ratio of the absolute abundance of the taxon to the total absolute abundance of all taxa in a unit volume of an ecosystem.

In practice, however, neither absolute abundance nor absolute relative abundance of a taxon in a unit volume of an ecosystem can be easily determined⁸. Although these parameters are typically not observable, the next-generation sequencing (NGS) technologies such as the 16S rRNA gene sequencing yield useful data for describing microbial compositions in an ecosystem. Following initial quality assessment/control steps, such as primer(s) removal, demultiplexing and quality filtering, the 16S amplicon sequences are either clustered into Operational Taxonomic Units (OTUs) representing the common working definition of bacterial species⁹ by OTU picking algorithms (e.g. UPARSE¹⁰), or grouped into Sequence Variants (SVs) using denoising algorithms (e.g. DADA2¹¹ and Deblur¹²). After the construction of OTU or SV, these observed counts are typically organized into a large matrix referred to as the feature table. Some researchers or software packages such as QIIME2¹³ represent samples by columns and features (OTUs or SVs) by rows, but this representation is not universal. The observed counts of features (OTUs or SVs) represent observed abundances of taxa in the sample. Since abundances in a feature table represent only relative information regarding each taxa^8,14–18, these are compositional data and thus reside inside a simplex¹⁹. Some researchers refer to these frequencies as relative abundances of taxa in a sample. However, in our terminology, relative abundance of a taxon in the sample is the fraction of the taxon observed in the feature table relative to the sum of all observed taxa corresponding to the sample in the feature table. Thus, by our terminology, the relative abundances sum to 1. In a recent paper by Lin and Peddada²⁰, the authors refer to abundance of taxa in a feature table as “observed absolute abundance”, which is a confusing terminology and should be avoided. Instead they should have referred to it as “observed abundance”. Various terms used in this paper are summarized in Table 1. The notations described in statistical methods are summarized in Table 2.

Table 1.

Definitions of key terminologies.

Term	Definition
Microbiota	Community of microscopic organisms.
Microbiome	Genes associated with the microbiota.
Amplicon	Product of PCR amplification.
High-throughput Sequencing	DNA sequencing approach that produces large amounts of sequence data rapidly at low cost.
OTU	Operational taxonomic unit: Group of DNA sequences with 97% similarity.
SV	Sequence variant: Individual DNA sequences recovered from a high-throughput marker gene analysis following the removal of spurious sequences generated during PCR amplification and sequencing.
Absolute abundance	Unobservable actual abundance of a taxon in a unit volume of an ecosystem.
Observed abundance	Observed counts of features (OTUs or SVs) in the feature table.
Relative abundance	The fraction of the taxon observed in the feature table relative to the sum of all taxa in the sample. It is between 0 and 1.
Feature Table	A matrix summarizing observed microbial abundances in the sample. Usually, columns represent samples and rows stand for OTUs or SVs.
Library Size	The total number of observed abundances for all taxa in a sample.
Microbial Load	The total number of (unobserved) absolute abundances for all taxa in a unit volume of an ecosystem.

Open in a new tab

Table 2.

Summary of notations.

Notation	Description
m	Total number of taxa.
n	Total number of samples.
p	Total number of covariates.
i	Taxon index, i = 1, 2, …, m.
j	Sample index, j = 1, 2, …, n.
k	Index of covariates, k = 1, 2, …, p.
x_j	Covariates of interest for the jth sample. $x_{j} = {(x_{j 1}, \dots, x_{jp})}^{T}$ .
A_ij^a	Unobserved absolute abundance of ith taxon in a unit volume of ecosystem of jth sample.
A_⋅j^a	Microbial load in a unit volume of ecosystem of jth sample. $A_{\cdot j} = \sum_{i = 1}^{m} A_{ij}$ .
γ_ij^a	Unobserved absolute relative abundance of ith taxon in a unit volume of ecosystem of jth sample.
O_ij^a	Observed abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of jth sample.
O_⋅j^a	Library size of a random specimen taken from a unit volume of ecosystem of jth sample. $O_{\cdot j} = \sum_{i = 1}^{m} O_{ij}$ .
r_ij^a	Observed relative abundance of ith taxon in a random specimen taken from a unit volume of ecosystem of jth sample.
c_j^b	For the jth sample, c_j represents the proportion of its ecosystem (unobserved absolute abundance) in a random specimen (observed abundance), thus $c_{j} = \frac{E (O_{ij} ∣ A_{ij})}{A_{ij}}$ . We shall refer to this constant as “sampling fraction”.
y_ij^a	$\log (O_{ij})$ .
d_j^b	$R e p r e s e n t s t h e e f f e c t o f t h e s c a l i n g p a r a m e t e r c_{j} i n l o g - s c a l e$

Open in a new tab

^aRandom variable.

^bParameter.

We define a taxon to be differentially abundant between two ecosystems if its mean absolute abundance is different between two ecosystems. It is important to distinguish between absolute and relative abundances of taxa in a unit volume of an ecosystem. The choice of parameter for statistical analysis is important and needs to be clearly stated. Often researchers are interested in identifying taxa that are different in mean absolute abundance per unit volume between two or more ecosystems⁸. The mean absolute abundance may not be the only criterion of interest. Researchers may consider other criteria such as differential ranking⁸. Furthermore, there are instances such as niche apportionment, where researchers are interested in identifying taxa that are different in mean absolute relative abundance per unit volume between two or more ecosystems. Thus, the choice of statistical parameter depends upon the scientific question of interest.

For each taxon i within sample j, the sampling fraction is the ratio of the expected abundance of taxon i within the jth sample to its absolute abundance in a unit volume of an ecosystem (e.g. gut) where the sample was derived from. The sampling fraction is assumed to be constant for all taxa within the jth sample. Thus the sampling fraction for the jth sample is given by the following expression involving the conditional expectation of the observed abundance O_ij given the unobservable absolute abundance A_ij.

Definition 0.1 (Sampling fraction).

c_{j} = \frac{E (O_{ij} ∣ A_{ij})}{A_{ij}},

where

O_ij is the observed abundance of ith taxon in jth sample,
A_ij is the unobserved absolute abundance of ith taxon in the ecosystem of jth sample,
c_j is the sample-specific sampling fraction.

The problem underlying the differential abundance (DA) analysis of microbiome data is that while O_ij is known, c_j is unknown and can vary drastically from sample to sample. Consequently, the observed abundances are not comparable between samples. The goal of DA analysis described in this paper is to identify taxa whose mean absolute abundances, per unit volume, of an ecosystem are significantly different with changes in the covariate of interest (e.g. study groups).

Similar to the toy example in ref. ²⁰, Fig. 1 is a toy example consisting of ecosystems of three subjects A, B, and C with each having two taxa, the blue and red taxa varieties. A false negative may occur when comparing the ecosystems of A and B. Clearly, the true absolute abundance of each taxon is 50% more in subject B’s ecosystem as compared to subject A’s. However, they each have the same library size (4 each) in their respective samples (e.g. stool samples). Without considering the differential sampling fractions, one would falsely conclude that none of the taxa are differentially abundant in the two ecosystems. This erroneous conclusion would be avoided if one recognizes that we have a larger sampling fraction in the sample obtained from A’s ecosystem than from B’s ( $\frac{1}{2}$ vs. $\frac{1}{3}$ ). Similarly, we get a false positive result when comparing ecosystems of A and C. In their ecosystems, blue is more abundant in C than in A (12 vs. 4), and both have the same amounts of red taxa (4 vs. 4). However, given that samples from A and C have same the library size, one may mistakenly conclude that both blue (2 vs. 3) and red taxa (2 vs. 1) are differentially abundant between A and C.

An important characteristic of a feature table is that it is typically sparse, sometimes as many as ~90% are zero entries²¹, which creates a challenge for analyzing rare taxa. A quick and simple strategy to deal with excess zeros is to add a small positive constant (e.g. 1) called pseudo-count^14,22 to each cell of the feature table. The addition of a pseudo-count becomes necessary when using methods of analysis that require log transformation of the observed counts. Even though adding a pseudo-count is simple and widely used, the choice of the pseudo-count is ad hoc. Studies have shown that differential abundance or clustering results could be sensitive to the choice of pseudo count^23,24. Although different values of pseudo counts have been discussed in the literature^23–26, to the best of our knowledge, there is no consensus on how to choose the optimal value. Other strategies involve modeling zero counts by some probability models^21,27. However, these methods may not be valid if the underlying assumptions do not hold. Instead of modeling zeros by parametric distributions, ANCOM-II²⁸ attempts to provide a general framework to classify and identify zeros into three different types, which includes outlier zeros caused by some extraneous reasons such as the wrong data entry, structural zeros because of the nature of the study groups, i.e. some bacteria are not expected to belong to certain environments (e.g. a desert) but in others (e.g. a rain forest), and sampling zeros owing to insufficient library size. In our opinion, the zero counts problem is still an open problem and requires further investigation.

Normalization methods

As we described intuitively in the introduction, an important obstacle for performing DA analysis is the unknown sampling fraction corresponding to each sample. It is critical to normalize the data to eliminate any bias due to differences in the sampling fractions. Thus, the primary objective of normalization is to transform the observed data so that expected differences in the mean absolute abundances between two ecosystems is not confounded by the differences in the sampling fractions. Failure to normalize the data will result in a systematic bias that increases the false discovery rate (FDR) and also possible loss of power in some cases.

Rarefying

A traditional microbiome analysis workflow often involves rarefying^29–31, or subsampling to a given depth, a practice in the field of ecology long before its use in microbiome surveys³². Samples are rarefied to deal with differences in library sizes. Note that the terms rarefying and rarefaction are used interchangeably in microbiome literature³³. Rarefying was first recommended for microbiome data to deal with rare taxa³⁴, which impact some measures of alpha and beta diversities³³. Generally, the rarefying process includes the following steps:

Determine the minimum library size ( $O_{\min}$ ). Samples with library sizes smaller than $O_{\min}$ will be discarded,
Subsample taxa without replacement so that all samples have the same library size $O_{\min}$ .

One way to select the minimum library size is to create rarefaction curves³⁵. Rarefaction curves represent diversity as a function of library size (Fig. 2). If lines of the plot appear to “level out” (i.e., approach a slope of zero) at certain library size along the x-axis, it indicates the diversity of the samples has been fully observed; otherwise, increasing the minimum library size would result in additional features. Originally, rarefaction curves were based on alpha diversities^35,36. However, lately researchers have considered beta diversities^37,38 as well. Although rarefying is well established and widely used in practice, in recent years there has been some discussion on the effects of rarefying on statistical tests for differential abundance analysis^33,39,40. Some concerns discussed in the literature include:

The omission of available valid data,
The introduction of artificial uncertainty in the sub-sampling step,
The arbitrary selection of the minimum library size,
Challenges in estimating over-dispersion parameter.

Fig. 2 — The number of genera is 130, and the sample size is 222 (African American = 123, Native African = 99). x axis denotes the library size, and y axis represents the corresponding alpha diversity. Data are presented as mean values ± standard error (SE). It shows that regardless of the choice of diversity measures, as the increase of library size, the rarefaction curve starts to “level out” suggesting that the diversity of the samples has been fully observed.

Scaling

Scaling is another popular method used for normalizing microbiome data. The basic idea is to divide the observed abundance in the feature table by a “scaling factor” or “normalization factor” to eliminate biases resulting from unequal sampling fractions. More precisely, scaling is defined as follows.

Definition 0.2 (Scaling microbiome data).

{\tilde{O}}_{ij} = \frac{O_{ij}}{s_{j}},

where

${\tilde{O}}_{ij}$ is the normalized observed abundance for taxon i within sample j,
s_j is the scaling/normalization factor for sample j.

Comparing with the definition of sampling fraction (Eq. (1)), it is clear that an ideal scaling method should have scaling factor close to the unknown sampling fraction c_j, i.e. s_j ≈ c_j; or is approximately proportional to c_j, i.e. s_j ≈ c_j × c for all j, where c is a constant.

Some commonly used normalization methods include Cumulative-Sum Scaling (CSS) implemented in metagenomeSeq²¹, Median (MED) in DESeq2⁴¹, Upper Quartile (UQ)⁴² and Trimmed Mean of M-values (TMM)⁴³ in edgeR⁴⁴ and Wrench⁴⁵, and Total-Sum Scaling (TSS) which simply transforms the abundance table (feature table) into relative abundance table, i.e. scale by each sample’s library size. The authors of the user manual of edgeR⁴⁶ state that to deal with the “RNA composition” effect, one should multiply the normalization factors with the corresponding library size to account for “effective library size”. Hence, Lin and Peddada²⁰ also considered modified versions of UQ and TMM, denoted by “ELib-UQ” (Effective library size using UQ) and “ELib-TMM” (Effective library size using TMM) in their simulation studies. Since the literature is often not explicit regarding the mathematical formulas used by various methods, we provide some useful formulas in Table 3.

Table 3.

Summary of different normalization methods.

Method	Sampling fraction estimate
ANCOM-BC	$\log ({\hat{c}}_{j}^{ANCOM - BC}) = \frac{1}{m} \sum_{i = 1}^{m} (y_{ij} - x_{j}^{T} {\hat{β}}_{i})$
CSS	${\hat{c}}_{j}^{CSS} = \frac{s_{j}^{\hat{l}} + 1}{N}$
MED	${\hat{c}}_{j}^{MED} = {median}_{i : O_{i}^{R} \neq 0} \frac{O_{ij}}{O_{i}^{R}}$
UQ	${\hat{c}}_{j}^{UQ} = {UQ}_{i : O_{ij} > 0} (\frac{O_{ij}}{O_{\cdot j}})$
TMM	$\log_{2} ({\hat{c}}_{j}^{TMM}) = \frac{\sum_{i \in G } w_{ij} M_{ij}}{\sum_{i \in G } w_{ij}}$
Elib-UQ	${\hat{c}}_{j}^{Elib - UQ} = O_{\cdot j} {\hat{c}}_{j}^{UQ}$
Elib-TMM	${\hat{c}}_{j}^{Elib - TMM} = O_{\cdot j} {\hat{c}}_{j}^{TMM}$
Wrench	${\hat{c}}_{j}^{Wrench} = \frac{1}{m} \sum_{i = 1}^{m} b_{ij} \frac{r_{ij}}{{\bar{r}}_{i \cdot}}$
TSS	${\hat{c}}_{j}^{TSS} = O_{\cdot j}$

Open in a new tab

${\hat{β}}_{i}$ is obtained from ANCOM-BC algorithm.

N = an approximately chosen normalization constant.

$s_{j}^{\hat{l}} = \sum_{i : O_{ij} \leq q_{j}^{\hat{l}}} O_{ij}$ .

$q_{j}^{\hat{l}} = {\hat{l}}^{th}$ quantile of sample j.

$O_{i}^{R} = {(\prod_{j = 1}^{n} O_{ij})}^{\frac{1}{n}}$ .

UQ(X) denotes the upper quartile of X.

$M_{ij} = \log_{2} (\frac{O_{ij}}{O_{\cdot j}}) - \log_{2} (\frac{O_{i j^{'}}}{O_{\cdot j^{'}}})$ , where $j^{'}$ is the reference sample.

$w_{ij} = \frac{O_{\cdot j} - O_{ij}}{O_{\cdot j} O_{ij}} + \frac{O_{\cdot j^{'}} - O_{i j^{'}}}{O_{\cdot j^{'}} O_{i j^{'}}}$ , where $j^{'}$ is the reference sample.

G* represents a set of taxa that were not considered as extreme data for fold-change (M values) and average intensity (A values). Refer to Robinson and Oshlack⁴³ for details.

b_ij represents the taxon-specific weight. Refer to Kumar et al.⁴⁵ for details.

TSS is known to have a bias in differential abundance estimates^33,39,42,47 since a few preferentially sampled measurements (e.g. taxa, genes) will have an undue influence on the relative abundance data. Change in the abundance of a single taxon can alter the relative abundances of all taxa. Generally, the FDR generated from TSS-based analyses is unacceptably large. The CSS²¹ in metagenomeSeq modifies TSS in a sample-specific manner to reduce biases resulting from preferentially sampled taxa. CSS assumes that observed abundances of samples should be roughly independent and identically distributed up to a specific quantile l. Thus, instead of normalizing each sample by its library size (which is also known as total sum), CSS selects the scaling factor to be the cumulative sum of observed abundances for each sample up to the lth quantile. This quantile is determined adaptively in a data-driven way, which relies on the change point of the distribution of cumulative sum switching from stability to instability. The Median normalization (MED) method used in DESeq2⁴¹ assumes that the taxon of median absolute abundance is not differentially abundant. Although it may be a valid assumption in gene expression studies where a large proportion of genes are not differentially expressed, it may not be a valid assumption in microbiome studies. Depending upon the application, a very large proportion of taxa may be differentially abundant between two or more study groups, especially when the data are analyzed at higher taxonomic classification levels (e.g. phylum, order, etc.). The Upper Quartile normalization (UQ) and the TMM used in edgeR have similar issues as MED in DESeq2. UQ assumes that the upper quartile of the observed abundances for each library is able to capture the invariant segment of the count distribution. However, choosing the most effective quantile is nontrivial^{21,42,44,47–49}. Similar to MED, TMM is based on the hypothesis that most taxa are not differentially abundant. The scaling factor is calculated using a weighted trimmed mean of log abundance ratios by first trimming (by default) the taxa belong to upper and lower 30% M values (taxon-wise log-fold-change) or 5% A values (abundance level). Wrench⁴⁵ assumes that the observed abundances are from a hurdle Log-Gaussian distribution. A robust location estimate of the Gaussian distribution leads to the desired scaling factor for each sample. However, Wrench currently implements strategies for categorical variable only, and the estimated scaling factor is essentially the average of ratios of relative abundances across taxa, which implicitly requires that a large proportion of taxa do not change across study groups, or the effect sizes of differentially abundant taxa are not too large.

One must exercise caution when using scaling methods. Most importantly, a scaling method is likely to overestimate or underestimate the fraction of zero counts depending on the corresponding library size of each sample^49,50. This problem becomes more obvious for microbiome data since its feature table is typically sparse.

Recently a new method called Analysis of Compositions of Microbiome with Bias Correction (ANCOM-BC) was introduced by Lin and Peddada²⁰ to address the problem of unequal sampling fractions. ANCOM-BC assumes that the observed abundance in a feature table is, in expectation, proportional to the unobservable absolute abundance of a taxon in a unit volume of the ecosystem. This proportion is defined as the sampling fraction and is allowed to vary from sample to sample. ANCOM-BC accounts for sampling fraction by introducing a sample-specific offset term in a linear regression model that is estimated from the observed abundance data. The offset term serves as the bias correction. Statistical properties of this approach have also been discussed in²⁰.

Extensive simulation studies using Poisson-Gamma model as well as some based on real data, were performed in²⁰ to evaluate the performance of various normalization methods. Results reported in Fig. 3 of this article are similar to those provided in²⁰, but in the present simulation study we have three groups, which are denoted by G₁, G₂, and G₃ (see Supplementary Information for simulation settings). We compared all normalization methods using the centered residuals between true and estimated sampling fractions in log scale.

Fig. 3 — In the box plot, the lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). The median is represented by a solid line within the box. The upper whisker extends from the hinge to the largest value (maxima) no further than 1.5 times Interquartile Range (IQR, distance between the first and third quartiles) from the hinge, the lower whisker extends from the hinge to the smallest value (minima) at most 1.5 times IQR of the hinge. Data beyond the end of the whiskers are called “outlying” points. N = 90 samples examined over three study groups (denoted by circle, cross, and triangle, with 30 samples per group) and the data points are overlaid in each box. Each facet title indicates the normalization method and its variance is provided within parenthesis. The microbial absolute abundances in the ecosystem are generated from the log-normal distribution. By comparing residuals across different groups, an ideal box-plot should display a narrow height (i.e. smaller variability) and samples from different groups should be inter-mixed and not display any systematic separation. We note that all existing methods have larger variances compared to ANCOM-BC, and TSS has the largest variance. Except ANCOM-BC, UQ, and TMM, we see from the plot that circles, cross, and triangles are systematically separated, which indicates that ELib-UQ, ELib-TMM, CSS, MED, and TSS do not account for systematic bias due to differences in sampling fractions across groups.

Definition 0.3 (Centered Residual).

h_{j} = d_{j} - t_{j} - \frac{1}{n} \sum_{j = 1}^{n} (d_{j} - t_{j})

where

dj (see Table 2)
$t_{j} = \log s_{j}$ .

As noted at the beginning of this subsection, for each sample j, a reasonable scaling method should estimate scaling factors close to the true sampling fractions with possibly a constant shift across all samples. Not all scaling methods are expected to achieve this goal since many normalization methods were proposed solely to address the differences in library sizes (e.g. TSS). Failure to correct for differences in sampling fractions would lead to undesirable systematic bias in the test statistic, which can be identified by fitting a simple linear regression between centered residual h_j and the covariate of interest, such as x_jk (e.g. study groups):

h_{j} = α_{0} + α_{1} x_{jk} + e_{j} .

The existence of systematic bias due to differences in sampling fractions may be determined by testing the null hypothesis H₀: α₁ = 0 against the alternative H₁: α₁ ≠ 0 or simply by drawing box plots of the centered residuals, as commonly done in linear regression diagnostics (Fig. 3). For an ideal normalization method, the box plot should display no pattern with respect to the covariate of interest, and the centered residuals should be randomly distributed around 0. As can be seen in the box plots provided in Fig. 3, except for ANCOM-BC, UQ, and TMM methods, for all other methods the groups G₁, G₂, and G₃ cluster separately, indicating that in the estimation of sampling fractions, scaling factors estimated by these methods (with the exception of ANCOM-BC, UQ, and TMM) systematically differ by group labels. Furthermore, the box plot of ANCOM-BC had the shortest width, suggesting that it not only successfully estimates the true sampling fractions and eliminates bias due to its variability, but it also has the smallest variance which is not the case with other methods. This has a direct effect on the type I error and FDR as seen later in this paper and demonstrated in²⁰.

Log-ratio based methods

As an alternative to the above class of methods, several methods have been proposed in the literature that are inspired by Aitchison’s methodology for compositional data. These methods do not explicitly perform normalization such as the ones described above, since they convert the observed abundances to log-ratios within each sample. Thus, within each sample, by taking log-ratios of all taxa with respect to some common reference taxon or some suitable function of all taxa, these methods are intrinsically eliminating the effect of the sampling fraction. This class of methods include DR⁸, ANCOM¹⁴, and ALDEx2⁵¹. ALDEx2 uses a pre-specified taxon as a reference taxon and transforms the observed abundances to log ratios of the observed abundance each taxon relative to the reference taxon. Such a log-transformation of observed abundance data is called the additive log transformation (alr). Mathematically, it is defined as follows:

Definition 0.4 (additive log-ratio transformation (alr)¹⁹, $S^{m} {\to R}^{m - 1}$ ).

alr (O_{j}) = [\log (\frac{O_{1 j}}{O_{i^{'} j}}), \dots, \log (\frac{O_{m j}}{O_{i^{'} j}})] .

Thus, the alr transformation converts observed m dimensional observed abundance vector, representing the m taxa, that are in a simplex (i.e. sum to a constant), to a m − 1 dimensional data in the Euclidean space. A challenge with alr, and hence ALDEx2, is that the user needs to pre-specify the reference taxon. While this might be easy to do in some applications, it is generally a challenge when the number of taxa m is large such as when we are interested in performing DA analysis at the genus level. Although ANCOM is also based on alr transformation, it overcomes the above deficiency because it repeatedly applies the alr transformation by taking each of the m taxa to be a reference taxon one at a time. Thus, for each taxon, it performs m − 1 regressions. Hence, it overall fits m(m − 1) regression models.

To avoid the above challenges due to alr transformation, rather than using a pre-specified taxon as the reference taxon, one may consider the center of mass of all taxa as the reference. Thus, within each sample, for each taxon, the log-ratios are computed relative to the geometric mean of all taxa. This transformation is called the clr transformation. More precisely, it is defined as follows:

Definition 0.5 (centered log-ratio transformation (clr)¹⁹, $S^{m} {\to U}^{m}$ ).

clr (O_{j}) = [\log (\frac{O_{1 j}}{g (O_{j})}), \dots, \log (\frac{O_{mj}}{g (O_{j})})],

where

g(x) is the geometric mean of x,
U^m = {(u₁, …, u_m) ϵ R^m: u₁ + … + u_m = 0} is a hyperplane in $R^{m}$ .

This transformation to a real space again makes the implementation of unconstrained statistical methods possible. clr transformation is an isometry, but sum of the transformed values equals to 0, leading to a degenerate distribution.

The alr transformation is not isometric and clr is not an isomorphism. The isometric log-ratio transformation (ilr)²⁵ (also known as balance) is both an isomorphism and an isometry, and consequently orthonormal coordinates can be defined using this transformation.

Definition 0.6 (isometric log-ratio transformation (ilr), $S^{m} {\to R}^{m - 1}$ ).

ilr (O_{j}) = clr (O_{j}) Ψ^{T},

where Ψ is a (m − 1, m) orthonormal basis.

There are multiple ways to construct orthonormal bases. Typically, if a bifurcating tree is given then we can construct a basis from the internal nodes in the tree. Each element in the ilr transformed data is of the following form:

b_{l} = \sqrt{\frac{∣ l_{L} ∣ ∣ l_{R} ∣}{∣ l_{L} ∣ + ∣ l_{R} ∣}} \log [\frac{g (l_{L})}{g (l_{R})}],

where

b_l is the balance at internal node l,
l_L is the set of relative abundances contained in the left subtree at internal node l,
l_R is the set of relative abundances contained in the right subtree at internal node l,
∣l_L∣ is the number of taxa contained in l_L,
∣l_R∣ is the the number of taxa contained in l_R,
g(x) is the geometric mean of x.

Methods of differential abundance analysis

A number of procedures have been introduced and used in the literature for identifying differentially abundant taxa. One common approach is to apply a nonparametric test (e.g. the Mann–Whitney/Wilcoxon rank-sum test for two sample classes; the Kruskal–Wallis test for multiple sample classes) after normalizing the feature table. Unfortunately, these standard nonparametric tests do not take into account the compositional structure of microbiome data.

RNA-seq based methods: edgeR and DESeq2

As alternatives to standard nonparametric tests, many parametric models have been proposed in the literature based on transcriptomics data, such as the RNA-Seq data, for testing differences across study groups. Among them, DESeq2⁴¹ and edgeR⁴⁴ are two popular methods. These methods model the observed abundances using negative binomial (NB) distribution after normalizing data with corresponding scaling methods to account for differences in sampling fractions. Thus O_ij are modeled using the a negative binomial distribution as follows:

O_{ij} ~ NB (s_{j} μ_{i}, ϕ_{i}),

where

s_j is the scaling factor for sample j,
μ_i is the mean absolute abundance (in ecosystem) for taxon i,
ϕ_i is the dispersion parameter for taxon i.

Introduction of the dispersion parameter ϕ_i is inspired by mean-variance dependence in count data (e.g. RNA-Seq, microbiome data), and recognizing that the variance is typically larger than mean especially when the mean value is large. Thus, the variance of the observed abundance is modeled as follows:

Var (O_{ij}) = s_{j} μ_{i} + ϕ_{i} s_{j}^{2} μ_{i}^{2} .

The NB distribution is more appropriate for modeling these types of count data than the Poisson distribution because it provides greater flexibility in modeling the variance. We remind the readers that by conditioning independent Poisson random variables on the total count results in multinomial distribution^52,53.

The estimation of the dispersion parameter is critical for both edgeR as well as DESeq2. Based on the assumption that taxa with similar observed abundances also share similar variances, edgeR estimates the taxon-wise dispersion by conditional maximum likelihood⁵⁴, and then shrinks the dispersion estimate for each taxon towards a common estimate of taxa with similar observed abundances using an empirical Bayes procedure⁵⁵. Similarly, DESeq2 first estimates the taxon-wise dispersion by maximum likelihood estimation, and then fits the dispersion trend combining all individual estimates, and finally shrinks the taxon-wise dispersion estimates towards the values predicted by the trend curve using an empirical Bayes approach.

While both methods are generally very reasonable and appropriate for gene expression data, they seem to perform poorly for microbiome data. This is largely because, as stated earlier, the normalization methods used by these two methods intrinsically assume that a very small fraction of taxa are differentially abundant. This assumption is not necessarily valid for microbiome data. As a consequence, the test statistics used by these methods are intrinsically biased under the null hypothesis. As demonstrated analytically as well as empirically in Lin and Peddada²⁰, and reproduced here empirically using similar log-normal distribution based simulation settings (Fig. 4, see Supplementary Information for simulation settings), the bias in the test statistic results in inflated FDRs for these methods. What is worse, because of the bias, as the sample size increases, the FDR increases for these methods²⁰. Similar phenomena were reported by Weiss et al.³⁹. When dealing with population studies, it is important to recognize that there is variability within subject and there is variability between subjects in the population. In simple terms, observed abundance of a taxon from a subject may vary from stool sample to stool sample obtained from the same subject. This is within subject variation. Hence when calculating variability in measurements of random subject, one needs to take into account variation within as well as between subjects. This results in over-dispersion³³. While it is important to account for this over-dispersion, it does not correct the intrinsic bias due to differential sampling fractions noted above. RNA-seq inspired methods do not perform well for microbiome data even after correcting for the over-dispersion parameter.

Fig. 4 — The FDR and power of various differential abundance (DA) analyses (two-sided) are shown in a, b, respectively. The number of taxa is set to be 200, and the sample size equals to 60, with 30 samples per group. The microbial absolute abundances in the ecosystem are generated from the log-normal distribution. The y-axis denotes patterns of proportion of differential abundant taxa, ranging from 0.05 to 0.25. The solid vertical line is the 5% nominal level of FDR, and the dashed vertical line denotes 5% nominal level plus one standard error (SE). Legend on the bottom indicates the color for each method and the normalization method is provided within parenthesis. By default, ANCOM-BC implements Holm-Bonferroni⁷⁰ method and other DA methods implement BH procedure⁵⁷ to adjust for multiple comparisons. By detecting differentially abundant taxa between two groups, results show that only ANCOM and ANCOM-BC control the FDR under the nominal level (5%) while maintaining power comparable to other methods. Gaussian model version of metagenomeSeq has highly inflated FDR, while the log Gaussian version has substantial loss of power, sometimes well below 5%.

MetagenomeSeq

Instead of using a negative binomial model, an alternative mixture model based on zero-inflated Gaussian (ZIG) is implemented in metagenomeSeq²¹, where excess zeros due to both sampling zeros and structural zeros are accounted by a probability mass, and the Gaussian distribution modeling the non-zero observed abundances. The framework can be summarized as follows:

\begin{matrix} y_{ij} = \log_{2} (O_{ij} + 1), \\ f_{zig} (y_{ij}, O_{\cdot j}, μ_{i}, σ_{i}^{2}) = π_{j} (O_{\cdot j}) I_{{0}} (y_{ij}) + [1 - π_{j} (O_{\cdot j})] ϕ (y_{ij}, μ_{i}, σ_{i}^{2}), \\ μ_{i} = η_{i} \log_{2} (\frac{s_{j}^{\hat{l}} + 1}{N}) + {β_{i}}^{T} x_{j}, \end{matrix}

where

N is a normalization constant,
$\hat{l}$ is determined by CSS normalization,
$q_{j}^{\hat{l}}$ is the ${\hat{l}}^{th}$ quantile of observed abundances for sample j,
$s_{j}^{\hat{l}} = \sum_{i : O_{ij}} \leq {q_{j}^{\hat{l}}}^{\hat{l}} O_{ij}$ .

However, as shown in our benchmark simulations (Fig. 4) as well as in other previously published simulation studies^14,33,39, although metagenomeSeq has marginally higher powers than most of the other DA methods, it is subject to unreasonably high FDRs even though the observed abundances are normalized by their built-in scaling method (CSS). Furthermore, the problem of FDR inflation gets worse when sample size or the effect size (i.e. fold change of mean absolute abundances) increases^20,39. It is also worth pointing out that metagenomeSeq was the only method, among all parametric models, that increases FDR when applied to rarefied data^33,39. This is likely due to its zero-inflated model which requires the input of precise library sizes to capture the zero proportions.

Note that the authors of metagenomeSeq modified their procedure and recommended replacing zero-inflated Gaussian (ZIG) mixture model by zero-inflated Log-Gaussian (ZILG) mixture model for DA analysis. Although switching to zero-inflated Log-Gaussian distribution improves the FDR control, the procedure becomes extremely conservative, with FDR close to zero and a substantial loss of power in our simulations (Fig. 4) and in ref. ²⁰.

ALDEx2

It is based on the original version of ANOVA-Like Differential Expression (ALDEx) analysis⁵⁶. It was proposed as a compositional data analysis tool that is applicable to three different types of data: RNA-Seq, ChIP-Seq, and 16S rRNA gene sequencing⁵¹. By acknowledging these high-throughput sequencing data are fundamentally compositional, the methodology of ALDEx2 can be summarized as follows:

The observed abundances are converted to relative abundances by Monte Carlo (MC) sampling from the Dirichlet distribution with the addition of a uniform prior. The MC sampling is repeated for K times (K = 128 times by default), thus essentially, for each taxon i in sample j, the observed abundance O_ij is represented by a vector of MC samples of relative abundances ${(r_{ij}^{(1)}, \dots, r_{ij}^{(K)})}^{T}$ ,
Within each sample j and each MC Dirichlet realization k, k = 1, …, K, the relative abundance vector ${(r_{1 j}^{(k)}, \dots, r_{mj}^{(k)})}^{T}$ is clr transformed,
Significance test (Welch’s t-test or Wilcoxon test) is performed on each taxon in the vector of clr transformed values. Since there are a total of K MC Dirichlet samples, each taxon will result in K p-values.
Each resulting p-value is corrected using the B–H⁵⁷ procedure, and the expected adjusted p-value for each taxon is reported by taking the empirical mean of K adjusted p-values.

The ALDEx2 was designed to identify differential abundances of features (genes, taxa, or genomic segments), relative to the geometric mean abundance, between two or more groups. As reported in the simulation study described in this paper (Fig. 4) ALDEx2 not only generally exceeds the nominal level of FDR (5%), but also has substantially smaller power as compared to competing DA methods. Similar results were also reported in Morton et al.⁸.

ANCOM

Analysis of composition of microbiomes (ANCOM)¹⁴ is an alr based methodology, which accounts for the compositional structure of microbiome data. Given a total of m taxa, ANCOM relies on two assumptions as follows.

Assumption 0.1: The mean log absolute abundance (in the ecosystem) of 2 taxa are not different.

Assumption 0.2: The mean log absolute abundance (in the ecosystem) of all m taxa do not differ by the same amount between two study groups. For example, suppose the absolute abundance of m taxa for a subject in group 1 (C-section born babies) are A₁, A₂, …, A_m and suppose the absolute abundance of taxa for a subject in group 2 (vaginally born babies) are B₁, B₂, …, B_m. Then B_i ≠ CA_i, for all i = 1, 2, …, m. Thus, not all taxa are changing by the same constant C.

Note that the first assumption made by ANCOM is substantially weaker than the assumptions made by DESeq2 and edgeR, which require very “few” taxa to be differentially abundant.

Under the above assumptions, together with the fact that ANCOM performs all possible DA analyses by successively using each taxon as a reference taxon, the authors proved that one can test the null hypothesis regarding mean log absolute abundance in a unit volume of an ecosystem using relative abundances.

For the ith taxon and jth sample, ANCOM uses standard ANOVA model formulation:

\log \frac{r_{ij}^{(g)}}{r_{i^{'} j}^{(g)}} = α_{i i^{'}} + β_{i i^{'}}^{(g)} + \sum_{k} x_{jk} β_{i i^{'} k} + ϵ_{i i^{'} j}^{(g)},

where

$i^{'}$ is the reference taxon, $i^{'} \neq i = 1, 2, \dots, m$ ,
g = 1, 2, …, G is the number of study groups.

By virtue of Assumption 0.1 and Assumption 0.2, to test whether a taxon i is differentially abundant according to a factor of interest with G levels, it is equivalent to test:

\begin{matrix} H_{0 (i i')} : β_{i i'}^{(1)} = \dots = β_{i i'}^{(G)} = 0, \\ H_{1 (i i')} : Notall β_{i i'}^{(g)} equalsto 0, \end{matrix}

for every $i \neq i^{'}$ .

P-values from $\frac{m (m - 1)}{2}$ distinct null hypotheses $H_{0 (i i^{'})}$ , $i \neq i^{'}$ are adjusted using a multiple testing correction procedure such as the Benjamini-Hochberg (BH) procedure⁵⁷ or Bonferroni correction^58,59. For each taxon, the number of rejections, denoted by W_i, is counted, and ANCOM makes use of the empirical distribution of {W₁, W₂, …, W_m} to determine the cut-off value of significant taxon. The rule of thumb is, when the value of W_i is large, then it is more likely that taxon i is differentially abundant. The authors recommend using 70th percentile of the W distribution as the empirical cut-off value. However, the ANCOM outputs results from different cutoffs such as the 60th to 90th percentile and lets the user select the threshold of their interest.

As shown in the simulation studies (Fig. 4) as well as in^14,20, using the 70th percentile of W distribution as the cut-off, ANCOM successfully controls the FDR under the nominal level (5%) while maintaining adequate power. However, ANCOM can be computationally intensive since for each taxon, it performs alr transformation using all remaining taxa. The computation time scales up quadratically with the number of taxa. Additionally, the statistical decision made by ANCOM depends on the quantile of its test statistic W, rather than p-values, which some researchers find it difficult to interpret.

DR

Differential Ranking (DR)⁸ exploits the fact that the ranks of relative differentials (i.e. log ratio between absolute relative abundances) are identical to the ranks of absolute differentials (i.e. log ratio between absolute abundances). They estimate relative differentials using a linear regression where relative abundances are alr transformed. The regression coefficients corresponding to different taxa are ranked in order to determine the most important to the least important taxa.

The DR model can be summarized as follows:

\begin{matrix} β_{ik} ~ N (0, μ_{β}), \\ r_{j} = {alr}^{- 1} (β_{i}^{T} x_{j}), \\ A_{j} ~ Multinomial (r_{j}), \end{matrix}

where

x_j is the vector of covariates of interest (e.g. study groups) for the jth sample,
r_j is the vector of observed relative abundances for the jth sample,
A_j is the vector of absolute abundances in the ecosystem for the jth sample.

The model parameters are estimated using a maximum a posteriori priori (MAP) estimation by stochastic gradient descent.

To understand the implementation of the DR procedure, consider a simple example where the true absolute relative abundance is known. Suppose there are only two samples belonging to two groups (e.g. control vs treatment) and the unobserved absolute abundance is linearly related with the group effect in log scale, i.e.:

\log A_{ij} = α_{i 0} + α_{i 1} I (j \in group 1) .

Suppose sample j₁ is in group 1 and sample j₂ is in group 2, then from (Eq. 14) we have

\log A_{i j_{1}} - \log A_{i j_{2}} = α_{i 1} .

Denoting the true absolute relative abundances by γ_ij and $γ_{i^{'} j}$ one can write down the DR model (Eq. 13) as:

\log \frac{γ_{ij}}{γ_{i^{'} j}} = \log \frac{A_{ij}}{A_{i^{'} j}} = β_{i 0} + β_{i 1} .

where $i^{'}$ is the reference taxon. Thus,

\begin{matrix} \log \frac{γ_{i j_{1}}}{γ_{i^{'} j_{1}}} - \log \frac{γ_{i j_{2}}}{γ_{i^{'} j_{2}}} = \log \frac{A_{i j_{1}}}{A_{i^{'} j_{1}}} - \log \frac{A_{i j_{2}}}{A_{i^{'} j_{2}}} \\ = \log A_{i j_{1}} - \log A_{i j_{2}} - (\log A_{i^{'} j_{1}} - \log A_{i^{'} j_{2}}) \\ = β_{i 1} . \end{matrix}

Comparing (Eq. 15) with (Eq. 17), it is clear that although β_i1 ≠ α_i1, due to the bias term $\log A_{i^{'} j_{1}} - \log A_{i^{'} j_{2}}$ . However, since the bias term is constant for taxon i, the rank of β_i1 is same as the rank of α_i1.

Thus, unlike typical DA methods in which the estimated coefficient reflects the change in absolute abundances, the interpretation of DR results requires care because it is based on the ranks. Due to the presence of the microbial load bias ( $\log A_{i^{'} j_{1}} - \log A_{i^{'} j_{2}}$ in the above example), a positive valued coefficient from DR model does not necessarily mean that the absolute abundance has increased. Similarly, a zero valued coefficient does not imply the absolute abundance of the corresponding taxon has not changed. Nevertheless, based on the ranks of coefficients, one can focus on taxa with high or low ranks since they are the ones that are potentially increasing or decreasing the most in absolute abundances relative to other taxa.

Note that since different reference taxon in the alr transformation of DR model will lead to the same result regarding the ranks, DR is robust to the choice of reference taxon.

ANCOM-BC

Analysis of compositions of microbiomes with bias correction (ANCOM-BC)²⁰ models the observed abundances using an offset-based log-linear model.

y_{ij} = d_{j} + {β_{i}}^{T} x_{j} + ϵ_{ij},

where

$y_{ij} = \log O_{ij}$ is the log observed abundance,
dj (see Table 2)

In this set-up, the zero counts are handled using the methodology described in Kaul et al.²⁸. This formulation explicitly tests the hypothesis regarding differential absolute abundance of individual taxon while estimating sample-specific sampling fractions and correcting the bias appropriately. As demonstrated in our simulation studies, ANCOM-BC not only controls the FDR very well, but also competes very well with other methods in terms of power (Fig. 4). Furthermore, unlike any of the existing methods, ANCOM-BC provides valid confidence intervals for differential abundance of individual taxa between two study groups and also provides a valid p-value²⁰. Since it has a linear regression framework, it allows for repeated measurement designs as well as covariate adjustments. ANCOM-BC can also be extended to describe patterns of differential abundance in multiple study groups such as time course or dose-response studies²⁰.

As a benchmark analysis, we also compared significant genera identified by ANCOM-BC, ANCOM, and DR using the global gut microbiota data⁶⁰. This data set consists of 11,905 OTUs obtained from fecal samples of subjects in the USA (n = 317), Malawi (n = 114), and Venezuela (n = 99). We first subdivided the data into two age strata “≤2 years” and “>2 years”. This stratification was performed because it is expected that microbial composition of infants changes when they switch over from breast milk (or formula milk) to solid food⁷. The sample sizes in the two age categories (≤2 years, >2 years) for Malawi (MA), USA (US) and Venezuela (VEN) are (47, 36), (50, 260), and (27, 70), respectively. Note that samples with missing values of age were discarded in the downstream analysis. Without a hard threshold available for DR, as suggested in the original paper⁸, we investigated the highest/lowest ranks of genera by selecting the top 25 and bottom 25 genera in terms of rank order of regression parameter estimates. As seen in Fig. 5, the three methods generally have a large number of overlapping genera, with ANCOM-BC and ANCOM having more taxa in common that are differentially abundant. While implementing ANCOM, we used the 70th percentile of the distribution of W as the cut-off. Note that the DR method was applied with all hyper-parameters of the multinomial model set to their default values in the algorithm which can be further tuned.

Fig. 5 — The global gut microbiota data⁶⁰ is used to make the Venn diagram. The dataset contains 673 genera, with subjects from Malawi (MA, n₁ = 114), USA (US, n₂ = 317), and Venezuela (VEN, n₃ = 99). We compare the absolute abundance of different genera for (1) Subjects who are less than or equal to 2 years old with sample size (MA:US:VEN) = (47:50:27) and (2) Subjects who are greater than 2 years old with sample size (MA:US:VEN) = (36:260:70). ANCOM-BC and ANCOM generally have large overlap of significant genera.

Balance-based methods

A variety of methods have been proposed in the literature that are based on balances described earlier in this paper. Some examples include gneiss¹⁸, phylofactorization^61,62, PhILR⁶³, and selbal⁶⁴. Although the balance-based methods were not explicitly designed for performing formal statistical DA analyses for individual taxon, they are often used for that purpose.

To overcome the challenges posed by the compositional structure of 16S rRNA data for identifying individual differentially abundant taxa, gneiss¹⁸ was developed to identify taxa distribution across different covariates with the help of balances. The balances (Eq. (8))^65,66 are useful to infer meaningful properties of sub-communities. Gneiss aims to associate the effect of parameter of interest to the matrix of balances:

Definition 0.7 (gneiss model).

b_{jl} = β_{l}^{T} x_{j},

where

b_jl represents the balance for sample j at node l,
$β_{l} = {(β_{l 1}, \dots, β_{lp})}^{T}$ represents a vector of coefficients,
$x_{j} = {(x_{j 1}, \dots, x_{jp})}^{T}$ represents the measures for covariates.

Gneiss methodology is very flexible and can be broadly used for determining niches of microbes in various sub-communities. Thus, it is a very useful method for discovering niche differentiation in microbes.

Similar to gneiss, phylofactorization^61,62 is not designed for the DA analysis as defined in this paper, but it focuses on the comparison between clades with a clear phylogenetic interpretation. It is based on a greedy algorithm which sequentially selects edges, instead of nodes or splits in a phylogeny, whose ilr basis element maximizes a pre-specified objective function (e.g. the percentage of variation explained). Therefore, besides comparing sister clades, phylofactorization compares the relative abundances between all other clades.

We illustrate gneiss using the global gut data⁶⁰ discussed earlier in this paper using Malawi (MA, n₁ = 114) and the USA (US, n₂ = 317) data. Gneiss identified different trends among various balances (Fig. 6). For example, balance y0 is detected to increase in US as compared to MA for subjects who are ≤2 years old; It is in a reverse direction for subjects who are >2 years old. One caveat to keep in mind is that the components of balances are not necessarily the same across different data sets. The first balance y0 for the younger generation (age ≤ 2 years old) consists of 642 taxa in the numerator (the left subtree) and 31 taxa in the denominator (the right subtree); On the other hand, y0 for the older group (age >2 years old) has 655 taxa in the numerator and 18 taxa in the denominator. It is important to note gneiss is not designed to infer changes in abundance for each individual taxon, however, it can answer questions such as whether the absolute abundances of taxa in the numerator of y0 on average have increased or decreased as compared to those in the denominator.

Fig. 6 — We subset the dataset to subjects whose nationality is Malawi (MA) and USA (US), with sample size (MA:US) = (47:50) for the group of age ≤2 years old, and (MA:US) = (36:260) for the group of age >2 years old. The columns of the plot represent coefficients, and the rows of the plot represent balances. BH procedure⁵⁷ was applied to correct for multiple comparisons, and coefficients with FDR corrected p-values < 0.05 are discarded.

LEfSe

Linear Discriminant Analysis Effect Size (LEfSe)⁶⁷ is specifically designed for group comparisons of microbiome data with a particular focus on detecting change in relative abundance between two or more groups of samples with biological consistency. Important statistical and computational steps implemented in LEfSe are as follows:

For each taxon, test whether its observed abundances in different groups are differentially distributed using Kruskal–Wallis test.
(Optional, only if subgroups are defined) Discard taxa which are not statistically significant in step 1 (e.g. p-value > 0.05). The pairwise Wilcoxon test is then applied to retain taxa. A taxon is not retained for further consideration if it is not significant in every pairwise comparison (e.g. p-value > 0.05 for at least one pairwise comparison) or if the signs of test statistics are not equal among all comparisons.
After feature selection, a Linear Discriminant Analysis (LDA) model is built with the group label as the dependent variable and observed abundance of taxa selected in above step, subgroup label, and demographic features as independent variables. This model is used to calculate the effect size for each taxon. This effect size serves as the average of each taxon’s variability and discriminatory power.
Finally, the LDA score for each taxon is obtained by computing the logarithm (base 10) of the effect size after being scaled in the [1, 10⁶] interval. The rank for each taxon is assigned based on the corresponding LDA score and further feature selection could be achieved by setting a threshold (e.g. 2.0) for LDA scores.

By its construction, LEfSe method is more a discriminant analysis method rather than a DA method. Unlike the DA analysis methods discussed earlier in this paper, LEfSe is more focused on investigating the relationship among microbial profiles and an outcome or phenotype (Step 3). More precisely, LEfSe tries to quantify the magnitude of the effect size of such associations between microbial profiles (e.g. a set of taxa) and the outcome of interest.

Discussion

Microbiome studies are becoming very popular in biomedical sciences. As new scientific questions emerge, so do new statistical and computational methods of analysis. This is a very rapidly growing area of research with new statistical methods being developed on a regular basis. Hence an up-to-date comprehensive review of the statistical methods in the field is a challenging problem. This is particularly true with methods for DA analysis. A number of methods exist in the literature and each method has its own strengths and weaknesses. One of the challenges in evaluating the performance of various methods is that not all methods are designed to test statistical hypotheses regarding the same parameter. Some methods are designed for testing hypotheses regarding the relative abundance, while others are designed for testing hypothesis regarding absolute abundance. If a simulation study is designed for testing hypothesis regarding absolute abundance then methods for relative abundance parameter may show an inflated FDR and vise versa. A related problem is that often researchers use the terms “relative abundance” and “absolute abundance in a unit volume” interchangeably. This makes the simulation studies difficult to interpret. Therefore journals and researchers should make the terminology precise. In this paper, simulation studies were set-up to compare FDR and power of various methods when testing hypotheses regarding absolute abundance of taxa in a unit volume of a tissue.

We performed simulation studies using the log-normal distribution for modeling abundances. Consistent with the findings of²⁰, ANCOM and ANCOM-BC control the FDR at the desired nominal level for most configurations while competing well with all procedures in terms of the overall power. The only situations where ANCOM as well as ANCOM-BC fail to control FDR is when the sample sizes are very small, such as <10²⁰. All other methods considered in this paper tend to inflate FDR for all sample sizes and their FDR gets worse with the sample size increases²⁰. This is because, under the null hypothesis, each of these methods is biased away from zero. This bias increases with sample size. Hence the FDR increases with sample size.

While ANCOM and ANCOM-BC have very similar operating characteristics in terms of FDR and power, ANCOM-BC is computationally simpler and faster to implement because unlike ANCOM it requires only m linear regression fits rather than $\frac{m \times (m - 1)}{2}$ models fits needed by ANCOM. Secondly, unlike ANCOM, ANCOM-BC provides individual p-values and confidence intervals of pairwise difference in mean abundance for each taxon. Among the methods available today, ANCOM-BC is the only procedure that provides valid p-values and confidence intervals. Furthermore, since ANCOM-BC is based on a regression model framework, it can easily be extended to repeated measures/longitudinal data covariate adjustments.

Supplementary information

Supplementary Information^{(106.2KB, pdf)}

Acknowledgements

This research was funded by the Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA.

Author contributions

This research work was conceived by S.D.P. All numerical calculations were performed by H.L. Both authors contributed equally in writing the manuscript.

Data availability

DNA sequences from the global gut microbiota study⁶⁰ can be found in MG-RAST https://www.mg-rast.org/index.html server under search string “mgp401” for Illumina V4-16S rRNA; feature table, metadata, and taxonomy of the diet swap data⁶⁸ is available in the microbiome⁶⁹ R package http://microbiome.github.com/microbiome.

Code availability

All datasets and analysis scripts can be found under https://github.com/FrederickHuangLin/Microbiome-Review-Code-Archive.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information is available for this paper at 10.1038/s41522-020-00160-w.

References

1.Tierney BT, et al. The landscape of genetic content in the gut and oral human microbiome. Cell Host Microbe. 2019;26:283–295. doi: 10.1016/j.chom.2019.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.O’Hara AM, Shanahan F. The gut flora as a forgotten organ. EMBO Rep. 2006;7:688–693. doi: 10.1038/sj.embor.7400731. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Relman DA, Falkow S. The meaning and impact of the human genome sequence for microbiology. Trends Microbiol. 2001;9:206–208. doi: 10.1016/S0966-842X(01)02041-8. [DOI] [PubMed] [Google Scholar]
4.Hurst GD. Extended genomes: symbiosis and evolution. Interface Focus. 2017;7:20170001. doi: 10.1098/rsfs.2017.0001. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Turnbaugh PJ, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Gevers D, et al. The treatment-naive microbiome in new-onset crohn?s disease. Cell Host Microbe. 2014;15:382–392. doi: 10.1016/j.chom.2014.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Lozupone CA, et al. Alterations in the gut microbiota associated with hiv-1 infection. Cell Host Microbe. 2013;14:329–339. doi: 10.1016/j.chom.2013.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Morton JT, et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Schloss PD. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16s rrna gene-based studies. PLoS Comput. Biol. 2010;6:e1000844. doi: 10.1371/journal.pcbi.1000844. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Edgar RC. Uparse: highly accurate otu sequences from microbial amplicon reads. Nat. Methods. 2013;10:996. doi: 10.1038/nmeth.2604. [DOI] [PubMed] [Google Scholar]
11.Callahan BJ, et al. Dada2: high-resolution sample inference from illumina amplicon data. Nat. Methods. 2016;13:581. doi: 10.1038/nmeth.3869. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Amir A, et al. Deblur rapidly resolves single-nucleotide community sequence patterns. MSystems. 2017;2:e00191–16. doi: 10.1128/mSystems.00191-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Bolyen E, et al. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. Nat. Biotechnol. 2019;37:852–857. doi: 10.1038/s41587-019-0209-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Mandal S, et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26:27663. doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 2016;62:692–703. doi: 10.1139/cjm-2015-0821. [DOI] [PubMed] [Google Scholar]
16.Gloor GB, Wu JR, Pawlowsky-Glahn V, Egozcue JJ. It’s all relative: analyzing microbiome data as compositions. Ann. Epidemiol. 2016;26:322–329. doi: 10.1016/j.annepidem.2016.03.003. [DOI] [PubMed] [Google Scholar]
17.Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Morton JT, et al. Balance trees reveal microbial niche differentiation. MSystems. 2017;2:e00162–16. doi: 10.1128/mSystems.00162-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Aitchison, J. The statistical analysis of compositional data. J. Royal Stat. Soc. Ser. B. 139–177 (1982).
20.Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nat. Commun. 2020;11:1–11. doi: 10.1038/s41467-020-17041-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods. 2013;10:1200. doi: 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Xia F, Chen J, Fung WK, Li H. A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics. 2013;69:1053–1063. doi: 10.1111/biom.12079. [DOI] [PubMed] [Google Scholar]
23.Costea PI, Zeller G, Sunagawa S, Bork P. A fair comparison. Nat. Methods. 2014;11:359. doi: 10.1038/nmeth.2897. [DOI] [PubMed] [Google Scholar]
24.Paulson JN, Bravo HC, Pop M. Reply to:" a fair comparison". Nat. Methods. 2014;11:359. doi: 10.1038/nmeth.2898. [DOI] [PubMed] [Google Scholar]
25.Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C. Isometric logratio transformations for compositional data analysis. Math. Geol. 2003;35:279–300. doi: 10.1023/A:1023818214614. [DOI] [Google Scholar]
26.Greenacre M. Measuring subcompositional incoherence. Math. Geosci. 2011;43:681–693. doi: 10.1007/s11004-011-9338-5. [DOI] [Google Scholar]
27.Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32:2611–2617. doi: 10.1093/bioinformatics/btw308. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of microbiome data in the presence of excess zeros. Front. Microbiol. 2017;8:2114. doi: 10.3389/fmicb.2017.02114. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Navas-Molina, J. A. et al. Advancing our understanding of the human microbiome using qiime. In Methods in Enzymology, Vol. 531, 371–444 (Elsevier, 2013). [DOI] [PMC free article] [PubMed]
30.Hughes JB, Hellmann JJ. The application of rarefaction techniques to molecular inventories of microbial diversity. Methods Enzymol. 2005;397:292–308. doi: 10.1016/S0076-6879(05)97017-1. [DOI] [PubMed] [Google Scholar]
31.Koren O, et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 2013;9:e1002863. doi: 10.1371/journal.pcbi.1002863. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gotelli NJ, Colwell RK. Estimating species richness. Biol. Divers. Front. Meas. Assess. 2011;12:39–54. [Google Scholar]
33.McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014;10:e1003531. doi: 10.1371/journal.pcbi.1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R. Unifrac: an effective distance metric for microbial community comparison. ISME J. 2011;5:169. doi: 10.1038/ismej.2010.133. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Gotelli NJ, Colwell RK. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 2001;4:379–391. doi: 10.1046/j.1461-0248.2001.00230.x. [DOI] [Google Scholar]
36.Brewer A, Williamson M. A new relationship for rarefaction. Biodivers. Conserv. 1994;3:373–379. doi: 10.1007/BF00056509. [DOI] [Google Scholar]
37.Horner-Devine MC, Lage M, Hughes JB, Bohannan BJ. A taxa–area relationship for bacteria. Nature. 2004;432:750. doi: 10.1038/nature03073. [DOI] [PubMed] [Google Scholar]
38.Jernvall J, Wright PC. Diversity components of impending primate extinctions. Proc. Natl Acad. Sci. USA. 1998;95:11279–11283. doi: 10.1073/pnas.95.19.11279. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Weiss S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5:27. doi: 10.1186/s40168-017-0237-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Beule L, Karlovsky P. Improved normalization of species count data in ecology by scaling with ranked subsampling (srs): application to microbial communities. PeerJ. 2020;8:e9593. doi: 10.7717/peerj.9593. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinform. 2010;11:94. doi: 10.1186/1471-2105-11-94. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Kumar MS, et al. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018;19:799. doi: 10.1186/s12864-018-5160-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Chen, Y., McCarthy, D., Robinson, M. & Smyth, G. K. edger: differential expression analysis of digital gene expression data user’s guide. http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf (2014). [DOI] [PMC free article] [PubMed]
47.Dillies M-A, et al. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Brief. Bioinforma. 2013;14:671–683. doi: 10.1093/bib/bbs046. [DOI] [PubMed] [Google Scholar]
48.Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Agresti A, Hitchcock DB. Bayesian inference for categorical data analysis. Stat. Methods Appl. 2005;14:297–330. doi: 10.1007/s10260-005-0121-y. [DOI] [Google Scholar]
50.Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012;8:e1002687. doi: 10.1371/journal.pcbi.1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Fernandes AD, et al. Unifying the analysis of high-throughput sequencing datasets: characterizing rna-seq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15. doi: 10.1186/2049-2618-2-15. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Steel, G. et al. Relation between poisson and multinomial distributions. https://ecommons.cornell.edu/bitstream/handle/1813/32480/BU-39-M.pdf?sequence=1 (1953).
53.Taddy M. Multinomial inverse regression for text analysis. J. Am. Stat. Assoc. 2013;108:755–770. doi: 10.1080/01621459.2012.734168. [DOI] [Google Scholar]
54.Smyth GK, Verbyla AP. A conditional likelihood approach to residual maximum likelihood estimation in generalized linear models. J. R. Stat. Soc. Ser. B. 1996;58:565–572. [Google Scholar]
55.Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]
56.Fernandes AD, Macklaim JM, Linn TG, Reid G, Gloor GB. Anova-like differential expression (aldex) analysis for mixed population rna-seq. PLoS ONE. 2013;8:e67019. doi: 10.1371/journal.pone.0067019. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B. 1995;57:289–300. [Google Scholar]
58.Dunn, O. J. Estimation of the means of dependent variables. Annal. Math. Stat. 1095–1111 (1958).
59.Dunn OJ. Multiple comparisons among means. J. Am. Stat. Assoc. 1961;56:52–64. doi: 10.1080/01621459.1961.10482090. [DOI] [Google Scholar]
60.Yatsunenko T, et al. Human gut microbiome viewed across age and geography. Nature. 2012;486:222–227. doi: 10.1038/nature11053. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Washburne AD, et al. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. PeerJ. 2017;5:e2969. doi: 10.7717/peerj.2969. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Washburne AD, et al. Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data. Ecol. Monogr. 2019;89:e01353. doi: 10.1002/ecm.1353. [DOI] [Google Scholar]
63.Silverman JD, Washburne AD, Mukherjee S, David LA. A phylogenetic transform enhances analysis of compositional microbiota data. Elife. 2017;6:e21887. doi: 10.7554/eLife.21887. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Rivera-Pinto, J. et al. Balances: a new perspective for microbiome analysis. MSystems 3 (2018). [DOI] [PMC free article] [PubMed]
65.Egozcue JJ, Pawlowsky-Glahn V. Groups of parts and their balances in compositional data analysis. Math. Geol. 2005;37:795–828. doi: 10.1007/s11004-005-7381-9. [DOI] [Google Scholar]
66.Pawlowsky-Glahn V, Egozcue JJ. Exploring compositional data with the coda-dendrogram. Austrian J. Stat. 2011;40:103–113. [Google Scholar]
67.Segata N, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12:R60. doi: 10.1186/gb-2011-12-6-r60. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.O’Keefe SJ, et al. Fat, fibre and cancer risk in african americans and rural africans. Nat. Commun. 2015;6:6342. doi: 10.1038/ncomms7342. [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Lahti, L., Shetty, S., Blake, T. & Salojarvi, J. Tools for microbiome analysis in r. version 2.1.28. https://microbiome.github.io/tutorials/ (2017).
70.Holm, S. A simple sequentially rejective multiple test procedure. Scand J. Stat. 65–70 (1979).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information^{(106.2KB, pdf)}

Data Availability Statement

All datasets and analysis scripts can be found under https://github.com/FrederickHuangLin/Microbiome-Review-Code-Archive.

[CR1] 1.Tierney BT, et al. The landscape of genetic content in the gut and oral human microbiome. Cell Host Microbe. 2019;26:283–295. doi: 10.1016/j.chom.2019.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.O’Hara AM, Shanahan F. The gut flora as a forgotten organ. EMBO Rep. 2006;7:688–693. doi: 10.1038/sj.embor.7400731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Relman DA, Falkow S. The meaning and impact of the human genome sequence for microbiology. Trends Microbiol. 2001;9:206–208. doi: 10.1016/S0966-842X(01)02041-8. [DOI] [PubMed] [Google Scholar]

[CR4] 4.Hurst GD. Extended genomes: symbiosis and evolution. Interface Focus. 2017;7:20170001. doi: 10.1098/rsfs.2017.0001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Turnbaugh PJ, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480. doi: 10.1038/nature07540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Gevers D, et al. The treatment-naive microbiome in new-onset crohn?s disease. Cell Host Microbe. 2014;15:382–392. doi: 10.1016/j.chom.2014.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Lozupone CA, et al. Alterations in the gut microbiota associated with hiv-1 infection. Cell Host Microbe. 2013;14:329–339. doi: 10.1016/j.chom.2013.08.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Morton JT, et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Schloss PD. The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16s rrna gene-based studies. PLoS Comput. Biol. 2010;6:e1000844. doi: 10.1371/journal.pcbi.1000844. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Edgar RC. Uparse: highly accurate otu sequences from microbial amplicon reads. Nat. Methods. 2013;10:996. doi: 10.1038/nmeth.2604. [DOI] [PubMed] [Google Scholar]

[CR11] 11.Callahan BJ, et al. Dada2: high-resolution sample inference from illumina amplicon data. Nat. Methods. 2016;13:581. doi: 10.1038/nmeth.3869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Amir A, et al. Deblur rapidly resolves single-nucleotide community sequence patterns. MSystems. 2017;2:e00191–16. doi: 10.1128/mSystems.00191-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Bolyen E, et al. Reproducible, interactive, scalable and extensible microbiome data science using qiime 2. Nat. Biotechnol. 2019;37:852–857. doi: 10.1038/s41587-019-0209-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Mandal S, et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26:27663. doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Gloor GB, Reid G. Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. Can. J. Microbiol. 2016;62:692–703. doi: 10.1139/cjm-2015-0821. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Gloor GB, Wu JR, Pawlowsky-Glahn V, Egozcue JJ. It’s all relative: analyzing microbiome data as compositions. Ann. Epidemiol. 2016;26:322–329. doi: 10.1016/j.annepidem.2016.03.003. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Morton JT, et al. Balance trees reveal microbial niche differentiation. MSystems. 2017;2:e00162–16. doi: 10.1128/mSystems.00162-16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Aitchison, J. The statistical analysis of compositional data. J. Royal Stat. Soc. Ser. B. 139–177 (1982).

[CR20] 20.Lin H, Peddada SD. Analysis of compositions of microbiomes with bias correction. Nat. Commun. 2020;11:1–11. doi: 10.1038/s41467-020-17041-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat. Methods. 2013;10:1200. doi: 10.1038/nmeth.2658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Xia F, Chen J, Fung WK, Li H. A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics. 2013;69:1053–1063. doi: 10.1111/biom.12079. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Costea PI, Zeller G, Sunagawa S, Bork P. A fair comparison. Nat. Methods. 2014;11:359. doi: 10.1038/nmeth.2897. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Paulson JN, Bravo HC, Pop M. Reply to:" a fair comparison". Nat. Methods. 2014;11:359. doi: 10.1038/nmeth.2898. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C. Isometric logratio transformations for compositional data analysis. Math. Geol. 2003;35:279–300. doi: 10.1023/A:1023818214614. [DOI] [Google Scholar]

[CR26] 26.Greenacre M. Measuring subcompositional incoherence. Math. Geosci. 2011;43:681–693. doi: 10.1007/s11004-011-9338-5. [DOI] [Google Scholar]

[CR27] 27.Chen EZ, Li H. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32:2611–2617. doi: 10.1093/bioinformatics/btw308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Kaul A, Mandal S, Davidov O, Peddada SD. Analysis of microbiome data in the presence of excess zeros. Front. Microbiol. 2017;8:2114. doi: 10.3389/fmicb.2017.02114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Navas-Molina, J. A. et al. Advancing our understanding of the human microbiome using qiime. In Methods in Enzymology, Vol. 531, 371–444 (Elsevier, 2013). [DOI] [PMC free article] [PubMed]

[CR30] 30.Hughes JB, Hellmann JJ. The application of rarefaction techniques to molecular inventories of microbial diversity. Methods Enzymol. 2005;397:292–308. doi: 10.1016/S0076-6879(05)97017-1. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Koren O, et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 2013;9:e1002863. doi: 10.1371/journal.pcbi.1002863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Gotelli NJ, Colwell RK. Estimating species richness. Biol. Divers. Front. Meas. Assess. 2011;12:39–54. [Google Scholar]

[CR33] 33.McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 2014;10:e1003531. doi: 10.1371/journal.pcbi.1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R. Unifrac: an effective distance metric for microbial community comparison. ISME J. 2011;5:169. doi: 10.1038/ismej.2010.133. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Gotelli NJ, Colwell RK. Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett. 2001;4:379–391. doi: 10.1046/j.1461-0248.2001.00230.x. [DOI] [Google Scholar]

[CR36] 36.Brewer A, Williamson M. A new relationship for rarefaction. Biodivers. Conserv. 1994;3:373–379. doi: 10.1007/BF00056509. [DOI] [Google Scholar]

[CR37] 37.Horner-Devine MC, Lage M, Hughes JB, Bohannan BJ. A taxa–area relationship for bacteria. Nature. 2004;432:750. doi: 10.1038/nature03073. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Jernvall J, Wright PC. Diversity components of impending primate extinctions. Proc. Natl Acad. Sci. USA. 1998;95:11279–11283. doi: 10.1073/pnas.95.19.11279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Weiss S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome. 2017;5:27. doi: 10.1186/s40168-017-0237-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Beule L, Karlovsky P. Improved normalization of species count data in ecology by scaling with ranked subsampling (srs): application to microbial communities. PeerJ. 2020;8:e9593. doi: 10.7717/peerj.9593. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinform. 2010;11:94. doi: 10.1186/1471-2105-11-94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol. 2010;11:R25. doi: 10.1186/gb-2010-11-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Robinson MD, McCarthy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR45] 45.Kumar MS, et al. Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics. 2018;19:799. doi: 10.1186/s12864-018-5160-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Chen, Y., McCarthy, D., Robinson, M. & Smyth, G. K. edger: differential expression analysis of digital gene expression data user’s guide. http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf (2014). [DOI] [PMC free article] [PubMed]

[CR47] 47.Dillies M-A, et al. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Brief. Bioinforma. 2013;14:671–683. doi: 10.1093/bib/bbs046. [DOI] [PubMed] [Google Scholar]

[CR48] 48.Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Agresti A, Hitchcock DB. Bayesian inference for categorical data analysis. Stat. Methods Appl. 2005;14:297–330. doi: 10.1007/s10260-005-0121-y. [DOI] [Google Scholar]

[CR50] 50.Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012;8:e1002687. doi: 10.1371/journal.pcbi.1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Fernandes AD, et al. Unifying the analysis of high-throughput sequencing datasets: characterizing rna-seq, 16s rrna gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15. doi: 10.1186/2049-2618-2-15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Steel, G. et al. Relation between poisson and multinomial distributions. https://ecommons.cornell.edu/bitstream/handle/1813/32480/BU-39-M.pdf?sequence=1 (1953).

[CR53] 53.Taddy M. Multinomial inverse regression for text analysis. J. Am. Stat. Assoc. 2013;108:755–770. doi: 10.1080/01621459.2012.734168. [DOI] [Google Scholar]

[CR54] 54.Smyth GK, Verbyla AP. A conditional likelihood approach to residual maximum likelihood estimation in generalized linear models. J. R. Stat. Soc. Ser. B. 1996;58:565–572. [Google Scholar]

[CR55] 55.Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23:2881–2887. doi: 10.1093/bioinformatics/btm453. [DOI] [PubMed] [Google Scholar]

[CR56] 56.Fernandes AD, Macklaim JM, Linn TG, Reid G, Gloor GB. Anova-like differential expression (aldex) analysis for mixed population rna-seq. PLoS ONE. 2013;8:e67019. doi: 10.1371/journal.pone.0067019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B. 1995;57:289–300. [Google Scholar]

[CR58] 58.Dunn, O. J. Estimation of the means of dependent variables. Annal. Math. Stat. 1095–1111 (1958).

[CR59] 59.Dunn OJ. Multiple comparisons among means. J. Am. Stat. Assoc. 1961;56:52–64. doi: 10.1080/01621459.1961.10482090. [DOI] [Google Scholar]

[CR60] 60.Yatsunenko T, et al. Human gut microbiome viewed across age and geography. Nature. 2012;486:222–227. doi: 10.1038/nature11053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Washburne AD, et al. Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets. PeerJ. 2017;5:e2969. doi: 10.7717/peerj.2969. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Washburne AD, et al. Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data. Ecol. Monogr. 2019;89:e01353. doi: 10.1002/ecm.1353. [DOI] [Google Scholar]

[CR63] 63.Silverman JD, Washburne AD, Mukherjee S, David LA. A phylogenetic transform enhances analysis of compositional microbiota data. Elife. 2017;6:e21887. doi: 10.7554/eLife.21887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Rivera-Pinto, J. et al. Balances: a new perspective for microbiome analysis. MSystems 3 (2018). [DOI] [PMC free article] [PubMed]

[CR65] 65.Egozcue JJ, Pawlowsky-Glahn V. Groups of parts and their balances in compositional data analysis. Math. Geol. 2005;37:795–828. doi: 10.1007/s11004-005-7381-9. [DOI] [Google Scholar]

[CR66] 66.Pawlowsky-Glahn V, Egozcue JJ. Exploring compositional data with the coda-dendrogram. Austrian J. Stat. 2011;40:103–113. [Google Scholar]

[CR67] 67.Segata N, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12:R60. doi: 10.1186/gb-2011-12-6-r60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR68] 68.O’Keefe SJ, et al. Fat, fibre and cancer risk in african americans and rural africans. Nat. Commun. 2015;6:6342. doi: 10.1038/ncomms7342. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR69] 69.Lahti, L., Shetty, S., Blake, T. & Salojarvi, J. Tools for microbiome analysis in r. version 2.1.28. https://microbiome.github.io/tutorials/ (2017).

[CR70] 70.Holm, S. A simple sequentially rejective multiple test procedure. Scand J. Stat. 65–70 (1979).

PERMALINK

Analysis of microbial compositions: a review of normalization and differential abundance analysis

Huang Lin

Shyamal Das Peddada

Abstract

Introduction

Table 1.

Table 2.

Fig. 1. Microbiome data is represented by relative abundances, thus differential abundance analysis should account for the bias introduced by across-sample variations in sampling fractions.

Normalization methods

Rarefying

Fig. 2. Rarefaction curves using the diet swap data68 at the genus level.

Scaling

Table 3.

Fig. 3. Box plot of residuals between true sampling fraction and its estimate for each sample.

Log-ratio based methods

Methods of differential abundance analysis

RNA-seq based methods: edgeR and DESeq2

Fig. 4. False Discovery Rate (FDR) and power comparisons using synthetic data.

MetagenomeSeq

ALDEx2

ANCOM

DR

ANCOM-BC

Fig. 5. Venn diagrams representing consistency of differentially abundant genera identified by ANCOM-BC, ANCOM, and DR.

Balance-based methods

Fig. 6. Waterfall plot visualizing coefficient (US: MA) for top 20 balances identified by gneiss using the global gut microbiota data60.

LEfSe

Discussion

Supplementary information

Acknowledgements

Author contributions

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig. 2. Rarefaction curves using the diet swap data⁶⁸ at the genus level.

Fig. 6. Waterfall plot visualizing coefficient (US: MA) for top 20 balances identified by gneiss using the global gut microbiota data⁶⁰.