Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 Jan 6;24(1):bbac607. doi: 10.1093/bib/bbac607

Benchmarking differential abundance analysis methods for correlated microbiome sequencing data

Lu Yang 1, Jun Chen 2,
PMCID: PMC9851339  PMID: 36617187

Abstract

Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.

Keywords: microbiome, metagenomics, repeated sampling, matched-pair, longitudinal, differential abundance analysis

Background

With nearly two-decade research efforts, the human microbiome, the collection of microorganisms and their genetic contents associated with the human body, has been revealed to play a significant role in human health and disease [1]. The human microbiome constantly interacts with environmental and host factors and dynamically evolves over time [2]. It is of great interest to study how the microbiome changes over time and its association with factors such as demographic and lifestyle characteristics, medical history, disease treatment and various clinical outcomes. To answer these questions, longitudinal designs have been increasingly employed in microbiome studies [3, 4]. Compared with case–control and cross-sectional microbiome studies, longitudinal studies, which repeatedly sample the microbiome over a course of time, provide unique opportunities to investigate the dynamics of the microbiome, decipher the species interaction network and establish a potential causal relationship if the microbiome change precedes the phenotypic change [2]. Statistically, longitudinal studies enjoy higher statistical power and less confounding by using the baseline measurement as the control. Exemplary longitudinal microbiome studies are those from the Integrative Human Microbiome Project (iHMP) [5], the second phase of the Human Microbiome Project (HMP). iHMP focuses on generating integrated longitudinal datasets and understanding how the microbiome impacts the disease course through a longitudinal view. Besides the longitudinal design, spatial and replicate sampling designs have also been frequently used in microbiome studies [2, 4, 6]. All these studies generate correlated microbiome data, where the microbiome composition profile derived from the same subject is more similar to each other than those derived from different subjects. Addressing these inherent correlations in microbiome data analysis is critical in obtaining robust and reproducible results.

One central statistical task for microbiome data analysis is differential abundance analysis (DAA), where the goal is to identify the microbial features whose abundance covaries with a variable of interest. The identified microbial features could help improve our understanding of disease mechanisms and be potentially used as biomarkers for disease prevention, diagnosis, prognosis and treatment selection [7]. With the help of next-generation sequencing technologies, microbiome samples are now routinely profiled by either 16S rRNA gene-targeted sequencing or whole-genome shotgun sequencing [8]. After bioinformatics processing of the sequencing reads, microbiome data can be summarized into a count table, which records the frequencies of the detected microbial features. Depending on the specific pipeline used, these microbial features could be operational taxonomic units (OTUs), amplicon sequence variants [9] or taxa at different taxonomic ranks. DAA is then performed on the count table together with the metadata describing the sample conditions. DAA of microbiome data raises several statistical challenges including properly modeling the zero-inflated highly skewed abundance distribution [10–12], addressing the inherent compositional effects [13–15] and effectively utilizing the phylogenetic relatedness among microbial features [16, 17]. In addition to addressing these basic characteristics of microbiome compositional data, DAA of correlated microbiome data also faces the challenge of properly accounting for the correlation structures of non-normally distributed abundance data. Ignoring the correlations could reduce the efficiency of the analysis (analogy to using a two-sample t-test to paired data) or more seriously, produce overly confident results due to exaggeration of the true sample size.

Compared with a plethora of DAA methods developed for independent microbiome data, methods for correlated microbiome data (we label them as ‘DAA-c’) are relatively under-developed. Nevertheless, in the past decade, several statistical methods were proposed and applied in analyzing correlated microbiome data. These methods could be roughly divided into three categories. The first category of methods involves data transformation so that the transformed abundance data are more amenable to modeling by existing statistical methods. Commonly used transformation include log, centered log ratio (CLR) [18, 19], square-root and arcsin-square root transformation [20]. Based on the transformed data, the standard linear mixed-effects model (LMM) is then directly applied [21–23]. The MaAsLin2 [24] package uses LMM as the default method to analyze correlated microbiome data, with additional preprocessing steps (e.g. zero replacement), and several options for normalization and transformation. However, the default total sum scaling (TSS) normalization used in MaAsLin2, could lead to severely inflated type I error under certain scenarios due to strong compositional effects [25]. For example, the increase in the abundance of one dominant microbial species will lead to apparent decreases in the relative abundance of all other species. To address the compositional effects, Zhou et al. [26] proposed the LinDA method, which corrects the compositional bias after applying LMM on CLR transformed data. Although LMM is computationally efficient and highly interpretable, its normality assumption may not be met for real data due to the severe zero inflation [10–12]. It is unknown whether such assumption violation will lead to reduced power and/or increased type I error. To remedy the drawback of LMM, zero-inflated Gaussian mixed models (ZIGMM) [27] were proposed. ZIGMM assumes a zero-inflated Gaussian distribution for the transformed abundance data and models the zero and nonzero parts using logistic and LMMs, respectively. The LDM method [28], another linear model-based method based on transformed abundance data, uses permutation to assess the significance so that it is more robust to model misspecification. Different permutation schemes are used in LDM to account for the correlation structure in the data. The second category of methods models the TSS-normalized data or proportions using probabilistic distributions with support on [0, 1]. The beta distribution is a popular choice due to its ability to model a wide range of skewed abundance distributions through its two shape parameters. One representative in this category is the two-part zero-inflated beta mixed model (ZIBR) [29], where the logistic mixed-effects model and the mixed-effect beta regression model are used to model the (structural) zero and non-zero parts, similar in spirit to ZIGMM. The third category of methods models the count data through generalized linear mixed-effects models (GLMM). These GLMM methods naturally address the sampling variability of the read counts and use more information than the methods from the first two categories. Thus, they are expected to be more powerful for small-sample studies. However, the major challenges are the computational complexity due to the involvement of integration in the likelihood function and strong model assumptions of the count distribution. The simplest count model is the Binomial or Poisson model. However, real microbiome data exhibit more variability than what is expected by a Binomial or Poisson model. The negative binomial (NB) model, on the other hand, has an extra overdispersion parameter and is more flexible. NB models have been widely used in DAA of microbiome data [30, 31]. To account for the correlation structure, the mixed-effects NB regression model, which has been implemented in the famous R lme4 package (‘glmer.nb’ function) [32], has been applied in practice [33–35]. As an alternative to ‘glmer.nb’, Zhang et al. [36] developed a flexible and efficient Iterative Weighted Least Squares algorithm to fit the mixed-effects NB regression model (‘negative binomial mixed model’, NBMM). Later, Zhang and Yi [37] extended the NBMM to account for zero inflation and proposed the zero-inflated negative binomial mixed model (FZINBMM). Besides these specialized methods, the zero-inflated negative binomial mixed model can also be fit using the R package GLMMadaptive [38] and glmmTMB [39]. GLMMadaptive and glmmTMB use an adaptive Gaussian quadrature and Laplace approximation to evaluate the likelihood function, respectively. In addition to the NB model, GLMM with a quasi-Poisson family has also been used for analyzing over-dispersed longitudinal microbiome data. For example, the studies [40, 41] used the Penalized Quasi-Likelihood (GLMMPQL) approach to fit the GLMM with a quasi-Poisson family.

The rising popularity of longitudinal microbiome studies and the availability of multiple DAA-c methods calls for a comprehensive evaluation to provide recommendations and guidance to end-users and tool developers. In contrast to several benchmarking studies of DAA methods [42, 43] for independent microbiome data, no benchmarking studies, to our best knowledge, have been conducted for DAA-c methods for correlated microbiome data. In this study, we propose a real data-based semiparametric simulation framework to perform comprehensive evaluation under diverse biologically relevant settings. We evaluate the performance of DAA-c methods under three commonly seen study designs. These designs include (i) replicate sampling, where each microbiome sample is subject to multiple measurements to reduce noises [44], (ii) matched-pair design, where the microbiome is sampled before and after treatment for each subject [28, 45] and (iii) general longitudinal sampling, where the microbiomes of two groups of subjects are sampled at multiple time points [46]. We focus the evaluation on the ability of the DAA-c method to control for false positives and the power to detect true association signals after false discovery rate (FDR) control [47]. The results of the benchmarking study will inform the users to select the most robust method for their studies.

Materials and methods

A semiparametric simulation framework for realistic correlated microbiome data generation

Traditional microbiome data simulators are usually based on parametric models such as Dirichlet-multinomial model [48, 49] and logistic normal multinomial model [50]. The sample space is thus determined by a small set of parameters. Due to the complexity of the microbiome data, existing parametric models may fail to capture the full distributional characteristics of the data. To generate more realistic data, we adopt a semiparametric approach, where we draw random samples from a large reference microbiome dataset (non-parametric part) and add covariate/confounder effects parametrically (parametric part). Basically, for each drawn reference sample, we infer the underlying composition based on an empirical Bayesian model and add covariate/confounder effects to the composition vector via a log linear model. Once the true underlying composition is obtained, sequence reads are generated using a multinomial model. By using the real microbiome data as the template, our method circumvents the difficulty in modeling the complex inter-subject variation of the microbiome composition.

The basic steps of the semiparametric simulation framework are depicted in Supplementary Figure S1. Specifically, we use the following steps to generate the longitudinal microbiome data with Inline graphic subjects belonging to two groups and Inline graphic evenly spaced time points for each subject:

1. Build a reference dataset. The reference dataset is a collection of microbiome sequencing samples from a specific body site of a study population. It should be large enough to capture the main compositional variation in the population of interest. Microbiome datasets from those large-scale population-level studies such as HMP [5] and American Gut Project [51] are all good choices. The reference datasets used in the simulation are the human stool and vaginal microbiome datasets from HMP with basic filtering to remove extremely rare taxa (prevalence <10% or max proportion < 0.2%), resulting in 295 samples and 2094 taxa, and 381 samples and 781 taxa for the stool and vaginal dataset, respectively. The human stool and vaginal microbiome are chosen to represent a high- and low-diversity microbial community, respectively.

2. Sample a posteriori the underlying composition of the reference samples based on the observed counts using an empirical Bayes approach.

a. Suppose we have M taxa and N subjects and let k and i index for taxa and subjects, respectively. Assume an informative Dirichlet prior for the underlying composition, estimate the Dirichlet hyperparameters (Inline graphic) based on the observed counts (Inline graphicInline graphic) using the maximum likelihood estimation (R package ‘dirmult’). The posterior distribution of the underlying composition for sample Inline graphic is then a Dirichlet distribution with parameter Inline graphic.

b. Obtain a posterior sample of the underlying composition for each reference sample based on the posterior Dirichlet distributionInline graphic Denote Inline graphic be the proportion for the Inline graphicth taxon in the Inline graphicth subject.

3. Generate the absolute abundance (Inline graphic by multiplying a factor Inline graphic representing the microbial load at the sampling site i.e. Inline graphicwhere Inline graphic without loss of generality.

4. Given Inline graphic subjects and Inline graphic time points, randomly draw Inline graphic samples based on the absolute abundance data generated in the last step. Replicate the absolute abundance profile of each subject Inline graphic times. Denote Inline graphic as the absolute abundance for the Inline graphicth taxon in the Inline graphicth subject at the Inline graphicth time point.

5. Generate the time and group covariates and a confounder. A binary group covariate Inline graphic is created by dichotomizing a latent variable Inline graphic using some cutoff value to achieve the specified group sizes. The confounder is generated by Inline graphicInline graphic where Inline graphic is the desired correlation between Inline graphic and Inline graphic. The time covariateInline graphicis set as Inline graphic.

6. Given K taxa included in the analysis, generate their coefficients for Inline graphic, Inline graphic, respectively, which are Inline graphicInline graphic and Inline graphic, where Inline graphic, for Inline graphic The interpretation of the parameters Inline graphic, Inline graphicInline graphiccan be found in Supplementary Table S1. Note that the time coefficient Inline graphic could vary by subject so that each subject could have its own trajectory (random slope). Non-differential taxa are simulated by setting the corresponding coefficients to 0 s. Time and group interaction can also be added in this step.

7. Generate random error Inline graphic (Inline graphic, where Inline graphiccontrols the within-subject correlation for taxon Inline graphic.

8. Add covariate (Inline graphic), confounder (Inline graphic, time (Inline graphic effects and the random effect (Inline graphic) using a log linear model Inline graphic.

9. Normalize into the proportion Inline graphic based on Inline graphic. Generate the sequencing depth Inline graphic based on a NB distribution. Finally, generate the read counts Inline graphic based on a multinomial distribution with parameters Inline graphic.

The matched-pair and the replicate sampling design can be regarded as special cases of the longitudinal design. The matched-pair data can be generated by including two time points in the previous steps and setting the covariate effect to be 0, while the replicate sampling data can be generated by setting the time effect to be 0.

Simulation settings for the evaluation

To comprehensively evaluate the performance of DAA-c methods for correlated microbiome data, we simulate various settings covering a wide range of signal structures (Table 1). We focus on testing the effect of Inline graphic in the replicate sampling design, Inline graphic in the matched-pair design, Inline graphic and Inline graphic in the longitudinal design. We study the performance under both the balanced and unbalanced differential settings, where the differential taxa could have random (‘balanced’) or the same direction of change (‘unbalanced’). The unbalanced setting creates strong compositional effects and is statistically more challenging than the balanced setting. We study the performance under both a high- and low-diversity microbial community as represented by the stool and vaginal microbiome, respectively.

Table 1.

Simulation settings used in evaluation of DAA-c methods

Design Setting Effect size Between-subject variationInline graphic Within-subject correlationInline graphic
Inline graphic Inline graphic
Global null 1 0 0 0 0 0 1
2 0 0 0 0 0 4
Replicate sampling Balanced 3 + ++ 0 0 0 1
Unbalanced 4 + ++ 0 0 0 1
Balanced 5 + ++ 0 0 0 4
Unbalanced 6 + ++ 0 0 0 4
Matched-pair Balanced 7 0 0 + ++ 0 1
Unbalanced 8 0 0 + ++ 0 1
Balanced 9 0 0 + ++ 0 4
Unbalanced 10 0 0 + ++ 0 4
Longitudinal Balanced 11 + ++ 0.5 0.5 0.5 1
12 0.5 0.5 + ++ 0.5 1
Unbalanced 13 + ++ 0.5 0.5 0.5 1
14 0.5 0.5 + ++ 0.5 1
Balanced 15 + ++ 0.5 0.5 0.5 4
16 0.5 0.5 + ++ 0.5 4
Unbalanced 17 + ++ 0.5 0.5 0.5 4
18 0.5 0.5 + ++ 0.5 4

To further dissect the performance of the DAA-c methods, we study two levels of effect sizes under each setting, denoted as ‘+’ and ‘++’, representing moderate and large effects. Since confounders are common for microbiome studies [52, 53] and adjusting confounders is critical in obtaining robust biological findings, we simulate one continuous confounder with a correlation of ~0.6 between the covariate and the confounder. In the default setting, we include Inline graphic taxa and a total of Inline graphic samples. Specifically, for the replicate sampling design, we simulate 100 subjects, each with two replicates. For the matched-pair design, we simulate 100 subjects, each with a pre- and post-treatment sample. For the longitudinal design, we simulate 40 subjects, each with five time points. We also study the effect of a small sample size/taxa number by decreasing the number of subjects to 20 and the number of taxa to 50, roughly representing family- or genus-level abundance data after filtering. For all simulations, we generate sequencing depths from a NB distribution with a mean depth 10 000 and a dispersion parameter 5 (rnegbin(theta = 5) in R package ‘MASS’). Throughout the simulation setting, we include 10% randomly drawn differential taxa. We also let 5 and 10% taxa be affected by the confounder for the differential and non-differential taxa, respectively.

DAA methods evaluated

We evaluate the widely used and recently developed DAA-c methods including ZIGMM [27], NBMM [36], FZINBMM [37], GLMMadaptive [38], glmmTMB [54], glmer.nb [55], GLMMPQL [55], LDM [28, 56], LinDA [26], ZIBR [29] and MaAsLin2 [24]. A detailed summary is shown in Table 2. For count model-based methods including NBMM, FIZNBMM, GLMMadaptive, glmmTMB, glmer.nb and GLMMPQL, the log GMPR (geometric mean of pairwise ratios) [57] size factors are used as the offset to account for the library size variation. We choose ‘family = zi.negative.binomial()’ and ‘family = nbinom2’ for GLMMadaptive and glmmTMB, respectively. For ZIGMM, a log transformation is applied before running the method and a log GMPR size factor is used as the offset. LDM currently can only be directly applied to replicate sampling and matched-pair designs, thus it is not tested for the general longitudinal design. Default settings are chosen for all the methods evaluated. For all simulated datasets, taxa with prevalence <10% or the maximum proportion less 0.2% are excluded from testing as is usually done in practice. For consistency, all filtering steps in the evaluated methods are disabled, and the same preprocessed datasets are used as the input to all methods.

Table 2.

DAA-c methods evaluated in this study

Method Handling zeros Normalization Model R package version Availability
ZIGMM Model(Zero-inflation) GMPR ZIGMM NBZIMM_1.0 https://github.com/nyiuab/NBZIMM
NBMM Not necessary NB mixed models
FZINBMM Model(Zero-inflation) Zero-inflated negative binomial mixed model
GLMMadaptive Model(Zero-inflation) Zero-Inflated Poisson/negative binomial mixed model GLMMadaptive_0.8-0 https://cran.r-project.org/web/packages/GLMMadaptive/index.html project.org/web/packages/GLMMadaptive/index.html
glmmTMB Model(Zero-inflation) Zero-inflated negative binomial mixed model glmmTMB_1.0.2.1 https://cran.r-project.org/web/packages/glmmTMB/index.html
glmer.nb Model(Zero-inflation) Negative binomial/Poisson mixed models lme4_1.1-26 https://cran.r-project.org/web/packages/lme4/index.html
GLMMPQL Model(Overdispersion) GLM quasi-Poisson model MASS_7.3-53 https://cran.r-project.org/web/packages/MASS/index.html
LDM Not necessary TSS Linear model LDM_1.0 https://github.com/yijuanhu/LDM
LinDA Pseudo-count CLR Linear model LinDA_0.1.0 https://github.com/zhouhj1994/LinDA
MaAsLin2 Pseudo-count TSS Log linear model Maaslin2_1.4.0 https://github.com/biobakery/Maaslin2
ZIBR Model(Zero-inflation) TSS Two-part zero-inflated beta mixed model ZIBR_0.1 https://github.com/chvlyl/ZIBR

Performance evaluation for the simulation study

We evaluate the performance of DAA-c methods based on their ability to control for false positives and their power to detect the true associations after applying FDR control (BH procedure [58]) at the 5% target level. False positive control is assessed based on the observed empirical FDR, which is the false discovery proportion (FDP) averaged over 100 simulation runs (1000 simulation runs for the global null). Power is assessed based on the average true positive rate (TPR). FDP and TPR are defined as:

graphic file with name DmEquation1.gif

where FP, TP and FN are the number of false positives, true positives and false negatives, respectively. To facilitate assessment and visual interpretation, we use a scoring system to summarize the performance across settings (Supplementary Table S2):

False positive control scoring system

Observed FDR ∈[0,0.05], (0.05,0.1], (0.1,0.2] and (0.2,1] scores 3 green stars, 2 yellow stars, 1 red star and 0 star (all gray), respectively. The total score is the number of stars the method receives for each setting. If an observed FDR is larger than 0.05 but its 95% confidence interval covers 0.05, we also assign 3 green stars.

Power scoring system

We rank the methods based on their average TPRs (higher rank, better power). The total score is the sum of the ranks for each setting.

Overall score

To produce an overall score, we first convert the total FDR and TPR scores into ranks (‘TPR rank’ and ‘FDR rank’). These ranks are summed for each method to produce an ‘overall score’. This strategy assigns equal weight to false positive control and power. The order of the methods displayed in the figures is then based on the overall score.

Real microbiome datasets

Three real microbiome datasets representing replicate sampling, matched-pair and longitudinal designs were used to compare the performance of competing DAA-c methods. The first dataset (‘Smoker2010’) was generated to study whether smoking has an effect on the human upper respiratory tract (URT) microbiome via 16S rRNA gene-targeted sequencing [59]. Replicate sampling was used in this study (left and right nose, left and right throat). The dataset was downloaded from the Qiita database [60] with the study ID 524. Samples with reads < 1000 were excluded from downstream analysis. We focused on comparing the URT microbiome between smokers and non-smokers based on the two throat samples and excluded samples with <1000 read counts and OTUs with a maximum proportion less than 0.002 or a prevalence <10% of the samples. Finally, 124 samples (31 smoking subjects and 31 non-smoking subjects, each subject has 2 replicates from the left and right side of the throat) and 197 OTUs were included in the analysis. Sex is the confounder (P = 0.01) in this dataset and was included as a covariate.

The second dataset (‘Nicholas2013’) was generated to study the impact of cleaning on the surface microbiome within a NICU (neonatal intensive care units) [6]. The dataset was downloaded from the Qiita database [60] with the study ID 1798. Matched-pair design was used in this study. 16S rRNA gene-targeted sequencing was used to profile the NICU surface microbial communities before and after cleaning [6]. Genus-level abundance data were used in this analysis. Genera with a maximum proportion<0.002 or a prevalence <10% of the samples were excluded from the analysis. Finally, 70 samples (35 matched pairs before and after intensive cleaning) and 110 genera were included in the analysis.

The third dataset (‘IBD2017’) was generated from a longitudinal study of the gut microbiome in Inflammatory bowel disease (IBD) patients [61]. The dataset was downloaded from the Qiita database [60] with the study ID 1629. The fecal samples were provided by patients every third month for a 2-year period. Again, 16S rRNA gene-targeted sequencing was used to profile the stool microbial community. In this analysis, we focused on comparing the gut microbiome of ICDr patients (ICD patients that had previously undergone ileocaecal resection) to healthy controls (group difference) and testing whether the longitudinal trend differed by the group (time and group interaction). Fecal calprotectin (f-calprotectin) concentration is the confounder (P < 0.001) and was included as a covariate. Samples with a sequencing depth < 10 000 and taxa with a prevalence < 10% or a maximum proportion <0.002 were excluded from the analysis. Subjects with samples <2 were also excluded from the analysis. As a result, a total of 147 samples and 498 OTUs were included in the analysis. The 147 samples came from 9 healthy controls (2–8 longitudinal samples per subject) and 18 ICDr subjects (2–8 longitudinal samples per subject).

Results

The semiparametric simulation approach captures the characteristics of correlated microbiome data

To provide an objective evaluation of DAA-c methods, real microbiome datasets with known truth are the best candidates. However, such datasets are difficult to obtain and even if they do exist, they may only cover limited biologically relevant settings. Therefore, we use simulations, where the ground truth is known, to evaluate the performance of DAA-c methods. To simulate realistic microbiome data, we employ a semiparametric approach, where the baseline compositions are sampled from a reference set of real microbiome data and covariate and confounder effects are then added parametrically (Methods and Supplementary Figure S1). This approach circumvents the difficulty in modeling the complex abundance distribution of real microbiome data using statistical models. Previously, we introduced a semiparametric simulation framework for independent data [62], where we demonstrated that it could capture the basic characteristics of the real microbiome data such as the sparsity level, mean and variance and taxon-taxon correlations. In this study, we extended the framework by incorporating within-subject correlations. This is achieved by replicating the subject-level abundance profile Inline graphictimes, where Inline graphic is the number of replicates for each subject, followed by adding sample-specific random errors, whose variance (the parameter Inline graphic see Supplementary Table S1) controls the correlation strength. Based on the principal coordinate plot (Bray-Curtis distance) on the simulated data, we see that Inline graphic controls the subject-level clustering pattern for the longitudinal microbiome data. As we increase Inline graphic from 1 to 4, the samples from the same subject are less clustered (Supplementary Figure S2a). In addition, the approach allows including random slopes so that each subject has its own temporal trajectory (Supplementary Figure S2b). We will use this semiparametric framework to evaluate the performance of DAA-c methods for correlated data under diverse settings (Methods, Table 1).

Performance of DAA-c methods under the global null setting

We first study the global null setting, where there are no differential taxa with respect to the covariate Inline graphic (Table 1 settings 1–2, 100 subjects with 2 replicates for each, 500 taxa). We compare the FDR control of various DAA-c methods at the 5% level (Figure 1, Supplementary Figure S3). In this case, FDR is essentially the probability of making any false claims in multiple testing. For stool data, LDM, LinDA and MaAsLin2, all linear model-based methods, could control the FDR close to the target level across settings. In contrast, NBMM, ZIGMM, ZINBMM and glmmadaptive show moderate FDR inflation when the within-subject correlation is low (Inline graphic, whereas ZIBR has moderate FDR inflation when the within-subject correlation is high (Inline graphic. glmernb, GLMMPQL and glmmTMB, on the other hand, have moderate FDR inflation in both settings. For vaginal data, FDR control becomes more challenging. Only LDM and MaAsLin2 can control the FDR to the target level across settings, whereas LinDA shows moderate FDR inflation regardless of the within-subject correlation strength. ZIGMM can control FDR within 10% only when the within-subject correlation is low. All other methods fail to control FDR properly.

Figure 1.

Figure 1

Performance of DAA-c methods under the global null setting for stool and vaginal microbiome data (replicate sampling). Performance is assessed by the observed FDR calculated as the percentage of the 1000 simulation runs making any false discoveries. Three green stars and 2 yellow stars, 1 red star and 0 star (all gray) indicate the observed FDR level in [0, 0.05], (0.05,0.1], (0.1,0.2] and (0.2,1], respectively.

Performance of DAA-c methods under balanced changes

Our next study focuses on the performance of DAA-c methods when the abundance of 10% randomly selected taxa covaries with the treatment covariate (Inline graphic) or the time covariate (Inline graphic) (total sample sizeInline graphictaxa numberInline graphic). In this set of simulations, we simulate balanced changes i.e. the abundance of those differential taxa increases or decreases in one group randomly. We will test for the effect ofInline graphic for replicate sampling data, the effect of Inline graphicfor matched-pair data and both Inline graphic and Inline graphic for general longitudinal data.

Performance on replicate sampling data (100 subjects with 2 replicates each, settings 3 and 5, Figure 2 ). For the stool data (Figure 2A), only LinDA, MaAsLin2 and LDM can control the FDR within 10% across settings. However, LinDA and MaAsLin2 are substantially more powerful than LDM, especially when the within-subject correlation is low. ZIGMM, glmmadaptive, ZINBMM and NBMM can control the FDR within 10% when the within-subject correlation is high and the power is comparable to LinDA and MaAsLin2. But they cannot control the FDR properly when the within-subject correlation is low. ZIBR shows the opposite trend as in the global null setting. In contrast, glmmTMB, GLMMPQL and glmernb show high FDR inflation. For the vaginal data (Figure 2B), the performance for most methods becomes worse. However, MaAsLin2, LinDA and LDM are still able to control FDR within 10% and MaAsLin2 and LinDA are more powerful than LDM. ZIBR performs well in both FDR control and power when the within-subject correlation is low but cannot control FDR when the within-subject correlation is high. All other methods have severe FDR inflation.

Figure 2.

Figure 2

False positive control and power under the replicate sampling design (balanced change setting, A: stool and B: vaginal). ‘+’ and ‘++’ represent moderate and large effect sizes, respectively. ‘High’ and ‘Low’ within-subject correlations are simulated with Inline graphic= 1 and Inline graphic= 4, respectively. Green, yellow, red and gray colors indicate the observed FDR. The green color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05–0.1], (0.1, 0.2] and (0.2, 1], respectively. The length of the bar is proportional to the average TPR and the actual value is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR scores across settings. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Performance on matched-pair data (100 subjects with pre- and post-treatment sample each, settings 7 and 9, Figure 3 ). Overall, we see a deterioration of the FDR control performance for most methods compared with their performance in the replicate sampling setting (Figure 3 versus Figure 2). For both stool and vaginal data, only LinDA, MaAsLin2 and LDM can control the FDR close to the target level across settings. Again, LinDA and MaAsLin2 are more powerful than LDM. For other methods, only glmmadaptive can control FDR within 10% for the stool data with high within-subject correlation, while other methods fail to control the FDR properly for both stool and vaginal data.

Figure 3.

Figure 3

False positive control and power under the matched-pair design (balanced change setting, A: stool and B: vaginal). ‘+’ and ‘++’ represent moderate and large effect sizes, respectively. ‘High’ and ‘Low’ within-subject correlations are simulated with Inline graphic= 1 and Inline graphic= 4, respectively. Green, yellow, red and gray colors indicate empirical FDR. The green color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05–0.1], (0.1, 0.2] and (0.2, 1], respectively. The length of the bar is proportional to the average TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR scores across settings. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Performance on longitudinal data (40 subjects each with 5 time points, settings 11–12 and 15–16, Figure 4 ). We test both the effect of the covariate (Inline graphic) (settings 11–12, Figure 4A and B) and the time variable (Inline graphic) (settings 15–16, Figure 4C and D). Overall, we observe a similar trend as in previous settings. However, there are several noticeable differences. In the case of testing the effect of Inline graphic (Figure 4A and B), for the stool data, LinDA is more powerful than MaAsLin2 when the within-subject correlation is low. For the vaginal data, only MaAsLin2 can control FDR within 10% across settings, while LinDA show some FDR inflation (10–20%) when the within-subject correlation is high and the effect size is moderate (‘+’). In the case of testing the effect of Inline graphic (Figure 4C and D), for both stool and vaginal data, only LinDA can control FDR to the target level across settings and the FDR control is not at the expense of power. For MaAsLin2, however, we observe some inflated FDR (>10%) when the within-subject correlation is high. In one setting for vaginal data (effect size ‘+’), the FDR inflation of MaAsLin2 is >20%.

Figure 4.

Figure 4

False positive control and power under the general longitudinal design (balanced change setting). A,B: testing the group (X) effect [A: stool and B: vaginal] and C,D: testing the time (T) effect [C: stool and D: vaginal]. ‘+’ and ‘++’ represent moderate and large effect sizes, respectively. ‘High’ and ‘Low’ within-subject correlations are simulated with Inline graphic= 1 and Inline graphic= 4, respectively. Green, yellow, red and gray colors indicate empirical FDR. The green color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05–0.1], (0.1, 0.2] and (0.2, 1], respectively. The length of the bar is proportional to the average TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR scores across settings. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Performance of DAA-c methods under unbalanced changes

When the differential changes are balanced, the compositional effects are considered to be very moderate. It is interesting to study the performance of the DAA-c methods when the changes are less balanced (i.e. the direction of change is not random) so that the compositional effects are strong. In this new set of simulations, we let the direction of change for those differential taxa be the same. Although such a setting may not be common in practice, it could be used to test the limit of DAA-c methods in addressing compositional effects. We repeat similar analyses under the replicate sampling (Supplementary Figure S4, settings 4 and 6), matched-pair (Figure 5, settings 8 and 10) and longitudinal (Figure 6, settings 13–14 and 17–18) designs.

Figure 5.

Figure 5

False positive control and power under the matched-pair design (unbalanced change setting, A: stool and B: vaginal). ‘+’ and ‘++’ represent moderate and large effect sizes, respectively. ‘High’ and ‘Low’ within-subject correlations are simulated with Inline graphic= 1 and Inline graphic= 4, respectively. Green, yellow, red and gray colors indicate empirical FDR. The green color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05–0.1], (0.1, 0.2] and (0.2, 1], respectively. The length of the bar is proportional to the average TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR scores across settings. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Figure 6.

Figure 6

False positive control and power under the general longitudinal design (unbalanced change setting). A,B: testing the group (X) [A: stool and B: vaginal] and C,D: testing the time (T) effect [C: stool and D: vaginal]. ‘+’ and ‘++’ represent moderate and large effect sizes, respectively. ‘High’ and ‘Low’ within-subject correlations are simulated with Inline graphic= 1 and Inline graphic= 4, respectively. Green, yellow, red and gray colors indicate empirical FDR. The green color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05–0.1], (0.1, 0.2] and (0.2, 1], respectively. The length of the bar is proportional to the average TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR scores across settings. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Unsurprisingly, compared with their performance in the balanced change scenario, the FDR control for all DAA-c methods becomes much worse in the unbalanced settings. None of the methods, including LinDA, MaAsLin2 and LDM, could control the FDR within 10% across settings (Supplementary Figure S4, Figures 5 and 6). The FDR control performance deteriorates with increasing effect size (‘+’ versus ‘++’), indicating the challenge of DAA in the presence of strong compositional effects. The FDR control is poorer for most methods in testing the treatment effect for matched-pair data (Figure 5) or testing the time effect in longitudinal data (Figure 6C and D). A lower within-subject correlation and a lower microbial diversity (vaginal) also tend to decrease the FDR control performance for many methods.

Among the three best-performing methods in the balanced change settings (LinDA, MaAsLin2 and LDM), LinDA has overall the best FDR control: it can control the FDR within 20% for all settings and within 10% when the within-subject correlation is low. Notably, LinDA can control the FDR to the target level when testing the time effect for the longitudinal data, while other methods fail to control FDR properly. The power of LinDA is also comparable to competing methods.

For MaAsLin2, we used the default TSS normalization in comparison. It is interesting to see if its FDR control performance improves with alternative normalization methods. We thus replace the default TSS normalization in MaAsLin2 with Geometric Mean Pairwise Ratios (GMPR) [57], Trimmed mean of M values (TMM) [63] and cumulative sum scaling normalization [64]. We can see that the FDR control of MaAsLin2 does improve significantly, but it still does not perform as well as LinDA (Supplementary Figure S5).

Effect of the sample size and the number of taxa on the performance of DAA-c methods

In practice, many microbiome studies are conducted with small sample sizes. Moreover, in DAA at a higher taxonomic rank, the number of tested taxa may be small. Thus, we want to check how the DAA-c methods perform when the sample size or the number of taxa is small. We use the stool data under the replicate sampling design to study the effect of a small number of samples and taxa.

We first decrease the sample size to 40 (20 subjects with 2 samples each), while the number of taxa to be tested remains at 500. Compared with the results with a sample size of 200 (Figure 2), we observe a significant decrease in the performance of FDR control for most methods (Supplementary Figure S6a and b). Those count-based methods perform poorly with both severe FDR inflation and low power. In contrast, those linear model-based methods, MaAsLin2, LinDA and LDM, are more robust to small sample sizes when the changes are more balanced (Supplementary Figure S6a). MaAsLin2 and LinDA are substantially more powerful than LDM. MaAsLin2 has the best FDR control performance in this scenario, while LinDA has some noted FDR inflation in one setting. These results suggest that simpler models (i.e. linear models) may be preferred over complex models (i.e. count-based generalized linear model) when the sample size is small. When the changes are unbalanced (Supplementary Figure S6b), none of the methods including LinDA can control FDR with adequate power when the within-subject correlation is high, and the effect size is large. When the within-subject correlation is low, LinDA excels in both FDR control and power, while other methods perform poorly.

We next decrease the number of taxa to 50 by keeping the most abundant taxa in the analysis (Supplementary Figure S6c and d). Again, most evaluated methods show decreased performance in FDR control, due to the increased compositional effects with a smaller number of taxa. When the changes are balanced (Supplementary Figure S6c), MaAsLin2 and LinDA perform much better than other methods with MaAsLin2 being able to control the FDR to the target level across settings. However, when the changes are unbalanced (Supplementary Figure S6d), MaAsLin2 has a significantly higher FDR than LinDA.

Computational efficiency and performance summary

Computationally efficiency is an important factor influencing a user’s choice of methods. Therefore, we compare the computational speeds of the evaluated DAA methods based on the simulated data. We find that only 6 out of 11 methods (ZINBMM, NBMM, ZIGMM, GLMMPQL, MaAsLin2, LinDA) can complete the analysis of a moderate-sized microbiome dataset (100 subjects, 2 replicates each, 500 taxa) within 10 min on our computer system (x86_64-pc-linux-gnu (64-bit) Red Hat Enterprise Linux Server 7.9, Intel(R) Xeon(R) CPU E5–2698 v4 @ 2.20GHz, 8GB running memory). LinDA and MaAsLin2 are the fastest methods that can finish the analysis within 1 min (Supplementary Figure S7).

Finally, we summarize the DAA performance using different metrics based on our simulation studies (Figure 7). For each evaluation metric, we classify each method as ‘good’, ‘intermediate’ or ‘poor’ (see legend in Figure 7). Although it is difficult to capture the full complexity of the evaluation based on a crude categorization, the heatmap provides a convenient way to convey the major findings in the simulation studies. It is easy to spot that LinDA, LDM and MaAsLin2 have much better FDR control performance than the other methods. LinDA is the only method that can control FDR reasonably well across settings, while LDM and MaAsLin2 can have poor FDR control when the compositional effects are strong (e.g. unbalanced change settings). Remarkably, LinDA is overall more powerful than competing methods.

Figure 7.

Figure 7

Performance summary of DAA-c methods based on various evaluation metrics. For each metric, the performance is categorized into ‘Good’, ‘Intermediate’ and ‘Poor’. For FDR control, ‘Good’, ‘Intermediate’ and ‘Poor’ represent observed FDR (averaged over two effect sizes and low and high within-subject correlations) in (0,0.05], (0.05,0.2], (0.2,1]. For power, ‘Good’, ‘Intermediate’ and ‘Poor’ represent top 1/3, mid 1/3 and bottom 1/3 based on their TPR rank. For computational speed, ‘Good’, ‘Intermediate’ and ‘Poor’ correspond to a computational time (min/ run) < 1, (1–60] and > 60, respectively. LDM currently does not support the general longitudinal design and is colored gray in the corresponding rows in the heatmap.

Discovery patterns on real correlated microbiome datasets

We next apply the evaluated DAA-c methods to three publicly available datasets [6, 61, 65] (‘Method’), which are examples of replicating sampling (‘Smoker2010’), matched-pair (‘Nicholas2013’) and longitudinal (‘IBD2017’) designs. Since the ground truth is unknown for the three real datasets, we aim to assess whether the discovery pattern on the real datasets reflects what we have observed in the simulation study. We first evaluate the FDR control of DAA-c methods under the global null by shuffling the sample labels (1000 times) to disrupt the differential signals. ZIBR currently do not support an unequal number of samples for each subject and LDM does not support general longitudinal designs, so they were excluded from the comparison on the third dataset. For the first dataset, the smoking status of the subjects were permuted. For the second dataset, the pre-cleaning and post-cleaning status labels for each subject were permuted. For the third dataset, group labels were shuffled and time points were permuted within the subjects. Any differential taxa identified from the permuted datasets are considered to be false positives. Therefore, if we use the Benjamini–Hochberg FDR control procedure to identify differential taxa at 5% FDR, we expect to see on average 5% of the permuted datasets to have any false findings. Consistent with the simulation results, most methods do not perform satisfactorily in controlling for false positives (Figure 8A and B). For the ‘Smoker2010’ dataset, LDM, LinDA, MaAsLin2, GLMMPQL, NBMM and ZINBMM can control FDR to the 5% target level (Figure 8A) and the percentage of differential taxa range from 0to 12% (median: 0%, Figure 8B) on the permutated datasets. ZIGMM has slight FDR inflation (8%) with the percentage of differential taxa ranging from 0 to 14% (median 0%). In contrast, all other methods have seriously inflated FDR (>20%), and the percentage of differential taxa ranges from 0 to 20% (median: 3%). For the ‘Nicholas2013’ dataset, LinDA, MaAsLin2 and LDM stand out among their competitors with an observed FDR < 5% and the percentage of differential taxa ranging from 0 to 6% (median: 0%). All other methods fail to control FDR properly. For the ‘IBD2017’ dataset, only MaAsLin2 and LinDA control the FDR around 5% for both testing the group effect and the group-time interaction effect.

Figure 8.

Figure 8

Evaluation of longitudinal differential abundance analysis (DAA-c) methods based on three experimental datasets. A,B. Performance evaluation under the global null setting. A. Performance is assessed by the observed FDR level calculated as the percentage of the 1000 simulation runs making any false discoveries. B. Boxplot showing the number of differential taxa at 5% FDR (left y-axis) and the percentage of differential taxa (right y-axis) based on 1000 permuted datasets. C. Overlaps of significant taxa (5% FDR) between DAA-c methods on the real datasets. Set size means the total number of differential taxa discovered by each method. Intersection size means the number of differential taxa commonly found by the methods indicated by the black dots.

Next, we compare the numbers of identified differential taxa at 5% FDR for those DAA-c methods and study their overlaps based on the original real datasets (Figure 8C). The results at 10% FDR can be found in Supplementary Figure S8. As expected, methods that do not control for false positives tend to find more differential taxa (horizontal bars, Figure 8C, with one exception), many of which are unique to themselves (right side of the top vertical bars, Figure 8C). For the ‘Smoker2010’ dataset, LinDA identifies 2 OTUs associated with smoking, while the other methods that control for false positives identify 1 or 0 OTU. For the ‘Nicholas2013’ dataset, LDM, LinDA and MaAsLin2, the three methods that have good false positive control on the permuted datasets, do not identify any significant genera. Although other methods identify a few, the large FDR inflation on the permuted datasets casts doubt on the credibility of the identified genera. For the ‘IBD2017’ dataset, when testing the group effect (ICDr versus control), MaAsLin2 and LinDA identify more OTUs than the other methods, indicating that their well-controlled FDR may not be at the expense of the power. It is well known that the gut microbiome of IBD patients is very distinguishable from that of healthy controls [66], supporting the findings of MaAsLin2 and LinDA. However, when testing the interaction between group and time, neither MaAsLin2 nor LinDA identifies any significant interactions. Given the fact that interaction detection usually requires a large sample size, such negative findings are not surprising. Overall, the detection patterns on the real datasets are consistent with those from the simulation study.

Discussion

DAA is one of the most fundamental statistical tasks in microbiome data analysis. Given the rising popularity of complex study designs involving temporal [67], spatial [68] and repeated sampling [4] of the microbiome, statistical tools that could properly address the correlation structure in microbiome data are much needed. Although there are several tools for DAA of correlated microbiome data (DAA-c) [26–29, 36–38], their performance has not been evaluated independently by a large-scale benchmarking study. It is unclear whether these methods can control for false positives while retaining sufficient power for real microbiome datasets under diverse settings. In this study, we thus conducted a large-scale simulation study to objectively evaluate the performance of the major existing DAA-c methods under a wide range of settings. We aim to identify and recommend the most robust DAA-c tool to the field. To achieve this end, we designed a semiparametric simulation framework for realistic microbiome data generation in an extension of our previous work for independent data [62]. The proposed simulation framework circumvents the difficulty in properly modeling the zero-inflated, highly skewed abundance distribution by drawing random samples from a large reference dataset. Covariate and confounder effects are added parametrically to generate correlated microbiome data. We show that the generated data capture the basic characteristics of microbiome data [62] and thus are more suitable for benchmarking DAA-c methods than parametric model-simulated data.

Based on the proposed simulation framework, we performed the evaluation covering both low- and high-diversity microbial communities. We show that the FDR control performance of evaluated DAA-c methods varies tremendously, and most DAA-c methods are still not satisfactory. In fact, none of the evaluated methods could control the FDR to the target level across all settings. The FDR control is more difficult for the low-diversity community such as the vaginal microbiome since the data are much sparser. The FDR control also deteriorates as the effect size becomes stronger and the direction of change becomes less balanced since these conditions result in stronger compositional effects. For methods that model the count distribution using NB distribution or zero-inflated negative binomial distribution (GLMMPQL, glmmadaptive, NBMM, ZINBMM, glmernb, glmmTMB), they tend to have severe FDR inflation even when the compositional effects are small, indicating that the assumed count model may still not be able to capture the abundance variation adequately [69, 70]. Their performance worsens as the sample size becomes smaller probably because the asymptotic distribution of the test statistic, on which the P-value calculation depends, is not accurate for small sample sizes. These methods tend to perform less well when the within-subject correlation is lower, indicating that they probably overfit the correlations. For ZIBR, which models the proportion data using zero-inflated beta distribution, large FDR inflation was observed in most settings and the FDR inflation became more serious when the within-subject correlation was higher, indicating that the zero-inflated beta distribution may not fit the data well, either. In contrast, those methods based on data transformation and linear models (LinDA, MaAsLin2 and LDM) are more robust and they have much better FDR control than the methods that model counts and proportions. When the changes are more balanced, they can control the FDR close to the target level. In terms of power, MaAsLin2 and LinDA are more powerful than LDM especially when the within-subject correlation is low. However, when the changes are less balanced, the FDR control of MaAsLin2 and LDM deteriorates since they do not specifically address the compositional effects other than using robust normalization factors. In contrast, LinDA, which corrects the bias due to compositional effects, has substantially better FDR control than MaAsLin2 and LDM when the compositional effect is strong. The improved FDR control is not at the excessive expense of the power. However, FDR inflation was still noted for LinDA when the within-subject correlation was high, and the sample size was small. We summarize the pros and cons of the evaluated methods in Supplementary Table S3.

Based on the evaluation, we find that LinDA has the best trade-off between FDR control and power across settings. Due to potential FDR inflation under some settings, we highly recommend using a standard FDR level such as 5% and resisting the temptation to raise the level to find ‘signals’. As an alternative to LinDA, MaAsLin2 can also be applied when the compositional effects are moderate. However, some diagnostics for compositional effects are needed if MaAsLin2 is chosen to perform DAA. The magnitude of compositional effects can be detected by an effect size plot for those differential taxa. If there are many taxa with the same direction of change, applying MaAsLin2 is not recommended. Otherwise, MaAsLin2 can be used due to its power advantage.

It is well known that there are strong interactions among microbial species in the microbial ecosystem [71]. Such interactions could lead to positive or negative correlations between taxa, adding another layer of complexity in addition to the sample-wise correlations induced by study designs. All the methods evaluated, however, treat the taxa independently and do not exploit such between-taxa correlations. Ignoring this aspect in the analysis could lead to power loss or lack of interpretation.

Finally, we comment that there is still plenty of room for novel methodological development for DAA of correlated microbiome data. Our simulation framework can be easily re-used to evaluate the performance of new methods.

Key Points

  • We have benchmarked the performance of the major existing DAA-c methods using a semiparametric simulation framework, which is capable of generating realistic microbiome data with specific correlation structures.

  • Our evaluation study shows that none of the evaluated methods are optimal across settings and the best performing method depends on the biological truth and data characteristics.

  • Overall, the LinDA method has the best tradeoff between false positive control and power, and is the only method that controls the FDR close to the target level under strong compositional effects.

Supplementary Material

Supplementary_Files_bbac607

Acknowledgements

We thank Dr Zhigang Li for the helpful discussions and suggestions.

Lu Yang is a Postdoctoral Research Fellow in the Department of Quantitative Health Sciences at Mayo Clinic. Her research interests include bioinformatics and biostatistics.

Jun Chen is an Associate Professor of Biostatistics in the Department of Quantitative Health Sciences at Mayo Clinic. His work focuses on the development and application of powerful and robust statistical methods for high-dimensional omics data.

Contributor Information

Lu Yang, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA.

Jun Chen, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA.

Funding

Center for Individualized Medicine at Mayo Clinic, National Institutes of Health [R21 HG011662, R01 GM144351 to J.C.] and National Science Foundation [DMS 2113360 to J.C.].

Data and code availability

The datasets and codes supporting the conclusions of this article are available in the https://github.com/chloelulu/DAA-c repository. The semiparametric simulation approach is implemented as ‘SimulateMSeqC’ function in the CRAN GUniFrac package (https://CRAN.R-project.org/package=GUniFrac). All analyses are performed in R v4.0.3 on an x86_64-pc-linux-gnu (64-bit) Red Hat Enterprise Linux Server 7.9 at Mayo Clinic.

References

  • 1. Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet 2012;13:260–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lugo-Martinez J, Ruiz-Perez D, Narasimhan G, et al. Dynamic interaction network inference from longitudinal microbiome data. Microbiome 2019;7:54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ma T, Villot C, Renaud D, et al. Linking perturbations to temporal changes in diversity, stability, and compositions of neonatal calf gut microbiota: prediction of diarrhea. ISME J 2020;14:2223–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Edwinson AL, Yang L, Peters S, et al. Gut microbial beta-glucuronidases regulate host luminal proteases and are depleted in irritable bowel syndrome. Nat Microbiol 2022;7:680–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Proctor LM, Creasy HH, Fettweis JM, et al. The integrative human microbiome project. Nature 2019;569:641–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Bokulich NA, Mills DA, Underwood MA. Surface microbes in the neonatal intensive care unit: changes with routine cleaning and over time. J Clin Microbiol 2013;51:2617–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Zhou YL, Xu ZJZ, He Y, et al. Gut microbiota offers universal biomarkers across ethnicity in inflammatory bowel disease diagnosis and infliximab response prediction. mSystems 2018;3:e00188–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Kuczynski J, Lauber CL, Walters WA, et al. Experimental and analytical tools for studying the human microbiome. Nat Rev Genet 2012;13:47–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 2017;11:2639–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Pan AY. Statistical analysis of microbiome data: the challenge of sparsity. Curr Opin Endocr Metab Res 2021;19:35–40. [Google Scholar]
  • 11. Silverman JD, Roche K, Mukherjee S, et al. Naught all zeros in sequence count data are the same. Comput Struct Biotechnol J 2020;18:2789–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kaul A, Mandal S, Davidov O, et al. Analysis of microbiome data in the presence of excess zeros. Front Microbiol 2017;8:2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Li HZ. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu Rev Stat Appl 2015;2:73–94. [Google Scholar]
  • 14. Weiss S, Xu ZZ, Peddada S, et al. Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome 2017;5:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Morton JT, Marotz C, Washburne A, et al. Establishing microbial composition measurement standards with reference frames. Nat Commun 2019;10:2719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Xiao J, Chen L, Johnson S, et al. Predictive Modeling of microbiome data using a phylogeny-regularized generalized linear mixed model. Front Microbiol 2018;9:1391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Chen J, Bushman FD, Lewis JD, et al. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics 2013;14:244–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Aitchison J. The statistical-analysis of compositional data. J Roy Stat Soc B Met 1982;44:139–60. [Google Scholar]
  • 19. Gloor G. ALDEx2: ANOVA-like differential expression tool for compositional data. ALDEX Manual Modular 2015;20:1–11. [Google Scholar]
  • 20. Warton DI, Hui FKC. The arcsine is asinine: the analysis of proportions in ecology. Ecology 2011;92:3–10. [DOI] [PubMed] [Google Scholar]
  • 21. Bokulich NA, Dillon MR, Zhang YL, et al. q2-longitudinal: longitudinal and paired-sample analyses of microbiome data. mSystems 2018;3:e00219–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wang J, Kalyan S, Steck N, et al. Analysis of intestinal microbiota in hybrid house mice reveals evolutionary divergence in a vertebrate hologenome. Nat Commun 2015;6:6440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Benson AK, Kelly SA, Legge R, et al. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc Natl Acad Sci U S A 2010;107:18933–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Mallick H, Rahnavard A, McIver LJ, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol 2021;17:e1009442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Gloor GB, Macklaim JM, Pawlowsky-Glahn V, et al. Microbiome datasets are compositional: and this is not optional. Front Microbiol 2017;8:2224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Zhou H, He K, Chen J, et al. LinDA: linear models for differential abundance analysis of microbiome compositional data. Genome Biol 2021;23:95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Zhang XY, Guo BY, Yi NJ. Zero-inflated Gaussian mixed models for analyzing longitudinal microbiome data. PLoS One 2020;15:e0242073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Zhu Z, Satten GA, Mitchell C, et al. Constraining PERMANOVA and LDM to within-set comparisons by projection improves the efficiency of analyses of matched sets of microbiome data. Microbiome 2021;9:133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Chen EZ, Li HZ. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 2016;32:2611–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 2014;10:e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Chen J, King E, Deek R, et al. An omnibus test for differential distribution analysis of microbiome sequencing data. Bioinformatics 2018;34:643–51. [DOI] [PubMed] [Google Scholar]
  • 32.Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. J Stat Softw 2015;67:1–48. [Google Scholar]
  • 33. Walther-Antonio MRS, Chen J, Multinu F, et al. Potential contribution of the uterine microbiome in the development of endometrial cancer. Genome Med 2016;8:122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Vandeputte D, De Commer L, Tito RY, et al. Temporal variability in quantitative human gut microbiome profiles and implications for clinical research. Nat Commun 2021;12:6740. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Nishiwaki H, Hamaguchi T, Ito M, et al. Short-chain fatty acid-producing gut microbiota is decreased in Parkinson’s disease but not in rapid-eye-movement sleep behavior disorder. mSystems 2020;5:e00797–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Zhang XY, Mallick H, Tang ZX, et al. Negative binomial mixed models for analyzing microbiome count data. BMC Bioinformatics 2017;18:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Zhang XY, Yi NJ. Fast zero-inflated negative binomial mixed modeling approach for analyzing longitudinal metagenomics data. Bioinformatics 2020;36:2345–51. [DOI] [PubMed] [Google Scholar]
  • 38. Rizopoulos D. GLMMadaptive: generalized linear mixed models using adaptive gaussian quadrature. 2022; https://drizopoulos.github.io/GLMMadaptive/, https://github.com/drizopoulos/GLMMadaptive.
  • 39.Brooks ME, Kristensen K, van Benthem KJ, et al. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. The R Journal 2017;9:378–400. [Google Scholar]
  • 40. Vatanen T, Franzosa EA, Schwager R, et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 2018;562:589–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Morgan XC, Kabakchiev B, Waldron L, et al. Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biol 2015;16:67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Lin H, Peddada SD. Analysis of microbial compositions: a review of normalization and differential abundance analysis. NPJ Biofilms Microbiomes 2020;6:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Weiss SJ, Xu Z, Amir A, et al. Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ PrePrints 2015;3:e1157. [Google Scholar]
  • 44. Dennis B, Ponciano JM, Taper ML. Replicated sampling increases efficiency in monitoring biological populations. Ecology 2010;91:610–20. [DOI] [PubMed] [Google Scholar]
  • 45. Zhou XY, Singh S, Baumann R, et al. Household paired design reduces variance and increases power in multi-city gut microbiome study in multiple sclerosis. Mult Scler J 2021;27:366–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Faust K, Lahti L, Gonze D, et al. Metagenomics meets time series analysis: unraveling microbial community dynamics. Curr Opin Microbiol 2015;25:56–66. [DOI] [PubMed] [Google Scholar]
  • 47. Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J R Stat Soc B 1995;57:289–300. [Google Scholar]
  • 48. La Rosa PS, Brooks JP, Deych E, et al. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS One 2012;7:e52078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Chen J, Li HZ. Variable selection for sparse dirichlet-multinomial regression with an application to microbiome data analysis. Ann Appl Stat 2013;7:418–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Hawinkel S, Mattiello F, Bijnens L, et al. A broken promise: microbiome differential abundance methods do not control the false discovery rate. Brief Bioinform 2019;20:210–21. [DOI] [PubMed] [Google Scholar]
  • 51. McDonald D, Hyde E, Debelius JW, et al. American gut: an open platform for citizen science microbiome research. mSystems 2018;3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Vujkovic-Cvijin I, Sklar J, Jiang LJ, et al. Host variables confound gut microbiota studies of human disease. Nature 2020;587:448–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Galazzo G, Best N, Bervoets L, et al. Development of the microbiota and associations with birth mode, diet, and atopic disorders in a longitudinal analysis of stool samples, collected from infancy through early childhood. Gastroenterology 2020;158:1584–96. [DOI] [PubMed] [Google Scholar]
  • 54. Brooks ME, Kristensen K, Benthem KJ, et al. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed Modeling. R J 2017;9:378–400. [Google Scholar]
  • 55. Venables WNRB, Ripley BD. Modern Applied Statistics with S, 4th edn. New York: Springer, 2002. [Google Scholar]
  • 56. Hu YJ, Satten GA. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics 2020;36:4106–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Chen L, Reeve J, Zhang LJ, et al. GMPR: a robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ 2018;6:e4600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001;29:1165–88. [Google Scholar]
  • 59. Charlson ES, Chen J, Custers-Allen R, et al. Disordered microbial communities in the upper respiratory tract of cigarette smokers. Am J Resp Crit Care 2011;5:e15216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Gonzalez A, Navas-Molina JA, Kosciolek T, et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat Methods 2018;15:796–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Halfvarson J, Brislawn CJ, Lamendella R, et al. Dynamics of the human gut microbiome in inflammatory bowel disease. Nat Microbiol 2017;2:17004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Yang L, Chen J. A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions. Microbiome 2022;10:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 2010;11:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Paulson JN, Stine OC, Bravo HC, et al. Differential abundance analysis for microbial marker-gene surveys. Nat Methods 2013;10:1200–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Goodrich JK, Waters JL, Poole AC, et al. Human genetics shape the gut microbiome. Cell 2014;159:789–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Willing BP, Dicksved J, Halfvarson J, et al. A pyrosequencing study in twins shows that gastrointestinal microbial profiles vary with inflammatory bowel disease phenotypes. Gastroenterology 2010;139:1844–1854.e1. [DOI] [PubMed] [Google Scholar]
  • 67. Stewart CJ, Ajami NJ, O'Brien JL, et al. Temporal development of the gut microbiome in early childhood from the TEDDY study. Nature 2018;562:583–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Duncan K, Carey-Ewend K, Vaishnava S. Spatial analysis of gut microbiome reveals a distinct ecological niche associated with the mucus layer. Gut Microbes 2021;13:1874815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Hawinkel S, Rayner JCW, Bijnens L, et al. Sequence count data are poorly fit by the negative binomial distribution. PLoS One 2020;15:e0224909. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Li YM, Ge XZ, Peng F, et al. Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biol 2022;23:79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Faust K, Raes J. Microbial interactions: from networks to models. Nat Rev Microbiol 2012;10:538–50. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Files_bbac607

Data Availability Statement

The datasets and codes supporting the conclusions of this article are available in the https://github.com/chloelulu/DAA-c repository. The semiparametric simulation approach is implemented as ‘SimulateMSeqC’ function in the CRAN GUniFrac package (https://CRAN.R-project.org/package=GUniFrac). All analyses are performed in R v4.0.3 on an x86_64-pc-linux-gnu (64-bit) Red Hat Enterprise Linux Server 7.9 at Mayo Clinic.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES