Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 24.
Published in final edited form as: Ann Appl Stat. 2020 Apr 16;14(1):94–115. doi: 10.1214/19-aoas1283

MODELING MICROBIAL ABUNDANCES AND DYSBIOSIS WITH BETA-BINOMIAL REGRESSION

Bryan D Martin 1, Daniela Witten 2, Amy D Willis 3
PMCID: PMC7514055  NIHMSID: NIHMS1607085  PMID: 32983313

Abstract

Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.

Keywords: Relative abundance, microbiome, correlated data, overdispersion, high throughput sequencing, beta-binomial

1. Introduction

Estimating the proportion of a population that belongs to a certain category—the relative abundance—is a problem spanning fields as broad as social science, population health and ecology. For example, researchers may be interested in estimating the proportion of low-income students who attend competitive higher-education institutions (Bastedo and Jaquette (2011)), child mortality rates in Sub-Saharan African regions (Mercer et al. (2015)), or the proportion of diseased leaf tissue in coastal grasslands (Parker et al. (2015)). In most of these settings, it is not possible to sample the entire population of interest, and it is necessary to estimate the true proportion based on a sample of individuals from the population. In this paper we consider the general problem of estimating the prevalence of a category within a population when the category labels of the observed individuals may be correlated.

While this problem is of broad interest, our method is particularly motivated by the ever-increasing number of studies of microbiomes. A microbiome is the collection of microscopic organisms (microbes), along with their genes and metabolites, that inhabit an ecological niche (Poussin et al. (2018)). Microbes live on and in the human body, and in fact, microbial cells may outnumber human cells (Sender, Fuchs and Milo (2016)). Because of this, the relative abundance of a microbe—or a taxon, which refers to a biological grouping of microbes—is a common marker of host or environmental health. For example, the species G. vaginalis has been found to correlate with symptomatic bacterial vaginosis (Callahan et al. (2017)); different genera of Cyanobacteria flourish in response to precipitation and irrigation runoff (Tromas et al. (2018)); and Parkinson’s disease has been associated with reduced levels of the family Prevotellaceae (Hill-Burns et al. (2017)). Accurate and precise estimation of microbial abundances is critical for disease diagnosis and treatment (Qin et al. (2014), Grice (2014), Gevers et al. (2014), Shi et al. (2015)).

A particularly challenging aspect of estimating microbial abundances is that the category labels of microbes are known to be correlated. Microbial communities are spatially organized, with a member of one taxon more likely to be observed close to the same taxon than close to a different taxon (Welch et al. (2016)). In this paper we argue that a correlated-taxon model is a natural approach to estimating relative abundances in this setting. It successfully explains the large number of unobserved taxa in many samples, as well as overdispersion in the abundance of observed taxa relative to models where the occurrences of individual microbes are uncorrelated.

An additional advantage of our method is that it provides a statistical framework for testing for dysbiosis. Dysbiosis describes a microbial imbalance, or a deviation from a healthy microbiome (Petersen and Round (2014), Hooks and O’Malley (2017)). In particular, the term is often used to refer to a change in the stability of a microbiome. For example, inflammatory bowel disease (IBD) has been associated with increases in the variability of the gut microbiome (Halfvarson et al. (2017)), and the microbiomes of IBD patients are often referred to as dysbiotic (Tamboli et al. (2004)). Unlike many methods for modeling relative abundances of microbial taxa, the method that we propose provides a natural framework for hypothesis testing for dysbiosis via the parameters of a heteroskedastic model for taxon abundances. Specifically, we can test whether the variability in a taxon’s counts is associated with some covariate of interest.

Our paper is laid out as follows. In Section 2, we review several existing regression models for microbial abundances. In Section 3, we propose our model, and discuss parameter estimation. We propose approaches for testing for differential abundance and differential variability in Section 4. In Section 5, we show via simulation that our hypothesis testing framework is valid, even with small sample sizes. We apply our method to data from a soil microbiome study in Section 6, and we close with a discussion of our method in Section 7. Software for implementing our model and hypothesis testing procedures is available in the R package corncob, available at github.com/bryandmartin/corncob and provided in Supplement A of the Supplementary Material (Martin, Witten and Willis (2020a)).

2. Literature review

Modeling of population proportions, or relative abundances, has a long history in the statistical literature, and includes basic methods such as z-tests for proportions, and logistic regression. However, modeling microbial abundance data brings with it a number of challenges. For example, the dynamic nature of the microbiome commonly gives rise to a large number of microbial taxa that are only present in a small number of samples, but are highly abundant when present (DiGiulio et al. (2015), Dethlefsen and Relman (2011)). Some microbes may be so rare that they consistently evade detection or are observed at low abundances in all samples (Sogin et al. (2006)). In addition, the number of taxa (typically on the order of thousands) is generally substantially less than the number of samples (typically less than one hundred). Finally, the number of counts that are observed in each sample may differ substantially, and thus the amount of information contained in each sample may differ.

Thus, we focus our literature review on models for microbial abundances. We broadly categorize these models into two approaches: jointly modeling multiple taxa, and modeling each taxon individually. While our proposal pertains to the latter, both approaches are common and each has its advantages and disadvantages, which we now review.

Jointly modeling multiple taxa is a popular approach because it represents the entire microbial community with a single model. However, since these communities are often very diverse (the total number of taxa is large), and different taxa exhibit differing levels of variability, a large number of parameters is typically needed to obtain a good model fit (Kurtz et al. (2015), Sankaran and Holmes (2017)). Hierarchical models of absolute abundances are often used to constrain the number of parameters (e.g., La Rosa et al. (2012), Holmes, Harris and Quince (2012), Chen and Li (2013), Sankaran and Holmes (2017), Cao, Zhang and Li (2017)). However, modeling the variance structure is challenging with few parameters (Sankaran and Holmes (2017)). Many joint taxon models make use of the log-ratio or centered log-ratio transformations to model relative abundances. However, these approaches typically cannot be applied to zero-valued observations (Aitchison (1986), McMurdie and Holmes (2014), Willis and Martin (2018)). Since many taxa are typically unobserved in each sample, these methods commonly make use of pseudo-counts to replace zeros, or incorporate a zero-inflation component into their model (Xia et al. (2013), Mandal et al. (2015), Li et al. (2018), Willis and Martin (2018)). In the case of pseudo-counts, parameter estimation depends on an arbitrarily chosen hyperparameter, while zero-inflated models may lack interpretability.

Because simultaneously modeling large numbers of microbial taxa is challenging, an alternative approach is to model individual taxa one-by-one. We further classify individual taxon models into models for observed relative abundances (the proportion of the observed counts that corresponds to the specific taxon), and models for absolute abundances (the number of observed counts of the taxon). A particularly common model for observed relative abundances is the beta distribution, which is a natural choice since it is supported on (0, 1). Zero-inflated beta regression models have been proposed to account for the large number of zeros often observed in microbial abundance data, corresponding to the absence of a taxon in a sample (Peng, Li and Liu (2016), Chen and Li (2016), Chai et al. (2018)). Nonparametric models for observed relative abundances (White, Nagarajan and Pop (2009), Segata et al. (2011)) and Gaussian models for transformed observed relative abundances (Morgan et al. (2012, 2015)) have also been proposed.

Another option is to model the absolute abundance of a taxon. Popular methods originally designed for RNAseq data, such as DESeq2 (Love, Huber and Anders (2014)) and EdgeR (Robinson, McCarthy and Smyth (2010)), make use of the negative binomial distribution. These models can be extended with random effects and a zero-inflation component to account for correlation across subjects and to model additional overdispersion of the counts (Zhang et al. (2017), Fang et al. (2016)). Alternative approaches to modeling absolute abundances include the use of transformations such as cumulative sum scaling (Wahba et al. (1995), Paulson et al. (2013)), trimmed mean of M-values (Robinson and Oshlack (2010), Law et al. (2014)) and ratio approaches (Sohn, Du and An (2015), Chen et al. (2018)).

All of the papers mentioned thus far focus on an association between mean abundance and covariates. In this paper, we propose a beta-binomial regression model for microbial taxon abundances. To the best of our knowledge, this is the first regression model that allows for an association between the variance of a taxon’s abundance and covariates, rather than only an association between the mean abundance and covariates. In addition, our model can accommodate the absence of a taxon in samples, variability in the total number of counts across samples and high variability in the observed relative abundances.

3. The beta-binomial regression model

3.1. A hierarchical model for microbial abundances

In this section, we present a beta-binomial regression model for microbial abundance data. While the beta-binomial model has been extensively studied in the statistics literature (Skellam (1948), Kleinman (1973), Williams (1975), Prentice (1986), McCullagh and Nelder (1989), Aerts et al. (2002), Dolzhenko and Smith (2014), Wagner, Riggs and Mikulich-Gilbertson (2015)), to our knowledge, we are the first to propose a regression framework that can link both discrete and continuous covariates to both a relative abundance parameter and a correlation/overdispersion parameter, as well as the first to apply this model to the analysis of microbial data. We summarize the notation and definitions defined in this section in Table 1.

Table 1.

The notation for the observed random variables, latent random variables and parameters of our proposed beta-binomial model. The subscript i refers to the i th sample. Notations are defined for each taxon

Notation Definition

Yi,j indicator that the j th read corresponds to the taxon of interest
Wi observed counts, or observed absolute abundance, of the taxon of interest
Mi sequencing depth, or total number of counts, across all taxa
Wi/Mi observed relative abundance of the taxon of interest
Zi latent relative abundance of the taxon of interest
μi expected relative abundance of the taxon of interest
ϕi overdispersion, or within-sample correlation of the taxon of interest

Suppose we have n samples of microbial communities, indexed by i = 1, …, n. Let Mi be the sequencing depth, or the number of total counts (or reads) across all taxa, in the ith sample. Let Yi,j for j = 1, …, Mi be an indicator that the jth read corresponds to the taxon of interest. Therefore, Wi=j=1MiYi,j is the observed absolute abundance of the taxon of interest in the ith sample.

It is natural to consider the model

Wi(Zi,Mi)Binomial(Mi,Zi), (3.1)

and to perform inference on Zi, where Zi is the probability of observing the taxon of interest in the ith sample. However, this model is insufficiently flexible to model microbial abundance data. For example, Figure 1 (left) shows 95% prediction intervals from a binomial model fit to the relative abundance of a strain of Rhizobium in 16 experimental replicates of sampling microbes in soil (see Section 6 for details). We see that the data are substantially overdispersed relative to the binomial model, which provides a very poor fit (see McMurdie and Holmes (2014) for further discussion on overdispersion of microbial abundance data).

Fig. 1.

Fig. 1.

The relative abundance of a strain of Rhizobium in 16 biological replicate samples in a soil microbiology study, and 95% prediction intervals based on a binomial model (left) and the proposed beta-binomial model (right). The data is clearly overdispersed relative to the binomial model, motivating the development of our beta-binomial model.

The overdispersion of the observed relative abundances compared to a binomial model motivates a more flexible model. We propose the following model:

Wi(Zi,Mi)Binomial(Mi,Zi), (3.2)
ZiBeta(a1,i,a2,i), (3.3)

where a1,i+, a2,i+. In the model (3.2)(3.3), Zi is itself a random variable, representing the latent relative abundance of the taxon. As we will demonstrate, this hierarchical approach to modeling relative abundance is a major advantage of our approach.

Using the parameterization

μi=a1,ia1,i+a2,i, (3.4)

it can be shown that

E(WiMi)=Mi×E(Zi)=Mi×μi. (3.5)

Thus μi ∈ (0, 1) is the expected relative abundance of the taxon in the ith sample. In addition, using the parameterization

ϕi=1a1,i+a2,i+1, (3.6)

it can be shown that

Var(WiMi)=Mi×μi×(1μi)×(1+(Mi1)×ϕi). (3.7)

The multiplicative factor (1 + (Mi − 1) × ϕi) is therefore the overdispersion of the absolute abundance of the taxon for the ith sample relative to a binomial random variable. Furthermore,

Corr(Yi,j,Yi,j*)=ϕifor1j<j*Mi, (3.8)

so ϕi can also be interpreted as the correlation between the taxon indicator variables within the ith sample (Prentice (1986)).

We then link the expected relative abundance, μi, and the overdispersion, ϕi, to covariates. We define link functions

g(μi)=β0+XiTβ, (3.9)
h(ϕi)=β0*+Xi*Tβ*, (3.10)

where Xi, the ith row of the covariate matrix X=[Xij]n×k, represents k covariates associated with μi; Xi*, the ith row of the covariate matrix X*=[Xij*]n×k*, represents the k* covariates associated with ϕi; β = (β1, …, βk)T; and β*=(β1*,,βk**)T. X and X* may be identical, or they may be non- or partially-overlapping.

Throughout this paper, we choose the logit transformation for the link functions in (3.9) and (3.10), so that

g(x)h(x):=log(x1x).

This link function is convenient as it is a bijection between [0, 1] and . Other choices for the link functions can be used as well, and the link functions for μi and ϕi need not be identical.

This hierarchical model has three key advantages over other approaches. First, the use of a beta random variable as a model for the binomial probability allows us to incorporate overdispersion. Second, the overdispersion parameter (rather than just the mean) can be modeled with covariates. As we will see in Section 6, this is a key advantage of our approach. Finally, our model makes direct use of the absolute abundance (W1, …, Wn) and the total number of counts (M1, …, Mn), rather than simply transforming these quantities into the observed relative abundance (W1/M1, …, Wn/Mn), which would amount to throwing away valuable information about the sequencing depth across in each sample. We show the 95% prediction intervals from a beta-binomial model for the soil microbiology study in Figure 1 (right).

3.2. Model fitting

Given n samples from the model (3.2)(3.3), the log-likelihood is

logL(θW,M)=i=1nlog[(MiWi)B(a1,i+Wi,a2,i+MiWi)B(a1,i,a2,i)]=i=1nlog[(MiWi)B(eβ0*Xi*Tβ*1+eβ0XiTβ+Wi,eβ0*Xi*Tβ*1+eβ0+XiTβ+MiWi)B(eβ0*Xi*Tβ*1+eβ0XiTβ,eβ0*Xi*Tβ*1+eβ0+XiTβ)], (3.11)

where Wn, Mn, βk, β*k*, θ=(β0,βT,β0*,β*T)T, and B(·, ·) is the Beta function given by B(x,y)=01tx1(1t)y1dt for x and y+. We fit the model by maximum likelihood using the trust region optimization algorithm (Fletcher (1987), Nocedal and Wright (1999), Geyer (2015)), which has accelerated computation relative to a line search method.

In this iterative algorithm, a “trust region” is defined around the parameter estimate at each iteration. The algorithm then updates the parameter estimate by minimizing a second-order Taylor series expansion of the objective function, subject to the constraint that the solution is within the trust region. If a proposed update is infeasible (i.e., it is outside of the parameter space), then it is rejected and the trust region shrinks. The minimization of the objective function then repeats with the new constraint. If a proposed update is close to the boundary of the trust region, the trust region expands in the next iteration. We implement the trust algorithm for minimizing the negative log-likelihood using the R package trust (Geyer (2015)).

The log-likelihood is not concave in θ (see Appendix A), so trust region optimization does not guarantee convergence to the global minimum of the objective function. However, under mild conditions, the limit points of the trust algorithm are guaranteed to satisfy the first- and second-order conditions that are necessary for a local minimum (Fletcher (1987), Nocedal and Wright (1999)). We use multiple initializations and select the estimate that has the largest log-likelihood. In practice, there is little difference in the parameter estimates across initializations.

Each iteration of the trust region optimization algorithm makes use of the gradient and Hessian of (3.11). These are given in Appendix B for the case of logit link functions for g(·) and h(·) in (3.9) and (3.10).

4. Hypothesis testing

We now discuss inference on θ. We consider the null hypothesis that Aθ = b, where Ar×(k+k*+2) has full row rank and r < k + k* + 2, br, and where θ is the parameter vector introduced in (3.11). Note that this general form for the null hypothesis allows us to test arbitrary subsets and linear combinations of the parameters within θ=(β0,βT,β0*,β*T)T. The Wald test statistic is

T^Wald=n(Aθ^b)T(AI^(θ^)n1AT)1(Aθ^b), (4.1)

where

θ^=argsupθlogL(θW,M) (4.2)

and I^(θ^)n is the observed Fisher information evaluated at θ^:

I^(θ^)n=1ni=1n[2θθTlogL(θW,M)]θ=θ^. (4.3)

Algorithm 1.

Parametric Bootstrap Wald Test of H0 : = b

   Require: W, M, X, X*, a large integer B (e.g., B = 10,000)
1: Estimate θ^ and θ^0 as in (4.2) and (4.5), respectively, with the trust region optimization procedure.
2: Compute T^Wald as in (4.1) using A, b and θ^.
3: for b = 1, …, B do
4:   Simulate W˜b with elements W˜ib drawn from a beta-binomial distribution with Mi draws and parameters θ^0.
5:   Estimate θ˜b as in (4.2) using W˜b and M with the trust region optimization procedure.
6:   Compute T^Waldb as in (4.1) using A, b and θ˜b.
7: Calculate the p-value:
p^1B+1(1+b=1B1{T^WaldbT^Wald}).
8: return p^

Under the null hypothesis that = b, we find empirically that T^Wald is well-approximated by a χr2 distribution if n is large (Section 5.1). Alternatively, we can test = b using a likelihood ratio test statistic, defined as

T^LRT=2(logL(θ^W,M)logL(θ^0W,M)), (4.4)

where

θ^0=argsupθ:Aθ=blogL(θW,M). (4.5)

When n is large and = b, we find that the distribution of T^LRT is well-approximated by a χr2 distribution (Section 5.1).

In practice, we often do not have the sample size necessary to use the χr2 approximation. For this reason, we also implement a parametric bootstrap hypothesis testing procedure. Our parametric bootstrap Wald testing procedure is given in Algorithm 1; the parametric bootstrap likelihood ratio test procedure is provided in Appendix C.

For certain realizations of W, Wald-type inference is uninformative. For example, if k = k* = 1, Xi=Xi*{0,1} for i = 1, …, n, and i:Xi=1Wi=0, then a parameter estimate diverges to −∞ (see Lemma D.1 in Appendix D for details). This limitation is not unique to our model, and hypothesis testing using Wald tests in the case of complete or quasi-complete separation in logistic regression is known to have the same issue (see Albert and Anderson (1984), Heinze and Schemper (2002), Heinze (2006) for further discussion). In this case, we instead use the likelihood ratio test to test hypotheses about β, such as β = 0. However, in this setting, even the likelihood ratio test does not provide a useful test of certain hypotheses about β*, such as β* = 0 (see Appendix D). Since it is often the case that a taxon is unobserved in certain experimental conditions, the default behaviour for our software in this setting is to return a test statistic of zero for Wald-type tests to indicate that inference is uninformative and the null hypothesis should not be rejected.

While (4.4) and Algorithms 12 hold for any A and b, they require solving (4.5). This may be difficult to do for certain A and b. In this case, an approximate solution could be obtained by maximizing the likelihood subject to a penalty on Aθb by (e.g., see Fiacco and McCormick (1968), Ryan (1974)). Alternatively, approximating the distribution of (4.1) with a χr2 distribution does not require restricted maximum likelihood estimation.

In summary, we implement four hypothesis testing procedures: the Wald test, the likelihood ratio test, the parametric bootstrap Wald test and the parametric bootstrap likelihood ratio test. The Wald and likelihood ratio tests permit faster inference than the parametric bootstrap tests. However, the parametric bootstrap procedures successfully control Type 1 error in small sample sizes. We now demonstrate the performance of all of these hypothesis testing procedures in simulation.

5. Simulation study

We now investigate the performance of our approach, which we call count regression for correlated observations with the beta-binomial, or corncob, under simulation. We study the Type I error rate and the power when testing for both differential abundance and differential variability. We generate sequencing depths Mn with elements Mi simulated from the empirical distribution of the observed sequencing depths in the data set discussed in Section 6, which ranges from 7821 to 58,655. We use sample sizes n ∈ {10, 30, 100} and a binary covariate Xi=Xi*=0 for i = 1, …, n/2 − 1 and Xi=Xi*=1 for i = n/2, …, n. We then simulate absolute abundances Wn with elements Wi simulated under the data generating model (described below). The parameter values were selected by fitting corncob to the genus Thermomonas in the data set discussed in Section 6 so that simulated data are similar to what might be observed in a real-world experiment. For each simulation, we calculate 10,000 p-values using all four of the hypothesis testing procedures outlined in Section 4: the Wald test, the likelihood ratio test, the parametric bootstrap Wald test and the parametric bootstrap likelihood ratio test. We use 1000 bootstrap iterations for the parametric bootstrap testing procedures.

5.1. Type I error rate

We first confirm that corncob controls Type I error at the nominal level. We generate data using the beta-binomial model with logit link functions for mean and overdispersion, under three settings for β. In the first simulation setting, we test the null hypothesis H0:(β1,β1*)=(0,0). We generated model parameters by fitting a model to the genus Thermomonas without using soil amendment as a covariate, yielding parameters (β˜0,β˜1,β˜0*,β˜1*)=(5.75,0,5.24,0). In the second simulation setting, we test the null hypothesis H0:β1*=0. We generated model parameters by fitting a model to the genus Thermomonas using soil amendment as a covariate for μi, yielding parameters (β˜0,β˜1,β˜0*,β˜1*)=(5.36,1.12,5.69,0). In the third simulation setting, we test the null hypothesis H0 : β1 = 0. We generated model parameters by fitting a model to the genus Thermomonas using soil amendment as a covariate for ϕi, yielding parameters (β˜0,β˜1,β˜0*,β˜1*)=(5.51,0,5.38,0.70). For all three simulation settings, the null hypotheses are true, so we would expect p-values obtained from testing the null hypotheses to be uniformly distributed.

The results are shown in Figure 2. For sample sizes of 30 and 100, all testing procedures resulted in approximately uniform p-values, and Type I error is controlled. This suggests that for this experiment, a sample size of 30 is sufficient to approximate the distribution of the Wald and likelihood ratio test statistics using a χ2 distribution.

Fig. 2.

Fig. 2.

Quantiles of p-values obtained from the Type I error rate simulation settings compared to quantiles of a Uniform(0, 1) distribution. We test the null hypotheses H0:(β1,β1*)=(0,0) (left), H0:β1*=0 (middle) and H0 : β1 = 0 (right). A 45-degree line is shown (black). p-values were obtained using Wald (red), likelihood ratio (green), parametric bootstrap Wald (blue) and parametric bootstrap likelihood ratio (purple) tests. Sample sizes used were 10 (dashed), 30 (dotted) and 100 (solid). Quantiles of each test are shown in Table 2 in Appendix E.

Example quantiles from each of the simulation settings are shown in Table 2 in Appendix E. For a sample size of 10, only the parametric bootstrap procedures resulted in approximately uniform p-values and successful Type I error control. The p-values obtained using the Wald and likelihood ratio tests were anti-conservative, suggesting that for this experiment, a sample size of 10 is too small to approximate the distribution of the test statistics using a χ2 distribution. Therefore, to obtain reliable inference, we recommend the parametric bootstrap procedure when n is smaller than 30.

5.2. Power

We now investigate the power of corncob to reject (i) the null hypothesis H0 : β1 = 0, as well as (ii) the null hypothesis H0:β1*=0. We consider two cases: varying the value of β1, and varying the value of β1*. For both settings, we generated model parameters by fitting a model to the genus Thermomonas using soil amendment as a covariate for μi and ϕi, yielding parameters (β˜0,β˜1,β˜0*,β˜1*)=(5.17,2.46,5.13,3.88). In the first case (Setting 4 in Figure 3), we set (β0,β1,βn*,β1*)=(β˜0,cβ˜1,β˜n*,β˜1*) using c ∈ {0, 0.05, …, 1}. In the second case (Setting 5 in Figure 3), we set (β0,β1,β0*,β1*)=(β˜0,β˜1,β˜0*,cβ˜1*) using c ∈ {0, 0.05, …, 1}.

Fig. 3.

Fig. 3.

Power curves of p-values obtained from the power simulations. Setting 4 (left) tests H0 : β1 = 0. Setting 5 (right) tests H0:β1*=0. A horizontal dashed line is shown at 0.05. p-values were obtained using Wald (red), likelihood ratio (green), parametric bootstrap Wald (blue) and parametric bootstrap likelihood ratio (purple) tests. Sample sizes used were 10 (dashed), 30 (dotted) and 100 (solid).

The results of the power analyses are shown in Figure 3. For both null hypotheses, all sample sizes, and all hypothesis testing procedures, the power increases as both the sample size and the magnitude of the coefficient being tested increases. For sample sizes of 30 and 100, there is little difference in power across the four testing procedures. This is not surprising, given that in the simulations in Section 5.1, all procedures performed similarly with sample sizes of 30 and 100. We do not show results for the procedures that rely on the asymptotic distribution of the test statistics for n = 10, as we saw in Section 5.1 that these procedures did not properly control Type I error.

6. Application to soil data

We now consider a study of the association between soil treatments and soil microbiome composition (Whitman et al. (2016)). In this experiment, there are three groups of soil treatments: no additions, biochar additions and fresh biomass additions. For each treatment group, multiple experimental replicates were taken at three time points: on the first day, after 12 days and after 82 days. The data include n = 119 samples with sequencing depths ranging from 8830 to 194,356. After quality control (as described in Whitman et al. (2016)), a total of 7770 operational taxonomic units were identified using the UPARSE workflow (Edgar (2013)), and taxonomy was assigned using reference databases. Using the assigned taxonomy, we aggregated counts to the genus level, giving 241 genera.

We are interested in applying our method to compare the microbiome of soil with no additions after 82 days (n = 15) to the microbiome of soil with biochar additions after 82 days (n = 16). We remove 13 genera for which the total number of counts in these 31 samples is zero. We apply corncob using soil addition as a covariate for μi and ϕi as in (3.9) and (3.10). We calculate p-values using the parametric bootstrap likelihood ratio test (Algorithm 2) with B = 106 bootstrap iterations. We compare the results of corncob to those from DESeq2 (Love, Huber and Anders (2014)), EdgeR (Robinson, McCarthy and Smyth (2010)), metagenomeSeq (Paulson et al. (2013)) and a zero-inflated beta (ZIB) regression model (Peng, Li and Liu (2016)).

6.1. Detection of differential abundance

We first compare p-values obtained from testing for differential abundance across soil addition group. Roughly speaking, each of the approaches tests for a difference in abundance of a single taxon across conditions, although the details of the model used vary across methods. In the context of corncob, testing for differential abundance amounts to testing the null hypothesis H0 : β = 0, using the notation defined in (3.9). Scatter plots of the negative log-10 p-values for each approach are shown in Figure 4.

Fig. 4.

Fig. 4.

The negative log-10 p-values obtained by testing for differential abundance using corncob (H0 : β = 0) compared to those from DESeq2 (left-most, Spearman’s correlation coefficient ρ = 0.854), EdgeR (middle-left, ρ = 0.783), metagenomeSeq (middle-right, ρ = 0.552) and ZIB (right-most, ρ = 0.705). A 45-degree line is shown. We see that the p-values are on a similar scale overall. Thermomonas (green), Flavisolibacter (red) and Myxococcus (blue) are further examined in Figure 6.

Overall, as p-values calculated using corncob decrease, so do those calculated using other approaches. We observe moderate to strong correlations across the different approaches, with Spearman’s correlation coefficients between the p-values obtained from corncob (H0 : β = 0) and DESeq2, edgeR, metagenomeSeq and ZIB, respectively, of 0.854, 0.783, 0.552, 0.705. corncob calculated a lower p-value for 53.9%, 43.6%, 63.8% and 58.3% of genera compared to DESeq2, edgeR, metagenomeSeq and ZIB, respectively. Median p-values across all genera for corncob, DESeq2, edgeR, metagenomeSeq and ZIB are 0.273, 0.318, 0.297, 0.491 and 0.320, respectively. Therefore, while the p-values produced by corncob are on a similar scale to the other approaches, they may be higher or lower for any given taxon. While each of the approaches uses a different model and makes use of a different test statistic, they are all testing for some difference in the mean abundance of the taxon across the soil addition. Thus, it is unsurprising that the p-values are similar across the approaches.

6.2. Detection of differential variability

We now test for differences in the variability of the abundance of a single taxon across conditions, which we refer to as differential variability. Using corncob and the notation in (3.9)(3.10), this amounts to testing the null hypothesis H0 : β* = 0. As far as we know, corncob is the only approach that explicitly tests for differential variability. Thus, in this section, we investigate whether testing for differential variability allows us to identify new genera beyond what we identify when testing only for differential abundance.

We compare the results of testing for differential variability to the results of testing for differential abundance using the methods investigated in Section 6.1. Figure 5 shows scatter plots of the negative log-10 transformations of the p-values for testing differential abundance from DESeq2, metagenomeSeq, ZIB and corncob against the p-values for testing differential variability with corncob. We see from Figure 5 that there is only a weak association between the p-values for differential variability obtained using corncob and the p-values for differential abundance obtained using the other approaches. In particular, Spearman’s correlation coefficients are 0.127, 0.234, 0.132, 0.215 and 0.362 between corncob p-values for H0 : β* = 0 and p-values from DESeq2, edgeR, metagenomeSeq, ZIB and corncob for H0 : β = 0, respectively. We omit from Figure 5 the scatter plot comparing the corncob p-values for H0 : β* = 0 to the edgeR p-values because the p-values from edgeR are similar to those from DESeq2. We conclude that applying corncob to test H0 : β* = 0 leads to the discovery of a very different set of genera than those discovered by applying corncob or other approaches to test for differential abundance.

Fig. 5.

Fig. 5.

The negative log-10 p-values obtained by testing for differential variability using corncob (H0 : β* = 0) compared to the negative log-10 p-values obtained by testing for differential abundance using DESeq2 (left-most, Spearman’s correlation coefficient ρ = 0.127), metagenomeSeq (middle-left, ρ = 0.132), ZIB (middle-right, ρ = 0.215) and corncob (H0 : β = 0) (right-most, ρ = 0.362). A 45-degree line is shown. Thermomonas (green), Flavisolibacter (red) and Myxococcus (blue) are further examined in Figure 6. We omit a scatter plot showing p-values for edgeR (ρ = 0.234); results are similar to DESeq2.

To obtain greater insight into the results shown in Figure 5, we consider the 3 highlighted genera, which we further investigate in Figure 6. The first, Thermomonas, has small p-values for both differential abundance (p = 1.00 × 10−6) and differential variability (p = 1.00 × 10−6) using corncob. The second, Flavisolibacter, has a small p-value for differential abundance (p = 7.44 × 10−4) and a large p-value for differential variability (p = 0.404). The third, Myxococcus, has a large p-value for differential abundance (p = 0.244) and a small p-value for differential variability (p = 8.83 × 10−3), so it would not be identified using the competing approaches (see Table 3 in Appendix F for p-values from all approaches). Figure 6 indicates a clear visual difference between genera that are identified as differentially abundant but not differentially variable, differentially variable but not differentially abundant, and both differentially abundant and differentially variable. Researchers can use corncob to distinguish between these three possibilities.

Fig. 6.

Fig. 6.

The observed relative abundances of the genera Thermomonas (left), Flavisolibacter (middle) and Myxococcus (right) in 31 soil samples. Each of these genera is highlighted in each panel of Figures 4 and 5. In each panel, the first 16 samples correspond to the biochar additions group (darker color), and the remaining 15 samples correspond to the no additions group (lighter color). 95% prediction intervals for the relative abundances from a corncob fit using soil addition as a covariate for μi and ϕi are shown. Using corncob to test H0 : β = 0 and H0 : β* = 0 indicates that Thermomonas is both differentially abundant (p = 1.00 × 10−6) and differentially variable (p = 1.00 × 10−6), Flavisolibacter is differentially abundant (p = 7.44 × 10−4) and not differentially variable (p = 0.404), and Myxococcus is differentially variable (p = 8.83 × 10−3) and not differentially abundant (p = 0.244). See Table 3 in Appendix F for p-values from all approaches.

In practice, a data analyst will apply a multiple testing procedure to adjust the p-values for multiple comparisons, so we also investigative the number of genera identified as either differentially abundant or differentially variable after applying the Benjamini–Hochberg procedure (Benjamini and Hochberg (1995)) to the p-values obtained using corncob to test H0 : β = 0 and H0 : β* = 0. The results are shown in Figure 7. We see that for a given false discovery rate, in this data set we detect more genera as being differentially abundant than differentially variable; this can also be seen in the right-most panel of Figure 5. All code for performing this analysis is available in the supplementary materials available at github.com/bryandmartin/corncob_supplementary and provided in Supplement B of the Supplementary Material (Martin, Witten and Willis (2020b)).

Fig. 7.

Fig. 7.

The estimated false discovery rate using the Benjamini–Hochberg procedure, as a function of the number of genera identified as differentially abundant and differentially variable. For a given false discovery rate, we identify fewer genera that are differentially variable than differentially abundant.

7. Discussion

In this paper, we have proposed a beta-binomial regression model for abundance data. Our model extends existing beta-binomial models by allowing discrete and continuous covariates to be linked to both a relative abundance parameter and an overdispersion parameter. Our method is particularly well-suited to modeling microbial abundance data for a number of reasons. First, microbial taxa are commonly unobserved in many samples. For example, in the data set examined in Section 6, 34% of absolute abundances were zero. Our model can accommodate this without requiring a zero-inflation component or pseudocounts. Second, studies of microbial populations often have small sample sizes. Our simulation study in Section 5 suggests that our parametric bootstrap inference methods (Algorithms 1 and 2) give valid inference even with small samples. Third, the interpretation of μi as the expected relative abundance and of ϕi as the within-sample correlation of taxon labels (i.e., ϕi=Corr(Yij=Yij), see Section 3) are intuitive and complement ecological theory (Welch et al. (2016)). Finally, regression models for contrasting microbial populations commonly focus on differential abundance. By conducting inference about ϕi, our model is also able to identify differences in microbial populations associated with differential variability.

Many studies (e.g., see Gerber (2014), Faust et al. (2015), Zhou et al. (2015), among others) employ a longitudinal design to investigate the dynamics of microbial populations over time. To accommodate this setting, future work could incorporate random effects into (3.9) and (3.10).

Our proposed approach models a single taxon’s abundance. A limitation of this approach is that it does not enforce the compositionality constraint (i.e., the estimated expected relative abundances need not sum to 1 across all microbes in the population). Future work could consider a multivariate extension of our approach to enforce the compositionality constraint or incorporate between-taxon correlations.

All methods proposed in this paper are implemented in an R package available at github.com/bryandmartin/corncob and provided in Supplement A of the Supplementary Material (Martin, Witten and Willis (2020a)). Code to reproduce all simulations and data analyses are available at github.com/bryandmartin/corncob_supplementary and provided in Supplement B of the Supplementary Material (Martin, Witten and Willis (2020b)).

Supplementary Material

supplement part 1
supplement part 2

Acknowledgements

The authors thank the two anonymous reviewers for the insightful feedback that greatly improved this paper. The authors also thank Thea Whitman for providing data, as well as Pauline Trinh and Moira Differding for contributions to the software package. The authors are also grateful to the R Core Team (R Core Team (2018)) and authors of the packages brglm2 (Kosmidis (2018)), ggplot2 (Wickham (2016)), phyloseq (McMurdie and Holmes (2013)), trust (Geyer (2015)) and VGAM (Yee (2010)), which were used for constructing the figures and running the analyses in this article.

The first author was supported in part by the NSF Grant DGE-1256082.

The second author was supported in part by NSF CAREER Grant DMS-1252624, NIH Grant DP5OD009145 and a Simons Investigator Award in Mathematical Modeling of Living Systems Grants.

APPENDIX A: NONCONCAVITY OF THE BETA-BINOMIAL LOG-LIKELIHOOD

We show that (3.11) is not guaranteed to be concave in θ. Let n = 1, WW1 = 15 and MM1 = 2000. Suppose further that θ=(β0,β0*)T. Let θ1 = (−3 −5)T and θ2 = (−1 −5)T. Then

logL(θ1W1,M1)=8.481,logL(θ2W1,M1)=9.816,logL(0.5θ1+0.5θ2W1,M1)=9.251.

Therefore there exists θ1, and θ2 such that

logL(0.5θ1+0.5θ2W,M)<0.5logL(θ1W,M)+0.5logL(θ2W,M),

which establishes that (3.11) is not concave in θ.

APPENDIX B: ANALYTIC EXPRESSIONS FOR THE GRADIENT AND HESSIAN

Let γi=ϕi1ϕi for all i, and define ψ(x)=0(ettext1et)dt for x+ to be the digamma function, the derivative of the logarithm of the gamma function. Define Zi = (1 Xi) and Zi*=(1Xi*) to be the design matrices for covariates associated with μi and ϕi, respectively, including intercept terms. Then the expression for the gradient of (3.11) is given by

logL(θW,M)β=i=1n{γi1μi(1μi)Zi[ψ(1μiγi)ψ(Mi+1μiWiγiγi)+ψ(Wi+μiγi)ψ(μiγi)]}, (B.1)
logL(θW,M)β*=i=1n{γi1Zi*[ψ(Mi+1γi)ψ(1γi)+(μi1)(ψ(Mi+1μiWiγiγi)ψ(1μiγi))+μi(ψ(μiγi)ψ(Wi+μiγi))]}. (B.2)

Let ψ(1)(x)=xψ(x) be the trigamma function. Define Yi=(ZiT0)Tk+k*+2 and Yi*=(0Zi*T)Tk+k*+2. Then the expression for the Hessian of (3.11), H, is given by

H=i=1n[c1,iμi2(1μi)2YiYiT+c2,i(μi(1μi)YiγiYi*T+γiYi*μi(1μi)YiT)+c3,i(γiYi*γiYi*T)+c4,i(μi(1μi)(12μi)YiYiT)+c5,i(γiYi*Yi*T)],

where

c1,i=[ψ(1)(Mi+(1μiWiγi)/γi)ψ(1)((1μi)/γi)+ψ(1)(Wi+μi/γi)ψ(1)(μi/γi)]γi2,
c2,i=[γi(ψ(Mi(μi+Wiγi1)/γi)ψ((1μi)/γi))+γi(ψ(μi/γi)ψ(μi/γi+Wi))+(μi1)(ψ(1)((1μi)/γi)ψ(1)(Mi(μi+Wiγi1)/γi))+ψ(1)(μi/γi)ψ(1)(μi/γi+Wi)]γi3,
c3,i=[2γiψ(1/γi)+ψ(1)(1/γi)2γiψ(Mi+1/γi)ψ(1)(Mi+1/γi)+(μi1)2ψ(1)(Mi(μi+Wiγi1)/γi)2γi(μi1)ψ(Mi(μi+Wiγi1)/γi)μi2ψ(1)(μi/γi)+μi2ψ(1)(μi/γi+Wi)(μi1)2ψ(1)((1μi)/γi)+2γi(μi1)ψ((1μi)/γi)2γiμiψ(μi/γi)+2γiμiψ(μi/γi+Wi)]γi4,
c4,i=[ψ((1μi)/γi)ψ(Mi(μi+Wiγi1)/γi)+ψ(μi/γi+Wi)ψ(μi/γi)]/γi,
c5,i=[ψ(Mi+1/γi)ψ(1/γi)+μi(ψ(μi/γi)ψ(μi/γi+Wi))+(μi1)(ψ(Mi(μi+Wiγi1)/γi)ψ((1μi)/γi))]γi2.

APPENDIX C: PARAMETRIC BOOTSTRAP LIKELIHOOD RATIO TEST

We present Algorithm 2 to conduct a parametric bootstrap likelihood ratio test.

APPENDIX D: LIKELIHOOD RATIO TESTING WITH A ZERO-COUNT GROUP

We prove that testing the null hypothesis H0 : β* = 0 results in a test statistic of zero under certain conditions. We first prove in Lemma D.1 that the log-likelihood of the model (3.2)(3.3) is equal to zero under certain conditions. We use this to prove our main claim in Theorem D.2.

Algorithm 2.

Parametric Bootstrap Likelihood Ratio Test of H0 : = b

   Require: W, M, X, X*, a large integer B (e.g., B = 10,000)
1: Estimate θ^ and θ^0 as in (4.2) and (4.5), respectively, with the trust region optimization procedure.
2: Compute T^LRT as in (4.4) using W, M, θ^ and θ^0.
3: for b = 1, …, B do
4:   Simulate W˜b with elements W˜ib drawn from a beta-binomial distribution with Mi draws and parameters θ^0.
5:   Estimate θ˜b as in (4.2) using W˜b and M with the trust region optimization procedure.
6:   Estimate θ˜0b as in (4.5) using W˜b and the trust region optimization procedure.
7:   Compute T^LRTb as in (4.4) using W˜b, M, θ˜b and θ˜0b.
8: Calculate the p-value:
p^1B+1(1+b=1B1{T^LRTbT^LRT}).
9: return p^

Lemma D.1. Consider the model (3.2)(3.3) with parameters as in (3.4)(3.6) and link functions as in (3.9)(3.10) in the simplified setting with no covariates for μi, so that θ=(β0,β0*,β*T)T. Suppose that iWi=0. Then

supβ0logL(θW,M)=0.

Proof. We write the log-likelihood

logL(θW,M)=i=1nlog[(MiWi)B(eβ0*Xi*Tβ*1+eβ0+Wi,eβ0*Xi*Tβ*1+eβ0+MiWi)B(eβ0*Xi*Tβ*1+eβ0,eβ0*Xi*Tβ*1+eβ0)].

Substituting Wi = 0, using the definition of B(·, ·), and taking the limit in β0 gives

limβ0logL(θW,M)=limβ0i=1nlog[Γ(eβ0*Xi*Tβ*1+eβ0+Mi)]+log[Γ(eβ0*Xi*Tβ*1+eβ0+eβ0*Xi*Tβ*1+eβ0)]log[Γ(eβ0*Xi*Tβ*1+eβ0+eβ0*Xi*Tβ*1+eβ0+Mi)]log[Γ(eβ0*Xi*Tβ*1+eβ0)]=0supβ0logL(θW,M),

where the last inequality is because the log-likelihood associated with a discrete distribution cannot exceed 0.

Therefore,

supβ0logL(θW,M)=limβ0logL(θW,M)=0.

Theorem D.2. Consider the model (3.2)(3.3) with parameters as in (3.4)(3.6) and link functions as in (3.9)(3.10). Assume that k = k* = 1 and Xi=Xi*{0,1} for i = 1, …, n. Suppose that i:Xi=0Wi=0 and i:Xi=1Wi>0. Then the likelihood ratio test statistic for testing the null hypothesis that β1*=0 is equal to 0.

Proof. Let Li represent the likelihood of the ith sample, so that

i=1nlogLi(β0,β1,β0*,β1*Wi,Mi)logL(β0,β1,β0*,β1*W,M). (D.1)

We wish to show that the likelihood ratio test statistic

supβ0,β1,β0*,β1*i=1nlogLi(β0,β1,β0*,β1*Wi,Mi)supβ0,β1,β0*i=1nLi(β0,β1,β0*,β1*=0Wi,Mi)=0. (D.2)

First, we notice that for all i, the parameters β0* and β1* enter the likelihood Li(·) only though the term β0*+β1*Xi. This term is equal to β0* for all i such that Xi = 0, and β0*+β1* for all i such that Xi = 1. Similarly, the parameters β0 and β1 enter the likelihood Li(·) only through the term β0 for all i such that Xi = 0, and β0 + β1 for all i such that Xi = 1. Therefore, we can write the first term in (D.2) as the sum of two sub-problems,

supβ0,β1,β0*,β1*i=1nlogLi(β0,β1,β0*,β1*Wi,Mi)=supβ0,β1,β0*,β1*i:Xi=0logLi(β0,β1,β0*,β1*Wi,Mi)+supβ0,β1,β0*,β1*i:Xi=1logLi(β0,β1,β0*,β1*Wi,Mi)=supβ0,β1,β0*,β1*i:Xi=1logLi(β0,β1,β0*,β1*Wi,Mi),

where the last equality results from Lemma D.1. Similarly, we can write the second term in (D.2),

supβ0,β1,β0*i=1nlogLi(β0,β1,β0*,β1*=0Wi,Mi)=supβ0,β1,β0*i:Xi=1logLi(β0,β1,β0*,β1*=0Wi,Mi).

Thus, to show (D.2), it suffices to show that

supβ0,β1,β0*,β1*i:Xi=1logLi(β0,β1,β0*,β1*Wi,Mi)supβ0,β1,β0*i:Xi=1logLi(β0,β1,β0*,β1*=0Wi,Mi)=0.

This follows directly from the fact that the parameters β0* and β1* enter the likelihood Li(·) only though the term β0*+β1* for all i such that Xi = 1. This completes our proof of Theorem D.2. □

APPENDIX E: QUANTILES OF TYPE I ERROR RATE SIMULATIONS

We present Table 2 to display sample quantiles from the type I error rate simulations discussed in Section 5.1. We show the 5, 25, 50, 75 and 95% quantiles of each test.

Table 2.

The 5, 25, 50, 75 and 95% quantiles of p-values from the type I error rate simulations discussed in Section 5.1. The Wald and LRT procedures use a χ2 distribution to approximate the distributions of the test statistics (4.1) and (4.4), respectively. The PB Wald and PB LRT procedures are the parametric bootstrap hypothesis testing procedures discussed in Algorithms 1 and 2, respectively

Null hypothesis Sample size Procedure 5% 25% 50% 75% 95%

(β1, β1*) = (0, 0) 10 Wald 0.002 0.089 0.329 0.647 0.929
PB Wald 0.046 0.246 0.499 0.749 0.952
LRT 0.013 0.132 0.360 0.655 0.929
PB LRT 0.049 0.247 0.496 0.748 0.953
30 Wald 0.027 0.202 0.453 0.713 0.941
PB Wald 0.050 0.250 0.497 0.739 0.947
LRT 0.035 0.213 0.458 0.716 0.941
PB LRT 0.049 0.250 0.498 0.741 0.947
100 Wald 0.039 0.234 0.485 0.738 0.947
PB Wald 0.047 0.247 0.498 0.745 0.949
LRT 0.042 0.238 0.486 0.738 0.947
PB LRT 0.046 0.247 0.498 0.744 0.949
β1* = 0 10 Wald 0.006 0.124 0.381 0.688 0.934
PB Wald 0.039 0.242 0.497 0.754 0.949
LRT 0.014 0.146 0.393 0.689 0.934
PB LRT 0.048 0.248 0.502 0.752 0.948
30 Wald 0.031 0.212 0.472 0.735 0.949
PB Wald 0.047 0.245 0.499 0.751 0.952
LRT 0.035 0.215 0.473 0.735 0.949
PB LRT 0.048 0.245 0.499 0.750 0.953
100 Wald 0.045 0.244 0.487 0.740 0.947
PB Wald 0.051 0.254 0.496 0.744 0.947
LRT 0.047 0.245 0.488 0.740 0.947
PB LRT 0.050 0.253 0.494 0.743 0.950
β1 = 0 10 Wald 0.002 0.123 0.387 0.685 0.935
PB Wald 0.048 0.248 0.498 0.745 0.949
LRT 0.015 0.157 0.401 0.687 0.935
PB LRT 0.047 0.251 0.497 0.746 0.949
30 Wald 0.030 0.216 0.470 0.740 0.953
PB Wald 0.050 0.246 0.496 0.755 0.955
LRT 0.037 0.221 0.472 0.741 0.953
PB LRT 0.049 0.246 0.496 0.752 0.955
100 Wald 0.044 0.24 0.492 0.747 0.952
PB Wald 0.052 0.249 0.499 0.752 0.953
LRT 0.048 0.242 0.492 0.746 0.952
PB LRT 0.049 0.248 0.501 0.752 0.953

APPENDIX F: HYPOTHESIS TESTING RESULTS FOR EXAMPLE GENERA

We present Table 3 to display the results from the hypothesis tests conducted on the Thermomonas, Flavisolibacter and Myxococcus genera discussed in Section 6.2.

Table 3.

Results from the hypothesis tests conducted on the genera discussed in Section 6.2

Genus Method p-value

Thermomonas corncob (H0 : β = 0) 1.00 × 10−6
corncob (H0 : β* = 0) 1.00 × 10−6
DESeq2 3.58 × 10−15
edgeR 4.96 × 10−13
metagenomeSeq 0.0543
ZIB 0.00163
Flavisolibacter corncob (H0 : β = 0) 0.000735
corncob (H0 : β* = 0) 0.403
DESeq2 0.00387
edgeR 0.143
metagenomeSeq 0.759
ZIB 0.000941
Myxococcus corncob (H0 : β = 0) 0.243
corncob (H0 : β* = 0) 0.00871
DESeq2 0.0018
edgeR 0.0017
metagenomeSeq 0.0016
ZIB 0.0015

Footnotes

SUPPLEMENTARY MATERIAL

Supplement A: corncob R package (DOI: 10.1214/19-AOAS1283SUPPA; .zip). We provide an R package implementing all methods proposed in this paper.

Supplement B: Figure code (DOI: 10.1214/19-AOAS1283SUPPB; .zip). We provide code to reproduce all simulations and data analyses in this paper.

REFERENCES

  1. Aerts M, Molenberghs G, Geys H and Ryan LM (2002). Topics in Modelling of Clustered Data. CRC Press/CRC, Boca Raton, FL. [Google Scholar]
  2. Aitchison J (1986). The Statistical Analysis of Compositional Data Monographs on Statistics and Applied Probability. CRC Press, London: MR0865647 10.1007/978-94-009-4109-0 [DOI] [Google Scholar]
  3. Albert a. and Anderson J. a. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 1–10. MR0738319 10.1093/biomet/71.1.1 [DOI] [Google Scholar]
  4. Bastedo MN and Jaquette O (2011). Running in place: Low-income students and the dynamics of higher education stratification. Educ. Eval. Policy Anal 33 318–339. [Google Scholar]
  5. Benjamini Y and HOCHBERG Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. MR1325392 [Google Scholar]
  6. Callahan B. j., DiGiulio DB, Goltsman D. S. a., Sun CL, Costello EK, Jeganathan P, Biggio JR, Wong RJ, Druzin ML et al. (2017). Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proc. Natl. Acad. Sci. USA 114 9966–9971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cao Y, Zhang A and Li H (2017). Microbial composition estimation from sparse count data. Preprint. Available at arXiv:1706.02380. [Google Scholar]
  8. Chai H, Jiang H, Lin L and LIU L (2018). A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput. Biol 14 e1006329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen J and Li H (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat 7 418–442. MR3086425 10.1214/12-AOAS592 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen EZ and Li H (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 2611–2617. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chen L, Reeve j., Zhang L, Huang S, Wang X and Chen J (2018). GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ 6 e4600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dethlefsen L and Relman DA (2011). Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proc. Natl. Acad. Sci. USA 108 4554–4561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. DiGiulio DB, Callahan B. j., McMurdie P. j., Costello EK, Lyell D. j., Robaczewska a., Sun CL, Goltsman D. S. a., Wong RJ et al. (2015). Temporal and spatial variation of the human microbiota during pregnancy. Proc. Natl. Acad. Sci. USA 112 11060–11065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dolzhenko E and Smith a. D. (2014). Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinform. 15 215 10.1186/1471-2105-15-215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Edgar RC (2013). UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10 996–998. 10.1038/nmeth.2604 [DOI] [PubMed] [Google Scholar]
  16. Fang R, Wagner BD, Harris JK and Fillon S. a. (2016). Zero-inflated negative binomial mixed model: An application to two microbial organisms important in oesophagitis. Epidemiol. Infect 144 2447–2455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Faust K, Lahti L, Gonze D, DE Vos WM and Raes J (2015). Metagenomics meets time series analysis: Unraveling microbial community dynamics. Curr. Opin. Microbiol 25 56–66. 10.1016/j.mib.2015.04.004 [DOI] [PubMed] [Google Scholar]
  18. Fiacco AV and McCormick GP (1968). Nonlinear Programming: Sequential Unconstrained Minimization Techniques. Wiley, New York: MR0243831 [Google Scholar]
  19. Fletcher R (1987). Practical Methods of Optimization, 2nd ed. Wiley, Chichester: MR0955799 [Google Scholar]
  20. Gerber GK (2014). The dynamic microbiome. FEBS Lett. 588 4131–4139. [DOI] [PubMed] [Google Scholar]
  21. Gevers D, Kugathasan S, Denson L. a., Vázquez-Baeza Y, Van Treuren W, Ren B, Schwager E, Knights D, Song SJ et al. (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15 382–392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Geyer CJ (2015). trust: Trust region optimization. R package version 0.1–7. [Google Scholar]
  23. Grice E. a. (2014). The skin microbiome: Potential for novel diagnostic and therapeutic approaches to cutaneous disease. Semin. Cutan. Med. Surg 33 98 NIH Public Access. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Halfvarson j., Brislawn C. j., Lamendella R, Vázquez-Baeza Y, Walters W. a., Bramer LM, D’Amato M, Bonfiglio F, McDonald D et al. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nat. Microbiol 2 17004 10.1038/nmicrobiol.2017.4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Heinze G (2006). A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med 25 4216–4226. MR2307586 10.1002/sim.2687 [DOI] [PubMed] [Google Scholar]
  26. Heinze G and SCHEMPER M (2002). A solution to the problem of separation in logistic regression. Stat. Med 21 2409–2419. [DOI] [PubMed] [Google Scholar]
  27. Hill-Burns EM, Debelius JW, Morton JT, Wissemann WT, Lewis MR, Wallen ZD, Peddada SD, Factor SA, Molho E et al. (2017). Parkinson’s disease and Parkinson’s disease medications have distinct signatures of the gut microbiome. Mov. Disord 32 739–749. 10.1002/mds.26942 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Holmes I, Harris K and QUINCE C (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 7 e30126 10.1371/journal.pone.0030126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hooks KB and O’Malley MA (2017). Dysbiosis and its discontents. mBio 8 e01492–17. 10.1128/mBio.01492-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kleinman JC (1973). Proportions with extraneous variance: Single and independent samples. J. Amer. Statist. Assoc 68 46–54. [Google Scholar]
  31. Kosmidis I (2018). brglm2: Bias reduction in generalized linear models. R package version 0.1.8. [Google Scholar]
  32. Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ and Bonneau RA (2015). Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol 11 e1004226 10.1371/journal.pcbi.1004226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Law CW, Chen Y, Shi W and SMYTH GK (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, Wang Q, Sodergren E, Weinstock G and Shannon WD (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE 7 e52078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li Z, Lee K, Karagas MR, Madan JC, Hoen AG, O’Malley AJ and Li H (2018). Conditional regression based on a multivariate zero-inflated logistic-normal model for microbiome relative abundance data. Stat. Biosci 10 587–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Love MI, Huber W and Anders S (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 550 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Mandal S, Van Treuren W, White RA, EggesbØ M, Knight R and Peddada SD (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microb. Ecol. Health Dis 26 27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. MARTIN BD, Witten D and Willis AD (2020a). Supplement A to “Modeling microbial abundances and dysbiosis with beta-binomial regression.” 10.1214/19-AOAS1283SUPPA. [DOI] [PMC free article] [PubMed]
  39. MARTIN BD, Witten D and Willis AD (2020b). Supplement B to “Modeling microbial abundances and dysbiosis with beta-binomial regression.” 10.1214/19-AOAS1283SUPPB. [DOI] [PMC free article] [PubMed]
  40. McCullagh P and Nelder JA (1989). Generalized Linear Models Monographs on Statistics and Applied Probability. CRC Press, London: MR3223057 10.1007/978-1-4899-3242-6 [DOI] [Google Scholar]
  41. McMurdie PJ and Holmes S (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8 e61217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. McMurdie PJ andHOLMES S (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol 10 e1003531. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Mercer LD, Wakefield J, Pantazis A, Lutambi AM, Masanja H and Clark S (2015). Space-time smoothing of complex survey data: Small area estimation for child mortality. Ann. Appl. Stat 9 1889–1905. MR3456357 10.1214/15-AOAS872 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13 R79 10.1186/gb-2012-13-9-r79 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Morgan XC, Kabakchiev B, Waldron L, Tyler AD, Tickle TL, Milgrom R, Stempak JM, Gevers D, Xavier RJ et al. (2015). Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biol. 16 67 10.1186/s13059-015-0637-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Nocedal J and Wright SJ (1999). Numerical Optimization Springer Series in Operations Research. Springer, New York: MR1713114 10.1007/b98874 [DOI] [Google Scholar]
  47. Parker IM, Saunders M, Bontrager M, Weitz AP, Hendricks R, Magarey R, Suiter K and Gilbert GS (2015). Phylogenetic structure and host abundance drive disease pressure in communities. Nature 520 542–544. 10.1038/nature14372 [DOI] [PubMed] [Google Scholar]
  48. Paulson JN, STINE OC, Bravo HC and Pop M (2013). Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10 1200–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Peng X, Li G and Liu Z (2016). Zero-inflated beta regression for differential abundance analysis with metagenomics data. J. Comput. Biol 23 102–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Petersen C and Round JL (2014). Defining dysbiosis and its influence on host immunity and disease. Cell. Microbiol. 16 1024–1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Poussin C, Sierro N, Boué S, Battey J, Scotti E, Belcastro V, Peitsch MC, IVANOV NV and Hoeng J (2018). Interrogating the microbiome: Experimental and computational considerations in support of study reproducibility. Drug Discov. Today 23 1644–1657. 10.1016/j.drudis.2018.06.005 [DOI] [PubMed] [Google Scholar]
  52. Prentice RL (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J. Amer. Statist. Assoc 81 321–327. [Google Scholar]
  53. Qin N, Yang F, Li a., Prifti E, Chen Y, Shao L, Guo j., Le Chatelier E, Yao J et al. (2014). Alterations of the human gut microbiome in liver cirrhosis. Nature 513 59. [DOI] [PubMed] [Google Scholar]
  54. R CORE TEAM (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  55. Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Robinson MD and Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11 R25 10.1186/gb-2010-11-3-r25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Ryan DM (1974). Penalty and barrier functions In Numerical Methods for Constrained Optimization (Proc. Sympos, National Physical Lab, Teddington, 1974) 175–190. MR0456505 [Google Scholar]
  58. Sankaran K and Holmes SP (2017). Latent variable modeling for the microbiome. Preprint. Available at arXiv:1706.04969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Segata N, Izard j., Waldron L, Gevers D, Miropolsky L, Garrett WS and Huttenhower C (2011). Metagenomic biomarker discovery and explanation. Genome Biol. 12 R60 10.1186/gb-2011-12-6-r60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Sender R, Fuchs S and Milo R (2016). Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14 e1002533 10.1371/journal.pbio.1002533 [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Shi B, Chang M, Martin j., Mitreva M, Lux R, Klokkevold P, Sodergren E, Weinstock GM, Haake SK et al. (2015). Dynamic changes in the subgingival microbiome and their potential for diagnosis and prognosis of periodontitis. mBio 6 e01926–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Skellam JG (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J. R. Stat. Soc. Ser. B. Stat. Methodol 10 257–261. MR0028539 [Google Scholar]
  63. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM and Herndl GJ (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere.” Proc. Natl. Acad. Sci. USA 103 12115–12120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Sohn MB, Du R and An L (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics 31 2269–2275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Tamboli CP, Neut C, Desreumaux P and Colombel JF (2004). Dysbiosis in inflammatory bowel disease. Gut 53 1–4. 10.1136/gut.53.1.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Tromas N, Taranu ZE, Martin BD, Willis a., Fortin N, Greer CW and Shapiro BJ (2018). Niche separation increases with genetic distance among bloom-forming cyanobacteria. Front. Microbiol 9 438 10.3389/fmicb.2018.00438 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Wagner B, Riggs P and Mikulich-Gilbertson S (2015). The importance of distribution-choice in modeling substance use data: A comparison of negative binomial, beta binomial, and zero-inflated distributions. Am. J. Drug Alcohol Abuse 41 489–497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Wahba G, Wang Y, Gu C, Klein R and Klein B (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist 23 1865–1895. MR1389856 10.1214/aos/1034713638 [DOI] [Google Scholar]
  69. Welch JLM, Rossetti B. j., Rieken CW, Dewhirst FE and Borisy GG (2016). Biogeography of a human oral microbiome at the micron scale. Proc. Natl. Acad. Sci. USA 113 E791–E800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. White JR, Nagarajan N and Pop M (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol 5 e1000352 10.1371/journal.pcbi.1000352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Whitman T, Pepe-Ranney C, Enders a., Koechli C, Campbell a., Buckley DH and Lehmann J (2016). Dynamics of microbial community composition and soil organic carbon mineralization in soil following addition of pyrogenic and fresh organic matter. ISME J. 10 2918–2930. 10.1038/ismej.2016.68 [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer, New York. [Google Scholar]
  73. Williams DA (1975). 394: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31 949–952. [PubMed] [Google Scholar]
  74. Willis AD and Martin BD (2018). DivNet: Estimating diversity in networked communities. BioRxiv 305045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. XIA F, CHEN J, Fung WK and LI H (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69 1053–1063. MR3146800 10.1111/biom.12079 [DOI] [PubMed] [Google Scholar]
  76. Yee TW (2010). The VGAM package for categorical data analysis. J. Stat. Softw 32 1–34. [Google Scholar]
  77. Zhang X, Mallick H, Tang Z, Zhang L, Cui X, Benson AK and Yi N (2017). Negative binomial mixed models for analyzing microbiome count data. BMC Bioinform. 18 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Zhou Y, Shan G, Sodergren E, Weinstock G, Walker WA and Gregory KE (2015). Longitudinal analysis of the premature infant intestinal microbiome prior to necrotizing enterocolitis: A case-control study. PLoS ONE 10 e0118632 10.1371/journal.pone.0118632 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement part 1
supplement part 2

RESOURCES