Summary
There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. In this article, we develop a new probability distribution, zero-inflated generalized Dirichlet multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status) and develop a fast expectation–maximization algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.
Keywords: Compositional data analysis, Differential abundance, Hierarchical model, Microbiome, Score test, Zero-inflated model
1. Introduction
The human microbiome is the microbe population living in and on the human body. The importance of microbiome in human health and disease has been increasingly recognized. Research in this area has begun to discover relationships between human microbiome and many complex diseases such as cancer, diabetes, psoriasis, and obesity (Cho and Blaser, 2012; Cho and others, 2012; Qin and others, 2012; Ahn and others, 2013; Alekseyenko and others, 2013). Advances in high-throughput sequencing technologies allow scientists to accurately quantify a wide variety of microbes. In most microbiome studies, the data come from high-throughput sequencing of the 16S ribosomal ribonucleic acid (rRNA) gene. This gene is the target in sequencing experiments because it is ubiquitous in the bacterial kingdom, comprised of conserved and variable regions, and evolves at relative constant rates (Kuczynski and others, 2012). These features allow us to determine the types and abundances of different microbes in a sample with high sensitivity. By comparing sequencing reads to a reference 16S rRNA database, each sequence can be assigned to a series of taxonomic identities (i.e. lineage) at the levels of kingdom, phylum, class, order, family, and genus (DeSantis and others, 2006, Cole and others, 2007, Liu and others, 2008, Jovel and others, 2016). Therefore, for each taxonomic level, the final data from one sample can be summarized as a vector of read counts, referred to as taxon counts, with each component corresponding to a taxon at that particular level. The total taxon counts for each sample is bounded by the sequencing depth that does not reflect the actual microbial load in the sample. Thus, the taxon counts are compositional data and only reflect the relative abundances (i.e. proportions) of different microorganisms.
Linking multivariate taxon counts to covariates of interest (as a special case, detecting differential abundance between groups) is the common practice in microbiome studies of human disease. The quasi-conditional association test (QCAT) does not make any distributional assumption on taxon counts (Tang and others, 2017). This non-parametric test is very robust to the underlying correlation structures, but can be less powerful than parametric tests based on suitable distributional assumptions. In addition, the test is underpowered in the presence of excessive zeros in taxon counts. In the microbiome data, most taxa exhibit zero-inflation with zero observations representing the absence of the taxa in the samples (often referred to as structural zeros) or the undersampling of the rare taxa (often referred to as sampling zeros) (Li, 2015). The two-part version of QCAT models positive counts and zero counts separately and combines association evidence from both parts (Tang and others, 2017). This test is more powerful than regular QCAT if taxon counts are zero-inflated. However, the two-part test does not distinguish different types of zeros. Indeed, it is not possible to disentangle structural zeros and sampling zeros unless fitting a good parametric model to the data.
The power of parametric tests heavily depends on how well the adopted probability distribution fits the data. The Dirichlet multinomial (DM) distribution is commonly used to model taxon counts. In the DM, a vector of counts follows the multinomial distribution with underlying proportion parameters sampled from a Dirichlet distribution. The DM has been shown to fit microbiome data better than the multinomial distribution and many statistical methods have been developed based on the DM (La Rosa and others, 2012; Chen and Li, 2013; Wadsworth and others, 2017; Wang and Zhao, 2017). In particular, two DM-based association tests exist: one tests for differential mean and the other tests for both differential mean and dispersion (La Rosa and others, 2012).
The test for differential mean assumes homogeneous dispersion levels between groups, but many scenarios can lead to deviations from this assumption. It is well-known that microbiome compositions are very dynamic—the abundance of one microbe can be altered by the presence/absence of other microbes that cooperate or compete with that microbe, as well as by changes in environmental factors such as temperature, host diet, and lifestyle (Gilbert and others, 2016). Therefore, the disease-microbe association can be moderated by many factors, resulting in heterogeneous dispersion levels between the disease and control groups. A taxon with heterogeneous dispersion could represent a previously undetected interaction with other taxa or environmental factors. Consequently, ignoring the differential dispersion misses the opportunity of identifying these important taxa.
Several recent studies have shown that the DM may not be adequate for microbiome data (O’Brien and others, 2016, Sankaran and Holmes, 2017, Shi and Li, 2017, Tang and others, 2017). Specifically, the DM model intrinsically imposes a negative correlation among taxon counts, whereas the actual data display both positive and negative correlations (Mandal and others, 2015). In addition, the DM has only one dispersion parameter so that it cannot flexibly handle various dispersion patterns and zero-inflation levels among multiple taxa.
Generalized DM (GDM) distribution addresses one key limitation of the DM by allowing more general covariance structure. Comparing to the DM, the GDM chooses a more flexible distribution, the Generalized Dirichlet (GD) (Connor and Mosimann, 1969), as a prior for the multinomial. The GDM model has not been applied to microbiome data. It is not clear if the additional parameters in the GDM are necessary for modeling microbiome data and if the GDM can handle excessive zeros in taxon counts. In addition, the complex form of the GDM probability density function renders numerical challenges in statistical inference (Zhang and others, 2017).
In this article, we develop a new probability distribution—zero-inflated GDM (ZIGDM)—for modeling microbiome compositional data that includes the GDM as a special case. In contrast to the DM model, the ZIGDM has additional parameters to flexibly accommodate the over-dispersion and zero-inflation of the data. We focus on demonstrating the usefulness of this additional flexibility in microbiome association analysis. We propose to use the ZIGDM regression model to link the mean and dispersion levels of the microbial abundance to the covariates of interest. Based on the ZIGDM regression, we derive association tests for detecting differential mean and dispersion. Like the GDM, the ZIGDM model does not belong to the natural exponential family and the parameter estimation is not simple. To meet this challenge, we develop a fast expectation–maximization (EM) algorithm. We use extensive simulation studies to evaluate the performance of these tests. An application to a gut microbiome data leads to a discovery of differential microbial sub-compositions between the body mass index (BMI) groups. The ZIGDM and GDM models fit the gut microbiome data significantly better than the DM model.
2. GD and ZIGD
2.1. GD model for random proportions
We consider
taxa in the microbial composition with
a total of
sequencing reads. We denote the vector of
counts as
with
, and the
underlying unobserved proportions as
with
. To flexibly
model random proportions with a complex correlation structure, Connor and Mosimann (1969) proposed the Generalized Dirichlet (GD)
distribution. Specifically, the GD random proportions
can be constructed
from a set of mutually independent Beta variables
as
![]() |
(2.1) |
This GD construction in (2.1)
resembles the stick-breaking process for constructing a truncated Dirichlet process (Ishwaran and James, 2001). The density function for
can be written as
where
and
are two
parameters in the Beta distribution and
is
the Beta function. Applying the transformation, we obtain the density function of
as
![]() |
(2.2) |
where
for
and
. This distribution is
referred to as GD
, where
and
. Conversely, a
set of mutually independent Beta variables can be derived from a GD random proportions
by the transformation
![]() |
(2.3) |
The GDM is given by using the GD as a prior for the multinomial distribution. Wong (1998) has shown that the GD is a conjugate prior
for the multinomial. Specifically, suppose
follows the
multinomial distribution with a GD
prior on the proportion
parameters
, the posterior probability of
is a
GD
, where
,
,
and
,
. The GDM has been applied to
model multivariate count data with complex correlation structures such as in RNA-seq data
analysis (Zhang and others, 2017).
However, its usefulness in analyzing microbiome data has not been evaluated.
2.2. ZIGD model for random proportions with zero components for absent taxa
The GD model assumes all taxa have positive proportions (i.e. are present in the sample)
and the observed zeros in
are sampling zeros. To model absent
taxa (i.e. structural zeros), we assume
follows zero-inflated
Beta (ZIB) distribution with parameters
, where
represents the probability of
. We, then, transform these
’s to
’s using (2.1), and refer to the resulting
distribution of
as ZIGD
,
where
.
Clearly,
is equivalent to
based on their relationship (2.1). We let
denote
the indicator function. The variable
follows
and indicates the
absence and presence of the taxon
by values of 1 and 0, respectively. If we
assume all taxa are present
, the ZIGD
becomes the GD. Suppose we have
taxa present in the sample. We let
denote the set of indexes for these taxa (i.e.
).
The complement set
includes the
indexes for the absent taxa. Suppose we observe
taxa with zero counts. We
let
denote the set of indexes for these taxa (i.e.
) and
denote the
complement of set
. Note that the two sets
and
are not exclusive: their
intersection
indexed taxa that are present in the sample but have zero counts due to the undersampling
in the sequencing experiment (i.e. sampling zeros).
We can use the ZIGD as a prior for the multinomial and term the resulting marginal
distribution for the counts as ZIGDM. In the remaining content of this section, we show
that the ZIGD is a conjugate prior for the multinomial. In the derivation, we let
denote the sub-vector of
vector
defined by the index set
. Suppose
follows a multinomial with a
ZIGD
prior on the proportion parameters
. We let
.
Because
, we have
,
hence, the posterior probability of the proportions given observed counts can be expressed
as
![]() |
(2.4) |
Because
when
(a
taxon has the proportion of 0 if it is absent from a sample), we have
.
Because
follows the GD
and the GD is a conjugate prior for the multinomial, the posterior probability
follows the GD
.
Because
when
(a taxon
is present in the sample if its count is positive), we have
.
The mass function for
given
can be derived as follows
![]() |
(2.5) |
The last equality holds because
if and only if
by definition. Hence,
the posterior probability
follows a ZIGD with
zero-inflation on the taxa having observed zero counts and the probability of the observed
zero being structural zero is
(
).
3. ZIGDM regression model
Suppose we have
subjects measured on
taxa. We let
and
denote the observed count and the
underlying true proportion for taxon
in subject
, and
denote the
-dimensional
vector including a unit component for the intercept, covariates of interest, and confounding
variables. We assume the count vector
follows
the ZIGDM(
,
,
), where
,
, and
. As
described in the previous section, the equivalent hierarchical model can be expressed as
![]() |
(3.1) |
Under this model,
,
, and
are
three possible sets of parameters that can be linked to
. For each taxon
in subject
, the
controls the probability of absence
and the
and
control the
abundance distribution at the presence. To facilitate the interpretation, we model
and
as opposed
to
and
, where
pertains to the mean of the Beta
variable and
can be viewed as the dispersion
parameter because the variance of the Beta variable takes the form
. It is natural
to use logit link functions because
’s,
’s and
’s take values between 0 and 1:
![]() |
(3.2) |
where
,
,
and
are regression coefficients for taxon
. The design matrices are not necessarily
the same for the three features of abundance distribution (i.e. absence probability, mean,
and dispersion). For ease of presentation, we keep them the same in describing the
regression model.
We write the complete set of parameters as
.
The likelihood-based inference on
is not simple because
the observed log-likelihood function is complicated. We describe below an efficient EM
algorithm for fitting the model and estimating parameters. Formula (3.3) gives the complete data
log-likelihood expressed in terms of
’s:
![]() |
(3.3) |
where
and
. Using
’s instead of
’s allows us to
derive the explicit form of posterior expectations in the E-step and estimate parameters for
each taxon independently in the M-step.
In the
-th E-step, we need to compute the expected
complete data log-likelihood,
![]() |
(3.4) |
where the expectation is with respect to the posterior
distributions of
and
with
being the
parameter estimates in the
-th M-step. Based on the results from
the previous section, we have
![]() |
(3.5) |
where
,
,
and
is the digamma function.
Thus,
can be rewritten as
![]() |
(3.6) |
where
and 
In the
-th M-step, for each taxon
, we obtain
from maximizing
the function
and obtain
and
from
maximizing the function
. The computation
burden for the optimization is the same as a logistic and a weighted Beta regressions.
Because the parameters for individual taxa are updated independently, we can estimate
parameters for all taxa in parallel by distributing the optimization jobs to multiple
computing cores. In summary, the EM algorithm is computationally efficient because of the
simple calculation of posterior expectations and the ability to update parameters in a
taxon-by-taxon manner.
4. Association tests
By testing the different sets of parameters in the ZIGDM regression model, we can reveal a
variety of patterns of variation in microbial communities. In this article, we focus on
testing the null hypothesis that the covariates are not associated with mean
(
)
or dispersion (
),
where
and
are subsets of
and
corresponding to the
covariates of interest. We may test the null hypotheses by using the score, Wald, or
likelihood ratio (LR) statistics. We have adopted score statistics, which are
computationally faster and more stable than Wald and LR statistics (Lin and Tang, 2011). The derivation of the score statistic is included in
the Appendix of supplementary
material available at Biostatistics online. In our numerical
studies, we evaluate the performance of score tests based on the ZIGDM and GDM regression
models. When performing the association test for one feature of the abundance distribution
(e.g. mean), we include only the intercept in models for the other features (e.g. absence
probability, dispersion).
The asymptotic approximation of the test statistics may not be accurate when most of the observations are zeros, especially when the sample size is small. Therefore, we need to use permutation techniques to obtain P-values. It is computationally efficient to obtain the permutation P-values for score statistics because the null model (without confounding variables) would only include the intercept term and needs to be fit only once in the permutation procedure. In particular, we permute the covariate of interest and calculate the score test statistic in each permutation. The permutation P-value is the proportion of the permuted test statistics that are greater than the observed statistic.
5. Simulation evaluations
5.1. Simulation strategy
We conducted extensive simulation studies to investigate the performance of the proposed
and existing methods. The
and
are tests based on the
ZIGDM distribution for detecting differential mean and dispersion, respectively. The
and
are the GDM counterparts. The
and
are
two tests based on the DM distribution. The
is a test for
detecting differential mean and the
is an omnibus test for jointly
detecting differential mean and dispersion. In the current implementation of these DM
tests, the
employs the Wald statistic and
the
employs the LR statistic (La Rosa and others, 2012). We used the
HMP package in R for the DM tests in our numerical studies (La Rosa and others, 2016). The
and
are distribution-free tests
based on generalized score statistics for detecting differential mean. The
employs the two-part model
with one part modeling zero observations and the other part modeling positive abundance.
We used the miLineage package in R for the QCAT tests in our numerical studies.
We simulated six taxon counts for two groups with same sample sizes and tested differential abundance in the six taxa between the two groups. The number of taxa is chosen to be six in order to reflect the fact that testing sub-composition on a taxonomic tree usually involves less than 10 taxa (see details in Section 6). We considered the sample sizes of 100 and 200 in all simulation studies. The total sequencing reads for each sample was simulated from Poisson(1000). In the power evaluation, we perturbed one taxon by changing either its mean abundance or its dispersion level in one group. We used 5000 simulated datasets to evaluate Type I error and power of the tests at the 0.05 significance level.
In the first set of simulations, taxon counts were generated from the DM model. In particular, the vector of proportion parameters for the multinomial distribution was simulated from a Dirichlet distribution with equal mean parameters of 1/6 and the dispersion parameter of 0.3. To evaluate power, we randomly selected a taxon and sampled its mean parameter from Uniform (0, 0.5) or its dispersion parameter from Uniform(0, 0.5).
In the second set of simulations, taxon counts were generated from the GDM or ZIGDM
model. In particular, we first simulated five independent Beta or ZIB variables
with equal Beta
mean parameters of 0.2 and equal Beta dispersion parameters of 0.2. For the ZIGDM, the
zero-inflation levels were sampled without replacement from
. We then
transformed the
’s into a vector of proportions
according to
(2.1). The proportion of the sixth
taxon was determined by
. To evaluate power,
we changed the Beta mean or dispersion parameter of a taxon: the Beta mean parameter for
the differential taxon was sampled from Uniform(0, 0.5); the Beta dispersion parameter for
the differential taxon was sampled from Uniform(0, 1). For the GDM, the differential taxon
was randomly picked; for the ZIGDM, the differential taxon was the one with a certain
level of zero-inflation. For example, when we consider the zero-inflation level of 0.2, we
only perturb the taxon with that particular level of zero-inflation.
In the third set of simulations, taxon counts were generated from the Log-Normal (LN) or
zero-inflated LN (ZILN) model. This set of simulations are designed to evaluate the
robustness of different tests to the underlying distributions. In particular, we first
simulated
from a
multivariate normal distribution with means of zero, variances of one, and a polynomial
decay correlation matrix
given by
.
We then transformed the
’s into a vector of proportions
.
For the ZILN, we randomly changed the observed proportions for each taxon to zero
according to the zero-inflation levels sampled without replacement from
. The proportion
of the sixth taxon was determined by
. To evaluate power
under the LN, we changed the Normal mean or variance for a randomly-picked taxon: its
Normal mean parameter was sampled from Uniform(0, 1), and its Normal variance parameter
was sampled from Uniform(1, 6). To evaluate power under the ZILN, we changed the Normal
mean or variance for the taxon with a certain level of zero-inflation: its Normal mean
parameter was sampled from Uniform(0, 2); its Normal variance parameter was sampled from
Uniform(1, 8).
5.2. Simulation results
The empirical Type I errors for asymptotic and permutation tests are presented in Table 1. The Type I errors are properly controlled in
the permutation tests. The asymptotic distribution-free tests
and
preserve the Type I error in
all scenarios. The rest of the asymptotic tests tend to be too liberal under zero-inflated
models (i.e. ZIGDM and ZILN models), especially when the data are not generated from the
distribution the test assumes. The
asymptotic test is overly
conservative under non-zero-inflated models (i.e. DM, GDM, and LN models). For a fair
comparison, we only report permutation results in the power evaluation.
Table 1.
Type I error of the tests
| Model |
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| Asymptotic | DM | 100 | 0.058 | 0.041 | 0.064 | 0.057 | 0.051 | 0.012 | 0.048 | 0.042 |
| 200 | 0.058 | 0.058 | 0.054 | 0.054 | 0.050 | 0.009 | 0.049 | 0.051 | ||
| GDM | 100 | 0.047 | 0.033 | 0.060 | 0.057 | 0.042 | 0.022 | 0.042 | 0.028 | |
| 200 | 0.053 | 0.041 | 0.059 | 0.050 | 0.056 | 0.018 | 0.055 | 0.039 | ||
| LN | 100 | 0.072 | 0.041 | 0.072 | 0.041 | 0.058 | 0.027 | 0.047 | 0.048 | |
| 200 | 0.057 | 0.037 | 0.057 | 0.037 | 0.055 | 0.024 | 0.047 | 0.049 | ||
| ZIGDM | 100 | 0.110 | 0.120 | 0.130 | 0.099 | 0.061 | 0.100 | 0.046 | 0.043 | |
| 200 | 0.069 | 0.062 | 0.110 | 0.092 | 0.060 | 0.100 | 0.045 | 0.049 | ||
| ZILN | 100 | 0.085 | 0.072 | 0.180 | 0.190 | 0.064 | 0.230 | 0.047 | 0.050 | |
| 200 | 0.057 | 0.049 | 0.180 | 0.190 | 0.062 | 0.210 | 0.044 | 0.050 | ||
| Permutation | DM | 100 | 0.048 | 0.048 | 0.054 | 0.050 | 0.051 | 0.050 | 0.050 | 0.049 |
| 200 | 0.046 | 0.054 | 0.046 | 0.051 | 0.051 | 0.047 | 0.049 | 0.050 | ||
| GDM | 100 | 0.046 | 0.046 | 0.046 | 0.043 | 0.042 | 0.050 | 0.047 | 0.050 | |
| 200 | 0.050 | 0.054 | 0.050 | 0.050 | 0.052 | 0.050 | 0.055 | 0.052 | ||
| LN | 100 | 0.053 | 0.046 | 0.053 | 0.046 | 0.054 | 0.056 | 0.051 | 0.052 | |
| 200 | 0.052 | 0.047 | 0.052 | 0.047 | 0.046 | 0.049 | 0.047 | 0.050 | ||
| ZIGDM | 100 | 0.041 | 0.052 | 0.048 | 0.056 | 0.051 | 0.047 | 0.054 | 0.046 | |
| 200 | 0.048 | 0.050 | 0.051 | 0.049 | 0.045 | 0.050 | 0.048 | 0.050 | ||
| ZILN | 100 | 0.046 | 0.050 | 0.052 | 0.051 | 0.047 | 0.050 | 0.048 | 0.052 | |
| 200 | 0.048 | 0.049 | 0.049 | 0.049 | 0.049 | 0.049 | 0.044 | 0.048 |
The power of permutation tests under non-zero-inflated models are presented in Table 2. The results for ZIGDM and GDM are identical
under the LN model because almost all simulated taxa counts are positive. When the mean
differs, the differential-mean tests (
,
,
,
, and
)
outperform the differential-dispersion tests (
,
). As the
jointly tests the mean and
dispersion, we view it as both the differential-mean and differential-dispersion tests.
When the dispersion differs, differential-dispersion tests are more powerful than their
differential-mean counterparts. The differential-mean test
yields decent power for detecting differential dispersion if the change in dispersion also
alters the percentages of zero counts in the data (see settings in the DM and GDM models).
When the mean differs, the performances of differential-mean tests are largely similar. In
contrast, when the dispersion differs, the performances of differential-dispersion tests
are diverse: the
is more powerful and robust
than the
and
tests. In practice, we can use statistical model selection criteria (e.g. Bayesian
information criterion, Akaike information criterion) to choose between ZIGDM and GDM
models.
Table 2.
Power of the permutation tests under non-zero-inflated models
| Model | Difference |
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| DM | Mean | 100 | 0.52 | 0.33 | 0.67 | 0.47 | 0.60 | 0.66 | 0.58 | 0.59 |
| 200 | 0.67 | 0.54 | 0.76 | 0.62 | 0.71 | 0.76 | 0.70 | 0.72 | ||
| Disp | 100 | 0.17 | 0.72 | 0.42 | 0.77 | 0.05 | 0.76 | 0.06 | 0.64 | |
| 200 | 0.52 | 0.84 | 0.60 | 0.84 | 0.05 | 0.81 | 0.05 | 0.75 | ||
| GDM | Mean | 100 | 0.60 | 0.27 | 0.66 | 0.34 | 0.59 | 0.60 | 0.60 | 0.59 |
| 200 | 0.70 | 0.39 | 0.76 | 0.48 | 0.70 | 0.71 | 0.71 | 0.70 | ||
| Disp | 100 | 0.17 | 0.47 | 0.56 | 0.79 | 0.07 | 0.55 | 0.07 | 0.59 | |
| 200 | 0.23 | 0.54 | 0.67 | 0.86 | 0.07 | 0.62 | 0.08 | 0.66 | ||
| LN | Mean | 100 | 0.48 | 0.06 | 0.48 | 0.06 | 0.50 | 0.51 | 0.52 | 0.52 |
| 200 | 0.63 | 0.06 | 0.63 | 0.05 | 0.65 | 0.66 | 0.66 | 0.66 | ||
| Disp | 100 | 0.05 | 0.70 | 0.05 | 0.70 | 0.24 | 0.39 | 0.17 | 0.18 | |
| 200 | 0.06 | 0.83 | 0.06 | 0.83 | 0.47 | 0.61 | 0.40 | 0.41 |
The power of permutation tests under zero-inflated models when
are
displayed in Figure 1 (see Figure 2 of supplementary material available at
Biostatistics online for the results when
). Each
plot demonstrates the power curve as a function of the zero-inflation levels. When the
mean differs, the
outperforms the other tests.
When the dispersion differs, under the ZIGDM model, the
,
and
perform substantially better
than the other tests; under the ZILN model, the
is the most
powerful test and the other tests have dramatic power loss.
Fig. 1.
Power of the permutation tests under ZIGDM and ZILN models when the sample size is 100. The pattern of variation is indicated above each graph.
In summary, comparing to the DM tests, the GDM and ZIGDM tests are more powerful to detect differential mean/dispersion and are more robust to the underlying distributions. If the data are zero-inflated, the ZIGDM tests are more desirable than the GDM tests. If the data are not zero-inflated, the GDM tests should be preferred over the ZIGDM tests.
6. Gut microbiome and BMI
Gut microbiome plays an important role in obesity by contributing to nutrient digestion and absorption in humans. Wu and others (2011) investigated the relationship between micronutrients and gut microbiome composition. Fecal samples from 98 healthy volunteers were collected, along with their demographic data and diet information. The sample DNA was analyzed by sequencing the V1–V2 region of the 16S rRNA genes in the fecal samples. The sequencing reads were taxonomically classified to the genus level via QIIME (Caporaso and others, 2010). Taxa that appear in less than two samples were removed resulting in 80 genera.
These genera can be mapped to a taxonomic tree according to their taxonomic identities at the higher taxonomic levels. Figure 2 displays the tree with six levels from genus to kingdom. The outer circle of the tree represents lower taxonomic level. Each node on the tree represents a taxon. At each internal node, counts of the sequencing reads that assigned to that node are distributed to multiple taxa at the lower taxonomic levels (i.e. child nodes). As described in the previous literature (Shi and Li, 2017, Tang and others, 2017), in order to identify all differential lineages on the tree, we visited every internal node; for each node, we applied the tests to the sub-composition defined as the vector of taxon counts on its immediate child nodes. We then employed the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) to control the false discovery rate (FDR). Alternative FDR control methods that respect the hierarchical structure of the taxonomy can potentially increase discovery power (Bogomolov and others, 2017, Lei and others, 2017). Testing sub-compositions on the lineage leverages taxonomic structure to define the unit of the tests in a biologically meaningful way and essentially reduces the dimension in each test. In our analysis, the number of taxa in each test is usually less than 10.
Fig. 2.

Taxonomic tree for the gut microbiome data. The root nodes of the discovered lineages are marked with stars and annotated by the names of the taxa on the root nodes. The lineages under family Prevotellaceae and kingdom Bacteria were identified exclusively by the ZIGDM test. Three most abundant phyla (Bacteroidetes, Firmicutes, and Proteobacteria) are also annotated in the tree.
Because dysbiosis of the gut microbiome has been shown to be associated with obesity (Sanderson and others, 2006), we were
interested in identifying differential bacterial lineages in high vs. normal BMI groups. We
dichotomized the BMI value in our analysis because the current implementation of the DM
tests can only handle the group-wise comparison (La Rosa
and others, 2016). The BMI
25 is the commonly
accepted range for overweight. We performed the ZIGDM, GDM, DM, and QCAT permutation tests
to compare the mean or dispersion level of the microbial abundance between the high BMI
(
25) and normal BMI
(
25) groups over all lineages on the
taxonomic tree at the family, order, class, phylum, and kingdom levels. The
P-values for all the discovered lineages (FDR = 5%) are listed in Table 1
of supplementary material
available at Biostatistics online. The DM and QCAT tests identify two
BMI-associated lineages on the tree: family Ruminococcaceae
(
P-value =
,
P-value =
) and family
Veillonellaceae (
P-value =
,
P-value =
). Besides family
Veillonellaceae, the GDM and ZIGDM tests identified another two
BMI-associated lineages: family Prevotellaceae (
P-value = 0.0014
and
P-value = 0.0014)
and kingdom Bacteria (
P-value =
0.0016). The test for family Prevotellaceae involves two genera. The test
for kingdom Bacteria involves eight phyla, among which the three most
abundant taxa (phyla Bacteroidetes, Firmicutes, and
Proteobacteria) dominate the community and the remaining five taxa are
rare and the majority of their counts are zeros. In Figure
2, all the differential lineages and the three most abundant phyla are annotated in
the tree. Figure 3 of supplementary
material available at Biostatistics online displays the
distributions of relative abundance for the three phyla and demonstrates significant
dispersion differences between BMI groups in phyla Bacteroidetes and
Firmicutes. The dispersion difference for taxa at such high taxonomic
level is probably due to the aggregation of many low-level taxa that interact with each
other and/or with environments. On the other hand, the difference in mean abundance would
usually be difficult to detect at such high level because the differences are diluted by
aggregating many taxa with no difference in mean.
Furthermore, we used the DM, GDM, and ZIGDM models to fit individual sub-compositions defined at internal nodes of the tree. In each model, we linked the mean, dispersion, and absence probability (in the ZIGDM model only) to the binary BMI and obtained the maximum likelihood estimates for the coefficients of the intercept and the binary BMI. We then generated synthetic data from each distribution with the estimated parameters. The GDM and ZIGDM models provide superior fit for most of the sub-compositions. For example, Figure 3 shows quantile-quantile plots of synthetic and observed relative abundances for the families Prevotellaceae and Porphyromonadaceae under order Bacteroidales. Prevotellaceae has many observed zeros (58%) and Porphyromonadaceae has very few (3.1%). The GDM and ZIGDM fit the data much better than the DM for these families and the ZIGDM has the best fit at the lower tail for Prevotellaceae. Another example for the sub-composition of three most abundant phyla under kingdom Bacteria is shown in Figure 4 of supplementary material available at Biostatistics online. Our analysis demonstrates that the additional parameters in GDM/ZIGDM are well-spent to provide better fit to the data, resulting in powerful association tests.
Fig. 3.
Quantile-quantile plots from fitting three distributions to the sub-composition of two families under the order Bacteroidales. A thousand synthetic datasets were randomly generated from each distribution with the estimated parameters. The medians were used to draw points and the 2.5 and 97.5 percentiles were used to draw envelopes.
7. Discussion
In this article, we develop the ZIGDM distribution for modeling microbiome compositional data with excess zeros, and propose score tests based on the ZIGDM regression model to detect differential mean or dispersion level of microbial composition. In contrast to the commonly-used DM model, the ZIGDM model provides a more flexible way of accommodating excess zeros and handling the complex correlation structure and dispersion patterns in taxon count data. We develop an EM algorithm to efficiently estimate parameters in the ZIGDM regression. The EM provides explicit forms of the posterior expectations in the E-step and updates the parameters for individual taxa independently in the M-step. Extensive simulation studies have been conducted to compare the proposed tests to existing ones. The results have demonstrated that the ZIGDM tests are more powerful to detect differential mean/dispersion and are more robust to the underlying distribution if the taxon counts are zero-inflated. If the taxon counts are not zero-inflated, the GDM tests are more desirable. In the analysis of a gut microbiome dataset, the proposed methods identify additional lineages with differential dispersion between BMI groups. We have demonstrated that the GDM provides a superior fit to taxon counts compared to the DM and the ZIGDM can further improve the goodness-of-fit for taxa with many zero counts.
We can potentially develop an omnibus test by combining the mean and dispersion tests. Moreover, besides testing the mean and dispersion, we can incorporate the test for differential presence-absence frequencies. However, the high degrees of freedom of the omnibus test will compromise its statistical power. To reduce the degree of the freedom, we can aggregate the abundances of multiple taxa or use the maximum statistic in the association test (Lin and Tang, 2011). Alternatively, we can construct a test statistic from the minimal P-value among tests for different features of the abundance distribution and use permutation to obtain a uniform P-value (Tang and others, 2016). The performance of these tests in analyzing microbiome data requires further study.
The multivariate association tests cannot handle high-dimensional microbial taxa, especially when the number of taxa is larger than the sample size. This similar problem has been investigated under the DM regression (Chen and Li, 2013; Wang and Zhao, 2017). The ZIGDM model is able to incorporate standard regularization approaches to deal with high dimensionality. Depending on the need in practice, the model can produce sparse estimates with the lasso penalty (Tibshirani, 1996) and the group lasso penalty (Yuan and Lin, 2006). In addition, a phylogenetic structure-constrained penalty function can be used to incorporate important prior knowledge on evolutionary relationships among microbial taxa (Chen and others, 2012). These penalized ZIGDM regressions would enable us to identify microbes with differential mean, dispersion level, and presence-absence frequency.
8. Software
R codes to implement the methods have been incorporated into the software miLineage, which is available at https://tangzheng1.github.io/tanglab/software.html.
Supplementary Material
Supplementary Data
Acknowledgments
We are grateful to the two anonymous reviewers for their helpful comments. We thank Dr. Hongzhe Li for providing the BMI data. Conflict of Interest: None declared.
References
- Ahn, J., Sinha, R., Pei, Z., Dominianni, C., Wu, J., Shi, J., Goedert, J. J., Hayes, R. B. and Yang, L. (2013). Human gut microbiome and risk for colorectal cancer. Journal of the National Cancer Institute 105, 1907–1911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alekseyenko, A. V., Perez-Perez, G. I., De Souza, A.,Strober, B.,Gao, Z.,Bihan, M.,Li, K.,Methé, B. A. and Blaser, M. J. (2013). Community differentiation of the cutaneous microbiota in psoriasis. Microbiome 1, 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 57, 289–300. [Google Scholar]
- Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165–1188. [Google Scholar]
- Bogomolov, M., Peterson, C. B., Benjamini, Y. and Sabatti, C. (2017). Testing hypotheses on a tree: new error rates and controlling strategies. arXiv preprint arXiv:1705.07529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., Fierer, N., Pena, A. G., Goodrich, J. K., Gordon, J. I. and others. (2010). QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7, 335–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, J., Bushman, F. D., Lewis, J. D., Wu, G. D. and Li, H. (2012). Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics 14, 244–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, J. and Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics 7, 418–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho, I. and Blaser, M. J. (2012). The human microbiome: at the interface of health and disease. Nature Reviews Genetics 13, 260–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho, I., Yamanishi, S., Cox, L., Methé, B. A., Zavadil, J.,Li, K.,Gao, Z.,Mahana, D.,Raju, K.,Teitler, I. and others. (2012). Antibiotics in early life alter the murine colonic microbiome and adiposity. Nature 488, 621–626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cole, J. R., Chai, B., Farris, R. J., Wang, Q., Kulam-Syed-Mohideen, A. S., McGarrell, D. M., Bandela, A. M., Cardenas, E., Garrity, G. M. and Tiedje, J. M. (2007). The ribosomal database project (RDP-II): introducing myRDP space and quality controlled public data. Nucleic Acids Research 35, 169–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Connor, R. J. and Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association 64, 194–206. [Google Scholar]
- DeSantis, T. Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E. L., Keller, K., Huber, T., Dalevi, D., Hu, P. and Andersen, G. L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and Environmental Microbiology 72, 5069–5072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert, J. A., Quinn, R. A., Debelius, J., Xu, Z. Z., Morton, J., Garg, N., Jansson, J. K., Dorrestein, P. C. and Knight, R. (2016). Microbiome-wide association studies link dynamic microbial consortia to disease. Nature 535, 94–103. [DOI] [PubMed] [Google Scholar]
- Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96, 161–173. [Google Scholar]
- Jovel, J., Patterson, J., Wang, W., Hotte, N., O’Keefe, S., Mitchel, T., Perry, T., Kao, D., Mason, A. L., Madsen, K. L. and others. (2016). Characterization of the gut microbiome using 16S or shotgun metagenomics. Frontiers in Microbiology 7, 459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuczynski, J., Lauber, C. L., Walters, W. A., Parfrey, L. W., Clemente, J. C., Gevers, D. and Knight, R. (2012). Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics 13, 47–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G. and Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS One 7, e52078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- La Rosa, P. S., Deych, E., Shands, B. and Shannon, W. D. (2016). HMP: Hypothesis Testing and Power Calculations for Comparing Metagenomic Samples from HMP. R package version 1.4.3, https://CRAN.R-project.org/package=HMP. [Google Scholar]
- Lei, L., Ramdas, A. and Fithian, W. (2017). Star: a general interactive framework for FDR control under structural constraints. arXiv preprint arXiv:1710.02776. [Google Scholar]
- Li, H. (2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application 2, 73–94. [Google Scholar]
- Lin, D.-Y. and Tang, Z.-Z. (2011). A general framework for detecting disease associations with rare variants in sequencing studies. The American Journal of Human Genetics 89, 354–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Z., DeSantis, T. Z., Andersen, G. L. and Knight, R. (2008). Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Research 36, e120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mandal, S., Van Treuren, W., White, R. A., Eggesbø, M., Knight, R. and Peddada, S. D. (2015). Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial Ecology in Health and Disease 26, 27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Brien, J. D., Record, N. and Countway, P. (2016). The power and pitfalls of Dirichlet-multinomial mixture models for ecological count data. bioRxiv. https://doi.org/10.1101/045468. [Google Scholar]
- Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y., Shen, D. and others. (2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60. [DOI] [PubMed] [Google Scholar]
- Sanderson, S., Boardman, W., Ciofi, C. and Gibson, R. (2006). Human gut microbes associated with obesity. Nature 444, 1022–1023. [DOI] [PubMed] [Google Scholar]
- Sankaran, K. and Holmes, S. (2017). Latent variable modeling for the microbiome. arXiv preprint arXiv: 1706.04969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi, P. and Li, H. (2017). A model for paired-multinomial data and its application to analysis of data on a taxonomic tree. Biometrics 73, 1266–1278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang, Z.-Z., Chen, G. and Alekseyenko, A. V. (2016). PERMANOVA-S: association test for microbial community composition that accommodates confounders and multiple distances. Bioinformatics 32, 2618–2625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang, Z.-Z., Chen, G., Alekseyenko, A. V. and Li, H. (2017). A general framework for association analysis of microbial communities on a taxonomic tree. Bioinformatics 33, 1278–1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58, 267–288. [Google Scholar]
- Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelbourne, S. A. and Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinformatics 18, 94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, T. and Zhao, H. (2017). A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics 73, 792–801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong, T.-T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation 97, 165–181. [Google Scholar]
- Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R. and others. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science 334, 105–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]
- Zhang, Y., Zhou, H., Zhou, J. and Sun, W. (2017). Regression models for multivariate count data. Journal of Computational and Graphical Statistics 26, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Data













