Summary
Understanding the factors that alter the composition of the human microbiota may help personalized healthcare strategies and therapeutic drug targets. In many sequencing studies, microbial communities are characterized by a list of taxa, their counts, and their evolutionary relationships represented by a phylogenetic tree. In this article, we consider an extension of the Dirichlet multinomial distribution, called the Dirichlet-tree multinomial distribution, for multivariate, over-dispersed, and tree-structured count data. To address the relationships between these counts and a set of covariates, we propose the Dirichlet-tree multinomial regression model for which we develop a penalized likelihood method for estimating parameters and selecting covariates. For efficient optimization, we adopt the accelerated proximal gradient approach. Simulation studies are presented to demonstrate the good performance of the proposed procedure. An analysis of a data set relating dietary nutrients with bacterial counts is used to show that the incorporation of the tree structure into the model helps increase the prediction power.
Keywords: Dirichlet distributions, Over-dispersion, Sparse group lasso, Tree-structured learning
1. Introduction
Next generation sequencing of DNA extracted from microbial communities has advanced our knowledge of the role of microbiota in human health and disease (Cho and Blaser, 2012). Researchers have found that changes in the human microbiota are associated with a number of pathological states such as obesity and inflammatory bowel disease (Clemente et al., 2012). It has also been found that microbiota populations are sensitive to genetic and environmental influences (Spor et al., 2011). Despite these progresses, we are still at the very beginning of fully understanding the factors regulating our microbiota and the impacts of microbiota on our health. In particular, our statistical methods for extracting meaningful information from microbiome studies have not been developed as quickly as experimental techniques.
To process raw data from either targeted amplicon or metagenomic sequencing studies (Kuczynski et al., 2011), Operational Taxonomic Units (OTUs), which are microbial genomic sequences clustered by sequence similarity, are commonly used to partition sequences into discrete groups in place of traditional taxonomic units. For each OTU, its abundance or count is defined as the number of sequences belonging to this OTU. Typically, the most abundant sequence in an OTU is chosen as the representative sequence, and representative sequences from all the OTUs are used to construct a phylogenetic tree among all the OTUs, and/or to map OTUs to a taxonomic reference database (Schloss et al., 2009; Navas-Molina et al., 2013). After these processing steps, a microbial community can be characterized by a list of OTUs and their counts, their phylogenetic relationships, and/or their taxonomy.
The data that motivated the statistical method in this article were part of a human gut microbiome study conducted at the University of Pennsylvania (Wu et al., 2011), where the investigators aimed to assess the effect of diet on gut microbiome composition. For this study, both gut microbiome data and nutrient intake data were available (see Section 6 for details). Using a distance-based analysis procedure, the authors identified a few gut microbiome-associated nutrients that are biologically interpretable. As noted by Chen and Li (2013), however, this method is based on distances between microbiome samples, and thus is unable to provide information on how dietary nutrients affect bacterial taxa. In order to identify key nutrients as well as the taxa they affect, Chen and Li (2013) adopted a regression-based approach, where they treated OTU-abundance data as multivariate count responses, and nutrients as covariates.
Let Y = (Y1, …, YK)⊤ denote the random vector of counts on K bacterial taxa or OTUs. The simplest distribution for multivariate count data is the multinomial distribution. Suppose y = (y1, …, yK)⊤ is a draw from this distribution. Given , the probability mass of Y at y is
where Γ(·) is the gamma function, and p = (p1, …, pK)⊤ is the vector of probabilities such that . In the multinomial regression model, p is linked to the covariates via the multinomial-Poisson transformation.
Bacterial count data in microbiome studies are usually over-dispersed. The multinomial distribution is inappropriate, however, when more variation is observed than expected. In order to overcome this problem, Chen and Li (2013) assumed that p is random with some prior distribution. Denote by 𝒮d the (d − 1)-dimensional simplex. Since the support of p is 𝒮K−1, they used the Dirichlet distribution whose density at u = (u1, …, uK) ∈ 𝒮K−1, parameterized by K positive values α1 > 0, …, αK > 0, is given by
where α = (α1, …, αK)⊤. The Dirichlet distribution is popular mainly because, as a prior distribution, it is conjugate to the multinomial distribution, leading to the Dirichlet multinomial distribution (Mosimann, 1962), also known as the Dirichlet compound multinomial distribution, with probability mass function
Using the Dirichlet multinomial distribution, Chen and Li (2013) considered a Dirichlet-multinomial regression model, and applied it to the analysis of gut microbiome and nutrient intake data.
The Dirichlet (multinomial) distribution has three limitations. First, all components must share a common variance parameter. Second, the components are mutually independent, up to the constraint that they must sum up to one (Mosimann, 1962). Finally, the distribution fails to take into account the special and inherent property of microbiome count data, namely, microbial taxa are not independent of each other, but are related evolutionarily in a phylogenetic tree. To address these limitations, we consider a generalization of the Dirichlet (multinomial) distribution. Assuming that the relationships among components of the count vector can be represented as a tree, the generalized distribution encodes the tree structure node-by-node as detailed in the next section. By construction, each component has an independent variance, and components are correlated at subtree levels. Based on this distribution, we propose a regression model to include the effects of covariates, and develop a regularized method for selecting covariates (e.g., nutrients) that are associated with the count responses (e.g., OTUs).
To circumvent the first two issues with the Dirichlet (multinomial) distribution, Billheimer et al. (2001) proposed an alternative distribution for over-dispersed multivariate count data, called the logistic normal multinomial distribution, by replacing the Dirichlet distribution for p with Aitchison’s logistic normal distribution (Aitchison, 1982). The corresponding regression model, which relates covariates to the count vector, was studied by Billheimer et al. (2001), and was recently applied to link dietary nutrients with bacterial counts by Xia et al. (2013). However, like the Dirichlet multinomial distribution, the logistic normal multinomial distribution cannot exploit the tree structure information, and unlike the Dirichlet multinomial distribution and the generalized version considered here, its density does not have a closed-form expression, making the regression estimation computationally intensive.
The rest of the article is organized as follows. We review the Dirichlet-tree multinomial distribution and its properties in Section 2. We then propose the Dirichlet-tree multinomial regression model in Section 3. Regularized likelihood estimation, including the algorithm and tuning, are considered in Section 4. Simulation studies and an application to a human gut microbiome data set, respectively in Sections 5 and 6, are used to evaluate the performance of our method. Finally, Section 7 contains some remarks.
2. Dirichlet-Tree Multinomial Distributions
Data on bacterial community structure can be compiled in a matrix of counts, where samples are represented as rows and taxa as columns. As in Chen and Li (2013) and Xia et al. (2013), we treat the total number of counts, which is determined by the sequencing depth, as an ancillary statistic, and conduct the analysis conditioning on this number. Now suppose we have available a tree T representing the hierarchical structure over the count responses (e.g., a phylogenetic tree of bacterial taxa). We seek to incorporate this structural information into the modeling process.
Denote by ℒ the set of leaf nodes and 𝒱 the set of internal nodes of T, respectively. For each internal node υ ∈ 𝒱, let 𝒞υ be the set of child nodes of υ. For each υ and c ∈ 𝒞υ, define δυc(l) to be 1, if the branch from υ to c leads to l ∈ ℒ, and 0, otherwise. For simplicity, write ℒ = {1, …, K}. Denote by yυc = Σl∈ℒ δυc(l)yl and pυc = Σl∈ℒ δυc(l)pl, respectively, the count and probability in the subtree indexed by c ∈ 𝒞υ.
Let υ0 be the root node of T. For each leaf node l ∈ ℒ, denote by
the path from υ0 to l, where for d = 0, …, Dl − 1, and Dl ≥ 1 is the number of branches in this path. Let bυc = pυc/Σs∈𝒞υ pυs. Then Σc∈𝒞υ bυc = 1, hence each branch of T is assigned a probability. A simple calculation shows that
That is, pl is the product of branch probabilities as we traverse from υ0 to l. See Figure 1 for an illustration. Furthermore, we can write
where bυ = {bυc, c ∈ 𝒞υ}.
Figure 1.
Two trees, each with four leaf nodes. Shown are branch and leaf probabilities.
Instead of assuming one Dirichlet distribution for p, we can specify a separate Dirichlet distribution for bυ, for each internal node υ ∈ 𝒱. Specifically, the joint density function of (bυ, v ∈ 𝒱) has the form
where uυ = (uυc, c ∈ 𝒞υ) ∈ 𝒮Kυ−1, Kυ is the number of children of υ, and αυ = (αυc > 0, c ∈ 𝒞υ) is a vector of positive scalars. This gives the Dirichlet-tree multinomial distribution (Dennis, 1996; Minka, 2004), which is a compound multinomial distribution with probability mass function
| (1) |
Note that each component in the product Πυ∈𝒱 corresponds to an internal node in the tree, which is described by a Dirichlet multinomial distribution based on the accumulated counts along branches of that node. In the special case where υ0 is the only internal node, that is, 𝒱 = {υ0}, this distribution reduces to the Dirichlet multinomial distribution.
It is not difficult to show that
Thus, given . Other moments of the Dirichlet-tree distribution can be found in Dennis (1991). In particular, Dennis showed that when 𝒱 ≠ {υ0}, each component pl has an independent variance, and two components corresponding to the subtree indexed by υ ∈ 𝒱 are correlated, due to their dependence on the ancestors of υ.
3. A Dirichlet-Tree Multinomial Regression Model
The Dirichlet-tree distribution is often used as a prior in a Bayesian analysis (see, e.g., Tam and Schultz, 2007; Haffari and Teh, 2009). On the other hand, to our knowledge the Dirichlet-tree (multinomial) distribution is not thought of as a response distribution, while in practice it is frequent to find multivariate count data (such as counts of bacterial taxa) with observed covariates (such as measurements of dietary nutrients). In this section, we use the Dirichlet-tree multinomial distribution to develop a model for regressing the count vector Y on p covariates X1, …, Xp.
To model the effects of covariates, for each υ ∈ 𝒱 and c ∈ 𝒞υ we express log(αυc) as a linear combination of the covariates. Specifically, we assume that
| (2) |
where X = (X1, …, Xp)⊤, and βυc = (βυc1, …, βυcp)⊤ is the vector of coefficients.
Let βυ = (βυc, c ∈ 𝒞υ) and β = (βυ, υ ∈ 𝒱). The standard way of estimating β is to use maximum likelihood. Let yi = (yi1, …, yiK)⊤ denote the observed vector of counts and xi = (xi1, …, xip)⊤ the observed vector of covariates, for i = 1, …, n. Subject to a constant, the log-likelihood function for the observed data is
| (3) |
with
where Γ̃(·) = log{Γ(·)}, , and yiυc = Σl∈ℒδυc(l)yil.
4. Regularized Likelihood Estimation
One can show that the number of parameters in the Dirichlet-tree multinomial regression model equals the number of covariates, p, times the number of branches, Συ∈𝒱 Kυ, and thus it increases rapidly both as the dimension p grows and as the tree T expands. This makes the maximum likelihood estimation unappealing in terms of both accuracy and interpretability. In order to obtain a reliable and interpretable model, we propose a penalized (negative) log-likelihood method that estimates parameters and selects covariates simultaneously. Specifically, we consider the following objective function
| (4) |
where λ1 and λ2 are tuning parameters, and ‖·‖1 and ‖·‖2 are the usual l1 and l2 norms.
4.1. Algorithm
Since the penalty function is non-smooth, we adopt the accelerated proximal gradient method (see, e.g., Beck and Teboulle, 2009) to minimize plDTM(β). The algorithm alternates between two sequences {β(t)} and {η(t)}, where t is the index of iteration. At the (t + 1)-th iteration, we approximate −lDTM(β) at η(t) by
where 〈·, ·〉 is the trace operator, ∇lDTM(·) is the gradient of lDTM(·), and C > 0 is a constant. Let
| (5) |
We update β(t+1) by solving
| (6) |
We then update η(t+1) as
| (7) |
Here, at = 2/(t + 2). A simple calculation shows that the optimization problem (6) can be decomposed into a set of subproblems
for υ ∈ 𝒱 and c ∈ 𝒞υ, where ∇lυc(ηυ) is the gradient of lυ(ηυ) with respect to ηυc. Let
where sgn is the sign function, and the equality holds element-wise. The solution of is easily shown to be
| (8) |
Instead of using a constant step size C, we can use the backtracking rule to choose a suitable C at each iteration. In summary, the algorithm for each pair of (λ1, λ2) proceeds as follows.
| S1 Initialize C = Συ∈𝒱 Kυ × p, a0 = 1, β0, and η0 = β0. | |
| S2 Iterate until plDTM (β(t+1)) < pl̃DTM (β(t+1); η(t)). | |
| S2.1 For all υ ∈ 𝒱 and c ∈ 𝒞υ, compute
and
| |
| S2.2 Update
| |
| S2.3 Set C to be 2C. | |
| S3 Update at+1 = 2/ (t + 3) and
| |
| S4 If the objective value converges, stop the algorithm. Otherwise, set t to be t + 1 and go to S2. |
Three things are noteworthy. First, each of the two sequences is computed analytically, hence the algorithm is easy to implement. Second, the complexity of the algorithm scales well with both the number of covariates and the number of branches. Third, as lDTM(β) is non-convex, convergence is not guaranteed. Generally, the algorithm converges to a local minimizer, since the objective value decreases over iterations. Empirically, based on our own experience in simulations, the algorithm is fast and efficient.
4.2. Tuning and Implementation
A key issue in the above penalized likelihood estimation is the choice of appropriate values for the tuning parameters. To this end, we consider a re-parametrization of λ1 and λ2 as follows. Let λ = λ1 + λ2 and γ = λ2/(λ1 + λ2). We can write
Denote by β̂(λ, γ) the minimizer of plDTM(β; λ, γ). Two special cases are the lasso estimate β̂(λ, 0) and the group lasso estimate β̂(λ, 1). When 0 < γ < 1, we call β̂(λ, γ) a sparse group lasso estimate.
For each given γ, we use the Bayesian information criterion (Schwarz, 1978) to select λ. To be specific, we define
where df{β̂(λ, γ)} is the effective degrees of freedom implied by β̂(λ, γ). In this article, we approximate df{β̂(λ, γ)} by the number of nonzero components in β̂(λ, γ). We then select λ by minimizing BIC(λ, γ) over a set of candidate values. Note that cross validation is an alternative to BIC, but it tends to over-select spurious features (Wasserman and Roeder, 2009). Also, cross validation is computationally too intensive on large data sets.
It is now essential to specify an interval for the values of λ, for each given γ. Clearly, β̂(λ, γ) is exactly zero when λ is sufficiently large. To determine such a value, denoted by λmax, we search for the smallest λ so that zero is a minimizer of
for all υ ∈ 𝒱 and c ∈ 𝒞υ. To obtain estimates over a grid of λ values, we start with λmax for which β̂(λ, γ) = 0, decrease it on a log scale, and run the proposed algorithm to compute β̂(λ, γ). Then λ is decreased again and the process is repeated. To reduce the computing time, we use the estimate from the previous λ as a warm start, and to avoid over-fitting, we stop the process until the proportion of nonzero components in β̂(λ, γ) exceeds a certain threshold (e.g., 80%).
It might be cumbersome to interpret potentially tens of thousands of parameters in β̂(λ, γ). To partly overcome this problem, as we will do in the data analysis, we can proceed as follows. Note that for each leaf node l in the tree, which represents a count response, there is a path from the root node υ0 to l. To see how a covariate (e.g., a dietary nutrient) affects this count response (e.g., a bacterial taxon), we can check whether this nutrient appears in the path from υ0 to l. In other words, we look at the problem from the viewpoint of covariate selection and take advantage of the tree topology to summarize the information in β̂(λ, γ).
5. Simulation Studies
In this section, we use simulated data to examine the performance of our method. To mimic the association study between nutrient intake and OTU abundance that will be described in Section 6, we generate n = 75 samples, each with K = 28 count responses and p = 100 covariates, and we use the tree in Figure 2 to represent the relationship among the responses. For simplicity, the first 10 covariates are assumed to be relevant (i.e., each of them is associated with at least one response) and the rest redundant.
Figure 2.
The phylogenetic tree. Leaves (i.e., OTUs) are labeled as 1, …, 28, and internal nodes are labeled as 29, …, 55 and indicated by circles.
We first draw the covariate vector X = (X1,…, Xp)⊤ from a multivariate normal distribution with mean vector zero and covariance matrix Σ = (0.5|i−j|) ∈ ℝp×p. We then generate the count vector Y = (Y1, …, YK)⊤ using the Dirichlet-tree multinomial model with parameters specified in Figure 3. Here, we choose nonzero coefficients to be of the same magnitude just for simplicity and for ease of illustration. The conclusions are qualitatively similar if they vary in magnitude.
Figure 3.
Coefficients of the 10 relevant covariates used in the Dirichlet-tree multinomial model. This figure appears in color in the electronic version of this article.
We generated 100 data sets in this way, and for each data set, we applied our method with the tuning parameter λ selected by BIC. Throughout this section, we considered four instances of γ: 0, 0.25, 0.5, and 1. To see how each competitor selected the covariates, we calculated the true positive rate and false positive rate at each tuning parameter value λ, and summarized them by the Receiver Operating Characteristic (ROC) curve that resulted from a grid of λ values. The average ROC curves over 100 data replications are shown in Figure 4. We can see that, for this simulation setup, the sparse group lasso (γ = 0.25 or γ = 0.5) performs better than the lasso (γ = 0) and the group lasso (γ = 1).
Figure 4.
Simulation studies. ROC curves for the four regularized estimators, all based on the Dirichlet-tree multinomial model. The colored triangles show the true positive rate (TPR) and false positive rate (FPR), when the tuning parameter λ was selected by the BIC criterion. (a) n = 75, and (b) n = 100.
For each data set, we further generated a separate set of size n* = 25, and computed the prediction error using the following cost function
where are test samples, , and denotes a predicted value of . For comparison, we also evaluated the performance of the method of Chen and Li (2013), which was based on the Dirichlet multinomial model. The results are reported in Table 1. Our method had higher prediction accuracy than theirs. Furthermore, for this simulation setup, the performance of the sparse group lasso was better than that of the (group) lasso.
Table 1.
Means and standard deviations (in parentheses) of the prediction error, over 100 replications, are reported for the four regularized estimators based on the Dirichlet-tree multinomial (DTM) model, and those based on the Dirichlet multinomial (DM) model
| n = 75 | n = 100 | ||
|---|---|---|---|
| DTM | γ = 0 | 80 (28) | 65 (24) |
| γ=0.25 | 77 (24) | 63 (20) | |
| γ=0.5 | 79 (23) | 64 (19) | |
| γ=1 | 138 (34) | 123 (33) | |
| DM | γ=0 | 315 (653) | 154 (160) |
| γ=0.25 | 145 (95) | 113 (35) | |
| γ=0.5 | 141 (34) | 115 (25) | |
| γ=1 | 212 (37) | 175 (51) |
Finally, we increased the sample size from n = 75 to n = 100. From Figure 4 and Table 1 we see that the performance of our method improved, but the conclusions remain qualitatively similar.
Note that the Dirichlet-tree distribution reduces to the Dirichlet distribution, when the parameters in the former are chosen carefully. However, this nesting structure is lost in regression models: in the presence of predictors, the Dirichlet multinomial regression is not a sub-model of the Dirichlettree multinomial regression. This makes the direct comparison between these two models difficult, and is the reason for our use of prediction accuracy as a measure of performance. To make feasible the comparison of ROC curves between our method and that of Chen and Li (2013), we can calculate the true positive rate and false positive rate in terms of how each method identifies the relevant covariates. For the simulation example considered here, our method was shown to be superior to Chen and Li (2013)’s; see Web Appendix A for details.
6. Data Analysis
The human gut harbors trillions of microbes that play a fundamental role in human health and disease (Clemente et al., 2012). Some recent studies have suggested that diet is an important factor that underlies the gut microbiota composition (De Filippo et al., 2010; Wu et al., 2011; David et al., 2014). One important goal of microbiome studies is thus to identify which food constituents specifically promote/inhibit growth and functionality of which microorganisms in the human gut.
The data we consider come from a cross-sectional study associating nutrient intake with the human gut microbiota composition (Wu et al., 2011). In this study, the investigators used 16S rRNA sequencing to characterize fecal samples from 98 healthy volunteers. After processing the generated sequences using the QIIME pipeline with the default parameter settings (Navas-Molina et al., 2013), more than 3000 species-level OTUs were obtained, and the microbiome data was summarized by an OTU table (i.e., a matrix of OTU counts) including the taxonomy of each OTU, and a phylogenetic tree relating these OTUs. To link long-term dietary patterns with the microbiome composition, diet information of each subject was collected using a food frequency questionnaire, and was converted into nutrient intake profiles for more than 200 micro-nutrients.
In our analysis, we combined the species-level OTUs at the genus level, and considered 28 relatively prevalent genera that occurred in more than 25 of the 98 samples. Further, we considered a subset of 97 dietary nutrients (Wu et al., 2011) that were obtained by screening the nutrients using a distance-based nonparametric test (McArdle and Anderson, 2001). Our final data consisted of an OTU-count matrix Y ∈ ℝ98×28, a phylogenetic tree T among 28 genus-levels OTUs, as shown in Figure 2, and a nutrient-intake matrix X ∈ ℝ98×97. Each nutrient was standardized to have zero mean and unit variance.
In order to evaluate the performance, we randomly split the 98 samples into balanced training and test sets. Specifically, we set approximately 75% of the observations as training samples, and the rest as test samples. Our method was first applied to the training set, and its performance was then assessed by the test set using the same cost function defined in the previous section. We also compared our method with the method of Chen and Li (2013), which was developed for the Dirichlet multinomial model and thus failed to exploit the tree information. Three values of γ were explored to favor sparse solutions, 0, 0.25, and 0.5.
To reduce the variability, the splitting into training and test sets was repeated 100 times and the error estimates were averaged. The results are summarized in Table 2. We can see that the incorporation of the tree structure helped improve the prediction accuracy.
Table 2.
Means and standard deviations (in parentheses) of the prediction error, based on 100 random splits of 98 samples into training sets (75 samples) and test sets (23 samples), are reported for the three regularized estimators based on the Dirichlet-tree multinomial (DTM) model, and those based on the Dirichlet multinomial (DM) model.
| DTM | γ = 0 | 170 (88) |
| γ = 0.25 | 162 (77) | |
| γ = 0.5 | 149 (69) | |
| DM | γ = 0 | 189 (112) |
| γ = 0.25 | 169 (95) | |
| γ = 0.5 | 154 (76) |
To see how dietary nutrients affected bacterial taxa, we set γ = 0.25 and concentrated on two genera, Bacteroides and Prevetella, that were previously used to define enterotypes of the human gut microbiome (Arumugam et al., 2011; Wu et al., 2011). For each genus-nutrient pair, we checked whether the nutrient appeared in the path from the root node to the leaf node corresponding to the genus. The top 10 nutrients, based on 100 random splits of the total samples into training and test sets, are listed in Table 3. The results are largely consistent with the findings of Wu et al. (2011), where they reported that the Bacteroides enterotype was associated with a number of amino acids and animal protein, while the Prevotella enterotype was associated with a carbohydrate-based diet. Comparing our method with that of Chen and Li (2013), the two subsets of identified nutrients were different. Nevertheless, the conclusions were similar, largely because some nutrients are highly correlated. In Web Appendix B, we provide a heatmap of selection frequencies of all dietary nutrients across 100 random splits.
Table 3.
The 10 most frequently appeared nutrients, respectively for Bacteroides and Prevetella, are reported based on 100 random splits of the total samples
| Bacteroides | Prevotella |
|---|---|
| Cholesterol | Catechin, flavan-3-ol |
| Added Germ from wheats | Total Trans/Cis Trans Linoleic |
| Sucrose | Alcohol |
| Valine | Added Germ from wheats |
| Proline | Naringenin, flavanone |
| Methionine | Cholesterol |
| Maltose | Glycemic Index |
| Animal Protein | Sucrose |
| Serine | Maltose |
| Tyrosine | Valine |
7. Discussion
We have considered a generalization of the Dirichlet multinomial distribution for describing multivariate, over-dispersed, and tree-structured count data, called the Dirichlet-tree multinomial distribution, and its associated regression model for relating a set of covariates to these counts. By introducing the l1 + l2 regularization, we have proposed a method for simultaneous parameter estimation and covariate selection. We have conducted simulation studies to demonstrate the efficacy of the new approach. When applied to analyze the motivating data set, our method was shown to improve the accuracy of prediction.
We have implemented the proposed method in R (R Core Team, 2014). Our experience suggests that the procedure works reasonably fast. For instance, for the simulation example in Section 5 where n = 100, K = 28, and p = 100, the average computing time (including tuning by BIC) was less than 1.5 minutes on the Yale Louise high performance cluster (Dell m620 system, 8-core processor, 48G of memory) using R version 3.1.0 and a single processer. Due to the incorporation of tree structure, our method is computationally more intensive than that of Chen and Li (2013). However, the complexity of the proposed algorithm increases linearly with the number of branches, and in the numerical studies our method was 2–4 times slower than Chen and Li’s. We also tried the block-coordinate descent algorithm of Chen and Li (2013), and found that this algorithm was much slower than ours, and that it occasionally broke down for a range of tuning parameters. We believe our method of coupling l1 and l2 regularizations with the Dirichlet-tree multinomial model would provide a useful data analysis tool for associating covariates (such as dietary nutrients) with multivariate, over-dispersed, and tree-structured count data (such as taxa counts).
In the literature, several schemes have been proposed to take advantage of the tree structure over either the covariates or responses (see, for example, Zhao et al., 2009; Jenatton et al., 2011; Kim and Xing, 2012; Garcia et al., 2014). However, these methods are mainly for univariate or multivariate continuous responses, while our method is for multivariate count responses. More importantly, our way of incorporating the tree information into the model is inherently different from the imposition of tree-guided penalties on the regression parameters for achieving structured sparsity. Indeed, our proposed Dirichlet-tree multinomial model is more flexible than the Dirichlet multinomial model in the sense that it allows the rich dependence structure among the count responses.
It might be reasonable to expect that non-zero regression coefficients are similar for phylogenetically related OTUs, and in this case some tree-based fused lasso regularization is preferable to sparse group lasso regularization. But, the resulting optimization is expected to be more challenging. Also, in reality it is possible that sister taxa can have very different metabolic pathways, ecologies, and interactions with the host. A comparison of our method and a tree-based fused lasso on multiple data sets would be an interesting area for future research.
In this article, we assume that the tree structure is known a priori. In the real data example, the tree is learned from 16S rRNA sequences using a fast and approximate maximum likelihood method (Price et al., 2010). Note that the Dirichlet-tree multinomial regression reduces to the Dirichlet multinomial regression, when the tree is a “comb” tree, that is, the root node of the tree is the only internal node (see Figure 1 for illustration). This, together with the simulation results in Section 5, implies that the correct specification of the tree structure is important. Indeed, tree mis-specification could be a major risk for shotgun sequencing data (Matsen et al., 2010). Another issue is that we exploit the topology of the phylogenetic tree, but ignores the branch lengths. Since each component lυ of the log-likelihood function (3) corresponds to one internal node υ, direct incorporation of branch lengths is cumbersome. For each internal node υ, the distances of its child nodes from it (i.e., branch lengths) may be summarized and then used as the weight for lυ. But, robustness against tree mis-specification could be a concern, since both tree topology and branch lengths are exploited. Also, we need to choose a summary statistic that is biologically meaningful. Alternatively, we may use a new model with new inference algorithms. These are beyond the scope of the present article.
Supplementary Material
Acknowledgments
We are grateful to the Editor, the Associate Editor, and three anonymous referees for their helpful comments. We thank Hongzhe Li and Jun Chen for providing the data. This work was supported by National Natural Science Foundation of China (11601326), and NIH grant R01 GM59507.
Footnotes
Web Appendices A and B referenced in Sections 5 and 6, simulated data, and computer code are available with this article at the Biometrics website on Wiley Online Library.
References
- Aitchison J. The statistical analysis of compositional data. Journal of the Royal Statistical Society, Series B. 1982;44:139–177. [Google Scholar]
- Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al. Enterotypes of the human gut microbiome. Nature. 2011;473:174–180. doi: 10.1038/nature09944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009;2:183–202. [Google Scholar]
- Billheimer D, Guttorp P, Fagan WF. Statistical interpretation of species composition. Journal of the American statistical Association. 2001;96:1205–1214. [Google Scholar]
- Chen J, Li H. Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. The Annals of Applied Statistics. 2013;7:418–442. doi: 10.1214/12-AOAS592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nature Reviews Genetics. 2012;13:260–270. doi: 10.1038/nrg3182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clemente JC, Ursell LK, Parfrey LW, Knight R. The impact of the gut microbiota on human health: an integrative view. Cell. 2012;148:1258–1270. doi: 10.1016/j.cell.2012.01.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- David LA, Maurice CF, Carmody RN, Gootenberg DB, Button JE, Wolfe BE, et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature. 2014;505:559–563. doi: 10.1038/nature12820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Filippo C, Cavalieri D, Di Paola M, Ramazzotti M, Poullet JB, Massart S, et al. Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa. Proceedings of the National Academy of Sciences. 2010;107:14691–14696. doi: 10.1073/pnas.1005963107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dennis SY. On the hyper-Dirichlet type 1 and hyper-Liouville distributions. Communications in Statistics-Theory and Methods. 1991;20:4069–4081. [Google Scholar]
- Dennis SY. A Bayesian analysis of tree-structured statistical decision problems. Journal of Statistical Planning and Inference. 1996;53:323–344. [Google Scholar]
- Garcia TP, Müller S, Carroll RJ, Walzem RL. Identification of important regressor groups, subgroups and individuals via regularization methods: Application to gut microbiome data. Bioinformatics. 2014;30:831–837. doi: 10.1093/bioinformatics/btt608. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haffari G, Teh YW. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics; Stroudsburg, PA, USA: 2009. Hierarchical Dirichlet trees for information retrieval; pp. 173–181. [Google Scholar]
- Jenatton R, Mairal J, Obozinski G, Bach F. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research. 2011;12:2297–2334. [Google Scholar]
- Kim S, Xing EP. Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping. The Annals of Applied Statistics. 2012;6:1095–1117. [Google Scholar]
- Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, et al. Experimental and analytical tools for studying the human microbiome. Nature Reviews Genetics. 2011;13:47–58. doi: 10.1038/nrg3129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matsen FA, Kodner RB, Armbrust EV. pplacer: Linear time maximum-likelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics. 2010;11:538. doi: 10.1186/1471-2105-11-538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McArdle BH, Anderson MJ. Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology. 2001;82:290–297. [Google Scholar]
- Minka T. The Dirichlet-tree distribution. Technical report. 2004 Paper available online at: http://msr-waypoint.com/en-us/um/people/minka/papers/dirichlet/minka-dirtree.pdf.
- Mosimann JE. On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika. 1962;49:65–82. [Google Scholar]
- Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, Xu Z, et al. Advancing our understanding of the human microbiome using QIIME. Methods in Enzymology. 2013;531:371–444. doi: 10.1016/B978-0-12-407863-5.00019-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5:e9490. doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. [Google Scholar]
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology. 2009;75:7537–7541. doi: 10.1128/AEM.01541-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
- Spor A, Koren O, Ley R. Unravelling the effects of the environment and host genotype on the gut microbiome. Nature Reviews Microbiology. 2011;9:279–290. doi: 10.1038/nrmicro2540. [DOI] [PubMed] [Google Scholar]
- Tam Y-C, Schultz T. Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, Volume 4, IV–41. IEEE Piscataway; New Jersey, US: 2007. Correlated latent semantic model for unsupervised LM adaptation. [Google Scholar]
- Wasserman L, Roeder K. High dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu GD, Chen J, Hoffmann C, Bittinger K, Chen Y-Y, Keilbaugh SA, et al. Linking long-term dietary patterns with gut microbial enterotypes. Science. 2011;334:105–108. doi: 10.1126/science.1208344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xia F, Chen J, Fung WK, Li H. A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics. 2013;69:1053–1063. doi: 10.1111/biom.12079. [DOI] [PubMed] [Google Scholar]
- Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics. 2009;37:3468–3497. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




