Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Mar 1.
Published in final edited form as: Ann Appl Stat. 2018 Mar 9;12(1):540–566. doi: 10.1214/17-AOAS1102

KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA

Timothy W Randolph §,*,, Sen Zhao ¶,, Wade Copeland §, Meredith Hullar §,*, Ali Shojaie ¶,†,
PMCID: PMC6138053  NIHMSID: NIHMS978592  PMID: 30224943

Abstract

The analysis of human microbiome data is often based on dimension-reduced graphical displays and clusterings derived from vectors of microbial abundances in each sample. Common to these ordination methods is the use of biologically motivated definitions of similarity. Principal coordinate analysis, in particular, is often performed using ecologically defined distances, allowing analyses to incorporate context-dependent, non-Euclidean structure. In this paper, we go beyond dimension-reduced ordination methods and describe a framework of high-dimensional regression models that extends these distance-based methods. In particular, we use kernel-based methods to show how to incorporate a variety of extrinsic information, such as phylogeny, into penalized regression models that estimate taxonspecific associations with a phenotype or clinical outcome. Further, we show how this regression framework can be used to address the compositional nature of multivariate predictors comprised of relative abundances; that is, vectors whose entries sum to a constant. We illustrate this approach with several simulations using data from two recent studies on gut and vaginal microbiomes. We conclude with an application to our own data, where we also incorporate a significance test for the estimated coefficients that represent associations between microbial abundance and a percent fat.

Keywords and phrases: compositional data, distance-based analysis, kernel methods, microbial community data, penalized regression

1. Introduction

A common tool in the analysis of data from microbiome studies is a scatterplot of dimension-reduced microbial abundance vectors. This is a display of the samples’ beta diversity which, in ecology, refers to differences among various habitats. When applied to human studies, beta diversity describes the variation in microbial community structure across sampling units (e.g., human subjects): a beta diversity plot displays the n sampling units with respect to the principal coordinates of their microbial abundance vectors, each consisting of measures on the p taxa (bacterial types) observed in the study; see, e.g., Claesson et al. (2012); Koren et al. (2013); Kuczynski et al. (2010); Goodrich et al. (2014). This principal coordinates analysis (PCoA; or multidimensional scaling, MDS) begins with an n×n matrix of pairwise dissimilarities between vectors of taxon abundances. The choice of dissimilarity measure may greatly influence the biological interpretation (Lozupone et al., 2007; Fukuyama et al., 2012). Euclidean distance is rarely used.

A common assay of microbial content is based on counting sequences observed from the 16S rRNA gene, a marker used to identify bacterial species or other taxonomic categories. We will generically refer to “taxa” rather than specifying a category, such as genus or species. A single taxon may be placed within the context of a phylogenetic tree in order to provide evolutionary relationships among taxa. Dissimilarity measures that account for these phylogenetic relationships are assumed to enhance statistical analyses — for instance, to improve the power of statistical tests — because they incorporate the degree of divergence between sequences (Chen et al., 2012) and do not ignore “the correlation between evolutionary and ecological similarity” (Hamady and Knight, 2009). The UniFrac distance (Lozupone and Knight, 2005), in particular, is based on the premise that taxa which share a large fraction of the phylogenetic tree should be viewed as more similar than those sharing a small fraction of the tree. In the unweighted version of UniFrac, each taxon is quantified merely by its presence or absence; the distance between a pair of samples is based on the number of branches in the tree shared by both. Figure 1(a) is a beta diversity plot of n = 100 human microbial abundance vectors with p = 149 taxa based on data from Yatsunenko et al. (2012). Each sample is represented by 2-dimensional coordinates with respect to the unweighted UniFrac distance, and the size of each point is proportional to log(age) of the subject.

Fig 1. PCoA plots of data from Yatsunenko et al. (2012).

Fig 1

(a): PCoA plot with respect to unweighted UniFrac distance where dot size is proportional to log(age) of the subject. (b): PCoA plot with respect to unweighted UniFrac distance, dot size is proportional to yTrue from the model in Eq. (3.2) with ε = 0.

Dissimilarity measures in microbiome studies are many and varied, with a rich collection that, like UniFrac, exploit the phylogentic structure: Chen et al. (2012) generalize UniFrac by reweighting rare and abundant lineages; double principal coordinate analysis (DPCoA) (Pavoine, Dufour and Chessel, 2004), as shown by Purdom (2011), generalizes PCA by incorporating the covariance that would arise if the data was created by a process modeled by the tree; the edge PCA method of Matsen and Evans (2013) incorporates taxon abundance information at all nodes in a phylogenetic tree, rather than just the leaves of the tree, and Evans and Matsen (2012) formalize the mathematical interpretation of UniFrac as just one example within a large family of Wasserstein (or earth mover’s) metrics. A wide variety of nonphylogenetic dissimilarities are also in common use, such as Bray-Curtis (The Human Microbiome Project Consortium, 2012) and Jenson-Shannon (Koren et al., 2013), among others.

While PCoA plots provide valuable graphical insight into the relationships among microbial profiles and an outcome or phenotype, they do not quantify this association. More importantly, the (sets of) taxa associated with the outcome — and the magnitude or statistical significance of such associations — are not ascertained from a PCoA plot; once a matrix of (dis)similarities between samples is formed, it is not clear how to identify individual taxa that are associated with an outcome. Specifically, given a PCoA plot as in Figure 1(a), with structure imposed by the chosen dissimilarity matrix (e.g., unweighted UniFrac) and with associations implied by a class label or continuous outcome (e.g., age), how does one estimate which taxa or subcommunities are associated with this outcome? We address this question by formulating multivariate regression models that are constrained by the structure of the (dis)similarity matrix. This is made possible by exploiting an equivalence between a taxon-based (primal space) and sample-based (dual space) formulation of our penalized regression models. While exploiting such an equivalence is straightforward in the special case of ridge regression (with purely Euclidean structure), it becomes complicated when more general distance measures are used. To this end, we show how a little-used regularization scheme by Franklin (1978) provides a dual-space regression coefficient estimate that naturally connects to primal-space coefficients. Because a dissimilarity matrix can be used to construct a similarity matrix (as commonly done in classical MDS (Mardia, Kent and Bibby, 1980)), we work with kernels, rather than distances, and allow for general kernels, including those constructed from a nonlinear feature map.

In addition to complications stemming from more general distances, the analysis of microbiome data is also complicated by the compositional nature of the data itself. More specifically, taxon measures typically represent relative, rather than absolute, abundances. The p-variate relative abundance vectors are thus compositional in that they are constrained to a simplex within ℝp; such data do not reside in a Euclidean vector space (Aitchison, 2003a). Consequently, spurious correlations arise and standard multiple regression models fail. Our proposed KPR framework, however, addresses this: the centered log (CLR) transform of the relative abundance vectors first removes the vectors from the simplex, then the estimation process is constrained using a penalization term defined by Aitchison’s variation matrix. This approach takes a different perspective from the recent proposal of Li (2015) which forces the estimated coefficient vector to reside in the simplex. Given that the CLR transforms the compositional vectors to Euclidean space and that the units of the Aitchison variation matrix are the same as the CLR transformed data (Egozcue and Pawlowsky-Glahn, 2011), our constraint seems more suitable for the geometry of the problem.

In summary, we describe a family of high-dimensional regression problems in Section 2, which are designed to incorporate the assumptions that are tacitly implied by various exploratory and graphically-focused PCoA plots common in microbiome studies. We show how phylogenetic and other structure can be incorporated via kernel penalized regression in either the primal (p-dimensional) feature space or the dual (n-dimensional) samples space; see Sections 2.2 and 2.3. Finally, our proposed framework leads to an approach, described in Section 2.4, for addressing well-known problems that arise from applying standard (Euclidean-based) statistical models to compositional data. Section 3 illustrates the proposed framework with simulations based on publicly available data, while Section 4 presents an application to our recent microbiome study of premenopausal women. In this analysis, we obtain estimates of associations between microbial species and percent fat measured in premenopausal women, and also provide inference for these estimates by applying a recent significance test (Zhao and Shojaie, 2016) in our kernel-penalized regression (KPR) framework.

2. Kernel Penalized Regression for Microbiome Data

We describe a family of multiple regression problems aimed at incorporating assumptions that are implicit in principal coordinate analysis (PCoA) plots common in microbiome studies. We begin in Section 2.1 by establishing notation and concepts from existing dimension-reduction (ordination) methods with the goal of extending them to non-truncated (penalized) regression models. Section 2.2 extends PCoA and principal component regression (PCR) to penalized regression models in the primal space in a manner that incorporates structures implicit in recent microbiome analyses. Section 2.3 extends kernel ridge regression to general (non-L2) structure and the use of two kernels. This extension exploits a dual-space regularization scheme of Franklin (Franklin, 1978). Section 2.4 describes how our proposed framework can be applied to formulate a penalized regression model that accounts for the compositional nature of relative abundance data.

We denote by yi, i = 1, .., n, a real-valued quantified trait and by xi = [xi1, …, xip]′ a p-dimensional vector of microbial abundance values measured for each of n subjects. Denote by X the n × p sample-by-taxon matrix whose ith row is xi. We assume throughout that the columns of X are mean centered. For now, we assume that the abundance values are appropriately normalized/transformed and postpone the treatment of compositional data to Section 2.4. The transpose of a matrix A is denoted by A′ and the Frobenius norm is denoted as ||A||F. IIp denotes the identity matrix on ℝp and the Euclidean norm of a vector x ∈ ℝp is denoted ||x||p or simply ||x||.

2.1. Background for PCoA and principal component regression

Consider first the Euclidean PCoA, which is obtained from the eigenvectors of the kernel matrix KI := XX′ of inner products Kij = 〈xi, xj〉 between samples. Let 𝒥 be the centering matrix, J=I-1n11, where 1 is the n×1 vector of ones. Then, it can be seen that XX=-12JΔEJ, where ΔE is the n×n matrix of squared Euclidean distances between samples: Δi,jE=xi-xjp2. The relationship between a kernel and a distance matrix Δ is more general. In particular, if Δ is any n × n symmetric matrix of squared dissimilarities between vectors in ℝp then H=-12JΔJ serves as a kernel matrix summarizing similarities; see, e.g., Gower (1966); Pekalska, Paclik and Duin (2002). A particular case involves a p×p symmetric, positive definite matrix Q that defines an inner product xi,xjQ=xiQxj on ℝp. If ΔQ denotes the matrix of squared distances, Δi,jQ=xi-xjQ2=xi-xj,xi-xjQ, defined with respect to this inner product, then XQX=-12JΔQJ is also a similarity kernel for the n samples. We will denote this kernel by KQ = XQX′. Similarly, one may start with a matrix ΔU of squared distances defined by a tree-based UniFrac dissimilarity (Lozupone and Knight, 2005), and define a similarity kernel by H=-12JΔUJ.

In graphical displays, two or three coordinates are typically used to explore the relationship between samples. Let K = US2U′ be the eigendecomposition of any similarity kernel, K, where U is the matrix whose columns are eigenvectors and S2=diag{σj2} is the diagonal matrix of eigenvalues. The two-dimensional PCoA plot is then the collection of points {ηi1,ηi2}i=1n:={(σ1Ui1,σ2Ui2)}i=1n; i.e., a plot of the points represented by the first two columns of the matrix US. These points are often colored according to a grouping label or continuous values, {yi}i=1n, to graphically explore the existence of an association between the outcome y and the sample profiles summarized by the first few columns of US. So, a PCoA plot may be viewed as a graphical depiction of a two-component regression model of association:

yi=γ1ηi1+γ2ηi2+ε,i=1,,n, (2.1)

where η1 and η2 are the first two PCoA axes. Ordinary principal component regression corresponds to the case that η1 and η2 come from the Euclidean kernel KI = XX′. On the other hand, the configuration of points in Figure 1(b) correspond to the first two eigenvectors of the kernel defined by an unweighted UniFrac distance matrix ΔU, and the size of individual points correspond to the values of y from eq. (2.1) with ε = 0.

Let A(k) denote the first k columns of a matrix A, or its first k rows and columns if A is diagonal. Then, using the singular value decomposition (SVD) X = USV′, if we express the dimension-reduced approximation of X as X:=U(2)S(2)V(2), then eq. (2.1) can be written as

y=γ1η1+γ2η2+ε=U(2)S(2)γ+ε=XV(2)γ+ε, (2.2)

where γ = [γ1 γ2]′. Here, X̆V (2) = U(2)S(2) and Range(′) = Range(V(2)). Consequently, the model y = X̆V(2) γ + ε can be written as y = X̆β + ε, where β is some vector of the form β = γ. So inherent in a Euclidean PCoA plot is an implicit coefficient vector, β, which models a linear association between y and . Using the SVD of X in (2.2), the PCR estimate of β ∈ ℝp is expressed as

β^PCR=(XX)Xy=V(2)S(2)-1U(2)y=k=121σkukyvk, (2.3)

where † denotes the Moore-Penrose inverse.

2.2. Penalized regression and DPCoA

An alternative to a Euclidean PCR is the ordinary ridge regression (Hoerl and Kennard, 1970),

β^ridge=(XX+λI)-1Xy=k=1n(σk2σk2+λ2)1σkukyvk, (2.4)

in which the terms are re-weighted, instead of being truncated as in β̂PCR. The estimate in (2.4) is the solution of the penalized least squares regression problem, β̂ridge = argminβ{||y||2 + λ||β||2}, where here and throughtout λ is a tuning parameter that controls the amount of shrinkage or size of β in the penalty term. Here the penalty is simply the Euclidean (or ℓ2) norm on ℝp, but a wide range of penalty terms have been proposed to replace or extend this particular form of regularization; see Bühlmann, Kalisch and Meier (2014) for a review of the most established methods. These methods, such as the lasso, elastic net or SCAD do not incorporate any extrinsic information, but a variety of other penalization methods have been proposed which aim to do this. For instance, Tanaseichuk, Borneman and Jiang (2014) uses a tree-guided penalty (Kim and Xing, 2010) to incorporate such structure into a penalized logistic regression framework to encourage similar coefficients among taxa according to their relationships in the phylogenetic tree. Tibshirani and Taylor (2011) study the solution path for computing a “generalized lasso” estimate in which an ℓ2 penalty is replaced with an ℓ1 penalty applied to a linear transformation of the features, λ||||1. Within the context of genetic networks, Li and Li (2008) accounted for network structure by augmenting the ℓ1 penalty with a second penalty of the form λ2βL2=βLβ, where ℒ denotes the graph Laplacian matrix corresponding to pre-defined connections between genes in a pathway.

For now, we consider a positive definite p×p matrix Q with a Cholesky decomposition Q = LL′, and a penalty term of the form L-1β2=βQ-12=βQ-1β. The generalized ridge (or Tikhonov regularization (Golub and van Loan, 2012)) estimate with respect to Q is then defined as

β^Q=argminβ{y-Xβ2+λβQ-12}=(XX+λQ-1)-1Xy=k=1n(σk2σk2+λμk2)1σkukyvk. (2.5)

This estimate takes the same form as (2.4) but now the vectors uk and vk arise from the SVD of XL = USV′. Regarding the last equality, note that if A denotes any matrix with p columns, the structure of an estimate β̂A from a penalty term of the form Aβ22 is determined by the joint eigenstructure of the pair (X, A) via the generalized singular value decomposition.1 In particular, the basis expansion of β̂Q in (2.5) is given in terms of the generalized singular vectors of (X, L−1). Although the ridge estimate (with Q = Ip) is biased, an informed choice of penalty term can reduce the bias (Randolph, Harezlak and Feng, 2012).

Now consider the context of phylogentic information and let δ represent the matrix of squared patristic distances between pairs of taxa — i.e., the sum of branch lengths between each pair of taxa on the leaves of a phylogenetic tree. Set Q=-12JδJ, a matrix of similarities between taxa. Double principal coordinate analysis (DPCoA) was proposed by Pavoine, Dufour and Chessel (2004) to provide an alternative to ordinary PCoA that incorporates structure among samples as well as structure implied by the taxa’s distribution among subcommunities, as summarized by Q. Purdom (2011) clarified the original multi-step DPCoA procedure and showed how it can be more simply understood as a generalized PCA (gPCA) in which one obtains the new coordinates from the eigenvectors of KQ = XQX′. Note that when Q = Ip, DPCoA reduces to PCA/MDS. As emphasized in Purdom (2011), the use of a non-identity Q matrix incorporates structure from known relationships between the p taxa by exploiting a matrix representation of phylogenetic relationships, thus providing a model for covariance structure.

If we let Q = LL′ be a Cholesky decomposition of Q and set Z := XL, then the kernel KQ = XQX′ has an eigendecomposition of the form US2U′ with respect to the SVD of Z = USV′. This leads to a two-dimensional regression estimate that takes the same the form as β̂PCR in (2.3). Indeed, we can recover a primal space estimate in terms of singular vectors as

β^DPCR:=V(2)S(2)-1U(2)y=k=121σkukyvk. (2.6)

That is, implicit in a DPCoA plot is a coefficient vector β̂DPCR which models a two-dimensional linear association between y and Z = XL in the same way that β̂PCR represents a two-dimensional linear association between y and X. Further, XQX′ = (XL)(XL)′ and so U, S and V in β̂DPCR of (2.6) are the same as those in the penalized (non-truncated) estimate, β̂Q, in (2.5). When Q = I, these two estimates reduce to β̂PCR and β̂ridge, respectively.

2.3. Kernel-based regression with two kernels

In addition to similarities among taxa, as in Q, it is often of interest to incorporate similarities among samples as derived, for instance, from UniFrac distances: H=-12JΔUJ. The symmetric positive definite n×n kernel H defines a new inner product on ℝn given by 〈u, wH = uHw, with the corresponding norm uH2=u,uH. If we consider both a general kernel, H, and a DPCoA kernel KQ = XQX′, the generalized ridge estimate β̂Q in (2.5) can be extended to

β^Q,H:=argminβp{y-XβH2+λβQ-12}=(XHX+λQ-1)-1XHy. (2.7)

In this section, we show that the estimate in (2.7) is directly defined based on the generalized eigenvectors of the two kernels KQ and H. Before proceeding to the general case, let us examine the special case of ridge regression for which H = In and Q = Ip. A ridge estimate can be obtained by solving an equivalent optimization problem in the dual space ℝn, known as kernel ridge regression (Schölkopf and Smola, 2002). Specifically, taking KI = XX′, the ridge estimate in (2.4) can be obtained as β̂ridge = Xγ̂kernel ridge, where

γ^kernelridge=(KI+λI)-1y=(KI2+λKI)-1KIy=argminγn{y-KIγ2+λγKI2}. (2.8)

In the case of ridge, the connection between the dual- and primal-space estimates, γ̂kernel ridge and β̂ridge, relies on the form KI = XX′. Unfortunately, it is less clear how to extend this connection to a general kernel (e.g., UniFrac or polynomial). One way to incorporate a more general kernel K and a second kernel H in (2.8) is to define the penalty in terms of H as

γ^=(K2+λH-1)-1Ky=argminγ{y-Kγ2+λγH-12}, (2.9)

which is exactly Tikhonov regularization, but in the dual space; compare eq. (2.5). However, γ̂* ∈ ℝn has no obvious connection to a penalized estimate of β ∈ ℝp and cannot be used to obtain a penalized regression estimate in the primal space, even if K = KI = XX′.

To bridge this gap, we instead apply the Franklin regularization scheme (Franklin, 1978), a little-used alternative to Tikhonov regularization. More specifically, for any kernels K and H, we define the dual estimate

γ^H,K:=(K+λH-1)-1y=argminγn{y-KγK-12+λγH-12}, (2.10)

where the justification for the second equality is given in the supplementary material (Randolph et al., 2017).

Comparing (2.9) and (2.10), one sees that the analytic form of (2.10) involves just K rather than K2 = KK. As shown in Proposition 2.2, this subtle difference is a key for relating a dual-space estimate γ̂H,KQ and its primal-space counterpart, β̂Q,H = QXγ̂H,KQ. Before presenting the main result of this section, we provide several equivalent forms of γ̂H,K.

γ^H,K=(K+λH-1)-1y=argminγn{y-KγK-12+λγH-12}=argminγn{y-KγH2+λγK2}=(HK+λI)-1Hy=H(KH+λI)-1y. (2.11)

In Proposition 2.2, we also refer to the special case corresponding to the DPCoA ordination. As before, let Z = XL so that KQ = XQX′ = XLLX′ = ZZ′. Taking H = I, the dual-space estimate in (2.10) is γ̂I,KQ = (KQ + λI)−1y, and so the corresponding primal-space estimate is β̂Zγ̂I,KQ. Since this estimate arises from the DPCoA kernel, we make the following definition.

Definition 2.1

A primal space DPCoA estimate is of the form β̂DPCoA = Zγ̂I,KQ = LX′(XQX′ + λI)−1y.

The next proposition collects several properties that emphasize the roles of H and K in our penalized regression framework. In particular, we show that the primal space estimate β̂Q,H can be recovered in terms of two kernels, H and KQ.

Proposition 2.2

Let H and K be any two kernels constructed using the rows of X in the regression model y = + ε. Then,

  1. γ̂H,K is a linear combination of the eigenvectors of the matrix product HK.

  2. For any kernel H and DPCoA kernel KQ = XQX, then the primal- and dual-space estimates in (2.7) and (2.10), respectively, are related as: β̂Q,H = QXγ̂H,KQ.

  3. For H = I and Q = LL, the generalized ridge and DPCoA estimates are related as β̂Q = QX′(KQ + λIn)−1y = Lβ̂DPCoA.

The proof, given in the supplementary material (Randolph et al., 2017), makes use of some linear algebraic identities which show, in particular, that

β^Q,H=QX(XQX+λH-1)-1y=QXγ^H,KQ. (2.12)
Remarks
  1. Types of similarity kernels. In general, a sufficient condition for a matrix K to be a similarity kernel is that it is induced by a feature map ϕ: ℝp → 𝒦. More specifically, the i, j entry of K is defined as the inner product of the observations xi ∈ ℝp with respect to their transformed versions Kij = 〈ϕ(xi), ϕ(xj)〉 in the new inner product space, (𝒦, 〈·, ·〉). Examples include KI = XX′ or KQ = XQX′, where 𝒦 is ℝp with inner product 〈·, ·〉Q (as in DPCoA). It is this quadratic form that we require for KQ in Proposition 2.2(2)–(3); see Freytag et al. (2014) for genomic applications of this form. On the other hand, H can be any symmetric positive semi-definite matrix. Here, we are more interested in biologically-motivated kernels, such as UniFrac or DPCoA, than mathematically-derived ones, such as those constructed from polynomials or radial basis functions (Schölkopf and Smola, 2002).

  2. Co-informative kernels and the HSIC. Any kernels K and H may be used in (2.10) and (2.11), but to be useful in this framework, we assume that they are “co-informative” in the sense that they exhibit a shared eigenstructure; for instance, both should be informative for classifying samples. This concept is illustrated in the simulation of Section 3.3 and Figure 4. The co-informativeness can be made precise using the Hilbert-Schmidt information criteria (HSIC) (Gretton et al., 2005) or its relatives—the distance covariance (Székely and Rizzo, 2009) and the RV statistic (Robert and Escoufier, 1976). Josse and Holmes (2016) provide a nice review of these and related kernel-based tests. The HSIC provides a test for the statistical dependence of two data sets, X1 (n×p) and X2 (n×q), and is based on the eigen-spectrum of covariance operators defined by kernels created from X1 and X2. For two kernels K and H, the empirical HSIC is simply trace(HK). The HSIC is thus of particular interest in item (1) of Proposition 2.2, which shows how two co-informative kernels may be used to obtain a penalized estimate β̂Q,H.

  3. Linear mixed models and KPR. As an alternative to the regularization framework presented here, one may consider a kernel as a generalized covariance among either the p variables (using Q) or n subjects (using H) (Purdom, 2011; Schaid, 2010). This alternative representation can be made precise using the linear mixed model (LMM) framework (Ruppert, Wand and Carroll, 2003). Specifically, recall from equations (2.7) and (2.11) that
    β^Q,H=argminβp{y-XβH2+λβQ-12}=QX(KQ+λH-1)-1y=QXargminγn{y-KQγH2+λγKQ2}=QXγ^H,KQ.

    These regression estimates are compatible with β~N(0,σb2Q),ε~N(0,σe2H-1) and var(y) = (τKQ + λH−1)−1. And the estimate γ̂H,KQ is compatible with γ~N(0,σa2KQ-1) and ε~N(0,σe2H-1). With regard to the latter, a genetic similarity between subjects (e.g., kinship) is often used for grouping subjects and several authors have proposed this form of kernel for testing the (global) genetic association with a trait or phenotype, y; see, e.g., Schifano et al. (2012). In particular, these methods use the LMM framework to motivate and define a “kernel association test”. The variance score statistic for testing the null hypothesis of no association between y and X (H0 : β = 0) is, using our notation above, T:=yH1/2KQH1/22. The kernel association testing framework has been applied to microbiome data using a single kernel at a time derived from UniFrac (Zhao et al., 2015), but this is a test for whether β ≠ 0 and, unlike our KPR framework, provides no insight about which taxa, as represented by coordinates of β, are associated with y.

Fig 4. Analysis of bacterial vaginosis data from Srinivasan et al. (2012).

Fig 4

(a): representation of the samples in the space of the first two PC’s of the edge-matrix kernel H = EE′. The color corresponds to the pH of the sample; (b): heatmap of edge-matrix kernel used to generate the plot in (a); (c): two-dimensional PCA plot based on the genus-level relative abundances, colored according to pH; (d): heatmap of the genus-abundance kernel K = XX′ used to create the plot in (c). In (b) and (d), subjects are ordered by the pH values.

2.4. Regression with compositional data

Data from 16S rRNA gene sequencing methods are random counts of the molecules in each sample. The number of sequence reads assigned to a taxon contains no information about the actual number of molecules in the sample; the total number of reads observed in two samples can vary by several orders of magnitude. Hence, only relative amounts can be investigated. Common approaches for normalizing these data include converting them to proportions (relative percent) or subsampling the sequences to create equal library sizes for each sample (rarefying). These data are “compositional” in the sense that the microbial abundances represent a proportion of a constant total. It is known, however, that compositional measures can result in spurious correlations among taxa (Pearson, 1896; Aitchison, 2003a; Friedman and Alm, 2012), an effect that can be quite extreme when there are a few dominant taxa.

Compositional data reside on the simplex 𝕊p−1 of unit-sum vectors in ℝp and so standard multivariate methods do not apply (Aitchison, 2003b; Egozcue and Pawlowsky-Glahn, 2011; Li, 2015; Lovell et al., 2015). In particular, because these data do not naturally reside in a Euclidean vector space, standard regression models based on Euclidean covariance measures are inappropriate. However, ordinary least-squares and ridge regression estimates are of the form β̂ = (XX + λI)−1Xy (with λ = 0 and λ > 0, respectively). Thus, these estimates depend on the empirical covariance structure, XX, among taxa, which may include spurious correlations. Similarly, Li (2015) points out that a naïve application of lasso regression is not expected to perform well due to the compositional nature of the covariates. He addresses this issue by applying a lasso regression model to the log-ratio abundances and imposing an additional constant-sum constraint on the coefficient vector, β.

We next show that the generality of KPR for handling non-Euclidean structures can be used to address the compositional nature of microbiome data. In particular, we propose an approach that uses the centered log-ratio transformation of the compositional vectors and an estimate of covariance among the log taxa counts that is obtained via Aitchison’s variance matrix (Aitchison, 2003b; Egozcue and Pawlowsky-Glahn, 2011).

Let X be the n×p sample-by-taxon matrix whose rows are relative percent (compositional) vectors {xi}i=1nSp-1. The columns of X will be denoted by xk, corresponding to k = 1, ..., p taxa. Let g(z)=(Πk=1pzk)1/p be the geometric mean of a row vector, z, and denote the centered log-ratio (CLR) transform of xi by xi=clr(xi):=[logxi1g(xi),,logxipg(xi)]. In what follows we denote the matrix of CLR vectors by , and use the normalized variation matrix T, of X, as defined by Aitchison (1982): Tk,=var(12logxkx). T is a symmetric dissimilarity matrix with zeros on the diagonal and entries that have squared Aitchison distance units: the Aitchison norm of a vector x ∈ 𝕊p−1 is defined as xa2=12pk,(logxkx)2. In fact, xa2=clr(x)2. One can show that T is related to the covariance matrix, C, of the log of the true unobserved taxa counts via T = v1′+1v′−2C (Li, 2015). Consequently, C=-12JTJ, and we can use C in place of Q in eq. (2.5) to obtain

βC=argminβ{y-Xβn2+λβC-12}. (2.13)

As a comparison, we observe that Li (2015) proposed a constrained regression

E(yi)=β1logxi1++βplogxipsubjecttoj=1pβj=0, (2.14)

augmented with a lasso penalty to obtain an estimate of the form

argminβ{12ny-jlog(xj)βjn2+λjβj}subjecttoj=1pβj=0.

The zero-sum constraint on β was emphasized for interpretability advantages over the standard lasso estimate. Temporarily denoting βp=-j=1p-1βj, we see that (2.14) is equivalent to

E(yi)=β1logxi1xip+β2logxi2xip++βp-1logxip-1xip=β1logxi1+β2logxi2++βp-1logxip-1-j=1p-1βj·logxip.

Since j=1pβj=0, this can be rewritten as

E(yi)=β1logxi1++βplogxip-(j=1pβj)logg(xi)=β1logxi1g(xi)++βplogxipg(xi)subjecttoj=1pβj=0.

Therefore, Li’s proposal of regression on log-ratio abundances is equivalent to regression on the CLR-transformed data provided a zero-sum constraint is imposed on β. In contrast, however, our formulation does not explicitly impose a constant-sum constraint. In fact, this constraint is not needed because the CLR transform removes the analysis from the simplex to allow an analysis in Euclidean vector space algebra (Egozcue and Pawlowsky-Glahn, 2011). Our model instead incorporates the appropriate covariance structure for the CLR transformation, C.

As a final observation, a positive-definite C in (2.13), or more generally Q in (2.5), can be decomposed as a sum Q = I+ of the identity plus a positive semi-definite singular matrix . The identity term constrains j=1pβj2 to be small while, overall, encourages extrinsic structure (e.g., smoothness). One may also control the size of j=1pβj2 by adding or subtracting values in the diagonal entries of Q. This idea is similar to that of “Grace-ridge” in Zhao and Shojaie (2016) where, in addition to the penalty induced by Q, the authors propose to further impose a ridge-type penalty in the objective. We apply the significance testing framework of Zhao and Shojaie (2016) in Section 4.

3. Numerical Experiments

To illustrate the proposed framework, we perform several data-driven simulations using publicly available micro-biome data. We consider three scenarios from the literature that exploit extrinsic structure from a phylogenetic tree, including DPCoA, UniFrac and edge PCA. To achieve realistic simulations, we simulate “true” signals of the type implied by each of these methods in order to create benchmarks for performance evaluation. Our emphasis is on formalizing the role that such structure plays in penalized regression when modeling associations between the multivariate data, X, and a response variable, y. Since y is directly simulated from X in these settings, the compositional nature of the data discussed in Section 2.4 does not affect the simulation results. We will return to this topic when analyzing the relative abundance data in Section 4.

The numerical experiments in this section are motivated by the relationship between the PCoA plots and PCR described in Section 2.1 and Figure 1(b). This connection can be generalized to a number of other commonly-used graphical representations in the microbiome literature. For instance, any two-dimensional DPCoA plot involves an implicit coefficient vector, β, of associations between y and X.

Throughout this section, we compare the performance of KPR with ridge regression and lasso. Ridge regression provides a direct extension of ordinary least squares and thus is a natural benchmark for comparing various KPR estimates. Lasso, which gives sparse estimates, is used as a benchmark in settings where the true β is sparsely non-zero. The choice of competing methods is limited by our emphasis on estimating β, rather than predicting the outcome y. Indeed, most kernel methods focus on prediction which renders them inappropriate for comparison.

In all simulation experiments, the tuning parameters for KPR, ridge and lasso are chosen using 10-fold cross-validation. Specifically, to compare the prediction performance of KPR, ridge and lasso, we choose the tuning parameters that minimize squared test error in held-out cross validation samples (CV min). On the other hand, the task of estimation usually requires more smoothing than prediction (Cai and Hall, 2006). Therefore, when examining the estimation performances of KPR, ridge and lasso, we use the largest tuning parameters such that the squared test errors are within one standard error of the minimum squared test error (CV 1se), as suggested in Hastie, Tibshirani and Friedman (2009). For comparison, we also consider the tuning parameters corresponding to the minimum squared test error for ridge and lasso.

3.1. Regression and DPCoA

In our first example, we compare the estimation and prediction performances of KPR, ridge and lasso using the data depicted in Figure 1. The rows of X represent relative abundances of p = 149 taxa from n = 100 subjects in a study by Yatsunenko et al. (2012). The outcome y is log-transformed age of each subject. For KPR, we use KQ = XQX′ and H = I, where Q=-12JδJ is a matrix of similarities between taxa obtained from the matrix of squared patristic distances, δ. Motivated by DPCoA plots, we assume the underlying “true” response yTrue is generated from the first two eigenvectors of KQ. Let L be the Cholesky factor of Q, i.e., Q = LL′, and let XL = ULSL(VL)′. Recall that A(k) denotes the first k columns of matrix A, or its first k rows and columns if A is diagonal. Motivated by (2.6), we let

βTrue=s(V(2)L(S(2)L)-1(U(2)L)y,τ), (3.1)

where, s, τ) is the hard-thresholding operator, i.e., s(x, τ) = x · 1(|x| > τ). The threshold τ ≥ 0 is set to achieve various levels of sparsity: ||βTrue||0 ∈ {⌊0.2p, ⌊0.6p, p}. After generating βTrue, we simulate

yTrue=U(2)LS(2)L(V(2)L)βTrue.

The simulation is repeated 500 times, each with a different ε~Nn(0,σε2In) in yobs = yTrue + ε. Further, σε2 is set to achieve R2=var(yTrue)/(var(yTrue)+σε2){0.1,0.2,,0.9}. In each repetition, we estimate β̂DPCoA from yobs according to definition 2.1. To make the simulation more realistic, we add error to the matrix Q used to simulate βTrue and yTrue. I.e., we use Qobs, obtained by adding random Gaussian noise to Q, to estimate β̂DPCoA. The eigenvalues of Qobs are adjusted to be equal to the eigenvalues of Q. The amount of Gaussian noise added to the entries of Qobs is empirically determined to achieve ||QQobs||F/||Q||F ∈ {0, 0.25, 0.5}. As a comparison, we estimate β̂Ridge and β̂Lasso using only X and yobs, without incorporating Q. From the estimated coefficients, we compute ŷDPCoA = XLobsβ̂DPCoA, ŷRidge = Xβ̂Ridge and ŷLasso = Xβ̂Lasso. The performance metrics are the prediction sum of squared error (PSSE) from yTrue and estimation sum squared error (ESSE) from βTrue.

Figure 2 shows the estimation and prediction performance of KPR, ridge and lasso. KPR significantly outperforms both ridge regression and lasso for both prediction and estimation in all settings. As expected, the performance of ridge and lasso for estimation improve when using a larger tuning parameter. On the other hand, neither mis-specification of Q nor sparsity of βTrue seems to substantially impact the relative performance of the three methods. This may be due to the fact that KPR estimates the correct target βTrue, even with mis-specified Q, whereas ridge regression and lasso estimate the wrong target.

Fig 2.

Fig 2

Estimation sum squared error (ESSE: left panels) and prediction sum squared errors (PSSE: right panels) of KPR, ridge regression and lasso, and their 95% confidence bands. Standard errors for ESSE and PSSE are estimated based on 500 simulation runs, and are roughly 0.5%–2% (ESSE) and 2%–5% (PSSE) for KPR. We consider three sparsity settings for βTrue, based on (3.1): ||βTrue||0 = p in top panels; ||βTrue||0 = ⌊0.6p⌋ in center panels, and ||βTrue||0 = ⌊0.2p⌋ in bottom panels. For ridge and lasso, tuning parameters that produce the smallest cross-validated squared test error (CV min), and the largest tuning parameters such that the cross-validated squared test errors are within one standard error of the minimum cross-validated squared test error (CV 1se) are considered. For KPR, we consider ||QQobs||F/||Q||F = 0 (no Q error), 0.25 (small Q error) and 0.5 (large Q error).

3.2. Regression and PCoA with respect to a UniFrac kernel

In the case of PCoA with respect to a UniFrac matrix ΔU of squared dissimilarities, the graphical displays are based on the eigen-decomposition of H=-12JΔUJ. That is, for H=UH(SH)2(UH)U(2)H(S(2)H)2(U(2)H), the n samples are represented in two dimensions by the columns of U(2)HS(2)H; this results in points {ηi1,ηi2}i=1n:={(σ1Ui1H,σ2Ui2H)}i=1n, as plotted in Figure 1. When the points are colored according to a response variable, {yi}i=1n, the implied regression model is

y=γ1η1+γ2η2+ε=U(2)HS(2)Hγ+ε. (3.2)

However, in contrast to PCR in eq. (2.2), where US = XV, it is not obvious how to connect γ directly to the p-coordinates corresponding to the p columns of X. Here, we exploit the joint eigenstructure of kernels KI and H by proceeding as in (2.11) to obtain the estimate β̂H = Xγ̂ as in (2.12), with Q = I.

In this example, we use the same data as in Section 3.1. For KPR, we use K = XX′ and obtain H=-12JΔUJ using the UniFrac distance matrix. We simulate γTrue and yTrue from the first two eigenvectors of H, as in (3.2):

γTrue=((U(2)H)(U(2)H))-1S(2)H(U(2)H)yyTrue=U(2)HS(2)HγTrue. (3.3)

This bivariate ordinary least squares regression is illustrated in Figure 1(b).

The simulation is repeated 500 times, each with a different ε~Nn(0,σε2In) to produce various values of R2 ∈ {0.1, 0.2, . . . , 0.9}. We compute ŷKPR = K γ̂KPR, where γ̂KPR is estimated using (2.10). Similar to the last example, we do not assume we always observe the H matrix that is used to generate γTrue and yTrue; rather, we use a noisy version, Hobs, of H in KPR with ||HHobs||F/||H||F ∈ {0, 0.25, 0.5}.

Although we estimate β here as in (2.7) (with Q = I), there is no obvious way to simulate a βTrue using UniFrac and so we do not compare the methods based on their estimation performances, and only consider prediction. For all three methods, we find the tuning parameters that minimize the cross-validated Hobs-weighted squared test error. While the use of H in tuning ridge and lasso penalties deviates from the common practice, it results in improved performances, given the important role of H in this simulation. The H matrix also defines the valid distance in this example. Thus, to evaluate the prediction performances of various methods, we use the H-weighted prediction sum of squared error (HPSSE), y^-yTrueH2.

Figure 3 shows that KPR consistently outperforms ridge regression and lasso in prediction, even with a reasonable amount of misspecification of H. This may be due to the fact that, with the incorporation of the H matrix, KPR estimates the correct target whereas ridge and lasso do not.

Fig 3.

Fig 3

H-weighted prediction sum of squared error (HPSSE) of KPR, ridge and lasso, with 95% confidence bands. Standard errors for HPSSE are estimated based on 500 simulation runs, and are roughly 1%–4% for KPR. For KPR, we consider ||HHobs||F/||H||F = 0 (no H Error), 0.25 (small H Error) and 0.5 (large H Error).

3.3. Regression and PCoA using an edge-matrix kernel

In this section, simulations are based on data from a study of bacterial vaginosis (BV) by Srinivasan et al. (2012) in which 16S rRNA gene samples were collected using vaginal swabs from n = 220 women with and without BV. Here, the outcome y represents pH measured from vaginal fluid of each subject and we consider the association of y with genus-level taxa. In this example, we use the p = 62 genera that exhibit non-zero sequence counts in at least 20% of the subjects. So here, X represents 220 × 62 abundances in a sample-by-genus matrix, and we use a kernel K = XX′. Additionally, however, we define a second kernel H = EE′ based on the “edge mass difference matrix”, E, originally introduced by Matsen and Evans (2013). If the full phylogenetic tree has q edges, each sample can be represented by a vector indexed by all q edges, the eth coordinate of which quantifies the difference between the fraction of sequence reads on either side of the edge; i.e., the fraction of reads observed on the root side of the tree minus the fraction of reads on the non-root side. We refer to Matsen and Evans (2013) for details and a discussion of “edge PCA”, which refers to PCA applied to the n × q matrix E. Note, in particular, that abundances from every taxon level in the tree contribute to a similarity between subjects as opposed to abundances at a single taxon level, which is used in UniFrac or DPCoA.

In summary, X represents p = 62 genus-level abundances while E is based on all q = 1770 edges in the original phylogenetic tree. Figure 4(a) shows a PCA plot of the 220 subjects in which their similarity is defined using the edge kernel H = EE′; the color of each dot represents the subject’s pH. Figure 4(b) is a heatmap of the kernel H used to create Figure 4(a). The columns and rows of H represent similarities between samples based on the edge mass difference matrix, ordered by subjects’ pH measurements. Similarly, Figure 4(c) is a (Euclidean) PCA plot based on similarities defined using the genus-level abundance kernel, K = XX′. Figure 4(d) is a heatmap of the kernel K used to create Figure 4(c), and subjects are again ordered by pH. These figures illustrate how two different measures of similarity (two separate kernels) may be co-informative in the sense that they both provide information about grouping of subjects’ microbiota in relation to their pH. It is thus natural to expect that incorporating information from both H and K within the KPR framework may result in improved estimates of association between y = pH and the microbial abundances.

For the simulation, we define a “true” association between pH and the genus-level taxa in X using the 2-dimensional PCR model in eq. (2.2) and (2.3). Specifically, we use the apparent association between y = pH and genus-level abundances in Figure 4(c) to construct a “true” coefficient vector βTrue as follows. Using the SVD of X = USV′, and proceeding as in (3.3), define

γTrue=[(U(2)S(2))(U(2)S(2))]-1(U(2)S(2))yyTrue=U(2)S(2)γTrue,

We then project yTrue onto the space spanned by the first two singular vectors of X to define a true coefficient vector as

βTrue=V(2)S(2)-1U(2)yTrue.

We now consider how the contribution of H = EE′ can aid in both the prediction of yTrue and the estimation of βTrue even though, by construction, neither are informed by E.

Taking H = EE′ in a KPR model of the form (2.10), we compare the resulting estimate of β with ridge and lasso estimates. The simulation is repeated 500 times, each with a different ε~Nn(0,σε2In) to produce various values of R2 ∈ {0.1, 0.2, ..., 0.9}. The performance metrics are the estimation sum squared error (ESSE) the H-weighted prediction sum squared error (HPSSE) as in the previous section. In this numerical example, we do not assume we always observe the true H matrix; rather, we use a noisy version, Hobs, of H in KPR with ||HHobs||F/||H||F ∈ {0, 0.25, 0.5}. For all three methods, tuning parameter values are chosen to minimize the sum of squared test error weighted by Hobs. As in the simulation for DPCoA, we also allow for using the largest tuning parameters such that the squared test error weighted by H is within one standard error of the minimum squared test error.

Figure 5 shows that KPR significantly outperforms ridge and lasso in both prediction and estimation. Even though H is not used to simulate the true association, using the edge kernel in KPR enhances the performance of both estimation and prediction, as long as H is not severely misspecified. Once again, the performance of ridge and lasso estimates improve when using a larger tuning parameter (CV 1se).

Fig 5.

Fig 5

In silico evaluation of using tree-based edge information in regression models. Estimation sum squared error (ESSE) and H-weighted prediction sum squared error (HPSSE) of KPR, ridge regression and lasso, with the 95% confidence bands. Standard errors for ESSE and HPSSE are estimated based on 500 simulation runs, and are roughly 2%–5% (ESSE) and 3%–5% (hPSSE) for KPR. For KPR, we consider ||HHobs||F/||H||F = 0 (no H error), 0.25 (small H error) and 0.5 (large H error).

4. Application to an observational study

We apply our kernel-penalized regression framework to data from 16S rRNA gene collected in a study of premenopausal women (Hullar et al., 2015). This study investigated aspects of gut microbial communities in stool samples from premenopausal women using 454 pyrosequencing of the 16S rRNA gene. The abundances of 127 species were zero for more than 90% of the subjects and were removed from our analysis. The data set we consider consists of p = 128 species sampled from n = 102 women.

To make the measurements comparable between subjects, the species abundances were scaled by the total number of sequences measured in each sample. This scaling produces compositional data (the relative abundances in each sample sum to 1) which introduces analytical complications. In particular, regression analysis using compositional covariates must somehow account for their unit sum constraint (Kurtz et al., 2015; Li, 2015). For this reason, we apply the CLR transformation to the relative abundance values and use this transformed data as the matrix of predictors in the KPR model. Additionally, using Aitchison’s variation matrix (Aitchison, 1982), T, we obtain the covariance matrix, C, as described prior to eq. (2.13). As C provides more accurate information on the covariance among the true abundances than does the empirical covariance matrix from relative abundances, X, or their CLR transform, , we use C in place of Q in (2.5).

In this example, we examine the effect of using the CLR transformed data and covariance C as in (2.13) and fit penalized regression models with the goal of estimating β̃C in (2.13) for the purpose of identifying specific species that may be associated with percent fat in the cohort described above. To this end, we apply a recently developed significance testing procedure to three high-dimensional models in order to identify species exhibiting evidence of association with subjects’ adiposity. This significance test for graph-constrained estimation, called Grace (Zhao and Shojaie, 2016), provides a means to assign significance to estimates from penalized regression models that incorporate structure of the type provided by Q in (2.5) (or C in (2.13)). The method asymptotically controls the type-I error rate regardless of the choice of Q. The special case with Q = I provides a sig-nificance test for ordinary ridge regression. In each application of the Grace test, tuning parameters are selected based on the smallest squared test error using 10-fold cross validation. Following Zhao and Shojaie (2016), the assumed sparsity parameter is set to be ξ = 0.05. The tuning parameter for the initial estimator is set to be λinit=4σ^ε3logp/n, where σ̂ε is the estimated standard deviation of the random error ε, using the scaled Lasso (Sun and Zhang, 2012). To assess significance for the sparse models using lasso, we apply the recently proposed significance test for lasso regressions based on low-dimensional projection estimator (LDPE) (Zhang and Zhang, 2014; Van de Geer et al., 2014), which provides an asymptotically valid test for lasso-penalized regression estimates.

We report on five regression estimation methods for which the significance of regression coefficients can be evaluated using existing high-dimensional testing methods. Two are obtained using the relative abundances, X, with respect to: (i) an ordinary ridge penalty and (ii) a lasso penalty. Three are obtained using the CLR transformed abundances, , with respect to: (iii) an ordinary ridge penalty, (iv) a lasso penalty, and (v) the KPR estimate in (2.13). None of these methods results in any species associated with the outcome of percent fat when controlled for false discovery rate (FDR) at 0.1 using the Benjamini-Yekutieli procedure (Benjamini and Yekutieli, 2001). However, when using a cut-off of p = 0.01, the KPR estimate (2.13) results in ten species. With a cut-off of p = 0.005, KPR results in four species. Ordinary ridge regressions using the CLR-transformed vectors find no associations at a cut-off of p = 0.01, whereas using the relative abundances, ridge finds two species at the p = 0.01 cut-off and none at p = 0.005. Lasso regression with the CLR-transformed vectors identifies one specie at the p = 0.01 cut-off and none at p = 0.005 cut-off. When using the relative abundances, lasso identifies two species as significant at the p = 0.01 cut-off and one at the p = 0.005 cutoff. See Table 1 for the list of identified species.

Table 1.

Species found to be associated with percent fat (in increasing order of p-values) at different significant levels using: KPR with centered log-ratio transformed abundances (CLR) ; ridge and lasso regression with centered log-ratio transformed abundances; and ridge and lasso regression with untransformed relative abundances (rel%).

p < 0.01 p < 0.005 FDR < 0.1
KPR + CLR Bacteroides, Anaerovorax, Acidaminococcus, Blautia, Dethiosulfatibacter, Asaccharobacter, Turicibacter, Lebetimonas, Streptobacillus, Anoxynatronum Bacteroides, Turicibacter, Acidaminococcus, Dethiosulfatibacter (none)
Ridge + CLR (none) (none) (none)
Ridge + rel% Catonella, Dethiosulfatibacter (none) (none)
Lasso + CLR Roseburia (none) (none)
Lasso + rel% Dethiosulfatibacter, Micropruina Dethiosulfatibacter (none)

5. Discussion

We have formulated a family of regression models that naturally extends the dimension-reduced graphical explorations common to microbiome studies. In this sense, we have simply re-focused the role of the eigen-structures used in ordination methods toward exploiting this structure in penalized regression models. The large family of models developed here provides a supervised statistical learning counterpart to the unsupervised methods of principal coordinate analysis (PCoA).

A primary motivation for PCoA graphical displays is the ability to incorporate biologically-inclined measures of (dis)similarity. The popular use of UniFrac, for instance, is motivated by the desire to impose phylogeny into the analysis. These dissimilarities have also been used for rigorous statistical testing in the context of Anderson’s nonparametric MANOVA (Anderson, 2006) or the closely-related kernel machine regression score test (Chen et al., 2012; Pan, 2011; Zhao et al., 2015) for global association of a multivariate predictor with an outcome. However, the use of UniFrac and other non-Euclidean distances make it difficult to identify specific associations between the microbial abundance profiles and a phenotype; indeed, none of these analyses proceed to estimate the individual associations. In addition to ordination displays and global tests for associations, a variety of machine learning approaches have emphasized on models that predict a response. In contrast, we focus on estimating the coefficient vector, which is a key aspect of any approach used to draw scientific conclusions based on the association of microbial communities with an outcome or phenotype.

An interesting feature of the proposed kernel-penalized regression framework is its ability to sidestep some of the problems inherent in compositional data analysis. Indeed, as emphasized by Li (2015) regression analysis with compositional covariates must somehow acknowledge their unit-sum constraint and spurious correlations. Our approach, which differs somewhat from that of Li (2015), may also be viewed as a penalized version of the low-dimensional linear model for compositions by Tolosana-Delgado and Van Den Boogart (2011), who use the isometric log-ratio (ILR) coordinates. We note that ILR coordinates arise from the SVD of mean-centered CLR-transformed data, (see Egozcue and Pawlowsky-Glahn (2011)), which is also used in our model. However, to estimate β ∈ ℝp, we used instead a regularization framework; our penalty in Section 2.4 arises from Aithison’s total variation matrix whose singular values are the total variances of ILR components. Moreover, the proposed framework also allows us to use existing inference frameworks for high-dimensional regression, and in particular the Grace test (Zhao and Shojaie, 2016), to assess the significance of estimated regression coefficients.

Supplementary Material

Supp

Footnotes

1

We refer here to the generalized singular value decomposition (GSVD) of Van Loan (1976), a simultaneous diagonlization of two matrices. A different SVD generalization (Greenacre, 1984) imposes constraints on left and right singular vectors of a matrix.

References

  1. Aitchison J. The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society: Series B. 1982;44:139–177. [Google Scholar]
  2. Aitchison J. A concise guide to compositional data analysis. 2nd Compositional Data Analysis Workshop.2003a. [Google Scholar]
  3. Aitchison J. The Statistical Analysis of Compositional Data. The Blackburn Press; Caldwell, NJ: 2003b. [Google Scholar]
  4. Anderson MJ. Distance-Based Tests for Homogeneity of Multivariate Dispersions. Biometrics. 2006;62:245–253. doi: 10.1111/j.1541-0420.2005.00440.x. [DOI] [PubMed] [Google Scholar]
  5. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 2001;29:1165–1188. [Google Scholar]
  6. Bühlmann P, Kalisch M, Meier L. High-Dimensional Statistics with a View Toward Applications in Biology. Annual Review of Statistics and Its Application. 2014;1:255–278. [Google Scholar]
  7. Cai TT, Hall P. Prediction in functional linear regression. The Annals of Statistics. 2006;34:2159–2179. [Google Scholar]
  8. Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, Collman RG, Bushman FD, Li H. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinfor-matics. 2012;28:2106–2113. doi: 10.1093/bioinformatics/bts342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Claesson MJ, Jeffery IB, Conde S, Power SE, O’Connor EM, Cu-sack S, Harris HM, Coakley M, Lakshminarayanan B, O’Sullivan O, Fitzgerald GF, Deane J, O’Connor M, Harnedy N, O’Connor K, O’Mahony D, van Sinderen D, Wallace M, Brennan L, Stanton C, Marchesi JR, Fitzgerald AP, Shanahan F, Hill C, Ross RP, O’Toole PW. Gut microbiota composition correlates with diet and health in the elderly. Nature. 2012;488:178–184. doi: 10.1038/nature11319. [DOI] [PubMed] [Google Scholar]
  10. The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Egozcue JJ, Pawlowsky-Glahn V. Basic Concepts and Procedures. In: Pawlowsky-Glahn, Buccianti A, editors. Compositional Data Analysis: Theory and Applications. Vol. 2. John Wiley & Sons; Chichester, West Sussex, UK: 2011. pp. 12–28. [Google Scholar]
  12. Evans SN, Matsen FA. The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B. 2012;74:569–592. doi: 10.1111/j.1467-9868.2011.01018.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Franklin JN. Minimum principles for ill-posed problems. SIAM Journal on Mathematical Analysis. 1978;9:638–650. [Google Scholar]
  14. Freytag S, Manitz J, Schlather M, Kneib T, Amos CI, Risch A, Chang-Claude J, Heinrich J, Bickeböller H. A network-based kernel machine test for the identification of risk pathways in genome-wide association studies. Human heredity. 2014;76:64–75. doi: 10.1159/000357567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Computational Biology. 2012;8:e1002687. doi: 10.1371/journal.pcbi.1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fukuyama J, McMurdie PJ, Dethlefsen L, Relman DA, Holmes S. Comparisons of distance methods for combining covariates and abundances in microbiome studies. Pacific Symposium on Biocomputing. 2012;17:213–224. [PMC free article] [PubMed] [Google Scholar]
  17. Golub Gh, van Loan CF. Matrix Computations. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press; Baltimore, MD: 2012. [Google Scholar]
  18. Goodrich JK, Di Rienzi SC, Poole AC, Koren O, Walters WA, Caporaso JG, Knight R, Ley RE. Conducting a microbiome study. Cell. 2014;158:250–262. doi: 10.1016/j.cell.2014.06.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–338. [Google Scholar]
  20. Greenacre MJ. Theory and Applications of Correspondence Analysis. Academic Press; Cambridge, MA: 1984. [Google Scholar]
  21. Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. Kernel methods for measuring independence. The Journal of Machine Learning Research. 2005;6:2075–2129. [Google Scholar]
  22. Hamady M, Knight R. Microbial community profiling for human micro-biome projects: Tools, techniques, and challenges. Genome Research. 2009;19:1141–1152. doi: 10.1101/gr.085464.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer-Verlag; New York, NY: 2009. [Google Scholar]
  24. Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12:55–67. [Google Scholar]
  25. Hullar MA, Lancaster SM, Li F, Tseng E, Beer K, Atkinson C, Wähälä K, Copeland WK, Randolph TW, Newton KM, Lampe JW. Enterolignan-Producing Phenotypes Are Associated with Increased Gut Microbial Diversity and Altered Composition in Premenopausal Women in the United States. Cancer Epidemiology Biomarkers & Prevention. 2015;24:546–554. doi: 10.1158/1055-9965.EPI-14-0262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Josse J, Holmes S. Measuring multivariate association and beyond. Statistics Surveys. 2016;10:132–167. doi: 10.1214/16-SS116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kim S, Xing EP. Tree-guided group lasso for multi-task regression with structured sparsity. In: Fürnkranz J, Joachims T, editors. Proceedings of the 27th International Conference on Machine Learning (ICML 2010); 2010. pp. 543–550. [Google Scholar]
  28. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, Hut-tenhower C, Ley RE. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Computational Biology. 2013;9:e1002863. doi: 10.1371/journal.pcbi.1002863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, Knight R. Microbial community resemblance methods differ in their ability to detect biologically relevant patterns. Nature Methods. 2010;7:813–819. doi: 10.1038/nmeth.1499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kurtz ZD, Mueller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Computational Biology. 2015;11:e1004226. doi: 10.1371/journal.pcbi.1004226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Li H. Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis. Annual Review of Statistics and Its Application. 2015;2:73–94. [Google Scholar]
  32. Li C, Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24:1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
  33. Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bähler J. Proportionality: a valid alternative to correlation for relative data. PLoS computational biology. 2015;11:e1004075. doi: 10.1371/journal.pcbi.1004075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and environmental microbiology. 2005;71:8228–8235. doi: 10.1128/AEM.71.12.8228-8235.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lozupone CA, Hamady M, Kelley ST, Knight R. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Applied and environmental microbiology. 2007;73:1576–1585. doi: 10.1128/AEM.01996-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Mardia KV, Kent JT, Bibby JM. Multivariate analysis. Academic press; 1980. [Google Scholar]
  37. Matsen FA, Evans SN. Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison. PLoS ONE. 2013;8:e56859. doi: 10.1371/journal.pone.0056859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pan W. Relationship between genomic distance-based regression and kernel machine regression for multi-marker association testing. Genetic Epidemiology. 2011;35:211–216. doi: 10.1002/gepi.20567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Pavoine S, Dufour A-B, Chessel D. From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology. 2004;228:523–537. doi: 10.1016/j.jtbi.2004.02.014. [DOI] [PubMed] [Google Scholar]
  40. Pearson K. Mathematical Contributions to the Theory of Evolution—On a Form of Spurious Correlation Which May Arise When Indices Are Used in the Measurement of Organs. Proceedings of the royal society of london. 1896;60:489–498. [Google Scholar]
  41. Pekalska E, Paclik P, Duin RP. A generalized kernel approach to dissimilarity-based classification. The Journal of Machine Learning Research. 2002;2:175–211. [Google Scholar]
  42. Purdom E. Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree. The Annals of Applied Statistics. 2011;5:2326–2358. [Google Scholar]
  43. Randolph TW, Harezlak J, Feng Z. Structured penalties for functional linear models—partially empirical eigenvectors for regression. Electronic Journal of Statistics. 2012;6:323–353. doi: 10.1214/12-EJS676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Randolph WT, Zhao S, Copeland W, Hullar M, Shojaie A. Supplement to “Kernel-Penalized Regression for Analysis of Microbiome Data”. Annals of Applied Statistics. 2017 doi: 10.1214/17-AOAS1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Robert P, Escoufier Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Journal of the Royal Statistical Society: Series C. 1976;25:257–265. [Google Scholar]
  46. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge University Press; New York: 2003. [Google Scholar]
  47. Schaid DJ. Genomic similarity and kernel methods I: advancements by building on mathematical and statistical foundations. Human heredity. 2010;70:109–131. doi: 10.1159/000312641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schifano ED, Epstein MP, Bielak LF, Jhun MA, Kardia SLR, Peyser PA, Lin X. SNP set association analysis for familial data. Genetic epidemiology. 2012;36:797–810. doi: 10.1002/gepi.21676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Scholköpf B, Smola AJ. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press; Cambridge, MA: 2002. [Google Scholar]
  50. Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, Ross FJ, McCoy CO, Bumgarner R, Marrazzo JM, et al. Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. PloS one. 2012;7:e37818. doi: 10.1371/journal.pone.0037818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sun T, Zhang CH. Scaled sparse linear regression. Biometrika. 2012;99:879–898. [Google Scholar]
  52. Székely GJ, Rizzo ML. Brownian distance covariance. The Annals of Applied Statistics. 2009;3:1236–1265. doi: 10.1214/09-AOAS312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Tanaseichuk O, Borneman J, Jiang T. Phylogeny-based classification of microbial communities. Bioinformatics. 2014;30:449–456. doi: 10.1093/bioinformatics/btt700. [DOI] [PubMed] [Google Scholar]
  54. Tibshirani RJ, Taylor J. The solution path of the generalized lasso. The Annals of Statistics. 2011;39:1335–1371. [Google Scholar]
  55. Tolosana-Delgado V, Van Den Boogart KG. Linear Models with Compositions in R. In: Pawlowsky-Glahn V, Buccianti A, editors. Compositional Data Analysis: Theory and Applications. Vol. 26. John Wiley & Sons; Chichester, West Sussex, UK: 2011. pp. 12–28. [Google Scholar]
  56. Van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42:1166–1202. [Google Scholar]
  57. Van Loan CF. Generalizing the Singular Value Decomposition. SIAM Journal on Numerical Analysis. 1976;13:76–83. [Google Scholar]
  58. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, Magris M, Hidalgo G, Baldassano RN, Anokhin AP, heath AC, Warner B, Reeder J, Kuczynski J, Caporaso JG, Lozupone CA, Lauber C, Clemente JC, Knights D, Knight R, Gordon JI. Human gut microbiome viewed across age and geography. Nature. 2012;486:222–227. doi: 10.1038/nature11053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Zhang CH, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76:217–242. [Google Scholar]
  60. Zhao S, Shojaie A. A Signifiance Test for Graph-Constrained Estimation. Biometrics. 2016;72:484–493. doi: 10.1111/biom.12418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, Zhou JJ, Ringel Y, Li H, Wu MC. Testing in Microbiome-Profiling Studies with MiRKAT, the Microbiome Regression-Based Kernel Association Test. American Journal of Human Genetics. 2015;96:797–807. doi: 10.1016/j.ajhg.2015.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp

RESOURCES