Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jun 25.
Published in final edited form as: Biometrics. 2015 Mar 2;71(2):354–363. doi: 10.1111/biom.12292

Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure

Yanming Li 1, Bin Nan 1,, Ji Zhu 2
PMCID: PMC4479976  NIHMSID: NIHMS680623  PMID: 25732839

Summary

We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functioning groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.

Keywords: coordinate descent algorithm, eQTL, high-dimensional data, genetic association, oracle inequalities, sparsity

1 Introduction

Genomic association studies with a single phenotype have been widely studied. Such association studies often encounter high-dimensional predictors with sparsity, i.e., only a small number of predictors are associated with the response. To select truly associated predictors, it is necessary to use regularization penalties to shrink the coefficients of irrelevant predictors to exactly zero. Popular penalties for regression models with a univariate response include the lasso (Tibshirani, 1996), the adaptive lasso (Zou, 2006), the elastic net (Zou and Hastie, 2005) and the smoothly clipped absolute deviation (Fan and Li, 2001), among many others.

An important characteristic of high-dimensional genomic predictors is the intrinsic group structures. For example, the DNA markers, also known as single nucleotide polymorphisms (SNPs), can often be grouped into genes, and genes can be grouped into biological pathways. Such grouping strategies have been applied successfully to genomic studies in rare variant detection (Zhou et al., 2010; Biswas and Lin, 2012). For group variable selection, Yuan and Lin (2006) proposed the group lasso method for the univariate response case. It penalizes the L2 norm of each predictor group and selects important groups in an “all-in-all-out” fashion. That is, all the predictors in a group are included or excluded simultaneously. However, in real applications, this is rarely the case. Oftentimes, not all the variables in an important group are important. For example, a gene associated with a certain complex trait does not mean that all the variants within the gene are causal, and a pathway that regulates certain gene expressions does not necessarily indicate that all its components have regulatory effects. Recent efforts have been made to select both important groups and important within-group signals simultaneously. Huang et al. (2009) and Zhou and Zhu (2010) adopted a Lγ, 0 < γ < 1, penalty to select important groups while removing unimportant variables within them; Zhou et al. (2010) used a penalized logistic regression with a mixed L1/L2 penalty to select both common and rare variants in a genome-wide association study; and Simon et al. (2013) proposed the sparse group lasso for selecting both important groups and within group predictors. However, all the above methods concern a univariate response.

Many other genomic data analyses focus on investigating the associations between high dimensional response variables and high-dimensional covariates, such as gene-gene associations (Park and Hastie, 2008; Zhang et al., 2010), protein-DNA associations (Zamdborg and Ma, 2009) and brain fMRI-DNA (or gene) associations (Stein et al., 2010). Oftentimes pairwise associations are calculated in such studies. For example, many multivariate genome-wide association studies nowadays still look for one association at a time between a single marker and a single trait, and then correct for multiple hypothesis testing (Dudoit et al., 2003; Stein et al., 2010). However, when both responses and predictors are of high dimensions, most of the family-wise type I error controlling procedures are usually too conservative and yield poor performance (Stein et al., 2010), and oftentimes adjusted analysis considering multiple variables simultaneously is more appropriate.

High dimensional responses also have natural group structures very often, for example, pathway group structures for gene expression responses and brain functional regions for fMRI intensity responses. For multivariate responses, Peng et al. (2010) adopted the mixed L1/L2 penalty in an orthonormal setting for identifying hub covariates in a gene regulation network; Obozinski et al. (2011) and Bunea et al. (2011) studied joint support union and joint rank selections; Lounici et al. (2011) proved oracle inequalities for multitask learning. Despite all the efforts, little focus, to our knowledge, has been put on the cases where the responses also have a group structure, whereas such cases are commonly encountered in biological studies. A possible strategy for multivariate-response analysis is to perform covariate selection for one response variable at a time. In such analysis the predictor group structure can be considered but the response group structure is overlooked.

In this article, we propose a regularization method for making a good use of the intrinsic biological group structures on both covariates and responses to facilitate a better variable selection on multivariate-response and multiple-predictor data by effectively removing unimportant blocks of regression coefficients. Both the predictor and response group structures, or in general, the block structure of the regression coefficient matrix, are assumed known. Information of many biologically confirmed group structures can be achieved from publicly available repositories, for example, RefSeq gene files from NCBI Reference Sequence Database (http://www.ncbi.nlm.nih.gov/refseq/), KEGG pathway maps from Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/), and Brodmann brain anatomic region atlas from https://surfer.nmr.mgh.harvard.edu/fswiki/BrodmannAreaMaps. The proposed method can handle cases where the number of variables in either responses or predictors is much greater than the sample size, and complex group structures such as overlapping groups where a variable belongs to multiple groups. The estimators enjoy finite sample oracle bounds for the prediction error, the estimation error, and the estimated sparsity of the regression coefficient matrix. Extensive simulations show that the proposed method outperforms competitive regularization methods. We applied the proposed method to a yeast gene expression quantitative loci (eQTL) study, where the numbers of gene expression responses and genetic marker predictors are both much larger than the sample size. The gene expressions are grouped into biological pathways and the genetic markers are grouped into genes. We demonstrate by considering both group structures that the proposed method generates a much more interpretable and predictive eQTL network between the gene expressions and genetic markers, comparing with several other commonly used regularized approaches.

2 Multivariate linear model with arbitrary grouping

We consider the multivariate linear model

Y=XB+W, (1)

where Y = (y1,⋯, yq) ∈ ℝn×q is the response matrix of n samples and q variables, X = (x1,⋯, xp) ∈ ℝn×p is the covariate matrix of n samples and p variables, B = (βjk)p×q ∈ ℝp×q is the coefficient matrix and W = (w1,⋯, wq) ∈ ℝn×q is the matrix of error terms with each wk ~ N(0, σk2In×n), k = 1,⋯, q. Assume Y and X are centered so that there is no intercept in B. We adopt the notational convention that the column vectors of X are indexed by j, the column vectors of Y and W are indexed by k, and the samples are indexed by i.

Assume B contains G groups, and each group, denoted as Bg where g ∈ {1,⋯, G}, is a subset of two or more elements in B. We denote the group structure by 𝒢 = {B1,⋯, BG}. We use B or Bg to denote either the set of all their elements or the numerical values of all their elements, depending on the context, which should not cause any confusion. Figure 1 illustrates a few examples of group structures, where each highlighted block indicates an important group in 𝒢 and each figure may represent several different group structures. Note that the group structures considered in this article are pre-defined by biological functions, such as gene or pathways. Also note that the union of all groups in 𝒢 does not need to contain all the elements of B, in other words, some βjk may not belong to any group. We say Bg1 is nested in Bg2 if Bg1Bg2; Bg1 and Bg2 are overlapping if Bg1Bg2 is not empty. Obviously, nested groups are a special case of overlapping. A group structure with overlapping groups is common in biological studies. For example, when grouping genetic variants according to genes or pathways, different genes or pathways can overlap.

Figure 1.

Figure 1

B* group structures. Important groups are shaded. (a) X group structure, (b) XY group structure, (c) X+XY group structure (nesting group structure) and (d) overlapping group structure.

Though the proposed method works for an arbitrary group structure 𝒢 on B, in real applications, a biologically meaningful group structure on B is usually introduced from the group structures of both predictors and responses. Specifically, suppose X has m1 column groups and Y has m2 column groups, then they yield m1 ×m2 intersection block groups on B. We denote this intersection block group structure by 𝒢XY, the row block group structure only determined by the predictor groups by 𝒢X, and the nested group structure containing all groups in 𝒢XY and 𝒢X by 𝒢XY ∪ 𝒢X. In the eQTL association study, a nonzero group in 𝒢XY indicates that the corresponding gene group has SNPs associated with expressions in the corresponding pathway group. A nonzero group in 𝒢X indicates that the corresponding gene group has an effect on some or all of the expressions.

For an arbitrary group structure 𝒢 with G groups, let g=1GBg2 be the total sum of L2 norms of every group in 𝒢, where Bg22=βjkBgβjk2. The group L2 norm reduces to the Frobenius norm ‖A2 = {tr(ATA)}1/2 for a matrix group A and to the vector L2 norm ‖a2 = {aTa}1/2 for a vector group a. Proofs of theoretical results in the following sections are provided in the web-based Supplementary Materials.

3 The regularization method and its properties

3.1 The multivariate sparse group lasso

For an arbitrary group structure 𝒢 on B, to simplify the notation, we denote {g: Bg ∈ 𝒢} by {g ∈ 𝒢} as long as it does not cause any confusion. For j = 1,…, p and k = 1, …, q, let λjk ≥ 0 be the adaptive lasso tuning parameter for βjk, with λjk = 0 if βjk is not penalized. Let λg ≥ 0 be the adaptive tuning parameter for group Bg ∈ 𝒢, with λg = 0 if group Bg is not penalized. We consider the following penalized optimization problem for a general regularized multivariate multiple linear regression:

arg minB12nYXB22+1jp,1kqλjk|βjk|+g𝒢λgBg2, (2)

where the L2 penalty term aims to shrink unimportant groups to zero and the L1 penalty term aims to shrink unimportant entries within an important group to zero. We call it the multivariate sparse group lasso (MSGLasso). We exclude the trivial case that λg = 0 for all g ∈ 𝒢 and λjk = 0 for all j, k. To better understand the solution to (2), we develop the following theorem for βjk when all other elements in B are fixed.

Theorem 3.1

For an arbitrary group structure 𝒢 on B, let B̂ be the solution to (2) and βjk be its jk-th element. If for some group Bg0 ∈ 𝒢 with a tuning parameter λg0,

{jk:βjkBg0}(|Sjk|/nλjk)+2λg0, (3)

then β̂jk = 0 for every βjkBg0. Otherwise, βjk satisfies

β^jk=sgn(Sjk)(|Sjk|nλjk)+xj22+n{g𝒢:βjkBg}λg/B^g2, (4)

where Sjk=xjT(YXB^(j))·k with (−j) being the j-th row of replaced by zeros, the subscript ·k refers to the k-th column of a matrix, and a+ = a if a > 0 and 0 otherwise.

Note that Theorem 3.1 is a general solution form and applies to arbitrary group structures. If there is no group structure assigned on B, then 𝒢 becomes an empty set and (4) reduces to the lasso solution; If λjk = 0 for all j, k, then (4) and (3) provide the group lasso solution. It is of interest to consider certain special group structures that are intuitive and commonly used in many applications. Specifically, we consider model (2) with the following four group structures: (I) 𝒢 = ∅, no group structure assigned on B; (II) 𝒢X; (III) 𝒢XY; (IV) 𝒢XY ∪ 𝒢X. The corresponding optimization problems become

arg minB12nYXB22+λ|B|1, (5)
arg minB12nYXB22+λ|B|1+λ1g1𝒢Xωg11/2Bg12, (6)
arg minB12nYXB22+λ|B|1+λ2g2𝒢XYωg21/2Bg22, (7)
arg minB12nYXB22+λ|B|1+λ1g1𝒢Xωg11/2Bg12+λ2g2𝒢XYωg21/2Bg22, (8)

where |B|1 = ∑jkjk| is the L1 norm of B, and ωg1 and ωg2 are some weights, in particular, the group sizes. The tuning parameter λjk = λ for all lasso penalties, λg=λ1ωg11/2 if g ∈ 𝒢X, and λg=λ2ωg21/2 if g ∈ 𝒢XY.

In the remaining of this article, we call (5) the Lasso model, (6) the Lasso+X model, (7) the Lasso+XY model, and (8) the Lasso+X+XY model.

Let L, LX, LXY and LXXY be the solutions to (5), (6), (7) and (8), respectively. Their corresponding expressions from Theorem 3.1 further reduce to some interesting simpler forms under the orthonormal design, in particular, LX and LXY are just further shrinkages of L, and LXXY is a further shrinkage of either LX or LXY. We are also interested in the group lasso cases where λ = 0 in (6), (7) and (8), with their solutions denoted by GX, GXY and GXXY, respectively. Then the main theorems in Yuan and Lin (2006) and Peng et al. (2010) become special cases.

In the eQTL example that we will analyze later, method (5) does not take the advantage of knowing the group structure. Method (6) only concerns the predictor group structure, therefore can select important gene groups. However, it ignores which pathways those genes are associated with. Method (7) considers both predictor and response group structures, therefore can select gene-to-pathway association blocks. Method (8) pertains advantages of both (6) and (7) and is more robust to misspecified group structures.

3.2 Oracle inequalities

The lasso method has been shown to achieve the oracle bounds for both prediction and estimation in the multiple linear regression model, which are the error bounds one would obtain if the true model were given, see for example, Bickel et al. (2009). Similar bounds also hold for a total of pq regression coefficients in the multivariate multiple linear regression model with a multivariate mixed L1/L2 penalty. For notational simplicity, we consider the following special case of (2) with λjk = λ for all j, k:

arg minB12nYXB22+λ|B|1+g𝒢λgBg2. (9)

We follow the method of Bickel et al. (2009). Let J1(B) = {jk : |βjk| ≠ 0} be the index set of nonzero elements in B, and J2(B) = {g ∈ 𝒢, ‖Bg2 ≠ 0} be the index set of nonzero groups in 𝒢. Define M1(B) = ∑jk Ijk ≠ 0) = |J1(B)| and M2(B) = ∑g∈𝒢 I(‖Bg2 ≠ 0) = |J2(B)|. For any matrix Δ ∈ ℝp×q and any given index set J1 ⊆ {jk : 1 ≤ jp, 1 ≤ kq}, denote ΔJ1 the projection of Δ on the index set J1, that is the matrix with the same elements of Δ on coordinates J1 and zeros on the complementary coordinates J1c. Also for any group index set J2 ⊆ {1,⋯,|𝒢|}, denote ΔJ2 the set of projection of Δ on each of {Bg: gJ2}, that is ΔJ2 = {ΔBg : gJ2}. Denote M1(B) = r and M2(B) = s. We then impose a restricted eigenvalue assumption for the multivariate linear regression model with a multivariate mixed L1/L2 penalty, which leads to the desirable oracle inequalities.

Assumption 3.2

Let J1 ⊆ {jk : 1 ≤ jp, 1 ≤ kq} and J2 ⊆ {1,⋯,|𝒢|} be any index sets that satisfy |J1| ≤ r and |J2| ≤ s. Let ρ̃ = {ρg : g ∈ 𝒢} be a set of positive numbers. Then for any nontrivial matrix Δ ∈ ℝp×q that satisfies

|ΔJ1c|1+2gJ2cρgΔBg23|ΔJ1|1+2gJ2ρgΔBg2,

the following minimums exist and are positive:

κ1(r,s,ρ˜)=minJ1,J2,Δ0XΔ2n1/2ΔJ12>0,κ2(r,s,ρ˜)=minJ1,J2,Δ0XΔ2n1/2ΔJ22>0.

Theorem 3.3

Consider model (9). Let B* be the true coefficient matrix. Assume each column of the error matrix, wk, follows a multivariate normal distribution N(0, σkIn), and all the diagonal elements of the matrix XTX/n are equal to 1. Suppose M1(B*) = r and M2(B*) = s. Let ψmax be the largest eigenvalue of XTX/n, σ = max{σ1,⋯, σq}, λg = ρgλ for g ∈ 𝒢, ρ = min{1, ρg; g ∈ 𝒢}, c be the maximum number of duplicates of a coefficient in overlapping groups in 𝒢, and

λ=2σA{log(pq)/n}1/2

for some constant A > 21/2. Furthermore, assume Assumption 3.2 holds with κ1 = κ1(r, s, ρ̃) and κ2 = κ2(r, s, ρ̃). Then with probability at least 1 − (pq)1−A2/2, we have the following oracle bounds for the prediction error, the estimation error and the order of sparsity:

1nX(B^B*)2216λ2(r1/2k1+(gJ2(B*)ρg2)1/2k2)2,
|B^B*|132(c+2)σA1+ρ(log(pq)n)1/2(r1/2k1+(gJ2(B*)ρg2)1/2k2)2,
M1(B^)64ψmax(r1/2k1+(gJ2(B*)ρg2)1/2k2)2.

The mean square prediction error is bounded by a factor of order λ2 ~ log(pq)/n, the l1 norm of the estimation error is bounded by a factor of order log(pq)/n, and the estimated order of sparsity is bounded by a constant related to Assumption 3.2. These results are similar to those in Bickel et al. (2009). Note that Theorem 3.3 will still hold for flexible λjk in (2), as long as λjk > 0 for all j, k.

4 The mixed coordinate descent algorithm

Based on Theorem 3.1, the zero groups can be determined according to (3) and the entries in a nonzero group can be determined by solving for the fixed point solution of (4) using a coordinate descent algorithm. The coordinate algorithm updates each coefficient coordinate βjk at a step while fixing all the other coefficients at their current values. Theoretically, the coordinate descent algorithm would work if one can solve (4) for β̂jk exactly. Practically, since β̂jk also appears in the term ∑{g∈𝒢: βjkBg, ‖g2>0} λg/‖g2 on the right hand side of (4), unlike lasso, a closed form solution is usually not available and numerically solving for β̂jk requires iteratively updating (4), which can be time consuming. Here we propose a mixed coordinate descent algorithm, which only updates β̂jk once from β^jk(m1) to β^jk(m) according to (4) without iteratively solving (4). In particular, the algorithm updates β̂jk by the following.

  1. If any of the groups Bg ∈ 𝒢 containing βjk satisfies (3), then the entire group is estimated at zero. Otherwise β̂jk will be updated according to one of the situations (II)–(IV).

  2. If all the groups containing βjk satisfy B^g(jk)(m1)2=0 at the current step, where B^g(jk)(m1) is B^g(m1) with its jkth element replaced by zero, then β̂jk is updated by
    β^jk(m)=sgn(Sjk(m1))(|Sjk(m1)|n{g𝒢:βjkBg,B^g(jk)(m1)2=0}λgnλjk)+xj22.
    Notice that in this case, (4) becomes a closed form lasso solution.
  3. If all the groups containing βjk satisfy B^g(jk)(m1)2>0 at the current step and λjk = 0, then β^jk(m1) is updated by the group lasso formulation
    β^jk(m)=Sjk(m1)xj22+n{g𝒢:βjkBg,B^g(jk)(m1)2>0}λg/B^g(m1)2.
    Notice in this case, all the entries in Bg with ‖g−(jk)2 > 0 will enter as nonzero entries, or in other words, the whole group Bg will be selected as an important group.
  4. If some but not all groups containing βjk satisfy ‖g−(jk)2 = 0 at the current step, then β^jk(m1) belongs to a mixture of the lasso case (for groups with B^g(jk)(m1)2=0) and the group lasso case (for groups with B^g(jk)(m1)2>0), and it is updated as if by a mixture of the lasso and the group lasso through
    β^jk(m)=sgn(Sjk(m1))(|Sjk(m1)|n{g𝒢:βjkBg,B^g(jk)(m1)2=0}λgnλjk)+xj22+n{g𝒢:βjkBgB^g(jk)(m1)2>0}λg/B^g(m1)2.

Specifically, the algorithm is given in the following for a fixed set of values of all the tuning parameters.

  • Step 1. Standardize the data such that
    i=1nyik=0,i=1nxij=0,i=1nxij2=1,for allj{1,,p},k{1,,q}.
    In our numerical examples, we also standardize yk such that i=1nyik2=1 to minimize the impact of different scales of variations across yk on the regression coefficients for all k ∈ {1,⋯ q}.
  • Step 2. Set initial values for all β̂jk and the iteration index m = 1. We use initial values β^jk(0)=0 in our numerical examples.

  • Step 3. For a given pair (j, k), fix βjk at β^jk(m1) for all j′ ≠ j or k′ ≠ k. Then update β^jk(m1) to β^jk(m) by (I) to (IV) accordingly.

  • Step 4. Repeat Step 3 for all j ∈ {1,⋯, p} and k ∈ {1,⋯, q}, and iterate until ‖(m)(m−1)‖ reaches a prespecified precision level for some norm ‖·‖. We use infinity norm in our numerical examples.

Convergence of different types of coordinate descent algorithms have been studied in the literature. Tseng (2001) provided conditions for convergence of cyclic coordinate descent algorithm with general separable objective functions. Wu and Lange (2008) proved the convergence of greedy coordinate descent algorithm with a L2 loss and the lasso penalty. Based on Wu and Lange (2008), we show the convergence of our mixed coordinate descent algorithm which is given in the following proposition. Details are provided in the supplemental materials, where we also illustrate that the speed of convergence of our mixed coordinate descent algorithm is much faster than the coordinate descent algorithm that solves the fixed point solution to (4) with inner iterations.

Proposition 4.1

A sequence of coordinate estimates iteratively updated by the mixed coor- dinate descent algorithm converge to a global minimizer of the objective function.

We implemented the MSGLasso and the mixed coordinate descent algorithm with C/C++ language and wrapped into an R package. It is available on the web-based Supplementary Materials and will soon be upload to CRAN repository.

5 Numerical studies

5.1 Simulations

In this section, we first investigate the numerical performances of Lasso, Lasso+X, Lasso+XY, Lasso+X+XY methods and their group lasso counterparts when the true coefficient matrix B* takes a group structure of either 𝒢X, 𝒢XY or 𝒢XY ∪ 𝒢X. We also compare the proposed MSGLasso method with lasso and group lasso for an overlapping group structure.

All the true group structures considered in our simulations are given in Fig.1(a)–1(d). For each group structure, we consider two scenarios: (i) “all-in-all-out”, where all the coefficients in an important group are important, and (ii) “not-all-in-all-out”, where only a subset of coefficients in an important group are important. Specifically, we generate B* by setting βjk*=0 if it is from an unimportant group, and drawing its value from a uniform distribution on [−5,−1] ∪ [1, 5] and fixing it for the simulations if it is from an important group. The sparsity of an important group in the “not all in all out” setting is randomly set between 1/4 and 1/6.

Each B* is of dimension 200 × 200. For a nonoverlapping group structure, each X row group is of dimension 20 × 200; each XY block group is of dimension 20 × 20. For the overlapping group structure, the groups start on coordinates (1, 21, 41, 61, 101, 121, 141, 181) and end on coordinates (20, 40, 70, 100, 120, 150, 180, 200), for both X and Y variables.

Covariates Xi·T, i = 1,⋯, n, are generated from a multivariate normal distribution Np(0, ΣX), where ΣX = diag(Σg1,⋯, Σg10) is block diagonal and each block corresponds to each group of X which has the first order autoregressive structure. Specifically, Σgi (j, k) = ρ|jk| for any j, k pair from the same group, i = 1,⋯, 10. The error terms wik are generated from a normal distribution N(0, σ2), where σ2 is to yield a signal to noise ratio of 2. Finally, the responses are generated from Y = XB* +W.

The optimal values of tuning parameters may be selected by different criteria. Since the degrees of freedom are difficult to determine for a penalty with multiple tuning parameters, we search for the optimal tuning parameter values using a 5-fold cross-validation over a wide range of candidate values. The searching process starts with the largest candidate tuning parameter values with each by itself shrinking all the coefficients to zero. The converged estimates obtained from the previous searching step are used as the initial values for B in the next searching step with a new set of tuning parameter values. We find it is very effective in reducing the computational cost.

For each simulation setup, we run a hundred replications and calculate the averages of the following quantities:

false positives=|{ijpairs:β^ij0andβij*=0}|,
false negatives=|{ijpairs:β^ij=0andβij*0}|,
sensitivity=|{ijpairs:β^ij0andβij*0}||{ijpairs:βij*0}|,
specificity=|{ijpairs:β^ij=0andβij*=0}||{ijpairs:βij*=0}|,
prediction error=YtestXtestB^22,

where |·| is the number of elements in a set and (Ytest, Xtest) is an independently generated testing set of 100 samples.

Figure 2 summarizes these quantities for simulation setups with “not all in and all out” for all the group structures in Fig.1 at p = q = 200, n = 150, and ρ = 0.5. The proposed method using Lasso+X+XY for the nonoverlapping group structures 𝒢X, 𝒢XY and 𝒢XY ∪𝒢X as well as for the overlapping group structure are highlighted in black. The methods for the correctly specified group structures are highlighted in grey except in Fig.2(c) and Fig.2(d), where the implemented group structures are by themselves the correctly specified group structures. From Fig.2 we see that correctly incorporating group structure improves both variable selection and prediction, and our proposed method Lasso+X+XY, or the MSGLasso, performs at least the same as, if not better than, the methods for the correct group structures and yields the lowest prediction errors.

Figure 2.

Figure 2

Simulation results, large p small n, “not all in all out” cases with n = 100, p = q = 200 and ρ = 0.5. SGL: the multivariate sparse group lasso; G: the multivariate group lasso.

Figure 3 illustrates fitted results for a data set randomly chosen from one hundred replications, where B* has a “not all in all out” either 𝒢XY ∪ 𝒢X or overlapping group structure with p = 200, q = 200 and ρ = 0.5. It clearly shows that the MSGLasso results for correctly specified group structure, both in Fig.3(e) and in Fig.3(k), yield the most desirable estimates. Methods without lasso penalty yield too many false positives inside the important groups for the “not all in all out” case even when the groups are correctly specified, while methods with lasso penalty but incorrectly specified groups yield too many false positives outside the important groups.

Figure 3.

Figure 3

Heatmaps of coefficient matrices, selection effects. (a)–(h): “Not all in all out” X+XY nonoverlapping group structure with n = 100, p = 200, q = 200, and ρ = 0.5. (a) B*; (b) L; (c) LX; (d) LXY ; (e) LXXY ; (f) GX; (g) GXY ; (h) GXXY. (i)–(l): “Not all in all out” overlapping group structure with n = 100, p = 200, q = 200, and ρ = 0.5. (i) B*; (j) L; (k) SGL; (l) G.

5.2 Yeast eQTL data analysis

In this section, we demonstrate our method by analyzing a yeast eQTL data set generated by Brem and Kruglyak (2005), see also Yin and Li (2011), where gene expressions are grouped into, possibly overlapping, pathways and the genetic markers are grouped into genes.

The data set contains 6216 yeast genes assayed for 112 individual segregant. Genotypes of these 112 segregant at 2956 marker positions were also collected using GeneChip Yeast Genome S98 microarrays. The 6216 expressed genes are grouped by Kyoto Encyclopedia of Genes and Genomes pathways and the 2956 markers are grouped by genes, taking isoform genes as the same gene. To illustrate the method, in the reported analysis we only include genes from the following four pathways: the mitogen-activated protein kinases (MAPK) pathway containing 54 genes, the cell cycle pathway containing 116 genes, the cancer pathway containing 20 genes and the ribosome pathway containing 137 genes. There are in total 315 distinct expressed genes in these pathways, with 5 genes overlapping between MAPK and cell cycle, 5 genes overlapping between MAPK and cancer, 3 genes overlapping between cell cycle and cancer, and 1 gene overlapping between MAPK, cell cycle and cancer. Ribosome does not contain overlapping genes with the other three pathways.

We follow a similar procedure of Yin and Li (2011) for prescreening genotype markers by performing univariate linear regressions across all the 315 gene expressions and 2956 markers, and include the 395 markers with p-value of 0.01 or smaller into the final analysis. These 395 markers are embedded in 45 distinct genes.

Since some marker within a gene is associated with some gene expression in a pathway does not necessarily imply the gene must be associated with all four pathways, we exclude the 𝒢X group structure and only apply an overlapping 𝒢XY group structure in the data analysis. We cross-validate the performance of the multivariate sparse group lasso, the multivariate lasso, the multivariate group lasso and the univariate lasso. In particular, we randomly divide the 112 samples into five approximately equal sized subsets, set one subset aside as the test set, and use the remaining four subsets as the training set. Then for each model, we run 5-fold cross-validation on the training set to estimate the coefficient matrix, and use the estimated model to compute the prediction error on the test set. We repeat the above procedures until each of the five subsets has been used as the test set once. The overall cross-validated prediction errors, the sum of squares, are reported in Table 1. The univariate lasso is conducted by first selecting variables on the training set using 315 separate lasso regressions, each for a single gene expression variable, and then implementing multivariate linear regression on only the selected set of covariates to obtain . Our proposed method has the best performance. The univariate lasso gives the highest prediction error, this is expected as the relations among responses are totaly overlooked. And this leads to high variability and over-fitting (Peng et al., 2010). The proposed method shows roughly a 10% decrease of the cross-validated prediction error over the multivariate lasso method, the second best approach among all four compared methods.

Table 1.

Comparison of prediction errors between different methods

Method MSG lasso M lasso MG lasso lasso
Prediction error 3094.5 3396.8 3557.4 3683.3

MSG lasso = multivariate sparse group lasso, M lasso = multivariate lasso, MG lasso = multivariate group lasso, lasso = univariate lassos.

We then apply the multivariate sparse group lasso to the entire data set with 315 gene expressions and 395 markers. The final tuning parameters are λ = 7 × 10−2 and λ1 = 2 × 10−4 determined by a 5-fold cross-validation. We also investigate the selection stability following Meinshausen and Bühlmann (2010) by calculating the selection frequencies of the top selected associations using one hundred bootstrap datasets. The top associations in terms of size, with selection frequency no less than 95%, are given in Table 2. The p-values in the last column are obtained from marginal simple linear regressions. Overall there are 1422 nonzero elements in the estimated coefficient matrix, which gives an overall estimated sparsity of about 1%. There are 235 markers with nonzero coefficients related to genes in the MAPK pathway, 135 markers related to genes in the cell cycle pathway, 65 markers related to genes in the cancer pathway, and 65 markers related to genes in the ribosome pathway. Among those, 34 markers are related to genes in the overlap of MAPK and cell cycle pathways, 23 markers are related to genes in the overlap of MAPK and cancer pathways, and 5 markers is related to a gene in the overlap of MAPK, cell cycle and cancer pathways.

Table 2.

Top selected expression-marker associations

Index β̂jk Sel.
Freq.* (%)
Expr.**
name
Expr.
pathways
Marker
Chr:BP***
Marker gene p-value
1 −1.481 100 YKL178C MAPK 3:201166 YCR041W 2.43e-51
2 1.465 100 YFL026W MAPK 3:201166 YCR041W 2.81e-55
3 −1.264 100 YPL187W MAPK 3:201166 YCR041W 7.10e-45
4 1.061 100 YNL145W MAPK 3:201166 YCR041W 5.54e-39
5 −0.735 100 YGL089C MAPK 3:201166 YCR041W 8.53e-20
6 0.650 100 YFL026W MAPK 3:201167 YCR041W 2.81e-55
7 −0.649 100 YKL178C MAPK 3:201167 YCR041W 2.43e-51
8 −0.554 98 YPL187W MAPK 3:201167 YCR041W 7.10e-45
9 0.452 100 YDR461W MAPK 3:201166 YCR041W 8.42e-14
10 −0.385 98 YPL187W MAPK 3:177850 gCR02 1.65e-33
11 0.352 100 YGR088W MAPK 15:170945 gOL02 1.52e-10
12 0.346 100 YGR088W MAPK 15:174364 gOL02 1.51e-10
13 −0.318 97 YKL178C MAPK 3:177850 gCR02 2.44e-37
14 0.257 98 YGR088W MAPK 10:51003 YJL204C 0.044
15 −0.175 95 YGL089C MAPK 2:681361 YML056C 0.66
*

Sel. Freq. = Selection Frequency.

**

expr. = gene expression.

***

Marker is denoted by its physical position in the format of “chromosome:basepair”.

Table 3 lists the top pathway-gene groupwise associations in terms of the group L2 norms with a 100% group-wise selection frequency. Out of 180 block groups, 89 groups contain nonzero coefficients. Several top selected genes have been reported in the literature. For example, one of the isoforms of YCR gene, YCR073C/SSK22 is MAPK cascade involved in osmosensory signaling pathway. Gene groups YJL and YGR in the Scr homology 3 domains are interacting with gene Pbs2 in one of the three kinase components in the MAPK pathway (Zarrinpar et al., 2003). The top association signals detected between the gene expressions in the joint of MAPK, cell cycle and cansor pathways and markers in NHR gene group also confirm the regulation effects of NHR genes on cell cycle pathway and other autophagyrelated genes (Nicole, 2011).

Table 3.

Top selected pathway-gene associations (with 100% selection frequency)

Index Pathway Gene g2 Number of nonzero
β̂jk in group
Top expr.*
in pathway
Top marker**
in gene
Top β̂jk
in group
1 MAPK YCR 3.06 23 YKL178C 3:201166 −1.481
2 MAPK gOL 0.508 10 YGR088W 15:170945 0.352
3 MAPK gCR 0.499 3 YPL187W 3:177850 −0.385
4 MAPK YJL 0.424 23 YGR088W 10:51003 0.257
5 MAPK NHR 0.420 49 YCL027W 8:111686 −0.184
6 MAPK NBR 0.382 15 YGL089C 2:681361 0.207
7 MAPK YBR 0.372 81 YGR088W 2:368060 0.165
8 ribosome YER 0.342 119 YER102W 5:350744 −0.063
9 cancer YLR 0.286 14 YJR048W 12:674651 0.164
10 MAPK YGR 0.275 3 YGL089C 7:916471 −0.172
11 MAPK YPL 0.274 18 YGR088W 12:428612 0.240
12 MAPK YLR 0.252 62 YCL027W 12:957108 0.092
13 MAPK YER 0.229 23 YPL187W 7:321714 0.135
14 MAPK YML 0.214 23 YGL098C 13:164026 −0.175
15 MAPK YHL 0.205 15 YKL178C 8:98513 −0.128
16 MAPK YNL 0.183 23 YGL089C 14:418269 −0.083
17 MAPK YCL 0.176 27 YCL027W 3:64311 0.140
18 MAPK;
cell cycle
NHR 0.175 44 YJL157C 8:111686 −0.061
19 MAPK gJL 0.131 9 YFL026W 10:259991 0.098
20 MAPK
MAPK;
YOL 0.125 26 YPL187W 15:193911 0.084
21 cell cycle;
cancer
NHR 0.098 5 YBL016W 8:111686 −0.044
22 cell cycle YCR 0.067 5 YLR288C 3:201166 0.046
23 cell cycle YCL 0.063 16 YDL003W 3:64311 −0.035
24 cell cycle YLR 0.029 37 YBR093C 12:674651 0.012
*

expr. = gene expression.

**

Top marker in gene is denoted by its physical position in the format of “chromosome:basepair”.

It worth noting that none of the association p-values from marginal simple linear regressions between gene YJL and pathway MAPK survives the Bonferroni correction for multiple comparisons. For example, the 14th signal in Table 2 has a univariate marginal p-value of 0.044, therefore it is unlikely to be picked up by the pairwise analysis. However, the MSGLasso successfully selected this signal in an adjusted analysis with high individual and group selection frequencies, see Tables 2 and 3. This finding is supported by Zarrinpar et al. (2003). It demonstrates that besides the advantage of dimension reduction, the MSGLasso can also pick out important signals that would be missed by the pairwise method.

The stability selection results show that the first 40 selected top signals do not contain zero within their 2.5%–97.5% bootstrap percentile band, and the bootstrap Q1–Q3 band of the top 100 selected signals do not contain zero, indicating that the top selected signals using proposed method have high selection frequencies from bootstrap samples.

6 Discussion

For a predetermined group structure, the MSGLasso effectively and efficiently selects the important groups and important individual signals within those groups. There is some interest in recent literature in learning the group structure and selecting the important variables simultaneously. For example, Yin and Li (2011) proposed a conditional Guassian graphical model to select nonzero entries in the precision matrix conditional on simultaneously selected predictors. It is of interest to select important predictors via the MSGlasso based on a data driven group structure, where the selection of group structure is a topic for future research.

The L1/L2 penalty in the MSGLasso ensures that the objective function is a convex function with respect to B. The convexity is essential for the proposed mixed coordinate descent algorithm. Replacing the L1 penalty by the SCAD penalty (Fan and Li, 2001) would be of interest, but the respective optimization is non-convex, thus not guaranteed to converge to the global minimum. More research along this line is needed.

Supplementary Material

Supp Material

Acknowledgements

The authors thank Dr. Hongzhe Li for providing the yeast eQTL data and helpful discussions. The research was supported in part by the National Institute of Health grant R01-AG036802 and the National Science Foundation grants DMS-1007590 and DMS-0748389.

Footnotes

Supplementary Materials

Web Appendices for the proofs of theoretical results referenced in Sections 3 and 4, computing cost comparison and MSGLasso package referenced in Section 4, and additional numerical results are available with this paper at the Biometrics website on Wiley Online Library.

References

  1. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and dantzig selector. Ann. Stat. 2009;37:1705–1732. [Google Scholar]
  2. Biswas S, Lin S. Logistic bayesian lasso for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics. 2012;68:587–597. doi: 10.1111/j.1541-0420.2011.01680.x. [DOI] [PubMed] [Google Scholar]
  3. Brem RB, Kruglyak L. The landscape of genetic complexity across 5700 gene expression traits in yeast. Proceddings of National Academy of Sciences. 2005;102:1572–1577. doi: 10.1073/pnas.0408709102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bunea F, She Y, Wegkamp M. Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Stat. 2011;39:1282–1309. [Google Scholar]
  5. Dudoit S, SHaffer J, Boldrick J. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003;18:71–103. [Google Scholar]
  6. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 2001;96:1348–1360. [Google Scholar]
  7. Huang J, Ma S, Xie H, Zhang C. A group bridge approach for variable selection. Biometrika. 2009;2:339–355. doi: 10.1093/biomet/asp020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lounici K, Pontil M, Tsybakov AB, van de Geer S. Oracle inequalities and optimal inference under group sparsity. Annal of Statistics. 2011;39:2164–2204. [Google Scholar]
  9. Meinshausen N, Bühlmann P. Stability selection. J. R. Statist. Soc. B. 2010;72:417–473. [Google Scholar]
  10. Nicole A. Integration of nutritional status with germline proliferation: characterizing the roles of nhr-88 and nhr-49 in the c. elegans gonad. 2011 [Google Scholar]
  11. Obozinski G, Wainwright M, Jordan M. Support union recovery in highdimensional multivariate regression. Ann. Stat. 2011;39:1–47. [Google Scholar]
  12. Park MY, Hastie T. Penalized logistic regression for detecting geneinteractions. Biostat. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]
  13. Peng J, Zhu J, Bergamaschi A, Han W, D Y, Noh JP, Wang P. Newblock regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 2010;4:53–77. doi: 10.1214/09-AOAS271SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical. Statistics. 2013;22:231–245. [Google Scholar]
  15. Stein J, Hua X, Lee S, Ho A, Leow A, Toga A, Saykin A, Shen L, Foroud T, Pankratz N, Huentelman M, Craig D, Gerber J, Allen A, Corneveaux J, Dechairo B, Potkin S, Weiner M, Thompson P, Initiative ADN. Voxelwise genome-wide association study (vgwas) Neuroimage. 2010;53(3):1160–1174. doi: 10.1016/j.neuroimage.2010.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–288. [Google Scholar]
  17. Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization: Theory and Applications. 2001;109:275–294. [Google Scholar]
  18. Wu T, Lange K. Coordinate descent algorithms for lasso penalized regression. Annal of Applied Statistics. 2008;2:224–244. [Google Scholar]
  19. Yin J, Li H. A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 2011;4:2630–2650. doi: 10.1214/11-AOAS494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B. 2006;68:49–67. [Google Scholar]
  21. Zamdborg L, Ma P. Discovery of protein-dna interactions by penalized multivariate regression. Nucl. Acids Res. 2009;37:5246–5254. doi: 10.1093/nar/gkp554. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zarrinpar A, Park SH, Lim WA. Optimization of specificity in a cellular protein interaction network by negative selection. Nature. 2003;426:676–680. doi: 10.1038/nature02178. [DOI] [PubMed] [Google Scholar]
  23. Zhang S, Ching W, Tsing N, Leung H, Guo D. A new multiple regression approach for the construction of genetic regulatory networks. Artificial Intelligence in Medicine. 2010;48:153–160. doi: 10.1016/j.artmed.2009.11.001. [DOI] [PubMed] [Google Scholar]
  24. Zhou H, Sehl ME, Sinsheimer JS, Lange K. Association screening of common and rare genetic variants by penalized regression. Nucl. Acids Res. 2010;26:2375–2382. doi: 10.1093/bioinformatics/btq448. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Zhou N, Zhu J. Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface. 2010;4:557–574. [Google Scholar]
  26. Zou H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 2006;101:1418–1429. [Google Scholar]
  27. Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

RESOURCES