Summary
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functioning groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.
Keywords: coordinate descent algorithm, eQTL, high-dimensional data, genetic association, oracle inequalities, sparsity
1 Introduction
Genomic association studies with a single phenotype have been widely studied. Such association studies often encounter high-dimensional predictors with sparsity, i.e., only a small number of predictors are associated with the response. To select truly associated predictors, it is necessary to use regularization penalties to shrink the coefficients of irrelevant predictors to exactly zero. Popular penalties for regression models with a univariate response include the lasso (Tibshirani, 1996), the adaptive lasso (Zou, 2006), the elastic net (Zou and Hastie, 2005) and the smoothly clipped absolute deviation (Fan and Li, 2001), among many others.
An important characteristic of high-dimensional genomic predictors is the intrinsic group structures. For example, the DNA markers, also known as single nucleotide polymorphisms (SNPs), can often be grouped into genes, and genes can be grouped into biological pathways. Such grouping strategies have been applied successfully to genomic studies in rare variant detection (Zhou et al., 2010; Biswas and Lin, 2012). For group variable selection, Yuan and Lin (2006) proposed the group lasso method for the univariate response case. It penalizes the L2 norm of each predictor group and selects important groups in an “all-in-all-out” fashion. That is, all the predictors in a group are included or excluded simultaneously. However, in real applications, this is rarely the case. Oftentimes, not all the variables in an important group are important. For example, a gene associated with a certain complex trait does not mean that all the variants within the gene are causal, and a pathway that regulates certain gene expressions does not necessarily indicate that all its components have regulatory effects. Recent efforts have been made to select both important groups and important within-group signals simultaneously. Huang et al. (2009) and Zhou and Zhu (2010) adopted a Lγ, 0 < γ < 1, penalty to select important groups while removing unimportant variables within them; Zhou et al. (2010) used a penalized logistic regression with a mixed L1/L2 penalty to select both common and rare variants in a genome-wide association study; and Simon et al. (2013) proposed the sparse group lasso for selecting both important groups and within group predictors. However, all the above methods concern a univariate response.
Many other genomic data analyses focus on investigating the associations between high dimensional response variables and high-dimensional covariates, such as gene-gene associations (Park and Hastie, 2008; Zhang et al., 2010), protein-DNA associations (Zamdborg and Ma, 2009) and brain fMRI-DNA (or gene) associations (Stein et al., 2010). Oftentimes pairwise associations are calculated in such studies. For example, many multivariate genome-wide association studies nowadays still look for one association at a time between a single marker and a single trait, and then correct for multiple hypothesis testing (Dudoit et al., 2003; Stein et al., 2010). However, when both responses and predictors are of high dimensions, most of the family-wise type I error controlling procedures are usually too conservative and yield poor performance (Stein et al., 2010), and oftentimes adjusted analysis considering multiple variables simultaneously is more appropriate.
High dimensional responses also have natural group structures very often, for example, pathway group structures for gene expression responses and brain functional regions for fMRI intensity responses. For multivariate responses, Peng et al. (2010) adopted the mixed L1/L2 penalty in an orthonormal setting for identifying hub covariates in a gene regulation network; Obozinski et al. (2011) and Bunea et al. (2011) studied joint support union and joint rank selections; Lounici et al. (2011) proved oracle inequalities for multitask learning. Despite all the efforts, little focus, to our knowledge, has been put on the cases where the responses also have a group structure, whereas such cases are commonly encountered in biological studies. A possible strategy for multivariate-response analysis is to perform covariate selection for one response variable at a time. In such analysis the predictor group structure can be considered but the response group structure is overlooked.
In this article, we propose a regularization method for making a good use of the intrinsic biological group structures on both covariates and responses to facilitate a better variable selection on multivariate-response and multiple-predictor data by effectively removing unimportant blocks of regression coefficients. Both the predictor and response group structures, or in general, the block structure of the regression coefficient matrix, are assumed known. Information of many biologically confirmed group structures can be achieved from publicly available repositories, for example, RefSeq gene files from NCBI Reference Sequence Database (http://www.ncbi.nlm.nih.gov/refseq/), KEGG pathway maps from Kyoto Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/), and Brodmann brain anatomic region atlas from https://surfer.nmr.mgh.harvard.edu/fswiki/BrodmannAreaMaps. The proposed method can handle cases where the number of variables in either responses or predictors is much greater than the sample size, and complex group structures such as overlapping groups where a variable belongs to multiple groups. The estimators enjoy finite sample oracle bounds for the prediction error, the estimation error, and the estimated sparsity of the regression coefficient matrix. Extensive simulations show that the proposed method outperforms competitive regularization methods. We applied the proposed method to a yeast gene expression quantitative loci (eQTL) study, where the numbers of gene expression responses and genetic marker predictors are both much larger than the sample size. The gene expressions are grouped into biological pathways and the genetic markers are grouped into genes. We demonstrate by considering both group structures that the proposed method generates a much more interpretable and predictive eQTL network between the gene expressions and genetic markers, comparing with several other commonly used regularized approaches.
2 Multivariate linear model with arbitrary grouping
We consider the multivariate linear model
| (1) |
where Y = (y1,⋯, yq) ∈ ℝn×q is the response matrix of n samples and q variables, X = (x1,⋯, xp) ∈ ℝn×p is the covariate matrix of n samples and p variables, B = (βjk)p×q ∈ ℝp×q is the coefficient matrix and W = (w1,⋯, wq) ∈ ℝn×q is the matrix of error terms with each wk ~ N(0, ), k = 1,⋯, q. Assume Y and X are centered so that there is no intercept in B. We adopt the notational convention that the column vectors of X are indexed by j, the column vectors of Y and W are indexed by k, and the samples are indexed by i.
Assume B contains G groups, and each group, denoted as Bg where g ∈ {1,⋯, G}, is a subset of two or more elements in B. We denote the group structure by 𝒢 = {B1,⋯, BG}. We use B or Bg to denote either the set of all their elements or the numerical values of all their elements, depending on the context, which should not cause any confusion. Figure 1 illustrates a few examples of group structures, where each highlighted block indicates an important group in 𝒢 and each figure may represent several different group structures. Note that the group structures considered in this article are pre-defined by biological functions, such as gene or pathways. Also note that the union of all groups in 𝒢 does not need to contain all the elements of B, in other words, some βjk may not belong to any group. We say Bg1 is nested in Bg2 if Bg1 ⊂ Bg2; Bg1 and Bg2 are overlapping if Bg1 ∩ Bg2 is not empty. Obviously, nested groups are a special case of overlapping. A group structure with overlapping groups is common in biological studies. For example, when grouping genetic variants according to genes or pathways, different genes or pathways can overlap.
Figure 1.
B* group structures. Important groups are shaded. (a) X group structure, (b) XY group structure, (c) X+XY group structure (nesting group structure) and (d) overlapping group structure.
Though the proposed method works for an arbitrary group structure 𝒢 on B, in real applications, a biologically meaningful group structure on B is usually introduced from the group structures of both predictors and responses. Specifically, suppose X has m1 column groups and Y has m2 column groups, then they yield m1 ×m2 intersection block groups on B. We denote this intersection block group structure by 𝒢XY, the row block group structure only determined by the predictor groups by 𝒢X, and the nested group structure containing all groups in 𝒢XY and 𝒢X by 𝒢XY ∪ 𝒢X. In the eQTL association study, a nonzero group in 𝒢XY indicates that the corresponding gene group has SNPs associated with expressions in the corresponding pathway group. A nonzero group in 𝒢X indicates that the corresponding gene group has an effect on some or all of the expressions.
For an arbitrary group structure 𝒢 with G groups, let be the total sum of L2 norms of every group in 𝒢, where . The group L2 norm reduces to the Frobenius norm ‖A‖2 = {tr(ATA)}1/2 for a matrix group A and to the vector L2 norm ‖a‖2 = {aTa}1/2 for a vector group a. Proofs of theoretical results in the following sections are provided in the web-based Supplementary Materials.
3 The regularization method and its properties
3.1 The multivariate sparse group lasso
For an arbitrary group structure 𝒢 on B, to simplify the notation, we denote {g: Bg ∈ 𝒢} by {g ∈ 𝒢} as long as it does not cause any confusion. For j = 1,…, p and k = 1, …, q, let λjk ≥ 0 be the adaptive lasso tuning parameter for βjk, with λjk = 0 if βjk is not penalized. Let λg ≥ 0 be the adaptive tuning parameter for group Bg ∈ 𝒢, with λg = 0 if group Bg is not penalized. We consider the following penalized optimization problem for a general regularized multivariate multiple linear regression:
| (2) |
where the L2 penalty term aims to shrink unimportant groups to zero and the L1 penalty term aims to shrink unimportant entries within an important group to zero. We call it the multivariate sparse group lasso (MSGLasso). We exclude the trivial case that λg = 0 for all g ∈ 𝒢 and λjk = 0 for all j, k. To better understand the solution to (2), we develop the following theorem for βjk when all other elements in B are fixed.
Theorem 3.1
For an arbitrary group structure 𝒢 on B, let B̂ be the solution to (2) and βjk be its jk-th element. If for some group Bg0 ∈ 𝒢 with a tuning parameter λg0,
| (3) |
then β̂jk = 0 for every βjk ∈ Bg0. Otherwise, βjk satisfies
| (4) |
where with B̂(−j) being the j-th row of B̂ replaced by zeros, the subscript ·k refers to the k-th column of a matrix, and a+ = a if a > 0 and 0 otherwise.
Note that Theorem 3.1 is a general solution form and applies to arbitrary group structures. If there is no group structure assigned on B, then 𝒢 becomes an empty set and (4) reduces to the lasso solution; If λjk = 0 for all j, k, then (4) and (3) provide the group lasso solution. It is of interest to consider certain special group structures that are intuitive and commonly used in many applications. Specifically, we consider model (2) with the following four group structures: (I) 𝒢 = ∅, no group structure assigned on B; (II) 𝒢X; (III) 𝒢XY; (IV) 𝒢XY ∪ 𝒢X. The corresponding optimization problems become
| (5) |
| (6) |
| (7) |
| (8) |
where |B|1 = ∑jk |βjk| is the L1 norm of B, and ωg1 and ωg2 are some weights, in particular, the group sizes. The tuning parameter λjk = λ for all lasso penalties, if g ∈ 𝒢X, and if g ∈ 𝒢XY.
In the remaining of this article, we call (5) the Lasso model, (6) the Lasso+X model, (7) the Lasso+XY model, and (8) the Lasso+X+XY model.
Let B̂L, B̂LX, B̂LXY and B̂LXXY be the solutions to (5), (6), (7) and (8), respectively. Their corresponding expressions from Theorem 3.1 further reduce to some interesting simpler forms under the orthonormal design, in particular, B̂LX and B̂LXY are just further shrinkages of B̂L, and B̂LXXY is a further shrinkage of either B̂LX or B̂LXY. We are also interested in the group lasso cases where λ = 0 in (6), (7) and (8), with their solutions denoted by B̂GX, B̂GXY and B̂GXXY, respectively. Then the main theorems in Yuan and Lin (2006) and Peng et al. (2010) become special cases.
In the eQTL example that we will analyze later, method (5) does not take the advantage of knowing the group structure. Method (6) only concerns the predictor group structure, therefore can select important gene groups. However, it ignores which pathways those genes are associated with. Method (7) considers both predictor and response group structures, therefore can select gene-to-pathway association blocks. Method (8) pertains advantages of both (6) and (7) and is more robust to misspecified group structures.
3.2 Oracle inequalities
The lasso method has been shown to achieve the oracle bounds for both prediction and estimation in the multiple linear regression model, which are the error bounds one would obtain if the true model were given, see for example, Bickel et al. (2009). Similar bounds also hold for a total of pq regression coefficients in the multivariate multiple linear regression model with a multivariate mixed L1/L2 penalty. For notational simplicity, we consider the following special case of (2) with λjk = λ for all j, k:
| (9) |
We follow the method of Bickel et al. (2009). Let J1(B) = {jk : |βjk| ≠ 0} be the index set of nonzero elements in B, and J2(B) = {g ∈ 𝒢, ‖Bg‖2 ≠ 0} be the index set of nonzero groups in 𝒢. Define M1(B) = ∑jk I(βjk ≠ 0) = |J1(B)| and M2(B) = ∑g∈𝒢 I(‖Bg‖2 ≠ 0) = |J2(B)|. For any matrix Δ ∈ ℝp×q and any given index set J1 ⊆ {jk : 1 ≤ j ≤ p, 1 ≤ k ≤ q}, denote ΔJ1 the projection of Δ on the index set J1, that is the matrix with the same elements of Δ on coordinates J1 and zeros on the complementary coordinates . Also for any group index set J2 ⊆ {1,⋯,|𝒢|}, denote ΔJ2 the set of projection of Δ on each of {Bg: g ∈ J2}, that is ΔJ2 = {ΔBg : g ∈ J2}. Denote M1(B) = r and M2(B) = s. We then impose a restricted eigenvalue assumption for the multivariate linear regression model with a multivariate mixed L1/L2 penalty, which leads to the desirable oracle inequalities.
Assumption 3.2
Let J1 ⊆ {jk : 1 ≤ j ≤ p, 1 ≤ k ≤ q} and J2 ⊆ {1,⋯,|𝒢|} be any index sets that satisfy |J1| ≤ r and |J2| ≤ s. Let ρ̃ = {ρg : g ∈ 𝒢} be a set of positive numbers. Then for any nontrivial matrix Δ ∈ ℝp×q that satisfies
the following minimums exist and are positive:
Theorem 3.3
Consider model (9). Let B* be the true coefficient matrix. Assume each column of the error matrix, wk, follows a multivariate normal distribution N(0, σkIn), and all the diagonal elements of the matrix XTX/n are equal to 1. Suppose M1(B*) = r and M2(B*) = s. Let ψmax be the largest eigenvalue of XTX/n, σ = max{σ1,⋯, σq}, λg = ρgλ for g ∈ 𝒢, ρ = min{1, ρg; g ∈ 𝒢}, c be the maximum number of duplicates of a coefficient in overlapping groups in 𝒢, and
for some constant A > 21/2. Furthermore, assume Assumption 3.2 holds with κ1 = κ1(r, s, ρ̃) and κ2 = κ2(r, s, ρ̃). Then with probability at least 1 − (pq)1−A2/2, we have the following oracle bounds for the prediction error, the estimation error and the order of sparsity:
The mean square prediction error is bounded by a factor of order λ2 ~ log(pq)/n, the l1 norm of the estimation error is bounded by a factor of order , and the estimated order of sparsity is bounded by a constant related to Assumption 3.2. These results are similar to those in Bickel et al. (2009). Note that Theorem 3.3 will still hold for flexible λjk in (2), as long as λjk > 0 for all j, k.
4 The mixed coordinate descent algorithm
Based on Theorem 3.1, the zero groups can be determined according to (3) and the entries in a nonzero group can be determined by solving for the fixed point solution of (4) using a coordinate descent algorithm. The coordinate algorithm updates each coefficient coordinate βjk at a step while fixing all the other coefficients at their current values. Theoretically, the coordinate descent algorithm would work if one can solve (4) for β̂jk exactly. Practically, since β̂jk also appears in the term ∑{g∈𝒢: βjk∈Bg, ‖B̂g‖2>0} λg/‖B̂g‖2 on the right hand side of (4), unlike lasso, a closed form solution is usually not available and numerically solving for β̂jk requires iteratively updating (4), which can be time consuming. Here we propose a mixed coordinate descent algorithm, which only updates β̂jk once from to according to (4) without iteratively solving (4). In particular, the algorithm updates β̂jk by the following.
If any of the groups Bg ∈ 𝒢 containing βjk satisfies (3), then the entire group is estimated at zero. Otherwise β̂jk will be updated according to one of the situations (II)–(IV).
- If all the groups containing βjk satisfy at the current step, where is with its jkth element replaced by zero, then β̂jk is updated by
Notice that in this case, (4) becomes a closed form lasso solution. - If all the groups containing βjk satisfy at the current step and λjk = 0, then is updated by the group lasso formulation
Notice in this case, all the entries in Bg with ‖B̂g−(jk)‖2 > 0 will enter as nonzero entries, or in other words, the whole group Bg will be selected as an important group. - If some but not all groups containing βjk satisfy ‖B̂g−(jk)‖2 = 0 at the current step, then belongs to a mixture of the lasso case (for groups with ) and the group lasso case (for groups with ), and it is updated as if by a mixture of the lasso and the group lasso through
Specifically, the algorithm is given in the following for a fixed set of values of all the tuning parameters.
- Step 1. Standardize the data such that
In our numerical examples, we also standardize yk such that to minimize the impact of different scales of variations across yk on the regression coefficients for all k ∈ {1,⋯ q}. Step 2. Set initial values for all β̂jk and the iteration index m = 1. We use initial values in our numerical examples.
Step 3. For a given pair (j, k), fix βj′k′ at for all j′ ≠ j or k′ ≠ k. Then update to by (I) to (IV) accordingly.
Step 4. Repeat Step 3 for all j ∈ {1,⋯, p} and k ∈ {1,⋯, q}, and iterate until ‖B̂(m) − B̂(m−1)‖ reaches a prespecified precision level for some norm ‖·‖. We use infinity norm in our numerical examples.
Convergence of different types of coordinate descent algorithms have been studied in the literature. Tseng (2001) provided conditions for convergence of cyclic coordinate descent algorithm with general separable objective functions. Wu and Lange (2008) proved the convergence of greedy coordinate descent algorithm with a L2 loss and the lasso penalty. Based on Wu and Lange (2008), we show the convergence of our mixed coordinate descent algorithm which is given in the following proposition. Details are provided in the supplemental materials, where we also illustrate that the speed of convergence of our mixed coordinate descent algorithm is much faster than the coordinate descent algorithm that solves the fixed point solution to (4) with inner iterations.
Proposition 4.1
A sequence of coordinate estimates iteratively updated by the mixed coor- dinate descent algorithm converge to a global minimizer of the objective function.
We implemented the MSGLasso and the mixed coordinate descent algorithm with C/C++ language and wrapped into an R package. It is available on the web-based Supplementary Materials and will soon be upload to CRAN repository.
5 Numerical studies
5.1 Simulations
In this section, we first investigate the numerical performances of Lasso, Lasso+X, Lasso+XY, Lasso+X+XY methods and their group lasso counterparts when the true coefficient matrix B* takes a group structure of either 𝒢X, 𝒢XY or 𝒢XY ∪ 𝒢X. We also compare the proposed MSGLasso method with lasso and group lasso for an overlapping group structure.
All the true group structures considered in our simulations are given in Fig.1(a)–1(d). For each group structure, we consider two scenarios: (i) “all-in-all-out”, where all the coefficients in an important group are important, and (ii) “not-all-in-all-out”, where only a subset of coefficients in an important group are important. Specifically, we generate B* by setting if it is from an unimportant group, and drawing its value from a uniform distribution on [−5,−1] ∪ [1, 5] and fixing it for the simulations if it is from an important group. The sparsity of an important group in the “not all in all out” setting is randomly set between 1/4 and 1/6.
Each B* is of dimension 200 × 200. For a nonoverlapping group structure, each X row group is of dimension 20 × 200; each XY block group is of dimension 20 × 20. For the overlapping group structure, the groups start on coordinates (1, 21, 41, 61, 101, 121, 141, 181) and end on coordinates (20, 40, 70, 100, 120, 150, 180, 200), for both X and Y variables.
Covariates , i = 1,⋯, n, are generated from a multivariate normal distribution Np(0, ΣX), where ΣX = diag(Σg1,⋯, Σg10) is block diagonal and each block corresponds to each group of X which has the first order autoregressive structure. Specifically, Σgi (j, k) = ρ|j−k| for any j, k pair from the same group, i = 1,⋯, 10. The error terms wik are generated from a normal distribution N(0, σ2), where σ2 is to yield a signal to noise ratio of 2. Finally, the responses are generated from Y = XB* +W.
The optimal values of tuning parameters may be selected by different criteria. Since the degrees of freedom are difficult to determine for a penalty with multiple tuning parameters, we search for the optimal tuning parameter values using a 5-fold cross-validation over a wide range of candidate values. The searching process starts with the largest candidate tuning parameter values with each by itself shrinking all the coefficients to zero. The converged estimates B̂ obtained from the previous searching step are used as the initial values for B in the next searching step with a new set of tuning parameter values. We find it is very effective in reducing the computational cost.
For each simulation setup, we run a hundred replications and calculate the averages of the following quantities:
where |·| is the number of elements in a set and (Ytest, Xtest) is an independently generated testing set of 100 samples.
Figure 2 summarizes these quantities for simulation setups with “not all in and all out” for all the group structures in Fig.1 at p = q = 200, n = 150, and ρ = 0.5. The proposed method using Lasso+X+XY for the nonoverlapping group structures 𝒢X, 𝒢XY and 𝒢XY ∪𝒢X as well as for the overlapping group structure are highlighted in black. The methods for the correctly specified group structures are highlighted in grey except in Fig.2(c) and Fig.2(d), where the implemented group structures are by themselves the correctly specified group structures. From Fig.2 we see that correctly incorporating group structure improves both variable selection and prediction, and our proposed method Lasso+X+XY, or the MSGLasso, performs at least the same as, if not better than, the methods for the correct group structures and yields the lowest prediction errors.
Figure 2.
Simulation results, large p small n, “not all in all out” cases with n = 100, p = q = 200 and ρ = 0.5. SGL: the multivariate sparse group lasso; G: the multivariate group lasso.
Figure 3 illustrates fitted results for a data set randomly chosen from one hundred replications, where B* has a “not all in all out” either 𝒢XY ∪ 𝒢X or overlapping group structure with p = 200, q = 200 and ρ = 0.5. It clearly shows that the MSGLasso results for correctly specified group structure, both in Fig.3(e) and in Fig.3(k), yield the most desirable estimates. Methods without lasso penalty yield too many false positives inside the important groups for the “not all in all out” case even when the groups are correctly specified, while methods with lasso penalty but incorrectly specified groups yield too many false positives outside the important groups.
Figure 3.
Heatmaps of coefficient matrices, selection effects. (a)–(h): “Not all in all out” X+XY nonoverlapping group structure with n = 100, p = 200, q = 200, and ρ = 0.5. (a) B*; (b) B̂L; (c) B̂LX; (d) B̂LXY ; (e) B̂LXXY ; (f) B̂GX; (g) B̂GXY ; (h) B̂GXXY. (i)–(l): “Not all in all out” overlapping group structure with n = 100, p = 200, q = 200, and ρ = 0.5. (i) B*; (j) B̂L; (k) B̂SGL; (l) B̂G.
5.2 Yeast eQTL data analysis
In this section, we demonstrate our method by analyzing a yeast eQTL data set generated by Brem and Kruglyak (2005), see also Yin and Li (2011), where gene expressions are grouped into, possibly overlapping, pathways and the genetic markers are grouped into genes.
The data set contains 6216 yeast genes assayed for 112 individual segregant. Genotypes of these 112 segregant at 2956 marker positions were also collected using GeneChip Yeast Genome S98 microarrays. The 6216 expressed genes are grouped by Kyoto Encyclopedia of Genes and Genomes pathways and the 2956 markers are grouped by genes, taking isoform genes as the same gene. To illustrate the method, in the reported analysis we only include genes from the following four pathways: the mitogen-activated protein kinases (MAPK) pathway containing 54 genes, the cell cycle pathway containing 116 genes, the cancer pathway containing 20 genes and the ribosome pathway containing 137 genes. There are in total 315 distinct expressed genes in these pathways, with 5 genes overlapping between MAPK and cell cycle, 5 genes overlapping between MAPK and cancer, 3 genes overlapping between cell cycle and cancer, and 1 gene overlapping between MAPK, cell cycle and cancer. Ribosome does not contain overlapping genes with the other three pathways.
We follow a similar procedure of Yin and Li (2011) for prescreening genotype markers by performing univariate linear regressions across all the 315 gene expressions and 2956 markers, and include the 395 markers with p-value of 0.01 or smaller into the final analysis. These 395 markers are embedded in 45 distinct genes.
Since some marker within a gene is associated with some gene expression in a pathway does not necessarily imply the gene must be associated with all four pathways, we exclude the 𝒢X group structure and only apply an overlapping 𝒢XY group structure in the data analysis. We cross-validate the performance of the multivariate sparse group lasso, the multivariate lasso, the multivariate group lasso and the univariate lasso. In particular, we randomly divide the 112 samples into five approximately equal sized subsets, set one subset aside as the test set, and use the remaining four subsets as the training set. Then for each model, we run 5-fold cross-validation on the training set to estimate the coefficient matrix, and use the estimated model to compute the prediction error on the test set. We repeat the above procedures until each of the five subsets has been used as the test set once. The overall cross-validated prediction errors, the sum of squares, are reported in Table 1. The univariate lasso is conducted by first selecting variables on the training set using 315 separate lasso regressions, each for a single gene expression variable, and then implementing multivariate linear regression on only the selected set of covariates to obtain B̂. Our proposed method has the best performance. The univariate lasso gives the highest prediction error, this is expected as the relations among responses are totaly overlooked. And this leads to high variability and over-fitting (Peng et al., 2010). The proposed method shows roughly a 10% decrease of the cross-validated prediction error over the multivariate lasso method, the second best approach among all four compared methods.
Table 1.
Comparison of prediction errors between different methods
| Method | MSG lasso | M lasso | MG lasso | lasso |
|---|---|---|---|---|
| Prediction error | 3094.5 | 3396.8 | 3557.4 | 3683.3 |
MSG lasso = multivariate sparse group lasso, M lasso = multivariate lasso, MG lasso = multivariate group lasso, lasso = univariate lassos.
We then apply the multivariate sparse group lasso to the entire data set with 315 gene expressions and 395 markers. The final tuning parameters are λ = 7 × 10−2 and λ1 = 2 × 10−4 determined by a 5-fold cross-validation. We also investigate the selection stability following Meinshausen and Bühlmann (2010) by calculating the selection frequencies of the top selected associations using one hundred bootstrap datasets. The top associations in terms of size, with selection frequency no less than 95%, are given in Table 2. The p-values in the last column are obtained from marginal simple linear regressions. Overall there are 1422 nonzero elements in the estimated coefficient matrix, which gives an overall estimated sparsity of about 1%. There are 235 markers with nonzero coefficients related to genes in the MAPK pathway, 135 markers related to genes in the cell cycle pathway, 65 markers related to genes in the cancer pathway, and 65 markers related to genes in the ribosome pathway. Among those, 34 markers are related to genes in the overlap of MAPK and cell cycle pathways, 23 markers are related to genes in the overlap of MAPK and cancer pathways, and 5 markers is related to a gene in the overlap of MAPK, cell cycle and cancer pathways.
Table 2.
Top selected expression-marker associations
| Index | β̂jk | Sel. Freq.* (%) |
Expr.** name |
Expr. pathways |
Marker Chr:BP*** |
Marker gene | p-value |
|---|---|---|---|---|---|---|---|
| 1 | −1.481 | 100 | YKL178C | MAPK | 3:201166 | YCR041W | 2.43e-51 |
| 2 | 1.465 | 100 | YFL026W | MAPK | 3:201166 | YCR041W | 2.81e-55 |
| 3 | −1.264 | 100 | YPL187W | MAPK | 3:201166 | YCR041W | 7.10e-45 |
| 4 | 1.061 | 100 | YNL145W | MAPK | 3:201166 | YCR041W | 5.54e-39 |
| 5 | −0.735 | 100 | YGL089C | MAPK | 3:201166 | YCR041W | 8.53e-20 |
| 6 | 0.650 | 100 | YFL026W | MAPK | 3:201167 | YCR041W | 2.81e-55 |
| 7 | −0.649 | 100 | YKL178C | MAPK | 3:201167 | YCR041W | 2.43e-51 |
| 8 | −0.554 | 98 | YPL187W | MAPK | 3:201167 | YCR041W | 7.10e-45 |
| 9 | 0.452 | 100 | YDR461W | MAPK | 3:201166 | YCR041W | 8.42e-14 |
| 10 | −0.385 | 98 | YPL187W | MAPK | 3:177850 | gCR02 | 1.65e-33 |
| 11 | 0.352 | 100 | YGR088W | MAPK | 15:170945 | gOL02 | 1.52e-10 |
| 12 | 0.346 | 100 | YGR088W | MAPK | 15:174364 | gOL02 | 1.51e-10 |
| 13 | −0.318 | 97 | YKL178C | MAPK | 3:177850 | gCR02 | 2.44e-37 |
| 14 | 0.257 | 98 | YGR088W | MAPK | 10:51003 | YJL204C | 0.044 |
| 15 | −0.175 | 95 | YGL089C | MAPK | 2:681361 | YML056C | 0.66 |
Sel. Freq. = Selection Frequency.
expr. = gene expression.
Marker is denoted by its physical position in the format of “chromosome:basepair”.
Table 3 lists the top pathway-gene groupwise associations in terms of the group L2 norms with a 100% group-wise selection frequency. Out of 180 block groups, 89 groups contain nonzero coefficients. Several top selected genes have been reported in the literature. For example, one of the isoforms of YCR gene, YCR073C/SSK22 is MAPK cascade involved in osmosensory signaling pathway. Gene groups YJL and YGR in the Scr homology 3 domains are interacting with gene Pbs2 in one of the three kinase components in the MAPK pathway (Zarrinpar et al., 2003). The top association signals detected between the gene expressions in the joint of MAPK, cell cycle and cansor pathways and markers in NHR gene group also confirm the regulation effects of NHR genes on cell cycle pathway and other autophagyrelated genes (Nicole, 2011).
Table 3.
Top selected pathway-gene associations (with 100% selection frequency)
| Index | Pathway | Gene | ‖B̂g‖2 | Number of nonzero β̂jk in group |
Top expr.* in pathway |
Top marker** in gene |
Top β̂jk in group |
|---|---|---|---|---|---|---|---|
| 1 | MAPK | YCR | 3.06 | 23 | YKL178C | 3:201166 | −1.481 |
| 2 | MAPK | gOL | 0.508 | 10 | YGR088W | 15:170945 | 0.352 |
| 3 | MAPK | gCR | 0.499 | 3 | YPL187W | 3:177850 | −0.385 |
| 4 | MAPK | YJL | 0.424 | 23 | YGR088W | 10:51003 | 0.257 |
| 5 | MAPK | NHR | 0.420 | 49 | YCL027W | 8:111686 | −0.184 |
| 6 | MAPK | NBR | 0.382 | 15 | YGL089C | 2:681361 | 0.207 |
| 7 | MAPK | YBR | 0.372 | 81 | YGR088W | 2:368060 | 0.165 |
| 8 | ribosome | YER | 0.342 | 119 | YER102W | 5:350744 | −0.063 |
| 9 | cancer | YLR | 0.286 | 14 | YJR048W | 12:674651 | 0.164 |
| 10 | MAPK | YGR | 0.275 | 3 | YGL089C | 7:916471 | −0.172 |
| 11 | MAPK | YPL | 0.274 | 18 | YGR088W | 12:428612 | 0.240 |
| 12 | MAPK | YLR | 0.252 | 62 | YCL027W | 12:957108 | 0.092 |
| 13 | MAPK | YER | 0.229 | 23 | YPL187W | 7:321714 | 0.135 |
| 14 | MAPK | YML | 0.214 | 23 | YGL098C | 13:164026 | −0.175 |
| 15 | MAPK | YHL | 0.205 | 15 | YKL178C | 8:98513 | −0.128 |
| 16 | MAPK | YNL | 0.183 | 23 | YGL089C | 14:418269 | −0.083 |
| 17 | MAPK | YCL | 0.176 | 27 | YCL027W | 3:64311 | 0.140 |
| 18 |
MAPK; cell cycle |
NHR | 0.175 | 44 | YJL157C | 8:111686 | −0.061 |
| 19 | MAPK | gJL | 0.131 | 9 | YFL026W | 10:259991 | 0.098 |
| 20 |
MAPK MAPK; |
YOL | 0.125 | 26 | YPL187W | 15:193911 | 0.084 |
| 21 |
cell cycle; cancer |
NHR | 0.098 | 5 | YBL016W | 8:111686 | −0.044 |
| 22 | cell cycle | YCR | 0.067 | 5 | YLR288C | 3:201166 | 0.046 |
| 23 | cell cycle | YCL | 0.063 | 16 | YDL003W | 3:64311 | −0.035 |
| 24 | cell cycle | YLR | 0.029 | 37 | YBR093C | 12:674651 | 0.012 |
expr. = gene expression.
Top marker in gene is denoted by its physical position in the format of “chromosome:basepair”.
It worth noting that none of the association p-values from marginal simple linear regressions between gene YJL and pathway MAPK survives the Bonferroni correction for multiple comparisons. For example, the 14th signal in Table 2 has a univariate marginal p-value of 0.044, therefore it is unlikely to be picked up by the pairwise analysis. However, the MSGLasso successfully selected this signal in an adjusted analysis with high individual and group selection frequencies, see Tables 2 and 3. This finding is supported by Zarrinpar et al. (2003). It demonstrates that besides the advantage of dimension reduction, the MSGLasso can also pick out important signals that would be missed by the pairwise method.
The stability selection results show that the first 40 selected top signals do not contain zero within their 2.5%–97.5% bootstrap percentile band, and the bootstrap Q1–Q3 band of the top 100 selected signals do not contain zero, indicating that the top selected signals using proposed method have high selection frequencies from bootstrap samples.
6 Discussion
For a predetermined group structure, the MSGLasso effectively and efficiently selects the important groups and important individual signals within those groups. There is some interest in recent literature in learning the group structure and selecting the important variables simultaneously. For example, Yin and Li (2011) proposed a conditional Guassian graphical model to select nonzero entries in the precision matrix conditional on simultaneously selected predictors. It is of interest to select important predictors via the MSGlasso based on a data driven group structure, where the selection of group structure is a topic for future research.
The L1/L2 penalty in the MSGLasso ensures that the objective function is a convex function with respect to B. The convexity is essential for the proposed mixed coordinate descent algorithm. Replacing the L1 penalty by the SCAD penalty (Fan and Li, 2001) would be of interest, but the respective optimization is non-convex, thus not guaranteed to converge to the global minimum. More research along this line is needed.
Supplementary Material
Acknowledgements
The authors thank Dr. Hongzhe Li for providing the yeast eQTL data and helpful discussions. The research was supported in part by the National Institute of Health grant R01-AG036802 and the National Science Foundation grants DMS-1007590 and DMS-0748389.
Footnotes
Web Appendices for the proofs of theoretical results referenced in Sections 3 and 4, computing cost comparison and MSGLasso package referenced in Section 4, and additional numerical results are available with this paper at the Biometrics website on Wiley Online Library.
References
- Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of lasso and dantzig selector. Ann. Stat. 2009;37:1705–1732. [Google Scholar]
- Biswas S, Lin S. Logistic bayesian lasso for identifying association with rare haplotypes and application to age-related macular degeneration. Biometrics. 2012;68:587–597. doi: 10.1111/j.1541-0420.2011.01680.x. [DOI] [PubMed] [Google Scholar]
- Brem RB, Kruglyak L. The landscape of genetic complexity across 5700 gene expression traits in yeast. Proceddings of National Academy of Sciences. 2005;102:1572–1577. doi: 10.1073/pnas.0408709102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bunea F, She Y, Wegkamp M. Optimal selection of reduced rank estimators of high-dimensional matrices. Ann. Stat. 2011;39:1282–1309. [Google Scholar]
- Dudoit S, SHaffer J, Boldrick J. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003;18:71–103. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 2001;96:1348–1360. [Google Scholar]
- Huang J, Ma S, Xie H, Zhang C. A group bridge approach for variable selection. Biometrika. 2009;2:339–355. doi: 10.1093/biomet/asp020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lounici K, Pontil M, Tsybakov AB, van de Geer S. Oracle inequalities and optimal inference under group sparsity. Annal of Statistics. 2011;39:2164–2204. [Google Scholar]
- Meinshausen N, Bühlmann P. Stability selection. J. R. Statist. Soc. B. 2010;72:417–473. [Google Scholar]
- Nicole A. Integration of nutritional status with germline proliferation: characterizing the roles of nhr-88 and nhr-49 in the c. elegans gonad. 2011 [Google Scholar]
- Obozinski G, Wainwright M, Jordan M. Support union recovery in highdimensional multivariate regression. Ann. Stat. 2011;39:1–47. [Google Scholar]
- Park MY, Hastie T. Penalized logistic regression for detecting geneinteractions. Biostat. 2008;9:30–50. doi: 10.1093/biostatistics/kxm010. [DOI] [PubMed] [Google Scholar]
- Peng J, Zhu J, Bergamaschi A, Han W, D Y, Noh JP, Wang P. Newblock regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann. Appl. Stat. 2010;4:53–77. doi: 10.1214/09-AOAS271SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical. Statistics. 2013;22:231–245. [Google Scholar]
- Stein J, Hua X, Lee S, Ho A, Leow A, Toga A, Saykin A, Shen L, Foroud T, Pankratz N, Huentelman M, Craig D, Gerber J, Allen A, Corneveaux J, Dechairo B, Potkin S, Weiner M, Thompson P, Initiative ADN. Voxelwise genome-wide association study (vgwas) Neuroimage. 2010;53(3):1160–1174. doi: 10.1016/j.neuroimage.2010.02.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–288. [Google Scholar]
- Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization: Theory and Applications. 2001;109:275–294. [Google Scholar]
- Wu T, Lange K. Coordinate descent algorithms for lasso penalized regression. Annal of Applied Statistics. 2008;2:224–244. [Google Scholar]
- Yin J, Li H. A sparse conditional gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 2011;4:2630–2650. doi: 10.1214/11-AOAS494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B. 2006;68:49–67. [Google Scholar]
- Zamdborg L, Ma P. Discovery of protein-dna interactions by penalized multivariate regression. Nucl. Acids Res. 2009;37:5246–5254. doi: 10.1093/nar/gkp554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zarrinpar A, Park SH, Lim WA. Optimization of specificity in a cellular protein interaction network by negative selection. Nature. 2003;426:676–680. doi: 10.1038/nature02178. [DOI] [PubMed] [Google Scholar]
- Zhang S, Ching W, Tsing N, Leung H, Guo D. A new multiple regression approach for the construction of genetic regulatory networks. Artificial Intelligence in Medicine. 2010;48:153–160. doi: 10.1016/j.artmed.2009.11.001. [DOI] [PubMed] [Google Scholar]
- Zhou H, Sehl ME, Sinsheimer JS, Lange K. Association screening of common and rare genetic variants by penalized regression. Nucl. Acids Res. 2010;26:2375–2382. doi: 10.1093/bioinformatics/btq448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou N, Zhu J. Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface. 2010;4:557–574. [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. J. Am. Statist. Assoc. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



