Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Oct 1.
Published in final edited form as: J Multivar Anal. 2012 Apr 27;111:241–255. doi: 10.1016/j.jmva.2012.03.013

Simultaneous Multiple Response Regression and Inverse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood

Wonyul Lee 1, Yufeng Liu 1,*
PMCID: PMC3392174  NIHMSID: NIHMS374480  PMID: 22791925

Abstract

Multivariate regression is a common statistical tool for practical problems. Many multivariate regression techniques are designed for univariate response cases. For problems with multiple response variables available, one common approach is to apply the univariate response regression technique separately on each response variable. Although it is simple and popular, the univariate response approach ignores the joint information among response variables. In this paper, we propose three new methods for utilizing joint information among response variables. All methods are in a penalized likelihood framework with weighted L1 regularization. The proposed methods provide sparse estimators of conditional inverse co-variance matrix of response vector given explanatory variables as well as sparse estimators of regression parameters. Our first approach is to estimate the regression coefficients with plug-in estimated inverse covariance matrices, and our second approach is to estimate the inverse covariance matrix with plug-in estimated regression parameters. Our third approach is to estimate both simultaneously. Asymptotic properties of these methods are explored. Our numerical examples demonstrate that the proposed methods perform competitively in terms of prediction, variable selection, as well as inverse covariance matrix estimation.

Keywords: GLASSO, Inverse covariance matrix estimation, Joint estimation, LASSO, Multiple response, Sparsity

1. Introduction

Parameter estimation and variable selection are two important goals in linear regression analysis. In traditional statistical procedures, these two objectives are often achieved separately. For example, parameter estimation can be done by the least squares regression method and variable selection can be achieved by certain subset selection techniques. However, with a large number of predictors available in practice, these methods may not be feasible. When the dimension gets large, the least squares method may have an overfitting problem which reduces predictive accuracy. When the dimension is larger than the sample size, the least squares regression solution cannot even be calculated directly. In terms of variable selection, the all subset selection method can be unstable because the procedure is not continuous [3], and it can be computationally infeasible when the dimension is large. To solve these problems, a large number of methods have been proposed based on the regularization framework. Some well-known methods include the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani [18], the nonnegative garrote proposed by Breiman [2], and the smoothly clipped absolute deviation (SCAD) proposed by Fan and Li [6]. These regularized methods can help to avoid overfitting. More importantly, these techniques can perform parameter estimation and variable selection simultaneously.

With multiple response variables available, the standard approach to model them is to regress each response variable separately on the same set of explanatory variables. All marginal univariate regression procedures including the above methods can be applied to each response. However, this approach may not be optimal since they do not utilize the information among response variables. To solve this multi-response regression problem, Breiman and Friedman [4] proposed a method, called the curd and whey that uses the relationship among response variables to improve predictive accuracy. They showed that their method can outperform separate univariate regression approaches when there are correlations among the response variables. However, their method did not address the topic of variable selection. Recently, Yuan et al. [21] proposed a method based on dimension reduction. Their idea is to obtain dimension reduction by encouraging sparsity among singular values of the parameter matrix. However, their approach focuses on dimension reduction rather than variable selection. Thus, it does not give a subset of explanatory variables for each response. Variable selection can be a very important issue when the number of explanatory variables is large or when explanatory variables are highly correlated. To relate with variable selection, Turlach et al. [19] proposed a penalized method using the max-L1 penalty to select a common subset of explanatory variables for multiple response regression. Their method aims to select a subset which can be used as predictors for all response variables. However, this assumption may be too strong when each response has different sets of explanatory variables.

Recently, Rothman et al. [16] proposed a penalized log-likelihood approach with the multivariate Gaussian assumption. In this paper, we further extend their method and propose three approaches to tackle the multiple response regression problem via utilizing the joint information among multiple response variables. To handle the problem, we need to estimate two parameter matrices, the regression parameter matrix B and the conditional inverse covariance matrix of response variables C = Σ−1. The first two approaches are plug-in methods, i.e., plugging in an estimator of one parameter matrix to solve the other one. The third approach tries to jointly estimate both parameter matrices. In particular, the first proposed method maximizes a sparse penalized log-likelihood using a previously estimated inverse covariance matrix Ĉ. Similarly, the second proposed method maximizes a sparse penalized log-likelihood using a previously estimated regression parameter matrix . The last proposed method simultaneously estimates regression parameters and the inverse covariance matrix by maximizing a doubly penalized joint likelihood function. These methods involve two penalty terms: the weighted L1 penalty on the inverse covariance matrix C and the weighted L1 penalty on the regression parameter matrix B. Note that the joint approach is more general than that of Rothman et al. [16], which used unweighted L1 penalty terms. Our framework allows flexible weights on the penalty terms and it is more general. To handle the computational difficulty of high dimensional problems, we recommend some prescreening procedure to eliminate noise variables before further estimation.

In the following sections, we describe the new proposed methods in more details with theoretical justification and numerical examples. In Section 2, we introduce our proposed methodology. Section 3 explores the corresponding theoretical properties. Section 4 develops coordinate descent computational algorithms to obtain solutions for proposed methods. A prescreening step is suggested for the joint method to speed up the computation. Section 5 provides some brief results of our numerical examples. We conclude the paper with some discussion in Section 6. The proofs of the theorems are provided in Appendix.

2. Methodology

Consider the regression problem of p covariates and m response variables. Suppose the data contain n observations. Let yi = (yi1, …, yim)T; i = 1, …, n, be m-dimensional responses and Y = [y1, …, yn]T be the n × m response matrix. Let xi = (xi1, …, xip)T; i = 1, …, n, be p-dimensional predictors and X = [x1, …, xn]T be the n × p design matrix. For simplicity of notations, let yk = (y1k, …, ynk)T be the k-th response vector (k = 1, …, m) and xj = (x1j, …, xnj)T be the j-th predictor (j = 1, …, p). Consider the following model,

Y=XB+e,withe=[ε1,,εn]T,

where B = {βjk}; j = 1, …, p, k = 1, …, m, is an unknown p × m parameter matrix. The errors εi = (εi1, …, εim)T; i = 1, …, n, are i.i.d. m-dimensional random vectors following a multivariate normal distribution N(0, Σ) with the nonsingular covariance matrix Σ.

Our goal is to estimate B so that we can use X to predict Y. A simple way to estimate B is to build m single response models separately and the least squares solution is denoted by S = (XT X)−1XT Y, provided that XT X is nonsingular. However, this approach ignores information on Σ. When Σ is diagonal, this separate modeling approach can work well. However, when Σ is not diagonal, we sometimes have strong correlations among the response variables. The separate modeling approach does not make use of the joint information among the response variables. To produce a better estimator, we consider to incorporate Σ in the estimation procedure of B. Denote Σ−1 by C. If we assume that Σ is known, the log-likelihood for B conditional on X is

-12tr{(Y-XB)C(Y-XB)T}, (1)

up to a constant not depending on B. Interestingly, although the maximum likelihood function involves Σ, the corresponding maximum likelihood estimate turns out to be identical to the least squares estimate using the separate maximum likelihood method. This implies that the maximizer of (1) does not take any advantage from the known information on Σ. However, when we impose penalties on the likelihood, the joint method can bring some advantage in estimation.

In this paper, we propose to build multivariate regression models through joint shrinkage. The goal is to utilize the joint information among the m response variables to improve estimation and prediction. Since Σ is involved in the joint estimation and it is often unknown, we consider three different approaches: two plug-in methods and the doubly penalized approach. The plug-in approach in Section 2.1 uses some estimator Ĉ for C to plug in the penalized likelihood function and then estimate B jointly. The plug-in approach in Section 2.2 estimates C after plugging in a reasonable estimator of B. The doubly penalized approach in Section 2.3 estimates C and B simultaneously via regularizing the estimation of both C and B.

For discussion, we first assume that Σ is known. To regress Y on X, we can model them separately, such as applying the LASSO for m different responses. Alternatively, we can use joint shrinkage estimation for the m response variables simultaneously. To demonstrate the difference between separate shrinkage and joint shrinkage, we consider a simple toy example for illustration. Suppose that m = 2, p = 1, and XT X = 1. Let B^S=(β^11S,β^12S) be the least squares solution and assume that both β^11S and β^12S are positive and =(1ρρ1). With the penalty parameter λ, the separate LASSO solution is given by

β^1mLASSO=argminβ1m{(ym-Xβ1m)T(ym-Xβ1m)+λβ1m}=[β^1mS-λ2]+;m=1,2, (2)

where [u]+ = u if u ≥ 0 and [u]+ = 0 if u < 0. In the joint shrinkage estimation, however, the solution is given by

argminB[tr{(Y-XB)C(Y-XB)T}+λβ11+λβ12]. (3)

We can show that (3) is equivalent to

argminB[(B-B^S)C(B-B^S)T+λβ11+λβ12] (4)

and the solution of (4) is given by

β^1m=[β^1mS-λ2(1+ρ)]+;m=1,2. (5)

Compared with the separate LASSO solution (2), the solution (5) obtains more shrinkage if ρ is positive, while negative ρ results in less shrinkage. Figure 1 provides some insight on the reason why the amount of shrinkage changes with ρ for the joint method. Solid curves in Figure 1 are contour curves of (BS)C(BS)T as the quadratic function of B and dashed lines correspond to the penalty function. When ρ is positive, the quadratic function increases along the 45° line to the horizontal axis slower than the case when ρ is zero. Note that the solution of the joint method with ρ = 0 is identical to the separate LASSO solution. Thus, the solution of (4) can be closer to the origin with more shrinkage than the solution with ρ = 0. On the other hand, the quadratic function with negative increases faster along the 45° line to the horizontal axis. Thus, the solution of (4) tends to be closer to the least squares solution than the solution with ρ = 0. Therefore, the joint method can help us to produce more accurate estimators via utilizing the joint information through C.

Figure 1.

Figure 1

Contour plots for the toy example to illustrate the change of shrinkage with ρ for the joint method.

We propose three approaches, including two plug-in methods and one joint method. In Sections 2.1 and 2.2, we introduce two different plug-in penalized likelihood methods, one is for multiple response regression and the other one is for inverse covariance estimation. In the plug-in method for multiple response regression, we estimate C prior to the step of regression and then use the estimator of C to produce a better estimator of B. In the plug-in method for inverse covariance estimation, we estimate B first and then estimate C with the estimator available. In Section 2.3, we estimate B and C together via double penalization. Section 2.4 provides some guidance on three proposed methods and model selection.

2.1. Plug-in Joint Weighted LASSO Estimator

To ensure that estimation of B includes the information on Σ, we propose a joint penalized likelihood method, namely the plug-in joint weighted LASSO (PWL) estimator. In particular, the corresponding penalized likelihood function is as follows

tr{(Y-XB)C(Y-XB)T}+λ1j,kwjkβjk. (6)

Here λ1 is a tuning parameter and wjk ≥ 0; j = 1, …, p, k = 1, …, m, are prespecified weights for the L1-penalty of βjk. If C is an m × m diagonal matrix with diagonal entries ( σ12,,σm2), then y1, …, ym are mutually independent. In that case, the minimizer of (6) is equivalent to the weighted LASSO solution obtained by applying the weighted LASSO separately to each response vector yk with the penalty parameter λ1/σk2(k=1,,m). However, if C is not diagonal, the minimizer of (6) can be different from the separate penalized likelihood method which handles each response vector yk separately. Our numerical examples indicate that the joint method can be more accurate when the response variables are highly correlated.

In practice, C is often not available. Thus, we need to estimate it. To estimate C, we assume that zi=(yiT,xiT)T is an (m+p)-dimensional random vector following a multivariate normal distribution N(μ, Σy,x), where y,x=(y,yy,xx,yx,x). Because Σ is the covariance matrix of yi conditioned on xi, it can be expressed by =y,y-y,xx,x-1x,y. Therefore, we can estimate Σ by first estimating Σy,x. To estimate Σy,x, we adapt the Graphical LASSO (GLASSO) method proposed by Friedman et al. [7]. The GLASSO method considers the problem of estimating the inverse covariance matrix in the context of sparse Gaussian graphical models [12]. This technique was also considered by Yuan and Lin [23], Banerjee et al. [1] and Rothman et al. [15].

The GLASSO estimator, ^y,x-1, is given as the minimizer of the following penalized likelihood function

-logdet(y,x-1)+1ni=1n(zi-z¯)Ty,x-1(zi-z¯)+λ0||y,x-1||. (7)

Here is the sample mean, || Σy,x−1|| is the sum of the absolute values of the off-diagonal elements of Σy,x−1, and λ0 is a tuning parameter.

The PWL method is a two-step procedure. With the estimate Σ̂ available, the PWL method solves the following problem

argminB[tr{(Y-XB)C^(Y-XB)T}+λ1j,kwjkβjk], (8)

where ^y,x=(^y,y^y,x^x,y^x,x),^=^y,y-^y,x^x,x-1^x,y and Ĉ = Σ̂−1.

2.2. Plug-in Weighted Graphical LASSO Estimator

In Section 2.1, we propose a plug-in method, PWL, which estimates C first and then estimates B given Ĉ. In this section, we propose another plug-in method to estimate C. In particular, we first estimate B by using univariate regression techniques. With the estimator available, we propose a penalized likelihood method, the plug-in weighted graphical LASSO (PWGL) estimator, by solving

argminC[-nlogdet(C)+tr{(Y-XB^)C(Y-XB^)T}+λ2stvstcst], (9)

where C = {cst}; s = 1, …, m, t = 1, …, m. Here λ2 is a tuning parameter and vst ≥ 0; s = 1, …, m, t = 1, …, m, are prespecified weights for the L1 penalty of cst.

2.3. Doubly Penalized Maximum Likelihood Estimator

In Sections 2.1 and 2.2, we propose two plug-in methods. PWL estimates C first and then estimates B given Ĉ while PWGL estimates B first and then estimates C given . In this section, we propose to estimate (B, C) simultaneously. Since yi|xi ~ N(BTxi, Σ), the log-likelihood of (B, C) conditional on X is

n2logdet(C)-12tr{(Y-XB)C(Y-XB)T}. (10)

It can be shown that the maximum likelihood estimator of B is also given by (XT X)−1XTY. Interestingly, the resulting estimator of B is the same as the ordinary least square estimator, which can be obtained without using the information on the relationship among the response vectors y1, …, ym. To incorporate the information among different response variables in estimation of B, we propose a joint penalized method, the doubly penalized maximum likelihood (DML) estimator, by solving

argminB,C[-nlogdet(C)+tr{(Y-XB)C(Y-XB)T}+λ1j,kwjkβjk+λ2stvstcst] (11)

Note that the objective function in (11) is not convex with respect to (B, C). The corresponding optimization can be unstable sometimes when pn. This is because the first term in (11) can dominate the other terms if some diagonal elements of tr (YXB)T (YXB)} are zeros, which may occur when pn. This can be shown by taking a diagonal matrix C and increasing the values of its diagonal elements corresponding to the zero diagonal entries in tr (YXB)T (YXB). As a result, the numerical solution of C in (11) can have some large diagonal entries. In practice, the solution of C with very large diagonal entries is not desirable as it leads to very small residual variances of the corresponding response variables. We recommend to first use the plug-in method in Section 2.1 or separate modeling methods to screen the variables and reduce the dimensions. Then one can apply the joint method on the reduced set of variables. As shown in our simulation examples, the joint method can often outperform the plug-in methods when p is moderate compared to n.

2.4. Model Selection

Two plug-in methods are preferable if one of B and C is of main interest and the other is already well estimated. Another advantage of two plug-in methods is that they have lower computational cost than the joint method. On the other hand, the joint method does not require good estimate of B or C. Even though the joint method is computationally more intensive, it often performs better than two plug-in methods in the sense that it optimizes the log-likelihood of (B, C) jointly.

The tuning parameters λ1 and λ2 in (8), (9) and (11) control the sparsity of the resulting estimators of (B, C). They can be selected either using validation sets or through K-fold cross-validation. The K-fold cross-validation method randomly splits the dataset into K segments of equal sizes. For the k-th fold, we denote the estimated regression parameter matrix and the estimated inverse covariance matrix using all data excluding those in the k-th segment and the tuning parameters λ1 and λ2 by ( B^λ1(-k),C^λ2(-k)). We also denote the data in the k-th segment as (Y(k), X(k)). Specifically, for the PWL method, we select the optimal tuning parameter λ̂1 which minimizes the prediction error as follows,

CV(λ1)=k=1K||Y(k)-X(k)B^λ1(-k)||F2, (12)

where ||·||F2 is the Frobenius norm of a matrix. For the PWGL method, we select the optimal tuning parameter λ̂2 which minimizes the predictive negative log-likelihood as follows,

CV(λ2)=k=1K[-nklogdet(C^λ2(-k))+tr{(Y(k)-X(k)B^)C^λ2(-k)(Y(k)-X(k)B^)T}], (13)

where nk is the sample size of the k-th segment. For the DML method, we first select the optimal λ̂1 by using (12) with a prespecified λ2 and select λ̂2 by using (13) with the selected optimal λ̂1. It helps to avoid a two dimensional grid search of (λ1, λ2). We have found in simulations that the selected optimal λ̂1s are almost identical for a wide range of prespecified λ2.

In the use of validation sets, we split the dataset into two parts, the training set and the validation set. With a pair of (λ1, λ2), we first estimate (B, C) using the training set. The prediction error and the predictive negative log-likelihood of the resulting estimator are obtained using the validation set as (Y(k), X(k)) in (12) and (13). The validation set is not used to construct the final estimator with the selected (λ̂1, λ̂2), while the K -fold cross-validation uses all data for the final estimator with (λ̂1, λ̂2).

3. Asymptotic Properties

To investigate a sparse regression technique, it is necessary to investigate its asymptotic behaviors. Fan and Li [6] pointed out that a good variable selection procedure should have oracle properties. Asymptotically with probability tending to 1, a procedure with oracle properties can identify the true underlying subset of predictor variables. The resulting estimator of the procedure also asymptotically performs as well as if the true underlying subset were known in advance. In this section, we study the asymptotic behavior of our three proposed methods. In particular, we show that with a proper choice of (λ1, λ2), all three methods enjoy the oracle properties.

For the asymptotic analysis, we use the set-up of Fan and Li [6], Yuan and Lin [23] and Zou [25]. The technical derivation uses the results in Knight and Fu [10]. Let B=(βjk); j = 1, …, p, k = 1, …, m, be the true regression parameter matrix and C=(cst); s = 1, …, m, t = 1, …, m, be the true inverse covariance matrix. Let A={(j,k):βjk0} and C={(s,t):cst0}. Then we assume the following conditions for our theoretical results:

  • (A1)

    1nXTXA where A is a positive definite matrix.

  • (A2)

    The cardinality of Inline graphic, | Inline graphic| = q1 > 0.

  • (A3)

    There exists β̃jk which is a n-consistent estimator of βjk; j = 1, …, p, k = 1, …, m.

  • (A4)

    The cardinality of Inline graphic, | Inline graphic| = q2 > 0.

  • (A5)

    There exists st which is a n-consistent estimator of cst; s = 1, …, m, t = 1, …, m.

Note that conditions (A3) and (A5) are generally satisfied by maximum likelihood estimators or L2 regularized maximum likelihood estimators with proper choices of penalty parameters. For example, the least square estimator of B can be used as the β̃jks and the inverse of residual sample covariance matrix can be used as sts. For the theoretical analysis, we define wjk and vst as wjk=1β^jkγ; j = 1, …, p, k = 1, …, m, γ > 0, and vst=1cst; s = 1, …, m, t = 1, …, m, respectively.

In Sections 3.1 and 3.2, we show the plug-in estimators enjoy the oracle properties. Section 3.3 develops the asymptotic theory that reveals the oracle properties of the DML solution.

3.1. Oracle properties of the PWL solution

In this section, we first show that with the known C*, the minimizer of (6) is consistent in variable selection and has the asymptotic normality. Then we show that with a consistent estimator of C*, the PWL estimator also enjoys the same properties.

Define the true regression parameter vector as β=(β11,,βp1,,β1m,,βpm)T. Let β1^(n) be the estimator of β* obtained by minimizing (6) with the penalty parameter λ1,n. Let βA be the q1-dimensional true parameter vector which consists of nonzero components in β*. Let β1^A(n) be the corresponding estimators of βA. Let D = (C* ⊗ A)Inline graphic be the q × q matrix obtained by removing the (j + (k −1)m)-th row and column of C* ⊗ A for (j, k) ∉ Inline graphic. Then the following lemma shows the oracle properties of the penalized likelihood estimator β1^(n) with the known C*, as the minimizer of (6) defined previously.

Lemma 1

(Oracle properties of the minimizer of (6), β1^(n), with the known C*) Suppose that λ1,nn-120 and λ1,nnγ-12 as n → ∞. Under the conditions (A1)–(A3), we have the following results:

  1. (Selection consistency) limnP(β1^jk(n)=0)=1 if βjk=0;

  2. (Asymptotic normality) n(β1^A(n)-βA)dN(0,D-1).

Lemma 1 tells us that the penalized maximum likelihood estimator with the known C* satisfies the oracle properties. Since C* is typically unknown in practice, one often uses an estimator for C*. With slight modification of Lemma 1, we can show that the PWL solution also enjoys the oracle properties. Denote the PWL estimator of β* with the penalty parameter λ1,n as β2^(n). Let β2^A(n) be the corresponding estimator of βA.

Theorem 1

(Oracle properties of the PWL solution) In addition to the assumptions in Lemma 1, suppose that Ĉ is a consistent estimator of C*. Under the conditions (A1)–(A3), we have the following results:

  1. (Selection consistency) limnP(β2^jk(n)=0)=1ifβjk=0 if βjk=0;

  2. (Asymptotic normality) n(β2^A(n)-βA)dN(0,D-1).

Theorem 1 states that with a consistent estimator of C*, variable selection in the PWL is consistent and the resulting estimator still enjoys the asymptotic normality.

3.2. Oracle properties of the PWGL solution

In this section, we show the oracle properties of the PWGL solution. To this end, we first show the oracle properties of the solution of

argminC[-nlogdet(C)+tr{(Y-XB)C(Y-XB)T}+λ2jkvjkcjk], (14)

with the known B*. Then we show that with a consistent estimator of B*, the PWGL estimator still enjoys the same properties.

Denote by Ĉ(1) the minimizer of (14) with the known B*. Let C^0(1) be the matrix obtained from Ĉ(1) by replacing c^jk(1) with 0 if cjk=0. Then the following lemma shows the oracle properties of Ĉ(1).

Lemma 2

(Oracle properties of the minimizer of (14), Ĉ(1), with known B*) Suppose that λ2,nn-120 and λ2,n → ∞ as n → ∞. Under the conditions (A1), (A4) and (A5), we have the following results:

  1. (Selection consistency) limnP(c^jk(1)=0)=1 if cjk=0;

  2. (Asymptotic distribution) n(C^01-C)dargminV(U),

    where V (U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if cjk=0.

In Lemma 2, we show that the penalized maximum likelihood estimator with the known B* satisfies the oracle properties. Since B* is typically unknown in practice, one often applies an univariate regression technique to obtain an estimator for B*. With slight modification of Lemma 2, we can show that the PWGL solution also enjoys the oracle properties. Denote the PWGL estimator of C* with the penalty parameter λ2,n as Ĉ(2). Let C^0(2) be the matrix obtained from Ĉ(2) by replacing c^jk(2) with 0 if cjk=0. Then the following theorem shows the oracle properties of the PWGL estimator.

Theorem 2

(Oracle properties of the PWGL solution) In addition to the assumptions in Lemma 2, suppose that is a consistent estimator of B*. Under the above conditions, we have the following results:

  1. (Selection consistency) limnP(c^jk(2)=0)=1 if cjk=0;

  2. (Asymptotic distribution) n(C^02-C)dargminV(U),

    where V(U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if cjk=0.

Theorem 2 states that with a consistent estimator of B*, the PWGL solution satisfies the oracle properties.

3.3. Oracle properties of the DML solution

In Sections 3.1 and 3.2, we establish the oracle properties of plug-in estimators. In this section, we explore oracle properties of the DML solution in which (, Ĉ) are obtained together. First, we show that with a proper choice of (λ1, λ2), there exists a n-consistent local minimizer of (11). Then we show that this local minimizer enjoys the oracle properties as a solution of the DML estimator.

The following lemma shows the existence of a local minimizer of (11) which is n-consistent.

Lemma 3

Suppose that λ1,nn-120 and λ2,nn-120. Under the conditions (A1)–(A5), there exists a local minimizer of (11) such that

(vec(B^)T,vec(C^)T)T-(vec(B)T,vec(C)T)T=Op(1/n).

From Lemma 3, it is clear that there exists a n-consistent doubly penalized maximum likelihood estimator. As the DML estimator of (B*, C*), denote by ((n), Ĉ) the n-consistent local solution of (11) with the penalty parameter (λ1,n, λ2,n). Let β̂(n) = vec((n)) and let β^A(n) be the corresponding estimator of βA. Let Ĉ0 be the matrix obtained from Ĉ by replacing ĉjk with 0 if cjk=0. We now show that with a proper choice of (λ1, λ2), the DML estimator as this local minimizer enjoys the oracle properties in the following theorem.

Theorem 3

(Oracle properties of the DML solution) Suppose that λ1,nn-120 and λ1,nn-γ-12. In addition to that, suppose that λ2,nn-120 and *lamda;2,n → ∞. Under the conditions (A1)–(A5), we have the following results:

  1. limnP(β^jk(n)=0)=1 if βjk=0;

  2. n(β^A(n)-βA)dN(0,D-1);

  3. limn P(ĉjk = 0) = 1 if cjk=0;

  4. n(C^0-C)dargminV(U),

    where V (U) = tr(UΣUΣ) + tr(UW) and W is a m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if cjk=0.

4. Computational Algorithm

In this section, we describe computational algorithms to solve problems (8), (9), and (11). In particular, we apply the GLASSO algorithm for (9). To solve the problems (8) and (11), we apply the coordinate-descent algorithm as described in Peng et al. [14], which can be viewed as a modification of the shooting algorithm [8]. The basic idea of the coordinate-descent algorithm is to optimize each parameter at one time while holding the other parameters fixed at the current solution. The corresponding optimization at each step can be very simple to solve.

We now describe the coordinate-descent algorithm for the PWL method in details. Denote Ĉ by (ĉij)m×m. Then (8) is equivalent to minimizing

i=1nk,l=1mc^kl(yik-j=1pβjkxij)(yil-j=1pβjlxij)+λ1j,kwjkβjk. (15)

Consider (15) as a function of βjk with other coefficients fixed. Then the minimizer of (15) is equivalent to

argminβjk[{i=1n(c^kk(yik-jjβjkxij-βjkxij)2+2kkc^kk(yik-jβikxij)(yik-jjβjkxij-βjkxij))}+λ1wjkβjk].

This problem is essentially a one-dimensional LASSO optimization which has a closed form solution. Therefore, the algorithm can be summarized as follows:

Algorithm 1: the Coordinate-Descent Algorithm for the PWL Method

  • Step 1

    (Initial value). Set the separate LASSO solution βjk(old); j = 1, …, p, k = 1, …, m, as the initial value for B.

  • Step 2

    (Updating rule). For j = 1, …, p and k = 1, …, m,

    βqr(new)=βqr(old),if qj and rk,
    βjk(new)=sign(l=1mc^lk(el(old))Txjc^kkxjTxj+βjk(old))(|l=1mc^lk(el(old))Txjc^kkxjTxj+βjk(old)|-λ1wjk2c^kkxjTxj)+,

    where el(old)=yl-Xβl(old) and βl(old)=(β1l(old),,βpl(old)).

  • Step 3

    (Iteration). Repeat Step 2 until convergence. Our stopping rule is that the change of the objective function in (8) is less than δ = 0.1.

To be computationally more efficient, we combine the above algorithm with the active shooting algorithm proposed by Peng et al. [14]. The basic idea of the active shooting algorithm is to update the coefficients within the active set until convergence instead of iterating all coefficients at each step. The active set is defined as the set of currently nonzero coefficients and it is typically small. Once the coefficients in the active set converge, then we continue to update other coefficients. This step can speed up the algorithm significantly if the final solution is very sparse.

Next we describe the problem (9) in the GLASSO framework. Since (9) is equivalent to minimizing

-logdet(C)+tr{1n(Y-XB^)T(Y-XB^)C}+λ2njkvjkcjk, (16)

we can apply the GLASSO algorithm [7] to solve (9) by substituting the sample covariance matrix with 1n(Y-XB^)T(Y-XB^). Therefore, the algorithm for (9) proceeds as follows:

Algorithm 2: the GLASSO Algorithm for the PWGL Method

  • Step 1

    (Estimator of B) Set the separate LASSO solution as the estimator, , of B.

  • Step 2

    (Estimator of C) Given , apply the GLASSO algorithm to solve (16).

Next, we combine Algorithm 1 and the GLASSO algorithm to solve problem (11) for the doubly penalized method DML in Section 2.3. The algorithm can be summarized as follows:

Algorithm 3: the Coordinate-Descent Algorithm for the DML Method

  • Step 1

    (Initial values of B and C). Set the separate LASSO solution βjk(old); j = 1, …, p, k = 1, …, m, as the initial value for B and the solution of (9), C(old), as the initial value of C.

  • Step 2
    (B updating rule). For a given C(old), update B(old)B(new) with
    B(new)=argminB[tr{(Y-XB)C(old)(Y-XB)T}+λ1j,kwjkβjk].

    This step can be solved using the Algorithm 1.

  • Step 3
    (C updating rule). For a given B(new), update C(old)C(new) by
    C(new)=argminC[tr{1n(Y-XB(new))T(Y-XB(new))C}-logdet(C)+λ2nstvstcst].

    This can be solved using the GLASSO algorithm.

  • Step 4

    (Iteration). Repeat Steps 2 and 3 until convergence. Our stopping rule is that the change of the objective function in (11) is less than δ = 0.1.

Based on our experiment, the coordinate-descent algorithm works very efficiently. Since the DML method involves estimation of both B and C, the computation can be intensive when the dimension is high. We consider a prescreening step to speed up the computation. In particular, we adapt the group lasso method considered by Yuan and Lin [22] and Meier et al. [11]. The basic idea of the group lasso method is to employ group penalty in the regression problem so that model selection can be achieved in terms of group selection. In our multiple response variable regression problem, (βj1, …, βjm); j = 1, …, p, can be considered as p groups. Therefore, for the prescreening step, the group lasso estimator, group of B, is given as the minimizer of the following penalized function

i=1nk=1m(yik-j=1pβjkxij)2+λj=1pβj12++βjm2,

where λ is a tuning parameter. We screen out a variable if the corresponding coefficients are estimated as zeros for all response variables. In other words, we remove the variable xj from our model if β^j1group==β^jmgroup=0. This prescreening step can not only speed up the computation, but also improve the prediction performance as shown in our examples.

5. Numerical Examples

In this section, our proposed methods are compared with several existing methods. The first existing method we compare is the curds and whey (CW) method proposed by Breiman and Friedman [4]. The other two methods are the separate ridge regression (RR) and the separate LASSO. In particular, we apply the RR and the LASSO to each response variable separately. Separate LASSO solutions are constructed by a modification of the LARS algorithm proposed by Efron et al. [5]. The main idea of this modified LARS algorithm was also considered by Osborne et al. [13]. In this paper, some brief summaries of the results are provided. All details about numerical examples can be found in the online supplement materials.

In simulated examples, all methods are compared in two ways, B estimation and C estimation. Performance of B estimation is compared in terms of prediction and variable selection. For the comparison of C estimation, we use the entropy criterion [9, 24] which measures the difference of two matrices. In terms of prediction, overall, the proposed DML method works the best. The PWL method also works reasonably well in all cases, although it is not as accurate as the DML estimator. In the example where the true inverse covariance matrix is not sparse, LASSO gives the worst prediction performance while the other methods show similar performance. This implies that joint approaches outperform separate approaches with the joint information. DML outperforms LASSO and PWL in terms of identification of zero coefficients. We also notice that the ratios of correctly identified zeros for PWL and DML increase as the sample size increases. This supports the selection consistency shown in Section 3. When the dimension of predictor variables is low, the DML estimator gives the best performance in C estimation. When the dimension of predictor variables is close to the sample size, PWGL outperforms DML. Since the DML method simultaneously estimates both B and C, with a small sample size, the C estimation may not be as good.

We apply our methodology to a Glioblastoma multiforme (GBM) cancer data set studied by the Cancer Genome Atlas (TCGA) Research Network [17]. As noted in [20], GBM is the most common primary form of brain tumor in adults. In terms of prediction, our method, DML, performs best even though the difference between DML and the separate LASSO is not statistically significant in view of the standard errors. In terms of the number of included genes in models, PWL and DML construct sparser models than the separate LASSO. One possible explanation is that there may be some strong positive correlations among microRNAs which are response variables. As we have discussed in the toy example of Section 2, with strong positive correlations among response variables, joint methods tend to obtain more shrinkage than the separate LASSO. To explore this further, correlations among microRNAs are examined. Some strong postive correlations among the microRNAs are detected while negative correlations are not strong. Interestingly, with much fewer number of gene expressions than the separate LASSO, PWL and DML perform competitively in terms of prediction accuracy.

6. Discussion

In this paper, we have proposed three methods for utilizing joint information among response variables in a penalized likelihood framework with weighted L1 regularization. Our theoretical investigation shows that our proposed estimators enjoy oracle properties. Simulated examples and an application to the GBM cancer data set demonstrate that our proposed methods perform competitively.

Our current study assumes Gaussian distribution of the response vector. One future research direction is to extend the proposed method with other distributional assumptions. Although we mainly focus on the weighted L1 penalty, our methods can be directly extended for other penalty functions as well. It will be interesting to compare the performance of various choices of penalty in this context.

Supplementary Material

01

Acknowledgments

The authors were supported in part by NSF grant DMS-0747575 and NIH grant 5R01CA149569-03.

Appendix A. Proof of Lemma 1

Appendix A.1. Asymptotic normality

Let = ((y1)T, …, (ym)T)T be the nm-dimensional response vector and ε̃ be the corresponding nm-dimensional error vector which consists of εik; i = 1, …, n, k = 1, …, m. Let β̃ = (β11, …, βp1, …, β1m, …, βpm)T be the pm-dimensional vector and = ImX. Then the minimizer of (6) is equivalent to

argminβ[(Y-Xβ)T(CIn)(Y-Xβ)+λ1j,kwjkβjk].

Let β=β+un and

Vn(u)=(Y-X(β+un))T(CIn)(Y-X(β+un))+λ1,nj,kwjkβjk+ujkn.

Let û(n) = argminu Vn(u) and then u^(n)=n(β1^(n)-β). Note that û(n) = argminu Vn(u) = argminu{Vn(u) – Vn(0)} and

Vn(u)-Vn(0)=1nuTXT(CIn)Xu-2nεT(CIn)Xu+λ1,nj,kwjk(|βjk+ujkn|-βjk). (A.1)

We know that 1nuTXT(CIn)Xu=uT(C1nXTX)uuT(CA)u. For the second term of the right hand side of (A.1), note that ε̃ ~ N(0, ΣIn). Thus, 1nεT(CIn)XdZ where Z ~ N(0, CA) as 1nXT(CIn)(In)(CIn)X=1nXT(CIn)XCA. Now we consider the last term of the right hand side of (A.1):

  • If βjk=0, then λ1,nwjk(|βjk+ujkn|-βjk)=λ1,nnwjkujk=λ1,nnγ-12ujk(nβjk)γ as nβjk=Op(1).

  • if βjk0, then λ1,nwjk(|βjk+ujkn|-βjk)=λ1,nnwjkn(|βjk+ujkn|-βjk). Note that λ1,nn0,wjkp1βjkγ and n(|βjk+ujkn|-βjk)ujksign(βjk). By the Slutsky’s theorem, λ1,nwjk(|βjk+ujkn|-βjk)p0.

By combining above statements and using the Slutsky’s theorem again, we obtain the following:

Vn(u)-Vn(0)dV(u)={uATDuA-2uATZAifujk=0forall(j,k)A,ifotherwise,

where Inline graphic consists of ujk for (j, k) ∈ Inline graphic and Inline graphic ~ N(0, D).

Let û = argminu V(u). Then we have

{u^A=D-1ZA,u^jk=0(j,k)A.

Note that Vn(u) – Vn(0) is convex and so argminu(Vn(u) – Vn(0)) →d argminu V(u). Since Inline graphic ~ N(0, D), thus u^A(n)dN(0,D-1). Finally, we have that u^A(n)=n(β1^A(n)-βA)dD-1ZA as n → ∞.

Appendix A.2. Selection consistency

We need to show that ∀(j, k) ∉ Inline graphic, P(β1^jk(n)0)0. For fixed (j, k) ∉ Inline graphic, let (j,k)An1. Then β1^jk(n)0 and so we have that 2xjkT(CIn)(Y-Xβ1^(n))=λ1,nwjksign(β1^jk(n)) by the KKT conditions, where jk is (j + (k − 1))-th row of . Therefore, P(β1^jk(n)0)P(2xjkT(CIn)(Y-Xβ1^(n))=λ1,nwjksign(β1^jk(n))). Note that

2xjkT(CIn)(Y-Xβ1^(n))n=2xjkT(CIn)Xn(β-β1^(n))n+2xjkT(CIn)εn.

From the asymptotic normality part, we know that 2xjkT(CIn)Xn(β-β1^(n))n converges in distribution to some normal random vector. We also have that 2xjkT(CIn)εndN(0,(CA)jk,jk), where (CA)jk,jk is the (j + (k − 1))-th diagonal element of CA. As λ1,nwjksign(β1^jk(n))n=λ1,nnγ-12sign(β1^jk(n))(nβjk)γ± with nβjk=Op(1), we have P(2xjkT(CIn)(Y-Xβ1^(n))=λ1,nwjksign(β1^jk(n)))0. Therefore, P(β1^jk(n)0)0as n → ∞.

Appendix B. Proof of Theorem 1

The proof is similar to that of Lemma 1 except we replace C by Ĉ.

Appendix B.1. Asymptotic normality

Note that (8) is equivalent to

argminβ[(Y-Xβ)T(C^In)(Y-Xβ)+λ1j,kwjkβjk].

Let β=β+un and

Vn(u)=(Y-X(β+un))T(C^In)(Y-X(β+un))+λ1,nj,kwjk|βjk+ujkn|.

Let u^n=argminuVn(u) and then u^(n)=n(β2^(n)-β). We can show that

Vn(u)-Vn(0)=Vn(u)-Vn(0)+1nuTXT((C^-C)In)Xu-2nεT((C^-C)In)Xu,

where Vn(u) is defined in the proof of Lemma 1. As Ĉ is a consistent estimator of C, 1nuTXT((C^-C)In)Xup0 and 2nεT((C^-C)In)Xud0. From the proof of Lemma 1, we also know that Vn(u) – Vn(0) →d V(u). By combining the above statements and using the Slutsky’s theorem, we have that Vn(u)-Vn(0)dV(u). By using the same arguments as in the proof of Lemma 1, finally we have that u^A(n)=n(β2^A(n)-βA)dD-1ZA as n → ∞.

Appendix B.2. Selection consistency

Now it suffices to show that ∀(j, k) ∉ Inline graphic, P(β2^jk(n)0)0 as n → ∞. For fixed (j, k) ∉ Inline graphic, let (j,k)An2. Then β2^jk(n)0 and so we have that 2xjkT(C^In)(Y-Xβ2^(n))=λ1,nwjksign(β2^jk(n)) by the KKT conditions. Therefore, P(β2^jk(n)0)P(2xjkT(C^In)(Y-Xβ2^(n))=λ1,nwjksign(β2^jk(n))). Note that

2xjkT(C^In)(Y-Xβ2^(n))n=2xjkT(C^In)Xn(β-β2^(n))n+2xjkT(C^In)εn.

From the asymptotic normality part and the fact that Ĉ is consistent, we know that 2xjkT(C^In)Xn(β-β2^(n))n converges in distribution to some normal random vector. We also have that 2xjkT(C^In)εndN(0,(CA)jk,jk). As λ1,nwjksign(β2^jk(n))n=λ1,nnγ-12sign(β2^jk(n))(nβ^jk)γ±, we have P(2xjkT(C^In)(Y-Xβ2^(n))=λ1,nwjksign(β2^jk(n)))0. Therefore, P(β2^jk(n)0)0 as n → ∞.

Appendix C. Proof of Lemma 2

Let R=1n(Y-XB)T(Y-XB). With given B*, define Q(C) as

Q(C)=-nlogdet(C)+ntr(CR)+λ2,njkvjkcjk. (C.1)

Appendix C.1. Selection consistency

Using the definition of Q(C) in (C.1), define Vn(U) as

Vn(U)=Q(C+Un)-Q(C)=-nlogdet((C+Un)C-1)+ntr(URn)+λ2,njkvjk(cjk+ujkn-cjk).

Using a similar argument as in the proof of Theorem 1 in Yuan and Lin (2007), it can be shown that

Vn(U)=tr(UU)+tr[Un(R-)]+λ2,njkvjk(cjk+ujkn-cjk)+o(1).

Note that as vst=1cst,λ2,nn-120, and cjkpcjk, we have

λ2,njkvjk(cjk+ujkn-cjk)=λ2,ncjk=0ujkncjk+λ2,nncjk0(ujkcjksign(cjk)+o(1))=λ2,ncjk=0ujkncjk+op(1).

On the other hand, n(R-)dN(0,Λ) by the central limit theorem as R=1ninεiεiT. Therefore, Vn(U) can be written as

Vn(U)=tr(UU)+tr(UWn)+λ2,ncjk=0ujkncjk+op(1),

where Wnd N(0, Λ). Denote by Û the minimizer of Vn(U). Note that λ2,n → ∞ and ncjk=Op(1). Therefore, if cjk=0, P (ûjk = 0) → 1 as n → ∞. This completes the proof of the variable selection consistency.

Appendix C.2. Asymptotic distribution

Suppose U satisfies that ujk = 0 if cjk=0. Then, Vn(U) can be written as

Vn(U)=tr(UU)+tr[Un(R-)]+op(1).

By using the Slutsky’s theorem, we have that

Vn(U)dV(U)=tr(UU)+tr(UW)wherevec(W)N(0,Λ).

Since Vn(U) and V(U) are both convex and V(U) has a unique minimum, argmin Vn(U) →d argmin V(U). From the fact that argminVn(U)=argminQ(C+Un)=n(C^01-C),argminVn(U)=n(C^01-C)dargminV(U). This completes the proof of the asymptotic distribution.

Appendix D. Proof of Theorem 2

With a n-consistent estimator of B, let R^=1n(Y-XB^)T(Y-XB^). Define Q(C) as

Q(C)=-nlogdet(C)+ntr(CR^)+λ2,njkvjkcjk. (D.1)

By using the above definition, define Vn(U) as

Vn(U)=Q(C+Un)-Q(C)=-nlogdet((C+Un)C-1)+ntr(UR^n)+λ2,njkvjk(cjk+ujkn-cjk).

Note that

ntr(UR^n)=ntr(U(R^-R)n)+ntr(URn).

Therefore, by the proof of Lemma 2 and the Slutsky’s theorem, it suffices to show that

ntr(U(R^-R)n)=op(1). (D.2)

The left-hand side of (D.2) can be written as

ntr(U(R^-R)n)=tr(Un(Y-XB^)T(Y-XB^))-tr(Un(Y-XB)T(Y-XB))=tr(Un(B^-B)TXTXn(B^-B))-2tr(U(Y-XB)TXn(B^-B)),

where we add and subtract XB in the first term. Since n(B^-B)=Op(1),(Y-XB)TXn=Op(1) , (-B = op(1) and 1nXTXA, (D.2) holds.

Appendix E. Proof of Lemma 3

Define Q(B, C) for the jointly penalized likelihood as

Q(B,C)=-nlogdet(C)+tr{C(Y-XB)T(Y-XB)}+λ1,nj,kwjkβjk+λ2,nstvstcSt. (E.1)

To show the results, we use the similar idea of the proof of Theorem 1 in Fan and Li [6]. It suffices to show that for any given δ > 0, there exists a large constant D such that

P{sup||U||=DQ(B+U1n,C+U2n)>Q(B,C)}1-δ, (E.2)

where U = (vec(U1)T, vec(U2)T)T. Using the definition of Q(B, C) in (E.1), define Vn(U) as

Vn(U)=Q(B+U1n,C+U2n)-Q(B,C).

Since βjk+u1jkn-βjk=u1jkn for βjk=0 and cst+u2stn-cst=u2stn for cst=0,

Vn(U)-nlogdet((C+U2n)C-1)+tr{(C+U2n)(Y-X(B+U1n))T(Y-X(B+U1n))}-tr{C(Y-XB)T(Y-XB)}+λ1,nβkj0wjk(βjk+u1jkn-βjk)+λ2,ncst0vst(cst+u2stn-cst)=-nlogdet((C+U2n)C-1)+tr{U2n(Y-XB)T(Y-XB)}+tr{(C+U2n)(XU1n)T(XU1n)}-2tr{(C+U2n)(Y-XB)T(XU1n)}+λ1,nβkj0wjk(βjk+u1jkn-βjk)+λ2,ncst0vst(cst+u2stn-cst). (E.3)

For the first term and the second term on the right-hand side of (E.3), it has been shown in Lemma 2 that

-nlogdet((C+U2n)C-1)+tr{U2n(Y-XB)T(Y-XB)}=tr(U2U2)+tr(U2Wn).

Let U1 = vec(U1). For the third term on the right-hand side of (E.3), as 1nXTXA, note that

tr{(C+U2n)(XU1n)T(XU1n)}=U1T{(C+U2n)(XTXn)}U1=U1T(CA)U1+o(1).

For the fourth term on the right-hand side of (E.3), we have

tr{(C+U2n)(Y-XB)T(XU1n)}=U1T(Xn)T{(C+U2n)In}ε.

Note that (Xn)T{(C+U2n)In}εdZ where Z has multivariate normal distribution of dimension n × m. By combining above statements, we have

Vn(U)tr(U2U2)+tr(U2Wn)+U1T(CA)U1+U1TZn+op(1)+λ1,nβkj0wjk(βjk+u1jkn-βjk)+λ2,ncst0vst(cst+u2stn-cst).

As λ1,nn-120 and λ2,nn-120, we have

λ1,nβjk0wjk(βjk+u1jkn-βjk)=λ1,nnβjk0(u1jkβjkγsign(βjk)+o(1))=op(1),λ2,ncst0vst(cst+u2stn-cst)=λ2,nncst0(u2stcstsign(cst)+o(1))=op(1).

Therefore,

Vn(U)tr(U2U2)+tr(U2Wn)+U1T(CA)U1+U1TZn+op(1). (E.4)

By choosing a sufficiently large D, Vn(U) > 0 uniformly on {U : ||U || = D} with the probability greater than 1 as C* and A are positive-definite, Wn = Op(1), and Zn = Op(1). Therefore, (E.2) holds. This completes the proof of this lemma.

Appendix F. Proof of Theorem 3

As defined in Lemma 3, define Q(B, C) for the jointly penalized likelihood as

Q(B,C)=-nlogdet(C)+tr{C(Y-XB)T(Y-XB)}+λ1,nj,kwjkβjk+λ2,nstvstcst.

Note that ((n), Ĉ) is a n-consistent local minimizer of Q(B, C). As (n) = argminB Q(B, Ĉ) and Ĉ is n-consistent, the oracle properties of (n) hold by Theorem 1. Similarly, since Ĉ = argminC Q((n), C) and (n) is n-consistent, the oracle properties of Ĉ hold by Theorem 2. These complete the proof of this theorem.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Wonyul Lee, Email: wonyull@email.unc.edu.

Yufeng Liu, Email: yfliu@email.unc.edu.

References

  • 1.Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
  • 2.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
  • 3.Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
  • 4.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society Series B. 1997;59:3–54. [Google Scholar]
  • 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
  • 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  • 7.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fu WJ. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
  • 9.Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]
  • 10.Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  • 11.Meier L, van de Geer S, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B. 2008;70:53–71. [Google Scholar]
  • 12.Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
  • 13.Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 2000;20:389–404. [Google Scholar]
  • 14.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
  • 16.Rothman AJ, Levina E, Zhu J. Sparse multiple regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
  • 19.Turlach BA, Venables WN, Wright SJ. Simulatneous variable selection. Technometrics. 2005;47:349–363. [Google Scholar]
  • 20.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN TCGA. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010;17:98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Yuan M, Ekici A, Lu Z, Monteiro R. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society Series B. 2007;69:329–346. [Google Scholar]
  • 22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B. 2006;68:49–67. [Google Scholar]
  • 23.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
  • 24.Zhu Z, Liu Y. Estimating spatial covariance using penalised likelihood with weighted l1 penalty. Journal of Nonparametric Statistics. 2009;21:925–942. [Google Scholar]
  • 25.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES