Abstract
Multivariate regression is a common statistical tool for practical problems. Many multivariate regression techniques are designed for univariate response cases. For problems with multiple response variables available, one common approach is to apply the univariate response regression technique separately on each response variable. Although it is simple and popular, the univariate response approach ignores the joint information among response variables. In this paper, we propose three new methods for utilizing joint information among response variables. All methods are in a penalized likelihood framework with weighted L1 regularization. The proposed methods provide sparse estimators of conditional inverse co-variance matrix of response vector given explanatory variables as well as sparse estimators of regression parameters. Our first approach is to estimate the regression coefficients with plug-in estimated inverse covariance matrices, and our second approach is to estimate the inverse covariance matrix with plug-in estimated regression parameters. Our third approach is to estimate both simultaneously. Asymptotic properties of these methods are explored. Our numerical examples demonstrate that the proposed methods perform competitively in terms of prediction, variable selection, as well as inverse covariance matrix estimation.
Keywords: GLASSO, Inverse covariance matrix estimation, Joint estimation, LASSO, Multiple response, Sparsity
1. Introduction
Parameter estimation and variable selection are two important goals in linear regression analysis. In traditional statistical procedures, these two objectives are often achieved separately. For example, parameter estimation can be done by the least squares regression method and variable selection can be achieved by certain subset selection techniques. However, with a large number of predictors available in practice, these methods may not be feasible. When the dimension gets large, the least squares method may have an overfitting problem which reduces predictive accuracy. When the dimension is larger than the sample size, the least squares regression solution cannot even be calculated directly. In terms of variable selection, the all subset selection method can be unstable because the procedure is not continuous [3], and it can be computationally infeasible when the dimension is large. To solve these problems, a large number of methods have been proposed based on the regularization framework. Some well-known methods include the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani [18], the nonnegative garrote proposed by Breiman [2], and the smoothly clipped absolute deviation (SCAD) proposed by Fan and Li [6]. These regularized methods can help to avoid overfitting. More importantly, these techniques can perform parameter estimation and variable selection simultaneously.
With multiple response variables available, the standard approach to model them is to regress each response variable separately on the same set of explanatory variables. All marginal univariate regression procedures including the above methods can be applied to each response. However, this approach may not be optimal since they do not utilize the information among response variables. To solve this multi-response regression problem, Breiman and Friedman [4] proposed a method, called the curd and whey that uses the relationship among response variables to improve predictive accuracy. They showed that their method can outperform separate univariate regression approaches when there are correlations among the response variables. However, their method did not address the topic of variable selection. Recently, Yuan et al. [21] proposed a method based on dimension reduction. Their idea is to obtain dimension reduction by encouraging sparsity among singular values of the parameter matrix. However, their approach focuses on dimension reduction rather than variable selection. Thus, it does not give a subset of explanatory variables for each response. Variable selection can be a very important issue when the number of explanatory variables is large or when explanatory variables are highly correlated. To relate with variable selection, Turlach et al. [19] proposed a penalized method using the max-L1 penalty to select a common subset of explanatory variables for multiple response regression. Their method aims to select a subset which can be used as predictors for all response variables. However, this assumption may be too strong when each response has different sets of explanatory variables.
Recently, Rothman et al. [16] proposed a penalized log-likelihood approach with the multivariate Gaussian assumption. In this paper, we further extend their method and propose three approaches to tackle the multiple response regression problem via utilizing the joint information among multiple response variables. To handle the problem, we need to estimate two parameter matrices, the regression parameter matrix B and the conditional inverse covariance matrix of response variables C = Σ−1. The first two approaches are plug-in methods, i.e., plugging in an estimator of one parameter matrix to solve the other one. The third approach tries to jointly estimate both parameter matrices. In particular, the first proposed method maximizes a sparse penalized log-likelihood using a previously estimated inverse covariance matrix Ĉ. Similarly, the second proposed method maximizes a sparse penalized log-likelihood using a previously estimated regression parameter matrix B̂. The last proposed method simultaneously estimates regression parameters and the inverse covariance matrix by maximizing a doubly penalized joint likelihood function. These methods involve two penalty terms: the weighted L1 penalty on the inverse covariance matrix C and the weighted L1 penalty on the regression parameter matrix B. Note that the joint approach is more general than that of Rothman et al. [16], which used unweighted L1 penalty terms. Our framework allows flexible weights on the penalty terms and it is more general. To handle the computational difficulty of high dimensional problems, we recommend some prescreening procedure to eliminate noise variables before further estimation.
In the following sections, we describe the new proposed methods in more details with theoretical justification and numerical examples. In Section 2, we introduce our proposed methodology. Section 3 explores the corresponding theoretical properties. Section 4 develops coordinate descent computational algorithms to obtain solutions for proposed methods. A prescreening step is suggested for the joint method to speed up the computation. Section 5 provides some brief results of our numerical examples. We conclude the paper with some discussion in Section 6. The proofs of the theorems are provided in Appendix.
2. Methodology
Consider the regression problem of p covariates and m response variables. Suppose the data contain n observations. Let yi = (yi1, …, yim)T; i = 1, …, n, be m-dimensional responses and Y = [y1, …, yn]T be the n × m response matrix. Let xi = (xi1, …, xip)T; i = 1, …, n, be p-dimensional predictors and X = [x1, …, xn]T be the n × p design matrix. For simplicity of notations, let yk = (y1k, …, ynk)T be the k-th response vector (k = 1, …, m) and xj = (x1j, …, xnj)T be the j-th predictor (j = 1, …, p). Consider the following model,
where B = {βjk}; j = 1, …, p, k = 1, …, m, is an unknown p × m parameter matrix. The errors εi = (εi1, …, εim)T; i = 1, …, n, are i.i.d. m-dimensional random vectors following a multivariate normal distribution N(0, Σ) with the nonsingular covariance matrix Σ.
Our goal is to estimate B so that we can use X to predict Y. A simple way to estimate B is to build m single response models separately and the least squares solution is denoted by B̂S = (XT X)−1XT Y, provided that XT X is nonsingular. However, this approach ignores information on Σ. When Σ is diagonal, this separate modeling approach can work well. However, when Σ is not diagonal, we sometimes have strong correlations among the response variables. The separate modeling approach does not make use of the joint information among the response variables. To produce a better estimator, we consider to incorporate Σ in the estimation procedure of B. Denote Σ−1 by C. If we assume that Σ is known, the log-likelihood for B conditional on X is
| (1) |
up to a constant not depending on B. Interestingly, although the maximum likelihood function involves Σ, the corresponding maximum likelihood estimate turns out to be identical to the least squares estimate using the separate maximum likelihood method. This implies that the maximizer of (1) does not take any advantage from the known information on Σ. However, when we impose penalties on the likelihood, the joint method can bring some advantage in estimation.
In this paper, we propose to build multivariate regression models through joint shrinkage. The goal is to utilize the joint information among the m response variables to improve estimation and prediction. Since Σ is involved in the joint estimation and it is often unknown, we consider three different approaches: two plug-in methods and the doubly penalized approach. The plug-in approach in Section 2.1 uses some estimator Ĉ for C to plug in the penalized likelihood function and then estimate B jointly. The plug-in approach in Section 2.2 estimates C after plugging in a reasonable estimator of B. The doubly penalized approach in Section 2.3 estimates C and B simultaneously via regularizing the estimation of both C and B.
For discussion, we first assume that Σ is known. To regress Y on X, we can model them separately, such as applying the LASSO for m different responses. Alternatively, we can use joint shrinkage estimation for the m response variables simultaneously. To demonstrate the difference between separate shrinkage and joint shrinkage, we consider a simple toy example for illustration. Suppose that m = 2, p = 1, and XT X = 1. Let be the least squares solution and assume that both and are positive and . With the penalty parameter λ, the separate LASSO solution is given by
| (2) |
where [u]+ = u if u ≥ 0 and [u]+ = 0 if u < 0. In the joint shrinkage estimation, however, the solution is given by
| (3) |
We can show that (3) is equivalent to
| (4) |
and the solution of (4) is given by
| (5) |
Compared with the separate LASSO solution (2), the solution (5) obtains more shrinkage if ρ is positive, while negative ρ results in less shrinkage. Figure 1 provides some insight on the reason why the amount of shrinkage changes with ρ for the joint method. Solid curves in Figure 1 are contour curves of (B – B̂S)C(B – B̂S)T as the quadratic function of B and dashed lines correspond to the penalty function. When ρ is positive, the quadratic function increases along the 45° line to the horizontal axis slower than the case when ρ is zero. Note that the solution of the joint method with ρ = 0 is identical to the separate LASSO solution. Thus, the solution of (4) can be closer to the origin with more shrinkage than the solution with ρ = 0. On the other hand, the quadratic function with negative increases faster along the 45° line to the horizontal axis. Thus, the solution of (4) tends to be closer to the least squares solution than the solution with ρ = 0. Therefore, the joint method can help us to produce more accurate estimators via utilizing the joint information through C.
Figure 1.
Contour plots for the toy example to illustrate the change of shrinkage with ρ for the joint method.
We propose three approaches, including two plug-in methods and one joint method. In Sections 2.1 and 2.2, we introduce two different plug-in penalized likelihood methods, one is for multiple response regression and the other one is for inverse covariance estimation. In the plug-in method for multiple response regression, we estimate C prior to the step of regression and then use the estimator of C to produce a better estimator of B. In the plug-in method for inverse covariance estimation, we estimate B first and then estimate C with the estimator B̂ available. In Section 2.3, we estimate B and C together via double penalization. Section 2.4 provides some guidance on three proposed methods and model selection.
2.1. Plug-in Joint Weighted LASSO Estimator
To ensure that estimation of B includes the information on Σ, we propose a joint penalized likelihood method, namely the plug-in joint weighted LASSO (PWL) estimator. In particular, the corresponding penalized likelihood function is as follows
| (6) |
Here λ1 is a tuning parameter and wjk ≥ 0; j = 1, …, p, k = 1, …, m, are prespecified weights for the L1-penalty of βjk. If C is an m × m diagonal matrix with diagonal entries ( ), then y1, …, ym are mutually independent. In that case, the minimizer of (6) is equivalent to the weighted LASSO solution obtained by applying the weighted LASSO separately to each response vector yk with the penalty parameter . However, if C is not diagonal, the minimizer of (6) can be different from the separate penalized likelihood method which handles each response vector yk separately. Our numerical examples indicate that the joint method can be more accurate when the response variables are highly correlated.
In practice, C is often not available. Thus, we need to estimate it. To estimate C, we assume that is an (m+p)-dimensional random vector following a multivariate normal distribution N(μ, Σy,x), where . Because Σ is the covariance matrix of yi conditioned on xi, it can be expressed by . Therefore, we can estimate Σ by first estimating Σy,x. To estimate Σy,x, we adapt the Graphical LASSO (GLASSO) method proposed by Friedman et al. [7]. The GLASSO method considers the problem of estimating the inverse covariance matrix in the context of sparse Gaussian graphical models [12]. This technique was also considered by Yuan and Lin [23], Banerjee et al. [1] and Rothman et al. [15].
The GLASSO estimator, , is given as the minimizer of the following penalized likelihood function
| (7) |
Here z̄ is the sample mean, || Σy,x−1|| is the sum of the absolute values of the off-diagonal elements of Σy,x−1, and λ0 is a tuning parameter.
The PWL method is a two-step procedure. With the estimate Σ̂ available, the PWL method solves the following problem
| (8) |
where and Ĉ = Σ̂−1.
2.2. Plug-in Weighted Graphical LASSO Estimator
In Section 2.1, we propose a plug-in method, PWL, which estimates C first and then estimates B given Ĉ. In this section, we propose another plug-in method to estimate C. In particular, we first estimate B by using univariate regression techniques. With the estimator B̂ available, we propose a penalized likelihood method, the plug-in weighted graphical LASSO (PWGL) estimator, by solving
| (9) |
where C = {cst}; s = 1, …, m, t = 1, …, m. Here λ2 is a tuning parameter and vst ≥ 0; s = 1, …, m, t = 1, …, m, are prespecified weights for the L1 penalty of cst.
2.3. Doubly Penalized Maximum Likelihood Estimator
In Sections 2.1 and 2.2, we propose two plug-in methods. PWL estimates C first and then estimates B given Ĉ while PWGL estimates B first and then estimates C given B̂. In this section, we propose to estimate (B, C) simultaneously. Since yi|xi ~ N(BTxi, Σ), the log-likelihood of (B, C) conditional on X is
| (10) |
It can be shown that the maximum likelihood estimator of B is also given by (XT X)−1XTY. Interestingly, the resulting estimator of B is the same as the ordinary least square estimator, which can be obtained without using the information on the relationship among the response vectors y1, …, ym. To incorporate the information among different response variables in estimation of B, we propose a joint penalized method, the doubly penalized maximum likelihood (DML) estimator, by solving
| (11) |
Note that the objective function in (11) is not convex with respect to (B, C). The corresponding optimization can be unstable sometimes when p ≥ n. This is because the first term in (11) can dominate the other terms if some diagonal elements of tr (Y – XB)T (Y – XB)} are zeros, which may occur when p ≥ n. This can be shown by taking a diagonal matrix C and increasing the values of its diagonal elements corresponding to the zero diagonal entries in tr (Y − XB)T (Y − XB). As a result, the numerical solution of C in (11) can have some large diagonal entries. In practice, the solution of C with very large diagonal entries is not desirable as it leads to very small residual variances of the corresponding response variables. We recommend to first use the plug-in method in Section 2.1 or separate modeling methods to screen the variables and reduce the dimensions. Then one can apply the joint method on the reduced set of variables. As shown in our simulation examples, the joint method can often outperform the plug-in methods when p is moderate compared to n.
2.4. Model Selection
Two plug-in methods are preferable if one of B and C is of main interest and the other is already well estimated. Another advantage of two plug-in methods is that they have lower computational cost than the joint method. On the other hand, the joint method does not require good estimate of B or C. Even though the joint method is computationally more intensive, it often performs better than two plug-in methods in the sense that it optimizes the log-likelihood of (B, C) jointly.
The tuning parameters λ1 and λ2 in (8), (9) and (11) control the sparsity of the resulting estimators of (B, C). They can be selected either using validation sets or through K-fold cross-validation. The K-fold cross-validation method randomly splits the dataset into K segments of equal sizes. For the k-th fold, we denote the estimated regression parameter matrix and the estimated inverse covariance matrix using all data excluding those in the k-th segment and the tuning parameters λ1 and λ2 by ( ). We also denote the data in the k-th segment as (Y(k), X(k)). Specifically, for the PWL method, we select the optimal tuning parameter λ̂1 which minimizes the prediction error as follows,
| (12) |
where is the Frobenius norm of a matrix. For the PWGL method, we select the optimal tuning parameter λ̂2 which minimizes the predictive negative log-likelihood as follows,
| (13) |
where nk is the sample size of the k-th segment. For the DML method, we first select the optimal λ̂1 by using (12) with a prespecified λ2 and select λ̂2 by using (13) with the selected optimal λ̂1. It helps to avoid a two dimensional grid search of (λ1, λ2). We have found in simulations that the selected optimal λ̂1s are almost identical for a wide range of prespecified λ2.
In the use of validation sets, we split the dataset into two parts, the training set and the validation set. With a pair of (λ1, λ2), we first estimate (B, C) using the training set. The prediction error and the predictive negative log-likelihood of the resulting estimator are obtained using the validation set as (Y(k), X(k)) in (12) and (13). The validation set is not used to construct the final estimator with the selected (λ̂1, λ̂2), while the K -fold cross-validation uses all data for the final estimator with (λ̂1, λ̂2).
3. Asymptotic Properties
To investigate a sparse regression technique, it is necessary to investigate its asymptotic behaviors. Fan and Li [6] pointed out that a good variable selection procedure should have oracle properties. Asymptotically with probability tending to 1, a procedure with oracle properties can identify the true underlying subset of predictor variables. The resulting estimator of the procedure also asymptotically performs as well as if the true underlying subset were known in advance. In this section, we study the asymptotic behavior of our three proposed methods. In particular, we show that with a proper choice of (λ1, λ2), all three methods enjoy the oracle properties.
For the asymptotic analysis, we use the set-up of Fan and Li [6], Yuan and Lin [23] and Zou [25]. The technical derivation uses the results in Knight and Fu [10]. Let ; j = 1, …, p, k = 1, …, m, be the true regression parameter matrix and ; s = 1, …, m, t = 1, …, m, be the true inverse covariance matrix. Let and . Then we assume the following conditions for our theoretical results:
-
(A1)
where A is a positive definite matrix.
-
(A2)
The cardinality of
, |
| = q1 > 0. -
(A3)
There exists β̃jk which is a -consistent estimator of ; j = 1, …, p, k = 1, …, m.
-
(A4)
The cardinality of
, |
| = q2 > 0. -
(A5)
There exists c̃st which is a -consistent estimator of ; s = 1, …, m, t = 1, …, m.
Note that conditions (A3) and (A5) are generally satisfied by maximum likelihood estimators or L2 regularized maximum likelihood estimators with proper choices of penalty parameters. For example, the least square estimator of B can be used as the β̃jks and the inverse of residual sample covariance matrix can be used as c̃sts. For the theoretical analysis, we define wjk and vst as ; j = 1, …, p, k = 1, …, m, γ > 0, and ; s = 1, …, m, t = 1, …, m, respectively.
In Sections 3.1 and 3.2, we show the plug-in estimators enjoy the oracle properties. Section 3.3 develops the asymptotic theory that reveals the oracle properties of the DML solution.
3.1. Oracle properties of the PWL solution
In this section, we first show that with the known C*, the minimizer of (6) is consistent in variable selection and has the asymptotic normality. Then we show that with a consistent estimator of C*, the PWL estimator also enjoys the same properties.
Define the true regression parameter vector as
. Let
be the estimator of β* obtained by minimizing (6) with the penalty parameter λ1,n. Let
be the q1-dimensional true parameter vector which consists of nonzero components in β*. Let
be the corresponding estimators of
. Let D = (C* ⊗ A)
be the q × q matrix obtained by removing the (j + (k −1)m)-th row and column of C* ⊗ A for (j, k) ∉
. Then the following lemma shows the oracle properties of the penalized likelihood estimator
with the known C*, as the minimizer of (6) defined previously.
Lemma 1
(Oracle properties of the minimizer of (6), , with the known C*) Suppose that and as n → ∞. Under the conditions (A1)–(A3), we have the following results:
(Selection consistency) if ;
(Asymptotic normality) .
Lemma 1 tells us that the penalized maximum likelihood estimator with the known C* satisfies the oracle properties. Since C* is typically unknown in practice, one often uses an estimator for C*. With slight modification of Lemma 1, we can show that the PWL solution also enjoys the oracle properties. Denote the PWL estimator of β* with the penalty parameter λ1,n as . Let be the corresponding estimator of .
Theorem 1
(Oracle properties of the PWL solution) In addition to the assumptions in Lemma 1, suppose that Ĉ is a consistent estimator of C*. Under the conditions (A1)–(A3), we have the following results:
(Selection consistency) if ;
(Asymptotic normality) .
Theorem 1 states that with a consistent estimator of C*, variable selection in the PWL is consistent and the resulting estimator still enjoys the asymptotic normality.
3.2. Oracle properties of the PWGL solution
In this section, we show the oracle properties of the PWGL solution. To this end, we first show the oracle properties of the solution of
| (14) |
with the known B*. Then we show that with a consistent estimator of B*, the PWGL estimator still enjoys the same properties.
Denote by Ĉ(1) the minimizer of (14) with the known B*. Let be the matrix obtained from Ĉ(1) by replacing with 0 if . Then the following lemma shows the oracle properties of Ĉ(1).
Lemma 2
(Oracle properties of the minimizer of (14), Ĉ(1), with known B*) Suppose that and λ2,n → ∞ as n → ∞. Under the conditions (A1), (A4) and (A5), we have the following results:
(Selection consistency) if ;
-
(Asymptotic distribution) ,
where V (U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if .
In Lemma 2, we show that the penalized maximum likelihood estimator with the known B* satisfies the oracle properties. Since B* is typically unknown in practice, one often applies an univariate regression technique to obtain an estimator for B*. With slight modification of Lemma 2, we can show that the PWGL solution also enjoys the oracle properties. Denote the PWGL estimator of C* with the penalty parameter λ2,n as Ĉ(2). Let be the matrix obtained from Ĉ(2) by replacing with 0 if . Then the following theorem shows the oracle properties of the PWGL estimator.
Theorem 2
(Oracle properties of the PWGL solution) In addition to the assumptions in Lemma 2, suppose that B̂ is a consistent estimator of B*. Under the above conditions, we have the following results:
(Selection consistency) if ;
-
(Asymptotic distribution) ,
where V(U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if .
Theorem 2 states that with a consistent estimator of B*, the PWGL solution satisfies the oracle properties.
3.3. Oracle properties of the DML solution
In Sections 3.1 and 3.2, we establish the oracle properties of plug-in estimators. In this section, we explore oracle properties of the DML solution in which (B̂, Ĉ) are obtained together. First, we show that with a proper choice of (λ1, λ2), there exists a -consistent local minimizer of (11). Then we show that this local minimizer enjoys the oracle properties as a solution of the DML estimator.
The following lemma shows the existence of a local minimizer of (11) which is -consistent.
Lemma 3
Suppose that and . Under the conditions (A1)–(A5), there exists a local minimizer of (11) such that
From Lemma 3, it is clear that there exists a -consistent doubly penalized maximum likelihood estimator. As the DML estimator of (B*, C*), denote by (B̂(n), Ĉ) the -consistent local solution of (11) with the penalty parameter (λ1,n, λ2,n). Let β̂(n) = vec(B̂(n)) and let be the corresponding estimator of . Let Ĉ0 be the matrix obtained from Ĉ by replacing ĉjk with 0 if . We now show that with a proper choice of (λ1, λ2), the DML estimator as this local minimizer enjoys the oracle properties in the following theorem.
Theorem 3
(Oracle properties of the DML solution) Suppose that and . In addition to that, suppose that and *lamda;2,n → ∞. Under the conditions (A1)–(A5), we have the following results:
if ;
;
limn P(ĉjk = 0) = 1 if ;
-
,
where V (U) = tr(UΣUΣ) + tr(UW) and W is a m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(wij, wkl) = cov(ε1iε1j, ε1kε1l). The minimum is taken over all symmetric matrices U satisfying ujk = 0 if .
4. Computational Algorithm
In this section, we describe computational algorithms to solve problems (8), (9), and (11). In particular, we apply the GLASSO algorithm for (9). To solve the problems (8) and (11), we apply the coordinate-descent algorithm as described in Peng et al. [14], which can be viewed as a modification of the shooting algorithm [8]. The basic idea of the coordinate-descent algorithm is to optimize each parameter at one time while holding the other parameters fixed at the current solution. The corresponding optimization at each step can be very simple to solve.
We now describe the coordinate-descent algorithm for the PWL method in details. Denote Ĉ by (ĉij)m×m. Then (8) is equivalent to minimizing
| (15) |
Consider (15) as a function of βjk with other coefficients fixed. Then the minimizer of (15) is equivalent to
This problem is essentially a one-dimensional LASSO optimization which has a closed form solution. Therefore, the algorithm can be summarized as follows:
Algorithm 1: the Coordinate-Descent Algorithm for the PWL Method
-
Step 1
(Initial value). Set the separate LASSO solution ; j = 1, …, p, k = 1, …, m, as the initial value for B.
-
Step 2
(Updating rule). For j = 1, …, p and k = 1, …, m,
,if q ≠ j and r ≠ k,where and .
-
Step 3
(Iteration). Repeat Step 2 until convergence. Our stopping rule is that the change of the objective function in (8) is less than δ = 0.1.
To be computationally more efficient, we combine the above algorithm with the active shooting algorithm proposed by Peng et al. [14]. The basic idea of the active shooting algorithm is to update the coefficients within the active set until convergence instead of iterating all coefficients at each step. The active set is defined as the set of currently nonzero coefficients and it is typically small. Once the coefficients in the active set converge, then we continue to update other coefficients. This step can speed up the algorithm significantly if the final solution is very sparse.
Next we describe the problem (9) in the GLASSO framework. Since (9) is equivalent to minimizing
| (16) |
we can apply the GLASSO algorithm [7] to solve (9) by substituting the sample covariance matrix with . Therefore, the algorithm for (9) proceeds as follows:
Algorithm 2: the GLASSO Algorithm for the PWGL Method
-
Step 1
(Estimator of B) Set the separate LASSO solution as the estimator, B̂, of B.
-
Step 2
(Estimator of C) Given B̂, apply the GLASSO algorithm to solve (16).
Next, we combine Algorithm 1 and the GLASSO algorithm to solve problem (11) for the doubly penalized method DML in Section 2.3. The algorithm can be summarized as follows:
Algorithm 3: the Coordinate-Descent Algorithm for the DML Method
-
Step 1
(Initial values of B and C). Set the separate LASSO solution ; j = 1, …, p, k = 1, …, m, as the initial value for B and the solution of (9), C(old), as the initial value of C.
-
Step 2(B updating rule). For a given C(old), update B(old) → B(new) with
This step can be solved using the Algorithm 1.
-
Step 3(C updating rule). For a given B(new), update C(old) → C(new) by
This can be solved using the GLASSO algorithm.
-
Step 4
(Iteration). Repeat Steps 2 and 3 until convergence. Our stopping rule is that the change of the objective function in (11) is less than δ = 0.1.
Based on our experiment, the coordinate-descent algorithm works very efficiently. Since the DML method involves estimation of both B and C, the computation can be intensive when the dimension is high. We consider a prescreening step to speed up the computation. In particular, we adapt the group lasso method considered by Yuan and Lin [22] and Meier et al. [11]. The basic idea of the group lasso method is to employ group penalty in the regression problem so that model selection can be achieved in terms of group selection. In our multiple response variable regression problem, (βj1, …, βjm); j = 1, …, p, can be considered as p groups. Therefore, for the prescreening step, the group lasso estimator, B̂group of B, is given as the minimizer of the following penalized function
where λ is a tuning parameter. We screen out a variable if the corresponding coefficients are estimated as zeros for all response variables. In other words, we remove the variable xj from our model if . This prescreening step can not only speed up the computation, but also improve the prediction performance as shown in our examples.
5. Numerical Examples
In this section, our proposed methods are compared with several existing methods. The first existing method we compare is the curds and whey (CW) method proposed by Breiman and Friedman [4]. The other two methods are the separate ridge regression (RR) and the separate LASSO. In particular, we apply the RR and the LASSO to each response variable separately. Separate LASSO solutions are constructed by a modification of the LARS algorithm proposed by Efron et al. [5]. The main idea of this modified LARS algorithm was also considered by Osborne et al. [13]. In this paper, some brief summaries of the results are provided. All details about numerical examples can be found in the online supplement materials.
In simulated examples, all methods are compared in two ways, B estimation and C estimation. Performance of B estimation is compared in terms of prediction and variable selection. For the comparison of C estimation, we use the entropy criterion [9, 24] which measures the difference of two matrices. In terms of prediction, overall, the proposed DML method works the best. The PWL method also works reasonably well in all cases, although it is not as accurate as the DML estimator. In the example where the true inverse covariance matrix is not sparse, LASSO gives the worst prediction performance while the other methods show similar performance. This implies that joint approaches outperform separate approaches with the joint information. DML outperforms LASSO and PWL in terms of identification of zero coefficients. We also notice that the ratios of correctly identified zeros for PWL and DML increase as the sample size increases. This supports the selection consistency shown in Section 3. When the dimension of predictor variables is low, the DML estimator gives the best performance in C estimation. When the dimension of predictor variables is close to the sample size, PWGL outperforms DML. Since the DML method simultaneously estimates both B and C, with a small sample size, the C estimation may not be as good.
We apply our methodology to a Glioblastoma multiforme (GBM) cancer data set studied by the Cancer Genome Atlas (TCGA) Research Network [17]. As noted in [20], GBM is the most common primary form of brain tumor in adults. In terms of prediction, our method, DML, performs best even though the difference between DML and the separate LASSO is not statistically significant in view of the standard errors. In terms of the number of included genes in models, PWL and DML construct sparser models than the separate LASSO. One possible explanation is that there may be some strong positive correlations among microRNAs which are response variables. As we have discussed in the toy example of Section 2, with strong positive correlations among response variables, joint methods tend to obtain more shrinkage than the separate LASSO. To explore this further, correlations among microRNAs are examined. Some strong postive correlations among the microRNAs are detected while negative correlations are not strong. Interestingly, with much fewer number of gene expressions than the separate LASSO, PWL and DML perform competitively in terms of prediction accuracy.
6. Discussion
In this paper, we have proposed three methods for utilizing joint information among response variables in a penalized likelihood framework with weighted L1 regularization. Our theoretical investigation shows that our proposed estimators enjoy oracle properties. Simulated examples and an application to the GBM cancer data set demonstrate that our proposed methods perform competitively.
Our current study assumes Gaussian distribution of the response vector. One future research direction is to extend the proposed method with other distributional assumptions. Although we mainly focus on the weighted L1 penalty, our methods can be directly extended for other penalty functions as well. It will be interesting to compare the performance of various choices of penalty in this context.
Supplementary Material
Acknowledgments
The authors were supported in part by NSF grant DMS-0747575 and NIH grant 5R01CA149569-03.
Appendix A. Proof of Lemma 1
Appendix A.1. Asymptotic normality
Let Ỹ = ((y1)T, …, (ym)T)T be the nm-dimensional response vector and ε̃ be the corresponding nm-dimensional error vector which consists of εik; i = 1, …, n, k = 1, …, m. Let β̃ = (β11, …, βp1, …, β1m, …, βpm)T be the pm-dimensional vector and X̃ = Im ⊗ X. Then the minimizer of (6) is equivalent to
Let and
Let û(n) = argminu Vn(u) and then . Note that û(n) = argminu Vn(u) = argminu{Vn(u) – Vn(0)} and
| (A.1) |
We know that . For the second term of the right hand side of (A.1), note that ε̃ ~ N(0, Σ ⊗ In). Thus, where Z ~ N(0, C ⊗ A) as . Now we consider the last term of the right hand side of (A.1):
If , then as .
if , then . Note that and . By the Slutsky’s theorem, .
By combining above statements and using the Slutsky’s theorem again, we obtain the following:
where
consists of ujk for (j, k) ∈
and
~ N(0, D).
Let û = argminu V(u). Then we have
Note that Vn(u) – Vn(0) is convex and so argminu(Vn(u) – Vn(0)) →d argminu
V(u). Since
~ N(0, D), thus
. Finally, we have that
as n → ∞.
Appendix A.2. Selection consistency
We need to show that ∀(j, k) ∉
,
. For fixed (j, k) ∉
, let
. Then
and so we have that
by the KKT conditions, where x̃jk is (j + (k − 1))-th row of X̃. Therefore,
. Note that
From the asymptotic normality part, we know that converges in distribution to some normal random vector. We also have that , where (C ⊗ A)jk,jk is the (j + (k − 1))-th diagonal element of C ⊗ A. As with , we have . Therefore, as n → ∞.
Appendix B. Proof of Theorem 1
The proof is similar to that of Lemma 1 except we replace C by Ĉ.
Appendix B.1. Asymptotic normality
Note that (8) is equivalent to
Let and
Let and then . We can show that
where Vn(u) is defined in the proof of Lemma 1. As Ĉ is a consistent estimator of C, and . From the proof of Lemma 1, we also know that Vn(u) – Vn(0) →d V(u). By combining the above statements and using the Slutsky’s theorem, we have that . By using the same arguments as in the proof of Lemma 1, finally we have that as n → ∞.
Appendix B.2. Selection consistency
Now it suffices to show that ∀(j, k) ∉
,
as n → ∞. For fixed (j, k) ∉
, let
. Then
and so we have that
by the KKT conditions. Therefore,
. Note that
From the asymptotic normality part and the fact that Ĉ is consistent, we know that converges in distribution to some normal random vector. We also have that . As , we have . Therefore, as n → ∞.
Appendix C. Proof of Lemma 2
Let . With given B*, define Q(C) as
| (C.1) |
Appendix C.1. Selection consistency
Using the definition of Q(C) in (C.1), define Vn(U) as
Using a similar argument as in the proof of Theorem 1 in Yuan and Lin (2007), it can be shown that
Note that as , and , we have
On the other hand, by the central limit theorem as . Therefore, Vn(U) can be written as
where Wn →d N(0, Λ). Denote by Û the minimizer of Vn(U). Note that λ2,n → ∞ and . Therefore, if , P (ûjk = 0) → 1 as n → ∞. This completes the proof of the variable selection consistency.
Appendix C.2. Asymptotic distribution
Suppose U satisfies that ujk = 0 if . Then, Vn(U) can be written as
By using the Slutsky’s theorem, we have that
Since Vn(U) and V(U) are both convex and V(U) has a unique minimum, argmin Vn(U) →d argmin V(U). From the fact that . This completes the proof of the asymptotic distribution.
Appendix D. Proof of Theorem 2
With a -consistent estimator B̂ of B, let . Define Q(C) as
| (D.1) |
By using the above definition, define Vn(U) as
Note that
Therefore, by the proof of Lemma 2 and the Slutsky’s theorem, it suffices to show that
| (D.2) |
The left-hand side of (D.2) can be written as
where we add and subtract XB in the first term. Since , (B̂-B = op(1) and , (D.2) holds.
Appendix E. Proof of Lemma 3
Define Q(B, C) for the jointly penalized likelihood as
| (E.1) |
To show the results, we use the similar idea of the proof of Theorem 1 in Fan and Li [6]. It suffices to show that for any given δ > 0, there exists a large constant D such that
| (E.2) |
where U = (vec(U1)T, vec(U2)T)T. Using the definition of Q(B, C) in (E.1), define Vn(U) as
Since for and for ,
| (E.3) |
For the first term and the second term on the right-hand side of (E.3), it has been shown in Lemma 2 that
Let U1 = vec(U1). For the third term on the right-hand side of (E.3), as , note that
For the fourth term on the right-hand side of (E.3), we have
Note that where Z has multivariate normal distribution of dimension n × m. By combining above statements, we have
As and , we have
Therefore,
| (E.4) |
By choosing a sufficiently large D, Vn(U) > 0 uniformly on {U : ||U || = D} with the probability greater than 1-δ as C* and A are positive-definite, Wn = Op(1), and Zn = Op(1). Therefore, (E.2) holds. This completes the proof of this lemma.
Appendix F. Proof of Theorem 3
As defined in Lemma 3, define Q(B, C) for the jointly penalized likelihood as
Note that (B̂(n), Ĉ) is a -consistent local minimizer of Q(B, C). As B̂(n) = argminB Q(B, Ĉ) and Ĉ is -consistent, the oracle properties of B̂(n) hold by Theorem 1. Similarly, since Ĉ = argminC Q(B̂(n), C) and B̂(n) is -consistent, the oracle properties of Ĉ hold by Theorem 2. These complete the proof of this theorem.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Contributor Information
Wonyul Lee, Email: wonyull@email.unc.edu.
Yufeng Liu, Email: yfliu@email.unc.edu.
References
- 1.Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
- 2.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
- 3.Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
- 4.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society Series B. 1997;59:3–54. [Google Scholar]
- 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
- 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- 7.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Fu WJ. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
- 9.Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]
- 10.Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
- 11.Meier L, van de Geer S, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B. 2008;70:53–71. [Google Scholar]
- 12.Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
- 13.Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 2000;20:389–404. [Google Scholar]
- 14.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
- 16.Rothman AJ, Levina E, Zhu J. Sparse multiple regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
- 19.Turlach BA, Venables WN, Wright SJ. Simulatneous variable selection. Technometrics. 2005;47:349–363. [Google Scholar]
- 20.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN TCGA. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010;17:98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Yuan M, Ekici A, Lu Z, Monteiro R. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society Series B. 2007;69:329–346. [Google Scholar]
- 22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B. 2006;68:49–67. [Google Scholar]
- 23.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
- 24.Zhu Z, Liu Y. Estimating spatial covariance using penalised likelihood with weighted l1 penalty. Journal of Nonparametric Statistics. 2009;21:925–942. [Google Scholar]
- 25.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

