Simultaneous Multiple Response Regression and Inverse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood

Wonyul Lee; Yufeng Liu

doi:10.1016/j.jmva.2012.03.013

. Author manuscript; available in PMC: 2013 Oct 1.

Published in final edited form as: J Multivar Anal. 2012 Apr 27;111:241–255. doi: 10.1016/j.jmva.2012.03.013

Simultaneous Multiple Response Regression and Inverse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood

Wonyul Lee ¹, Yufeng Liu ^1,^*

PMCID: PMC3392174 NIHMSID: NIHMS374480 PMID: 22791925

Abstract

Multivariate regression is a common statistical tool for practical problems. Many multivariate regression techniques are designed for univariate response cases. For problems with multiple response variables available, one common approach is to apply the univariate response regression technique separately on each response variable. Although it is simple and popular, the univariate response approach ignores the joint information among response variables. In this paper, we propose three new methods for utilizing joint information among response variables. All methods are in a penalized likelihood framework with weighted L₁ regularization. The proposed methods provide sparse estimators of conditional inverse co-variance matrix of response vector given explanatory variables as well as sparse estimators of regression parameters. Our first approach is to estimate the regression coefficients with plug-in estimated inverse covariance matrices, and our second approach is to estimate the inverse covariance matrix with plug-in estimated regression parameters. Our third approach is to estimate both simultaneously. Asymptotic properties of these methods are explored. Our numerical examples demonstrate that the proposed methods perform competitively in terms of prediction, variable selection, as well as inverse covariance matrix estimation.

Keywords: GLASSO, Inverse covariance matrix estimation, Joint estimation, LASSO, Multiple response, Sparsity

1. Introduction

Parameter estimation and variable selection are two important goals in linear regression analysis. In traditional statistical procedures, these two objectives are often achieved separately. For example, parameter estimation can be done by the least squares regression method and variable selection can be achieved by certain subset selection techniques. However, with a large number of predictors available in practice, these methods may not be feasible. When the dimension gets large, the least squares method may have an overfitting problem which reduces predictive accuracy. When the dimension is larger than the sample size, the least squares regression solution cannot even be calculated directly. In terms of variable selection, the all subset selection method can be unstable because the procedure is not continuous [3], and it can be computationally infeasible when the dimension is large. To solve these problems, a large number of methods have been proposed based on the regularization framework. Some well-known methods include the least absolute shrinkage and selection operator (LASSO) proposed by Tibshirani [18], the nonnegative garrote proposed by Breiman [2], and the smoothly clipped absolute deviation (SCAD) proposed by Fan and Li [6]. These regularized methods can help to avoid overfitting. More importantly, these techniques can perform parameter estimation and variable selection simultaneously.

With multiple response variables available, the standard approach to model them is to regress each response variable separately on the same set of explanatory variables. All marginal univariate regression procedures including the above methods can be applied to each response. However, this approach may not be optimal since they do not utilize the information among response variables. To solve this multi-response regression problem, Breiman and Friedman [4] proposed a method, called the curd and whey that uses the relationship among response variables to improve predictive accuracy. They showed that their method can outperform separate univariate regression approaches when there are correlations among the response variables. However, their method did not address the topic of variable selection. Recently, Yuan et al. [21] proposed a method based on dimension reduction. Their idea is to obtain dimension reduction by encouraging sparsity among singular values of the parameter matrix. However, their approach focuses on dimension reduction rather than variable selection. Thus, it does not give a subset of explanatory variables for each response. Variable selection can be a very important issue when the number of explanatory variables is large or when explanatory variables are highly correlated. To relate with variable selection, Turlach et al. [19] proposed a penalized method using the max-L₁ penalty to select a common subset of explanatory variables for multiple response regression. Their method aims to select a subset which can be used as predictors for all response variables. However, this assumption may be too strong when each response has different sets of explanatory variables.

Recently, Rothman et al. [16] proposed a penalized log-likelihood approach with the multivariate Gaussian assumption. In this paper, we further extend their method and propose three approaches to tackle the multiple response regression problem via utilizing the joint information among multiple response variables. To handle the problem, we need to estimate two parameter matrices, the regression parameter matrix B and the conditional inverse covariance matrix of response variables C = Σ⁻¹. The first two approaches are plug-in methods, i.e., plugging in an estimator of one parameter matrix to solve the other one. The third approach tries to jointly estimate both parameter matrices. In particular, the first proposed method maximizes a sparse penalized log-likelihood using a previously estimated inverse covariance matrix Ĉ. Similarly, the second proposed method maximizes a sparse penalized log-likelihood using a previously estimated regression parameter matrix B̂. The last proposed method simultaneously estimates regression parameters and the inverse covariance matrix by maximizing a doubly penalized joint likelihood function. These methods involve two penalty terms: the weighted L₁ penalty on the inverse covariance matrix C and the weighted L₁ penalty on the regression parameter matrix B. Note that the joint approach is more general than that of Rothman et al. [16], which used unweighted L₁ penalty terms. Our framework allows flexible weights on the penalty terms and it is more general. To handle the computational difficulty of high dimensional problems, we recommend some prescreening procedure to eliminate noise variables before further estimation.

In the following sections, we describe the new proposed methods in more details with theoretical justification and numerical examples. In Section 2, we introduce our proposed methodology. Section 3 explores the corresponding theoretical properties. Section 4 develops coordinate descent computational algorithms to obtain solutions for proposed methods. A prescreening step is suggested for the joint method to speed up the computation. Section 5 provides some brief results of our numerical examples. We conclude the paper with some discussion in Section 6. The proofs of the theorems are provided in Appendix.

2. Methodology

Consider the regression problem of p covariates and m response variables. Suppose the data contain n observations. Let y_i = (y_i₁, …, y_im)^T; i = 1, …, n, be m-dimensional responses and Y = [y₁, …, y_n]^T be the n × m response matrix. Let x_i = (x_i₁, …, x_ip)^T; i = 1, …, n, be p-dimensional predictors and X = [x₁, …, x_n]^T be the n × p design matrix. For simplicity of notations, let y^k = (y₁_k, …, y_nk)^T be the k-th response vector (k = 1, …, m) and x^j = (x₁_j, …, x_nj)^T be the j-th predictor (j = 1, …, p). Consider the following model,

Y = XB + e, with e = {[ε_{1}, \dots, ε_{n}]}^{T},

where B = {β_jk}; j = 1, …, p, k = 1, …, m, is an unknown p × m parameter matrix. The errors ε_i = (ε_i₁, …, ε_im)^T; i = 1, …, n, are i.i.d. m-dimensional random vectors following a multivariate normal distribution N(0, Σ) with the nonsingular covariance matrix Σ.

Our goal is to estimate B so that we can use X to predict Y. A simple way to estimate B is to build m single response models separately and the least squares solution is denoted by B̂_S = (X^T X)⁻¹X^T Y, provided that X^T X is nonsingular. However, this approach ignores information on Σ. When Σ is diagonal, this separate modeling approach can work well. However, when Σ is not diagonal, we sometimes have strong correlations among the response variables. The separate modeling approach does not make use of the joint information among the response variables. To produce a better estimator, we consider to incorporate Σ in the estimation procedure of B. Denote Σ⁻¹ by C. If we assume that Σ is known, the log-likelihood for B conditional on X is

- \frac{1}{2} tr {(Y - XB) C {(Y - XB)}^{T}},

(1)

up to a constant not depending on B. Interestingly, although the maximum likelihood function involves Σ, the corresponding maximum likelihood estimate turns out to be identical to the least squares estimate using the separate maximum likelihood method. This implies that the maximizer of (1) does not take any advantage from the known information on Σ. However, when we impose penalties on the likelihood, the joint method can bring some advantage in estimation.

In this paper, we propose to build multivariate regression models through joint shrinkage. The goal is to utilize the joint information among the m response variables to improve estimation and prediction. Since Σ is involved in the joint estimation and it is often unknown, we consider three different approaches: two plug-in methods and the doubly penalized approach. The plug-in approach in Section 2.1 uses some estimator Ĉ for C to plug in the penalized likelihood function and then estimate B jointly. The plug-in approach in Section 2.2 estimates C after plugging in a reasonable estimator of B. The doubly penalized approach in Section 2.3 estimates C and B simultaneously via regularizing the estimation of both C and B.

For discussion, we first assume that Σ is known. To regress Y on X, we can model them separately, such as applying the LASSO for m different responses. Alternatively, we can use joint shrinkage estimation for the m response variables simultaneously. To demonstrate the difference between separate shrinkage and joint shrinkage, we consider a simple toy example for illustration. Suppose that m = 2, p = 1, and X^T X = 1. Let ${\hat{B}}_{S} = ({\hat{β}}_{11}^{S}, {\hat{β}}_{12}^{S})$ be the least squares solution and assume that both ${\hat{β}}_{11}^{S}$ and ${\hat{β}}_{12}^{S}$ are positive and $\sum = (\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix})$ . With the penalty parameter λ, the separate LASSO solution is given by

\begin{array}{l} {\hat{β}}_{1 m}^{LASSO} = \underset{β_{1 m}}{argmin} {{(y^{m} - X β_{1 m})}^{T} (y^{m} - X β_{1 m}) + λ ∣ β_{1 m} ∣} \\ = {[{\hat{β}}_{1 m}^{S} - \frac{λ}{2}]}_{+}; m = 1, 2, \end{array}

(2)

where [u]₊ = u if u ≥ 0 and [u]₊ = 0 if u < 0. In the joint shrinkage estimation, however, the solution is given by

\underset{B}{argmin} [tr {(Y - XB) C {(Y - XB)}^{T}} + λ ∣ β_{11} ∣ + λ ∣ β_{12} ∣] .

(3)

We can show that (3) is equivalent to

\underset{B}{argmin} [(B - {\hat{B}}_{S}) C {(B - {\hat{B}}_{S})}^{T} + λ ∣ β_{11} ∣ + λ ∣ β_{12} ∣]

(4)

and the solution of (4) is given by

{\hat{β}}_{1 m} = {[{\hat{β}}_{1 m}^{S} - \frac{λ}{2} (1 + ρ)]}_{+}; m = 1, 2.

(5)

Compared with the separate LASSO solution (2), the solution (5) obtains more shrinkage if ρ is positive, while negative ρ results in less shrinkage. Figure 1 provides some insight on the reason why the amount of shrinkage changes with ρ for the joint method. Solid curves in Figure 1 are contour curves of (B – B̂_S)C(B – B̂_S)^T as the quadratic function of B and dashed lines correspond to the penalty function. When ρ is positive, the quadratic function increases along the 45° line to the horizontal axis slower than the case when ρ is zero. Note that the solution of the joint method with ρ = 0 is identical to the separate LASSO solution. Thus, the solution of (4) can be closer to the origin with more shrinkage than the solution with ρ = 0. On the other hand, the quadratic function with negative increases faster along the 45° line to the horizontal axis. Thus, the solution of (4) tends to be closer to the least squares solution than the solution with ρ = 0. Therefore, the joint method can help us to produce more accurate estimators via utilizing the joint information through C.

Contour plots for the toy example to illustrate the change of shrinkage with ρ for the joint method.

We propose three approaches, including two plug-in methods and one joint method. In Sections 2.1 and 2.2, we introduce two different plug-in penalized likelihood methods, one is for multiple response regression and the other one is for inverse covariance estimation. In the plug-in method for multiple response regression, we estimate C prior to the step of regression and then use the estimator of C to produce a better estimator of B. In the plug-in method for inverse covariance estimation, we estimate B first and then estimate C with the estimator B̂ available. In Section 2.3, we estimate B and C together via double penalization. Section 2.4 provides some guidance on three proposed methods and model selection.

2.1. Plug-in Joint Weighted LASSO Estimator

To ensure that estimation of B includes the information on Σ, we propose a joint penalized likelihood method, namely the plug-in joint weighted LASSO (PWL) estimator. In particular, the corresponding penalized likelihood function is as follows

tr {(Y - XB) C {(Y - XB)}^{T}} + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣ .

(6)

Here λ₁ is a tuning parameter and w_jk ≥ 0; j = 1, …, p, k = 1, …, m, are prespecified weights for the L₁-penalty of β_jk. If C is an m × m diagonal matrix with diagonal entries ( $σ_{1}^{2}, \dots, σ_{m}^{2}$ ), then y¹, …, y^m are mutually independent. In that case, the minimizer of (6) is equivalent to the weighted LASSO solution obtained by applying the weighted LASSO separately to each response vector y^k with the penalty parameter $λ_{1} / σ_{k}^{2} (k = 1, \dots, m)$ . However, if C is not diagonal, the minimizer of (6) can be different from the separate penalized likelihood method which handles each response vector y^k separately. Our numerical examples indicate that the joint method can be more accurate when the response variables are highly correlated.

In practice, C is often not available. Thus, we need to estimate it. To estimate C, we assume that $z_{i} = {(y_{i}^{T}, x_{i}^{T})}^{T}$ is an (m+p)-dimensional random vector following a multivariate normal distribution N(μ, Σ_y_,_x), where $\sum_{y, x} = (\begin{matrix} \sum_{y, y} & \sum_{y, x} \\ \sum_{x, y} & \sum_{x, x} \end{matrix})$ . Because Σ is the covariance matrix of y_i conditioned on x_i, it can be expressed by $\sum = \sum_{y, y} - \sum_{y, x} \sum_{x, x}^{- 1} \sum_{x, y}$ . Therefore, we can estimate Σ by first estimating Σ_y_,_x. To estimate Σ_y_,_x, we adapt the Graphical LASSO (GLASSO) method proposed by Friedman et al. [7]. The GLASSO method considers the problem of estimating the inverse covariance matrix in the context of sparse Gaussian graphical models [12]. This technique was also considered by Yuan and Lin [23], Banerjee et al. [1] and Rothman et al. [15].

The GLASSO estimator, ${\sum^{^}}_{y, x}^{- 1}$ , is given as the minimizer of the following penalized likelihood function

- log det ({\sum_{y, x}}^{- 1}) + \frac{1}{n} \sum_{i = 1}^{n} {(z_{i} - \bar{z})}^{T} {\sum_{y, x}}^{- 1} (z_{i} - \bar{z}) + λ_{0} | | {\sum_{y, x}}^{- 1} | | .

(7)

Here z̄ is the sample mean, || Σ_y_,_x⁻¹|| is the sum of the absolute values of the off-diagonal elements of Σ_y_,_x⁻¹, and λ₀ is a tuning parameter.

The PWL method is a two-step procedure. With the estimate Σ̂ available, the PWL method solves the following problem

\underset{B}{argmin} [tr {(Y - XB) \hat{C} {(Y - XB)}^{T}} + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣],

(8)

where ${\sum^{^}}_{y, x} = (\begin{matrix} {\sum^{^}}_{y, y} & {\sum^{^}}_{y, x} \\ {\sum^{^}}_{x, y} & {\sum^{^}}_{x, x} \end{matrix}), \sum^{^} = {\sum^{^}}_{y, y} - {\sum^{^}}_{y, x} {\sum^{^}}_{x, x}^{- 1} {\sum^{^}}_{x, y}$ and Ĉ = Σ̂⁻¹.

2.2. Plug-in Weighted Graphical LASSO Estimator

In Section 2.1, we propose a plug-in method, PWL, which estimates C first and then estimates B given Ĉ. In this section, we propose another plug-in method to estimate C. In particular, we first estimate B by using univariate regression techniques. With the estimator B̂ available, we propose a penalized likelihood method, the plug-in weighted graphical LASSO (PWGL) estimator, by solving

\underset{C}{argmin} [- n log det (C) + tr {(Y - X \hat{B}) C {(Y - X \hat{B})}^{T}} + λ_{2} \sum_{s \neq t} v_{s t} ∣ c_{s t} ∣],

(9)

where C = {c_st}; s = 1, …, m, t = 1, …, m. Here λ₂ is a tuning parameter and v_st ≥ 0; s = 1, …, m, t = 1, …, m, are prespecified weights for the L₁ penalty of c_st.

2.3. Doubly Penalized Maximum Likelihood Estimator

In Sections 2.1 and 2.2, we propose two plug-in methods. PWL estimates C first and then estimates B given Ĉ while PWGL estimates B first and then estimates C given B̂. In this section, we propose to estimate (B, C) simultaneously. Since y_i|x_i ~ N(B^Tx_i, Σ), the log-likelihood of (B, C) conditional on X is

\frac{n}{2} log det (C) - \frac{1}{2} tr {(Y - XB) C {(Y - XB)}^{T}} .

(10)

It can be shown that the maximum likelihood estimator of B is also given by (X^T X)⁻¹X^TY. Interestingly, the resulting estimator of B is the same as the ordinary least square estimator, which can be obtained without using the information on the relationship among the response vectors y¹, …, y^m. To incorporate the information among different response variables in estimation of B, we propose a joint penalized method, the doubly penalized maximum likelihood (DML) estimator, by solving

\underset{B, C}{argmin} [- n log det (C) + tr {(Y - XB) C {(Y - XB)}^{T}} + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣ + λ_{2} \sum_{s \neq t} v_{s t} ∣ c_{s t} ∣]

(11)

Note that the objective function in (11) is not convex with respect to (B, C). The corresponding optimization can be unstable sometimes when p ≥ n. This is because the first term in (11) can dominate the other terms if some diagonal elements of tr (Y – XB)^T (Y – XB)} are zeros, which may occur when p ≥ n. This can be shown by taking a diagonal matrix C and increasing the values of its diagonal elements corresponding to the zero diagonal entries in tr (Y − XB)^T (Y − XB). As a result, the numerical solution of C in (11) can have some large diagonal entries. In practice, the solution of C with very large diagonal entries is not desirable as it leads to very small residual variances of the corresponding response variables. We recommend to first use the plug-in method in Section 2.1 or separate modeling methods to screen the variables and reduce the dimensions. Then one can apply the joint method on the reduced set of variables. As shown in our simulation examples, the joint method can often outperform the plug-in methods when p is moderate compared to n.

2.4. Model Selection

Two plug-in methods are preferable if one of B and C is of main interest and the other is already well estimated. Another advantage of two plug-in methods is that they have lower computational cost than the joint method. On the other hand, the joint method does not require good estimate of B or C. Even though the joint method is computationally more intensive, it often performs better than two plug-in methods in the sense that it optimizes the log-likelihood of (B, C) jointly.

The tuning parameters λ₁ and λ₂ in (8), (9) and (11) control the sparsity of the resulting estimators of (B, C). They can be selected either using validation sets or through K-fold cross-validation. The K-fold cross-validation method randomly splits the dataset into K segments of equal sizes. For the k-th fold, we denote the estimated regression parameter matrix and the estimated inverse covariance matrix using all data excluding those in the k-th segment and the tuning parameters λ₁ and λ₂ by ( ${\hat{B}}_{λ_{1}}^{(- k)}, {\hat{C}}_{λ_{2}}^{(- k)}$ ). We also denote the data in the k-th segment as (Y⁽^k⁾, X⁽^k⁾). Specifically, for the PWL method, we select the optimal tuning parameter λ̂₁ which minimizes the prediction error as follows,

CV (λ_{1}) = \sum_{k = 1}^{K} {| | Y^{(k)} - X^{(k)} {\hat{B}}_{λ_{1}}^{(- k)} | |}_{F}^{2},

(12)

where ${| | \cdot | |}_{F}^{2}$ is the Frobenius norm of a matrix. For the PWGL method, we select the optimal tuning parameter λ̂₂ which minimizes the predictive negative log-likelihood as follows,

CV (λ_{2}) = \sum_{k = 1}^{K} [- n_{k} log det ({\hat{C}}_{λ_{2}}^{(- k)}) + tr {(Y^{(k)} - X^{(k)} \hat{B}) {\hat{C}}_{λ_{2}}^{(- k)} {(Y^{(k)} - X^{(k)} \hat{B})}^{T}}],

(13)

where n_k is the sample size of the k-th segment. For the DML method, we first select the optimal λ̂₁ by using (12) with a prespecified λ₂ and select λ̂₂ by using (13) with the selected optimal λ̂₁. It helps to avoid a two dimensional grid search of (λ₁, λ₂). We have found in simulations that the selected optimal λ̂₁s are almost identical for a wide range of prespecified λ₂.

In the use of validation sets, we split the dataset into two parts, the training set and the validation set. With a pair of (λ₁, λ₂), we first estimate (B, C) using the training set. The prediction error and the predictive negative log-likelihood of the resulting estimator are obtained using the validation set as (Y⁽^k⁾, X⁽^k⁾) in (12) and (13). The validation set is not used to construct the final estimator with the selected (λ̂₁, λ̂₂), while the K -fold cross-validation uses all data for the final estimator with (λ̂₁, λ̂₂).

3. Asymptotic Properties

To investigate a sparse regression technique, it is necessary to investigate its asymptotic behaviors. Fan and Li [6] pointed out that a good variable selection procedure should have oracle properties. Asymptotically with probability tending to 1, a procedure with oracle properties can identify the true underlying subset of predictor variables. The resulting estimator of the procedure also asymptotically performs as well as if the true underlying subset were known in advance. In this section, we study the asymptotic behavior of our three proposed methods. In particular, we show that with a proper choice of (λ₁, λ₂), all three methods enjoy the oracle properties.

For the asymptotic analysis, we use the set-up of Fan and Li [6], Yuan and Lin [23] and Zou [25]. The technical derivation uses the results in Knight and Fu [10]. Let $B^{*} = (β_{j k}^{*})$ ; j = 1, …, p, k = 1, …, m, be the true regression parameter matrix and $C^{*} = (c_{s t}^{*})$ ; s = 1, …, m, t = 1, …, m, be the true inverse covariance matrix. Let $A = {(j, k) : β_{j k}^{*} \neq 0}$ and $C = {(s, t) : c_{s t}^{*} \neq 0}$ . Then we assume the following conditions for our theoretical results:

(A1)
$\frac{1}{n} X^{T} X \to A$ where A is a positive definite matrix.
(A2)
The cardinality of , | | = q₁ > 0.
(A3)
There exists β̃_jk which is a $\sqrt{n}$ -consistent estimator of $β_{j k}^{*}$ ; j = 1, …, p, k = 1, …, m.
(A4)
The cardinality of , | | = q₂ > 0.
(A5)
There exists c̃_st which is a $\sqrt{n}$ -consistent estimator of $c_{s t}^{*}$ ; s = 1, …, m, t = 1, …, m.

Note that conditions (A3) and (A5) are generally satisfied by maximum likelihood estimators or L₂ regularized maximum likelihood estimators with proper choices of penalty parameters. For example, the least square estimator of B can be used as the β̃_jks and the inverse of residual sample covariance matrix can be used as c̃_sts. For the theoretical analysis, we define w_jk and v_st as $w_{j k} = \frac{1}{{∣ {\hat{β}}_{j k} ∣}^{γ}}$ ; j = 1, …, p, k = 1, …, m, γ > 0, and $v_{s t} = \frac{1}{∣ {\tilde{c}}_{s t} ∣}$ ; s = 1, …, m, t = 1, …, m, respectively.

In Sections 3.1 and 3.2, we show the plug-in estimators enjoy the oracle properties. Section 3.3 develops the asymptotic theory that reveals the oracle properties of the DML solution.

3.1. Oracle properties of the PWL solution

In this section, we first show that with the known C*, the minimizer of (6) is consistent in variable selection and has the asymptotic normality. Then we show that with a consistent estimator of C*, the PWL estimator also enjoys the same properties.

Define the true regression parameter vector as $β^{*} = {(β_{11}^{*}, \dots, β_{p 1}^{*}, \dots, β_{1 m}^{*}, \dots, β_{p m}^{*})}^{T}$ . Let ${\hat{β^{1}}}^{(n)}$ be the estimator of β* obtained by minimizing (6) with the penalty parameter λ_1,_n. Let $β_{A}^{*}$ be the q₁-dimensional true parameter vector which consists of nonzero components in β*. Let ${\hat{β^{1}}}_{A}^{(n)}$ be the corresponding estimators of $β_{A}^{*}$ . Let D = (C* ⊗ A) be the q × q matrix obtained by removing the (j + (k −1)m)-th row and column of C* ⊗ A for (j, k) ∉ Inline graphic . Then the following lemma shows the oracle properties of the penalized likelihood estimator ${\hat{β^{1}}}^{(n)}$ with the known C*, as the minimizer of (6) defined previously.

Lemma 1

(Oracle properties of the minimizer of (6), ${\hat{β^{1}}}^{(n)}$ , with the known C*) Suppose that $λ_{1, n} n^{- \frac{1}{2}} \to 0$ and $λ_{1, n} n^{\frac{γ - 1}{2}} \to \infty$ as n → ∞. Under the conditions (A1)–(A3), we have the following results:

(Selection consistency) ${lim}_{n} P ({\hat{β^{1}}}_{j k}^{(n)} = 0) = 1$ if $β_{j k}^{*} = 0$ ;
(Asymptotic normality) $\sqrt{n} ({\hat{β^{1}}}_{A}^{(n)} - β_{A}^{*}) \to_{d} N (0, D^{- 1})$ .

Lemma 1 tells us that the penalized maximum likelihood estimator with the known C* satisfies the oracle properties. Since C* is typically unknown in practice, one often uses an estimator for C*. With slight modification of Lemma 1, we can show that the PWL solution also enjoys the oracle properties. Denote the PWL estimator of β* with the penalty parameter λ_1,_n as ${\hat{β^{2}}}^{(n)}$ . Let ${\hat{β^{2}}}_{A}^{(n)}$ be the corresponding estimator of $β_{A}^{*}$ .

Theorem 1

(Oracle properties of the PWL solution) In addition to the assumptions in Lemma 1, suppose that Ĉ is a consistent estimator of C*. Under the conditions (A1)–(A3), we have the following results:

(Selection consistency) ${lim}_{n} P ({\hat{β^{2}}}_{j k}^{(n)} = 0) = 1 i f β_{j k}^{*} = 0$ if $β_{j k}^{*} = 0$ ;
(Asymptotic normality) $\sqrt{n} ({\hat{β^{2}}}_{A}^{(n)} - β_{A}^{*}) \to_{d} N (0, D^{- 1})$ .

Theorem 1 states that with a consistent estimator of C*, variable selection in the PWL is consistent and the resulting estimator still enjoys the asymptotic normality.

3.2. Oracle properties of the PWGL solution

In this section, we show the oracle properties of the PWGL solution. To this end, we first show the oracle properties of the solution of

\underset{C}{argmin} [- n log det (C) + tr {(Y - X B^{*}) C {(Y - X B^{*})}^{T}} + λ_{2} \sum_{j \neq k} v_{j k} ∣ c_{j k} ∣],

(14)

with the known B*. Then we show that with a consistent estimator of B*, the PWGL estimator still enjoys the same properties.

Denote by Ĉ⁽¹⁾ the minimizer of (14) with the known B*. Let ${\hat{C}}_{0}^{(1)}$ be the matrix obtained from Ĉ⁽¹⁾ by replacing ${\hat{c}}_{j k}^{(1)}$ with 0 if $c_{j k}^{*} = 0$ . Then the following lemma shows the oracle properties of Ĉ⁽¹⁾.

Lemma 2

(Oracle properties of the minimizer of (14), Ĉ⁽¹⁾, with known B*) Suppose that $λ_{2, n} n^{- \frac{1}{2}} \to 0$ and λ_2,_n → ∞ as n → ∞. Under the conditions (A1), (A4) and (A5), we have the following results:

(Selection consistency) ${lim}_{n} P ({\hat{c}}_{j k}^{(1)} = 0) = 1$ if $c_{j k}^{*} = 0$ ;
(Asymptotic distribution) $\sqrt{n} ({\hat{C}}_{0}^{1} - C^{*}) \to_{d} arg min V (U)$ ,

where V (U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(w_ij, w_kl) = cov(ε₁_iε₁_j, ε₁_kε₁_l). The minimum is taken over all symmetric matrices U satisfying u_jk = 0 if $c_{j k}^{*} = 0$ .

In Lemma 2, we show that the penalized maximum likelihood estimator with the known B* satisfies the oracle properties. Since B* is typically unknown in practice, one often applies an univariate regression technique to obtain an estimator for B*. With slight modification of Lemma 2, we can show that the PWGL solution also enjoys the oracle properties. Denote the PWGL estimator of C* with the penalty parameter λ_2,_n as Ĉ⁽²⁾. Let ${\hat{C}}_{0}^{(2)}$ be the matrix obtained from Ĉ⁽²⁾ by replacing ${\hat{c}}_{j k}^{(2)}$ with 0 if $c_{j k}^{*} = 0$ . Then the following theorem shows the oracle properties of the PWGL estimator.

Theorem 2

(Oracle properties of the PWGL solution) In addition to the assumptions in Lemma 2, suppose that B̂ is a consistent estimator of B*. Under the above conditions, we have the following results:

(Selection consistency) ${lim}_{n} P ({\hat{c}}_{j k}^{(2)} = 0) = 1$ if $c_{j k}^{*} = 0$ ;
(Asymptotic distribution) $\sqrt{n} ({\hat{C}}_{0}^{2} - C^{*}) \to_{d} arg min V (U)$ ,

where V(U) = tr(UΣUΣ) + tr(UW) and W is an m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(w_ij, w_kl) = cov(ε₁_iε₁_j, ε₁_kε₁_l). The minimum is taken over all symmetric matrices U satisfying u_jk = 0 if $c_{j k}^{*} = 0$ .

Theorem 2 states that with a consistent estimator of B*, the PWGL solution satisfies the oracle properties.

3.3. Oracle properties of the DML solution

In Sections 3.1 and 3.2, we establish the oracle properties of plug-in estimators. In this section, we explore oracle properties of the DML solution in which (B̂, Ĉ) are obtained together. First, we show that with a proper choice of (λ₁, λ₂), there exists a $\sqrt{n}$ -consistent local minimizer of (11). Then we show that this local minimizer enjoys the oracle properties as a solution of the DML estimator.

The following lemma shows the existence of a local minimizer of (11) which is $\sqrt{n}$ -consistent.

Lemma 3

Suppose that $λ_{1, n} n^{- \frac{1}{2}} \to 0$ and $λ_{2, n} n^{- \frac{1}{2}} \to 0$ . Under the conditions (A1)–(A5), there exists a local minimizer of (11) such that

‖ {(vec {(\hat{B})}^{T}, vec {(\hat{C})}^{T})}^{T} - {(vec {(B^{*})}^{T}, vec {(C^{*})}^{T})}^{T} ‖ = O_{p} (1 / \sqrt{n}) .

From Lemma 3, it is clear that there exists a $\sqrt{n}$ -consistent doubly penalized maximum likelihood estimator. As the DML estimator of (B*, C*), denote by (B̂⁽ⁿ⁾, Ĉ) the $\sqrt{n}$ -consistent local solution of (11) with the penalty parameter (λ_1,_n, λ_2,_n). Let β̂⁽ⁿ⁾ = vec(B̂⁽ⁿ⁾) and let ${\hat{β}}_{A}^{(n)}$ be the corresponding estimator of $β_{A}^{*}$ . Let Ĉ₀ be the matrix obtained from Ĉ by replacing ĉ_jk with 0 if $c_{j k}^{*} = 0$ . We now show that with a proper choice of (λ₁, λ₂), the DML estimator as this local minimizer enjoys the oracle properties in the following theorem.

Theorem 3

(Oracle properties of the DML solution) Suppose that $λ_{1, n} n^{- \frac{1}{2}} \to 0$ and $λ_{1, n} n^{- \frac{γ - 1}{2}} \to \infty$ . In addition to that, suppose that $λ_{2, n} n^{- \frac{1}{2}} \to 0$ and *lamda;_2,_n → ∞. Under the conditions (A1)–(A5), we have the following results:

${lim}_{n} P ({\hat{β}}_{j k}^{(n)} = 0) = 1$ if $β_{j k}^{*} = 0$ ;
$\sqrt{n} ({\hat{β}}_{A}^{(n)} - β_{A}^{*}) \to_{d} N (0, D^{- 1})$ ;
lim_n P(ĉ_jk = 0) = 1 if $c_{j k}^{*} = 0$ ;
$\sqrt{n} ({\hat{C}}_{0} - C^{*}) \to_{d} arg min V (U)$ ,

where V (U) = tr(UΣUΣ) + tr(UW) and W is a m × m random symmetric matrix such that vec(W) ~ N(0, Λ) in which cov(w_ij, w_kl) = cov(ε₁_iε₁_j, ε₁_kε₁_l). The minimum is taken over all symmetric matrices U satisfying u_jk = 0 if $c_{j k}^{*} = 0$ .

4. Computational Algorithm

In this section, we describe computational algorithms to solve problems (8), (9), and (11). In particular, we apply the GLASSO algorithm for (9). To solve the problems (8) and (11), we apply the coordinate-descent algorithm as described in Peng et al. [14], which can be viewed as a modification of the shooting algorithm [8]. The basic idea of the coordinate-descent algorithm is to optimize each parameter at one time while holding the other parameters fixed at the current solution. The corresponding optimization at each step can be very simple to solve.

We now describe the coordinate-descent algorithm for the PWL method in details. Denote Ĉ by (ĉ_ij)_m_×_m. Then (8) is equivalent to minimizing

\sum_{i = 1}^{n} \sum_{k, l = 1}^{m} {\hat{c}}_{k l} (y_{i k} - \sum_{j = 1}^{p} β_{j k} x_{i j}) (y_{i l} - \sum_{j = 1}^{p} β_{j l} x_{i j}) + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣ .

(15)

Consider (15) as a function of β_jk with other coefficients fixed. Then the minimizer of (15) is equivalent to

\underset{β_{j k}}{argmin} [{\sum_{i = 1}^{n} ({\hat{c}}_{k k} {(y_{i k} - \sum_{j^{'} \neq j} β_{j^{'} k} x_{{i j}^{'}} - β_{j k} x_{i j})}^{2} + 2 \sum_{k^{'} \neq k} {\hat{c}}_{{k k}^{'}} (y_{{i k}^{'}} - \sum_{j} β_{{i k}^{'}} x_{i j}) (y_{i k} - \sum_{j^{'} \neq j} β_{j^{'} k} x_{{i j}^{'}} - β_{j k} x_{i j}))} + λ_{1} w_{j k} ∣ β_{j k} ∣] .

This problem is essentially a one-dimensional LASSO optimization which has a closed form solution. Therefore, the algorithm can be summarized as follows:

Algorithm 1: the Coordinate-Descent Algorithm for the PWL Method

Step 1
(Initial value). Set the separate LASSO solution $β_{j k}^{(old)}$ ; j = 1, …, p, k = 1, …, m, as the initial value for B.
Step 2
(Updating rule). For j = 1, …, p and k = 1, …, m,

$β_{q r}^{(new)} = β_{q r}^{(old)}$ ,if q ≠ j and r ≠ k,
$β_{j k}^{(new)} = sign (\frac{\sum_{l = 1}^{m} {\hat{c}}_{l k} {(e_{l}^{(old)})}^{T} x^{j}}{{\hat{c}}_{k k} x^{j^{T}} x^{j}} + β_{j k}^{(old)}) {(| \frac{\sum_{l = 1}^{m} {\hat{c}}_{l k} {(e_{l}^{(old)})}^{T} x^{j}}{{\hat{c}}_{k k} x^{j^{T}} x^{j}} + β_{j k}^{(old)} | - \frac{λ_{1} w_{j k}}{2 {\hat{c}}_{k k} x^{j^{T}} x^{j}})}^{+},$

where $e_{l}^{(old)} = y^{l} - X β^{l (old)}$ and $β^{l (old)} = (β_{1 l}^{(old)}, \dots, β_{p l}^{(old)})$ .
Step 3
(Iteration). Repeat Step 2 until convergence. Our stopping rule is that the change of the objective function in (8) is less than δ = 0.1.

To be computationally more efficient, we combine the above algorithm with the active shooting algorithm proposed by Peng et al. [14]. The basic idea of the active shooting algorithm is to update the coefficients within the active set until convergence instead of iterating all coefficients at each step. The active set is defined as the set of currently nonzero coefficients and it is typically small. Once the coefficients in the active set converge, then we continue to update other coefficients. This step can speed up the algorithm significantly if the final solution is very sparse.

Next we describe the problem (9) in the GLASSO framework. Since (9) is equivalent to minimizing

- log det (C) + tr {\frac{1}{n} {(Y - X \hat{B})}^{T} (Y - X \hat{B}) C} + \frac{λ_{2}}{n} \sum_{j \neq k} v_{j k} ∣ c_{j k} ∣,

(16)

we can apply the GLASSO algorithm [7] to solve (9) by substituting the sample covariance matrix with $\frac{1}{n} {(Y - X \hat{B})}^{T} (Y - X \hat{B})$ . Therefore, the algorithm for (9) proceeds as follows:

Algorithm 2: the GLASSO Algorithm for the PWGL Method

Step 1
(Estimator of B) Set the separate LASSO solution as the estimator, B̂, of B.
Step 2
(Estimator of C) Given B̂, apply the GLASSO algorithm to solve (16).

Next, we combine Algorithm 1 and the GLASSO algorithm to solve problem (11) for the doubly penalized method DML in Section 2.3. The algorithm can be summarized as follows:

Algorithm 3: the Coordinate-Descent Algorithm for the DML Method

Step 1
(Initial values of B and C). Set the separate LASSO solution $β_{j k}^{(old)}$ ; j = 1, …, p, k = 1, …, m, as the initial value for B and the solution of (9), C⁽^old⁾, as the initial value of C.
Step 2
(B updating rule). For a given C⁽^old⁾, update B⁽^old⁾ → B⁽^new⁾ with
$B^{(new)} = \underset{B}{argmin} [tr {(Y - XB) C^{(old)} {(Y - XB)}^{T}} + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣] .$

This step can be solved using the Algorithm 1.
Step 3
(C updating rule). For a given B⁽^new⁾, update C⁽^old⁾ → C⁽^new⁾ by
$C^{(new)} = \underset{C}{argmin} [tr {\frac{1}{n} {(Y - X B^{(new)})}^{T} (Y - X B^{(new)}) C} - log det (C) + \frac{λ_{2}}{n} \sum_{s \neq t} v_{s t} ∣ c_{s t} ∣] .$

This can be solved using the GLASSO algorithm.
Step 4
(Iteration). Repeat Steps 2 and 3 until convergence. Our stopping rule is that the change of the objective function in (11) is less than δ = 0.1.

Based on our experiment, the coordinate-descent algorithm works very efficiently. Since the DML method involves estimation of both B and C, the computation can be intensive when the dimension is high. We consider a prescreening step to speed up the computation. In particular, we adapt the group lasso method considered by Yuan and Lin [22] and Meier et al. [11]. The basic idea of the group lasso method is to employ group penalty in the regression problem so that model selection can be achieved in terms of group selection. In our multiple response variable regression problem, (β_j₁, …, β_jm); j = 1, …, p, can be considered as p groups. Therefore, for the prescreening step, the group lasso estimator, B̂^group of B, is given as the minimizer of the following penalized function

\sum_{i = 1}^{n} \sum_{k = 1}^{m} {(y_{i k} - \sum_{j = 1}^{p} β_{j k} x_{i j})}^{2} + λ \sum_{j = 1}^{p} \sqrt{β_{j 1}^{2} + \dots + β_{j m}^{2}},

where λ is a tuning parameter. We screen out a variable if the corresponding coefficients are estimated as zeros for all response variables. In other words, we remove the variable x^j from our model if ${\hat{β}}_{j 1}^{group} = \dots = {\hat{β}}_{j m}^{group} = 0$ . This prescreening step can not only speed up the computation, but also improve the prediction performance as shown in our examples.

5. Numerical Examples

In this section, our proposed methods are compared with several existing methods. The first existing method we compare is the curds and whey (CW) method proposed by Breiman and Friedman [4]. The other two methods are the separate ridge regression (RR) and the separate LASSO. In particular, we apply the RR and the LASSO to each response variable separately. Separate LASSO solutions are constructed by a modification of the LARS algorithm proposed by Efron et al. [5]. The main idea of this modified LARS algorithm was also considered by Osborne et al. [13]. In this paper, some brief summaries of the results are provided. All details about numerical examples can be found in the online supplement materials.

In simulated examples, all methods are compared in two ways, B estimation and C estimation. Performance of B estimation is compared in terms of prediction and variable selection. For the comparison of C estimation, we use the entropy criterion [9, 24] which measures the difference of two matrices. In terms of prediction, overall, the proposed DML method works the best. The PWL method also works reasonably well in all cases, although it is not as accurate as the DML estimator. In the example where the true inverse covariance matrix is not sparse, LASSO gives the worst prediction performance while the other methods show similar performance. This implies that joint approaches outperform separate approaches with the joint information. DML outperforms LASSO and PWL in terms of identification of zero coefficients. We also notice that the ratios of correctly identified zeros for PWL and DML increase as the sample size increases. This supports the selection consistency shown in Section 3. When the dimension of predictor variables is low, the DML estimator gives the best performance in C estimation. When the dimension of predictor variables is close to the sample size, PWGL outperforms DML. Since the DML method simultaneously estimates both B and C, with a small sample size, the C estimation may not be as good.

We apply our methodology to a Glioblastoma multiforme (GBM) cancer data set studied by the Cancer Genome Atlas (TCGA) Research Network [17]. As noted in [20], GBM is the most common primary form of brain tumor in adults. In terms of prediction, our method, DML, performs best even though the difference between DML and the separate LASSO is not statistically significant in view of the standard errors. In terms of the number of included genes in models, PWL and DML construct sparser models than the separate LASSO. One possible explanation is that there may be some strong positive correlations among microRNAs which are response variables. As we have discussed in the toy example of Section 2, with strong positive correlations among response variables, joint methods tend to obtain more shrinkage than the separate LASSO. To explore this further, correlations among microRNAs are examined. Some strong postive correlations among the microRNAs are detected while negative correlations are not strong. Interestingly, with much fewer number of gene expressions than the separate LASSO, PWL and DML perform competitively in terms of prediction accuracy.

6. Discussion

In this paper, we have proposed three methods for utilizing joint information among response variables in a penalized likelihood framework with weighted L₁ regularization. Our theoretical investigation shows that our proposed estimators enjoy oracle properties. Simulated examples and an application to the GBM cancer data set demonstrate that our proposed methods perform competitively.

Our current study assumes Gaussian distribution of the response vector. One future research direction is to extend the proposed method with other distributional assumptions. Although we mainly focus on the weighted L₁ penalty, our methods can be directly extended for other penalty functions as well. It will be interesting to compare the performance of various choices of penalty in this context.

Supplementary Material

NIHMS374480-supplement-01.pdf^{(772KB, pdf)}

Acknowledgments

The authors were supported in part by NSF grant DMS-0747575 and NIH grant 5R01CA149569-03.

Appendix A. Proof of Lemma 1

Appendix A.1. Asymptotic normality

Let Ỹ = ((y¹)^T, …, (y^m)^T)^T be the nm-dimensional response vector and ε̃ be the corresponding nm-dimensional error vector which consists of ε_ik; i = 1, …, n, k = 1, …, m. Let β̃ = (β₁₁, …, β_p₁, …, β₁_m, …, β_pm)^T be the pm-dimensional vector and X̃ = I_m ⊗ X. Then the minimizer of (6) is equivalent to

\underset{\tilde{β}}{argmin} [{(\tilde{Y} - \tilde{X} \tilde{β})}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} \tilde{β}) + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣] .

Let $\tilde{β} = β^{*} + \frac{u}{\sqrt{n}}$ and

V_{n} (u) = {(\tilde{Y} - \tilde{X} (β^{*} + \frac{u}{\sqrt{n}}))}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} (β^{*} + \frac{u}{\sqrt{n}})) + λ_{1, n} \sum_{j, k} w_{j k} ∣ β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} ∣ .

Let û⁽ⁿ⁾ = argmin_u V_n(u) and then ${\hat{u}}^{(n)} = \sqrt{n} ({\hat{β^{1}}}^{(n)} - β^{*})$ . Note that û⁽ⁿ⁾ = argmin_u V_n(u) = argmin_u{V_n(u) – V_n(0)} and

V_{n} (u) - V_{n} (0) = \frac{1}{n} u^{T} {\tilde{X}}^{T} (C \otimes I_{n}) \tilde{X} u - \frac{2}{\sqrt{n}} {\tilde{ε}}^{T} (C \otimes I_{n}) \tilde{X} u + λ_{1, n} \sum_{j, k} w_{j k} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣) .

(A.1)

We know that $\frac{1}{n} u^{T} {\tilde{X}}^{T} (C \otimes I_{n}) \tilde{X} u = u^{T} (C \otimes \frac{1}{n} X^{T} X) u \to u^{T} (C \otimes A) u$ . For the second term of the right hand side of (A.1), note that ε̃ ~ N(0, Σ ⊗ I_n). Thus, $\frac{1}{\sqrt{n}} {\tilde{ε}}^{T} (C \otimes I_{n}) \tilde{X} \to_{d} Z$ where Z ~ N(0, C ⊗ A) as $\frac{1}{n} {\tilde{X}}^{T} (C \otimes I_{n}) (\sum \otimes I_{n}) (C \otimes I_{n}) \tilde{X} = \frac{1}{n} {\tilde{X}}^{T} (C \otimes I_{n}) \tilde{X} \to C \otimes A$ . Now we consider the last term of the right hand side of (A.1):

If $β_{j k}^{*} = 0$ , then $λ_{1, n} w_{j k} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣) = \frac{λ_{1, n}}{\sqrt{n}} w_{j k} ∣ u_{j k} ∣ = λ_{1, n} n^{\frac{γ - 1}{2}} \frac{∣ u_{j k} ∣}{{(\sqrt{n} ∣ {\tilde{β}}_{j k} ∣)}^{γ}} \to \infty$ as $\sqrt{n} {\tilde{β}}_{j k} = O_{p} (1)$ .
if $β_{j k}^{*} \neq 0$ , then $λ_{1, n} w_{j k} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣) = \frac{λ_{1, n}}{\sqrt{n}} w_{j k} \sqrt{n} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣)$ . Note that $\frac{λ_{1, n}}{\sqrt{n}} \to 0, w_{j k} \to_{p} \frac{1}{{∣ β_{j k}^{*} ∣}^{γ}}$ and $\sqrt{n} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣) \to u_{j k} sign (β_{j k}^{*})$ . By the Slutsky’s theorem, $λ_{1, n} w_{j k} (| β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | - ∣ β_{j k}^{*} ∣) \to_{p} 0$ .

By combining above statements and using the Slutsky’s theorem again, we obtain the following:

V_{n} (u) - V_{n} (0) \to_{d} V (u) = {\begin{array}{l} u_{A}^{T} D u_{A} - 2 u_{A}^{T} Z_{A} & if & u_{j k} = 0 for all (j, k) \notin A, \\ \infty & if & otherwise, \end{array}

where Inline graphic consists of u_jk for (j, k) ∈ and ~ N(0, D).

Let û = argmin_u V(u). Then we have

{\begin{array}{l} {\hat{u}}_{A} = D^{- 1} Z_{A}, \\ {\hat{u}}_{j k} = 0 \forall (j, k) \notin A . \end{array}

Note that V_n(u) – V_n(0) is convex and so argmin_u(V_n(u) – V_n(0)) →_d argmin_u V(u). Since Inline graphic ~ N(0, D), thus ${\hat{u}}_{A}^{(n)} \to_{d} N (0, D^{- 1})$ . Finally, we have that ${\hat{u}}_{A}^{(n)} = \sqrt{n} ({\hat{β^{1}}}_{A}^{(n)} - β_{A}^{*}) \to_{d} D^{- 1} Z_{A}$ as n → ∞.

Appendix A.2. Selection consistency

We need to show that ∀(j, k) ∉ Inline graphic , $P ({\hat{β^{1}}}_{j k}^{(n)} \neq 0) \to 0$ . For fixed (j, k) ∉ , let $(j, k) \in A_{n}^{1}$ . Then $∣ {\hat{β^{1}}}_{j k}^{(n)} ∣ \neq 0$ and so we have that $2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{1}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{1}}}_{j k}^{(n)})$ by the KKT conditions, where x̃_jk is (j + (k − 1))-th row of X̃. Therefore, $P ({\hat{β^{1}}}_{j k}^{(n)} \neq 0) \leq P (2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{1}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{1}}}_{j k}^{(n)}))$ . Note that

\frac{2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{1}}}^{(n)})}{\sqrt{n}} = \frac{2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) \tilde{X} \sqrt{n} (β^{*} - {\hat{β^{1}}}^{(n)})}{n} + \frac{2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) \tilde{ε}}{\sqrt{n}} .

From the asymptotic normality part, we know that $\frac{2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) \tilde{X} \sqrt{n} (β^{*} - {\hat{β^{1}}}^{(n)})}{n}$ converges in distribution to some normal random vector. We also have that $\frac{2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) \tilde{ε}}{n} \to_{d} N (0, {(C \otimes A)}_{j k, j k})$ , where (C ⊗ A)_jk_,_jk is the (j + (k − 1))-th diagonal element of C ⊗ A. As $\frac{λ_{1, n} w_{j k} sign ({\hat{β^{1}}}_{j k}^{(n)})}{\sqrt{n}} = λ_{1, n} n^{\frac{γ - 1}{2}} \frac{sign ({\hat{β^{1}}}_{j k}^{(n)})}{(\sqrt{n} ∣ {\tilde{β}}_{j k} ∣) γ} \to \pm \infty$ with $\sqrt{n} {\tilde{β}}_{j k} = O_{p} (1)$ , we have $P (2 {\tilde{x}}_{j k}^{T} (C \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{1}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{1}}}_{j k}^{(n)})) \to 0$ . Therefore, $P ({\hat{β^{1}}}_{j k}^{(n)} \neq 0) \to 0$ as n → ∞.

Appendix B. Proof of Theorem 1

The proof is similar to that of Lemma 1 except we replace C by Ĉ.

Appendix B.1. Asymptotic normality

Note that (8) is equivalent to

\underset{\tilde{β}}{argmin} [{(\tilde{Y} - \tilde{X} \tilde{β})}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} \tilde{β}) + λ_{1} \sum_{j, k} w_{j k} ∣ β_{j k} ∣] .

Let $\tilde{β} = β^{*} + \frac{u}{\sqrt{n}}$ and

V_{n}^{*} (u) = {(\tilde{Y} - \tilde{X} (β^{*} + \frac{u}{\sqrt{n}}))}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} (β^{*} + \frac{u}{\sqrt{n}})) + λ_{1, n} \sum_{j, k} w_{j k} | β_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} | .

Let ${\hat{u}}^{n} = {argmin}_{u} V_{n}^{*} (u)$ and then ${\hat{u}}^{(n)} = \sqrt{n} ({\hat{β^{2}}}^{(n)} - β^{*})$ . We can show that

V_{n}^{*} (u) - V_{n}^{*} (0) = V_{n} (u) - V_{n} (0) + \frac{1}{n} u^{T} {\tilde{X}}^{T} ((\hat{C} - C) \otimes I_{n}) \tilde{X} u - \frac{2}{\sqrt{n}} {\tilde{ε}}^{T} ((\hat{C} - C) \otimes I_{n}) \tilde{X} u,

where V_n(u) is defined in the proof of Lemma 1. As Ĉ is a consistent estimator of C, $\frac{1}{n} u^{T} {\tilde{X}}^{T} ((\hat{C} - C) \otimes I_{n}) \tilde{X} u \to_{p} 0$ and $\frac{2}{\sqrt{n}} {\tilde{ε}}^{T} ((\hat{C} - C) \otimes I_{n}) \tilde{X} u \to_{d} 0$ . From the proof of Lemma 1, we also know that V_n(u) – V_n(0) →_d V(u). By combining the above statements and using the Slutsky’s theorem, we have that $V_{n}^{*} (u) - V_{n}^{*} (0) \to_{d} V (u)$ . By using the same arguments as in the proof of Lemma 1, finally we have that ${\hat{u}}_{A}^{(n)} = \sqrt{n} ({\hat{β^{2}}}_{A}^{(n)} - β_{A}^{*}) \to_{d} D^{- 1} Z_{A}$ as n → ∞.

Appendix B.2. Selection consistency

Now it suffices to show that ∀(j, k) ∉ Inline graphic , $P ({\hat{β^{2}}}_{j k}^{(n)} \neq 0) \to 0$ as n → ∞. For fixed (j, k) ∉ , let $(j, k) \in A_{n}^{2}$ . Then $∣ {\hat{β^{2}}}_{j k}^{(n)} ∣ \neq 0$ and so we have that $2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{2}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{2}}}_{j k}^{(n)})$ by the KKT conditions. Therefore, $P ({\hat{β^{2}}}_{j k}^{(n)} \neq 0) \leq P (2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{2}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{2}}}_{j k}^{(n)}))$ . Note that

\frac{2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{2}}}^{(n)})}{\sqrt{n}} = \frac{2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) \tilde{X} \sqrt{n} (β^{*} - {\hat{β^{2}}}^{(n)})}{n} + \frac{2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) \tilde{ε}}{\sqrt{n}} .

From the asymptotic normality part and the fact that Ĉ is consistent, we know that $\frac{2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) \tilde{X} \sqrt{n} (β^{*} - {\hat{β^{2}}}^{(n)})}{n}$ converges in distribution to some normal random vector. We also have that $\frac{2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) \tilde{ε}}{\sqrt{n}} \to_{d} N (0, {(C \otimes A)}_{j k, j k})$ . As $\frac{λ_{1, n} w_{j k} sign ({\hat{β^{2}}}_{j k}^{(n)})}{\sqrt{n}} = λ_{1, n} n^{\frac{γ - 1}{2}} \frac{sign ({\hat{β^{2}}}_{j k}^{(n)})}{{(\sqrt{n} ∣ {\hat{β}}_{j k} ∣)}^{γ}} \to \pm \infty$ , we have $P (2 {\tilde{x}}_{j k}^{T} (\hat{C} \otimes I_{n}) (\tilde{Y} - \tilde{X} {\hat{β^{2}}}^{(n)}) = λ_{1, n} w_{j k} sign ({\hat{β^{2}}}_{j k}^{(n)})) \to 0$ . Therefore, $P ({\hat{β^{2}}}_{j k}^{(n)} \neq 0) \to 0$ as n → ∞.

Appendix C. Proof of Lemma 2

Let $R = \frac{1}{n} {(Y - X B^{*})}^{T} (Y - X B^{*})$ . With given B*, define Q(C) as

Q (C) = - n log det (C) + n tr (C R) + λ_{2, n} \sum_{j \neq k} v_{j k} ∣ c_{j k} ∣ .

(C.1)

Appendix C.1. Selection consistency

Using the definition of Q(C) in (C.1), define V_n(U) as

\begin{array}{l} V_{n} (U) = Q (C^{*} + \frac{U}{\sqrt{n}}) - Q (C^{*}) \\ = - n log det ((C^{*} + \frac{U}{\sqrt{n}}) {C^{*}}^{- 1}) + n tr (\frac{U R}{\sqrt{n}}) + λ_{2, n} \sum_{j \neq k} v_{j k} (∣ c_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} ∣ - ∣ c_{j k}^{*} ∣) . \end{array}

Using a similar argument as in the proof of Theorem 1 in Yuan and Lin (2007), it can be shown that

V_{n} (U) = tr (U \sum U \sum) + tr [U \sqrt{n} (R - \sum)] + λ_{2, n} \sum_{j \neq k} v_{j k} (∣ c_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} ∣ - ∣ c_{j k}^{*} ∣) + o (1) .

Note that as $v_{s t} = \frac{1}{∣ {\tilde{c}}_{s t} ∣}, λ_{2, n} n^{- \frac{1}{2}} \to 0$ , and ${\tilde{c}}_{j k} \to_{p} c_{j k}^{*}$ , we have

\begin{array}{l} λ_{2, n} \sum_{j \neq k} v_{j k} (∣ c_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} ∣ - ∣ c_{j k}^{*} ∣) = λ_{2, n} \sum_{c_{j k}^{*} = 0} \frac{∣ u_{j k} ∣}{\sqrt{n} ∣ {\tilde{c}}_{j k} ∣} + \frac{λ_{2, n}}{\sqrt{n}} \sum_{c_{j k}^{*} \neq 0} (\frac{∣ u_{j k} ∣}{∣ {\tilde{c}}_{j k} ∣} sign (c_{j k}^{*}) + o (1)) \\ = λ_{2, n} \sum_{c_{j k}^{*} = 0} \frac{∣ u_{j k} ∣}{\sqrt{n} ∣ {\tilde{c}}_{j k} ∣} + o_{p} (1) . \end{array}

On the other hand, $\sqrt{n} (R - \sum) \to_{d} N (0, Λ)$ by the central limit theorem as $R = \frac{1}{n} \sum_{i}^{n} ε_{i} ε_{i}^{T}$ . Therefore, V_n(U) can be written as

V_{n} (U) = tr (U \sum U \sum) + tr ({U W}_{n}) + λ_{2, n} \sum_{c_{j k}^{*} = 0} \frac{∣ u_{j k} ∣}{\sqrt{n} ∣ {\tilde{c}}_{j k} ∣} + o_{p} (1),

where W_n →_d N(0, Λ). Denote by Û the minimizer of V_n(U). Note that λ₂_,n → ∞ and $\sqrt{n} ∣ {\tilde{c}}_{j k} ∣ = O_{p} (1)$ . Therefore, if $c_{j k}^{*} = 0$ , P (û_jk = 0) → 1 as n → ∞. This completes the proof of the variable selection consistency.

Appendix C.2. Asymptotic distribution

Suppose U satisfies that u_jk = 0 if $c_{j k}^{*} = 0$ . Then, V_n(U) can be written as

V_{n} (U) = tr (U \sum U \sum) + tr [U \sqrt{n} (R - \sum)] + o_{p} (1) .

By using the Slutsky’s theorem, we have that

V_{n} (U) \to_{d} V (U) = tr (U \sum U \sum) + tr (U W) where vec (W) \sim N (0, Λ) .

Since V_n(U) and V(U) are both convex and V(U) has a unique minimum, argmin V_n(U) →_d argmin V(U). From the fact that $argmin V_{n} (U) = argmin Q (C^{*} + \frac{U}{\sqrt{n}}) = \sqrt{n} ({\hat{C}}_{0}^{1} - C^{*}), argmin V_{n} (U) = \sqrt{n} ({\hat{C}}_{0}^{1} - C^{*}) \to_{d} argmin V (U)$ . This completes the proof of the asymptotic distribution.

Appendix D. Proof of Theorem 2

With a $\sqrt{n}$ -consistent estimator B̂ of B, let $\hat{R} = \frac{1}{n} {(Y - X \hat{B})}^{T} (Y - X \hat{B})$ . Define Q(C) as

Q (C) = - n log det (C) + n tr (C \hat{R}) + λ_{2, n} \sum_{j \neq k} v_{j k} ∣ c_{j k} ∣ .

(D.1)

By using the above definition, define V_n(U) as

\begin{array}{l} V_{n} (U) = Q (C^{*} + \frac{U}{\sqrt{n}}) - Q (C^{*}) \\ = - n log det ((C^{*} + \frac{U}{\sqrt{n}}) C^{* - 1}) + n tr (\frac{U \hat{R}}{\sqrt{n}}) + λ_{2, n} \sum_{j \neq k} v_{j k} (∣ c_{j k}^{*} + \frac{u_{j k}}{\sqrt{n}} ∣ - ∣ c_{j k}^{*} ∣) . \end{array}

Note that

n tr (\frac{U \hat{R}}{\sqrt{n}}) = n tr (\frac{U (\hat{R} - R)}{\sqrt{n}}) + n tr (\frac{U R}{\sqrt{n}}) .

Therefore, by the proof of Lemma 2 and the Slutsky’s theorem, it suffices to show that

n tr (\frac{U (\hat{R} - R)}{\sqrt{n}}) = o_{p} (1) .

(D.2)

The left-hand side of (D.2) can be written as

\begin{array}{l} n tr (\frac{U (\hat{R} - R)}{\sqrt{n}}) = tr (\frac{U}{\sqrt{n}} {(Y - X \hat{B})}^{T} (Y - X \hat{B})) - tr (\frac{U}{\sqrt{n}} {(Y - XB)}^{T} (Y - XB)) \\ = tr (U \sqrt{n} {(\hat{B} - B)}^{T} \frac{X^{T} X}{n} (\hat{B} - B)) - 2 tr (U \frac{{(Y - XB)}^{T} X}{\sqrt{n}} (\hat{B} - B)), \end{array}

where we add and subtract XB in the first term. Since $\sqrt{n} (\hat{B} - B) = O_{p} (1), \frac{{(Y - XB)}^{T} X}{\sqrt{n}} = O_{p} (1)$ , (B̂-B = o_p(1) and $\frac{1}{n} X^{T} X \to A$ , (D.2) holds.

Appendix E. Proof of Lemma 3

Define Q(B, C) for the jointly penalized likelihood as

Q (B, C) = - n log det (C) + tr {C {(Y - XB)}^{T} (Y - XB)} + λ_{1, n} \sum_{j, k} w_{j k} ∣ β_{j k} ∣ + λ_{2, n} \sum_{s \neq t} v_{s t} ∣ c_{S t} ∣ .

(E.1)

To show the results, we use the similar idea of the proof of Theorem 1 in Fan and Li [6]. It suffices to show that for any given δ > 0, there exists a large constant D such that

P {sup_{| | U | | = D} Q (B^{*} + \frac{U_{1}}{\sqrt{n}}, C^{*} + \frac{U_{2}}{\sqrt{n}}) > Q (B^{*}, C^{*})} \geq 1 - δ,

(E.2)

where U = (vec(U₁)^T, vec(U₂)^T)^T. Using the definition of Q(B, C) in (E.1), define V_n(U) as

V_{n} (U) = Q (B^{*} + \frac{U_{1}}{\sqrt{n}}, C^{*} + \frac{U_{2}}{\sqrt{n}}) - Q (B^{*}, C^{*}) .

Since $∣ β_{j k}^{*} + \frac{u_{1 j k}}{\sqrt{n}} ∣ - ∣ β_{j k}^{*} ∣ = ∣ \frac{u_{1 j k}}{\sqrt{n}} ∣$ for $β_{j k}^{*} = 0$ and $∣ c_{s t}^{*} + \frac{u_{2 s t}}{\sqrt{n}} ∣ - ∣ c_{s t}^{*} ∣ = ∣ \frac{u_{2 s t}}{\sqrt{n}} ∣$ for $c_{s t}^{*} = 0$ ,

\begin{array}{l} V_{n} (U) \geq - n log det ((C^{*} + \frac{U_{2}}{\sqrt{n}}) C^{* - 1}) + tr {(C^{*} + \frac{U_{2}}{\sqrt{n}}) {(Y - X (B^{*} + \frac{U_{1}}{\sqrt{n}}))}^{T} (Y - X (B^{*} + \frac{U_{1}}{\sqrt{n}}))} \\ - tr {C^{*} {(Y - X B^{*})}^{T} (Y - X B^{*})} + λ_{1, n} \sum_{β_{k j} \neq 0} w_{j k} (∣ β_{j k}^{*} + \frac{u_{1 j k}}{\sqrt{n}} ∣ - ∣ β_{j k}^{*} ∣) \\ + λ_{2, n} \sum_{c_{s t} \neq 0} v_{s t} (∣ c_{s t}^{*} + \frac{u_{2 s t}}{\sqrt{n}} ∣ - ∣ c_{s t}^{*} ∣) \\ = - n log det ((C^{*} + \frac{U_{2}}{\sqrt{n}}) C^{* - 1}) + tr {\frac{U_{2}}{\sqrt{n}} {(Y - X B^{*})}^{T} (Y - X B^{*})} \\ + tr {(C^{*} + \frac{U_{2}}{\sqrt{n}}) {(\frac{{X U}_{1}}{\sqrt{n}})}^{T} (\frac{{X U}_{1}}{\sqrt{n}})} - 2 tr {(C^{*} + \frac{U_{2}}{\sqrt{n}}) {(Y - X B^{*})}^{T} (\frac{{X U}_{1}}{\sqrt{n}})} \\ + λ_{1, n} \sum_{β_{k j} \neq 0} w_{j k} (∣ β_{j k}^{*} + \frac{u_{1 j k}}{\sqrt{n}} ∣ - ∣ β_{j k}^{*} ∣) + λ_{2, n} \sum_{c_{s t} \neq 0} v_{s t} (∣ c_{s t}^{*} + \frac{u_{2 s t}}{\sqrt{n}} ∣ - ∣ c_{s t}^{*} ∣) . \end{array}

(E.3)

For the first term and the second term on the right-hand side of (E.3), it has been shown in Lemma 2 that

- n log det ((C^{*} + \frac{U_{2}}{\sqrt{n}}) C^{* - 1}) + tr {\frac{U_{2}}{\sqrt{n}} {(Y - X B^{*})}^{T} (Y - X B^{*})} = tr (U_{2} \sum U_{2} \sum) + tr (U_{2} W_{n}) .

Let U₁ = vec(U₁). For the third term on the right-hand side of (E.3), as $\frac{1}{n} X^{T} X \to A$ , note that

tr {(C^{*} + \frac{U_{2}}{\sqrt{n}}) {(\frac{{X U}_{1}}{\sqrt{n}})}^{T} (\frac{{X U}_{1}}{\sqrt{n}})} = {\tilde{U}}_{1}^{T} {(C^{*} + \frac{U_{2}}{\sqrt{n}}) \otimes (\frac{X^{T} X}{n})} {\tilde{U}}_{1} = {\tilde{U}}_{1}^{T} (C^{*} \otimes A) {\tilde{U}}_{1} + o (1) .

For the fourth term on the right-hand side of (E.3), we have

tr {(C^{*} + \frac{U_{2}}{\sqrt{n}}) {(Y - X B^{*})}^{T} (\frac{{X U}_{1}}{\sqrt{n}})} = {\tilde{U}}_{1}^{T} {(\frac{\tilde{X}}{\sqrt{n}})}^{T} {(C^{*} + \frac{U_{2}}{\sqrt{n}}) \otimes I_{n}} \tilde{ε} .

Note that ${(\frac{\tilde{X}}{\sqrt{n}})}^{T} {(C^{*} + \frac{U_{2}}{\sqrt{n}}) \otimes I_{n}} \tilde{ε} \to_{d} Z$ where Z has multivariate normal distribution of dimension n × m. By combining above statements, we have

V_{n} (U) \geq tr (U_{2} \sum U_{2} \sum) + tr (U_{2} W_{n}) + {\tilde{U}}_{1}^{T} (C^{*} \otimes A) {\tilde{U}}_{1} + {\tilde{U}}_{1}^{T} Z_{n} + o_{p} (1) + λ_{1, n} \sum_{β_{k j} \neq 0} w_{j k} (∣ β_{j k}^{*} + \frac{u_{1 j k}}{\sqrt{n}} ∣ - ∣ β_{j k}^{*} ∣) + λ_{2, n} \sum_{c_{s t} \neq 0} v_{s t} (∣ c_{s t}^{*} + \frac{u_{2 s t}}{\sqrt{n}} ∣ - ∣ c_{s t}^{*} ∣) .

As $λ_{1, n} n^{- \frac{1}{2}} \to 0$ and $λ_{2, n} n^{- \frac{1}{2}} \to 0$ , we have

\begin{array}{r} λ_{1, n} \sum_{β_{j k}^{*} \neq 0} w_{j k} (∣ β_{j k}^{*} + \frac{u_{1 j k}}{\sqrt{n}} ∣ - ∣ β_{j k}^{*} ∣) = \frac{λ_{1, n}}{\sqrt{n}} \sum_{β_{j k}^{*} \neq 0} (\frac{∣ u_{1 j k} ∣}{{∣ {\tilde{β}}_{j k} ∣}^{γ}} sign (β_{j k}^{*}) + o (1)) = o_{p} (1), \\ λ_{2, n} \sum_{c_{s t} \neq 0} v_{s t} (∣ c_{s t}^{*} + \frac{u_{2 s t}}{\sqrt{n}} ∣ - ∣ c_{s t}^{*} ∣) = \frac{λ_{2, n}}{\sqrt{n}} \sum_{c_{s t}^{*} \neq 0} (\frac{∣ u_{2 s t} ∣}{∣ {\tilde{c}}_{s t} ∣} sign (c_{s t}^{*}) + o (1)) = o_{p} (1) . \end{array}

Therefore,

V_{n} (U) \geq tr (U_{2} \sum U_{2} \sum) + tr (U_{2} W_{n}) + {\tilde{U}}_{1}^{T} (C^{*} \otimes A) {\tilde{U}}_{1} + {\tilde{U}}_{1}^{T} Z_{n} + o_{p} (1) .

(E.4)

By choosing a sufficiently large D, V_n(U) > 0 uniformly on {U : ||U || = D} with the probability greater than 1-δ as C* and A are positive-definite, W_n = O_p(1), and Z_n = O_p(1). Therefore, (E.2) holds. This completes the proof of this lemma.

Appendix F. Proof of Theorem 3

As defined in Lemma 3, define Q(B, C) for the jointly penalized likelihood as

Q (B, C) = - n log det (C) + tr {C {(Y - XB)}^{T} (Y - XB)} + λ_{1, n} \sum_{j, k} w_{j k} ∣ β_{j k} ∣ + λ_{2, n} \sum_{s \neq t} v_{s t} ∣ c_{s t} ∣ .

Note that (B̂⁽ⁿ⁾, Ĉ) is a $\sqrt{n}$ -consistent local minimizer of Q(B, C). As B̂⁽ⁿ⁾ = argmin_B Q(B, Ĉ) and Ĉ is $\sqrt{n}$ -consistent, the oracle properties of B̂⁽ⁿ⁾ hold by Theorem 1. Similarly, since Ĉ = argmin_C Q(B̂⁽ⁿ⁾, C) and B̂⁽ⁿ⁾ is $\sqrt{n}$ -consistent, the oracle properties of Ĉ hold by Theorem 2. These complete the proof of this theorem.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Wonyul Lee, Email: wonyull@email.unc.edu.

Yufeng Liu, Email: yfliu@email.unc.edu.

References

1.Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
2.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
3.Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]
4.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society Series B. 1997;59:3–54. [Google Scholar]
5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]
6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
7.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fu WJ. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
9.Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]
10.Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
11.Meier L, van de Geer S, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B. 2008;70:53–71. [Google Scholar]
12.Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
13.Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 2000;20:389–404. [Google Scholar]
14.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
16.Rothman AJ, Levina E, Zhu J. Sparse multiple regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]
19.Turlach BA, Venables WN, Wright SJ. Simulatneous variable selection. Technometrics. 2005;47:349–363. [Google Scholar]
20.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN TCGA. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010;17:98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Yuan M, Ekici A, Lu Z, Monteiro R. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society Series B. 2007;69:329–346. [Google Scholar]
22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B. 2006;68:49–67. [Google Scholar]
23.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
24.Zhu Z, Liu Y. Estimating spatial covariance using penalised likelihood with weighted l1 penalty. Journal of Nonparametric Statistics. 2009;21:925–942. [Google Scholar]
25.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS374480-supplement-01.pdf^{(772KB, pdf)}

[R1] 1.Banerjee O, Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R2] 2.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]

[R3] 3.Breiman L. Heuristics of instability and stabilization in model selection. The Annals of Statistics. 1996;24:2350–2383. [Google Scholar]

[R4] 4.Breiman L, Friedman JH. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society Series B. 1997;59:3–54. [Google Scholar]

[R5] 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32:407–499. [Google Scholar]

[R6] 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R7] 7.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fu WJ. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]

[R9] 9.Huang JZ, Liu N, Pourahmadi M, Liu L. Covariance matrix selection and estimation via penalised normal likelihood. Biometrika. 2006;93:85–98. [Google Scholar]

[R10] 10.Knight K, Fu W. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R11] 11.Meier L, van de Geer S, Buhlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society Series B. 2008;70:53–71. [Google Scholar]

[R12] 12.Meinshausen N, Buhlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R13] 13.Osborne MR, Presnell B, Turlach BA. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis. 2000;20:389–404. [Google Scholar]

[R14] 14.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]

[R16] 16.Rothman AJ, Levina E, Zhu J. Sparse multiple regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B. 1996;58:267–288. [Google Scholar]

[R19] 19.Turlach BA, Venables WN, Wright SJ. Simulatneous variable selection. Technometrics. 2005;47:349–363. [Google Scholar]

[R20] 20.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN TCGA. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010;17:98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Yuan M, Ekici A, Lu Z, Monteiro R. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society Series B. 2007;69:329–346. [Google Scholar]

[R22] 22.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B. 2006;68:49–67. [Google Scholar]

[R23] 23.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]

[R24] 24.Zhu Z, Liu Y. Estimating spatial covariance using penalised likelihood with weighted l1 penalty. Journal of Nonparametric Statistics. 2009;21:925–942. [Google Scholar]

[R25] 25.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Simultaneous Multiple Response Regression and Inverse Covariance Matrix Estimation via Penalized Gaussian Maximum Likelihood

Wonyul Lee

Yufeng Liu

Abstract

1. Introduction

2. Methodology

Figure 1.

2.1. Plug-in Joint Weighted LASSO Estimator

2.2. Plug-in Weighted Graphical LASSO Estimator

2.3. Doubly Penalized Maximum Likelihood Estimator

2.4. Model Selection

3. Asymptotic Properties

3.1. Oracle properties of the PWL solution

Lemma 1

Theorem 1

3.2. Oracle properties of the PWGL solution

Lemma 2

Theorem 2

3.3. Oracle properties of the DML solution

Lemma 3

Theorem 3

4. Computational Algorithm

Algorithm 1: the Coordinate-Descent Algorithm for the PWL Method

Algorithm 2: the GLASSO Algorithm for the PWGL Method

Algorithm 3: the Coordinate-Descent Algorithm for the DML Method

5. Numerical Examples

6. Discussion

Supplementary Material

Acknowledgments

Appendix A. Proof of Lemma 1

Appendix A.1. Asymptotic normality

Appendix A.2. Selection consistency

Appendix B. Proof of Theorem 1

Appendix B.1. Asymptotic normality

Appendix B.2. Selection consistency

Appendix C. Proof of Lemma 2

Appendix C.1. Selection consistency

Appendix C.2. Asymptotic distribution

Appendix D. Proof of Theorem 2

Appendix E. Proof of Lemma 3

Appendix F. Proof of Theorem 3

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases