Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jan 8.
Published in final edited form as: Stat Anal Data Min. 2012 Oct 29;5(6):10.1002/sam.11158. doi: 10.1002/sam.11158

Multiple Response Regression for Gaussian Mixture Models with Known Labels

Wonyul Lee 1, Ying Du 1, Wei Sun 1, D Neil Hayes 1, Yufeng Liu 1,*
PMCID: PMC3885347  NIHMSID: NIHMS539872  PMID: 24416092

Abstract

Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.

Keywords: Covariance estimation, GLASSO, Hierarchical penalty, LASSO, Multiple response, Regression, Sparsity

1 Introduction

Multivariate regression with a univariate response is a common and popular technique that builds a model to predict a response variable given a set of predictor variables. In the statistical learning literature, many multivariate regression techniques are designed for the setting of a univariate response case. In many applications, one may have multiple response variables available. A simple approach is to regress each response variable separately on the same set of explanatory variables. Although it is simple and popular, the univariate response approach ignores the joint information among response variables.

To improve estimation and prediction accuracy, it can be advantageous to model the response variables jointly. Breiman and Friedman [1] proposed a method called the “curd and whey”, which was shown to achieve improved prediction accuracy over univariate regression techniques when there are correlations among the response variables. To achieve variable selection, Turlach et al. [2] proposed a penalized method using the max-L1 penalty. Their method aims to select a common subset that can be used as predictors for all response variables. Yuan et al. [3] proposed another shrinkage method to model response variables jointly. Their idea is to obtain dimension reduction by encouraging sparsity among singular values of the regression coefficient matrix. The embedded dimension reduction is very useful especially when the dimension of predictor variables is much higher than the sample size.

Besides the regression coefficient matrix, it can also be useful to estimate the conditional inverse covariance matrix. Under the Gaussian assumption, this matrix closely relates to Gaussian graphical models and provides useful interpretation of the relationship among response variables. Recently, Rothman et al. [4] and Lee and Liu [5] proposed to model response variables jointly in a penalized likelihood framework using L1 regularization. In particular, they assume that the conditional distribution of response variables given predictors is multivariate Gaussian. Under this assumption, they perform joint estimation of the regression coefficient matrix and the conditional inverse covariance matrix of response variables given predictors. In the estimation procedure, Lee and Liu [5] used weighted L1 penalties on both matrices in order to encourage sparsity among the entries of the estimated matrices. Their results indicate that simultaneous modeling of the multiple response variables can provide more accurate estimation of both regression coefficients and the inverse covariance matrix.

The previous work mentioned above assumes that all observations come from a single multivariate Gaussian distribution. However, in some applications, this assumption can be too strong. For example, we consider the glioblastoma multiforme (GBM) cancer data set studied by The Cancer Genome Atlas (TCGA) Research Network [6]. Verhaak et al. [7] showed that the GBM patients can be divided into four subtypes based on their gene expressions. Based on their study, gene expressions of patients within each subtype are very similar. However, patients in different subtypes can be very different from each other. Therefore, the assumption of one multivariate Gaussian distribution for all patients may not be valid. In this paper, we consider modeling the data arisen from a mixture of several Gaussian distributions. Specifically, we model gene expression data of the patients of a particular subtype by a multivariate Gaussian distribution, which can vary from one subtype to another. Here we assume that the Gaussian mixture labels are given. To tackle this problem, a naive approach is to model each group separately. However, this approach ignores the common structures that may exist across different groups. Therefore, it might be more useful to model all groups jointly so that the common structures can be estimated from the aggregated data.

In this paper, we propose three approaches to model all groups jointly via penalizing parameter matrices together. The first two approaches are plug-in methods and the third one is to estimate all parameter matrices jointly. In particular, for the first approach, we plug in a reasonable estimator of the inverse covariance matrices to estimate the regression coefficient matrices. For the second approach, we estimate the inverse covariance matrices instead after plugging in a good estimator of the regression coefficient matrices. The last approach simultaneously estimates the regression coefficient matrices and the inverse covariance matrices. These methods are penalized log-likelihood approaches with the multivariate mixture Gaussian assumption.

In the following sections, we describe the new proposed methods in more details with theoretical justification and numerical examples. In Section 2, we introduce our proposed methods. Section 3 explores their theoretical properties. Section 4 develops computational algorithms to obtain solutions for proposed methods. Simulated examples are presented in Section 5 to demonstrate performance of our methods and Section 6 provides analysis of a glioblastoma cancer data example. We conclude the paper with some discussion in Section 7. The proofs of the theorems are provided in the Appendix.

2 Methodology

Consider the dataset with G different groups. Suppose the g-th group contains ng observations of p covariates and m response variables. Let yi(g)=(yi1(g),,yim(g))T; i = 1, …, ng, be m-dimensional responses and Y(g)=[y1(g),,yng(g)]T be the ng × m response matrix in the g-th group. Let xi(g)=(xi1(g),,xip(g))T; i = 1, …, ng, be p-dimensional predictors and X(g)=[x1(g),,xng(g)]T be the ng × p design matrix in the g-th group. Consider the multiple response linear regression model in the g-th group,

Y(g)=X(g)B(g)+e(g),withe(g)=[ε1(g),,εng(g)]T,

where B(g)={βjk(g)}; j = 1, …, p, k = 1, …, m, is an unknown p × m parameter matrix. The errors εi(g)=(εi1(g),,εim(g))T; i = 1, …, ng, are i.i.d. m-dimensional random vectors following a multivariate normal distribution N(0, Σ(g)) with a nonsingular covariance matrix Σ(g). Let C(g)=((g))-1=(cj,j(g))m×m; j, j′ = 1, …, m.

Our goal is to estimate {(B(g), C(g))} so that we can perform prediction and achieve graphical interpretation among response variables, where {(B(g), C(g))} = {(B(g), C(g)), g = 1, …, G}. The most direct way to estimate {(B(g), C(g))} is to build G individual maximum likelihood models. More specifically, the maximum likelihood estimator of {(B(g), C(g))} can be obtained via maximizing the following conditional log-likelihoods on X(g),

ng2logdet(C(g))-12tr{(Y(g)-X(g)B(g))C(g)(Y(g)-X(g)B(g))T},g=1,,G, (1)

up to a constant not depending on (B(g), C(g)). It is well-known that the resulting estimator of B(g) is the ordinary least squares estimator and it does not make use of the joint information among response variables. To incorporate the joint information among response variables in the estimation procedure, we can apply the method proposed by Lee and Liu [5]. In particular, the estimator is given by solving

argminB(g),C(g){-l(B(g),C(g))+λ1j,k|βjk(g)|+λ2st|cst(g)|}, (2)

where l(B(g), C(g)) = ng log det(C(g))−tr{(Y(g)X(g)B(g))C(g)(Y(g)X(g)B(g))T}; g = 1, …, G.

Motivated from the technique for a single linear model as in (2), we consider penalization for (1) to improve estimation. In particular, estimation of {(B(g), C(g))} can be improved if some common information across groups can be shared in the estimation procedure. Note that the optimization problem suggested in (2) can be solved individually within each group. Therefore, it does not utilize the common information across groups. However, since these groups may have shared information with similar structure, it can be useful to consider the connection.

In this section, we propose methods that combine G individual models to improve prediction and estimation. Our goal is to estimate {(B(g), C(g))} simultaneously to identify the common and unique structures across groups. Note that there are two parameter matrices in each group, B(g) and C(g), involved in the estimation, and in many applications only one of them is of main interest. Hence, we consider three different approaches, two plug-in methods and one joint method. In Sections 2.1 and 2.2, we introduce two different plug-in penalized likelihood methods, one is for multiple response regression and the other one is for inverse covariance estimation. In the plug-in method for multiple response regression, we estimate {C(g)} first and then plug it into the likelihood to estimate {B(g)} using penalization. In the plug-in method for inverse covariance estimation, we estimate {B(g)} first and then incorporate the information to estimate {C(g)}. In Section 2.3, we estimate {(B(g), C(g))} together via double penalization.

2.1 Plug-in Hierarchical LASSO estimator

Our goal in this section is to estimate the regression coefficients {B(g)}, while assuming the inverse covariance estimates {Ĉ(g)} is available. Although B(g) can be different for different g, we expect they share some common structures. In particular, for our cancer application example, different groups correspond to patients with different subtypes of brain cancer. Thus, patients from different groups are likely to have a lot of similarities although there are important differences among various subtypes. This motivates us to perform joint estimation of {B(g)} through shrinkage. It is desirable to identify the common and unique structure on {B(g)} through the penalty.

Suppose we have the inverse covariance estimates, {Ĉ(g)}, available. Let βjk=(βjk(1),,βjk(G))T. Regression parameters, ( βjk(1),,βjk(G)), corresponding to the same response variable and the same predictor variable are treated as a group. We consider a new penalized likelihood method, namely the plug-in hierarchical LASSO (PHL) estimator, to estimate {B(g)} by solving

argmin(B(g))g=1Gg=1Gtr{(Y(g)-X(g)B(g))C^(g)(Y(g)-X(g)B(g))T}+λ1j,kp(βjk),wherep(βjk)=(g=1G|βjk(g)|)1/2. (3)

Here λ1 is a tuning parameter. The penalty in (3) was proposed by Zhou and Zhu [8], which they call the hierarchical group penalty. This penalty controls the sparsity of {(g)} hierarchically. As the first level of the hierarchical sparsity, the estimator of βjk tends to shrink to a zero vector with the hierarchical penalty as a group if all coefficients in the group are small in magnitude. For the second level of the hierarchical sparsity, if βjk is estimated as a nonzero vector, within the group, some coefficients can be still shrunk to zero according to their magnitude. Zhou and Zhu [8] showed the penalty in (3) encourages such a hierarchical sparsity. Intuitively, note that p(βjk) can be approximated by g=1G12(g=1Gβjk(g),)1/2|βjk(g)| where βjk(g), is close to the solution of (3). Therefore, all coefficients in βjk have the same weight, 12(g=1Gβjk(g),)1/2, as a group while each coefficient has different amount of penalty according to its magnitude. As a remark, we would like to point out that p(βjk) for each group serves as a group penalty which encourages group shrinkage. Similar idea was previously considered by Turlach et al. [2], Yuan and Lin [9], Zhang et al. [10], and Zhao et al. [11].

For the procedure in (3), we need to first estimate {C(g)}. To that end, we obtain initial estimates of {B(g)} by applying univariate regression techniques within each group. Let {(g),0} be initial estimates. Define S(g) by

S(g)=1ng(Y(g)-XB^(g),0)(Y(g)-XB^(g),0)T. (4)

Then {Ĉ(g)} can be obtained by solving

argminC(g){-logdet(C(g))+tr(S(g)C(g))+λ2jkvjk|cjk|},g=1,,G. (5)

This problem is essentially the same as the problem of estimating the inverse covariance matrix in the context of sparse Gaussian graphical models. This technique was considered by many authors [12, 13, 14, 15]. Among various existing algorithms, we adapt the Graphical LASSO (GLASSO) algorithm proposed by Friedman et al. [14].

2.2 Plug-in Hierarchical Graphical LASSO estimator

In Section 2.1, we considered a plug-in method, PHL, which estimates {C(g)} first and then estimates {B(g)} given {Ĉ(g)}. In this section, we propose another plug-in method using {(g)} to estimate {C(g)}. In particular, we first estimate {B(g)} by using univariate regression techniques and then obtain {S(g)}, defined as S(g)=1ng(Y(g)-XB^(g))(Y(g)-XB^(g))T. With {S(g)} available, we propose a penalized likelihood method, the plug-in hierarchical graphical LASSO (PHGL) estimator, by solving

argmin(C(g))g=1Gg=1G{-nglogdet(C(g))+ngtr(S(g)C(g))}+λ2st(g=1G|cst(g)|)1/2, (6)

where λ2 is a tuning parameter.

This approach is closely related to the method previously considered by Guo et al. [16]. They considered the problem of estimating the inverse covariance matrix of Y(g). However, we estimate the conditional inverse covariance matrix of Y(g) given X(g). Even though the optimization problem in (6) is technically the same as that in their method, our resulting estimator has different graphical interpretations.

2.3 Doubly Penalized Sparse Estimator

In Sections 2.1 and 2.2, we considered two plug-in methods for estimation of {(B(g), C(g))}. In this section, we propose to estimate {(B(g), C(g))} simultaneously. We would like to incorporate the information among different response variables in estimation of {B(g)} and encourage all groups to share some common structure among {(B(g), C(g))}. We propose a joint penalized method, the doubly penalized sparse estimator (DPS), by solving

argmin(B(g),C(g))g-1Gg=1G{-lg(B(g),C(g))}+λ1jk(g=1G|βjk(g)|)1/2+λ2st(g=1G|cst(g)|)1/2, (7)

where lg(B(g), C(g)) = ng log det(C(g)) − tr{(Y(g)X(g)B(g))C(g)(Y(g)X(g)B(g))T}. As a group penalty, the first penalty term in (7) encourages the hierarchical sparsity among {B(g)}. In the meantime, the second penalty term in (7) serves as a group penalty for {C(g)}.

Note that the objective function in (7) is not convex with respect to {(B(g), C(g))} and the optimization can be unstable when max{n1, …, nG} < p. With diagonal {C(g)}, the first term in lg(B(g), C(g)) can dominate the other terms in the objective function if the trace terms are zero, which may occur when max{n1, …, nG} < p. Therefore, the objective function can keep decreasing as the diagonal entries in {C(g)} continue to increase. As a result, the numerical solution of {C(g)} in (7) can have large diagonal entries. Such kinds of solutions are not desirable in practice since it implies that the residual variances of response variables are very small. Therefore, if max{n1, …, nG} < p, the plug-in methods in Sections 2.1 and 2.2 are recommended and can often perform better than the DPS method.

2.4 Model Selection

In Sections 2.1 – 2.3, we proposed two plug-in methods and one joint method for estimation of {(B(g), C(g))}. To apply these methods, we first need to select the tuning parameters λ1 and λ2 in (3), (6), and (7), which control the sparsity of the resulting estimators. For the tuning parameter selection, K-fold cross-validation can be combined to our methods. In particular, the K-fold cross-validation method randomly splits the dataset into K parts of equal sizes. Denote the data in the k-th segment by {( X(k)(g),Y(k)(g))}. For any given λ1, λ2 and k, we estimate the regression coefficient matrices and the inverse covariance matrices using all data except the data in the k-th part and denote them by {( B^λ1,(-k)(g),C^λ2,(-k)(g))}. For the PHL method, the optimal tuning parameter λ̂1 is selected which minimizes the prediction error defined by

CV(λ1)=k=1Kg=1G||Y(k)(g)-X(k)(g)B^λ1,(-k)(g)||F2, (8)

where ||·||F is the Frobenius norm of a matrix. For the PHGL method, we select the optimal tuning parameter λ̂2 which maximizes the predictive log-likelihood defined by

CV(λ2)=k=1Kg=1G[n(g,k)logdet(C^λ2,(-k)(g))-tr{(Y(k)(g)-X(k)(g)B^(g))C^λ2,(-k)(g)(Y(k)(g)-X(k)(g)B(g))T}], (9)

where n(g,k) is the sample size of the g-th group in the k-th segment. In the DPS method, we first choose the optimal λ̂1 using (8) with a prespecified λ2 and the optimal λ̂2 is selected using (9) with the selected λ̂1. It helps to avoid a two dimensional grid search of (λ1, λ2). We have found in simulations that within a certain range for λ2, the choice of particular value for λ2 has very little effect on the optimal λ̂1. In particular, in all simulated examples in Section 5, the selected optimal values of λ1 are almost identical for any prespecified value of λ2 within interval [2−6, 26]. To prespecify a value of λ2, the optimal tuning parameter in the PHGL method can be used.

The tuning parameters can be also selected using validation sets. In particular, we split the dataset into the training set and the validation set. With given λ1 and λ2, we construct the corresponding models by applying our methods to the training set. By using the validation set as {( X(k)(g),Y(k)(g))} in (8) and (9), we can compute the prediction error and the predictive log-likelihood on this set to select tuning parameters. The cross-validation method is computationally more intensive than using validation sets. We used validation sets for our simulated examples and the 5-fold cross-validation for the glioblastoma cancer data example.

3 Asymptotic Properties

In this section, we investigate the asymptotic behavior of our three proposed methods when sample sizes go to infinity. In particular, we show that the resulting estimators of all three methods satisfy consistency and sparsity with proper choices of tuning parameters. To this end, we use the set-up of Fan and Li [17], Yuan and Lin [12] and Zou [18]. The technical derivation uses the results in Knight and Fu [19]. Without loss of generality, we assume that n = n1 = ··· = nG and n goes to infinity. Define a vector operator for any matrix A = [a1, …, ap] by Vec(A)=(a1T,,apT)T. Let β* = (Vec(B*,(1))T, …, Vec(B*,(G))T)T be the true regression parameter vector and c* = (Vec(C*,(1))T, …, Vec(C*,(G))T)T be the vector of the entries in the true inverse covariance matrices. The following theorem shows the n-consistency and the sparsity of the solution in (3).

Theorem 1

Suppose that λ1n-120 as n → ∞ and Ĉ(g) in (3) is a consistent estimator of C*,(g); g = 1, …, G. Furthermore, suppose that 1nX(g)TX(g)A(g) as n → ∞ where A(g) is a positive definite matrix; g = 1, …, G.

  1. (Consistency) There exists a local minimizer of (3) such that ||β^-β||=Op(1n), where β̂ = (Vec((1))T, …, Vec((G))T)T;

  2. (Sparsity) If λ1n-14,limnP(β^jk(g)=0)=1 if βjk,(g)=0.

Theorem 1 states that with a consistent estimator of C*,(g), the PHL estimator is n-consistent. Furthermore, it can identify the true subset of predictor variables asymptotically with probability tending to 1. Similar asymptotic properties hold for the PHGL estimator as stated in the following theorem.

Theorem 2

Suppose that λ2n-120 as n → ∞ and (g) in (6) is a n-consistent estimator of B*,(g); g = 1, …, G.

  1. (Consistency) There exists a local minimizer of (6) such that ||c^-c||=Op(1n), where ĉ = (Vec(Ĉ(1))T, …, Vec(Ĉ(G))T)T;

  2. (Sparsity) If λ2n-14,limnP(c^jk(g)=0)=1 if cjk,(g)=0.

In theorems 1 and 2, we establish the consistency and sparsity of plug-in estimators. The following theorem shows the similar asymptotic properties of the DPS solution in which {(g)} and {Ĉ(g)} are obtained together.

Theorem 3

Suppose that λ1n-120 and λ2n-120 as n → ∞. In addition to that, suppose that 1nX(g)TX(g)A(g) as n → ∞ where A(g) is a positive definite matrix; g = 1, …, G.

  1. (Consistency) There exists a local minimizer of (7) such that
    ||(β^T,c^T)T-(βT,cT)T||=Op(1n),

    where β̂ = (Vec((1))T, …, Vec((G))T)T and ĉ = (Vec(Ĉ(1))T, …, Vec(Ĉ(G))T)T;

  2. (Sparsity of {(g)}) If λ1n-14,limnP(β^jk(g)=0)=1 if βjk,(g)=0;

  3. (Sparsity of {Ĉ(g)}) If λ2n-14,limnP(c^jk(g)=0)=1 if cjk,(g)=0.

4 Computational Algorithm

In this section, we describe computational algorithms to solve problems (3), (6), and (7). In particular, we apply the coordinate-descent algorithms used by Lee and Liu [5] iteratively, with combination of the local linear approximation (LLA) [20].

We now describe the algorithm for the PHL method in details. Denote the estimates of βjk(g) from the i-th iteration by (β^jk(g))(i). Then by applying the LLA, the penalty term in (3) at the (i + 1)-th iteration can be approximated as follows,

p(βjk)=(g=1G|βjk(g)|)1/2g=1G|βjk(g)|2(g=1G|(β^jk(g))(i)|)1/2.

Then, at the (i + 1)-th iteration, the problem (3) is decomposed into G individual optimization problems

argminB(g)tr{(Y(g)-X(g)B(g))C^(g)(Y(g)-X(g)B(g))T}+λ1j,kwjk|βjk(g)|, (10)

where wjk=12(g=1G|(β^jk(g))(i)|)-1/2 and g = 1, …, G. The optimization problem (10) is exactly the problem of estimating the regression parameter matrix with the plug-in inverse covariance matrix. It can be solved by applying the coordinate-descent algorithm for the plug-in weighted LASSO method proposed by Lee and Liu [5]. The basic idea of the coordinate-descent algorithm is to optimize each parameter at one time with other parameters being fixed at the current solution. Therefore, the algorithm for (3) proceeds as follows:

Algorithm for the PHL Method

  • Step 1 (Initial value). Set the separate LASSO solution {((g))(i)}; g = 1, …, G as the initial value for {B(g)}.

  • Step 2 (Updating rule). For g = 1, …, G, update {((g))(i)} by applying the coordinate-descent algorithm for the the plug-in weighted LASSO method [5] to the problem (10).

  • Step 3 (Iteration). Repeat Step 2 until convergence.

Next we describe the algorithm for the PHGL method in Section 2.2. Similar to the algorithm for the PHL method, we first apply the LLA to the objective function in (6) with the current estimates {(Ĉ(g))(i)}. Then, at the (i + 1)-th iteration, the problem (6) is decomposed into G individual optimization problems

argminC(g){-nglogdet(C(g))+ngtr(S(g)C(g))}+λ2stvst|cst(g)|, (11)

where vst=12(g=1G|(c^jk(g))(i)|)-1/2 and g = 1, …, G. The problem (11) can be solved by applying the GLASSO algorithm. Therefore, the algorithm for (6) proceeds as follows:

Algorithm for the PHGL Method

  • Step 1 (Initial value). Set the separate GLASSO solution {(Ĉ(g))(i)}; g = 1, …, G as the initial value for {C(g)}.

  • Step 2 (Updating rule). For g = 1, …, G, update {(Ĉ(g))(i)} by applying the GLASSO algorithm to the problem (11).

  • Step 3 (Iteration). Repeat Step 2 until convergence.

Next, we combine the above two algorithms to solve problem (7) for the doubly penalized method, DPS. The algorithm can be summarized as follows:

Algorithm for the DPS Method

  • Step 1 (Initial values of {B(g)} and {C(g)}). Set the separate LASSO solution {((g))(i)} as the initial value for {B(g)} and the separate GLASSO solution {(Ĉ(g))(i)} as the initial value of {C(g)}.

  • Step 2 ({B(g)} updating rule). For a given {((g))(i)}, update {(Ĉ(g))(i)} by applying the algorithm for the PHGL method.

  • Step 3 ({C(g)} updating rule). For a given updated {(Ĉ(g))(i)}, update {((g))(i)} by applying the algorithm for the PHL method.

  • Step 4 (Iteration). Repeat Steps 2 and 3 until convergence.

As we point out in Section 2.3, when max{n1, …, nG} < p, the solution can possibly be unstable with very small residual variances. In that case, the plug-in methods may perform better.

5 Simulated Examples

In this section, simulation studies are carried out to assess the performance of our proposed methods. In particular, we compare our proposed methods with several existing methods. All five methods are described below.

  • Method 1 (M1). We model each group separately. In particular, we apply the doubly penalized maximum likelihood (DML) method by Lee and Liu [5] separately to each group. The estimator is given by solving (2). This method will be referred as DML1.

  • Method 2 (M2). In this approach, all groups are combined into one dataset as if they come from a common Gaussian distribution. We apply the DML method to the combined dataset. We name this method as DML2.

  • Method 3 (M3). We first estimate {B(g)} by applying LASSO to each response variable separately in each group. Once we have an estimator of {B(g)}, we compute residuals and apply GLASSO to estimate {C(g)}. In particular, the estimator of {C(g)} is given by solving (5). The resulting estimator of {B(g)} will be called the LASSO estimator and the resulting estimator of {C(g)} will be referred as the GLASSO estimator.

  • Method 4 (M4). An initial estimate of {B(g)} is obtained by applying LASSO. With the initial estimate of {B(g)}, we apply our proposed plug-in method, PHGL, to estimate {C(g)} jointly. Once we have the estimator of {C(g)}, another plug-in method, PHL, is applied to obtain the final estimate of {B(g)}.

  • Method 5 (M5). We model all groups jointly by applying our proposed method, DPS. In this approach, we estimate both {B(g)} and {C(g)} simultaneously.

Note that Methods 1 and 3 model all groups separately and Method 2 does not allow any unique structure to each group. On the other hand, our proposed methods (Methods 4 and 5) model all groups jointly while allowing unique structures to each group.

We set G = 3, p = 20 and m = 20. For each group, we generate training sets, validation sets, and testing sets with the the same size of n = 40. Each data set is generated as follows. First, we produce B and C common in all groups. Figure 1 shows the common structure across groups. We create unique structures to each group by adding additional nonzero parameters to each group. In particular, for each B(g), we randomly pick zero entries and replace them with values randomly chosen from the interval [1,3]. For each C(g), we randomly pick zero entries and make them have values randomly chosen from interval [−1, −0.5] ∪ [0.5, 1]. We define ρ as the ratio of the number of unique nonzero entries to the number of common nonzero entries. We consider two values of ρ. The case of ρ = 0 does not allow unique structure to each group. The second case has ρ = 0.25. Finally, yi(g) is generated from N(B(g)Txi(g),C(g)), where xi(g); i = 1,, n are i.i.d vectors from N(0, Ip).

Figure 1.

Figure 1

Regression Parameter Structure and Inverse Covariance Structure that are common in all groups. Non-zero entries are colored as black and zero entries are colored as white.

To assess prediction performance, we use the prediction error defined as,

PE=1nmGg=1G||Y(g)-Y^(g)||F2,

where ||·||F is the Frobenius norm of a matrix.

To compare performance in the estimation of {C(g)}, we report the average entropy loss and the average Frobenius loss which are defined as,

EL=1Gg=1G[tr((g)C^(g))-log(|(g)C^(g)|)-m],FL=1Gg=1G||C(g)-C^(g)||F2/||C(g)||F2.

Table 1 and Figures 24 summarize the results. When ρ = 0, M2 outperforms the others in both prediction and estimation of {C(g)}. This is expected because M2 assumes all groups come from the same distribution and that assumption is valid when ρ = 0. Therefore, by combining all groups, M2 has more information than other methods. Note that our proposed methods, M4 and M5, also show competitive performance in prediction. When ρ = 0.25, M5, one of our proposed methods, shows the best performance in all criteria. This implies that modeling all groups jointly can help us improve both prediction and the estimation of {C(g)} when all groups share some common structures.

Table 1.

Average prediction error, entropy loss, and Frobenius loss based on 100 replications (The numbers in parentheses are standard errors)

ρ M1: DML1 M2: DML2 M3: LASSO M4: PHL M5: DPS
Prediction Error 0 2.23(0.011) 1.59(0.006) 1.90(0.008) 1.61(0.006) 1.61(0.006)
0.25 2.05(0.021) 4.51(0.018) 1.76(0.009) 1.50(0.007) 1.50(0.007)

ρ M1: DML1 M2: DML2 M3: GLASSO M4: PHGL M5: DPS

Entropy Loss 0 11.52(0.153) 1.17(0.020) 4.69(0.077) 4.40(0.079) 2.47(0.043)
0.25 11.58(0.149) 8.62(0.046) 5.22(0.058) 5.27(0.087) 3.31(0.051)

Frobenius Loss 0 0.82(0.019) 0.05(0.001) 0.36(0.006) 0.34(0.010) 0.15(0.003)
0.25 1.20(0.027) 0.47(0.002) 0.42(0.012) 0.46(0.012) 0.22(0.004)

Figure 2.

Figure 2

Boxplots of prediction errors of all methods based on 100 replications. Left: All groups are the same. Right: There exist the common and unique structures across groups.

Figure 4.

Figure 4

Boxplots of Frobenius losses of all methods based on 100 replications. Left: All groups are the same. Right: There exist the common and unique structures across groups.

Table 2 summarizes the relative computational times of M4 and M5 compared with that of M3. In terms of computational complexity, M5 is more intensive than the other methods while M4 shows competitive computational time. For instance, when ρ = 0.25, the computational time of M5 is 30.09 times of that for M3. M5 is computationally more intensive as it estimates all parameter matrices simultaneously. However, in terms of performance, M5 outperforms M3 in both prediction and estimation of {C(g)} in our simulated examples.

Table 2.

Averages of relative computational time of M4 and M5 compared with M3 based on 100 replications (The numbers in parentheses are standard errors). For example, when ρ = 0, the computational time of M4 is 3.92 times of that for M3.

M3 M4: PHL M5: DPS
Simulated examples ρ = 0 1 3.92(0.05) 38.40(0.35)
ρ = 0.25 1 3.44(0.04) 30.09(0.35)

6 Application on the Glioblastoma Cancer Data

In this section, we apply our proposed methods to a glioblastoma multiforme (GBM) cancer dataset. In our application, there are 17814 genes and 534 micro-RNAs of 482 GBM patients. The patients were classified into 4 gene expression-based subtypes, namely, Classical, Mesenchymal, Neural, and Proneural with sample sizes of 127, 145, 85 and 125 respectively [7]. One important goal is to regress genes on micro-RNAs to investigate the effect of micro-RNAs on gene expressions. The other goal is to estimate the conditional inverse covariance matrix of gene expressions given micro-RNAs. This matrix can help us to interpret the conditional relationship among genes given micro-RNAs.

To proceed with the analysis, preprocessing is necessary. There are many possibilities for preprocessing. For example, Bair and Tibshirani [21] developed some procedures that utilize both gene expression data and clinical data to select a list of genes for identifying cancer subtypes. In our analysis, the preprocessing step proceeds as follows. Verhaak et al. [7] established 840 signature genes which are highly distinctive for four subtypes. We use these 840 signature genes to explore distinctive effects of micro-RNAs on them. Our proposed methods are needed for genes having correlated residuals. Therefore, to apply our proposed methods, the genes are first grouped into several gene modules with genes more related to each other within each module. Then our proposed methods are applied to each module separately. This approach is sensible for our methodology as a gene module is a set of genes which are closely related. To detect such gene modules, we perform the weighted gene co-expression network analysis (WGCNA) by Zhang and Horvath [22]. WGCNA detects modules using a hierarchical clustering method with the topological overlap dissimilarity measure [23]. Zhang and Horvath [22] pointed out that WGCNA can detect biologically meaningful modules.

By performing WGCNA with the 840 signature genes, we found 14 modules with 60 genes per module on average. It turns out that one of them is particularly interesting as many genes in the module such as EGFR and PDGFA are involved in cell proliferation. Moreover, Verhaak et al. [7] demonstrated the essential roles of these genes in GBM tumor genesis. Therefore, we focus on that module hereafter. In particular, there are 90 genes in this module. Among them, we choose top 40 genes with largest Median Absolute Deviations (MADs) since for the 50 genes with low MADs, all regression coefficients are estimated nearly zeros which do not provide any meaningful interpretation. We also select a subset of micro-RNAs which are predicted to target at least one of the selected genes and have large MADs. As a result, 40 genes and 50 micro-RNAs are used for the results in this analysis.

We consider four approaches to estimate the regression coefficient matrices and the residual inverse covariance matrices. In the first approach, we assume that the Gaussian distributions in all subtypes are the same. Therefore, all subtypes are combined into one data set and we apply the doubly penalized maximum likelihood (DML) method by Lee and Liu [5] to the combined data. In particular, the estimator can be obtained by solving (2). In the second approach, we apply LASSO and GLASSO. The detailed description of this approach is presented in M3 in Section 5. In the third approach, we apply our proposed plug-in methods, PHL and PHGL. The last approach uses the DPS method in which all matrices are jointly estimated. The third and fourth approaches can help us to discover the common and unique structures to each group.

For performance assessment, we randomly divide the data set of each subgroup into a training set of size 70 and a test set of the remainder. The tuning parameters are selected using 5-fold cross-validation as discussed in Section 2.4. We perform the random splitting 100 times. By using the test set, we assess prediction performance of several methods including our proposed methods.

Table 3 shows average PE of 100 replications. Note that the DPS, PHL and LASSO methods outperform the DML method. It implies that a single Gaussian assumption for all subtypes might not be reasonable. The LASSO gives comparable, but slightly better prediction accuracy than our PHL and DPS methods. One potential reason is that we allow different tuning parameter values for each response in the LASSO. The more flexible tuning may help the LASSO give slightly better PE.

Table 3.

Averages of PE based on 100 replications (The numbers in parentheses are standard errors)

DML LASSO PHL DPS
PE 1.373(0.004) 1.025(0.003) 1.050(0.004) 1.065(0.004)

Figure 5 shows the averaged estimated regression coefficients over 100 replications of several micro-RNAs for some selected genes. In order to produce the heatmap, the DPS estimates are used. The results show some interesting relationships between genes and micro-RNAs that are specific to certain GBM subtypes. For instance, we have observed a negative correlation between miR222 and its predicted target GLI2 in the Mesenchymal subtype. GLI2 is an essential transcription factor mediating cytokine expression in cancer cells [24]. It has been shown that the knockdown of GLI2 mRNA has significantly decreased the migratory ability of human glioblastoma cells [25]. Herein, our results suggest that the accelerated inflammatory response observed in the GBM Mesenchymal subtype might be partially through miR222-dependent GLI2 regulation [7].

Figure 5.

Figure 5

Heatmap of averaged estimated regression coefficients of several micro-RNAs for some selected genes. The DPS estimates are used to generate the heatmap.

Another example is the anti-correlation between miR130b and its predicted target ARAP2 (CENTD1) in the GBM Neural subtype. This subtype is typically associated with the gene ontology (GO) categories such as neuron projection and axon and synaptic transmission. Yoon et al. [26] have reported that ARAP2 associates with focal adhesions and functions downstream of RhoA to regulate focal adhesion dynamics in glioblastoma cells. Consistent with this report, our findings suggest that miR130b regulates ARAP2 specifically in the neural subtype.

Additionally, we have observed the subtype-specific correlation between micro-RNAs and non-target genes, indicating an indirect regulation between the two. For instance, our results have identified distinct EGFR-miR21 correlations in different subtypes. Several research papers have shown that EGFR regulates miR21 in a couple of cancers, including human glioblastoma and lung cancers [27, 28]. Here our observation further indicates that this regulation is subtype-specific in GBM. In the Neural subtype, there was positive correlation between EGFR and miR21 while negative correlations are observed in the subtypes, Messenchymal and Proneural.

Figure 6 shows the estimated conditional inverse covariance structure of genes given micro-RNAs. This structure is obtained from the model using our proposed DPS method. Black edges represent the common structure shared among all subgroups while grey edges represent unique structures to some subgroups. Verhaak et al. [7] claimed that FGFR3, PDGFA and EGFR are all Classical genes in sense that they tend to be highly expressed only in Classical subtype. Thus, it is expected that they have some connectivity among them. However, in Figure 6 from our results, none of them are connected for all subtypes. This implies that in all subtypes, they can be conditionally independent given other genes once we take out the effects of given 50 micro-RNAs on them even though they are marginally correlated. Therefore, joint modeling of all subtypes using our DPS method can help us to interpret similarities and differences of the conditional gene relationships given micro-RNAs among different cancer subtypes.

Figure 6.

Figure 6

A graphical model of gene expressions based on the estimated inverse covariance matrix. Black lines are common edges across all subgroups. Grey lines are unique edges to some subgroups. The DPS estimates are used to generate the network.

7 Discussion

In this paper, we propose three methods for modeling several groups jointly to estimate both the regression coefficient matrix and conditional inverse covariance matrix. All methods are derived in a penalized likelihood framework with hierarchical group penalties. Our theoretical investigation shows that our proposed estimators are consistent and can identify true zero parameters with probability tending to 1 as the sample size goes to infinity. Simulated examples demonstrate that our proposed methods can improve estimation of both regression coefficient matrix and conditional inverse covariance matrix.

In very high dimensional problems, our joint method (DPS) may have numerical difficulty as discussed in Section 2.3. In that case, the proposed plug-in methods are recommended and can often perform better than the DPS method. In certain applications such as our GBM cancer example, a preprocessing step can be first performed before applying the DPS method to reduce dimensions. There is a well-developed literature on preprocessing. For example, see Bair and Tibshirani [21]. With moderate dimensions of predictors and response variables, the joint method can be applied and its performance can be very competitive.

Our current theoretical study is on the case when n goes to infinity. However, for high dimensional cases, it will be also interesting to investigate asymptotic behaviors of our methods when the dimension of predictors, p and the dimension of response variables, m both go to infinity.

Our methods are based on the multivariate Gaussian assumption. Recently, there are some research developments on extending Gaussian graphical models to non-Gaussian cases such as Liu et al. [29] and Cai et al. [30]. Another research direction is to extend our methods to non-Gaussian situations. Further exploration is needed.

Figure 3.

Figure 3

Boxplots of entropy losses of all methods based on 100 replications. Left: All groups are the same. Right: There exist the common and unique structures across groups.

Acknowledgments

The authors are partially supported by NSF Grant DMS-0747575 and NIH Grant NIH/NCI R01 CA-149569. The authors are indebted to the editor, the associate editor, and two referees, whose helpful comments and suggestions led to a much improved presentation.

A Proof

A.1 Proof of Theorem 1

A.1.1 Consistency

Let β = (Vec(B(1))T,, Vec(B(G))T)T and define Q(β) as

Q(β)=g=1Gtr{(Y(g)-X(g)B(g))C^(g)(Y(g)-X(g)B(g))T}+λ1j,kp(βjk). (12)

To show the results, we use the similar technique in the proof of Theorem 1 in Fan and Li [17]. It suffices to show that for any given δ > 0, there exists a large constant D such that

P{sup||U||=DQ(β+1nU)>Q(β)}>1-δ, (13)

where U = (U(1)T, …, U(G)T)T is a m × p × G-dimensional vector.

Let y(g) = Vec(Y(g)), Xm,(g) =ImX(g) and β(g) = Vec(B(g)); g = 1,, G. Then we can rewrite Q(β) in (12) as

Q(β)=g=1G(y(g)-Xm,(g)β(g))T(C^(g)In)(y(g)-Xm,(g)β(g))+λ1j,kp(βjk). (14)

Define Vn(U)=Q(β+1nU)-Q(β). Using (14), we can show that

Vn(U)=g=1GU(g)T(C^(g)1nX(g)TX(g))U(g)-g=1G21nε(g)T(C^(g)In)Xm,(g)U(g)+λ1j,k{(g=1G|βjk,(g)+1nujk(g)|)1/2-(g=1G|βjk,(g)|)1/2}, (15)

where ε(g) = Vec(e(g)); g = 1,, G. Define I={(j,k)βjk,(g)0forsomeg=1,,G} for some g = 1,, G}. Since (g=1G|βjk,(g)+1nujk(g)|)1/2-(g=1G|βjk,(g)|)1/2=(g=1G|1nujk(g)|)1/20, for (j, k) ∉ Inline graphic, we have that

Vn(U)g=1GU(g)T(C^(g)1nX(g)TX(g))U(g)-g=1G21nε(g)T(C^(g)In)Xm,(g)U(g)+λ1(j,k)I{(g=1G|βjk,(g)+1nujk(g)|)1/2-(g=1G|βjk,(g)|)1/2}. (16)

For the first term on the right-hand side of (16), note that

g=1GU(g)T(C^(g)1nX(g)TX(g))U(g)=g=1GU(g)T(C,(g)A(g))U(g)+op(1)

as 1nX(g)TX(g)A(g) and

Ĉ(g)p C*,(g);g = 1, …, G.

For the second term on the right-hand side of (16), note that

|g=1G21nε(g)T(C^(g)In)Xm,(g)U(g)|2g=1G||1nε(g)T(C^(g)In)Xm,(g)||||U(g)||2g=1G||1nε(g)T(C^(g)In)Xm,(g)||||U||=Op(1)||U||

as 1nε(g)T(C^(g)In)Xm,(g)dZ where Z has multivariate normal distribution. For the third term on the right-hand side of (16), we can show that

λ1(j,k)I{(g=1G|βjk,(g)+1nujk(g)|)1/2-(g=1G|βjk,(g)|)1/2}=λ1(j,k)Ig=1G1γjk{|βjk,(g)+1nujk(g)|-|βjk,(g)|}=λ1n(j,k)Ig=1G1γjk{|ujk(g)|sign(βjk,(g))+o(1)}=op(1),

where γjk={(g=1G|βjk,(g)+1nujk(g)|)1/2+(g=1G|βjk,(g)|)1/2}.

By combining above statements, we have

Vn(U)g=1GU(g)T(C,(g)A(g))U(g)+Op(1)||U||+op(1).

By choosing a sufficiently large D, Vn(U) > 0 uniformly on {U :|| U ||= D} with the probability greater than 1 − δ as C*,(g) and A(g) are positive-definite. Therefore, (13) holds. This completes the proof of the consistency.

A.1.2 Sparsity

It is sufficient to show that with probability tending to 1 as n → ∞, for any (j, k) such that βjk,(g)=0, the partial derivative of Q in (12) with respect to βjk(g) at β^jk(g) has the same sign as β^jk(g). Let β(g) = Vec(B(g)) and note that

β(g)(y(g)-Xm,(g)β(g))T(C^(g)In)(y(g)-Xm,(g)β(g))|β(g)=β^(g)=(C^(g)X(g)TX(g))(β^(g)-β,(g))-(C^(g)X(g)T)ε(g)=n{(C^(g)1nX(g)TX(g))n(β^(g)-β,(g))-1n(C^(g)X(g)T)ε(g)}=nOp(1)

as (C^(g)1nX(g)TX(g))pC,(g)A(g),n(β^(g)-β,(g))=Op(1) and 1n(C^(g)X(g)T)ε(g)Z where Z has multivariate normal distribution. Therefore, the partial derivative of Q can be written as

Qβjk(g)βjk(g)=β^jk(g)=nOp(1)+λ1sign(β^jk(g))2(g=1G|β^jk(g)|)1/2=n(Op(1)+λ1n1/4sign(β^jk(g))2(g=1G|nβ^jk(g)|)1/2).

Since λ1n1/4 as n → ∞, the sign of the derivative is completely determined by that of β^jk(g). This completes the proof of the sparsity.

A.2 Proof of Theorem 2

A.2.1 Consistency

Let c = (Vec(C(1))T,, Vec(C(G))T)T and define Q(c) as

Q(c)=g=1G{-nlogdet(C(g))+ntr(S(g)C(g))}+λ2st(g=1G|cst(g)|)1/2 (17)

To show the results, we use the similar technique in the proof of Theorem 1. It suffices to show that for any given δ > 0, there exists a large constant D such that

P{sup||U||=DQ(c+1nU)>Q(c)}>1-δ, (18)

where U = (Vec(U(1))T,, Vec(U(G))T)T is a m × m × G-dimensional vector.

Using (17), define Vn(U) as

Vn(U)=Q(c+1nU)-Q(c)=g=1G{-nlogdet((C,(g)+U(g)n)(C,(g))-1)+ntr(U(g)S(g)n)}+λ2st{(g=1G|cst,(g)+1nujk(g)|)1/2-(g=1G|cjk,(g)|)1/2}.

Using the similar argument as in the proof of Lemma 2 in Lee and Liu [5], it can be shown that

Vn(U)=g=1Gtr(U(g)(g)U(g)(g))+g=1Gtr[U(g)n(S(g)-(g))]+λ2st{(g=1G|cst,(g)+1nujk(g)|)1/2-(g=1G|cjk,(g)|)1/2}+o(1). (19)

For the second term on the right-hand side of (19), by using the similar argument as in the proof of Theorem 2 in Lee and Liu [5], it can be shown that

g=1Gtr[U(g)n(S(g)-(g))]=g=1Gtr[U(g)n(S,(g)-(g))]+op(1)

where S,(g)=1n(Y(g)-X(g)B,(g))(Y(g)-X(g)B,(g))T. Note that n(S,(g)-(g)) converges in distribution to multivariate normal distribution by the central limit theorem.

For the third term on the right-hand side of (19), by using the similar argument as in the proof of Theorem 1, it can be shown that

λ2st{(g=1G|cst,(g)+1nujk(g)|)1/2-(g=1G|cjk,(g)|)1/2}=op(1).

By combining the above statements, we can conclude that the first term on the right-hand side of (19) dominates the other terms. Therefore, by choosing a sufficiently large D, Vn(U) > 0 uniformly on {U :|| U ||= D} with the probability greater than 1 − δ. This completes the proof of the consistency.

A.2.2 Sparsity

Similar to the proof of the sparsity in Theorem 1, it is sufficient to show that with probability tending to 1 as n → ∞, for any (s, t) such that cst,(g)=0, the partial derivative of Q in (17) with respect to cst(g) at c^st(g) has the same sign as c^st(g). Note that

Qcst(g)cst(g)=c^st(g)=n(sst(g)-σ^st(g))+λ2sign(c^st(g))2(g=1G|c^st(g)|)1/2

where S(g)=(sst(g)) and (C^(g))-1=(σ^st(g)). By using the argument in the proof of Theorem 2 in Guo et al. [16], one can show that (sst(g)-σ^st(g))=Op(1/n). Therefore, we have

Qcst(g)|cst(g)=c^st(g)=n(Op(1)+λ2n1/4sign(c^st(g))2(g=1G|nc^st(g)|)1/2).

Since λ2n1/4 as n → ∞, the sign of the derivative is completely determined by that of c^st(g). This completes the proof of the sparsity.

A.3 Proof of Theorem 3

A.3.1 Consistency

Define Q(β, c) as

Q(β,c)=g=1G{-lg(β,c)+λ1jk(g=1G|βjk(g)|)1/2+λ2st(g=1G|cst(g)|)1/2}, (20)

where lg(β, c) = n log det(C(g)) − tr (Y(g)X(g)B(g))C(g)(Y(g)X(g)B(g))T.

To show the results, we use the similar technique in the proof of Theorem 1. It suffices to show that for any given δ > 0, there exists a large constant D such that

P{sup||U||=DQ(β+1nU1,c+1nU2)>Q(β,c)}>1-δ, (21)

where U=(U1T,U2T)T,U1=(Vec(U1(1))T,,Vec(U1(G))T) and U2=(Vec(U2(1))T,,Vec(U2(G))T)

Using (20), define Vn(U)=Q(β+1nU1,c+1nU2)-Q(β,c). It can be shown that

Vn(U)=g=1G{-nlogdet((C,(g)+U2(g)n)(C,(g))-1)+ntr(U2(g)S,(g)n)}+g=1G{tr[(C,(g)+U2(g)n)(X(g)U1(g)n)T(X(g)U1(g)n)]}-2g=1G{tr[(C,(g)+U2(g)n)(Y(g)-X(g)B,(g))T(X(g)U1(g)n)]}+λ1j,k{(g=1G|βjk,(g)+1nu1,jk(g)|)1/2-(g=1G|βjk,(g)|)1/2}+λ2st{(g=1G|cst,(g)+1nu2,jk(g)|)1/2-(g=1G|cjk,(g)|)1/2}. (22)

For the first term on the right-hand side of (22), it has been shown in Theorem 2 that

g=1G{-nlogdet((C,(g)+U2(g)n)(C,(g))-1)+ntr(U2(g)S,(g)n)}=g=1Gtr(U2(g)(g)U2(g)(g))+Op(1)

For the second term and the third term on the right-hand side of (22), by using the similar argument in the proof of Lemma 3 in Lee and Liu [5], it can be shown that

g=1G{tr[(C,(g)+U2(g)n)(X(g)U1(g)n)T(X(g)U1(g)n)]}g=1GU1(g)T(C,(g)A(g))U1(g)+op(1)

and

g=1G{tr[(C,(g)+U2(g)n)(Y(g)-X(g)B,(g))T(X(g)U1(g)n)]}=Op(1).

For the fourth and fifth term on the right-hand side of (22), it has been shown in Theorems 1 and 2 that

λ1j,k{(g=1G|βjk,(g)+1nu1,jk(g)|)1/2-(g=1G|βjk,(g)|)1/2}=op(1),λ2st{(g=1G|cst,(g)+1nu2,jk(g)|)1/2-(g=1G|cjk,(g)|)1/2}=op(1).

By combining the above statements, we can conclude that the right-hand side of (22) is dominated by g=1Gtr(U2(g)(g)U2(g)(g)) and g=1GU1(g)T(C,(g)A(g))U1(g). Therefore, by choosing a sufficiently large D, Vn(U) > 0 uniformly on {U :|| U ||= D} with the probability greater than 1 − δ. This completes the proof of the consistency.

A.3.2 Sparsity

Note that (β̂, ĉ) is a n-consistent local minimizer of Q(β, c) defined in (20). As β̂ = argminβ Q(β, ĉ) and ĉ is n-consistent, the sparsity of β̂ holds by Theorem 1. Similarly, since ĉ = argminc Q(β̂, c) and β̂ is n-consistent, the sparsity of ĉ holds by Theorem 2. These complete the proof of this theorem.

References

  • 1.Breiman Leo, Friedman Jerome H. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society Series B. 1997;59:3–54. [Google Scholar]
  • 2.Turlach Berwin A, Venables William N, Wright Stephen J. Simulatneous variable selection. Technometrics. 2005;47:349–363. [Google Scholar]
  • 3.Yuan Ming, Ekici Ali, Lu Zhaosong, Monteiro Renato. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society. Series B. 2007;69:329–346. [Google Scholar]
  • 4.Rothman Adam J, Levina Elizaveta, Zhu Ji. Sparse multiple regression with covariance estimation. Journal of Computational and Graphical Statistics. 2010;19:947–962. doi: 10.1198/jcgs.2010.09188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lee Wonyul, Liu Yufeng. Simultaneous multiple response regression and inverse covariance matrix estimation via penalized gaussian maximum likelihood. Journal of Multivariate Analysis. doi: 10.1016/j.jmva.2012.03.013. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455:1061–1068. doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Verhaak Roel GW, Hoadley Katherine A, Purdom Elizabeth, Wang Victoria, Qi Yuan, Wilkerson Matthew D, Ryan Miller C, Ding Li, Golub Todd, Mesirov Jill P, Alexe Gabriele, Lawrence Michael, OKelly Michael, Tamayo Pablo, Weir Barbara A, Gabriel Stacey, Winckler Wendy, Gupta Supriya, Jakkula Lakshmi, Feiler Heidi S, Graeme Hodgson J, David James C, Sarkaria Jann N, Brennan Cameron, Kahn Ari, Spellman Paul T, Wilson Richard K, Speed Terence P, Gray Joe W, Meyerson Matthew, Getz Gad, Perou Charles M, Neil Hayes D TCGA. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell. 2010;17:98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhou Nengfeng, Zhu Ji. Group variable selection via a hierarchical lasso and its oracle property. Statistics and Its Interface. 2010;3:557–574. [Google Scholar]
  • 9.Yuan Ming, Lin Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B. 2006;68:49–67. [Google Scholar]
  • 10.Zhang Hao Helen, Liu Yufeng, Wu Yichao, Zhu Ji. Variable selection for the multicategory svm via adaptive sup-norm regularization. Electronic Journal of Statistics. 2008;2:149–167. [Google Scholar]
  • 11.Zhao Peng, Rocha Guilherme, Yu Bin. Grouped and hierarchical model selection through composite absolute penalties. The Annals of Statistics. 2009;37:3468–3497. [Google Scholar]
  • 12.Yuan Ming, Lin Yi. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
  • 13.Banerjee Onureena, El Ghaoui Laurent, d’Aspremont Alexandre. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
  • 14.Friedman Jerome, Hastie Trevor, Tibshirani Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rothman Adam J, Bickel Peter J, Levina Elizaveta, Zhu Ji. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
  • 16.Guo Jian, Levina Elizabeth, Michailidis George, Zhu Ji. Joint estimation of multiple graphical models. Biometrika. 2011;98:1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fan Jianqing, Li Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  • 18.Zou Hui. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  • 19.Knight Keith, Fu Wenjiang. Asymptotics for lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  • 20.Zou Hui, Li Runze. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Bair Eric, Tibshirani Robert. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology. 2004;2:511–522. doi: 10.1371/journal.pbio.0020108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhang Bin, Horvath Steve. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology. 2005;4:1–45. doi: 10.2202/1544-6115.1128. [DOI] [PubMed] [Google Scholar]
  • 23.Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi A-L. Hierarchical organization of modularity in metabolic networks. Science. 2002;297:1151–1155. doi: 10.1126/science.1073374. [DOI] [PubMed] [Google Scholar]
  • 24.Elsawa Sherine F, Almada Luciana L, Ziesmer Steven C, Novak Anne J, Witzig Thomas E, Ansell Stephen M, Fernandez-Zapico Martin E. Gli2 transcription factor mediates cytokine cross-talk in the tumor microenvironment. The Journal of Biological Chemistry. 2011;286:21524–21534. doi: 10.1074/jbc.M111.234146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Uchida Hiroyuki, Arita Kazunori, Yunoue Shunji, Yonezawa Hajime, Shinsato Yoshinari, Kawano Hiroto, Hirano Hirofumi, Hanaya Ryosuke, Tokimura Hiroshi. Role of sonic hedgehog signaling in migration of cell lines established from cd133-positive malignant glioma cells. J Neurooncol. 2011;104:697–704. doi: 10.1007/s11060-011-0552-2. [DOI] [PubMed] [Google Scholar]
  • 26.Yoon Hye-Young, Miura Koichi, Jebb Cuthbert E, Davis Kathryn Kay, Ahvazi Bijan, Casanova James E, Randazzo Paul A. Arap2 effects on the actin cytoskeleton are dependent on arf6-specific gtpase-activating-protein activity and binding to rhoa-gtp. Journal of Cell Science. 2006;119:4650–4666. doi: 10.1242/jcs.03237. [DOI] [PubMed] [Google Scholar]
  • 27.Zhou Xuan, Ren Yu, Moore Lynette, Mei Mei, You Yongping, Xu Peng, Wang Baoli, Wang Guangxiu, Jia Zhifan, Pu Peiyu, Zhang Wei, Kang Chunsheng. Downregulation of mir-21 inhibits egfr pathway and suppresses the growth of human glioblastoma cells independent of pten status. Laboratory investigation. 2010;90:144–155. doi: 10.1038/labinvest.2009.126. [DOI] [PubMed] [Google Scholar]
  • 28.Seike Masahiro, Goto Akiteru, Okano Tetsuya, Bowman Elise D, Schetter Aaron J, Horikawa Izumi, Mathe Ewy A, Jen Jin, Yang Ping, Sugimura Haruhiko, Gemma Akihiko, Kudoh Shoji, Croce Carlo M, Harris Curtis C. Mir-21 is an egfr-regulated anti-apoptotic factor in lung cancer in never-smokers. PNAS. 2009;106:12085–12090. doi: 10.1073/pnas.0905234106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Liu Han, Lafferty John, Wasserman Larry. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. 2009;10: 2295–2328. [PMC free article] [PubMed] [Google Scholar]
  • 30.Cai Tony, Liu Weidong, Luo Xi. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]

RESOURCES