Joint Estimation of Precision Matrices in Heterogeneous Populations

Takumi Saegusa; Ali Shojaie

doi:10.1214/16-EJS1137

. Author manuscript; available in PMC: 2017 May 2.

Published in final edited form as: Electron J Stat. 2016 May 31;10(1):1341–1392. doi: 10.1214/16-EJS1137

Joint Estimation of Precision Matrices in Heterogeneous Populations

Takumi Saegusa ¹, Ali Shojaie ²

PMCID: PMC5412991 NIHMSID: NIHMS844108 PMID: 28473876

Abstract

We introduce a general framework for estimation of inverse covariance, or precision, matrices from heterogeneous populations. The proposed framework uses a Laplacian shrinkage penalty to encourage similarity among estimates from disparate, but related, subpopulations, while allowing for differences among matrices. We propose an efficient alternating direction method of multipliers (ADMM) algorithm for parameter estimation, as well as its extension for faster computation in high dimensions by thresholding the empirical covariance matrix to identify the joint block diagonal structure in the estimated precision matrices. We establish both variable selection and norm consistency of the proposed estimator for distributions with exponential or polynomial tails. Further, to extend the applicability of the method to the settings with unknown populations structure, we propose a Laplacian penalty based on hierarchical clustering, and discuss conditions under which this data-driven choice results in consistent estimation of precision matrices in heterogenous populations. Extensive numerical studies and applications to gene expression data from subtypes of cancer with distinct clinical outcomes indicate the potential advantages of the proposed method over existing approaches.

Keywords and phrases: Graph Laplacian, graphical modeling, heterogeneous populations, hierarchical clustering, high-dimensional estimation, precision matrix, sparsity

1. Introduction

Estimation of large inverse covariance, or precision, matrices has received considerable attention in recent years. This interest is in part driven by the advent of high-dimensional data in many scientific areas, including high throughput omics measurements, functional magnetic resonance images (fMRI), and applications in finance and industry. Applications of various statistical methods in such settings require an estimate of the (inverse) covariance matrix. Examples include dimension reduction using principal component analysis (PCA), classification using linear or quadratic discriminant analysis (LDA/QDA), and discovering conditional independence relations in Gaussian graphical models (GGM).

In high-dimensional settings, where the data dimension p is often comparable or larger than the sample size n, regularized estimation procedures often result in more reliable estimates. Of particular interest is the use of sparsity inducing penalties, specifically the ℓ₁ or lasso penalty [30], which encourages sparsity in off-diagonal elements of the precision matrix [7, 8, 33, 34]. Theoretical properties of ℓ₁-penalized precision matrix estimation have been studied under both multivariate normality, as well as some relaxations of this assumption [4, 19, 25, 26].

Sparse estimation is particularly relevant in the setting of GGMs, where conditional independencies among variables correspond to zero off-diagonal elements of the precision matrix [14]. The majority of existing approaches for estimation of high-dimensional precision matrices, including those cited in the previous paragraph, assume that the observations are identically distributed, and correspond to a single population. However, data sets in many application areas include observations from several distinct subpopulations. For instance, gene expression measurements are often collected for both healthy subjects, as well as patients diagnosed with different subtypes of cancer. Despite increasing evidence for differences among genetic networks of cancer and healthy subjects [11, 27], the networks are also expected to share many common edges. Separate estimation of graphical models for each of the subpopulations would ignore the common structure of the precision matrices, and may thus be inefficient; this inefficiency can be particularly significant in high-dimensional low sample settings, where p ≫ n.

To address the need for estimation of graphical models in related subpopulations, few methods have been recently proposed for joint estimation of K precision matrices $Ω^{(k)} = {(ω_{i j}^{(k)})}_{i, j = 1}^{p} \in ℝ^{p \times p}$ , k = 1, …, K [6, 9]. These methods extend the penalized maximum likelihood approach by combining the Gaussian likelihoods for the K subpopulations

ℓ_{n} (Ω) = \frac{1}{n} \sum_{k = 1}^{K} n_{k} (log det (Ω^{(k)}) - tr ({\hat{Σ}}_{n}^{(k)} Ω^{(k)})) .

(1)

Here, n_k and ${\hat{Σ}}_{n}^{(k)}$ are the number of observations and the sample covariance matrix for the kth subpopulation, respectively, $n = \sum_{k = 1}^{K} n_{k}$ is the total sample size and tr(·) and det(·) denote matrix trace and determinant.

To encourage similarity among estimated precision matrices, Guo et al. [9] modeled the (i, j)-element of Ω^(k) as product of a common factor θ_ij and group-specific parameters $γ_{i j}^{(k)}$ , i.e. $ω_{i j}^{(k)} = δ_{i j} γ_{i j}^{(k)}$ . Identifiability of the estimates is ensured by assuming δ_ij ≥ 0. A zero common factor δ_ij = 0 induces sparsity across all subpopulations, whereas $γ_{i j}^{(k)} = 0$ results in condition-specific sparsity for $ω_{i j}^{(k)}$ . This reparametrization results in a non-convex optimization problem based on the Gaussian likelihood with ℓ₁-penalties ∑_i≠j δ_ij and $\sum_{i \neq j} \sum_{k = 1}^{K} | γ_{i j}^{(k)} |$ . Danaher et al. [6] proposed two alternative estimators by adding an additional convex penalty to the graphical lasso objective function: either a fused lasso penalty $\sum_{i \neq j} \sum_{k \neq k'} | ω_{i j}^{(k)} - ω_{i j}^{k'} |$ (FGL), or a group lasso penalty $\sum_{i \neq j} \sqrt{\sum_{k = 1}^{K} {(ω_{i j}^{(k)})}^{2}}$ (GGL). The fused lasso penalty has also been used by Kolar et al. [13], for joint estimation of multiple graphical models in multiple time points. The fused lasso penalty strongly encourages the values of $ω_{i j}^{(k)}$ to be similar across all subpopulations, both in values as well as sparsity patterns. On the other hand, the group lasso penalty results in similar estimates by shrinking all $ω_{i j}^{(k)}$ across subpopulations to zero if $\sum_{k = 1}^{K} {(ω_{i j}^{(k)})}^{2}$ is small.

Despite their differences, methods of Guo et al. [9] and Danaher et al. [6] inherently assume that precision matrices in K subpopulations are equally similar to each other, in that they encourage $ω_{i j}^{(k)}$ and $ω_{i j}^{(k')}$ and $ω_{i j}^{(k)}$ and $ω_{i j}^{(k ″)}$ to be equally similar. However, when K > 2, some subpopulations are expected to be more similar to each other than others. For instance, it is expected that genetic networks of two subtypes of cancer be more similar to each other than to the network of normal cells. Similarly, differences among genetic networks of various strains of a virus or bacterium are expected to correspond to the evolutionary lineages of their phylogenetic trees. Unfortunately, existing methods for joint estimation of multiple graphical models ignore this heterogeneity in multiple subpopulations. Furthermore, existing methods assume subpopulation memberships are known, which limits their applicability in settings with complex but unknown population structures; an important example is estimation of genetic networks of cancer cells with unknown subtypes.

In this paper, we propose a general framework for joint estimation of multiple precision matrices by capturing the heterogeneity among subpopulations. In this framework, similarities among disparate subpopulations are presented using a subpopulation networkG(V, E, W), a weighted graph whose node set V is the set of subpopulations. The edges in E and the weights W_kk′ for (k, k′) ∈ E represent the degree of similarity between any two subpopulations k, k′. In the special case where W_kk′ = 1 for all k, k′, the subpopulation similarities are only captured by the structure of the graph G. An example of such a subpopulation network is the line graph corresponding to observations over multiple time points, which is used in estimation of time-varying graphical models [13]. As we will show in Section 2.3, other existing methods for joint estimation of multiple graphical models, e.g. proposals of Danaher et al. [6], can also be seen as special cases of this general framework.

Our proposed estimator is the solution to a convex optimization problem based on the Gaussian likelihood with both ℓ₁ and graph Laplacian [15] penalties. The graph Laplacian has been used in other applications for incorporating a priori knowledge in classification [24], for principal component analysis on network data [28], and for penalized linear regression with correlated covariates [10, 15, 17, 18, 32, 37]. The Laplacian penalty encourages similarity among estimated precision matrices according to the subpopulation network G. The ℓ₁-penalty, on the other hand, encourages sparsity in the estimated precision matrices. Together, these two penalties capture both unique patterns specific to each subpopulation, as well as common patterns shared among different subpopulations.

We first discuss the setting where G(V, E, W) is known from external information, e.g. known phylogenetic trees (Section 2), and later discuss the estimation of the subpopulation memberships and similarities using hierarchical clustering (Section 4). We propose an alternating methods of multipliers (ADMM) algorithm [3] for parameter estimation, as well as its extension for efficient computation in high dimensions by decomposing the problem into block-diagonal matrices. Although we use the Gaussian likelihood, our theoretical results also hold for non-Gaussian distributions. We establish model selection and norm consistency of the proposed estimator under different model assumptions (Section 3), with improved rates of convergence over existing methods based on penalized likelihood. We also establish the consistency of the proposed algorithm for the estimation of multiple precision matrices, in settings where the subpopulation network G or subpopulation memberships are unknown. To achieve this, we establish the consistency of hierarchical clustering in high dimensions, by generalizing recent results of Borysov et al. [1] to the setting of arbitrary covariance matrices, which is of independent interest.

The rest of the paper is organized as follows. In Section 2 we describe the formal setup of the problem and present our estimator. Theoretical properties of the proposed estimator are studied in Section 3, and Section 4 discusses the extension of the method to the setting where the subpopulation network is unknown. The ADMM algorithm for parameter estimation and its extension for efficient computation in high dimensions are presented in Section 5. Results of the numerical studies, using both simulated and real data examples, are presented in Section 6. Section 7 concludes the paper with a discussion. Technical proofs are collected in the Appendix.

2. Model and Estimator

2.1. Problem Setup

Consider K subpopulations with distributions ℘^(k), k = 1, …, K. Let X^(k) = (X^(k),1, …, X^(k),p)^T ∈ ℝ^p be a random vector from the kth subpopulation with mean μ_k and the covariance matrix $Σ_{0}^{(k)} = {(σ_{i j}^{(k)})}_{i, j = 1}^{p}$ . Suppose that an observation comes from the kth subpopulation with probability π_k > 0.

Our goal is to estimate the precision matrices $Ω_{0}^{(k)} \equiv {(Σ_{0}^{(k)})}^{- 1} \equiv {(ω_{i j}^{(k)})}_{i, j = 1}^{p}$ , k = 1, …, K. To this end, we use the Gaussian log-likelihood based on the correlation matrix (see Rothman et al. [26]) as a working model for estimation of true $Ω_{0}^{(k)}$ , k = 1, …, K. Let $X_{i}^{(k)}$ , i = 1, …, n_k, be independent and identically distributed (i.i.d.) copies from ℘^(k), k = 1, …, K. We denote the correlation matrices and their inverse by $Θ^{(k)} = {(θ_{i j}^{(k)})}_{i, j = 1}^{p}$ , and $Ψ^{(k)} = {(ψ_{i j}^{(k)})}_{i, j = 1}^{p}$ , k = 1, …, K, respectively. The Gaussian log-likelihood based on the correlation matrix can then be written as

{\tilde{ℓ}}_{n} (Θ) = \frac{1}{n} \sum_{k = 1}^{K} n_{k} (log det (Θ^{(k)}) - tr (Ψ_{n}^{(k)} Θ^{(k)})),

(2)

where $Ψ_{n}^{(k)}$ , k = 1, …, K is the sample correlation matrix for subpopulation k.

Examining the derivative of (2), which consists of $Ψ_{0}^{(k)} - Ψ_{n}^{(k)}$ , k = 1, …, K, justifies its use as a working model for non-Gaussian data: the stationary points of (2) is $Ψ_{n}^{(k)}$ , which gives a consistent estimate of $Ψ_{0}^{(k)}$ . Thus we do not, in general, need to assume multivariate normality. However, in certain applications, for instance LDA/QDA and GGM, the resulting estimate is useful only if the data follows a multivariate normal distribution.

2.2. The Laplacian Shrinkage Estimator

Let Θ = (Θ⁽¹⁾, …, Θ^(K)) and write $Θ_{i j} = {(θ_{i j}^{(1)}, \dots, θ_{i j}^{(K)})}^{T} \in ℝ^{K}$ , i, j = 1, …, p for a vector of (i, j)-elements across subpopulations. Our proposed estimator, Laplacian Shrinkage for Inverse Covariance matrices from Heterogeneous populations (LASICH), first estimates the inverse of the correlation matrices for each of the K subpopulations, and then transforms them into the estimator of inverse covariance matrices, as in Rothman et al. [26]. In particular, we first obtain the estimate Θ̂ of the true inverse correlation matrix by solving the following optimization problem

{\hat{Θ}}_{ρ_{n}} \equiv \underset{Θ = Θ^{T}, Θ ≻ 0}{arg min} - {\tilde{ℓ}}_{n} (Θ) + ρ_{n} {‖ Θ ‖}_{1} + ρ_{n} ρ_{2} {‖ Θ ‖}_{L} \equiv \underset{Θ = Θ^{T}, Θ ≻ 0}{arg min} - {\tilde{ℓ}}_{n} (Θ) + ρ_{n} \sum_{k = 1}^{K} \sum_{i \neq j} | Θ_{i j}^{(k)} | + ρ_{n} ρ_{2} \sum_{i \neq j} {‖ Θ_{i j} ‖}_{L},

(3)

where Θ = Θ^T enforces the symmetry of individual inverse correlation matrices, i.e. Θ^(k) = (Θ^(k))^T, and Θ ≻ 0 requires that Θ^(k) is positive definite for k = 1, …, K. The ℓ₁-penalty ${‖ Θ ‖}_{1} = \sum_{k = 1}^{K} {‖ Θ^{(k)} ‖}_{1}$ in (3) encourages sparsity in estimated inverse correlation matrices. The graph Laplacian penalty, on the other hand, exploits the information in the subpopulation network G to encourage similarity among values of $θ_{i j}^{(k)}$ and $θ_{i j}^{(k')}$ . The tuning parameters ρ_n and ρ_nρ₂ control the size of each penalty term.

Figure 1 illustrates the motivation for the graph Laplacian penalty ‖Θ_ij‖_L in (3). The gray-scale images in the figure show the hypothetical sparsity patterns of precision matrices Θ⁽¹⁾, Θ⁽²⁾, Θ⁽³⁾ for three related subpopulations. Here, Θ⁽¹⁾ consists of two blocks with one “hub” node in each block; in Θ⁽²⁾ and Θ⁽³⁾ one of the blocks is changed into a “banded” structure. It can be seen that one of the two blocks in both Θ⁽²⁾ and Θ⁽³⁾ have a similar sparsity pattern as Θ⁽¹⁾. However, Θ⁽²⁾ and Θ⁽³⁾ are not similar. The subpopulation network G in this figure captures the relationship among precision matrices of the three subpopulations. Such complex relationships cannot be captured using the existing approaches, e.g. Danaher et al. [6], Guo et al. [9], which encourage all precision matrices to be equally similar to each other. More generally, G can be a weighted graph, G(V, E, W), whose nodes represent the subpopulations 1, …, K. The edge weights W : E → ℝ₊ represent the similarity among pairs of subpopulations, with larger values of W_kk′ ≡ W (k, k′) > 0 corresponding to more similarity between precision matrices of subpopulations k and k′.

Fig 1 — Illustration of similarities in the sparsity patterns of precision matrices Ω⁽¹⁾, Ω⁽²⁾ and Ω⁽³⁾. Nonzero and zero off-diagonal entries are colored in black and white, respectively, while diagonal entires are colored in gray. The associated subpopulation network G reflects the similarities between precision matrices of subpopulations 1 and 2 and 1 and 3. The simulation experiments in Section 6.1 use a similar subpopulation network in a high-dimensional setting.

In this section, we assume that the weighted graph G is externally available, and defer the discussion of data-driven choices of G, based on hierarchical clustering, to Section 4. Given G, the (unnormalized) graph Laplacian penalty ‖Θ_ij‖_L is defined as

{‖ Θ_{i j} ‖}_{L} = {\sum_{k, k' = 1}^{K} W_{k k'} {(θ_{i j}^{(k)} - θ_{i j}^{(k')})}^{2}}^{1 / 2}

(4)

where W_kk′ = 0 if k and k′ are not connected. The Laplacian shrinkage penalty can be alternatively written as ${‖ Θ_{i j} ‖}_{L} = Θ_{i j}^{T} L Θ_{i j}$ , where $L = {(l_{k k'})}_{k, k' = 1}^{K} \in ℝ^{K \times K}$ is the Laplacian matrix [5] of the subpopulation network G defined as

l_{k k'} = {\begin{matrix} d_{k} - W_{k k}, & k = k', d_{k} \neq 0, \\ - W_{k k'}, & k \neq k', \\ 0, & otherwise, \end{matrix}

where d_k = ∑_k′≠kW_kk′ is the degree of node k in G with W_kk′ = 0 if k and k′ are not connected. The Laplacian shrinkage penalty can also be defined in terms of the normalized graph Laplacian, I − D^−1/2W D^−1/2, where D = diag(d₁, …, d_K) is the diagonal degree matrix. The normalized Laplacian penalty,

{‖ Θ_{i j} ‖}_{L} = {\sum_{k, k' = 1}^{K} W_{k k'} {(\frac{θ_{i j}^{(k)}}{\sqrt{d_{k}}} - \frac{θ_{i j}^{(k')}}{\sqrt{d_{k'}}})}^{2}}^{1 / 2},

which we also denote as ‖Θ_ij‖_L, imposes smaller shrinkage on coefficients associated with highly connected subpopulations. We henceforth primarily focus on the normalized penalty.

Given estimates of the inverse correlation matrices Θ̂⁽¹⁾, …, Θ̂^(K) from (3), we obtain estimates of precision matrices Ω^(k) by noting that Ω^(k) = Ξ^(k)Θ^(k)Ξ^(k), where Ξ^(k) is the diagonal matrix of reciprocals of the standard deviations $Ξ^{(k)} = diag ({σ_{11}^{(k)}}^{- 1 / 2}, \dots, {σ_{p p}^{(k)}}^{- 1 / 2})$ . Our estimator ${\hat{Ω}}_{ρ_{n}} = ({\hat{Ω}}_{ρ_{n}}^{(1)}, \dots, {\hat{Ω}}_{ρ_{n}}^{(K)})$ of precision matrices Ω is thus defined as

{\hat{Ω}}_{ρ_{n}}^{(k)} = {{\hat{Ξ}}^{(k)}}^{- 1} {\hat{Θ}}_{ρ_{n}}^{(k)} {{\hat{Ξ}}^{(k)}}^{- 1}, k = 1, \dots, K,

where ${\hat{Ξ}}^{(k)} = diag (1 / {{\hat{σ}}_{11}^{(k)}}^{1 / 2}, \dots, 1 / {{\hat{σ}}_{p p}^{(k)}}^{1 / 2})$ with sample variance ${\hat{σ}}_{i i}^{(k)}$ for the ith element in the kth subpopulation.

A number of alternative strategies can be used instead of the graph Laplacian penalty in (3). First, similarity among coefficients of precision matrices can also be imposed using a ridge-type penalty, ${‖ Θ_{i j} ‖}_{L}^{2}$ . The main difference is that our penalty ‖Θ_ij‖_L discourages the inclusion of edges $θ_{i j}^{(1)}, \dots, θ_{i j}^{(K)}$ if they are very different across the K subpopulations. Another option is to use the graph trend filtering [31], which impose a fused lasso penalty over the subpopulation graph G. Finally, ignoring the weights W_kk′ in (4), the Laplacian shrinkage penalty resembles the Markov random field (MRF) prior used in Bayesian variable selection with structured covariates [16]. While our paper was under review, we became aware of the recent work by Peterson et al. [23], who utilize an MRF prior to develop a Bayesian framework for estimation of multiple Gaussian graphical models. This method assumes that edges between pairs of random variable are formed independently, and is hence more suited for Erdős-Rényi networks. Our penalized estimation framework can be seen as an alternative to using an MRF prior to estimate the precision matrices in a mixture of Gaussian distributions.

2.3. Connections to Other Estimators

To connect our proposed estimator to existing methods for joint estimation of multiple graphical models, we first give an alternative interpretation of the graph Laplacian penalty ${‖ Θ_{i j} ‖}_{L} = {(Θ_{i j}^{T} L Θ_{i j})}^{1 / 2}$ as a norm for a transformed version of $θ_{i j}^{(k)} s$ . More specifically, consider the mapping g_G : ℝ^K → ℝ^K defined based on the Laplacian matrix for graph G

g_{G} (Θ_{i j}) = {\begin{matrix} 0, & k = k', \\ \sqrt{W_{k k'}} (\frac{θ_{i j}^{(k)}}{\sqrt{2 d_{k}}} - \frac{θ_{i j}^{(k')}}{\sqrt{2 d_{k'}}}), & k \neq k', \end{matrix}

if G has at least one edge. For a graph with no edges, define g_G(Θ_ij) = I_K⊗Θ_ij = diag(Θ_ij), where I_K is the K-identity matrix, and ⊗ denotes the Kronecker product. It can then be seen that the graph Laplacian penalty can be rewritten as

{‖ Θ_{i j} ‖}_{L} = {‖ g_{G} (Θ_{i j}) ‖}_{F} .

where ‖·‖_F is the Frobenius norm.

Using the above interpretation, other methods for joint estimation of multiple graphical models can be seen as penalties on transformations g_G(Θ_ij) corresponding to different graphs G. We illustrate this connection using the hypothetical subpopulation network shown in Figure 2a.

Fig 2 — Comparison of subpopulation networks used in the penalty for different methods for joint estimation of multiple precision matrices: a) the true network, modeled by LASICH; b) FGL; c) GGL & Guo et al; and d) estimation of time-varying networks (Kolar & Xing, 2009); see Section 2.3 for details.

Consider first the FGL penalty of Danaher et al. [6], applied to elements of the inverse correlation matrix $| θ_{i j}^{(k)} - θ_{i j}^{(k')} |$ . Let G_C be a complete unweighted graph (W_kk′ = 1 ∀k ≠ k′), in which all $(\begin{matrix} K \\ 2 \end{matrix})$ node-pairs are connected to each other (Figure 2b). It is then easy to see that

\sum_{k \neq l} | θ_{i j}^{(k)} - θ_{i j}^{(l)} | = \sqrt{2 (K - 1)} {‖ g_{G_{C}} (Θ_{i j}) ‖}_{1},

where the factor of $\sqrt{2 (K - 1)}$ can be absorbed into the tuning parameter for the FGL penalty. A similar argument can also be applied to the GGL penalty of Danaher et al. [6], ‖Θ_ij‖, by considering instead an empty graph G_e with no edges between nodes (Figure 2c). In this case, the mapping g_G would give a diagonal matrix with elements $θ_{i j}^{(k)}$ , and hence ‖Θ_ij‖ = ‖g_{G_e}(Θ_ij)‖_F.

Unlike proposals of Danaher et al. [6], the estimator of Guo et al. [9] is based on a non-convex penalty, and does not naturally fit into the above framework. However, Lemma 2 in Guo et al. [9] establishes a connection between the optimal solutions of the original optimization problem, with those obtained by considering a single penalty of the form ${\sum_{k = 1}^{K} | θ_{i j}^{(k)} |}^{1 / 2} \equiv {‖ Θ_{i j} ‖}_{1, 2}$ . Similar to GGL, the connection with the method of Guo et al. [9] can be build based on the above alternative formulation, by considering again the empty graph G_e (Figure 2c), but instead the ‖·‖_1,2 penalty, which is a member of the CAP family of penalties [36]. More specifically,

{\sum_{k = 1}^{K} | ω_{i j}^{(k)} |}^{1 / 2} = {‖ g_{G_{e}} (Θ_{i j}) ‖}_{1, 2} .

Using the above framework, it is also easy to see the connection between our proposed estimator and the proposal of Kolar et al. [13]: the total variation penalty in Kolar et al. [13] is closely related to FGL, with summation over differences in consecutive time points. It is therefore clear that the penalty of Kolar et al. [13] (up to constant multipliers) can be obtained by applying the graph Laplacian penalty defined for a line graph connecting the time points (Figure 2d).

The above discussion highlights the generality of the proposed estimator, and its connection to existing methods. In particular, while FGL and GGL/Guo et al. [9] consider extreme cases with isolated, or fully connected nodes, one can obtain more flexibility in estimation of multiple precision matrices by defining the penalty based on the known subpopulation network, e.g. based on phylogenetic trees or spatio-temporal similarities between fMRI samples. The clustering-based approach of Section 4 further extends the applicability of the proposed estimator to the settings where the subpopulation network in not known a priori. The simulation results in Section 6 show that the additional flexibility of the proposed estimator can result in significant improvements in estimation of multiple precision matrices, when K > 2. The above discussion also suggests that other variants of the proposed estimator can be defined, by considering other norms. We leave such extensions to future work.

3. Theoretical Properties

In this section, we establish norm and model selection consistency of the LASICH estimator. We consider a high-dimensional setting p ≫ n_k, k = 1, …, K, where both n and p go to infinity. As mentioned in the Introduction, the normality assumption is not required for establishing these results. We instead require conditions on tails of random vectors X^(k) for each k = 1, …, K. We consider two cases, exponential tails and polynomial tails, which both allow for distributions other than multivariate normal.

Condition 1 (Exponential Tails)

There exists a constant c₁ ∈ (0, ∞) such that

𝔼 [exp {t (X_{j}^{(k)} - μ_{j}^{(k)}) / {(σ_{j j}^{(k)})}^{1 / 2}}] \leq e^{c_{1}^{2} t^{2} / 2}, \forall t \in ℝ, k = 1, \dots, K, j = 1, \dots, p .

Condition 2 (Polynomial Tails)

There exist constants c₂, c₃ > 0 and c₄such that

𝔼 [{X_{j}^{(k)} / {(σ_{j j}^{(k)})}^{1 / 2}}^{4 (c_{2} + c_{3} + 1)}] \leq c_{4}, k = 1, \dots, K, j = 1, \dots, p .

Since we adopt the correlation-based Gaussian log-likelihood, we require the boundedness of the true variances to control the error between true and sample correlation matrices.

Condition 3 (Bounded variance)

There exist constants c₅ > 0 andc₆ < ∞ such that $c_{5} \leq {min}_{k, j} σ_{j j}^{(k)}$ and ${max}_{k, j} σ_{j j}^{(k)} \leq c_{6}$ .

Condition 4 (Sample size)

Let $λ_{Θ} \equiv {max}_{k} {‖ Θ_{0}^{(k)} ‖}_{2}$ . Let

C_{1} \equiv {2 c_{5}^{- 2} + c_{5} + c_{6}^{- 3 / 2} + 2 c_{5}^{- 5 / 2} c_{6} + {(c_{5}^{- 4} + 2 c_{5}^{- 5} c_{6})}^{1 / 2}}^{- 1} .

(Exponential tails). It holds that
$n \geq max {\frac{12}{{min}_{k} π_{k}}, 2^{18} 3^{3} C_{1}^{2} {(1 + 4 c_{1}^{2})}^{2} c_{6}^{2} λ_{Θ}^{4} {(1 + {‖ L ‖}_{2}^{1 / 2})}^{2} s} log p,$
and log p/n → 0.
(Polynomial tails). Let $C_{2} = {sup}_{n} {ρ_{n} \sqrt{n / log p}} = O (1)$ where ρ_nis given in Lemma 1 in theAppendixand c₇ > 0 be some constant. It holds that
$n \geq max {\frac{p^{1 / c_{2}}}{c_{7}^{1 / c_{2}}}, 2^{7} 3^{2} C_{1}^{2} C_{2}^{2} K min_{k} π_{k} λ_{Θ}^{4} {(1 + {‖ L ‖}_{2}^{1 / 2})}^{2} s log p} .$

Condition 4 determines the sufficient sample size n = Σ_k for consistent estimation of precision matrices Θ⁽¹⁾, …, Θ^(K) in relation to, among other quantities, the number of variables p, the sparsity pattern s and the spectral norm of the Laplacian matrix ‖L‖₂ of the subpopulation network G. While a general characterization of ‖L‖₂ is difficult, investigating its value in special cases provides insight into the effect of the underlying population structure on the required sample size. Consider, for instance, two extreme cases: for a fully connected graph G associated with K subpopulations, ‖L‖₂ = 1/(K − 1); for a minimally connected “line” graph, corresponding to e.g. multiple time points, ‖L‖₂ = 2: with K = 5, 30% more samples are needed for the line graph, compared to a fully connected network. The above calculations match our intuition that fewer samples are needed to consistently estimate precision matrices of K subpopulations that share greater similarities. This, of course, makes sense, as information can be better shared when estimating parameters of similar subpopulations. Note that, here L represents the Laplacian matrix of the true subpopulation network capturing the underlying population structure. The above conditions thus do not provide any insight into the effect of misspecifying the relationship between subpopulations, i.e., when an incorrect L is used. This is indeed an important issue that garners additional investigation; see Zhao and Shojaie [37] for some insight in the context of inference for high dimensional regression. In Section 4, we will discuss a data-driven choice of L that results in consistent estimation of precision matrices.

Before presenting the asymptotic results, we introduce some additional notations. For a matrix $A = {(a_{i j})}_{i, j = 1}^{p} \in ℝ^{p \times p}$ , we denote the spectral norm ‖A‖₂ = max_{x∈ℝ^p,‖x‖=1}‖Ax‖, and the element-wise ℓ_∞-norm ‖A‖_∞ = max_i,j |a_i,j| where ‖x‖ is the Euclidean norm for a vector x. We also write the induced ℓ_∞-norm ‖A‖_∞/∞ = sup_{‖x‖_∞=1}‖Ax‖_∞ where ‖x‖_∞ = max_i |x_i| for x = (x₁, …, x_p). For the ease of presentation, the results in this section are presented in asymptotic form; non-asymptotic results and proofs are deferred to the Appendix.

3.1. Consistency in Spectral Norm

Let $s \equiv ⧣ {(i, j) : ω_{0, i j}^{(k)} \neq 0, i, j = 1, \dots, p, i \neq j, k = 1, \dots, K}$ , and $d = {max}_{k, i} ⧣ {(i, j) : ω_{0, i j}^{(k)} \neq 0, j = 1, \dots, p, i \neq j}$ . The following theorem establishes the rate of convergence of the LASICH estimator, in spectral norm, under either exponential or polynomial tail conditions (Condition 1 or 2). Convergence rates for LASICH in ℓ_∞-and Frobenius norm are discussed in Section 3.3.

Theorem 1

Suppose Conditions 3 and 4 hold. Under Condition 1 or 2,

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{2} = O_{P} (\sqrt{\frac{λ_{Θ}^{4} (s + 1) log p}{n}}),

as n, p → ∞ where ρ_nis given in Lemma 1 in theAppendixwith γ = min_k π_k/2.

Theorem 1 is proved in the Appendix. The proof builds on tools from Negahban et al. [20]. However, our estimation procedure does not match their general framework: First, we do not penalize the diagonal elements of the inverse correlation matrices; our penalty is thus not a norm. Second, the Laplacian matrix is nonpositive definite. Thus, the Laplacian shrinkage penalty is not strictly convex. The results from Negahban et al. [20] are thus not directly applicable to our problem. To establish the estimation consistency, we first show, in Lemma 3, that the function r(·) = ‖·‖₁ + ρ₂‖·‖_L is a seminorm, and is, moreover, convex and decomposable. We also characterize the subdifferential of this seminorm in Lemma 6, based on the spectral decomposition of the graph Laplacian L. The rest of the proof uses tools from Negahban et al. [20], Rothman et al. [26] and Ravikumar et al. [25], as well as new inequalities and concentration bounds. In particular, in Lemma 4 we establish a new ℓ_∞ bound for the empirical covariance matrix for random variables with polynomial tails, which is used to established the consistency in the spectral norm under Condition 2.

The convergence rate in Theorem 1 compares favorably to several other methods based on penalized likelihood. Few results are currently available for estimation of multiple precision matrices. An exception is Guo et al. [9], who obtained a slower rate of convergence O_p({(s + p) log p/n}^1/2) under the normality assumption and based on a bound on the Frobenius norm. Our rates of convergence are comparable to the results of Rothman et al. [26] for spectral norm convergence of a single precision matrix, obtained under the normality assumption. Ravikumar et al. [25], on the other hand, assumed the irrepresentability condition to obtain the rate O_p({min{s + p, d²} log p/n}^1/2) and O_p({min{s + p, d²}p^{τ/(c₂+c₃+1)}/n}^1/2), under exponential and polynomial tail conditions, respectively, where τ > 2 is some scalar. The rate in Theorem 1 is obtained without assuming the irrepresentability condition. In fact, our rates of convergence are faster than those of Ravikumar et al. [25] given the irrepresentability condition 5 (see Corollary 1). Cai et al. [4] obtained improved rates of convergence under both tail conditions for an estimator that is not found by minimizing the penalized likelihood objective function, and may be nonpositive definite. Finally, note that the results in [4, 25, 26] are for separate estimation of precision matrices and hold for the minimum sample size across subpopulations, min_kn_k, whereas our results hold for the total samples size Σ_kn_k.

3.2. Model Selection Consistency

Let $S^{(k)} = {(i, j) : ω_{0, i j}^{(k)} \neq 0, i, j = 1, \dots, p}$ be the support of $Ω_{0}^{(k)}$ , and denote by d the maximum number of nonzero elements in any rows of $Ω_{0}^{(k)}$ , k = 1, …, K. Define the event

ℳ ({\hat{Ω}}_{ρ_{n}}, Ω_{0}) \equiv {sign ({\hat{ω}}_{ρ_{n}, i j}^{(k)}) = sign (ω_{0, i j}^{(k)}), i, j = 1, \dots, p, k = 1, \dots, K},

(5)

where sign(a) is 1 if a > 0, 0 if a = 0 and −1 if a < 0. We say that an estimator Ω̂_{ρ_n} of Ω₀ is model-selection consistent if $P {ℳ ({\hat{Ω}}_{ρ_{n}}, Ω_{0})} \to 1$ .

We begin by discussing an irrepresentability condition for estimation of multiple graphical models. This restrictive condition is commonly assumed to establish model selection consistency of lasso-type estimators, and is known to be almost necessary [19, 35]. For the graphical lasso, Ravikumar et al. [25] showed that the irrepresentability condition amounts to a constraint on the correlation between entries of the Hessian matrix Γ = Ω⁻¹ ⊗ Ω⁻¹ in the set S corresponding to nonzero elements of Ω, and those outside this set. Our irrepresentability condition is motivated by that in Ravikumar et al. [25], however, we adjust the index set S to also account for covariances of “non-edge variables” that are correlated with each other. More specifically, the description of irrepresentability condition in Ravikumar et al. [25] involves Γ_SS consisting only of elements σ_ijσ_kl with (i, j) ∈ S and (k, l) ∈ S. However, σ_ij ≠ 0 for (i, j) ∉ S is not taken into account by this definition. We thus adjust the index set S so that Γ_SS also includes elements σ_ijσ_kl if (i, k) ∈ S and (j, l) ∈ S. This definition is based on the crucial observations that Γ = Σ ⊗ Σ involves the covariance matrix Σ instead of the precision matrix Ω, and that some variables are correlated (i.e., σ_ij ≠ 0) even though they may be conditionally independent (i.e., ω_ij = 0). Defining S^(k) for k = 1, …, K as above, we assume the following condition.

Condition 5 (Irrepresentability condition)

The inverse $Θ_{0}^{(k)}$ of the correlation matrix $Ψ_{0}^{(k)}$ satisfies the irrepresentability condition for S^(k)with parameter α: (a) ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ and ${(Ψ_{0}^{(k)} \otimes Ψ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ are invertible, and (b) there exists some α ∈ (0, 1] such that

max_{(i, j) \in {(S^{(k)})}^{c}} {‖ Γ_{{(i, j)} \times S^{(k)}}^{(k)} {Γ_{S^{(k)} S^{(k)}}^{(k)}}^{- 1} ‖}_{1} \leq 1 - α,

(6)

for k = 1, …, K where $Γ^{(k)} \equiv Ψ_{0}^{(k)} \otimes Ψ_{0}^{(k)}$ .

In addition to the irrepresentability condition, we require bounds on the magnitude of $θ_{i j}^{(k)} \neq 0$ and their normalized difference.

Condition 6 (Lower bounds for the inverse correlation matrices)

There exists a constant c₈ ∈ ℝ such that

θ_{min} \equiv min_{k = 1, \dots, K, i \neq j} | θ_{0, i j}^{(k)} | \geq c_{8} > 0 .

Moreover, for Ω_0,ij ≠ 0, LΩ_0,ij ≠ 0 and there exists a constant c₉ > 0 such that

min_{l_{k k'} \neq 0, \frac{ω_{0, i j}^{(k)}}{\sqrt{d_{k}}} - \frac{ω_{0, i j}^{(k')}}{\sqrt{d_{k'}}} \neq 0} | \frac{θ_{0, i j}^{(k)}}{\sqrt{d_{k}}} - \frac{θ_{0, i j}^{(k')}}{\sqrt{d_{k'}}} | \geq c_{9} .

The first lower bound in Condition 6 is the usual “min-beta” condition for model selection consistency of lasso-type estimators. The second lower bound, which is represented here for the normalized Laplacian penalty, is a mild condition which ensures estimates based on inverse correlation matrices can be mapped to precision matrices. For any pair of subpopulations k and k′ connected in G it requires that if the difference in (normalized) entries of the entires of the precision matrices are nonzero, the difference in (normalized) entries of inverse correlation matrices are bounded away from zero. In other words, the bound guarantees that Θ_0,ij is not in the null space of L, whenever Ω_0,ij is outside of the null space. This bound can be relaxed if we use a positive definite matrix L_ε = L + εI for ε > 0 small.

Our last condition for establishing the model selection consistency concerns the minimum sample size and the tuning parameter for the graph Laplacian penalty. This condition is necessary to control the ℓ_∞-bound of the error Θ̂_{ρ_n} − Θ₀, as in Ravikumar et al. [25]. Our minimum sample size requirement is related to the irrepresentability condition. Let κ_Γ be the maximum of the absolute column sums of the matrices {(Γ^(k))⁻¹}_S^(k)S^(k), k = 1, …, K, and κ_Ψ be the maximum of the absolute column sums of the matrices $Ψ_{0}^{(k)}$ , k = 1, …, K. The minimum sample size in Ravikumar et al. [25] is also a function of the irrepresentability constant, in particular, their κ_Γ involves ${(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1}$ . There is, therefore, a subtle difference between our definition and theirs: in our definition, the matrix is first inverted and then partitioned, while in Ravikumar et al. [25], the matrix is first partitioned and then inverted. Corollary 2 establishes the model selection consistency under a weaker sample size requirement, by exploiting instead the control of the spectral norm in Theorem 1.

Condition 7 (Sample size and regularization parameters)

Let

C_{3} = max {\frac{2^{6} 3^{4} κ_{Ψ}^{2} κ_{Γ}^{2}}{{min}_{k} π_{k}^{2}} max {1, \frac{2^{6} 7^{2} κ_{Ψ}^{4} κ_{Γ}^{2}}{α^{2} {min}_{k} π_{k}^{2}}}, \frac{36}{c_{8}^{2}}, \frac{2^{4} 3^{2}}{c_{9}^{2} {min}_{k} d_{k}}}

(Exponential tails). It holds
$n > \frac{12 log p}{{min}_{k} π_{k}} max {1, 2^{6} 3^{2} C_{1}^{2} {(1 + c_{1}^{2})}^{2} c_{6}^{2} C_{3} d^{2}}$
(Polynomial tails). It holds $n > max {p^{1 / c_{2}} c_{7}^{- 1 / c_{2}}, C_{1}^{2} C_{2}^{2} C_{3} d^{2} log p}$ .
It holds that $ρ_{2} \leq α^{2} / {4 {‖ L ‖}_{2}^{1 / 2} (2 - α)}$ .

With these condition, we obtain

Theorem 2

Suppose that Conditions 3, 5, 6 and 7 hold. Under Condition 1 or 2, P(ℳ(Ω̂_{ρ_n}, Ω₀)) → 1 as n, p → ∞ where ρ_nis given in Lemma 1 in theAppendixwith γ = min_k π_k/2.

3.3. Additional Results

In this section, we establish norm and variable selection consistency of LASICH under alternative assumptions. Our first result gives better rates of convergence for consistency in the ℓ_∞-, spectral and Frobenius norms, under the condition for model selection consistency. Our rates in Corollary 1 improve the previous results by Ravikumar et al. [25], and are comparable to that of Cai et al. [4] in the ℓ_∞- and spectral norms under both tail conditions.

Corollary 1

Suppose the conditions in Theorem 2 hold. Then, under Condition 1 or 2,

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{F} = O_{P} (\sqrt{\frac{min {λ_{Θ}^{4} p (s + 1), κ_{Γ}^{2} (s + p)} log p}{n}}),

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{2} = O_{P} (\sqrt{\frac{min {λ_{Θ}^{4} (s + 1), κ_{Γ}^{2} d^{2}} log p}{n}}),

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{\infty} = O_{P} (\sqrt{\frac{κ_{Γ}^{2} log p}{n}}) .

Our next result in Corollary 2 establishes the model selection consistency under a weaker version of the irrepresentability condition (Condition 6). Aside from the difference in the index sets S^(k), the form of the Condition 6 and the assumption of invertibility of ${(Ψ_{0}^{(k)} \otimes Ψ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ are similar to those in Ravikumar et al. [25]. On the other hand, Ravikumar et al. [25] do not require invertibility of ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ . However, their proof is based on an application of Brouwer’s fixed point theorem, which does not hold for the corresponding function (Eq. (70) in page 973) since it involves a matrix inverse, and is hence not continuous on its range. The additional inevitability assumption in Condition 6 is used to address this issue in Lemma 11. The condition can be relaxed if we assume an alternative scaling of the sample size stated in Condition 8 below instead of Condition 7.

Condition 8

Let $λ_{Ψ} = {max}_{k} ‖ Ψ_{0}^{(k)} ‖$ . Suppose $ρ_{2} \leq α^{2} / {4 {‖ L ‖}_{2}^{1 / 2} (2 - α)}$ and

(Exponential tails)
$n > 2^{19} 3^{3} {min_{k} π_{k}}^{- 3} C_{1}^{2} {(1 + 4 c_{1}^{2})}^{2} c_{6}^{2} λ_{Θ}^{4} {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} s log p max {λ_{Ψ}, 4 λ_{Θ}^{4} α^{- 1}},$
or
(Polynomial tails)
$n > 2^{12} 3^{3} {min_{k} π_{k}}^{- 2} K^{2} C_{1}^{2} C_{2}^{2} λ_{Θ}^{4} {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} s log p max {λ_{Ψ}, 4 λ_{Θ}^{4} α^{- 1}} .$

Corollary 2

Suppose that Conditions 3, 6 and 8 hold. Suppose also that Condition 5 holds without requiring the invertibility of ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ . Then, under Condition 1 or 2, P(ℳ(Ω̂_{ρ_n}, Ω₀)) → 1 as n, p → ∞ where ρ_nis given in Lemma 1 in theAppendix with γ = min_k π_k/2.

4. Laplacian Shrinkage based on Hierarchical Clustering

Our proposed LASICH approach utilizes the information in the subpopulation network G. In practice, however, similarity between subpopulations may be difficult to ascertain or quantify. In this section, we present a modified LASICH framework, called HC-LASICH, which utilizes hierarchical clustering to learn the relationships among subpopulations. The information from hierarchical clustering is then used to define the weighted subpopulation network. Importantly, HC-LASICH can even be used in settings where the subpopulation membership is unavailable, for instance, to learn the genetic network of cancer patients, where cancer subtypes may be unknown.

We use hierarchical clustering with a complete, single or average linkage to estimate both the subpopulation memberships and the weighted subpopulation network G. Specifically, the length of a path between two subpopulations in the dendrogram is used as a measure of dissimilarity between two subpopulations; the weights for the subpopulation networks are simply defined by taking the inverse of these lengths. Throughout this section, we assume that the number of subpopulations K is known. While a number of methods have been proposed for estimating the number of subpopulations in hierarchical clustering (see e.g. Borysov et al. [1] and the references therein), the problem is beyond the scope of this paper.

Let I = (I⁽¹⁾, …, I^(K)) be the subpopulation membership indicator such that I follows the multinomial distribution Mult_K (1, (π₁, …, π_K)) with parameter 1 and subpopulation membership probabilities (π₁, …, π_K) ∈ (0, 1)^K. Note that I is missing and is to be estimated. Let I_i, i = 1, …, n be i.i.d. copies of I and ${\hat{I}}_{i} = ({\hat{I}}_{i}^{1}, \dots, {\hat{I}}_{i}^{K})$ be an estimated subpopulation indicator for the ith observation via hierarchical clustering. Based on the estimated subpopulation membership and subpopulation network Ĝ, we apply our method to obtain the estimator, HC-LASICH, ${\hat{Ω}}_{H C, ρ_{n}} = ({\hat{Ω}}_{H C, ρ_{n}}^{(1)}, \dots, {\hat{Ω}}_{H C, ρ_{n}}^{(K)})$ . Interestingly, HC-LASICH enjoys the same theoretical properties as LASICH, under the normality assumption. To show this, we first establish the consistency of hierarchical clustering in high dimensions, which is of independent interest. Our result is motivated by the recent work of [1], who study the consistency of hierarchical clustering for independent normal variables X^(k) ~ N(μ^(k), σ^(k)I); we establish similar results for multivariate normal distributions with arbitrary covariance structures. We make the following assumption.

Condition 9

For k, k′ = 1, …, K, let

{\bar{λ}}^{(k)} = p^{- 1} \sum_{j = 1}^{p} λ^{(k), j},

μ^{(k, k')} = p^{- 1} {‖ Λ_{k, k'}^{1 / 2} Q_{k, k'}^{T} {[Σ^{(k)} + Σ^{(k')}]}^{1 / 2} [μ^{(k)} - μ^{(k')}] ‖}^{2},

where λ^(k),jis the eigenvalues of Σ^(k)with λ^(k),1 ≤ λ^(k),2 ≤ … ≤ λ^(k),p, and the spectral decomposition of Σ^(k) + Σ^(k′)is $Σ^{(k)} + Σ^{(k')} = Q_{k, k'} Λ_{k, k'} Q_{k, k'}^{T}$ . It holds that

μ^{(k, k')} > 2 min {{\bar{λ}}^{(k)}, {\bar{λ}}^{(k')}} - λ^{(k), p} - λ^{(k'), p}, k \neq k', k, k' = 1, \dots, K,

0 < c_{10} \leq λ^{(k), j} \leq c_{11} < \infty, ‖ μ^{(k)} ‖ \leq c_{11}, k = 1, \dots, K, j = 1, \dots, p .

for constants m and M.

Under the normality assumption, the following results shows that the probability of successful clustering converges to 1, as p, n → ∞.

Theorem 3

Suppose that that X^(k), k = 1, …, K, is normally distributed. Under Condition 9,

P ({\hat{I}}_{i} = I_{i}, i = 1, \dots, n) \to 1,

(7)

as n, p → ∞.

To proof of Theorem 3 generalizes recent results of Borysov et al. [1] to the case of arbitrary covariance structures. A key component of the proof is a new bound on the ℓ₂ norm of a multivariate normal random variable with arbitrary mean and covariance matrix established in Lemma 14. The proof of the lemma uses new concentration inequalities for high-dimensional problems in [2], and may be of independent interest.

Note that the consistent estimation of subpopulation memberships (7) implies that the estimated hierarchy among clusters also matches the true hierarchy. Thus, with successful clustering established in Theorem 3, theoretical properties of Ω̂_{HC, ρ_n} naturally follow.

Theorem 4

Suppose that X^(k), k = 1, …, K, is normally distributed and that Condition 9 holds. (i) Under the conditions of Theorem 1,

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{H C, ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{2} = O_{P} (\sqrt{\frac{λ_{Θ}^{4} (s + 1) log p}{n}}) .

Suppose, moreover, that the conditions of Theorem 2 holds. Then

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{H C, ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{F} = O_{P} (\sqrt{\frac{min {λ_{Θ}^{4} p (s + 1), κ_{Γ}^{2} (s + p)} log p}{n}}),

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{H C, ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{2} = O_{P} (\sqrt{\frac{min {λ_{Θ}^{4} (s + 1), κ_{Γ}^{2} d^{2}} log p}{n}}),

\sum_{k = 1}^{K} {‖ {\hat{Ω}}_{H C, ρ_{n}}^{(k)} - Ω_{0}^{(k)} ‖}_{\infty} = O_{P} (\sqrt{\frac{κ_{Γ}^{2} log p}{n}}) .

(ii) Under the conditions of Theorem 2,

P (ℳ ({\hat{Ω}}_{H C, ρ_{n}}, Ω_{0})) \to 1, as n, p \to \infty .

5. Algorithms

We develop an alternating directions method of multipliers (ADMM) to efficiently solve the convex optimization problem (3).

Let $A^{(k)} = {(a_{i j}^{(k)})}_{i, j = 1}^{p} \in ℝ^{p \times p}, B^{(k)} = {(b_{i j}^{(k)})}_{i, j = 1}^{p} \in ℝ^{p \times p}, C^{(k)} = {(c_{i j}^{(k)})}_{i, j = 1}^{p} \in ℝ^{p \times p}, D^{(k)} = {(d_{i j}^{(k)})}_{i, j = 1}^{p} \in ℝ^{p \times p}$ , k = 1, … K. Define A = (A⁽¹⁾, …, A^(K)), B = (B⁽¹⁾, …, B^(K)), C = (C⁽¹⁾, …, C^(K)), D = (D⁽¹⁾, …, D^(K)), and $c_{i j} \equiv {(c_{i j}^{(1)}, \dots, c_{i j}^{(K)})}^{T} \in ℝ^{K}, d_{i j} \equiv {(d_{i j}^{(1)}, \dots, d_{i j}^{(K)})}^{T} \in ℝ^{K}, e_{C, i j} \equiv {(e_{C, i j}^{(1)}, \dots, e_{C, i j}^{(K)})}^{T} \in ℝ^{K}$ where $E_{C}^{(k)} = {(e_{C, i j}^{(k)})}_{i, j = 1}^{p}$ .

To facilitate the computation, we consider instead a perturbed graph Laplacian L_ε = L + εI, where I is the identity matrix and ε > 0 is a small perturbation. The difference between solutions to the original and modified optimization problem is largely negligible for small ε; however, the positive definiteness of L_ε results in more efficient computation. A similar idea was used in Guo et al. [9] and Rothman et al. [26] to avoid dividision by zero. The optimization problem (3) with L replaced by L_ε can then be written as

minimize \sum_{k = 1}^{K} \frac{n_{k}}{n} (tr (Ψ_{n}^{(k)} A^{(k)}) - log det (A^{(k)})) + ρ_{n} \sum_{k = 1}^{K} {‖ B^{(k)} ‖}_{1} + ρ_{n} ρ_{2} \sum_{i \neq j} {(c_{i j}^{T} L_{ε} c_{i j})}^{1 / 2}

(8)

s . t . A^{(k)} = D^{(k)}, B^{(k)} = D^{(k)}, L_{ε} c_{i j} = L_{ε} d_{i j} k = 1, \dots, K, i, j = 1, \dots, p .

Using Lagrange multipliers E = (E_A, E_B, E_C)^T, with $E_{A} = (E_{A}^{(1)}, \dots, E_{A}^{(K)})$ with $E_{A}^{(k)} \in ℝ^{p \times p}$ , k = 1, …, K, $E_{B} = (E_{B}^{(1)}, \dots, E_{B}^{(K)})$ with $E_{B}^{(k)} \in ℝ^{p \times p}$ , k = 1, …, K, and $E_{C} = (E_{C}^{(1)}, \dots, E_{C}^{(K)})$ with $E_{C}^{(k)} \in ℝ^{p \times p}$ , k = 1, …, K, the augmented Lagrangian in scaled form is given by

L_{ϱ} (A, B, C, D, E) \equiv n^{- 1} \sum_{k = 1}^{K} n_{k} (tr (Ψ_{n}^{(k)} A^{(k)}) - log det (A^{(k)})) + ρ_{n} \sum_{k = 1}^{K} {‖ B^{(k)} ‖}_{1} + ρ_{n} ρ_{2} \sum_{i \neq j} {(c_{i j}^{T} L_{ε} c_{i j})}^{1 / 2} + \frac{ϱ}{2} \sum_{k = 1}^{K} {‖ A^{(k)} - D^{(k)} + E_{A}^{(k)} ‖}_{F}^{2} + \frac{ϱ}{2} \sum_{k = 1}^{K} {‖ B^{(k)} - D^{(k)} + E_{B}^{(k)} ‖}_{F}^{2} + \frac{ϱ}{2} \sum_{i, j} {‖ L_{ε}^{1 / 2} c_{i j} - L_{ε}^{1 / 2} d_{i j} + e_{C, i j} ‖}_{F}^{2} .

Here ϱ > 0 is a regularization parameter and $L_{ε}^{1 / 2}$ is the square root of L_ε with $L_{ε} = {(L_{ε}^{1 / 2})}^{T} L_{ε}^{1 / 2}$ .

The proposed ADMM algorithm is as follows.

Step 0. Initialize A^(k) = A^(k),0, B^(k) = B^(k),0, C^(k) = C^(k),0, D^(k) = D^(k),0, $E_{A}^{(k)} = E_{A}^{(k), 0}, E_{B}^{(k)} = E_{B}^{(k), 0}, E_{C}^{(k)} = E_{C}^{(k), 0}$ and choose ϱ > 0. Select a scalar ϱ > 0.
Step m. Given the (m − 1)th estimates,
- –
  (Update A^(k)) Find A^m minimizing $- ℓ_{n} (A) - (ϱ / 2) \sum_{k = 1}^{K} ‖ A^{(k)} - D^{(k), m - 1} - E_{A}^{(k), m - 1} ‖$ (see pages 46–47 of Boyd et al. [3] for details).
- –
  (Update B^(k)) Compute $B_{i j}^{(k), m} = S_{ρ_{n} / ϱ} (D_{i j}^{(k), m - 1} - E_{B, i j}^{(k), m - 1})$ , where S_y(x) is x − y if x > y, is 0 if |x| ≤ y, and is x + y if x < −y.
- –
  (Update C^(k)) For (x)₊ = max{x, 0}, compute
  $c_{i j}^{m} = {(1 - \frac{ρ_{n} ρ_{2}}{ϱ ‖ L_{ε}^{1 / 2} d_{i j}^{m - 1} - e_{C, i j}^{m - 1} ‖})}_{+} (d_{i j}^{m - 1} - L_{ε}^{- 1 / 2} e_{C, i j}^{m - 1}) .$
- –
  (Update D^(k)) Compute
  $d_{i j}^{m} = {(2 I + L_{ε})}^{- 1} {a_{i j}^{m} + e_{A, i j}^{m - 1} + b_{i j}^{m} + e_{B, i j}^{m - 1} + L_{ε} c_{i j}^{m} + {(L_{ε}^{1 / 2})}^{T} e_{C, i j}^{m - 1}} .$
- –
  (Update E_A) Compute $E_{A}^{(k), m} = E_{A}^{(k)} + A^{(k), m} - D^{(k), m}$ .
- –
  (Update E_B) Compute $E_{B}^{(k), m} = E_{B}^{(k)} + B^{(k), m} - D^{(k), m}$ ,
- –
  (Update E_C) Compute $e_{C, i j}^{(k), m} = e_{C, i j}^{(k)} + L^{1 / 2} (c_{i j}^{(k), m} - d_{i j}^{(k), m})$ .
Repeat the iteration until the maximum of the errors $r_{A}^{(k)} = A^{(k)} - D^{(k)}, r_{B}^{(k), m} = B^{(k), m} - D^{(k), m}, r_{C}^{(k), m} = C^{(k), m} - D^{(k), m}$ , s^(k),m = ϱ(D^(k),m − D^(k),m−1) in the Frobenius norm is less than a specified tolerance level.

The proposed ADMM algorithm facilitates the estimation of parameters of moderately large problems. However, parameter estimation in high dimensions can be computationally challenging. We next present a result that determines whether the solution to the optimization problem (3), for given values of tuning parameters ρ_n, ρ₂, is block diagonal. (Note that this result is an exact statement about the solution to (3), and does not assume block sparsity of the true precision matrices; see Theorems 1 and 2 of Danaher et al. [6] for similar results.) More specifically, the condition in Proposition 1 provides a very fast check, based on the entries of the empirical correlation matrices $Ψ_{n}^{(k)}$ , k = 1, …, K, to identify the block sparsity pattern in ${\hat{Ω}}_{ρ_{n}}^{(k)}$ , k = 1, …, K after some permutation of the features.

Let U_L = [u₁ … u_K] ∈ ℝ^K×K where u₁, …, u_K’s are eigenvectors of L corresponding to 0, λ_L,2, …, λ_L,K. Define $Λ_{L}^{- 1 / 2}$ as the diagonal matrix with diagonal elements 0, $λ_{L, 2}^{- 1 / 2}, \dots, λ_{L, K}^{- 1 / 2}$ .

Proposition 1

The solution ${\hat{Ω}}_{ρ_{n}}^{(k)}$ , k = 1, …, Kto the optimization problem(3)consists of the block diagonal matrices with the same block structure diag(Ω₁, …, Ω_B) among all groups if and only if for $Ψ_{n, i j} = {(ψ_{n, i j}^{(1)}, \dots, ψ_{n, i j}^{(K)})}^{T}$

min_{υ \in {[- 1, 1]}^{K}} ‖ Λ_{L}^{- 1 / 2} U_{L} (\frac{n_{k}}{n} Ψ_{n, i j} - ρ_{n} υ) ‖ \leq ρ_{n} ρ_{2},

(9)

and for all i, j such that the (i, j) element is outside the blocks.

The proof of the Proposition is similar to Theorems 1 of Danaher et al. [6] and is hence omitted. Condition 9 can be easily verified by applying quadratic programming to the left hand side of the inequality. The solution to (3) can then be equivalently found by solving the optimization problem separately for each of the blocks; this can result in significant computational advantages for moderate to large values of ρ_nρ₂.

6. Numerical Results

6.1. Simulation Experiments

We compare our method with four existing methods, graphical lasso, the method of Guo et al. [9], FGL and GGL of Danaher et al. [6]. For graphical lasso, estimation was carried out separately for each group with the same regularization parameter.

Our simulation setting is motivated by estimation of gene networks for healthy subjects and patients with two similar diseases caused by inactivation of certain biological pathways. We consider K = 3 groups with sample sizes n = (50, 100, 50) and dimension p = 100. Data are generated from multivariate normal distributions $N (μ^{(k)}, {(Ω_{0}^{(k)})}^{- 1})$ , k = 1, 2, 3; all precision matrices $Ω_{0}^{(k)}$ are block diagonal with 4 blocks of equal size.

To create the precision matrices, we first generated a graph with 4 components of equal size, each as either an Erdős-Rényi or scale free graphs with 95 total edges. We randomly assigned Unif((−7, −5) ∪ (.5, .7)) values to nonzero entries of the corresponding adjacency matrix A and obtained a matrix Ã. We then added 0.1 to the diagonal of Ã to obtain a positive definite matrix $Ω_{0}^{(1)}$ . For each of subpopulations 2 and 3, we removed one of the components of the graph by setting the off diagonal entries of Ã to zero, and added a perturbation from Unif(−2, .2) to nonzero entries in Ã. Positive definite matrices $Ω_{0}^{(2)}$ and $Ω_{0}^{(3)}$ were obtained by adding 0.1 to the diagonal elements. All partial correlations ranges from .28 to .54 in the absolute values. A similar setting was considered in in Danaher et al. [6], where the graph included more components, but no perturbation was added. We consider two simulation settings, with known and unknown subpopulation network G.

6.1.1. Known subpopulation network G

In this case, we set μ^(k) = 0, k = 1, 2, 3 and use the graph in Figure 1 as the subpopulation network.

Figures 3a,c show the average number of true positive edges versus the average number of detected edges over 50 simulated data sets. Results for multiple choices of the second tuning parameter are presented for FGL, GGL and LASICH. It can be seen that in both cases, LASICH outperforms other methods, when using relatively large values of ρ₂. Smaller values of ρ₂, on the other hand, give similar results as other methods of joint estimation of multiple graphical models. These results indicate that, when the available subpopulation network is informative, the Laplacian shrinkage constraint can result in significant improvement in estimation of the underlying network.

Fig 3 — Simulation results for joint estimation of multiple precision matrices with known subpopulation memberships. Results show the average number of true positive edges (a & c) and estimation error, in Frobenius norm (b & d) over 50 data sets with n = 200 multivariate normal observations generated from a graphical model with p = 100 features; results in top row (a & b) are for an Erdős-Rényi graph and those in bottom row (c & d) are for a scale free (power-law) graph.

Figures 3b,d show the estimation error, in Frobenius norm, versus the number of detected edges. LASICH has larger errors when the estimated graphs have very few edges, but, its error decreases as the number of detected edges increase, eventually yielding smaller errors than other methods. The non-convex penalty of Guo et al. [9] performs well in terms of estimation error, although determining the appropriate range of tuning parameter for this method may be difficult.

6.1.2. Unknown subpopulation network G

In this case, the subpopulation memberships and the subpopulation network G are estimated based on hierarchical clustering. We randomly generated μ⁽¹⁾ from a multivariate normal distribution with a covariance matrix σ²I. For subpopulations 2 and 3, the elements of μ⁽¹⁾ corresponding to the empty components of the graph were set to zero to obtain μ⁽²⁾ and μ⁽³⁾. Hierarchical clustering with complete linkage was applied to data to obtain the dendrogram; we took inverse of distances in the dendrogram to obtain similarity weights used in the graph Laplacian.

Figures 4 compares the performance of HC-LASICH, in terms of support recovery, to competing methods, in the setting where the subpopulation memberships and network are estimated from data (Section 4). Here the differences in subpopulation means μ^(k,k′) are set up to evaluate the effect of clustering accuracy. The four settings considered correspond to average Rand indices of .6 .7, .8 and .9 across 50 data sets, respectively. Here the second tuning parameter for HC-LASICH, GGL and FGL is chosen according to the best performing model in Figure 3. As expected, changing the mean structure, and correspondingly the Rand index, does not affect the performance of other methods. The results indicate that, as long as features can be clustered in a meaningful way, HC-LASICH can result in improved support recovery. Data-adaptive choices of the tuning parameter corresponding to the Laplacian shrinkage penalty may result in further improvements in the performance of the HC-LASICH. However, we do not pursue such choices here.

Fig 4 — Simulation results for joint estimation of multiple precision matrices with unknown subpopulation memberships. Results show the average number of true positive edges over 50 data sets with n = 200 multivariate normal observations generated from a graphical model with over an Erdős-Rényi graph with p = 100 features. Results for HC-LASICH and FGL/GGL correspond to the best choice of the second tuning parameter among those in Figure 3a. The Rand indices for HC-LASICH are averages over 50 generated data sets.

6.2. Genetic Networks of Cancer Subtypes

Breast cancer is heterogenous with multiple clinically verified subtypes [22]. Jönsson et al. [12] used copy number variation and gene expression measurements to identify new subtypes of breast cancer and showed that the identified subtypes have distinct clinical outcomes. The genetic networks of these different subtypes are expected to share similarities, but to also have unique features. Moreover, the similarities among the networks are expected to corroborate with the clustering of the subtypes based on their molecular profiles. We applied network estimation methods of Section 6.1 to a subset of the microarray gene expression data from Jönsson et al. [12], containing data for 218 patients classified into three previously known subtypes of breast cancer: 46 Luminal-simple, 105 Luminal-complex and 67 Basal-complex samples. For ease of presentation, we focused on 50 genes with largest variances. The hierarchical clustering results of Jönsson et al. [12], reproduced in Figure 5 for the above three subtypes, were used to identify the subpopulation membership; reciprocals of distances in the dendrogram were used to define similarities among subtypes used in the graph Laplacian penalty.

Fig 5 — Dendrogram of hierarchical clustering of three subtypes of breast cancer from Jönsson et al. (2010) along with estimated gene networks using graphical lasso (Glasso), method of Guo et al., FGL and GGL of Daneher et al. (2014) and LASICH. Blue edges are common to Luminal subtypes and black edges are shared by all three subtypes; condition specific edges are drawn in gray.

To facilitate the comparison, tuning parameters were selected such that the estimated networks of the three subtypes using each method contained a total of 150 edges. For methods with two tuning parameters, pairs of tuning parameters were determined using the Bayesian information criterion (BIC), as described in Guo et al. [9]. Estimated genetic networks of the three cancer subtypes are shown in Figure 5. For each method, edges common in all three subtypes, those common in Luminal subtypes and subtype specific edges are distinguished.

In this example, separate graphical lasso estimates and FGL/GGL estimates are two extremes. Estimated network topologies from graphical lasso vary from subtype to subtype, and common structures are obscured; this variability may be because similarities among subtypes are not incorporated in the estimation. In contrast, FGL and GGL give identical networks for all subtypes, perhaps because both methods encourage the estimated networks of all subtypes to be equally similar. Intermediate results are obtained using LASICH and the method of Guo et al. [9]. The main difference between these two methods is that Guo et al. [9] finds more edges common to all three subtypes, whereas LASICH finds more edges common to the Luminal subtypes. This difference is likely because LASICH prioritizes the similarity between the Luminal subtypes via graph Laplacian while the method of Guo et al. [9] does not distinguish between the three subtypes. The above example highlights the potential advantages of LASICH in providing network estimates that better corroborate with the known hierarchy of subpopulations.

7. Discussion

We introduced a flexible method for joint estimation of multiple precision matrices, called LASICH, which is particularly suited for settings where observations belong to three or more subpopulations. In the proposed method, the relationships among heterogenous subpopulations is captured by a weighted network, whose nodes correspond to subpopulations, and whose edges capture their similarities. As a result, LASICH can model complex relationships among subpopulations, defined, for example, based on hierarchical clustering of samples.

We established asymptotic properties of the proposed estimator in the setting where the relationship among subpopulations is externally defined. We also extended the method to the setting of unknown relationships among subpopulations, by showing that clusters estimated from the data can accurately capture the true relationships. The proposed method generalizes existing convex penalties for joint estimation of graphical models, and can be particularly advantageous in settings with multiple subpopulations.

A particularly appealing feature of the proposed extension of LASICH is that it can also be applied in settings where the subpopulation memberships are unknown. The latter setting is closely related to estimation of precision matrices for mixture of Gaussian distributions. Both approaches have limitations and drawbacks: on the one hand, the extension of LASICH to unknown subpopulation memberships requires certain assumptions on differences of population means (Section 4). On the other hand, estimation of precision matrices for mixture of Gaussians is computationally challenging, and known rates of convergence of parameter estimation in mixture distributions (e.g. in Städler et al. [29]) are considerably slower.

Throughout this paper we assumed that the number of subpopulations is known. Extensions of this method to estimation of graphical models in populations with an unknown number of subpopulations would be particularly interesting for analysis of genetic networks associated with heterogeneity in cancer samples, and are left for future research.

Acknowledgments

This work was partially supported by NSF grants DMS-1161565 & DMS-1561814 to AS.

Appendix

8. Appendix: Proofs and Technical Detials

We denote true inverse correlation matrices as $Θ_{0} = (Θ_{0}^{(1)}, \dots, Θ_{0}^{(K)})$ and true correlation matrices as $Ψ_{0} = (Ψ_{0}^{(1)}, \dots, Ψ_{0}^{(K)})$ , where $Θ_{0}^{(k)} \equiv {(Ψ_{0}^{(k)})}^{- 1} \equiv {(θ_{0, i j}^{(k)})}_{i, j = 1}^{p}$ , and $Ψ_{0}^{(k)} = {(ψ_{0, i j}^{(k)})}_{i, j = 1}^{p}$ . The estimates of the population parameters are dented as ${\hat{Σ}}_{n}^{(k)} = {({\hat{σ}}_{i j})}_{i, j = 1}^{p}, Ψ_{n}^{(k)} = {(ψ_{n, i j})}_{i, j = 1}^{p}$ , and ${\hat{Θ}}_{ρ_{n}}^{(k)} = {({\hat{θ}}_{ρ_{n}, i j}^{(k)})}_{i, j = 1}^{p}$ . For a vector x = (x₁, …, x_p)^T and J ⊂ {1, …, p}, we denote x_J = (x_j, j ∈ J)^T. For a matrix A, λ_k(A) is the kth smallest eigenvalue and A⃗ is the vectorization of A. For J ⊂ {(i, j) : i, j = 1, …, p} and A ∈ ℝ^p×p, A⃗_J is a vector in ℝ^|J| obtained by removing elements corresponding to (i, j) ∉ J from A⃗. A zero-filled matrix A_J ∈ ℝ^p×p is obtained from A by replacing a_ij by 0 for (i, j) ∉ J.

8.1. Consistency in Matrix Norms

Theorem 1 is a direct consequence of the following result.

Lemma 1

Suppose that Condition 1 holds. Let γ ∈ (0, min_k π_k) be arbitrary. For
$n \geq max {\frac{6}{γ} log p, \frac{2^{15} 3^{3} C_{1}^{2}}{γ^{3}} {(1 + 4 c_{1}^{2})}^{2} max_{k, i} {σ_{i i}^{(k)}}^{2} λ_{Θ}^{4} {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} s log p}$
and $ρ_{n} = 2^{3} \sqrt{6} C_{1} (1 + 4 c_{1}^{2}) γ^{- 1 / 2} {max}_{k, i} σ_{i i}^{(k)} \sqrt{log p / n}$ , we have with probability (1 − 2K/p)(1 − 2K exp(−2n(min_k π_k − γ)²)) that
$\sum_{k = 1}^{K} {‖ {\hat{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{F} \leq \frac{2^{15 / 2} 3^{3 / 2} C_{1}}{γ^{3 / 2}} (1 + 4 c_{1}^{2}) max_{k, i} σ_{i i}^{(k)} λ_{Θ}^{2} (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) \sqrt{\frac{s log p}{n}} .$

Suppose that Condition 2 holds with p ≤ c₇n^c₂, c₂, c₃, c₇ > 0. For ρ_n = C₁Kδ_nsatisfying

2^{4} 3^{2} C_{1} ρ_{n}^{2} γ^{- 2} s {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} λ_{Θ}^{4} \leq 1 / 4

and

τ > (2^{7} + 2^{3} \sqrt{1 + 2^{4} 3^{2} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2}}) / (9 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2})

we have with probability (1 − 2K exp(−2n(min_k π_k − γ)²))ν _nthat

\sum_{k = 1}^{K} {‖ {\hat{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{F} \leq 2^{4} 3^{3 / 2} C_{1} γ^{- 2} K (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) λ_{Θ}^{2} s^{1 / 2} δ_{n},

where

δ_{n} \equiv max_{k, i} {σ_{i i}^{(k)}}^{2} c_{4} (4 + τ) γ^{- 1} \frac{log p}{n} + (1 + 2 max_{k, i} | μ^{(k), i} |) \sqrt{max_{k, i} {σ_{i i}^{(k)}}^{2} c_{4} (4 + τ) γ^{- 1} \frac{log p}{n}} + 2 max_{k, i, j} 𝔼 | X^{(k), i} X^{(k), j} | I (| X^{(k), i} X^{(k), j} | \geq \sqrt{\frac{γ n}{log p}}) + 4 {max_{k, i} 𝔼 | X^{(k), i} | I (| X^{(k), i} | \geq \sqrt{\frac{γ n}{log p}})}^{2} + 2 (1 + 2 max_{k, i} | μ^{(k), i} |) max_{k, i} 𝔼 | X^{(k), i} | I (| X^{(k), i} | \geq \sqrt{\frac{γ n}{log p}}) = O (\sqrt{\frac{log p}{n}}),

and

ν_{n} \equiv \frac{3 c_{7} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} {(log p)}^{c_{2} + c_{3} + 1}}{γ^{c_{3}} n^{c_{3}}} + \frac{c_{7} c_{4} {max}_{k, i} σ_{i i}^{(k)} {(log p)}^{2 (c_{2} + c_{3} + 1)}}{n^{c_{2} + c_{3} + 1}} + 8 p^{2} exp (- \frac{{max}_{k, i} σ_{i i}^{(k)} c_{4} (4 + γ) log p}{2 {max}_{k, i} σ_{i i}^{(k)} c_{4} + \sqrt{{max}_{k, i} {σ_{i i}^{(k)}}^{2} c_{4} (64 + 16 τ)} / 3}) = o (1) .

Our proofs adopt several tools from Negahban et al. [20]. Note however that our penalty does not penalize the diagonal elements, and is hence a seminorm; thus, their results do not apply to our case. We first introduce several notations. To treat multiple precision matrices in a unified way, our parameter space is defined to be the set ℝ̃^(pK)×(pK) of (pK) × (pK) symmetric block diagonal matrices, where the kth diagonal block is a p × p matrix corresponding to the precision matrix of subpopulation k. We write A ∈ ℝ̃^(pK)×(pK) for a K-tuple ${(A^{(k)})}_{k = 1}^{K}$ of diagonal blocks A^(k) ∈ ℝ^p×p. Note that for A, B ∈ ℝ̃^(pK)×(pK), ${〈 A, B 〉}_{p K} = \sum_{k = 1}^{K} {〈 A^{(k)}, B^{(k)} 〉}_{p}$ where 〈·, ·〉_p is the trace inner product on ℝ^p×p. In this parameter space, we evaluate the following map from ℝ̃^(pK)×(pK) to ℝ given by

f (Δ) = - {\tilde{ℓ}}_{n} (Θ_{0} + Δ) + {\tilde{ℓ}}_{n} (Θ_{0}) + ρ_{n} {r (Θ_{0} + Δ) - r (Θ_{0})},

where r : ℝ̃^(pK)×(pK) ↦ ℝ is given by r(Θ) = ‖Θ‖₁+ ρ₂‖Θ‖_L. This map provides information on the behavior of our criterion function in the neighborhood of Θ₀. A similar map with a different penalty was studied in Rothman et al. [26]. A key observation is that f(0) = 0 and f(Δ̂_n) ≤ 0 where Δ̂_n = Θ̂_{ρ_n} − Θ₀.

The following lemma provides a non-asymptotic bound on the Frobenius norm of Δ (see Lemma 4 in Negahban et al. [21] for a similar lemma in a different context). Let $S = \cup_{k = 1}^{K} S^{(k)}$ be the union of the supports of $Ω_{0}^{(k)}$ . Define a model subspace $ℳ = {Ω \in {\tilde{ℝ}}^{(p K) \times (p K)} : ω_{i j}^{(k)} = 0, (i, j) \notin S, k = 1, \dots, K}$ and its orthocomplement $ℳ^{⊥} = {Ω \in {\tilde{ℝ}}^{(p K) \times (p K)} : ω_{i j}^{(k)} = 0, (i, j) \in S, k = 1, \dots K}$ under the trace inner product in ℝ̃^(pK)×(pK). For $A = {(a_{i j})}_{i, j = 1}^{p K} \in {\tilde{ℝ}}^{(p K) \times (p K)}$ , we write A = A_ℳ + A_ℳ^⊥ where A_ℳ and A_ℳ^⊥ are the projection of A into ℳ and _ℳ^⊥, in the Frobenius norm, respectively. In other words, the (i, j)-element of A_ℳ is a_ij if (i, j) ∈ S and zero otherwise, and the (i, j)-element of A_ℳ^⊥ is a_ij if (i, j) ∉ S and zero otherwise. Note that Θ₀ ∈ ℳ. Define the set 𝒞 = {Δ ∈ ℝ̃^(pK)×(pK) : r(Δ_ℳ^⊥) ≤ 3r(Δ_ℳ)}.

Lemma 2

Let ε > 0 be arbitrary. Suppose $ρ_{n} \geq 2 {max}_{1 \leq k \leq K} {‖ {\hat{Ψ}}_{n}^{(k)} - Ψ_{0}^{(k)} ‖}_{\infty}$ . Iff (Δ) > 0 for all elements Δ ∈ 𝒞 ⋂ {Δ ∈ ℝ̃^(pK)×(pK) : ‖Δ‖_F = ε} then ‖Δ̂_n‖_F ≤ ε.

Proof

We first show that Δ̂_n ∈ 𝒞. We have by the convexity of −ℓ̃_n(Θ) that

- {\tilde{ℓ}}_{n} (Θ_{0} + {\hat{Δ}}_{n}) + {\tilde{ℓ}}_{n} (Θ_{0}) \geq - | 〈 - \nabla {\tilde{ℓ}}_{n} (Θ_{0}), {\hat{Δ}}_{n} 〉 | .

It follows from Lemma 3(iv) with our choice ρ_n that the right hand side of the inequality is further bounded below by −2⁻¹ ρ_n (r(Δ̂_n,ℳ) + r(Δ̂_n,ℳ^⊥)). Applying Lemma 3(iii), we obtain

0 \geq f ({\hat{Δ}}_{n}) = - {\tilde{ℓ}}_{n} (Θ_{0} + {\hat{Δ}}_{n}) + {\tilde{ℓ}}_{n} (Θ_{0}) + r (Θ_{0} + {\hat{Δ}}_{n}) - r (Θ_{0}) \geq \frac{ρ_{n}}{2} r ({\hat{Δ}}_{n, ℳ^{⊥}}) - \frac{3 ρ_{n}}{2} r ({\hat{Δ}}_{n, ℳ}),

or r(Δ̂_n,ℳ^⊥) ≤ 3r(Δ̂_n,ℳ). This verifies Δ̂_n ∈ 𝒞. Note that f, as a function of Δ is sum of two convex functions ℓ_n and r, and is hence convex. Thus, the rest of the proof follows exactly as Lemma 4 in Negahban et al. [21].

Lemma 3

Let Δ ∈ ℝ̃^(pK)×(pK).

The gradient of ℓ̃_n(Θ₀) is a block diagonal matrix given by
$\nabla {\tilde{ℓ}}_{n} (Θ_{0}) = n^{- 1} diag {n_{1} (Ψ_{0}^{(1)} - {\hat{Ψ}}_{n}^{(1)}), \dots, n_{K} (Ψ_{0}^{(K)} - {\hat{Ψ}}_{n}^{(K)})} .$ (10)
Let c > 0 be a constant. For ‖Δ‖_F ≤ c and n_k/n ≥ γ > 0 for all k and n,
$- {\tilde{ℓ}}_{n} (Θ_{0} + Δ) + {\tilde{ℓ}}_{n} (Θ_{0}) + 〈 \nabla {\tilde{ℓ}}_{n} (Θ_{0}), Δ 〉 \geq \frac{γ}{2 {λ_{Θ} + c}^{2}} {‖ Δ ‖}_{F}^{2} \equiv κ_{ℓ_{n}, c} {‖ Δ ‖}_{F}^{2} .$ (11)
The map r is a seminorm, convex, and decomposable with respect to (ℳ, ℳ^⊥) in the sense that r(Θ₁ + Θ₂) = r(Θ₁) + r(Θ₂) for every Θ₁ ∈ ℳ and Θ₂ ∈ ℳ^⊥. Moreover,
$r (Θ_{0} + Δ) - r (Θ_{0}) \geq r (Δ_{ℳ^{⊥}}) - r (Δ_{ℳ}) .$
For Δ ∈ ℝ̃^(pK)×(pK),
$| 〈 \nabla {\tilde{ℓ}}_{n} (Θ_{0}), Δ 〉 | \leq r (Δ) max_{1 \leq k \leq K} {‖ {\hat{Ψ}}_{n}^{(k)} - Ψ_{0}^{(k)} ‖}_{\infty} .$ (12)
For Θ ∈ ℝ̃^(pK)×(pK),
$r (Θ_{ℳ}) \leq {(s + 1)}^{1 / 2} (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) {‖ Θ_{ℳ} ‖}_{F} .$

Proof

The result follows by taking derivatives blockwise.

Rothman et al. [26] (page 500–502) showed that

- {\tilde{ℓ}}_{n} (Θ_{0} + Δ) + {\tilde{ℓ}}_{n} (Θ_{0}) - 〈 - \nabla {\tilde{ℓ}}_{n} (Θ_{0}), Δ 〉 = \sum_{k = 1}^{K} \frac{n_{k}}{n} (- log det (Θ_{0}^{(k)} + Δ^{(k)}) + log det (Θ_{0}^{(k)}) + 〈 Ψ_{0}^{(k)}, Δ^{(k)} 〉) \geq \sum_{k = 1}^{K} \frac{n_{k}}{n} \frac{{‖ Δ^{(k)} ‖}_{F}^{2}}{2 {min}_{0 \leq υ \leq 1} {{‖ Θ_{0}^{(k)} ‖}_{2} + υ {‖ Δ^{(k)} ‖}_{2}}^{2}} .

Since ‖A‖₂ ≤ ‖A‖_F, n_k/n ≥ γ and ‖Δ‖_F ≤ c, this is further bounded below by

\sum_{k = 1}^{K} \frac{γ}{2} \frac{{‖ Δ^{(k)} ‖}_{F}^{2}}{{{‖ Θ_{0}^{(k)} ‖}_{2} + {‖ Δ^{(k)} ‖}_{F}}^{2}} \geq κ_{ℓ_{n}, c} {‖ Δ ‖}_{F}^{2} .

Because the graph Laplacian L is a positive semidefinite matrix, the triangle inequality r(Θ₁ + Θ₂) ≤ r(Θ₁) + r(Θ₂) holds. To see this let L = L̃L̃^T be any Cholesky decomposition of L. Then
${{(x + y)}^{T} L (x + y)}^{1 / 2} = ‖ {\tilde{L}}^{T} (x + y) ‖ \leq ‖ {\tilde{L}}^{T} x ‖ + ‖ {\tilde{L}}_{y} ‖ = {x^{T} L x}^{1 / 2} + {y^{T} L y}^{1 / 2} .$

It is clear that r(cΘ) = cr(Θ) for any constant c. Thus, given that r does not penalize the diagonal elements, it is a seminorm. The decomposability follows from the definition of r. The convexity follows from the same argument for the triangle inequality. Since Θ₀ + Δ = Θ₀ + Δ_ℳ + Δ_ℳ^⊥, the triangle inequality and the decomposability of r yield
$r (Θ_{0} + Δ) - r (Θ_{0}) \geq r (Θ_{0} + Δ_{ℳ^{⊥}}) - r (Δ_{ℳ}) - r (Θ_{0}) = r (Δ_{ℳ^{⊥}}) - r (Δ_{ℳ}) .$
We show that, for A, B ∈ ℝ̃^(pK)×(pK) with diag(B) = 0, 〈A, B〉 ≤ r(A)‖B‖_∞. If A is a diagonal matrix (or if A = 0), the inequality trivially holds since 〈A, B〉 = 0. If not, r(A) ≠ 0 so that
$\frac{〈 A, B 〉}{r (A)} \leq \frac{{‖ A ‖}_{1} {‖ B ‖}_{\infty}}{{‖ A ‖}_{1}} = {‖ B ‖}_{\infty} .$

Since the diagonal elements of ∇ℓ̃_n(Θ₀) are all zero, the result follows.
For s ≠ 0, we have
$\frac{r (Θ_{ℳ})}{{‖ Θ_{ℳ} ‖}_{F}} \leq sup_{Θ \in ℳ} \frac{\sum_{k = 1}^{K} {‖ Θ^{(k)} ‖}_{1}}{{‖ Θ - diag (Θ) ‖}_{F}} + sup_{Θ \in ℳ} \frac{ρ_{2} \sum_{i \neq j} \sqrt{θ_{i j}^{T} L θ_{i j}}}{{‖ Θ ‖}_{F}} \leq s^{1 / 2} + ρ_{2} sup_{Θ \in ℳ} \frac{\sum_{i \neq j} \sqrt{{‖ L ‖}_{2} {‖ θ_{i j} ‖}_{F}^{2}}}{{‖ Θ ‖}_{F}} \leq s^{1 / 2} (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) .$

In the last inequality we used that $\sqrt{\sum_{j = 1}^{J} \sum_{i = 1}^{I} a_{i j}^{2}} \geq J^{- 1 / 2} \sum_{j = 1}^{J} \sqrt{\sum_{i = 1}^{I} a_{i j}^{2}}$ , which follows by the concavity of the square root function. For s = 0, we trivially have $0 = r (Θ_{ℳ}) \leq s^{1 / 2} {1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}} {‖ Θ_{ℳ} ‖}_{F}$ . Combining these two cases yields the desired result.

Next, we obtain an upper bound for ${max}_{1 \leq k \leq K} {‖ {\hat{Ψ}}_{n}^{(k)} - Ψ_{0}^{(k)} ‖}_{\infty}$ , which holds with high-probability assuming the tail conditions of the random vectors.

Lemma 4

Suppose that n_k/n ≥ γ > 0 for all k and n.

Suppose that Condition 1 holds. Then for n ≥ 6γ⁻¹ log p we have
$P ({‖ {\hat{Σ}}_{n} - Σ_{0} ‖}_{\infty} \geq 2^{3} \sqrt{6} {(1 + 4 c_{1}^{2})}^{2} γ^{- 1 / 2} max_{k, i} σ_{i i}^{(k)} \sqrt{\frac{log p}{γ n}}) \leq 2 K / p .$ (13)
Suppose that Condition 2 holds with c₂, c₃ > 0 and p ≤ c₇n^c₂. Then we have for $τ > {max}_{k} (2^{7} + 2^{3} \sqrt{1 + 2^{4} 3^{2} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2}}) / (9 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2})$
$P ({‖ {\hat{Σ}}_{n} - Σ_{0} ‖}_{\infty} \geq \sum_{k = 1}^{K} δ_{n}^{(k)}) \leq K ν_{n}$ (14)
where
$δ_{n}^{(k)} \equiv (1 + 2 max_{i} | μ^{(k), i} |) (2 δ_{n, 1}^{(k)} + δ_{n, 2}^{(k)}) + {(δ_{n, 1}^{(k)})}^{2} + {(δ_{n, 2}^{(k)})}^{2} + 2 δ_{n, 3}^{(k)},$
with
$δ_{n, 1}^{(k)} \equiv max_{i, j} 𝔼 | X_{l}^{(k), i} X_{l}^{(k), j} | I (| X_{l}^{(k), i} X_{l}^{(k), j} | \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}),$

$δ_{n, 2}^{(k)} \equiv {c_{4} max_{k, i} {σ_{i i}^{(k)}}^{2} (4 + τ) log p / n_{k}}^{1 / 2},$

$δ_{n, 3}^{(k)} \equiv max_{i} 𝔼 | X_{l}^{(k), i} | I (| X_{l}^{(k), i} | \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}) .$
Suppose that Condition 3 holds and that P(‖Σ̂_n − Σ₀‖_∞ ≥ b_n) = o(1) and b_n = o(1) as n → ∞. Then P(‖Ψ̂_n − Ψ₀‖_∞ ≥ C₁b_n) = o(1).

Proof

This was proved by Ravikumar et al. [25].

Note that

{\hat{Σ}}_{n}^{(k)} - Σ^{(k)} = n_{k}^{- 1} \sum_{l = 1}^{n_{k}} {(X_{l}^{(k)})}^{\otimes 2} - 𝔼 {(X^{(k)})}^{\otimes 2} - {({\bar{X}}^{(k)} - μ^{(k)})}^{\otimes 2} - μ^{(k)} {({\bar{X}}^{(k)} - μ^{(k)})}^{T} - ({\bar{X}}^{(k)} - μ^{(k)}) {(μ^{(k)})}^{T} .

We first evaluate the probability in (14) for

n_{k}^{- 1} \sum_{l = 1}^{n_{k}} {(X_{l}^{(k)})}^{\otimes 2} - 𝔼 {(X^{(k)})}^{\otimes 2}

. Let

Y_{l}^{(k), i j} \equiv X_{l}^{(k), i} X_{l}^{(k), j} - 𝔼 X_{l}^{(k), i} X_{l}^{(k), j},

Ȳ_{l}^{(k), i j} \equiv X_{l}^{(k), i} X_{l}^{(k), j} I (| X_{l}^{(k), i} X_{l}^{(k), j} | \leq \sqrt{\frac{n_{k}}{log p}}) - 𝔼 X_{l}^{(k), i} X_{l}^{(k), j} I (| X_{l}^{(k), i} X_{l}^{(k), j} | \leq \sqrt{\frac{n_{k}}{log p}}),

Ỹ_{l}^{(k), i j} \equiv Y_{l}^{(k), i j} - Ȳ_{l}^{(k), i j} .

We have

P (max_{i, j} | \sum_{l = 1}^{n_{k}} Ỹ_{l}^{(k), i j} | \geq 2 n_{k} δ_{n, 1}^{(k)}) \leq P (max_{i, j} | \sum_{l = 1}^{n_{k}} X_{l}^{(k), i} X_{l}^{(k), j} I (| X_{l}^{(k), i} X_{l}^{(k), j} | \geq \sqrt{\frac{n_{k}}{log p}}) | \geq n_{k} δ_{n, 1}^{(k)}) \leq P (max_{l, i} {(X_{l}^{(k), i})}^{2} \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}) (x y \leq max {x^{2}, y^{2}}) \leq p n_{k} \frac{𝔼 X_{0 i}^{4 (c_{2} + c_{3} + 1)} {(log p)}^{c_{2} + c_{3} + 1}}{n_{k}^{c_{2} + c_{3} + 1}} (Markov ’ s inequality) \leq \frac{c_{7} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} {(log p)}^{c_{2} + c_{3} + 1}}{n_{k}^{c_{3}}} (p \leq c_{7} n^{c_{2}}) \leq \frac{c_{7} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} {(log p)}^{c_{2} + c_{3} + 1}}{γ^{c_{3}} n^{c_{3}}} \equiv ν_{n, 1},

(15)

where the first inequality follows from the triangle inequality. Note that

𝔼 {(Ȳ_{l}^{(k), i j})}^{2} \leq 𝔼 {[X_{l}^{(k), i} X_{l}^{(k), j} I (| X_{l}^{(k), i} X_{l}^{(k), j} | \leq \sqrt{\frac{n_{k}}{log p}})]}^{2} \leq 𝔼 {| X_{l}^{(k), i} X_{l}^{(k), j} |}^{2} \leq 2^{- 1} (𝔼 {(X_{l}^{(k), i})}^{4} + 𝔼 {(X_{l}^{(k), j})}^{4}) \leq c_{4} max_{k, i} {σ_{i i}^{(k)}}^{2} .

It follows from Bernstein’s inequality that

P (max_{i, j} | \sum_{l = 1}^{n_{k}} Ȳ_{l}^{(k), i j} | \geq n_{k} δ_{n, 2}^{(k)}) \leq 2 p^{2} exp (- \frac{c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} (4 + τ) log p}{2 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} + 2 \sqrt{c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} (64 + 16 τ)} / 3}) \equiv ν_{n, 2} .

(16)

Now, for

τ > (2^{7} + 2^{3} \sqrt{1 + 2^{4} 3^{2} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2}}) / (9 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2})

, ν_n,2 → 0 as p → ∞. Note that for this to hold it suffices to have

\frac{3 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} (4 + τ)}{6 c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} + 8 \sqrt{c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} (4 + τ)}} > 2,

so that the power in the exponent is negative. This inequality reduces to

3 c_{4} max_{k, i} {σ_{i i}^{(k)}}^{2} τ > 16 \sqrt{c_{4} max_{k, i} {σ_{i i}^{(k)}}^{2} (4 + τ)} .

We can solve this by changing a quadratic equation for τ, since τ of our interest is positive. Combining (15) and (16) yields

P ({‖ \frac{1}{n_{k}} \sum_{i = 1}^{n_{k}} {(X_{l}^{(k)})}^{\otimes 2} - 𝔼 {(X^{(k)})}^{\otimes 2} ‖}_{\infty} \geq 2 δ_{n, 1}^{(k)} + δ_{n, 2}^{(k)}) \leq ν_{n, 1} + ν_{n, 2} .

(17)

Let

Z_{l}^{(k), i} \equiv X_{l}^{(k), i} - 𝔼 X_{l}^{(k), i},

{\bar{Z}}_{l}^{(k), i} \equiv X_{l}^{(k), i} I (| X_{l}^{(k), i} | \leq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}) - 𝔼 X_{l}^{(k), i} I (| X_{l}^{(k), i} | \leq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}),

{\tilde{Z}}_{l}^{(k), i} \equiv U_{l}^{(k), i} - {\bar{Z}}_{l}^{(k), i} .

Proceeding as for

Y_{l}^{(k), i j}

’s, we have

P (max_{i} | \sum_{l = 1}^{n_{k}} {\tilde{Z}}_{l}^{(k), i} | \geq 2 n_{k} δ_{n, 3}^{(k)}) \leq \frac{c_{7} c_{4} {max}_{k, i} {σ_{i i}^{(k)}}^{2} {(log p)}^{2 (c_{2} + c_{3} + 1)}}{γ^{c_{2} + c_{3} + 1} n^{c_{2} + c_{3} + 1}} \equiv ν_{n, 3},

and

P (max_{i} | \sum_{k = 1}^{n} {\bar{Z}}_{l}^{(k), i} | \geq n_{k} δ_{n, 2}^{(k)}) \leq ν_{n, 2} .

Thus, we have

P ({‖ {({\bar{X}}^{(k)} - μ^{(k)})}^{\otimes 2} ‖}_{\infty} \geq {(δ_{n, 2}^{(k)})}^{2} + {(2 δ_{n, 3}^{(k)})}^{2}) \leq P (max_{i} | {\bar{X}}^{(k), i} - μ^{(k), i} | \geq \sqrt{{(δ_{n, 1}^{(k)})}^{2} + {(δ_{n, 2}^{(k)})}^{2}}) \leq P (max_{i} | \sum_{k = 1}^{n} {\bar{Z}}_{l}^{(k), i} | \geq n_{k} δ_{n, 2}^{(k)}) + P (max_{i} | \sum_{l = 1}^{n_{k}} {\tilde{Z}}_{l}^{(k), i} | \geq 2 n_{k} δ_{n, 3}^{(k)}) \leq ν_{n, 2} + ν_{n, 3},

(18)

and

P ({‖ ({\bar{X}}^{(k)} - μ^{(k)}) {(μ^{(k)})}^{T} ‖}_{\infty} \geq max_{i} | μ^{(k), i} | (2 δ_{n, 1}^{(k)} + δ_{n, 2}^{(k)})) \leq P (max_{i} | {\bar{X}}^{(k), i} - μ^{(k), i} | \geq 2 δ_{n, 1}^{(k)} + δ_{n, 2}^{(k)}) \leq ν_{n, 1} + ν_{n, 2} .

(19)

Combining (17)–(19) yields

P ({‖ {\hat{Σ}}_{n}^{(k)} - Σ^{(k)} ‖}_{\infty} \geq (1 + 2 max_{i} | μ^{(k), i} |) (2 δ_{n, 1}^{(k)} + δ_{n, 2}^{(k)}) + {(δ_{n, 2}^{(k)})}^{2} + {(2 δ_{n, 3}^{(k)})}^{2}) \leq 3 ν_{n, 1} + 4 ν_{n, 2} + ν_{n, 3} = ν_{n} .

Note that $δ_{n, 1}^{(k)}, δ_{n, 2}^{(k)}, δ_{n, 3}^{(k)}$ , ν_n,1, ν_n,2, ν_n,3 → 0 as n, p → ∞ if log p/n → 0. Note also that $δ_{n, 1}^{(k)}, δ_{n, 2}^{(k)}$ and ${(δ_{n, 3}^{(k)})}^{2}$ are $O (\sqrt{log p / n})$ on the set where n_k/n ≥ γ.

For example, we have by Jensen’s inequality that

\sqrt{\frac{n}{log p}} {(δ_{n, 3}^{(k)})}^{2} = \sqrt{\frac{n}{log p}} max_{i} {𝔼 | X^{(k), i} | I {| X^{(k), i} | \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}}}^{2} \leq max_{i} 𝔼 \frac{n}{n_{k}} \sqrt{\frac{n_{k}}{log p}} {| X^{(k), i} |}^{2} I {| X^{(k), i} | \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}} \leq γ^{- 1} max_{i} 𝔼 {| X^{(k), i} |}^{3} I {| X^{(k), i} | \geq n_{k}^{1 / 2} {(log p)}^{- 1 / 2}} \leq c_{4} γ^{- 1} max_{i} {σ_{i i}^{(k)}}^{2} .

Given that

| σ_{0, i j}^{(k)} | \leq \sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}},

| ψ_{n, i j}^{(k)} - ψ_{0, i j}^{(k)} | = | \frac{{\hat{σ}}_{n, i j}^{(k)}}{\sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)}}} - \frac{σ_{0, i j}^{(k)}}{\sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}} | = \frac{| \sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}} ({\hat{σ}}_{n, i j}^{(k)} - σ_{0, i j}^{(k)}) + σ_{0, i j}^{(k)} (\sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}} - \sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)}}) |}{\sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)} σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}} \leq \frac{\sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}}{\sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)} σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}} {| {\hat{σ}}_{n, i j}^{(k)} - σ_{0, i j}^{(k)} | + | \sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}} - \sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)}} |},

wherein

\sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}} - \sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)}} = \frac{\sqrt{σ_{0, j j}^{(k)}}}{\sqrt{σ_{0, i i}^{(k)}} + \sqrt{{\hat{σ}}_{n, i i}^{(k)}}} (σ_{0, i i}^{(k)} - {\hat{σ}}_{n, i i}^{(k)}) + \frac{\sqrt{{\hat{σ}}_{n, i i}^{(k)}}}{\sqrt{σ_{0, j j}^{(k)}} + \sqrt{{\hat{σ}}_{n, j j}^{(k)}}} (σ_{0, j j}^{(k)} - {\hat{σ}}_{n, j j}^{(k)}) .

Since b_n → 0, b_n ≤ c₅/2 for n sufficiently large by Condition 3. On the event ‖Σ̂_n − Σ₀‖_∞ ≤ b_n with n large,

0 < c_{5} / 2 \leq σ_{0, i i}^{(k)} - c_{5} / 2 \leq {\hat{σ}}_{n, i i}^{(k)} \leq σ_{0, i i}^{(k)} + c_{5} / 2 \leq c_{6} + c_{5} / 2

. Thus,

\frac{\sqrt{σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}}{\sqrt{{\hat{σ}}_{n, i i}^{(k)} {\hat{σ}}_{n, j j}^{(k)} σ_{0, i i}^{(k)} σ_{0, j j}^{(k)}}} \leq \frac{2 (c_{5} + 2 c_{6})}{c_{5}^{2}}

\frac{\sqrt{σ_{0, j j}^{(k)}}}{\sqrt{σ_{0, i i}^{(k)}} + \sqrt{{\hat{σ}}_{n, i i}^{(k)}}} \leq \frac{\sqrt{c_{6}}}{2 \sqrt{c_{5}}}

\frac{\sqrt{{\hat{σ}}_{n, i i}^{(k)}}}{\sqrt{σ_{0, j j}^{(k)}} + \sqrt{{\hat{σ}}_{n, j j}^{(k)}}} \leq \frac{\sqrt{c_{5} + 2 c_{6}}}{2 \sqrt{c_{5}}} .

It follows that

| ψ_{n, i j}^{(k)} - ψ_{0, i j}^{(k)} | \leq {2 c_{5}^{- 2} + c_{5} + c_{6}^{- 3 / 2} + 2 c_{5}^{- 5 / 2} c_{6} + {(c_{5}^{- 4} + 2 c_{5}^{- 5} c_{6})}^{1 / 2}} max_{k, i, j} | {\hat{σ}}_{n, i j}^{(k)} - σ_{0, i j}^{(k)} | .

Thus, we have

P ({‖ {\hat{Ψ}}_{n} - Ψ_{0} ‖}_{\infty} \geq C_{1} b_{n}) \leq P ({‖ {\hat{Ψ}}_{n} - Ψ_{0} ‖}_{\infty} \geq C_{1} b_{n}, {‖ {\hat{Σ}}_{n} - Σ_{0} ‖}_{\infty} < b_{n}) + P ({‖ {\hat{Σ}}_{n} - Σ_{0} ‖}_{\infty} \geq b_{n}) \leq 2 P ({‖ {\hat{Σ}}_{n} - Σ_{0} ‖}_{\infty} \geq b_{n}) \to 0 .

So far we have assumed n_k/n ≥ γ in lemmas. We evaluate the probability of this event noting that n_k ~ Binom(n, π_k).

Lemma 5

Let ε > 0 such that γ ≡ min_k π_k − ε > 0. Then

P (min_{k} n_{k} / n \leq min_{k} π_{k} - ε) \leq 2 K exp (- 2 n ε^{2}) .

(20)

Proof

We have by Hoeffding’s inequality that

P (min_{k} n_{k} / n \leq min_{k} π_{k} - ε) \leq P (\exists k, n_{k} / n \leq min_{k} π_{k} - ε) \leq P (\exists k, n_{k} / n \leq π_{k} - ε) \leq P (\exists k, | n_{k} / n - π_{k} | \geq ε) \leq \sum_{k = 1}^{K} P (| n_{k} / n - π_{k} | \geq ε) \leq 2 K exp (- 2 n ε^{2}) .

Proof of Lemma 1

We apply Lemma 2 to obtain the non-asymptotic error bounds.

We first compute a lower bound for f(Δ). Suppose ε ≤ c. For Δ ∈ 𝒞 ∩ {Δ ∈ ℝ̃^(pK)×(pK) : ‖Δ‖_F = ε}, we have by Lemma 3(ii) and (iii) that

f (Δ) \geq - 〈 {\tilde{ℓ}}_{n} (Θ_{0}), Δ 〉 + κ_{ℓ_{n}, c} {‖ Δ ‖}_{F}^{2} + ρ_{n} {r (Δ_{ℳ^{⊥}}) - r (Δ_{ℳ})} .

The assumption on ρ_n and Lemma 3(iii) and (iv) then yield

| 〈 {\tilde{ℓ}}_{n} (Θ_{0}), Δ 〉 | \leq \frac{ρ_{n}}{2} {r (Δ_{ℳ}) + r (Δ_{ℳ^{⊥}})} .

From this inequality and Lemma 3(v) we have

f (Δ) \geq κ_{ℓ_{n}, c} {‖ Δ ‖}_{F}^{2} - \frac{3 ρ_{n}}{2} r (Δ_{ℳ}) \geq κ_{ℓ_{n}, c} {‖ Δ ‖}_{F}^{2} - \frac{3 ρ_{n}}{2} {(s + 1)}^{1 / 2} (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) {‖ Δ ‖}_{F} .

Viewing the right hand side of the above inequality as a quadratic equation in ‖Δ‖_F, we have f(Δ) > 0 if

{‖ Δ ‖}_{F} \geq \frac{3 ρ_{n}}{κ_{ℓ_{n}, c}} {(s + 1)}^{1 / 2} (1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2}) \equiv ε_{c} > 0 .

Thus, if we show that there exists a c₀ > 0 such that ε_c₀ ≤ c₀, Lemma 2 yields that ‖Θ̂_{ρ_n} − Θ₀‖_F ≤ ε_c₀.

Consider the inequality (x + y)²z^1/2 ≤ y where x, y, z ≥ 0. This inequality holds for (x, y, z) such that x = y and xz^1/2 = 1/4. We apply the inequality above with x = λ_Θ, y = c, $z = 2^{4} 3^{2} ρ_{n}^{2} γ^{- 2} s {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2}$ and solve xz ≤ 1/4 for n. (i) For $ρ_{n} = 2^{3} \sqrt{6} C_{1} {(1 + 4 c_{1}^{2})}^{2} γ^{- 1 / 2} {max}_{k, i} σ_{i i}^{(k)} \sqrt{log p / n}$ , xz ≤ 1/4 yields

n \geq max {\frac{6 log p}{γ}, \frac{2^{15} 3^{3} C_{1}^{2} {(1 + 4 c_{1}^{2})}^{2}}{γ^{3}} max_{k, i} {σ_{i i}^{(k)}}^{2} λ_{Θ}^{4} {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} s log p},

and (x + y)⁴z becomes

ε_{{max}_{k} {{‖ Θ_{0}^{(k)} ‖}_{2}}}^{2} \leq 2^{15} 3^{3} {(1 + c_{1}^{2})}^{2} max_{k, i} {(σ_{i i}^{(k)})}^{2} {(1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2})}^{2} γ^{- 3} λ_{Θ}^{4} \frac{s log p}{n} .

(ii) For ρ_n = C₁Kδ_n, there is no closed form solution for n. Note that δ_n → 0 if log p/n → 0 so that xz ≤ 1/4 holds for n sufficiently large, given that $\sum_{k = 1}^{K} δ_{n}^{(k)} \leq K δ_{n}$ .

Computing appropriate probabilities using Lemmas 4 and 5 completes the proof.

Proof of Theorem 1

The estimation error ${‖ {\hat{Ω}}_{ρ_{n}}^{(k)} - Ω_{0} ‖}_{2}^{(2)}$ in the spectral norm can be bounded and evaluated in the same way as in the proof of Theorem 2 of Rothman et al. [26] together with Lemma 1.

8.2. Model Selection Consistency

Our proof is based on the primal-dual witness approach of Ravikumar et al. [25], with some modifications to overcome a difficulty in their proof when applying the fixed point theorem to a discontinuous function. First, we define the oracle estimator ${\overset{ˇ}{Θ}}_{ρ_{n}} = ({\overset{ˇ}{Θ}}_{ρ_{n}}^{(1)}, \dots, {\overset{ˇ}{Θ}}_{ρ_{n}}^{(K)})$ by

{\overset{ˇ}{Θ}}_{ρ_{n}} = \underset{Θ^{(k)} > 0, Θ^{(k)} = {(Θ^{(k)})}^{T}, Θ_{{(s^{(k)})}^{c}}^{(k)} = 0}{arg min} n^{- 1} \sum_{k = 1}^{K} n_{k} (tr (Ψ_{n}^{(k)} Θ^{(k)}) - log det (Θ^{(k)})) + ρ_{n} \sum_{k = 1}^{K} {‖ Θ^{(k)} ‖}_{1} + ρ_{n} ρ_{2} \sum_{i, j} \sqrt{Θ_{i j}^{T} L Θ_{i j}},

(21)

where $Θ_{{(S^{(k)})}^{c}}^{(k)} = 0$ indicates that $Θ_{(i, j)}^{(k)} = 0$ for (i, j) ∉ S^(k).

Lemma 6

Let A ∈ ℝ^p×pbe a positive semidefinite matrix with eigenvalues 0 ≤ λ₁ ≤ λ₂ ≤ ⋯ ≤ λ_pand corresponding eigenvectors u_i satisfying u_i ⊥ u_j, i ≠ j and ‖u_i‖ = 1. The subdifferential $\partial \sqrt{x^{T} A x}$ of $f (x) = \sqrt{x^{T} A x}$ is
$\partial \sqrt{x^{T} A x} = {\begin{matrix} A x / \sqrt{x^{T} A x}, & A x \neq 0, \\ {U Λ^{1 / 2} y : ‖ y ‖ \leq 1}, & A x = 0 . \end{matrix}$
where U ∈ ℝ^p×phas u_i as the ith columns and Λ^1/2is the diagonal matrix with $λ_{i}^{1 / 2}$ , i = 1, …, p, as diagonal elements. Furthermore, the subgradients are bounded above, i.e.
${‖ \nabla f (x) ‖}_{\infty} \leq {‖ A ‖}_{2}^{1 / 2}, for all \nabla f (x) \in \partial \sqrt{x^{T} A x} .$
Let A ∈ ℝ^p×pbe a positive semidefinite matrix and S = {S_i} ⊂ {1, …, p}. Suppose A_SS has eigenvalues 0 ≤ λ_1,S ≤ λ_2,S ≤ ⋯ ≤ λ_|S|,Sand corresponding eigenvectors u_i,S satisfying u_i,S ⊥ u_j,S, i ≠ j and ‖u_i,S‖ = 1. Let g_S : ℝ^|S| → ℝ^pbe a map defined by g_S(x) = y where y_i = x_{S_j} for i = S_j for and y_i = 0 for i ∉ S. The subdifferential $h_{A, S} (x) = \sqrt{g_{S} {(x)}^{T} A g_{S} (x)}$ equals to the subdifferential of $\sqrt{x^{T} A_{S S} x}$ given by
$\partial \sqrt{x^{T} A_{S S} x} = {\begin{matrix} A_{S S} x / \sqrt{x^{T} A_{S S} x}, & A_{S S} x \neq 0, \\ U_{S} Λ_{S}^{1 / 2} {y : ‖ y ‖ \leq 1}, & A_{S S} x = 0 . \end{matrix}$
where U_S ∈ ℝ^|S|×|S|has u_i,S as the ith columns and $Λ_{S}^{1 / 2}$ is the diagonal matrix with $λ_{i, S}^{1 / 2}$ , i = 1, …, |S|, as diagonal elements. For x with A_SSx ≠ 0, there is a relationship between $\partial \sqrt{x^{T} A_{S S} x}$ and $\partial \sqrt{y^{T} A y}$ at y = g_S(x) given by
${\frac{A y}{\sqrt{y^{T} A y}}}_{S} = \frac{A_{S S} x}{\sqrt{x^{T} A_{S S} x}},$

${\frac{A y}{\sqrt{y^{T} A y}}}_{S^{c}} = \frac{A_{S^{c} S} x}{\sqrt{x^{T} A_{S S} x}} .$

Subgradients are bounded above:
${‖ \nabla h_{A, S} (x) ‖}_{\infty} \leq {‖ A_{S S} ‖}_{2}^{1 / 2} \leq {‖ A ‖}_{2}^{1 / 2}, \forall \nabla f_{A, S} (x) \in \partial \sqrt{x^{T} A_{S S} x} .$

Proof

For x with Ax ≠ 0, f(x) is differentiable and the subgradient of f at x is simply the matrix derivative. By definition, for x with Ax = 0, the subgradient υ of f at x satisfies the following inequality
$\sqrt{y^{T} A y} \geq 〈 y - x, υ 〉,$ (22)
for all y. Choosing y = 2x and y = 0 yield 0 ≥ 〈x, υ〉 and 0 ≥ − 〈x, υ〉, implying 〈x, υ〉 = 0. The inequality (22) reduces to $\sqrt{y^{T} A y} \geq 〈 y, υ 〉$ , for any y. If Ay = 0, a similar argument implies that 〈y, υ〉 = 0. Hence υ ⊥ y for every y with Ay = 0.

Let j₀ be the smallest index such that λ_j₀ > 0. Because u_j ’s form an orthonormal basis, any arbitrary vector y can be written as $y = \sum_{j = 1}^{p} β_{j} u_{j}$ . Moreover, the null space of A is the span of u₁, …, u_j₀−1. Thus, the subgradient υ can be written as $υ = \sum_{j = j_{0}}^{p} α_{j} u_{j}$ . Thus, using the spectral decomposition of A as $A = \sum_{j = j_{0}}^{p} λ_{j} u_{j} u_{j}^{T}$ , we can write $f (y) = {\sum_{j = j_{0}}^{p} λ_{j} β_{j}^{2}}^{1 / 2}$ . On the other hand, $〈 y, υ 〉 = \sum_{j = j_{0}}^{p} α_{j} β_{j}$ . Thus, the inequality (22) further reduces to
${\sum_{j = j_{0}}^{p} λ_{j} β_{j}^{2}}^{1 / 2} \geq \sum_{j = j_{0}}^{p} α_{j} β_{j}, \forall β_{j} \in ℝ .$

It follows from the Cauchy-Schwartz inequality that the left hand side of the inequality is bounded from above;
$\sum_{j = j_{0}}^{p} α_{j} β_{j} = \sum_{j = j_{0}}^{p} \frac{α_{j}}{λ_{j}^{1 / 2}} λ_{j}^{1 / 2} β_{j} \leq {\sum_{j = j_{0}}^{p} \frac{α_{j}^{2}}{λ_{j}}}^{1 / 2} {\sum_{j = j_{0}}^{p} λ_{j} β_{j}^{2}}^{1 / 2} .$

Thus,
$\partial f (x) = {υ : υ = \sum_{j = j_{0}}^{p} α_{j} υ_{j}, \sum_{j = j_{0}}^{p} \frac{α_{j}^{2}}{λ_{j}} \leq 1, α_{j} \in ℝ} .$

It is easy to see that this set is the image of the map UΛ^1/2 on the closed ball of radius 1.

Given that ‖x‖_∞ ≤ ‖x‖, to establish the bound in the ℓ_∞-norm, we compute the bound in the Euclidean norm. We use the same notation as in (i). For x with Ax ≠ 0,
$‖ \frac{A x}{\sqrt{x^{T} A x}} ‖ = \frac{‖ U Λ^{1 / 2} Λ^{1 / 2} U^{T} x ‖}{‖ Λ^{1 / 2} U^{T} x ‖} \leq {‖ U Λ^{1 / 2} ‖}_{2} .$

But ${‖ U Λ^{1 / 2} ‖}_{2} = {sup}_{‖ x ‖ = 1} ‖ U Λ^{1 / 2} x ‖ = {sup}_{x \in ℝ^{K}} ‖ U Λ^{1 / 2} (U^{T} x) ‖ / ‖ U^{T} x ‖ = {‖ A ‖}_{2}^{1 / 2}$ , because ‖U^T x‖ = ‖x‖. For x with Ax = 0, $‖ Λ^{1 / 2} y / ‖ y ‖ ‖ \leq {‖ A ‖}_{2}^{1 / 2}$ for every y. Because of the form of the subdifferential and the fact that ‖U x‖ = ‖x‖, the result follows.
Let B_S be a product of elementary matrices for row and column exchange such that B_Sg_S(x) = (x, 0). Notice that $B_{S} = B_{S}^{- 1}$ and that $B_{S} = B_{S}^{T}$ since B_S only rearranges elements of vectors and exchanges rows by multiplication from the left. Note also that ‖B_S‖₂ ≤ ‖B_S‖_∞/∞ = 1, since ‖C‖₂ ≤ ‖C‖_∞/∞ for C = C^T and each row of B_S has only one element with value 1. Because
${h_{A, S} (x)}^{2} = g_{S} {(x)}^{T} A g_{S} (x) = {(B_{S} g_{S} (x))}^{T} (B_{S} A B_{S}) (B_{S} g_{S} (x)) = x^{T} A_{S S} x,$
the subdifferential of h_A,S(x) follows from (ii). For x with A_SSx ≠ x and y = g_S(x), $A y = B_{S} A B_{S} {(x, 0)}^{T} = B_{S} {(A_{S S} x, A_{S^{c} S}^{T} x)}^{T} \neq 0$ because of invertibility of B_S. The relationship holds since
$[\begin{matrix} {(A y / \sqrt{y^{T} A y})}_{S} \\ {(A y / \sqrt{y^{T} A y})}_{S^{c}} \end{matrix}] = B_{S} \frac{A y}{\sqrt{y^{T} A y}} = \frac{1}{\sqrt{x^{T} A_{S S} x}} [\begin{matrix} A_{S S} x \\ A_{S^{c} S} x \end{matrix}]$

An ℓ_∞-bound follows from (i) and the fact that ${‖ A_{S S} ‖}_{2} \leq {‖ B_{S} ‖}_{2}^{2} {‖ A ‖}_{2} = {‖ A ‖}_{2}$ .

Lemma 7

For sample correlation matrices ${\hat{Ψ}}_{n} = ({\hat{Ψ}}_{n}^{(1)}, \dots, {\hat{Ψ}}_{n}^{(K)})$ and any ρ_n > 0, the convex problem(3)has a unique solution ${\hat{Θ}}_{ρ_{n}} = ({\hat{Θ}}_{ρ_{n}}^{(1)}, \dots, {\hat{Θ}}_{ρ_{n}}^{(K)})$ with ${\hat{Θ}}_{ρ_{n}}^{(k)} > 0$ , k = 1, …, K, characterized by

n^{- 1} n_{k} (ψ_{n, i j}^{(k)} - {[{{\hat{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i j}) + ρ_{n} Û_{1, i j}^{(k)} + ρ_{n} ρ_{2} Û_{2, i j}^{(k)} = 0,

(23)

with $Û_{1, i j}^{(k)} \in \partial | {\hat{θ}}_{ρ_{n}, i j}^{(k)} |$ and ${(Û_{2, i j}^{(1)}, \dots, Û_{2, i j}^{(K)})}^{T} \in \partial \sqrt{{\hat{Θ}}_{ρ_{n}, i j}^{T} L {\hat{Θ}}_{ρ_{n}, i j}}$ for every i ≠ j and k = 1, …, K. Moreover,

n^{- 1} n_{k} (ψ_{n, i i}^{(k)} - {[{{\hat{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i i}) + ρ_{n} Û_{1, i j}^{(k)} + ρ_{n} ρ_{2} Û_{2, i j}^{(k)} = 0,

(24)

with $Û_{1, i j}^{(k)} = Û_{2, i j}^{(k)} = 0$ for every i = 1, …, p, andk = 1, …, K.

For each (i, j) ∈ S, let $S_{i j} = {k : Θ_{0, i j}^{(k)} \neq 0}$ . The convex problem(21)has a unique solution ${\overset{ˇ}{Θ}}_{ρ_{n}} = ({\overset{ˇ}{Θ}}_{ρ_{n}}^{(1)}, \dots, {\overset{ˇ}{Θ}}_{ρ_{n}}^{(K)})$ with ${\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} > 0$ characterized by

n^{- 1} n_{k} (ψ_{n, i j}^{(k)} - {[{{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i j}) + ρ_{n} Ǔ_{1, i j}^{(k)} + ρ_{n} ρ_{2} Ǔ_{2, i j}^{(k)} = 0,

(25)

with $Ǔ_{1, i j}^{(k)} \in \partial | {\overset{ˇ}{θ}}_{ρ_{n}, i j}^{(k)} |$ and $Ǔ_{2, i j}^{(k)} \in \partial \sqrt{{{\overset{ˇ}{Θ}}_{ρ_{n}, i j}}_{S_{i j}}^{T} L_{S_{i j} S_{i j}} {{\overset{ˇ}{Θ}}_{ρ_{n}, i j}}_{S_{i j}}}$ for every i ≠ j and k = 1, …, K. Moreover,

n^{- 1} n_{k} (ψ_{n, i i}^{(k)} - {[{{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i i}) + ρ_{n} Ǔ_{1, i j}^{(k)} + ρ_{n} ρ_{2} Ǔ_{2, i j}^{(k)} = 0,

(26)

with $Ǔ_{1, i j}^{(k)} = Ǔ_{2, i j}^{(k)} = 0$ for every i = 1, …, p, and k = 1, …, K.

Proof

A proof for the uniqueness of the solution is similar to the proof of Lemma 3 of Ravikumar et al. [25]. The rest is the KKT condition using Lemma 6.

We choose a pair Ũ = (Ũ₁, Ũ₂) of the subgradients of the first and second regularization terms evaluated at Θ̌_ρn. For each (i, j) with Ω_0,ij = 0 or with LΘ̌_{ρ_n,ij} = 0, set

Ũ_{1, i j}^{(k)} = ρ_{n}^{- 1} n^{- 1} n_{k} (- ψ_{n, i j}^{(k)} + {[{{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i j}), Ũ_{2, i j}^{(k)} = 0, k = 1, \dots, K .

For (i, j) with $ω_{0, i j}^{(k)} \neq 0$ , for all k = 1, …, K, set

Ũ_{1, i j}^{(k)} = Ǔ_{1, i j}^{(k)}, Ũ_{2, i j}^{(k)} = Ǔ_{2, i j}^{(k)}, k = 1, \dots, K .

For (i, j) with LΘ̌_{ρ_n,ij} ≠ 0, Ω_0,ij ≠ 0 but $ω_{0, i j}^{(k')} = 0$ for some k′, set

Ũ_{1, i j}^{(k)} = ρ_{n}^{- 1} n^{- 1} n_{k} (- ψ_{n, i j}^{(k)} + {[{{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1}]}_{i j}) - ρ_{2} \frac{l_{k} {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}{\sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j}^{T} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}},

and

Ũ_{2, i j}^{(k)} = \frac{l_{k}^{T} {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}{\sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j}^{T} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}},

if $ω_{0, i j}^{(k)} = 0$ . Otherwise, let

Ũ_{1, i j}^{(k)} = Ǔ_{1, i j}^{(k)}, Ũ_{2, i j}^{(k)} = \frac{l_{k}^{T} {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}{\sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j}^{T} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}} .

Here, l_k is the kth row of L.

The main idea of the proof is to show that (Θ̌_{ρ_n}, Ũ) satisfies the optimality conditions of the original problem with probability tending to 1. In particular, we show the following equation, which holds by construction of Ũ₁ and Ũ₂, is in fact the KKT condition of the original problem (3):

n^{- 1} n_{k} ({\hat{Ψ}}_{n}^{(k)} - {{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1}) + ρ_{n} Ũ_{1}^{(k)} + ρ_{n} ρ_{2} Ũ_{2}^{(k)} = 0 .

(27)

To this end, we show that Ũ₁ and Ũ₂ are both subgradients of the original problem. We can then conclude that the oracle estimator in the restricted problem (21) is the solution to the original problem (3). Then it follows from the uniqueness of the solution that Θ̌_{ρ_n} = Θ̂_{ρ_n}.

Let $Ξ^{(k)} = {\hat{Ψ}}_{n}^{(k)} - Ψ_{0}^{(k)}, R^{(k)} (Δ^{(k)}) = {{\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}}^{- 1} - Ψ_{0}^{(k)} + Ψ_{0}^{(k)} Δ^{(k)} Ψ_{0}^{(k)}$ , and ${\overset{ˇ}{Δ}}^{(k)} = {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)}$ .

Lemma 8

Suppose that max{‖Ξ^(k)‖_∞, R^(k)(Δ̌^(k))‖_∞} ≤ αρ_n/8, and $ρ_{2} \leq α^{2} / {4 {‖ L ‖}_{2}^{1 / 2} (2 - α)}$ . Suppose moreover thatLΘ̌_{ρ_n,ij} ≠ 0 for (i, j) ∈ S. Then $| Ũ_{1, i j}^{(k)} | < 1$ for (i, j) ∈ (S^(k))^c.

Proof

We rewrite (27) to obtain

\frac{n_{k}}{n} Ψ_{0}^{(k)} {\overset{ˇ}{Δ}}^{(k)} Ψ_{0}^{(k)} + \frac{n_{k}}{n} Ξ^{(k)} - \frac{n_{k}}{n} R^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) + ρ_{n} Ũ_{1}^{(k)} + ρ_{n} ρ_{2} Ũ_{2}^{(k)} = 0 .

We further rewrite the above equation via vectorization;

\frac{n_{k}}{n} (Ψ_{0}^{(k)} \otimes Ψ_{0}^{(k)}) {\vec{\overset{ˇ}{Δ}}}^{(k)} + \frac{n_{k}}{n} {\vec{Ξ}}^{(k)} - \frac{n_{k}}{n} {\vec{R}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) + ρ_{n} {\vec{Ũ}}_{1}^{(k)} + ρ_{n} ρ_{2} {\vec{Ũ}}_{2}^{(k)} = 0 .

We separate this equation into two equations depending on S^(k);

\frac{n_{k}}{n} Γ_{S^{(k)} S^{(k)}}^{(k)} {\vec{\overset{ˇ}{Δ}}}_{S^{(k)}}^{(k)} + \frac{n_{k}}{n} {\vec{Ξ}}_{S^{(k)}}^{(k)} - \frac{n_{k}}{n} {\vec{R}}_{S^{(k)}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) + ρ_{n} {\vec{Ũ}}_{1, S^{(k)}}^{(k)} + ρ_{n} ρ_{2} {\vec{Ũ}}_{2, S^{(k)}}^{(k)} = 0,

\frac{n_{k}}{n} Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {\vec{\overset{ˇ}{Δ}}}_{S^{(k)}}^{(k)} + \frac{n_{k}}{n} {\vec{Ξ}}_{{(S^{(k)})}^{c}}^{(k)} - \frac{n_{k}}{n} {\vec{R}}_{{(S^{(k)})}^{c}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) + ρ_{n} {\vec{Ũ}}_{1, {(S^{(k)})}^{c}}^{(k)} + ρ_{n} ρ_{2} {\vec{Ũ}}_{2, {(S^{(k)})}^{c}}^{(k)} = 0 .

(28)

where (Ũ⃗_l)_J ≡ Ũ⃗_k,J, l = 1, 2. Here we used ${\overset{ˇ}{Δ}}_{{(S^{(k)})}^{c}}^{(k)} = 0$ . Since $Γ_{S^{(k)} S^{(k)}}^{(k)}$ is invertible, we solve the first equation to obtain

\frac{n_{k}}{n} {\vec{\overset{ˇ}{Δ}}}_{S^{(k)}}^{(k)} = {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} {- \frac{n_{k}}{n} {\vec{Ξ}}_{S^{(k)}}^{(k)} + \frac{n_{k}}{n} {\vec{R}}_{S^{(k)}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) - ρ_{n} {\vec{Ũ}}_{1, S^{(k)}}^{(k)} - ρ_{n} ρ_{2} {\vec{Ũ}}_{2, S^{(k)}}^{(k)}} .

Substituting this expression into (28) yields

{\vec{Ũ}}_{1, {(S^{(k)})}^{c}}^{(k)} = ρ_{n}^{- 1} Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} (\frac{n_{k}}{n} {\vec{Ξ}}_{S^{(k)}}^{(k)} - \frac{n_{k}}{n} {\vec{R}}_{S^{(k)}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)})) + Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} {\vec{Ũ}}_{1, S^{(k)}}^{(k)} + ρ_{2} Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} {\vec{Ũ}}_{2, S^{(k)}}^{(k)} - ρ_{n}^{- 1} (\frac{n_{k}}{n} {\vec{Ξ}}_{{(S^{(k)})}^{c}}^{(k)} - \frac{n_{k}}{n} {\vec{R}}_{{(S^{(k)})}^{c}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)})) - ρ_{2} {\vec{Ũ}}_{2, {(S^{(k)})}^{c}}^{(k)} .

Taking the ℓ_∞-norm yields

{‖ {\vec{Ũ}}_{1, {(S^{(k)})}^{c}}^{(k)} ‖}_{\infty} \leq ρ_{n}^{- 1} {‖ Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} ‖}_{\infty / \infty} ({‖ {\vec{Ξ}}_{S^{(k)}}^{(k)} ‖}_{\infty} + {‖ {\vec{R}}_{S^{(k)}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) ‖}_{\infty}) + {‖ Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} ‖}_{\infty / \infty} ({‖ {\vec{Ũ}}_{1, S^{(k)}}^{(k)} ‖}_{\infty} + ρ_{2} {‖ {\vec{Ũ}}_{2, S^{(k)}}^{(k)} ‖}_{\infty}) + ρ_{n}^{- 1} ({‖ {\vec{Ξ}}_{{(S^{(k)})}^{c}}^{(k)} ‖}_{\infty} + {‖ {\vec{R}}_{{(S^{(k)})}^{c}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) ‖}_{\infty}) + ρ_{2} {‖ {\vec{Ũ}}_{2, {(S^{(k)})}^{c}}^{(k)} ‖}_{\infty} \leq \frac{2 - α}{ρ_{n}} ({‖ {\vec{Ξ}}_{{(S^{(k)})}^{c}}^{(k)} ‖}_{\infty} + {‖ {\vec{R}}_{{(S^{(k)})}^{c}}^{(k)} ({\overset{ˇ}{Δ}}^{(k)}) ‖}_{\infty}) + 1 - α + (2 - α) ρ_{2} {‖ L ‖}_{2}^{1 / 2} .

Here we used that ‖A_x‖_∞ ≤ ‖A‖_∞/∞ ≤ ‖A‖_∞ and ${‖ Γ_{{(S^{(k)})}^{c} S^{(k)}}^{(k)} {(Γ_{S^{(k)} S^{(k)}}^{(k)})}^{- 1} ‖}_{\infty / \infty} \leq 1 - α$ , and applied Lemma 6 to bound ‖Ũ⃗_{2, (S^(k))^c}‖_∞ and ‖Ũ⃗_{2, (S^(k))}‖_∞ by ${‖ L ‖}_{2}^{1 / 2}$ . We also used ${‖ {\vec{Ũ}}_{1, S^{(k)}}^{(k)} ‖}_{\infty} = {‖ {\vec{Ǔ}}_{1, S^{(k)}}^{(k)} ‖}_{\infty} \leq 1$ by construction of Ũ₁ and the assumption that ${\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} \neq 0$ . It follows by the assumption of the lemma that

{‖ Ũ_{{(S^{(k)})}^{c}}^{(k)} ‖}_{\infty} \leq \frac{2 - α}{ρ_{n}} \frac{α ρ_{n}}{4} + (1 - α) + (2 - α) ρ_{2} {‖ L ‖}_{2}^{1 / 2} \leq 1 - \frac{α}{2} - \frac{α^{2}}{4} + \frac{α^{2}}{4} < 1 .

Lemma 9 (Lemma 5 of Ravikumar et al. [25])

Suppose that ‖Δ‖_∞ ≤ 1/(3κ_Ψd) with

{(Δ^{(k)})}_{(S^{(k)} \cup {(i, i) : i = 1, \dots, p]})^{c}} = 0 .

Then ‖H^(k)‖_∞/∞ ≤ 3/2 where $H^{(k)} \equiv \sum_{j = 1}^{\infty} {(- 1)}^{j} {(Ψ_{0}^{(k)} Δ^{(k)})}^{j}$ , k = 1, …, K, andR^(k)(Δ^(k)) has representation $R^{(k)} (Δ^{(k)}) = Ψ_{0}^{(k)} Δ^{(k)} Ψ_{0} Δ H^{(k)} Ψ_{0}^{(k)}$ with ${‖ R^{(k)} (Δ^{(k)}) ‖}_{\infty} \leq (3 / 2) d {‖ Δ^{(k)} ‖}_{\infty}^{2} {(κ_{Ψ})}^{3}$ .

Lemma 10

Suppose ${‖ Δ ‖}_{2} \leq 1 / (2 {max}_{k} {‖ Ψ_{0}^{(k)} ‖}_{2})$ with $Δ_{{(S^{(k)} \cup {{(i, i)}_{i = 1}^{p}]})}^{c}}^{(k)} = 0$ . Then ‖H^(k)‖_∞/∞ ≤ 2 where $H^{(k)} \equiv \sum_{t = 1}^{\infty} {(- 1)}^{t} {(Ψ_{0}^{(k)} Δ^{(k)})}^{t}$ , k = 1, …, K, andR^(k)(Δ^(k)) has representation $R^{(k)} (Δ^{(k)}) = Ψ_{0}^{(k)} Δ^{(k)} Ψ_{0} Δ H^{(k)} Ψ_{0}^{(k)}$ with ${‖ R^{(k)} (Δ^{(k)}) ‖}_{\infty} \leq 2 λ_{Θ}^{3} {‖ Δ^{(k)} ‖}_{2}^{2}$ .

Proof

Note that the Neumann series for a matrix (I − A)⁻¹ converges if the operator norm of A is strictly less than 1, and that the ℓ_∞-norm is bounded by the operator norm. A proof is similar to that of Lemma 5 of Ravikumar et al. [25] with the induced infinity norm ‖·‖_∞/∞ replaced by the operator norm in appropriate inequalities.

The following lemma is similar to the statement of Lemma 6 of Ravikumar et al. [25].

Lemma 11

Suppose that

r \equiv \frac{4}{{min}_{k} π_{k}} κ_{Γ} (max_{k} {‖ Ξ^{(k)} ‖}_{\infty} + ρ_{n} + ρ_{n} ρ_{2} {‖ L ‖}_{2}^{1 / 2}) < \frac{1}{6 d max {κ_{Ψ}, κ_{Ψ}^{3} κ_{Γ}}},

for k = 1, …, K. Suppose moreover that ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ are invertible for k = 1, …, K. Then with probability $1 - 2 K exp (- n {min}_{k} π_{k}^{2} / 2)$ ,

max_{k} {‖ {\tilde{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq (3 / 2) r .

Proof

We apply Shauder’s fixed point theorem on the event min_k π_k/2 ≤ n_k/n, which holds with probability $1 - 2 K exp (- n {min}_{k} π_{k}^{2} / 2)$ by Lemma 5 with ε = min_k π_k/2. We first define the function f_k and its domain 𝒟_k to which the fixed point theorem applies. Let S̅^(k) = S^(k) ∪ {(i, i) : 1 ≤ i ≤ p}, and define

𝒟_{k} = {A \in 𝕊^{p \times p} : x^{T} (A + Θ_{0}^{(k)}) x \geq 0, \forall_{x} \in ℝ^{p}, {‖ A_{{\bar{S}}^{(k)}} ‖}_{\infty} \leq r, A_{{({\bar{S}}^{(k)})}^{c}} = 0},

where 𝕊^p×p is the space of symmetric p × p matrices. Then 𝒟_k is a convex, compact subset of the set of 𝕊^p×p.

Let $Ǔ_{l}^{(k)} \in ℝ^{p \times p}$ , l = 1, 2, be zero-filled matrices whose (i, j)-element is $Ǔ_{l, i j}^{(k)}$ in Lemma 7 if (i, j) ∈ S^(k) and zero otherwise. Define the map g_k on the set of invertible matrices in ℝ^p×p by $g_{k} (B) = (n_{k} / n) (B^{- 1} - {\hat{Ψ}}_{n}^{(k)}) - ρ_{n} Ǔ_{1}^{(k)} - ρ_{n} ρ_{2} Ǔ_{2}^{(k)}$ . Note that ${g_{k} ({\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)})}_{S^{(k)}} = 0$ is the KKT condition for the restricted problem (21). Let δ > 0 be a constant such that δ < min{1/2, 1/{10(4dr + 1)}}r and $δ + r \leq 1 / {6 d max {κ_{Ψ}, κ_{Ψ}^{3} κ_{Γ}}$ . Define a continuous function f_k : 𝒟_k ↦ 𝒟_k as

({f_{k} (A))}_{i j} = {\begin{matrix} A_{i j} & i = j, \\ {h_{k} (A) Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)} + A}_{i j}, & i \neq j, (i, j) \in S^{(k)} \\ 0, & otherwise, \end{matrix}

where

h_{k} (A) \equiv \frac{2^{- 1} min {λ_{1} (A + Θ_{0}^{(k)}), 2^{- 1}} + 2^{- 1}}{max {| λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}} - I) |, 1}} .

Let ${\tilde{f}}_{k} (A) = h_{k} (A) Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}$ . Then f_k(A) = (f̃_k(A))_S^(k) + A for A ∈ 𝒟_k.

We now verify the conditions of Shauder’s fixed point theorem below. Once these conditions are established, the theorem yields that f_k(A) = A. Since (f_k(A))_(S̅^(k))^c = A for any A ∈ 𝒟_k, and h_k(A) > 0, the solution A to f_k(A) = A is determined by ${(Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)})}_{S^{(k)}} = 0$ . Vectorizing this equation to obtain ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}} {g_{k} (A + Θ_{0}^{(k)} + δ I)}_{S^{(k)}} = 0$ , it follows from the invertibility of ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ that ${g_{k} (A + Θ_{0}^{(k)} + δ I)}_{S^{(k)}} = 0$ . By the uniqueness of the KKT condition, the solution is $A = {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} - δ I$ . Since A ∈ 𝒟_k, and δ < r/2, we conclude ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq (3 / 2) r$ .

In the following, we write A⃗ = vec(A) for a matrix A for notational convenience. For J ⊂ {(i, j) : i, j = 1, …, p}, vec(A)_J should be understood as A⃗_J.

The function f_k is continuous on 𝒟_k. To see this, note first that $A + Θ_{0}^{(k)} + δ I$ is positive definite for every A ∈ 𝒟_k so that the inversion is continuous. Note also that all elements in the matrices involved with eigenvalues in h_k(A) are uniformly bounded in 𝒟_k, and hence the eigenvalues are also uniformly bounded.

To show that f_k(A) ∈ 𝒟_k, first we show that $f_{k} (A) + Θ_{0}^{(k)}$ is positive semidefinite. This follows because for any x ∈ ℝ^p

x^{T} (f_{k} (A) + Θ_{0}^{(k)}) x = x^{T} {{({\tilde{f}}_{k} (A))}_{S^{(k)}} - I} x + x^{T} (A + Θ_{0}^{(k)}) x + x^{T} x \geq h_{k} (A) λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}} - I) {‖ x ‖}^{2} + λ_{1} (A + Θ_{0}^{(k)}) {‖ x ‖}^{2} + {‖ x ‖}^{2} \geq 0 .

To see this, note that if $λ_{A} \equiv λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}} - I)$ is positive, then the inequality easily follows. On the other hand, if λ_A < −1, we have

h_{k} (A) λ_{A} {‖ x ‖}^{2} \geq - 2^{- 1} min {λ_{1} (A + Θ_{0}^{(k)}), 2^{- 1}} {‖ x ‖}^{2} - 2^{- 1} {‖ x ‖}^{2} \geq - (λ_{1} (A + Θ_{0}^{(k)}) / 2 + 1 / 2) {‖ x ‖}^{2} .

Lastly, if −1 ≤ λ_A < 0, we have

h_{k} (A) λ_{A} {‖ x ‖}^{2} \geq - | λ_{A} | [2^{- 1} min {λ_{1} (A + Θ_{0}^{(k)}), 2^{- 1}} + 1 / 2] {‖ x ‖}^{2} \geq - | λ_{A} | (λ_{1} (A + Θ_{0}^{(k)}) / 2 + 1 / 2) {‖ x ‖}^{2} .

Next, we show that ‖f_k(A)_S̅^(k)‖_∞ ≤ r. Because diag(f_k(A)) = diag(A), it suffices to show ‖f_k(A)_S^(k)‖_∞ ≤ r. Since $δ + r \leq 1 / {6 d max {κ_{Ψ}, κ_{Ψ}^{3} κ_{Γ}}$ ,

{‖ Ψ_{0}^{(k)} (A + δ I) ‖}_{\infty / \infty} \leq κ_{Ψ} d {‖ A + δ I ‖}_{\infty} \leq κ_{Ψ} d (r + δ) \leq 1 / 3 .

It then follows from Lemma 9 that

R (A + δ I) = {(A + δ I + Θ_{0}^{(k)})}^{- 1} - Ψ_{0}^{(k)} + Ψ_{0}^{(k)} (A + δ I) Ψ_{0}^{(k)} = {Ψ_{0}^{(k)} (A + δ I)}^{2} H^{(k)} Ψ_{0}^{(k)} .

Thus, adding and subtracting $Ψ_{0}^{(k)}$ yields

{\tilde{f}}_{k} (A) + A = h_{k} (A) Θ_{0}^{(k)} ((n_{k} / n) {Ψ_{0}^{(k)} (A + δ I)}^{2} H^{(k)} Ψ_{0}^{(k)} - (n_{k} / n) Ξ^{(k)} - ρ_{n} Ǔ_{1}^{(k)} - ρ_{n} ρ_{2} Ǔ_{2}^{(k)}) Θ_{0}^{(k)} + (1 - (n_{k} / n) h_{k} (A)) A - (n_{k} / n) δ h_{k} (A) I .

Vectorization and restriction on S^(k) gives

{‖ vec {(f_{k} (A))}_{S^{(k)}} ‖}_{\infty} = {‖ vec {({\tilde{f}}_{k} (A) + A)}_{S^{(k)}} ‖}_{\infty} \leq (n_{k} / n) h_{k} (A) {‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} vec {({Ψ_{0}^{(k)} (A + δ I)}^{2} H^{(k)} Ψ_{0}^{(k)})}_{S^{(k)}} ‖}_{\infty} + (1 - (n_{k} / n) h_{k} (A)) {‖ vec {(A)}_{S^{(k)}} ‖}_{\infty} + (n_{k} / n) δ + h_{k} (A) {‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} {vec {((n_{k} / n) Ξ^{(k)})}_{S^{(k)}} + ρ_{n} vec {(Ǔ_{1}^{(k)})}_{S^{(k)}} + ρ_{n} ρ_{2} vec {(Ǔ_{2}^{(k)})}_{S^{(k)}}} ‖}_{\infty},

(29)

where ${{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} = {(Θ^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ . Here we used h_k(A) ≤ (1/4 + 1/2)/1 = 3/4. For the first term of the upper bound in (29), it follows by the inequality ‖Ax‖_∞ ≤ ‖A‖_∞/∞‖x‖_∞ for A ∈ ℝ^p×p and x ∈ ℝ^p, Lemma 9 and the choice of δ satisfying $δ + r \leq 1 / {6 d max {κ_{Ψ}, κ_{Ψ}^{3} κ_{Γ}}$ that

{‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} vec {({Ψ_{0}^{(k)} (A + δ I)}^{2} H^{(k)} Ψ_{0}^{(k)})}_{S^{(k)}} ‖}_{\infty} \leq κ_{Γ} {‖ R^{(k)} (A + δ I) ‖}_{\infty} \leq κ_{Γ} \frac{3}{2} d {‖ A + δ I ‖}_{\infty}^{2} κ_{Ψ}^{3} \leq κ_{Γ} \frac{3}{2} d {‖ A + δ I ‖}_{\infty} (r + δ) κ_{Ψ}^{3} \leq (r + δ) / 4 .

For the second term, it follows by the assumption, the inequality that ‖Ax‖_∞ ≤ ‖A‖_∞/∞‖x‖_∞ for A ∈ ℝ^p×p and x ∈ ℝ^p, and Lemma 6 that

{‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} {\frac{n_{k}}{n} vec {(Ξ^{(k)})}_{S^{(k)}} + ρ_{n} vec {(Ǔ_{1}^{(k)})}_{S^{(k)}} + ρ_{n} ρ_{2} vec {(Ǔ_{2}^{(k)})}_{S^{(k)}}} ‖}_{\infty} \leq κ_{Γ} (‖ Ξ^{(k)} + ρ_{n} + ρ_{n} ρ_{2} {‖ L ‖}_{2}^{1 / 2}) = (min_{k} π_{k}) r / 4 \leq (n_{k} / n) r / 2 .

Thus, we can further bound ‖vec((f̃_k(A) + A)_S^(k))‖_∞ by

\frac{n_{k}}{n} h_{k} (A) \frac{r + δ}{4} + \frac{n_{k}}{n} h_{k} (A) \frac{r}{2} + (1 - \frac{n_{k}}{n} h_{k} (A)) r + \frac{n_{k}}{n} δ = r {1 - \frac{n_{k}}{n} \frac{h_{k} (A)}{4}} + \frac{n_{k}}{n} {1 + \frac{h_{k} (A)}{4}} δ .

(30)

Since

{‖ {(Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)})}_{S^{(k)}} ‖}_{\infty} \leq {‖ A_{S^{(k)}} ‖}_{\infty} + {‖ Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)} + A)_{S^{(k)}} ‖}_{\infty},

and δ ≤ r/2, a similar reasoning shows that

{‖ {(Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)})}_{S^{(k)}} ‖}_{\infty} \leq (n_{k} / n) {‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} vec {({Ψ_{0}^{(k)} (A + δ I)}^{2} H^{(k)} Ψ_{0}^{(k)})}_{S^{(k)}} ‖}_{\infty} + (2 - (n_{k} / n)) {‖ vec {(A)}_{S^{(k)}} ‖}_{\infty} + (n_{k} / n) δ + {‖ {{(Γ^{(k)})}^{- 1}}_{S^{(k)} S^{(k)}} {(n_{k} / n) vec {(Ξ^{(k)})}_{S^{(k)}} + ρ_{n} vec {(Ǔ_{1}^{(k)})}_{S^{(k)}} + ρ_{n} ρ_{2} vec ({(Ǔ_{2}^{(k)})}_{S^{(k)}})} ‖}_{\infty} \leq \frac{r + δ}{4} + \frac{r}{2} + 2 r + δ \leq 4 r .

Thus, the inequality ‖B‖₂ ≤ ‖B‖_∞/∞ for B = B^T implies that

| λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}} - I) | \leq {‖ λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}}) ‖}_{2} + 1 \leq {‖ λ_{1} ({Θ_{0}^{(k)} g_{k} (A + Θ_{0}^{(k)} + δ I) Θ_{0}^{(k)}}_{S^{(k)}}) ‖}_{\infty / \infty} + 1 \leq 4 d r + 1 .

Hence h_k(A) ≥ 1/(8dr + 2) for every A ∈ 𝒟_k.

Now (30) is further bounded by r:

r {1 - \frac{n_{k}}{n} \frac{h_{k} (A)}{4}} + \frac{n_{k}}{n} {1 + \frac{h_{k} (A)}{4}} δ \leq r {1 - \frac{n_{k}}{n} \frac{h_{k} (A)}{4}} + \frac{n_{k}}{n} {1 + \frac{h_{k} (A)}{4}} \frac{r}{10 (4 d r + 1)} \leq r {1 - \frac{n_{k}}{n} \frac{h_{k} (A)}{4}} + \frac{n_{k}}{n} {1 + \frac{h_{k} (A)}{4}} \frac{h_{k} (A) r}{5} \leq r - \frac{n_{k}}{n} \frac{h_{k} (A) - h_{k}^{2} (A)}{20} r \leq r .

Here we used the fact that δ ≤ r/{10(4dr + 1)} and 1/(8dr + 2) ≤ h_k(A) < 1. Thus, ‖(f_k(A))_S^(k)‖_∞ ≤ r.

Since (f_k(A))_(S^(k))^c = 0 by definition, all the conditions for the fixed point theorem are established. This completes the proof.

We are now ready to prove Theorem 2. Note that Condition 7 implies that

ρ_{n} < min {\frac{{min}_{k} π_{k}}{72 d κ_{Γ}} min {\frac{1}{κ_{Ψ}}, \frac{1}{κ_{Ψ}^{3} κ_{Γ}}, \frac{{min}_{k} π_{k}}{56 κ_{Ψ}^{3} κ_{Γ}} α}, \frac{c_{8}}{6}, \frac{c_{9} {min}_{k} \sqrt{d_{k}}}{12}} .

Proof of Theorem 2

We prove that the oracle estimator Θ̌_{ρ_n} satisfies (I) the model selection consistency and (II) the KKT conditions of the original problem (3) with (Θ̌_{ρ_n}, Ũ₁, Ũ₂). The model selection consistency of Θ̂_{ρ_n} = Θ̌_{ρ_n} then follows by the uniqueness of the solution to the original problem. The following discussion is on the event that min_k π_k/2 ≤ n_k/n, k = 1, …, K, and max_k‖Ξ^(k)‖_∞ ≤ α/8. Note that this event has probability approaching 1 by Lemmas 4 and 5.

First we obtain an ℓ_∞-bound of the error of the oracle estimator. Note that by Condition 7 and the fact that α ∈ [0, 1)

\frac{α}{8} + 1 + ρ_{2} {‖ L ‖}_{2}^{1 / 2} \leq \frac{α}{8} + 1 + \frac{α^{2}}{4 (2 - α)} \leq 3 .

Thus, it follows from Condition 7 that

\frac{4}{{min}_{k} π_{k}} κ_{Γ} ({‖ Ξ^{(k)} ‖}_{\infty} + ρ_{n} + ρ_{n} ρ_{2} {‖ L ‖}_{2}^{1 / 2}) < \frac{12 κ_{Γ}}{{min}_{k} π_{k}} \frac{{min}_{k} π_{k}}{72 d κ_{Γ}} min {\frac{1}{κ_{Ψ}}, \frac{1}{κ_{Ψ}^{3} κ_{Γ}}} = \frac{1}{6 d max {κ_{Ψ}, κ_{Ψ}^{3} κ_{Γ}}} .

Because ${(Θ_{0}^{(k)} \otimes Θ_{0}^{(k)})}_{S^{(k)} S^{(k)}}$ is invertible by Condition 5, we can apply Lemma 11 to obtain ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq (6 / {min}_{k} π_{k}) κ_{Γ} ({‖ Ξ^{(k)} ‖}_{\infty} + ρ_{n} + ρ_{n} ρ_{2} {‖ L ‖}_{2}^{1 / 2})$ with probability approaching 1.

As a consequence of the ℓ_∞-bound, Θ̌_{ρ_n,ij} ≠ 0 for (i, j) ∈ S, because ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq 3 ρ_{n} \leq c_{8} / 2 < {min}_{k = 1, \dots, K, i \neq j} | θ_{0, i j}^{(k)} |$ by Conditions 6 and 7. This establishes the model selection consistency of the oracle estimator.

Next, we show that the Oracle estimator satisfies the KKT condition of the original problem (3). As the first step, we prove $Ũ_{1, i j}^{(k)} \in \partial {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)}$ for every i, j, k with probability approaching 1. Since Θ̌_{ρ_n,ij} ≠ 0 for (i, j) ∈ S with probability approaching 1, $Ũ_{1, i j}^{(k)} = Ǔ_{1, i j}^{(k)}$ for (i, j) ∈ S^(k) by construction. For (i, j) ∈ (S^(k))^c, we need to prove $| Ũ_{1, i j}^{(k)} | < 1$ for every i, j, k. To this end, it suffices to verify that ${‖ R^{(k)} ({\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)}) ‖}_{\infty} \leq α / 8$ and apply Lemma 8. Applying Lemma 9 with ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq (6 / {min}_{k} π_{k}) κ_{Γ} ({‖ Ξ^{(k)} ‖}_{\infty} + ρ_{n} + ρ_{n} ρ_{2} {‖ L ‖}_{2}^{1 / 2})$ and Condition 7 gives

{‖ R^{(k)} ({\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)}) ‖}_{\infty} \leq \frac{3}{2} d κ_{Ψ}^{3} {‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty}^{2} \leq \frac{3}{2} d κ_{Ψ}^{3} \frac{324 κ_{Γ}^{2}}{{min}_{k} π_{k}^{2}} ρ_{n}^{2} \leq \frac{486 d κ_{Ψ}^{3} κ_{Γ}^{2}}{{min}_{k} π_{k}^{2}} {\frac{{min}_{k} π_{k}}{72 d κ_{Γ}} \frac{{min}_{k} π_{k}}{56 κ_{Ψ}^{3} κ_{Γ}} α} ρ_{n} \leq \frac{α}{8} α .

Next, we prove that $Ũ_{2, i j} \in \partial \sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}$ for every (i, j). For (i, j) with $ω_{0, i j}^{(k)} \neq 0$ for all k = 1, …, K, $Ũ_{2, i j} = Ǔ_{ρ_{n}} \in \partial \sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}$ . For (i, j) with Ω_0,ij = 0, $Ũ_{2, i j} = 0 \in \partial \sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}$ by Lemma 6. For (i, j) with Ω_0,ij ≠ 0 and $ω_{0, i j}^{(k')} = 0$ for some k′,

Ũ_{2, i j} = L {\overset{ˇ}{Θ}}_{ρ_{n}, i j} / \sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}} \in \partial \sqrt{{\overset{ˇ}{Θ}}_{ρ_{n}, i j} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j}}

if LΘ̌_{ρ_n,ij} ≠ 0. To see LΘ̌_{ρ_n,ij} ≠ 0 holds with probability approaching 1, let (k, k′) ∈ S with k ≠ k′ such that $Θ_{0, i j}^{(k)} / \sqrt{d_{k}} - Θ_{0, i j}^{(k')} / \sqrt{d_{k'}} \neq 0$ . This pair (k, k′) exists by Condition 6 and the assumption LΘ_0,ij ≠ 0. We assume without loss of generality $θ_{0, i j}^{(k)} / \sqrt{d_{k}} - θ_{0, i j}^{(k')} / \sqrt{d_{k'}} > 0$ . Since ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} \leq 3 ρ_{n} \leq c_{9} {min}_{k} \sqrt{d_{k}} / 12$ , it follows from Condition 7 that

\frac{{\overset{ˇ}{θ}}_{ρ_{n}, i j}^{(k)}}{\sqrt{d_{k}}} - \frac{{\overset{ˇ}{θ}}_{ρ_{n}, i j}^{(k')}}{\sqrt{d_{k'}}} \geq \frac{θ_{0, i j}^{(k)}}{\sqrt{d_{k}}} - \frac{θ_{0, i j}^{(k')}}{\sqrt{d_{k'}}} - 3 ρ_{n} (\frac{1}{\sqrt{d_{k}}} + \frac{1}{\sqrt{d_{k'}}}) \geq c_{9} - 3 ρ_{n} (max_{W_{k, k'} \neq 0} \frac{1}{\sqrt{d_{k}}} + \frac{1}{\sqrt{d_{k'}}}) \geq \frac{1}{2} c_{9} .

Hence, ${\overset{ˇ}{Θ}}_{ρ_{n}, i j}^{T} L {\overset{ˇ}{Θ}}_{ρ_{n}, i j} \geq W_{k k'} c_{9}^{2} / 4 > 0$ or LΘ̌_{ρ_n,ij} ≠ 0.

Finally, we show that Equation (27) for the KKT condition holds. For the (i, j)-element of the equation with Ω_{0, ij} = 0, this equation hold by construction for every k = 1, …, K. For the (i, j)-element with $ω_{0, i j}^{(k)} \neq 0$ for every k = 1, …, K, the equation holds for every k = 1, …, K, because it is the equation for the KKT condition of the corresponding element in a restricted problem (21). For (i, j)-element with Ω_{0, ij} ≠ 0 and $ω_{0, i j}^{(k')} = 0$ for some k′, note that Θ̌_{ρ_n,ij} ≠ 0 with probability approaching 1 and that the rearrangement in Θ_ij and corresponding exchange of rows and columns of L for each i, j does not change the original and restricted optimization problems (3) and (21). Thus, with the appropriate rearrangement of elements and exchange of rows and columns, $Ũ_{2, i j}^{(k)}$ with $ω_{0, i j}^{(k)} \neq 0$ is in fact $Ǔ_{2, i j}^{(k)}$ . Thus for such k the equation holds because of the corresponding KKT condition in the restricted problem (21). For other k, the equation holds by construction. We thus conclude the oracle estimator satisfies the KKT condition of the original problem (3). This completes the proof.

Proof of Corollary 1

In the proof of Theorem 2, the ℓ_∞-bound of the error yields

{‖ {\hat{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)} ‖}_{\infty} = O_{P} (κ_{Γ} ρ_{n}) .

Note that if one of two matrices A and B is diagonal, ‖AB‖_∞ ≤ ‖A‖_∞‖B‖_∞. Thus, we can proceed in the same way as in the proof of Theorem 2 of Rothman et al. [26] to conclude that

{‖ {\hat{Ω}}_{n}^{(k)} - Ω_{0}^{(k)} ‖}_{\infty} = O_{P} (κ_{Γ} ρ_{n}) .

The result follows from a similar argument to the proof of Corollary 3 in Ravikumar et al. [25].

Proof of Corollary 2

It follows from Condition 8 and Lemma 1 applied to Θ̌_{ρ_n} that ${‖ {\overset{ˇ}{Θ}}_{ρ_{n}}^{(k)} - Θ_{0}^{(k)}) ‖}_{2} \leq 1 / (2 λ_{Θ})$ . Then we can apply Lemma 10 instead of Lemma 9. The rest is similar to the proof of Theorem 2.

Hierarchical Clustering

For simplicity, we prove Theorem 3 for the case of K = 2; the proof can be easily generalized to K > 2. Let X and Y be the random variable from the first and subpopulation, respectively. Suppose that X = (X₁, …, X_p)^T ~ N(μ_X, Σ_X) with μ_X = (μ_1,X, …, μ_p,X) and the spectral decomposition $Σ_{X} = Q_{X} Λ_{X} Q_{X}^{T}$ of Σ_X where λ_1,X, …, λ_p,X are the eigenvalues of Σ_X and that Y ~ N(μY, Σ_Y) with μ_Y = (μ_1,Y, …, μ_p,Y) and the spectral decomposition $Σ_{Y} = Q_{Y} Λ_{Y} Q_{Y}^{T}$ of Σ_Y where λ_1,Y, …, λ_p,Y are the eigenvalues of Σ_Y. Define Z = (X − Y) = (Z₁, …, Z_p)^T ~ N(μ_Z, Σ_Z) with μ_Z = (μ_1,Z, …, μ_p,Z) and the spectral decomposition $Σ_{Z} = Q_{Z} Λ_{Z} Q_{Z}^{T}$ of Σ_Z where λ_1,Z, …, λ_{p, Z} are the eigenvalues of Σ_Z. Let $\tilde{X} = {({\tilde{X}}_{1}, \dots, \tilde{X})}^{T} = Λ_{X}^{1 / 2} Q_{X}^{T} Σ_{X}^{- 1 / 2} X, Ỹ = {(Ỹ_{1}, \dots, Ỹ)}^{T} = Λ_{Y}^{1 / 2} Q_{Y}^{T} Σ_{Y}^{- 1 / 2} Y$ and $\tilde{Z} = {({\tilde{Z}}_{1}, \dots, \tilde{Z})}^{T} = Λ_{Z}^{1 / 2} Q_{Z}^{T} Σ_{Z}^{- 1 / 2} Z$ . Then X̃ ~ N(μ̃_X, Λ_X), Ỹ ~ N(μ̃_Y, Λ_Y) and Z̃ ~ N(μ̃_Z, Λ_Z), where

{\tilde{μ}}_{X} = {({\tilde{μ}}_{1, X}, \dots, {\tilde{μ}}_{p, X})}^{T} \equiv Λ_{X}^{1 / 2} Q_{X}^{T} Σ_{X}^{- 1 / 2} μ_{X},

{\tilde{μ}}_{Y} = {({\tilde{μ}}_{1, Y}, \dots, {\tilde{μ}}_{p, Y})}^{T} \equiv Λ_{Y}^{1 / 2} Q_{Y}^{T} Σ_{Y}^{- 1 / 2} μ_{Y},

{\tilde{μ}}_{Z} = {({\tilde{μ}}_{1, Z}, \dots, {\tilde{μ}}_{p, Z})}^{T} \equiv Λ_{Z}^{1 / 2} Q_{Z}^{T} Σ_{Z}^{- 1 / 2} μ_{Z} .

Let also

μ_{\tilde{X}}^{2} = ‖ {\tilde{μ}}_{X}^{2} ‖ / p, μ_{Ỹ}^{2} = ‖ {\tilde{μ}}_{Y}^{2} ‖ / p, μ_{\tilde{Z}}^{2} = ‖ {\tilde{μ}}_{Z}^{2} ‖ / p,

{\bar{λ}}_{X} = \sum_{k = 1}^{p} λ_{k, X} / p, {\bar{λ}}_{Y} = \sum_{k = 1}^{p} λ_{k, Y} / p, {\bar{λ}}_{Z} = \sum_{k = 1}^{p} λ_{k, Z} / p .

Lemma 12 (Lemma 1 of Borysov et al. [1])

Let W₁, …, W_p be independent non-negative random variables with finite second moments. Let $S = \sum_{j = 1}^{p} (W_{j} - 𝔼 W_{j})$ and $υ = \sum_{j = 1}^{p} 𝔼 W_{j}^{2}$ . Then for any t > 0 P(S ≤ −t) ≤ exp(−t²/(2υ)).

The following lemma is an extension of Lemma 2 in Borysov et al. [1].

Lemma 13

Let $0 < a < μ_{\tilde{X}}^{2} + {\bar{λ}}_{X}$ . Then

P ({‖ X ‖}^{2} < a p) \leq exp (- \frac{p^{2} {(μ_{\tilde{X}}^{2} + {\bar{λ}}_{X} - a)}^{2}}{2 \sum_{j = 1}^{p} ({\tilde{μ}}_{j, X}^{4} + 6 {\tilde{μ}}_{k, X}^{2} λ_{j, X} + 3 λ_{j, X}^{2})}) .

Proof

Note that elements of X̃ are independent and that X̃_j ~ N(μ̃_j,X, λ_j,X). Thus, we have

𝔼 {\tilde{X}}_{j}^{2} = {\tilde{μ}}_{j, X}^{2} + λ_{j, X}, Var ({\tilde{X}}_{j}^{2}) = 2 (λ_{j, X}^{2} + 2 {\tilde{μ}}_{j, X}^{2} λ_{j, X}),

𝔼 {\tilde{X}}_{j}^{4} = {\tilde{μ}}_{j, X}^{4} + 6 {\tilde{μ}}_{j, X}^{2} λ_{j, X} + 3 λ_{j, X}^{2} .

Applying Lemma 12 with $W_{i} = {\tilde{X}}_{i}^{2}$ , since P (‖X‖² < ap) = P (‖X̃‖² < ap), we get

P ({‖ X ‖}^{2} < a p) = P [\sum_{j = 1}^{p} ({\tilde{X}}_{j}^{2} - {\tilde{μ}}_{j, X}^{2} - λ_{j, X}) < - p (μ_{\tilde{X}}^{2} + {\bar{λ}}_{X} - a)] \leq exp (- \frac{p^{2} {(μ_{\tilde{X}}^{2} + {\bar{λ}}_{X} - a)}^{2}}{2 \sum_{j = 1}^{p} ({\tilde{μ}}_{j, X}^{4} + 6 {\tilde{μ}}_{j, X}^{2} λ_{j, X} + 3 λ_{j, X}^{2})}) .

The following is an extension of Lemma 3 in Borysov et al. [1].

Lemma 14

Let $a > {\bar{λ}}_{X} + μ_{\tilde{X}}^{2}$ . Then

P ({‖ X ‖}^{2} > a p) \leq exp (- \frac{1}{2} (p + \sum_{j = 1}^{p} \frac{a}{λ_{j, X}} - \sum_{j = 1}^{p} \sqrt{1 + 2 \frac{a}{λ_{j, X}}})) .

Proof

By Markov’s inequality, for $t > \sum_{j = 1}^{p} λ_{j, X} + {\tilde{μ}}_{j, X}^{2}$ , we get

P (\sum_{j = 1}^{p} X_{j}^{2} \geq t) = P (\sum_{j = 1}^{p} {\tilde{X}}_{j}^{2} \geq t) = P [exp (\sum_{j = 1}^{p} γ {\tilde{X}}_{j}^{2} - γ λ_{j, X} - γ {\tilde{μ}}_{j, X}^{2}) \geq exp (γ t - γ \sum_{j = 1}^{p} (λ_{j, x} + {\tilde{μ}}_{j, X}^{2}))] \leq exp (- γ (t - \sum_{j = 1}^{p} ({\tilde{μ}}_{j, X}^{2} + λ_{j, X}))) \prod_{j = 1}^{p} 𝔼 exp ((γ λ_{j, X}) {\tilde{X}}_{j}^{2} / λ_{j, x}) = exp (- γ (t - \sum_{j = 1}^{p} {\tilde{μ}}_{j, X}^{2})) \prod_{j = 1}^{p} exp (- γ λ_{j, X} - \frac{1}{2} log (1 - 2 γ λ_{j, X})) \times exp (\frac{γ {\tilde{μ}}_{j, X}^{2}}{1 - 2 γ λ_{j, X}}) .

Since for all u ∈ (0, 1), −log(1 − u) − u ≤ u²/{2(1 − u)} (see page 28 of Boucheron et al. [2]), the above display is bounded above by

exp (- γ (t - \sum_{i = 1}^{p} {\tilde{μ}}_{i, X}^{2})) \prod_{i = 1}^{p} exp (\frac{γ^{2} λ_{i, X}^{2}}{1 - 2 γ λ_{i, X}}) exp (\frac{γ {\tilde{μ}}_{i, X}^{2}}{1 - 2 γ λ_{i, X}}) .

Using the following result from Boucheron et al. [2]

inf_{γ \in (0, 1 / c)} \frac{υ γ^{2}}{2 (1 - c γ)} - t γ = - \frac{υ}{c^{2}} h (\frac{c t}{υ}) .

wherein $h (u) = 1 + u - \sqrt{1 + 2 u}$ , u > 0, we further obtain the upper bound

exp (γ \sum_{i = 1}^{p} {\tilde{μ}}_{i, X}^{2}) \prod_{i = 1}^{p} exp (- \frac{1}{2} (1 + \frac{t}{λ_{i, X} p} - \sqrt{1 + \frac{2 t}{λ_{i, X} p}})) exp (\frac{γ {\tilde{μ}}_{i, X}^{2}}{1 - 2 γ λ_{i, X}}) .

Taking γ ↓ 0, the upper bound becomes

exp (- \frac{1}{2} (p + \sum_{i = 1}^{p} \frac{t}{λ_{i, X} p} - \sum_{i = 1}^{p} \sqrt{1 + \frac{2 t}{λ_{i, X} p}})) .

Choosing t = ap, we have

P (\sum_{i = 1}^{p} {\tilde{X}}_{i}^{2} \geq a p) \leq exp (- \frac{1}{2} (p + \sum_{i = 1}^{p} \frac{a}{λ_{i, X}} - \sum_{i = 1}^{p} \sqrt{1 + 2 \frac{a}{λ_{i, X}}})) .

Note that f(u) = (1 + 2u)^1/2 ≤ u for u ≥ 0 because f′(0) = 1 and f′ is decreasing for u > 0. Thus, $P (\sum_{i = 1}^{p} {\tilde{X}}_{i}^{2} \geq a p) \to 0$ as p → ∞.

Proof of Theorem 3

For simplicity, we present the proof for the case of K = 2; the proof can be easily generalized to K > 2. Let n₁ and n₂ be the sample sizes for the first and second subpopulations, respectively. Define

E_{1} = {max_{i, j} ‖ X_{i} - X_{j} ‖ < min_{k, l} ‖ X_{k} - Y_{l} ‖}, E_{3} = {max_{i, j} {‖ X_{i} - X_{j} ‖}^{2} < a p},

E_{2} = {max_{i, j} ‖ Y_{i} - Y_{j} ‖ < min_{k, l} ‖ X_{k} - Y_{l} ‖}, E_{4} = {max_{i, j} {‖ Y_{i} - Y_{j} ‖}^{2} < a p},

E_{5} = {max_{k, l} {‖ X_{k} - Y_{l} ‖}^{2} > a p},

for a fixed a > 0 satisfying the assumption. The intersection E₁ ∩ E₂ is contained in the event that the clustering performs in the way that two subpopulations are joined in the last step. The intersection E₃ ∩ E₄ ∩ E₅ is also contained in E₁ ∩ E₂, or in other words, $P ({(E_{1} \cap E_{2})}^{c}) \leq P (E_{3}^{c}) + P (E_{4}^{c}) + P (E_{5}^{c})$ . Thus, it suffices to show that $P (E_{3}^{c}) + P (E_{4}^{c}) + P (E_{5}^{c}) \to 0$ as n, p → ∞.

For $E_{3}^{c}$ and $E_{4}^{c}$ we have by Lemma 14 that

P (E_{3}^{c}) \leq \sum_{i, j}^{n} P ({‖ X_{i} - X_{j} ‖}^{2} > a p) = \frac{n_{1} (n_{1} - 1)}{2} P ({‖ X_{1} - X_{2} ‖}^{2} > a p) \leq \frac{n_{1} (n_{1} - 1)}{2} exp (- \frac{1}{2} (p + \sum_{l = 1}^{p} \frac{a}{2 λ_{l, X}} - \sum_{l = 1}^{p} \sqrt{1 + \frac{a}{λ_{l, X}}})) \leq exp (- \frac{1}{2} (p + \sum_{l = 1}^{p} \frac{a}{2 λ_{l, X}} - \sum_{l = 1}^{p} \sqrt{1 + \frac{a}{λ_{l, X}}}) + 2 log n_{1}) = exp (- \frac{p}{2} (1 + \frac{1}{p} \sum_{l = 1}^{p} \frac{a}{2 λ_{l, X}} - \frac{1}{p} \sum_{l = 1}^{p} \sqrt{1 + \frac{a}{λ_{l, X}}} + 4 \frac{log n_{1}}{p}))

and that

P (E_{4}^{c}) \leq exp (- \frac{p}{2} (1 + \frac{1}{p} \sum_{l = 1}^{p} \frac{a}{2 λ_{l, Y}} - \frac{1}{p} \sum_{l = 1}^{p} \sqrt{1 + \frac{a}{λ_{l, Y}}} + 4 \frac{log n_{2}}{p})) .

for a satisfying a > 2 max{λ̄_X, λ̄_Y}.

Note that log n_k/p → 0, k = 1, 2 as n₁, n₂, p → ∞. Moreover $x - \sqrt{1 + 2 x} \geq 0$ for x > 0. Thus, $P (E_{3}^{c}) \to 0$ and $P (E_{4}^{c}) \to 0$ as n₁, n₂, p → ∞. For $E_{5}^{c}$ , we have by Lemma 13 that

P (E_{5}^{c}) \leq \sum_{i, j} P ({‖ X_{i} - Y_{j} ‖}^{2} < a p) \leq n_{1} n_{2} P ({‖ X_{1} - Y_{1} ‖}^{2} < a p) \leq exp (- \frac{p^{2} {(μ_{\tilde{Z}}^{2} + {\bar{λ}}_{Z} - a)}^{2}}{2 \sum_{l = 1}^{p} ({\tilde{μ}}_{i, Z}^{4} + 6 {\tilde{μ}}_{l, Z}^{2} λ_{l, Z} + 3 λ_{l, Z}^{2})} + log n_{1} n_{2})

for $a < μ_{\tilde{Z}}^{2} + {\bar{λ}}_{Z}$ . Given the assumption c₁₀ ≤ λ_j,X ≤ c₁₁, c₁₀ ≤ λ_j,Y ≤ c₁₁, max{|μ_j,X|, |μ_j,Y|} ≤ c₁₁, j = 1, 2, …. Thus, we get $P (E_{5}^{c}) \to 0$ as n₁, n₂, p → 1.

Since 2λ̄_X − λ_p,X − λ_p,Y ≥ 2λ̄_X − λ̄_Z, and 2λ̄_Y − λ_p,X − λ_p,Y ≥ 2λ̄_Y − λ̄_Z, the assumption that $μ_{\tilde{Z}}^{2} > 2 min {{\bar{λ}}_{X}, {\bar{λ}}_{Y}} - λ_{p, X} - λ_{p, Y}$ implies that there exists a such that a < μ̄_Z̃ + λ̄_Z and a > 2 max{λ̄_X, λ̄_Y}. This completes the proof.

Contributor Information

Takumi Saegusa, Department of Mathematics, University of Maryland, College Park, MD 20742 USA.

Ali Shojaie, Department of Biostatistics, University of Washington, Seattle, WA 98195 USA.

References

1.Borysov Petro, Hannig Jan, Marron JS. Asymptotics of hierarchical clustering for growing dimension. Journal of Multivariate Analysis. 2014;124:465–479. [Google Scholar]
2.Boucheron Stéphane, Lugosi Gábor, Massart Pascal. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]
3.Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning. 2011;3(1):1–122. [Google Scholar]
4.Cai Tony, Liu Weidong, Luo Xi. A constrained ℓ1 minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 2011;106(494):594–607. ISSN 0162-1459. [Google Scholar]
5.Chung Fan RK. Spectral graph theory. Vol. 92. American Mathematical Soc; 1997. [Google Scholar]
6.Danaher Patrick, Wang Pei, Witten Daniela M. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology. 2014;76(2):373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.d’Aspremont Alexandre, Banerjee Onureena, Ghaoui Laurent El. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 2008;30(1):56–66. ISSN 0895-4798. [Google Scholar]
8.Friedman Jerome, Hastie Trevor, Tibshirani Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Guo Jian, Levina Elizaveta, Michailidis George, Zhu Ji. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. doi: 10.1093/biomet/asq060. ISSN 0006-3444. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Huang Jian, Ma Shuangge, Li Hongzhe, Zhang Cun-Hui. The sparse Laplacian shrinkage estimator for high-dimensional regression. Ann. Statist. 2011;39(4):2021–2046. doi: 10.1214/11-aos897. ISSN 0090-5364. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ideker Trey, Krogan Nevan J. Differential network biology. Molecular systems biology. 2012;8(1) doi: 10.1038/msb.2011.99. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Jönsson Göran, Staaf Johan, Vallon-Christersson Johan, Ringnér Markus, Holm Karolina, Hegardt Cecilia, Gunnarsson Haukur, Fagerholm Rainer, Strand Carina, Agnarsson Bjarni A, et al. Genomic subtypes of breast cancer identified by array-comparative genomic hybridization display distinct molecular and clinical characteristics. Breast Cancer Research. 2010;12(3):1–14. doi: 10.1186/bcr2596. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kolar Mladen, Song Le, Xing Eric P. Sparsistent learning of varying-coefficient models with structural changes. Advances in Neural Information Processing Systems. 2009:1006–1014. [Google Scholar]
14.Lauritzen Steffen L. Graphical models. Oxford University Press; 1996. [Google Scholar]
15.Li Caiyan, Li Hongzhe. Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 2010;4(3):1498–1516. doi: 10.1214/10-AOAS332. ISSN 1932-6157. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Li Fan, Zhang Nancy R. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association. 2010;105(491):1202–1214. [Google Scholar]
17.Liu F, Lozano AC, Chakraborty S, LI F. A graph laplacian prior for variable selection and grouping. Biometrika. 2011;98(1):1–31. [Google Scholar]
18.Liu Fei, Chakraborty Sounak, Li Fan, Liu Yan, Lozano Aurelie C, et al. Bayesian regularization via graph laplacian. Bayesian Analysis. 2014;9(2):449–474. [Google Scholar]
19.Meinshausen Nicolai, Bühlmann Peter. High-dimensional graphs and variable selection with the lasso. Ann. Statist. 2006;34(3):1436–1462. ISSN 0090-5364. [Google Scholar]
20.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Stat. Sci. 2012;27(4):538–557. [Google Scholar]
21.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. Supplementary material for “a unified framework for high-dimensional analysis of m-estimators with decomposable regularizers”. Stat. Sci. 2012 [Google Scholar]
22.Perou Charles M, Sørlie Therese, Eisen Michael B, van de Rijn Matt, Jeffrey Stefanie S, Rees Christian A, Pollack Jonathan R, Ross Douglas T, Johnsen Hilde, Akslen Lars A, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–752. doi: 10.1038/35021093. [DOI] [PubMed] [Google Scholar]
23.Peterson Christine, Stingo Francesco C, Vannucci Marina. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association. 2015;110(509):159–174. doi: 10.1080/01621459.2014.896806. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Rapaport Franck, Zinovyev Andrei, Dutreix Marie, Barillot Emmanuel, Vert Jean-Philippe. Classification of microarray data using gene networks. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-35. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ravikumar Pradeep, Wainwright Martin J, Raskutti Garvesh, Yu Bin. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 2011;5:935–980. ISSN 1935-7524. [Google Scholar]
26.Rothman Adam J, Bickel Peter J, Levina Elizaveta, Zhu Ji. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008;2:494–515. ISSN 1935-7524. [Google Scholar]
27.Sedaghat Nafiseh, Saegusa Takumi, Randolph Timothy, Shojaie Ali. Comparative study of computational methods for reconstructing genetic networks of cancer-related pathways. Cancer Informatics. 2014;13(Suppl 2):55–66. 09. doi: 10.4137/CIN.S13781. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Shojaie Ali, Michailidis George. Penalized principal component regression on graphs for analysis of subnetworks. In: Lafferty John D, Williams Christopher KI, Shawe-Taylor John, Zemel Richard S, Culotta Aron., editors. NIPS. Curran Associates, Inc; 2010. pp. 2155–2163. [Google Scholar]
29.Städler Nicolas, Bühlmann Peter, Van De Geer Sara. ℓ1-penalization for mixture regression models. Test. 2010;19(2):209–256. [Google Scholar]
30.Tibshiranit Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
31.Wang Yu-Xiang, Sharpnack James, Smola Alex, Tibshirani Ryan J. Trend filtering on graphs. arXiv preprint arXiv:1410.7690. 2014 [Google Scholar]
32.Weinberger Kilian Q, Sha Fei, Zhu Qihui, Saul Lawrence K. Graph laplacian regularization for large-scale semidefinite programming. Advances in neural information processing systems (NIPS) 2006:1489–1496. [Google Scholar]
33.Yuan Ming. High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 2010;11:2261–2286. ISSN 1532-4435. [Google Scholar]
34.Yuan Ming, Lin Yi. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35. ISSN 0006-3444. [Google Scholar]
35.Zhao Peng, Yu Bin. On model selection consistency of lasso. The Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
36.Zhao Peng, Rocha Guilherme, Yu Bin. The composite absolute penalties family for grouped and hierarchical variable selection’. Annals of Statistics. 2009;37(6A):3468–3497. [Google Scholar]
37.Zhao Sen, Shojaie Ali. A significance test for graph-constrained estimation. Biometrics (forthcoming) 2015 doi: 10.1111/biom.12418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Borysov Petro, Hannig Jan, Marron JS. Asymptotics of hierarchical clustering for growing dimension. Journal of Multivariate Analysis. 2014;124:465–479. [Google Scholar]

[R2] 2.Boucheron Stéphane, Lugosi Gábor, Massart Pascal. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]

[R3] 3.Boyd Stephen, Parikh Neal, Chu Eric, Peleato Borja, Eckstein Jonathan. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning. 2011;3(1):1–122. [Google Scholar]

[R4] 4.Cai Tony, Liu Weidong, Luo Xi. A constrained ℓ1 minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 2011;106(494):594–607. ISSN 0162-1459. [Google Scholar]

[R5] 5.Chung Fan RK. Spectral graph theory. Vol. 92. American Mathematical Soc; 1997. [Google Scholar]

[R6] 6.Danaher Patrick, Wang Pei, Witten Daniela M. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology. 2014;76(2):373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.d’Aspremont Alexandre, Banerjee Onureena, Ghaoui Laurent El. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 2008;30(1):56–66. ISSN 0895-4798. [Google Scholar]

[R8] 8.Friedman Jerome, Hastie Trevor, Tibshirani Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Guo Jian, Levina Elizaveta, Michailidis George, Zhu Ji. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. doi: 10.1093/biomet/asq060. ISSN 0006-3444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Huang Jian, Ma Shuangge, Li Hongzhe, Zhang Cun-Hui. The sparse Laplacian shrinkage estimator for high-dimensional regression. Ann. Statist. 2011;39(4):2021–2046. doi: 10.1214/11-aos897. ISSN 0090-5364. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Ideker Trey, Krogan Nevan J. Differential network biology. Molecular systems biology. 2012;8(1) doi: 10.1038/msb.2011.99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Jönsson Göran, Staaf Johan, Vallon-Christersson Johan, Ringnér Markus, Holm Karolina, Hegardt Cecilia, Gunnarsson Haukur, Fagerholm Rainer, Strand Carina, Agnarsson Bjarni A, et al. Genomic subtypes of breast cancer identified by array-comparative genomic hybridization display distinct molecular and clinical characteristics. Breast Cancer Research. 2010;12(3):1–14. doi: 10.1186/bcr2596. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Kolar Mladen, Song Le, Xing Eric P. Sparsistent learning of varying-coefficient models with structural changes. Advances in Neural Information Processing Systems. 2009:1006–1014. [Google Scholar]

[R14] 14.Lauritzen Steffen L. Graphical models. Oxford University Press; 1996. [Google Scholar]

[R15] 15.Li Caiyan, Li Hongzhe. Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 2010;4(3):1498–1516. doi: 10.1214/10-AOAS332. ISSN 1932-6157. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Li Fan, Zhang Nancy R. Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association. 2010;105(491):1202–1214. [Google Scholar]

[R17] 17.Liu F, Lozano AC, Chakraborty S, LI F. A graph laplacian prior for variable selection and grouping. Biometrika. 2011;98(1):1–31. [Google Scholar]

[R18] 18.Liu Fei, Chakraborty Sounak, Li Fan, Liu Yan, Lozano Aurelie C, et al. Bayesian regularization via graph laplacian. Bayesian Analysis. 2014;9(2):449–474. [Google Scholar]

[R19] 19.Meinshausen Nicolai, Bühlmann Peter. High-dimensional graphs and variable selection with the lasso. Ann. Statist. 2006;34(3):1436–1462. ISSN 0090-5364. [Google Scholar]

[R20] 20.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Stat. Sci. 2012;27(4):538–557. [Google Scholar]

[R21] 21.Negahban Sahand N, Ravikumar Pradeep, Wainwright Martin J, Yu Bin. Supplementary material for “a unified framework for high-dimensional analysis of m-estimators with decomposable regularizers”. Stat. Sci. 2012 [Google Scholar]

[R22] 22.Perou Charles M, Sørlie Therese, Eisen Michael B, van de Rijn Matt, Jeffrey Stefanie S, Rees Christian A, Pollack Jonathan R, Ross Douglas T, Johnsen Hilde, Akslen Lars A, et al. Molecular portraits of human breast tumours. Nature. 2000;406(6797):747–752. doi: 10.1038/35021093. [DOI] [PubMed] [Google Scholar]

[R23] 23.Peterson Christine, Stingo Francesco C, Vannucci Marina. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association. 2015;110(509):159–174. doi: 10.1080/01621459.2014.896806. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Rapaport Franck, Zinovyev Andrei, Dutreix Marie, Barillot Emmanuel, Vert Jean-Philippe. Classification of microarray data using gene networks. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Ravikumar Pradeep, Wainwright Martin J, Raskutti Garvesh, Yu Bin. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Stat. 2011;5:935–980. ISSN 1935-7524. [Google Scholar]

[R26] 26.Rothman Adam J, Bickel Peter J, Levina Elizaveta, Zhu Ji. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008;2:494–515. ISSN 1935-7524. [Google Scholar]

[R27] 27.Sedaghat Nafiseh, Saegusa Takumi, Randolph Timothy, Shojaie Ali. Comparative study of computational methods for reconstructing genetic networks of cancer-related pathways. Cancer Informatics. 2014;13(Suppl 2):55–66. 09. doi: 10.4137/CIN.S13781. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Shojaie Ali, Michailidis George. Penalized principal component regression on graphs for analysis of subnetworks. In: Lafferty John D, Williams Christopher KI, Shawe-Taylor John, Zemel Richard S, Culotta Aron., editors. NIPS. Curran Associates, Inc; 2010. pp. 2155–2163. [Google Scholar]

[R29] 29.Städler Nicolas, Bühlmann Peter, Van De Geer Sara. ℓ1-penalization for mixture regression models. Test. 2010;19(2):209–256. [Google Scholar]

[R30] 30.Tibshiranit Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58(1):267–288. [Google Scholar]

[R31] 31.Wang Yu-Xiang, Sharpnack James, Smola Alex, Tibshirani Ryan J. Trend filtering on graphs. arXiv preprint arXiv:1410.7690. 2014 [Google Scholar]

[R32] 32.Weinberger Kilian Q, Sha Fei, Zhu Qihui, Saul Lawrence K. Graph laplacian regularization for large-scale semidefinite programming. Advances in neural information processing systems (NIPS) 2006:1489–1496. [Google Scholar]

[R33] 33.Yuan Ming. High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 2010;11:2261–2286. ISSN 1532-4435. [Google Scholar]

[R34] 34.Yuan Ming, Lin Yi. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35. ISSN 0006-3444. [Google Scholar]

[R35] 35.Zhao Peng, Yu Bin. On model selection consistency of lasso. The Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]

[R36] 36.Zhao Peng, Rocha Guilherme, Yu Bin. The composite absolute penalties family for grouped and hierarchical variable selection’. Annals of Statistics. 2009;37(6A):3468–3497. [Google Scholar]

[R37] 37.Zhao Sen, Shojaie Ali. A significance test for graph-constrained estimation. Biometrics (forthcoming) 2015 doi: 10.1111/biom.12418. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Joint Estimation of Precision Matrices in Heterogeneous Populations

Takumi Saegusa

Ali Shojaie

Abstract

1. Introduction

2. Model and Estimator

2.1. Problem Setup

2.2. The Laplacian Shrinkage Estimator

Fig 1.

2.3. Connections to Other Estimators

Fig 2.

3. Theoretical Properties

Condition 1 (Exponential Tails)

Condition 2 (Polynomial Tails)

Condition 3 (Bounded variance)

Condition 4 (Sample size)

3.1. Consistency in Spectral Norm

Theorem 1

3.2. Model Selection Consistency

Condition 5 (Irrepresentability condition)

Condition 6 (Lower bounds for the inverse correlation matrices)

Condition 7 (Sample size and regularization parameters)

Theorem 2

3.3. Additional Results

Corollary 1

Condition 8

Corollary 2

4. Laplacian Shrinkage based on Hierarchical Clustering

Condition 9

Theorem 3

Theorem 4

5. Algorithms

Proposition 1

6. Numerical Results

6.1. Simulation Experiments

6.1.1. Known subpopulation network G

Fig 3.

6.1.2. Unknown subpopulation network G

Fig 4.

6.2. Genetic Networks of Cancer Subtypes

Fig 5.

7. Discussion

Acknowledgments

Appendix

8. Appendix: Proofs and Technical Detials

8.1. Consistency in Matrix Norms

Lemma 1

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Proof of Lemma 1

Proof of Theorem 1

8.2. Model Selection Consistency

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Lemma 9 (Lemma 5 of Ravikumar et al. [25])

Lemma 10

Proof

Lemma 11

Proof

Proof of Theorem 2

Proof of Corollary 1

Proof of Corollary 2

Hierarchical Clustering

Lemma 12 (Lemma 1 of Borysov et al. [1])

Lemma 13

Proof

Lemma 14

Proof

Proof of Theorem 3