Model Selection and Estimation in the Matrix Normal Graphical Model

Jianxin Yin; Hongzhe Li

doi:10.1016/j.jmva.2012.01.005

. Author manuscript; available in PMC: 2013 May 1.

Published in final edited form as: J Multivar Anal. 2012 May 1;107:119–140. doi: 10.1016/j.jmva.2012.01.005

Model Selection and Estimation in the Matrix Normal Graphical Model

Jianxin Yin ¹, Hongzhe Li ^1,¹

PMCID: PMC3285238 NIHMSID: NIHMS349393 PMID: 22368309

Abstract

Motivated by analysis of gene expression data measured over different tissues or over time, we consider matrix-valued random variable and matrix-normal distribution, where the precision matrices have a graphical interpretation for genes and tissues, respectively. We present a l₁ penalized likelihood method and an efficient coordinate descent-based computational algorithm for model selection and estimation in such matrix normal graphical models (MNGMs). We provide theoretical results on the asymptotic distributions, the rates of convergence of the estimates and the sparsistency, allowing both the numbers of genes and tissues to diverge as the sample size goes to infinity. Simulation results demonstrate that the MNGMs can lead to better estimate of the precision matrices and better identifications of the graph structures than the standard Gaussian graphical models. We illustrate the methods with an analysis of mouse gene expression data measured over ten different tissues.

Keywords: Gaussian graphical model, Gene networks, High dimensional data, l₁ penalized likelihood, Matrix normal distribution, Sparsistency

1. Introduction

Gaussian graphical models (GGMs) provide natural tools for modeling the conditional independence relationships among a set of random variables [1, 2]. Many methods of estimating the standard GGMs have been developed in recent years, especially in high-dimensional settings. Meinshausen and Bühlmann [3] took a neighborhood selection approach to this problem by fitting a l₁ penalized regression or Lasso [4] to each variable using the other variables as predictors. They show that this neighborhood selection procedure consistently estimates the set of non-zero elements of the precision matrix. Other authors have proposed algorithms for the exact maximization of the l₁-penalized log-likelihood. Yuan and Lin [5], Banerjee et al. [6] and Dahl et al. [7] adapted interior point optimization method for the solution to this problem. Based on the work of Banerjee et al. [6] and a block-wise coordinate descent algorithm, Friedman et al. [8] developed the graphical Lasso (glasso) for sparse inverse covariance estimation, which is computationally very efficient even when the dimension is greater than the sample size. Yuan [9] developed a linear programming procedure for high dimensional inverse covariance matrix estimation and obtained oracle inequalities for the estimation error in terms of several matrix norms. Some theoretical properties of this type of methods have also been developed by Yuan and Lin [5], Ravikumar et al. [10], Rothman et al. [11] and Lam and Fan [12]. Cai et al. [13] developed a constrained l₁ minimization approach to sparse precision matrix estimation, extending the idea of Dantzig selector [14] developed for sparse high dimensional regressions.

The standard likelihood framework for building Gaussian graphical models assumes that samples are independent and identically distributed from a multivariate Gaussian distribution. This assumption is often limited in certain applications. For example, in genomics, gene expression data of p genes collected over q different tissues from the same subject are often correlated. For a given sample, let Y be the p × q matrix of the expression data, where the jth column corresponds to the expression data of p genes measured in the jth tissue, and the ith row corresponds to gene expressions of the ith gene over q different tissues. Instead of assuming that the columns or rows are independent, we assume that the matrix variate random variable Y follows a matrix normal distribution [15, 1, 16], where both row and column precision matrices can be specified. The matrix-variate normal distribution has been studied in analysis of multivariate linear model under the assumption of independence and homoscedasticity for the structure of the among-row and among-column covariance matrices of the observation matrix [17, 18]. Such a model has also be applied to spatio-temporal data [19, 20]. In genomics, Teng and Huang [21] proposed to use the Kronecker product matrix to model gene-experiment interactions, which leads to gene expression matrix following a matrix-normal distribution. The gene expression matrix measured over multiple tissues is transposable, meaning that potentially both the rows and/or columns are correlated. Such matrix-valued normal distribution was also used in Allen and Tibshirani [22] and Efron [23] for modeling gene expression data in order to account for gene expression dependency across different experiments. Dutilleul [24] developed the maximum likelihood estimation(MLE) algorithm for the matrix normal distribution. Mitchell et al [25] developed a likelihood ratio test for separability of the covariances. Muralidharan [26] used a matrix normal framework for detecting column dependence when rows are correlated and estimating the strength of the row correlation.

The precision matrices of the matrix normal distribution provide the conditional independence structures of the row and column variables [1], where the non-zero off-diagonal elements of the precision matrices correspond to conditional dependencies among the elements in row or column of the matrix normal distribution. The matrix normal models with specified non-zero elements of the precision matrices define the matrix normal graphical models (MNGMs). This is analogous to the relationship between the Gaussian graphical model and the precision matrix of a multivariate normal distribution. Despite the flexibility of the matrix normal distribution and the MNGMs in modeling the transposable data, methods for model selection and estimation of such models have not been developed fully, especially in high dimensional settings. Wang and West [27] developed a Bayesian approach for the MNGMs using Markov Chain Monte Carlo sampling scheme that employs an efficient method for simulating hyper-inverse Wishart variates for both decomposable and nondecomposable graphs. Allen and Tibshirani [22, 28] proposed penalized likelihood approaches for such the matrix normal models, where both l₁-norm and l₂-norm penalty functions are used on the precision matrices.

The focus of this paper is to develop a model selection and estimation method for the MNGMs based on a l₁ penalized likelihood approach under the assumption of both row and column precision matrices being sparse. Our penalized estimation method is the same as that proposed in [22, 29, 28] when l₁ penalty is used. Allen and Tibshirani [22, 28] only considered the setting when there is one observed matrix-variate normal data and used the estimated covariance matrices for imputing the missing data and for de-correlating the noise in the underlying data. We focus on evaluating how well such a l₁ penalized estimation method recovers the underlying graphical structures that correspond to the row and column precision matrices when we have n i.i.d samples from a matrix normal distribution. In addition, we provide asymptotic justification of the estimates and show that the estimates enjoy similar asymptotic and oracle properties as the penalized estimates for the standard GGMs [30, 12, 5] even when the dimensions p = p_n and q = q_n diverge as the number of observations n → ∞. In addition, if consistent estimates of the precision matrices are available are used in the adaptive l₁ penalty functions, the resulting estimates have the property of sparsistency.

The rest of the paper is organized as follows. We introduce the MNGMs as motivated by analysis of gene expression data across multiple tissues in Section 2. In Section 3 we present an l₁ penalized likelihood estimate of such a MNGM and an iterative coordinate descent procedure for the optimization. We present the asymptotic properties of the estimates in Section 4 in both the classic setting when the dimensions are fixed and the setting allowing the dimensions to diverge as the sample size goes to infinity. In Section 5 we present simulation results and comparisons with the standard Gaussian graphical model. We present an application of the MNGM in Section 6 to an analysis of mouse gene expression data measured over 10 different tissues. Finally, in Section 7 we give a brief discussion. The proofs of all the theorems are given in the Appendix.

2. Matrix Normal Graphical Model for Multi-tissue Gene Expression Data

We consider the gene expression data measured over different tissues. Let Y be the random p × q matrix of the gene expression levels of p genes over q tissues. Let vec(A) be the vectorization of a matrix A obtained by stacking the columns of the matrix A on top of one another. Instead of assuming that the expression levels are independent over different tissues, following [21], we can model this gene expression matrix as

Y = G + T + I_{G T} + ∊,

(1)

where G and T are expected (constant) effects from the genes and tissues respectively, I_GT are the interaction effects that are assumed to be random with vec(I_GT ) following a multivariate normal distribution with zero means and a covariance matrix $V \otimes U$ , where the covariance matrices U and V respectively represent the gene and tissue dependencies, and ε represents small random normal noises with zero means arising from all nuisance sources. With negligible nuisance effects, vec(Y) follows a multivariate normal distribution with means vec(M) = vec(G + T) and a covariance matrix $V \otimes U$ [21].

Treating the data Y as a matrix-valued random variable, we say Y follows a matrix normal distribution, if Y has a density function

p (Y ∣ M, U, V) = k (U, V) exp (- tr {{(Y - M)}^{T} U^{- 1} (Y - M) V^{- 1} ∕ 2}),

(2)

where k(U, V) = (2π)^−pq/2|U|^−q/2|V|^−p/2 is the normalizing constant, M is the mean matrix, U is row covariance matrix and V is column covariance matrix. This definition is equivalent to the definition via the Kronecker product [31, Section 8.8 and 9.2]. Specifically,

Y \sim M N_{p, q} (M; U, V) if and only if vec (Y) \sim N_{p q} (vec (M), V \otimes U) .

(3)

We denote the corresponding precision matrices as A = U⁻¹, B = V⁻¹ for U and V, respectively. This model assumes a particular decomposable covariance matrix for vec(Y) that is separable in the geostatistics context [32]. The parameters U and V are defined up to a positive multiplicative constant. We can set B₁₁ to any positive constant to make the parameters identifiable.

The following proposition shows that there is a graphical model interpretation for the two precision matrices A and B in the matrix normal model (2).

Proposition 2.1. Assume that Y ~ MN_p,q(M; U, V). If we partition the columns of Y as Y = (y₁, ···, y_q), then it holds for γ, μ ∈ = Γ = {1, ···, q} with γ ≠ μ that

y_{γ} ⫫ y_{μ} ∣ y_{Γ ∖ {γ, μ}} i f a n d o n l y i f b_{γ μ} = 0,

(4)

where B = {b_αβ]_α,β∈Γ = V^–1 is the column precision matrix of the distribution; similarly, if we partition the rows of Y as Y = (y¹, ···, y^p)^T, then it holds for δ, β ∈ Ξ = (1, ..., p} with δ ≠ η that

y^{δ} ⫫ y^{η} ∣ y^{Δ ∖ {δ, η}} i f a n d o n l y i f a_{δ η} = 0

(5)

where A = {a_δη}_δ,η∈Ξ = U⁻¹is the row precision matrix of the distribution.

This proposition is based on a proposition in Lauritzen [1]. Detailed proof can be found in the Appendix. Without loss of generality, we assume M = 0 in this paper since it can be easily estimated.

3. l₁-Penalized Maximum Likelihood Estimation of the Precision Matrices

We propose to estimate the precision matrices A = U⁻¹, B = V⁻¹ in model (2) by maximizing a penalized likelihood function. Since for any c > 0, p(Y | A, B) = p(Y | cA, B/c), A and B are not uniquely identified. We set b₁₁ = 1 for the purpose of parameter identification. We propose to estimate A and B by minimizing the following penalized negative log-likelihood function

ϕ (A, B) = - q log (∣ A ∣) - p log (∣ B ∣) + \frac{1}{n} \sum_{k = 1}^{n} tr {{Ay}_{k} {By}_{k}^{T}} + \sum_{i \neq j} p_{λ_{i j}} (a_{i j}) + \sum_{i \neq j} p_{ρ_{i j}} (b_{i j}),

(6)

where p_λij (·) is the penalty function for the element a_ij of A with tuning parameter λ_ij, while p_ρij (·) is the corresponding penalty function for b_ij with tuning parameter ρ_ij. We consider both l₁-penalty with p_λij (a_ij) = λ|a_ij| and p_ρij (b_ij) = ρ|b_ij| and adaptive l₁ penalty with p_λij (a_ij) = λ|ã_ij|^−γ1|a_ij| and p_ρij(b) = ρ|b̃_ij|^−γ2|b_ij|, where Ã = {ã_ij} and B̃ = {b̃_ij} are some consistent estimate of A and B and γ₁ > 0 and γ₂ > 0 are two constants.

It is easy to check that the objective function (6) is a bi-convex function in A and B. We propose the following iterative procedure to minimize this function:

Initialization: B̂⁽⁰⁾ = I_q.
In ith step, given the current estimate of B, B̂⁽ⁱ⁾, we update A by
${\hat{A}}^{(i + 1)} = arg min_{A} {- log (∣ A ∣) + tr ({\hat{S}}_{A}^{(i)} A) + \sum_{i \neq j} p_{λ_{i j}^{*}} (a_{i j})},$ (7)
where ${\hat{S}}_{A}^{(i)} = 1 ∕ (n q) \sum_{k = 1}^{n} y_{k} {\hat{B}}^{(i)} y_{k}^{T}, λ_{i j}^{*} = λ_{i j} ∕ q$ .
In (i+1)th step, given the current estimate of A, Â⁽ⁱ⁺¹⁾, we update B by
${\hat{B}}^{(i + 1)} = arg min_{B} {- log (∣ B ∣) + tr ({\hat{S}}_{B}^{(i + 1)} B) + \sum_{i \neq j} p_{ρ_{i j}^{*}} (b_{i j})},$ (8)
when ${\hat{S}}_{B}^{(i + 1)} = 1 ∕ (n p) \sum_{k = 1}^{n} y_{k}^{T} {\hat{A}}^{(i + 1)} y_{k}$ , $ρ_{i j}^{*} = ρ_{i j} ∕ p$ .
Iterate step 2 and 3 until convergence.
Scale (Â, B̂) = (Â/c, cB̂) such that b̂₁₁ = 1.

Optimizations (7) and (8) can be solved using the block coordinate descent algorithm in the same way as that developed for estimating the precision matrix in standard Gaussian graphical models [8]. We use the program glasso [8] in this paper for these optimizations when the l₁ or the adaptive l₁ penalty functions are used. The glasso algorithm guarantees that the estimates Â and B̂ are positive definite.

Note that in Step 5 of the algorithm, we rescale the A and B matrices to ensure that b̂₁₁ = 1. However, when l₁ or the adaptive l₁ penalty functions are used, the solution to (6) is always unique in the sense that for a given λ and ρ, there is a unique scaling factor c*,

c^{*} = \sqrt{ρ {‖ B_{0} ‖}_{1} ∕ (λ {‖ A_{0} ‖}_{1})}

in the equivalent class $C_{A_{0}, B_{0}} : = {(A, B) ∣ A = c A_{0}, B = c^{- 1} B_{0}, for some 0 < c < \infty}$ that minimizes Φ(A, B), where ${‖ A_{0} ‖}_{1} = \sum_{i = 1}^{p} \sum_{j = 1}^{p} ∣ A_{0} (i, j) ∣$ is matrix l₁ norm. This can be seen by

λ_{1} {‖ A_{0} ‖}_{1} c + λ_{2} {‖ B_{0} ‖}_{1} \frac{1}{c} \geq 2 \sqrt{λ_{1} λ_{2} {‖ A_{0} ‖}_{1} {‖ B_{0} ‖}_{1}},

based on the algebra-geometry inequality. Equality holds and hence the minimum is attained when $c^{*} = \sqrt{λ_{2} {‖ B_{0} ‖}_{1} ∕ (λ_{1} {‖ A_{0} ‖}_{1})}$ . Hence A = c* A₀, B = B₀/c* are uniquely determined.

Finally, the tuning parameters λ and ρ in the l₁ penalty functions are chosen using the cross-validated likelihood function.

4. Asymptotic Theorems

Throughout this paper, for a given p × q matrix A = (a_ij), we denote $‖ A ‖ = max {‖ Ax ‖ ∕ ‖ x ‖, x \in R^{q}, x \neq 0}$ as operator or spectral norm of A, ||A||_∞ = max|a_ij| as the element-wise l_∞ norm of A, and ${∣ ‖ A ‖ ∣}_{\infty} = {max}_{1 \leq i \leq p} \sum_{j = 1}^{q} ∣ a_{i j} ∣$ as the matrix l_∞ norm of A. Furthermore, we use ${‖ A ‖}_{F} = \sqrt{\sum_{i j} a_{i j}^{2}}$ as the Frobenius norm of A. Denote λ_min(A) and λ_max(A) the smallest and largest eigenvalues of the matrix A.

4.1. Asymptotic theorems when p and q are fixed

We first consider the asymptotic distributions of the penalized maximum like lihood estimates in the setting when p and q are fixed as n → ∞. The following theorem provides the asymptotic distribution of the estimate (Â, B̂).

Theorem 1. For n independent identically distributed observations Y₁, ··· Y_n from a matrix normal distribution MN (0; A⁻¹, B⁻¹), the optimizer (Â, B̂) of the penalized negative log-likelihood function (6) with the l₁penalty functions has the following property:

If n^1/2 λ → λ₀ ≥ 0, n^1/2ρ → ρ₀ ≥ 0, as n → ∞, then

n^{1 ∕ 2} {(\hat{A}, \hat{B}) - (A, B)} \to {argmin}_{M = M^{T}, N = N^{T}} f (M, N)

in distribution, where

f (M, N) = q tr (MUMU) + p tr (NVNV) + tr (MU) tr (NV) + W + λ_{0} \sum_{i \neq j} {m_{i j} sgn (a_{i j}) I (a_{i j} \neq 0) + ∣ m_{i j} ∣ I (a_{i j} = 0)} + ρ_{0} \sum_{i \neq j} {n_{i j} sgn (b_{i j}) I (b_{i j} \neq 0) + ∣ n_{i j} ∣ I (b_{i j} = 0)},

in which W is a random variable such that W ~ N (0, σ²), where

σ^{2} = 2 {q tr (MUMU) + p tr (NVNV) + 2 tr (MU) tr (NV)} .

This result parallels to that of Yuan and Lin [5] for the l₁ penalized likelihood estimate of the precision matrix in the standard Gaussian graphical model.

Suppose that we have c_n-consistent estimators of A and B, denoted by Ã = (ã_ij)_1≤i,j≤p and B̃ = (b̃_ij)_1≤i,j≤q, that is c_n(Ã – A) = O_p(1), c_n(B̃ – B) = O_p(1), we consider the penalized likelihood estimates using the adaptive l₁ penalty function

\sum_{i \neq j} p_{λ_{i j}} (a_{i j}) = λ \sum_{i \neq j} {∣ {\tilde{a}}_{i j} ∣}^{- γ_{1}} ∣ a_{i j} ∣, \sum_{i \neq j} p_{ρ_{i j}} (b_{i j}) = ρ \sum_{i \neq j} {∣ {\tilde{b}}_{i j} ∣}^{- γ_{2}} ∣ b_{i j} ∣

in the objective function (6), where γ₁ and γ₂ are two constants. The following theorem shows that the resulting estimates of the precision matrices have the oracle property that parallels to that of Fan et al. [30] for the standard Gaussian graphical model.

Theorem 2. For n independent identically distributed observations Y₁, ···, Y_n from a matrix normal distribution MN(0; A⁻¹, B⁻¹), the optimizer (Â, B̂) of the object function (6) with adaptive l₁ penalty functions has the oracle property in the sense of Fan and Li [33]. That is, when n^1/2λ = O_p(1), n^1/2ρ = O_p(1), $n^{1 ∕ 2} λ c_{n}^{γ} \to \infty$ and $n^{1 ∕ 2} ρ c_{n}^{γ} \to \infty$ as n → for some λ₁ > 0 and γ₂ > 0, then (1). asymptotically, the estimates Â and B̂ have the same sparsity pattern as the true precision matrix A and B.

(2). the non-zero entries of Â and B̂ are c_n-consistent and asymptotically normal.

4.2. Asymptotic theorems when p = p_n and q = q_n diverge

The next two theorems provide the convergence rates and sparsistency properties of the estimates allowing p = p_n, q = q_n to diverge as n → ∞. We use $A_{0} = (a_{i j}^{(0)})$ and $B_{0} = (b_{k l}^{(0)})$ to denote the true precision matrices and $S_{A} = {(i, j) : a_{i j}^{(0)} \neq 0}$ and $S_{B} = {(k, l) : b_{k l}^{(0)} \neq 0}$ to denote the support of the true matrices, respectively. Let s_n1 = card(S_A) – p_n and s_n2 = card(S_B) – q_n be the number of nonzero elements in the off-diagonal entries of A₀ and B₀, respectively. We assume the following regularity conditions:

There exist constants ε₁ and ε₂ such that
$0 < ε_{1} \leq λ_{min} (A_{0}) \leq λ_{max} (A_{0}) \leq ε_{2} < \infty, for all n .$
There exist constants ε₃ and ε₄ such that
$0 < ε_{3} \leq λ_{min} (B_{0}) \leq λ_{max} (B_{0}) \leq ε_{4} < \infty, for all n .$
The tuning parameter λ_n satisfies
$λ_{n} = O {(1 + \frac{\sqrt{p_{n}}}{\sqrt{s_{n 1}} + 1}) q_{n} \sqrt{\frac{q_{n} (log p_{n} + log q_{n})}{n}}} .$
The tuning parameter ρ_n satisfies
$ρ_{n} = O {(1 + \frac{\sqrt{q_{n}}}{\sqrt{s_{n 2}} + 1}) p_{n} \sqrt{\frac{p_{n} (log p_{n} + log q_{n})}{n}}} .$

Conditions (A) and (B) bound uniformly the eigenvalues of A₀ and B₀, which facilitates the proof of consistency. These conditions are also assumed for the penalized likelihood estimation for the standard Gaussian graphical models [34, 12]. The upper bounds on λ_n and ρ_n in condition (C) and (D) are related to the control of bias due to the l₁ penalty terms in the objective function [33, 35, 12].

Denote $S_{n} = 1 ∕ n \sum_{k = 1}^{n} Y_{k} \otimes Y_{k}$ . It is easy to check that Σ₀ = (vec(U))(vec(V))^T = ES_n. We use the double indices (i, j) and (k, l) to refer to a row or a column in S_n or Σ₀. The following Lemma provides the tail probability bound of (S_n – Σ₀).

Lemma 4.1. Suppose the matrix observations Y_k's are i.i.d. from a matrix normal distribution, Y_k ~ MN(0; U, V), and $λ_{max} (U) \leq ε_{1}^{- 1}$ , $λ_{max} (V) \leq ε_{3}^{- 1}$ . Then we have the tail bound:

P r (max_{1 \leq i, j \leq p_{n}, 1 \leq k, l \leq q_{n}} ∣ {(S_{n} - Σ_{0})}_{(i, j) (k, l)} ∣ \geq t) \leq C_{1} p_{n}^{2} q_{n}^{2} exp (- C_{2} n t^{2}), for ∣ t ∣ \leq δ,

(9)

for some constants C₁, C₂and δ that depend on ε₁, ε₃only.

In this lemma, if we choose $t = \sqrt{log (p_{n}^{2} q_{n}^{2}) ∕ (n C_{2})} M$ for some M such that |t| ≤ δ, then

{‖ S_{n} - Σ_{0} ‖}_{\infty} = O_{P} (\sqrt{(log p_{n} + log q_{n}) ∕ n}) .

The next theorem provides the rates of convergence of the penalized likelihood estimates Â and B̂ in terms of the Frobenius norms.

Theorem 3 (Rate of convergence). Under the regularity conditions (A)-(D), if $q_{n}^{3} (log p_{n} + log q_{n}) ∕ n = O (λ_{n}^{2})$ and q_n(p_n + s_n1) (log p_n + log q_n)^k/n = O(1) for some k > 1; $p_{n}^{3} (log p_{n} + log q_{n}) ∕ n = O (ρ_{n}^{2})$ and p_n (q_n + s_n2) (log p_n + log q_n)^l/n = O(1) for some l > 1. Then when the l₁ penalty functions are used, there exists a local minimizer (Â, B̂) of (6) such that ${‖ \hat{A} - A_{0} ‖}_{F}^{2} = O_{P} {q_{n} (p_{n} + s_{n 1}) (log p_{n} + log q_{n}) ∕ n}$ and ${‖ \hat{B} - B_{0} ‖}_{F}^{2} = O_{P} {p_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n}) ∕ n}$ .

Theorem 3 states explicitly how the number of nonzero elements and dimensionality of both precision matrices affect the rates of convergence of the estimates. Since there are (q_n + s_n2)(p_n + s_n1) nonzero elements in the Kronecker product $A \otimes B$ and $B \otimes A$ and each of them can be estimated at best with rate n^−1/2, the total square errors are at least of rate q_n(p_n + s_n1)/n for estimating A and p_n(q_n + s_n2) for estimating B. The price that we pay for high dimensionality is a logarithmic factor (log p_n + log q_n). The estimates Â and B̂ converge to their true values in Frobenius norm as long as q_n(p_n + s_n1) and p_n(q_n + s_n2) are at a rate O((log p_n + log q_n)^−l) for some l > 1, which decays to zero slowly. This means that in practice p_nq_n can be comparable to n without violating the results. Compared to the rates of convergence of the l₁ penalized likelihood estimates of the precision matrix in the standard GGM [12], the convergence rates for Â and B̂ are increased by a factor q_n and p_n. If q_n (or p_n) is fixed as n → ∞, then the rate for Â (or B̂) is exactly the same as that given in [12] for the standard Gaussian graphical models.

When an adaptive l₁ penalty function is used, we have the following sparsistency of the penalized estimates. Here sparsistency refers to the property that all parameters in A₀ and B₀ that are zero are actually estimated as zero with probability tending to one. We use S^c to denote the complement of a set S.

Theorem 4 (Sparsistency). Under the conditions given in Theorem 3, when the penalty functions in (6) are adaptive l₁penalty, p_λij (a_ij) = |a_ij|/|ã_ij|^γ1, p_ρkl (b_kl) =|b_kl|/|b̃_kl|^γ2for some γ₁ > 0, γ₂ > 0, where Ã = (ã_ij) and B̃ = (b̃_kl) are any two e_n- and f_n- consistent estimator, i.e. e_n ||Ã – A₀||_∞ = O_p(1), f_n||B̃ – B₀||_∞ = O_P(1). For any local minimizer (Â, B̂) of (6) satisfying

\begin{matrix} {‖ \hat{A} - A_{0} ‖}_{F}^{2} & = O_{P} (q_{n} (p_{n} + s_{n 1}) (log p_{n} + log q_{n}) ∕ n), \\ {‖ \hat{B} - B_{0} ‖}_{F}^{2} & = O_{P} (p_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n}) ∕ n), \end{matrix}

and ||Â – A₀||² = O_P(c_n,), ||B̂ – B₀||² = O_P(d_n) for sequences c_n → 0 and d_n → 0, If

e_{n}^{- 2 γ 1} q_{n} (\frac{p_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n} + c_{n} q_{n}) = O (λ_{n}^{2}),

(10)

and

f_{n}^{- 2 γ_{2}} p_{n} (\frac{q_{n} (p_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n} + d_{n} p_{n}) = O (ρ_{n}^{2}),

(11)

then with probability tending to 1, â_ij = 0 for all $(i, j) \in S_{A}^{c}$ and b̂_kl = 0 for all $(k, l) \in S_{B}^{c}$ .

The sparsistency results requires a lower bound on the rates of the regularization parameters λ_n and ρ_n. On the other hand, the regularity conditions (C)(and (D) impose an upper bound on λ_n and ρ_n in order to control the estimation biases. These requirements on the tuning parameters are similar to those for the GGMs. However, in the case of the matrix normal estimation, the conditions for λ_n depend not only on the dimension p_n of A, the rate of the consistent estimator Ã and the rate of error for Â in l₂ norm, but also depend on the dimension q_n and its sparsity s_n2 of the matrix B₀. Similarly, the conditions for ρ_n depend not only on the rate of B̃ and rate of error for B̂ in l₂ norm, but also on the dimension and sparsity of A₀. In addition, the condition (10) in the theorem, combined with the regularity Condition (C), implies that

e_{n}^{- γ_{1}} \leq (1 + \frac{\sqrt{p_{n}}}{\sqrt{s_{n 1}} + 1}) \frac{q_{n}}{\sqrt{p_{n} (q_{n} + s_{n 2})}} < \sqrt{q_{n}} (\frac{1}{\sqrt{p_{n}}} + \frac{1}{\sqrt{s_{n 1}} + 1}),

and

e_{n}^{- γ_{1}} \sqrt{c_{n}} \leq \sqrt{\frac{p_{n} q_{n} (log p_{n} + log q_{n})}{n}} (\frac{1}{\sqrt{p_{n}}} + \frac{1}{\sqrt{s_{n 1}} + 1}) .

These are the requirements for both the rate of the consistent estimator Ã in its element-wise l_∞ norm and rate of the operator norm of Â. Similarly, condition (11) and regularity Condition (D) imply that

f_{n}^{- γ_{2}} < \sqrt{p_{n}} (\frac{1}{\sqrt{q_{n}}} + \frac{1}{\sqrt{s_{n 2}} + 1}),

and

f_{n}^{- γ_{2}} \sqrt{d_{n}} \leq \sqrt{\frac{p_{n} q_{n} (log p_{n} + log q_{n})}{n}} (\frac{1}{\sqrt{q_{n}}} + \frac{1}{\sqrt{s_{n 2}} + 1}) .

5. Monte Carlo Simulations

5.1. Comparison candidates and measurements

In this section we present results form Monte Carlo simulations to examine the performances of the penalized likelihood method and to compare them to several naive methods for estimating the two precision matrices. The first method uses only data from one row or column in order to ensure that the observations are independent. Specifically, to estimate the precision matrix A, we choose the ith column from every observation matrix Y_k (k = 1, ···, n), denoted by y_k·i the ith column of Y_k, to estimate the row precision matrix A. Since y_k·i ~ N(M_·i, v_iiU), we can estimate the precision matrix A up to a multiplier by fitting a standard GGM. Without loss of generality, we choose the first column y_k·1 in our simulations. We call this procedure the Gaussian graphical model using the column data (GGM-C). Similarly, we can estimate the precision matrix B by choosing the first row y_k1· from every Y_k (k = 1, ···, n). We call this procedure the Gaussian graphical model using the row data (GGM-R). The second approach simply ignores the dependency of the data across the columns or rows and estimates A by treating the q columns as independent observations and estimates B by treating the p rows as independent observations using the Gaussian graphical model. We call this procedure the Gaussian graphical model assuming independence of row variables or column variables (GGM-I). For all three procedures (GGM-C, GGMC and GGM-I), we use glasso algorithm to estimate these two matrices. When p, q < n, we also consider the adaptive version of the glasso, where the maximum likelihood estimates are used as the initial consistent estimates of the precision matrices.

We compare the performance of different estimators of the precision matrices A and B by calculating different matrix norms of the estimation errors. Let Δ_A = A – Â and Δ_B = B – B̂ be the estimation errors of the estimators Â and B̂, respectively. We compare ||Δ_A||_∞, |||Δ_A|||_∞, ||Δ_A|| and ||Δ_A||_F for Â, and ||Δ_B||_∞, |||Δ_B|||_∞, ||Δ_B|| and ||Δ_B||_F for B̂.

In order to evaluate how well different procedures recover the graphical structures defined by the precision matrices, we define the non-zero entry in a sparse precision matrix as “positive” and define the specificity (SPE), sensitivity (SEN) and Matthews correlation coefficient (MCC) scores as following:

SPE = \frac{T N}{T N + F P}, SEN = \frac{T P}{T P + F N}, MCC = \frac{T P \times T N - F P \times F N}{{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}^{1 ∕ 2}},

where TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives.

5.2. Models and data generation

We generate sparse precision matrices A and B using a similar scheme as in [36] and [30]. To be specific, our generating procedure can be described as:

\begin{matrix} a_{i i} \equiv 1 \\ a_{i j} ∣ (δ_{i j} = 0) \equiv 0 \\ a_{i j} ∣ (δ_{i j} = 1) \sim Unif ([- 1, - 0.5] \cup [0.5, 1]), \end{matrix}

where i ≠ j and δ_ij is a Bernoulli random variable with a success probability of p₊. Then the off-diagonal elements of each row a_ij (j = 1, ···, p and j ≠ i) are divided by 1.5Σ|a_i·|₁ (off-diagonal elements only). A is then symmetrized and U = A⁻¹ is obtained. Note that the diagonal elements in U generated in this way are heterozygous. We further modify A by WA where W is a diagonal matrix. Since A generated as above is diagonal dominant, W = diag(w₁, ···, w_p) is generated as the follows: first we choose the upper bound w_max for w_i's, here we use w_max = 1.2. Then for each j, we generate a uniformly distributed random variable r in the interval (Σ_i,i≠j|a_ij|/|a_jj|, 1) and let w_j = rw_max. Thus we can guarantee the diagonal dominance of the matrix WA and hence the positive definiteness. We further define U = (WA)⁻¹. Matrices B and V are generated in a similar way.

After we generate the parameter (A, B) and (U, V), we generate the matrix normal data by first generating a pq-dimensional normal vectors z_k from $N (0, V \otimes U)$ and then rearranging them into a matrix Y_k such that vec(y_k) = z_k for k = 1, ···, n.

In the following, let p_A+ (or p_B+) be the probability that an off-diagonal element of matrix A (or B) is non-zero, which measures the degree of the sparsity of the matrix. We consider five different models of different dimensions and different degrees of sparsity:

\begin{matrix} Model 1 : n = 100, p = 30, q = 30, p_{A +} = 1 ∕ 10 and p_{B +} = 1 ∕ 10; \\ Model 2 : n = 100, p = 80, q = 80, p_{A +} = 1 ∕ 20 and p_{B +} = 1 ∕ 20 . \\ Model 3 : n = 100, p = 150, q = 150, p_{A +} = 1 ∕ 40 and p_{B +} = 1 ∕ 40 . \\ Model 4 : n = 100, p = 500, q = 500, p_{A +} = 1 ∕ 200 and p_{B +} = 1 ∕ 200 . \\ Model 5 : n = 20, p = 600, q = 600, p_{A +} = 1 ∕ 200 and p_{B +} = 1 ∕ 200 . \end{matrix}

We use a 5-fold cross validation to tune the regularization parameters for Models 1-4 and 3-fold cross validation for Model 5 due to its small sample size. The simulations are repeated 50 times.

5.3. Simulation results

We present in Tables 1 and 2 the results of the three different procedures in terms of estimating the precision matrix and recovering the corresponding graphical structures when the l₁ penalty functions are used. For all four models considered, we observe that the MNGM results in smaller estimation errors and better performances in identifying graphical structures defined by the precision matrices than the naive applications of the Gaussian graphical models. This is true both for the settings when p, q < n (Models 1 and 2) and when p, q > n (Models 3 and 4). We observe that when only one row or one column is chosen from each observation and the standard GGM is used (GGM-R or GGM-C), the estimation errors are much higher than the MNGM or the GGM when the rows or columns are treated as independent. Similarly, both sensitivities and specificities are also lower if only data from one row or one columns are used. This can be explained by the relatively small sample sizes. On the other hand, if the dependency of the columns or rows is ignored and the data of the columns or rows are treated as independent, direct application of the Gaussian graphical model (GGM-I) results in smaller specificities and higher false positives. As a benchmark comparison, for Models 1 and 2, we also present in Table 1 the errors of the MLEs of A and B. It is clear that the MNGM gives better estimates than the MLEs. MLEs for Models 3 and 4 do not exist.

Table 1.

Comparison of the performance for simulated data sets of different dimensions when l₁ penalty functions are used.

Precision Matrix	Measure	MNGM	GGM-I	GGM-C GGM-R	MLE
	Model 1, n = 100, p = 30, q = 30
A	\|\|Δ_A\|\|	0.17(0.026)	0.27(0.015)	0.62(0.057)	0.25(0.037)
	\|\|\|Δ_A\|\|\|_∞	0.35(0.042)	0.53(0.064)	1.23(0.133)	0.64(0.059)
	\|\|Δ_A\|\|_∞	0.08(0.019)	0.15(0.013)	0.34(0.066)	0.09(0.021)
	\|\|Δ_A\|\|_F	0.43(0.051)	0.73(0.025)	1.73(0.130)	0.60(0.044)
	SPE_A	0.68(0.025)	0.32(0.156)	0.82(0.147)
	SEN_A	1.00(0.000)	1.00(0.000)	0.36(0.298)
	MCC_A	0.54(0.022)	0.28(0.108)	0.23(0.068)
B	\|\|Δ_B\|\|	0.15(0.026)	0.25(0.013)	0.61(0.052)	0.22(0.027)
	\|\|\|Δ_B\|\|\|_∞	0.32(0.044)	0.48(0.040)	1.28(0.100)	0.60(0.051)
	\|\|Δ_B\|\|_∞	0.08(0.018)	0.14(0.013)	0.31(0.056)	0.07(0.016)
	\|\|Δ_B\|\|_F	0.38(0.049)	0.68(0.026)	1.64(0.099)	0.57(0.033)
	SPE_B	0.68(0.031)	0.40(0.060)	0.70(0.067)
	SEN_B	0.99(0.009)	1.00(0.006)	0.63(0.142)
	MCC_B	0.47(0.027)	0.29(0.037)	0.25(0.055)
	Model 2, n = 100, p = 80, q = 80
A	\|\|Δ_A\|\|	0.17(0.022)	0.31(0.020)	0.75(0.089)	0.25(0.021)
	\|\|\|Δ_A\|\|\|_∞	0.37(0.030)	0.58(0.030)	1.26(0.089)	0.93(0.045)
	\|\|Δ_A\|\|_∞	0.07(0.015)	0.19(0.015)	0.47(0.079)	0.06(0.014)
	\|\|Δ_A\|\|_F	0.67(0.082)	1.56(0.122)	3.30(0.515)	0.96(0.038)
	SPE_A	0.89(0.080)	0.69(0.009)	1.00(0.000)
	SEN_A	1.00(0.000)	1.00(0.000)	0.00(0.000)
	MCC_A	0.68(0.100)	0.43(0.008)	-
B	\|\|Δ_B\|\|	0.14(0.013)	0.13(0.012)	0.72(0.147)	0.22(0.020)
	\|\|\|Δ_B\|\|\|_∞	0.45(0.047)	0.42(0.049)	1.45(0.117)	0.88(0.040)
	\|\|Δ_B\|\|_∞	0.06(0.008)	0.06(0.010)	0.53(0.180)	0.05(0.011)
	\|\|Δ_B\|\|_F	0.56(0.023)	0.54(0.025)	2.97(0.499)	0.91(0.023)
	SPE_B	0.86(0.102)	0.69(0.010)	1.00(0.000)
	SEN_B	1.00(0.000)	1.00(0.000)	0.00(0.000)
	MCC_B	0.64(0.124)	0.42(0.008)	0.06(0.000)

Open in a new tab

MNGM: the matrix normal graphical model with l₁ penalties; GGM-I: Gaussian graphical model treating rows or columns as independent; GGM-R/GGM-C: Gaussian graphical model that uses only data from the first column or the first row; MLE: maximum likelihood estimates. For each measurement, mean and standard deviation are calculated over 50 replications.

Table 2.

Comparison of the performance for simulated data sets of different dimensions when l₁ penalty functions are used.

Precision Matrix	Measure	MNGM	GGM-I	GGM-C GGM-R
	Model 3, n = 100, p = 150, q = 150
A	\|\|Δ_A\|\|	0.12(0.014)	0.31(0.013)	0.78(0.094)
	\|\|\|Δ_A\|\|\|_∞	0.32(0.028)	0.59(0.027)	1.45(0.092)
	\|\|Δ_A\|\|_∞	0.05(0.011)	0.20(0.010)	0.49(0.074)
	\|\|Δ_A\|\|_F	0.61(0.069)	2.26(0.120)	4.72(0.802)
	SPE_A	0.84(0.005)	0.80(0.004)	1.00(0.000)
	SEN_A	1.00(0.000)	1.00(0.000)	0.00(0.000)
	MCC_A	0.45(0.006)	0.40(0.004)	0.05(0.022)
B	\|\|Δ_B\|\|	0.10(0.009)	0.10(0.009)	0.77(0.186)
	\|\|\|Δ_B\|\|\|	0.29(0.022)	0.32(0.025)	1.38(0.208)
	\|\|Δ_B\|\|_∞	0.04(0.007)	0.04(0.007)	0.58(0.236)
	\|\|Δ_B\|\|_F	0.53(0.025)	0.56(0.024)	4.25(0.856)
	SPE_B	0.83(0.005)	0.80(0.003)	1.00(0.000)
	SEN_B	1.00(0.000)	1.00(0.000)	0.01(0.000)
	MCC_B	0.43(0.007)	0.40(0.004)	0.07(0.021)
	Model 4, n = 100, p = 500, q = 500
A	\|\|Δ_A\|\|	0.10(0.008)	0.22(0.008)	3.69(0.521)
	\|\|\|Δ_A\|\|\|_∞	0.27(0.018)	0.45(0.019)	4.23(0.502)
	\|\|Δ_A\|\|_∞	0.04(0.007)	0.14(0.006)	3.63(0.581)
	\|\|Δ_A\|\|_F	0.95(0.078)	2.94(0.131)	43.68(6.153)
	SPE_A	0.99(0.001)	0.95(0.001)	1.00(0.002)
	SEN_A	1.00(0.00)	1.00(0.00)	0.01(0.038)
	MCC_A	0.76(0.008)	0.52(0.003)	0.13(0.030)
B	\|\|Δ_B\|\|	0.08(0.006)	0.08(0.006)	1.17(0.026)
	\|\|\|Δ_B\|\|\|_∞	0.26(0.019)	0.26(0.019)	6.88(0.809)
	\|\|Δ_B\|\|_∞	0.03(0.003)	0.03(0.004)	0.34(0.088)
	\|\|Δ_B\|\|_F	0.79(0.028)	0.76(0.031)	13.07(0.773)
	SPE_B	0.98(0.001)	0.97(0.001)	0.64(0.055)
	SEN_B	1.00(0.000)	1.00(0.000)	0.65(0.095)
	MCC_B	0.75(0.007)	0.62(0.003)	0.06(0.015)

Open in a new tab

MNGM: the matrix normal graphical model with l₁ penalties; GGM-I: Gaussian graphical model treating rows or columns as independent; GGM-R/GGM-C: Gaussian graphical model that uses only data from the first column or the first row. For each measurement, mean and standard deviation are calculated over 50 replications.

When p, q < n, as in Models 1 and 2, we have also implemented the penalized likelihood estimation with adaptive l₁ loss functions and performed simulation comparisons with the standard l₁ loss functions, where the maximum likelihood estimates of A and B are obtained and used as weights in the adaptive l₁ penalty functions. We present the results in Table 3. Comparing to the results in Table 1, we observe that using the adaptive penalties in the MNGM and the GGM-I can lead to better estimates of the precision matrixes and better recovery of the graphical structures defined by these precision matrices. However, if we only select one row or column and estimate the precision matrices using the GGM (GGM-R/GGM-C), the estimates based on the adaptive l₁ penalty functions are in general not as good as those based on the l₁ penalty functions. This is due to the fact that when only one row or one column is used, the sample size is small and the MLEs of the precision matrices may not provide sensible estimates of the weights in the adaptive penalty functions, which can lead to poor performance of the resulting estimates.

Table 3.

Comparison of the performance for simulated data sets of different dimensions when adaptive penalty functions are used.

Precision Matrix	Measure	MNGM	GGM-I	GGM-C GGM-R
	Model 1, n = 100, p = 30, q = 30
A	\|\|Δ_A\|\|	0.15(0.024)	0.26(0.014)	0.64(0.033)
	\|\|\|Δ_A\|\|\|_∞	0.30(0.037)	0.51(0.037)	1.10(0.079)
	\|\|Δ_A\|\|_∞	0.08(0.021)	0.14(0.010)	0.35(0.061)
	\|\|Δ_A\|\|_F	0.37(0.046)	0.69(0.025)	1.74(0.059)
	SPE_A	0.81(0.018)	0.41(0.028)	0.95(0.016)
	SEN_A	1.00(0.000)	1.00(0.000)	0.24(0.069)
	MCC_A	0.67(0.022)	0.34(0.018)	0.27(0.059)
B	\|\|Δ_B\|\|	0.14(0.024)	0.24(0.012)	0.66(0.040)
	\|\|\|Δ_B\|\|\|_∞	0.28(0.047)	0.48(0.037)	1.14(0.091)
	\|\|Δ_B\|\|_∞	0.08(0.019)	0.13(0.012)	0.35(0.043)
	\|\|Δ_B\|\|_F	0.35(0.052)	0.65(0.025)	1.70(0.078)
	SPE_B	0.81(0.023)	0.42(0.023)	0.94(0.012)
	SEN_B	0.99(0.011)	1.00(0.006)	0.32(0.055)
	MCC_B	0.60(0.029)	0.30(0.015)	0.32(0.056)
	Model 2, n = 100, p = 80, q = 80
A	\|\|Δ_A\|\|	0.12(0.019)	0.28(0.020)	2.36(0.756)
	\|\|\|Δ_A\|\|\|_∞	0.28(0.034)	0.49(0.034)	2.94(0.792)
	\|\|Δ_A\|\|_∞	0.06(0.013)	0.18(0.015)	2.28(0.773)
	\|\|Δ_A\|\|_F	0.48(0.064)	1.44(0.123)	12.19(4.586)
	SPE_A	0.92(0.005)	0.85(0.005)	1.00(0.000)
	SEN_A	1.00(0.000)	1.00(0.000)	0.00(0.000)
	MCC_A	0.73(0.012)	0.59(0.008)	-
B	\|\|Δ_B\|\|	0.12(0.011)	0.12(0.012)	4.78(1.206)
	\|\|\|Δ_B\|\|\|_∞	0.33(0.045)	0.34(0.049)	12.91(3.251)
	\|\|Δ_A\|\|_∞	0.06(0.009)	0.06(0.011)	0.74(0.320)
	\|\|Δ_A\|\|_F	0.44(0.027)	0.47(0.030)	10.48(2.145)
	SPE_B	0.92(0.005)	0.85(0.008)	0.12(0.010)
	SEN_B	1.00(0.000)	1.00(0.000)	0.90(0.019)
	MCC_B	0.73(0.013)	0.59(0.012)	0.02(0.018)

Open in a new tab

MNGM: the matrix normal graphical model with adaptive l₁ penalties; GGM-I: Gaussian graphical model treating rows or columns as independent; GGM-R/GGM-C: Gaussian graphical model that uses only data from the first column or the first row. For each measurement, mean and standard deviation are calculated over 50 replications.

As expected, since the precision matrices A and B are generated similarly and both are of the same dimensions, the estimates of these two precision matrices based on the MNGM are very comparable for all four models considered. Some differences in performances for estimating A and B in Model 4 are observed when the GGM-I or GGM-R/GGM-C is used. This is largely due to the large variability in selecting the tuning parameters when the dependence of the data is ignored as in GGM-I or when only partial data are used as in GGM-R/GGM-C.

Finally, Model 5 with n = 20, p = q = 600 mimics the scenario when n << min(p, q). The performances of the MNGM as shown in Table 4 are still quite comparable to the previous four models. However, estimates from the GGM-I or GGM-R/RRM-C are significantly worse, resulting much lower sensitivities and larger estimation errors.

Table 4.

Comparison of the performance for simulated data sets when n << min(p, q) and l₁ penalty functions are used (Model 5).

Precision Matrix	Measure	MNGM	GGM-I	GGM-C GGM-R
	Model 5, n = 20, p = 600, q = 600
A	\|\|Δ_A\|\|	0.14(0.013)	0.85(0.012)	6.85(1.532)
	\|\|\|Δ_A\|\|\|	0.67(0.041)	1.52(0.015)	10.9(3.799)
	\|\|Δ_A\|\|_∞	0.05(0.009)	0.43(0.01)	6.78(1.563)
	\|\|Δ_A\|\|_F	1.58(0.091)	10.17(0.188)	56.92(14.796)
	SPE_A	0.84(0.005)	1(0)	0.98(0.025)
	SEN_A	1(0)	0.03(0.001)	0.04(0.042)
	MCC_A	0.22(0.004)	0.18(0.004)	0.02(0.003)
B	\|\|Δ_B\|\|	0.15(0.01)	0.54(0.015)	1.12(0.069)
	\|\|\|Δ_B\|\|\|	0.75(0.037)	1.32(0.024)	4.06(0.344)
	\|\|Δ_B\|\|_∞	0.05(0.009)	0.23(0.005)	0.74(0.191)
	\|\|Δ_B\|\|_F	1.68(0.06)	6.84(0.023)	11.96(0.807)
	SPE_B	0.81(0.006)	1(0)	0.93(0.001)
	SEN_B	1(0)	0.03(0.001)	0.11(0.008)
	MCC_B	0.21(0.004)	0.16(0.004)	0.01(0.003)

Open in a new tab

MNGM: the matrix normal graphical model with l₁ penalties; GGM-I: Gaussian graphical model treating rows or columns as independent; GGM-R/GGM-C: Gaussian graphical model that uses only data from the first column or the first row. For each measurement, mean and standard deviation are calculated over 50 replications.

6. Real Data Analysis

We applied the MNGM to an analysis of the mouse gene expression data measured over different tissues from the Atlas of Gene Expression in Mouse Aging (AGEMAP) database [37]. In this study, the authors profiled the effects of aging on gene expressions in different mouse tissues dissected from C57BL/6 mice. Mice were of ages 1, 6, 16, and 24 months, with ten mice per age cohort and five mice of each sex. Sixteen tissues, the cerebellum, cerebrum, striatum, hippocampus, spinal cord, adrenal glands, heart, lung, liver, kidney, muscle, spleen, thymus, bone marrow, eye, and gonads, were dissected from each mouse. For each issue, mRNA was isolated and hybridized to two filter membranes containing a total of 16,896 cDNA clones corresponding to 8,932 genes. In our analysis, we leave out the data from six tissues, including cerebellum, bone-marrows, heart, gonads, striatum and liver, from our analysis due to the fact that some mice did not have data on these tissues. Due to the small sample size n = 40, we consider a set of 40 genes that belong to the mouse vascular endothelial growth factor (VEGF) signaling pathway and have measured gene expression levels over all 10 tissues.

Figure 1 shows the scatter plots of the pair-wise correlations of expression levels of these 40 genes in different tissues, clearly indicating that many gene pairs have similar correlations across different tissues and also the gene expression levels are clearly not independent across multiple tissues. The plots indicate that the assumption of the Kronecker covariance structure for the gene-tissue matrix normal data would be helpful in studying the covariance structure of the genes across different tissues.

Mouse gene expression data: scatter plots of pair-wise correlations of 40 genes across different tissues, showing that the expression levels in different issues are dependent.

Our goal is to study the dependency structure of these 40 genes of the VEGF pathway using the expression data across all 10 tissues. When the standard GGM is used to the data of each of the tissues separately, gene networks are identified from five out of 10 tissues, including andrenal, kidney, lung, thymus and eye. However, no gene links are identified for the other five tissues. The corresponding gene network graphs are shown in Figure 2 for each of the five tissues. The networks identified based on the tissue-specific data only include a few VEGF genes, indicating lack of the power of the recovering biologically meaningful links based on data from single tissue. The differences of the identified networks from difference tissues might also be due to the fact that genes of the VEGF pathways are not perturbed enough in some tissues to make inferences on the conditional independence structures among the genes. On the other hand, if all the data are pooled together and the dependency of gene expression across issues is ignored, the GGM results in a very dense network with 373 links, which is biologically difficult to interpret given that the biological networks are expected to be sparse.

Analysis of mouse gene expression data: networks identified by the GGM for each of the five tissues, including adrenal, lung, kidney, thymus and eye. The genes that belong to the mouse VEGF pathway are labeled on each of these network graphs. No networks are identified for other five tissues, including hippocampus, cerebral-cortex, spinal-cord, spleen and skeletal-muscle.

Figure 3 shows the gene and the tissue networks estimated by the proposed MNGM, including a gene network of 27 links among 22 VEGF genes and a tissue network with 15 edges among the 10 tissues. Compared to the networks estimated based on data from single tissue (see plots of Figure 2), we observe that more links are identified among these genes and many links identified by the MNGM appear in one of the graphs identified based on the issue-specific data. The difference between the overall network identified by the MNGM and the tissue-specific networks can also be due to the dependence structures of the VEGF genes are different in different tissues. It is interesting to note that many links identified by the MNGM may reflect the underlying VEGF signaling pathway [38]. For example, the binding of VEGF to VEGFR-2 leads to dimerization of the receptor, followed by intracellular activation of the PLCgamma (Plcg). It is interesting that several forms of the PLgamma gene such as Plcg1, Plcg2 and their downstream genes Nfat5 and Pla2g6 are part of network. Several genes on the PKC-Raf kinase-MEK-mitogen-activated protein kinase (MAPK) pathway such as Mapk13, Mapk14 and Mapkapk2 are interconnected.

Analysis of mouse multi-tissue gene expression data using the MNGM: (a) Gene network, where the genes that belong to the mouse VEGF pathway are labeled on the network graph; (b) Tissue network based on gene expression data.

The tissue network as shown in Figure 3 (b) should be interpreted as the conditional dependency structure among the tissues with respect to the gene expression patterns observed among the genes on the VEGF pathway. It is interesting to observe links among lung, spleen and kidney in the vascular tissue group and links between eye and cerebral-cortex and between thymus and hippocampus in the neural tissue group. It is also interesting to observe that the adrenal tissue in the steroid responsive group is linked to both vascular and neural tissue groups. Similar clustering of tissue groups based on their gene expression data are also observed in [37].

7. Discussion

Motivated by analysis of gene expression data measured over different tissues on the same set of samples, we have proposed to apply the matrix normal distribution to model the data jointly and have developed a penalized likelihood method to estimate the row and column precision matrices assuming that both matrices are sparse. Our simulation results have clearly demonstrated such models can result in better estimates of the precision matrices and better identification of the corresponding graphical structures than naive application of the Gaussian graphical models. Our analysis of the mouse gene expression data demonstrated that by effectively combining the expression data from multiple tissues from the same subjects, the matrix normal graphical model can lead to conditional independence graph with meaningful biological interpretations. We also demonstrated that ignoring the dependency of gene expression across different tissues can lead to higher false positive links and dense graphs, which are difficult to interpret biologically.

The matrix normal distribution provides a nature way of modeling the dependency of data measured over different conditions. If the underlying precision matrices are sparse, the proposed penalized likelihood estimation can lead to identification of the non-zero elements in these precision matrix. We observe that the proposed l₁ regularized estimation can lead to better estimates of these sparse precision matrices than the MLEs. Such estimated precision matrices can in turn be applied to the problem of co-expression analysis [21], differential expression analysis [22] and the problem of estimating missing gene expression data. Other applications of the proposed methods include face recognition [39].

The methods proposed in this paper and the related theorems can also be extended to array normal distribution by extending the matrix-variate normal to the tensor array setting using the Tucker product [40]. Such array normal distributions were recently studied by Hoff [41]. Allen [29] proposed an l₁ penalized estimation for such an array normal distribution by regularizing separable tensor precision matrices. Similar techniques can be applied to derive the estimation error bounds and to prove the sparsistency when the adaptive l₁ penalties are used. As multi-dimensional data with possible correlations among the variables of each dimension is becoming more prevalent, further development of estimation methods and relevant theorems are important.

Acknowledgement

This research was supported by NIH grants ES009911 and CA127334. We thank the reviewers for many helpful comments and for pointing out several omitted references.

Appendix

Proof of Proposition 1

Before we state the proof of Proposition 1, we need the following lemma [1]:

Lemma Appendix .1. Using the same notation as in the main text, if we partition the columns of Y as Y = (Y₁, Y₂), where Y₁is p × r, Y₂is p × s random matrix respectively, with r + s = q. Then the conditional distribution of Y₁given Y₂ = y₂is $M N_{p \times r} (M_{1} + (y_{2} - M_{2}) V_{22}^{-} V_{21}; U, V_{1 ∣ 2})$ , where M = (M₁, M₂) and $V_{1 ∣ 2} = V_{11} - V_{12} V_{22}^{-} V_{21}$ .

Proof of Lemma Appendix .1. See Proposition C.8 in [1, Appendix C].

Proof of Proposition 2.1. From Lemma Appendix .1, we know Y_{γ,μ} given Y_Γ{γ,μ} is distributed as matrix normal $M N (M_{1} + (y_{2} - M_{2}) V_{22}^{-} V_{21}; U, V_{1 ∣ 2})$ . From Proposition C.5 of [1, Appendix C], we have

B_{{γ, μ}} = (\begin{matrix} b_{γ γ} & b_{γ μ} \\ b_{μ γ} & b_{μ μ} \end{matrix}) = V_{1 ∣ 2}^{- 1} .

V_{1 ∣ 2} = V_{{γ, μ} ∣ Γ ∖ {γ, μ}} = \frac{1}{det B_{{γ, μ}}} (\begin{matrix} b_{μ μ} & - b_{γ μ} \\ - b_{μ γ} & b_{γ γ} \end{matrix}) .

From Proposition C.6 of [1, Appendix C] we know $Y_{γ} ⫫ Y_{μ} ∣ Y_{Γ ∖ {γ, μ}}$ if and only if b_γμ = 0. Similar argument can be applied to the rows.

Proof of Theorem 1

Proof. Let M = M^T be p × p, N = N^T be q × q symmetric random matrices. Denote

f_{n} (M, N) = - q log ∣ A + \frac{M}{n^{1 ∕ 2}} ∣ - p log ∣ B + \frac{N}{n^{1 ∕ 2}} ∣ + \frac{1}{n} \sum_{k = 1}^{n} tr {(A + \frac{M}{n^{1 ∕ 2}}) Y_{k} (B + \frac{N}{n^{1 ∕ 2}}) Y_{k}^{T}} + λ \sum_{i \neq j} ∣ a_{i j} + \frac{m_{i j}}{n^{1 ∕ 2}} ∣ + ρ \sum_{i \neq j} ∣ b_{i j} + \frac{n_{i j}}{n^{1 ∕ 2}} ∣ + q log ∣ A ∣ + p log ∣ B ∣ - \frac{1}{n} \sum_{k = 1}^{n} tr {{AY}_{k} {BY}_{k}^{T}} - λ \sum_{i \neq j} ∣ a_{i j} ∣ - ρ \sum_{i \neq j} ∣ b_{i j} ∣ .

Using the same argument as in [5], we have

\begin{matrix} log ∣ A + \frac{M}{n^{1 ∕ 2}} ∣ - log ∣ A ∣ & = \frac{tr (MU)}{n^{1 ∕ 2}} - \frac{tr (MUMU)}{n} + o (n^{- 1}), \\ log ∣ B + \frac{N}{n^{1 ∕ 2}} ∣ - log ∣ B ∣ & = \frac{tr (NV)}{n^{1 ∕ 2}} - \frac{tr (NVNV)}{n} + o (n^{- 1}) . \end{matrix}

Let $T_{k} = tr {(A + n^{- 1 ∕ 2} M) Y_{k} (B + n^{- 1 ∕ 2} N) Y_{k}^{T}} - tr {{AY}_{k} {BY}_{k}^{T}}$ , then

\begin{matrix} T_{k} & = n^{- 1 ∕ 2} tr ({MY}_{k} {BY}_{k}^{T}) + n^{- 1 ∕ 2} tr ({AY}_{k} {NY}_{k}^{T}) + n^{- 1} tr ({MY}_{k} {NY}_{k}^{T}) \\ = n^{- 1 ∕ 2} {(vec Y_{k})}^{T} (B \otimes M) vec Y_{k} + n^{- 1 ∕ 2} {(vec Y_{k})}^{T} (N \otimes A) vec Y_{k} + n^{- 1} {(vec Y_{k})}^{T} (N \otimes M) vec Y_{k} . \end{matrix}

Denote Z_k = vecY_k, then $Z_{k} \sim N (0, V \otimes U)$ , $T_{k} = n^{- 1 ∕ 2} Z_{k}^{T} (B \otimes M + N \otimes A + n^{- 1 ∕ 2} N \otimes M) Z_{k}$ . Next we compute E(T_k) and var(T_k). First, $E (T_{k}) = n^{- 1 ∕ 2} tr {(B \otimes M) (V \otimes U) + (N \otimes A) (V \otimes U) + n^{- 1 ∕ 2} (N \otimes M) (V \otimes U)}$ , so E(T_k) = n^−1/2[qtr(MU) + ptr(NV) + n^−1/2tr(NV)tr(MU)]. Next, $var (T_{k}) = E (T_{k}^{2}) - {E (T_{k})}^{2}$ and $E (T_{k}^{2}) = n^{- 1} E [Z_{k}^{T} L Z_{k} Z_{k}^{T} L Z_{k}]$ , where $L = B \otimes M + N \otimes A + n^{- 1 ∕ 2} N \otimes M$ . If Z ~ N(0, Σ), then

E (Z^{T} A Z Z^{T} B Z) = tr {A Σ (B + B^{T}) Σ} + tr (A Σ) tr (B Σ) .

(.1)

Using (.1), we obtain

E (T_{k}^{2}) = 2 n^{- 1} {q tr (MUMU) + 2 tr (NV) tr (MU) + p tr (NVNV) + 2 n^{- 1 ∕ 2} tr (NV) tr (MUMU) + 2 n^{- 1 ∕ 2} tr (MU) tr (NVNV) + O (n^{- 1})} + {E (T_{k})}^{2},

and

var (T_{k}) = 2 n^{- 1} {q tr (MUMU) + p tr (NVNV) + 2 tr (MU) tr (NV) + 2 n^{- 1 ∕ 2} tr (NV) tr (MUMU) + 2 n^{- 1 ∕ 2} tr (MU) tr (NVNV) + O (n^{- 1})} .

Let $\overset{‒}{T} = n^{- 1} \sum_{k = 1}^{n} T_{k}$ , by the central limited theorem, W_n = n(T̄ – ET_k) → N(0, σ²), where σ² = 2[qtr(MUMU) + ptr(NVNV) + 2tr(MU)tr(NV)]. Finally,

n f_{n} (M, N) = - n q [\frac{tr (MU)}{n^{1 ∕ 2}} - \frac{tr (MUMU)}{n} + o (n^{- 1})] - n p [\frac{tr (NV)}{n^{1 ∕ 2}} - \frac{tr (NVNV)}{n} + o (n^{- 1})] + n [\overset{‒}{T} - E (T_{k})] + n E (T_{k}) + n^{1 ∕ 2} λ \sum_{i \neq j} {m_{i j} sgn (a_{i j}) I (a_{i j} \neq 0) + ∣ m_{i j} ∣ I (a_{i j} = 0)} + n^{1 ∕ 2} ρ \sum_{i \neq j} {n_{i j} sgn (b_{i j}) I (b_{i j} \neq 0) + ∣ n_{i j} ∣ I (b_{i j} = 0)} = p tr (NVNV) + q tr (MUMU) + tr (MU) tr (NV) + W_{n} + n^{1 ∕ 2} λ \sum_{i \neq j} {m_{i j} sgn (a_{i j}) I (a_{i j} \neq 0) + ∣ m_{i j} ∣ I (a_{i j} = 0)} + n^{1 ∕ 2} ρ \sum_{i \neq j} {n_{i j} sgn (b_{i j}) I (b_{i j} \neq 0) + ∣ n_{i j} ∣ I (b_{i j} = 0)},

so nf_n(M, N) → f(M, N).

Proof of Theorem 2

Proof. We prove this theorem by verifying the regularity conditions (A), (B) and (C) of [33] [42, also]. We use (A)_ij to denote the (i, j)th element of the matrix, a_ij. The log-likelihood function is

l = - n p q ∕ 2 log (2 π) + n q ∕ 2 log (∣ A ∣) + n p ∕ 2 log (∣ B ∣) - 1 ∕ 2 \sum_{k = 1}^{n} tr ({AY}_{k} {BY}_{k}^{T}) .

\frac{\partial l}{\partial a_{i j}} = \frac{n q}{2} \frac{1}{∣ A ∣} ∣ A ∣ u_{i j} - \frac{1}{2} \sum_{k = 1}^{n} {(Y_{k} {BY}_{k}^{T})}_{i j} .

On the other hand, $n^{- 1} \sum_{k = 1}^{n} Y_{k} {BY}_{k}^{T}$ is Wishart-distributed, so

E {1 ∕ (n q) \sum_{k = 1}^{n} {(Y_{k} {BY}_{k}^{T})}_{i j}} = u_{i j},

and E(∂l/∂a_ij) = 0. Similarly, one can verify E(∂l/∂b_ij) = 0, so the first part of condition (A) is verified. For the second part, we need to check

\begin{matrix} E (\partial l ∕ \partial a_{i j} \partial l ∕ \partial a_{k l}) & = E (- \partial^{2} l ∕ (\partial a_{i j} \partial a_{k l})), \\ E (\partial l ∕ \partial b_{i j} \partial l ∕ \partial b_{k l}) & = E (- \partial^{2} l ∕ (\partial b_{i j} \partial b_{k l})), \\ E (\partial l ∕ \partial b_{k l} \partial l ∕ \partial a_{i j}) & = E (- \partial^{2} l ∕ (\partial b_{k l} \partial a_{i j})) . \end{matrix}

From the property of the Wishart distribution,

E (\frac{\partial l}{\partial a_{i j}} \frac{\partial l}{\partial a_{k l}}) = \frac{1}{4} E [\sum_{k = 1}^{n} {{(Y_{k} {BY}_{k}^{T})}_{i j} - q u_{i j}} {{(Y_{k} {BY}_{k}^{T})}_{k l} - q u_{k l}}] = \frac{n q}{2} u_{i j} u_{k l},

On the other hand, dA⁻¹ = −A⁻¹dAA⁻¹, so

\frac{\partial^{2} l}{\partial a_{i j} \partial a_{k l}} = \frac{\partial}{\partial a_{i j}} [\frac{n q}{2} {u_{k l} - \frac{1}{n q} \sum_{m = 1}^{n} {(Y_{m} {BY}_{m}^{T})}_{i j}}] = - \frac{n q}{2} \frac{\partial u_{k l}}{\partial a_{i j}} = - \frac{n q}{2} u_{i j} u_{k l},

So E(∂l/∂a_ij∂l/∂a_kl) = E(−∂²l/(∂a_ij∂a_kl)) holds. We can similarly verify that E(∂l/∂b_ij∂l/∂b_kl) = E(−∂²l/(∂b_ij∂b_kl)). Denote the orthogonal bases e_i = (0, ···, 1, ··· 0), which is a vector of all zero except the ith element. We have

\begin{matrix} - \frac{\partial^{2} l}{\partial b_{k l} \partial a_{i j}} & = \frac{1}{2} \sum_{s = 1}^{n} \frac{\partial}{\partial b_{k l}} {{(Y_{s} {BY}_{s}^{T})}_{i j} - q u_{i j}} = \frac{1}{2} \sum_{s = 1}^{n} \frac{\partial}{\partial b_{k l}} tr ({BY}_{s}^{T} e_{j} e_{i}^{T} Y_{s}) \\ = \frac{1}{2} \sum_{s = 1}^{n} {(Y_{s}^{T} e_{i} e_{j}^{T} Y_{s})}_{k l} = \frac{1}{2} \sum_{s = 1}^{n} tr (Y_{s}^{T} e_{i} e_{j}^{T} Y_{s} e_{l} e_{k}^{T}) \\ = \frac{1}{2} \sum_{s = 1}^{n} {(vec Y_{s})}^{T} (e_{l} e_{k}^{T} \otimes e_{j} e_{i}^{T}) vec Y_{s}, \\ E (- \frac{\partial^{2} l}{\partial b_{k l} \partial a_{i j}}) & = \frac{1}{2} \sum_{s = 1}^{n} tr {(e_{l} e_{k}^{T} \otimes e_{j} e_{i}^{T}) (V \otimes U)} = \frac{1}{2} \sum_{s = 1}^{n} tr (e_{l} e_{k}^{T} V) tr (e_{j} e_{i}^{T} U) \\ = \frac{n}{2} u_{i j} v_{k l} . \end{matrix}

Using the same notation as in the proof of Theorem 1, let Z_k = vecY_k, then $Z_{k} \sim N (0, V \otimes U)$ . We then have

\begin{matrix} E (\frac{\partial l}{\partial b_{k l}} \frac{\partial l}{\partial a_{i j}}) & = \frac{1}{4} \sum_{s = 1}^{n} E [{{(Y_{s}^{T} {AY}_{s})}_{k l} - p v_{k l}} {{(Y_{s} {BY}_{s}^{T})}_{i j} - q u_{i j}}] \\ = \frac{1}{4} \sum_{s = 1}^{n} E {(e_{i}^{T} Y_{s} {BY}_{s}^{T} e_{j}) (e_{k}^{T} Y_{s}^{T} {AY}_{s} e_{l}) - p q u_{i j} v_{k l}}, \end{matrix}

and $e_{i}^{T} Y_{s} {BY}_{s}^{T} e_{j} = tr (e_{j} e_{i}^{T} Y_{s} {BY}_{s}^{T}) = {(vec Y_{s})}^{T} {B \otimes (e_{j} e_{i}^{T})} vec Y_{s} = Z_{s}^{T} {B \otimes (e_{j} e_{i}^{T})} Z_{s}$ , and $e_{k}^{T} Y_{s}^{T} {AY}_{s} e_{l} = Z_{s}^{T} {(e_{k} e_{l}^{T}) \otimes A} Z_{s}$ . Denote $K = B \otimes (e_{j} e_{i}^{T})$ , $L = (e_{k} e_{l}^{T}) \otimes A$ , so $E {(e_{i}^{T} Y_{s} {BY}_{s}^{T} e_{j}) (e_{k}^{T} Y_{s}^{T} {AY}_{s} e_{l})} = E (Z_{s}^{T} K Z_{s} Z_{s}^{T} L Z_{s})$ . Using (.1), we have

\begin{matrix} E (Y_{s}^{T} K Z_{s} Z_{s}^{T} L Z_{s}) & = tr [{B \otimes (e_{j} e_{i}^{T})} (V \otimes U) {(e_{k} e_{l}^{T}) \otimes A + (e_{l} e_{k}^{T}) \otimes A} (V \otimes U)] + tr {I_{q} \otimes (e_{j} e_{i}^{T} U)} tr {(e_{k} e_{l}^{T} V) \otimes I_{p}} \\ = v_{l k} u_{i j} + v_{k l} u_{i j} + p q u_{i j} v_{k l} = 2 v_{k l} u_{i j} + p q u_{i j} v_{k l} . \end{matrix}

So E(∂l/∂b_kl∂l/∂a_ij) = n/2u_ijv_kl, and E{−∂²l/(∂b_kl∂a_ij)} = E(∂l/∂b_kl∂l/∂a_ij). The condition (A) is verified.

Next, we verify condition (B). We have

\begin{matrix} - 2 d l = - n q tr (U d A) - n p tr (V d B) + \sum_{k = 1}^{n} {tr (Y_{k} {BY}_{k}^{T} d A) + tr (Y_{k}^{T} {AY}_{k} d B)}, \\ - 2 d^{2} l = n q tr (U d AU d A) + n p tr (V d BV d B) + 2 \sum_{k = 1}^{n} tr (Y_{k} d {BY}_{k}^{T} d A) \\ = [{(d vec A)}^{T}, {(d vec B)}^{T}] (\begin{matrix} n q (U \otimes U) & \sum_{k = 1}^{n} Y_{k} \otimes Y_{k} \\ \sum_{k = 1}^{n} Y_{k}^{T} \otimes Y_{k}^{T} & n p (V \otimes V) \end{matrix}) (\begin{matrix} d vec A \\ d vec B \end{matrix}) . \end{matrix}

One can verify that $E (Y_{i j}^{(k)} Y_{p q}^{(k)}) = u_{i p} v_{j q}$ , where $Y_{i j}^{(k)}$ is the (i, j)th entry of Y_k's. So $E (Y_{k} \otimes Y_{k}) = (vec U) {(vec V)}^{T}$ . Denote I(A, B) as the Fisher information matrix, then

2 I (A, B) = (\begin{matrix} n q (U \otimes U) & n (vec U) {(vec V)}^{T} \\ n (vec V) {(vec U)}^{T} & n p (V \otimes V) \end{matrix}) .

(.2)

To see I(A, B) is non-negative definite, one only needs to check $n p (V \otimes V) - n ∕ q (vec V) {(vec U)}^{T} (A \otimes A) (vec U) {(vec V)}^{T} = n p {V \otimes V - 1 ∕ q (vec V) {(vec V)}^{T}}$ is so. This is equivalent to check that for any vector t ≠ 0 in $R^{q^{2}}$ ,

t^{T} {V \otimes V - 1 ∕ q (vec V) {(vec V)}^{T}} t \geq 0 .

Denote q × q matrix D such that vecD = t, then

t^{T} {V \otimes V - 1 ∕ q (vec V) {(vec V)}^{T}} t = tr (V D^{T} V D) - \frac{1}{q} {tr (V D)}^{2}

(.3)

Since V is non-negative definite, V^1/2 is well defined, denote V^1/2DV^1/2 = A, then A^T = V^1/2D^TV^1/2 and (.3) = tr(A^TA) – 1/q[tr(A)]². But in general, one has the inequality [tr(A^TB)]² ≤ tr(A^TA)tr(B^TB), so ${[tr (A)]}^{2} = {[tr (A^{T} I_{q})]}^{2} \leq tr (A^{T} A) tr (I_{q}^{2}) = q tr (A^{T} A)$ , thus we proved the condition (B).

Since the third derivative of the log-likelihood function doesn't involve any random variable, condition (C) is easy to satisfy. Theorem 2 thus holds.

Proof of Lemma 4.1

Proof. We use the notation Y^(k) to refer to Y_k for convenience. We then have

{(S_{n} - Σ_{0})}_{(i, j) (k, l)} = 1 ∕ n \sum_{s = 1}^{n} (Y_{i k}^{(s)} Y_{j l}^{(s)} - u_{i j} v_{k l}) .

Since vec(Y^(s)) is normally distributed, Lemma A.3 of [34] leads to the fact that there exist some constants δ, C₁ and C₂ depending on ε₁ and ε₃ only such that

P r (∣ \sum_{s = 1}^{n} (Y_{i k}^{(s)} Y_{j l}^{(s)} - u_{i j} u_{k l}) ∣ \geq n t) \leq C_{1} exp {- C_{2} n t^{2}}, for ∣ t ∣ \leq δ .

Hence we have

P r (max_{\begin{matrix} (i, j) \\ (k, l) \end{matrix}} ∣ {(S_{n} - Σ_{0})}_{(i, j) (k, l)} ∣ \geq t) \leq C_{1} p_{n}^{2} q_{n}^{2} exp {- C_{2} n t^{2}}, for ∣ t ∣ \leq δ,

which proves the lemma.

Proof of Theorem 3

Proof. Let W₁ be a symmetric matrix of dimension p_n and W₂ be a symmetric matrix of dimension q_n. Let D_W₁, D_W₂ be their diagonal matrices, and R_W₁ = W₁ – D_W₁, R_W₂ = W₂ – D_W₂ be their off-diagonal matrices, respectively. Set Δ_W₁ = α_nR_W₁ + β_nD_W₁ and Δ_W₂ = δ_nR_W₂ + β_nD_W₂. We show that, for $α_{n} = \sqrt{q_{n} s_{n 1} (log p_{n} + log q_{n}) ∕ n}$ , $β_{n} = \sqrt{q_{n} p_{n} (log p_{n} + log q_{n}) ∕ n}$ and $δ_{n} = \sqrt{p_{n} s_{n 2} (log p_{n} + log q_{n}) ∕ n}$ , for a set $A$ defined as

A = {(W_{1}, W_{2}) : {‖ Δ_{W_{1}} ‖}_{F}^{2} = C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2} a n d {‖ Δ_{W_{2}} ‖}_{F}^{2} = C_{3}^{2} δ_{n}^{2} + C_{4}^{2} β_{n}^{2}}, P r (inf_{(W_{1}, W_{2}) \in A} ϕ (A_{0} + Δ_{W_{1}}, B_{0} + Δ_{W_{2}}) > ϕ (A_{0}, B_{0})) \to 1,

for sufficiently large constants C₁, C₂, C₃ and C₄. Denote $A_{1} = A_{0} + Δ_{W_{1}} = (a_{i j}^{(1)})$ , $B_{1} = B_{0} + Δ_{W_{2}} = (b_{i j}^{(1)})$ , then

ϕ (A_{1}, B_{1}) - ϕ (A_{0}, B_{0}) = - q_{n} [log (∣ A_{1} ∣) - log (∣ A_{0} ∣)] - p_{n} [log (∣ B_{1} ∣) - log (∣ B_{0} ∣)] + \frac{1}{n} \sum_{s = 1}^{n} tr {A_{1} Y^{(s)} B_{1} Y^{(s) T}} - \frac{1}{n} \sum_{s = 1}^{n} tr {A_{0} Y^{(s)} B_{0} Y^{(s) T}} + λ_{n} \sum_{i \neq j} {∣ a_{i j}^{(1)} ∣ - ∣ a_{i j}^{(0)} ∣} + ρ_{n} \sum_{k \neq l} {∣ b_{i j}^{(1)} ∣ - ∣ b_{i j}^{(0)} ∣} .

(.4)

So ϕ(A₁, B₁) − ϕ(A₀, B₀) = I₁ + I₂ + I₃ + I₄ + I₅, where

\begin{matrix} I_{1} & = \frac{1}{n} \sum_{s = 1}^{n} [tr {A_{1} Y^{(s)} B_{1} Y^{(s)} B_{1} Y^{(s) T}} - tr {A_{0} Y^{(s)} B_{0} Y^{(s) T}}] - q_{n} [log (∣ A_{1} ∣) - log (∣ A_{0} ∣)] - p_{n} [log (∣ B_{1} ∣) - log (∣ B_{0} ∣)], \\ I_{2} & = λ_{n} \sum_{(i, j) \in S_{A}^{c}} {∣ a_{i j}^{(1)} ∣ - ∣ a_{i j}^{(0)} ∣}, \\ I_{3} & = λ_{n} \sum_{\begin{matrix} i \neq j \\ (i, j) \in S_{A} \end{matrix}} {∣ a_{i j}^{(1)} ∣ - ∣ a_{i j}^{(0)} ∣}, \\ I_{4} & = ρ_{n} \sum_{(k, l) \in S_{B}^{c}} {∣ b_{k l}^{(1)} ∣ - ∣ b_{k l}^{(0)} ∣}, \\ I_{5} & = ρ_{n} \sum_{\begin{matrix} k \neq l \\ (k, l) \in S_{B} \end{matrix}} {∣ b_{k l}^{(1)} ∣ - ∣ b_{k l}^{(0)} ∣} . \end{matrix}

Denote Δ_A = Δ_W₁, Δ_B = Δ_W₂ and recall the definitions of $S_{n} = 1 ∕ n \sum_{s = 1}^{n} Y^{(s)} \otimes Y^{(s)} and Σ_{0} = (vec U_{0}) {(vec V_{0})}^{T} = E S_{n}$ . Using Taylor's expansion with integral residues, we have

I_{1} = - q_{n} tr {A_{0}^{- 1} Δ_{A}} - p_{n} tr {B_{0}^{- 1} Δ_{B}} + {(vec Δ_{A})}^{T} (S_{n} - Σ_{0} + Σ_{0}) (vec B_{0}) + {(vec A_{0})}^{T} (S_{n} - Σ_{0} + Σ_{0}) (vec Δ_{B}) + q_{n} \int_{0}^{1} (1 - v) d v {(vec Δ_{A})}^{T} (A_{v}^{- 1} \otimes A_{v}^{- 1}) (vec Δ_{A}) + p_{n} \int_{0}^{1} (1 - v) d v {(vec Δ_{B})}^{T} (B_{v}^{- 1} \otimes B_{v}^{- 1}) (vec Δ_{B}) + {(vec Δ_{A})}^{T} (S_{n} - Σ_{0} + Σ_{0}) (vec Δ_{B}),

(.5)

where A_v = A₀ + vΔ_A, B_v = B₀ + vΔ_B. One can easily check that

{(vec Δ_{A})}^{T} Σ_{0} (vec B_{0}) = {(vec Δ_{A})}^{T} (vec U_{0}) {(vec V_{0})}^{T} (vec B_{0}) = tr (U_{0}^{T} Δ_{A}) tr (V_{0}^{T} B_{0}) = q_{n} tr (U_{0} Δ_{A}),

and similarly, (vecA₀)^TΣ₀(vecΔ_B) = p_ntr(V₀Δ_B). Then I₁ can be further simplified as I₁ = K₁ + K₂ + K₃ + K₄ + K₅ + K₆, where

\begin{matrix} K_{1} & = q_{n} \int_{0}^{1} (1 - v) d v {(vec Δ_{A})}^{T} (A_{v}^{- 1} \otimes A_{v}^{- 1}) (vec Δ_{A}), \\ K_{2} & = p_{n} \int_{0}^{1} (1 - v) d v {(vec Δ_{B})}^{T} (B_{v}^{- 1} \otimes B_{v}^{- 1}) (vec Δ_{B}), \\ K_{3} & = {(vec Δ_{A})}^{T} (S_{n} - Σ_{0}) (vec B_{0}), \\ K_{4} & = {(vec A_{0})}^{T} (S_{n} - Σ_{0}) (vec Δ_{B}), \\ K_{5} & = tr (U_{0} Δ_{A}) tr (V_{0} Δ_{B}), \\ K_{6} & = {(vec Δ_{A})}^{T} (S_{n} - Σ_{0}) (vec Δ_{B}) . \end{matrix}

Note that

\begin{matrix} K_{1} & \geq q_{n} {‖ Δ_{A} ‖}_{F}^{2} ∕ 2 min_{0 \leq v \leq 1} λ_{max}^{- 2} (A_{v}) \\ \geq q_{n} {‖ Δ_{A} ‖}_{F}^{2} ∕ 2 {(‖ A_{0} ‖ + ‖ Δ_{A} ‖)}^{- 2} \\ \geq q_{n} ∕ 2 {‖ Δ_{A} ‖}_{F}^{2} {(ε_{2} + o (1))}^{- 2} = q_{n} ∕ 2 (C_{1}^{2} α_{n}^{2} + C_{2}^{2} β_{n}^{2}) {(ε_{2} + o (1))}^{- 2}, \end{matrix}

(.6)

K_{2} \geq p_{n} ∕ 2 {‖ Δ_{B} ‖}_{F}^{2} {(ε_{4} + o (1))}^{- 2} = p_{n} ∕ 2 (C_{3}^{2} δ_{n}^{2} + C_{4}^{2} β_{n}^{2}) {(ε_{4} + o (1))}^{- 2} .

(.7)

Since

A_{v}^{- 1} = (I_{p} - v U_{0} Δ_{A} + v^{2} U_{0} Δ_{A} U_{0} Δ_{A} + \dots) A_{0}^{- 1},

hence

tr (A_{v}^{- 1} Δ_{A}^{T} A_{v}^{- 1} Δ_{A}) = tr (A_{0}^{- 1} Δ_{A}^{T} A_{0}^{- 1} Δ_{A}) (1 + o (1)),

then

K_{1} \geq q_{n} ∕ 2 min_{0 \leq v \leq 1} tr (A_{v}^{- 1} Δ_{A}^{T} A_{v}^{- 1} Δ_{A}),

K_{1} \geq q_{n} ∕ 2 tr (U_{0} Δ_{A}^{T} U_{0} Δ_{A}) (1 + o (1)),

(.8)

and similarly

K_{2} \geq p_{n} ∕ 2 tr (V_{0} Δ_{B}^{T} V_{0} Δ_{B}) (1 + o (1)) .

(.9)

Generally, for two squared p × p matrix M and q × q matrix N, we have $tr (M) \leq \sqrt{p tr (M^{T} M)}, tr (N) \leq \sqrt{q tr (N^{T} N)}$ and then

∣ tr (M) tr (N) ∣ \leq \sqrt{p q tr (M^{T} M) tr (N^{T} N)} \leq \frac{1}{2} q tr (M^{T} M) + \frac{1}{2} p tr (N^{T} N) .

Let $M = U_{0}^{1 ∕ 2} Δ_{A} U_{0}^{1 ∕ 2}$ , $N = V_{0}^{1 ∕ 2} Δ_{B} V_{0}^{1 ∕ 2}$ , we have

∣ K_{5} ∣ = ∣ tr (U_{0} Δ_{A}) tr (V_{0} Δ_{B}) ∣ = ∣ tr (M) tr (N) ∣ \leq \frac{1}{2} q_{n} tr (M^{T} M) + \frac{1}{2} p_{n} tr (N^{T} N) = \frac{1}{2} q_{n} tr (U_{0} Δ_{A}^{T} U_{0} Δ_{A}) + \frac{1}{2} p_{n} tr (V_{0} Δ_{B}^{T} V_{0} Δ_{B}) .

Combining with (.8) and (.9) we know that |K₅| is dominated by K₁ + K₂ with a large probability.

Next we bound |K₃| and |K₄|. We have

\begin{matrix} ∣ K_{3} ∣ & = ∣ {(vec Δ_{A})}^{T} (S_{n} - Σ_{0}) (vec B_{0}) ∣ \leq L_{1} + L_{2}, \\ ∣ K_{4} ∣ & = ∣ {(vec A_{0})}^{T} (S_{n} - Σ_{0}) (vec Δ_{B}) ∣ \leq L_{3} + L_{4}, \end{matrix}

where if we use double index to indicate a row or column in S_n, Σ₀ or a position in vecΔ_A, vecB₀, vecA₀ and vecΔ_B, we have

L_{1} = \sum_{\begin{matrix} i \neq j, (i, j) \in S_{A} \\ (k, l) \end{matrix}} ∣ {(Δ_{A})}_{(i, j)} {(S_{n} - Σ_{0})}_{(i, j) (k, l)} {(B_{0})}_{(k, l)} ∣,

L_{2} = \sum_{\begin{matrix} (i, j) \in S_{A}^{c} \\ (k, l) \end{matrix}} ∣ {(Δ_{A})}_{(i, j)} {(S_{n} - Σ_{0})}_{(i, j) (k, l)} {(B_{0})}_{(k, l)} ∣,

L_{3} = \sum_{\begin{matrix} k \neq l, (k, l) \in S_{B} \\ (i, j) \end{matrix}} ∣ {(A_{0})}_{(i, j)} {(S_{n} - Σ_{0})}_{(i, j) (k, l)} {(Δ_{B})}_{(k, l)} ∣,

and

L_{4} = \sum_{\begin{matrix} (k, l) \in S_{B}^{c} \\ (i, j) \end{matrix}} ∣ {(A_{0})}_{(i, j)} {(S_{n} - Σ_{0})}_{(i, j) (k, l)} {(Δ_{B})}_{(k, l)} ∣ .

From Lemma 4.1 we know that

max_{(i, j) (k, l)} {(S_{n} - Σ_{0})}_{(i, j) (k, l)} = O_{P} (\sqrt{(log p_{n} + log q_{n}) ∕ n}) .

Then

\begin{matrix} L_{1} & \leq \sqrt{s_{n 1}} {‖ Δ_{A} ‖}_{F} O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) q_{n} \sqrt{tr (B_{0}^{T} B_{0})} \\ \leq q_{n} \sqrt{q_{n} s_{n 1}} λ_{max} (B_{0}) {‖ Δ_{A} ‖}_{F} O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) \\ \leq q_{n} O_{P} (α_{n}) {‖ Δ_{A} ‖}_{F} \\ \leq q_{n} O_{P} (C_{1} α_{n}^{2} + C_{2} β_{n}^{2}), \end{matrix}

(.10)

This together with (.6) shows that L₁ is dominated by K₁ by choosing sufficiently large C₁ and C₂. Symmetrically,

\begin{matrix} L_{3} & \leq p_{n} \sqrt{tr (A_{0}^{T} A_{0})} O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) \sqrt{s_{n 2}} {‖ Δ_{B} ‖}_{F} \\ \leq p_{n} \sqrt{p_{n} s_{n 2}} λ_{max} (A_{0}) O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) {‖ Δ_{B} ‖}_{F} \\ \leq p_{n} O_{P} (C_{3} δ_{n}^{2} + C_{4} β_{n}^{2}) . \end{matrix}

By choosing sufficiently large C₃ and C₄, this together with (.7) shows L₃ can be dominated by K₂. Also

\begin{matrix} L_{2} & \leq \sum_{(i, j) \in S_{A}^{c}} ∣ a_{i j}^{(1)} ∣ O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) \sum_{k, l} ∣ {(B_{0})}_{k, l} ∣ \\ \leq \sum_{(i, j) \in S_{A}^{c}} ∣ a_{i j}^{(1)} ∣ O_{P} (\sqrt{\frac{log p_{n} + log q_{n}}{n}}) q_{n} \sqrt{q_{n}} λ_{max} (B_{0}) \\ \leq \sum_{(i, j) \in S_{A}^{c}} ∣ a_{i j}^{(1)} ∣ O_{P} (q_{n} \sqrt{\frac{q_{n} (log p_{n} + log q_{n})}{n}}) \end{matrix}

(.11)

from the condition of λ_n in theorem, and using the similar technique in [12], it can be shown that L₂ is dominated by I₂. Similarly, L₄ is dominated by I₄. Thus we proved |K₃| + |K₄| can be dominated by K₁ + K₂ + I₂ + I₄. It is easy to show that |K₆| is of smaller order of K₃ and K₄, hence is also dominated by K₁ + K₂ + I₂ + I₄. We next show that

\begin{matrix} ∣ I_{3} ∣ & = λ_{n} \sum_{\begin{matrix} i \neq j, \\ (i, j) \in S_{A} \end{matrix}} ∣ {∣ a_{i j}^{(1)} ∣ - ∣ a_{i j}^{(0)} ∣} ∣ \\ \leq λ_{n} \sum_{\begin{matrix} i \neq j, \\ (i, j) \in S_{A} \end{matrix}} ∣ a_{i j}^{(1)} - a_{i j}^{(0)} ∣ \leq λ_{n} \sqrt{s_{n 1}} {‖ Δ_{A} ‖}_{F} \\ \leq \sqrt{s_{n 1}} O ((1 + \frac{\sqrt{p_{n}}}{\sqrt{s_{n 1}} + 1}) q_{n} \sqrt{\frac{q_{n} (log p_{n} + log q_{n})}{n}}) \times O (C_{1} α_{n} + C_{2} β_{n}) \\ = q_{n} O (C_{1} α_{n}^{2} + C_{2} β_{n}^{2}), \end{matrix}

(.12)

where the middle term in (.12) is from regularity condition (C), thus I₃ is dominated by K₁ if we choose sufficiently large constants C₁ and C₂. Similarly, we get $∣ I_{5} ∣ \leq p_{n} O (C_{3} δ_{n}^{2} + C_{4} β_{n}^{2})$ and is dominated by K₂ when C₃ and C₄ are large. Hence the proof.

Proof of Theorem 4

Proof. For (Â, B̂), a minimizer of (6), where Â = (a_ij), B̂ = (b_kl), the derivative of ϕ(A, B) with respect to a_ij for $(i, j) \in S_{A}^{c}$ evaluated at Â is

\begin{matrix} \frac{\partial ϕ (\hat{A}, \hat{B})}{\partial a_{i j}} & = q_{n} u_{i j} + \frac{1}{n} \sum_{s = 1}^{n} {Y^{(s)} \hat{B} Y^{(s) T}}_{i j} + \frac{λ_{n}}{{∣ {\tilde{a}}_{i j} ∣}^{γ_{1}}} sgn (a_{i j}) \\ = - q_{n} u_{i j}^{(0)} - q_{n} (u_{i j} - u_{i j}^{(0)}) + \frac{1}{n} \sum_{s = 1}^{n} {Y^{(s)} B_{0} Y^{(s) T}}_{i j} + \frac{1}{n} \sum_{s = 1}^{n} {Y^{(s)} (\hat{B} - B_{0}) Y^{(s) T}}_{i j} + \frac{λ_{n}}{{∣ {\tilde{a}}_{i j} ∣}^{γ_{1}}} sgn (a_{i j}), \end{matrix}

where Û = (u_ij) = Â^–1 and $U_{0} = (u_{i j}^{(0)})$ and B₀ are the true parameters. If we can show that the sign of ∂ϕ(Â, B̂)/∂a_ij depends on sgn(a_ij) only with probability tending to 1, the optimum is then at 0, so that a_ij = 0 for $(i, j) \in S_{A}^{c}$ with probability tending to 1. Let

\begin{matrix} I_{1} & = \frac{1}{n} \sum_{s = 1}^{n} {Y^{(s)} B_{0} Y^{(s) T}}_{i j} - q_{n} u_{i j}^{(0)}, \\ I_{2} & = - q_{n} (u_{i j} - u_{i j}^{(0)}), \\ I_{3} & = \frac{1}{n} \sum_{s = 1}^{n} {Y^{(s)} (\hat{B} - B_{0}) Y^{(s) T}}_{i j}, \end{matrix}

(.13)

we then have

\frac{\partial ϕ (\hat{A}, \hat{B})}{\partial a_{i j}} = I_{1} + I_{2} + I_{3} + \frac{λ_{n}}{{∣ {\tilde{a}}_{i j} ∣}^{γ_{1}}} sgn (a_{i j}) .

Using the same argument as in [12],

max_{i j} ∣ u_{i j} - u_{i j}^{(0)} ∣ \leq ‖ \hat{U} - U_{0} ‖ = ‖ \hat{U} (\hat{A} - A_{0}) U_{0} ‖ \leq ‖ \hat{U} ‖ ‖ \hat{A} - A_{0} ‖ ‖ U_{0} ‖ = O (‖ \hat{A} - A_{0} ‖),

and then ${max}_{i j} ∣ I_{2} ∣ = O_{P} (q_{n} {\sqrt{c}}_{n})$ .

Since Y^(s) ~ MN(0; U₀, V₀), the Y^(s)T ~ MN(0, V₀, U₀) and $vec (Y^{(s) T}) \sim N (0, U_{0} \otimes V_{0})$ . Let $Y^{(s) T} = (Y_{1}^{(s)}, \dots, Y_{p_{n}}^{(s)})$ , where $Y_{i}^{(s)}$ is a q_n × 1 vector,for i = 1, ···, p_n. We have

{Y^{(s)} (\hat{B} - B_{0}) Y^{(s) T}}_{i j} = Y_{i}^{(s) T} (\hat{B} - B_{0}) Y_{j}^{(s)},

(.14)

{Y^{(s)} B_{0} Y^{(s) T}}_{i j} = Y_{i}^{(s) T} B_{0} Y_{j}^{(s)},

(.15)

and

(\begin{matrix} Y_{i}^{(s)} \\ Y_{j}^{(s)} \end{matrix}) \sim N (0, (\begin{matrix} u_{i i}^{*} V_{0}, & u_{i j}^{*} V_{0} \\ u_{i j}^{*} V_{0}, & u_{j j}^{*} V_{0} \end{matrix})),

(.16)

(\begin{matrix} B_{0}^{1 ∕ 2} Y_{i}^{(s)} \\ B_{0}^{1 ∕ 2} Y_{j}^{(s)} \end{matrix}) \sim N (0, (\begin{matrix} u_{i i}^{*} I_{q_{n}}, & u_{i j}^{*} I_{q_{n}} \\ u_{i j}^{*} I_{q_{n}}, & u_{j j}^{*} I_{q_{n}} \end{matrix})) .

(.17)

I₁ can be simplified as $I_{1} = 1 ∕ n \sum_{s = 1}^{n} Y_{i}^{(s) T} B_{0} Y_{j}^{(s)} - q_{n} u_{i j}^{*}$ . We have the following proposition:

Proposition Appendix .1. Under the notations above, we have

max_{i, j \in {1, \dots, p_{n}}} ∣ \frac{1}{n} \sum_{s = 1}^{n} Y_{i}^{(s) T} B_{0} Y_{j}^{(s)} - q_{n} u_{i j}^{*} ∣ = O_{P} (q_{n} \sqrt{\frac{log p_{n}}{n q_{n}}}) .

Proof of Proposition Appendix .1. To save notation, we use q for q_n here or there. Denote $B_{0}^{1 ∕ 2} Y_{i}^{(s)} = {(z_{1}, \dots, z_{q})}^{T}$ and $B_{0}^{1 ∕ 2} Y_{j}^{(s)} = {(w_{1}, \dots, w_{q})}^{T}$ . From (.17) we have

\begin{matrix} (z_{k}, w_{k}) \sim_{i . i . d .} N (0, (\begin{matrix} u_{i i}^{*} & u_{i j}^{*} \\ u_{i j}^{*} & u_{j j}^{*} \end{matrix})), \\ Y_{i}^{(s) T} B_{0} Y_{j}^{(s)} = {(B_{0}^{1 ∕ 2} Y_{i}^{(s)})}^{T} (B_{0}^{1 ∕ 2} Y_{j}^{(s)}) = z_{1} w_{1} + \dots + z_{q} w_{q} . \end{matrix}

(.18)

Note that (.18) does not depends on the sample index s, and the sum in I₁ is equivalent to nq_n normal observations. By Lemma A.3 of [34], we have

max_{i, j \in {1, \dots, p_{n}}} ∣ \frac{q_{n}}{n q_{n}} \sum_{s = 1}^{n} Y_{i}^{(s) T} B_{0} Y_{j}^{(s)} - q_{n} u_{i j}^{*} ∣ = q_{n} O_{P} (\sqrt{\frac{log p_{n}}{n q_{n}}}) = O_{P} (\sqrt{\frac{q_{n} log p_{n}}{n}}) .

Hence the Proposition Appendix .1 is proved.

From Proposition Appendix .1, we know ${max}_{i, j} ∣ I_{1} ∣ = O_{P} (\sqrt{q_{n} log p_{n} ∕ n})$ . Next we bound |I₃|. From (.14) we know that $I_{3} = 1 ∕ n \sum_{s = 1}^{n} Y_{i}^{(s) T} (\hat{B} - B_{0}) Y_{j}^{(s)}$ , and since

Y_{i}^{(s) T} (\hat{B} - B_{0}) Y_{j}^{(s)} = tr ((\hat{B} - B_{0}) Y_{j}^{(s)} Y_{i}^{(s) T}) = tr [(\hat{B} - B_{0}) (Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0} + u_{i j}^{*} V_{0})],

we have

I_{3} = tr [u_{i j}^{*} V_{0} (\hat{B} - B_{0})] + tr [(\hat{B} - B_{0}) \frac{1}{n} \sum_{s = 1}^{n} (Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0})] .

Then we have |I₃| ≤ L₁ + L₂, where

\begin{matrix} L_{1} & = ∣ tr (u_{i j}^{*} V_{0} (\hat{B} - B_{0})) ∣, \\ L_{2} & = ∣ tr [(\hat{B} - B_{0}) \frac{1}{n} \sum_{s = 1}^{n} (Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0})] ∣ . \end{matrix}

Since ${max}_{i, j} ∣ u_{i j}^{*} ∣ \leq ‖ U_{0} ‖ \leq ε_{1}^{- 1}$ and $‖ V_{0} ‖ \leq ε_{3}^{- 1}$ , then

\begin{matrix} L_{1} & \leq max_{i, j} ∣ u_{i j}^{*} ∣ \sqrt{tr (V_{0}^{T} V_{0})} \sqrt{tr ({(\hat{B} - B_{0})}^{T} (\hat{B} - B_{0}))} \\ \leq \sqrt{q_{n}} ‖ U_{0} ‖ ‖ V_{0} ‖ {‖ \hat{B} - B_{0} ‖}_{F} \leq ε_{1}^{- 1} ε_{3}^{- 1} \sqrt{q_{n}} {‖ \hat{B} - B_{0} ‖}_{F} . \end{matrix}

On the other hand, by (.16) and Lemma A.3 of [34],

max_{k, l \in {1, \dots, q_{n}}} ∣ {(\frac{1}{n} \sum_{s = 1}^{n} Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0})}_{(k, l)} ∣ = O_{P} (\sqrt{\frac{log q_{n}}{n}}) .

Denoting $1_{q} a q_{n} \times 1$ vector of 1's, then

{‖ \frac{1}{n} \sum_{s = 1}^{n} Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0} ‖}_{F}^{2} = tr [{(\frac{1}{n} \sum_{s = 1}^{n} Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0})}^{T}, (\frac{1}{n} \sum_{s = 1}^{n} Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0})] = O_{P} (\frac{log q_{n}}{n} tr (1_{q} 1_{q}^{T} 1_{q} 1_{q}^{T})) = O_{P} (q_{n}^{2} \frac{log q_{n}}{n}) .

Therefore,

L_{2} \leq {‖ \hat{B} - B_{0} ‖}_{F} {‖ \frac{1}{n} \sum_{s = 1}^{n} Y_{j}^{(s)} Y_{i}^{(s) T} - u_{i j}^{*} V_{0} ‖}_{F} = O_{P} (\sqrt{\frac{q_{n} log q_{n}}{n}} \sqrt{q_{n}} {‖ \hat{B} - B_{0} ‖}_{F}) .

Then $L_{1} + L_{2} = O_{P} (\sqrt{q_{n}} {‖ \hat{B} - B_{0} ‖}_{F} (1 + \sqrt{q_{n} log q_{n} ∕ n}))$ . Since the theorem requires the condition q_n(p_n + s_n1)(log p_n + log q_n)^k/n = O(1) for some k > 1, we know that $\sqrt{q_{n} log q_{n} ∕ n} = o (1)$ . So

∣ I_{3} ∣ \leq L_{1} + L_{2} = O_{P} (\sqrt{q_{n}} {‖ \hat{B} - B_{0} ‖}_{F}) = O_{P} (\sqrt{\frac{p_{n} q_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n}}) .

Concluding from above, we have

∣ I_{1} ∣ + ∣ I_{2} ∣ + ∣ I_{3} ∣ \leq O_{P} (\sqrt{\frac{q_{n} log p_{n}}{n}}) + O_{P} (q_{n} \sqrt{c_{n}}) + O_{P} (\sqrt{\frac{p_{n} q_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n}}) = \sqrt{q_{n}} O_{P} (\sqrt{\frac{p_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n}} + \sqrt{q_{n} c_{n}}) .

On the other hand, for $(i, j) \in S_{A}^{c}, λ_{n} ∕ {∣ {\tilde{a}}_{i j} ∣}^{γ_{1}} \geq c λ_{n} e_{n}^{γ_{1}}$ for some constant c From the condition in the Theorem, we have

e_{n}^{- γ_{1}} \sqrt{q_{n}} {\sqrt{\frac{p_{n} (q_{n} + s_{n 2}) (log p_{n} + log q_{n})}{n}} + \sqrt{c_{n} q_{n}}} = O (λ_{n})

So the sign of ∂ϕ(Â, B̂)/∂a_ij is dominated by sgn(a_ij), and thus we proved the sparsistency for Â. Similar proof can be applied to B̂.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1.Lauritzen SL. Graphical Models. Clarendon Press; Oxford: 1996. [Google Scholar]
2.Whittaker J. Graphical Models in Applied Multivariate Analysis. Wiley; 1990. [Google Scholar]
3.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
4.Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–288. [Google Scholar]
5.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
6.Banerjee O, Ghaoui LE, d'Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Machine Learning Research. 2008;9:485–516. [Google Scholar]
7.Dahl J, Vandenberghe L, Roychowdhury V. Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software. 2008;23:501–420. [Google Scholar]
8.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Yuan M. Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research. 2010;11:2261–2286. [Google Scholar]
10.Ravikumar P, Wainwright M, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
11.Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]
12.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrices estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Cai T, Liu W, Luo X. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of American Statistical Association. 2011;106:594–607. [Google Scholar]
14.Candes E, Tao T. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics. 2007;35:2313–2351. [Google Scholar]
15.Dawid A. Some matrix-variate distribution theory: Notational considerations and a bayesian application. Biometrika. 1981;68:265–274. [Google Scholar]
16.Gupta A, Nagar D. Matrix variate distributions, Volume 104 of Monographs and Surveys in Pure and Applied Mathematics. Chapman & Hall, CRC Press; Boca Raton, FL: 1999. [Google Scholar]
17.Finn JD. A general model for multivariate analysis. Holt, Rinehart and Winston; New York: 1974. [Google Scholar]
18.Timm NH. Multivariate analysis of variance of repeated measurements. Handbook of Statistics. 1980;1:41–87. [Google Scholar]
19.Mardia KV, Goodall C. Spatial-temporal analysis of multivariate environmental monitoring data. Multivariate Environmental Statistics. 1993;6:347–385. [Google Scholar]
20.Huizenga HM, De Munck JC, Waldorp LJ, Grasman R. Spatiotemporal eeg/meg source analysis based on a parametric noisecovariance model. IEEE Trans. Biomed. Eng. 2002;49:533–9. doi: 10.1109/TBME.2002.1001967. [DOI] [PubMed] [Google Scholar]
21.Teng S, Huang H. A statistical framework to infer functional gene associations from multiple biologically interrelated microarray experiments. Journal of the American Statistical Association. 2009;104:465–473. [Google Scholar]
22.Allen G, Tibshirani R. Transposable regularized covariance models with an application to missing data imputation. Annals of Applied Statistics. 2010;4(2):764–790. doi: 10.1214/09-AOAS314. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Efron B. Are a set of microarrays independent of each other? The Annals of Applied Statistics. 2009;13(3):922942. doi: 10.1214/09-AOAS236. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Dutilleul P. The mle algorithm for the matrix normal distribution. Journal of Statistical Computating and Simulations. 1999;64:105–123. [Google Scholar]
25.Mitchell M, Genton M, Gumpertz M. A likelihood ratio test for separability of covariances. Journal of Multivariate Analysis. 2006;97(5):1025–1043. [Google Scholar]
26.Muralidharan O. Detecting column dependence when rows are correlated and estimating the strength of the row correlation. Electronic Journal of Statistics. 2010;4:1527–1546. [Google Scholar]
27.Wang H, West M. Bayesian analysis of matrix normal graphical models. Biometrika. 2009;96:821–834. doi: 10.1093/biomet/asp049. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Allen G, Tibshirani R. Inference with transposable data: Modeling the effects of row and column correlations. Journal of the Royal Statistical Society, Series B (Theory & Methods) 2011 doi: 10.1111/j.1467-9868.2011.01027.x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Allen G. Comment on article by hoff. Bayesian Analysis. 2011;6(2):197–202. [Google Scholar]
30.Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Graybill FA. Matrices with applications in statistics. Second Edition Wadsworth; Belmont: 1983. [Google Scholar]
32.Cressie N. Statistics for Spatial Data. Wiley; New York: 1993. [Google Scholar]
33.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
34.Bickel P, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36(1):199–227. [Google Scholar]
35.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
36.Li H, Gui J. Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7:302–317. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]
37.Zahn JM, Poosala S, Owen A, Ingram DK, Lustig A, et al. Agemap: A gene expression database for aging in mice. PLoS Genetics. 2007;3(11):2326–2337. doi: 10.1371/journal.pgen.0030201. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Holmes K, Roberts O, Thomas A, Cross M. Vascular endothelial growth factor receptor-2: Structure, function, intracellular signalling and therapeutic inhibition. 2007;19:2003–2012. doi: 10.1016/j.cellsig.2007.05.013. [DOI] [PubMed] [Google Scholar]
39.Zhang Y, Schneider J. Learning multiple tasks with a sparse matrix-normal penalty. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel R, Culotta A, editors. Advances in Neural Information Processing Systems. Vol. 23. 2010. pp. 2550–2558. [Google Scholar]
40.Tucker LR. The extension of factor analysis to three-dimensional matrices. In: Gulliksen H, Frederiksen N, editors. Contributions to Mathematical Psychology. Holt, Rinehart and Winston; New York: 1964. [Google Scholar]
41.Hoff P. Separable covariance arrays via the tucker product, with applications to multivariate relational data. Bayesian Analysis. 2011;6(2):179–196. [Google Scholar]
42.Lehmann E. Theory of Point Estimation. Wadsworth and Brooks/Cole; Pacific Grove, CA: 1983. [Google Scholar]

[R1] 1.Lauritzen SL. Graphical Models. Clarendon Press; Oxford: 1996. [Google Scholar]

[R2] 2.Whittaker J. Graphical Models in Applied Multivariate Analysis. Wiley; 1990. [Google Scholar]

[R3] 3.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R4] 4.Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–288. [Google Scholar]

[R5] 5.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]

[R6] 6.Banerjee O, Ghaoui LE, d'Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R7] 7.Dahl J, Vandenberghe L, Roychowdhury V. Covariance selection for non-chordal graphs via chordal embedding. Optimization Methods and Software. 2008;23:501–420. [Google Scholar]

[R8] 8.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Yuan M. Sparse inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research. 2010;11:2261–2286. [Google Scholar]

[R10] 10.Ravikumar P, Wainwright M, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]

[R11] 11.Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104:177–186. [Google Scholar]

[R12] 12.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrices estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Cai T, Liu W, Luo X. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of American Statistical Association. 2011;106:594–607. [Google Scholar]

[R14] 14.Candes E, Tao T. The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics. 2007;35:2313–2351. [Google Scholar]

[R15] 15.Dawid A. Some matrix-variate distribution theory: Notational considerations and a bayesian application. Biometrika. 1981;68:265–274. [Google Scholar]

[R16] 16.Gupta A, Nagar D. Matrix variate distributions, Volume 104 of Monographs and Surveys in Pure and Applied Mathematics. Chapman & Hall, CRC Press; Boca Raton, FL: 1999. [Google Scholar]

[R17] 17.Finn JD. A general model for multivariate analysis. Holt, Rinehart and Winston; New York: 1974. [Google Scholar]

[R18] 18.Timm NH. Multivariate analysis of variance of repeated measurements. Handbook of Statistics. 1980;1:41–87. [Google Scholar]

[R19] 19.Mardia KV, Goodall C. Spatial-temporal analysis of multivariate environmental monitoring data. Multivariate Environmental Statistics. 1993;6:347–385. [Google Scholar]

[R20] 20.Huizenga HM, De Munck JC, Waldorp LJ, Grasman R. Spatiotemporal eeg/meg source analysis based on a parametric noisecovariance model. IEEE Trans. Biomed. Eng. 2002;49:533–9. doi: 10.1109/TBME.2002.1001967. [DOI] [PubMed] [Google Scholar]

[R21] 21.Teng S, Huang H. A statistical framework to infer functional gene associations from multiple biologically interrelated microarray experiments. Journal of the American Statistical Association. 2009;104:465–473. [Google Scholar]

[R22] 22.Allen G, Tibshirani R. Transposable regularized covariance models with an application to missing data imputation. Annals of Applied Statistics. 2010;4(2):764–790. doi: 10.1214/09-AOAS314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Efron B. Are a set of microarrays independent of each other? The Annals of Applied Statistics. 2009;13(3):922942. doi: 10.1214/09-AOAS236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Dutilleul P. The mle algorithm for the matrix normal distribution. Journal of Statistical Computating and Simulations. 1999;64:105–123. [Google Scholar]

[R25] 25.Mitchell M, Genton M, Gumpertz M. A likelihood ratio test for separability of covariances. Journal of Multivariate Analysis. 2006;97(5):1025–1043. [Google Scholar]

[R26] 26.Muralidharan O. Detecting column dependence when rows are correlated and estimating the strength of the row correlation. Electronic Journal of Statistics. 2010;4:1527–1546. [Google Scholar]

[R27] 27.Wang H, West M. Bayesian analysis of matrix normal graphical models. Biometrika. 2009;96:821–834. doi: 10.1093/biomet/asp049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Allen G, Tibshirani R. Inference with transposable data: Modeling the effects of row and column correlations. Journal of the Royal Statistical Society, Series B (Theory & Methods) 2011 doi: 10.1111/j.1467-9868.2011.01027.x. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Allen G. Comment on article by hoff. Bayesian Analysis. 2011;6(2):197–202. [Google Scholar]

[R30] 30.Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Graybill FA. Matrices with applications in statistics. Second Edition Wadsworth; Belmont: 1983. [Google Scholar]

[R32] 32.Cressie N. Statistics for Spatial Data. Wiley; New York: 1993. [Google Scholar]

[R33] 33.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R34] 34.Bickel P, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36(1):199–227. [Google Scholar]

[R35] 35.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R36] 36.Li H, Gui J. Gradient directed regularization for sparse gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7:302–317. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]

[R37] 37.Zahn JM, Poosala S, Owen A, Ingram DK, Lustig A, et al. Agemap: A gene expression database for aging in mice. PLoS Genetics. 2007;3(11):2326–2337. doi: 10.1371/journal.pgen.0030201. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Holmes K, Roberts O, Thomas A, Cross M. Vascular endothelial growth factor receptor-2: Structure, function, intracellular signalling and therapeutic inhibition. 2007;19:2003–2012. doi: 10.1016/j.cellsig.2007.05.013. [DOI] [PubMed] [Google Scholar]

[R39] 39.Zhang Y, Schneider J. Learning multiple tasks with a sparse matrix-normal penalty. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel R, Culotta A, editors. Advances in Neural Information Processing Systems. Vol. 23. 2010. pp. 2550–2558. [Google Scholar]

[R40] 40.Tucker LR. The extension of factor analysis to three-dimensional matrices. In: Gulliksen H, Frederiksen N, editors. Contributions to Mathematical Psychology. Holt, Rinehart and Winston; New York: 1964. [Google Scholar]

[R41] 41.Hoff P. Separable covariance arrays via the tucker product, with applications to multivariate relational data. Bayesian Analysis. 2011;6(2):179–196. [Google Scholar]

[R42] 42.Lehmann E. Theory of Point Estimation. Wadsworth and Brooks/Cole; Pacific Grove, CA: 1983. [Google Scholar]

PERMALINK

Model Selection and Estimation in the Matrix Normal Graphical Model

Jianxin Yin

Hongzhe Li

Abstract

1. Introduction

2. Matrix Normal Graphical Model for Multi-tissue Gene Expression Data

3. l₁-Penalized Maximum Likelihood Estimation of the Precision Matrices

4. Asymptotic Theorems

4.1. Asymptotic theorems when p and q are fixed

4.2. Asymptotic theorems when p = p_n and q = q_n diverge

5. Monte Carlo Simulations

5.1. Comparison candidates and measurements

5.2. Models and data generation

5.3. Simulation results

Table 1.

Table 2.

Table 3.

Table 4.

6. Real Data Analysis

Figure 1.

Figure 2.

Figure 3.

7. Discussion

Acknowledgement

Appendix

Proof of Proposition 1

Proof of Theorem 1

Proof of Theorem 2

Proof of Lemma 4.1

Proof of Theorem 3

Proof of Theorem 4

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Model Selection and Estimation in the Matrix Normal Graphical Model

Jianxin Yin

Hongzhe Li

Abstract

1. Introduction

2. Matrix Normal Graphical Model for Multi-tissue Gene Expression Data

3. l1-Penalized Maximum Likelihood Estimation of the Precision Matrices

4. Asymptotic Theorems

4.1. Asymptotic theorems when p and q are fixed

4.2. Asymptotic theorems when p = pn and q = qn diverge

5. Monte Carlo Simulations

5.1. Comparison candidates and measurements

5.2. Models and data generation

5.3. Simulation results

Table 1.

Table 2.

Table 3.

Table 4.

6. Real Data Analysis

Figure 1.

Figure 2.

Figure 3.

7. Discussion

Acknowledgement

Appendix

Proof of Proposition 1

Proof of Theorem 1

Proof of Theorem 2

Proof of Lemma 4.1

Proof of Theorem 3

Proof of Theorem 4

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3. l₁-Penalized Maximum Likelihood Estimation of the Precision Matrices

4.2. Asymptotic theorems when p = p_n and q = q_n diverge