Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 1.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2018 Jun 14;80(5):927–950. doi: 10.1111/rssb.12278

Multiple Matrix Gaussian Graphs Estimation

Yunzhang Zhu , Lexin Li
PMCID: PMC6261498  NIHMSID: NIHMS969429  PMID: 30505211

Abstract

Matrix-valued data, where the sampling unit is a matrix consisting of rows and columns of measurements, are emerging in numerous scientific and business applications. Matrix Gaussian graphical model is a useful tool to characterize the conditional dependence structure of rows and columns. In this article, we employ nonconvex penalization to tackle the estimation of multiple graphs from matrix-valued data under a matrix normal distribution. We propose a highly efficient nonconvex optimization algorithm that can scale up for graphs with hundreds of nodes. We establish the asymptotic properties of the estimator, which requires less stringent conditions and has a sharper probability error bound than existing results. We demonstrate the efficacy of our proposed method through both simulations and real functional magnetic resonance imaging analyses.

Keywords: Conditional independence, Gaussian graphical model, Matrix normal distribution, Nonconvex penalization, Resting-state functional magnetic resonance imaging, Sparsistency

1 Introduction

Gaussian graphical model has been widely used to describe the conditional dependence relationship, which is encoded in a partial correlation matrix, among a set of interacting variables. There have been a large number of statistical methods proposed to estimate a sparse Gaussian graphical model (Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Ravikumar et al., 2011; Cai et al., 2011, among others). There are also extensions from estimation of a single graph to multiple graphs across groups (Guo et al., 2011; Danaher et al., 2014; Zhu et al., 2014; Lee and Liu, 2015; Cai et al., 2016). All those methods assume the vector of interacting variables follow a normal distribution. In recent years, matrix-valued data, where each sampling unit is a matrix consisting of rows and columns of measurements, are rapidly emerging. Accordingly, the matrix normal distribution is becoming increasingly popular in modeling such matrix-valued observations (Zhou, 2014). Under this distribution, there have been some recent development of sparse graphical model estimation that aims to characterize the dependence of rows and columns of matrix data (Yin and Li, 2012; Leng and Tang, 2012; Tsiligkaridis et al., 2013). In this article, we aim at estimation of multiple graphs for matrix data under a matrix normal distribution.

Our motivation is brain connectivity analysis based on resting-state functional magnetic resonance imaging (fMRI). Meanwhile, our proposal is equally applicable to many other network data analyses. Brain functional connectivity reveals the synchronization of brain systems through correlations in neurophysiological measures of brain activity. When measured during resting-state, it maps the intrinsic functional architecture of the brain (Varoquaux and Craddock, 2013). Brain connectivity analysis is now at the foreground of neuroscience research. Accumulated evidences have suggested that connectivity network alters with the presence of numerous neurological disorders, and such alternations hold useful insights of disease pathologies (Fox and Greicius, 2010). In a typical functional connectivity study, the fMRI data are collected for multiple subjects from the disease group and normal control. For each individual subject, the observed fMRI data takes the form of a region by time matrix, where the number of brain regions is usually in the order of 102 and the number of time points around 150 to 200. From this matrix, a region by region correlation matrix is estimated to describe the brain connectivity graph, one for each diagnostic group separately. In this graph, nodes represent brain regions, and links measure dependence between the brain regions, where partial correlation is a commonly used correlation measure (Fornito et al., 2013). Brain connectivity analysis is then turned into the problem of estimation of partial correlation matrices under Gaussian graphical models across multiple groups.

Our proposal integrates matrix normal distribution, multiple partial correlation matrices estimation, and nonconvex penalization. Such an integration distinguishes our proposal from the existing solutions. For the matrix-valued data, directly applying the existing graphical model estimation methods assuming a vector normal distribution, in effect, requires the columns of the matrix data to be independent, which is obviously not true. For instance, for fMRI, the columns correspond to time series of repeatedly measured brain activities and are highly correlated. Whitening may help reduce the between-column correlation. In Section 5, we compare and show that our method substantially outperforms two state-of-the-art vector normal based multi-graph estimation methods, Lee and Liu (2015) and Cai et al. (2016), both facilitated by whitening. Among the few solutions on graphical model estimation under a matrix normal distribution (Yin and Li, 2012; Leng and Tang, 2012; Zhou, 2014), none tackle estimation of multiple graphs across different populations, but instead only focus on a single graph. Our proposal is also different from two recent multi-graph estimation methods of Qiu et al. (2016) and Han et al. (2016), in both study goals and estimation approaches. Specifically, Qiu et al. (2016) aimed to estimate a graph at any given location, e.g., age, whereas Han et al. (2016) aimed to capture and summarize the commonality underlying a collection of individual graphs. By contrast, our goal is to simultaneously estimate multiple graphs, one from each of a given group of subjects. Besides, Qiu et al. (2016) proposed a two-step procedure, which first obtained a smoothed estimate of the sample covariance matrix through kernel smoothing, then plugged into the constrained 1 minimization method of Cai et al. (2011) for precision matrix estimation. Han et al. (2016) first obtained an estimate of all the individual graphs, using again Cai et al. (2011), then plugged into an objective function that minimizes the Hamming distance between the targeting median graph and the individual graphs. For our proposal, we employ a likelihood based loss function, plus a combination of a nonconvex sparsity penalty and a nonconvex group sparsity penalty to induce both sparsity and similarity across multiple partial correlation matrices. Our choice of loss function is to ensure both theoretical properties and positive-definiteness of the estimator. Meanwhile, our choice of penalty function is motivated by the belief that, the true graph is approximately sparse, and the difference of graphs across multiple groups is approximately sparse too. In other words, those graphs may exhibit different connectivity patterns, but are also encouraged to be similar to each other. Moreover, nonconvex penalization in high-dimensinoal models has often been shown to outperform its convex counterpart both in theory and in practice (Fan and Li, 2001a; Zhang, 2010; Shen et al., 2012). In the context of graphical model under a vector normal distribution, nonconvex penalization has been shown to deliver more precise and concise graph estimates (Fan et al., 2009; Shen et al., 2012).

The novelty of our proposal lies in both the computational and the theoretical contributions. Computationally, recognizing that nonconvex optimization is more challenging than convex optimization, we propose a highly efficient and scalable algorithm through a combination of two modern optimization techniques, the minorize-maximization algorithm (MM, Hunter and Lange, 2004), and the alternating direction method of multipliers (ADMM, Boyd et al., 2011). The proposed algorithm is fast, yielding a comparable computation time as its convex counterpart. It is also much faster than the competing methods of Lee and Liu (2015) and Cai et al. (2016). In addition, our method scales reasonably well and can work for graphs with the number of nodes ranging to hundreds. It is noteworthy that this range covers the typical size of a brain connectivity network in neuroimaging analysis.

Theoretically, we study the asymptotic properties of the proposed optimization problem and establish sharp theoretical results. We focus on two scenarios: imposing only the sparsity penalty, and imposing only the group sparsity penalty. Such an investigation would shed new insights on the connection and difference of the two types of penalties, and also facilitate a direct comparison with existing theoretical results.

Specifically, the first scenario corresponds to performing sparse graph estimation across multiple groups separately. In the context of single graph estimation under the vector normal distribution, theoretical analysis of correct identification of sparse structure has been investigated in Ravikumar et al. (2011); Fan et al. (2014); Loh and Wain-wright (2014). Compared to Ravikumar et al. (2011), who employed a convex 1 penalty and thus required the irrepresentable condition, our sparsistency result does not require this rather stringent condition due to the use of the nonconvex penalty. Compared to Fan et al. (2014), we obtain a sharper probability error bound and an improved minimum signal strength condition. Moreover, we do not require a consistent initial estimator as Fan et al. (2014) did. Compared to Loh and Wainwright (2014), our result is directly comparable. But we develop a new proof technique that can easily generalize to multiple graphs. In the context of single graph estimation under the matrix normal distribution, Leng and Tang (2012) provided the estimation and sparseness pursuit guarantee; however, their results were established for some unknown local minimizer of their optimization function. By contrast, we obtain the theoretical properties for the actual local optimum computed by the optimization algorithm. In addition, Zhou (2014) studied estimation error, whereas we focus on the sparsity pattern reconstruction of the graphical dependency.

The second scenario corresponds to estimation of multiple graphs jointly. Both Danaher et al. (2014) and Zhu et al. (2014) studied multiple graph estimation with fusion type penalties. However, Danaher et al. (2014) did not provide any theoretical result on graph recovery, whereas Zhu et al. (2014) obtained sparsistency for the global, but not local, solution of their optimization function. Both Lee and Liu (2015) and Cai et al. (2016) provided the theoretical guarantee for multi-graph structure recovery, but none provided the positive-definiteness guarantee for the resulting estimator.

The rest of the article is organized as follows. Section 2 presents the model and the penalized objective function. Section 3 develops the optimization algorithm. Section 4 studies the asymptotic properties. Section 5 presents the simulations, and Section 6 the fMRI data analyses. Section 7 concludes the paper with a discussion. All technical proofs are relegated to an online Supplementary Appendix.

2 Model

2.1 Penalized optimization

Suppose the observed data, Xki, i = 1, …, nk, k = 1, …, K, are from K heterogeneous populations, with nk number of observations from the kth group. Each observation Xki is a p × q matrix, with p denoting the spatial dimension and q the temporal dimension. We assume Xki follows a matrix normal distribution, i.e.,

Xk1,,Xknki.i.d.~N(Mk,kSkT),k=1,,K,

where Mk = E[Xki], ΣkS ∈ IRp × p and ΣkT ∈ IRp × p denote the spatial and temporal covariance matrices, respectively, and ⊗ is the Kronecker product. This assumption of matrix normal distribution has been frequently adopted in numerous applications involving matrix-valued observations (Yin and Li, 2012; Leng and Tang, 2012). It is also scientifically plausible in the context of neuroimaging analysis. For instance, the standard neuroimaging processing software, such as SPM (Friston et al., 2007) and FSL (Smith et al., 2004), adopt a framework that assumes the data are normally distributed per voxel (location) with a noise factor and an autoregressive structure, which shares a similar spirit as the matrix normal formulation. We further discuss potential relaxation of this assumption in Section 7.

Our primary object of interest is the spatial partial correlation matrix,

Ωk=Diag(kS)1/2kS1Diag(kS)1/2,k=1,,K.

Under the normal distribution, a zero partial correlation coefficient implies the conditional independence of two nodes given all others in the graph. By contrast, the mean term Mk and the temporal correlation matrix are to be treated as nuisance parameters. This is mainly driven by our motivating application of brain connectivity analysis, where the primary interest is to estimate the connectivity pattern of spatial regions of the brain. Nevertheless we note that our proposed methodology is applicable to estimation of the temporal partial correlation matrix as well.

Under the matrix normal distribution, a natural solution seeks to minimize over (Ω1, …, ΩK) the negative log likelihood function, aside from a constant,

k=1Knk{trace(ΩkΓ^k)log det(Ωk)}, (1)

where Γ̂k is a sample correlation matrix Γk; for instance,

Γ^k=DiagScale{i=1nk(XkiX¯k)(XkiX¯k)T},k=1,,K,

where DiagScale(C) = Diag(C)−1/2CDiag(C)−1/2 for any square matrix C. That is, we plug into (1) a set of consistent correlation estimators. The estimator Γ̂k was studied by Zhou (2014), and its rate of convergence has been established in the high-dimensional regime, which would facilitate our subsequent asymptotic investigation. Directly solving (1), however, may encounter some challenges. First, the number of unknown parameters in {Ωk}k=1K may far exceed the sample size, causing inversion of Γ̂k problematic. Second, we are generally interested in finding pairs of nodes that are conditionally independent given the others. However, minimizing (1) would not yield any exact zero estimates in {Ωk}k=1K, rendering the interpretation difficult. Third, it is often desirable to encourage the estimated graphs to be similar across groups, under the belief that the differences of graphical structure would usually concentrate on some local areas of the nodes. For instance, in brain connectivity analysis, the brain region connections are usually sparse (Zhang et al., 2015), and the differences of brain connections across different populations usually localize in some subnetworks of the brain (Toussaint et al., 2014).

To address those challenges, we propose to estimate the K partial correlation matrices {Ωk}k=1K by solving the following penalized optimization,

minimizeλmax(Ωk)R;k=1,,Kk=1Knk{trace(ΩkΓ^k)log det(Ωk)}+k=1Knkijpλ1k(|ωkij|)+nminijpλ2(ω1ij2+ωKij2) (2)

where λmax (Ωk) denotes the largest eigenvalue of Ωk, nmin = min1≤kK nk, a, R > 0, λ1k; k = 1, ⋯, K, and λ2 are the tuning parameters, and the penalty function pλ (·) : ℝ+ → ℝ+ satisfies the following conditions:

  1. pλ (x) is nondecreasing and differentiable on ℝ+ and pλ (0) = 0;

  2. limx0+pλ(x)=λ;

  3. pλ (x) + x2/b is convex for some constant b > 0;

  4. pλ(x)=0 for |x| > for some constant ab/2.

A few remarks are in order. First, the condition ab/2 ensures the existence of pλ(x), and different choices of a, b correspond to different nonconvex penalties. For instance, a > 2, b = 2/(a − 1) leads to the penalty function of Fan and Li (2001a), and a = b/2, b > 0 to that of Zhang (2010). Other types of nonconvex penalty can also be used here, e.g., the truncated 1 penalty (Shen et al., 2012), or the q penalty with q < 1. Second, our penalty function consists of two parts, a sparsity penalty that encourages sparsity in each individual partial correlation matrix, and a group sparsity penalty that encourages common sparsity patterns across different partial correlation matrices. Third, our penalty function is in general nonconvex, and using a nonconvex penalty is beneficial in several ways. It leads to nearly unbiased parameter estimation, is to facilitate cross-validation for parameter tuning, and can achieve a better sparsity pursuit guarantee under less stringent assumptions (Fan et al., 2009; Shen et al., 2012).

2.2 Parameter tuning

Parameter tuning is always challenging for high-dimensional models, and we propose the following cross-validation approach to tune the parameters in (2). Motivated by our theoretical analysis in Section 4, we let λ11=λ1log(pq)n1q,,λ1K=λ1log(pq)nKq, where pq = max(p, q), and let λ = (λ1, λ2)T. We select λ by minimizing a prediction criterion using 5-fold cross-validation. That is, we divide the data set for each group into five parts D1, ⋯, D5. Under group k, define Γ^kl and Γ^kl to be the sample correlation matrices calculated based on samples in Dl and {D1, ⋯, D5} \ Dl, l = 1, ⋯, 5, respectively. Similarly, define Ω^kl(λ) to be the partial correlation matrix calculated based on Γ^kl, l = 1, ⋯, 5, under the tuning parameter λ. Then we define the criterion as,

CV(λ)=15Kl=15k=1K[log det{Ω^k1(λ)}+trace{Γ^klΩ^kl(λ)}p].

The optimal tuning parameter for each data partition is selected as λ = arg min λ CV(λ), which is then used to obtain the final cross-validated estimator (Ω̂1, ⋯, Ω̂K). Minimization of CV(λ) is carried out using a simple grid search over the domain of the tuning parameters. Following both the common practice in nonconvex penalization and our own theoretical analysis, we choose not to tune a and b in pλ (·), but instead set b = 2a and a equal to some constant divided by λ1. We choose not to tune R either, since our method is not sensitive to the value of R as long as it is reasonably large. We also make some remarks comparing the cross-validation based tuning under a convex versus a nonconvex penalty. When comparing the goodness-of-fit of two selected models, it is essentially comparing the likelihood function evaluated at the constrained maximum likelihood estimator (MLE), i.e., the MLE over the selected support of the parameters. Since a convex penalty such as 1 does not yield a constrained MLE; rather, it shrinks the MLE to achieve an optimal bias-variance trade-off, the convex penalized estimator’s cross-validation score is not suitable for model comparison. By contrast, a nonconvex penalized estimator is nearly identical to the constrained MLE given the selected support (Fan and Li, 2001a; Zhang, 2010; Shen et al., 2012). As such, a nonconvex penalty is better suited to cross-validation tuning for sparsity identification. In graphical model estimation with a convex penalty, cross-validation and the more traditional Bayesian information criterion have been shown to perform poorly (Liu et al., 2010). We further compare the two penalty functions numerically in Section 5.

3 Computation

Nonconvex optimization is in general more challenging than convex optimization. In this section, we develop a highly efficient and scalable optimization algorithm for nonconvex minimization of (2). The algorithm consists of two core components: the minorize-maximization algorithm that optimizes (2) through a sequence of convex relaxations (Hunter and Lange, 2004), and the alternating direction method of multipliers that optimizes each convex relaxation (Boyd et al., 2011). We first summarize our optimization procedure in Algorithm 1, then discuss each individual component in detail. We conclude this section with a discussion regarding the overall computational cost.

3.1 Sequential convex relaxation through MM algorithm

The MM algorithm is commonly employed for solving nonconvex optimization approximately. Its key idea is to decompose the objective function into difference of two convex functions. In our setting, we linearize the nonconvex penalty based on the previous iterate x(t), i.e.,

pλ(|x|)=pλ(|x(t)|)+pλ(|x(t)|)(|x||x(t)|),

to obtain a convex approximation at x(t). Accordingly, we solve the nonconvex optimization (2) by considering a sequence of convex relaxations until we get a stationary point. Specifically, based on (Ω^1(t),,Ω^K(t)) at step t, we minimize the following convex relaxation,

k=1Knknmin{trace(ΩkΓ^k)log det(Ωk)}+k=1Kl<lbkll(t)|ωkll|+i<icll(t)k=1Kωkll2, (3)

subject to λmax(Ωk) ≤ R; k = 1, …, K, where

bkll(t)=nknminpλ1(|ωkll(t)|),cll(t)=pλ2(k=1K(ωkll(t))2).

We then obtain the solution (Ω^1(t+1),,Ω^K(t+1)) at the (t + 1)th step, and iterate over t until convergence.

3.2 Alternating direction method of multipliers

To solve each relaxation (3), we propose an ADMM algorithm. Specifically, we introduce K new variables Δk = (δkll′)1≤l,l′p, such that Δk = Ωk, and K associated dual variables Θk = (θkll′)1≤l,l′p, k = 1, ⋯, K. The ADMM algorithm solves (3) through iteratively applying the following updating scheme, for k = 1, …, K, and 1 ≤ ll′p,

Ωk(m+1)=arg minλmax(Ω)R{nknmin(trace(ΩΓ^k)log detΩ))+ρ2ΩΔk(m)+Θk(m)22}, (4a)
(δkll(m+1))k=1K=arg minδK{ρ2k=1K(δkωkll(m+1)θkll(m))2+k=1Kbkll(t)|δk|+cll(t)δ2},Θk(m+1)=Θk(m)+Ωk(m+1)Δk(m+1), (4b)

The first update (4a) can be carried out efficiently according to the next lemma.

Lemma 1

Consider the following optimization problem,

minimizeΩ0,λmax(Ω)Rtrace(ΩΔ)log detΩ+c2ΩF2

Let Δ = UDU be the eigen-decomposition of Δ. The solution to the above problem is given by

Ω=UQU,

where Q is a diagonal matrix with diagonal elements

Qii=arg min0<xRxDiilog(x)+cx22;i=1,,p.

The second update (4b) has an analytical solution according to the next lemma.

Lemma 2

Consider the following generic minimization problem,

minimizexK12k=1K(xkak)2+k=1Kbk|xk|+νk=1Kxk2.

Its solution is given by

x={1ν[k=1K(Sbk(ak))2]1/2}+{Sb1(a1),,SbK(aK)},

where Sb(a) = Sign(a)(|a| − b)+ is the soft-thresholding function.

The proofs of Lemma 1 and 2 are given in the Appendix.

3.3 Overall computational cost

We make a few remarks regarding the overall computation of our algorithm. First, the per-iteration computational complexity for carrying out the ADMM step in Algorithm 1 is O(Kp3). Such a cubic dependence on p is essentially inevitable if one is to obtain a positive-definite matrix estimate. If positive-definiteness is not required, there are some alternative loss functions such as the pseudo-likelihood loss, and faster algorithms are possible. We have chosen the likelihood loss partly because of the positive-definiteness requirement, and partly because it is more amenable to the theoretical analysis. Second, although nonconvex optimization is in general more challenging, our nonconvex algorithm achieves a comparable computation time as its convex counterpart, as we report in Section 5. This is due to the fast convergence of the step that tackles nonconvexity, i.e., the MM step of convex relaxations. Our numerical study shows that the MM step usually converges in only a few iterations. Consequently, the main computational cost of the algorithm is dominated by the convex optimization step of ADMM. Third, our optimization algorithm scales reasonably well, and can handle networks with the number of nodes up to a few hundreds. It is noteworthy that, in functional connectivity analysis, the typical size of a region-based brain network is in tens to a few hundreds. As such, our method is well suited for brain connectivity type applications. Finally, we comment that some of the steps in our algorithm can be parallelized to further speed up the computation.

4 Asymptotics

Our asymptotic analysis focuses on two scenarios. We first study the case when there is only the sparsity penalty, i.e., when λ2 = 0. We then study the case when there is only the group sparsity penalty, i.e., when λ1 = 0. Considering these cases provides new insights to the connection and difference of the two types of penalty functions. Meanwhile, it allows a direct comparison with existing theoretical results in Gaussian graphical models. We also note that we did not pursue the scenario where both λ1 and λ2 are non-zero, for two reasons. Although it is undoubtedly of interest to study the theoretical properties when both penalties are present, such a characterization would naturally require an explicit quantification of similarity between the true graphs. This kind of knowledge is almost surely unknown in reality, making the asymptotic result less relevant practically. Moreover, there are lack of tools to overcome some technical difficulties in analyzing the KKT conditions when both sparse and group sparse penalties are employed. There is no existing work of this type even for the vector normal case. We defer this pursuit as potential future research.

That being said, we also clarify on our theoretical contributions. For the separate graph estimation scenario with λ2 = 0, we provide a new sparsistency result that achieves a sharper error bound, requires less stringent conditions, and holds for the actual local optimum of the estimation algorithm. For the multi-graph joint estimation scenario with λ1= 0, we establish the sparsistency for the actual local instead of global minimizer, and guarantee both multi-graph structure recovery, symmetry and positive-definiteness. Moreover, we develop a new proof technique that permits a direct generalization from the single graph case to the multi-graph case. This proof technique is new in the literature, and is potentially useful for theoretical analysis of other models as well.

4.1 Sparsity penalty only with λ2 = 0

First we consider the case where we impose the sparsity penalty only and set λ2 = 0 in (2). Let Ak0={(i,j):ωkij00} denote the support of the true partial correlation matrix Ωk0=(ωkij0), i, j = 1, …, p, k = 1, ⋯, K. We define the oracle estimator Ω^k,Ak0 as

Ω^k,Ak0=arg min(i,j):ωkij0=0,(i,j)Ak0{trace(Γ^kΩ)log det(Ω)}, (5)

which is essentially the MLE over {Ak0}k=1K. Moreover, let nmin = min1≤kK nk and nmax = max1≤kK nk. We impose the following assumptions.

  • A1
    Let Γk0=Diag(kS)1/2kSDiag(kS)1/2 denote the true correlation matrix. Assume that, for all k = 1, ⋯, K,
    c01<λmin(Γk0)λmax(Γk0)<c0andc01<λmin(kT0)λmax(kT0)<c0,
    holds for some positive real number c0.
  • A2
    Let c1=maxkΓk0,, c2=maxkIk,, where Ik=12[Ωk0]1s[Ωk0]1 is the Fisher information matrix in group k, and A,=max1jpk=1n|Ajk| is the /operator norm of matrix A. Let s0=max1kKmax1jpi=1pI((i,j)Ak0), where I is the indicator function. Assume that
    2c1c2c3(1+2c12c2)s0log(pq)nkq1,k=1,,K,
    where c3 is some absolute constant.

Assumption (A1) is a commonly imposed condition when analyzing the theoretical properties of many types of precision matrix estimators; see, for example, Fan et al. (2009); Cai et al. (2016). Assumption (A2) restricts the scaling of the graph sparsity level measured by s0 as a function of sample size n and graph size p. Similar scaling has been used in Fan et al. (2009); Loh and Wainwright (2014). It is also noteworthy that the quantities c0, c1, c2, and s0 can grow with the sample size, the spatial dimension, and the temporal dimension. Under these assumptions, we have the following result.

Theorem 1

Under Assumptions (A1) and (A2), and the condition that,

min(i,j)Ak0|ωkij0|>{2c2c3+(1+c12c2)c3(c0+2c11)2}log(pq)nkq, (6)

for k = 1, ⋯, K, there exist λ1 and a such that the oracle estimator Ω^k,Ak0 is the unique minimizer of problem (2) when R=2a, b = 2a, and λ2 = 0, with probability at least 16K(pq)2, as n, p → ∞.

This theorem shows that the oracle estimator is the unique minimizer of (2) under λ2 = 0. That is, when the maximum node degree s0 does not grow too fast as (n, p) goes to infinity, for some choice of the tuning parameters, solving (2) could reconstruct the true structure of the K graphs with probability tending to one. This result holds when the minimum signal satisfies the condition (6). If we further assume ci, i = 1, 2, 3, are all constants, then the minimum signal condition (6) roughly requires that

min(i,j)Ak0|ωkij0|O(log(pq)nkq),k=1,,K. (7)

Comparing (7) to the minimum signal strength condition in Fan et al. (2014), their condition is suboptimal in terms of dependence on column/row sparsity s0, in that it requires min(i,j)Au|ωij0|>O(s02logpn). By contrast, we only require min(i,j)Au|ωij0|>O(logpn). Our result is comparable to that of Loh and Wainwright (2014). However, their proof used a primal-dual witness technique, whereas our proof proceeds in two steps, by first establishing the rate of convergence for the oracle estimator, and then proving that the oracle estimator is the unique local minimum. An advantage of our two-step proof is that it is straightforward to generalize to the multiple partial correlation matrices case when a group sparsity penalty is further imposed. Finally, unlike Leng and Tang (2012) that established the oracle property for some unknown local minimizer of their objective function, we obtain the result for our actual local minimizer.

4.2 Group sparsity penalty only with λ1 = 0

Next we consider the case where we impose the group sparsity penalty only and set λ1 = 0 in (2). For this case, it is impossible to recover the oracle estimator unless, A10==AK0, since the graph estimators obtained by using only the group sparsity penalty would be identical across all groups. On the other hand, it is still feasible to recover the oracle estimator over Au=k=1KAk0. Specifically, we define the oracle estimator Ω̂1,Au, ⋯, Ω̂K,Au as

Ω^k,Au=arg min(i,j):ωkij0=0,(i,j)Au{trace(Γ^kΩ)log det(Ω)},

which is essentially the MLE over the joint set Au. We also modify the assumption (A2) slightly and introduce the next assumption.

(A3) Let s0=max1jpi=1pI((i,j)Au). Assume that

2c1c2c3(1+2c12c2)s0log(pq)nkq1,k=1,,K.

Assumption (A3) is directly comparable to (A2). In (A3), 0 is the sparsity level of the joint of all K graphs, whereas s0 in (A2) is the maximum sparsity level of all graphs. Easily s00; and when the sparisity pattern differs significantly across different groups, 0 can be much larger than s0. In this sense, the group sparsity penalty is most effective when the sparsity patterns are similar across different groups. Under (A1) and (A3), we have the following result.

Theorem 2

Under Assumptions (A1) and (A3), and the condition that

min(i,j)Auk=1K(ωkij0)2>2c2c3klog(pq)nminq+(1+c12c2)c3(c0+2c11)2nmaxnminKlog(pq)nminq, (8)

for k = 1, …, K, there exist λ2 and a such that the oracle estimator Ω̂k,Au, k = 1, ⋯, K is the unique minimizer of (2) when R=2a, b = 2a, and λ1 = 0, with probability at least 16K(pq)2, as n, p → ∞.

This theorem says that, if the size of the union of supports Ak0 is not too large, the oracle estimator is the unique local optimum of (2) under λ1 = 0, and can recover the true graph structure with probability tending to one. Again, if we treat ci, i = 1, 2, 3, as constants, then the condition (8) becomes

min(i,j)Auk=1K(ωkij0)2>O(nmaxnminKlog(pq)nminq). (9)

Comparing the two minimum signal strength conditions (7) and (9) reveals some useful insights about the two penalties. It is noted that neither condition is stronger nor weaker than the other. When the sample sizes n1, ⋯, nK are well balanced, and the sparsity patterns are similar across all groups, adding a sparsity group penalty is to facilitate the graph recovery. This can be seen by inspecting the extreme case where n1 = ⋯ = nK = ñ and the sparsity patterns are identical. In this case, the condition for using the group sparsity penalty reduces to min(i,j)Auk=1K(ωkij0)2KO(log(pq)nq), which is clearly less stringent than the condition (7) required for using the sparsity penalty, because min(i,j)Auk=1K(ωkij0)2Kminkmin(i,j)Ak0|ωkij0|. On the other hand, if the sample sizes are highly unbalanced, or the sparsity patterns are markedly different across groups, then using the sparsity penalty would require a less stringent condition. Comparing to some existing vector-based multi-graph analysis, our result is for the actual local minimizer, rather than the global minimizer as in Zhu et al. (2014). Moreover, we guarantee both multi-graph structure recovery and ensure positive-definiteness of the estimator, while Lee and Liu (2015); Cai et al. (2016) can not guarantee the latter.

5 Simulations

5.1 Setup

We study the infinite-sample performance of our method through simulations. To evaluate the accuracy of sparsity identification, we employ the average false positive (FP) and average false negative (FN) rates, defined as,

FP=1Kk=1K1llpI(ωllk=0,ω^llk0)1llpI(ωllk=0){1I(Ωk,ll0)},FN=1Kk=1K1llpI(ωllk0,ω^llk=0)1llpI(ωllk0)I(Ωk,ll0),

where Ωk,−ll are the off-diagonal elements of Ωk. To evaluate the accuracy of parameter estimation, we employ the entropy loss (EL) and quadratic loss (QL), defined as,

ELk=trace(Ωk1Ωk)log det(Ωk1Ω^k)p,QLk=trace{(Ωk1Ω^kI)2},k=1,,K.

We generate the data from a matrix normal distribution. We consider three spatial dependence structures: a chain graph, a hub graph, and a random graph, as shown in Figure 1. We fix the temporal dependence structure as an order-one autoregressive model. We focus on the two-group graph estimation, i.e., with K = 2, although our method can be equally applied to more than two groups. We first generate a graph following one of the three structures in Figure 1 for one group, then construct the graph for the other group by randomly adding a few edges to the first graph. We vary the number of per-group subjects nk = {10, 20}, the spatial dimension p = {100, 200}, and the temporal dimension q = {50, 100}. In the interest of space, we report the results when nk = 10 in the online Supplementary Appendix.

Figure 1.

Figure 1

Three types of graphs used in our simulation studies

5.2 Comparison

We compare our method with some competing alternative solutions. The first is a matrix Gaussian multi-graph estimation method using the convex penalty, i.e., a combination of the 1 and the group 1 penalty. The second category are two state-of-the-art vector Gaussian multi-graph estimation methods, Lee and Liu (2015) and Cai et al. (2016). Both estimate multiple graphs that share a common structure, and both utilize the convex penalty. Since both methods have been designed for the vector-valued rather than the matrix-valued data, we first apply whitening to reduce the temporal correlations among the columns of the matrix data, then apply Lee and Liu (2015) and Cai et al. (2016). All parameter tunings are done via 5-fold cross-validation. Tables 1 to 3 summarize the results based on 100 data replications for the three spatial graph structures in Figure 1. In summary, our proposed method clearly outperforms the alternative solutions in terms of both sparsity identification and graph estimation accuracy.

Table 1.

Chain graph. Reported are the average and standard deviation (in parenthesis) of the accuracy criteria based on 100 data replications. Also reported is the average running time (in seconds). Evaluation criteria include the false positive rate (FP), the false negative rate (FN), the entropy loss (ELk), and the quadratic loss (QLk). We compare the proposed nonconvex based multi-graph estimation method (denoted as Nonconvex) with its convex counterpart (denoted as Convex), the method of Lee and Liu (2015) (denoted as Lee & Liu), and the method of Cai et al. (2016) (denoted as Cai et al.).

nk p q Method FP FN EL1 EL2 QL1 QL2 Time
20 100 100 Nonconvex 0.003 (0.005) 0.000 (0.000) 0.093 (0.015) 0.105 (0.015) 0.230 (0.036) 0.265 (0.038) 116
Convex 0.059 (0.058) 0.000 (0.000) 4.030 (2.160) 3.520 (1.700) 7.850 (4.410) 7.080 (3.640) 85
Lee & Liu 0.413 (0.058) 0.000 (0.000) 1.475 (0.043) 0.960 (0.059) 3.991 (0.121) 2.562 (0.181) 1050
Cai et al. 0.005 (0.005) 6e-04 (0.002) 14.40 (1.100) 16.90 (2.000) 48.00 (5.000) 55.50 (9.300) 631

50 Nonconvex 0.001 (0.003) 0.000 (0.000) 0.194 (0.028) 0.216 (0.027) 0.484 (0.072) 0.547 (0.069) 159
Convex 0.053 (0.045) 0.000 (0.000) 6.500 (3.350) 5.460 (2.440) 13.10 (7.410) 11.70 (5.800) 102
Lee & Liu 0.382 (0.008) 0.000 (0.000) 1.835 (0.070) 1.253 (0.070) 5.063 (0.218) 3.486 (0.216) 1096
Cai et al. 7e-04 (7e-04) 0.000 (0.000) 7.200 (0.750) 9.300 (1.500) 22.10 (2.300) 28.30 (4.700) 608

200 100 Nonconvex 0.000 (0.001) 0.000 (0.000) 0.192 (0.023) 0.196 (0.018) 0.475 (0.056) 0.475 (0.044) 474
Convex 0.032 (0.032) 0.000 (0.000) 11.40 (5.020) 9.110 (3.780) 22.70 (10.40) 18.70 (8.130) 398
Lee & Liu 0.166 (0.002) 0.000 (0.000) 2.800 (0.059) 1.600 (0.044) 7.200 (0.170) 4.000 (0.120) 9308
Cai et al. 1e-04 (2e-04) 0.000 (0.000) 13.50 (1.200) 19.10 (2.400) 41.20 (3.800) 54.20 (7.800) 7875

50 Nonconvex 0.000 (0.000) 0.000 (0.000) 0.390 (0.039) 0.384 (0.034) 0.972 (0.099) 0.933 (0.085) 730
Convex 0.023 (0.027) 0.000 (0.000) 16.70 (6.510) 13.10 (4.550) 34.10 (14.50) 27.20 (10.400) 672
Lee & Liu 0.212 (0.031) 0.000 (0.000) 3.400 (0.088) 1.900 (0.180) 8.900 (0.220) 5.000 (0.520) 11770
Cai et al. 0.001 (0.001) 0.000 (0.000) 8.200 (0.680) 10.50 (1.800) 23.70 (1.700) 29.00 (4.300) 7281

Table 3.

Random graph. The setup is the same as Table 1.

nk p q Method FP FN EL1 EL2 QL1 QL2 Time
20 100 100 Nonconvex 0.005 (0.005) 0.000 (0.000) 0.121 (0.016) 0.146 (0.016) 0.328 (0.044) 0.457 (0.054) 195
Convex 0.143 (0.045) 0.001 (0.002) 2.230 (1.160) 3.010 (1.430) 4.850 (2.730) 6.540 (3.510) 178
Lee & Liu 0.471 (0.030) 0.000 (0.000) 0.950 (0.034) 1.400 (0.043) 2.600 (0.120) 3.700 (0.150) 1107
Cai et al. 0.006 (0.006) 0.030 (0.008) 10.50 (1.000) 6.900 (0.660) 40.80 (5.700) 23.90 (3.700) 605

50 Nonconvex 0.029 (0.008) 0.001 (0.002) 0.276 (0.034) 0.409 (0.094) 0.736 (0.094) 1.200 (0.293) 176
Convex 0.304 (0.061) 0.001 (0.002) 1.690 (0.854) 2.050 (0.883) 3.720 (1.990) 4.510 (2.190) 153
Lee & Liu 0.390 (0.011) 3e-04 (8e-04) 1.300 (0.048) 1.800 (0.068) 3.800 (0.160) 5.200 (0.280) 1145
Cai et al. 0.005 (0.005) 0.026 (0.009) 8.100 (0.800) 6.100 (0.600) 29.20 (3.900) 22.00 (3.600) 580

200 100 Nonconvex 0.022 (0.001) 0 (0.001) 0.284 (0.021) 0.459 (0.077) 0.776 (0.057) 1.830 (0.355) 1206
Convex 0.333 (0.016) 0.001 (0.001) 1.610 (0.073) 2.510 (0.075) 3.78 (0.18) 6.120 (0.193) 1082
Lee & Liu 0.204 (0.033) 0.002 (0.001) 3.000 (0.057) 4.000 (0.089) 8.000 (0.180) 10.90 (0.310) 9600
Cai et al. 0.007 (0.010) 0.079 (0.006) 14.60 (0.800) 13.10 (0.780) 60.00 (6.000) 96.20 (11.60) 8310

50 Nonconvex 0.020 (0.003) 0.020 (0.006) 0.700 (0.058) 1.810 (0.169) 1.850 (0.157) 6.480 (0.638) 1303
Convex 0.299 (0.033) 0.009 (0.002) 3.400 (0.645) 5.030 (0.873) 7.900 (1.470) 11.90 (2.090) 1071
Lee & Liu 0.334 (0.024) 0.006 (0.003) 3.800 (0.088) 6.000 (0.140) 10.70 (0.290) 22.70 (1.400) 11904
Cai et al. 0.004 (0.004) 0.077 (0.008) 11.10 (0.880) 13.20 (0.980) 38.50 (3.700) 98.60 (14.10) 7959

Compared with the convex counterpart, our proposed nonconvex method achieves a smaller false positive, as well as a smaller estimation error. For instance, under the chain graph and nk = 20, p = q = 100, the average entropy loss for the first graph for our method is 0.093, with the standard deviation SD = 0.015, and that for the convex method is 4.030, with SD = 2.160. Meanwhile, the average false positive rate for our method is 0.003, with SD = 0.005, and that for the convex method is 0.059, with SD = 0.058. Similar numerical advantages of the nonconvex solution are consistently observed for different graph structures, sample sizes, and spatial and temporal dimensions. These results, to some extent, also reflect the advantage of a nonconvex penalty compared to a convex one when the parameter tuning is done via cross-validation.

Compared with the method of Lee and Liu (2015), again, our proposal performs much better in both sparsity identification and graph estimation accuracy. In particular, the method of Lee and Liu (2015) yields a much higher false positive rate than our approach, while the false negative rates of the two are comparable. Besides, the estimation error of our method is 3 to 10 times smaller than that of Lee and Liu (2015). Compared with the method of Cai et al. (2016), our proposal performs about the same in terms of sparsity identification, but substantially improves in graph estimation. Actually, the graph estimation error of Cai et al. (2016) is the worst among all solutions, and in some situations, its estimation error is 1000 times higher than that of our proposed method. Since both Lee and Liu (2015) and Cai et al. (2016) relied on some convex penalties, these results partially reflect the conflict between selection consistency and estimation accuracy that is not uncommon when employing a convex penalty (Shen et al., 2012). Moreover, it shows the advantage of directly working with the matrix data, rather than working with the vector-valued data after whitening.

5.3 Computation

We also examine in detail the computational cost of our proposed solution. All computations were done on a single core, Xeon E5-2690 v3 at 2.6GHz and 128G memory. We first report and compare the running time of various methods for the simulation examples in Section 5.2. The last column of Tables 1 to 3 records the average running time, in seconds, rounded up to integers. It is seen that our proposed method is slower than its convex counterpart, but only slightly, and the two running times are comparable. For instance, for the chain graph with nk = 20, p = 200, q = 100, the average running time for our method is 474 seconds, and that for the convex solution is 398 seconds. This is due to that the MM step of convex relaxation of our nonconvex objective function usually converges in only a few iterations. Consequently, the main computational cost of our algorithm is dominated by the convex optimization step of ADMM. On the other hand, we have observed a 4 to 50 fold slowdown in running time for the other two competing methods of Lee and Liu (2015) and Cai et al. (2016). For the aforementioned setup, the average time for Lee and Liu (2015) is 9,308 seconds, and for Cai et al. (2016) is 7,875 seconds. This is partly because those two alternatives use the interior point method in optimization, which slows down significantly when the graph size increases. As a further illustration, we also report the computational time when the number of network nodes gradually increases from p = 25 to p = 500 in the online Supplementary Appendix. Our method is found to be comparable to the convex solution in terms of running time, but is much faster than Lee and Liu (2015) and Cai et al. (2016), especially when the graph dimension p is large.

6 Data analysis

6.1 Autism spectrum disorder study

Autism spectrum disorder (ASD) is an increasingly prevalent neurodevelopmental disorder, and its estimated prevalence was 1 in every 68 American children according to the Centers for Disease Control and Prevention in 2014. It is characterized by symptoms such as social difficulties, communication deficits, stereotyped behaviors and cognitive delays (Rudie et al., 2013). We analyzed a resting-state fMRI dataset from the Autism Brain Imaging Data Exchange (ABIDE) study (Di Martino et al., 2014). The imaging was performed on Siemens magneto trio scanners, with the scan parameters: voxel size = 3 × 3 × 4mm, slice thickness = 4mm, number of slices = 34, repetition time = 3s, and echo time = 28ms. During imaging acquisition, all subjects were asked to lie still, stay awake, and keep eyes open under a white background with a black central fixation cross. After removing the images with poor quality or substantial missing values, we focused on a dataset of 795 subjects, among whom 362 have ASD, and 433 are normal controls. See Table 4 for the basic demographic information of the study subjects. All fMRI scans have been preprocessed through a standard pipeline, including slice timing correction, motion correction, denoising by regressing out motion parameters and white matter and cerebrospinal fluid time courses, spatial smoothing, band-pass filtering, and registration. Each brain image was then parcellated into 116 regions of interest using the Anatomical Automatic Labeling (AAL) atlas (Tzourio-Mazoyer et al., 2002). The time series of voxels within the same region were then averaged, resulting in a spatial-temporal data matrix for each individual subject, with the spatial dimension p = 116 and the temporal dimension q = 146. We also comment that, other than simple averaging, there are alternative approaches to summarize the voxel data within each region (Kang et al., 2016). Our proposed method is equally applicable to the data with a different summary.

Table 4.

Demographic information of the ASD dataset and the ADHD dataset.

Group ASD study
ADHD study
Case Control Case Control
Sample size 362 433 96 91
Age (mean ± sd) 16.72 ± 8.253 16.27 ± 6.893 11.38 ± 2.757 12.38 ± 3.112
Male/female 341/48 348/85 73/23 44/47

Of scientific interest is to understand how brain functional connectivity differs between the ASD subjects and normal controls. We applied our nonconvex penalized multi-graph estimation method to this data, and tuned the parameters using 5-fold cross-validation. A quick examination of the quantile-quantile plot (not shown here) suggested the normality holds approximately for this data. Figure 2 reports the results. To facilitate the graphical presentation, we plot only the top 2% of the identified links for the autism and normal control groups. The grey links are the ones found common in both groups, while the red links are unique to each group. Our findings are in general consistent with the ASD literature. For instance, we have observed decreased connectivity between the two hemispheres, as shown by the red links in the control group between the left and right half of the graph (Vissers et al., 2012). We also found some brain regions with different connectivity patterns between the two groups of subjects, such as inferior frontal gyrus and fusiform gyrus, which have been noted in previous studies too (Rudie et al., 2013; Di Martino et al., 2014; Tyszka et al., 2014).

Figure 2.

Figure 2

Estimated connectivity networks for the ABIDE data. The left panel is for the ASD group, and the right panel for the normal control. Shown are the top 2% links, where the grey links are the ones found common in both groups, and the red links are unique to each group.

6.2 Attention deficit hyperactivity disorder study

Attention deficit hyperactivity disorder (ADHD) is one of the most commonly diagnosed child-onset neurodevelopmental disorders and has an estimated childhood prevalence of 5 to 10% worldwide (Pelham et al., 2007). Symptoms include difficulty in staying focused and paying attention, difficulty in controlling behavior, and over-activity. These symptoms may persist into adolescence and adulthood, resulting in a lifelong impairment (Biederman et al., 2000). We analyzed a resting-state fMRI dataset from the ADHD-200 Global Competition. The fMRI images were acquired on Siemens allegra 3T scanners at New York University, with the scan parameters: voxel size = 3 × 3 × 4mm, slice thickness = 4mm, number of slices = 33, repetition time = 2s, and echo time = 15ms. During acquisition, all subject were asked to stay awake and not to think about anything under a black screen. For each subject, one or two fMRI scans were acquired, and for each scan, a quality control assessment (pass or questionable) was given by the data curators. We only used the scans that pass the quality control. If both scans of a subject passed the quality control, we arbitrarily chose the first scan. If neither scan passed the quality control, we removed that subject from further analysis. This results in 187 subjects, among whom 96 are combined ADHD subjects and 91 are typically developing controls. See Table 4 for the demographic information. All the scans have been preprocessed using the Athena pipeline, including slice timing correction, motion correction, denoising, spatial smoothing, band-pass filtering, and registration. Each image was then parcellated using the AAL atlas, and the resulting data is a spatial-temporal matrix, with the spatial dimension p = 116 and the temporal dimension q = 172.

Our study goal is to estimate and compare the functional connectivity network between the ADHD and control groups. We applied our method, tuned by 5-fold cross-validation. The quantile-quantile plot suggested the data are approximately normal. Figure 3 shows the results of the top 2% of the identified links for the two groups. We found a number of brain regions that exhibit different connectivity patterns between the ADHD and control groups, including frontal gyrus, cingulate gyrus, cerebellum and cerebellar vermis. Such finds are generally in agreement with the ADHD literature. Specifically, the prefrontal cortex is responsible for many higher-order mental functions, including those that regulate attention and behavior, and it is commonly believed that ADHD is associated with alterations in the prefrontal cortex (Arnsten and Li, 2005). The cingulate gyrus is associated with cognitive process, and there are evidences of anterior cingulate dysfunctions in ADHD patients (Bush et al., 2005). The cerebellum is responsible for motor control and cognitive functions such as attention and language, and dysfunction in the cerebellum and anomaly in the cerebellar vermis in ADHD patients have been reported (Toplak et al., 2006; Goetz et al., 2014).

Figure 3.

Figure 3

Estimated connectivity networks for the ADHD data. The left panel is for the ADHD group, and the right panel for the normal control. Shown are the top 2% links, where the grey links are the ones found common in both groups, and the red links are unique to each group.

7 Discussion

In this article, we have proposed a nonconvex penalized method to simultaneously estimate multiple graphs from matrix-valued data. We have developed an efficient optimization algorithm, and established some sharp theoretical results. Numerical analysis has demonstrated clear advantages of our method compared to some alternative solutions.

We have advocated a nonconvex penalty, since it produces a nearly unbiased estimator, is better suited for cross-validation tuning, and can achieve a better theoretical guarantee under less stringent assumptions. Meanwhile, we recognize its potential limitations. In terms of prediction and estimation accuracy, a nonconvex penalty tends to work better when the signal in the data is sparse and has a relatively large magnitude. On the other hand, a convex penalty tends to perform better if the signal is not sparse and if there are many small signals. This phenomenon has been constantly observed in the context of high-dimensional linear model selection and graph estimation (Fan et al., 2009; Zhang, 2010; Shen et al., 2012). We also clarify that, the proposed penalized optimization formulation in (2) is in general a nonconvex problem. However, (2) can be convex for the special cases when λ1 = 0 or λ2 = 0 under some choice of the parameters, e.g., R=2a, b = 2a. The convexity allows us to establish the desired theoretical properties for those special cases. This is a strategy commonly used in high-dimensional theoretical analysis. For instance, for variable selection, typically, it is only shown that there exist some tuning parameter values at which the solution is selection consistent or attains the oracle property (Fan and Li, 2001b; Zhang, 2010; Shen et al., 2013).

A key assumption for our proposal is the matrix normal distribution. Such an assumption is widely used, and is scientifically plausible in the context of brain connectivity analysis. On the other hand, we recognize that this assumption may not always hold. There are two possible ways to relax this assumption. The first is to consider a different loss function; for instance, the D-trace loss function (Zhang and Zou, 2012), or the pseudo-likelihood loss function (Lee and Hastie, 2015). The penalty function developed in our solution can be coupled with those alternative loss functions. We have chosen the likelihood based loss function, because it is more amenable to the theoretical analysis thanks to the strong convexity property of the negative log-likehood loss function, and because it yields a positive-definite estimator for the precision matrix. The second type of relaxation comes from recent development that extend a vector Gaussian graphical model to a semiparametric model (Liu et al., 2012), or a fully nonparametric model (Lee et al., 2016). Parallel extension of those methods to matrix-valued data is warranted for future research.

We have primarily focused on graph estimation in this article, which is a different problem than graph inference, even though both can produce, in effect, a sparse representation of the graph structure. We recognize that graph-based inference is a very challenging problem, and is currently an active area of research that receives increasing attention (Janková and van de Geer, 2015; Xia and Li, 2017). An alternative solution is Bayesian graph estimation (Peterson et al., 2015; Zhu et al., 2016), which could automatically produce a valid inference for all the parameters, provided that the prior is appropriately specified. However, a major challenge for the class of Bayesian solutions is the computation and the scalability to large graphs. The sampling method used in Bayesian analysis is computationally much more expensive than the optimization in the frequentist solution. For the Gaussian graphical model, each Markov chain Monte Carlo (MCMC) iteration requires O(p3) operations, and the number of MCMC steps required for mixing is usually much larger than the number of ADMM steps in our optimization. Scalable Bayesian graph estimation is an important future direction.

Supplementary Material

Supp info

Table 2.

Hub graph. The setup is the same as Table 1.

nk p q Method FP FN EL1 EL2 QL1 QL2 Time
20 100 100 Nonconvex 0.006 (0.006) 0.000 (0.000) 0.086 (0.013) 0.111 (0.018) 0.408 (0.070) 0.459 (0.077) 226
Convex 0.199 (0.049) 0.000 (0.000) 1.290 (0.894) 1.360 (0.861) 4.31 (2.76) 4.180 (2.580) 197
Lee & Liu 0.467 (0.024) 0.000 (0.000) 1.100 (0.033) 1.000 (0.037) 3.500 (0.180) 3.90 (0.260) 1072
Cai et al. 5e-04 (0.001) 0.000 (0.000) 20.90 (3.300) 22.40 (3.200) 577.5 (169.4) 574.5 (155.4) 603

50 Nonconvex 0.012 (0.007) 0.001 (0.003) 0.183 (0.025) 0.309 (0.049) 0.875 (0.149) 1.130 (0.162) 272
Convex 0.193 (0.055) 0.001 (0.003) 2.450 (1.330) 2.530 (1.170) 8.010 (4.370) 7.340 (3.620) 257
Lee & Liu 0.399 (0.016) 0.000 (0.000) 1.500 (0.060) 1.600 (0.073) 6.000 (0.460) 6.500 (0.570) 1108
Cai et al. 0.005 (0.003) 9e-05 (6e-04) 17.00 (2.300) 18.40 (2.400) 412.4 (94.70) 419.5 (94.20) 611

200 100 Nonconvex 0.001 (0.001) 0.000 (0.000) 0.171 (0.020) 0.198 (0.024) 0.818 (0.106) 0.797 (0.097) 915
Convex 0.099 (0.028) 0.000 (0.000) 2.870 (1.680) 2.630 (1.390) 10.40 (4.950) 8.280 (3.880) 857
Lee & Liu 0.247 (0.021) 0.000 (0.000) 2.100 (0.054) 1.900 (0.065) 6.600 (0.250) 7.200 (0.430) 9073
Cai et al. 0.002 (0.001) 0.000 (0.000) 48.80 (15.10) 47.50 (13.40) 1615 (835.8) 1418 (680.7) 7533

50 Nonconvex 0.001 (0.000) 0.002 (0.003) 0.354 (0.034) 0.470 (0.073) 1.710 (0.214) 1.740 (0.249) 1295
Convex 0.098 (0.029) 0.000 (0.001) 7.480 (3.840) 6.280 (2.890) 25.20 (13.90) 18.90 (9.770) 1089
Lee & Liu 0.222 (0.011) 1e-04 (5e-04) 3.300 (0.089) 2.700 (0.079) 11.70 (0.560) 11.80 (0.660) 11136
Cai et al. 0.006 (0.001) 5e-05 (3e-04) 30.90 (3.500) 30.60 (3.100) 721.7 (141.4) 649.1 (112.8) 7297

Algorithm 1.

Algorithm 1

The MM algorithm and ADMM algorithm for solving (2).

Acknowledgments

Zhu’s research was supported in part by NSF grants DMS-1721445 and DMS-1712580. Li’s research was supported in part by NSF grant DMS-1613137 and NIH grant AG034570. The authors thank the editor, the associate editor and three referees for their valuable comments and suggestions.

References

  1. Arnsten AF, Li B-M. Neurobiology of executive functions: Catecholamine influences on prefrontal cortical functions. Biological Psychiatry. 2005;57(11):1377–1384. doi: 10.1016/j.biopsych.2004.08.019. [DOI] [PubMed] [Google Scholar]
  2. Biederman J, Mick E, Faraone SV. Age-dependent decline of symptoms of attention deficit hyperactivity disorder: Impact of remission definition and symptom type. American Journal of Psychiatry. 2000;157(5):816–818. doi: 10.1176/appi.ajp.157.5.816. [DOI] [PubMed] [Google Scholar]
  3. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning. 2011;3(1):1–122. [Google Scholar]
  4. Bush G, Valera EM, Seidman LJ. Functional neuroimaging of attention-deficit/hyperactivity disorder: A review and suggested future directions. Biological Psychiatry. 2005;57(11):1273–1284. doi: 10.1016/j.biopsych.2005.01.034. [DOI] [PubMed] [Google Scholar]
  5. Cai TT, Li H, Liu W, Xie J. Joint estimation of multiple high-dimensional precision matrices. Statistica Sinica. 2016;26:445–464. doi: 10.5705/ss.2014.256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cai TT, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. J Amer Statist Assoc. 2011;106(494):594–607. [Google Scholar]
  7. Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. J R Statist Soc B. 2014;76(2):373–397. doi: 10.1111/rssb.12033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Di Martino A, Yan C-G, Li Q, Denio E, Castellanos FX, Alaerts K, Anderson JS, Assaf M, Bookheimer SY, Dapretto M, Deen B, Delmonte S, Dinstein I, Ertl-Wagner B, Fair DA, Gallagher L, Kennedy DP, Keown CL, Keysers C, Lainhart JE, Lord C, Luna B, Menon V, Minshew NJ, Monk CS, Mueller S, Muller R-A, Nebel MB, Nigg JT, O’Hearn K, Pelphrey KA, Peltier SJ, Rudie JD, Sunaert S, Thioux M, Tyszka JM, Uddin LQ, Verhoeven JS, Wenderoth N, Wiggins JL, Mostofsky SH, Milham MP. The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry. 2014;19(6):659–667. doi: 10.1038/mp.2013.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and SCAD penalties. Ann Appl Stat. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001a;96(456):1348–1360. [Google Scholar]
  11. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association. 2001b;96(456):1348–1360. [Google Scholar]
  12. Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. Annals of statistics. 2014;42(3):819. doi: 10.1214/13-aos1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fornito A, Zalesky A, Breakspear M. Graph analysis of the human connectome: Promise, progress, and pitfalls. NeuroImage. 2013;80:426–444. doi: 10.1016/j.neuroimage.2013.04.087. [DOI] [PubMed] [Google Scholar]
  14. Fox MD, Greicius M. Clinical applications of resting state functional connectivity. Frontiers in Systems Neuroscience. 2010;4(19) doi: 10.3389/fnsys.2010.00019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Friston K, Ashburner J, Kiebel S, Nichols T, Penny W, editors. Statistical Parametric Mapping: the Analysis of Functional Brain Images. Academic Press; London: 2007. [Google Scholar]
  17. Goetz M, Vesela M, Ptacek R. Notes on the role of the cerebellum in adhd. Austin J Psychiatry Behav Sci. 2014;1(3):1013. [Google Scholar]
  18. Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Han F, Han X, Liu H, Caffo B. Sparse median graphs estimation in a high-dimensional semiparametric model. The Annals of Applied Statistics. 2016;10(3):1397–1426. [Google Scholar]
  20. Hunter DR, Lange K. A tutorial on mm algorithms. The American Statistician. 2004;58(1):30–37. [Google Scholar]
  21. Janková J, van de Geer S. Honest confidence regions and optimality in high-dimensional precision matrix estimation. arXiv preprint arXiv:1507.02061 2015 [Google Scholar]
  22. Kang J, Bowman FD, Mayberg H, Liu H. A depression network of functionally connected regions discovered via multi-attribute canonical correlation graphs. NeuroImage. 2016;141:431–441. doi: 10.1016/j.neuroimage.2016.06.042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lee JD, Hastie TJ. Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics. 2015;24(1):230–253. doi: 10.1080/10618600.2014.900500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lee K-Y, Li B, Zhao H. On additive partial correlation operator and nonparametric estimation of graphical models. Biometrika. 2016 doi: 10.1093/biomet/asw028. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lee W, Liu Y. Joint estimation of multiple precision matrices with common structures. Journal of Machine Learning Research. 2015;16:1035–1062. [PMC free article] [PubMed] [Google Scholar]
  26. Leng C, Tang CY. Sparse matrix graphical models. J Amer Statist Assoc. 2012;107(499):1187–1200. [Google Scholar]
  27. Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High-dimensional semiparametric Gaussian copula graphical models. Ann Statist. 2012;40(4):2293–2326. [Google Scholar]
  28. Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A, editors. Advances in Neural Information Processing Systems. Vol. 23. Curran Associates, Inc; 2010. pp. 1432–1440. [PMC free article] [PubMed] [Google Scholar]
  29. Loh P-L, Wainwright MJ. Support recovery without incoherence: A case for nonconvex regularization. arXiv preprint arXiv:1412.5632 2014 [Google Scholar]
  30. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006:1436–1462. [Google Scholar]
  31. Pelham WE, Foster EM, Robb JA. The economic impact of attention-deficit/hyperactivity disorder in children and adolescents. Ambulatory Pediatrics. 2007;7(1, Supplement):121–131. doi: 10.1016/j.ambp.2006.08.002. Measuring Outcomes in Attention Deficit Hyperactivity Disorder. [DOI] [PubMed] [Google Scholar]
  32. Peterson C, Stingo FC, Vannucci M. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association. 2015;110(509):159–174. doi: 10.1080/01621459.2014.896806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Qiu H, Han F, Liu H, Caffo B. Joint estimation of multiple graphical models from high dimensional time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2016;78(2):487–504. doi: 10.1111/rssb.12123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron J Stat. 2011;5:935–980. [Google Scholar]
  35. Rudie J, Brown J, Beck-Pancer D, Hernandez L, Dennis E, Thompson P, Bookheimer S, Dapretto M. Altered functional and structural brain network organization in autism. NeuroImage: Clinical. 2013;2:79–94. doi: 10.1016/j.nicl.2012.11.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. J Amer Statist Assoc. 2012;107(497):223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Shen X, Pan W, Zhu Y, Zhou H. On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics. 2013;65(5):807–832. doi: 10.1007/s10463-012-0396-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Smith SM, Jenkinson M, Woolrich MW, Beckmann CF, Behrens TE, Johansen-Berg H, Bannister PR, Luca MD, Drobnjak I, Flitney DE, Niazy RK, Saunders J, Vickers J, Zhang Y, Stefano ND, Brady JM, Matthews PM. Advances in functional and structural {MR} image analysis and implementation as {FSL} NeuroImage. 2004;23(Supplement 1):S208–S219. doi: 10.1016/j.neuroimage.2004.07.051. Mathematics in Brain Imaging. [DOI] [PubMed] [Google Scholar]
  39. Toplak ME, Dockstader C, Tannock R. Temporal information processing in adhd: Findings to date and new methods. Journal of Neuroscience Methods. 2006;151(1):15–29. doi: 10.1016/j.jneumeth.2005.09.018. Towards a Neuroscience of Attention-Deficit/Hyperactivity Disorder (ADHD) [DOI] [PubMed] [Google Scholar]
  40. Toussaint P-J, Maiz S, Coynel D, Doyon J, Messé A, de Souza LC, Sarazin M, Perlbarg V, Habert M-O, Benali H. Characteristics of the default mode functional connectivity in normal ageing and alzheimer’s disease using resting state fmri with a combined approach of entropy-based and graph theoretical measurements. NeuroImage. 2014;101:778–786. doi: 10.1016/j.neuroimage.2014.08.003. [DOI] [PubMed] [Google Scholar]
  41. Tsiligkaridis T, Hero AO, III, Zhou S. On convergence of Kronecker graphical lasso algorithms. IEEE Trans Signal Process. 2013;61(7):1743–1755. [Google Scholar]
  42. Tyszka JM, Kennedy DP, Paul LK, Adolphs R. Largely typical patterns of resting-state functional connectivity in high-functioning adults with autism. Cerebral Cortex. 2014;24(7):1894–1905. doi: 10.1093/cercor/bht040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Tzourio-Mazoyer N, Landeau B, Papathanassiou D, Crivello F, Etard O, Delcroix N, Mazoyer B, Joliot M. Automated anatomical labeling of activations in {SPM} using a macroscopic anatomical parcellation of the {MNI} {MRI} single-subject brain. NeuroImage. 2002;15(1):273–289. doi: 10.1006/nimg.2001.0978. [DOI] [PubMed] [Google Scholar]
  44. Varoquaux G, Craddock RC. Learning and comparing functional connectomes across subjects. NeuroImage. 2013;80:405–415. doi: 10.1016/j.neuroimage.2013.04.007. Mapping the Connectome. [DOI] [PubMed] [Google Scholar]
  45. Vissers ME, Cohen MX, Geurts HM. Review. Neuroscience and Biobehavioral Reviews. 2012;36(1):604–625. doi: 10.1016/j.neubiorev.2011.09.003. [DOI] [PubMed] [Google Scholar]
  46. Xia Y, Li L. Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics. 2017 doi: 10.1111/biom.12633. in press. [DOI] [PubMed] [Google Scholar]
  47. Yin J, Li H. Model selection and estimation in the matrix normal graphical model. Journal of Multivariate Analysis. 2012;107:119–140. doi: 10.1016/j.jmva.2012.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94(1):19–35. [Google Scholar]
  49. Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Ann Statist. 2010;38(2):894–942. [Google Scholar]
  50. Zhang T, Wu J, Li F, Caffo B, Boatman-Reich D. A dynamic directional model for effective brain connectivity using electrocorticographic (ECoG) time series. J Amer Statist Assoc. 2015;110(509):93–106. doi: 10.1080/01621459.2014.988213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Zhang T, Zou H. Sparse precision matrix estimation via lasso penalized d-trace loss. Biometrika. 2012;99(1):1–18. [Google Scholar]
  52. Zhou S. Gemini: graph estimation with matrix variate normal instances. Ann Statist. 2014;42(2):532–562. [Google Scholar]
  53. Zhu H, Strawn N, Dunson DB. Bayesian graphical models for multivariate functional data. Journal of Machine Learning Research. 2016;17(204):1–27. [Google Scholar]
  54. Zhu Y, Shen X, Pan W. Structural Pursuit Over Multiple Undirected Graphs. J Amer Statist Assoc. 2014;109(508):1683–1696. doi: 10.1080/01621459.2014.921182. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

RESOURCES