Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Sep 21.
Published in final edited form as: J Am Stat Assoc. 2014 Sep;109(507):1188–1204. doi: 10.1080/01621459.2014.882842

On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis

Bing Li 1, Hyonho Chun 2, Hongyu Zhao 3
PMCID: PMC4577248  NIHMSID: NIHMS561317  PMID: 26401064

Abstract

We introduce a nonparametric method for estimating non-gaussian graphical models based on a new statistical relation called additive conditional independence, which is a three-way relation among random vectors that resembles the logical structure of conditional independence. Additive conditional independence allows us to use one-dimensional kernel regardless of the dimension of the graph, which not only avoids the curse of dimensionality but also simplifies computation. It also gives rise to a parallel structure to the gaussian graphical model that replaces the precision matrix by an additive precision operator. The estimators derived from additive conditional independence cover the recently introduced nonparanormal graphical model as a special case, but outperform it when the gaussian copula assumption is violated. We compare the new method with existing ones by simulations and in genetic pathway analysis.

Key words and phrases: Additive conditional independence, additive precision operator, copula, conditional independence, covariance operator, gaussian graphical model, nonparanormal graphical model, reproducing kernel Hilbert space

1 Introduction

The gaussian graphical model is one of the most important tools for statistical network analysis, partly because its estimation can be neatly formulated as identifying the zero entries in the precision matrix, which turns estimation of a graph into sparse estimation of a high-dimensional precision matrix. For recent advances in this area, see Meinshausen and Buhlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), Peng, Wang, Zhou, and Zhu (2009), Guo, Levina, Michailidis, and Zhu (2009), Lam and Fan (2009).

Specifically, let X = (X1, …, Xp) be a p-dimensional random vector distributed as N(μ, Σ), where Σ is positive definite. Let Γ = {1, …, p}, E ⊆ {(i, j) ∈ Γ × Γ : ij}, and Θ = Σ−1. Let θij be the (i, j)th element of Θ. Then X follows a gaussian graphical model (GGM) with respect to the graph 𝓖 = {Γ, E} iff

XiXjX-(i,j),(i,j)E, (1)

where X−(i,j) represents the set {Xk : ki, j}. Under the multivariate gaussian assumption, the above relation is equivalent to

θij=0,(i,j)E. (2)

Based on this equivalence the estimation of 𝓖 reduces to identification of the 0 entries in Θ, which can be done by sparse estimation procedures such as thresholding (Bickel and Levina, 2008), LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), and adaptive LASSO (Zuo, 2006).

Since the multivariate gaussian assumption limits the applicability of the GGM, there have been intense developments in generalizing it to non-gaussian cases in the past few years, chiefly through the use of nonparametric copula transformations. See Liu, Lafferty, and Wasserman (2009), Liu, Han, Yuan, Lafferty, and Wasserman (2013), and Xue and Zou (2013). The nonparametric gaussian copula graphical model (GCGM) is a flexible and effective model that relaxes the marginal gaussian assumption but preserves the inter-dependence structure of a multivariate gaussian distribution, so that the equivalence between (1) and (2) still holds in an altered form. It assumes the existence of unknown injections f1, …, fp such that (f1(X1), …, fp(Xp)) is multivariate gaussian. The GCGM keeps the simplicity and elegance of the GGM but greatly expands its scope.

However, the gaussian copula assumption can be violated even for commonly used interactions. For example, if (U, V) is a random vector where V = U2 + ε, and U and ε are i.i.d. N(0, 1), then no transformation can render (f1(U), f2(V)) bivariate gaussian. In fact, the copula transformations in this case are: f1(u) = u, f2(v) = Φ−1FV(v), where FV is the convolution of N(0, 1) and χ12 and Φ is the c.d.f. of N(0, 1). Figure 1 plots (U, Φ−1FV(V)) for a sample of 100 observed pairs, which shows a strong non-gaussian pattern even though U and Φ−1FV(V) are marginally gaussian. Under such circumstances the gaussian copula assumption no longer holds, and our goal is to develop a broader class of graphical models that cover such cases.

Figure 1.

Figure 1

Violation of copula assumption. The vertical axis is Φ−1FV(V). Both U and Φ−1FV(V) are N (0, 1) but their joint distribution is strongly non-gaussian.

In developing the new graphical models we would like to preserve an appealing feature of the GCGM; that is, the nonparametric operation is acted upon one-dimensional random variables individually, rather than a high-dimensional random vector jointly. This feature allows us to avoid high-dimensional smoothing, which is the source of the curse of dimensionality (Bellman, 1957). However, if we insist on using conditional independence (1) as the criterion for constructing graphical models, then we are inevitably lead to a fully fledged nonparametric procedure involving smoothing over the entire p-dimensional space, and the nice structures of the gaussian graphical model such as (2) would be lost. This motivates us to introduce a new statistical relation — additive conditional independence, or ACI, to replace conditional independence as our criterion. Although ACI is different from conditional independence, it satisfies the axioms for a semi-graphoid (Pearl and Verma, 1987) shared by conditional independence, which captures the essence of a graph. Furthermore, under the gaussian copula assumption, ACI reduces to conditional independence. The graphical model defined by ACI — which we refer to as the Additive Semi-graphoid Model (ASG) — preserves a parallel structure to (2

The paper is organized follows. In section 2 we outline the axiomatic system for a semi-graphoid logical relation, and introduce the notion of ACI as well as the ASG model it induces. In section 3 we investigate the relation between ACI and conditional independence under the gaussian copula assumption. In sections 4 and 5, we introduce the additive conditional covariance operator and additive precision operator, and study their relations with ACI. In sections 6 and 7 we introduce two estimators for the ASG model based on these operators. In section 8 we propose cross validation procedures to determine the tuning parameters. In sections 9 and 10 we compare the ASG model with GGM and GCGM by simulation and in an applied setting of genetic pathway inference. We conclude in section 11 with suggested future research directions. All proofs are presented in the Appendix.

2 Semi-graphoid axioms and additive conditional independence

Before introducing additive conditional independence, we first restate the axiomatic system developed by Pearl and Verma (1987) and Pearl, Geiger, and Verma (1989), in a slightly different form. Let 2Γ be the class of all subsets of Γ. We call any subset ℛ ⊆ 2Γ × 2Γ × 2Γ a three-way relation on Γ. This terminology follows the tradition of two-way relations in mathematics (see, for example, Kelley, 1955, page 6).

Definition 1 (Semi-graphoid Axioms)

A three-way relation ℛ is called a semi-graphoid if it satisfies the following conditions:

  1. (symmetry) (A, C, B) ∈ ℛ ⇒ (B, C, A) ∈ ℛ;

  2. (decomposition) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C, B) ∈ ℛ;

  3. (weak union) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C ∪ B, D) ∈ ℛ;

  4. (contraction) (A, C ∪ B, D) ∈ ℛ, (A, C, B) ∈ ℛ ⇒ (A, C, B ∪ D) ∈ ℛ.

These axioms are extracted from conditional independence to convey the general idea of “B is irrelevant for understanding A once C is known”, or “C separates A and B”. Conditional independence is a special case of the semi-graphoid three-way relation. More specifically, let X = {Xi : i ∈ Γ} be a random vector and, for any A ∈ Γ, let XA be the {Xi : iA}. Let ℛ be the class of (A, C, B) such that XAXB|XC (XA and XB are conditionally independent given XC), then it can be shown that ℛ indeed satisfies the semi-graphoid axioms (Dawid, 1979).

Because, like conditional independence, a semi-graphoid relation (A, C, B) ∈ ℛ also conveys the idea that C separates A and B, it seems reasonable to use the semi-graphoid relation itself as a criterion for constructing a graph. Indeed, the very statement “there is no direct link between node i and node j” means, precisely, that “{i, j}c separates i and j”. In other words, it is reasonable to use the logical statement ({i}, {i, j}c, {j}) ∈ ℛ to determine the absence of edge between i and j. Using the semi-graphoid relation in place of probabilistic conditional independence can free us from some awkward restrictions that come with a fully fledged probability model.

Let X = (X1, …, Xp) be a random vector. For any subvector U = (U1, …, Ur) of X, let L2(PU) be the class of functions of U such that Ef(U) = 0, Ef2(U) < ∞. For each Ui, let 𝓐Ui denote a subset of L2(PUi). Let 𝓐U denote the additive family

AU1++AUr={f1++fr:f1AU1,,frAUr}.

For two subspaces 𝓐 and 𝓑 of L2(PX), let 𝓐 ⊖ 𝓑 = 𝓐 ∩ 𝓑, where orthogonality is in terms of the L2(PX)-inner product. In the following, an unadorned inner product 〈·, ·〉 always represents the L2(PX)-inner product.

Definition 2

Let U, V, and W be subvectors of X. We say that U and V are additively conditionally independent (ACI) given W iff

(AU+AW)AW(AV+AW)AW, (3)

where ⊥ indicates orthogonality in the L2(PX)-inner product. We write relation (3) as UAV|W.

Strictly speaking, ACI should be stated with respect to the triplet of functional families (𝓐U, 𝓐V, 𝓐W), but in most cases we can ignore this qualification without loss of clarity. Lauritzen (1996, page 31) mentioned a relation similar to (3) where 𝓐U, 𝓐V, and 𝓐W are Euclidean subspaces, but absent from that construction are the underlying random vectors U, V, W, which is crucial to our development.

The next theorem gives seven equivalent conditions for ACI, each of which can be used as its definition. As we will see in the proof of Theorem 2, these various forms are immensely useful in exploring the ramifications of ACI. In the following, for an operator P on a Hilbert space 𝓗, let ker P and ranP denote the kernel and range of P, and let ran¯P be the closure of ranP.

Theorem 1

Let U, V, W be subvectors of X. The following statements are equivalent:

  1. (𝓐U + 𝓐W) ⊖ 𝓐W ⊥ (𝓐V + 𝓐W) ⊖ 𝓐W;

  2. 𝓐UW ⊖ 𝓐W ⊥ 𝓐VW ⊖ 𝓐W;

  3. (IP𝓐W) 𝓐U ⊥ (IP𝓐W)𝓐V;

  4. 𝓐U ⊆ ker(P𝓐VW ⊖ 𝓐W);

  5. 𝓐U ⊆ ker(P𝓐VWP𝓐W);

  6. 𝓐UW ⊆ ker(P𝓐VW ⊖ 𝓐W);

  7. 𝓐UW ⊆ ker(P𝓐VWP𝓐W).

The next theorem shows that ACI satisfies the semi-graphoid axioms, which gives justification of using it to construct graphical models.

Theorem 2

The additive conditional independence relation is a semi-graphoid.

We are now ready to define the additive semi-graphoid model.

Definition 3

A random vector X follows an additive semi-graphoid model with respect to a graph 𝓖 = (Γ, E) iff

XiAXjX-(i,j)(i,j)E. (4)

We write this condition as X ~ ASG(𝓖).

3 Relation with copula graphical models

While in the last section we have seen it is reasonable to use ACI instead of conditional independence as a criterion for constructing graphs, we now investigate the special cases where ACI reduces conditional independence.

Theorem 3

Suppose

  1. X has a gaussian copula distribution with copula functions f1, …, fp;

  2. U, V, and W are subvectors of X;

  3. 𝓐Xi = span{fi}, i = 1, …, p.

Then UAV|W if and only if UV|W.

Under the specific form of ASG model in Theorem 3, there is no edge between i and j iff the (i, j)th entry of the precision matrix of (f1(X1), …, fp(Xp)) is 0. The latter is population-level criterion used by the GCGM estimator. In this sense, ASG can be regarded as a generalization of the nonparanormal models.

Since our main interests are in the ASG models with 𝓐Xi = L2(PXi), it is also of interest to ask whether ⫫A coincides with ⫫ in this case. The next theorem and the subsequent numerical investigation show that they are very nearly equivalent for all practical purposes.

Theorem 4

Suppose

  1. X has a gaussian copula distribution with copula functions f1, …, fp;

  2. U, V, and W are sub-vectors of X;

    3. 𝓐Xi = L2(PXi).

Then UAV|W implies UV|W.

The proof, which is omitted, is essentially the same as that of Theorem 3. Interestingly, under 𝓐Xi = L2(PXi), ⫫ does not imply ⫫A even under the gaussian copula assumption. However, the following numerical investigation suggests that this implication holds approximately, with vanishingly small error. We are able to carry out this investigation because an upper bound of

ρUVW2=sup{f-PAWf,g-PAWg2:fL2(PU),gL2(PV),f=g=1}

can be calculated explicitly under the copula gaussian distribution. To our knowledge this bound has not been recorded in the literature; so it is of interest in its own.

Proposition 1

Suppose X has a gaussian copula distribution with copula functions f1, …, fp. Without loss of generality, suppose E[fi(Xi)] = 0, var[fi(Xi)] = 1. Let

U=fi(Xi),V=fj(Xj),W={fk(X)k:ki,j},

and RVW = cov(V, W), RWW = var(W), RWU = cov(W, U), ρUV = cor(U, V). Then

ρUVWmax{ρUVα-RVWα(RWWα)-1RWUα:α=1,2,}τUVW, (5)

Aα is the α-fold Hadamard product A ⊙ ··· ⊙ A and |·| means determinant.

Using this proposition we conduct the following numerical investigation. First we generate a positive definite random matrix C with C12 = C21 = 0 as follows. We generate from the Wishart distribution W(Ip, k), where k is the degrees of freedom, and then set 12 and 21 to 0. This does not guarantee to be positive definite. If it is not, we repeat this process until we get a positive definite matrix, which is then set to be C. Let RXX = [diag(C)]1/2C−1[diag(C)]1/2. Let RWU, RWV, and RWW be the (p − 2) × 1, (p − 2) × 1, and (p − 2) × (p − 2) matrices

{(RXX)i1:i=3,,p},{(RXX)i2:i=3,,p},{(RXX)ij:i,j=3,,p},

and let RUW=RWU,RVW=RWV. Note that, if X ~ N (0, RXX), then it is guaranteed that UV|W, and var(Xi) = 1.

We choose the degrees of freedom k to be relatively small to prevent C/k from converging to Ip, but large enough to avoid singularity. In particular, we take k = ⌈5p/4⌉, ⌈6p/4⌉, …, ⌈8p/4⌉, which guarantees a wide range of covariance matrices. We take p = 20, 40, …, 100. We compute τUV|W by taking the maximum of the first 10 terms in (5), which in our simulations is always the global maximum. For each combination of (p, k), we generate 500 random matrices and present the averages of τUV|W in Table 1. Because ρUV|W = 0 iff UAV|W, the small numbers in Table 1 indicate that for all practical purposes we can treat UV|WUAV|W as valid under the gaussian copula model, even though it is not a mathematical fact.

Table 1.

Average values of τUV|W under UV|W

p k
⌈5p/4⌉ ⌈6p/4⌉ ⌈7p/4⌉ ⌈8p/4⌉

20 0.00310 0.00270 0.00114 0.00087
40 0.00472 0.00240 0.00095 0.00051
60 0.00485 0.00185 0.00069 0.00024
80 0.00435 0.00183 0.00053 0.00018
100 0.00483 0.00125 0.00036 0.00018

In the similar way it is related to the gaussian copula graphical models, the ASG is also related to the more general transelliptical graphical models recently introduced by Liu, Han, and Zhang (2012). A random vector X is said to have a transelliptical distribution with shape parameter Σ (a p × p positive definite matrix) if there exist injections f1, …, fp on ΩX to ℝ such that the distribution of U = (f1(X1), …, fp(Xp)) depends only on uΣu. Furthermore, X is said to follow a transelliptical graphical model with respect to a graph 𝓖 = (Γ, E) iff X has a transelliptical distribution and E = {(i, j) : (Σ−1)ij ≠ 0}. In this case we write X ~ TEG(𝓖). The following corollary states that ASG reduces to the transelliptical graphical model if the additive families 𝓐Xi are chosen to be span (fi) and the underlying distribution of X is assumed to be transelliptical. Its proof is parallel to that of Theorem 3 and is omitted.

Corollary 1

Suppose

  1. X has a transelliptical distribution with respect to a positive definite shape parameter Σ and injections f1, …, fp;

  2. 𝓐Xi = span(fi).

Then X ~ ASG(𝓖) iff X ~ TEG(𝓖).

4 Additive conditional covariance operator

In this and the next sections we introduce two linear operators that will give rise to two estimation procedures for ASG. The first is the additive conditional covariance operator (ACCO). Consider the bilinear form

AU×AV,(f,g)E[f(U)g(V)].

By the Riesz representation theorem, this induces a bounded linear operator ΣUV : 𝓐V → 𝓐U such that 〈f, ΣUVg〈 = E[f (U)g(V)]. We call the operator

UVW=UV-UWWV (6)

the additive conditional covariance operator from V to U given W. This operator is different from the conditional variance operator introduced by Fukumizu, Bach, and Jordan (2009), because of the additive nature of the functional spaces 𝓐U, 𝓐V, and 𝓐W. In particular, the following identity no longer holds:

f,UVWg=E[cov(f(U),g(V)W)].

In the following, for a Hilbert space 𝓗, a subspace 𝓛 of 𝓗, and an operator T defined on 𝓗, we use T|𝓛 to denote the operator T restricted on 𝓛. That is, T|𝓛 is a mapping from 𝓛 to 𝓗 with (T|𝓛)f = Tf for all f ∈ 𝓛. The following lemma gives a relation between the projection operator and the covariance operator.

Lemma 1

Let P𝓐W: 𝓐X → 𝓐W be the projection on to 𝓐W. Then P𝓐W|𝓐U = ΣWU.

The next theorem shows that ΣUV|W = 0 iff UAV|W, which is the basis for our first ASG estimator. This resembles the relation between conditional covariance operator and conditional independence; that is, if CUV|W is the conditional covariance operator in Fukumizu, Jordan, and Bach (2009), then CUV|W = 0 iff UV|W.

Theorem 5

UAV|W iff ΣUV|W = 0.

5 Additive precision operator

We now introduce the second linear operator – the additive precision operator (APO), and show that a statement resembling the equivalence of (1) and (2) can be made for ASG if we replace the precision matrix by APO. As a result, the estimation of the ASG model can be turned into the sparse estimation of APO.

As a preparation, we first develop the notion of a matrix of operators on a direct-sum functional space. Let 𝓐X be the Hilbert space of functions consisting of the same set of functions as 𝓐X but with the inner product

f1++fp,g1++gpAX=i=1pfi,giAXi.

Suppose for each i, j we are given a linear operator TXjXi: 𝓐Xi → 𝓐Xj. Then we can construct a linear operator T : 𝓐X → 𝓐X as follows

T(f1++fp)=i=1pj=1pTXjXifi. (7)

This construction is legitimate because in the additive case f1 + ··· + fp uniquely determines (f1, …, fp). In fact, if we fix (c2, …, cp) in ΩX2 × ··· × ΩXp and vary x1 in ΩX1, then using E[f1(X1)] = 0 we have

f1(x1)=f1(x1)+f2(c)2++fp(cp)-E[f1(X1)+f2(c2)++fp(cp)].

Conversely, a linear operator T : 𝓐X → 𝓐X uniquely determines operators TXjXi : 𝓐Xi → 𝓐Xj that satisfies (7), as follows. For each fi ∈ 𝓐Xi, Tfi can be represented as h1 + ··· + hp, where hj ∈ 𝓐Xj. Moreover, h1 + ··· + hp uniquely determines (h1, …, hp), and hence also hj. We define TXjXi : 𝓐Xi → 𝓐Xj by TXjXi(fi) = hj. Thus, there is a one-to-one correspondence between T and {TXjXi} and they are related by (7). Henceforth we denote this one-to-one correspondence by the equality

T={TXjXi:i,j=1,,p}.

Moreover, T is bounded iff each TXjXi is bounded, as shown in the next proposition.

Proposition 2

The following statements are equivalent

  1. TXjXi ∈ 𝓑 (𝓐Xi, 𝓐Xj) for i, j = 1, …, p;

  2. T ∈ 𝓑 (𝓐X, 𝓐X).

Our construction of T = {TXjXi} echoes the construction of Bach (2008). However, our construction is relative to a different inner product and will play a new role under ACI. We are now ready to introduce the additive precision operator.

Definition 4

The operator ϒ : 𝓐X → 𝓐X defined by the matrix of operators {ΣXiXj: i ∈ Γ, j ∈ Γ} is called the additive covariance operator. If it is invertible then its inverse Θ is called the additive precision operator.

It can be easily checked that, for any f, g ∈ 𝓐X,

f,ϒgAX=cov[f(X),g(X)]=f,g. (8)

This relation also shows why we used 𝓐X instead of 𝓐X to define ϒ in the first place: if we had used 𝓐X, then ϒ would have satisfied ϒf = f for any f ∈ 𝓐X; that is, ϒ would have been the identity mapping from 𝓐X to 𝓐X. In the meantime, we note that, by construction, ϒXiXj = ΣXiXj for all i, j = 1, …, p.

We should note that it is generally unreasonable to assume that the inverse of a covariance operator to be a bounded operator, because we typically assume the sequence of eigenvalues of such operators tend to 0. However, in our construction, the inner product 〈fi, fi𝓐X is the marginal variance var[fi(Xi)]. Thus our covariance operator ϒ is more like a correlation than a covariance. In particular, ϒii are identity operators by construction. If we assume ϒij are compact for all ij, then ϒ is of the form “identity plus compact perturbation”. Such operators, if invertible, are bounded from below by cI for some c > 0 (Bach, 2008), and therefore have bounded inverses. We summarize this fact in the following proposition.

Proposition 3

Suppose ϒ is invertible and, for each ij, ϒij is compact. Then ϒ−1 is bounded.

To establish the relation between the precision operator Θ and ACI, we need the block inversion formulas for operators. For any subvector U of X, let V denote Uc, and define the matrices of operators

TUU={TXiXj:XiU,XjU},TUV={TXiXj:XiU,XjV},TVU={TXiXj:XiV,XjU},TVV={TXiXj:XiV,XjV}.

From these definitions it follows that TUU 𝓐U ⊆ 𝓐U, and

f,TUUgAX=f,TgAXforanyf,gAU.

The same can be said of TVV, TUV, and TVU. The following operator identities resemble the matrix identities derived from block-matrix inverse.

Lemma 2

Suppose that T ∈ 𝓑 (𝓐X, 𝓐X) is a self adjoint, positive definite operator. Then: the operators

TUU,TVV,TUU-TUVTVV-1TVU,TVV-TVUTUU-1TUV

are bounded, self adjoint, and positive definite, and the following identities hold

(T-1)UU=(TUU-TUVTVV-1TVU)-1,(T-1)VV=(TVV-TVUTUU-1TUV)-1,(T-1)UV=-(T-1)UUTUVTVV-1=-TUU-1TUV(T-1)VV,(T-1)VU=-(T-1)VVTVUTUU-1=-TVV-1TVU(T-1)UU, (9)

where (T−1)UU and (T−1)VV are the blocks of T−1 corresponding to 𝓐U and 𝓐V.

The above lemma shows that if ϒ has a bounded inverse, then ΘUV is a well defined operator in 𝓑 (𝓐V, 𝓐U). Moreover, it behaves just like the off-diagonal block of the inverse of a covariance matrix. As we will see in the Appendix, manipulating the blocks of ϒ and Θ in the same way we do with a gaussian covariance matrix lead to the following equivalence.

Theorem 6

Suppose (U, V, W) is a partition of X and ϒ is invertible. Then UA V|W iff ΘUV = 0.

This relation is parallel to the equivalence between (1) and (2) in the gaussian case, which means we can turn estimating ASG into estimating the locations of the zero entries of APO, as shown in the next corollary.

Corollary 2

A random vector X follows an ASG(𝓖) model if and only if ΘXiXj = 0 whenever (i, j) ∉ E.

Note that the graph ASG(𝓖) is based entirely on pairwise additive conditional independence (4), which is similar to the pairwise Markov property for classical graphical models based on conditional independence. In the classical setting, pairwise Markov property with respect to a graph 𝓖 = (Γ, E) means

XiXjX-(i,j)(i,j)E.

Under fairly mild conditions, this property also implies the global Markov property, which means if A, B, C are subsets of Γ and C separates A and B in 𝓖, then

XAXBXC.

See Lauritzen (1996, Chapter 3). It is of theoretical interest to see if the parallel implication holds true in our additive setting. For convenience, let us agree to call property (4) the pairwise additive Markov property with respect to 𝓖. We are interested in whether XAAXB|XC holds whenever C separates A and B according to the graph 𝓖 determined by (4), a property that can be appropriately defined as the global additive Markov property with respect to 𝓖.

Using Theorem 6 we can easily derive a partial answer to this question, as follows. Suppose {A, B, C} is a partition of Γ and C separates A and B according to 𝓖 defined by (4). Then all paths between A and B intersects C. In other words, the block ΘAB and ΘBA of Θ consists of zero operators. However, by Theorem 6, this is equivalent to XAAXB|XC. We will leave it for future research to establish whether this statement is true when {A, B, C} is not necessarily a partition.

In the next two sections we introduce two estimators of the ASG model, one based on the ACCO introduced in the last section, and one based on the APO.

6 The ACCO-based estimator

By Theorem 5, it is reasonable to determine the absence of edges by smallness of the sample estimate of ||ΣXiXj|X−(i,j)||. To develop this estimate, we first derive the finite-sample matrix representation of the operator ΣXiXj|X−(i,j).

For each i = 1, …, p, let κi : ΩXi × ΩXi → ℝ be a positive definite kernel. Let

AXi=span{κi(·,X1i)-Enκi(Xi,X1i),,κi(·,Xni)-Enκi(Xi,Xni)},

where Enκi(Xi,Xki)=n-1=1nκi(Xi,Xki). Let In be the n × n identity matrix and 1n be the n-dimensional vector with all its entries equal to 1. Let Q be the projection matrix In-1n1n/n. Let κ˚i denote the vector-valued function

xi(κi(xi,X1i)-Enκi(Xi,X1i),,κi(xi,Xni)-Enκi(Xi,Xni)).

Then, any f ∈ 𝓐Xi can be expressed as f=κi[f] for some [f] ∈ ℝn, which is called the coordinate of f in the spanning system κ˚i. A coordinate system is relative to a class of functions, and sometimes we need to use symbols such as [f]𝓐Xi for clarity. But in most of the discussions below this is unnecessary. In this coordinate notation we have (f(X1i),,f(Xni))=QKi[f], where Ki={κi(Xki,Xi):k,=1,,n}. Also, for any f, g ∈ 𝓐Xi,

f,gAXi=n-1(QKi[f])(QKi[g])=n-1[f]KiQKi[g].

Let K−(i,j) denote the n(p − 2) × n matrix obtained by removing the ith and jth block of the np × n matrix (K1, …, Kp). Then, for any f ∈ 𝓐X−(i,j we have

(f(X1-(i,j)),,f(Xn-(i,j)))=QK-(i,j)[f] (10)

for some [f] ∈ ℝn(p−2). Finally, for any linear operator T defined on a finite dimensional Hilbert space 𝓗, there is a matrix [T] such that [T f] = [T][f] for each function f ∈ 𝓗. The matrix [T] is called the matrix representation of T. This notation system is adopted from Horn and Johnson (1985) and was used in Li, Artemiou, and Li (2011), Li, Chun, and Zhao (2012), and Lee, Li, and Chiaromonte (2013).

Lemma 3

The following relations hold:

KiQKi[XiXj]=KiQKj,KiQKi[XiX-(i,j)]=KiQK-(i,j)K-(i,j)QK-(i,j)[X-(i,j)Xi]=K-(i,j)QKi. (11)

Because KiQKi and K-(i,j)QK-(i,j) are singular, we solve the equations in (11) by Tychonoff regularization:

[XiXj]=(KiQKi+ε1In)-1KiQKj,[XiX-(i,j)]=(KiQKi+ε1In)-1KiQK-(i,j),[X-(i,j)Xi]=(K-(i,j)QK-(i,j)+ε2In(p-2))-1K-(i,j)QKi. (12)

Note that we use two different tuning constants ε1 and ε2, because the matrices involved have different dimensions. We will discuss how to choose these constants in Section 8. is then The matrix representation of ΣXjXi|X−(i,j) is then

[XjXiX-(i,j)]=[XjXi]-[XjX-(i,j)][X-(i,j)Xi]. (13)

By definition, ||ΣXjXi|X−(i,j)||2 is the maximum of

XjXiX-(i,j)f,XjXiX-(i,j)fAXj=n-1[f][XjXiX-(i,j)]KjQKj[XjXiX-(i,j)][f]

subject to 〈f, f𝓐Xi = [f] KiQKi[f] = 1. To solve this generalized eigenvalue problem, let v = (KiQKi)1/2[f]. Then, using inverse Tychonoff regularization, [f] = (KiQKi + ε1In)−1/2v. So it is equivalent to maximizing

v(KiQKi+ε1In)-1/2[XjXiX-(i,j)]KjQKj[XjXiX-(i,j)](KiQKi+ε1In)-1/2v

subject to v v = 1, which implies that ||ΣXiXj|X−(i,j)||2 is the largest eigenvalue of

ΛXiXj=(KiQKi+ε1In)-1/2[XjXiX-(i,j)]KjQKj[XjXiX-(i,j)](KiQKi+ε1In)-1/2.

Since the factor KjQKjXjXi|X−(i,j)] in the right-hand side is of the form

KjQKj(KjQKj+ε1In)-1KjQ(In-K-(i,j)(K-(i,j)QK-(i,j)+ε2In(p-2))-1K-(i,j))QKi,

it is reasonable to replace the first factor KjQKj by KjQKj + ε1In, so that ΛXiXj becomes the quadratic form ΔXiXjΔXiXj, where

ΔXiXj=(KiQKi+ε1In)-1/2Ki(Q-QK-(i,j)(K-(i,j)QK-(i,j)+ε2In(p-2))-1K-(i,j)Q)Kj(KjQKj+ε1In)-1/2. (14)

Hence, at the sample level, ||ΣXiXj|X−(i,j)|| is the largest singular value of ΔXiXj. Because ΔXiXj=ΔXjXi, we have the symmetry ||ΣXiXj|X−(i,j)|| = ||ΣXjXi|X−(i,j)||.

The above procedure involves inversion of K-(i,j)QK-(i,j)+ε2In(p-2), which is a large matrix if the graph 𝓖 contains many nodes. Nevertheless, as the next proposition shows the actual amount computation is not large — it is essentially that of spectral decomposition of an n by n matrix. Let diag(A1, …, Ap) be the block-diagonal matrix with diagonal blocks A1, …, Ap, and V be a matrix in ℝs×t, with s > t, which has singular value decomposition

V=(L1,L0)diag(D,0)(R1,R0), (15)

where L1 ∈ ℝs×u, L0 ∈ ℝs×(su), R1 ∈ ℝt×u, R0 ∈ ℝt×(tu), and D ∈ ℝu×u is a diagonal matrix with diagonal elements δ1 ≥ ··· ≥ δu > 0.

Proposition 4

If the conditions in the last paragraph are satisfied, then

  1. (VV+εIs)-1=VR1D-1((D2+εIu)-1-ε-1Iu)D-1R1V+ε-1Is;

  2. V(VV+εIs)-1V=R1D(D2+εIu)-1DR1.

The above proposition allows us to calculate the inversions of s × s matrix (VV + εIs) essentially by spectral decomposition of t × t matrix V V. We now summarize the estimation procedure as an algorithm. Suppose we observe an i.i.d. sample X1, …, Xn from X, where Xi is a p-dimensional random vector (Xi1,,Xip).

  1. Standardize each component of X marginally. That is, for each i = 1, …, p, we standardize X1i,,Xni, so that EnXi = 0 and varnXi = 1.

  2. Choose a positive definite kernel κ: ℝ × ℝ → ℝ, such as the gaussian radial kernel. Choose the kernel parameter γ using, for example, the procedure in Li, Chun, and Zhao (2012). Compute the kernel matrices Ki and K−(i,j).

  3. Compute ΔXiXj in (14), using part 2 of Proposition 4 to compute the matrix QK-(i,j)(K-(i,j)QK-(i,j)+ε2In(p-2))-1K-(i,j)Q, with V = K−(i,j)Q, s = n(p − 2), t = n, and u = n − 1. The tuning constants ε1 and ε2 are determined using the method described in Section 8.

  4. For each i > j, compute the largest singular value of ΔXiXj. If it is smaller than a certain threshold, declare that there is no edge between i and j.

We emphasize again that throughout this procedure, the largest matrix whose spectral decomposition is needed is of dimension n × n, regardless of the dimension p of X.

7 The APO-based estimator

Our second estimator is based on Corollary 2, which states that XiAXj|X−(i,j) iff ΘXjXi = 0. Thus we can determine the absence of edge between i, j if the sample version of ||ΘXjXi|| is below a threshold. We first derive the matrix representations of the operator ϒ and its inverse Θ. Let K1:p = (K1, …, Kp).

Theorem 7

The matrix representation of ϒ satisfies the following equation:

diag(K1QK1,,KpQKp)[ϒ]=K1:pQK1:p. (16)

The matrix [Θ] is obtained by solving (16) for [ϒ]−1 with Tychonoff regularization:

[Θ]=(K1:pQK1:p+ε3Inp)-1diag(K1QK1,,KpQKp). (17)

To find ||ΘXjXi||, we maximize

ΘXjXifi,ΘXjXifiAXj=n-1[fi][ΘXjXi]KjQKj[ΘXjXi][fi]

subject to 〈fi, fi𝓐Xi = [fi]KiQKi[fi] = 1. Following the similar development that leads to (14), we see that ||ΘXjXi||2 is the largest eigenvalue of

(KiQKi+ε1In)-1/2[ΘXjXi](KjQKj+ε1In)[ΘXjXi](KiQKi+ε1In)-1/2,

where [ΘXjXi] is the (j, i)th block of (17). Equivalently, ||ΘXjXi|| is the largest singular value of

ΓXiXj=(KiQKi+ε1In)-1/2[ΘXjXi](KjQKj+ε1In)1/2.

We now summarize the algorithm for the APO-based estimator. The first two steps are the same as in the ACCO-based algorithm; the last two steps are listed below.

  • 3′ Compute [Θ] according to (17), where the inverse (K1:pQK1:p+ε3Inp)-1 is computed using part 1 of Proposition 4, with V = K1:pQ, ε = ε3, s = np, and u = n − 1. The tuning constant ε3 is determined by the method in Section 8.

  • 4′ For each i, j = 1, …, p, i > j, compute the largest singular value of ΓXiXj. If it is smaller than a certain threshold, then there is no edge between i and j.

In concluding this section, we would like to mention that the operator norm used for both the ACCO and APO can be replaced by other norms, such as the Hilbert Schmidt norm. We have experimented both norms in our simulation studies (not presented), and found that their performances are similar.

8 Tuning parameters for regularization

In this section we propose cross-validation and generalized cross-validation procedures to select ε’s in the Tychonoff regularization. We need to invert three types of matrices

typeI:KiQKi,typeII:K-(i,j)QK-(i,j),typeIII:K1:pQK1:p.

As before, we denote the constants in the regularized inverses as ε1, ε2, and ε3.

Since our cross-validation is based on the total error in predicting one set of functions by another set of functions, we first formulate such predictions in generic terms. Let U and V be random vectors and W = (U, V). Let 𝓕 = {f0, f1, …, fr} be a set of functions of U, and 𝓖 = {g0, g1, …, gs} be a set of functions of V, where g0 and f0 represent the constant function 1, which are included to account for any constant terms in prediction. Suppose we have an i.i.d. sample {W1, …, Wn} of W. Let A = {a1, …, am} ⊆ {1, …, n}, B = {b1, …, bnm} = Ac. Let {Wa : aA} be the training set and {Wb : bB} be the testing set. Suppose E^VU(ε):GF is a training-set-based operator such that, for each g ∈ 𝓖, E^VU(ε)g is the best prediction of g(V) based on functions in 𝓕, where ε is the tuning constant to be determined. The total error for predicting any function in 𝓖 evaluated at the testing set is

bBμ=0s[gμ(Vb)-(E^VU(ε)gμ)(Ub)]2. (18)

Now let LU be the (r + 1) × m matrix whose ith row is the vector {fi+1(Ua) : aA}, and LV be the (s + 1) × m matrix whose kth row is the vector {gk+1(Va) : aA}. Let ℓU and ℓV denote the vector-valued functions (f0, …, fr) and (g0, …, gr). Then, following the argument similar to Lee, Li, and Chiaromonte (2013), we can express (18) as

bBV(Vb)-LVLU(LULU+εIr+1)-1U(Ub)2,

where ||·|| is the Euclidean norm. If we let GU be the (r + 1) × (nm) matrix whose bth column is ℓU (Ub), and GV be the (s + 1) × (nm) matrix whose bth column is ℓV (Vb), then the above can be further simplified as

GV-LVLU(LULU+εIr+1)-1GU2,

where ||·|| is the Frobenius matrix norm. Other matrix norms are also reasonable. In our simulation we used the operator norm, which performed well.

Return now to our cross-validation problem. In the k-fold cross validation, data are divided into k subsets. At each step, one subset is used as the testing set and union of the rest is used as the training set. Let i, j, … index the p components of x, a, b, … index the n subjects, and ν index the k steps in the k-fold cross-validation.

At the νth step, let {Xa : aA} be the training set and {Xb : bB} be the testing set. Reset κ˚i to be the ℝm-valued function xi(κ(xi,Xa1i),,κ(xi,Xami)). Let ℓi, ℓ(i,j), and ℓ be the vector-valued functions defined by

i(xi)=(1,κi(xi)),(i,j)(xi,xj)=(1,κi(xi),κj(xj)),(x)=(1,κ1(x1),,κp(xp)).

Let ℓi be the ℝn(p−1)+1-valued function obtained by removing κ˚i from ℓ, and ℓ−(i,j) be the ℝn(p−2)+1-valued function obtained by removing κ˚i and κ˚j from ℓ. Let Li denote the matrix ( i(Xa1i),,i(Xami)), and define Li, L(i,j), and L−(i,j) similarly. Let Gi denote the matrix ( i(Xb1i,,i(Xbn-mi)), and define Gi, G(i,j), and G−(i,j) similarly.

To determine ε1, we use the functions in ℓi to predict the functions in ℓi at Xb for any bB. That is,

CV1(ν)(ε1)=i=1pbB-i(Xb-i)-L-iLi(LiLi+ε1Im+1)-1i(Xbi)2=i=1pG-i-L-iLi(LiLi+ε1Im+1)-1Gi2. (19)

Repeat this process for ν = 1, …, k and form CV1(ε1)=ν=1kCV1(ν)(ε1). We choose ε1 by minimizing CV1(ε1) over a grid. To determine ε2, we use the functions in ℓ−(i,j) to predict the functions in ℓ(i,j) at each Xb, bB, which leads to

CV2(ν)(ε2)=2i>jG(i,j)-L(i,j)L-(i,j)(L-(i,j)L-(i,j)+ε2In(p-2)+1)-1G-(i,j)2. (20)

Let CV2(ε2)=ν=1kCV2(ν)(ε2). We choose ε2 by minimizing CV2(ε2) over a grid. Here, we use part 1 of Proposition 4 to calculate (L-(i,j)L-(i,j)+ε2In(p-2)+1)-1.

Alternatively, we can use the generalized cross validation criteria parallel to (19) and (20) to speed up computation (Wahba, 1990). Rather than splitting the data into training and testing samples, we compute the residual sum of squares from the entire data, and regularize it by the effective degrees of freedom as defined by the traces of relevant projection matrices. Specifically, the GCV-criterion parallel to (19) is

GCV1(ε1)=i=1pL-i-L-iLi(LiLi+ε1In+1)-1Li2{1-tr(Li(LiLi+ε1In+1)-1Li)/n}2.

Here, Li and Li are computed from the full data, and Gi in (19) is now replaced by Li. Similarly, the GCV-criterion parallel to (20) is

GCV2(ε2)=2i>jL(i,j)-L(i,j)L-(i,j)(L-(i,j)L-(i,j)+ε2In+1)-1L-(i,j)2{1-tr(L-(i,j)(L-(i,j)L-(i,j)+ε2In(p-2)+1)-1L-(i,j))/n}2.

Here, again, the matrix L-(i,j)(L-(i,j)L-(i,j)+ε2In(p-2)+1)-1L-(i,j) is to be calculated by part 2 of Proposition 4 with V = L−(i,j), s = n(p − 2) + 1 and u = 1. Finally, we recommend taking ε3 to be the same as ε2. This is based on the intuition that K-(i,j)QK-(i,j) and K1:pQK1:p are of the similar forms and dimensions. This choice works well in our simulation studies.

From our experience, the GCV-criteria give the similar tuning constants as the CV-criteria but increase the computation speed by several folds. For this reason, we used the GCV criteria in our subsequent simulations and application.

9 Simulation studies

In this section we compare our ASG estimators with GGM and GCGM estimators, as proposed by Yuan and Lin (2007) and Liu et al (2013). Since our estimators are based on thresholding the norms of the operators ΣXiXj|X−(i,j) and ΘXjXi, it would be fair to compare them with the thresholded versions of GGM and GCGM. However, because L1-regularization is widely used for these estimators, we will instead compare them with the L1-regularized GGM and GCGM.

Comparison 1

We first compare the ASG estimators with GGM and GCGM when the copula assumption is violated. Consider the following models:

  • Model I : X1 = ε1, X2=X12+ε2, X3 = ε3,

  • Model II :X1 = ε1, X2 = ε2, X3=X12+ε3,X4=X12+X22+ε4, X5 = ε5,

where ε1, …, ε5 are i.i.d. N(0, 1). The edge set for Model I is {(1, 2)}; the edge set for Model II is {(1, 3), (1, 4), (2, 4)}. The sample size is n = 100 and we repeated the simulation 100 times and the averaged ROC curves for correct estimation of the edge sets are presented in the upper panels of Figure 2. As expected, our ASG estimators perform much better than the GGM and GCGM when the copula assumption does not hold.

Figure 2.

Figure 2

Comparison of ACCO, APO, GGM, and GCGM. Upper panels: ROC curves for Models I and II, where the gaussian copula assumption is violated. Lower panels: ROC curves for Models III and IV, where the gaussian copula assumption holds.

For this and all the other comparisons, we have used the radial basis function (RBF) kernel with γ = 1/(2p), where γ is as it appears in the parametric form used in Li, Chun, and Zhao (2012). In Comparison 6 we give some results for alternative kernels and parameters.

Comparison 2

We now consider two models that satisfy the gaussian or copula gaus-sian assumption, which are therefore favorable to GGM and GCGM, to see how the ASG estimators compare with them:

  • Model III : X ~ N(0, Θ−1),

  • Model IV : (f1(X1), …, fp(Xp)) ~ N(0, Θ−1),

where Θ is the 10 × 10 precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.554, and nonzero off-diagonal entries θ3,5 = 1.418, θ4,6 = −1.484, θ4,10 = −0.744, θ5,9 = 0.519, θ5,10 = −0.577, θ7,10 = 1.104, and θ8,9 = −0.737. For Model IV, X1, …, Xp each has a marginal gamma distribution with rate parameter 1 and shape parameter generated from unif(1, 5). The sample size is 50. The averaged ROC curves (over 100 replications) for the 4 estimators are shown in the lower panels in Figure 2. From these curves we can see the ASG estimators are comparable to GGM and GCGM when the gaussian or copula gaussian assumption hold.

The tuning parameters are selected by GCV1 and GCV2 from the coarse grids (5 × 10−4, …, 5 × 103). The optimal tuning parameters do vary across repetitions, but the ballpark figures are relatively stable. For example, for model IV, the typical values are 5 × 10−4 for ε1 and 5 × 10−1 for ε2. To evaluate the GCV criteria, in Figure 3 we compare the ROC curves using the GCV-tuned ε̂1, ε̂2 with with those for ε̂1 × 10−1 and ε̂2 × 10−1 and those for ε̂1 × 10 and ε̂2 × 10. We see that smaller tuning parameters performs better at low specificity, larger tuning parameters performs better at high specificity, but the GCV tuning criteria balances these two cases.

Figure 3.

Figure 3

Performance of GCV under Model IV.

Comparison 3

We now investigate the performance of our additive methods under a nonadditive model and compare with a fully nonparametric method that does not rely on the additive assumption. The fully nonparametric method is based on the conditional covariance operator (CCO) introduced by Fukumizu, Bach, and Jordan (2009). Consider the model

  • Model V : Xi = εi, i = 1, 2, 5; X3 = X1 + ε3; X4 = (X1 + X2)2 + ε4,

where ε1, …, ε5 be i.i.d. N(0, 1). The nonadditivity is introduced through the term (X1 + X2)2.

Let μ : ℝp−2 × ℝp−2 → ℝ be a positive definite kernel and MX−(i,j) be the kernel matrix { μ(Xr-(i,j),Xs-(i,j)), s = 1, …, n}. Let

GXi=QKXiQ,GXj=QKXjQ,GX-(i,j)=QMX-(i,j)Q,QX-(i,j)(ε2)=In-GX-(i,j)(GX-(i,j)+ε2In)-1.

Let VXiXj|X−(i,j) be the CCO and XiXj|X−(i,j) be its sample estimate. Along the lines of Fukumizu, Bach, and Jordan (2009) we derive the operator norm XiXj|X−(i,j) as the largest singular value of

(GXi1/2+ε1In)1/2QX-(i,j)(ε2)(GXj+ε1In)1/2. (21)

We declare no edge between node i and j if this value is smaller than a threshold. A derivation of this is sketched in the Appendix. The tuning parameters are determined the similar GCV criteria as described in Section 8 except that the univariate kernel KX−(i,j) is replaced by the multivariate kernel MX−(i,j). We ran this simulation for n = 50 using ACCO, APO, GCGM, and this fully nonparametric estimator (denoted as fuku in the plot). The averaged ROC curves are shown in Figure 4. We can see that our methods work better Fukumizu et al’s method even in this nonadditive case.

Figure 4.

Figure 4

Comparison of ASG and the fully nonparametric estimator on Model V

Comparison 4

In this study we compare the 4 estimators by applying them to the data sets from the DREAM4 Challenges at http://wiki.c2b2.columbia.edu/dream/index.php?title=D4c2 in which subnetworks are extracted from a real biological networks, and gene expressions are simulated from stochastic differential equations, reflecting their biological interactions. Under this simulation scenario, two genes are not guaranteed to be linearly associated. In the DREAM 4 samples, p = 10, n = 31 for all five networks. In the analysis, we combined all data sets from various experiments (wildtype, knockout, knockdown and multifactorial) except for the time-course experiments, and apply each method to the combined data sets across all five network configurations in the DREAM4 Challenges. Figure 5 shows the averaged ROC curves for the 5 subnetworks. We note that APO and ACCO perform the best for the first subnetwork, and APO performs the best in the fifth subnetwork. In the other three subnetworks there is no clear separation among the four estimators. These comparisons suggest a possible benefit of our ASG estimators for finding non-linear biological interactions.

Figure 5.

Figure 5

Comparison of the ROC curves by ACCO, APO, GGM, GCGM estimators on 5 subnetworks in DREAM4 Challenges.

Comparison 5

In this study we increase the sample size n and dimension p to investigate the accuracy and computational feasibility of our methods. We use Model IV in Comparison 2 with (p, n) = (200, 400) and (p, n) = (400, 400). We compare APO with the fully nonparametric method and the averaged ROC curves are given in Figure 6. For larger p and n the ACCO is more computationally expensive than APO because one has to compute ||Σ̂XiXj|X−(i,j)|| for all i, j = 1, …, p, ij. However, this problem is mitigated by APO, where the large (np × np) matrix [Θ] only needs to be calculated once.

Figure 6.

Figure 6

Comparison of ASG and fully nonparametric method for p = 200, n = 400 (left panel) and p = 400, n = 400 (right panel) using model IV

For p = 200 and n = 400, we selected (by GCV) ε̂1 = 5 × 10−5, ε̂2 = 0.5. Each run with fixed ε1 and ε2 took around 900 seconds (15 mins) in an Intel(R) Core(TM) i7-3770 CPU at 3.40GHz machine. For p = 400 and n = 400, we selected by GCV ε̂1 = 5e−5, ε̂2 = 0.5. Each run with fixed ε1 and ε2 took around 1900 seconds in the same computer. The GGM were not used for this simulation, since it took a very long time to get the estimates (more than a day for a single tuning parameter). We also performed 10 replications. It appears that our additive operator performs better than Fukumizu et al’s operator in high-dimensional cases.

Comparison 6

We now investigate the behavior of different kernels using Models III and IV. We have experimented with 3 types of kernels: the radial basis function kernel, the polynomial (POLY) kernel, and the rational quadratic (RATIONAL) kernel. The specific parametric forms we used for these kernels are as appear in Li, Chun, and Zhao (2012). For each kernel, we used two tuning parameters: for RBF, we use γ = 1/9, 1/(9p); for POLY, we use r = 2, 3; for RATIONAL, c = 200, 400. The ROC curves for these two models are given in Figure 7. We see that, with the exception of polynomial kernel with degree 3, these performance of these kernels are somewhat similar, with RBF, γ = 1/(9p) most favorable. The cubic polynomial kernel does not perform very well because it tends to amplify outlying observations, making their results unstable.

Figure 7.

Figure 7

Comparing different kernels using Model III (left panel) and Model IV (right panel).

10 Pathway analysis

We applied the four estimators to a real data set from Dimas et al (2009), in which gene expressions were measured from 85 individuals participating in the GenCord project. The same data set was also used in Chun, Zhang, and Zhao (2012). The participants were of Western European origin, and cell lines were obtained from their umbilical cords. In this analysis, we focused on one cell type, the B cells (lymphoblastoid cell lines). As in Chun et al (2012), we mapped the probes to 17,945 autosomal RefSeq genes, and use the resulting gene level data in our analysis. We then extracted the pathway information from the KEGG database (http://www.genome.jp/kegg/), and used it for partitioning genes and for evaluating the estimators by taking the interactions in the database as true interactions.

Due to the limited sample sizes, we only used the pathways whose sizes (i.e. number of genes) are greater than 2 and less than 36. There are a total of 300 such pathways. After ignoring directionality of regulations in KEGG pathways, we computed the false positives and true positives of the estimates and computed the area under curves (AUC) of the ROCs. We then evaluated the significance of ROC curves based on the p-values of the Wilcoxon rank sum test. Figure 8 shows the AUCs of all networks by ACCO and APO. The left panel shows AUCs based on ACCO, with significant values highlighted by green circles. The right panel shows those based on APO, with significant values highlighted by blue circles. In each case significance means the p-value is less than 10−4. We see that ACCO yields a larger number of significant ROCs than GGM. To reconfirm this pattern we computed the number of significant ROC curves by ACCO and GGM for three different significance levels, as presented in Table 2, which again shows that the numbers of significant ROC curves by ACCO are consistently higher at all three significance levels.

Figure 8.

Figure 8

Comparison of number of significant ROC curves by the ACCO and by GGM estimators for DREAM4 networks with p-values < 10−4.

Table 2.

Numbers of significant pathways with varying significance levels

p-values ACCO APO GCGM GGM
< 10−3 17 21 14 10
< 10−4 10 9 5 5
< 10−5 8 7 2 2

To provide further insights into the comparison, we listed in Table 3, Panel I, all the pathways with significant ROCs by ACCO (p-value < 10−5) whose AUCs by ACCO are larger than those by GGM. There were a total of 6 such pathways. In Table 3, Panel II, we listed all the pathways with significant ROCs by GGM whose AUCs by GGM are larger than those by ACCO. There were a total of 2 such pathways.

Table 3.

Significant pathways whose AUCs by ACCO are larger than those by GGM (panel I) and whose AUCs by ACCO are smaller than those by GGM (panel II)

Panel Pathwaya p AUC
ACCO APO GCGM GGM
I 1 35 0.63 0.59 0.52 0.53
2 24 0.66 0.66 0.56 0.58
3 25 0.61 0.58 0.54 0.54
4 18 0.67 0.60 0.62 0.59
5 18 0.69 0.63 0.62 0.62
6 16 0.69 0.59 0.55 0.55

II 7 14 0.60 0.65 0.62 0.66
8 17 0.73 0.71 0.64 0.79
a

Labels in this column indicate the following pathways. 1: Activation of Csk by cAMP-dependent Protein Kinase Inhibits Signaling through the T Cell Receptor; 2: Cell Cycle: G2 M Checkpoint; 3: Cyclins and Cell Cycle Regulation; 4: Lck and Fyn tyrosine kinases in initiation of TCR Activation; 5: Th1 Th2 Differentiation; 6: hsa00710; 7: Proteasome Complex pathway; 8: Human Peptide GPCRs.

Some pathways were selected by more than one methods. Specifically,

#(ACCOAPO)=4,#(ACCOGGM)=1,#(ACCOGCGM)=1,#(APOGGM)=1,#(APOGCGM)=1,#(GGMGCGM)=1,

where #(AB) means the number of pathways found by method A and method B. It appears that ACCO and APO yield most similar results. Furthermore, single pathway was found by all methods.

From the analysis of these 300 pathways we see that the ASG estimators perform better than GGM and GCGM in a majority of cases. A plausible interpretation is that intrinsically nonlinear interactions (i.e. those that cannot be described by the gaussian copula assumption) are prevalent among genetic networks, and using a more flexible interaction assumption such as the ASG has real benefits in modeling of gene regulations.

11 Conclusions

In this paper we introduced two new estimators for graphical models based on the notion of additive conditional independence. They substantially broaden the applicability of the current graphical model estimators based on the gaussian or copula gaussian assumption, but at the same time avoid using high-dimensional kernels that can hamper estimation accuracy. As we have demonstrated by simulation and data analysis, the new methods work particularly well when the interactions exhibit intrinsic nonlinearity, which cannot be captured by the gaussian or copula gaussian models, and are comparable to existing methods when the copula gaussian assumption holds.

Our generalization of precision matrix to additive precision operator allows us to adopt essentially the same estimation strategy of the gaussian graphical model at the operator level. Using this generalization we succeeded in preserving the simplicity of the gaussian dependence structure without suffering from its austere distributional restriction.

This basic idea opens up many other possibilities that cannot be explored within the scope of this paper. For example, although both the ACCO and APO estimators are based on truncating small operators (as measured by the operator norm), it is conceivable that more sophisticated sparse techniques, such as LASSO, SCAD, and adaptive LASSO, can be adapted to this setting as well. How to implement these techniques in the operator setting, and how much they would improve the truncation-based estimators presented here, would be interesting questions for future research. Moreover, the asymptotic properties of the proposed estimators, such as consistency and asymptotic distributions, also need to be developed in further studies.

Supplementary Material

Supplementary Material

Acknowledgments

Bing Li’s research was supported in part by NSF grant DMS-1106815.

Hyonho Chun’s research was supported in part by NSF grant DMS-1107025.

Hongyu Zhao’s research was supported in part by NIH grants R01-GM59507, P01-CA154295, and NSF grant DMS-1106738.

We greatly appreciate the thorough and thought provoking reviews by three referees and an Associate Editor, which contain many useful suggestions and lead to substantial improvements of this work.

Appendix: proofs

We first state without proof several simple geometric facts that will be useful for proving Theorems 1 and 2. Let P𝓛 be the projection on to a subspace 𝓛 of L2(PX), U and W be subvectors of X. Then

AUW=AU+AW. (22)
PAUW=PAUWAW+PAW. (23)
AUWAW=(I-PAW)AU (24)

Proof of Theorem 1

1 ⇔ 2 follows from relation (22); 2 ⇔ 3 follows from relation (24); 6 ⇔ 7 follows from (23); 4 ⇔ 5 follows from the relation (23).

2 ⇒ 4. By statement 2, P𝓐VW ⊖ 𝓐W (𝓐UW ⊖ 𝓐W) = {0}. By relation (24) and the fact that 𝓐VW ⊖ 𝓐W ⊥ 𝓐W, we have

PAVWAW(AUWAW)=PAVWAW(I-PAW)AU=PAVWAWAU.

4 ⇒ 6. This is because

PAVWAWAUW=PAVWAW(AU+AW)=PAVWAWAW={0},

where the first equality follows from (22); the second from assertion 4; the third from 𝓐W ⊥ 𝓐VW ⊖ 𝓐W.

6 ⇒ 2. Because ker(P𝓐VW ⊖ 𝓐W) = [ran(P𝓐VW ⊖ 𝓐W)] = (𝓐VW ⊖ 𝓐W), we have

AUWAVWAW,

which implies 𝓐UW ⊖ 𝓐W ⊥ 𝓐VW ⊖ 𝓐W because 𝓐UW ⊖ 𝓐W ⊆ 𝓐UW.

Proof of Theorem 2

Axiom 1 is obvious. To prove 2, 3, 4, let U, V, W, R be disjoint sub-vectors of X and 𝓐U, 𝓐V, 𝓐W, 𝓐R be their respective additive families.

2. If UA VR|W, then

(AU+AW)AW(AVR+AW)AW.

Axiom 2 follows because (𝓐V + 𝓐W) ⊖ 𝓐W ⊆ (𝓐VR + 𝓐W) ⊖ 𝓐W.

3. By Theorem 1, part 3, UA VR|W is equivalent to

(I-PAW)AU(I-PAW)AVR (25)

By Axiom 2, we have UA R|W, which, by Theorem 1, part 5, implies P𝓐W f = P𝓐WR f for all f ∈ 𝓐U. Hence the left hand side of (25) can be rewritten as (IP𝓐WR) 𝓐U. Meanwhile, the right hand side of (25) contains (IP𝓐W)𝓐V. Hence

(I-PAWR)AU(I-PAW)AV. (26)

Because 𝓐W ⊆ 𝓐WR, we have ran(IP𝓐WR) ⊥ ran(P𝓐WR). Hence

(I-PAWR)AUPAWRAWAV. (27)

Because

(I-PAWR)AV=(I-PAW-PAWRAW)AV(I-PAW)AV+PAWRAWAV,

relation (26) and the second expression in (27) implies the desired relation

(I-PAWR)AU(I-PAWR)AV.

4. By Theorem 1, part 3, UAV|W and UAR|WV are equivalent to

(I-PAW)AU(1-PAW)AV,(I-PAVW)AU(I-PAVW)AR. (28)

We need to show (IP𝓐W) 𝓐U ⊥ (IP𝓐W)(𝓐V + 𝓐R), which, by the first relation in in (28), is implied by

(I-PAW)AU(I-PAW)AR. (29)

Because UA V|W, the left side of (29) is (IP𝓐VW) 𝓐U. By (23), the right side of (29) is (IP𝓐VW + P𝓐VW ⊖ 𝓐W) 𝓐R. So, under (28), relation (29) is equivalent to

(I-PAVW)AU(I-PAVW+PAVWAW)AR.

Since (IP𝓐VW) 𝓐U ⊥ (IP𝓐VW) 𝓐R, it suffices to show that

(I-PAVW)AUPAVWAWAR, (30)

which is true because

ran(PAVWAW)=AVWAW=(I-PAW)(AVW)=ran((I-PAW)PAVW)

and (IP𝓐VW)(IP𝓐W) P𝓐VW = 0.

Proof of Theorem 3

The statement UA V|W means

f-PAWf,g-PAWg=0fAU,gAV. (31)

Let A, B, and C be the sets of indices for U, V, and W and h(W) be the vector {fk(Xk) : kC}. By condition 3, f and g can be written as ΣiA αifi and ΣjB βj fj for some α = (α1, …, αr) and β = (β1, …, βs). Since (f(U), h(W)) is jointly gaussian, the conditional expectation E[f(U)|h(W)] belongs to 𝓐W. For the same reason E[g(V)|h(W)] belongs to 𝓐W. It follows that P𝓐Wf = E[f(U)|h(W)], P𝓐Wg = E[g(V)|h(W)], and

f-PAWf,g-PAWg=E[cov(f(U),g(V)h(W))]=cov[f(U),g(V)h(W)],

where the second equality holds because the random vector (f(U), g(V), h(W)) is multivariate gaussian. Consequently, (31) is equivalent to f(U) ⫫ g(V)|h(W) for all f ∈ 𝓐U and g ∈ 𝓐V, which is equivalent to

{fi(Xi):iA}{fj(Xj):jB}h(W).

Because f1, …, fp are injective, this is equivalent to UV|W.

Proof of Proposition 1

Let fL2(PU) and gL2(PV) be any functions with ||f|| = ||g|| = 1. Because P𝓐W is idempotent and self adjoint, we have

f-PAWf,g-PAWg=f,g-f,PAWg.

For each γ = 1, …, p − 2, let { ψαγw: α = 1, 2, …} be the hermite polynomials in the variable wγ. Similarly, let { ψαu: α = 1, 2, …} and { ψαv: α = 1, 2, …} be the hermite polynomials in the variables u and v. By Lancaster (1957), these polynomials form orthonormal bases in L2(PWγ), L2(PU), and L2(PV). It follows that the set { ψαγw: α = 1, 2, …, γ = 1, …, p − 2} is an orthonormal basis for 𝓐W. Moreover, by Lancaster (1957), whenever αβ,

{ψαu}{ψαv}{ψαγw:γ=1,,p-2}{ψβu}{ψβv}{ψβγw:γ=1,,p-2}.

So, if we let AWα=span{ψαγw:γ=1,,p-2}, then PAWg=α=1PAWαg. Each term in this sum is

PAWαg=PAWα(β=1g,ψβvψβv)=β=1g,ψβvPAWαψβv=g,ψαvPAWαψαv

By Lancaster (1957) again,

ψαv,ψαγw=ρVWγα,ψαγw,ψαδw=ρWγWδα,

which yield PAWαψαv=RVWα(RWWα)-1ψαw, where ψαw=(ψα1w,,ψα,p-2w). Hence

PAWg=α=1PAWαg=α=1g,ψαvRVWα(RWWα)-1ψαw.

Using this we compute

f,PAWg=α=1g,ψαvf,RVWα(RWWα)-1ψαw=α=1g,ψαvβ=1f,ψβuψβu,RVWα(RWWα)-1ψαw=α=1g,ψαvf,ψαuψαu,RVWα(RWWα)-1ψαw=α=1g,ψαvf,ψαuRVWα(RWWα)-1RWUα.

By the similar calculation we found f,g=α=1f,ψαug,ψαvρUVα. Hence

f,g-f,PAWg=α=1g,ψαvf,ψαu(ρUVα-RVWα(RWWα)-1RWUα)

Finally, by the Cauchy-Schwarz inequality,

(f,g-f,PAWg)2α=1g,ψαv2α=1f,ψαu2(ρUVα-RVWα(RWWα)-1RWUα)2=α=1f,ψαu2(ρUVα-RVWα(RWWα)-1RWUα)2max{(ρUVα-RVWα(RWWα)-1RWUα)2:α=1,2,},

as desired.

Proof of Lemma 1

For any f ∈ 𝓐U and g ∈ 𝓐W, we have

(PAWAU-WU)f,g=PAWf,gAW-WUf,g. (32)

The first term on the right-hand side is

PAWf,g=f,PAWg=f,g=E[f(U)g(W)].

Hence 〈(P𝓐W ∣ 𝓐U − ΣWU)f,g𝓐W = 0 for all f ∈ 𝓐W and g ∈ 𝓐U.

Proof of Theorem 5

If UA V|W, then, for any f ∈ 𝓐U, g ∈ 𝓐V,

0=(I-PAW)f,(I-PAW)g=f,g-f,PAWg, (33)

where the second equality follows from the definition of a projection. Using Lemma 1, we calculate the two terms on the right as

f,g=f,UVg,f,PAWg=f,WVg=f,UWWVg.

Hence (33) is equivalent to 〈f, (ΣUV − ΣUW ΣWV)g〉 = 0, which implies ΣUV − ΣUW ΣWV = 0. The arguments can obviously be reversed.

Proof of Proposition 2

1 ⇒ 2. For any f ∈ 𝓐X, ∣∣f∣∣𝓐X ≤ 1, we have, by the triangular inequality,

TfAXi=1pj=1pTXjXifii=1pj=1pTXjXifip2maxi,j=1,,pTXjXi<.

2 ⇒ 1. For any fk ∈ 𝓐Xk and ∣∣fk∣∣𝓐k ≤ 1, we have

TfkAX2=Tfk,TfkAX=i=1pTXiXkfk,TXiXkfkTXXkfk2,

which implies

sup{TXXkfk:fk1}sup{TfkAX:fk1}=sup{TfkAX:fkAX1}=T<.

Proof of Lemma 2

First, we show that TUU (and hence also TVV) is bounded, self adjoint, and positive definite. Let h ∈ 𝓐U. Then TUU h ∈ 𝓐U. Hence

TUUh,TUUhAX=TUUh,ThAXTUUhAXThAX

which implies that ∣∣TUUh∣∣𝓐X is either 0 or no greater than ||T|| ||h||. Hence TUU is bounded. That TUU is self adjoint follows from 〈g, TUUh𝓐X = 〈g, Th𝓐X for all g, h ∈ 𝓐U. Because T > 0, if h ∈ 𝓐U and h ≠ 0, then 〈h, Th𝓐X > 0. Because 〈h, Th𝓐X = 〈h, TUUh𝓐X we have 〈h, TUUh𝓐X > 0. Thus TUU > 0.

That TUU-TUWTWW-1TWU and TVV-TVWTWW-1TWV are bounded and self adjoint can be proved similarly. To show that they are positive definite and the subsequent identities, let IUU be the identity operator from 𝓐U to 𝓐U, and OUV be the zero operator from 𝓐V to 𝓐U. Define IVV and OVU similarly. Then

((T-1)UU(T-1)UV(T-1)VU(T-1)VV)(TUUTUVTVUTVV)=(IUUOUVOVUIVV).

Hence

(T-1)UUTUU+(T-1)UVTVU=IUU (34)
(T-1)UUTUV+(T-1)UVTVV=OUV. (35)

By part 1, TV V is invertible. Hence, by (35),

(T-1)UV=-(T-1)UUTUVTVV. (36)

Substitute this into (34) to obtain

(T-1)UU(TUU-TUVTVVTVU)=IUU (37)

which implies that TUUTUV TVV TVU is invertible and the first two identities in (9) hole. Similarly, we can prove that TVVTVU TUU TUV is invertible and the last two identities in (9) hold.

Proof of Theorem 6

Let A, B, C be the index sets of U, V, W, Z = UV, and

ϒZZ={XiXj:i,jAB},ϒZW={XiXj:iAB,jC},ϒWZ={XiXj:iC,jAB},ϒWW={XiXj:iC,jC}.

These operators are different from the operators such as ΣUV in Section 4 because they are defined by different inner products. For example, ΣUU is the identity mapping but ϒUU is not. By Lemma 2, ΘZZ=(ϒZZ-ϒZWϒWW-1ϒWZ)-1. Because

ϒZZ-ϒZWϒWW-1ϒWZ=(ϒUU-ϒUWϒWW-1ϒWUϒUV-ϒUWϒWW-1ϒWVϒVU-ϒVWϒWW-1ϒWUϒVV-ϒVWϒWW-1ϒWV),

we have, by Lemma 2 again, (ΘZZ-1)UV=-(ΘZZ-1)UUΘUVΘVV-1. This can be rewritten as

ϒUV=ϒUWϒWW-1ϒWV=-(ϒUU-ϒUWϒWW-1ϒUW)ΘUVΘVV-1. (38)

Therefore, ΘUV = 0 iff ϒUV-ϒUWϒWW-1ϒWV=0.

Now let f ∈ 𝓐U, g ∈ 𝓐V. Then, by the definition of a projection,

f-PAWf,g-PAWg=f,g-f,PAWg.

The first term on the right is 〈f, ϒUVg𝓐X. Because, for any h ∈ 𝓐W,

h,ϒWW-1ϒWVg=h,ϒWVgAX=h,g,

we have ϒWW-1ϒWV=PAWAV. Hence

f,PAWg=f,ϒWW-1ϒWVg=f,ϒUWϒWW-1ϒWVgAX.

That is, we have shown that

f-PAWf,g-PAWg=f,(ϒUV-ϒUWϒWW-1ϒWV)gAX,

which means ϒUV-ϒUWϒWW-1ϒWV=0 iff UA V|W.

Proof of Lemma 3

The proof of the three equalities are similar. So we only prove the last one. By (10), for any f ∈ 𝓐X−(i,j, g ∈ 𝓐Xi, we have

f,X-(i,j)XigAX-(i,j)=n-1[f]K-(i,j)QK-(i,j)[X-(i,j)Xi][g].

Alternatively,

f,X-(i,j)XigAX-(i,j)=covn[f(X),g(X)]=n-1[f]K-(i,j)QKi[g].

Hence the right-hand sides of the above two equalities are the same for all [f] and [g], which implies the third relation in (11).

Proof of Proposition 4

  1. Write (15) as V = LΨR where L = (L1, L0), R = (R1, R0), and Ψ = diag(D, 0). Because L and R are orthogonal matrices,
    (VV+εIs)-1=(LΨΨL+εIs)-1=L(ΨΨ+εIs)-1L. (39)
    By construction,
    (ΨΨ+εIs)-1={diag(D2+εIu,εIs-u)}-1=diag{(D2+εIu)-1,ε-1Is-u}. (40)
    Hence the right-hand side of (39) is L1(D2+εIu)-1L1+ε-1L0L0, which, because L is an orthogonal matrix, can be rewritten as
    L1(D2+εIu)-1L1+ε-1(Is-L1L1)=L1((D2+εIu)-1-ε-1Iu)L1+ε-1Is.

    By (15) we have V=L1DR1, and hence L1 = V R1D−1. Substitute this into the right-hand side to prove part 1.

  2. Multiply both sides of (39) by V from the left and by V from the right, to obtain
    V(VV+εIs)-1V=RΨ(+εIs)-1ΨR.

    Substitute RΨ = R1D and (40) into the right-hand side above to prove part 2.

Proof of Theorem 7

Let f, g ∈ 𝓐X. Then f = f1 + ··· + fp and ϒg = h1 + ··· + hp where fk, hk ∈ 𝓐Xk, k = 1, …, p. Hence

f,ϒgAX=n-1i=1pfi,hiAXi=n-1i=1p[fi]KiQKi[hi] (41)

Because

[f]AX=([f1]AX1,,[fp]AXp),[ϒg]AX=([h1]AX1,,[hp]AXp),

The right-hand side of (41) can be rewritten in matrix form as

n-1[f]diag(K1QK1,,KpQKp)[ϒ][g]. (42)

Alternatively, the inner product in (41) can be written as

f,ϒgAX=n-1i=1pj=1pfi,XiXigjAXi=n-1i=1pj=1p[fi]KiQKj[gj]=n-1[f]K1:pQK1:p[g]. (43)

Since (42) is equal to (43) for all [f] and [g], equality (16) holds.

Operator norm of CCO

By Fukumizu et al (2009),

[V^XiXjX-(i,j)]=n-1QX-(i,j)(ε2)GXi. (44)

The operator norm of XiXj|X−(i,j) is the maximum of 〈fi, XiXj|X−(i,j) fj2 subject to 〈fi, fi〉 = 〈fj, fj〉 = 1. By linear algebra, this is equivalent to

maximizingfi,V^XiXjX-(i,j)V^XjXiX-(i,j)fisubjecttofi,fi=1. (45)

By (44), the first expression in (45) is n−2[fi]GXi QX−(i,j) (ε2)GXi QX−(i,j) (ε2)GXi [fi]. The second expression in (45) is [fi] GXi [fi] = 1. This is a generalized eigenvalue problem. To convert this to an ordinary eigenvalue problem, let ϕi=GXi1/2[fi]. Solve this equation for [fi] with Tychonoff regularization, to obtain [fi] = (GXi + ε1In)−1/2 ϕi. The optimization problem (45) reduces to maximizing (ignoring n−2):

ϕi(GXi+ε1In)-1/2GXiQX-(i,j)(ε2)GXjQX-(i,j)(ε2)GXi(GXi+ε1In)-1/2ϕi

subject to ϕiϕi=1. As we did with ACCO and APO, we use the symmetrized version

ϕi(GXi+ε1In)1/2QX-(i,j)(ε2)(GXj+ε1In)QX-(i,j)(ε2)(GXi+ε1In)1/2ϕi,

which is of the form AA with A being the matrix in (21).

Contributor Information

Bing Li, Email: bing@stat.psu.edu.

Hyonho Chun, Email: chunh@purdue.edu.

Hongyu Zhao, Email: hongyu.zhao@yale.edu.

References

  1. Bellman RE. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]
  2. Bach FR. Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]
  3. Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
  4. Chun H, Zhang X, Zhao H. Gene regulation network inference with joint sparse Gaussian graphical models. 2012. Unpublished manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, Ingle C, Beazley C, Gutierrez Arcelus M, Sekowska M, Gagnebin M, Nisbett J, Deloukas P, Dermitzakis ET, Antonarakis SE. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. 2009;325:1246–1250. doi: 10.1126/science.1174148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dawid AP. Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B. 1979;41:1–31. [Google Scholar]
  7. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  8. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fukumizu K, Bach FR, Jordan MI. Kernel dimension reduction in regression. Annals of Statistics. 2009;4:1871–1905. [Google Scholar]
  10. Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2009;98:1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kelley JL. General Topology. Springer; New York: 1955. [Google Scholar]
  12. Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lancaster. Some properties of the bivariate normal distribution considered in the form of a contingency table. Biometrika. 1957;44:289–292. [Google Scholar]
  14. Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]
  15. Lauritzen SL, Dawid AP, Larsen BN, Leimer HG. Independence properties of directed Markov fields. Networks. 1990;20:491–505. [Google Scholar]
  16. Lee K-Y, Li B, Chiaromonte F. A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Annals of Statistics. 2013 In print. [Google Scholar]
  17. Li B, Artemiou A, Li L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Annals of Statistics. 2011;36:3182–3210. [Google Scholar]
  18. Li B, Chun H, Zhao H. Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association. 2012;107:152–167. doi: 10.1080/01621459.2011.644498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liu H, Han F, Zhang C. Advances in Neural Information Processing System. Vol. 25. Curran Associates, Inc; 2013. Transelliptical Graphical Models; pp. 800–808. [Google Scholar]
  20. Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics. 2013 In print. [Google Scholar]
  21. Liu H, Lafferty J, Wasserman L. The nonparanormal: semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. 2009;10:2295–2328. [PMC free article] [PubMed] [Google Scholar]
  22. Meinshausen N, Bühlmann P. High-dimensional graphs with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
  23. Pearl J, Geiger D, Verma T. Conditional independence and its representations. Kybernetika. 1989;25:33–44. [Google Scholar]
  24. Pearl J, Paz A. Graphoids: a graph-based logic for reasoning about relevancy relations. Proceedings of the European Conference on Artificial Intelligence; Brighton, UK. 1986. [Google Scholar]
  25. Pearl J, Verma T. The logic of representing dependencies by directed acyclic graphs. Proceedings of the American Association of Artificial Intelligence Conference; Seattle, Washington. 1987. [Google Scholar]
  26. Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  28. Wahba G. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics.1990. [Google Scholar]
  29. Xue L, Zou H. Regularized Rank-based Estimation of High-dimensional Nonparanormal Graphical Models. Annals of Statistics. 2013 In print. [Google Scholar]
  30. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
  31. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES