Abstract
We introduce a nonparametric method for estimating non-gaussian graphical models based on a new statistical relation called additive conditional independence, which is a three-way relation among random vectors that resembles the logical structure of conditional independence. Additive conditional independence allows us to use one-dimensional kernel regardless of the dimension of the graph, which not only avoids the curse of dimensionality but also simplifies computation. It also gives rise to a parallel structure to the gaussian graphical model that replaces the precision matrix by an additive precision operator. The estimators derived from additive conditional independence cover the recently introduced nonparanormal graphical model as a special case, but outperform it when the gaussian copula assumption is violated. We compare the new method with existing ones by simulations and in genetic pathway analysis.
Key words and phrases: Additive conditional independence, additive precision operator, copula, conditional independence, covariance operator, gaussian graphical model, nonparanormal graphical model, reproducing kernel Hilbert space
1 Introduction
The gaussian graphical model is one of the most important tools for statistical network analysis, partly because its estimation can be neatly formulated as identifying the zero entries in the precision matrix, which turns estimation of a graph into sparse estimation of a high-dimensional precision matrix. For recent advances in this area, see Meinshausen and Buhlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), Peng, Wang, Zhou, and Zhu (2009), Guo, Levina, Michailidis, and Zhu (2009), Lam and Fan (2009).
Specifically, let X = (X1, …, Xp)⊤ be a p-dimensional random vector distributed as N(μ, Σ), where Σ is positive definite. Let Γ = {1, …, p}, E ⊆ {(i, j) ∈ Γ × Γ : i ≠ j}, and Θ = Σ−1. Let θij be the (i, j)th element of Θ. Then X follows a gaussian graphical model (GGM) with respect to the graph 𝓖 = {Γ, E} iff
| (1) |
where X−(i,j) represents the set {Xk : k ≠ i, j}. Under the multivariate gaussian assumption, the above relation is equivalent to
| (2) |
Based on this equivalence the estimation of 𝓖 reduces to identification of the 0 entries in Θ, which can be done by sparse estimation procedures such as thresholding (Bickel and Levina, 2008), LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), and adaptive LASSO (Zuo, 2006).
Since the multivariate gaussian assumption limits the applicability of the GGM, there have been intense developments in generalizing it to non-gaussian cases in the past few years, chiefly through the use of nonparametric copula transformations. See Liu, Lafferty, and Wasserman (2009), Liu, Han, Yuan, Lafferty, and Wasserman (2013), and Xue and Zou (2013). The nonparametric gaussian copula graphical model (GCGM) is a flexible and effective model that relaxes the marginal gaussian assumption but preserves the inter-dependence structure of a multivariate gaussian distribution, so that the equivalence between (1) and (2) still holds in an altered form. It assumes the existence of unknown injections f1, …, fp such that (f1(X1), …, fp(Xp))⊤ is multivariate gaussian. The GCGM keeps the simplicity and elegance of the GGM but greatly expands its scope.
However, the gaussian copula assumption can be violated even for commonly used interactions. For example, if (U, V) is a random vector where V = U2 + ε, and U and ε are i.i.d. N(0, 1), then no transformation can render (f1(U), f2(V)) bivariate gaussian. In fact, the copula transformations in this case are: f1(u) = u, f2(v) = Φ−1∘FV(v), where FV is the convolution of N(0, 1) and and Φ is the c.d.f. of N(0, 1). Figure 1 plots (U, Φ−1∘FV(V)) for a sample of 100 observed pairs, which shows a strong non-gaussian pattern even though U and Φ−1∘FV(V) are marginally gaussian. Under such circumstances the gaussian copula assumption no longer holds, and our goal is to develop a broader class of graphical models that cover such cases.
Figure 1.
Violation of copula assumption. The vertical axis is Φ−1∘FV(V). Both U and Φ−1∘FV(V) are N (0, 1) but their joint distribution is strongly non-gaussian.
In developing the new graphical models we would like to preserve an appealing feature of the GCGM; that is, the nonparametric operation is acted upon one-dimensional random variables individually, rather than a high-dimensional random vector jointly. This feature allows us to avoid high-dimensional smoothing, which is the source of the curse of dimensionality (Bellman, 1957). However, if we insist on using conditional independence (1) as the criterion for constructing graphical models, then we are inevitably lead to a fully fledged nonparametric procedure involving smoothing over the entire p-dimensional space, and the nice structures of the gaussian graphical model such as (2) would be lost. This motivates us to introduce a new statistical relation — additive conditional independence, or ACI, to replace conditional independence as our criterion. Although ACI is different from conditional independence, it satisfies the axioms for a semi-graphoid (Pearl and Verma, 1987) shared by conditional independence, which captures the essence of a graph. Furthermore, under the gaussian copula assumption, ACI reduces to conditional independence. The graphical model defined by ACI — which we refer to as the Additive Semi-graphoid Model (ASG) — preserves a parallel structure to (2
The paper is organized follows. In section 2 we outline the axiomatic system for a semi-graphoid logical relation, and introduce the notion of ACI as well as the ASG model it induces. In section 3 we investigate the relation between ACI and conditional independence under the gaussian copula assumption. In sections 4 and 5, we introduce the additive conditional covariance operator and additive precision operator, and study their relations with ACI. In sections 6 and 7 we introduce two estimators for the ASG model based on these operators. In section 8 we propose cross validation procedures to determine the tuning parameters. In sections 9 and 10 we compare the ASG model with GGM and GCGM by simulation and in an applied setting of genetic pathway inference. We conclude in section 11 with suggested future research directions. All proofs are presented in the Appendix.
2 Semi-graphoid axioms and additive conditional independence
Before introducing additive conditional independence, we first restate the axiomatic system developed by Pearl and Verma (1987) and Pearl, Geiger, and Verma (1989), in a slightly different form. Let 2Γ be the class of all subsets of Γ. We call any subset ℛ ⊆ 2Γ × 2Γ × 2Γ a three-way relation on Γ. This terminology follows the tradition of two-way relations in mathematics (see, for example, Kelley, 1955, page 6).
Definition 1 (Semi-graphoid Axioms)
A three-way relation ℛ is called a semi-graphoid if it satisfies the following conditions:
(symmetry) (A, C, B) ∈ ℛ ⇒ (B, C, A) ∈ ℛ;
(decomposition) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C, B) ∈ ℛ;
(weak union) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C ∪ B, D) ∈ ℛ;
(contraction) (A, C ∪ B, D) ∈ ℛ, (A, C, B) ∈ ℛ ⇒ (A, C, B ∪ D) ∈ ℛ.
These axioms are extracted from conditional independence to convey the general idea of “B is irrelevant for understanding A once C is known”, or “C separates A and B”. Conditional independence is a special case of the semi-graphoid three-way relation. More specifically, let X = {Xi : i ∈ Γ} be a random vector and, for any A ∈ Γ, let XA be the {Xi : i ∈ A}. Let ℛ be the class of (A, C, B) such that XA ⫫ XB|XC (XA and XB are conditionally independent given XC), then it can be shown that ℛ indeed satisfies the semi-graphoid axioms (Dawid, 1979).
Because, like conditional independence, a semi-graphoid relation (A, C, B) ∈ ℛ also conveys the idea that C separates A and B, it seems reasonable to use the semi-graphoid relation itself as a criterion for constructing a graph. Indeed, the very statement “there is no direct link between node i and node j” means, precisely, that “{i, j}c separates i and j”. In other words, it is reasonable to use the logical statement ({i}, {i, j}c, {j}) ∈ ℛ to determine the absence of edge between i and j. Using the semi-graphoid relation in place of probabilistic conditional independence can free us from some awkward restrictions that come with a fully fledged probability model.
Let X = (X1, …, Xp)⊤ be a random vector. For any subvector U = (U1, …, Ur)⊤ of X, let L2(PU) be the class of functions of U such that Ef(U) = 0, Ef2(U) < ∞. For each Ui, let 𝓐Ui denote a subset of L2(PUi). Let 𝓐U denote the additive family
For two subspaces 𝓐 and 𝓑 of L2(PX), let 𝓐 ⊖ 𝓑 = 𝓐 ∩ 𝓑⊥, where orthogonality is in terms of the L2(PX)-inner product. In the following, an unadorned inner product 〈·, ·〉 always represents the L2(PX)-inner product.
Definition 2
Let U, V, and W be subvectors of X. We say that U and V are additively conditionally independent (ACI) given W iff
| (3) |
where ⊥ indicates orthogonality in the L2(PX)-inner product. We write relation (3) as U ⫫AV|W.
Strictly speaking, ACI should be stated with respect to the triplet of functional families (𝓐U, 𝓐V, 𝓐W), but in most cases we can ignore this qualification without loss of clarity. Lauritzen (1996, page 31) mentioned a relation similar to (3) where 𝓐U, 𝓐V, and 𝓐W are Euclidean subspaces, but absent from that construction are the underlying random vectors U, V, W, which is crucial to our development.
The next theorem gives seven equivalent conditions for ACI, each of which can be used as its definition. As we will see in the proof of Theorem 2, these various forms are immensely useful in exploring the ramifications of ACI. In the following, for an operator P on a Hilbert space 𝓗, let ker P and ranP denote the kernel and range of P, and let be the closure of ranP.
Theorem 1
Let U, V, W be subvectors of X. The following statements are equivalent:
(𝓐U + 𝓐W) ⊖ 𝓐W ⊥ (𝓐V + 𝓐W) ⊖ 𝓐W;
𝓐U∪W ⊖ 𝓐W ⊥ 𝓐V∪W ⊖ 𝓐W;
(I − P𝓐W) 𝓐U ⊥ (I − P𝓐W)𝓐V;
𝓐U ⊆ ker(P𝓐V∪W ⊖ 𝓐W);
𝓐U ⊆ ker(P𝓐V∪W − P𝓐W);
𝓐U∪W ⊆ ker(P𝓐V∪W ⊖ 𝓐W);
𝓐U∪W ⊆ ker(P𝓐V∪W − P𝓐W).
The next theorem shows that ACI satisfies the semi-graphoid axioms, which gives justification of using it to construct graphical models.
Theorem 2
The additive conditional independence relation is a semi-graphoid.
We are now ready to define the additive semi-graphoid model.
Definition 3
A random vector X follows an additive semi-graphoid model with respect to a graph 𝓖 = (Γ, E) iff
| (4) |
We write this condition as X ~ ASG(𝓖).
3 Relation with copula graphical models
While in the last section we have seen it is reasonable to use ACI instead of conditional independence as a criterion for constructing graphs, we now investigate the special cases where ACI reduces conditional independence.
Theorem 3
Suppose
X has a gaussian copula distribution with copula functions f1, …, fp;
U, V, and W are subvectors of X;
𝓐Xi = span{fi}, i = 1, …, p.
Then U ⫫AV|W if and only if U ⫫V|W.
Under the specific form of ASG model in Theorem 3, there is no edge between i and j iff the (i, j)th entry of the precision matrix of (f1(X1), …, fp(Xp))⊤ is 0. The latter is population-level criterion used by the GCGM estimator. In this sense, ASG can be regarded as a generalization of the nonparanormal models.
Since our main interests are in the ASG models with 𝓐Xi = L2(PXi), it is also of interest to ask whether ⫫A coincides with ⫫ in this case. The next theorem and the subsequent numerical investigation show that they are very nearly equivalent for all practical purposes.
Theorem 4
Suppose
X has a gaussian copula distribution with copula functions f1, …, fp;
-
U, V, and W are sub-vectors of X;
3. 𝓐Xi = L2(PXi).
Then U ⫫AV|W implies U ⫫ V|W.
The proof, which is omitted, is essentially the same as that of Theorem 3. Interestingly, under 𝓐Xi = L2(PXi), ⫫ does not imply ⫫A even under the gaussian copula assumption. However, the following numerical investigation suggests that this implication holds approximately, with vanishingly small error. We are able to carry out this investigation because an upper bound of
can be calculated explicitly under the copula gaussian distribution. To our knowledge this bound has not been recorded in the literature; so it is of interest in its own.
Proposition 1
Suppose X has a gaussian copula distribution with copula functions f1, …, fp. Without loss of generality, suppose E[fi(Xi)] = 0, var[fi(Xi)] = 1. Let
and RVW = cov(V, W), RWW = var(W), RWU = cov(W, U), ρUV = cor(U, V). Then
| (5) |
A⊙α is the α-fold Hadamard product A ⊙ ··· ⊙ A and |·| means determinant.
Using this proposition we conduct the following numerical investigation. First we generate a positive definite random matrix C with C12 = C21 = 0 as follows. We generate C̃ from the Wishart distribution W(Ip, k), where k is the degrees of freedom, and then set C̃12 and C̃21 to 0. This does not guarantee C̃ to be positive definite. If it is not, we repeat this process until we get a positive definite matrix, which is then set to be C. Let RXX = [diag(C)]1/2C−1[diag(C)]1/2. Let RWU, RWV, and RWW be the (p − 2) × 1, (p − 2) × 1, and (p − 2) × (p − 2) matrices
and let . Note that, if X ~ N (0, RXX), then it is guaranteed that U ⫫ V|W, and var(Xi) = 1.
We choose the degrees of freedom k to be relatively small to prevent C/k from converging to Ip, but large enough to avoid singularity. In particular, we take k = ⌈5p/4⌉, ⌈6p/4⌉, …, ⌈8p/4⌉, which guarantees a wide range of covariance matrices. We take p = 20, 40, …, 100. We compute τUV|W by taking the maximum of the first 10 terms in (5), which in our simulations is always the global maximum. For each combination of (p, k), we generate 500 random matrices and present the averages of τUV|W in Table 1. Because ρUV|W = 0 iff U ⫫AV|W, the small numbers in Table 1 indicate that for all practical purposes we can treat U ⫫ V|W ⇒ U ⫫AV|W as valid under the gaussian copula model, even though it is not a mathematical fact.
Table 1.
Average values of τUV|W under U ⫫ V|W
| p |
k
|
|||
|---|---|---|---|---|
| ⌈5p/4⌉ | ⌈6p/4⌉ | ⌈7p/4⌉ | ⌈8p/4⌉ | |
|
| ||||
| 20 | 0.00310 | 0.00270 | 0.00114 | 0.00087 |
| 40 | 0.00472 | 0.00240 | 0.00095 | 0.00051 |
| 60 | 0.00485 | 0.00185 | 0.00069 | 0.00024 |
| 80 | 0.00435 | 0.00183 | 0.00053 | 0.00018 |
| 100 | 0.00483 | 0.00125 | 0.00036 | 0.00018 |
In the similar way it is related to the gaussian copula graphical models, the ASG is also related to the more general transelliptical graphical models recently introduced by Liu, Han, and Zhang (2012). A random vector X is said to have a transelliptical distribution with shape parameter Σ (a p × p positive definite matrix) if there exist injections f1, …, fp on ΩX to ℝ such that the distribution of U = (f1(X1), …, fp(Xp))⊤ depends only on u⊤Σu. Furthermore, X is said to follow a transelliptical graphical model with respect to a graph 𝓖 = (Γ, E) iff X has a transelliptical distribution and E = {(i, j) : (Σ−1)ij ≠ 0}. In this case we write X ~ TEG(𝓖). The following corollary states that ASG reduces to the transelliptical graphical model if the additive families 𝓐Xi are chosen to be span (fi) and the underlying distribution of X is assumed to be transelliptical. Its proof is parallel to that of Theorem 3 and is omitted.
Corollary 1
Suppose
X has a transelliptical distribution with respect to a positive definite shape parameter Σ and injections f1, …, fp;
𝓐Xi = span(fi).
Then X ~ ASG(𝓖) iff X ~ TEG(𝓖).
4 Additive conditional covariance operator
In this and the next sections we introduce two linear operators that will give rise to two estimation procedures for ASG. The first is the additive conditional covariance operator (ACCO). Consider the bilinear form
By the Riesz representation theorem, this induces a bounded linear operator ΣUV : 𝓐V → 𝓐U such that 〈f, ΣUVg〈 = E[f (U)g(V)]. We call the operator
| (6) |
the additive conditional covariance operator from V to U given W. This operator is different from the conditional variance operator introduced by Fukumizu, Bach, and Jordan (2009), because of the additive nature of the functional spaces 𝓐U, 𝓐V, and 𝓐W. In particular, the following identity no longer holds:
In the following, for a Hilbert space 𝓗, a subspace 𝓛 of 𝓗, and an operator T defined on 𝓗, we use T|𝓛 to denote the operator T restricted on 𝓛. That is, T|𝓛 is a mapping from 𝓛 to 𝓗 with (T|𝓛)f = Tf for all f ∈ 𝓛. The following lemma gives a relation between the projection operator and the covariance operator.
Lemma 1
Let P𝓐W: 𝓐X → 𝓐W be the projection on to 𝓐W. Then P𝓐W|𝓐U = ΣWU.
The next theorem shows that ΣUV|W = 0 iff U ⫫AV|W, which is the basis for our first ASG estimator. This resembles the relation between conditional covariance operator and conditional independence; that is, if CUV|W is the conditional covariance operator in Fukumizu, Jordan, and Bach (2009), then CUV|W = 0 iff U ⫫ V|W.
Theorem 5
U⫫AV|W iff ΣUV|W = 0.
5 Additive precision operator
We now introduce the second linear operator – the additive precision operator (APO), and show that a statement resembling the equivalence of (1) and (2) can be made for ASG if we replace the precision matrix by APO. As a result, the estimation of the ASG model can be turned into the sparse estimation of APO.
As a preparation, we first develop the notion of a matrix of operators on a direct-sum functional space. Let 𝓐⊕X be the Hilbert space of functions consisting of the same set of functions as 𝓐X but with the inner product
Suppose for each i, j we are given a linear operator TXjXi: 𝓐Xi → 𝓐Xj. Then we can construct a linear operator T : 𝓐⊕X → 𝓐⊕X as follows
| (7) |
This construction is legitimate because in the additive case f1 + ··· + fp uniquely determines (f1, …, fp). In fact, if we fix (c2, …, cp) in ΩX2 × ··· × ΩXp and vary x1 in ΩX1, then using E[f1(X1)] = 0 we have
Conversely, a linear operator T : 𝓐⊕X → 𝓐⊕X uniquely determines operators TXjXi : 𝓐Xi → 𝓐Xj that satisfies (7), as follows. For each fi ∈ 𝓐⊕Xi, Tfi can be represented as h1 + ··· + hp, where hj ∈ 𝓐Xj. Moreover, h1 + ··· + hp uniquely determines (h1, …, hp), and hence also hj. We define TXjXi : 𝓐Xi → 𝓐Xj by TXjXi(fi) = hj. Thus, there is a one-to-one correspondence between T and {TXjXi} and they are related by (7). Henceforth we denote this one-to-one correspondence by the equality
Moreover, T is bounded iff each TXjXi is bounded, as shown in the next proposition.
Proposition 2
The following statements are equivalent
TXjXi ∈ 𝓑 (𝓐Xi, 𝓐Xj) for i, j = 1, …, p;
T ∈ 𝓑 (𝓐⊕X, 𝓐⊕X).
Our construction of T = {TXjXi} echoes the construction of Bach (2008). However, our construction is relative to a different inner product and will play a new role under ACI. We are now ready to introduce the additive precision operator.
Definition 4
The operator ϒ : 𝓐⊕X → 𝓐⊕X defined by the matrix of operators {ΣXiXj: i ∈ Γ, j ∈ Γ} is called the additive covariance operator. If it is invertible then its inverse Θ is called the additive precision operator.
It can be easily checked that, for any f, g ∈ 𝓐⊕X,
| (8) |
This relation also shows why we used 𝓐⊕X instead of 𝓐X to define ϒ in the first place: if we had used 𝓐X, then ϒ would have satisfied ϒf = f for any f ∈ 𝓐X; that is, ϒ would have been the identity mapping from 𝓐X to 𝓐X. In the meantime, we note that, by construction, ϒXiXj = ΣXiXj for all i, j = 1, …, p.
We should note that it is generally unreasonable to assume that the inverse of a covariance operator to be a bounded operator, because we typically assume the sequence of eigenvalues of such operators tend to 0. However, in our construction, the inner product 〈fi, fi〉𝓐⊕X is the marginal variance var[fi(Xi)]. Thus our covariance operator ϒ is more like a correlation than a covariance. In particular, ϒii are identity operators by construction. If we assume ϒij are compact for all i ≠ j, then ϒ is of the form “identity plus compact perturbation”. Such operators, if invertible, are bounded from below by cI for some c > 0 (Bach, 2008), and therefore have bounded inverses. We summarize this fact in the following proposition.
Proposition 3
Suppose ϒ is invertible and, for each i ≠ j, ϒij is compact. Then ϒ−1 is bounded.
To establish the relation between the precision operator Θ and ACI, we need the block inversion formulas for operators. For any subvector U of X, let V denote Uc, and define the matrices of operators
From these definitions it follows that TUU 𝓐⊕U ⊆ 𝓐⊕U, and
The same can be said of TVV, TUV, and TVU. The following operator identities resemble the matrix identities derived from block-matrix inverse.
Lemma 2
Suppose that T ∈ 𝓑 (𝓐⊕X, 𝓐⊕X) is a self adjoint, positive definite operator. Then: the operators
are bounded, self adjoint, and positive definite, and the following identities hold
| (9) |
where (T−1)UU and (T−1)VV are the blocks of T−1 corresponding to 𝓐⊕U and 𝓐⊕V.
The above lemma shows that if ϒ has a bounded inverse, then ΘUV is a well defined operator in 𝓑 (𝓐⊕V, 𝓐⊕U). Moreover, it behaves just like the off-diagonal block of the inverse of a covariance matrix. As we will see in the Appendix, manipulating the blocks of ϒ and Θ in the same way we do with a gaussian covariance matrix lead to the following equivalence.
Theorem 6
Suppose (U, V, W) is a partition of X and ϒ is invertible. Then U ⫫A V|W iff ΘUV = 0.
This relation is parallel to the equivalence between (1) and (2) in the gaussian case, which means we can turn estimating ASG into estimating the locations of the zero entries of APO, as shown in the next corollary.
Corollary 2
A random vector X follows an ASG(𝓖) model if and only if ΘXiXj = 0 whenever (i, j) ∉ E.
Note that the graph ASG(𝓖) is based entirely on pairwise additive conditional independence (4), which is similar to the pairwise Markov property for classical graphical models based on conditional independence. In the classical setting, pairwise Markov property with respect to a graph 𝓖 = (Γ, E) means
Under fairly mild conditions, this property also implies the global Markov property, which means if A, B, C are subsets of Γ and C separates A and B in 𝓖, then
See Lauritzen (1996, Chapter 3). It is of theoretical interest to see if the parallel implication holds true in our additive setting. For convenience, let us agree to call property (4) the pairwise additive Markov property with respect to 𝓖. We are interested in whether XA ⫫AXB|XC holds whenever C separates A and B according to the graph 𝓖 determined by (4), a property that can be appropriately defined as the global additive Markov property with respect to 𝓖.
Using Theorem 6 we can easily derive a partial answer to this question, as follows. Suppose {A, B, C} is a partition of Γ and C separates A and B according to 𝓖 defined by (4). Then all paths between A and B intersects C. In other words, the block ΘAB and ΘBA of Θ consists of zero operators. However, by Theorem 6, this is equivalent to XA ⫫AXB|XC. We will leave it for future research to establish whether this statement is true when {A, B, C} is not necessarily a partition.
In the next two sections we introduce two estimators of the ASG model, one based on the ACCO introduced in the last section, and one based on the APO.
6 The ACCO-based estimator
By Theorem 5, it is reasonable to determine the absence of edges by smallness of the sample estimate of ||ΣXiXj|X−(i,j)||. To develop this estimate, we first derive the finite-sample matrix representation of the operator ΣXiXj|X−(i,j).
For each i = 1, …, p, let κi : ΩXi × ΩXi → ℝ be a positive definite kernel. Let
where . Let In be the n × n identity matrix and 1n be the n-dimensional vector with all its entries equal to 1. Let Q be the projection matrix . Let κ˚i denote the vector-valued function
Then, any f ∈ 𝓐Xi can be expressed as for some [f] ∈ ℝn, which is called the coordinate of f in the spanning system κ˚i. A coordinate system is relative to a class of functions, and sometimes we need to use symbols such as [f]𝓐Xi for clarity. But in most of the discussions below this is unnecessary. In this coordinate notation we have , where . Also, for any f, g ∈ 𝓐Xi,
Let K−(i,j) denote the n(p − 2) × n matrix obtained by removing the ith and jth block of the np × n matrix (K1, …, Kp)⊤. Then, for any f ∈ 𝓐X−(i,j we have
| (10) |
for some [f] ∈ ℝn(p−2). Finally, for any linear operator T defined on a finite dimensional Hilbert space 𝓗, there is a matrix [T] such that [T f] = [T][f] for each function f ∈ 𝓗. The matrix [T] is called the matrix representation of T. This notation system is adopted from Horn and Johnson (1985) and was used in Li, Artemiou, and Li (2011), Li, Chun, and Zhao (2012), and Lee, Li, and Chiaromonte (2013).
Lemma 3
The following relations hold:
| (11) |
Because KiQKi and are singular, we solve the equations in (11) by Tychonoff regularization:
| (12) |
Note that we use two different tuning constants ε1 and ε2, because the matrices involved have different dimensions. We will discuss how to choose these constants in Section 8. is then The matrix representation of ΣXjXi|X−(i,j) is then
| (13) |
By definition, ||ΣXjXi|X−(i,j)||2 is the maximum of
subject to 〈f, f〉𝓐Xi = [f]⊤ KiQKi[f] = 1. To solve this generalized eigenvalue problem, let v = (KiQKi)1/2[f]. Then, using inverse Tychonoff regularization, [f] = (KiQKi + ε1In)−1/2v. So it is equivalent to maximizing
subject to v⊤ v = 1, which implies that ||ΣXiXj|X−(i,j)||2 is the largest eigenvalue of
Since the factor KjQKj[ΣXjXi|X−(i,j)] in the right-hand side is of the form
it is reasonable to replace the first factor KjQKj by KjQKj + ε1In, so that ΛXiXj becomes the quadratic form , where
| (14) |
Hence, at the sample level, ||ΣXiXj|X−(i,j)|| is the largest singular value of ΔXiXj. Because , we have the symmetry ||ΣXiXj|X−(i,j)|| = ||ΣXjXi|X−(i,j)||.
The above procedure involves inversion of , which is a large matrix if the graph 𝓖 contains many nodes. Nevertheless, as the next proposition shows the actual amount computation is not large — it is essentially that of spectral decomposition of an n by n matrix. Let diag(A1, …, Ap) be the block-diagonal matrix with diagonal blocks A1, …, Ap, and V be a matrix in ℝs×t, with s > t, which has singular value decomposition
| (15) |
where L1 ∈ ℝs×u, L0 ∈ ℝs×(s−u), R1 ∈ ℝt×u, R0 ∈ ℝt×(t−u), and D ∈ ℝu×u is a diagonal matrix with diagonal elements δ1 ≥ ··· ≥ δu > 0.
Proposition 4
If the conditions in the last paragraph are satisfied, then
;
.
The above proposition allows us to calculate the inversions of s × s matrix (VV⊤ + εIs) essentially by spectral decomposition of t × t matrix V⊤ V. We now summarize the estimation procedure as an algorithm. Suppose we observe an i.i.d. sample X1, …, Xn from X, where Xi is a p-dimensional random vector .
Standardize each component of X marginally. That is, for each i = 1, …, p, we standardize , so that EnXi = 0 and varnXi = 1.
Choose a positive definite kernel κ: ℝ × ℝ → ℝ, such as the gaussian radial kernel. Choose the kernel parameter γ using, for example, the procedure in Li, Chun, and Zhao (2012). Compute the kernel matrices Ki and K−(i,j).
Compute ΔXiXj in (14), using part 2 of Proposition 4 to compute the matrix , with V = K−(i,j)Q, s = n(p − 2), t = n, and u = n − 1. The tuning constants ε1 and ε2 are determined using the method described in Section 8.
For each i > j, compute the largest singular value of ΔXiXj. If it is smaller than a certain threshold, declare that there is no edge between i and j.
We emphasize again that throughout this procedure, the largest matrix whose spectral decomposition is needed is of dimension n × n, regardless of the dimension p of X.
7 The APO-based estimator
Our second estimator is based on Corollary 2, which states that Xi ⫫ AXj|X−(i,j) iff ΘXjXi = 0. Thus we can determine the absence of edge between i, j if the sample version of ||ΘXjXi|| is below a threshold. We first derive the matrix representations of the operator ϒ and its inverse Θ. Let K1:p = (K1, …, Kp)⊤.
Theorem 7
The matrix representation of ϒ satisfies the following equation:
| (16) |
The matrix [Θ] is obtained by solving (16) for [ϒ]−1 with Tychonoff regularization:
| (17) |
To find ||ΘXjXi||, we maximize
subject to 〈fi, fi〉𝓐Xi = [fi]⊤KiQKi[fi] = 1. Following the similar development that leads to (14), we see that ||ΘXjXi||2 is the largest eigenvalue of
where [ΘXjXi] is the (j, i)th block of (17). Equivalently, ||ΘXjXi|| is the largest singular value of
We now summarize the algorithm for the APO-based estimator. The first two steps are the same as in the ACCO-based algorithm; the last two steps are listed below.
3′ Compute [Θ] according to (17), where the inverse is computed using part 1 of Proposition 4, with V = K1:pQ, ε = ε3, s = np, and u = n − 1. The tuning constant ε3 is determined by the method in Section 8.
4′ For each i, j = 1, …, p, i > j, compute the largest singular value of ΓXiXj. If it is smaller than a certain threshold, then there is no edge between i and j.
In concluding this section, we would like to mention that the operator norm used for both the ACCO and APO can be replaced by other norms, such as the Hilbert Schmidt norm. We have experimented both norms in our simulation studies (not presented), and found that their performances are similar.
8 Tuning parameters for regularization
In this section we propose cross-validation and generalized cross-validation procedures to select ε’s in the Tychonoff regularization. We need to invert three types of matrices
As before, we denote the constants in the regularized inverses as ε1, ε2, and ε3.
Since our cross-validation is based on the total error in predicting one set of functions by another set of functions, we first formulate such predictions in generic terms. Let U and V be random vectors and W = (U⊤, V⊤)⊤. Let 𝓕 = {f0, f1, …, fr} be a set of functions of U, and 𝓖 = {g0, g1, …, gs} be a set of functions of V, where g0 and f0 represent the constant function 1, which are included to account for any constant terms in prediction. Suppose we have an i.i.d. sample {W1, …, Wn} of W. Let A = {a1, …, am} ⊆ {1, …, n}, B = {b1, …, bn−m} = Ac. Let {Wa : a ∈ A} be the training set and {Wb : b ∈ B} be the testing set. Suppose is a training-set-based operator such that, for each g ∈ 𝓖, is the best prediction of g(V) based on functions in 𝓕, where ε is the tuning constant to be determined. The total error for predicting any function in 𝓖 evaluated at the testing set is
| (18) |
Now let LU be the (r + 1) × m matrix whose ith row is the vector {fi+1(Ua) : a ∈ A}, and LV be the (s + 1) × m matrix whose kth row is the vector {gk+1(Va) : a ∈ A}. Let ℓU and ℓV denote the vector-valued functions (f0, …, fr)⊤ and (g0, …, gr)⊤. Then, following the argument similar to Lee, Li, and Chiaromonte (2013), we can express (18) as
where ||·|| is the Euclidean norm. If we let GU be the (r + 1) × (n − m) matrix whose bth column is ℓU (Ub), and GV be the (s + 1) × (n − m) matrix whose bth column is ℓV (Vb), then the above can be further simplified as
where ||·|| is the Frobenius matrix norm. Other matrix norms are also reasonable. In our simulation we used the operator norm, which performed well.
Return now to our cross-validation problem. In the k-fold cross validation, data are divided into k subsets. At each step, one subset is used as the testing set and union of the rest is used as the training set. Let i, j, … index the p components of x, a, b, … index the n subjects, and ν index the k steps in the k-fold cross-validation.
At the νth step, let {Xa : a ∈ A} be the training set and {Xb : b ∈ B} be the testing set. Reset κ˚i to be the ℝm-valued function . Let ℓi, ℓ(i,j), and ℓ be the vector-valued functions defined by
Let ℓ−i be the ℝn(p−1)+1-valued function obtained by removing κ˚i from ℓ, and ℓ−(i,j) be the ℝn(p−2)+1-valued function obtained by removing κ˚i and κ˚j from ℓ. Let Li denote the matrix ( ), and define L−i, L(i,j), and L−(i,j) similarly. Let Gi denote the matrix ( ), and define G−i, G(i,j), and G−(i,j) similarly.
To determine ε1, we use the functions in ℓi to predict the functions in ℓ−i at Xb for any b ∈ B. That is,
| (19) |
Repeat this process for ν = 1, …, k and form . We choose ε1 by minimizing CV1(ε1) over a grid. To determine ε2, we use the functions in ℓ−(i,j) to predict the functions in ℓ(i,j) at each Xb, b ∈ B, which leads to
| (20) |
Let . We choose ε2 by minimizing CV2(ε2) over a grid. Here, we use part 1 of Proposition 4 to calculate .
Alternatively, we can use the generalized cross validation criteria parallel to (19) and (20) to speed up computation (Wahba, 1990). Rather than splitting the data into training and testing samples, we compute the residual sum of squares from the entire data, and regularize it by the effective degrees of freedom as defined by the traces of relevant projection matrices. Specifically, the GCV-criterion parallel to (19) is
Here, Li and L−i are computed from the full data, and Gi in (19) is now replaced by Li. Similarly, the GCV-criterion parallel to (20) is
Here, again, the matrix is to be calculated by part 2 of Proposition 4 with V = L−(i,j), s = n(p − 2) + 1 and u = 1. Finally, we recommend taking ε3 to be the same as ε2. This is based on the intuition that and are of the similar forms and dimensions. This choice works well in our simulation studies.
From our experience, the GCV-criteria give the similar tuning constants as the CV-criteria but increase the computation speed by several folds. For this reason, we used the GCV criteria in our subsequent simulations and application.
9 Simulation studies
In this section we compare our ASG estimators with GGM and GCGM estimators, as proposed by Yuan and Lin (2007) and Liu et al (2013). Since our estimators are based on thresholding the norms of the operators ΣXiXj|X−(i,j) and ΘXjXi, it would be fair to compare them with the thresholded versions of GGM and GCGM. However, because L1-regularization is widely used for these estimators, we will instead compare them with the L1-regularized GGM and GCGM.
Comparison 1
We first compare the ASG estimators with GGM and GCGM when the copula assumption is violated. Consider the following models:
Model I : X1 = ε1, , X3 = ε3,
Model II :X1 = ε1, X2 = ε2, , X5 = ε5,
where ε1, …, ε5 are i.i.d. N(0, 1). The edge set for Model I is {(1, 2)}; the edge set for Model II is {(1, 3), (1, 4), (2, 4)}. The sample size is n = 100 and we repeated the simulation 100 times and the averaged ROC curves for correct estimation of the edge sets are presented in the upper panels of Figure 2. As expected, our ASG estimators perform much better than the GGM and GCGM when the copula assumption does not hold.
Figure 2.
Comparison of ACCO, APO, GGM, and GCGM. Upper panels: ROC curves for Models I and II, where the gaussian copula assumption is violated. Lower panels: ROC curves for Models III and IV, where the gaussian copula assumption holds.
For this and all the other comparisons, we have used the radial basis function (RBF) kernel with γ = 1/(2p), where γ is as it appears in the parametric form used in Li, Chun, and Zhao (2012). In Comparison 6 we give some results for alternative kernels and parameters.
Comparison 2
We now consider two models that satisfy the gaussian or copula gaus-sian assumption, which are therefore favorable to GGM and GCGM, to see how the ASG estimators compare with them:
Model III : X ~ N(0, Θ−1),
Model IV : (f1(X1), …, fp(Xp))⊤ ~ N(0, Θ−1),
where Θ is the 10 × 10 precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.554, and nonzero off-diagonal entries θ3,5 = 1.418, θ4,6 = −1.484, θ4,10 = −0.744, θ5,9 = 0.519, θ5,10 = −0.577, θ7,10 = 1.104, and θ8,9 = −0.737. For Model IV, X1, …, Xp each has a marginal gamma distribution with rate parameter 1 and shape parameter generated from unif(1, 5). The sample size is 50. The averaged ROC curves (over 100 replications) for the 4 estimators are shown in the lower panels in Figure 2. From these curves we can see the ASG estimators are comparable to GGM and GCGM when the gaussian or copula gaussian assumption hold.
The tuning parameters are selected by GCV1 and GCV2 from the coarse grids (5 × 10−4, …, 5 × 103). The optimal tuning parameters do vary across repetitions, but the ballpark figures are relatively stable. For example, for model IV, the typical values are 5 × 10−4 for ε1 and 5 × 10−1 for ε2. To evaluate the GCV criteria, in Figure 3 we compare the ROC curves using the GCV-tuned ε̂1, ε̂2 with with those for ε̂1 × 10−1 and ε̂2 × 10−1 and those for ε̂1 × 10 and ε̂2 × 10. We see that smaller tuning parameters performs better at low specificity, larger tuning parameters performs better at high specificity, but the GCV tuning criteria balances these two cases.
Figure 3.
Performance of GCV under Model IV.
Comparison 3
We now investigate the performance of our additive methods under a nonadditive model and compare with a fully nonparametric method that does not rely on the additive assumption. The fully nonparametric method is based on the conditional covariance operator (CCO) introduced by Fukumizu, Bach, and Jordan (2009). Consider the model
Model V : Xi = εi, i = 1, 2, 5; X3 = X1 + ε3; X4 = (X1 + X2)2 + ε4,
where ε1, …, ε5 be i.i.d. N(0, 1). The nonadditivity is introduced through the term (X1 + X2)2.
Let μ : ℝp−2 × ℝp−2 → ℝ be a positive definite kernel and MX−(i,j) be the kernel matrix { , s = 1, …, n}. Let
Let VXiXj|X−(i,j) be the CCO and V̂XiXj|X−(i,j) be its sample estimate. Along the lines of Fukumizu, Bach, and Jordan (2009) we derive the operator norm V̂XiXj|X−(i,j) as the largest singular value of
| (21) |
We declare no edge between node i and j if this value is smaller than a threshold. A derivation of this is sketched in the Appendix. The tuning parameters are determined the similar GCV criteria as described in Section 8 except that the univariate kernel KX−(i,j) is replaced by the multivariate kernel MX−(i,j). We ran this simulation for n = 50 using ACCO, APO, GCGM, and this fully nonparametric estimator (denoted as fuku in the plot). The averaged ROC curves are shown in Figure 4. We can see that our methods work better Fukumizu et al’s method even in this nonadditive case.
Figure 4.

Comparison of ASG and the fully nonparametric estimator on Model V
Comparison 4
In this study we compare the 4 estimators by applying them to the data sets from the DREAM4 Challenges at http://wiki.c2b2.columbia.edu/dream/index.php?title=D4c2 in which subnetworks are extracted from a real biological networks, and gene expressions are simulated from stochastic differential equations, reflecting their biological interactions. Under this simulation scenario, two genes are not guaranteed to be linearly associated. In the DREAM 4 samples, p = 10, n = 31 for all five networks. In the analysis, we combined all data sets from various experiments (wildtype, knockout, knockdown and multifactorial) except for the time-course experiments, and apply each method to the combined data sets across all five network configurations in the DREAM4 Challenges. Figure 5 shows the averaged ROC curves for the 5 subnetworks. We note that APO and ACCO perform the best for the first subnetwork, and APO performs the best in the fifth subnetwork. In the other three subnetworks there is no clear separation among the four estimators. These comparisons suggest a possible benefit of our ASG estimators for finding non-linear biological interactions.
Figure 5.
Comparison of the ROC curves by ACCO, APO, GGM, GCGM estimators on 5 subnetworks in DREAM4 Challenges.
Comparison 5
In this study we increase the sample size n and dimension p to investigate the accuracy and computational feasibility of our methods. We use Model IV in Comparison 2 with (p, n) = (200, 400) and (p, n) = (400, 400). We compare APO with the fully nonparametric method and the averaged ROC curves are given in Figure 6. For larger p and n the ACCO is more computationally expensive than APO because one has to compute ||Σ̂XiXj|X−(i,j)|| for all i, j = 1, …, p, i ≠ j. However, this problem is mitigated by APO, where the large (np × np) matrix [Θ] only needs to be calculated once.
Figure 6.
Comparison of ASG and fully nonparametric method for p = 200, n = 400 (left panel) and p = 400, n = 400 (right panel) using model IV
For p = 200 and n = 400, we selected (by GCV) ε̂1 = 5 × 10−5, ε̂2 = 0.5. Each run with fixed ε1 and ε2 took around 900 seconds (15 mins) in an Intel(R) Core(TM) i7-3770 CPU at 3.40GHz machine. For p = 400 and n = 400, we selected by GCV ε̂1 = 5e−5, ε̂2 = 0.5. Each run with fixed ε1 and ε2 took around 1900 seconds in the same computer. The GGM were not used for this simulation, since it took a very long time to get the estimates (more than a day for a single tuning parameter). We also performed 10 replications. It appears that our additive operator performs better than Fukumizu et al’s operator in high-dimensional cases.
Comparison 6
We now investigate the behavior of different kernels using Models III and IV. We have experimented with 3 types of kernels: the radial basis function kernel, the polynomial (POLY) kernel, and the rational quadratic (RATIONAL) kernel. The specific parametric forms we used for these kernels are as appear in Li, Chun, and Zhao (2012). For each kernel, we used two tuning parameters: for RBF, we use γ = 1/9, 1/(9p); for POLY, we use r = 2, 3; for RATIONAL, c = 200, 400. The ROC curves for these two models are given in Figure 7. We see that, with the exception of polynomial kernel with degree 3, these performance of these kernels are somewhat similar, with RBF, γ = 1/(9p) most favorable. The cubic polynomial kernel does not perform very well because it tends to amplify outlying observations, making their results unstable.
Figure 7.
Comparing different kernels using Model III (left panel) and Model IV (right panel).
10 Pathway analysis
We applied the four estimators to a real data set from Dimas et al (2009), in which gene expressions were measured from 85 individuals participating in the GenCord project. The same data set was also used in Chun, Zhang, and Zhao (2012). The participants were of Western European origin, and cell lines were obtained from their umbilical cords. In this analysis, we focused on one cell type, the B cells (lymphoblastoid cell lines). As in Chun et al (2012), we mapped the probes to 17,945 autosomal RefSeq genes, and use the resulting gene level data in our analysis. We then extracted the pathway information from the KEGG database (http://www.genome.jp/kegg/), and used it for partitioning genes and for evaluating the estimators by taking the interactions in the database as true interactions.
Due to the limited sample sizes, we only used the pathways whose sizes (i.e. number of genes) are greater than 2 and less than 36. There are a total of 300 such pathways. After ignoring directionality of regulations in KEGG pathways, we computed the false positives and true positives of the estimates and computed the area under curves (AUC) of the ROCs. We then evaluated the significance of ROC curves based on the p-values of the Wilcoxon rank sum test. Figure 8 shows the AUCs of all networks by ACCO and APO. The left panel shows AUCs based on ACCO, with significant values highlighted by green circles. The right panel shows those based on APO, with significant values highlighted by blue circles. In each case significance means the p-value is less than 10−4. We see that ACCO yields a larger number of significant ROCs than GGM. To reconfirm this pattern we computed the number of significant ROC curves by ACCO and GGM for three different significance levels, as presented in Table 2, which again shows that the numbers of significant ROC curves by ACCO are consistently higher at all three significance levels.
Figure 8.

Comparison of number of significant ROC curves by the ACCO and by GGM estimators for DREAM4 networks with p-values < 10−4.
Table 2.
Numbers of significant pathways with varying significance levels
| p-values | ACCO | APO | GCGM | GGM |
|---|---|---|---|---|
| < 10−3 | 17 | 21 | 14 | 10 |
| < 10−4 | 10 | 9 | 5 | 5 |
| < 10−5 | 8 | 7 | 2 | 2 |
To provide further insights into the comparison, we listed in Table 3, Panel I, all the pathways with significant ROCs by ACCO (p-value < 10−5) whose AUCs by ACCO are larger than those by GGM. There were a total of 6 such pathways. In Table 3, Panel II, we listed all the pathways with significant ROCs by GGM whose AUCs by GGM are larger than those by ACCO. There were a total of 2 such pathways.
Table 3.
Significant pathways whose AUCs by ACCO are larger than those by GGM (panel I) and whose AUCs by ACCO are smaller than those by GGM (panel II)
| Panel | Pathwaya | p | AUC
|
|||
|---|---|---|---|---|---|---|
| ACCO | APO | GCGM | GGM | |||
| I | 1 | 35 | 0.63 | 0.59 | 0.52 | 0.53 |
| 2 | 24 | 0.66 | 0.66 | 0.56 | 0.58 | |
| 3 | 25 | 0.61 | 0.58 | 0.54 | 0.54 | |
| 4 | 18 | 0.67 | 0.60 | 0.62 | 0.59 | |
| 5 | 18 | 0.69 | 0.63 | 0.62 | 0.62 | |
| 6 | 16 | 0.69 | 0.59 | 0.55 | 0.55 | |
|
| ||||||
| II | 7 | 14 | 0.60 | 0.65 | 0.62 | 0.66 |
| 8 | 17 | 0.73 | 0.71 | 0.64 | 0.79 | |
Labels in this column indicate the following pathways. 1: Activation of Csk by cAMP-dependent Protein Kinase Inhibits Signaling through the T Cell Receptor; 2: Cell Cycle: G2 M Checkpoint; 3: Cyclins and Cell Cycle Regulation; 4: Lck and Fyn tyrosine kinases in initiation of TCR Activation; 5: Th1 Th2 Differentiation; 6: hsa00710; 7: Proteasome Complex pathway; 8: Human Peptide GPCRs.
Some pathways were selected by more than one methods. Specifically,
where #(A ∩ B) means the number of pathways found by method A and method B. It appears that ACCO and APO yield most similar results. Furthermore, single pathway was found by all methods.
From the analysis of these 300 pathways we see that the ASG estimators perform better than GGM and GCGM in a majority of cases. A plausible interpretation is that intrinsically nonlinear interactions (i.e. those that cannot be described by the gaussian copula assumption) are prevalent among genetic networks, and using a more flexible interaction assumption such as the ASG has real benefits in modeling of gene regulations.
11 Conclusions
In this paper we introduced two new estimators for graphical models based on the notion of additive conditional independence. They substantially broaden the applicability of the current graphical model estimators based on the gaussian or copula gaussian assumption, but at the same time avoid using high-dimensional kernels that can hamper estimation accuracy. As we have demonstrated by simulation and data analysis, the new methods work particularly well when the interactions exhibit intrinsic nonlinearity, which cannot be captured by the gaussian or copula gaussian models, and are comparable to existing methods when the copula gaussian assumption holds.
Our generalization of precision matrix to additive precision operator allows us to adopt essentially the same estimation strategy of the gaussian graphical model at the operator level. Using this generalization we succeeded in preserving the simplicity of the gaussian dependence structure without suffering from its austere distributional restriction.
This basic idea opens up many other possibilities that cannot be explored within the scope of this paper. For example, although both the ACCO and APO estimators are based on truncating small operators (as measured by the operator norm), it is conceivable that more sophisticated sparse techniques, such as LASSO, SCAD, and adaptive LASSO, can be adapted to this setting as well. How to implement these techniques in the operator setting, and how much they would improve the truncation-based estimators presented here, would be interesting questions for future research. Moreover, the asymptotic properties of the proposed estimators, such as consistency and asymptotic distributions, also need to be developed in further studies.
Supplementary Material
Acknowledgments
Bing Li’s research was supported in part by NSF grant DMS-1106815.
Hyonho Chun’s research was supported in part by NSF grant DMS-1107025.
Hongyu Zhao’s research was supported in part by NIH grants R01-GM59507, P01-CA154295, and NSF grant DMS-1106738.
We greatly appreciate the thorough and thought provoking reviews by three referees and an Associate Editor, which contain many useful suggestions and lead to substantial improvements of this work.
Appendix: proofs
We first state without proof several simple geometric facts that will be useful for proving Theorems 1 and 2. Let P𝓛 be the projection on to a subspace 𝓛 of L2(PX), U and W be subvectors of X. Then
| (22) |
| (23) |
| (24) |
Proof of Theorem 1
1 ⇔ 2 follows from relation (22); 2 ⇔ 3 follows from relation (24); 6 ⇔ 7 follows from (23); 4 ⇔ 5 follows from the relation (23).
2 ⇒ 4. By statement 2, P𝓐V∪W ⊖ 𝓐W (𝓐U∪W ⊖ 𝓐W) = {0}. By relation (24) and the fact that 𝓐V∪W ⊖ 𝓐W ⊥ 𝓐W, we have
4 ⇒ 6. This is because
where the first equality follows from (22); the second from assertion 4; the third from 𝓐W ⊥ 𝓐V∪W ⊖ 𝓐W.
6 ⇒ 2. Because ker(P𝓐V∪W ⊖ 𝓐W) = [ran(P𝓐V∪W ⊖ 𝓐W)]⊥ = (𝓐V∪W ⊖ 𝓐W)⊥, we have
which implies 𝓐U∪W ⊖ 𝓐W ⊥ 𝓐V∪W ⊖ 𝓐W because 𝓐U∪W ⊖ 𝓐W ⊆ 𝓐U∪W.
Proof of Theorem 2
Axiom 1 is obvious. To prove 2, 3, 4, let U, V, W, R be disjoint sub-vectors of X and 𝓐U, 𝓐V, 𝓐W, 𝓐R be their respective additive families.
2. If U ⫫A V ∪ R|W, then
Axiom 2 follows because (𝓐V + 𝓐W) ⊖ 𝓐W ⊆ (𝓐V∪R + 𝓐W) ⊖ 𝓐W.
3. By Theorem 1, part 3, U ⫫A V ∪ R|W is equivalent to
| (25) |
By Axiom 2, we have U ⫫A R|W, which, by Theorem 1, part 5, implies P𝓐W f = P𝓐W∪R f for all f ∈ 𝓐U. Hence the left hand side of (25) can be rewritten as (I − P𝓐W∪R) 𝓐U. Meanwhile, the right hand side of (25) contains (I − P𝓐W)𝓐V. Hence
| (26) |
Because 𝓐W ⊆ 𝓐W∪R, we have ran(I − P𝓐W∪R) ⊥ ran(P𝓐W∪R). Hence
| (27) |
Because
relation (26) and the second expression in (27) implies the desired relation
4. By Theorem 1, part 3, U ⫫AV|W and U ⫫AR|W ∪ V are equivalent to
| (28) |
We need to show (I − P𝓐W) 𝓐U ⊥ (I − P𝓐W)(𝓐V + 𝓐R), which, by the first relation in in (28), is implied by
| (29) |
Because U ⫫A V|W, the left side of (29) is (I − P𝓐V∪W) 𝓐U. By (23), the right side of (29) is (I − P𝓐V∪W + P𝓐V∪W ⊖ 𝓐W) 𝓐R. So, under (28), relation (29) is equivalent to
Since (I − P𝓐V∪W) 𝓐U ⊥ (I − P𝓐V∪W) 𝓐R, it suffices to show that
| (30) |
which is true because
and (I − P𝓐V∪W)(I − P𝓐W) P𝓐V∪W = 0.
Proof of Theorem 3
The statement U ⫫A V|W means
| (31) |
Let A, B, and C be the sets of indices for U, V, and W and h(W) be the vector {fk(Xk) : k ∈ C}. By condition 3, f and g can be written as Σi∈A αifi and Σj∈B βj fj for some α = (α1, …, αr)⊤ and β = (β1, …, βs)⊤. Since (f(U), h(W)) is jointly gaussian, the conditional expectation E[f(U)|h(W)] belongs to 𝓐W. For the same reason E[g(V)|h(W)] belongs to 𝓐W. It follows that P𝓐Wf = E[f(U)|h(W)], P𝓐Wg = E[g(V)|h(W)], and
where the second equality holds because the random vector (f(U), g(V), h(W)) is multivariate gaussian. Consequently, (31) is equivalent to f(U) ⫫ g(V)|h(W) for all f ∈ 𝓐U and g ∈ 𝓐V, which is equivalent to
Because f1, …, fp are injective, this is equivalent to U ⫫ V|W.
Proof of Proposition 1
Let f ∈ L2(PU) and g ∈ L2(PV) be any functions with ||f|| = ||g|| = 1. Because P𝓐W is idempotent and self adjoint, we have
For each γ = 1, …, p − 2, let { : α = 1, 2, …} be the hermite polynomials in the variable wγ. Similarly, let { : α = 1, 2, …} and { : α = 1, 2, …} be the hermite polynomials in the variables u and v. By Lancaster (1957), these polynomials form orthonormal bases in L2(PWγ), L2(PU), and L2(PV). It follows that the set { : α = 1, 2, …, γ = 1, …, p − 2} is an orthonormal basis for 𝓐W. Moreover, by Lancaster (1957), whenever α ≠ β,
So, if we let , then . Each term in this sum is
By Lancaster (1957) again,
which yield , where . Hence
Using this we compute
By the similar calculation we found . Hence
Finally, by the Cauchy-Schwarz inequality,
as desired.
Proof of Lemma 1
For any f ∈ 𝓐U and g ∈ 𝓐W, we have
| (32) |
The first term on the right-hand side is
Hence 〈(P𝓐W ∣ 𝓐U − ΣWU)f,g〉𝓐W = 0 for all f ∈ 𝓐W and g ∈ 𝓐U.
Proof of Theorem 5
If U ⫫A V|W, then, for any f ∈ 𝓐U, g ∈ 𝓐V,
| (33) |
where the second equality follows from the definition of a projection. Using Lemma 1, we calculate the two terms on the right as
Hence (33) is equivalent to 〈f, (ΣUV − ΣUW ΣWV)g〉 = 0, which implies ΣUV − ΣUW ΣWV = 0. The arguments can obviously be reversed.
Proof of Proposition 2
1 ⇒ 2. For any f ∈ 𝓐⊕X, ∣∣f∣∣𝓐⊕X ≤ 1, we have, by the triangular inequality,
2 ⇒ 1. For any fk ∈ 𝓐Xk and ∣∣fk∣∣𝓐k ≤ 1, we have
which implies
Proof of Lemma 2
First, we show that TUU (and hence also TVV) is bounded, self adjoint, and positive definite. Let h ∈ 𝓐⊕U. Then TUU h ∈ 𝓐⊕U. Hence
which implies that ∣∣TUUh∣∣𝓐⊕X is either 0 or no greater than ||T|| ||h||. Hence TUU is bounded. That TUU is self adjoint follows from 〈g, TUUh〉𝓐⊕X = 〈g, Th〉𝓐⊕X for all g, h ∈ 𝓐⊕U. Because T > 0, if h ∈ 𝓐⊕U and h ≠ 0, then 〈h, Th〉𝓐⊕X > 0. Because 〈h, Th〉𝓐⊕X = 〈h, TUUh〉𝓐⊕X we have 〈h, TUUh〉𝓐⊕X > 0. Thus TUU > 0.
That and are bounded and self adjoint can be proved similarly. To show that they are positive definite and the subsequent identities, let IUU be the identity operator from 𝓐⊕U to 𝓐⊕U, and OUV be the zero operator from 𝓐⊕V to 𝓐⊕U. Define IVV and OVU similarly. Then
Hence
| (34) |
| (35) |
By part 1, TV V is invertible. Hence, by (35),
| (36) |
Substitute this into (34) to obtain
| (37) |
which implies that TUU − TUV TVV TVU is invertible and the first two identities in (9) hole. Similarly, we can prove that TVV − TVU TUU TUV is invertible and the last two identities in (9) hold.
Proof of Theorem 6
Let A, B, C be the index sets of U, V, W, Z = U ∪ V, and
These operators are different from the operators such as ΣUV in Section 4 because they are defined by different inner products. For example, ΣUU is the identity mapping but ϒUU is not. By Lemma 2, . Because
we have, by Lemma 2 again, . This can be rewritten as
| (38) |
Therefore, ΘUV = 0 iff .
Now let f ∈ 𝓐U, g ∈ 𝓐V. Then, by the definition of a projection,
The first term on the right is 〈f, ϒUVg〉𝓐⊕X. Because, for any h ∈ 𝓐W,
we have . Hence
That is, we have shown that
which means iff U ⫫A V|W.
Proof of Lemma 3
The proof of the three equalities are similar. So we only prove the last one. By (10), for any f ∈ 𝓐X−(i,j, g ∈ 𝓐Xi, we have
Alternatively,
Hence the right-hand sides of the above two equalities are the same for all [f] and [g], which implies the third relation in (11).
Proof of Proposition 4
-
Write (15) as V = LΨR⊤ where L = (L1, L0), R = (R1, R0), and Ψ = diag(D, 0). Because L and R are orthogonal matrices,
(39) By construction,(40) Hence the right-hand side of (39) is , which, because L is an orthogonal matrix, can be rewritten asBy (15) we have , and hence L1 = V R1D−1. Substitute this into the right-hand side to prove part 1.
-
Multiply both sides of (39) by V⊤ from the left and by V from the right, to obtain
Substitute RΨ⊤ = R1D and (40) into the right-hand side above to prove part 2.
Proof of Theorem 7
Let f, g ∈ 𝓐⊕X. Then f = f1 + ··· + fp and ϒg = h1 + ··· + hp where fk, hk ∈ 𝓐Xk, k = 1, …, p. Hence
| (41) |
Because
The right-hand side of (41) can be rewritten in matrix form as
| (42) |
Alternatively, the inner product in (41) can be written as
| (43) |
Since (42) is equal to (43) for all [f] and [g], equality (16) holds.
Operator norm of CCO
| (44) |
The operator norm of V̂XiXj|X−(i,j) is the maximum of 〈fi, V̂XiXj|X−(i,j) fj〉2 subject to 〈fi, fi〉 = 〈fj, fj〉 = 1. By linear algebra, this is equivalent to
| (45) |
By (44), the first expression in (45) is n−2[fi]⊤GXi QX−(i,j) (ε2)GXi QX−(i,j) (ε2)GXi [fi]. The second expression in (45) is [fi]⊤ GXi [fi] = 1. This is a generalized eigenvalue problem. To convert this to an ordinary eigenvalue problem, let . Solve this equation for [fi] with Tychonoff regularization, to obtain [fi] = (GXi + ε1In)−1/2 ϕi. The optimization problem (45) reduces to maximizing (ignoring n−2):
subject to . As we did with ACCO and APO, we use the symmetrized version
which is of the form AA⊤ with A being the matrix in (21).
Contributor Information
Bing Li, Email: bing@stat.psu.edu.
Hyonho Chun, Email: chunh@purdue.edu.
Hongyu Zhao, Email: hongyu.zhao@yale.edu.
References
- Bellman RE. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]
- Bach FR. Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]
- Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
- Chun H, Zhang X, Zhao H. Gene regulation network inference with joint sparse Gaussian graphical models. 2012. Unpublished manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, Ingle C, Beazley C, Gutierrez Arcelus M, Sekowska M, Gagnebin M, Nisbett J, Deloukas P, Dermitzakis ET, Antonarakis SE. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. 2009;325:1246–1250. doi: 10.1126/science.1174148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dawid AP. Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B. 1979;41:1–31. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukumizu K, Bach FR, Jordan MI. Kernel dimension reduction in regression. Annals of Statistics. 2009;4:1871–1905. [Google Scholar]
- Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2009;98:1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelley JL. General Topology. Springer; New York: 1955. [Google Scholar]
- Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lancaster. Some properties of the bivariate normal distribution considered in the form of a contingency table. Biometrika. 1957;44:289–292. [Google Scholar]
- Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]
- Lauritzen SL, Dawid AP, Larsen BN, Leimer HG. Independence properties of directed Markov fields. Networks. 1990;20:491–505. [Google Scholar]
- Lee K-Y, Li B, Chiaromonte F. A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Annals of Statistics. 2013 In print. [Google Scholar]
- Li B, Artemiou A, Li L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Annals of Statistics. 2011;36:3182–3210. [Google Scholar]
- Li B, Chun H, Zhao H. Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association. 2012;107:152–167. doi: 10.1080/01621459.2011.644498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu H, Han F, Zhang C. Advances in Neural Information Processing System. Vol. 25. Curran Associates, Inc; 2013. Transelliptical Graphical Models; pp. 800–808. [Google Scholar]
- Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics. 2013 In print. [Google Scholar]
- Liu H, Lafferty J, Wasserman L. The nonparanormal: semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. 2009;10:2295–2328. [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N, Bühlmann P. High-dimensional graphs with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
- Pearl J, Geiger D, Verma T. Conditional independence and its representations. Kybernetika. 1989;25:33–44. [Google Scholar]
- Pearl J, Paz A. Graphoids: a graph-based logic for reasoning about relevancy relations. Proceedings of the European Conference on Artificial Intelligence; Brighton, UK. 1986. [Google Scholar]
- Pearl J, Verma T. The logic of representing dependencies by directed acyclic graphs. Proceedings of the American Association of Artificial Intelligence Conference; Seattle, Washington. 1987. [Google Scholar]
- Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Wahba G. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics.1990. [Google Scholar]
- Xue L, Zou H. Regularized Rank-based Estimation of High-dimensional Nonparanormal Graphical Models. Annals of Statistics. 2013 In print. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






