On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis

Bing Li; Hyonho Chun; Hongyu Zhao

doi:10.1080/01621459.2014.882842

. Author manuscript; available in PMC: 2015 Sep 21.

Published in final edited form as: J Am Stat Assoc. 2014 Sep;109(507):1188–1204. doi: 10.1080/01621459.2014.882842

On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis

Bing Li ¹, Hyonho Chun ², Hongyu Zhao ³

PMCID: PMC4577248 NIHMSID: NIHMS561317 PMID: 26401064

Abstract

We introduce a nonparametric method for estimating non-gaussian graphical models based on a new statistical relation called additive conditional independence, which is a three-way relation among random vectors that resembles the logical structure of conditional independence. Additive conditional independence allows us to use one-dimensional kernel regardless of the dimension of the graph, which not only avoids the curse of dimensionality but also simplifies computation. It also gives rise to a parallel structure to the gaussian graphical model that replaces the precision matrix by an additive precision operator. The estimators derived from additive conditional independence cover the recently introduced nonparanormal graphical model as a special case, but outperform it when the gaussian copula assumption is violated. We compare the new method with existing ones by simulations and in genetic pathway analysis.

Key words and phrases: Additive conditional independence, additive precision operator, copula, conditional independence, covariance operator, gaussian graphical model, nonparanormal graphical model, reproducing kernel Hilbert space

1 Introduction

The gaussian graphical model is one of the most important tools for statistical network analysis, partly because its estimation can be neatly formulated as identifying the zero entries in the precision matrix, which turns estimation of a graph into sparse estimation of a high-dimensional precision matrix. For recent advances in this area, see Meinshausen and Buhlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), Peng, Wang, Zhou, and Zhu (2009), Guo, Levina, Michailidis, and Zhu (2009), Lam and Fan (2009).

Specifically, let X = (X¹, …, X^p)^⊤ be a p-dimensional random vector distributed as N(μ, Σ), where Σ is positive definite. Let Γ = {1, …, p}, E ⊆ {(i, j) ∈ Γ × Γ : i ≠ j}, and Θ = Σ⁻¹. Let θ_ij be the (i, j)th element of Θ. Then X follows a gaussian graphical model (GGM) with respect to the graph 𝓖 = {Γ, E} iff

X^{i} ⫫ X^{j} ∣ X^{- (i, j)}, (i, j) \notin E,

(1)

where X⁻⁽^i,j⁾ represents the set {X^k : k ≠ i, j}. Under the multivariate gaussian assumption, the above relation is equivalent to

θ_{i j} = 0, (i, j) \notin E .

(2)

Based on this equivalence the estimation of 𝓖 reduces to identification of the 0 entries in Θ, which can be done by sparse estimation procedures such as thresholding (Bickel and Levina, 2008), LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), and adaptive LASSO (Zuo, 2006).

Since the multivariate gaussian assumption limits the applicability of the GGM, there have been intense developments in generalizing it to non-gaussian cases in the past few years, chiefly through the use of nonparametric copula transformations. See Liu, Lafferty, and Wasserman (2009), Liu, Han, Yuan, Lafferty, and Wasserman (2013), and Xue and Zou (2013). The nonparametric gaussian copula graphical model (GCGM) is a flexible and effective model that relaxes the marginal gaussian assumption but preserves the inter-dependence structure of a multivariate gaussian distribution, so that the equivalence between (1) and (2) still holds in an altered form. It assumes the existence of unknown injections f₁, …, f_p such that (f₁(X¹), …, f_p(X^p))^⊤ is multivariate gaussian. The GCGM keeps the simplicity and elegance of the GGM but greatly expands its scope.

However, the gaussian copula assumption can be violated even for commonly used interactions. For example, if (U, V) is a random vector where V = U² + ε, and U and ε are i.i.d. N(0, 1), then no transformation can render (f₁(U), f₂(V)) bivariate gaussian. In fact, the copula transformations in this case are: f₁(u) = u, f₂(v) = Φ⁻¹∘F_V(v), where F_V is the convolution of N(0, 1) and $χ_{1}^{2}$ and Φ is the c.d.f. of N(0, 1). Figure 1 plots (U, Φ⁻¹∘F_V(V)) for a sample of 100 observed pairs, which shows a strong non-gaussian pattern even though U and Φ⁻¹∘F_V(V) are marginally gaussian. Under such circumstances the gaussian copula assumption no longer holds, and our goal is to develop a broader class of graphical models that cover such cases.

Violation of copula assumption. The vertical axis is Φ⁻¹∘*F_V*(V). Both U and Φ⁻¹∘*F_V*(V) are N (0, 1) but their joint distribution is strongly non-gaussian.

In developing the new graphical models we would like to preserve an appealing feature of the GCGM; that is, the nonparametric operation is acted upon one-dimensional random variables individually, rather than a high-dimensional random vector jointly. This feature allows us to avoid high-dimensional smoothing, which is the source of the curse of dimensionality (Bellman, 1957). However, if we insist on using conditional independence (1) as the criterion for constructing graphical models, then we are inevitably lead to a fully fledged nonparametric procedure involving smoothing over the entire p-dimensional space, and the nice structures of the gaussian graphical model such as (2) would be lost. This motivates us to introduce a new statistical relation — additive conditional independence, or ACI, to replace conditional independence as our criterion. Although ACI is different from conditional independence, it satisfies the axioms for a semi-graphoid (Pearl and Verma, 1987) shared by conditional independence, which captures the essence of a graph. Furthermore, under the gaussian copula assumption, ACI reduces to conditional independence. The graphical model defined by ACI — which we refer to as the Additive Semi-graphoid Model (ASG) — preserves a parallel structure to (2

The paper is organized follows. In section 2 we outline the axiomatic system for a semi-graphoid logical relation, and introduce the notion of ACI as well as the ASG model it induces. In section 3 we investigate the relation between ACI and conditional independence under the gaussian copula assumption. In sections 4 and 5, we introduce the additive conditional covariance operator and additive precision operator, and study their relations with ACI. In sections 6 and 7 we introduce two estimators for the ASG model based on these operators. In section 8 we propose cross validation procedures to determine the tuning parameters. In sections 9 and 10 we compare the ASG model with GGM and GCGM by simulation and in an applied setting of genetic pathway inference. We conclude in section 11 with suggested future research directions. All proofs are presented in the Appendix.

2 Semi-graphoid axioms and additive conditional independence

Before introducing additive conditional independence, we first restate the axiomatic system developed by Pearl and Verma (1987) and Pearl, Geiger, and Verma (1989), in a slightly different form. Let 2^Γ be the class of all subsets of Γ. We call any subset ℛ ⊆ 2^Γ × 2^Γ × 2^Γ a three-way relation on Γ. This terminology follows the tradition of two-way relations in mathematics (see, for example, Kelley, 1955, page 6).

Definition 1 (Semi-graphoid Axioms)

A three-way relation ℛ is called a semi-graphoid if it satisfies the following conditions:

(symmetry) (A, C, B) ∈ ℛ ⇒ (B, C, A) ∈ ℛ;
(decomposition) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C, B) ∈ ℛ;
(weak union) (A, C, B ∪ D) ∈ ℛ ⇒ (A, C ∪ B, D) ∈ ℛ;
(contraction) (A, C ∪ B, D) ∈ ℛ, (A, C, B) ∈ ℛ ⇒ (A, C, B ∪ D) ∈ ℛ.

These axioms are extracted from conditional independence to convey the general idea of “B is irrelevant for understanding A once C is known”, or “C separates A and B”. Conditional independence is a special case of the semi-graphoid three-way relation. More specifically, let X = {Xⁱ : i ∈ Γ} be a random vector and, for any A ∈ Γ, let X^A be the {Xⁱ : i ∈ A}. Let ℛ be the class of (A, C, B) such that X^A ⫫ X^B|X^C (X^A and X^B are conditionally independent given X^C), then it can be shown that ℛ indeed satisfies the semi-graphoid axioms (Dawid, 1979).

Because, like conditional independence, a semi-graphoid relation (A, C, B) ∈ ℛ also conveys the idea that C separates A and B, it seems reasonable to use the semi-graphoid relation itself as a criterion for constructing a graph. Indeed, the very statement “there is no direct link between node i and node j” means, precisely, that “{i, j}^c separates i and j”. In other words, it is reasonable to use the logical statement ({i}, {i, j}^c, {j}) ∈ ℛ to determine the absence of edge between i and j. Using the semi-graphoid relation in place of probabilistic conditional independence can free us from some awkward restrictions that come with a fully fledged probability model.

Let X = (X¹, …, X^p)^⊤ be a random vector. For any subvector U = (U¹, …, U^r)^⊤ of X, let L₂(P_U) be the class of functions of U such that Ef(U) = 0, Ef²(U) < ∞. For each Uⁱ, let 𝓐_Uⁱ denote a subset of L₂(P_Uⁱ). Let 𝓐_U denote the additive family

A_{U^{1}} + \dots + A_{U^{r}} = {f_{1} + \dots + f_{r} : f_{1} \in A_{U^{1}}, \dots, f_{r} \in A_{U^{r}}} .

For two subspaces 𝓐 and 𝓑 of L₂(P_X), let 𝓐 ⊖ 𝓑 = 𝓐 ∩ 𝓑^⊥, where orthogonality is in terms of the L₂(P_X)-inner product. In the following, an unadorned inner product 〈·, ·〉 always represents the L₂(P_X)-inner product.

Definition 2

Let U, V, and W be subvectors of X. We say that U and V are additively conditionally independent (ACI) given W iff

(A_{U} + A_{W}) ⊖ A_{W} ⊥ (A_{V} + A_{W}) ⊖ A_{W},

(3)

where ⊥ indicates orthogonality in the L₂(P_X)-inner product. We write relation (3) as U ⫫_AV|W.

Strictly speaking, ACI should be stated with respect to the triplet of functional families (𝓐_U, 𝓐_V, 𝓐_W), but in most cases we can ignore this qualification without loss of clarity. Lauritzen (1996, page 31) mentioned a relation similar to (3) where 𝓐_U, 𝓐_V, and 𝓐_W are Euclidean subspaces, but absent from that construction are the underlying random vectors U, V, W, which is crucial to our development.

The next theorem gives seven equivalent conditions for ACI, each of which can be used as its definition. As we will see in the proof of Theorem 2, these various forms are immensely useful in exploring the ramifications of ACI. In the following, for an operator P on a Hilbert space 𝓗, let ker P and ranP denote the kernel and range of P, and let $\bar{ran} P$ be the closure of ranP.

Theorem 1

Let U, V, W be subvectors of X. The following statements are equivalent:

(𝓐_U + 𝓐_W) ⊖ 𝓐_W ⊥ (𝓐_V + 𝓐_W) ⊖ 𝓐_W;
𝓐_U∪W ⊖ 𝓐_W ⊥ 𝓐_V∪W ⊖ 𝓐_W;
(I − P_{𝓐_W}) 𝓐_U ⊥ (I − P_{𝓐_W})𝓐_V;
𝓐_U ⊆ ker(P_{𝓐_V∪W ⊖ 𝓐_W});
𝓐_U ⊆ ker(P_{𝓐_V∪W} − P_{𝓐_W});
𝓐_U∪W ⊆ ker(P_{𝓐_V∪W ⊖ 𝓐_W});
𝓐_U∪W ⊆ ker(P_{𝓐_V∪W} − P_{𝓐_W}).

The next theorem shows that ACI satisfies the semi-graphoid axioms, which gives justification of using it to construct graphical models.

Theorem 2

The additive conditional independence relation is a semi-graphoid.

We are now ready to define the additive semi-graphoid model.

Definition 3

A random vector X follows an additive semi-graphoid model with respect to a graph 𝓖 = (Γ, E) iff

X^{i} ⫫_{A} X^{j} ∣ X^{- (i, j)} \Leftrightarrow (i, j) \notin E .

(4)

We write this condition as X ~ ASG(𝓖).

3 Relation with copula graphical models

While in the last section we have seen it is reasonable to use ACI instead of conditional independence as a criterion for constructing graphs, we now investigate the special cases where ACI reduces conditional independence.

Theorem 3

Suppose

X has a gaussian copula distribution with copula functions f₁, …, f_p;
U, V, and W are subvectors of X;
𝓐_Xⁱ = span{f_i}, i = 1, …, p.

Then U ⫫_AV|W if and only if U ⫫V|W.

Under the specific form of ASG model in Theorem 3, there is no edge between i and j iff the (i, j)th entry of the precision matrix of (f₁(X¹), …, f_p(X^p))^⊤ is 0. The latter is population-level criterion used by the GCGM estimator. In this sense, ASG can be regarded as a generalization of the nonparanormal models.

Since our main interests are in the ASG models with 𝓐_Xⁱ = L₂(P_Xⁱ), it is also of interest to ask whether ⫫_A coincides with ⫫ in this case. The next theorem and the subsequent numerical investigation show that they are very nearly equivalent for all practical purposes.

Theorem 4

Suppose

X has a gaussian copula distribution with copula functions f₁, …, f_p;
U, V, and W are sub-vectors of X;

3. 𝓐_Xⁱ = L₂(P_Xⁱ).

Then U ⫫_AV|W implies U ⫫ V|W.

The proof, which is omitted, is essentially the same as that of Theorem 3. Interestingly, under 𝓐_Xⁱ = L₂(P_Xⁱ), ⫫ does not imply ⫫_A even under the gaussian copula assumption. However, the following numerical investigation suggests that this implication holds approximately, with vanishingly small error. We are able to carry out this investigation because an upper bound of

ρ_{U V ∣ W}^{2} = sup {{〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉}^{2} : f \in L_{2} (P_{U}), g \in L_{2} (P_{V}), ‖ f ‖ = ‖ g ‖ = 1}

can be calculated explicitly under the copula gaussian distribution. To our knowledge this bound has not been recorded in the literature; so it is of interest in its own.

Proposition 1

Suppose X has a gaussian copula distribution with copula functions f₁, …, f_p. Without loss of generality, suppose E[f_i(Xⁱ)] = 0, var[f_i(Xⁱ)] = 1. Let

U = f_{i} (X^{i}), V = f_{j} (X^{j}), W = {f_{k} {(X)}^{k} : k \neq i, j},

and R_VW = cov(V, W), R_WW = var(W), R_WU = cov(W, U), ρ_UV = cor(U, V). Then

ρ_{U V ∣ W} \leq max {∣ ρ_{U V}^{α} - R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α} ∣ : α = 1, 2, \dots} \equiv τ_{U V ∣ W},

(5)

A^⊙^α is the α-fold Hadamard product A ⊙ ··· ⊙ A and |·| means determinant.

Using this proposition we conduct the following numerical investigation. First we generate a positive definite random matrix C with C₁₂ = C₂₁ = 0 as follows. We generate C̃ from the Wishart distribution W(I_p, k), where k is the degrees of freedom, and then set C̃₁₂ and C̃₂₁ to 0. This does not guarantee C̃ to be positive definite. If it is not, we repeat this process until we get a positive definite matrix, which is then set to be C. Let R_XX = [diag(C)]^1/2C⁻¹[diag(C)]^1/2. Let R_WU, R_WV, and R_WW be the (p − 2) × 1, (p − 2) × 1, and (p − 2) × (p − 2) matrices

{{(R_{X X})}_{i 1} : i = 3, \dots, p}, {{(R_{X X})}_{i 2} : i = 3, \dots, p}, {{(R_{X X})}_{i j} : i, j = 3, \dots, p},

and let $R_{U W} = R_{W U}^{⊤}, R_{V W} = R_{W V}^{⊤}$ . Note that, if X ~ N (0, R_XX), then it is guaranteed that U ⫫ V|W, and var(Xⁱ) = 1.

We choose the degrees of freedom k to be relatively small to prevent C/k from converging to I_p, but large enough to avoid singularity. In particular, we take k = ⌈5p/4⌉, ⌈6p/4⌉, …, ⌈8p/4⌉, which guarantees a wide range of covariance matrices. We take p = 20, 40, …, 100. We compute τ_UV_|_W by taking the maximum of the first 10 terms in (5), which in our simulations is always the global maximum. For each combination of (p, k), we generate 500 random matrices and present the averages of τ_UV_|_W in Table 1. Because ρ_UV_|_W = 0 iff U ⫫_AV|W, the small numbers in Table 1 indicate that for all practical purposes we can treat U ⫫ V|W ⇒ U ⫫_AV|W as valid under the gaussian copula model, even though it is not a mathematical fact.

Table 1.

Average values of τ_UV_|_W under U ⫫ V|W

p	k
p	⌈5p/4⌉	⌈6p/4⌉	⌈7p/4⌉	⌈8p/4⌉

20	0.00310	0.00270	0.00114	0.00087
40	0.00472	0.00240	0.00095	0.00051
60	0.00485	0.00185	0.00069	0.00024
80	0.00435	0.00183	0.00053	0.00018
100	0.00483	0.00125	0.00036	0.00018

Open in a new tab

In the similar way it is related to the gaussian copula graphical models, the ASG is also related to the more general transelliptical graphical models recently introduced by Liu, Han, and Zhang (2012). A random vector X is said to have a transelliptical distribution with shape parameter Σ (a p × p positive definite matrix) if there exist injections f₁, …, f_p on Ω_X to ℝ such that the distribution of U = (f₁(X¹), …, f_p(X^p))^⊤ depends only on u^⊤Σu. Furthermore, X is said to follow a transelliptical graphical model with respect to a graph 𝓖 = (Γ, E) iff X has a transelliptical distribution and E = {(i, j) : (Σ⁻¹)_ij ≠ 0}. In this case we write X ~ TEG(𝓖). The following corollary states that ASG reduces to the transelliptical graphical model if the additive families 𝓐_Xⁱ are chosen to be span (f_i) and the underlying distribution of X is assumed to be transelliptical. Its proof is parallel to that of Theorem 3 and is omitted.

Corollary 1

Suppose

X has a transelliptical distribution with respect to a positive definite shape parameter Σ and injections f₁, …, f_p;
𝓐_Xⁱ = span(f_i).

Then X ~ ASG(𝓖) iff X ~ TEG(𝓖).

4 Additive conditional covariance operator

In this and the next sections we introduce two linear operators that will give rise to two estimation procedures for ASG. The first is the additive conditional covariance operator (ACCO). Consider the bilinear form

A_{U} \times A_{V} \to ℝ, (f, g) \mapsto E [f (U) g (V)] .

By the Riesz representation theorem, this induces a bounded linear operator Σ_UV : 𝓐_V → 𝓐_U such that 〈f, Σ_UVg〈 = E[f (U)g(V)]. We call the operator

\sum_{U V ∣ W} = \sum_{U V} - \sum_{U W} \sum_{W V}

(6)

the additive conditional covariance operator from V to U given W. This operator is different from the conditional variance operator introduced by Fukumizu, Bach, and Jordan (2009), because of the additive nature of the functional spaces 𝓐_U, 𝓐_V, and 𝓐_W. In particular, the following identity no longer holds:

〈 f, \sum_{U V ∣ W} g 〉 = E [cov (f (U), g (V) ∣ W)] .

In the following, for a Hilbert space 𝓗, a subspace 𝓛 of 𝓗, and an operator T defined on 𝓗, we use T|𝓛 to denote the operator T restricted on 𝓛. That is, T|𝓛 is a mapping from 𝓛 to 𝓗 with (T|𝓛)f = Tf for all f ∈ 𝓛. The following lemma gives a relation between the projection operator and the covariance operator.

Lemma 1

Let P_{𝓐_W}: 𝓐_X → 𝓐_W be the projection on to 𝓐_W. Then P_{𝓐_W}|𝓐_U = Σ_WU.

The next theorem shows that Σ_UV_|_W = 0 iff U ⫫_AV|W, which is the basis for our first ASG estimator. This resembles the relation between conditional covariance operator and conditional independence; that is, if C_UV_|_W is the conditional covariance operator in Fukumizu, Jordan, and Bach (2009), then C_UV_|_W = 0 iff U ⫫ V|W.

Theorem 5

U⫫_AV|W iff Σ_UV_|_W = 0.

5 Additive precision operator

We now introduce the second linear operator – the additive precision operator (APO), and show that a statement resembling the equivalence of (1) and (2) can be made for ASG if we replace the precision matrix by APO. As a result, the estimation of the ASG model can be turned into the sparse estimation of APO.

As a preparation, we first develop the notion of a matrix of operators on a direct-sum functional space. Let 𝓐_⊕X be the Hilbert space of functions consisting of the same set of functions as 𝓐_X but with the inner product

{〈 f_{1} + \dots + f_{p}, g_{1} + \dots + g_{p} 〉}_{A_{\oplus X}} = \sum_{i = 1}^{p} {〈 f_{i}, g_{i} 〉}_{A_{X^{i}}} .

Suppose for each i, j we are given a linear operator T_X^jXⁱ: 𝓐_Xⁱ → 𝓐_X^j. Then we can construct a linear operator T : 𝓐_⊕X → 𝓐_⊕X as follows

T (f_{1} + \dots + f_{p}) = \sum_{i = 1}^{p} \sum_{j = 1}^{p} T_{X^{j} X^{i}} f_{i} .

(7)

This construction is legitimate because in the additive case f₁ + ··· + f_p uniquely determines (f₁, …, f_p). In fact, if we fix (c₂, …, c_p) in Ω_X² × ··· × Ω_X^p and vary x¹ in Ω_X¹, then using E[f₁(X¹)] = 0 we have

f_{1} (x^{1}) = f_{1} (x^{1}) + f_{2} {(c)}^{2} + \dots + f_{p} (c^{p}) - E [f_{1} (X^{1}) + f_{2} (c^{2}) + \dots + f_{p} (c^{p})] .

Conversely, a linear operator T : 𝓐_⊕X → 𝓐_⊕X uniquely determines operators T_X^jXⁱ : 𝓐_Xⁱ → 𝓐_X^j that satisfies (7), as follows. For each f_i ∈ 𝓐_{⊕X_i}, Tf_i can be represented as h₁ + ··· + h_p, where h_j ∈ 𝓐_X^j. Moreover, h₁ + ··· + h_p uniquely determines (h₁, …, h_p), and hence also h_j. We define T_X^jXⁱ : 𝓐_Xⁱ → 𝓐_X^j by T_X^jXⁱ(f_i) = h_j. Thus, there is a one-to-one correspondence between T and {T_X^jXⁱ} and they are related by (7). Henceforth we denote this one-to-one correspondence by the equality

T = {T_{X^{j} X^{i}} : i, j = 1, \dots, p} .

Moreover, T is bounded iff each T_X^jXⁱ is bounded, as shown in the next proposition.

Proposition 2

The following statements are equivalent

T_X^jXⁱ ∈ 𝓑 (𝓐_Xⁱ, 𝓐_X^j) for i, j = 1, …, p;
T ∈ 𝓑 (𝓐_⊕X, 𝓐_⊕X).

Our construction of T = {T_X^jXⁱ} echoes the construction of Bach (2008). However, our construction is relative to a different inner product and will play a new role under ACI. We are now ready to introduce the additive precision operator.

Definition 4

The operator ϒ : 𝓐_⊕X → 𝓐_⊕X defined by the matrix of operators {Σ_XⁱX^j: i ∈ Γ, j ∈ Γ} is called the additive covariance operator. If it is invertible then its inverse Θ is called the additive precision operator.

It can be easily checked that, for any f, g ∈ 𝓐_⊕X,

{〈 f, ϒ g 〉}_{A_{\oplus X}} = cov [f (X), g (X)] = 〈 f, g 〉 .

(8)

This relation also shows why we used 𝓐_⊕X instead of 𝓐_X to define ϒ in the first place: if we had used 𝓐_X, then ϒ would have satisfied ϒf = f for any f ∈ 𝓐_X; that is, ϒ would have been the identity mapping from 𝓐_X to 𝓐_X. In the meantime, we note that, by construction, ϒ_XⁱX^j = Σ_XⁱX^j for all i, j = 1, …, p.

We should note that it is generally unreasonable to assume that the inverse of a covariance operator to be a bounded operator, because we typically assume the sequence of eigenvalues of such operators tend to 0. However, in our construction, the inner product 〈f_i, f_i〉_{𝓐_⊕X} is the marginal variance var[f_i(Xⁱ)]. Thus our covariance operator ϒ is more like a correlation than a covariance. In particular, ϒ_ii are identity operators by construction. If we assume ϒ_ij are compact for all i ≠ j, then ϒ is of the form “identity plus compact perturbation”. Such operators, if invertible, are bounded from below by cI for some c > 0 (Bach, 2008), and therefore have bounded inverses. We summarize this fact in the following proposition.

Proposition 3

Suppose ϒ is invertible and, for each i ≠ j, ϒ_ij is compact. Then ϒ⁻¹ is bounded.

To establish the relation between the precision operator Θ and ACI, we need the block inversion formulas for operators. For any subvector U of X, let V denote U^c, and define the matrices of operators

\begin{array}{l} T_{U U} = {T_{X^{i} X^{j}} : X_{i} \in U, X_{j} \in U}, T_{U V} = {T_{X^{i} X^{j}} : X_{i} \in U, X_{j} \in V}, \\ T_{V U} = {T_{X^{i} X^{j}} : X \in i \in V, X_{j} \in U}, T_{V V} = {T_{X^{i} X^{j}} : X_{i} \in V, X_{j} \in V} . \end{array}

From these definitions it follows that T_UU 𝓐_⊕U ⊆ 𝓐_⊕U, and

{〈 f, T_{U U} g 〉}_{A_{\oplus X}} = {〈 f, T g 〉}_{A_{\oplus X}} for any f, g \in A_{\oplus U} .

The same can be said of T_VV, T_UV, and T_VU. The following operator identities resemble the matrix identities derived from block-matrix inverse.

Lemma 2

Suppose that T ∈ 𝓑 (𝓐_⊕X, 𝓐_⊕X) is a self adjoint, positive definite operator. Then: the operators

T_{U U}, T_{V V}, T_{U U} - T_{U V} T_{V V}^{- 1} T_{V U}, T_{V V} - T_{V U} T_{U U}^{- 1} T_{U V}

are bounded, self adjoint, and positive definite, and the following identities hold

\begin{array}{l} {(T^{- 1})}_{U U} = {(T_{U U} - T_{U V} T_{V V}^{- 1} T_{V U})}^{- 1}, \\ {(T^{- 1})}_{V V} = {(T_{V V} - T_{V U} T_{U U}^{- 1} T_{U V})}^{- 1}, \\ {(T^{- 1})}_{U V} = - {(T^{- 1})}_{U U} T_{U V} T_{V V}^{- 1} = - T_{U U}^{- 1} T_{U V} {(T^{- 1})}_{V V}, \\ {(T^{- 1})}_{V U} = - {(T^{- 1})}_{V V} T_{V U} T_{U U}^{- 1} = - T_{V V}^{- 1} T_{V U} {(T^{- 1})}_{U U}, \end{array}

(9)

where (T⁻¹)_UU and (T⁻¹)_VV are the blocks of T⁻¹ corresponding to 𝓐_⊕U and 𝓐_⊕V.

The above lemma shows that if ϒ has a bounded inverse, then Θ_UV is a well defined operator in 𝓑 (𝓐_⊕V, 𝓐_⊕U). Moreover, it behaves just like the off-diagonal block of the inverse of a covariance matrix. As we will see in the Appendix, manipulating the blocks of ϒ and Θ in the same way we do with a gaussian covariance matrix lead to the following equivalence.

Theorem 6

Suppose (U, V, W) is a partition of X and ϒ is invertible. Then U ⫫_A V|W iff Θ_UV = 0.

This relation is parallel to the equivalence between (1) and (2) in the gaussian case, which means we can turn estimating ASG into estimating the locations of the zero entries of APO, as shown in the next corollary.

Corollary 2

A random vector X follows an ASG(𝓖) model if and only if Θ_XⁱX^j = 0 whenever (i, j) ∉ E.

Note that the graph ASG(𝓖) is based entirely on pairwise additive conditional independence (4), which is similar to the pairwise Markov property for classical graphical models based on conditional independence. In the classical setting, pairwise Markov property with respect to a graph 𝓖 = (Γ, E) means

X^{i} ⫫ X^{j} ∣ X^{- (i, j)} \Leftrightarrow (i, j) \notin E .

Under fairly mild conditions, this property also implies the global Markov property, which means if A, B, C are subsets of Γ and C separates A and B in 𝓖, then

X^{A} ⫫ X^{B} ∣ X^{C} .

See Lauritzen (1996, Chapter 3). It is of theoretical interest to see if the parallel implication holds true in our additive setting. For convenience, let us agree to call property (4) the pairwise additive Markov property with respect to 𝓖. We are interested in whether X^A ⫫_AX^B|X^C holds whenever C separates A and B according to the graph 𝓖 determined by (4), a property that can be appropriately defined as the global additive Markov property with respect to 𝓖.

Using Theorem 6 we can easily derive a partial answer to this question, as follows. Suppose {A, B, C} is a partition of Γ and C separates A and B according to 𝓖 defined by (4). Then all paths between A and B intersects C. In other words, the block Θ_AB and Θ_BA of Θ consists of zero operators. However, by Theorem 6, this is equivalent to X^A ⫫_AX^B|X^C. We will leave it for future research to establish whether this statement is true when {A, B, C} is not necessarily a partition.

In the next two sections we introduce two estimators of the ASG model, one based on the ACCO introduced in the last section, and one based on the APO.

6 The ACCO-based estimator

By Theorem 5, it is reasonable to determine the absence of edges by smallness of the sample estimate of ||Σ_{XⁱX^j|X^−(i,j)}||. To develop this estimate, we first derive the finite-sample matrix representation of the operator Σ_{XⁱX^j|X^−(i,j)}.

For each i = 1, …, p, let κ_i : Ω_Xⁱ × ΩX_ⁱ → ℝ be a positive definite kernel. Let

A_{X^{i}} = span {κ_{i} (\cdot, X_{1}^{i}) - E_{n} κ_{i} (X^{i}, X_{1}^{i}), \dots, κ_{i} (\cdot, X_{n}^{i}) - E_{n} κ_{i} (X^{i}, X_{n}^{i})},

where $E_{n} κ_{i} (X^{i}, X_{k}^{i}) = n^{- 1} \sum_{ℓ = 1}^{n} κ_{i} (X_{ℓ}^{i}, X_{k}^{i})$ . Let I_n be the n × n identity matrix and 1_n be the n-dimensional vector with all its entries equal to 1. Let Q be the projection matrix $I_{n} - 1_{n} 1_{n}^{⊤} / n$ . Let κ˚_i denote the vector-valued function

x^{i} \mapsto {(κ_{i} (x^{i}, X_{1}^{i}) - E_{n} κ_{i} (X^{i}, X_{1}^{i}), \dots, κ_{i} (x^{i}, X_{n}^{i}) - E_{n} κ_{i} (X^{i}, X_{n}^{i}))}^{⊤} .

Then, any f ∈ 𝓐_Xⁱ can be expressed as $f = {\overset{\circ}{κ}}_{i}^{⊤} [f]$ for some [f] ∈ ℝⁿ, which is called the coordinate of f in the spanning system κ˚_i. A coordinate system is relative to a class of functions, and sometimes we need to use symbols such as [f]_{𝓐_Xⁱ} for clarity. But in most of the discussions below this is unnecessary. In this coordinate notation we have ${(f (X_{1}^{i}), \dots, f (X_{n}^{i}))}^{⊤} = {Q K}_{i} [f]$ , where $K_{i} = {κ_{i} (X_{k}^{i}, X_{ℓ}^{i}) : k, ℓ = 1, \dots, n}$ . Also, for any f, g ∈ 𝓐_Xⁱ,

{〈 f, g 〉}_{A_{X^{i}}} = n^{- 1} {({Q K}_{i} [f])}^{⊤} ({Q K}_{i} [g]) = n^{- 1} {[f]}^{⊤} K_{i} {Q K}_{i} [g] .

Let K₋₍_i_,_j₎ denote the n(p − 2) × n matrix obtained by removing the ith and jth block of the np × n matrix (K₁, …, K_p)^⊤. Then, for any f ∈ 𝓐_X^−(i,j we have

{(f (X_{1}^{- (i, j)}), \dots, f (X_{n}^{- (i, j)}))}^{⊤} = {Q K}_{- (i, j)}^{⊤} [f]

(10)

for some [f] ∈ ℝⁿ⁽^p⁻²⁾. Finally, for any linear operator T defined on a finite dimensional Hilbert space 𝓗, there is a matrix [T] such that [T f] = [T][f] for each function f ∈ 𝓗. The matrix [T] is called the matrix representation of T. This notation system is adopted from Horn and Johnson (1985) and was used in Li, Artemiou, and Li (2011), Li, Chun, and Zhao (2012), and Lee, Li, and Chiaromonte (2013).

Lemma 3

The following relations hold:

\begin{matrix} K_{i} {Q K}_{i} [\sum_{X^{i} X^{j}}] = K_{i} {Q K}_{j}, K_{i} {Q K}_{i} [\sum_{X^{i} X^{- (i, j)}}] = K_{i} {Q K}_{- (i, j)}^{⊤} \\ K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} [\sum_{X^{- (i, j)} X^{i}}] = K_{- (i, j)} {Q K}_{i} . \end{matrix}

(11)

Because K_iQK_i and $K_{- (i, j)} {Q K}_{- (i, j)}^{⊤}$ are singular, we solve the equations in (11) by Tychonoff regularization:

\begin{array}{l} [\sum_{X^{i} X^{j}}] = {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1} K_{i} {Q K}_{j}, \\ [\sum_{X^{i} X^{- (i, j)}}] = {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1} K_{i} {Q K}_{- (i, j)}^{⊤}, \\ [\sum_{X^{- (i, j)} X^{i}}] = {(K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2)})}^{- 1} K_{- (i, j)} {Q K}_{i} . \end{array}

(12)

Note that we use two different tuning constants ε₁ and ε₂, because the matrices involved have different dimensions. We will discuss how to choose these constants in Section 8. is then The matrix representation of Σ_{X^jXⁱ|X^−(i,j)} is then

[\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}] = [\sum_{X^{j} X^{i}}] - [\sum_{X^{j} X^{- (i, j)}}] [\sum_{X^{- (i, j)} X^{i}}] .

(13)

By definition, ||Σ_{X^jXⁱ|X^−(i,j)}||² is the maximum of

{〈 \sum_{X^{j} X^{i} ∣ X^{- (i, j)}} f, \sum_{X^{j} X^{i} ∣ X^{- (i, j)}} f 〉}_{A_{X^{j}}} = n^{- 1} {[f]}^{⊤} {[\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}]}^{⊤} K_{j} {Q K}_{j} [\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}] [f]

subject to 〈f, f〉_{𝓐_Xⁱ} = [f]^⊤ K_iQK_i[f] = 1. To solve this generalized eigenvalue problem, let v = (K_iQK_i)^1/2[f]. Then, using inverse Tychonoff regularization, [f] = (K_iQK_i + ε₁I_n)^−1/2v. So it is equivalent to maximizing

v^{⊤} {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} {[\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}]}^{⊤} K_{j} {Q K}_{j} [\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}] {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} v

subject to v^⊤ v = 1, which implies that ||Σ_{XⁱX^j|X^−(i,j)}||² is the largest eigenvalue of

Λ_{X^{i} X^{j}} = {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} {[\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}]}^{⊤} K_{j} {Q K}_{j} [\sum_{X^{j} X^{i} ∣ X^{- (i, j)}}] {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} .

Since the factor K_jQK_j[Σ_{X^jXⁱ|X^−(i,j)}] in the right-hand side is of the form

K_{j} {Q K}_{j} {(K_{j} {Q K}_{j} + ε_{1} I_{n})}^{- 1} K_{j} Q (I_{n} - K_{- (i, j)}^{⊤} {(K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2)})}^{- 1} K_{- (i, j)}) {Q K}_{i},

it is reasonable to replace the first factor K_jQK_j by K_jQK_j + ε₁I_n, so that Λ_XⁱX^j becomes the quadratic form $Δ_{X^{i} X^{j}} Δ_{X^{i} X^{j}}^{⊤}$ , where

\begin{array}{l} Δ_{X^{i} X^{j}} = {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} K_{i} \\ (Q - {Q K}_{- (i, j)}^{⊤} {(K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2)})}^{- 1} K_{- (i, j)} Q) K_{j} {(K_{j} {Q K}_{j} + ε_{1} I_{n})}^{- 1 / 2} . \end{array}

(14)

Hence, at the sample level, ||Σ_{XⁱX^j|X^−(i,j)}|| is the largest singular value of Δ_XⁱX^j. Because $Δ_{X^{i} X^{j}}^{⊤} = Δ_{X^{j} X^{i}}$ , we have the symmetry ||Σ_{XⁱX^j|X^−(i,j)}|| = ||Σ_{X^jXⁱ|X^−(i,j)}||.

The above procedure involves inversion of $K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2)}$ , which is a large matrix if the graph 𝓖 contains many nodes. Nevertheless, as the next proposition shows the actual amount computation is not large — it is essentially that of spectral decomposition of an n by n matrix. Let diag(A₁, …, A_p) be the block-diagonal matrix with diagonal blocks A₁, …, A_p, and V be a matrix in ℝ^s^×^t, with s > t, which has singular value decomposition

V = (L_{1}, L_{0}) diag (D, 0) {(R_{1}, R_{0})}^{⊤},

(15)

where L₁ ∈ ℝ^s^×^u, L₀ ∈ ℝ^s^×(^s⁻^u⁾, R₁ ∈ ℝ^t^×^u, R₀ ∈ ℝ^t^×(^t⁻^u⁾, and D ∈ ℝ^u^×^u is a diagonal matrix with diagonal elements δ₁ ≥ ··· ≥ δ_u > 0.

Proposition 4

If the conditions in the last paragraph are satisfied, then

${({V V}^{⊤} + ε I_{s})}^{- 1} = {V R}_{1} D^{- 1} ({(D^{2} + ε I_{u})}^{- 1} - ε^{- 1} I_{u}) D^{- 1} R_{1}^{⊤} V^{⊤} + ε^{- 1} I_{s}$ ;
$V^{⊤} {({V V}^{⊤} + ε I_{s})}^{- 1} V = R_{1} D {(D^{2} + ε I_{u})}^{- 1} {D R}_{1}^{⊤}$ .

The above proposition allows us to calculate the inversions of s × s matrix (VV^⊤ + εI_s) essentially by spectral decomposition of t × t matrix V^⊤ V. We now summarize the estimation procedure as an algorithm. Suppose we observe an i.i.d. sample X₁, …, X_n from X, where X_i is a p-dimensional random vector ${(X_{i}^{1}, \dots, X_{i}^{p})}^{⊤}$ .

Standardize each component of X marginally. That is, for each i = 1, …, p, we standardize $X_{1}^{i}, \dots, X_{n}^{i}$ , so that E_nXⁱ = 0 and var_nXⁱ = 1.
Choose a positive definite kernel κ: ℝ × ℝ → ℝ, such as the gaussian radial kernel. Choose the kernel parameter γ using, for example, the procedure in Li, Chun, and Zhao (2012). Compute the kernel matrices K_i and K₋₍_i_,_j₎.
Compute Δ_XⁱX^j in (14), using part 2 of Proposition 4 to compute the matrix ${Q K}_{- (i, j)}^{⊤} {(K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2)})}^{- 1} K_{- (i, j)} Q$ , with V = K₋₍_i_,_j₎Q, s = n(p − 2), t = n, and u = n − 1. The tuning constants ε₁ and ε₂ are determined using the method described in Section 8.
For each i > j, compute the largest singular value of Δ_XⁱX^j. If it is smaller than a certain threshold, declare that there is no edge between i and j.

We emphasize again that throughout this procedure, the largest matrix whose spectral decomposition is needed is of dimension n × n, regardless of the dimension p of X.

7 The APO-based estimator

Our second estimator is based on Corollary 2, which states that Xⁱ ⫫ _AX^j|X⁻⁽ⁱ^,^j⁾ iff Θ_X^jXⁱ = 0. Thus we can determine the absence of edge between i, j if the sample version of ||Θ_X^jXⁱ|| is below a threshold. We first derive the matrix representations of the operator ϒ and its inverse Θ. Let K_1:_p = (K₁, …, K_p)^⊤.

Theorem 7

The matrix representation of ϒ satisfies the following equation:

diag (K_{1} {Q K}_{1}, \dots, K_{p} {Q K}_{p}) [ϒ] = K_{1 : p} {Q K}_{1 : p}^{⊤} .

(16)

The matrix [Θ] is obtained by solving (16) for [ϒ]⁻¹ with Tychonoff regularization:

[Θ] = {(K_{1 : p} {Q K}_{1 : p}^{⊤} + ε_{3} I_{n p})}^{- 1} diag (K_{1} {Q K}_{1}, \dots, K_{p} {Q K}_{p}) .

(17)

To find ||Θ_X^jXⁱ||, we maximize

{〈 Θ_{X^{j} X^{i}} f_{i}, Θ_{X^{j} X^{i}} f_{i} 〉}_{A_{X^{j}}} = n^{- 1} {[f_{i}]}^{⊤} {[Θ_{X^{j} X^{i}}]}^{⊤} K_{j} {Q K}_{j} [Θ_{X^{j} X^{i}}] [f_{i}]

subject to 〈f_i, f_i〉_{𝓐_Xⁱ} = [f_i]^⊤K_iQK_i[f_i] = 1. Following the similar development that leads to (14), we see that ||Θ_X^jXⁱ||² is the largest eigenvalue of

{(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} {[Θ_{X^{j} X^{i}}]}^{⊤} (K_{j} {Q K}_{j} + ε_{1} I_{n}) [Θ_{X^{j} X^{i}}] {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2},

where [Θ_X^jXⁱ] is the (j, i)th block of (17). Equivalently, ||Θ_X^jXⁱ|| is the largest singular value of

Γ_{X^{i} X^{j}} = {(K_{i} {Q K}_{i} + ε_{1} I_{n})}^{- 1 / 2} {[Θ_{X^{j} X^{i}}]}^{⊤} {(K_{j} {Q K}_{j} + ε_{1} I_{n})}^{1 / 2} .

We now summarize the algorithm for the APO-based estimator. The first two steps are the same as in the ACCO-based algorithm; the last two steps are listed below.

3′ Compute [Θ] according to (17), where the inverse ${(K_{1 : p} {Q K}_{1 : p}^{⊤} + ε_{3} I_{n p})}^{- 1}$ is computed using part 1 of Proposition 4, with V = K_1:_pQ, ε = ε₃, s = np, and u = n − 1. The tuning constant ε₃ is determined by the method in Section 8.
4′ For each i, j = 1, …, p, i > j, compute the largest singular value of Γ_XⁱX^j. If it is smaller than a certain threshold, then there is no edge between i and j.

In concluding this section, we would like to mention that the operator norm used for both the ACCO and APO can be replaced by other norms, such as the Hilbert Schmidt norm. We have experimented both norms in our simulation studies (not presented), and found that their performances are similar.

8 Tuning parameters for regularization

In this section we propose cross-validation and generalized cross-validation procedures to select ε’s in the Tychonoff regularization. We need to invert three types of matrices

type I : K_{i} {Q K}_{i}, type II : K_{- (i, j)} {Q K}_{- (i, j)}^{⊤}, type III : K_{1 : p} {Q K}_{1 : p}^{⊤} .

As before, we denote the constants in the regularized inverses as ε₁, ε₂, and ε₃.

Since our cross-validation is based on the total error in predicting one set of functions by another set of functions, we first formulate such predictions in generic terms. Let U and V be random vectors and W = (U^⊤, V^⊤)^⊤. Let 𝓕 = {f₀, f₁, …, f_r} be a set of functions of U, and 𝓖 = {g₀, g₁, …, g_s} be a set of functions of V, where g₀ and f₀ represent the constant function 1, which are included to account for any constant terms in prediction. Suppose we have an i.i.d. sample {W₁, …, W_n} of W. Let A = {a₁, …, a_m} ⊆ {1, …, n}, B = {b₁, …, b_n₋_m} = A^c. Let {W_a : a ∈ A} be the training set and {W_b : b ∈ B} be the testing set. Suppose ${\hat{E}}_{V ∣ U}^{(ε)} : G \to F$ is a training-set-based operator such that, for each g ∈ 𝓖, ${\hat{E}}_{V ∣ U}^{(ε)} g$ is the best prediction of g(V) based on functions in 𝓕, where ε is the tuning constant to be determined. The total error for predicting any function in 𝓖 evaluated at the testing set is

\sum_{b \in B} \sum_{μ = 0}^{s} {[g_{μ} (V_{b}) - ({\hat{E}}_{V ∣ U}^{(ε)} g_{μ}) (U_{b})]}^{2} .

(18)

Now let L_U be the (r + 1) × m matrix whose ith row is the vector {f_i₊₁(U_a) : a ∈ A}, and L_V be the (s + 1) × m matrix whose kth row is the vector {g_k₊₁(V_a) : a ∈ A}. Let ℓ_U and ℓ_V denote the vector-valued functions (f₀, …, f_r)^⊤ and (g₀, …, g_r)^⊤. Then, following the argument similar to Lee, Li, and Chiaromonte (2013), we can express (18) as

\sum_{b \in B} {‖ ℓ_{V} (V_{b}) - L_{V} L_{U}^{⊤} {(L_{U} L_{U}^{⊤} + ε I_{r + 1})}^{- 1} ℓ_{U} (U_{b}) ‖}^{2},

where ||·|| is the Euclidean norm. If we let G_U be the (r + 1) × (n − m) matrix whose bth column is ℓ_U (U_b), and G_V be the (s + 1) × (n − m) matrix whose bth column is ℓ_V (V_b), then the above can be further simplified as

{‖ G_{V} - L_{V} L_{U}^{⊤} {(L_{U} L_{U}^{⊤} + ε I_{r + 1})}^{- 1} G_{U} ‖}^{2},

where ||·|| is the Frobenius matrix norm. Other matrix norms are also reasonable. In our simulation we used the operator norm, which performed well.

Return now to our cross-validation problem. In the k-fold cross validation, data are divided into k subsets. At each step, one subset is used as the testing set and union of the rest is used as the training set. Let i, j, … index the p components of x, a, b, … index the n subjects, and ν index the k steps in the k-fold cross-validation.

At the νth step, let {X_a : a ∈ A} be the training set and {X_b : b ∈ B} be the testing set. Reset κ˚_i to be the ℝ^m-valued function $x^{i} \mapsto {(κ (x^{i}, X_{a_{1}}^{i}), \dots, κ (x^{i}, X_{a_{m}}^{i}))}^{⊤}$ . Let ℓ_i, ℓ₍_i_,_j₎, and ℓ be the vector-valued functions defined by

\begin{matrix} ℓ_{i} (x^{i}) = {(1, {\overset{\circ}{κ}}_{i}^{⊤} (x^{i}))}^{⊤}, ℓ_{(i, j)} (x^{i}, x^{j}) = {(1, {\overset{\circ}{κ}}_{i}^{⊤} (x^{i}), {\overset{\circ}{κ}}_{j}^{⊤} (x^{j}))}^{⊤}, \\ ℓ (x) = {(1, {\overset{\circ}{κ}}_{1}^{⊤} (x^{1}), \dots, {\overset{\circ}{κ}}_{p}^{⊤} (x^{p}))}^{⊤} . \end{matrix}

Let ℓ₋_i be the ℝⁿ⁽^p⁻¹⁾⁺¹-valued function obtained by removing κ˚_i from ℓ, and ℓ₋₍_i_,_j₎ be the ℝⁿ⁽^p⁻²⁾⁺¹-valued function obtained by removing κ˚_i and κ˚_j from ℓ. Let L_i denote the matrix ( $ℓ_{i} (X_{a_{1}}^{i}), \dots, ℓ_{i} (X_{a_{m}}^{i})$ ), and define L₋_i, L₍_i_,_j₎, and L₋₍_i_,_j₎ similarly. Let G_i denote the matrix ( $ℓ_{i} (X_{b_{1}}^{i}, \dots, ℓ_{i} (X_{b_{n - m}}^{i})$ ), and define G₋_i, G₍_i_,_j₎, and G₋₍_i_,_j₎ similarly.

To determine ε₁, we use the functions in ℓ_i to predict the functions in ℓ₋_i at X_b for any b ∈ B. That is,

\begin{array}{l} {CV}_{1}^{(ν)} (ε_{1}) = \sum_{i = 1}^{p} \sum_{b \in B} {‖ ℓ_{- i} (X_{b}^{- i}) - L_{- i} L_{i}^{⊤} {(L_{i} L_{i}^{⊤} + ε_{1} I_{m + 1})}^{- 1} ℓ_{i} (X_{b}^{i}) ‖}^{2} \\ = \sum_{i = 1}^{p} {‖ G_{- i} - L_{- i} L_{i}^{⊤} {(L_{i} L_{i}^{⊤} + ε_{1} I_{m + 1})}^{- 1} G_{i} ‖}^{2} . \end{array}

(19)

Repeat this process for ν = 1, …, k and form ${CV}_{1} (ε_{1}) = \sum_{ν = 1}^{k} {CV}_{1}^{(ν)} (ε_{1})$ . We choose ε₁ by minimizing CV₁(ε₁) over a grid. To determine ε₂, we use the functions in ℓ₋₍_i_,_j₎ to predict the functions in ℓ₍_i_,_j₎ at each X_b, b ∈ B, which leads to

{CV}_{2}^{(ν)} (ε_{2}) = 2 \sum_{i > j} {‖ G_{(i, j)} - L_{(i, j)} L_{- (i, j)}^{⊤} {(L_{- (i, j)} L_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2) + 1})}^{- 1} G_{- (i, j)} ‖}^{2} .

(20)

Let ${CV}_{2} (ε_{2}) = \sum_{ν = 1}^{k} {CV}_{2}^{(ν)} (ε_{2})$ . We choose ε₂ by minimizing CV₂(ε₂) over a grid. Here, we use part 1 of Proposition 4 to calculate ${(L_{- (i, j)} L_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2) + 1})}^{- 1}$ .

Alternatively, we can use the generalized cross validation criteria parallel to (19) and (20) to speed up computation (Wahba, 1990). Rather than splitting the data into training and testing samples, we compute the residual sum of squares from the entire data, and regularize it by the effective degrees of freedom as defined by the traces of relevant projection matrices. Specifically, the GCV-criterion parallel to (19) is

{GCV}_{1} (ε_{1}) = \sum_{i = 1}^{p} \frac{{‖ L_{- i} - L_{- i} L_{i}^{⊤} {(L_{i} L_{i}^{⊤} + ε_{1} I_{n + 1})}^{- 1} L_{i} ‖}^{2}}{{1 - tr (L_{i}^{⊤} {(L_{i} L_{i}^{⊤} + ε_{1} I_{n + 1})}^{- 1} L_{i}) / n}^{2}} .

Here, L_i and L₋_i are computed from the full data, and G_i in (19) is now replaced by L_i. Similarly, the GCV-criterion parallel to (20) is

{GCV}_{2} (ε_{2}) = 2 \sum_{i > j} \frac{{‖ L_{(i, j)} - L_{(i, j)} L_{- (i, j)}^{⊤} {(L_{- (i, j)} L_{- (i, j)}^{⊤} + ε_{2} I_{n + 1})}^{- 1} L_{- (i, j)} ‖}^{2}}{{1 - tr (L_{- (i, j)}^{⊤} {(L_{- (i, j)} L_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2) + 1})}^{- 1} L_{- (i, j)}) / n}^{2}} .

Here, again, the matrix $L_{- (i, j)}^{⊤} {(L_{- (i, j)} L_{- (i, j)}^{⊤} + ε_{2} I_{n (p - 2) + 1})}^{- 1} L_{- (i, j)}$ is to be calculated by part 2 of Proposition 4 with V = L₋₍_i_,_j₎, s = n(p − 2) + 1 and u = 1. Finally, we recommend taking ε₃ to be the same as ε₂. This is based on the intuition that $K_{- (i, j)} {Q K}_{- (i, j)}^{⊤}$ and $K_{1 : p} {Q K}_{1 : p}^{⊤}$ are of the similar forms and dimensions. This choice works well in our simulation studies.

From our experience, the GCV-criteria give the similar tuning constants as the CV-criteria but increase the computation speed by several folds. For this reason, we used the GCV criteria in our subsequent simulations and application.

9 Simulation studies

In this section we compare our ASG estimators with GGM and GCGM estimators, as proposed by Yuan and Lin (2007) and Liu et al (2013). Since our estimators are based on thresholding the norms of the operators Σ_{XⁱX^j|X^−(i,j)} and Θ_X^jXⁱ, it would be fair to compare them with the thresholded versions of GGM and GCGM. However, because L₁-regularization is widely used for these estimators, we will instead compare them with the L₁-regularized GGM and GCGM.

Comparison 1

We first compare the ASG estimators with GGM and GCGM when the copula assumption is violated. Consider the following models:

Model I : X₁ = ε_1, $X_{2} = X_{1}^{2} + ε_{2}$ , X₃ = ε₃,
Model II :X₁ = ε₁, X₂ = ε₂, $X_{3} = X_{1}^{2} + ε_{3}, X_{4} = X_{1}^{2} + X_{2}^{2} + ε_{4}$ , X₅ = ε₅,

where ε₁, …, ε₅ are i.i.d. N(0, 1). The edge set for Model I is {(1, 2)}; the edge set for Model II is {(1, 3), (1, 4), (2, 4)}. The sample size is n = 100 and we repeated the simulation 100 times and the averaged ROC curves for correct estimation of the edge sets are presented in the upper panels of Figure 2. As expected, our ASG estimators perform much better than the GGM and GCGM when the copula assumption does not hold.

Comparison of ACCO, APO, GGM, and GCGM. Upper panels: ROC curves for Models I and II, where the gaussian copula assumption is violated. Lower panels: ROC curves for Models III and IV, where the gaussian copula assumption holds.

For this and all the other comparisons, we have used the radial basis function (RBF) kernel with γ = 1/(2p), where γ is as it appears in the parametric form used in Li, Chun, and Zhao (2012). In Comparison 6 we give some results for alternative kernels and parameters.

Comparison 2

We now consider two models that satisfy the gaussian or copula gaus-sian assumption, which are therefore favorable to GGM and GCGM, to see how the ASG estimators compare with them:

Model III : X ~ N(0, Θ⁻¹),
Model IV : (f₁(X¹), …, f_p(X^p))^⊤ ~ N(0, Θ⁻¹),

where Θ is the 10 × 10 precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.554, and nonzero off-diagonal entries θ_3,5 = 1.418, θ_4,6 = −1.484, θ_4,10 = −0.744, θ_5,9 = 0.519, θ_5,10 = −0.577, θ_7,10 = 1.104, and θ_8,9 = −0.737. For Model IV, X¹, …, X^p each has a marginal gamma distribution with rate parameter 1 and shape parameter generated from unif(1, 5). The sample size is 50. The averaged ROC curves (over 100 replications) for the 4 estimators are shown in the lower panels in Figure 2. From these curves we can see the ASG estimators are comparable to GGM and GCGM when the gaussian or copula gaussian assumption hold.

The tuning parameters are selected by GCV₁ and GCV₂ from the coarse grids (5 × 10⁻⁴, …, 5 × 10³). The optimal tuning parameters do vary across repetitions, but the ballpark figures are relatively stable. For example, for model IV, the typical values are 5 × 10⁻⁴ for ε₁ and 5 × 10⁻¹ for ε₂. To evaluate the GCV criteria, in Figure 3 we compare the ROC curves using the GCV-tuned ε̂₁, ε̂₂ with with those for ε̂₁ × 10⁻¹ and ε̂₂ × 10⁻¹ and those for ε̂₁ × 10 and ε̂₂ × 10. We see that smaller tuning parameters performs better at low specificity, larger tuning parameters performs better at high specificity, but the GCV tuning criteria balances these two cases.

Comparison 3

We now investigate the performance of our additive methods under a nonadditive model and compare with a fully nonparametric method that does not rely on the additive assumption. The fully nonparametric method is based on the conditional covariance operator (CCO) introduced by Fukumizu, Bach, and Jordan (2009). Consider the model

Model V : X_i = ε_i, i = 1, 2, 5; X₃ = X₁ + ε₃; X₄ = (X₁ + X₂)² + ε₄,

where ε₁, …, ε₅ be i.i.d. N(0, 1). The nonadditivity is introduced through the term (X₁ + X₂)².

Let μ : ℝ^p⁻² × ℝ^p⁻² → ℝ be a positive definite kernel and M_X^−(i,j) be the kernel matrix { $μ (X_{r}^{- (i, j)}, X_{s}^{- (i, j)})$ , s = 1, …, n}. Let

\begin{matrix} G_{X^{i}} = {Q K}_{X^{i}} Q, G_{X^{j}} = {Q K}_{X^{j}} Q, G_{X^{- (i, j)}} = {Q M}_{X^{- (i, j)}} Q, \\ Q_{X^{- (i, j)}} (ε_{2}) = I_{n} - G_{X^{- (i, j)}} {(G_{X^{- (i, j)}} + ε_{2} I_{n})}^{- 1} . \end{matrix}

Let V_{XⁱX^j|X−(^i,j)} be the CCO and V̂_{XⁱX^j|X^−(i,j)} be its sample estimate. Along the lines of Fukumizu, Bach, and Jordan (2009) we derive the operator norm V̂_{XⁱX^j|X^−(i,j)} as the largest singular value of

{(G_{X^{i}}^{1 / 2} + ε_{1} I_{n})}^{1 / 2} Q_{X^{- (i, j)}} (ε_{2}) {(G_{X^{j}} + ε_{1} I_{n})}^{1 / 2} .

(21)

We declare no edge between node i and j if this value is smaller than a threshold. A derivation of this is sketched in the Appendix. The tuning parameters are determined the similar GCV criteria as described in Section 8 except that the univariate kernel K_X^−(i,j) is replaced by the multivariate kernel M_X^−(i,j). We ran this simulation for n = 50 using ACCO, APO, GCGM, and this fully nonparametric estimator (denoted as fuku in the plot). The averaged ROC curves are shown in Figure 4. We can see that our methods work better Fukumizu et al’s method even in this nonadditive case.

Comparison of ASG and the fully nonparametric estimator on Model V

Comparison 4

In this study we compare the 4 estimators by applying them to the data sets from the DREAM4 Challenges at http://wiki.c2b2.columbia.edu/dream/index.php?title=D4c2 in which subnetworks are extracted from a real biological networks, and gene expressions are simulated from stochastic differential equations, reflecting their biological interactions. Under this simulation scenario, two genes are not guaranteed to be linearly associated. In the DREAM 4 samples, p = 10, n = 31 for all five networks. In the analysis, we combined all data sets from various experiments (wildtype, knockout, knockdown and multifactorial) except for the time-course experiments, and apply each method to the combined data sets across all five network configurations in the DREAM4 Challenges. Figure 5 shows the averaged ROC curves for the 5 subnetworks. We note that APO and ACCO perform the best for the first subnetwork, and APO performs the best in the fifth subnetwork. In the other three subnetworks there is no clear separation among the four estimators. These comparisons suggest a possible benefit of our ASG estimators for finding non-linear biological interactions.

Comparison of the ROC curves by ACCO, APO, GGM, GCGM estimators on 5 subnetworks in DREAM4 Challenges.

Comparison 5

In this study we increase the sample size n and dimension p to investigate the accuracy and computational feasibility of our methods. We use Model IV in Comparison 2 with (p, n) = (200, 400) and (p, n) = (400, 400). We compare APO with the fully nonparametric method and the averaged ROC curves are given in Figure 6. For larger p and n the ACCO is more computationally expensive than APO because one has to compute ||Σ̂_{XⁱX^j|X^−(i,j)}|| for all i, j = 1, …, p, i ≠ j. However, this problem is mitigated by APO, where the large (np × np) matrix [Θ] only needs to be calculated once.

Comparison of ASG and fully nonparametric method for p = 200, n = 400 (left panel) and p = 400, n = 400 (right panel) using model IV

For p = 200 and n = 400, we selected (by GCV) ε̂₁ = 5 × 10⁻⁵, ε̂₂ = 0.5. Each run with fixed ε₁ and ε₂ took around 900 seconds (15 mins) in an Intel(R) Core(TM) i7-3770 CPU at 3.40GHz machine. For p = 400 and n = 400, we selected by GCV ε̂₁ = 5e⁻⁵, ε̂₂ = 0.5. Each run with fixed ε₁ and ε₂ took around 1900 seconds in the same computer. The GGM were not used for this simulation, since it took a very long time to get the estimates (more than a day for a single tuning parameter). We also performed 10 replications. It appears that our additive operator performs better than Fukumizu et al’s operator in high-dimensional cases.

Comparison 6

We now investigate the behavior of different kernels using Models III and IV. We have experimented with 3 types of kernels: the radial basis function kernel, the polynomial (POLY) kernel, and the rational quadratic (RATIONAL) kernel. The specific parametric forms we used for these kernels are as appear in Li, Chun, and Zhao (2012). For each kernel, we used two tuning parameters: for RBF, we use γ = 1/9, 1/(9p); for POLY, we use r = 2, 3; for RATIONAL, c = 200, 400. The ROC curves for these two models are given in Figure 7. We see that, with the exception of polynomial kernel with degree 3, these performance of these kernels are somewhat similar, with RBF, γ = 1/(9p) most favorable. The cubic polynomial kernel does not perform very well because it tends to amplify outlying observations, making their results unstable.

Comparing different kernels using Model III (left panel) and Model IV (right panel).

10 Pathway analysis

We applied the four estimators to a real data set from Dimas et al (2009), in which gene expressions were measured from 85 individuals participating in the GenCord project. The same data set was also used in Chun, Zhang, and Zhao (2012). The participants were of Western European origin, and cell lines were obtained from their umbilical cords. In this analysis, we focused on one cell type, the B cells (lymphoblastoid cell lines). As in Chun et al (2012), we mapped the probes to 17,945 autosomal RefSeq genes, and use the resulting gene level data in our analysis. We then extracted the pathway information from the KEGG database (http://www.genome.jp/kegg/), and used it for partitioning genes and for evaluating the estimators by taking the interactions in the database as true interactions.

Due to the limited sample sizes, we only used the pathways whose sizes (i.e. number of genes) are greater than 2 and less than 36. There are a total of 300 such pathways. After ignoring directionality of regulations in KEGG pathways, we computed the false positives and true positives of the estimates and computed the area under curves (AUC) of the ROCs. We then evaluated the significance of ROC curves based on the p-values of the Wilcoxon rank sum test. Figure 8 shows the AUCs of all networks by ACCO and APO. The left panel shows AUCs based on ACCO, with significant values highlighted by green circles. The right panel shows those based on APO, with significant values highlighted by blue circles. In each case significance means the p-value is less than 10⁻⁴. We see that ACCO yields a larger number of significant ROCs than GGM. To reconfirm this pattern we computed the number of significant ROC curves by ACCO and GGM for three different significance levels, as presented in Table 2, which again shows that the numbers of significant ROC curves by ACCO are consistently higher at all three significance levels.

Comparison of number of significant ROC curves by the ACCO and by GGM estimators for DREAM4 networks with p-values < 10⁻⁴.

Table 2.

Numbers of significant pathways with varying significance levels

p-values	ACCO	APO	GCGM	GGM
< 10⁻³	17	21	14	10
< 10⁻⁴	10	9	5	5
< 10⁻⁵	8	7	2	2

Open in a new tab

To provide further insights into the comparison, we listed in Table 3, Panel I, all the pathways with significant ROCs by ACCO (p-value < 10⁻⁵) whose AUCs by ACCO are larger than those by GGM. There were a total of 6 such pathways. In Table 3, Panel II, we listed all the pathways with significant ROCs by GGM whose AUCs by GGM are larger than those by ACCO. There were a total of 2 such pathways.

Table 3.

Significant pathways whose AUCs by ACCO are larger than those by GGM (panel I) and whose AUCs by ACCO are smaller than those by GGM (panel II)

Panel	Pathway^a	p	AUC
Panel	Pathway^a	p	ACCO	APO	GCGM	GGM
I	1	35	0.63	0.59	0.52	0.53
	2	24	0.66	0.66	0.56	0.58
	3	25	0.61	0.58	0.54	0.54
	4	18	0.67	0.60	0.62	0.59
	5	18	0.69	0.63	0.62	0.62
	6	16	0.69	0.59	0.55	0.55

II	7	14	0.60	0.65	0.62	0.66
II	8	17	0.73	0.71	0.64	0.79

Open in a new tab

Labels in this column indicate the following pathways. 1: Activation of Csk by cAMP-dependent Protein Kinase Inhibits Signaling through the T Cell Receptor; 2: Cell Cycle: G2 M Checkpoint; 3: Cyclins and Cell Cycle Regulation; 4: Lck and Fyn tyrosine kinases in initiation of TCR Activation; 5: Th1 Th2 Differentiation; 6: hsa00710; 7: Proteasome Complex pathway; 8: Human Peptide GPCRs.

Some pathways were selected by more than one methods. Specifically,

\begin{array}{l} # (ACCO \cap APO) = 4, # (ACCO \cap GGM) = 1, # (ACCO \cap GCGM) = 1, \\ # (APO \cap GGM) = 1, # (APO \cap GCGM) = 1, # (GGM \cap GCGM) = 1, \end{array}

where #(A ∩ B) means the number of pathways found by method A and method B. It appears that ACCO and APO yield most similar results. Furthermore, single pathway was found by all methods.

From the analysis of these 300 pathways we see that the ASG estimators perform better than GGM and GCGM in a majority of cases. A plausible interpretation is that intrinsically nonlinear interactions (i.e. those that cannot be described by the gaussian copula assumption) are prevalent among genetic networks, and using a more flexible interaction assumption such as the ASG has real benefits in modeling of gene regulations.

11 Conclusions

In this paper we introduced two new estimators for graphical models based on the notion of additive conditional independence. They substantially broaden the applicability of the current graphical model estimators based on the gaussian or copula gaussian assumption, but at the same time avoid using high-dimensional kernels that can hamper estimation accuracy. As we have demonstrated by simulation and data analysis, the new methods work particularly well when the interactions exhibit intrinsic nonlinearity, which cannot be captured by the gaussian or copula gaussian models, and are comparable to existing methods when the copula gaussian assumption holds.

Our generalization of precision matrix to additive precision operator allows us to adopt essentially the same estimation strategy of the gaussian graphical model at the operator level. Using this generalization we succeeded in preserving the simplicity of the gaussian dependence structure without suffering from its austere distributional restriction.

This basic idea opens up many other possibilities that cannot be explored within the scope of this paper. For example, although both the ACCO and APO estimators are based on truncating small operators (as measured by the operator norm), it is conceivable that more sophisticated sparse techniques, such as LASSO, SCAD, and adaptive LASSO, can be adapted to this setting as well. How to implement these techniques in the operator setting, and how much they would improve the truncation-based estimators presented here, would be interesting questions for future research. Moreover, the asymptotic properties of the proposed estimators, such as consistency and asymptotic distributions, also need to be developed in further studies.

Supplementary Material

NIHMS561317-supplement-Supplementary_Material.zip^{(10.1KB, zip)}

Acknowledgments

Bing Li’s research was supported in part by NSF grant DMS-1106815.

Hyonho Chun’s research was supported in part by NSF grant DMS-1107025.

Hongyu Zhao’s research was supported in part by NIH grants R01-GM59507, P01-CA154295, and NSF grant DMS-1106738.

We greatly appreciate the thorough and thought provoking reviews by three referees and an Associate Editor, which contain many useful suggestions and lead to substantial improvements of this work.

Appendix: proofs

We first state without proof several simple geometric facts that will be useful for proving Theorems 1 and 2. Let P_𝓛 be the projection on to a subspace 𝓛 of L₂(P_X), U and W be subvectors of X. Then

A_{U \cup W} = A_{U} + A_{W} .

(22)

P_{A_{U \cup W}} = P_{A_{U \cup W} ⊖ A_{W}} + P_{A_{W}} .

(23)

A_{U \cup W} ⊖ A_{W} = (I - P_{A_{W}}) A_{U}

(24)

Proof of Theorem 1

1 ⇔ 2 follows from relation (22); 2 ⇔ 3 follows from relation (24); 6 ⇔ 7 follows from (23); 4 ⇔ 5 follows from the relation (23).

2 ⇒ 4. By statement 2, P_{𝓐_V∪W ⊖ 𝓐_W} (𝓐_U∪W ⊖ 𝓐_W) = {0}. By relation (24) and the fact that 𝓐_V∪W ⊖ 𝓐_W ⊥ 𝓐_W, we have

P_{A_{V \cup W} ⊖ A_{W}} (A_{U \cup W} ⊖ A_{W}) = P_{A_{V \cup W} ⊖ A_{W}} (I - P_{A_{W}}) A_{U} = P_{A_{V \cup W} ⊖ A_{W}} A_{U} .

4 ⇒ 6. This is because

P_{A_{V \cup W} ⊖ A_{W}} A_{U \cup W} = P_{A_{V \cup W} ⊖ A_{W}} (A_{U} + A_{W}) = P_{A_{V \cup W} ⊖ A_{W}} A_{W} = {0},

where the first equality follows from (22); the second from assertion 4; the third from 𝓐_W ⊥ 𝓐_V∪W ⊖ 𝓐_W.

6 ⇒ 2. Because ker(P_{𝓐_V∪W ⊖ 𝓐_W}) = [ran(P_{𝓐_V∪W ⊖ 𝓐_W})]^⊥ = (𝓐_V∪W ⊖ 𝓐_W)^⊥, we have

A_{U \cup W} ⊥ A_{V \cup W} ⊖ A_{W},

which implies 𝓐_U∪W ⊖ 𝓐_W ⊥ 𝓐_V∪W ⊖ 𝓐_W because 𝓐_U∪W ⊖ 𝓐_W ⊆ 𝓐_U∪W.

Proof of Theorem 2

Axiom 1 is obvious. To prove 2, 3, 4, let U, V, W, R be disjoint sub-vectors of X and 𝓐_U, 𝓐_V, 𝓐_W, 𝓐_R be their respective additive families.

2. If U ⫫_A V ∪ R|W, then

(A_{U} + A_{W}) ⊖ A_{W} ⊥ (A_{V \cup R} + A_{W}) ⊖ A_{W} .

Axiom 2 follows because (𝓐_V + 𝓐_W) ⊖ 𝓐_W ⊆ (𝓐_V∪R + 𝓐_W) ⊖ 𝓐_W.

3. By Theorem 1, part 3, U ⫫_A V ∪ R|W is equivalent to

(I - P_{A_{W}}) A_{U} ⊥ (I - P_{A_{W}}) A_{V \cup R}

(25)

By Axiom 2, we have U ⫫_A R|W, which, by Theorem 1, part 5, implies P_{𝓐_W} f = P_{𝓐_W∪R} f for all f ∈ 𝓐_U. Hence the left hand side of (25) can be rewritten as (I − P_{𝓐_W∪R}) 𝓐_U. Meanwhile, the right hand side of (25) contains (I − P_{𝓐_W})𝓐_V. Hence

(I - P_{A_{W \cup R}}) A_{U} ⊥ (I - P_{A_{W}}) A_{V} .

(26)

Because 𝓐_W ⊆ 𝓐_W∪R, we have ran(I − P_{𝓐_W∪R}) ⊥ ran(P_{𝓐_W∪R}). Hence

(I - P_{A_{W \cup R}}) A_{U} ⊥ P_{A_{W \cup R} ⊖ A_{W}} A_{V} .

(27)

Because

(I - P_{A_{W \cup R}}) A_{V} = (I - P_{A_{W}} - P_{A_{W \cup R} ⊖ A_{W}}) A_{V} \subseteq (I - P_{A_{W}}) A_{V} + P_{A_{W \cup R} ⊖ A_{W}} A_{V},

relation (26) and the second expression in (27) implies the desired relation

(I - P_{A_{W \cup R}}) A_{U} ⊥ (I - P_{A_{W \cup R}}) A_{V} .

4. By Theorem 1, part 3, U ⫫_AV|W and U ⫫_AR|W ∪ V are equivalent to

(I - P_{A_{W}}) A_{U} ⊥ (1 - P_{A_{W}}) A_{V}, (I - P_{A_{V \cup W}}) A_{U} ⊥ (I - P_{A_{V \cup W}}) A_{R} .

(28)

We need to show (I − P_{𝓐_W}) 𝓐_U ⊥ (I − P_{𝓐_W})(𝓐_V + 𝓐_R), which, by the first relation in in (28), is implied by

(I - P_{A_{W}}) A_{U} ⊥ (I - P_{A_{W}}) A_{R} .

(29)

Because U ⫫_A V|W, the left side of (29) is (I − P_{𝓐_V∪W}) 𝓐_U. By (23), the right side of (29) is (I − P_{𝓐_V∪W} + P_{𝓐_V∪W ⊖ 𝓐_W}) 𝓐_R. So, under (28), relation (29) is equivalent to

(I - P_{A_{V \cup W}}) A_{U} ⊥ (I - P_{A_{V \cup W}} + P_{A_{V \cup W} ⊖ A_{W}}) A_{R} .

Since (I − P_{𝓐_V∪W}) 𝓐_U ⊥ (I − P_{𝓐_V∪W}) 𝓐_R, it suffices to show that

(I - P_{A_{V \cup W}}) A_{U} ⊥ P_{A_{V \cup W} ⊖ A_{W}} A_{R},

(30)

which is true because

ran (P_{A_{V \cup W} ⊖ A_{W}}) = A_{V \cup W} ⊖ A_{W} = (I - P_{A_{W}}) (A_{V \cup W}) = ran ((I - P_{A_{W}}) P_{A_{V \cup W}})

and (I − P_{𝓐_V∪W})(I − P_{𝓐_W}) P_{𝓐_V∪W} = 0.

Proof of Theorem 3

The statement U ⫫_A V|W means

〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉 = 0 \forall f \in A_{U}, g \in A_{V} .

(31)

Let A, B, and C be the sets of indices for U, V, and W and h(W) be the vector {f_k(X^k) : k ∈ C}. By condition 3, f and g can be written as Σ_i_∈_A α_if_i and Σ_j_∈_B β_j f_j for some α = (α₁, …, α_r)^⊤ and β = (β₁, …, β_s)^⊤. Since (f(U), h(W)) is jointly gaussian, the conditional expectation E[f(U)|h(W)] belongs to 𝓐_W. For the same reason E[g(V)|h(W)] belongs to 𝓐_W. It follows that P_{𝓐_W}f = E[f(U)|h(W)], P_{𝓐_W}g = E[g(V)|h(W)], and

〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉 = E [cov (f (U), g (V) ∣ h (W))] = cov [f (U), g (V) ∣ h (W)],

where the second equality holds because the random vector (f(U), g(V), h(W)) is multivariate gaussian. Consequently, (31) is equivalent to f(U) ⫫ g(V)|h(W) for all f ∈ 𝓐_U and g ∈ 𝓐_V, which is equivalent to

{f_{i} (X^{i}) : i \in A} ⫫ {f_{j} (X^{j}) : j \in B} ∣ h (W) .

Because f₁, …, f_p are injective, this is equivalent to U ⫫ V|W.

Proof of Proposition 1

Let f ∈ L₂(P_U) and g ∈ L₂(P_V) be any functions with ||f|| = ||g|| = 1. Because P_{𝓐_W} is idempotent and self adjoint, we have

〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉 = 〈 f, g 〉 - 〈 f, P_{A_{W}} g 〉 .

For each γ = 1, …, p − 2, let { $ψ_{α γ}^{w}$ : α = 1, 2, …} be the hermite polynomials in the variable w^γ. Similarly, let { $ψ_{α}^{u}$ : α = 1, 2, …} and { $ψ_{α}^{v}$ : α = 1, 2, …} be the hermite polynomials in the variables u and v. By Lancaster (1957), these polynomials form orthonormal bases in L₂(P_W^γ), L₂(P_U), and L₂(P_V). It follows that the set { $ψ_{α γ}^{w}$ : α = 1, 2, …, γ = 1, …, p − 2} is an orthonormal basis for 𝓐_W. Moreover, by Lancaster (1957), whenever α ≠ β,

{ψ_{α}^{u}} \cup {ψ_{α}^{v}} \cup {ψ_{α γ}^{w} : γ = 1, \dots, p - 2} ⊥ {ψ_{β}^{u}} \cup {ψ_{β}^{v}} \cup {ψ_{β_{γ}}^{w} : γ = 1, \dots, p - 2} .

So, if we let $A_{W}^{α} = span {ψ_{α γ}^{w} : γ = 1, \dots, p - 2}$ , then $P_{A_{W}} g = \sum_{α = 1}^{\infty} P_{A_{W}^{α}} g$ . Each term in this sum is

P_{A_{W}^{α}} g = P_{A_{W}^{α}} (\sum_{β = 1}^{\infty} 〈 g, ψ_{β}^{v} 〉 ψ_{β}^{v}) = \sum_{β = 1}^{\infty} 〈 g, ψ_{β}^{v} 〉 P_{A_{W}^{α}} ψ_{β}^{v} = 〈 g, ψ_{α}^{v} 〉 P_{A_{W}^{α}} ψ_{α}^{v}

By Lancaster (1957) again,

〈 ψ_{α}^{v}, ψ_{α γ}^{w} 〉 = ρ_{V W^{γ}}^{α}, 〈 ψ_{α γ}^{w}, ψ_{α δ}^{w} 〉 = ρ_{W^{γ} W^{δ}}^{α},

which yield $P_{A_{W}^{α}} ψ_{α}^{v} = R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} ψ_{α *}^{w}$ , where $ψ_{α *}^{w} = {(ψ_{α 1}^{w}, \dots, ψ_{α, p - 2}^{w})}^{⊤}$ . Hence

P_{A_{W}} g = \sum_{α = 1}^{\infty} P_{A_{W}^{α}} g = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} ψ_{α *}^{w} .

Using this we compute

\begin{array}{l} 〈 f, P_{A_{W}} g 〉 = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 〈 f, R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} ψ_{α *}^{w} 〉 \\ = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 \sum_{β = 1}^{\infty} 〈 f, ψ_{β}^{u} 〉 〈 ψ_{β}^{u}, R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} ψ_{α *}^{w} 〉 \\ = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 〈 f, ψ_{α}^{u} 〉 〈 ψ_{α}^{u}, R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} ψ_{α *}^{w} 〉 \\ = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 〈 f, ψ_{α}^{u} 〉 R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α} . \end{array}

By the similar calculation we found $〈 f, g 〉 = \sum_{α = 1}^{\infty} 〈 f, ψ_{α}^{u} 〉 〈 g, ψ_{α}^{v} 〉 ρ_{U V}^{α}$ . Hence

〈 f, g 〉 - 〈 f, P_{A_{W}} g 〉 = \sum_{α = 1}^{\infty} 〈 g, ψ_{α}^{v} 〉 〈 f, ψ_{α}^{u} 〉 (ρ_{U V}^{α} - R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α})

Finally, by the Cauchy-Schwarz inequality,

\begin{array}{l} {(〈 f, g 〉 - 〈 f, P_{A_{W}} g 〉)}^{2} \leq \sum_{α = 1}^{\infty} {〈 g, ψ_{α}^{v} 〉}^{2} \sum_{α = 1}^{\infty} {〈 f, ψ_{α}^{u} 〉}^{2} {(ρ_{U V}^{α} - R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α})}^{2} \\ = \sum_{α = 1}^{\infty} {〈 f, ψ_{α}^{u} 〉}^{2} {(ρ_{U V}^{α} - R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α})}^{2} \\ \leq max {{(ρ_{U V}^{α} - R_{V W}^{⊙ α} {(R_{W W}^{⊙ α})}^{- 1} R_{W U}^{⊙ α})}^{2} : α = 1, 2, \dots}, \end{array}

as desired.

Proof of Lemma 1

For any f ∈ 𝓐_U and g ∈ 𝓐_W, we have

〈 (P_{A_{W}} ∣ A_{U} - \sum_{W U}) f, g 〉 = {〈 P_{A_{W}} f, g 〉}_{A_{W}} - 〈 \sum_{W U} f, g 〉 .

(32)

The first term on the right-hand side is

〈 P_{A_{W}} f, g 〉 = 〈 f, P_{A_{W}} g 〉 = 〈 f, g 〉 = E [f (U) g (W)] .

Hence 〈(P_{𝓐_W} ∣ 𝓐_U − Σ_WU)f,g〉_{𝓐_W} = 0 for all f ∈ 𝓐_W and g ∈ 𝓐_U.

Proof of Theorem 5

If U ⫫_A V|W, then, for any f ∈ 𝓐_U, g ∈ 𝓐_V,

0 = 〈 (I - P_{A_{W}}) f, (I - P_{A_{W}}) g 〉 = 〈 f, g 〉 - 〈 f, P_{A_{W}} g 〉,

(33)

where the second equality follows from the definition of a projection. Using Lemma 1, we calculate the two terms on the right as

〈 f, g 〉 = 〈 f, \sum_{U V} g 〉, 〈 f, P_{A_{W}} g 〉 = 〈 f, \sum_{W V} g 〉 = 〈 f, \sum_{U W} \sum_{W V} g 〉 .

Hence (33) is equivalent to 〈f, (Σ_UV − Σ_UW Σ_WV)g〉 = 0, which implies Σ_UV − Σ_UW Σ_WV = 0. The arguments can obviously be reversed.

Proof of Proposition 2

1 ⇒ 2. For any f ∈ 𝓐_⊕X, ∣∣f∣∣_{𝓐_⊕X} ≤ 1, we have, by the triangular inequality,

\begin{array}{l} {‖ T f ‖}_{A_{\oplus X}} \leq \sum_{i = 1}^{p} \sum_{j = 1}^{p} ‖ T_{X^{j} X^{i}} f_{i} ‖ \\ \leq \sum_{i = 1}^{p} \sum_{j = 1}^{p} ‖ T_{X^{j} X^{i}} ‖ ‖ f_{i} ‖ \leq p^{2} {max}_{i, j = 1, \dots, p} ‖ T_{X^{j} X^{i}} ‖ < \infty . \end{array}

2 ⇒ 1. For any f_k ∈ 𝓐_X^k and ∣∣f_k∣∣_{𝓐_k} ≤ 1, we have

{‖ T f_{k} ‖}_{A_{\oplus X}}^{2} = {〈 T f_{k}, T f_{k} 〉}_{A_{\oplus X}} = \sum_{i = 1}^{p} 〈 T_{X^{i} X^{k}} f_{k}, T_{X^{i} X^{k}} f_{k} 〉 \geq {‖ T_{X^{ℓ} X^{k}} f_{k} ‖}^{2},

which implies

\begin{array}{l} sup {‖ T_{X^{ℓ} X^{k}} f_{k} ‖ : ‖ f_{k} ‖ \leq 1} \leq sup {{‖ T f_{k} ‖}_{A_{\oplus X}} : ‖ f_{k} ‖ \leq 1} \\ = sup {{‖ T f_{k} ‖}_{A_{\oplus X}} : {‖ f_{k} ‖}_{A_{\oplus X}} \leq 1} = ‖ T ‖ < \infty . \end{array}

Proof of Lemma 2

First, we show that T_UU (and hence also T_VV) is bounded, self adjoint, and positive definite. Let h ∈ 𝓐_⊕U. Then T_UU h ∈ 𝓐_⊕U. Hence

{〈 T_{U U} h, T_{U U} h 〉}_{A_{\oplus X}} = {〈 T_{U U} h, T h 〉}_{A_{\oplus X}} \leq {‖ T_{U U} h ‖}_{A_{\oplus X}} {‖ T h ‖}_{A_{\oplus X}}

which implies that ∣∣T_UUh∣∣_{𝓐_⊕X} is either 0 or no greater than ||T|| ||h||. Hence T_UU is bounded. That T_UU is self adjoint follows from 〈g, T_UUh〉_{𝓐_⊕X} = 〈g, Th〉_{𝓐_⊕X} for all g, h ∈ 𝓐_⊕U. Because T > 0, if h ∈ 𝓐_⊕U and h ≠ 0, then 〈h, Th〉_{𝓐_⊕X} > 0. Because 〈h, Th〉_{𝓐_⊕X} = 〈h, T_UUh〉_{𝓐_⊕X} we have 〈h, T_UUh〉_{𝓐_⊕X} > 0. Thus T_UU > 0.

That $T_{U U} - T_{U W} T_{W W}^{- 1} T_{W U}$ and $T_{V V} - T_{V W} T_{W W}^{- 1} T_{W V}$ are bounded and self adjoint can be proved similarly. To show that they are positive definite and the subsequent identities, let I_UU be the identity operator from 𝓐_⊕U to 𝓐_⊕U, and O_UV be the zero operator from 𝓐_⊕V to 𝓐_⊕U. Define I_VV and O_VU similarly. Then

(\begin{matrix} {(T^{- 1})}_{U U} & {(T^{- 1})}_{U V} \\ {(T^{- 1})}_{V U} & {(T^{- 1})}_{V V} \end{matrix}) (\begin{matrix} T_{U U} & T_{U V} \\ T_{V U} & T_{V V} \end{matrix}) = (\begin{matrix} I_{U U} & O_{U V} \\ O_{V U} & I_{V V} \end{matrix}) .

Hence

{(T^{- 1})}_{U U} T_{U U} + {(T^{- 1})}_{U V} T_{V U} = I_{U U}

(34)

{(T^{- 1})}_{U U} T_{U V} + {(T^{- 1})}_{U V} T_{V V} = O_{U V} .

(35)

By part 1, T_{V V} is invertible. Hence, by (35),

{(T^{- 1})}_{U V} = - {(T^{- 1})}_{U U} T_{U V} T_{V V} .

(36)

Substitute this into (34) to obtain

{(T^{- 1})}_{U U} (T_{U U} - T_{U V} T_{V V} T_{V U}) = I_{U U}

(37)

which implies that T_UU − T_UV T_VV T_VU is invertible and the first two identities in (9) hole. Similarly, we can prove that T_VV − T_VU T_UU T_UV is invertible and the last two identities in (9) hold.

Proof of Theorem 6

Let A, B, C be the index sets of U, V, W, Z = U ∪ V, and

\begin{array}{l} ϒ_{Z Z} = {\sum_{X^{i} X^{j}} : i, j \in A \cup B}, ϒ_{Z W} = {\sum_{X^{i} X^{j}} : i \in A \cup B, j \in C}, \\ ϒ_{W Z} = {\sum_{X^{i} X^{j}} : i \in C, j \in A \cup B}, ϒ_{W W} = {\sum_{X^{i} X^{j}} : i \in C, j \in C} . \end{array}

These operators are different from the operators such as Σ_UV in Section 4 because they are defined by different inner products. For example, Σ_UU is the identity mapping but ϒ_UU is not. By Lemma 2, $Θ_{Z Z} = {(ϒ_{Z Z} - ϒ_{Z W} ϒ_{W W}^{- 1} ϒ_{W Z})}^{- 1}$ . Because

ϒ_{Z Z} - ϒ_{Z W} ϒ_{W W}^{- 1} ϒ_{W Z} = (\begin{matrix} ϒ_{U U} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W U} & ϒ_{U V} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V} \\ ϒ_{V U} - ϒ_{V W} ϒ_{W W}^{- 1} ϒ_{W U} & ϒ_{V V} - ϒ_{V W} ϒ_{W W}^{- 1} ϒ_{W V} \end{matrix}),

we have, by Lemma 2 again, ${(Θ_{Z Z}^{- 1})}_{U V} = - {(Θ_{Z Z}^{- 1})}_{U U} Θ_{U V} Θ_{V V}^{- 1}$ . This can be rewritten as

ϒ_{U V} = ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V} = - (ϒ_{U U} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{U W}) Θ_{U V} Θ_{V V}^{- 1} .

(38)

Therefore, Θ_UV = 0 iff $ϒ_{U V} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V} = 0$ .

Now let f ∈ 𝓐_U, g ∈ 𝓐_V. Then, by the definition of a projection,

〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉 = 〈 f, g 〉 - 〈 f, P_{A_{W}} g 〉 .

The first term on the right is 〈f, ϒ_UVg〉_{𝓐_⊕X}. Because, for any h ∈ 𝓐_W,

〈 h, ϒ_{W W}^{- 1} ϒ_{W V} g 〉 = {〈 h, ϒ_{W V} g 〉}_{A_{\oplus X}} = 〈 h, g 〉,

we have $ϒ_{W W}^{- 1} ϒ_{W V} = P_{A_{W}} ∣ A_{V}$ . Hence

〈 f, P_{A_{W}} g 〉 = 〈 f, ϒ_{W W}^{- 1} ϒ_{W V} g 〉 = {〈 f, ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V} g 〉}_{A_{\oplus X}} .

That is, we have shown that

〈 f - P_{A_{W}} f, g - P_{A_{W}} g 〉 = {〈 f, (ϒ_{U V} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V}) g 〉}_{A_{\oplus X}},

which means $ϒ_{U V} - ϒ_{U W} ϒ_{W W}^{- 1} ϒ_{W V} = 0$ iff U ⫫_A V|W.

Proof of Lemma 3

The proof of the three equalities are similar. So we only prove the last one. By (10), for any f ∈ 𝓐_X^−(i,j, g ∈ 𝓐_Xⁱ, we have

{〈 f, \sum_{X^{- (i, j)} X^{i}} g 〉}_{A_{X^{- (i, j)}}} = n^{- 1} {[f]}^{⊤} K_{- (i, j)} {Q K}_{- (i, j)}^{⊤} [\sum_{X^{- (i, j)} X^{i}}] [g] .

Alternatively,

{〈 f, \sum_{X^{- (i, j)} X^{i}} g 〉}_{A_{X^{- (i, j)}}} = {cov}_{n} [f (X), g (X)] = n^{- 1} {[f]}^{⊤} K_{- (i, j)} {Q K}_{i} [g] .

Hence the right-hand sides of the above two equalities are the same for all [f] and [g], which implies the third relation in (11).

Proof of Proposition 4

Write (15) as V = LΨR^⊤ where L = (L₁, L₀), R = (R₁, R₀), and Ψ = diag(D, 0). Because L and R are orthogonal matrices,
${({V V}^{⊤} + ε I_{s})}^{- 1} = {(L Ψ Ψ^{⊤} L^{⊤} + ε I_{s})}^{- 1} = L {(Ψ Ψ^{⊤} + ε I_{s})}^{- 1} L^{⊤} .$ (39)

By construction,
${(Ψ Ψ^{⊤} + ε I_{s})}^{- 1} = {diag (D^{2} + ε I_{u}, ε I_{s - u})}^{- 1} = diag {{(D^{2} + ε I_{u})}^{- 1}, ε^{- 1} I_{s - u}} .$ (40)

Hence the right-hand side of (39) is $L_{1} {(D^{2} + ε I_{u})}^{- 1} L_{1}^{⊤} + ε^{- 1} L_{0} L_{0}^{⊤}$ , which, because L is an orthogonal matrix, can be rewritten as
$L_{1} {(D^{2} + ε I_{u})}^{- 1} L_{1}^{⊤} + ε^{- 1} (I_{s} - L_{1} L_{1}^{⊤}) = L_{1} ({(D^{2} + ε I_{u})}^{- 1} - ε^{- 1} I_{u}) L_{1}^{⊤} + ε^{- 1} I_{s} .$

By (15) we have $V = L_{1} {D R}_{1}^{⊤}$ , and hence L₁ = V R₁D⁻¹. Substitute this into the right-hand side to prove part 1.
Multiply both sides of (39) by V^⊤ from the left and by V from the right, to obtain
$V^{⊤} {({V V}^{⊤} + ε I_{s})}^{- 1} V = R Ψ^{⊤} {(⊤ ⊤^{⊤} + ε I_{s})}^{- 1} Ψ R .$

Substitute RΨ^⊤ = R₁D and (40) into the right-hand side above to prove part 2.

Proof of Theorem 7

Let f, g ∈ 𝓐_⊕X. Then f = f₁ + ··· + f_p and ϒg = h₁ + ··· + h_p where f_k, h_k ∈ 𝓐_X^k, k = 1, …, p. Hence

{〈 f, ϒ g 〉}_{A_{\oplus X}} = n^{- 1} \sum_{i = 1}^{p} {〈 f_{i}, h_{i} 〉}_{A_{X^{i}}} = n^{- 1} \sum_{i = 1}^{p} {[f_{i}]}^{⊤} K_{i} {Q K}_{i} [h_{i}]

(41)

Because

{[f]}_{A_{\oplus X}} = {({[f_{1}]}_{A_{X^{1}}}^{⊤}, \dots, {[f_{p}]}_{A_{X^{p}}}^{⊤})}^{⊤}, {[ϒ g]}_{A_{\oplus X}} = {({[h_{1}]}_{A_{X^{1}}}^{⊤}, \dots, {[h_{p}]}_{A_{X^{p}}}^{⊤})}^{⊤},

The right-hand side of (41) can be rewritten in matrix form as

n^{- 1} {[f]}^{⊤} diag (K_{1} {Q K}_{1}, \dots, K_{p} {Q K}_{p}) [ϒ] [g] .

(42)

Alternatively, the inner product in (41) can be written as

\begin{array}{l} {〈 f, ϒ g 〉}_{A_{\oplus X}} = n^{- 1} \sum_{i = 1}^{p} \sum_{j = 1}^{p} {〈 f_{i}, \sum_{X^{i} X^{i}} g_{j} 〉}_{A_{X^{i}}} \\ = n^{- 1} \sum_{i = 1}^{p} \sum_{j = 1}^{p} {[f_{i}]}^{⊤} K_{i} {Q K}_{j} [g_{j}] = n^{- 1} {[f]}^{⊤} K_{1 : p} {Q K}_{1 : p}^{⊤} [g] . \end{array}

(43)

Since (42) is equal to (43) for all [f] and [g], equality (16) holds.

Operator norm of CCO

By Fukumizu et al (2009),

[{\hat{V}}_{X^{i} X^{j} ∣ X^{- (i, j)}}] = n^{- 1} Q_{X^{- (i, j)}} (ε_{2}) G_{X^{i}} .

(44)

The operator norm of V̂_{XⁱX^j|X^−(i,j)} is the maximum of 〈f_i, V̂_{XⁱX^j|X^−(i,j)} f_j〉² subject to 〈f_i, f_i〉 = 〈f_j, f_j〉 = 1. By linear algebra, this is equivalent to

maximizing 〈 f_{i}, {\hat{V}}_{X^{i} X^{j} ∣ X^{- (i, j)}} {\hat{V}}_{X^{j} X^{i} ∣ X^{- (i, j)}} f_{i} 〉 subject to 〈 f_{i}, f_{i} 〉 = 1.

(45)

By (44), the first expression in (45) is n⁻²[f_i]^⊤G_Xⁱ Q_X^−(i,j) (ε₂)G_Xⁱ Q_X^−(i,j) (ε₂)G_Xⁱ [f_i]. The second expression in (45) is [f_i]^⊤ G_Xⁱ [f_i] = 1. This is a generalized eigenvalue problem. To convert this to an ordinary eigenvalue problem, let $ϕ_{i} = G_{X^{i}}^{1 / 2} [f_{i}]$ . Solve this equation for [f_i] with Tychonoff regularization, to obtain [f_i] = (G_Xⁱ + ε₁I_n)^−1/2 ϕ_i. The optimization problem (45) reduces to maximizing (ignoring n⁻²):

ϕ_{i}^{⊤} {(G_{X^{i}} + ε_{1} I_{n})}^{- 1 / 2} G_{X^{i}} Q_{X^{- (i, j)}} (ε_{2}) G_{X^{j}} Q_{X^{- (i, j)}} (ε_{2}) G_{X^{i}} {(G_{X^{i}} + ε_{1} I_{n})}^{- 1 / 2} ϕ_{i}

subject to $ϕ_{i}^{⊤} ϕ_{i} = 1$ . As we did with ACCO and APO, we use the symmetrized version

ϕ_{i}^{⊤} {(G_{X^{i}} + ε_{1} I_{n})}^{1 / 2} Q_{X^{- (i, j)}} (ε_{2}) (G_{X^{j}} + ε_{1} I_{n}) Q_{X^{- (i, j)}} (ε_{2}) {(G_{X^{i}} + ε_{1} I_{n})}^{1 / 2} ϕ_{i},

which is of the form AA^⊤ with A being the matrix in (21).

Contributor Information

Bing Li, Email: bing@stat.psu.edu.

Hyonho Chun, Email: chunh@purdue.edu.

Hongyu Zhao, Email: hongyu.zhao@yale.edu.

References

Bellman RE. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]
Bach FR. Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]
Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
Chun H, Zhang X, Zhao H. Gene regulation network inference with joint sparse Gaussian graphical models. 2012. Unpublished manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, Ingle C, Beazley C, Gutierrez Arcelus M, Sekowska M, Gagnebin M, Nisbett J, Deloukas P, Dermitzakis ET, Antonarakis SE. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. 2009;325:1246–1250. doi: 10.1126/science.1174148. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dawid AP. Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B. 1979;41:1–31. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fukumizu K, Bach FR, Jordan MI. Kernel dimension reduction in regression. Annals of Statistics. 2009;4:1871–1905. [Google Scholar]
Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2009;98:1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kelley JL. General Topology. Springer; New York: 1955. [Google Scholar]
Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lancaster. Some properties of the bivariate normal distribution considered in the form of a contingency table. Biometrika. 1957;44:289–292. [Google Scholar]
Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]
Lauritzen SL, Dawid AP, Larsen BN, Leimer HG. Independence properties of directed Markov fields. Networks. 1990;20:491–505. [Google Scholar]
Lee K-Y, Li B, Chiaromonte F. A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Annals of Statistics. 2013 In print. [Google Scholar]
Li B, Artemiou A, Li L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Annals of Statistics. 2011;36:3182–3210. [Google Scholar]
Li B, Chun H, Zhao H. Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association. 2012;107:152–167. doi: 10.1080/01621459.2011.644498. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, Han F, Zhang C. Advances in Neural Information Processing System. Vol. 25. Curran Associates, Inc; 2013. Transelliptical Graphical Models; pp. 800–808. [Google Scholar]
Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics. 2013 In print. [Google Scholar]
Liu H, Lafferty J, Wasserman L. The nonparanormal: semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. 2009;10:2295–2328. [PMC free article] [PubMed] [Google Scholar]
Meinshausen N, Bühlmann P. High-dimensional graphs with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Pearl J, Geiger D, Verma T. Conditional independence and its representations. Kybernetika. 1989;25:33–44. [Google Scholar]
Pearl J, Paz A. Graphoids: a graph-based logic for reasoning about relevancy relations. Proceedings of the European Conference on Artificial Intelligence; Brighton, UK. 1986. [Google Scholar]
Pearl J, Verma T. The logic of representing dependencies by directed acyclic graphs. Proceedings of the American Association of Artificial Intelligence Conference; Seattle, Washington. 1987. [Google Scholar]
Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Wahba G. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics.1990. [Google Scholar]
Xue L, Zou H. Regularized Rank-based Estimation of High-dimensional Nonparanormal Graphical Models. Annals of Statistics. 2013 In print. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS561317-supplement-Supplementary_Material.zip^{(10.1KB, zip)}

[R1] Bellman RE. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]

[R2] Bach FR. Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]

[R3] Bickel PJ, Levina E. Covariance Regularization by Thresholding. Annals of Statistics. 2008;36:2577–2604. [Google Scholar]

[R4] Chun H, Zhang X, Zhao H. Gene regulation network inference with joint sparse Gaussian graphical models. 2012. Unpublished manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Dimas AS, Deutsch S, Stranger BE, Montgomery SB, Borel C, Attar-Cohen H, Ingle C, Beazley C, Gutierrez Arcelus M, Sekowska M, Gagnebin M, Nisbett J, Deloukas P, Dermitzakis ET, Antonarakis SE. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science. 2009;325:1246–1250. doi: 10.1126/science.1174148. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Dawid AP. Conditional independence in statistical theory. Journal of the Royal Statistical Society, Series B. 1979;41:1–31. [Google Scholar]

[R7] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fukumizu K, Bach FR, Jordan MI. Kernel dimension reduction in regression. Annals of Statistics. 2009;4:1871–1905. [Google Scholar]

[R10] Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2009;98:1–15. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Kelley JL. General Topology. Springer; New York: 1955. [Google Scholar]

[R12] Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Lancaster. Some properties of the bivariate normal distribution considered in the form of a contingency table. Biometrika. 1957;44:289–292. [Google Scholar]

[R14] Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]

[R15] Lauritzen SL, Dawid AP, Larsen BN, Leimer HG. Independence properties of directed Markov fields. Networks. 1990;20:491–505. [Google Scholar]

[R16] Lee K-Y, Li B, Chiaromonte F. A general theory for nonlinear sufficient dimension reduction: formulation and estimation. Annals of Statistics. 2013 In print. [Google Scholar]

[R17] Li B, Artemiou A, Li L. Principal support vector machines for linear and nonlinear sufficient dimension reduction. Annals of Statistics. 2011;36:3182–3210. [Google Scholar]

[R18] Li B, Chun H, Zhao H. Sparse estimation of conditional graphical models with application to gene networks. Journal of the American Statistical Association. 2012;107:152–167. doi: 10.1080/01621459.2011.644498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Liu H, Han F, Zhang C. Advances in Neural Information Processing System. Vol. 25. Curran Associates, Inc; 2013. Transelliptical Graphical Models; pp. 800–808. [Google Scholar]

[R20] Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics. 2013 In print. [Google Scholar]

[R21] Liu H, Lafferty J, Wasserman L. The nonparanormal: semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. 2009;10:2295–2328. [PMC free article] [PubMed] [Google Scholar]

[R22] Meinshausen N, Bühlmann P. High-dimensional graphs with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R23] Pearl J, Geiger D, Verma T. Conditional independence and its representations. Kybernetika. 1989;25:33–44. [Google Scholar]

[R24] Pearl J, Paz A. Graphoids: a graph-based logic for reasoning about relevancy relations. Proceedings of the European Conference on Artificial Intelligence; Brighton, UK. 1986. [Google Scholar]

[R25] Pearl J, Verma T. The logic of representing dependencies by directed acyclic graphs. Proceedings of the American Association of Artificial Intelligence Conference; Seattle, Washington. 1987. [Google Scholar]

[R26] Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R28] Wahba G. Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics.1990. [Google Scholar]

[R29] Xue L, Zou H. Regularized Rank-based Estimation of High-dimensional Nonparanormal Graphical Models. Annals of Statistics. 2013 In print. [Google Scholar]

[R30] Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]

[R31] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

On an Additive Semigraphoid Model for Statistical Networks With Application to Pathway Analysis

Bing Li

Hyonho Chun

Hongyu Zhao

Abstract

1 Introduction

Figure 1.

2 Semi-graphoid axioms and additive conditional independence

Definition 1 (Semi-graphoid Axioms)

Definition 2

Theorem 1

Theorem 2

Definition 3

3 Relation with copula graphical models

Theorem 3

Theorem 4

Proposition 1

Table 1.

Corollary 1

4 Additive conditional covariance operator

Lemma 1

Theorem 5

5 Additive precision operator

Proposition 2

Definition 4

Proposition 3

Lemma 2

Theorem 6

Corollary 2

6 The ACCO-based estimator

Lemma 3

Proposition 4

7 The APO-based estimator

Theorem 7

8 Tuning parameters for regularization

9 Simulation studies

Comparison 1

Figure 2.

Comparison 2

Figure 3.

Comparison 3

Figure 4.

Comparison 4

Figure 5.

Comparison 5

Figure 6.

Comparison 6

Figure 7.

10 Pathway analysis

Figure 8.

Table 2.

Table 3.

11 Conclusions

Supplementary Material

Acknowledgments

Appendix: proofs

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Proposition 1

Proof of Lemma 1

Proof of Theorem 5

Proof of Proposition 2

Proof of Lemma 2

Proof of Theorem 6

Proof of Lemma 3

Proof of Proposition 4

Proof of Theorem 7

Operator norm of CCO

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases