Sparse Estimation of Conditional Graphical Models With Application to Gene Networks

Bing Li; Hyonho Chuns; Hongyu Zhao

doi:10.1080/01621459.2011.644498

. Author manuscript; available in PMC: 2014 Feb 24.

Published in final edited form as: J Am Stat Assoc. 2012 Jun 11;107(497):152–167. doi: 10.1080/01621459.2011.644498

Sparse Estimation of Conditional Graphical Models With Application to Gene Networks

Bing Li ¹, Hyonho Chuns ², Hongyu Zhao ³

PMCID: PMC3932550 NIHMSID: NIHMS542227 PMID: 24574574

Abstract

In many applications the graph structure in a network arises from two sources: intrinsic connections and connections due to external effects. We introduce a sparse estimation procedure for graphical models that is capable of isolating the intrinsic connections by removing the external effects. Technically, this is formulated as a conditional graphical model, in which the external effects are modeled as predictors, and the graph is determined by the conditional precision matrix. We introduce two sparse estimators of this matrix using the reproduced kernel Hilbert space combined with lasso and adaptive lasso. We establish the sparsity, variable selection consistency, oracle property, and the asymptotic distributions of the proposed estimators. We also develop their convergence rate when the dimension of the conditional precision matrix goes to infinity. The methods are compared with sparse estimators for unconditional graphical models, and with the constrained maximum likelihood estimate that assumes a known graph structure. The methods are applied to a genetic data set to construct a gene network conditioning on single-nucleotide polymorphisms.

Keywords: Conditional random field, Gaussian graphical models, Lasso and adaptive lasso, Oracle property, Reproducing kernel Hilbert space, Sparsity, Sparsistency, von Mises expansion

1. INTRODUCTION

Sparse estimation of the Gaussian graphical models has undergone intense development during the recent years, partly due to their wide applications in such fields as information retrieval and genomics, and partly due to the increasing maturity of statistical theories and techniques surrounding sparse estimation. See, for example, Meinshausen and Buhlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), Peng et al. (2009), Guo et al. (2009), and Lam and Fan (2009). The precursor of this line of work is the Gaussian graphical model in which the graph structure is assumed to be known; see Dempster (1972) and Lauritzen (1996).

Let Y = (Y¹, …, Y^p)^⊤ be a random vector, Γ = {1, …, p}, and E ⊆ Γ × Γ. Let Inline graphic = (Γ, E) be the graph with its vertices in Γ and edges in E. We say that Y follows a Gaussian graphical model (GGM) with respect to (Lauritzen 1996) if Y has a multivariate normal distribution and

{Y^{i}}_{⫫} Y^{j} ∣ Y^{- (i, j)}, (i, j) \in E^{c},

(1)

where A ⫫ B|C means A and B are independent given C, and Y⁻⁽ⁱ^,^j⁾ denotes the set {Y¹, …, Y^p}\{Yⁱ, Y^j}. Let ω_ij be the (i, j)th entry of [var(Y)]⁻¹. Then, Yⁱ ⫫ Y^j|Y⁻⁽ⁱ^,^j⁾ if and only if ω_ij = 0. Thus, estimating E is equivalent to estimating the set {(i, j) ∈ Γ × Γ: ω_ij = 0}. Conventional Gaussian graphical models focus on maximum likelihood estimation of ω_ij given the knowledge of E. In sparse estimation of graphical models, however, E is itself estimated by sparse regularization, such as lasso and adaptive lasso (Tibshirani 1996; Zou 2006).

The conditional graphical model, with which we are concerned, is motivated by the analysis of gene networks and the regulating effects of DNA markers. Let X¹, …, X^q represent the genetic markers at q locations in a genome, and let Y¹, …, Y^p represent the expression levels of p genes. The objective is to infer how the q genetic markers affect the expression levels of the p genes and how these p genes affect each other. Since some markers may have regulating effects on more than one gene, the connections among the genes are of two kinds: the connections due to shared regulation by the same marker, and the innate connections among the genes aside from their shared regulators. In this setting, we are interested in identifying the network of genes after removing the effects from shared regulations by the markers.

The situation is illustrated by Figure 1, in which X represents a single marker, and Y¹, Y², Y³ represent the expressions of three genes. If we consider the marginal distribution of the random vector (Y¹, Y², Y³), then there are two (undirected) edges in the unconditional graphical model: 1 ↔ 2 and 2 ↔ 3, as represented by the solid and the dotted line segments. However, if we condition on the marker and consider the conditional distribution of Y¹, Y², Y³|X, then there is only one (undirected) edge, 2 ↔ 3, in the conditional graphical model, as represented by the solid line segment.

Unconditional and conditional graphical models.

In mathematical terms, a conditional graphical model can be represented by

{Y^{i}}_{⫫} Y^{j} ∣ {Y^{- (i, j)}, X} for (i, j) \notin E,

(2)

where X = (X¹, …, X^q)^⊤. Compared with the unconditional graphical model, here we have an additional random vector X, whose effects we would like to remove when constructing the network for Y¹, …, Y^p. The conditional graphical model (2) was introduced by Lafferty, McCallum, and Pereira (2001) under the name “conditional random field.” However, in that article, the graph Inline graphic = (Γ, E) was assumed known and maximum likelihood was used to estimate the relevant parameters.

Our strategy for estimating the conditional graphical model (2) is as follows. In the first step, we propose a class of flexible and easy-to-implement initial (nonsparse) estimators for the conditional variance matrix Σ = var(Y | X), based on a conditional variance operator between the reproducing kernel Hilbert spaces (RKHS) of X and Y. In the second step, we incorporate the nonsparse estimators with two types of sparse penalties, the lasso and the adaptive lasso, to obtain sparse estimators of the conditional precision matrix Σ⁻¹.

We have chosen RKHS as our method to estimate Σ for several reasons. First, it does not require a rigid regression model for Y versus X. Second, the dimension of X only appears through a kernel function, so that the dimension of the largest matrix we need to invert is the sample size n, regardless of the dimension of X. This feature is particularly attractive when we deal with a large number of predictors. Finally, RKHS provides a natural mechanism to impose regularization on regression. Here, it is important to realize that our problem involves two kinds of regularization: one for the components of var(Y | X) and the other for the regression of Y on X. Considering the nature of our problem, the former must be sparse, but the latter need not be. Indeed, since estimating the regression parameter is not our purpose, it seems more natural to introduce the regularization for regression through RKHS than to try to parameterize the regression and then regularize the parameters.

The rest of the article is organized as follows. In Section 2, we introduce a conditional variance operator in RKHS and describe its relation with the conditional gaussian graphical model. In Section 3, we derive two RKHS-based estimators of conditional variance Σ. In Section 4, we subject the RKHS estimators to sparse penalties to estimate the conditional precision matrix Θ = Σ⁻¹. In Sections 5, 6, and 7, we establish the asymptotic properties of the sparse estimators for a fixed dimension p. In Section 8, we derive the convergence rate of the sparse estimators when p goes to infinity with the sample size. In Section 9, we discuss some issues involved in implementation. In Sections 10 and 11, we investigate and explore the performance of the proposed methods through simulation and data analysis.

2. CONDITIONAL VARIANCE OPERATOR IN RKHS

We begin with a formal definition of the conditional Gaussian graphical model. Throughout this article, we use Inline graphic to denote expectation to avoid confusion with the edge set E. We use ℙ to denote probability measures.

Definition 1

We say that (X, Y) follows a conditional Gaussian graphical model (CGGM) with respect to a graph Inline graphic = (Γ, E) if

relation (2) holds;
Y | X ~ N( (Y | X), Σ) for some nonrandom, positive-definite matrix Σ.

Note that we do not assume a regression model for Y versus X. However, we do require that Y | X is multivariate normal with a constant conditional variance, which is satisfied if Y = f (X) + ε, ε ⫫ X, and ε ~ N(0, Σ) for an arbitrary f.

Let Ω_X ⊆ ℝ^q and Ω_Y ⊆ ℝ^p be the support of X and Y. Let κ_X: Ω_X × Ω_X → ℝ, κ_Y: Ω_Y × Ω_Y → ℝ be positive-definite kernels and Inline graphic and be their corresponding RKHS’s. For further information about RKHS and the choices of kernels, see Aronszajn (1950) and Vapnik (1998). For two Hilbert spaces and , and a bounded bilinear form b: × → ℝ, there uniquely exist bounded linear operators A: → and B: Inline graphic → such that 〈f, Bg〉 = 〈Af, g〉 = b(f, g) for any f ∈ and g ∈ . Applying this fact to the bounded bilinear forms

cov [f_{1} (X), f_{2} (X)], cov [g_{1} (Y), g_{2} (Y)], cov [f (X), g (Y)],

we obtain three bounded linear operators

\begin{array}{l} \sum_{X X} : H_{X} \to H_{X}, \sum_{Y Y} : H_{Y} \to H_{Y}, \\ \sum_{X Y} : H_{Y} \to H_{X} . \end{array}

(3)

Furthermore, Σ_XY can be factorized as $\sum_{X X}^{\frac{1}{2}} R_{X Y} \sum_{Y Y}^{\frac{1}{2}}$ , where R_XY: Inline graphic → is a uniquely defined bounded linear operator. The conditional variance operator of Y, given X, is then defined as the bounded operator from to :

\sum_{Y Y ∣ X} = \sum_{Y Y} - \sum_{Y Y}^{\frac{1}{2}} R_{Y X} R_{X Y} \sum_{Y Y}^{\frac{1}{2}} .

This construction is due to Fukumizu, Bach, and Jordan (2009).

Let L₂(ℙ_X) denote the class of functions of X that are squared integrable with respect to ℙ_X, Inline graphic + ℝ denote the set of functions {h + c: h ∈ , c ∈ ℝ}, and cl(·) denote the closure of a set in L₂(ℙ_X). The next theorem describes how the conditional operator Σ_YY _| _X uniquely determines the CGGM, and suggests a way to estimate Σ.

Theorem 1

Suppose

⊆ L₂(ℙ_X) and, for each i = 1, …, p, (Yⁱ|X) ∈ cl( + ℝ);
κ_Y is the linear kernel: κ_Y (a, b) = 1 + a^⊤b, where a, b ∈ ℝ^p.

Then,

\sum = {{〈 y^{r}, \sum_{Y Y ∣ X} y^{s} 〉}_{H_{Y}} : r, s = 1, \dots, p} .

(4)

The assumption Inline graphic ⊆ L₂(ℙ_X) is satisfied if the function κ_X(x, x) belongs to L₂(ℙ_X), which is a mild requirement.

Proof

By assumption 2, any member of Inline graphic can be written as α^⊤y for some α ∈ ℝ^p. Hence, by Proposition 2 of Fukumizu et al. (2009),

\begin{array}{l} {〈 α^{⊤} y, \sum_{Y Y ∣ X} (α^{⊤} y) 〉}_{H_{Y}} = inf_{f \in H_{X} + ℝ} var [α^{⊤} Y - f (X)] \\ = E [var (α^{⊤} Y ∣ X)] + inf_{f \in H_{X} + ℝ} var {E [(α^{⊤} Y ∣ X) - f (X)]} . \end{array}

By assumption 1, for any ε > 0, there is an f ∈ Inline graphic + ℝ such that var{ [(α^⊤Y | X) − f (X)]} < ε. So, the second term on the right-hand side above is 0, and we have

{〈 α^{⊤} y, \sum_{Y Y ∣ X} (α^{⊤} y) 〉}_{H_{Y}} = E [var (α^{⊤} Y ∣ X)] .

(5)

In the meantime, we note that

\begin{array}{l} {〈 y^{i} + y^{j}, \sum_{Y Y ∣ X} (y^{i} + y^{j}) 〉}_{H_{Y}} - {〈 y^{i} - y^{j}, \sum_{Y Y ∣ X} (y^{i} - y^{j}) 〉}_{H_{Y}} \\ = 4 {〈 y^{i}, \sum_{Y Y ∣ X} y^{j} 〉}_{H_{Y}}, E [var (Y^{i} + Y^{j} ∣ X)] \\ - E [var (Y^{i} - Y^{j} ∣ X)] = 4 E [cov (Y^{i}, Y^{j} ∣ X)] . \end{array}

Applying Equation (5) to the left-hand sides of the above equations, we obtain

{〈 y^{i}, \sum_{Y Y ∣ X} y^{j} 〉}_{H_{Y}} = E [cov (Y^{i}, Y^{j} ∣ X)] = cov (Y^{i}, Y^{j} ∣ X),

as desired.

Condition 1 in the theorem can be replaced by the stronger assumption that Inline graphic + ℝ is a dense subset of L₂(P_X), which is satisfied by some well known kernels, such as the Gaussian radial kernel. See Fukumizu et al. (2009).

3. TWO RKHS ESTIMATORS OF CONDITIONAL COVARIANCE

To construct a sample estimate of Equation (4), we need to represent operators as matrices in a finite-dimensional space. To this end, we first introduce a coordinate notation system, adopted from Horn and Johnson (1985, p. 31) with slight modifications. Let Inline graphic be a finite-dimensional Hilbert space and = {b₁, …, b_m}⊆ be a set of functions that span but may not be linearly independent. We refer to as a spanning system (as opposed to a basis). Any f ∈ can be written as α₁b₁ + ··· + α_mb_m for some α = (α₁, …, α_m)^⊤ ∈ ℝ^m. This vector is denoted by [f], and is called a Inline graphic -coordinate of f. Note that [f] is not unique unless b₁, …, b_m are linearly independent. However, this does not matter because $\sum_{i = 1}^{m} {({[f]}_{B})}_{i} b_{i}$ unique regardless of the form of [f]. The same reasoning also applies to the nonuniqueness of coordinates below.

Let Inline graphic and be finite-dimensional Hilbert spaces with spanning systems = {b₁₁, …, b_1m₁}and = {b₂₁, …, b_2m₂}. Let T : → be a linear operator. The ( , )-representation of T, denoted by [T], is the m₂ × m₁ matrix

{{({[T b_{1 i}]}_{B_{2}})}_{j} : j = 1, \dots, m_{2}, i = 1, \dots, m_{1}} .

Then, ( [T] )[f] is a Inline graphic -coordinate of Tf. We write this relation as

{[T f]}_{B_{2}} = ({}_{B_{2}}{[T]}_{B_{1}}) {[f]}_{B_{1}} .

(6)

Let Inline graphic be another finite-dimensional Hilbert space with a spanning system . Let T₁: → and T₂ : → be linear operators. Then,

{}_{B_{3}}{[T_{2} T_{1}]}_{B_{1}} = ({}_{B_{3}}{[T_{2}]}_{B_{2}}) ({}_{B_{2}}{[T_{1}]}_{B_{1}}) .

(7)

The equality means the right-hand side is a ( Inline graphic , )-representation of T₂T₁. Finally, if is a finite-dimensional Hilbert space with a spanning system and T : → is a self-adjoint and positive-semidefinite linear operator, then, for any c > 0,

B {[T^{c}]}_{B} = {({}_{B}{[T]}_{B})}^{c} .

(8)

Let (X₁, Y₁), …, (X_n, Y_n) be independent copies of (X, Y), ℙ_n be the empirical measure based on this sample, and Inline graphic be the integral with respect to ℙ_n. Let

\begin{array}{l} {\hat{H}}_{X} = span {κ_{X} (\cdot, X_{i}) - E_{n} κ_{X} (\cdot, X) : i = 1, \dots, n}, \\ {\hat{H}}_{Y} = span {κ_{Y} (\cdot, Y_{i}) - E_{n} κ_{Y} (\cdot, Y) : i = 1, \dots, n}, \end{array}

(9)

where, for example, Inline graphic κ_X(·, X) stands for the function x ↦ κ_X(x, X). We center the functions κ_X(·, X_i) because constants do not play a role in our development. Let Σ̂_XX, Σ̂_YY, and Σ̂_XY be as defined in the last section but with , replaced by , , and cov replaced by the sample covariance. Let Inline graphic and denote the spanning systems in Equation (9). Let K_X and K_Y represent the n × n kernel matrices {κ_X(X_i, X_j)} and {κ_Y (Y_i, Y_j)}. For a symmetric matrix, A, let A^† represent its Moore-Penrose inverse. Let Q_n = I_n − n ⁻¹J_n, I_n is the n × n identity matrix, and J_n be the n × n matrix whose entries are 1. To simplify notation, we abbreviate Q_n by Q throughout the rest of the article. The following lemma crystallizes some known results, which can be proved by coordinate manipulation via formulas (6), (7), and (8).

Lemma 1

The following relations hold:

[Σ̂ _XX] = n⁻¹QK_XQ, [Σ̂_YY]= n⁻¹QK_YQ;
[Σ̂_YX]= n⁻¹QK_XQ, [Σ̂_XY]= n⁻¹QK_Y Q;
[Σ̂_YY _| _X] = n⁻¹[QK_Y Q − (QK_XQ)(QK_XQ)^† (QK_Y Q)].

When the dimension q of X is large relative to n, it is beneficial to use regularized version of (QK_XQ)^†. Here, we employ two types of regularization: the principal-component (PC) regularization and the ridge-regression (RR) regularization. Let A be a positive-semidefinite matrix with eigenvalues λ₁, …, λ _n and eigen-vectors v₁, …, v_n. Let ε ≥0. We call the matrix ${(A)}_{ε}^{†} = \sum_{i = 1}^{n} λ_{i}^{- 1} v_{i} v_{i}^{⊤} I (λ_{i} > ε)$ the PC-inverse of A. Note that ${(A)}_{0}^{†} = A^{†}$ . We call the matrix ${(A)}_{ε}^{‡} = {(A + ε I_{n})}^{- 1}$ the RR-inverse of A. The following result can be verified by simple calculation.

Lemma 2

For any f₁, f₂ ∈ Inline graphic , we have 〈 f₁, f₂〉 = ([f₁])^⊤QK_Y Q[f₂].

We now derive the sample estimate of the conditional covariance matrix Σ= var(Y | X). Let D_Y denote the matrix (Y₁, …, Y_n)^⊤.

Theorem 2

Let Σ̂_YY _| _X: Inline graphic → be defined by the coordinate representation

{}_{B_{Y}}{[{\sum^{^}}_{Y Y ∣ X}]}_{B_{Y}} = n^{- 1} [{QK}_{Y} Q - ({QK}_{X} Q) {({QK}_{X} Q)}_{ε_{n}}^{*} ({QK}_{Y} Q)],

(10)

where * can be either † or ‡, K_Y is the linear kernel matrix, and {ε_n} is a sequence of nonnegative numbers. Then

{{〈 y^{i}, {\sum^{^}}_{Y Y ∣ X} y^{j} 〉}_{{\hat{H}}_{Y}}} = n^{- 1} [D_{Y}^{⊤} {QD}_{Y} - D_{Y}^{T} Q ({QK}_{X} Q) {({QK}_{X} Q)}_{ε_{n}}^{*} {QD}_{Y}] .

(11)

Despite its appearance, the matrix $({QK}_{X} Q) {({QK}_{X} Q)}_{ε}^{*}$ is actually a symmetric matrix. One can also show that Equation (11) is a positive-semidefinite matrix. We denote the matrix (11) as Σ̂_PC(ε_n) if * = †, and as Σ̂_RR(ε_n) if * = ‡. In the following, e_i represents a p-dimensional vector whose ith entry is 1 and other entries are 0. For an expression that represents a vector, such as [f], let ([f])_i denote its ith entry.

Proof of Theorem 1

Note that, for an f ∈ Inline graphic , [f] is any vector a ∈ ℝ^p such that f (y) = a^⊤QD_Y y. Because $y^{i} = e_{i}^{⊤} y = e_{i}^{⊤} {(D_{Y}^{⊤} {QD}_{Y})}^{- 1} D_{Y}^{⊤} {QD}_{Y} y$ , we have

{[y^{i}]}_{B_{Y}} = D_{Y} {(D_{Y}^{⊤} {QD}_{Y})}^{- 1} e_{i} .

Then, by Lemma 2,

{〈 y^{i}, {\sum^{^}}_{Y Y ∣ X} y^{j} 〉}_{H_{Y}} = e_{i}^{⊤} {(D_{Y}^{⊤} {QD}_{Y})}^{- 1} D_{Y}^{⊤} {QK}_{Y} Q ({}_{B_{Y}}{[{\sum^{^}}_{Y Y ∣ X}]}_{B_{Y}}) D_{Y} {(D_{Y}^{⊤} {QD}_{Y})}^{- 1} e_{j} .

When κ_Y is the linear kernel, $K_{Y} = D_{Y} D_{Y}^{⊤}$ . Hence, the right-hand side reduces to

e_{i}^{⊤} D_{Y}^{⊤} Q ({}_{B_{Y}}{[{\sum^{^}}_{Y Y ∣ X}]}_{B_{Y}}) D_{Y} {(D_{Y}^{⊤} {QD}_{Y})}^{- 1} e_{j} .

Now substitute Equation (10) into the above expression to complete the proof.

4. INTRODUCING SPARSE PENALTY

Let Σ̂ denote either of the RKHS estimators given by Equation (11). We are interested in the sparse estimation of Θ= Σ⁻¹. Let

L_{n} (Θ) = - log det (Θ) + tr (Θ \sum^{^}) .

(12)

This objective function has the same form as that used in unconditional Gaussian graphical models, such as those considered by Lauritzen (1996), Yuan and Lin (2007), and Friedman, Hastie, and Tibshirani (2008), except that the sample covariance matrix therein is replaced by the RKHS estimate of conditional covariance matrix.

To achieve sparsity in Θ, we introduce two types of penalized versions of the objective function (12):

lasso : ϒ_{n} (Θ) = L_{n} (Θ) + λ_{n} \sum_{i \neq j} ∣ θ_{i j} ∣,

(13)

adaptive lasso : Λ_{n} (Θ) = L_{n} (Θ) + λ_{n} \sum_{i \neq j} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} ∣ θ_{i j} ∣,

(14)

where, in Equation (14), γ is a positive number and {θ̃_ij} is a $\sqrt{n}$ -consistent, nonsparse estimate of Θ. For more details about the development of these two types of penalty functions; see Tibshirani (1996), Zou (2006), Zhao and Yu (2006), and Zou and Li (2008). Yuan and Lin (2007) used lasso and nonnegative garrote (Breiman 1995) penalty functions for the unconditional graphical model, the latter of which is similar to adaptive lasso with γ = 1. In the following, we will write

P_{n} (Θ) = λ_{n} \sum_{i \neq j} ∣ θ_{i j} ∣, Π_{n} (Θ) = λ_{n} \sum_{i \neq j} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} ∣ θ_{i j} ∣ .

5. VON MISES EXPANSIONS OF THE RKHS ESTIMATORS

The sparse and oracle properties of the estimators introduced in Section 4 depend heavily on the asymptotic properties of the RKHS estimators Σ̂_PC(ε_n) and Σ̂_RR(ε_n). In this section, we derive their von Mises expansions (von Mises 1947). For simplicity, we base our asymptotic development on the polynomial kernel. That is, we let

κ_{X} (a, b) = {(a^{⊤} b + 1)}^{r}, r = 1, 2, \dots

(15)

For a matrix A and an integer k ≥ 2, let A^⊗^k be the k-fold Kronecker product A ⊗ ··· ⊗ A. For k = 1, we adopt the convention A^⊗1 = A. For a k-dimensional array B = {b_{i₁…i_k} : i₁, …, i_k = 1, …, q}, we define the “vec half” operator as

vech (B) = {b_{i_{1} \dots i_{k}} : 1 \leq i_{k} \leq \dots \leq i_{1} \leq q} .

This is a generalization of the vech operator for matrices introduced by Henderson and Searle (1979). It can be shown that there exists a matrix G_k_,_q, of full column rank, such that vec(B) = G_k_,_q vech(B). The specific form of G_k_,_q is not important to us. For k = 1, we adopt the convention vech(B) = B, G_1,_q = I_q. Let

\begin{array}{l} U = (\begin{matrix} vech (X^{\otimes 1}) \\ ⋮ \\ vech (X^{\otimes r}) \end{matrix}), \\ U_{i} = (\begin{matrix} vech (X_{i}^{\otimes 1}) \\ ⋮ \\ vech (X_{i}^{\otimes r}) \end{matrix}), and D_{U} = {(U_{1}, \dots, U_{n})}^{⊤} . \end{array}

The estimators Σ̂_PC(ε_n) and Σ̂_RR(ε_n) involve n × n matrices QK_XQ and ${({QK}_{X} Q)}_{ε_{n}}^{*}$ , which are difficult to handle asymptotically because their dimensions grow with n. We now give an asymptotically equivalent expression which only involves matrices of fixed dimensions. Let

\sum^{\sim} = n^{- 1} [D_{Y}^{⊤} {QD}_{Y} - D_{Y}^{⊤} {QD}_{U} {(D_{U}^{⊤} {QD}_{U})}^{- 1} D_{U}^{⊤} {QD}_{Y}] .

(16)

Lemma 3

Suppose κ_X is the polynomial kernel (15) and var(U) is positive definite.

If ε_n = o(n), then, with probability tending to 1, Σ̂_PC(ε_n) = Σ̃;
If $ε_{n} = o (n^{\frac{1}{2}})$ , then ${\sum^{^}}_{RR} (ε_{n}) = \sum^{\sim} + o_{P} (n^{- \frac{1}{2}})$ .

It is interesting to note that different choices of inversion requires different convergence rates for ε_n.

Proof

By simple computation, we find

{(X_{i}^{⊤} X_{j} + 1)}^{r} = 1 + \sum_{k = 1}^{r} (\begin{matrix} r \\ k \end{matrix}) {[vech (X_{i}^{\otimes k})]}^{⊤} G_{k, q}^{⊤} G_{k, q} vech (X_{i}^{\otimes k}) .

Hence, $K_{X} = J_{n} + D_{U} C D_{U}^{⊤}$ where $C = diag ((\begin{matrix} r \\ 1 \end{matrix}) G_{1, q}^{⊤} G_{1, q}, \dots, (\begin{matrix} r \\ r \end{matrix}) G_{r, q}^{⊤} G_{r, q})$ . So,

{QK}_{X} Q = {QD}_{U} {CD}_{U}^{⊤} Q .

(17)

Let s be the dimension of U. Then, rank(QK_XQ) = s. Let λ₁ ≥ ··· ≥ λ_s > 0 be the nonzero eigenvalues and v₁, …, v_s be the corresponding vectors of this matrix. Since λ₁,…, λ_s are also eigenvalues of $n (n^{- 1} C^{\frac{1}{2}} D_{U}^{⊤} {QD}_{U} C^{\frac{1}{2}})$ , which converges in probability to the positive-definite matrix $C^{\frac{1}{2}} var (U) C^{\frac{1}{2}}$ , we have λ_s = nc + o_P(n) for some c > 0. This, together with ε_n = o(n), implies ${({QK}_{X} Q)}_{ε_{n}}^{†} = {({QD}_{U} {CD}_{U}^{⊤} Q)}^{†}$ with probability tending to 1. Consequently, with probability tending to 1,

{\sum^{^}}_{PC} (ε_{n}) = n^{- 1} [D_{Y}^{⊤} {QD}_{Y} - D_{Y}^{⊤} Q ({QD}_{U} {CD}_{U}^{⊤} Q) {({QD}_{U} {CD}_{U}^{⊤} Q)}^{†} {QD}_{Y}] .

(18)

Now it is easy to verify that if B ∈ ℝ^s^×^t is a matrix of full column rank and A ∈ ℝ^t^×^t is a positive-definite matrix, then

({BAB}^{⊤}) {({BAB}^{⊤})}^{†} = B {(B^{⊤} B)}^{- 1} B^{⊤} .

(19)

Thus, the right-hand side of Equation (18) is Σ̃, which proves part 1. To prove part 2, we first note that

({QK}_{X} Q) {({QK}_{X} Q)}_{ε_{n}}^{‡} = ({QK}_{X} Q - \sum_{i = 1}^{s} \frac{λ_{i} ε_{n}}{λ_{i} + ε_{n}} v_{i} v_{i}^{⊤}) {({QK}_{X} Q)}^{†} .

(20)

Since the function −ε_n/(ε_n + λ) is increasing for λ> 0, we have

- \frac{ε_{n}}{λ_{s} + ε_{n}} ({QK}_{X} Q) {({QK}_{X} Q)}^{†} \leq ({QK}_{X} Q) {({QK}_{X} Q)}_{ε_{n}}^{‡} - ({QK}_{X} Q) {({QK}_{X} Q)}^{†} \leq 0.

Hence,

\begin{array}{l} 0 \geq n^{- 1} D_{Y}^{⊤} Q [({QK}_{X} Q) {({QK}_{X} Q)}_{ε_{n}}^{‡} - ({QK}_{X} Q) {({QK}_{X} Q)}^{†}] {QD}_{Y} \\ \geq - \frac{ε_{n}}{λ_{s} + ε_{n}} n^{- 1} D_{Y}^{⊤} Q ({QK}_{X} Q) {({QK}_{X} Q)}^{†} {QD}_{Y}, \end{array}

(21)

By Equation (19),

n^{- 1} D_{Y}^{⊤} Q ({QK}_{X} Q) {({QK}_{X} Q)}^{†} {QD}_{Y} = (n^{- 1} D_{Y}^{⊤} {QD}_{U}) {(n^{- 1} D_{U}^{⊤} {QD}_{U})}^{- 1} (n^{- 1} D_{U} {QD}_{Y}) \overset{P}{\to} cov (Y, U) {[var (U)]}^{- 1} cov (U, Y) .

Hence, the right-hand side of Equation (21) is of the order $o_{P} (n^{- \frac{1}{2}})$ . In other words,

{\sum^{^}}_{RR} (ε_{n}) = n^{- 1} [D_{Y}^{⊤} {QD}_{Y} - D_{Y}^{⊤} Q ({QD}_{U} {CD}_{U}^{⊤} Q) {({QD}_{U} {CD}_{U}^{⊤} Q)}^{†} {QD}_{Y}] + o_{P} (n^{- \frac{1}{2}}) .

Here, we evoke Equation (19) again to complete the proof.

Note that when r = 1 and p < n, and ε_n = 0, Σ̂_PC(ε_n) reduces to

n^{- 1} [D_{Y}^{⊤} {QD}_{Y} - D_{Y} {QD}_{X} {(D_{X}^{⊤} {QD}_{X})}^{- 1} D_{X} {QD}_{Y}] = {var}_{n} (Y) - {cov}_{n} (Y, X) {[{var}_{n} (X)]}^{- 1} {cov}_{n} (X, Y),

where var_n(·) and cov_n(·, ·) denote the sample variance and covariance matrices. This is exactly the sample estimate of the residual variance for linear regression.

Lemma 3 allows us to derive the asymptotic expansions of Σ̂_PC(ε_n) and Σ̂_RR(ε_n) from that of Σ̃, which is a (matrix-valued) function of sample moments. Let Inline graphic be a convex family of probability measures defined on Ω_XY that contains all the empirical distributions ℙ_n and the true distribution ℙ₀ of (X, Y). Let T : → ℝ^p^×^p be the following statistical functional:

ℙ \mapsto {var}_{ℙ} (Y) - {cov}_{ℙ} (Y, U) {[{var}_{ℙ} (U)]}^{- 1} {cov}_{ℙ} (U, Y) .

(22)

In this notation, Σ̃ = T (ℙ_n). In the following, we use var and cov to denote the variance and covariance under ℙ₀. For the polynomial kernel (15), the evaluation T (ℙ₀) has a special meaning, as described in the next lemma. Its proof is standard, and is omitted.

Lemma 4

Suppose:

Entries of (Y | X) are polynomials in X¹, …, X^q of degrees no more than r;
Y | X ~ N( (Y | X), Σ), where Σ is a nonrandom matrix.

Then, T (ℙP₀) = Σ.

Let ℙ_α = (1 − α)ℙ₀ + αℙ_n, where α ∈ [0, 1]. Let D_α denote the differential operator ∂/∂α, and let D_α₌₀ denote the operation of taking derivative with respect to α and then evaluating the derivative at α = 0. It is well known that if T is Hadamard differentiable with respect to the norm || · ||_∞ in Inline graphic , then

T (ℙ_{n}) = T (ℙ_{0}) + D_{α = 0} T (ℙ_{α}) + o_{P} (n^{- \frac{1}{2}}) .

(23)

Since the functional T in our case is a smooth function of sample moments, it is Hadamard differentiable under very general conditions. See, for example, Reeds (1976), Fernholz (1983), Bickel et al. (1993), and Ren and Sen (1991). The next lemma can be verified by straightforward computation.

Lemma 5

Let V₁ and V₂ be square-integrable random vectors. Then,

D_{α = 0} [{cov}_{ℙ_{α}} (V_{1}, V_{2})] = E_{n} q (V_{1}, V_{2}) - E_{q} (V_{1}, V_{2}),

where q(V₁, V₂) = (V₁ − Inline graphic V₁)(V₂ − V₂)^⊤.

We will adopt the following notational system:

\begin{array}{l} μ_{Y} = E (Y), μ_{U} = E (U), V_{Y} = var (Y), V_{U} = var (U), \\ V_{U Y} = cov (U, Y) . \end{array}

The next theorem gives the first-order von-Mises expansion of Σ̃.

Theorem 3

Suppose the functional in Equation (22) is Hadamard differentiable, and the conditions in Lemma 4 hold. Then,

\sum^{\sim} = \sum + E_{n} M + o_{P} (n^{- \frac{1}{2}}),

(24)

where M = M₀ − Inline graphic M₀ and

M_{0} = [Y - μ_{Y} - V_{Y U} V_{U}^{- 1} (U - μ_{U})] \times {[Y - μ_{Y} - V_{Y U} V_{U}^{- 1} (U - μ_{U})]}^{⊤} .

Proof

We have

D_{α} T (ℙ_{α}) = D_{α} [{var}_{ℙ_{α}} (Y)] - D_{α} {{cov}_{ℙ_{α}} (Y, U) \times {[{var}_{ℙ_{α}} (U)]}^{- 1} {cov}_{ℙ_{α}} (U, Y)} .

We now apply Lemma 5 to obtain

\begin{array}{l} D_{α = 0} {{cov}_{ℙ_{α}} (Y, U) {[{var}_{ℙ_{α}} (U)]}^{- 1} {cov}_{ℙ_{α}} (U, Y)} \\ = [E_{n} q (Y, U) - E q (Y, U)] V_{U}^{- 1} V_{U Y} + V_{Y U} D_{α = 0} \\ \times {{[{var}_{ℙ_{α}} (U)]}^{- 1}} V_{U Y} + V_{Y U} V_{U}^{- 1} [E_{n} q (U, Y) - E q (U, Y)] . \end{array}

By the chain rule for differentiation and Lemma 5,

D_{α = 0} {{[{var}_{ℙ_{α}} (U)]}^{- 1}} = - V_{U}^{- 1} [E_{n} q (U, U) - E q (U, U)] V_{U}^{- 1} .

Hence,

\begin{array}{l} D_{α = 0} T (ℙ_{α}) = [E_{n} q (Y, Y) - E q (Y, Y)] \\ - [E_{n} q {(Y, U)}^{⊤} - E q (Y, U)] V_{U}^{- 1} V_{U Y} \\ + V_{Y U} V_{U}^{- 1} [E_{n} q (U, U) - E q (U, U)] V_{U}^{- 1} V_{U Y} \\ - V_{Y U} V_{U}^{- 1} [E_{n} q (U, Y) - E q (U, Y)] . \end{array}

This can be rewritten as Inline graphic M₀ − M₀, where M₀ is the matrix in the theorem.

Using this expansion, we can write down the asymptotic distribution of Σ̃.

Corollary 1

Suppose the functional in Equation (22) is Hadamard differentiable. Then,

\sqrt{n} vec (\sum^{\sim} - \sum) \overset{D}{\to} N (0, Ξ),

(25)

where

Ξ = {(I_{p}, - V_{Y U} V_{U}^{- 1})}^{\otimes 2} var [(\begin{matrix} Y - μ_{Y} \\ U - μ_{U} \end{matrix}) \otimes (\begin{matrix} Y - μ_{Y} \\ U - μ_{U} \end{matrix})] {(\begin{matrix} I_{p} \\ - V_{U}^{- 1} V_{U Y} \end{matrix})}^{\otimes 2} .

By Lemma 3, Theorem 3, and Slutsky’s theorem, we arrive at the following von Mises expansions for the two RKHS estimators Σ̂_PC(ε_n) and Σ̂_RR(ε_n) of the conditional variance Σ.

Corollary 2

Suppose the functional in Equation (22) is Hadamard differentiable, and the conditions in Lemma 4 hold.

If $ε_{n} = o (n^{\frac{1}{2}})$ , then Σ̂_RR(ε_n) has expansion (24).
If ε_n = o(n), then Σ̂_PC(ε_n) has expansion (24).

Although in this article we have only studied the asymptotic distribution for the polynomial kernel, the basic formulation and analysis could potentially be extended to other kernels under some regularity conditions. We leave this to future research.

6. SPARSITY AND ASYMPTOTIC DISTRIBUTION: THE LASSO

In this section, we study the asymptotic properties of the sparse estimator based on the objective function ϒ_n(Θ) in Equation (13). Let Inline graphic denote the class of all p × p symmetric matrices. For a matrix A ∈ , let σ_i (A) be the ith eigenvalue of A. Note that, for any integer r, we have

tr (A^{r}) = \sum_{i = 1}^{p} σ_{i}^{r} (A) .

(26)

Let E_n be the sample estimate of E; that is, E_n = {(i, j): θ̂_ij ≠ 0}, where Θ̂ = {θ̂_ij}is the minimizer of Equation (13). Following Lauritzen (1996), for a matrix A = {a_ij} and a graph Inline graphic = (Γ, E), let A( ) denote the matrix that sets a_ij to 0 whenever (i, j) ∉ E. Let

\begin{array}{l} ℝ^{p \times p} (G) = {A (G) : A \in ℝ^{p \times p}}, S^{p \times p} (G) \\ = {A (G) : A \in S^{p \times p}} . \end{array}

We define a bivariate sign function as follows. For any numbers a, b, let

sign (a, b) = sign (a) I (a \neq 0) + sign (b) I (a = 0) .

The following lemma will prove useful. Its proof is omitted.

Lemma 6

The bivariate sign function sign(a, b) has the following properties:

For any c₁ > 0, c₂ > 0, sign(c₁a, c₂b) = sign(a, b);
For sufficiently small |b|, |a + b| − |a| = sign(a, b)b.

The next theorem generalizes Theorem 1 of Yuan and Lin (2007). It applies to any random matrix Σ̂ with expansion (24), including Σ̂_PC(ε_n) and Σ̂_RR(ε_n).

Theorem 4

Suppose (X, Y) follows CGGM with respect to a graph (Γ, E). Let Θ̂ be the minimizer of Equation (13), where $\sqrt{n} λ_{n} \to λ_{0} > 0$ and Σ̂ having the expansion (24). Let W ∈ ℝ^p^×^p be a random matrix such that vec(W) is distributed as N[0, var(vec(M))]. Then, the following assertions hold.

0 < lim_n _→∞ ℙ(E_n = E) < 1;
For any ε > 0, there is a λ₀ > 0 such that lim_n _→∞ ℙ(E_n = E) > 1 − ε;
$\sqrt{n} (\hat{Θ} - Θ_{0}) \overset{D}{\to} argmin {Φ (Δ, W) : Δ \in S^{p \times p}}$ , where
$Φ (Δ, W) = tr (Δ \sum Δ \sum / 2 + Δ W) + λ_{0} \sum_{i \neq j} sign (θ_{0, i j}, δ_{i j}) δ_{i j} .$

The inequality lim_n_→∞ ℙ(E_n = E) > 0 implies that, as n → ∞, there is a positive probability to estimate a parameter as 0 when it is 0. A stronger property is that this probability tends to 1. For clarity, we refer to the former property as sparsity, and the latter as sparsistency (see Fan and Li 2001; Lam and Fan 2009). According to this definition, the two inequalities in part 1 mean that lasso is sparse but not sparsistent. Note that the unconstrained maximum likelihood estimate—and indeed any regular estimate—is not sparse. Part 2 means that, even though lasso is not sparsistent, we can make it as close to sparsistent as we wish by choosing a sufficiently large λ₀. Part 3 gives the asymptotic distribution of $\sqrt{n} (\hat{Θ} - Θ_{0})$ . It is not the same as that of the maximum likelihood estimate under the constraint θ_ij = 0 for (i, j) ∈ E^c. That is, it does not have the oracle property. The proof of part 3 is similar to that of Theorem 1 in Yuan and Lin (2007) in the context of GGM. However, to our knowledge there were no previous results parallel to parts 1 and 2 for GGM. Knight and Fu (2000) contains some basic ideas for the asymptotics for lassotype estimators.

Proof of Theorem 2

We prove the three assertions in the order 3, 1, 2.

3. Let Θ₀ be the true value of Θ, and let

Φ_{n} (Δ) = n [ϒ_{n} (Θ_{0} + n^{- 1 / 2} Δ) - ϒ_{n} (Θ_{0})] .

By an argument similar to Yuan and Lin (2007, Theorem 1), it can be shown that

\begin{array}{l} n [L_{n} (Θ_{0} + n^{- 1 / 2} Δ) - L_{n} (Θ_{0})] = tr (Δ \sum Δ \sum) / 2 + n^{1 / 2} tr (Δ E_{n} M), \\ n [P_{n} (Θ_{0} + n^{- 1 / 2} Δ) - P_{n} (Θ_{0})] = λ_{0} \sum_{i \neq j} sign (θ_{0, i j}, δ_{i j}) δ_{i j} + o (1) . \end{array}

Since $n^{1 / 2} E_{n} M \overset{D}{\to} W$ , we have

Φ_{n} (Δ) \overset{D}{\to} tr (Δ \sum Δ \sum / 2 + Δ W) + λ_{0} \sum_{i \neq j} sign (θ_{0, i j}, δ_{i j}) δ_{i j} = Φ (Δ, W) .

Both Φ_n(Δ) and Φ(Δ, W) are strictly convex with probability 1. Applying Theorem 4.4 of Geyer (1994) we see that

argmin {Φ_{n} (Δ) : Δ \in S^{p \times p}} \overset{D}{\to} argmin {Φ (Δ, W) : Δ \in S^{p \times p}} .

However, by construction, if Δ̂ is the (almost surely unique) minimizer of Φ_n(Δ), then Δ̂ = n^1/2(Θ̂ − Θ₀). This proves part 3.

1. For a generic function f(t) defined on t ∈ ℝ^s, let $\partial_{t_{i}}^{L}$ and $\partial_{t_{i}}^{R}$ be the left and right partial derivatives with respect to the ith component of t. When f is differentiable with respect to t_i, we write $\partial_{t_{i}} = \partial_{t_{i}}^{L} = \partial_{t_{i}}^{R}$ . Note that E_n = E if and only if Φ(Δ, W) is minimized within Inline graphic ( ). This happens if and only if

\partial_{δ_{i j}}^{L} Φ (Δ, W) \leq 0 \leq \partial_{δ_{i j}}^{R} Φ (Δ, W), (i, j) \in Γ \times Γ, i \geq j, and Δ \in S^{p \times p} (G) .

(27)

Here, we only consider the cases i ≥ j because Δ is a symmetric matrix. Let

\begin{array}{l} L (Δ, W) = tr (Δ \sum Δ \sum / 2 + Δ W), \\ P (Δ, W) = λ_{0} \sum_{i \neq j} sign (θ_{0, i j}, δ_{i j}) δ_{i j} . \end{array}

Then, ∂L(Δ, W)/∂Δ = ΣΔΣ + W. For (i, j) ∈ E, P (Δ, W) is differentiable with respect to δ_ij and ∂_{δ_ij} P(Δ, W) = λ₀sign(θ_0,_ij). For (i, j) ∈ E^c, P (Δ, W) is not differentiable with respect to δ_ij, but has left and right derivatives, given by $\partial_{δ_{i j}}^{L} P (Δ, W) = - λ_{0}$ and $\partial_{δ_{i j}}^{R} P (Δ, W) = λ_{0}$ . Condition (27) now reduces to

{\begin{cases} {(\sum Δ \sum)}_{i j} + w_{i j} + s_{i j} = 0, & if (i, j) \in E, i \geq j, \\ {(\sum Δ \sum)}_{i j} + w_{i j} \in [- λ_{0}, λ_{0}], & if (i, j) \in E^{c}, i \geq j, \end{cases}

(28)

where Δ ∈ Inline graphic ( ), s_ij = λ₀sign(θ_0,_ij) if (i, j) ∈ E and i ≠ j and s_ij = 0 if i = j.

Now consider the event

G = {{w_{i j} : i \geq j} : Equation (28) is satisfied for some Δ \in S^{p \times p} (G)} .

We need to show that ℙ(G) > 1. Since Δ belongs to Inline graphic ( ), it has as many free parameters as there are equations in the first line of Equation (28). Since Σ is a nonsingular matrix, for any {w_ij: i ≥ j, (i, j) ∈ E}, there exists a unique Δ that satisfies the first line of Equation (28) and it is a linear function of {w_ij: (i, j) ∈ E, i ≥ j}. Writing this function as Δ({w_ij : (i, j) ∈ E, i ≥ j}), we see that Equation (28) is satisfied if and only if

{[\sum Δ ({w_{μ ν} : (μ, ν) \in E, μ \geq ν}) \sum]}_{i j} + w_{i j} \in [- λ_{0}, λ_{0}], for i \geq j, (i, j) \in E^{c} .

Since the mapping

τ : {w_{i j} : i \geq j} \mapsto {2 {[\sum Δ ({w_{μ ν} : (μ, ν) \in E, μ \geq ν}) \sum]}_{i j} + 2 w_{i j} : i \geq j, (i, j) \in E^{c}}

from ℝ^p⁽^p⁺¹⁾ to ℝ^{card(E^c)/2} is continuous, the set τ⁻¹[(−λ₀, λ₀)^{card(E^c)/2}] is open in ℝ^p⁽^p^+1)/2. Furthermore, this open set is nonempty because if we let

w_{i j} = - {[\sum Δ ({w_{μ ν} : (μ, ν) \in E, μ \geq ν}) \sum]}_{i j}, for (i, j) \in E^{c}, i \geq j,

then τ ({w_ij : i ≥ j}) = 0. Because {w_ij : i ≥ j} has a multivariate normal distribution with a nonsingular covariance matrix, any nonempty open set in ℝ^p⁽^p^+1)/2 has positive probability. This proves lim_n_→∞ ℙ(E_n = E) > 0.

Similarly, the set τ⁻¹[(λ₀, 3λ₀)^{card(E^c)/2}] is open in ℝ^p⁽^p^+1)/2, and if we let

w_{i j} = - {[\sum Δ ({w_{μ ν} : (μ, ν) \in E, μ \geq ν}) \sum]}_{i j} + 2 λ_{0}, for (i, j) \in E^{c}, i \geq j,

then τ({w_ij: i ≥ j}) = (2λ₀, …, 2λ₀)^⊤ ∈ (λ₀, 3λ₀)^{card(E^c)/2}. Hence, the W-probability of τ⁻¹[(λ₀, 3λ₀)^{card(E^c)/2}] is positive, implying lim_n_→∞ ℙ(E_n = E) < 1.

2. For each (i, j) ∈ E^c, i ≥ j, Let U_ij = [ΣΔ ({w_μν : (μ, ν) ∈ E, μ ≥ ν}) Σ]_ij + w_ij. Then, for any η > 0, there is a $λ_{0}^{i j} > 0$ such that $ℙ (U_{i j} \in [- λ_{0}^{i j}, λ_{0}^{i j}]) > 1 - η$ . Let $λ_{0} = max {λ_{0}^{i j} : (i, j) \in E^{c}, i \geq j}$ , i ≥ j}. Then, ℙ(U_ij ∈ [−λ₀, λ₀]) > 1 − η. Hence,

\begin{array}{l} lim_{n \to \infty} ℙ (E_{n} = E) = ℙ (U_{i j} \in [- λ_{0}, λ_{0}], (i, j) \in E^{c}, i \geq j) \\ \geq 1 - \sum_{(i, j) \in E^{c}, i \geq j} {1 - ℙ (U_{i j} \in [- λ_{0}, λ_{0}])} \\ \geq 1 - card (E^{c}) η / 2. \end{array}

This proves part 2 because η can be arbitrarily small.

7. SPARSISTENCY AND ORACLE PROPERTY: THE ADAPTIVE LASSO

We now turn to the adaptive lasso based on Equation (14). We still use Θ̂ = {θ̂_ij} to denote the minimizer of Equation (14). For a matrix Δ = {δ_ij} ∈ Inline graphic , and a set C ∈ Γ × Γ, let δ_C = {δ_ij : (i, j) ∈ C, i ≥ j}, which is to be interpreted as a vector where index i moves first, followed by index j.

Theorem 5

Suppose that (X, Y) follows CGGM with respect to a graph (Γ, E). Let Θ̂ be the minimizer of Equation (14), where Σ̂ has expansion (24), and

lim_{n \to \infty} n^{1 / 2} λ_{n} = 0, lim_{n \to \infty} n^{(1 + γ) / 2} λ_{n} = \infty .

Suppose Θ̃ in Equation (14) is a $\sqrt{n}$ -consistent estimate of Θ₀ with ℙ(θ̃_ij ≠ 0) = 1 for (i, j) ∈ E^c. Then,

lim_n_→∞ ℙ(E_n = E) = 1;
$\sqrt{n} (\hat{Θ} - Θ_{0}) \overset{D}{\to} argmin {L (Δ, G); Δ \in S^{p \times p} (G)}$ .

Part 1 asserts that the adaptive lasso is sparsistent; part 2 asserts that it is asymptotically equivalent to the minimizer of L(Δ, G) when E is known. In this sense Θ̂ is oracle. The condition P (θ̃_ij ≠ 0) = 1 means that θ̃_ij is not sparse, which guarantees that |θ̃_ij |⁻^γ is well defined. For example, we can use the inverse of one of the RKHS estimates as Θ̃.

Proof

Let Δ ∈ Inline graphic and Ψ_n(Δ) = n[Λ_n(Θ₀ + n^−1/2Δ) − Λ_n(Θ₀)]. Consider the difference

n [Π_{n} (Θ_{0} + n^{- 1 / 2} Δ) - Π_{n} (Θ_{0})] = λ_{n} n \sum_{i \neq j} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} (∣ θ_{0, i j} + n^{- 1 / 2} δ_{i j} ∣ - ∣ θ_{0, i j} ∣) .

By Lemma 6, for large enough n, the right-hand side can be rewritten as

\sum_{i \neq j} n^{1 / 2} λ_{n} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} sign (θ_{0, i j}, n^{- 1 / 2} δ_{i j}) δ_{i j} .

(29)

If (i, j) ∈ E, then |θ̃_ij |⁻^γ = O_P(1). Because n^1/2 λ_n → 0, the summand for such (i, j) converges to 0 in probability. Hence, Equation (29) reduces to

\sum_{(i, j) \in E^{c}} n^{1 / 2} λ_{n} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} sign (δ_{i j}) δ_{i j} + o_{P} (1) = \sum_{(i, j) \in E^{c}} n^{1 / 2} λ_{n} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} ∣ δ_{i j} ∣ + o_{P} (1) .

(30)

If (i, j) ∉ E, then θ̃_ij = O_P(n^−1/2), and hence

n^{1 / 2} λ_{n} {∣ {\tilde{θ}}_{i j} ∣}^{- γ} = λ_{n} n^{(1 - γ) / 2} {∣ n^{1 / 2} {\tilde{θ}}_{i j} ∣}^{- γ} = \frac{λ_{n} n^{(1 + γ) / 2}}{O_{P} (1)} \overset{P}{\to} \infty .

From this, we see that Equation (30) converges in probability to ∞ unless δ_{E_c} = 0, in which case it converges in probability to 0. In other words,

n [Π_{n} (Θ_{0} + n^{- 1 / 2} Δ) - Π_{n} (Θ_{0})] \overset{P}{\to} {\begin{cases} 0, & δ_{E^{c}} = 0, \\ \infty, & δ_{E^{c}} \neq 0, \end{cases}

This implies that

Ψ_{n} (Δ) \overset{D}{\to} Ψ (Δ, G), where Ψ (Δ, G) = {\begin{cases} L (Δ, G), & δ_{E^{c}} = 0, \\ \infty, & δ_{E^{c}} \neq 0. \end{cases}

Since both Ψ_n(Δ) and Ψ(Δ, G) are convex and Ψ(Δ, G) has a unique minimum, by the epi-convergence results of Geyer (1994), we have

argmin {Ψ_{n} (Δ) : Δ \in S^{p \times p}} \overset{D}{\to} argmin {Ψ (Δ, G) : Δ \in S^{p \times p}} .

This proves part 2 because the right-hand side is, in fact, argmin{L(Δ, G) : Δ ∈ Inline graphic ( )}, and the left-hand side is $\sqrt{n} (\hat{Θ} - Θ_{0})$ .

The function Ψ(Δ, W) is always minimized in a region of Δ in which it is not ∞. As a consequence, if Δ minimizes Ψ(Δ, W) over Inline graphic , then δ_{E_c} = 0. Hence, ℙ(E_n ⊇ E) → 1. In the meantime, since $\hat{Θ} - Θ_{0} = O_{P} (n^{- \frac{1}{2}})$ , we have ${\hat{θ}}_{i j} = θ_{0, i j} + O_{P} (n^{- \frac{1}{2}})$ . If (i, j) ∈ E, then θ_0,_ij ≠ 0. Thus, we see that ℙ(E_n ⊆ E) → 1. This proves part 1.

We now derive the explicit expression of the asymptotic distribution of Θ̂. Let G be the unique matrix in ℝ^{p²×[card(E)+p]/2} such that for any Δ ∈ Inline graphic ( ), vec(Δ) = Gδ_E. For example, if p = 3 and E = {(1, 1), (2, 2), (3, 3), (1, 2), (2, 1), (1, 3), (3, 1)}, then δ_E = (δ₁₁, δ₂₁, δ₃₁, δ₂₂, δ₃₃)^⊤, and G is defined by

G^{⊤} = (\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}) .

Corollary 3

Under the assumptions of Theorem 5,

\sqrt{n} vec (\hat{Θ} - Θ_{0}) \overset{D}{\to} N (0, V),

where

V = G {[G^{⊤} (\sum^{2} \otimes I_{p}) G]}^{- 1} G^{⊤} Ξ G {[G^{⊤} (\sum^{2} \otimes I_{p}) G)]}^{- 1} G^{⊤} .

(31)

Proof

Note that

\begin{array}{l} tr (Δ \sum Δ \sum) = {vec}^{⊤} (Δ \sum) vec (Δ \sum) = {vec}^{⊤} (Δ) (\sum^{2} \otimes I_{p}) vec (Δ) \\ tr (Δ W) = {vec}^{⊤} (W) vec (Δ) . \end{array}

Hence,

L (Δ, W) = δ_{E}^{⊤} G^{⊤} (\sum^{2} \otimes I_{p}) G δ_{E} / 2 + {vec}^{⊤} (W) G δ_{E} .

This is a quadratic function minimized by δ_E = −[G^⊤(Σ² ⊗ I_p)G]⁻¹G^⊤vec(W). In terms of Δ, the minimizer is vec(Δ) = −G[G^⊤(Σ² ⊗ I_p)G]⁻¹G^⊤vec(W). The corollary now follows from Theorem 5.

The asymptotic variance of $\sqrt{n} (\hat{Θ} - Θ_{0})$ can be estimated by replacing the moments in Equation (25) and (31) by their sample estimates.

8. CONVERGENCE RATE FOR HIGH-DIMENSIONAL GRAPH

In the last two sections, we have studied the asymptotic properties of our sparse CGGM estimators with the number of nodes p in the graph Inline graphic held fixed. We now investigate the case where p = p_n tends to infinity. Due to the limited space, we shall focus on the lasso-penalized RKHS estimator with PC regularization. That is, the minimizer of ϒ_n(Θ) with Σ̂ in Equation (12) is taken to be the RKHS estimator Σ̂_PC(ε_n) based on κ_X(a, b) = (1 + a^⊤b)^r. Throughout this section, Θ̂ denotes this estimator. The large-p_n convergence rate for the lasso estimator of the (unconditional) GGM has been studied by Rothman et al. (2008).

In this case, Σ, Θ, and Y should in principle be written as Σ⁽ⁿ⁾, Θ⁽ⁿ⁾, and Y⁽ⁿ⁾ because they now depend on n. However, to avoid complicated notation, we still use Σ, Θ, and Y, keeping in mind their dependence on n. Following Rothman et al. (2008), we develop the convergence rate in Frobenius norm. Let ||·||₁, ||·||_F, and ||·||_∞ be the L₁-norm, Frobenius norm, and the L_∞-norm of a matrix A ∈ ℝ^d₁×d₂:

\begin{array}{l} {‖ A ‖}_{1} = \sum_{i = 1}^{d_{1}} \sum_{j = 1}^{d_{2}} ∣ a_{i j} ∣, {‖ A ‖}_{F} = {(\sum_{i = 1}^{d_{1}} \sum_{j = 1}^{d_{2}} a_{i j}^{2})}^{1 / 2}, \\ {‖ A ‖}_{\infty} = max ∣ a_{i j} ∣, \end{array}

where a_ij denotes the (i, j)th entry of A. Let ρ(A) denote the number of nonzero entries of A. For easy reference, we list some properties of these matrix functions in the following proposition.

Proposition 1

Let A ∈ ℝ^d₁×d₂, B ∈ ℝ^d₂×d₃. Then,

\begin{array}{l} {‖ AB ‖}_{\infty} \leq d_{2} {‖ A ‖}_{\infty} {‖ B ‖}_{\infty}, ∣ tr (AB) ∣ \leq {‖ A ‖}_{1} {‖ B ‖}_{\infty}, \\ {‖ A ‖}_{1} \leq \sqrt{ρ (A)} {‖ A ‖}_{F} . \end{array}

The first inequality follows from the definition of ||·||_∞; the second from Hölder’s inequality; the third from the Cauchy-Schwarz inequality. The last two inequalities were used in Rothman et al. (2008). Let ℕ = {1, 2, …}. Consider the array of random matrices: { $A_{i}^{(n)} : i = 1, \dots, n$ , n ∈ ℕ}, where A⁽ⁿ⁾ ∈ ℝ^p_n×q, p_n may depend on n but q is fixed. Let (A⁽ⁿ⁾)_rs denote the (r, s)th entry of A⁽ⁿ⁾.

Lemma 7

Let { $A_{i}^{(n)} : i = 1, \dots, n$ , n ∈ ℕ} be an array of random matrices in ℝ^p_n×q, each of whose rows is an iid sample of a random matrix A⁽ⁿ⁾. Suppose that the moment generating functions of (A⁽ⁿ⁾)_st, say φ_nst, are finite on an interval (−δ, δ), and their second derivatives are uniformly bounded over this interval for all s = 1,…, p_n, t = 1,…, q, n ∈ ℕ. If p_n → ∞, then ${‖ E_{n} (A^{(n)}) - E (A^{(n)}) ‖}_{\infty} + O_{P} (log p_{n} / \sqrt{n})$ .

Proof

Let μ_nst = Inline graphic (A⁽ⁿ⁾)_st. Since $∣ μ_{nst} ∣ \leq 1 + φ_{nst}^{″} (0)$ , there is C > 0 such that |μ_nst| ≤ C for all s, t, n. Let (B⁽ⁿ⁾)_st = (A⁽ⁿ⁾)_st − μ_nst, and ψ_nst be the moment generating function of (B⁽ⁿ⁾)_st. Then, ψ_nst (τ) = e^−μ_nstτφ_nst (τ). For any a > 0,

\begin{array}{l} ℙ (\sqrt{n} (E_{n} {(B^{(n)})}_{s t}) > a) = ℙ (e^{\sqrt{n} E_{n} {(B^{(n)})}_{s t}} > e^{a}) \\ \leq e^{- a} {[ψ_{nst} (n^{- 1 / 2})]}^{n} . \end{array}

By Taylor’s theorem and noticing that $ψ_{nst}^{'} (0) = 0$ , we have

ψ_{nst} (n^{- 1 / 2}) = 1 + ψ_{nst}^{″} (ξ_{nst}) / (2 n) \leq 1 + e^{C} φ_{nst}^{″} (ξ_{nst}) / (2 n)

for some 0 ≤ ξ_nst ≤ n^−1/2. By assumption, there is C₁ > 0 such that lim sup_n_→∞ φ″ (ξ_nst) ≤ C₁ for all s = 1,…, p_n and t = 1,…, q. Hence,

{[ψ_{nst} (n^{- 1 / 2})]}^{n} \leq {[1 + e^{C} C_{1} / (2 n)]}^{n} \to e^{e^{C} C_{1} / 2} \equiv C_{2} .

Thus, we have $lim {sup}_{n \to \infty} P (\sqrt{n} E_{n} {(B^{(n)})}_{s t} > a) \leq e^{- a} C_{2}$ . By the same argument, we can show that $lim {sup}_{n \to \infty} P (\sqrt{n} E_{n} {(B^{(n)})}_{s t} < - a) \leq e^{- a} C_{2}$ . Therefore,

\underset{n \to \infty}{lim sup} P (\sqrt{n} ∣ E_{n} {(B^{(n)})}_{s t} ∣ > a) \leq 2 e^{- a} C_{2} .

It follows that

\begin{array}{l} \underset{n \to \infty}{lim sup} ℙ ({‖ E_{n} (B^{(n)}) ‖}_{\infty} > c_{n}) \leq \underset{n \to \infty}{lim sup} \sum_{s = 1}^{p_{n}} \sum_{t = 1}^{q} ℙ (\sqrt{n} ∣ E_{n} [{(A^{(n)})}_{s t} ∣ > \sqrt{n} c_{n}) \\ \leq 2 {q C}_{2} e^{- \sqrt{n} c_{n} + log p_{n}} . \end{array}

In particular,

\underset{n \to \infty}{lim sup} ℙ ({‖ E_{n} (B^{(n)}) ‖}_{\infty} > 2 log p_{n} / \sqrt{n}) \leq {q C}_{2} / p_{n} \to 0,

which implies the desired result.

In the following, we call any array of random matrices satisfying the conditions in Lemma 7 a standard array. We now establish the convergence rate of ||Θ̂ −Θ₀||_F. Let s_n denote the number of nonzero off-diagonal entries of Θ₀. Let Z = Y − μ_Y − E(Y | X). Let X^t and Y^s denote the components of X and Y. Their powers are denoted by (X^t)^r and (Y^s)^r. For a symmetric matrix A, let σ_max(A) and σ_min(A) denote the maximum and minimum eigenvalues of A.

Theorem 6

Let Θ̂ be the sparse estimator defined in the first paragraph of this section with λ_n ~ (log p_n/n)^1/2, ε_n = o(n), and κ(a, b) = (1 + a^⊤b)^r. Suppose that (X, Y) follows a CGGM, and satisfies the following additional assumptions:

Y, YU^⊤, and ZU^⊤ are standard arrays of random matrices;
for all n ∈ ℕ, σ_max(Σ) < ∞ and σ_min(Σ) > 0;
p_n → ∞ and p_n (p_n + s_n)^1/2(log p_n)^5/2 = o(n^3/2);
the fixed-dimensional matrix V_U = var(U) is nonsingular;
each component of (Y | X = x) is a polynomial in x¹,…, x^q of at most rth order.

Then, ||Θ̂ − Θ₀||_F = O_P ([(p_n + s_n) log p_n/n]^1/2).

Note that we can allow p_n(p_n + s_n)^1/2 to get arbitrarily close to n^3/2. This condition is slightly stronger than the corresponding condition in Rothman et al. (2008) for the unconditional case, which requires (p_n + s_n) log p_n = o(n^1/2). Also note that ||Θ̂ − Θ₀||_F is the sum, instead of average, of $p_{n}^{2}$ elements (roughly p_n + s_n nonzero elements). With this in mind, the convergence rate [(p_n + s_n) log p_n/n]^1/2 is quite fast. This is the same rate as that given in Rothman et al. (2008) for the unconditional GGM.

Proof of Theorem 3

Let r_n = [(p_n + s_n) log p_n/n]^1/2, and Inline graphic = {Δ ∈ : ||Δ||₂ = Mr_n} for some M > 0. Let,

G_{n} (Δ) = ϒ_{n} (Θ_{0} + Δ) - ϒ_{n} (Θ_{0}),

where

ϒ_{n} (Θ) = - log det (Θ) + tr (Θ {\sum^{^}}_{PC} (ε_{n})) + λ_{n} \sum_{i \neq j} ∣ θ_{i j} ∣ .

(32)

Then, Θ̂ minimizes L(Θ) if and only if Δ̂ = Θ̂ − Θ₀ minimizes G_n(Δ). As argued by Rothman et al. (2008), since G_n(Δ) is convex in Δ and G_n(Δ̂) ≤ 0, the minimizer Δ̂ resides within the sphere Inline graphic if G_n(Δ) is positive and bounded away from 0 on this sphere. That is, it suffices to show

ℙ (inf {G_{n} (Δ) : Δ \in D_{n}} > 0) \to 1.

The proof of Lemma 3 shows, in the context of fixed p, that ${({QK}_{X} Q)}_{ε_{n}}^{†} = {({QD}_{U} {CD}_{U}^{⊤} Q)}^{†}$ with probability tending to 1 if ε_n = o(n). This result still holds here because the dimension q of X remains fixed. Consequently ℙ(Σ̂_PC(ε_n) = Σ̃) → 1, where Σ̃ is as defined in Equation (16) but now its dimension increases with n. Thus, we can replace the Σ̂_PC(ε_n) in Equation (32) by Σ̃. Let

{\hat{μ}}_{U} = E_{n} (U), {\hat{V}}_{Y U} = n^{- 1} D_{Y}^{⊤} {QD}_{U}, {\hat{V}}_{U} = n^{- 1} D_{U}^{⊤} {QD}_{U} .

Let ${\hat{μ}}_{Y ∣ U} - {\hat{V}}_{Y U} {\hat{V}}_{U}^{- 1} (U - {\hat{μ}}_{U})$ , and $μ_{Y ∣ U} = E (Y ∣ X) = V_{Y U} V_{U}^{- 1} (U - μ_{U})$ . Then,

\sum^{\sim} = E_{n} (Z + μ_{Y ∣ U} - {\hat{μ}}_{Y ∣ U} + μ_{Y} - {\hat{μ}}_{Y}) {(Z + μ_{Y ∣ U} - {\hat{μ}}_{Y ∣ U} + μ_{Y} - {\hat{μ}}_{Y})}^{⊤} .

The term μ_Y_|_U − μ̂_Y_|_U can be further decomposed as Z_I + Z_II + Z_III, where

\begin{array}{l} Z_{I} = - ({\hat{V}}_{Y U} - V_{Y U}) {\hat{V}}_{U}^{- 1} (U - {\hat{μ}}_{U}), \\ Z_{I I} = V_{Y U} [V_{U}^{- 1} (U - μ_{U}) - {\hat{V}}_{U}^{- 1} (U - {\hat{μ}}_{U})], \\ Z_{III} = μ_{Y} - {\hat{μ}}_{Y} . \end{array}

The function G_n(Δ) can now be rewritten as

G_{n} (Δ) = L_{n}^{*} (Θ_{0} + Δ) - L_{n}^{*} (Θ_{0}) + \sum_{v \in [I, I I, III]} tr (Δ E_{n} (Z Z_{ν}^{⊤})) + \sum_{v \in {I, I I, III}} tr (Δ E_{n} (Z_{v} Z^{⊤})) + \sum_{v \in {I, I I, III}} \sum_{τ \in {I, I I, III}} tr (Δ E_{n} (Z_{ν} Z_{τ}^{⊤})),

where $L_{n}^{*} (Θ) = tr (Θ E_{n} ({ZZ}^{⊤})) - log det (Θ) + λ_{n} \sum_{i \neq j} ∣ θ_{i j} ∣$ . Since Z ~ N(0, Σ), we can use the same argument in the proof of Theorem 1 in Rothman et al. (2008) to show that

ℙ (inf {L_{n}^{*} (Θ_{0} + Δ) - L_{n}^{*} (Θ_{0}) : Δ \in D_{n}} > 0) \to 1.

Thus, our theorem will be proved if we can show that

sup {∣ tr (Δ E_{n} (A)) ∣ : Δ \in D_{n}} = o_{P} (1)

for A being any one of the following eight random matrices

Z Z_{ν}^{⊤}, Z_{ν} Z^{⊤}, Z_{ν} Z_{τ}^{⊤}, ν, τ = I, I I, III .

(33)

By Proposition 1, inequalities (2) and (3), we have, for Δ ∈ Inline graphic ,

\begin{array}{l} tr (Δ E_{n} (A)) ∣ \leq {‖ E_{n} (A) ‖}_{\infty} {‖ Δ ‖}_{1} \leq {‖ E_{n} (A) ‖}_{\infty} \sqrt{ρ (Δ)} {‖ Δ ‖}_{2} \\ \leq M {‖ E_{n} (A) ‖}_{\infty} p_{n} r_{n} . \end{array}

Thus, it suffices to show that,

{‖ E_{n} (A) ‖}_{\infty} p_{n} {(p_{n} + s_{n})}^{1 / 2} {(log p_{n})}^{1 / 2} n^{- 1 / 2} = o_{P} (1) .

(34)

Since || Inline graphic (A)||_∞ = || (A^⊤)||_∞, we only need to consider the following A:

Z Z_{I}^{⊤}, Z Z_{I I}^{⊤}, Z Z_{III}^{⊤}, Z_{I} Z_{I}^{⊤}, Z_{I} Z_{I I}^{⊤}, Z_{I} Z_{III}^{⊤}, Z_{I I} Z_{I I}^{⊤}, Z_{I I} Z_{III}^{⊤}, Z_{III} Z_{III}^{⊤} .

From the definitions of Z and Z_I, we have

E_{n} (Z Z_{I}^{⊤}) = - E_{n} [Z {(U - μ_{U})}^{⊤}] {\hat{V}}_{U}^{- 1} ({\hat{V}}_{U Y} - V_{U Y}) - {\hat{μ}}_{Z} {(μ_{U} - {\hat{μ}}_{U})}^{⊤} {\hat{V}}_{U}^{- 1} ({\hat{V}}_{U Y} - V_{U Y}),

(35)

where μ̂_Z = Inline graphic (Z). By the first inequality of Proposition 1,

{‖ E_{n} (Z Z_{I}^{⊤}) ‖}_{\infty} \leq q^{2} {‖ E_{n} [Z {(U - μ_{U})}^{⊤}] ‖}_{\infty} \times {‖ {\hat{V}}_{U}^{- 1} ‖}_{\infty} {‖ {\hat{V}}_{U Y} - V_{U Y} ‖}_{\infty} + q^{2} {‖ {\hat{μ}}_{Z} ‖}_{\infty} {‖ {(μ_{U} - {\hat{μ}}_{U})}^{⊤} ‖}_{\infty} \times {‖ {\hat{V}}_{U}^{- 1} ‖}_{\infty} {‖ {\hat{V}}_{U Y} - V_{U Y} ‖}_{\infty} .

(36)

Since Z ⫫ X, we have E[Z(U − μ_U)^⊤] = 0. Hence, by Lemma 7,

{‖ E_{n} [Z {(U - μ_{U})}^{⊤}] ‖}_{\infty} = O_{P} (n^{- \frac{1}{2}} log p_{n}) .

(37)

Similarly,

\begin{array}{r} {‖ E_{n} [(Y - μ_{Y}) {(U - μ_{U})}^{⊤} - V_{Y U}] ‖}_{\infty} = O_{P} (n^{- \frac{1}{2}} log p_{n}), \\ {‖ {\hat{μ}}_{Y} - μ_{Y} ‖}_{\infty} = O_{P} (n^{- \frac{1}{2}} log p_{n}) . \end{array}

(38)

Since μ̂_U − μ_U has a fixed-dimension finite-variance matrix, by the central limit theorem,

{‖ {\hat{μ}}_{U} - μ_{U} ‖}_{\infty} = O_{P} (n^{- 1 / 2}) .

(39)

By definition,

{\hat{V}}_{Y U} - V_{Y U} = E_{n} [(Y - μ_{Y}) {(U - μ_{U})}^{⊤} - V_{Y U}] - ({\hat{μ}}_{Y} - μ_{Y}) {({\hat{μ}}_{U} - μ_{U})}^{⊤} .

By Proposition 1, first inequality,

{‖ {\hat{V}}_{Y U} - V_{Y U} ‖}_{\infty} \leq {‖ E_{n} [(Y - μ_{Y}) (U - μ_{U}) - V_{Y U}] ‖}_{\infty} + {‖ {\hat{μ}}_{Y} - μ_{Y} ‖}_{\infty} {‖ {\hat{μ}}_{U} - μ_{U} ‖}_{\infty} .

Substituting Equations (38) and (39) into the above inequality, we find

{‖ {\hat{V}}_{Y U} - V_{Y U} ‖}_{\infty} = O_{P} (n^{- \frac{1}{2}} log p_{n}) .

(40)

Since Z is multivariate normal whose components have means 0 and bounded variance, it is a standard array. Hence,

{‖ {\hat{μ}}_{Z} ‖}_{\infty} = O_{P} (n^{- \frac{1}{2}} log p_{n}) .

(41)

Since the dimension of V̂_U is fixed, its entries have finite variances, and V_U is nonsingular, we have, by the central limit theorem,

{‖ {\hat{V}}_{U}^{- 1} ‖}_{\infty} = O_{P} (1) .

(42)

Substituting Equations (37), (39), (40), (41), and (42) into Equation (36), we find that

{‖ E_{n} ({ZZ}_{I}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} {(log p_{n})}^{2}),

which, by condition (3), satisfies Equation (34).

The order of magnitudes of the rest of the three terms can be derived similarly. We present the results below, omitting the details:

\begin{array}{l} {‖ E_{n} ({ZZ}_{I I}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} log p_{n}), \\ {‖ E_{n} ({ZZ}_{III}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} {(log p_{n})}^{2}), \\ {‖ E_{n} (Z_{I} Z_{I}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} {(log p_{n})}^{2}), \\ {‖ E_{n} (Z_{I} Z_{I I}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} log p_{n}), \\ {‖ E_{n} (Z_{I} Z_{III}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 3 / 2} {(log p_{n})}^{2}), \\ {‖ E_{n} (Z_{I I} Z_{I I}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1}), \\ {‖ E_{n} (Z_{I I} Z_{III}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 3 / 2} log p_{n}), \\ {‖ E_{n} (Z_{III} Z_{III}^{⊤}) ‖}_{\infty} = O_{P} (n^{- 1} {(log p_{n})}^{2}) . \end{array}

By condition (3), all of these terms satisfy the relation in Equation (34).

9. IMPLEMENTATION

In this section, we address two issues in implementation: the choice of the tuning parameter and the minimization of the objective functions (13) and (14). For the choice of the tuning parameter, we use a BIC-type criterion (Schwarz 1978) similar to that used in Yuan and Lin (2007). Let Θ̂(λ) = {θ̂_ij(λ) : i, j ∈ Γ} be the lasso or the adaptive lasso estimate of Θ₀ in the conditional graphical model for a specific choice of λ of the tuning parameter. Let E_n(λ) = {(i, j) : θ̂_ij(λ) ≠ 0}, and

BIC (λ) = - log det [\hat{Θ} (λ)] + tr [\hat{Θ} (λ) \sum_{n}] + log n \frac{card [E_{n} (λ)] + p}{2 n} .

The tuning parameter is then chosen to be

{\hat{λ}}_{BIC} = argmin {BIC (λ) : λ \in (0, \infty)} .

Practically, we evaluate this criterion on a grid of points in (0, ∞) and choose the minimizer among these points.

For the minimization of Equations (13) and (14), we follow the graphical lasso procedure proposed by Friedman et al. (2008), but with the sample covariance matrix therein replaced by the RKHS estimates, Σ̂_PC(ε_n) or Σ̂_RR(ε_n), of the conditional covariance matrix Σ. The graphical lasso (glasso) procedure is available as a package in the R language.

10. SIMULATION STUDIES

In this section, we compare the sparse estimators for the CGGM with the sparse estimators of the GGM, with the maximum likelihood estimators of the CGGM, and with two naive estimators. We also explore several reproducing kernels and investigate their performances for estimating the CGGM.

10.1 Comparison With Estimators for GGM

We use three criteria for this comparison:

False positive rate at λ̂_BIC. This is defined as the percentage of edges identified as belonging to E when they are not; that is,
$FP = card {(i, j) : i > j, θ_{0, i j} = 0, {\hat{θ}}_{i j} \neq 0} / card {(i, j) : i > j, θ_{0, i j} = 0} .$
False negative rate at λ̂_BIC. This is defined as the percentage of edges identified as belonging to E^c when they are not; that is,
$FN = card {(i, j) : i > j, θ_{0, i j} \neq 0, {\hat{θ}}_{i j} = 0} / card {(i, j) : i > j, θ_{0, i j} \neq 0} .$
Rate of correct paths. The above two criteria are both specific to the tuning method (in our case, BIC). To assess the potential capability of an estimator of E, independently of the tuning methods used, we use the percentage of cases where E belongs to the path {E_n(λ) : λ ∈ (0, ∞)}, where E_n(λ) is an estimator of E for a fixed λ. We write this criterion as PATH.

Example 1

This is the example, illustrated in Figure 1, in which p = 3, q = 1, and (X, Y) satisfies CGGM with E = {(1, 1), (2, 2), (3, 3), (2, 3), (3, 2)}. The conditional distribution of Y | X is specified by

Y = β X + ε,

(43)

where β = (β₁, β₂, 0)^⊤, X ~ N(0, 1), and ε ~ N(0, Σ), X ⫫ ε, and

\sum^{- 1} = Θ = (\begin{matrix} 5 & 0 & 0 \\ 0 & 4 & 2.53 \\ 0 & 2.53 & 2 \end{matrix}) .

For each simulated sample, β₁ and β₂ are generated, independently, from the uniform distribution defined on the set (−6, −3) ∪ (3, 6). We use two sample sizes n = 50, 100. The linear kernel κ_X(a, b) = 1 + a^⊤b is used for the initial RKHS estimate. The results are presented in the first three rows of Table 1. Entries in the table are the means calculated across 200 simulated samples.

Table 1.

Comparison of graph estimation accuracy among the lasso and the adaptive lasso estimators of GGM and CGGM for Example 1, Example 2 (including three scenarios), and Example 3. “ALASSO” means adaptive lasso

Example/scenario	Criteria	n = 50				n = 100
		LASSO		ALASSO		LASSO		ALASSO
		GGM	CGGM	GGM	CGGM	GGM	CGGM	GGM	CGGM
EX1	FP	1	0.51	1	0.15	1	0.40	1	0.05
	FN	0	0	0	0	0	0	0	0
	PATH	0	1	0	1	1	1	0	1
EX2-SC1	FP	0.84	0.02	0.54	0.06	0.93	0.01	0.67	0.02
	FN	0	0	0	0	0	0	0	0
	PATH	0	1	0.01	0.63	0	1	0.98	1
EX2-SC2	FP	0.98	0.55	0.96	0.31	1	0.51	1	0.22
	FN	0.06	0.01	0.15	0.02	0.02	0	0.08	0
	PATH	0	0.57	0	0.71	0	0.79	0	0.98
EX2-SC3	FP	0.71	0.68	0.18	0.19	0.75	0.77	0.10	0.11
	FN	0	0	0.01	0.02	0	0	0	0
	PATH	0	0.43	0.80	0.52	0	0.56	1	0.87
EX3	FP	0.79	0.23	0.41	0.09	0.83	0.16	0.59	0.03
	FN	0	0	0	0	0	0	0	0
	PATH	0	0.94	0.95	0.99	0	1	1	1

Open in a new tab

The table indicates that the unconditional sparse estimators have much higher false positive rate than false negative rate. This is because they tend to pick up connections among the components of Y that are due to X. In comparison, the conditional sparse estimators (both lasso and adaptive lasso) can successfully remove the edges effected by X, resulting in more accurate identification of the graph.

Example 2

In this example, we consider three scenarios in which the effect of an external source on the network varies in degree, resulting in different amounts of gain achievable by a conditional graphical model.

We still assume the linear regression model (43), but with p = 5. The first scenario is shown in the top two panels in Figure 2, where all the components of β are nonzero and Σ diagonal. In this case, the conditional graph is totally disconnected (left panel); whereas the unconditional graph is a complete graph (right panel). Specifically, the parameters are

Conditional and unconditional graphical models in Examples 2 and 3. Left panels: the conditional graphical models. Right panels: the corresponding unconditional graphical models. Black nodes indicate response variables; gray nodes are the predictors. Edges in the conditional and unconditional graphs are indicated by black lines; regression of Y on X is indicated by directed gray lines with arrows. (The online version of this figure is in color.)

\sum = Θ = I_{5}, β = {(0.656, 0.551, 0.757, 0.720, 0.801)}^{⊤} .

The panels in the third row of Figure 2 represent the other extreme, where only Y¹ is related to X, and each Yⁱ for i ≠ 1 shares an edge with Y¹ but has no other edges. In this case, the conditional and unconditional graphs are identical. The parameters are specified as follows. The first row (and first column) of Θ is (6.020, −0.827, −1.443, −1.186, −0.922); the remaining 4 × 4 block is I₄. The first entry of β is 0.656, and rest entries are 0. Between these two extremes is scenario 2 (second row in Figure 2), where the conditional and unconditional graphs differ only by one edge: 1 ↔ 4. The parameters are

\begin{array}{l} Θ = (\begin{matrix} 4.487 & - 1.186 & - 1.443 & 0 & 0 \\ - 1.186 & 1.464 & - 0.668 & 0.752 & - 0.681 \\ - 1.443 & - 0.668 & 1.963 & - 1.084 & 0.981 \\ 0 & 0.752 & - 1.084 & 2.220 & - 1.105 \\ 0 & - 0.681 & 0.981 & - 1.105 & 1 \end{matrix}), \\ β = (\begin{matrix} 0.656 \\ 0.551 \\ 0 \\ 0.601 \\ 0 \end{matrix}) . \end{array}

For each scenario, we generate 200 samples of sizes n = 50, 100 and compute the three criteria across the 200 samples. The results are presented row 4 through row 12 of Table 1. We see significant improvements of the rates of correct identification of the graphical structure by the sparse estimator of CGGM whenever the conditional graph differs from the unconditional graph. Also note that, for scenario 3, where the two graphs are the same, the adaptive lasso estimator for GGM performs better than the adaptive lasso estimator for CGGM. This is because the latter needs to estimate more parameters.

Example 3

In this example, we investigate a situation where the edges in the unconditional graphical model have two external sources. We use model (43) with p = 5, q = 2, X ~ N(0, I₂),

\begin{array}{l} Θ = (\begin{matrix} 1.683 & - 0.827 & 0 & 0 & 0 \\ - 0.827 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1.722 & 0.849 & 0 \\ 0 & 0 & 0.849 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix}), \\ β = (\begin{matrix} 0.656 & 0 \\ 0 & 0.551 \\ 0.757 & 0 \\ 0 & 0.720 \\ 0 & 0.800 \end{matrix}) . \end{array}

The results are summarized in the last three rows of Table 1, from which we can see similar improvements by the sparse CGGM estimators.

10.2 Comparison With Maximum Likelihood Estimates of CGGM

In Section 7, we showed that the adaptive lasso estimate for the CGGM possesses oracle property; that is, its asymptotic variance reaches the lower bound among regular estimators when the graph Inline graphic is assumed known. Hence, it makes sense to compare adaptive lasso estimate with the maximum likelihood estimate of Θ under the constraints θ_ij = 0, (i, j) ∉ E, which is known to be optimal among regular estimates. In this section, we make such a comparison, using all three examples in Section 10.1. As a benchmark, we also compare these two estimates with the maximum likelihood estimate under the full model, which for the linear kernel is [Σ̂_PC(0)]^†. For this comparison we use the squared Frobenius norm ${‖ \hat{Θ} - Θ_{0} ‖}_{F}^{2}$ , which characterizes the closeness of two precision matrices rather than that of graphs.

Table 2 shows that the adaptive lasso estimator is rather close to the constrained MLE, with the unconstrained MLE trailing noticeably behind. In three out of five cases, the constrained MLE performs better than the adaptive lasso, which is not surprising because, although the two estimators are equivalent asymptotically, the former employs the true graphical structure unavailable for adaptive lasso, making it more accurate for the finite sample. When the errors of adaptive lasso are lower than the constrained MLE, the differences are within the margins of error.

Table 2.

Comparison of parameter estimation accuracy among adaptive lasso, and unconstrained and constrained MLE for CGGM. Entries are of the form a ± b, where a is the mean, and b the standard deviation, of criterion ${‖ \hat{Θ} - Θ_{0} ‖}_{F}^{2}$ computed from 200 simulated samples

Example/scenario	ALASSO	MLE
Example/scenario	ALASSO	Unconstrained	Constrained
EX1	1.458 ± 0.125	2.256 ± 0.171	1.575 ± 0.155
EX2-SC1	0.142 ± 0.008	0.400 ± 0.018	0.122 ± 0.007
EX2-SC2	1.858 ± 0.120	2.294 ± 0.148	1.663 ± 0.116
EX2-SC3	1.147 ± 0.071	2.139 ± 0.186	0.969 ± 0.081
EX3	0.303 ± 0.016	0.773 ± 0.555	0.327 ± 0.275

Open in a new tab

10.3 Exploring Different Reproducing Kernels

In this section, we explore three types of kernels for RKHS

\begin{array}{l} polynomial (PN) : κ_{X} (a, b) = {(a^{⊤} b + 1)}^{r}, \\ Gauss radial basis (RB) : κ_{X} (a, b) = exp (- γ {‖ a - b ‖}^{2}), \\ rational quadratic (RQ) : κ_{X} (a, b) = 1 - {‖ a - b ‖}^{2} / \times ({‖ a - b ‖}^{2} + c), \end{array}

(44)

and investigate their performances as initial estimates for lasso and adaptive lasso. These kernels are widely used for RKHS (Genton 2001). For the CGGM, we use a nonlinear regression model with four combinations of dimensions: q = 10, 20, p = 50, 100. The nonlinear regression model is specified by

Y^{i} = {\begin{cases} {(β_{1}^{⊤} X + 1)}^{2} + ε^{i}, & i = 1, 3, 6, 8, \dots, p - 4, p - 2, \\ {(β_{2}^{⊤} X)}^{2} + ε^{i}, & i = 2, 4, 5, 7, 9, 10, \dots, \\ p - 3, p - 1, p, \end{cases}

(45)

where β₁ and β₂ are q-dimensional vectors

\begin{array}{l} β_{1} = {(\underset{q / 2}{\underset{︸}{1, \dots, 1}}, \underset{q / 2}{\underset{︸}{0, \dots, 0}})}^{⊤}, \\ β_{2} = {(\underset{q / 2}{\underset{︸}{0, \dots, 0}}, \underset{q / 2}{\underset{︸}{1, - 1, \dots, {(- 1)}^{q / 2 + 1}}})}^{⊤} . \end{array}

The distribution of (ε¹, …, ε^p)^⊤ is multivariate normal with mean 0 and precision matrix

(\begin{matrix} Γ & 0 \\ ⋱ \\ 0 & Γ \end{matrix})

where each Γ is the precision matrix in Example 3.

The following specifications apply throughout the rest of Section 10: γ = 1/(9q) for RB, c = 200 for RQ, and r = 2 for PN (because the predictors in Equation (45) are quadratic polynomials); the RKHS estimator Σ̂_PC(ε_n) is used as the initial estimator for lasso and adaptive lasso, where ε_n are chosen so that the first 70 eigenvectors of QK_XQ are retained. Ideally, the kernel parameters γ, c, and ε_n should be chosen by data-driven methods such as cross-validation. However, this is beyond the scope of the present article and will be further developed in a future study. Our choices are based on trial and error in pilot runs. Our experience indicates that the sparse estimators perform well and are reasonably stable when ε_n is chosen so that 10% ~ 30% of the eigenvectors of QK_XQ are included. The sample size is n = 100 and the simulation is based on 200 samples. To save computing time we use the BIC to optimize λ_n for the first sample and use it for the rest 199 samples.

In Table 3, we compare the sparse estimators lasso and adaptive lasso, whose initial estimates are derived from kernels in Equation (44), with the full and constrained MLEs. The full MLE is computed using the knowledge that the predictor is a quadratic polynomial of x₁, …, x_p. The constrained MLE uses, in addition, the knowledge of the conditional graph Inline graphic ; that is, the positions of the zero entries of the true conditional precision matrix.

Table 3.

Exploration of different reproducing kernels. Entries are ${‖ \hat{Θ} - Θ_{0} ‖}_{F}^{2}$ averaged over 200 simulation samples

p	q	LASSO			ALASSO			MLE
p	q	PN	RB	RQ	PN	RB	RQ	Full	Constrained
50	10	1.84	6.88	8.14	1.84	2.31	2.71	29.41	3.82
50	20	11.03	11.03	11.03	17.15	17.14	17.14	299.28	94.18
100	10	5.96	20.07	24.06	3.35	4.33	5.17	171.32	7.73
100	20	20.06	20.05	20.05	29.10	28.30	28.31	1787.32	186.39

Open in a new tab

Table 3 shows that in all cases the sparse estimators perform substantially better than the full MLEs. For q = 10, the adaptive lasso estimates based on all three kernels also perform better than both the full and constrained MLEs; whereas the accuracy of most of the lasso estimates are between the full and the constrained MLEs. For q = 20, all sparse estimators perform substantially better than both the full and the constrained MLEs. From these results, we can see the effects of two types of regularization: the sparse regularization of the conditional precision matrix and the kernel-PCA regularization for the predictor. The first regularization counteracts the increase in the number of parameters in the conditional precision matrix as p increases, and the second counteracts the increase in the number of terms in a quadratic polynomial as q increases, both resulting in substantially reduced estimation error.

10.4 Comparisons With Two Naive Estimators

We now compare our sparse RKHS estimators for the CGGM with two simple methods: the linear regression and the simple thresholding.

10.4.1 Naive Linear Regression

A simple estimate of CGGM is to first apply multivariate linear regression of Y versus X, regardless of the true regression relation, and then apply a sparse penalty to the residual variance matrix. To make a fair comparison with linear regression, we consider the following class of regression models:

Y = (1 - a) {(β^{⊤} X)}^{2} / 4 + a (β^{⊤} X) + ε, 0 \leq a \leq 1.

(46)

This is a convex combination of a linear model and a quadratic model: it is linear when a = 1, quadratic when a = 0, and a mixture of both when 0 < a < 1. The distribution of ε is as specified in Section 10.3. The fraction 1/4 in the quadratic term in Equation (46) is introduced so that the linear and quadratic terms have the similar signal-to-noise ratios. In Table 4, we compare the estimation error of the CGGM based on Equation (46) by sparse linear regression and by sparse RKHS estimator using the adaptive lasso as penalty. We take q = 10, p = 50, and n = 500. The Gauss radial basis is used for the kernel method, with tuning parameters γ and ε_n being the same as specified in Section 10.3.

Table 4.

Comparison with linear regression. Entries are ${‖ \hat{Θ} - Θ_{0} ‖}_{F}^{2}$ averaged over 200 samples

a	0	0.2	0.4	0.6	0.8	1
Linear	10.49	10.48	10.48	10.42	10.10	0.75
Kernel	1.56	2.13	1.77	1.72	1.67	1.56

Open in a new tab

We see that the sparse RKHS estimate performs substantially better in all cases except a = 1, where Equation (46) is exactly a linear model. This suggests that linear-regression sparse estimate of CGGM is rather sensitive to nonlinearity: a slight proportion of nonlinearity in the mixture would make the kernel method favorable. In comparison, the sparse RKHS estimate is stable and accurate for different types of regression relations.

10.4.2 Simple Thresholding

A naive approach to sparsity is by dropping small entries of a matrix. For example, we can estimate Θ by setting to zero the entries of ${\sum^{^}}_{PC}^{- 1}$ whose absolute values are smaller than some τ > 0. Let Θ̃(τ) denote this estimator. Using the example in Section 10.3 with p = 50, q = 10, n = 500, we now compare the sparse RKHS estimators with Θ̃(τ). The curve in Figure 3 is the error ${‖ \tilde{Θ} (τ) - Θ ‖}_{F}^{2}$ versus τ ∈ [0, 1]. Each point in the curve is the error over 200 simulated samples. The estimate Σ̂_PC is based on the gauss radial basis, whose tuning parameters γ and ε_n are as given in Section 10.3. The two horizontal lines in Figure 3 represent the errors for the lasso and the adaptive lasso estimators based on Σ̂_PC, which are read off from Table 3.

Comparison of lasso, adaptive lasso, and simple thresholding.

The figure shows that the simple thresholding estimate, even for the best threshold, does not perform as well as either of our sparse estimates. Note that in practice Θ₀ is unknown, and the optimal τ cannot be obtained by minimizing the curve in Figure 3. With this in mind, we expect the actual gap between the thresholding estimate and the sparse estimates to be even greater than that shown in the figure.

11. NETWORK INFERENCE FROM eQTL DATA

In this section, we apply our CGGM sparse estimators to two datasets to infer gene networks from expression quantitative trait loci (eQTL) data. The dataset is collected from an F2 intercross between inbred lines C3H/HeJ and C57BL/6J (Ghazalpour et al. 2006). It contains 3,421 transcripts and 1,065 markers from the liver tissues of 135 female mice (n = 135). The purpose of our analysis is to identify direct gene interactions by fitting the CGGM to the eQTL dataset. Although a gene network can be inferred from expression data alone, such a network would contain edges due to confounders such as shared genetic causal variants. The available marker data in the eQTL dataset allow us to isolate the confounded edges by conditioning on the genomic information.

We restrict our attention to subsets of genes, partly to accommodate the small sample size. In the eQTL analysis tradition, subsets of genes can be identified by two methods: co mapping and co expression. For co-mapping, each gene expression trait is mapped to markers, and the transcripts that are mapped to the same locus are grouped together. For co-expression, the highly correlated gene expressions are grouped together.

We first consider the subset of genes identified by co-mapping. It has been reported (Neto, Keller, Attie, and Yandell 2010) that 14 transcripts are mapped to a marker on chromosome 2 (at 55.95 cM). As this locus is linked to many transcripts, it is called a hot-spot locus. It is evident that this marker should be included as a covariate for CGGM. In addition, we include a marker on chromosome 15 (at 76.65 cM) as a covariate, because it is significantly linked to gene Apbb1ip (permutation p-value < 0.005), conditioning on the effect of the marker on chromosome 2. The transcript mapping is performed using the qtl package in R (Broman, Wu, Sen, and Churchill 2003).

The GGM detects 52 edges; the CGGM detects 34 edges, all in the set detected by the GGM. The two graphs are presented in the upper panel of Figure 4, where edges detected by both GGM and CGGM are represented by black solid lines, and edges detected by GGM alone are represented by blue dotted lines. In the left panel of Table 5, we compare the connectivity of genes of each graph. Among the 14 transcripts, Pscdbp contains the hot spot locus within the ±100 kb boundaries of its location. Interestingly, in CGGM, this cis transcript has the lowest connectivity among the 14 genes, but one of the genes with which it is associated (Apbb1ip) is a hub gene, connected with seven other genes. In sharp contrast, GGM shows high connectivity of Pscdbp itself.

Gene networks based on GGM and CGGM. Upper panel: data based on co mapping selection. Lower panel: data based on coexpression selection. An edge from both GGM and CGGM is represented by a solid line; an edge from the GGM alone is represented by a dotted line; an edge from CGGM alone is represented by a broken line. (The online version of this figure is in color.)

Table 5.

Connectivity of genes in the conditional and unconditional graphical models. The cis-regulated transcripts are indicated with boldface

Co-mapping				Co-expression
Gene	GGM	CGGM	Diff.	Gene	GGM	CGGM	Diff.
Il16	7	3	4	Aqr	7	6	1
Frzb	7	5	2	Nola3	8	2	6
Apbb1ip	8	7	1	AI648866	9	6	3
Clec2	6	4	2	Zfp661	8	5	3
Riken	6	4	2	Myef2	11	5	6
Il10rb	8	5	3	Synpo2	9	3	6
Myo1f	8	5	3	Slc24a5	6	5	1
Aif1	6	5	1	Nol5a	7	4	3
Unc5a	11	8	3	Sdh1	10	6	4
Trpv2	9	6	3	Dtwd1	9	4	5
Tnfsf6	9	4	5	B2m	8	6	2
Stat4	3	3	0	Tmem87a	6	3	3
Pscdbp	9	3	6	Zc3hdc8	6	5	1
D13Ertd275e	7	6	1

Open in a new tab

We next study the subset identified by co-expression. We use a hierarchical clustering approach in conjunction with the average agglomeration procedure to partition the transcripts into 10 groups. The relevant dissimilarity measure is 1 − |ρ_ij|, where ρ_ij is the Pearson correlation between transcripts i and j. Among the 10 groups, we choose a group that contains 15 transcripts with the mean absolute correlation equal to 0.78. Thirteen of the 15 transcripts have annotations, and they are used in our analysis. Using the qtl package in R, each transcript is mapped to markers, and seven markers on chromosome 2 are significantly linked to transcripts (permutation p-value 0.005). Among those, we drop two markers that are identical to the adjacent markers. We thus use five markers (Chr2@100.18, Chr2@112.75, Chr2@115.95, Chr2@120.72, Chr2@124.12) as covariates.

The GGM identifies 52 edges and the CGGM identifies 30 edges. Among these edges, 28 are shared by both methods. The two graphs are presented in the lower panel of Figure 4, where edges detected by both GGM and CGGM are represented by black solid lines, edges detected by GGM alone are represented by blue dotted lines, and edges detected by CGGM alone are represented by red broken lines. Among the 13 transcripts, Dtwd1 is the closest to all markers (distances < 300 kbp). The connectivity of each method is shown in the right panel of Table 5. The number of genes that are connected to Dtwd1 is reduced by 5 by CGGM. Unlike in the co-mapping network, in the co-expression network, CGGM detects two edges (Aqr–Zfp661, Zfp661–Tmem87a) that are not detected by GGM. These additional edges could be caused by the error in regression estimation. For example, if the regression coefficients for some of markers are 0, but are estimated to be nonzero, then spurious correlations arise among residuals. This indicates that the accurate identification of the correct covariates is more important in the co-expression network.

As a summary, in the network inference from the transcripts identified by co-mapping, we see that after conditioning on the markers, the cis-regulated transcripts (Pscdbp) are connected to relatively few genes of high connectivity. However, without conditioning on the markers, they appear to have high connectivity themselves. In other words, without conditioning on markers, these cis-regulated transcripts might be misinterpreted as hub-genes themselves. In the network inference from the transcripts identified by co expression, we see that after conditioning on the markers, a few edges are additionally detected, which may result from inaccurate identification of covariates for an individual transcript. Thus, including the correct set of markers for an individual transcript can be important.

Acknowledgments

Bing Li’s research was supported in part by NSF grants DMS-0704621, DMS-0806058, and DMS-1106815.

Hyonho Chun’s research was supported in part by NSF grant DMS-1107025.

Hongyu Zhao’s research was supported in part by NSF grants DMS-0714817 and DMS-1106738 and NIH grants R01 GM59507 and P30 DA018343.

We would like to thank three referees and an Associate Editor for their many excellent suggestions, which lead to substantial improvement on an earlier draft. Especially, the asymptotic development in Section 8 is inspired by two reviewers’ comments.

Contributor Information

Bing Li, Email: bing@stat.psu.edu, Professor of Statistics, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802.

Hyonho Chuns, Email: chunh@purdue.edu, Assistant Professor of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN 47907.

Hongyu Zhao, Email: hongyu.zhao@yale.edu, Professor of Biostatistics, Yale University, Suite 503, 300 George Street, New Haven, CT 06510.

References

Aronszajn N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society. 1950;68:337–404. [Google Scholar]
Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
Bickel PJ, Ritov Y, Klaassenn CAJ, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore, MD: The Johns Hopkins University Press; 1993. [Google Scholar]
Breiman L. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]
Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTLMapping in Experimental Crosses. Bioinformatics. 2003;19:889–890. doi: 10.1093/bioinformatics/btg112. [DOI] [PubMed] [Google Scholar]
Dempster AP. Covariance Selection. Biometrika. 1972;32:95–108. [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fernholz LT. Lecture Notes in Statistics. Vol. 19. New York: Springer; 1983. Von Mises Calculus for Statistical Functionals. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fukumizu K, Bach FR, Jordan MI. Kernel Dimension Reduction in Regression. The Annals of Statistics. 2009;4:1871–1905. [Google Scholar]
Genton MG. Classes of Kernels for Machine Learning: A Statistics Perspective. Journal of Machine Learning Research. 2001;2:299–312. [Google Scholar]
Geyer CJ. On the Asymptotics of Constrained M-estimation. The Annals of Statistics. 1994;22:1998–2010. [Google Scholar]
Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Shadt EE, Drake TA, Lusis AJ, Horvath S. Integrating Genetic and Network Analysis to Characterize Gene Related to Mouse Weight. PLoS Genetics. 2006;2:e130. doi: 10.1371/journal.pgen.0020130. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. 2009 doi: 10.1093/biomet/asq060. unpublished manuscript available at http://www.stat.lsa.umich.edu/gmichail/manuscript-jasa-09.pdf[152] [DOI] [PMC free article] [PubMed]
Henderson HV, Searle SR. Vec and Vech Operators for Matrices, with Some Uses in Jacobians and Multivariate Statistics. Canadian Journal of Statistics. 1979;7:65–81. [Google Scholar]
Horn RA, Johnson CR. Matrix Analysis. New York: Cambridge University Press; 1985. [Google Scholar]
Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001); 2001. pp. 282–289. [Google Scholar]
Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]
Meinshausen N, Bühlmann P. High-Dimensional Graphs with the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Mises Rv. On the Asymptotic Distribution of Differentiable Statistical Functions. The Annals of Mathematical Statistics. 1947;18:309–348. [Google Scholar]
Neto EC, Keller MP, Attie AD, Yandell BS. Causal Graphical Models in Systems Genetics: A Unified Framework for Joint Inference of Causal Network and Genetic Architecture for Correlated Phenotypes. The Annals of Applied Statistics. 2010;4:320–339. doi: 10.1214/09-aoas288. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peng J, Wang P, Zhou N, Zhu J. Partial Correlation Estimation by Joint Sparse Regression Models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reeds JA. PhD dissertation. Harvard University; 1976. On the Definition of Von Mises Functionals. [Google Scholar]
Ren J, Sen PK. On Hadamard Differentiability of Extended Statistical Functional. Journal of Multivariate Analysis. 1991;39:30–43. [Google Scholar]
Rothman AJ, Bickel P, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
Schwarz GE. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Vapnik NV. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]
Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35. [Google Scholar]
Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Aronszajn N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society. 1950;68:337–404. [Google Scholar]

[R2] Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]

[R3] Bickel PJ, Ritov Y, Klaassenn CAJ, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore, MD: The Johns Hopkins University Press; 1993. [Google Scholar]

[R4] Breiman L. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]

[R5] Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTLMapping in Experimental Crosses. Bioinformatics. 2003;19:889–890. doi: 10.1093/bioinformatics/btg112. [DOI] [PubMed] [Google Scholar]

[R6] Dempster AP. Covariance Selection. Biometrika. 1972;32:95–108. [Google Scholar]

[R7] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Fernholz LT. Lecture Notes in Statistics. Vol. 19. New York: Springer; 1983. Von Mises Calculus for Statistical Functionals. [Google Scholar]

[R9] Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Fukumizu K, Bach FR, Jordan MI. Kernel Dimension Reduction in Regression. The Annals of Statistics. 2009;4:1871–1905. [Google Scholar]

[R11] Genton MG. Classes of Kernels for Machine Learning: A Statistics Perspective. Journal of Machine Learning Research. 2001;2:299–312. [Google Scholar]

[R12] Geyer CJ. On the Asymptotics of Constrained M-estimation. The Annals of Statistics. 1994;22:1998–2010. [Google Scholar]

[R13] Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Shadt EE, Drake TA, Lusis AJ, Horvath S. Integrating Genetic and Network Analysis to Characterize Gene Related to Mouse Weight. PLoS Genetics. 2006;2:e130. doi: 10.1371/journal.pgen.0020130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. 2009 doi: 10.1093/biomet/asq060. unpublished manuscript available at http://www.stat.lsa.umich.edu/gmichail/manuscript-jasa-09.pdf[152] [DOI] [PMC free article] [PubMed]

[R15] Henderson HV, Searle SR. Vec and Vech Operators for Matrices, with Some Uses in Jacobians and Multivariate Statistics. Canadian Journal of Statistics. 1979;7:65–81. [Google Scholar]

[R16] Horn RA, Johnson CR. Matrix Analysis. New York: Cambridge University Press; 1985. [Google Scholar]

[R17] Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R18] Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001); 2001. pp. 282–289. [Google Scholar]

[R19] Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]

[R21] Meinshausen N, Bühlmann P. High-Dimensional Graphs with the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R22] Mises Rv. On the Asymptotic Distribution of Differentiable Statistical Functions. The Annals of Mathematical Statistics. 1947;18:309–348. [Google Scholar]

[R23] Neto EC, Keller MP, Attie AD, Yandell BS. Causal Graphical Models in Systems Genetics: A Unified Framework for Joint Inference of Causal Network and Genetic Architecture for Correlated Phenotypes. The Annals of Applied Statistics. 2010;4:320–339. doi: 10.1214/09-aoas288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Peng J, Wang P, Zhou N, Zhu J. Partial Correlation Estimation by Joint Sparse Regression Models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Reeds JA. PhD dissertation. Harvard University; 1976. On the Definition of Von Mises Functionals. [Google Scholar]

[R26] Ren J, Sen PK. On Hadamard Differentiability of Extended Statistical Functional. Journal of Multivariate Analysis. 1991;39:30–43. [Google Scholar]

[R27] Rothman AJ, Bickel P, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]

[R28] Schwarz GE. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]

[R29] Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R30] Vapnik NV. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]

[R31] Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35. [Google Scholar]

[R32] Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]

[R33] Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R34] Zou H, Li R. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Sparse Estimation of Conditional Graphical Models With Application to Gene Networks

Bing Li

Hyonho Chuns

Hongyu Zhao

Abstract

1. INTRODUCTION

Figure 1.

2. CONDITIONAL VARIANCE OPERATOR IN RKHS

Definition 1

Theorem 1

Proof

3. TWO RKHS ESTIMATORS OF CONDITIONAL COVARIANCE

Lemma 1

Lemma 2

Theorem 2

Proof of Theorem 1

4. INTRODUCING SPARSE PENALTY

5. VON MISES EXPANSIONS OF THE RKHS ESTIMATORS

Lemma 3

Proof

Lemma 4

Lemma 5

Theorem 3

Proof

Corollary 1

Corollary 2

6. SPARSITY AND ASYMPTOTIC DISTRIBUTION: THE LASSO

Lemma 6

Theorem 4

Proof of Theorem 2

7. SPARSISTENCY AND ORACLE PROPERTY: THE ADAPTIVE LASSO

Theorem 5

Proof

Corollary 3

Proof

8. CONVERGENCE RATE FOR HIGH-DIMENSIONAL GRAPH

Proposition 1

Lemma 7

Proof

Theorem 6

Proof of Theorem 3

9. IMPLEMENTATION

10. SIMULATION STUDIES

10.1 Comparison With Estimators for GGM

Example 1

Table 1.

Example 2

Figure 2.

Example 3

10.2 Comparison With Maximum Likelihood Estimates of CGGM

Table 2.

10.3 Exploring Different Reproducing Kernels

Table 3.

10.4 Comparisons With Two Naive Estimators

10.4.1 Naive Linear Regression

Table 4.

10.4.2 Simple Thresholding

Figure 3.

11. NETWORK INFERENCE FROM eQTL DATA

Figure 4.

Table 5.

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases