Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Feb 24.
Published in final edited form as: J Am Stat Assoc. 2012 Jun 11;107(497):152–167. doi: 10.1080/01621459.2011.644498

Sparse Estimation of Conditional Graphical Models With Application to Gene Networks

Bing Li 1, Hyonho Chuns 2, Hongyu Zhao 3
PMCID: PMC3932550  NIHMSID: NIHMS542227  PMID: 24574574

Abstract

In many applications the graph structure in a network arises from two sources: intrinsic connections and connections due to external effects. We introduce a sparse estimation procedure for graphical models that is capable of isolating the intrinsic connections by removing the external effects. Technically, this is formulated as a conditional graphical model, in which the external effects are modeled as predictors, and the graph is determined by the conditional precision matrix. We introduce two sparse estimators of this matrix using the reproduced kernel Hilbert space combined with lasso and adaptive lasso. We establish the sparsity, variable selection consistency, oracle property, and the asymptotic distributions of the proposed estimators. We also develop their convergence rate when the dimension of the conditional precision matrix goes to infinity. The methods are compared with sparse estimators for unconditional graphical models, and with the constrained maximum likelihood estimate that assumes a known graph structure. The methods are applied to a genetic data set to construct a gene network conditioning on single-nucleotide polymorphisms.

Keywords: Conditional random field, Gaussian graphical models, Lasso and adaptive lasso, Oracle property, Reproducing kernel Hilbert space, Sparsity, Sparsistency, von Mises expansion

1. INTRODUCTION

Sparse estimation of the Gaussian graphical models has undergone intense development during the recent years, partly due to their wide applications in such fields as information retrieval and genomics, and partly due to the increasing maturity of statistical theories and techniques surrounding sparse estimation. See, for example, Meinshausen and Buhlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), Peng et al. (2009), Guo et al. (2009), and Lam and Fan (2009). The precursor of this line of work is the Gaussian graphical model in which the graph structure is assumed to be known; see Dempster (1972) and Lauritzen (1996).

Let Y = (Y1, …, Yp) be a random vector, Γ = {1, …, p}, and E ⊆ Γ × Γ. Let Inline graphic = (Γ, E) be the graph with its vertices in Γ and edges in E. We say that Y follows a Gaussian graphical model (GGM) with respect to Inline graphic (Lauritzen 1996) if Y has a multivariate normal distribution and

YiYjY-(i,j),(i,j)Ec, (1)

where AB|C means A and B are independent given C, and Y−(i,j) denotes the set {Y1, …, Yp}\{Yi, Yj}. Let ωij be the (i, j)th entry of [var(Y)]−1. Then, YiYj|Y−(i,j) if and only if ωij = 0. Thus, estimating E is equivalent to estimating the set {(i, j) ∈ Γ × Γ: ωij = 0}. Conventional Gaussian graphical models focus on maximum likelihood estimation of ωij given the knowledge of E. In sparse estimation of graphical models, however, E is itself estimated by sparse regularization, such as lasso and adaptive lasso (Tibshirani 1996; Zou 2006).

The conditional graphical model, with which we are concerned, is motivated by the analysis of gene networks and the regulating effects of DNA markers. Let X1, …, Xq represent the genetic markers at q locations in a genome, and let Y1, …, Yp represent the expression levels of p genes. The objective is to infer how the q genetic markers affect the expression levels of the p genes and how these p genes affect each other. Since some markers may have regulating effects on more than one gene, the connections among the genes are of two kinds: the connections due to shared regulation by the same marker, and the innate connections among the genes aside from their shared regulators. In this setting, we are interested in identifying the network of genes after removing the effects from shared regulations by the markers.

The situation is illustrated by Figure 1, in which X represents a single marker, and Y1, Y2, Y3 represent the expressions of three genes. If we consider the marginal distribution of the random vector (Y1, Y2, Y3), then there are two (undirected) edges in the unconditional graphical model: 1 ↔ 2 and 2 ↔ 3, as represented by the solid and the dotted line segments. However, if we condition on the marker and consider the conditional distribution of Y1, Y2, Y3|X, then there is only one (undirected) edge, 2 ↔ 3, in the conditional graphical model, as represented by the solid line segment.

Figure 1.

Figure 1

Unconditional and conditional graphical models.

In mathematical terms, a conditional graphical model can be represented by

YiYj{Y-(i,j),X}for(i,j)E, (2)

where X = (X1, …, Xq). Compared with the unconditional graphical model, here we have an additional random vector X, whose effects we would like to remove when constructing the network for Y1, …, Yp. The conditional graphical model (2) was introduced by Lafferty, McCallum, and Pereira (2001) under the name “conditional random field.” However, in that article, the graph Inline graphic = (Γ, E) was assumed known and maximum likelihood was used to estimate the relevant parameters.

Our strategy for estimating the conditional graphical model (2) is as follows. In the first step, we propose a class of flexible and easy-to-implement initial (nonsparse) estimators for the conditional variance matrix Σ = var(Y | X), based on a conditional variance operator between the reproducing kernel Hilbert spaces (RKHS) of X and Y. In the second step, we incorporate the nonsparse estimators with two types of sparse penalties, the lasso and the adaptive lasso, to obtain sparse estimators of the conditional precision matrix Σ−1.

We have chosen RKHS as our method to estimate Σ for several reasons. First, it does not require a rigid regression model for Y versus X. Second, the dimension of X only appears through a kernel function, so that the dimension of the largest matrix we need to invert is the sample size n, regardless of the dimension of X. This feature is particularly attractive when we deal with a large number of predictors. Finally, RKHS provides a natural mechanism to impose regularization on regression. Here, it is important to realize that our problem involves two kinds of regularization: one for the components of var(Y | X) and the other for the regression of Y on X. Considering the nature of our problem, the former must be sparse, but the latter need not be. Indeed, since estimating the regression parameter is not our purpose, it seems more natural to introduce the regularization for regression through RKHS than to try to parameterize the regression and then regularize the parameters.

The rest of the article is organized as follows. In Section 2, we introduce a conditional variance operator in RKHS and describe its relation with the conditional gaussian graphical model. In Section 3, we derive two RKHS-based estimators of conditional variance Σ. In Section 4, we subject the RKHS estimators to sparse penalties to estimate the conditional precision matrix Θ = Σ−1. In Sections 5, 6, and 7, we establish the asymptotic properties of the sparse estimators for a fixed dimension p. In Section 8, we derive the convergence rate of the sparse estimators when p goes to infinity with the sample size. In Section 9, we discuss some issues involved in implementation. In Sections 10 and 11, we investigate and explore the performance of the proposed methods through simulation and data analysis.

2. CONDITIONAL VARIANCE OPERATOR IN RKHS

We begin with a formal definition of the conditional Gaussian graphical model. Throughout this article, we use Inline graphic to denote expectation to avoid confusion with the edge set E. We use ℙ to denote probability measures.

Definition 1

We say that (X, Y) follows a conditional Gaussian graphical model (CGGM) with respect to a graph Inline graphic = (Γ, E) if

  1. relation (2) holds;

  2. Y | X ~ N( Inline graphic(Y | X), Σ) for some nonrandom, positive-definite matrix Σ.

Note that we do not assume a regression model for Y versus X. However, we do require that Y | X is multivariate normal with a constant conditional variance, which is satisfied if Y = f (X) + ε, εX, and ε ~ N(0, Σ) for an arbitrary f.

Let ΩX ⊆ ℝq and ΩY ⊆ ℝp be the support of X and Y. Let κX: ΩX × ΩX → ℝ, κY: ΩY × ΩY → ℝ be positive-definite kernels and Inline graphic and Inline graphic be their corresponding RKHS’s. For further information about RKHS and the choices of kernels, see Aronszajn (1950) and Vapnik (1998). For two Hilbert spaces Inline graphic and Inline graphic, and a bounded bilinear form b: Inline graphic × Inline graphic → ℝ, there uniquely exist bounded linear operators A: Inline graphicInline graphic and B: Inline graphicInline graphic such that 〈f, BgInline graphic = 〈Af, gInline graphic = b(f, g) for any fInline graphic and gInline graphic. Applying this fact to the bounded bilinear forms

cov[f1(X),f2(X)],cov[g1(Y),g2(Y)],cov[f(X),g(Y)],

we obtain three bounded linear operators

XX:HXHX,YY:HYHY,XY:HYHX. (3)

Furthermore, ΣXY can be factorized as XX12RXYYY12, where RXY: Inline graphicInline graphic is a uniquely defined bounded linear operator. The conditional variance operator of Y, given X, is then defined as the bounded operator from Inline graphic to Inline graphic:

YYX=YY-YY12RYXRXYYY12.

This construction is due to Fukumizu, Bach, and Jordan (2009).

Let L2(ℙX) denote the class of functions of X that are squared integrable with respect to ℙX, Inline graphic + ℝ denote the set of functions {h + c: hInline graphic, c ∈ ℝ}, and cl(·) denote the closure of a set in L2(ℙX). The next theorem describes how the conditional operator ΣYY | X uniquely determines the CGGM, and suggests a way to estimate Σ.

Theorem 1

Suppose

  1. Inline graphicL2(ℙX) and, for each i = 1, …, p, Inline graphic(Yi|X) ∈ cl( Inline graphic + ℝ);

  2. κY is the linear kernel: κY (a, b) = 1 + ab, where a, b ∈ ℝp.

Then,

={yr,YYXysHY:r,s=1,,p}. (4)

The assumption Inline graphicL2(ℙX) is satisfied if the function κX(x, x) belongs to L2(ℙX), which is a mild requirement.

Proof

By assumption 2, any member of Inline graphic can be written as αy for some α ∈ ℝp. Hence, by Proposition 2 of Fukumizu et al. (2009),

αy,YYX(αy)HY=inffHX+var[αY-f(X)]=E[var(αYX)]+inffHX+var{E[(αYX)-f(X)]}.

By assumption 1, for any ε > 0, there is an fInline graphic + ℝ such that var{ Inline graphic[(αY | X) − f (X)]} < ε. So, the second term on the right-hand side above is 0, and we have

αy,YYX(αy)HY=E[var(αYX)]. (5)

In the meantime, we note that

yi+yj,YYX(yi+yj)HY-yi-yj,YYX(yi-yj)HY=4yi,YYXyjHY,E[var(Yi+YjX)]-E[var(Yi-YjX)]=4E[cov(Yi,YjX)].

Applying Equation (5) to the left-hand sides of the above equations, we obtain

yi,YYXyjHY=E[cov(Yi,YjX)]=cov(Yi,YjX),

as desired.

Condition 1 in the theorem can be replaced by the stronger assumption that Inline graphic + ℝ is a dense subset of L2(PX), which is satisfied by some well known kernels, such as the Gaussian radial kernel. See Fukumizu et al. (2009).

3. TWO RKHS ESTIMATORS OF CONDITIONAL COVARIANCE

To construct a sample estimate of Equation (4), we need to represent operators as matrices in a finite-dimensional space. To this end, we first introduce a coordinate notation system, adopted from Horn and Johnson (1985, p. 31) with slight modifications. Let Inline graphic be a finite-dimensional Hilbert space and Inline graphic = {b1, …, bm}⊆ Inline graphic be a set of functions that span Inline graphic but may not be linearly independent. We refer to Inline graphic as a spanning system (as opposed to a basis). Any fInline graphic can be written as α1b1 + ··· + αmbm for some α = (α1, …, αm) ∈ ℝm. This vector is denoted by [f]Inline graphic, and is called a Inline graphic-coordinate of f. Note that [f]Inline graphic is not unique unless b1, …, bm are linearly independent. However, this does not matter because i=1m([f]B)ibi unique regardless of the form of [f]Inline graphic. The same reasoning also applies to the nonuniqueness of coordinates below.

Let Inline graphic and Inline graphic be finite-dimensional Hilbert spaces with spanning systems Inline graphic = {b11, …, b1m1}and Inline graphic = {b21, …, b2m2}. Let T : Inline graphicInline graphic be a linear operator. The ( Inline graphic, Inline graphic)-representation of T, denoted by Inline graphic[T]Inline graphic, is the m2 × m1 matrix

{([Tb1i]B2)j:j=1,,m2,i=1,,m1}.

Then, (Inline graphic [T] Inline graphic)[f] Inline graphic is a Inline graphic-coordinate of Tf. We write this relation as

[Tf]B2=([T]B2B1)[f]B1. (6)

Let Inline graphic be another finite-dimensional Hilbert space with a spanning system Inline graphic. Let T1: Inline graphicInline graphic and T2 : Inline graphicInline graphic be linear operators. Then,

[T2T1]B3B1=([T2]B3B2)([T1]B2B1). (7)

The equality means the right-hand side is a ( Inline graphic, Inline graphic)-representation of T2T1. Finally, if Inline graphic is a finite-dimensional Hilbert space with a spanning system Inline graphic and T : Inline graphicInline graphic is a self-adjoint and positive-semidefinite linear operator, then, for any c > 0,

B[Tc]B=([T]BB)c. (8)

Let (X1, Y1), …, (Xn, Yn) be independent copies of (X, Y), ℙn be the empirical measure based on this sample, and Inline graphic be the integral with respect to ℙn. Let

H^X=span{κX(·,Xi)-EnκX(·,X):i=1,,n},H^Y=span{κY(·,Yi)-EnκY(·,Y):i=1,,n}, (9)

where, for example, Inline graphicκX(·, X) stands for the function xInline graphicκX(x, X). We center the functions κX(·, Xi) because constants do not play a role in our development. Let Σ̂XX, Σ̂YY, and Σ̂XY be as defined in the last section but with Inline graphic, Inline graphic replaced by Inline graphic, Inline graphic, and cov replaced by the sample covariance. Let Inline graphic and Inline graphic denote the spanning systems in Equation (9). Let KX and KY represent the n × n kernel matrices {κX(Xi, Xj)} and {κY (Yi, Yj)}. For a symmetric matrix, A, let A represent its Moore-Penrose inverse. Let Qn = Inn −1Jn, In is the n × n identity matrix, and Jn be the n × n matrix whose entries are 1. To simplify notation, we abbreviate Qn by Q throughout the rest of the article. The following lemma crystallizes some known results, which can be proved by coordinate manipulation via formulas (6), (7), and (8).

Lemma 1

The following relations hold:

  1. Inline graphic[Σ̂ XX] Inline graphic = n−1QKXQ, Inline graphic[Σ̂YY]Inline graphic= n−1QKYQ;

  2. Inline graphic[Σ̂YX]Inline graphic= n−1QKXQ, Inline graphic[Σ̂XY]Inline graphic= n−1QKY Q;

  3. Inline graphic[Σ̂YY | X]Inline graphic = n−1[QKY Q − (QKXQ)(QKXQ) (QKY Q)].

When the dimension q of X is large relative to n, it is beneficial to use regularized version of (QKXQ). Here, we employ two types of regularization: the principal-component (PC) regularization and the ridge-regression (RR) regularization. Let A be a positive-semidefinite matrix with eigenvalues λ1, …, λ n and eigen-vectors v1, …, vn. Let ε ≥0. We call the matrix (A)ε=i=1nλi-1viviI(λi>ε) the PC-inverse of A. Note that (A)0=A. We call the matrix (A)ε=(A+εIn)-1 the RR-inverse of A. The following result can be verified by simple calculation.

Lemma 2

For any f1, f2Inline graphic, we have 〈 f1, f2Inline graphic = ([f1]Inline graphic)QKY Q[f2]Inline graphic.

We now derive the sample estimate of the conditional covariance matrix Σ= var(Y | X). Let DY denote the matrix (Y1, …, Yn).

Theorem 2

Let Σ̂YY | X: Inline graphicInline graphic be defined by the coordinate representation

[^YYX]BYBY=n-1[QKYQ-(QKXQ)(QKXQ)εn(QKYQ)], (10)

where * can be either † or ‡, KY is the linear kernel matrix, and {εn} is a sequence of nonnegative numbers. Then

{yi,^YYXyjH^Y}=n-1[DYQDY-DYTQ(QKXQ)(QKXQ)εnQDY]. (11)

Despite its appearance, the matrix (QKXQ)(QKXQ)ε is actually a symmetric matrix. One can also show that Equation (11) is a positive-semidefinite matrix. We denote the matrix (11) as Σ̂PC(εn) if * = †, and as Σ̂RR(εn) if * = ‡. In the following, ei represents a p-dimensional vector whose ith entry is 1 and other entries are 0. For an expression that represents a vector, such as [f]Inline graphic, let ([f]Inline graphic)i denote its ith entry.

Proof of Theorem 1

Note that, for an fInline graphic, [f]Inline graphic is any vector a ∈ ℝp such that f (y) = aQDY y. Because yi=eiy=ei(DYQDY)-1DYQDYy, we have

[yi]BY=DY(DYQDY)-1ei.

Then, by Lemma 2,

yi,^YYXyjHY=ei(DYQDY)-1DYQKYQ([^YYX]BYBY)DY(DYQDY)-1ej.

When κY is the linear kernel, KY=DYDY. Hence, the right-hand side reduces to

eiDYQ([^YYX]BYBY)DY(DYQDY)-1ej.

Now substitute Equation (10) into the above expression to complete the proof.

4. INTRODUCING SPARSE PENALTY

Let Σ̂ denote either of the RKHS estimators given by Equation (11). We are interested in the sparse estimation of Θ= Σ−1. Let

Ln(Θ)=-logdet(Θ)+tr(Θ^). (12)

This objective function has the same form as that used in unconditional Gaussian graphical models, such as those considered by Lauritzen (1996), Yuan and Lin (2007), and Friedman, Hastie, and Tibshirani (2008), except that the sample covariance matrix therein is replaced by the RKHS estimate of conditional covariance matrix.

To achieve sparsity in Θ, we introduce two types of penalized versions of the objective function (12):

lasso:ϒn(Θ)=Ln(Θ)+λnijθij, (13)
adaptivelasso:Λn(Θ)=Ln(Θ)+λnijθij-γθij, (14)

where, in Equation (14), γ is a positive number and {θ̃ij} is a n-consistent, nonsparse estimate of Θ. For more details about the development of these two types of penalty functions; see Tibshirani (1996), Zou (2006), Zhao and Yu (2006), and Zou and Li (2008). Yuan and Lin (2007) used lasso and nonnegative garrote (Breiman 1995) penalty functions for the unconditional graphical model, the latter of which is similar to adaptive lasso with γ = 1. In the following, we will write

Pn(Θ)=λnijθij,Πn(Θ)=λnijθij-γθij.

5. VON MISES EXPANSIONS OF THE RKHS ESTIMATORS

The sparse and oracle properties of the estimators introduced in Section 4 depend heavily on the asymptotic properties of the RKHS estimators Σ̂PC(εn) and Σ̂RR(εn). In this section, we derive their von Mises expansions (von Mises 1947). For simplicity, we base our asymptotic development on the polynomial kernel. That is, we let

κX(a,b)=(ab+1)r,r=1,2, (15)

For a matrix A and an integer k ≥ 2, let Ak be the k-fold Kronecker product A ⊗ ··· ⊗ A. For k = 1, we adopt the convention A⊗1 = A. For a k-dimensional array B = {bi1ik : i1, …, ik = 1, …, q}, we define the “vec half” operator as

vech(B)={bi1ik:1iki1q}.

This is a generalization of the vech operator for matrices introduced by Henderson and Searle (1979). It can be shown that there exists a matrix Gk,q, of full column rank, such that vec(B) = Gk,q vech(B). The specific form of Gk,q is not important to us. For k = 1, we adopt the convention vech(B) = B, G1,q = Iq. Let

U=(vech(X1)vech(Xr)),Ui=(vech(Xi1)vech(Xir)),andDU=(U1,,Un).

The estimators Σ̂PC(εn) and Σ̂RR(εn) involve n × n matrices QKXQ and (QKXQ)εn, which are difficult to handle asymptotically because their dimensions grow with n. We now give an asymptotically equivalent expression which only involves matrices of fixed dimensions. Let

=n-1[DYQDY-DYQDU(DUQDU)-1DUQDY]. (16)

Lemma 3

Suppose κX is the polynomial kernel (15) and var(U) is positive definite.

  1. If εn = o(n), then, with probability tending to 1, Σ̂PC(εn) = Σ̃;

  2. If εn=o(n12), then ^RR(εn)=+oP(n-12).

It is interesting to note that different choices of inversion requires different convergence rates for εn.

Proof

By simple computation, we find

(XiXj+1)r=1+k=1r(rk)[vech(Xik)]Gk,qGk,qvech(Xik).

Hence, KX=Jn+DUCDU where C=diag((r1)G1,qG1,q,,(rr)Gr,qGr,q). So,

QKXQ=QDUCDUQ. (17)

Let s be the dimension of U. Then, rank(QKXQ) = s. Let λ1 ≥ ··· ≥ λs > 0 be the nonzero eigenvalues and v1, …, vs be the corresponding vectors of this matrix. Since λ1,…, λs are also eigenvalues of n(n-1C12DUQDUC12), which converges in probability to the positive-definite matrix C12var(U)C12, we have λs = nc + oP(n) for some c > 0. This, together with εn = o(n), implies (QKXQ)εn=(QDUCDUQ) with probability tending to 1. Consequently, with probability tending to 1,

^PC(εn)=n-1[DYQDY-DYQ(QDUCDUQ)(QDUCDUQ)QDY]. (18)

Now it is easy to verify that if B ∈ ℝs×t is a matrix of full column rank and A ∈ ℝt×t is a positive-definite matrix, then

(BAB)(BAB)=B(BB)-1B. (19)

Thus, the right-hand side of Equation (18) is Σ̃, which proves part 1. To prove part 2, we first note that

(QKXQ)(QKXQ)εn=(QKXQ-i=1sλiεnλi+εnvivi)(QKXQ). (20)

Since the function −εn/(εn + λ) is increasing for λ> 0, we have

-εnλs+εn(QKXQ)(QKXQ)(QKXQ)(QKXQ)εn-(QKXQ)(QKXQ)0.

Hence,

0n-1DYQ[(QKXQ)(QKXQ)εn-(QKXQ)(QKXQ)]QDY-εnλs+εnn-1DYQ(QKXQ)(QKXQ)QDY, (21)

By Equation (19),

n-1DYQ(QKXQ)(QKXQ)QDY=(n-1DYQDU)(n-1DUQDU)-1(n-1DUQDY)Pcov(Y,U)[var(U)]-1cov(U,Y).

Hence, the right-hand side of Equation (21) is of the order oP(n-12). In other words,

^RR(εn)=n-1[DYQDY-DYQ(QDUCDUQ)(QDUCDUQ)QDY]+oP(n-12).

Here, we evoke Equation (19) again to complete the proof.

Note that when r = 1 and p < n, and εn = 0, Σ̂PC(εn) reduces to

n-1[DYQDY-DYQDX(DXQDX)-1DXQDY]=varn(Y)-covn(Y,X)[varn(X)]-1covn(X,Y),

where varn(·) and covn(·, ·) denote the sample variance and covariance matrices. This is exactly the sample estimate of the residual variance for linear regression.

Lemma 3 allows us to derive the asymptotic expansions of Σ̂PC(εn) and Σ̂RR(εn) from that of Σ̃, which is a (matrix-valued) function of sample moments. Let Inline graphic be a convex family of probability measures defined on ΩXY that contains all the empirical distributions ℙn and the true distribution ℙ0 of (X, Y). Let T : Inline graphic → ℝp×p be the following statistical functional:

var(Y)-cov(Y,U)[var(U)]-1cov(U,Y). (22)

In this notation, Σ̃ = T (ℙn). In the following, we use var and cov to denote the variance and covariance under ℙ0. For the polynomial kernel (15), the evaluation T (ℙ0) has a special meaning, as described in the next lemma. Its proof is standard, and is omitted.

Lemma 4

Suppose:

  1. Entries of Inline graphic (Y | X) are polynomials in X1, …, Xq of degrees no more than r;

  2. Y | X ~ N( Inline graphic (Y | X), Σ), where Σ is a nonrandom matrix.

Then, T (ℙP0) = Σ.

Let ℙα = (1 − α)ℙ0 + αn, where α ∈ [0, 1]. Let Dα denote the differential operator ∂/∂α, and let Dα=0 denote the operation of taking derivative with respect to α and then evaluating the derivative at α = 0. It is well known that if T is Hadamard differentiable with respect to the norm || · || in Inline graphic, then

T(n)=T(0)+Dα=0T(α)+oP(n-12). (23)

Since the functional T in our case is a smooth function of sample moments, it is Hadamard differentiable under very general conditions. See, for example, Reeds (1976), Fernholz (1983), Bickel et al. (1993), and Ren and Sen (1991). The next lemma can be verified by straightforward computation.

Lemma 5

Let V1 and V2 be square-integrable random vectors. Then,

Dα=0[covα(V1,V2)]=Enq(V1,V2)-Eq(V1,V2),

where q(V1, V2) = (V1Inline graphicV1)(V2Inline graphicV2).

We will adopt the following notational system:

μY=E(Y),μU=E(U),VY=var(Y),VU=var(U),VUY=cov(U,Y).

The next theorem gives the first-order von-Mises expansion of Σ̃.

Theorem 3

Suppose the functional in Equation (22) is Hadamard differentiable, and the conditions in Lemma 4 hold. Then,

=+EnM+oP(n-12), (24)

where M = M0Inline graphicM0 and

M0=[Y-μY-VYUVU-1(U-μU)]×[Y-μY-VYUVU-1(U-μU)].
Proof

We have

DαT(α)=Dα[varα(Y)]-Dα{covα(Y,U)×[varα(U)]-1covα(U,Y)}.

We now apply Lemma 5 to obtain

Dα=0{covα(Y,U)[varα(U)]-1covα(U,Y)}=[Enq(Y,U)-Eq(Y,U)]VU-1VUY+VYUDα=0×{[varα(U)]-1}VUY+VYUVU-1[Enq(U,Y)-Eq(U,Y)].

By the chain rule for differentiation and Lemma 5,

Dα=0{[varα(U)]-1}=-VU-1[Enq(U,U)-Eq(U,U)]VU-1.

Hence,

Dα=0T(α)=[Enq(Y,Y)-Eq(Y,Y)]-[Enq(Y,U)-Eq(Y,U)]VU-1VUY+VYUVU-1[Enq(U,U)-Eq(U,U)]VU-1VUY-VYUVU-1[Enq(U,Y)-Eq(U,Y)].

This can be rewritten as Inline graphicM0Inline graphicM0, where M0 is the matrix in the theorem.

Using this expansion, we can write down the asymptotic distribution of Σ̃.

Corollary 1

Suppose the functional in Equation (22) is Hadamard differentiable. Then,

nvec(-)DN(0,Ξ), (25)

where

Ξ=(Ip,-VYUVU-1)2var[(Y-μYU-μU)(Y-μYU-μU)](Ip-VU-1VUY)2.

By Lemma 3, Theorem 3, and Slutsky’s theorem, we arrive at the following von Mises expansions for the two RKHS estimators Σ̂PC(εn) and Σ̂RR(εn) of the conditional variance Σ.

Corollary 2

Suppose the functional in Equation (22) is Hadamard differentiable, and the conditions in Lemma 4 hold.

  1. If εn=o(n12), then Σ̂RR(εn) has expansion (24).

  2. If εn = o(n), then Σ̂PC(εn) has expansion (24).

Although in this article we have only studied the asymptotic distribution for the polynomial kernel, the basic formulation and analysis could potentially be extended to other kernels under some regularity conditions. We leave this to future research.

6. SPARSITY AND ASYMPTOTIC DISTRIBUTION: THE LASSO

In this section, we study the asymptotic properties of the sparse estimator based on the objective function ϒn(Θ) in Equation (13). Let Inline graphic denote the class of all p × p symmetric matrices. For a matrix AInline graphic, let σi (A) be the ith eigenvalue of A. Note that, for any integer r, we have

tr(Ar)=i=1pσir(A). (26)

Let En be the sample estimate of E; that is, En = {(i, j): θ̂ij ≠ 0}, where Θ̂ = {θ̂ij}is the minimizer of Equation (13). Following Lauritzen (1996), for a matrix A = {aij} and a graph Inline graphic = (Γ, E), let A( Inline graphic) denote the matrix that sets aij to 0 whenever (i, j) ∉ E. Let

p×p(G)={A(G):Ap×p},Sp×p(G)={A(G):ASp×p}.

We define a bivariate sign function as follows. For any numbers a, b, let

sign(a,b)=sign(a)I(a0)+sign(b)I(a=0).

The following lemma will prove useful. Its proof is omitted.

Lemma 6

The bivariate sign function sign(a, b) has the following properties:

  1. For any c1 > 0, c2 > 0, sign(c1a, c2b) = sign(a, b);

  2. For sufficiently small |b|, |a + b| − |a| = sign(a, b)b.

The next theorem generalizes Theorem 1 of Yuan and Lin (2007). It applies to any random matrix Σ̂ with expansion (24), including Σ̂PC(εn) and Σ̂RR(εn).

Theorem 4

Suppose (X, Y) follows CGGM with respect to a graph (Γ, E). Let Θ̂ be the minimizer of Equation (13), where nλnλ0>0 and Σ̂ having the expansion (24). Let W ∈ ℝp×p be a random matrix such that vec(W) is distributed as N[0, var(vec(M))]. Then, the following assertions hold.

  1. 0 < limn →∞ ℙ(En = E) < 1;

  2. For any ε > 0, there is a λ0 > 0 such that limn →∞ ℙ(En = E) > 1 − ε;

  3. n(Θ^-Θ0)Dargmin{Φ(Δ,W):ΔSp×p}, where
    Φ(Δ,W)=tr(ΔΔ/2+ΔW)+λ0ijsign(θ0,ij,δij)δij.

The inequality limn→∞ ℙ(En = E) > 0 implies that, as n → ∞, there is a positive probability to estimate a parameter as 0 when it is 0. A stronger property is that this probability tends to 1. For clarity, we refer to the former property as sparsity, and the latter as sparsistency (see Fan and Li 2001; Lam and Fan 2009). According to this definition, the two inequalities in part 1 mean that lasso is sparse but not sparsistent. Note that the unconstrained maximum likelihood estimate—and indeed any regular estimate—is not sparse. Part 2 means that, even though lasso is not sparsistent, we can make it as close to sparsistent as we wish by choosing a sufficiently large λ0. Part 3 gives the asymptotic distribution of n(Θ^-Θ0). It is not the same as that of the maximum likelihood estimate under the constraint θij = 0 for (i, j) ∈ Ec. That is, it does not have the oracle property. The proof of part 3 is similar to that of Theorem 1 in Yuan and Lin (2007) in the context of GGM. However, to our knowledge there were no previous results parallel to parts 1 and 2 for GGM. Knight and Fu (2000) contains some basic ideas for the asymptotics for lassotype estimators.

Proof of Theorem 2

We prove the three assertions in the order 3, 1, 2.

3. Let Θ0 be the true value of Θ, and let

Φn(Δ)=n[ϒn(Θ0+n-1/2Δ)-ϒn(Θ0)].

By an argument similar to Yuan and Lin (2007, Theorem 1), it can be shown that

n[Ln(Θ0+n-1/2Δ)-Ln(Θ0)]=tr(ΔΔ)/2+n1/2tr(ΔEnM),n[Pn(Θ0+n-1/2Δ)-Pn(Θ0)]=λ0ijsign(θ0,ij,δij)δij+o(1).

Since n1/2EnMDW, we have

Φn(Δ)Dtr(ΔΔ/2+ΔW)+λ0ijsign(θ0,ij,δij)δij=Φ(Δ,W).

Both Φn(Δ) and Φ(Δ, W) are strictly convex with probability 1. Applying Theorem 4.4 of Geyer (1994) we see that

argmin{Φn(Δ):ΔSp×p}Dargmin{Φ(Δ,W):ΔSp×p}.

However, by construction, if Δ̂ is the (almost surely unique) minimizer of Φn(Δ), then Δ̂ = n1/2(Θ̂Θ0). This proves part 3.

1. For a generic function f(t) defined on t ∈ ℝs, let tiL and tiR be the left and right partial derivatives with respect to the ith component of t. When f is differentiable with respect to ti, we write ti=tiL=tiR. Note that En = E if and only if Φ(Δ, W) is minimized within Inline graphic( Inline graphic). This happens if and only if

δijLΦ(Δ,W)0δijRΦ(Δ,W),(i,j)Γ×Γ,ij,andΔSp×p(G). (27)

Here, we only consider the cases ij because Δ is a symmetric matrix. Let

L(Δ,W)=tr(ΔΔ/2+ΔW),P(Δ,W)=λ0ijsign(θ0,ij,δij)δij.

Then, ∂L(Δ, W)/∂Δ = ΣΔΣ + W. For (i, j) ∈ E, P (Δ, W) is differentiable with respect to δij and ∂δij P(Δ, W) = λ0sign(θ0,ij). For (i, j) ∈ Ec, P (Δ, W) is not differentiable with respect to δij, but has left and right derivatives, given by δijLP(Δ,W)=-λ0 and δijRP(Δ,W)=λ0. Condition (27) now reduces to

{(Δ)ij+wij+sij=0,if(i,j)E,ij,(Δ)ij+wij[-λ0,λ0],if(i,j)Ec,ij, (28)

where ΔInline graphic( Inline graphic), sij = λ0sign(θ0,ij) if (i, j) ∈ E and ij and sij = 0 if i = j.

Now consider the event

G={{wij:ij}:Equation(28)issatisfiedforsomeΔSp×p(G)}.

We need to show that ℙ(G) > 1. Since Δ belongs to Inline graphic( Inline graphic), it has as many free parameters as there are equations in the first line of Equation (28). Since Σ is a nonsingular matrix, for any {wij: ij, (i, j) ∈ E}, there exists a unique Δ that satisfies the first line of Equation (28) and it is a linear function of {wij: (i, j) ∈ E, ij}. Writing this function as Δ({wij : (i, j) ∈ E, ij}), we see that Equation (28) is satisfied if and only if

[Δ({wμν:(μ,ν)E,μν})]ij+wij[-λ0,λ0],forij,(i,j)Ec.

Since the mapping

τ:{wij:ij}{2[Δ({wμν:(μ,ν)E,μν})]ij+2wij:ij,(i,j)Ec}

from ℝp(p+1) to ℝcard(Ec)/2 is continuous, the set τ−1[(−λ0, λ0)card(Ec)/2] is open in ℝp(p+1)/2. Furthermore, this open set is nonempty because if we let

wij=-[Δ({wμν:(μ,ν)E,μν})]ij,for(i,j)Ec,ij,

then τ ({wij : ij}) = 0. Because {wij : ij} has a multivariate normal distribution with a nonsingular covariance matrix, any nonempty open set in ℝp(p+1)/2 has positive probability. This proves limn→∞ ℙ(En = E) > 0.

Similarly, the set τ−1[(λ0, 3λ0)card(Ec)/2] is open in ℝp(p+1)/2, and if we let

wij=-[Δ({wμν:(μ,ν)E,μν})]ij+2λ0,for(i,j)Ec,ij,

then τ({wij: ij}) = (2λ0, …, 2λ0) ∈ (λ0, 3λ0)card(Ec)/2. Hence, the W-probability of τ−1[(λ0, 3λ0)card(Ec)/2] is positive, implying limn→∞ ℙ(En = E) < 1.

2. For each (i, j) ∈ Ec, ij, Let Uij = [ΣΔ ({wμν : (μ, ν) ∈ E, μν}) Σ]ij + wij. Then, for any η > 0, there is a λ0ij>0 such that (Uij[-λ0ij,λ0ij])>1-η. Let λ0=max{λ0ij:(i,j)Ec,ij}, ij}. Then, ℙ(Uij ∈ [−λ0, λ0]) > 1 − η. Hence,

limn(En=E)=(Uij[-λ0,λ0],(i,j)Ec,ij)1-(i,j)Ec,ij{1-(Uij[-λ0,λ0])}1-card(Ec)η/2.

This proves part 2 because η can be arbitrarily small.

7. SPARSISTENCY AND ORACLE PROPERTY: THE ADAPTIVE LASSO

We now turn to the adaptive lasso based on Equation (14). We still use Θ̂ = {θ̂ij} to denote the minimizer of Equation (14). For a matrix Δ = {δij} ∈ Inline graphic, and a set C ∈ Γ × Γ, let δC = {δij : (i, j) ∈ C, ij}, which is to be interpreted as a vector where index i moves first, followed by index j.

Theorem 5

Suppose that (X, Y) follows CGGM with respect to a graph (Γ, E). Let Θ̂ be the minimizer of Equation (14), where Σ̂ has expansion (24), and

limnn1/2λn=0,limnn(1+γ)/2λn=.

Suppose Θ̃ in Equation (14) is a n-consistent estimate of Θ0 with ℙ(θ̃ij ≠ 0) = 1 for (i, j) ∈ Ec. Then,

  1. limn→∞ ℙ(En = E) = 1;

  2. n(Θ^-Θ0)Dargmin{L(Δ,G);ΔSp×p(G)}.

Part 1 asserts that the adaptive lasso is sparsistent; part 2 asserts that it is asymptotically equivalent to the minimizer of L(Δ, G) when E is known. In this sense Θ̂ is oracle. The condition P (θ̃ij ≠ 0) = 1 means that θ̃ij is not sparse, which guarantees that |θ̃ij |γ is well defined. For example, we can use the inverse of one of the RKHS estimates as Θ̃.

Proof

Let ΔInline graphic and Ψn(Δ) = nn(Θ0 + n−1/2Δ) − Λn(Θ0)]. Consider the difference

n[Πn(Θ0+n-1/2Δ)-Πn(Θ0)]=λnnijθij-γ(θ0,ij+n-1/2δij-θ0,ij).

By Lemma 6, for large enough n, the right-hand side can be rewritten as

ijn1/2λnθij-γsign(θ0,ij,n-1/2δij)δij. (29)

If (i, j) ∈ E, then |θ̃ij |γ = OP(1). Because n1/2 λn → 0, the summand for such (i, j) converges to 0 in probability. Hence, Equation (29) reduces to

(i,j)Ecn1/2λnθij-γsign(δij)δij+oP(1)=(i,j)Ecn1/2λnθij-γδij+oP(1). (30)

If (i, j) ∉ E, then θ̃ij = OP(n−1/2), and hence

n1/2λnθij-γ=λnn(1-γ)/2n1/2θij-γ=λnn(1+γ)/2OP(1)P.

From this, we see that Equation (30) converges in probability to ∞ unless δEc = 0, in which case it converges in probability to 0. In other words,

n[Πn(Θ0+n-1/2Δ)-Πn(Θ0)]P{0,δEc=0,,δEc0,

This implies that

Ψn(Δ)DΨ(Δ,G),whereΨ(Δ,G)={L(Δ,G),δEc=0,,δEc0.

Since both Ψn(Δ) and Ψ(Δ, G) are convex and Ψ(Δ, G) has a unique minimum, by the epi-convergence results of Geyer (1994), we have

argmin{Ψn(Δ):ΔSp×p}Dargmin{Ψ(Δ,G):ΔSp×p}.

This proves part 2 because the right-hand side is, in fact, argmin{L(Δ, G) : ΔInline graphic( Inline graphic)}, and the left-hand side is n(Θ^-Θ0).

The function Ψ(Δ, W) is always minimized in a region of Δ in which it is not ∞. As a consequence, if Δ minimizes Ψ(Δ, W) over Inline graphic, then δEc = 0. Hence, ℙ(EnE) → 1. In the meantime, since Θ^-Θ0=OP(n-12), we have θ^ij=θ0,ij+OP(n-12). If (i, j) ∈ E, then θ0,ij ≠ 0. Thus, we see that ℙ(EnE) → 1. This proves part 1.

We now derive the explicit expression of the asymptotic distribution of Θ̂. Let G be the unique matrix in ℝp2×[card(E)+p]/2 such that for any ΔInline graphic( Inline graphic), vec(Δ) = GδE. For example, if p = 3 and E = {(1, 1), (2, 2), (3, 3), (1, 2), (2, 1), (1, 3), (3, 1)}, then δE = (δ11, δ21, δ31, δ22, δ33), and G is defined by

G=(100000000010100000001000100000010000000000001).

Corollary 3

Under the assumptions of Theorem 5,

nvec(Θ^-Θ0)DN(0,V),

where

V=G[G(2Ip)G]-1GΞG[G(2Ip)G)]-1G. (31)
Proof

Note that

tr(ΔΔ)=vec(Δ)vec(Δ)=vec(Δ)(2Ip)vec(Δ)tr(ΔW)=vec(W)vec(Δ).

Hence,

L(Δ,W)=δEG(2Ip)GδE/2+vec(W)GδE.

This is a quadratic function minimized by δE = −[G(Σ2Ip)G]−1Gvec(W). In terms of Δ, the minimizer is vec(Δ) = −G[G(Σ2Ip)G]−1Gvec(W). The corollary now follows from Theorem 5.

The asymptotic variance of n(Θ^-Θ0) can be estimated by replacing the moments in Equation (25) and (31) by their sample estimates.

8. CONVERGENCE RATE FOR HIGH-DIMENSIONAL GRAPH

In the last two sections, we have studied the asymptotic properties of our sparse CGGM estimators with the number of nodes p in the graph Inline graphic held fixed. We now investigate the case where p = pn tends to infinity. Due to the limited space, we shall focus on the lasso-penalized RKHS estimator with PC regularization. That is, the minimizer of ϒn(Θ) with Σ̂ in Equation (12) is taken to be the RKHS estimator Σ̂PC(εn) based on κX(a, b) = (1 + ab)r. Throughout this section, Θ̂ denotes this estimator. The large-pn convergence rate for the lasso estimator of the (unconditional) GGM has been studied by Rothman et al. (2008).

In this case, Σ, Θ, and Y should in principle be written as Σ(n), Θ(n), and Y(n) because they now depend on n. However, to avoid complicated notation, we still use Σ, Θ, and Y, keeping in mind their dependence on n. Following Rothman et al. (2008), we develop the convergence rate in Frobenius norm. Let ||·||1, ||·||F, and ||·|| be the L1-norm, Frobenius norm, and the L-norm of a matrix A ∈ ℝd1×d2:

A1=i=1d1j=1d2aij,AF=(i=1d1j=1d2aij2)1/2,A=maxaij,

where aij denotes the (i, j)th entry of A. Let ρ(A) denote the number of nonzero entries of A. For easy reference, we list some properties of these matrix functions in the following proposition.

Proposition 1

Let A ∈ ℝd1×d2, B ∈ ℝd2×d3. Then,

ABd2AB,tr(AB)A1B,A1ρ(A)AF.

The first inequality follows from the definition of ||·||; the second from Hölder’s inequality; the third from the Cauchy-Schwarz inequality. The last two inequalities were used in Rothman et al. (2008). Let ℕ = {1, 2, …}. Consider the array of random matrices: { Ai(n):i=1,,n, n ∈ ℕ}, where A(n) ∈ ℝpn×q, pn may depend on n but q is fixed. Let (A(n))rs denote the (r, s)th entry of A(n).

Lemma 7

Let { Ai(n):i=1,,n, n ∈ ℕ} be an array of random matrices in ℝpn×q, each of whose rows is an iid sample of a random matrix A(n). Suppose that the moment generating functions of (A(n))st, say φnst, are finite on an interval (−δ, δ), and their second derivatives are uniformly bounded over this interval for all s = 1,…, pn, t = 1,…, q, n ∈ ℕ. If pn → ∞, then En(A(n))-E(A(n))+OP(logpn/n).

Proof

Let μnst = Inline graphic(A(n))st. Since μnst1+φnst(0), there is C > 0 such that |μnst| ≤ C for all s, t, n. Let (B(n))st = (A(n))stμnst, and ψnst be the moment generating function of (B(n))st. Then, ψnst (τ) = eμnstτφnst (τ). For any a > 0,

(n(En(B(n))st)>a)=(enEn(B(n))st>ea)e-a[ψnst(n-1/2)]n.

By Taylor’s theorem and noticing that ψnst(0)=0, we have

ψnst(n-1/2)=1+ψnst(ξnst)/(2n)1+eCφnst(ξnst)/(2n)

for some 0 ≤ ξnstn−1/2. By assumption, there is C1 > 0 such that lim supn→∞ φ″ (ξnst) ≤ C1 for all s = 1,…, pn and t = 1,…, q. Hence,

[ψnst(n-1/2)]n[1+eCC1/(2n)]neeCC1/2C2.

Thus, we have limsupnP(nEn(B(n))st>a)e-aC2. By the same argument, we can show that limsupnP(nEn(B(n))st<-a)e-aC2. Therefore,

limsupnP(nEn(B(n))st>a)2e-aC2.

It follows that

limsupn(En(B(n))>cn)limsupns=1pnt=1q(nEn[(A(n))st>ncn)2qC2e-ncn+logpn.

In particular,

limsupn(En(B(n))>2logpn/n)qC2/pn0,

which implies the desired result.

In the following, we call any array of random matrices satisfying the conditions in Lemma 7 a standard array. We now establish the convergence rate of ||Θ̂Θ0||F. Let sn denote the number of nonzero off-diagonal entries of Θ0. Let Z = YμYE(Y | X). Let Xt and Ys denote the components of X and Y. Their powers are denoted by (Xt)r and (Ys)r. For a symmetric matrix A, let σmax(A) and σmin(A) denote the maximum and minimum eigenvalues of A.

Theorem 6

Let Θ̂ be the sparse estimator defined in the first paragraph of this section with λn ~ (log pn/n)1/2, εn = o(n), and κ(a, b) = (1 + ab)r. Suppose that (X, Y) follows a CGGM, and satisfies the following additional assumptions:

  1. Y, YU, and ZU are standard arrays of random matrices;

  2. for all n ∈ ℕ, σmax(Σ) < ∞ and σmin(Σ) > 0;

  3. pn → ∞ and pn (pn + sn)1/2(log pn)5/2 = o(n3/2);

  4. the fixed-dimensional matrix VU = var(U) is nonsingular;

  5. each component of Inline graphic(Y | X = x) is a polynomial in x1,…, xq of at most rth order.

Then, ||Θ̂Θ0||F = OP ([(pn + sn) log pn/n]1/2).

Note that we can allow pn(pn + sn)1/2 to get arbitrarily close to n3/2. This condition is slightly stronger than the corresponding condition in Rothman et al. (2008) for the unconditional case, which requires (pn + sn) log pn = o(n1/2). Also note that ||Θ̂Θ0||F is the sum, instead of average, of pn2 elements (roughly pn + sn nonzero elements). With this in mind, the convergence rate [(pn + sn) log pn/n]1/2 is quite fast. This is the same rate as that given in Rothman et al. (2008) for the unconditional GGM.

Proof of Theorem 3

Let rn = [(pn + sn) log pn/n]1/2, and Inline graphic = {ΔInline graphic: ||Δ||2 = Mrn} for some M > 0. Let,

Gn(Δ)=ϒn(Θ0+Δ)-ϒn(Θ0),

where

ϒn(Θ)=-logdet(Θ)+tr(Θ^PC(εn))+λnijθij. (32)

Then, Θ̂ minimizes L(Θ) if and only if Δ̂ = Θ̂Θ0 minimizes Gn(Δ). As argued by Rothman et al. (2008), since Gn(Δ) is convex in Δ and Gn(Δ̂) ≤ 0, the minimizer Δ̂ resides within the sphere Inline graphic if Gn(Δ) is positive and bounded away from 0 on this sphere. That is, it suffices to show

(inf{Gn(Δ):ΔDn}>0)1.

The proof of Lemma 3 shows, in the context of fixed p, that (QKXQ)εn=(QDUCDUQ) with probability tending to 1 if εn = o(n). This result still holds here because the dimension q of X remains fixed. Consequently ℙ(Σ̂PC(εn) = Σ̃) → 1, where Σ̃ is as defined in Equation (16) but now its dimension increases with n. Thus, we can replace the Σ̂PC(εn) in Equation (32) by Σ̃. Let

μ^U=En(U),V^YU=n-1DYQDU,V^U=n-1DUQDU.

Let μ^YU-V^YUV^U-1(U-μ^U), and μYU=E(YX)=VYUVU-1(U-μU). Then,

=En(Z+μYU-μ^YU+μY-μ^Y)(Z+μYU-μ^YU+μY-μ^Y).

The term μY|Uμ̂Y|U can be further decomposed as ZI + ZII + ZIII, where

ZI=-(V^YU-VYU)V^U-1(U-μ^U),ZII=VYU[VU-1(U-μU)-V^U-1(U-μ^U)],ZIII=μY-μ^Y.

The function Gn(Δ) can now be rewritten as

Gn(Δ)=Ln(Θ0+Δ)-Ln(Θ0)+v[I,II,III]tr(ΔEn(ZZν))+v{I,II,III}tr(ΔEn(ZvZ))+v{I,II,III}τ{I,II,III}tr(ΔEn(ZνZτ)),

where Ln(Θ)=tr(ΘEn(ZZ))-logdet(Θ)+λnijθij. Since Z ~ N(0, Σ), we can use the same argument in the proof of Theorem 1 in Rothman et al. (2008) to show that

(inf{Ln(Θ0+Δ)-Ln(Θ0):ΔDn}>0)1.

Thus, our theorem will be proved if we can show that

sup{tr(ΔEn(A)):ΔDn}=oP(1)

for A being any one of the following eight random matrices

ZZν,ZνZ,ZνZτ,ν,τ=I,II,III. (33)

By Proposition 1, inequalities (2) and (3), we have, for ΔInline graphic,

tr(ΔEn(A))En(A)Δ1En(A)ρ(Δ)Δ2MEn(A)pnrn.

Thus, it suffices to show that,

En(A)pn(pn+sn)1/2(logpn)1/2n-1/2=oP(1). (34)

Since || Inline graphic(A)|| = || Inline graphic(A)||, we only need to consider the following A:

ZZI,ZZII,ZZIII,ZIZI,ZIZII,ZIZIII,ZIIZII,ZIIZIII,ZIIIZIII.

From the definitions of Z and ZI, we have

En(ZZI)=-En[Z(U-μU)]V^U-1(V^UY-VUY)-μ^Z(μU-μ^U)V^U-1(V^UY-VUY), (35)

where μ̂Z = Inline graphic(Z). By the first inequality of Proposition 1,

En(ZZI)q2En[Z(U-μU)]×V^U-1V^UY-VUY+q2μ^Z(μU-μ^U)×V^U-1V^UY-VUY. (36)

Since ZX, we have E[Z(UμU)] = 0. Hence, by Lemma 7,

En[Z(U-μU)]=OP(n-12logpn). (37)

Similarly,

En[(Y-μY)(U-μU)-VYU]=OP(n-12logpn),μ^Y-μY=OP(n-12logpn). (38)

Since μ̂UμU has a fixed-dimension finite-variance matrix, by the central limit theorem,

μ^U-μU=OP(n-1/2). (39)

By definition,

V^YU-VYU=En[(Y-μY)(U-μU)-VYU]-(μ^Y-μY)(μ^U-μU).

By Proposition 1, first inequality,

V^YU-VYUEn[(Y-μY)(U-μU)-VYU]+μ^Y-μYμ^U-μU.

Substituting Equations (38) and (39) into the above inequality, we find

V^YU-VYU=OP(n-12logpn). (40)

Since Z is multivariate normal whose components have means 0 and bounded variance, it is a standard array. Hence,

μ^Z=OP(n-12logpn). (41)

Since the dimension of U is fixed, its entries have finite variances, and VU is nonsingular, we have, by the central limit theorem,

V^U-1=OP(1). (42)

Substituting Equations (37), (39), (40), (41), and (42) into Equation (36), we find that

En(ZZI)=OP(n-1(logpn)2),

which, by condition (3), satisfies Equation (34).

The order of magnitudes of the rest of the three terms can be derived similarly. We present the results below, omitting the details:

En(ZZII)=OP(n-1logpn),En(ZZIII)=OP(n-1(logpn)2),En(ZIZI)=OP(n-1(logpn)2),En(ZIZII)=OP(n-1logpn),En(ZIZIII)=OP(n-3/2(logpn)2),En(ZIIZII)=OP(n-1),En(ZIIZIII)=OP(n-3/2logpn),En(ZIIIZIII)=OP(n-1(logpn)2).

By condition (3), all of these terms satisfy the relation in Equation (34).

9. IMPLEMENTATION

In this section, we address two issues in implementation: the choice of the tuning parameter and the minimization of the objective functions (13) and (14). For the choice of the tuning parameter, we use a BIC-type criterion (Schwarz 1978) similar to that used in Yuan and Lin (2007). Let Θ̂(λ) = {θ̂ij(λ) : i, j ∈ Γ} be the lasso or the adaptive lasso estimate of Θ0 in the conditional graphical model for a specific choice of λ of the tuning parameter. Let En(λ) = {(i, j) : θ̂ij(λ) ≠ 0}, and

BIC(λ)=-logdet[Θ^(λ)]+tr[Θ^(λ)n]+logncard[En(λ)]+p2n.

The tuning parameter is then chosen to be

λ^BIC=argmin{BIC(λ):λ(0,)}.

Practically, we evaluate this criterion on a grid of points in (0, ∞) and choose the minimizer among these points.

For the minimization of Equations (13) and (14), we follow the graphical lasso procedure proposed by Friedman et al. (2008), but with the sample covariance matrix therein replaced by the RKHS estimates, Σ̂PC(εn) or Σ̂RR(εn), of the conditional covariance matrix Σ. The graphical lasso (glasso) procedure is available as a package in the R language.

10. SIMULATION STUDIES

In this section, we compare the sparse estimators for the CGGM with the sparse estimators of the GGM, with the maximum likelihood estimators of the CGGM, and with two naive estimators. We also explore several reproducing kernels and investigate their performances for estimating the CGGM.

10.1 Comparison With Estimators for GGM

We use three criteria for this comparison:

  1. False positive rate at λ̂BIC. This is defined as the percentage of edges identified as belonging to E when they are not; that is,
    FP=card{(i,j):i>j,θ0,ij=0,θ^ij0}/card{(i,j):i>j,θ0,ij=0}.
  2. False negative rate at λ̂BIC. This is defined as the percentage of edges identified as belonging to Ec when they are not; that is,
    FN=card{(i,j):i>j,θ0,ij0,θ^ij=0}/card{(i,j):i>j,θ0,ij0}.
  3. Rate of correct paths. The above two criteria are both specific to the tuning method (in our case, BIC). To assess the potential capability of an estimator of E, independently of the tuning methods used, we use the percentage of cases where E belongs to the path {En(λ) : λ ∈ (0, ∞)}, where En(λ) is an estimator of E for a fixed λ. We write this criterion as PATH.

Example 1

This is the example, illustrated in Figure 1, in which p = 3, q = 1, and (X, Y) satisfies CGGM with E = {(1, 1), (2, 2), (3, 3), (2, 3), (3, 2)}. The conditional distribution of Y | X is specified by

Y=βX+ε, (43)

where β = (β1, β2, 0), X ~ N(0, 1), and ε ~ N(0, Σ), Xε, and

-1=Θ=(500042.5302.532).

For each simulated sample, β1 and β2 are generated, independently, from the uniform distribution defined on the set (−6, −3) ∪ (3, 6). We use two sample sizes n = 50, 100. The linear kernel κX(a, b) = 1 + ab is used for the initial RKHS estimate. The results are presented in the first three rows of Table 1. Entries in the table are the means calculated across 200 simulated samples.

Table 1.

Comparison of graph estimation accuracy among the lasso and the adaptive lasso estimators of GGM and CGGM for Example 1, Example 2 (including three scenarios), and Example 3. “ALASSO” means adaptive lasso

Example/scenario Criteria n = 50
n = 100
LASSO
ALASSO
LASSO
ALASSO
GGM CGGM GGM CGGM GGM CGGM GGM CGGM
EX1 FP 1 0.51 1 0.15 1 0.40 1 0.05
FN 0 0 0 0 0 0 0 0
PATH 0 1 0 1 1 1 0 1
EX2-SC1 FP 0.84 0.02 0.54 0.06 0.93 0.01 0.67 0.02
FN 0 0 0 0 0 0 0 0
PATH 0 1 0.01 0.63 0 1 0.98 1
EX2-SC2 FP 0.98 0.55 0.96 0.31 1 0.51 1 0.22
FN 0.06 0.01 0.15 0.02 0.02 0 0.08 0
PATH 0 0.57 0 0.71 0 0.79 0 0.98
EX2-SC3 FP 0.71 0.68 0.18 0.19 0.75 0.77 0.10 0.11
FN 0 0 0.01 0.02 0 0 0 0
PATH 0 0.43 0.80 0.52 0 0.56 1 0.87
EX3 FP 0.79 0.23 0.41 0.09 0.83 0.16 0.59 0.03
FN 0 0 0 0 0 0 0 0
PATH 0 0.94 0.95 0.99 0 1 1 1

The table indicates that the unconditional sparse estimators have much higher false positive rate than false negative rate. This is because they tend to pick up connections among the components of Y that are due to X. In comparison, the conditional sparse estimators (both lasso and adaptive lasso) can successfully remove the edges effected by X, resulting in more accurate identification of the graph.

Example 2

In this example, we consider three scenarios in which the effect of an external source on the network varies in degree, resulting in different amounts of gain achievable by a conditional graphical model.

We still assume the linear regression model (43), but with p = 5. The first scenario is shown in the top two panels in Figure 2, where all the components of β are nonzero and Σ diagonal. In this case, the conditional graph is totally disconnected (left panel); whereas the unconditional graph is a complete graph (right panel). Specifically, the parameters are

Figure 2.

Figure 2

Conditional and unconditional graphical models in Examples 2 and 3. Left panels: the conditional graphical models. Right panels: the corresponding unconditional graphical models. Black nodes indicate response variables; gray nodes are the predictors. Edges in the conditional and unconditional graphs are indicated by black lines; regression of Y on X is indicated by directed gray lines with arrows. (The online version of this figure is in color.)

=Θ=I5,β=(0.656,0.551,0.757,0.720,0.801).

The panels in the third row of Figure 2 represent the other extreme, where only Y1 is related to X, and each Yi for i ≠ 1 shares an edge with Y1 but has no other edges. In this case, the conditional and unconditional graphs are identical. The parameters are specified as follows. The first row (and first column) of Θ is (6.020, −0.827, −1.443, −1.186, −0.922); the remaining 4 × 4 block is I4. The first entry of β is 0.656, and rest entries are 0. Between these two extremes is scenario 2 (second row in Figure 2), where the conditional and unconditional graphs differ only by one edge: 1 ↔ 4. The parameters are

Θ=(4.487-1.186-1.44300-1.1861.464-0.6680.752-0.681-1.443-0.6681.963-1.0840.98100.752-1.0842.220-1.1050-0.6810.981-1.1051),β=(0.6560.55100.6010).

For each scenario, we generate 200 samples of sizes n = 50, 100 and compute the three criteria across the 200 samples. The results are presented row 4 through row 12 of Table 1. We see significant improvements of the rates of correct identification of the graphical structure by the sparse estimator of CGGM whenever the conditional graph differs from the unconditional graph. Also note that, for scenario 3, where the two graphs are the same, the adaptive lasso estimator for GGM performs better than the adaptive lasso estimator for CGGM. This is because the latter needs to estimate more parameters.

Example 3

In this example, we investigate a situation where the edges in the unconditional graphical model have two external sources. We use model (43) with p = 5, q = 2, X ~ N(0, I2),

Θ=(1.683-0.827000-0.8271000001.7220.8490000.8491000001),β=(0.656000.5510.757000.72000.800).

The results are summarized in the last three rows of Table 1, from which we can see similar improvements by the sparse CGGM estimators.

10.2 Comparison With Maximum Likelihood Estimates of CGGM

In Section 7, we showed that the adaptive lasso estimate for the CGGM possesses oracle property; that is, its asymptotic variance reaches the lower bound among regular estimators when the graph Inline graphic is assumed known. Hence, it makes sense to compare adaptive lasso estimate with the maximum likelihood estimate of Θ under the constraints θij = 0, (i, j) ∉ E, which is known to be optimal among regular estimates. In this section, we make such a comparison, using all three examples in Section 10.1. As a benchmark, we also compare these two estimates with the maximum likelihood estimate under the full model, which for the linear kernel is [Σ̂PC(0)]. For this comparison we use the squared Frobenius norm Θ^-Θ0F2, which characterizes the closeness of two precision matrices rather than that of graphs.

Table 2 shows that the adaptive lasso estimator is rather close to the constrained MLE, with the unconstrained MLE trailing noticeably behind. In three out of five cases, the constrained MLE performs better than the adaptive lasso, which is not surprising because, although the two estimators are equivalent asymptotically, the former employs the true graphical structure unavailable for adaptive lasso, making it more accurate for the finite sample. When the errors of adaptive lasso are lower than the constrained MLE, the differences are within the margins of error.

Table 2.

Comparison of parameter estimation accuracy among adaptive lasso, and unconstrained and constrained MLE for CGGM. Entries are of the form a ± b, where a is the mean, and b the standard deviation, of criterion Θ^-Θ0F2 computed from 200 simulated samples

Example/scenario ALASSO MLE
Unconstrained Constrained
EX1 1.458 ± 0.125 2.256 ± 0.171 1.575 ± 0.155
EX2-SC1 0.142 ± 0.008 0.400 ± 0.018 0.122 ± 0.007
EX2-SC2 1.858 ± 0.120 2.294 ± 0.148 1.663 ± 0.116
EX2-SC3 1.147 ± 0.071 2.139 ± 0.186 0.969 ± 0.081
EX3 0.303 ± 0.016 0.773 ± 0.555 0.327 ± 0.275

10.3 Exploring Different Reproducing Kernels

In this section, we explore three types of kernels for RKHS

polynomial(PN):κX(a,b)=(ab+1)r,Gaussradialbasis(RB):κX(a,b)=exp(-γa-b2),rationalquadratic(RQ):κX(a,b)=1-a-b2/×(a-b2+c), (44)

and investigate their performances as initial estimates for lasso and adaptive lasso. These kernels are widely used for RKHS (Genton 2001). For the CGGM, we use a nonlinear regression model with four combinations of dimensions: q = 10, 20, p = 50, 100. The nonlinear regression model is specified by

Yi={(β1X+1)2+εi,i=1,3,6,8,,p-4,p-2,(β2X)2+εi,i=2,4,5,7,9,10,,p-3,p-1,p, (45)

where β1 and β2 are q-dimensional vectors

β1=(1,,1q/2,0,,0q/2),β2=(0,,0q/2,1,-1,,(-1)q/2+1q/2).

The distribution of (ε1, …, εp) is multivariate normal with mean 0 and precision matrix

(Γ00Γ)

where each Γ is the precision matrix in Example 3.

The following specifications apply throughout the rest of Section 10: γ = 1/(9q) for RB, c = 200 for RQ, and r = 2 for PN (because the predictors in Equation (45) are quadratic polynomials); the RKHS estimator Σ̂PC(εn) is used as the initial estimator for lasso and adaptive lasso, where εn are chosen so that the first 70 eigenvectors of QKXQ are retained. Ideally, the kernel parameters γ, c, and εn should be chosen by data-driven methods such as cross-validation. However, this is beyond the scope of the present article and will be further developed in a future study. Our choices are based on trial and error in pilot runs. Our experience indicates that the sparse estimators perform well and are reasonably stable when εn is chosen so that 10% ~ 30% of the eigenvectors of QKXQ are included. The sample size is n = 100 and the simulation is based on 200 samples. To save computing time we use the BIC to optimize λn for the first sample and use it for the rest 199 samples.

In Table 3, we compare the sparse estimators lasso and adaptive lasso, whose initial estimates are derived from kernels in Equation (44), with the full and constrained MLEs. The full MLE is computed using the knowledge that the predictor is a quadratic polynomial of x1, …, xp. The constrained MLE uses, in addition, the knowledge of the conditional graph Inline graphic; that is, the positions of the zero entries of the true conditional precision matrix.

Table 3.

Exploration of different reproducing kernels. Entries are Θ^-Θ0F2 averaged over 200 simulation samples

p q LASSO
ALASSO
MLE
PN RB RQ PN RB RQ Full Constrained
50 10 1.84 6.88 8.14 1.84 2.31 2.71 29.41 3.82
20 11.03 11.03 11.03 17.15 17.14 17.14 299.28 94.18
100 10 5.96 20.07 24.06 3.35 4.33 5.17 171.32 7.73
20 20.06 20.05 20.05 29.10 28.30 28.31 1787.32 186.39

Table 3 shows that in all cases the sparse estimators perform substantially better than the full MLEs. For q = 10, the adaptive lasso estimates based on all three kernels also perform better than both the full and constrained MLEs; whereas the accuracy of most of the lasso estimates are between the full and the constrained MLEs. For q = 20, all sparse estimators perform substantially better than both the full and the constrained MLEs. From these results, we can see the effects of two types of regularization: the sparse regularization of the conditional precision matrix and the kernel-PCA regularization for the predictor. The first regularization counteracts the increase in the number of parameters in the conditional precision matrix as p increases, and the second counteracts the increase in the number of terms in a quadratic polynomial as q increases, both resulting in substantially reduced estimation error.

10.4 Comparisons With Two Naive Estimators

We now compare our sparse RKHS estimators for the CGGM with two simple methods: the linear regression and the simple thresholding.

10.4.1 Naive Linear Regression

A simple estimate of CGGM is to first apply multivariate linear regression of Y versus X, regardless of the true regression relation, and then apply a sparse penalty to the residual variance matrix. To make a fair comparison with linear regression, we consider the following class of regression models:

Y=(1-a)(βX)2/4+a(βX)+ε,0a1. (46)

This is a convex combination of a linear model and a quadratic model: it is linear when a = 1, quadratic when a = 0, and a mixture of both when 0 < a < 1. The distribution of ε is as specified in Section 10.3. The fraction 1/4 in the quadratic term in Equation (46) is introduced so that the linear and quadratic terms have the similar signal-to-noise ratios. In Table 4, we compare the estimation error of the CGGM based on Equation (46) by sparse linear regression and by sparse RKHS estimator using the adaptive lasso as penalty. We take q = 10, p = 50, and n = 500. The Gauss radial basis is used for the kernel method, with tuning parameters γ and εn being the same as specified in Section 10.3.

Table 4.

Comparison with linear regression. Entries are Θ^-Θ0F2 averaged over 200 samples

a 0 0.2 0.4 0.6 0.8 1
Linear 10.49 10.48 10.48 10.42 10.10 0.75
Kernel 1.56 2.13 1.77 1.72 1.67 1.56

We see that the sparse RKHS estimate performs substantially better in all cases except a = 1, where Equation (46) is exactly a linear model. This suggests that linear-regression sparse estimate of CGGM is rather sensitive to nonlinearity: a slight proportion of nonlinearity in the mixture would make the kernel method favorable. In comparison, the sparse RKHS estimate is stable and accurate for different types of regression relations.

10.4.2 Simple Thresholding

A naive approach to sparsity is by dropping small entries of a matrix. For example, we can estimate Θ by setting to zero the entries of ^PC-1 whose absolute values are smaller than some τ > 0. Let Θ̃(τ) denote this estimator. Using the example in Section 10.3 with p = 50, q = 10, n = 500, we now compare the sparse RKHS estimators with Θ̃(τ). The curve in Figure 3 is the error Θ(τ)-ΘF2 versus τ ∈ [0, 1]. Each point in the curve is the error over 200 simulated samples. The estimate Σ̂PC is based on the gauss radial basis, whose tuning parameters γ and εn are as given in Section 10.3. The two horizontal lines in Figure 3 represent the errors for the lasso and the adaptive lasso estimators based on Σ̂PC, which are read off from Table 3.

Figure 3.

Figure 3

Comparison of lasso, adaptive lasso, and simple thresholding.

The figure shows that the simple thresholding estimate, even for the best threshold, does not perform as well as either of our sparse estimates. Note that in practice Θ0 is unknown, and the optimal τ cannot be obtained by minimizing the curve in Figure 3. With this in mind, we expect the actual gap between the thresholding estimate and the sparse estimates to be even greater than that shown in the figure.

11. NETWORK INFERENCE FROM eQTL DATA

In this section, we apply our CGGM sparse estimators to two datasets to infer gene networks from expression quantitative trait loci (eQTL) data. The dataset is collected from an F2 intercross between inbred lines C3H/HeJ and C57BL/6J (Ghazalpour et al. 2006). It contains 3,421 transcripts and 1,065 markers from the liver tissues of 135 female mice (n = 135). The purpose of our analysis is to identify direct gene interactions by fitting the CGGM to the eQTL dataset. Although a gene network can be inferred from expression data alone, such a network would contain edges due to confounders such as shared genetic causal variants. The available marker data in the eQTL dataset allow us to isolate the confounded edges by conditioning on the genomic information.

We restrict our attention to subsets of genes, partly to accommodate the small sample size. In the eQTL analysis tradition, subsets of genes can be identified by two methods: co mapping and co expression. For co-mapping, each gene expression trait is mapped to markers, and the transcripts that are mapped to the same locus are grouped together. For co-expression, the highly correlated gene expressions are grouped together.

We first consider the subset of genes identified by co-mapping. It has been reported (Neto, Keller, Attie, and Yandell 2010) that 14 transcripts are mapped to a marker on chromosome 2 (at 55.95 cM). As this locus is linked to many transcripts, it is called a hot-spot locus. It is evident that this marker should be included as a covariate for CGGM. In addition, we include a marker on chromosome 15 (at 76.65 cM) as a covariate, because it is significantly linked to gene Apbb1ip (permutation p-value < 0.005), conditioning on the effect of the marker on chromosome 2. The transcript mapping is performed using the qtl package in R (Broman, Wu, Sen, and Churchill 2003).

The GGM detects 52 edges; the CGGM detects 34 edges, all in the set detected by the GGM. The two graphs are presented in the upper panel of Figure 4, where edges detected by both GGM and CGGM are represented by black solid lines, and edges detected by GGM alone are represented by blue dotted lines. In the left panel of Table 5, we compare the connectivity of genes of each graph. Among the 14 transcripts, Pscdbp contains the hot spot locus within the ±100 kb boundaries of its location. Interestingly, in CGGM, this cis transcript has the lowest connectivity among the 14 genes, but one of the genes with which it is associated (Apbb1ip) is a hub gene, connected with seven other genes. In sharp contrast, GGM shows high connectivity of Pscdbp itself.

Figure 4.

Figure 4

Gene networks based on GGM and CGGM. Upper panel: data based on co mapping selection. Lower panel: data based on coexpression selection. An edge from both GGM and CGGM is represented by a solid line; an edge from the GGM alone is represented by a dotted line; an edge from CGGM alone is represented by a broken line. (The online version of this figure is in color.)

Table 5.

Connectivity of genes in the conditional and unconditional graphical models. The cis-regulated transcripts are indicated with boldface

Co-mapping
Co-expression
Gene GGM CGGM Diff. Gene GGM CGGM Diff.
Il16 7 3 4 Aqr 7 6 1
Frzb 7 5 2 Nola3 8 2 6
Apbb1ip 8 7 1 AI648866 9 6 3
Clec2 6 4 2 Zfp661 8 5 3
Riken 6 4 2 Myef2 11 5 6
Il10rb 8 5 3 Synpo2 9 3 6
Myo1f 8 5 3 Slc24a5 6 5 1
Aif1 6 5 1 Nol5a 7 4 3
Unc5a 11 8 3 Sdh1 10 6 4
Trpv2 9 6 3 Dtwd1 9 4 5
Tnfsf6 9 4 5 B2m 8 6 2
Stat4 3 3 0 Tmem87a 6 3 3
Pscdbp 9 3 6 Zc3hdc8 6 5 1
D13Ertd275e 7 6 1

We next study the subset identified by co-expression. We use a hierarchical clustering approach in conjunction with the average agglomeration procedure to partition the transcripts into 10 groups. The relevant dissimilarity measure is 1 − |ρij|, where ρij is the Pearson correlation between transcripts i and j. Among the 10 groups, we choose a group that contains 15 transcripts with the mean absolute correlation equal to 0.78. Thirteen of the 15 transcripts have annotations, and they are used in our analysis. Using the qtl package in R, each transcript is mapped to markers, and seven markers on chromosome 2 are significantly linked to transcripts (permutation p-value 0.005). Among those, we drop two markers that are identical to the adjacent markers. We thus use five markers (Chr2@100.18, Chr2@112.75, Chr2@115.95, Chr2@120.72, Chr2@124.12) as covariates.

The GGM identifies 52 edges and the CGGM identifies 30 edges. Among these edges, 28 are shared by both methods. The two graphs are presented in the lower panel of Figure 4, where edges detected by both GGM and CGGM are represented by black solid lines, edges detected by GGM alone are represented by blue dotted lines, and edges detected by CGGM alone are represented by red broken lines. Among the 13 transcripts, Dtwd1 is the closest to all markers (distances < 300 kbp). The connectivity of each method is shown in the right panel of Table 5. The number of genes that are connected to Dtwd1 is reduced by 5 by CGGM. Unlike in the co-mapping network, in the co-expression network, CGGM detects two edges (Aqr–Zfp661, Zfp661–Tmem87a) that are not detected by GGM. These additional edges could be caused by the error in regression estimation. For example, if the regression coefficients for some of markers are 0, but are estimated to be nonzero, then spurious correlations arise among residuals. This indicates that the accurate identification of the correct covariates is more important in the co-expression network.

As a summary, in the network inference from the transcripts identified by co-mapping, we see that after conditioning on the markers, the cis-regulated transcripts (Pscdbp) are connected to relatively few genes of high connectivity. However, without conditioning on the markers, they appear to have high connectivity themselves. In other words, without conditioning on markers, these cis-regulated transcripts might be misinterpreted as hub-genes themselves. In the network inference from the transcripts identified by co expression, we see that after conditioning on the markers, a few edges are additionally detected, which may result from inaccurate identification of covariates for an individual transcript. Thus, including the correct set of markers for an individual transcript can be important.

Acknowledgments

Bing Li’s research was supported in part by NSF grants DMS-0704621, DMS-0806058, and DMS-1106815.

Hyonho Chun’s research was supported in part by NSF grant DMS-1107025.

Hongyu Zhao’s research was supported in part by NSF grants DMS-0714817 and DMS-1106738 and NIH grants R01 GM59507 and P30 DA018343.

We would like to thank three referees and an Associate Editor for their many excellent suggestions, which lead to substantial improvement on an earlier draft. Especially, the asymptotic development in Section 8 is inspired by two reviewers’ comments.

Contributor Information

Bing Li, Email: bing@stat.psu.edu, Professor of Statistics, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802.

Hyonho Chuns, Email: chunh@purdue.edu, Assistant Professor of Statistics, Purdue University, 250 N. University Street, West Lafayette, IN 47907.

Hongyu Zhao, Email: hongyu.zhao@yale.edu, Professor of Biostatistics, Yale University, Suite 503, 300 George Street, New Haven, CT 06510.

References

  1. Aronszajn N. Theory of Reproducing Kernels. Transactions of the American Mathematical Society. 1950;68:337–404. [Google Scholar]
  2. Bickel PJ, Levina E. Covariance Regularization by Thresholding. The Annals of Statistics. 2008;36:2577–2604. [Google Scholar]
  3. Bickel PJ, Ritov Y, Klaassenn CAJ, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore, MD: The Johns Hopkins University Press; 1993. [Google Scholar]
  4. Breiman L. Better Subset Regression Using the Nonnegative Garrote. Technometrics. 1995;37:373–384. [Google Scholar]
  5. Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTLMapping in Experimental Crosses. Bioinformatics. 2003;19:889–890. doi: 10.1093/bioinformatics/btg112. [DOI] [PubMed] [Google Scholar]
  6. Dempster AP. Covariance Selection. Biometrika. 1972;32:95–108. [Google Scholar]
  7. Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  8. Fernholz LT. Lecture Notes in Statistics. Vol. 19. New York: Springer; 1983. Von Mises Calculus for Statistical Functionals. [Google Scholar]
  9. Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Fukumizu K, Bach FR, Jordan MI. Kernel Dimension Reduction in Regression. The Annals of Statistics. 2009;4:1871–1905. [Google Scholar]
  11. Genton MG. Classes of Kernels for Machine Learning: A Statistics Perspective. Journal of Machine Learning Research. 2001;2:299–312. [Google Scholar]
  12. Geyer CJ. On the Asymptotics of Constrained M-estimation. The Annals of Statistics. 1994;22:1998–2010. [Google Scholar]
  13. Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Shadt EE, Drake TA, Lusis AJ, Horvath S. Integrating Genetic and Network Analysis to Characterize Gene Related to Mouse Weight. PLoS Genetics. 2006;2:e130. doi: 10.1371/journal.pgen.0020130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. 2009 doi: 10.1093/biomet/asq060. unpublished manuscript available at http://www.stat.lsa.umich.edu/gmichail/manuscript-jasa-09.pdf[152] [DOI] [PMC free article] [PubMed]
  15. Henderson HV, Searle SR. Vec and Vech Operators for Matrices, with Some Uses in Jacobians and Multivariate Statistics. Canadian Journal of Statistics. 1979;7:65–81. [Google Scholar]
  16. Horn RA, Johnson CR. Matrix Analysis. New York: Cambridge University Press; 1985. [Google Scholar]
  17. Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  18. Lafferty J, McCallum A, Pereira F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001); 2001. pp. 282–289. [Google Scholar]
  19. Lam C, Fan J. Sparsistency and Rates of Convergence in Large Covariance Matrix Estimation. The Annals of Statistics. 2009;37:4254–4278. doi: 10.1214/09-AOS720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lauritzen SL. Graphical Models. Oxford: Clarendon Press; 1996. [Google Scholar]
  21. Meinshausen N, Bühlmann P. High-Dimensional Graphs with the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
  22. Mises Rv. On the Asymptotic Distribution of Differentiable Statistical Functions. The Annals of Mathematical Statistics. 1947;18:309–348. [Google Scholar]
  23. Neto EC, Keller MP, Attie AD, Yandell BS. Causal Graphical Models in Systems Genetics: A Unified Framework for Joint Inference of Causal Network and Genetic Architecture for Correlated Phenotypes. The Annals of Applied Statistics. 2010;4:320–339. doi: 10.1214/09-aoas288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Peng J, Wang P, Zhou N, Zhu J. Partial Correlation Estimation by Joint Sparse Regression Models. Journal of American Statistical Association. 2009;104:735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Reeds JA. PhD dissertation. Harvard University; 1976. On the Definition of Von Mises Functionals. [Google Scholar]
  26. Ren J, Sen PK. On Hadamard Differentiability of Extended Statistical Functional. Journal of Multivariate Analysis. 1991;39:30–43. [Google Scholar]
  27. Rothman AJ, Bickel P, Levina E, Zhu J. Sparse Permutation Invariant Covariance Estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
  28. Schwarz GE. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
  29. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  30. Vapnik NV. Statistical Learning Theory. New York: Wiley; 1998. [Google Scholar]
  31. Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35. [Google Scholar]
  32. Zhao P, Yu B. On Model Selection Consistency of Lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
  33. Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  34. Zou H, Li R. One-Step Sparse Estimates in Nonconcave Penalized Likelihood Models (with discussion) The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES