Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2016 Aug 24;103(3):513–530. doi: 10.1093/biomet/asw028

On an additive partial correlation operator and nonparametric estimation of graphical models

Kuang-Yao Lee 1,, Bing Li 2, Hongyu Zhao 3
PMCID: PMC5793672  PMID: 29422689

Abstract

We introduce an additive partial correlation operator as an extension of partial correlation to the nonlinear setting, and use it to develop a new estimator for nonparametric graphical models. Our graphical models are based on additive conditional independence, a statistical relation that captures the spirit of conditional independence without having to resort to high-dimensional kernels for its estimation. The additive partial correlation operator completely characterizes additive conditional independence, and has the additional advantage of putting marginal variation on appropriate scales when evaluating interdependence, which leads to more accurate statistical inference. We establish the consistency of the proposed estimator. Through simulation experiments and analysis of the DREAM4 Challenge dataset, we demonstrate that our method performs better than existing methods in cases where the Gaussian or copula Gaussian assumption does not hold, and that a more appropriate scaling for our method further enhances its performance.

Keywords: Additive conditional covariance operator, Additive conditional independence, Copula, Gaussian graphical model, Partial correlation, Reproducing kernel

1. Introduction

We propose a new statistical object, the additive partial correlation operator, for estimating nonparametric graphical models. This operator is an extension of the partial correlation coefficient (Muirhead, 2005) to the nonlinear setting. It is akin to the additive conditional covariance operator of Li et al. (2014) but achieves better scaling, leading to enhanced estimation accuracy, when characterizing conditional independence in graphical models.

Let Inline graphic be a Inline graphic-dimensional random vector. Let Inline graphic be an undirected graph, where Inline graphic represents the set of vertices corresponding to Inline graphic random variables and Inline graphic represents the set of undirected edges. For convenience we assume that Inline graphic. A common approach to modelling an undirected graph is to associate separation with conditional independence; that is, node Inline graphic and node Inline graphic are separated if and only if Inline graphic and Inline graphic are independent given the rest of Inline graphic. In symbols,

1. (1)

where Inline graphic represents Inline graphic with its Inline graphicth and Inline graphicth components removed. Intuitively, this means that nodes Inline graphic and Inline graphic are connected if and only if, after removing the effects of all the other nodes, Inline graphic and Inline graphic still depend on each other. In other words, nodes Inline graphic and Inline graphic are connected in the graph if and only if Inline graphic and Inline graphic have a direct relation. The statistical problem is to estimate Inline graphic based on a sample of Inline graphic.

One of the most commonly used statistical models for (1) is the Gaussian graphical model, which assumes that Inline graphic satisfies (1) and is distributed as Inline graphic for a nonsingular covariance matrix Inline graphic. An appealing property of the multivariate Gaussian distribution is that conditional independence is completely characterized by the zero entries of the precision matrix. Specifically, let Inline graphic be the precision matrix and Inline graphic its Inline graphicth element. Then

1. (2)

Thus, under Gaussianity, estimation of Inline graphic amounts to identifying the zero entries or, equivalently, the sparsity pattern of the precision matrix Inline graphic. Many procedures have been developed to estimate the Gaussian graphical model. For example, Yuan & Lin (2007), Banerjee et al. (2008) and Friedman et al. (2008) considered penalized maximum likelihood estimation with Inline graphic penalties on Inline graphic. Based on a relation between partial correlations and regression coefficients, Meinshausen & Bühlmann (2006) and Peng et al. (2009) proposed to select the neighbours of each node by solving multiple lasso problems (Tibshirani, 1996). Other recent advances include the work of Bickel & Levina (2008a, b), who used hard thresholding to determine the sparsity pattern, Lam & Fan (2009), who used the smoothly clipped absolute deviation penalty (Fan & Li, 2001), and Yuan (2010) and Cai et al. (2011), who used the Danzig selector (Candès & Tao, 2007).

Since Gaussianity could be restrictive in applications, many recent papers have considered extensions. The challenge is not only to relax Gaussianity but also to preserve the simplicity of the conditional independence structure imparted by the Gaussian distribution. One elegant solution is to assume a copula Gaussian model, under which the data can be transformed marginally to multivariate Gaussianity; see Liu et al. (2009, 2012), Xue & Zou (2012) and Harris & Drton (2013). The copula Gaussian model preserves the equivalence (2) for the transformed Inline graphic, without requiring the Inline graphic to be marginally Gaussian. Other work on non-Gaussian graphical models includes Fellinghauer et al. (2013) and Voorman et al. (2014). In their settings, a given node is associated with its neighbours via either a semiparametric or a nonparametric model.

Another extension is the additive semigraphoid model of Li et al. (2014), which is based on a new statistical relation called additive conditional independence. By generalizing the precision matrix to the additive precision operator and replacing the conditional independence in (2) by additive conditional independence, Li et al. (2014) showed that the equivalence (2) emerges at the linear operator level, at which no distributional assumption is needed.

The primary motivation for introducing additive conditional independence is to maintain nonparametric flexibility without employing high-dimensional kernels. The distribution of points in a Euclidean space becomes increasingly sparse as the dimension of the space increases. For a kernel estimator in such spaces to be effective, we need to increase the bandwidth; otherwise we may have very few observations within a local ball of radius equal to the bandwidth. Increasing bandwidth, however, also increases bias. Therefore we face the dilemma of either increased bias or lack of data in each local region, a phenomenon known as the curse of dimensionality (Bellman, 1957). To avoid this problem while extracting useful information from high-dimensional data, one must impose some kind of additional structure, such as parametric models, sparsity or linear indices. The structure imposed by additive conditional independence is additivity, which allows us to employ only one-dimensional kernels, thus avoiding high dimensionality. The cost is that the graphical model is no longer characterized by conditional independence. Nonetheless, Li et al. (2014) have shown that additive conditional independence satisfies the semigraphoid axioms (Pearl & Verma, 1987; Pearl et al., 1989), a set of four fundamental properties of conditional independence.

To estimate the additive semigraphoid model, Li et al. (2014) proposed the additive conditional covariance and additive precision operators, which extend the conditional covariance and precision matrices and characterize additive conditional independence without distributional assumptions. In the classical setting, the conditional covariance Inline graphic between two random variables Inline graphic and Inline graphic given a third random variable Inline graphic describes the strength of dependence between Inline graphic and Inline graphic after removing the effect of Inline graphic. However, it is confounded by statistical variations in Inline graphic and Inline graphic, which have nothing to do with the conditional dependence. Partial correlation is designed to remove these effects, so that conditional dependence is retained. The additive partial correlation operator that we propose serves the same purpose in the nonlinear setting. We will also propose an estimator of the new operator, and establish its consistency along with that of the estimator of the additive conditional covariance operator, which was not proved in Li et al. (2014). Based on the additive partial correlation operator, we develop an estimator for the additive semigraphoid model and establish the consistency of this procedure.

All the proofs, as well as some additional propositions and numerical results, are presented in the Supplementary Material.

2. Additive conditional independence and graphical models

2.1. Additive conditional independence

Let Inline graphic be a probability space, Inline graphic a subset of Inline graphic, and Inline graphic a random vector. Let Inline graphic be the distribution of Inline graphic. Let Inline graphic be the Inline graphicth component of Inline graphic, and let Inline graphic be the support of Inline graphic. For a subvector Inline graphic of Inline graphic, let Inline graphic be the distribution of Inline graphic and Inline graphic the centred Inline graphic class

2.1.

We assume that all functions in Inline graphic have mean zero, because constants have no bearing on our construction. Additive conditional independence (Li et al., 2014) was introduced in terms of the Inline graphic geometry. Suppose that Inline graphic, Inline graphic and Inline graphic are subvectors of Inline graphic. For a subvector such as Inline graphic, let Inline graphic be the additive family formed by functions in each Inline graphic; that is,

2.1.

Note that Inline graphic and Inline graphic are different: the former consists of additive functions, while the latter has no such restriction. If Inline graphic and Inline graphic are subspaces of Inline graphic, we write Inline graphic if Inline graphic for all Inline graphic and Inline graphic, where Inline graphic denotes the inner product in Inline graphic. If Inline graphic, we denote by Inline graphic the set of functions Inline graphic such that Inline graphic. Also, let Inline graphic denote the subspace Inline graphic.

Definition 1 —

We say thatInline graphicandInline graphicare additively independent conditional onInline graphicif and only if

graphic file with name M98.gif (3)

We denote this relation byInline graphic.

Li et al. (2014) showed that the three-way relation Inline graphic satisfies the four semigraphoid axioms (Pearl & Verma, 1987; Pearl et al., 1989), which are features abstracted from probabilistic conditional independence suitable for describing a graph.

Based on (3), Li et al. (2014) proposed the following graphical model. Let Inline graphic be as defined in §1.

Definition 2 —

We say thatInline graphicfollows an additive semigraphoid model with respect toInline graphicif

graphic file with name M104.gif

Li et al. (2014) developed theoretical results and estimation methods for the additive semigraphoid model using the additive Inline graphic space Inline graphic.

2.2. Additive reproducing kernel Hilbert spaces

Rather than use Inline graphic geometry, here we use reproducing kernel Hilbert space geometry to derive our new operator and related methods. This is mainly because many asymptotic tools for linear operators have recently been developed in the reproducing kernel Hilbert space setting (Fukumizu et al., 2007; Bach, 2008). The advantage of this alternative formulation will become clear in §4. Let Inline graphic be a positive-definite kernel. For convenience, we assume Inline graphic and Inline graphic to be the same for all Inline graphic and write the common kernel function as Inline graphic. Let Inline graphic be the reproducing kernel Hilbert space of functions of Inline graphic based on the kernel Inline graphic; that is, Inline graphic is the space spanned by Inline graphic, with its inner product given by Inline graphic. In our theoretical developments, we require that all the functions in Inline graphic be square-integrable, which is guaranteed by the following assumption:

Assumption 1 —

Inline graphicforInline graphic

This condition is satisfied by most of the commonly used kernels, including the radial basis function.

Let Inline graphic be a subvector of Inline graphic. The additive reproducing kernel Hilbert space Inline graphic of functions of Inline graphic is defined as follows.

Definition 3 —

The spaceInline graphicis the direct sumInline graphicin the sense that

graphic file with name M128.gif

with inner productInline graphic

Equivalently, Inline graphic can be viewed as the reproducing kernel Hilbert space generated by the additive kernel Inline graphic, Inline graphic, where Inline graphic and Inline graphic.

2.3. Other notation

For two Hilbert spaces Inline graphic and Inline graphic, we let Inline graphic denote the class of all bounded operators from Inline graphic to Inline graphic and Inline graphic the class of all Hilbert–Schmidt operators. When Inline graphic, we denote these classes simply by Inline graphic and Inline graphic. The symbols Inline graphic and Inline graphic stand for the operator and Hilbert–Schmidt norms. For Inline graphic and Inline graphic, the tensor product Inline graphic is the mapping Inline graphic, Inline graphic. For two matrices or Euclidean vectors Inline graphic and Inline graphic, Inline graphic denotes their Kronecker product. The symbol Inline graphic stands for the identity mapping in a functional space, whereas Inline graphic means the Inline graphic identity matrix. The symbol Inline graphic stands for the vector of length Inline graphic whose entries are all ones. For an operator Inline graphic, Inline graphic represents the null space of Inline graphic and Inline graphic the range of Inline graphic; that is,

2.3.

Also,   Inline graphic stands for the closure of Inline graphic.

3. Additive partial covariance operator

3.1. The additive conditional covariance operator

The additive conditional covariance operator was proposed by Li et al. (2014) in terms of the Inline graphic geometry; here we redefine it in the reproducing kernel Hilbert space geometry.

For subvectors Inline graphic and Inline graphic of Inline graphic, by the Riesz representation theorem there exists a unique operator Inline graphic such that (Conway, 1994, p. 31)

3.1.

We define Inline graphic and Inline graphic similarly. The nonadditive versions of these operators were introduced by Baker (1973) and Fukumizu et al. (2004, 2009). Moreover, by Baker (1973), for any Inline graphic there exists a unique operator

3.1.

such that Inline graphic The operator Inline graphic is the correlation operator between Inline graphic and Inline graphic. Let Inline graphic denote the Inline graphic diagonal matrix of operators whose diagonal entries are the operators Inline graphic, and let Inline graphic be the Inline graphic matrix of operators whose Inline graphicth element is Inline graphic. Then it is obvious that Inline graphic Notice that Inline graphic is the identity operator. Define operators such as Inline graphic, Inline graphic and Inline graphic in a similar way. We make the following assumption about the entries of Inline graphic.

Assumption 2 —

ForInline graphic,Inline graphicis a compact operator.

In the Supplementary Material, we show that Inline graphic is invertible and its inverse is bounded. We are now ready to define the additive conditional covariance operator.

Definition 4 —

Suppose that Assumptions 1 and 2 hold. Then the operator

graphic file with name M197.gif

is called the additive conditional covariance operator ofInline graphicgivenInline graphic.

Again, this definition also accommodates operators such as Inline graphic and Inline graphic.

3.2. The additive partial correlation operator

We now introduce the additive partial correlation operator and establish its population-level properties. A straightforward way to define the additive partial correlation operator might be as

3.2. (4)

but caution is needed here because Inline graphic and Inline graphic are Hilbert–Schmidt operators and their eigenvalues tend to zero, so that there is no guarantee that (4) will be well-defined. The following theorem, which echoes Theorem 1 of Baker (1973), shows that (4) is well-defined under minimal conditions.

Theorem 1 —

Suppose that Assumptions 1 and 2 hold. Then there exists a unique operatorInline graphicsuch that:

  1. Inline graphic, whereInline graphic,Inline graphic, andInline graphicdenotes the projection onto a subspaceInline graphicinInline graphic;

  2. Inline graphic;

  3. Inline graphic.

Theorem 1 justifies the following definition.

Definition 5 —

Under Assumptions 1 and 2, the operatorInline graphicin Theorem 1 is called the additive partial correlation operator.

The additive partial correlation operator is defined via a reproducing kernel Hilbert space, whereas additive conditional independence is characterized via Inline graphic. In the Supplementary Material we show that when the kernel function is sufficiently rich that it is a characteristic kernel (Fukumizu et al., 2008, 2009), projections onto Inline graphic can be well approximated by elements in reproducing kernel Hilbert spaces. Specifically, this requires the following assumption.

Assumption 3 —

EachInline graphicis a dense subset ofInline graphicup to a constant; that is, for eachInline graphicthere is a sequenceInline graphicinInline graphicsuch thatInline graphicasInline graphic.

We are now ready to state the first main result: one can use the additive conditional covariance or additive partial correlation operator to characterize additive conditional independence.

Theorem 2 —

If Assumptions 1–3 hold, then

graphic file with name M224.gif

3.3. Estimators

Here we define sample estimators of Inline graphic and Inline graphic. Let Inline graphic be independent copies of Inline graphic. Let Inline graphic represent the sample average: Inline graphic. We define the estimate of Inline graphic by replacing the expectation with the sample average Inline graphic; that is,

3.3.

Let Inline graphic be the Inline graphic matrix of operators whose Inline graphicth entry is Inline graphic, and let Inline graphic, Inline graphic and so forth be the submatrices corresponding to subvectors Inline graphic and Inline graphic. Let Inline graphic be a sequence of positive constants converging to zero. We define the estimator of Inline graphic as

3.3. (5)

Let Inline graphic be another sequence of positive constants converging to zero. We define the estimator of Inline graphic as

3.3. (6)

The tuning parameters Inline graphic and Inline graphic in (5) and (6) play roles similar to that of the penalty in ridge regression (Hoerl & Kennard, 1970). Technically, they ensure the invertibility of the relevant linear operators and the consistency of the estimators. In practice, they often bring efficiency gains in high dimensions due to their shrinkage effects. Interestingly, as we will see in the next section, Inline graphic needs to converge to zero more slowly than Inline graphic in order for Inline graphic to be consistent.

4. Consistency and convergence rate

We first establish consistency of Inline graphic. Besides serving as an intermediate step for proving the consistency of Inline graphic, the consistency of Inline graphic is of interest in its own right, because it was not proved in Li et al. (2014), where it was originally proposed under Inline graphic geometry. To derive the convergence rate, we need an additional assumption.

Assumption 4 —

There is an operatorInline graphicsuch that

graphic file with name M258.gif

The operator Inline graphic also appeared in Lee et al. (2016), where it was called the regression operator because it can be written in the form Inline graphic, resembling the coefficient vector in linear regression. Assumption 4 is essentially a smoothness condition: it requires that the main components in the relation between Inline graphic and Inline graphic be sufficiently concentrated on the low-frequency components of the covariance operator Inline graphic, in the following sense. If Inline graphic is invertible, then Assumption 4 requires Inline graphic to be a compact operator. Since, under mild conditions, Inline graphic is a Hilbert–Schmidt operator (Fukumizu et al., 2007), Inline graphic is an unbounded operator. Intuitively, in order for Inline graphic to be compact, the range space of Inline graphic should be sufficiently concentrated on the eigenspaces of Inline graphic corresponding to its large eigenvalues, or the low-frequency components. As a simple special case of this scenario, in Lee et al. (2016, Proposition 1) it was shown that Assumption 4 is satisfied if the range of Inline graphic is a finite-dimensional reducing subspace of Inline graphic. This is true, for example, when the polynomial kernel of finite order is used. For kernels inducing infinite-dimensional spaces, Assumption 4 holds if there exist only finitely many eigenfunctions of Inline graphic that carry nontrivial correlations with any function in Inline graphic. Of course, these sufficient conditions can be relaxed with careful examination.

We state the consistency of the additive conditional covariance and additive partial correlation operators in the following two theorems, which require different rates for the ridge parameters. For two positive sequences Inline graphic and Inline graphic, we write Inline graphic if and only if Inline graphic, and we write Inline graphic if and only if Inline graphic.

Theorem 3 —

If Assumptions 1, 2 and 4 are satisfied andInline graphic, then

graphic file with name M282.gif

Theorem 4 —

If Assumptions 1, 2 and 4 are satisfied and

graphic file with name M283.gif (7)

thenInline graphicin probability asInline graphic.

We return now to the estimation of the additive semigraphoid graphical model in Definition 2. The estimators of the additive conditional covariance operator and additive partial correlation operator lead to the following thresholding methods for estimating the additive semigraphoid model:

4.

where Inline graphic and Inline graphic are thresholding constants for the additive conditional covariance operator and additive partial correlation operator, respectively. By combining Theorems 2, 3 and 4, it is easy to show that Inline graphic and Inline graphic are consistent estimators of the true edge set Inline graphic, in the following sense.

Theorem 5 —

Suppose that Assumptions 1–4 hold andInline graphicsatisfies the additive semigraphoid model in Definition 2 with respect to the graphInline graphic. Suppose further thatInline graphicandInline graphicare positive sequences satisfying (7). Then, for sufficiently smallInline graphicandInline graphic, asInline graphic,

graphic file with name M299.gif

The foregoing asymptotic development is under the assumption that Inline graphic is fixed as Inline graphic. We believe it should be possible to prove the consistency in the setting where Inline graphic as Inline graphic, perhaps along the lines of Bickel & Levina (2008a, b). We leave this to future research.

5. Implementation of estimation of graphical models

5.1. Coordinate representation

The estimators in (5) and (6) are defined in operator form. To compute them, we need to represent the operators as matrices. In the subsequent development we describe this process in the context of estimating the graphical models. We adopt the system of notation for coordinate representation from Horn & Johnson (1985); see also Li et al. (2012). Let Inline graphic be a generic Inline graphic-dimensional Hilbert space with spanning system Inline graphic. For any Inline graphic, there is a vector Inline graphic such that Inline graphic. The vector Inline graphic is called the coordinate of Inline graphic with respect to Inline graphic. Suppose that Inline graphic is another Hilbert space, spanned by Inline graphic, and Inline graphic is a linear operator from Inline graphic to Inline graphic. Then the coordinate of Inline graphic relative to Inline graphic and Inline graphic is the matrix Inline graphic, denoted by Inline graphic. If Inline graphic is a third finite-dimensional Hilbert space, with spanning system Inline graphic, and Inline graphic is a linear operator, then Inline graphic When there is no ambiguity regarding the spanning system used, we abbreviate Inline graphic to Inline graphic, Inline graphic to Inline graphic, and so on. One can also show that Inline graphic for any Inline graphic. In the rest of this section, square brackets Inline graphic will be reserved exclusively for the coordinate notation.

5.2. Norms of the estimated additive partial correlation operator

For each Inline graphic, let Inline graphic and Inline graphic be the Inline graphicth components of the vectors Inline graphic and Inline graphic, respectively. Consider the reproducing kernel Hilbert space

5.2.

Let Inline graphic be the Riesz representation of the linear functional Inline graphic, Inline graphic, and let Inline graphic. For our purposes, it suffices to consider the subspace Inline graphic of Inline graphic, because it is the range of operators such as Inline graphic and Inline graphic. For this reason, we define this subspace to be Inline graphic.

Let Inline graphic be the Gram kernel matrix. Let Inline graphic, which is the projection onto the orthogonal complement of Inline graphic in Inline graphic. Let Inline graphic, and let Inline graphic be the Inline graphic matrix obtained by removing the Inline graphicth and Inline graphicth blocks from the Inline graphic matrix Inline graphic. Let Inline graphic be the Inline graphic-dimensional block-diagonal matrix whose diagonal blocks are the Inline graphic blocks of Inline graphic, each of dimension Inline graphic. To avoid complicated notation, throughout this subsection we write the estimated operators Inline graphic and Inline graphic as simply Inline graphic and Inline graphic.

By straightforward calculations, details of which are given in the Supplementary Material, we have the following coordinate representations:

5.2. (8)

Let Inline graphic, where Inline graphic indicates the Moore–Penrose inverse. Therefore, Inline graphic is equal to Inline graphic and is denoted by Inline graphic. Similarly, Inline graphic. Then we can compute Inline graphic via

5.2. (9)

In the Supplementary Material, we also derive an explicit formula for calculating Inline graphic. Let Inline graphic and Inline graphic. Then we have

5.2. (10)

The following result links the additive partial correlation operator with the partial correlation when a linear kernel is considered.

Corollary 1 —

LetInline graphic. Then, asInline graphic,Inline graphicconverges in probability to the absolute value of the partial correlation betweenInline graphicandInline graphicgivenInline graphic.

5.3. Reduced kernel and generalized crossvalidation

To make our method readily applicable to relatively large networks with thousands of nodes, we now propose, as alternatives to (9) and (10), simplified algorithms for estimating Inline graphic and Inline graphic. Lower-frequency eigenfunctions of kernels often play dominant roles, and the numbers of statistically significant eigenvalues of kernel matrices are often much smaller than Inline graphic; see, for example, Lee & Huang (2007) and Chen et al. (2010). By employing only the dominant low-frequency eigenfunctions, we can greatly reduce the amount of computation without incurring much loss of accuracy. Let the eigendecomposition of the kernel matrix Inline graphic be written as

5.3. (11)

where Inline graphic corresponds to the first Inline graphic eigenvalues of Inline graphic and Inline graphic corresponds to the last Inline graphic eigenvalues. Instead of the original bases Inline graphic, we now work with Inline graphic, where Inline graphic and will be written simply as Inline graphic.

Let Inline graphic, let Inline graphic, let Inline graphic be the matrix obtained by removing Inline graphic and Inline graphic from Inline graphic, and let Inline graphic, where Inline graphic. Using derivations similar to (8) and (10), we find the coordinate representation of the additive conditional covariance operator with respect to the new basis Inline graphic as

5.3. (12)

which is denoted by Inline graphic. Correspondingly, the Hilbert–Schmidt norms of the additive conditional covariance operator and the additive partial correlation operator can be computed via

5.3. (13)

where Inline graphic is the Frobenius matrix norm. In (12) we need to invert an Inline graphic matrix Inline graphic which could be large if Inline graphic is large. However, as shown in Proposition 4 of Li et al. (2014), calculation of this matrix can be reduced to the eigendecomposition of an Inline graphic matrix.

For the choice of Inline graphic, we follow Fan et al. (2011) and determine it adaptively according to the sample size Inline graphic. Specifically, we take

5.3. (14)

We use the reduced kernel bases consistently for all the simulations and the real-data analysis in §6. Based on our experience, using the reduced bases not only cuts the computation time substantially but also gives very high accuracy compared with using the full bases.

Next, we introduce a generalized crossvalidation procedure to choose the thresholds Inline graphic and Inline graphic. Our process roughly follows Li et al. (2014). Given Inline graphic, let Inline graphic be the estimated graph by either criterion in (13), and define the neighbours of node Inline graphic as Inline graphic Our strategy is to regress each node on its neighbours and obtain the residuals; the generalized crossvalidation criterion is then used to minimize the total prediction error. Specifically, Inline graphic is determined by minimizing

5.3. (15)

where the Inline graphic are chosen differently for each node, as shown in the next subsection.

5.4. Algorithm

The following algorithm summarizes the estimating procedure for the additive semigraphoid model based on the estimated additive partial correlation operator and the estimated additive conditional covariance operator.

Step 1 —

For each Inline graphic, standardize Inline graphic such that Inline graphic and Inline graphic.

Step 2 —

Select the kernel Inline graphic, for example as the radial basis function Inline graphic where Inline graphic is the bandwidth parameter. As in Lee et al. (2013), we recommend choosing Inline graphic via

graphic file with name M440.gif

Step 3 —

Use the selected Inline graphic and Inline graphic to compute the kernel matrix Inline graphic, its centred counterpart Inline graphic, and the eigendecomposition (11) for Inline graphic. Choose Inline graphic according to (14).

Step 4 —

Determine the tuning parameters Inline graphic, Inline graphic and Inline graphic to be the fractions of the largest singular values of relevant matrices to be penalized. That is, let

graphic file with name M450.gif

where Inline graphic denotes the largest singular value of a matrix. The constants Inline graphic, Inline graphic and Inline graphic control the smoothing effects; we fix Inline graphic and let Inline graphic and Inline graphic be Inline graphic based on a criterion similar to that used in Step 3. Finally, to further simplify the computation, we can approximate Inline graphic by Inline graphic.

Step 5 —

For each Inline graphic, calculate Inline graphic or Inline graphic using (9) and (10) or their fast versions given in (13).

Step 6 —

Compute the thresholds that minimize (15), and determine the graph using either of the two criteria. For example, if Inline graphic is the best threshold, then remove Inline graphic from the edge set Inline graphic if Inline graphic.

6. Numerical study

6.1. Additive and high-dimensional settings

By means of simulated examples, we compare the additive partial correlation operator with the additive conditional covariance operator of Li et al. (2014) and the methods of Yuan & Lin (2007), Liu et al. (2009), Fellinghauer et al. (2013) and Voorman et al. (2014). The additive partial correlation operator is able to identify the graph whose underlying distribution does not satisfy the Gaussian or copula Gaussian assumption. To demonstrate this feature, we generate dependent random variables that do not have Gaussian or copula Gaussian distributions using the structural equation models of Pearl (2009). Specifically, given an edge set Inline graphic, we generate Inline graphic sequentially via

6.1.

where Inline graphic is the link function and Inline graphic are independent and identically distributed standard Gaussian variables. If Inline graphic is linear, the joint distribution is Gaussian; otherwise, the joint distribution may be neither Gaussian nor copula Gaussian.

We consider the following graphical models based on three choices of Inline graphic.

  • Model I: Inline graphic.

  • Model II: Inline graphic.

  • Model III: Inline graphic

The sample sizes are taken to be Inline graphic and 100, and the number of nodes is Inline graphic.

We use the hub structure to generate the underlying graphs and the corresponding edge sets Inline graphic. Hubs are commonly observed in networks such as gene regulatory networks and citation networks; see Newman (2003). Specifically, given a graph of size Inline graphic, ten independent hubs are generated so that each module is of degree Inline graphic. For each of the Inline graphic combinations, we generate 100 samples and produce the averaged receiver operating characteristic curves and the areas under these curves. To draw the curves, we need to compute the false positive and true positive rates. Suppose Inline graphic is an estimate of Inline graphic; then the formal definitions of these two measures are

6.1. (16)

The receiver operating characteristic curves are plotted in Fig. 1.

Fig. 1.

Fig. 1.

Receiver operating characteristic curves for different estimators: the additive partial correlation operator (—Inline graphic—); the additive conditional covariance operator (black —–); the method of Yuan & Lin (2007) (black - - - -); the method of Liu et al. (2009) (Inline graphic); the method of Fellinghauer et al. (2013) (grey ——); and the method of Voorman et al. (2014) with default basis (grey - - - -). The two middle panels also display the method of Voorman et al. (2014) with the correct basis Inline graphic for Model II (grey - -Inline graphic- -). In each panel the horizontal axis shows the false positive rate and the vertical axis the true positive rate.

For all the comparisons in §§6.1 and 6.2, we use the radial basis function for both the additive conditional covariance and the additive partial correlation operators. For Model I, we see that the methods of Yuan & Lin (2007) and Liu et al. (2009) perform better than the nonparametric methods. This is not surprising as Gaussianity holds under Model I, and because both methods use the Inline graphic penalty, which is more efficient than thresholding. Nevertheless, the performance of the additive partial correlation operator is not far behind. For example, the areas under the receiver operating characteristic curves for the additive partial correlation operator have an average of 0Inline graphic98 for the two curves in Model I, only slightly smaller than the average of the areas under the curves for the methods of Yuan & Lin (2007) and Liu et al. (2009), which is 1Inline graphic00.

For Models II and III, under which neither Gaussianity nor copula Gaussianity is satisfied, the methods of Yuan & Lin (2007) and Liu et al. (2009) do not perform well. In contrast, both the additive conditional covariance and the additive partial correlation operators still perform remarkably well. Moreover, the receiver operating characteristic curves of the additive partial correlation operator are consistently better than those of the additive conditional covariance operator for Models I and II and for sample sizes 50 and 100, indicating the benefit of a better scaling by the additive partial correlation operator. We also observe that the performance of the method of Fellinghauer et al. (2013) is not very stable. Since their method is based on random forests, it may be affected by the curse of dimensionality that a fully nonparametric approach tends to suffer from. The method of Voorman et al. (2014) is implemented using the R package (R Development Core Team, 2016) spacejam, whose default basis is the cubic polynomial. It shows improvements over the methods of Yuan & Lin (2007) and Liu et al. (2009), but does not perform as well as the additive partial correlation operator. To investigate the effect of the choice of basis on the method of Voorman et al. (2014), we compute its receiver operating characteristic curve for Model II using the correct basis Inline graphic. Notably, this method with the correct basis performs the best under Model II among all the competing methods. Results for smaller graphs are presented in the Supplementary Material.

6.2. Nonadditive and low-dimensional settings

We also investigate a setting where the relationships between nodes are nonadditive and the dimension of the graph is relatively low, which favours a fully nonparametric method such as the method of Fellinghauer et al. (2013). Specifically, we consider

  • Model IV: Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic, Inline graphic,

where Inline graphic are independent and identically distributed standard Gaussian variables.

Our goal is to recover the graph determined by the set of conditional independence relations Inline graphic whenever Inline graphic. Under Model IV, the edge set is Inline graphic The graphical model based on pairwise conditional independence cannot fully describe the interdependence in Inline graphic, because it cannot capture three-way or multi-way conditional dependence. A fully descriptive approach in such situations would be to use a hypergraph (Lauritzen, 1996, p. 21). Nevertheless, the pairwise conditional independence graphical model is well-defined and helps to illustrate the difference between an additive and a fully nonparametric model.

Taking Inline graphic, we compute the receiver operating characteristic curves for 100 replicates, which are presented in the Supplementary Material. Since the model is nonlinear, we only compare the additive partial correlation operator with the additive conditional covariance operator and the methods of Voorman et al. (2014) and Fellinghauer et al. (2013). The method of Fellinghauer et al. (2013) performs the best, because it allows the conditional expectation of each node to be a nonadditive function of its neighbouring nodes. On the other hand, the additive partial correlation operator still performs reasonably well. This indicates that, in spite of its additive formulation, the additive partial correlation operator is capable of identifying conditional independence even in nonadditive models.

6.3. Effects of the choices of kernels, ridge parameters and number of eigenfunctions

In this subsection we study the performance of the additive partial correlation operator with different choices of kernel. We investigate six types of kernel: the radial basis function, the rational quadratic kernel with parameters 200 and 400, the linear kernel, the quadratic kernel, and the Laplacian kernel. The choice of parameters for the rational quadratic kernel follows Li et al. (2014). For each model, ten replicates are generated using Inline graphic and Inline graphic. The averaged receiver operating characteristic curves for the six kernels are presented in the Supplementary Material. The results suggest that all the nonlinear kernels give comparable performance across Models I, II and III. As expected, the linear kernel fails for Models II and III.

Next, we investigate the sensitivity of the proposed estimator to the tuning parameters Inline graphic and Inline graphic. We take 20 equally spaced grid points in each of the ranges Inline graphic and Inline graphic, with Inline graphic and Inline graphic computed via the empirical formulas in §5.4. Then, for each of the Inline graphic combinations, a receiver operating characteristic curve is produced and its area under the curve is computed. The means of the areas for Models I, II and III are 0Inline graphic995, 0Inline graphic956 and 0Inline graphic971, respectively, with standard deviations of 0Inline graphic004, 0Inline graphic004 and 0Inline graphic011. These values indicate that the performance of the proposed estimator is reasonably robust with respect to the choice of tuning parameters. The actual receiver operating characteristic curves for different combinations of tuning parameters are plotted in the Supplementary Material.

We also investigated the effect of using different numbers of eigenfunctions. For each of Models I–III and with Inline graphic, we increase Inline graphic from 1 to Inline graphic and, for each fixed Inline graphic, produce a receiver operating characteristic curve and compute the area under the curve. The areas under the curves are reported in the Supplementary Material. The results show that the effect of using a different number of eigenfunctions varies across the three models, which is to be expected as they have different complexities. Specifically, a single eigenfunction achieves the largest area under the curve for the linear model, but for the nonlinear models the optimal areas are achieved when more eigenfunctions are used. We also see that our choices of Inline graphic are not far from the best choice for all three models.

6.4. Exploring the generalized crossvalidation procedure

In this subsection we investigate the performance and computational cost of the generalized crossvalidation procedure introduced in §5.3, and compare it with the method of Voorman et al. (2014) using two different selection criteria: the Akaike information criterion and the Bayesian information criterion. Three measures are used to evaluate the comparisons: the true positive and false positive rates in (16), and a synthetic score defined as

6.4. (17)

Table 1 shows the averages of these criteria over 100 replicates using Model III with Inline graphic and Inline graphic. We omit the result obtained from the method of Voorman et al. (2014) using the Bayesian information criterion, because the Akaike information criterion for the same method always performs better in this setting. Our procedure consistently picks up the thresholds located around the best scenario. In comparison, the method of Voorman et al. (2014) with the Akaike information criterion does not perform as well as our estimator.

Table 1.

Comparison of the tuning procedures for (a) the additive partial correlation operator with generalized crossvalidation and (b) the method of Voorman et al. 2014 with the Akaike information criterion; Inline graphic and Inline graphicInline graphic are defined in (16), and Inline graphic is defined in (17); larger Inline graphic, smaller Inline graphic and lower Inline graphic indicate better performance

Inline graphic
Inline graphic
Inline graphic
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
(a) 97Inline graphic1 10Inline graphic9 0Inline graphic11 98Inline graphic1 17Inline graphic8 0Inline graphic18 98Inline graphic7 22Inline graphic1 0Inline graphic22
(b) 58Inline graphic8 24Inline graphic7 0Inline graphic61 90Inline graphic0 71Inline graphic6 0Inline graphic72 78Inline graphic6 51Inline graphic5 0Inline graphic56

We also compare the computational costs of the two methods in estimating larger networks with Inline graphic or 5000. For the tuning parameters, 40 grid points are used for both the additive partial correlation operator and the method of Voorman et al. (2014). The results are reported in Table 2. With Inline graphic, our algorithm takes only minutes to complete, and for Inline graphic it is still reasonably efficient. In terms of estimation accuracy, our method has smaller Inline graphic than the method of Voorman et al. (2014). The complexity of the additive partial correlation operator grows as Inline graphic. However, for handling graphs with thousands of nodes, our method is faster than the regression-based approaches.

Table 2.

Comparison of computing times for (a) the additive partial correlation operator with generalized crossvalidation and (b) the method of Voorman et al.2014 with the Akaike information criterion. All experiments were conducted on an Intel Xeon E5520 with Inline graphic GHz CPU

Inline graphic Inline graphic
Inline graphic
Inline graphic
Inline graphic
Minutes Inline graphic Minutes Inline graphic Minutes Inline graphic Minutes Inline graphic
(a) 2Inline graphic3 0Inline graphic35 8Inline graphic6 0Inline graphic23 58Inline graphic4 0Inline graphic59 213Inline graphic3 0Inline graphic54
(b) 47Inline graphic7 0Inline graphic87 122Inline graphic8 0Inline graphic68 553Inline graphic4 0Inline graphic97 1572Inline graphic8 0Inline graphic96

6.5. Application to the DREAM4 Challenges data

We apply the six methods to a dataset from the DREAM4 Challenges project (Marbach et al., 2010). The goal of this study is to infer network structure from gene expression data. The topologies of the graphs are obtained by extracting subgraphs from real biological networks. The gene expression levels are generated based on a system of ordinary differential equations governing the dynamics of the biological interactions between the genes. There are five networks of size 100 to be estimated in this dataset. For each network, we stack up observations from three different experimental conditions, wild-type, knockdown and knockout, so that the overall sample size Inline graphic is 201. Then, the estimated graphs are produced using the additive partial correlation operator, the additive conditional covariance operator of Li et al. (2014), and the methods of Voorman et al. (2014), Fellinghauer et al. (2013), Yuan & Lin (2007) and Liu et al. (2009). The areas under the receiver operating characteristic curves are reported in Table 3, and the actual receiver operating characteristic curves are displayed in the Supplementary Material. We see that the additive partial correlation operator consistently performs best among the six estimators.

Table 3.

Areas under the receiver operating characteristic curves for the DREAM4 Challenges dataset, obtained from (a) the additive partial correlation operator, (b) the additive conditional covariance operator, (c) the method of Voorman et al. (2014), (d) the method of Fellinghauer et al. 2013, (e) the method of Liu et al. 2009, (f) the method of Yuan & Lin 2007, and (g) the championship method

(a) (b) (c) (d) (e) (f) (g)
Network 1 0Inline graphic86 0Inline graphic67 0Inline graphic79 0Inline graphic73 0Inline graphic61 0Inline graphic74 0Inline graphic91
Network 2 0Inline graphic81 0Inline graphic62 0Inline graphic70 0Inline graphic64 0Inline graphic57 0Inline graphic70 0Inline graphic81
Network 3 0Inline graphic83 0Inline graphic70 0Inline graphic77 0Inline graphic68 0Inline graphic64 0Inline graphic73 0Inline graphic83
Network 4 0Inline graphic83 0Inline graphic71 0Inline graphic76 0Inline graphic71 0Inline graphic61 0Inline graphic72 0Inline graphic83
Network 5 0Inline graphic77 0Inline graphic66 0Inline graphic70 0Inline graphic73 0Inline graphic61 0Inline graphic70 0Inline graphic75

The original DREAM4 project was open to public challenges, so it is reasonable to compare our results with those submitted by the participating teams. In column (g) of Table 3 we show the areas under the receiver operating characteristic curves obtained from the method of the championship team. The additive partial correlation operator yields the best areas under the curves for four of the five networks; in particular, it performs better than the method of the championship team for Network 5. As mentioned in Marbach et al. (2010), the best-performing approach used a combination of multiple models, including ordinary differential equations. Our operator replicates the most competitive results without employing any prior information on the model setting, which demonstrates the benefit of relaxing the distributional assumption; moreover, its additive structure does not seem to hamper its accuracy in this application.

7. Concluding remarks

In establishing the consistency of the additive conditional covariance operator and the additive partial correlation operator, we have developed a theoretical framework that is not limited to the current setting; it can be applied to other problems where additive conditional independence and linear operators are involved. Moreover, the idea of characterizing conditional independence by small values of the additive partial correlation operator has ramifications beyond those explored in this paper. For instance, the penalty in the proposed additive partial correlation operator is based on hard thresholding, but other penalties, such as the lasso-type penalties, may be more efficient in dealing with sparsity in the estimation of operators. We leave these extensions and refinements to future research.

Supplementary material

Supplementary material available at Biometrika online includes the proofs of the theoretical results and additional plots for the numerical studies.

Supplementary Material

Supplementary Data

Acknowledgments

We are grateful to three referees for their constructive comments and helpful suggestions. Bing Li's research was supported in part by the U.S. National Science Foundation; Hongyu Zhao's research was supported in part by both the National Science Foundation and the National Institutes of Health.

References

  1. Bach F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9, 1179–225. [Google Scholar]
  2. Baker C. R. (1973). Joint measures and cross-covariance operators. Trans. Am. Math. Soc. 186, 273–89. [Google Scholar]
  3. Banerjee O., Ghaoui, El L. & d'Aspremont A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9, 485–516. [Google Scholar]
  4. Bellman R. E. (1957). Dynamic Programming. Princeton: Princeton University Press. [Google Scholar]
  5. Bickel P. J. & Levina E. (2008a). Covariance regularization by thresholding. Ann. Statist. 36, 2577–604. [Google Scholar]
  6. Bickel P. J. & Levina E. (2008b). Regularized estimation of large covariance matrices. Ann. Statist. 36, 199–227. [Google Scholar]
  7. Cai T., Liu W. & Luo X. (2011). A constrained Inline graphic minimization approach to sparse precision matrix estimation. J. Am. Statist. Assoc. 106, 594–607. [Google Scholar]
  8. Candès E. & Tao T. (2007). The Dantzig selector: Statistical estimation when Inline graphic is much larger than Inline graphic. Ann. Statist. 35, 2313–51. [Google Scholar]
  9. Chen P.-C., Lee K.-Y., Lee T.-J., Lee Y.-J. & Huang S.-Y. (2010). Multiclass support vector classification via coding and regression. Neurocomputing 73, 1501–12. [Google Scholar]
  10. Conway J. B. (1994). A Course in Functional Analysis. New York: Springer, 2nd ed. [Google Scholar]
  11. Fan J., Feng Y. & Song R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Statist. Assoc. 106, 544–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Fan J. & Li R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 96, 1348–60. [Google Scholar]
  13. Fellinghauer B., Bühlmann P., Ryffelb M., Rheinc M. & Reinhardta J. D. (2013). Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Comp. Statist. Data Anal. 64, 132–52. [Google Scholar]
  14. Friedman J. H., Hastie T. J. & Tibshirani R. J. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Fukumizu K., Bach F. R. & Gretton A. (2007). Statistical consistency of kernel canonical correlation analysis. J. Mach. Learn. Res. 8, 361–83. [Google Scholar]
  16. Fukumizu K., Bach F. R. & Jordan M. I. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5, 73–99. [Google Scholar]
  17. Fukumizu K., Bach F. R. & Jordan M. I. (2009). Kernel dimension reduction in regression. Ann. Statist. 37, 1871–905. [Google Scholar]
  18. Fukumizu K., Gretton A., Sun X. & Schölkopf B. (2008). Kernel measures of conditional dependence. Adv. Neural Info. Proces. Syst. 20, 489–96. [Google Scholar]
  19. Harris N. & Drton M. (2013). PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res. 14, 3365–83. [Google Scholar]
  20. Hoerl A. E. & Kennard R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67. [Google Scholar]
  21. Horn R. A. & Johnson C. R. (1985). Matrix Analysis. Cambridge: Cambridge University Press. [Google Scholar]
  22. Lam C. & Fan J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist. 37, 4254–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lauritzen S. L. (1996). Graphical Models. Oxford: Oxford University Press. [Google Scholar]
  24. Lee K.-Y., Li B. & Chiaromonte F. (2013). A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. Ann. Statist. 41, 221–49. [Google Scholar]
  25. Lee K.-Y., Li B. & Zhao H. (2016). Variable selection via additive conditional independence. J. R. Statist. Soc. B to appear, doi:10.1111/rssb.12150.
  26. Lee Y.-J. & Huang S.-Y. (2007). Reduced support vector machines: A statistical theory. IEEE Trans. Neural Networks 18, 1–13. [DOI] [PubMed] [Google Scholar]
  27. Li B., Chun H. & Zhao H. (2012). Sparse estimation of conditional graphical models with application to gene networks. J. Am. Statist. Assoc. 107, 152–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li B., Chun H. & Zhao H. (2014). On an additive semi-graphoid model for statistical networks with application to pathway analysis. J. Am. Statist. Assoc. 109, 1188–204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Liu H., Han F., Yuan M., Lafferty J. & Wasserman L. (2012). High-dimensional semiparametric Gaussian copula graphical models. Ann. Statist. 40, 2293–326. [Google Scholar]
  30. Liu H., Lafferty J. & Wasserman L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res. 10, 2295–328. [PMC free article] [PubMed] [Google Scholar]
  31. Marbach D., Prill R. J., Schaffter T., Mattiussi C., Floreano D. & Stolovitzky G. (2010). Revealing strengths and weaknesses of methods for gene network inference. Proc. Nat. Acad. Sci. 107, 6286–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Meinshausen N. & Bühlmann P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34, 1436–62. [Google Scholar]
  33. Muirhead R. J. (2005). Aspects of Multivariate Statistical Theory. New York: Wiley, 2nd ed. [Google Scholar]
  34. Newman M. (2003). The structure and function of complex networks. SIAM Rev. 45, 167–256. [Google Scholar]
  35. Pearl J. (2009). Causality: Models, Reasoning and Inference. Cambridge: Cambridge University Press, 2nd ed. [Google Scholar]
  36. Pearl J., Geiger D. & Verma T. (1989). Conditional independence and its representations. Kybernetika 25, 33–44. [Google Scholar]
  37. Pearl J. & Verma T. (1987). The logic of representing dependencies by directed graphs. In Proceedings of the Sixth National Conference on Artificial Intelligence, vol. 1. AAAI Press, pp. 374–9.
  38. Peng J., Wang P., Zhou N. & Zhu J. (2009). Partial correlation estimation by joint sparse regression models. J. Am. Statist. Assoc. 104, 735–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. R Development Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  40. Tibshirani R. J. (1996). Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B 58, 267–88. [Google Scholar]
  41. Voorman A., Shojaie A. & Witten D. (2014). Graph estimation with joint additive models. Biometrika 101, 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Xue L. & Zou H. (2012). Regularized rank-based estimation of high-dimensional nonparanormal graphical models. Ann. Statist. 40, 2541–71. [Google Scholar]
  43. Yuan M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11, 2261–86. [Google Scholar]
  44. Yuan M. & Lin Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES