Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Mar 11.
Published in final edited form as: Adv Neural Inf Process Syst. 2017 Dec;30:4446–4456.

The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities

Arun Sai Suggala 1,*, Mladen Kolar 2,, Pradeep Ravikumar 3,
PMCID: PMC6411086  NIHMSID: NIHMS979858  PMID: 30867622

Abstract

Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments to propose a class of non-parametric models which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.

1. Introduction

Let X = (X1, …, Xp) be a p-dimensional random vector. Let G = (V, E) be the graph that encodes conditional independence assumptions underlying the distribution of X, that is, each node of the graph corresponds to a component of vector X and (a, b) ∈ E if and only if Xa⫫̸XbX¬ab with X¬ab := {XccV \{a, b}}. The graphical model represented by G is then the set of distributions over X that satisfy the conditional independence assumptions specified by the graph G.

There has been a considerable line of work on learning parametric families of such graphical model distributions from data [22, 20, 13, 28], where the distribution is indexed by a finite-dimensional parameter vector. The goal of this paper, however, is on specifying and learning nonparametric families of graphical model distributions, indexed by infinite-dimensional parameters, and for which there has been comparatively limited work. Non-parametric multivariate density estimation broadly, even without the graphical model constraint, has not proved as popular in practical machine learning contexts, for both statistical and computational reasons. Loosely, estimating a non-parametric multivariate density, with mild assumptions, typically requires the number of samples to scale exponentially in the dimension p of the data, which is infeasible even in the big-data era when n is very large. And the resulting estimators are typically computationally expensive or intractable, for instance requiring repeated computations of multivariate integrals.

We present a review of multivariate density estimation, that is necessarily incomplete but sets up our proposed approach. A common approach dating back to [15] uses the logistic density transform to satisfy the unity and positivity constraints for densities, and considers densities of the form f(X)=exp(η(X))Xexp(η(x))dx, with some constraints on η for identifiability such as η(X0) = 0 for some X0X or Xη(x)dx=0.

With the logistic density transform, differing approaches for non-parametric density estimation can be contrasted in part by their assumptions on the infinite-dimensional function space domain of η(·). An early approach [8] considered function spaces of functions with bounded “roughness” functionals. The predominant line of work however has focused on the setting where η(·) lies in a Reproducing Kernel Hilbert Space (RKHS), dating back to [21]. Consider the estimation of these logistic density transforms η(X) given n i.i.d. samples Xn={X(i)}i=1n drawn from fη(X). A natural loss functional is penalized log likelihood, with a penalty functional that ensures a smooth fit with respect to the function space domain: (η;Xn)1ni[n]η(X(i))+logexp(η(x))dx+λpen(η), for functions η(·) that lie in an RKHS H, and where pen(η)=ηH2 is the squared RKHS norm. This was studied by many [21, 11, 6]. A crucial caveat is that the representer theorem for RKHSs does not hold. Nonetheless, one can consider finite-dimensional function space approximations consisting of the linear span of kernel functions evaluated at the sample points [12]. Computationally this still scales poorly with the dimension due to the need to compute multidimensional integrals of the form ∫ exp(η(x)dx which do not, in general, decompose. These approximations also do not come with strong statistical guarantees.

We briefly note that the function space assumption that η(·) lies in an RKHS could also be viewed from the lens of an infinite-dimensional exponential family [4]. Specifically, let H be a Reproducing Kernel Hilbert Space with reproducing kernel k(·, ·), and inner product ,H. Then η(X)=θ(),k(X,)H, so that the density f (X) can in turn be viewed as a member of an infinite-dimensional exponential family with sufficient statistics k(X,):XH, and natural parameter θ()H. Following this viewpoint, [4] propose estimators via linear span approximations similar to [11].

Due to the computational caveat with exact likelihood based functionals, a line of approaches have focused on penalized surrogate likelihoods instead. [14] study the following loss functional: (η;Xn)1ni[n]exp(η(X(i)))+η(x)ρ(x)dx+λpen(η), where ρ(X) is some fixed known density with the same support as the unknown density f (X). While this estimation procedure is much more computationally amenable than minimizing the exact penalized likelihood, the caveat, however, is that for a general RKHS this requires solving higher order integrals. The next level of simplification has thus focused on the form of the logistic transform function itself. There has been a line of work on an ANOVA type decomposition of the logistic density function into node-wise and pairwise terms: η(X)=s=1pηs(Xs)+s=1pt=s+1pηst(Xs,Xt). A line of work has coupled such a decomposition with the assumption that each of the terms lie in an RKHS. This does not immediately provide a computational benefit: with penalized likelihood based loss functionals, the loss functional does not necessarily decompose into such node and pairwise terms. [24] thus couple this ANOVA type pairwise decomposition with a score matching based objective. [10] use the above decomposition with the surrogate loss functional of [14] discussed above, but note that this still requires the aforementioned function space approximation as a linear span of kernel evaluations, as well as two-dimensional integrals.

A line of recent work has thus focused on further stringent assumptions on the density function space, by assuming some components of the logistic transform to be finite-dimensional. [30] use an ANOVA decomposition but assume the terms belong to finite-dimensional function spaces instead of RKHSs, specified by a pre-defined finite set of basis functions. [29] consider logistic transform functions η(·) that have the pairwise decomposition above, with a specific class of parametric pairwise functions βstXsXt, and non-parametric node-wise functions. [17, 16] consider the problem of estimating monotonic node-wise functions such that the transformed random vector is multivariate Gaussian; which could also be viewed as estimating a Gaussian copula density.

To summarize the (necessarily incomplete) review above, non-parametric density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose stringent near-parametric assumptions on the form of the (logistic transform of the) density functions. In this paper, we leverage recent developments to propose a very computationally simple non-parametric density estimation algorithm, that still comes with strong statistical guarantees. Moreover, the density could be viewed as a graphical model distribution, with a corresponding sparse conditional independence graph.

Our approach relies on the following simple function space assumption: that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. As we show, for there to exist a consistent joint density, the logistic density transform with respect to a particular base measure necessarily decomposes into the following semi-parametric form: η(X)=s=1pθsBs(Xs)+s=1pt=s+1pθstBs(Xs)Bt(Xt) in the pairwise case, with both a parametric component {θs : s = 1, …, p}, {θst : s < t; s, t = 1, …, p}, as well as non-parametric components {Bs : s = 1, …, p}. We call this class of models the “expxorcist”, following other “ghostbusting” semi-parametric models such as the nonparanormal and nonparanormal skeptic [17, 16].

Since the conditional distributions are exponential families, we show that there exist computationally amenable estimators, even in our more general non-parametric setting, where the sufficient statistics have to be estimated as well. The statistical analysis in our non-parametric setting however is more subtle, due in part to non-convexity and in part to the non-parametric setting. We also show how the Expxorcist class of densities is closely related to a semi-parametric exponential family copula density that generalizes the Gaussian copula density of [17, 16]. We corroborate the applicability of our class of models with experiments on synthetic and real data sets.

2. Multivariate Density Specification via Conditional Densities

We are interested in the approach of estimating a multivariate density by estimating node-conditional densities. Since node-conditional densities focus on the density of a single variable, though conditioned on the rest of the variables, estimating these is potentially a simpler problem, both statistically and computationally, than estimating the entire joint density itself. Let us consider the general non-parametric conditional density estimation problem. Given the general multivariate density f(X)=exp(η(X))Xexp(η(x))dx, the conditional density of a variable Xs given the rest of the variables X−s is given by f(XsXs)=exp(η((Xs,Xs)))Xsexp(η((x,Xs)))dx, which does not have a multi-dimensional integral, but otherwise does not have a computationally amenable form. There has been a line of work on such conditional density estimation, mirroring developments in multivariate density estimation [9, 18, 23], but unlike parametric settings, there are no large sample complexity gains with non-parametric conditional density estimation under general settings. There have also been efforts to use ANOVA decompositions in a conditional density context [31, 26].

In addition to computational and sample complexity caveats, recall that in our context, we would like to use conditional density estimates to infer a joint multivariate density. A crucial caveat with using the above estimates to do so is that it is not clear when the estimated node-conditional densities would be consistent with a joint multivariate density. There has been a line of work on this question (of when conditional densities are consistent with a joint density) for parametric densities; see [1] for an overview, with more recent results in [27, 5, 2, 25]. Overall, while estimating node-conditional densities could be viewed as surrogate estimation of a joint density, arbitrary node-conditional distributions need not be consistent in general with any joint density. There has however been a line of work in recent years [3, 28], where it was shown that when the node-conditional distributions belong to an exponential family, then under certain conditions on their parameterization, there do exist multivariate densities consistent with the node-conditional densities. In the next section, we leverage these results towards non-parametric estimation of conditional densities.

3. Conditional Densities of an Exponential Family Form

We first recall the definition of an exponential family in the context of a conditional density.

Definition 1. A conditional density of a random variable YY given covariates Z(Z1,,Zm)Z is said to have an exponential family form if it can be written as f(YZ) = exp(B(Y)T E(Z) + C(Y) + D(Z)), for some functions B:YRk (for some finite integer k > 0), E:ZRk,C:YR and D:ZR.

Thus, f (YZ) belongs to a finite-dimensional exponential family with sufficient statistics B(Y), base measure exp(C(Y)), and with natural parameter E(Z) and where −D(Z) is the log-partition function. Contrast this with a general conditional density f (YZ) = exp(h(Y, Z) + C(Y) + D(Z)) with respect to the base measure exp(C(Y )) and −D(Z) being the log-normalization constant, and it can be seen that a conditional density of the exponential family form has its logistic density transform h(Y, Z) that factorizes as B(Y)T E(Z).

Consider the case where the sufficient statistic function is real-valued. The non-parametric estimation problem of a conditional density of exponential form then reduces to the estimation of the sufficient statistics function B(·), the exponential natural parameter function E(·), assuming the base measure C(·) is given. But when would such estimated conditional densities be consistent with a joint density? To answer this question, we draw upon developments in [28]. Suppose that the node-conditional distributions of each random variable Xs conditioned on the rest of random variables have the exponential family form as in Definition 1, so that for each sV

P(XsXs)exp{Es(Xs)Bs(Xs)+Cs(Xs)}, (1)

for some arbitrary functions Es(·), Bs(·), Cs(·) that specify a valid conditional density. Then [28] show that these node-conditional densities are consistent with a unique joint density over the random vector X, that moreover factors according to a set of cliques C in the graph G, if and only if the functions {Es(·)}sV specifying the node-conditional distributions have the form Es(Xs)=θs+CC:sCθCtC,tsBt(Xt), where {θs}{θC}CC is a set of parameters. Moreover, the corresponding consistent joint distribution has the following form

P(X)exp{sVθsBs(Xs)+CCθCsCBs(Xs)+sVCs(Xs)}. (2)

In this paper, we are interested in the non-parametric estimation of the Expxorcist class of densities in (2), where we estimate both the finite-dimensional parameters {θs}{θC}CC, as well as the functions {Bs(Xs)}sV. We assume we are given the base measures {Cs(Xs)}sV, so that the joint density is with respect to a given product base measure ∏s∈V exp(Cs(XS)), as is common in the multivariate density estimation literature. Note that this is not a very restrictive assumption. In practice the base measure at each node can be well approximated using the empirical univariate marginal density of that node. We could also extend our algorithm, which we present next, to estimate the base measures along with sufficient statistic functions.

4. Regularized Conditional Likelihood Estimation for Exponential Family Form Densities

We consider the nonparametric estimation problem of estimating a joint density of the form in (2), focusing on the pairwise case where the factors have size at most k = 2, so that the joint density takes the form

P(X)exp{sVθsBs(Xs)+(s,t)EθstBs(Xs)Bt(Xt)+sVCs(Xs)}. (3)

As detailed in the previous section, estimating this joint density can be reduced to estimating its node-conditional densities, which take the form

P(XsXs)exp{Bs(Xs)(θs+tNG(s)θstBt(Xt))+Cs(Xs)}. (4)

We now introduce some notation which we use in the sequel. Let Θ = {θs}sV ∪ {θst}st and Θs = θs ∪ {θst}tV\{s}. Let B = {Bs}sV be the set of sufficient statistics. Let Xs be the domain of Xs, which we assume is bounded and L2(Xs) be the Hilbert space of square integrable functions over Xs with respect to Lebesgue measure. We assume that the sufficient statistics Bs()L2(Xs).

Note that the model in Equation (3) is unidentifiable. To overcome this issue we impose additional constraints on its parameters. Specifically, we require Bs(Xs) to satisfy XsBs(X)dX=0, XsBs(X)2dX=1 and θs ≥ 0, ∀sV.

Optimization objective:

Let Xn={X(1),X(n)} be n i.i.d. samples drawn from a joint density of the form in Equation (3), with parameters Θ*, B*. And let Ls(Θs,B;Xn) be the node conditional negative log likelihood at node s

Ls(Θs,B;Xn)=1ni=1n{Bs(Xs(i))(θs+tV\sθstBt(Xt(i)))+A(Xs(i);Θs,B)},

where A(X−s; Θs, B) is the log partition function. To estimate the unknown parameters, we solve the following regularized node conditional log-likelihood estimation problem at each node sV

minΘs,BLs(Θs,B;Xn)+λnΘs1s.t.θs0,XiBt(X)dX=0XtBt(X)2dX=1tV. (5)

The equality constraints on the norm of functions Bt(·) makes the above optimization problem a difficult one to solve. While the norm constraints on Bt(·), ∀tV \ s can be handled through re-parametrization, the constraint on Bs(·) can not be handled efficiently. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node sV, we consider the following re-parametrization of B: Bs(Xs) ← θsBs(Xs), Bt(Xt) ← (θsts)Bt(Xt), ∀tV \ {s}. With a slight abuse of notation we redefine Ls using this re-parametrization as

Ls(B;Xn)=1ni=1n{Bs(Xs(i))(1+tV\sBt(Xt(i)))+A(Xs(i);B)}, (6)

where A(X−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5)

minBLs(B;Xn)+λntVXtBt(X)2dXs.t.XtBt(X)dX=0tV. (7)

For more details on the relation between (5) and (7), please refer to Appendix.

Algorithm:

We now present our algorithm for optimization of (7). In the sequel, for simplicity, we assume that the domains Xt of random variables Xt are all the same and equal to X. In order to estimate functions Bt, we expand them over a uniformly bounded, orthonormal basis {{ϕk()}k=0 of L2(X) with ϕ0(·) ∝ 1. Expansion of the functions Bt(·) over this basis yields

Bt(X)=k=1mαt,kϕk(X)+ρt,m(X)whereρt,m(X)=αt,0ϕ0(X)+k=m+1αt,kϕk(X).

Note that the constraint XBt(X)dX=0 in Equation (7), translates to αt,0 = 0. To convert the infinite dimensional optimization problem in (7) into a finite dimensional problem, we truncate the basis expansion to the top m terms and approximate Bt(·) as k=1mαt,kϕk(). The optimization problem in Equation (7) can then be rewritten as

minαmLs,m(αm;Xn)+λntVαt,m2, (8)

where αt,m={αt,k}k=1m, αm = {αt,m}tV and Ls,m is defined as

Ls,m(αm;Xn)=1ni=1n{k=1mαs,kϕk(Xs(i))(1+tV\{s}l=1mαt,lϕl(Xt(i)))+A(Xs(i);αm)}.

Iterative minimization of (8):

Note that the objective in (8) is non-convex. In this work, we use a simple alternating minimization technique for its optimization. In this technique, we alternately minimize αs,m, {αt,m}tV\s while fixing the other parameters. The resulting optimization problem in each of the alternating steps is convex. We use Proximal Gradient Descent to optimize these sub-problems. To compute the objective and its gradients, we need to numerically evaluate the one-dimensional integrals in the log partition function. To do this, we choose a uniform grid of points over the domain and use quadrature rules to approximate the integrals.

Convergence:

Although (8) is non-convex, we can show that under certain conditions on the objective function, the alternating minimization procedure converges to the global minimum. In a recent work [32] analyze alternating minimization for low rank matrix factorization problems and show that it converges to a global minimum if the sequence of convex problems are strongly convex and satisfy certain other regularity condition. The analysis of [32] can be extended to show global convergence of alternating minimization for (8).

5. Statistical Properties

In this section we provide parameter estimation error rates for the node conditional estimator in Equation (8). Note that these rates are for the re-parameterized model described in Equation (6)and can be easily translated to guarantees on the original model described in Equations (3), (4).

Notation:

Let B2(x,r)={y:yx2r} be the 2 ball with center x and radius r. Let {Bt()}tV be the true functions of the re-parametrized model, which we would like to estimate from the data. Denote the basis expansion coefficients of Bt(·) with respect to orthonormal basis {ϕk()}k=0 by αt, which is an infinite dimensional vector and let αt be the coefficients of Bt(). And let αt,m be the coefficients corresponding to the top m basis in the basis expansion of Bt(·). Note that Bt(X)2dX=αt22. Let α = {αt}tV and αm = {αt,m}tV. Let Ls,m(αm)=E[Ls,m(αm;Xn)] be the population version of the sample loss defined in Equation (8). We will often omit Xn from Ls,m(αm;Xn) when clear from the context. We let (αtαt,m) be the difference between infinite dimensional vector αt and the vector obtained by appropriately padding αt,m with zeros. Finally, we define the norm R() as R(αm)=tVαt,m2 and its dual as R(αm)=suptVαt,m2. The norms on infinite dimensional vector α are similarly defined.

We now state our key assumption on the loss function Ls,m(). This assumption imposes strong curvature condition on Ls,m along certain directions in a ball around αm.

Assumption 1. There exists rm > 0 and constants c, κ > 0 such that for any ΔmB2(0,rm) the gradient of the sample loss Ls,m satisfies: Ls,m(αm+Δm)Ls,m(αm),ΔmκΔm22cmlog(p)nR(Δm).

Similar assumptions are increasingly common in analysis of non-convex estimators, see [19] and references therein. We are now ready to state our results which give the parameter estimation error rates, the proofs of which can be found in Appendix. We first provide a deterministic bound on the error αmαm2 in terms of the random quantity R(Ls,m(αm)). We derive probabilistic results in the subsequent corollaries.

Theorem 2. Let Ns be the true neighborhood of node s, with Ns=d. Suppose Ls,m satisfies Assumption 1. If the regularization parameter λn is chosen such that λn2R(Ls,m(αm))+2cmlog(p)n, then any stationary point α^m of (8) in B2(αm,rm) satisfies:

α^mαm262κdλn.

We now provide a set of sufficient conditions under which the random quantity R(Ls,m(αm)) can be bounded.

Assumption 2. There exists a constant L > 0 such that the gradient of the population loss Ls,m at αm satisfies: R(Ls,m(αm))LR(ααm).

Corollary 3. Suppose the conditions in Theorem 2 are satisfied. Moreover, let γ=supiN,XXϕi(X) and τm=suptV,XXi=1mαt,iϕi(X). Suppose Ls,m satisfies Assumption 2. If the regularization parameter λn is chosen such that λn2LR(ααm)+cγτmmd2log(p)n, then then with probability at least 1 – 2m/p2 any stationary point α^m of (8) in B2(αm,rm) satisfies:

α^mαm262κdλn.

Theorem 2 and Corollary 3 bound the error of the estimated coefficients in the truncated expansion. The approximation error of the truncated expansion itself depends on the function space assumption, as well as the basis chosen, but can be simply combined with the statement of the above corollary to derive the overall error. As an instance, we present a corollary below for the specific case of Sobolev space of order two, and the trigonometric basis.

Corollary 4. Suppose the conditions in Corollary 3 are satisfied. Moreover, suppose the true functions Bt() lie in a Sobolev space of order two. Let {ϕk}k=0 be the trigonometric basis of L2(X). If the optimization problem (8) is solved with λn = c1(d2 log(p)/n)2/5 and m = c2(n/d2 log(p))1/5, then with probability at least 1 – 2m/p2 any stationary point α^m of (8) in B2(αm,rm) satisfies:

α^mα2c3(d134log(p)n)25,

where c1, c2, c3 depend on L, κ, γ, τm.

Discussion on Assumption 1:

We now provide a set of sufficient conditions which ensure the restricted strong convexity (RSC) condition. Suppose the population risk Ls,m() is strongly convex in a ball of radius rm around αm

Ls,m(αm+Δm)Ls,m(αm),ΔmκΔm22ΔmB2(0,rm). (9)

Moreover, suppose the empirical gradients converge uniformly to the population gradients

supαmB2(αm,rm)R(Ls,m(αm)Ls,m(αm))cmlogpn. (10)

For example, this condition holds with high probability when the gradient of Ls,m(αm) w.r.t αt,m, for any t ∈ [p] is a sub-Gaussian process. Equations (9),(10) are easier to check and en sure that Ls,m(αm) satisfies the RSC property in Assumption 1.

6. Connections to Exponential Family MRF Copulas

The Expxorcist class of models could be viewed as being closely related to an exponential family MRF [28] copula density. Consider the parametric exponential family MRF joint density in (3): PMRF;θ(X)exp{sVθsBs(Xs)+(s,t)E(G)θstBs(Xs)Bt(Xt)+sVCs(Xs)}, where the distribution is indexed by the finite-dimensional parameters {θs}sV, {θst}(s,t)∈E, and where in contrast to the previous sections, we assume we are given the sufficient statistics functions {Bs(·)}sV as well as the nodewise base measures {Cs(·)}sV. Now consider the following non-parametric problem. Given a random vector X, suppose we are interested in estimating monotonic node-wise functions {fs(Xs)}sV such that (f1(X1), …, fp(Xp)) follows PMRF;θ for some θ. Letting f(X) = (f1(X1), …, fp(Xp)), we have that P(f(X))=PMRF;θ(f(X)), so that the density of X can be written as P(X)P(f(X))sVfs(Xs). This is now a semi-parametric estimation problem, where the unknowns are the functions {fs(Xs)}sV as well as the finite-dimensional parameters θ. To simplify this density, suppose we assume that the given node-wise sufficient statistics are linear, so that Bs(z) = z, for all sV, so that density reduces to

P(X)exp{sVθsfs(Xs)+(s,t)E(G)θstfs(Xs)ft(Xt)+sV(Cs(fs(Xs))+logfs(Xs))}. (11)

In contrast, the Expxorcist nonparametric exponential family graphical model takes the form

P(X)exp{sVθsfs(Xs)+(s,t)E(G)θstfs(Xs)ft(Xt)+sVCs(Xs)}. (12)

It can be seen that the two densities have very similar forms, except that the density in (11) has a more complex base measure that depends on the unknown functions {fs}sV and importantly the functions {fs}sV in (11) are monotonic.

The class of densities in (11) can be cast as an exponential family MRF copula density. Suppose we denote the CDF of the parametric exponential family MRF joint density by FMRF;θ (X), with nodewise marginal CDFs FMRF;θ,s(Xs). Then the marginal CDF of the density (11) can be written as Fs(xs)=P[Xsxs]=P[fs(Xs)fs(xs)]=FMRF;θ,s(fs(xs)), so that

fs(xs)=FMRF;θ,s1(Fs(xs)). (13)

It then follows that: F(X)=FMRF;θ(FMRF;θ,11(F1(X1)),,FMRF;θ,p1(Fp(Xp))), where F (X) is the CDF of density (11). By letting FCOP;θ(U)=FMRF;θ(FMRF;θ,11(U1),,FMRF;θ,p1(Up)) be the exponential family MRF copula density function, we see that the CDF of X is precisely: F (X) = FCOP;θ (F1(X1), …, Fp(Xp)), which is specified by the marginal CDFs {Fs(Xs)}sV and the copula density FCOP;θ corresponding to the exponential family MRF density. In other words, the non-parametric extension in (11) of the exponential family MRF densities is precisely an exponential family MRF copula density. This development thus generalizes the non-parametric extension of Gaussian MRF densities via the Gaussian copula nonparanormal densities [17]. The caveats with the copula density however are two-fold: the node-wise functions are restricted to be monotonic, but also the estimation of these as in (13) requires the estimation of inverses of marginal CDFs of an exponential family MRF, which is intractable in general. Thus, minor differences in the expressions of the Expxorcist density (12) and an exponential family MRF copula density (11) nonetheless have seemingly large consequences for tractable estimation of these densities from data.

7. Experiments

We present experimental results on both synthetic and real datasets. We compare our estimator, Expxorcist, with the Nonparanormal model of [17] and Gaussian Graphical Model (GGM). We use glasso [7] to estimate GGM and the two step estimator of [17] to estimate Nonparanormal model.

7.1. Synthetic Experiments

Data:

We generated synthetic data from the Expxorcist model with chain and grid graph structures. For both the graph structures, we set θs = 1, ∀sV, θst = 1, ∀(s, t) ∈ E and fix the domain X to [−1, 1]. We experimented with two choices for sufficient statistics Bs(X): sin(4πX) and [exp (−20(X – 0.5)2 + exp (−20(X + 0.5)2 – 1] and picked the log base measure Cs(X) to be 0. The grid graph we considered has a 10 × (p/10) structure. We used Gibbs sampling to sample data from these models. We also generated data from Gaussian distribution with chain and grid graph structures. To generate this data we set the off diagonal non-zero entries of inverse covariance matrix to 0.49 for chain graph and 0.25 for grid graph and diagonal entries to 1.

Evaluation Metric:

We compared the performance of Expxorcist against baselines, on graph structure recovery, using ROC curves. The ROC curve plots the true positive rate (TPR) against false positive rate (FPR) over different choices of regularization parameter, where TPR is the fraction of correctly detected edges and FPR is the fraction of mis-identified non edges.

Experiment Settings:

For this experiment we set p = 50 and n ∈ {100, 200, 500} and varied the regularization parameter λ from 102 to 1. To fit the data to the non parametric model (3), we used cosine basis and truncated the basis expansion to top 30 terms. In practice, one could choose the number of basis (m) based on domain knowledge (e.g. “smooth” functions), or in the absence of which, one could use hold-out validation/cross validation. Given N^(s), the estimated neighborhood for node s, we estimated the overall graph structure as: sVtN^(s){(s,t)}. To reduce the variance in the ROC plots, we averaged results over 10 repetitions.

Results:

Figure 1 shows the ROC plots obtained from this experiment. Due to the lack of space, we present more experimental results in Appendix. It can be seen that Expxorcist has much better performance on non-Gaussian data. On these datasets, even at n = 500 the baselines chose edges at random. This suggests that in the presence of multiple modes and fat tails, Expxorcist is a better model. Expxorcist has slightly poor performance than baselines on Gaussian data. However, this is expected because it learns a broader family of distributions than Nonparanormal.

Figure 1:

Figure 1:

ROC plots from synthetic experiments. Top and bottom rows show plots for chain and grid graphs respectively. Left column shows plots for data generated from our non-parametric model with Bs(X) = sin(X), n = 500 and center column shows plots for the other choice of sufficient statistic with n = 500. Right column shows plots for Gaussian data with n = 200.

7.2. Futures Intraday Data

We now present our analysis on the Futures price returns. This dataset was downloaded from http://www.kibot.com/. We focus on the Top-26 most liquid instruments being traded at the Chicago Mercantile Exchange (CME). The instruments span different sectors like Energy, Agriculture, Currencies, Equity Indices, Metals and Interest Rates. We focus on the hours of maximum liquidity (9am Eastern to 3pm Eastern) and look at the 1 minute price returns. The return distribution is a mixture of 1 minute returns with the overnight return. Since overnight returns tend to be bigger than the 1 minute return within the day, the return distribution is multimodal and fat-tailed. We treat each instrument as a random variable and the 1 minute returns as independent samples drawn from these random variables. We use the data collected in February 2010 as training data and data from March 2010 as held out data for tuning parameter selection. After removing samples with missing entries we are left with 894 training and 650 held out data samples. We fit Expxorcist and baselines on this data with the same parameter settings described above. For each of these models, we select the best tuning parameter through log likelihood on held out data. However, this criteria resulted in complete graphs for Nonparanormal and GGM (325 edges) and a relatively sparser graph for Expxorcist (168 edges). So for a better comparison of these models, we selected tuning parameters for each of the models such that the resulting graphs have almost the same number of edges. Figure 2 shows the learned graphs for one such choice of tuning parameters, which resulted in ~ 52 edges in the graphs. Nonparanormal and GGM resulted in very similar graphs, so we only present Nonparanormal here. It can be seen that Expxorcist is able to identify the clusters better than Nonparanormal. More detailed graphs and comparison with GGM can be found in Appendix.

Figure 2:

Figure 2:

Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.

8. Conclusion

In this work we considered the problem of non-parametric density estimation and introduced Expxorcist, a new family of non-parametric graphical models. Our approach relies on a simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. We proposed an estimator for Expxorcist that is computationally efficient and comes with statistical guarantees. Our empirical results suggest that, in the presence of multiple modes and fat tails in the data, our non-parametric model is a better choice than the Nonparanormal model of [17].

9. Acknowledgement

A.S. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences. M. K. acknowledges support by an IBM Corporation Faculty Research Fund at the University of Chicago Booth School of Business.

Supplementary material for Nonparametric Graphical Models Via Conditional Exponential Densities

A. Node Conditional Maximum Likelihood Estimation

In this section we derive the relation between optimization problems (5) and (7) defined in Section 4 of the main paper. We start with the optimization problem (5), which is defined in terms of the original parameters of non-parametric graphical model (3):

minΘs,BLs(Θs,B;Xn)+λnΘs1s.t.θs0,XtBt(X)dX=0,XtBt(X)2dX=1tV,

where Ls(Θs,B;Xn) is the node conditional negative log likelihood at node s:

Ls(Θs,B;Xn)=1ni=1n{Bs(Xs(i))(θs+tV\sθstBt(Xt(i)))+A(Xs(i);θs,B)}.

Let B~t(Xt)=θstBt(Xt),tV\{s}. Using this parametrization Ls() can be written as:

Ls(θs,Bs,B~s;Xn)=1ni=1n{Bs(Xs(i))(θs+tV\sB~t(Xt(i)))+A(Xs(i);Θs,Bs,B~s)}.

Note that, given B~t, one can recover θst (and thus Bt) by computing the L2(Xt) norm of B~t. Using this re-parametrization the original optimization in Equation (5) can be written as the following equivalent problem:

minθs,Bs,B~sLs(θs,Bs,B~s;Xn)+λntV\sXtB~t(X)2dX+λnθss.t.θs0,XsBs(X)2dX=1,XtB~t(X)dX=0tV.

The above problem still has the equality constraint on the norm of functions Bs(·). As pointed out in the main paper, this makes the above optimization problem a difficult one to solve. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node sV, we consider the following re-parametrization of B: Bs(Xs) ← θsBs(Xs), Bt(Xt) (θsts)Bt(Xt)∀tV \ {s}. With a slight abuse of notation we redefine Ls using this re-parametrization as:

Ls(B;Xn)=1ni=1n{Bs(Xs(i))(1+tV\sBt(Xt(i)))+A(Xs(i);B)},

where A(X−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5):

minBLs(B;Xn)+λntVXtBt(X)2dXs.t.XtBt(X)dX=0tV.

Note that θs need to be bounded away from 0, for this re-parametrized optimization problem to give consistent estimates of the true parameters.

B. Statistical Properties

B.1. Proof of Theorem 2

Proof. Define C={αm+Δ:R(ΔNsc)3R(ΔNs)}, where ΔNs is the sub-vector of Δ restricted to the coordinates specified by variables {αt,m:tNs{s}}. Let Fs,m denote the optimization objective in Equation (8):

Fs,m(αm;Xn)=Ls,m(αm;Xn)+λnR(αm).

We prove the theorem in two parts. In the first part we show that Fs,m doesn’t have any stationary points in B2(αm,rm)Cc. In the second part we show that Fs,m doesn’t have any stationary points in B2(αm,rm)C\B2(αm,rs), where rs=(62κ)dλn. The proof of the Theorem then follows by combining the results from these two parts.

(a). No stationary points in B2(αm,rm)Cc:

Let αmB2(αm,rm)Cc and let Δ=αmαm. Let ∂f(x) denote the set of sub-gradients of a function f(.) at x. For any z(αm)Fs,m(αm), where z(αm)=Ls,m(αm)+λnv(αm), v(αm)R(αm), we have:

z(αm),αmαm=z(αm),Δ=Ls,m(αm),Δ+λnv(αm),Δ=Ls,m(αm)Ls,m(αm),Δ+Ls,m(αm),Δ+λnv(αm),Δ, (1)

We now bound each of the terms in the RHS of above equation. From Assumption 1 on RSC property of the sample loss, we have

Ls,m(αm)Ls,m(αm),ΔκΔ22cmlog(p)nR(Δ).

From the definition of Ns we have (αm)Nsc=0. So we have

v(αm),Δ=v(αm)Nsc,(αm)Nsc+v(αm)Ns,ΔNsR(ΔNsc)R(ΔNs),

where the last inequality follows from the properties of sub-gradient of the norm R(.). Finally, from the definition of dual norm R, we have:

Ls,m(αm),ΔR(Ls,m(αm))R(Δ).

Substituting these results in Equation (1) we get:

z(αm),αmαmκΔ22R(Ls,m(αm))R(Δ)(cmlog(p)n)R(Δ)+λn(R(ΔNsc)R(ΔNs)). (2)

Note that since αmCc we have R(ΔNsc)3R(ΔNs). Moreover, from our choice of λn we know that λn2R(Ls,m(αm))+2cmlog(p)n. Substituting this in the above equation we get:

z(αm),αmαmκΔ22+(R(ΔNsc)3R(ΔNs))(R(Ls,m(αm))+cmlog(p)n). (3)

The above inequality shows that z(αm),αmαm>0, αmB2(αm,rm)Cc.

Now suppose αmB2(αm,rm)Cc is a stationary point. Then from first-order necessary conditions we know that z(αm),αmαm0. However, this contradicts the result we obtained in Equation (3). This shows that there are no stationary points in B2(αm,rm)Cc.

(b). No stationary points in B2(αm,rm)C\B2(αm,rs):

The proof of this part follows along the lines of the previous part. Let αmB2(αm,rm)C\B2(αm,rs). From Equations (1), (2) we know that:

z(αm),αmαmκΔ22R(Ls,m(αm))R(Δ)(cmlog(p)n)R(Δ)+λnv(αm),Δ. (4)

Since αmC we have R(Δ)4d+1Δ242dΔ2. Substituting this in the above equation we get:

z(αm),αmαmκΔ22R(Ls,m(αm))R(Δ)(cmlog(p)n)R(Δ)λnR(Δ)κΔ22R(Δ)(R(Ls,m(αm))+cmlog(p)n+λn)κΔ2242dΔ2(R(Ls,m(αm))+cmlog(p)n+λn)(κΔ242d(R(Ls,m(αm))+cmlog(p)n+λn))Δ2(κΔ242d(32λn))Δ2. (5)

The above inequality shows that z(αm),αmαm>0, αmB2(αm,rm)C\B2(αm,rs). This shows that there are no stationary points in B2(αm,rm)C\B2(αm,rs).

Following results from parts (a) and (b) we conclude that any stationary point in B2(αm,rm) satisfies αmαm262κdλn.□

B.2. Proof of Corollary 3

Before we proceed to the proof of Corollary 3, we introduce some notation we use in its proof. We say Z is a σ–sub-Gaussian random variable, if it satisfies the following tail property:

P(ZE[Z]ϵ)2exp{ϵ22σ2}.

We use the notation ZSG(σ2) to say that a random variable Z is σ–sub-Gaussian. We use the following standard result from concentration theory: if Z1Zn are n i.i.d SG(σ2) random variables then 1ni=1nZiSG(σ2n).

The following Lemma provides an upper bound for R(Ls,m(αm)) that holds with high probability.

Lemma 1. Let Ns be the true neighborhood of node s, with Ns=d. Moreover, let

γ=supiN,XXϕi(X)

and

τm=suptV,XXi=1mαt,iϕi(X).

Suppose Ls,m satisfies Assumption 2. Then with probability at least 1 – 2m/p2 we have:

R(Ls,m(αm))LR(ααm)+c1γτmmd2log(p)n,

where c1 > 0 is a constant.

Proof. From triangle inequality, we have:

R(Ls,m(αm))R(Ls,m(αm)Ls,m(αm))+R(Ls,m(αm)).

We now upper bound each of the terms in the RHS of the above equation.

(a). Upper bound for R(Ls,m(αm)Ls,m(αm)):

The gradient of Ls,m(αm) at αm is given by:

(Ls,m(αm))s,k=1ni=1n(ϕk(Xs(i))Es(Xs(i);αm))+Eαm[ϕk(Xs)Es(Xs(i);αm)Xs(i)]

and for tV \ {s}:

(Ls,m(αm))t,k=1ni=1n(ϕk(Xt(i))Bs(Xs(i);αm))+Eαm[ϕk(Xt(i))Bs(Xs;αm)Xs(i)],

where Es(Xs;αm)=1+tV\{s}j=1mαt,jϕj(Xt) and Bs(Xs;αm)=j=1mαs,jϕj(Xs) and Eα[.] denotes expectation with respect to the density parametrized by α and (Ls,m(αm))t,k is the gradient of Ls,m(αm) evaluated at αm with respect to variable αt,k.

We now show that (Ls,m(αm))t,k concentrates well around its expectation. Note that

E[(Ls,m(αm))t,k]=(Ls,m(αm))t,k.

Lets first define random variable {Ys,m}k=1m and {Yt,k}k=1m,tV\{s} as:

Ys,k(X(i))=(ϕk(Xs(i))Es(Xs(i);αm))+Eαm[ϕk(Xs)Es(Xs(i);αm)Xs(i)],

and

Yt,k(X(i))=(ϕk(Xt(i))Bs(Xs(i);αm))+Eαm[ϕk(Xt(i))Bs(Xs;αm)Xs(i)].

To ease the notation, we denote Ys,k(X(i)) by Ys,k(i). We now rewrite Ls,m(αm) in terms of random variables {Ys,k}k=1m and {Yt,k}k=1m as follows:

(Ls,m(αm))s,k=1ni=1nYs,k(i),(Ls,m(αm))t,k=1ni=1nYt,k(i).

For any k ∈ [1, m], i ∈ [1, n] we have:

Ys,k(i)(ϕk(Xs(i))Es(Xs(i);αm))+[Eαm[ϕk(Xs)Es(Xs(i);αm)Xs(i)]γ(Es(Xs(i);αm))+γEαm[Es(Xs(i);αm)Xs(i)],

where the last inequality follows from Jensen’s inequality and the fact that γ=supjN,XXϕj(X).

We now bound (Es(Xs(i);αm)):

(Es(Xs(i);αm))=1+tV\{s}j=1mαt,jϕj(Xt(i))1+tV\{s}j=1mαt,jϕj(Xj(i))1+dτmd(1+τm).

Substituting this in the above equation we get:

Ys,k(i)2γd(1+τm).

Using similar arguments we can show that Yt,k(i)2γτm, ∀tV \ {s}. This shows that Ys,k(i)SG(4γ2d2(1+τm)2) and Yt,k(i)SG(4γ2τm2), ∀tV \ {s}. Since (Ls,m(αm))t,k is a sum of n i.i.d random variables {Yt,k(i)}i=1n we have:

(Ls,m(αm))s,kSG(4γ2d2(1+τm)2n),(Ls,m(αm))t,kSG(4γ2τm2n).

From the concentration properties of a sub-Gaussian random variable we have:

P((Ls,m(αm))s,k(Ls,m(αm))s,kϵ)12exp{nϵ28γ2(1+τm)2d2}.

and

P((Ls,m(αm))t,k(Ls,m(αm))t,kϵ)12exp{nϵ28γ2τm2},tV\{s}.

Now let (Ls,m(αm))t={(Ls,m(αm))t,k}k=1m. By union bound we get:

P((Ls,m(αm))s(Ls,m(αm))s2ϵ)k=1mP((Ls,m(αm))s,k(Ls,m(αm))s,kϵm)2mexp{nϵ28γ2(1+τm)2md2}

and

P((Ls,m(αm))t(Ls,m(αm))t2ϵ)2mexp{nϵ28γ2τm2m}.

By using union bound again we get:

P(R(Ls,m(αm)Ls,m(αm))ϵ)=P(suptV(Ls,m(αm))t(Ls,m(αm))tϵ)2mexp{nϵ28γ2(1+τm)2md2}+2m(p1)exp{nϵ28γ2τm2m}2mpexp{nϵ28γ2(1+τm)2md2}.

Choosing ϵ=24γ2(1+τm)2md2log(p)n, we get the following: with probability at least

R(Ls,m(αm)Ls,m(αm))24γ2(1+τm)2md2log(p)n.
(b). Upper bound for R(Ls,m(αm)):

From Assumption 2 we have the following upper bound for R(Ls,m(αm)):

R(Ls,m(αm))LR(αmα).

Combining the results from parts (a) and (b) we get the following: with probability at least 1–2m/p2

R(Ls,m(αm))LR(αmα)+c1γτmmd2log(p)n,

where c1 > 0 is a constant.□

Following results from Theorem 2, Lemma 1, we conclude that if the regularization parameter λn is chosen such that

λn2LR(αmα)+c1γτmmd2log(p)n,

then with probability at least 1 – 2m/p2, any stationary point αm in B2(αm,rm) satisfies

αmαm262κdλn.

B.3. Proof of Corollary 4

Let αm be any stationary point in B2(αm,rm). And suppose λn is chosen such that:

λn=2LR(αmα)+c1γτmmd2log(p)n.

In this section we derive bounds for the overall estimation error ∥αmα*∥2. From triangle inequality, we have:

αmα2αmαm2+αmα2.

From Theorem 2, we have a bound for αmαm2. So, here we focus on bounding αmα2.

Since the true parameters Bt(.) lie in a Sobolev space of order two, we know that there exists a constant c2 > 0 such that [1]:

suptVαt,mαt2c2m2.

Combining this result with the result from Theorem 2, we get, with probability at least 1 – 2m/p2:

αmα2αmαm2+αmα262κdλn+c2dm2=62κd(2LR(αmα)+c1γτmmd2log(p)n)+c2dm262κd(2c2Lm2+c1γτmmd2log(p)n)+c2dm2c3d[Lκ1m2+γτmmd2log(p)n],

where c3 > 0 is a constant. Choosing m=(Lκγτm)25(nd2log(p))15 gives us the following error bound:

αmα2c4(Lγ4τm4κ)15(d134log(p)n)25,

and the corresponding λn for this choice of m is given by:

λn=c5(Lγ4τm4κ)15(d2log(p)n)25,

where c4, c5 > 0 are constants.

Figure 1:

Figure 1:

ROC plots for data generated from non-parametric graphical model with Bs(X) =sin(4πX). Top row shows results for chain graph with Cs(X) = 0. Bottom row shows results for grid graph with Cs(X) = 0.

C. Experiments

C.1. Synthetic Data

In this section we present additional results from our synthetic experiments. Specifically, we present ROC plots for n ∈ {100, 200, 500}. We use the same parameter settings as described in Section 7 of the main paper, to generate synthetic data. Figure 1 shows ROC plots for data generated from non-parametric graphical model with Bs(X) = sin(4πX). Figure 2 shows ROC plots for Bs(X) = [exp (−20(X – 0.5)2) + exp (−20(X + 0.5)2) – 1] and Figure 3 shows ROC plots for Gaussian data.

C.2. Futures Intraday Data

In this section we present the graph learned by GGM for Futures Intraday data and also present more detailed graphs learned by all the three estimators. As pointed out in Section 7, selecting tuning parameter based on held out log likelihood resulted in very dense graphs for Nonparanormal and GGM. So we use a different technique to compare all the models. We fix a tuning parameter for Expxorcist and select tuning parameters for the baselines that resulted in graphs with same number of edges. Figure 4 shows the graph structures learned for one such choice of tuning parameters. Figures 5, 7, 6 present more detailed graphs for the corresponding graphs in Figure 4.

Figure 2.

Figure 2

ROC plots for data generated from non-parametric graphical model with Bs(X) = [exp (−20(X – 0.5)2) + exp (−20(X + 0.5)2) – 1]. Top row shows results for chain graph with Cs(X) = 0. Bottom row shows results for grid graph with Cs(X) = 0.

Figure 3:

Figure 3:

ROC plots for data generated from Gaussian distributions. Top row shows plots for chain graph and bottom row shows plots for grid graph.

Figure 4:

Figure 4:

Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.

Figure 5:

Figure 5:

Nonparanormal.

Figure 6:

Figure 6:

Expxorcist.

Figure 7:

Figure 7:

GGM.

Contributor Information

Arun Sai Suggala, Carnegie Mellon University Pittsburgh, PA 15213.

Mladen Kolar, University of Chicago Chicago, IL 60637.

Pradeep Ravikumar, Carnegie Mellon University Pittsburgh, PA 15213.

References

  • [1].Arnold Barry C., Enrique Castillo, and José María Sarabia. Conditionally specified distributions: an introduction. Stat. Sci, 16(3):249–274, 2001. With comments and a rejoinder by the authors. [Google Scholar]
  • [2].Patrizia Berti, Emanuela Dreassi, and Pietro Rigo. Compatibility results for conditional distributions. J. Multivar. Anal, 125:190–203, 2014. [Google Scholar]
  • [3].Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B, pages 192–236, 1974. [Google Scholar]
  • [4].Stéphane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714–720, March 2006. [Google Scholar]
  • [5].Hua Yun Chen. Compatibility of conditionally specified models. Statist. Probab. Lett, 80(7-8):670–677, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Ronaldo Dias. Density estimation via hybrid splines. J. Statist. Comput. Simulation, 60(4):277–293, 1998. [Google Scholar]
  • [7].Friedman Jerome H., Hastie Trevor J., and Tibshirani Robert J.. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Good IJ and Gaskins RA. Nonparametric roughness penalties for probability densities. Biometrika, 58:255–277, 1971. [Google Scholar]
  • [9].Chong Gu. Smoothing spline density estimation: conditional distribution. Stat. Sinica, 5(2):709–726, 1995. [Google Scholar]
  • [10].Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Stat. Sinica, 23:1131–1153, 2013. [Google Scholar]
  • [11].Chong Gu and Chunfu Qiu. Smoothing spline density estimation: theory. Ann. Stat, 21(1):217–234, 1993. [Google Scholar]
  • [12].Chong Gu and Jingyuan Wang. Penalized likelihood density estimation: direct cross-validation and scalable approximation. Stat. Sinica, 13(3):811–826, 2003. [Google Scholar]
  • [13].Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning discrete graphical models using group-sparse regularization. In AISTATS, pages 378–387, 2011. [Google Scholar]
  • [14].Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Stat. Sinica, 16(2):353–374, 2006. [Google Scholar]
  • [15].Tom Leonard. Density estimation, stochastic processes and prior information. J. R. Stat. Soc. B, 40(2):113–146, 1978. With discussion. [Google Scholar]
  • [16].Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat, 40(4):2293–2326, 2012. [Google Scholar]
  • [17].Han Liu, Lafferty John D., and Wasserman Larry A.. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res, 10:2295–2328, 2009. [Google Scholar]
  • [18].Mâsse Benoît R. and Truong Young K.. Conditional logspline density estimation. Canad. J. Statist, 27(4):819–832, 1999. [Google Scholar]
  • [19].Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016. [Google Scholar]
  • [20].Pradeep Ravikumar, Martin J Wainwright, Lafferty John D, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010. [Google Scholar]
  • [21].Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Ann. Stat, 10(3):795–810, 1982. [Google Scholar]
  • [22].Speed TP and Kiiveri HT. Gaussian markov distributions over finite graphs. The Annals of Statistics, pages 138–150, 1986. [Google Scholar]
  • [23].Stone Charles J., Hansen Mark H., Charles Kooperberg, and Truong Young K.. Polynomial splines and their tensor products in extended linear modeling. Ann. Stat, 25(4):1371–1470, 1997. With discussion and a rejoinder by the authors and Jianhua Z. Huang. [Google Scholar]
  • [24].Siqi Sun, Jinbo Xu, and Mladen Kolar. Learning structured densities via infinite dimensional exponential families. In Advances in Neural Information Processing Systems, pages 2287–2295, 2015. [Google Scholar]
  • [25].Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Stat. Sinica, 21(1):5–42, 2011. [Google Scholar]
  • [26].Arend Voorman, Ali Shojaie, and Witten Daniela M.. Graph estimation with joint additive models. Biometrika, 101(1):85–101, March 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Wang Yuchung J. and Ip Edward H.. Conditionally specified continuous distributions. Biometrika, 95(3):735–746, 2008. [Google Scholar]
  • [28].Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015. [PMC free article] [PubMed] [Google Scholar]
  • [29].Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697, 2014. [Google Scholar]
  • [30].Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, and Guangcan Liu. Learning additive exponential family graphical models via _{2, 1}-norm regularized m-estimation. In Advances in Neural Information Processing Systems, pages 4367–4375, 2016. [Google Scholar]
  • [31].Helen Zhang Hao and Yi Lin. Component selection and smoothing for nonparametric regression in exponential families. Stat. Sinica, 16(3):1021–1041, 2006. [Google Scholar]
  • [32].Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact first order oracle. Advances in Neural Information Processing Systems, 2015. [Google Scholar]
  • [1].Bickel P, Diggle P, Feinberg S, Gather U, Olkin I, and Zeger S. Springer series in statistics. 2009. [Google Scholar]

RESOURCES