The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities

Arun Sai Suggala; Mladen Kolar; Pradeep Ravikumar

. Author manuscript; available in PMC: 2019 Mar 11.

Published in final edited form as: Adv Neural Inf Process Syst. 2017 Dec;30:4446–4456.

The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities

Arun Sai Suggala ^1,^*, Mladen Kolar ^2,^†, Pradeep Ravikumar ^3,^‡

PMCID: PMC6411086 NIHMSID: NIHMS979858 PMID: 30867622

Abstract

Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments to propose a class of non-parametric models which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.

1. Introduction

Let X = (X₁, …, X_p) be a p-dimensional random vector. Let G = (V, E) be the graph that encodes conditional independence assumptions underlying the distribution of X, that is, each node of the graph corresponds to a component of vector X and (a, b) ∈ E if and only if $X_{a} ⫫̸ X_{b} ∣ X_{¬ab}$ with X_¬ab := {X_c ∣ c ∈ V \{a, b}}. The graphical model represented by G is then the set of distributions over X that satisfy the conditional independence assumptions specified by the graph G.

There has been a considerable line of work on learning parametric families of such graphical model distributions from data [22, 20, 13, 28], where the distribution is indexed by a finite-dimensional parameter vector. The goal of this paper, however, is on specifying and learning nonparametric families of graphical model distributions, indexed by infinite-dimensional parameters, and for which there has been comparatively limited work. Non-parametric multivariate density estimation broadly, even without the graphical model constraint, has not proved as popular in practical machine learning contexts, for both statistical and computational reasons. Loosely, estimating a non-parametric multivariate density, with mild assumptions, typically requires the number of samples to scale exponentially in the dimension p of the data, which is infeasible even in the big-data era when n is very large. And the resulting estimators are typically computationally expensive or intractable, for instance requiring repeated computations of multivariate integrals.

We present a review of multivariate density estimation, that is necessarily incomplete but sets up our proposed approach. A common approach dating back to [15] uses the logistic density transform to satisfy the unity and positivity constraints for densities, and considers densities of the form $f (X) = \frac{exp (η (X))}{\int_{X} exp (η (x)) dx}$ , with some constraints on η for identifiability such as η(X₀) = 0 for some $X_{0} \in X$ or $\int_{X} η (x) dx = 0$ .

With the logistic density transform, differing approaches for non-parametric density estimation can be contrasted in part by their assumptions on the infinite-dimensional function space domain of η(·). An early approach [8] considered function spaces of functions with bounded “roughness” functionals. The predominant line of work however has focused on the setting where η(·) lies in a Reproducing Kernel Hilbert Space (RKHS), dating back to [21]. Consider the estimation of these logistic density transforms η(X) given n i.i.d. samples $X_{n} = {X^{(i)}}_{i = 1}^{n}$ drawn from f_η(X). A natural loss functional is penalized log likelihood, with a penalty functional that ensures a smooth fit with respect to the function space domain: $ℓ (η; X_{n}) ≔ - \frac{1}{n} \sum_{i \in [n]} η (X^{(i)}) + \log \int exp (η (x)) dx + λ pen (η)$ , for functions η(·) that lie in an RKHS $H$ , and where $pen (η) = {‖ η ‖}_{H}^{2}$ is the squared RKHS norm. This was studied by many [21, 11, 6]. A crucial caveat is that the representer theorem for RKHSs does not hold. Nonetheless, one can consider finite-dimensional function space approximations consisting of the linear span of kernel functions evaluated at the sample points [12]. Computationally this still scales poorly with the dimension due to the need to compute multidimensional integrals of the form ∫ exp(η(x)dx which do not, in general, decompose. These approximations also do not come with strong statistical guarantees.

We briefly note that the function space assumption that η(·) lies in an RKHS could also be viewed from the lens of an infinite-dimensional exponential family [4]. Specifically, let $H$ be a Reproducing Kernel Hilbert Space with reproducing kernel k(·, ·), and inner product ${〈 \cdot, \cdot 〉}_{H}$ . Then $η (X) = {〈 θ (\cdot), k (X, \cdot) 〉}_{H}$ , so that the density f (X) can in turn be viewed as a member of an infinite-dimensional exponential family with sufficient statistics $k (X, \cdot) : X \mapsto H$ , and natural parameter $θ (\cdot) \in H$ . Following this viewpoint, [4] propose estimators via linear span approximations similar to [11].

Due to the computational caveat with exact likelihood based functionals, a line of approaches have focused on penalized surrogate likelihoods instead. [14] study the following loss functional: $ℓ (η; X_{n}) ≔ \frac{1}{n} \sum_{i \in [n]} exp (- η (X^{(i)})) + \int η (x) ρ (x) dx + λ pen (η)$ , where ρ(X) is some fixed known density with the same support as the unknown density f (X). While this estimation procedure is much more computationally amenable than minimizing the exact penalized likelihood, the caveat, however, is that for a general RKHS this requires solving higher order integrals. The next level of simplification has thus focused on the form of the logistic transform function itself. There has been a line of work on an ANOVA type decomposition of the logistic density function into node-wise and pairwise terms: $η (X) = \sum_{s = 1}^{p} η_{s} (X_{s}) + \sum_{s = 1}^{p} \sum_{t = s + 1}^{p} η_{st} (X_{s}, X_{t})$ . A line of work has coupled such a decomposition with the assumption that each of the terms lie in an RKHS. This does not immediately provide a computational benefit: with penalized likelihood based loss functionals, the loss functional does not necessarily decompose into such node and pairwise terms. [24] thus couple this ANOVA type pairwise decomposition with a score matching based objective. [10] use the above decomposition with the surrogate loss functional of [14] discussed above, but note that this still requires the aforementioned function space approximation as a linear span of kernel evaluations, as well as two-dimensional integrals.

A line of recent work has thus focused on further stringent assumptions on the density function space, by assuming some components of the logistic transform to be finite-dimensional. [30] use an ANOVA decomposition but assume the terms belong to finite-dimensional function spaces instead of RKHSs, specified by a pre-defined finite set of basis functions. [29] consider logistic transform functions η(·) that have the pairwise decomposition above, with a specific class of parametric pairwise functions β_stX_sX_t, and non-parametric node-wise functions. [17, 16] consider the problem of estimating monotonic node-wise functions such that the transformed random vector is multivariate Gaussian; which could also be viewed as estimating a Gaussian copula density.

To summarize the (necessarily incomplete) review above, non-parametric density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose stringent near-parametric assumptions on the form of the (logistic transform of the) density functions. In this paper, we leverage recent developments to propose a very computationally simple non-parametric density estimation algorithm, that still comes with strong statistical guarantees. Moreover, the density could be viewed as a graphical model distribution, with a corresponding sparse conditional independence graph.

Our approach relies on the following simple function space assumption: that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. As we show, for there to exist a consistent joint density, the logistic density transform with respect to a particular base measure necessarily decomposes into the following semi-parametric form: $η (X) = \sum_{s = 1}^{p} θ_{s} B_{s} (X_{s}) + \sum_{s = 1}^{p} \sum_{t = s + 1}^{p} θ_{st} B_{s} (X_{s}) B_{t} (X_{t})$ in the pairwise case, with both a parametric component {θ_s : s = 1, …, p}, {θ_st : s < t; s, t = 1, …, p}, as well as non-parametric components {B_s : s = 1, …, p}. We call this class of models the “expxorcist”, following other “ghostbusting” semi-parametric models such as the nonparanormal and nonparanormal skeptic [17, 16].

Since the conditional distributions are exponential families, we show that there exist computationally amenable estimators, even in our more general non-parametric setting, where the sufficient statistics have to be estimated as well. The statistical analysis in our non-parametric setting however is more subtle, due in part to non-convexity and in part to the non-parametric setting. We also show how the Expxorcist class of densities is closely related to a semi-parametric exponential family copula density that generalizes the Gaussian copula density of [17, 16]. We corroborate the applicability of our class of models with experiments on synthetic and real data sets.

2. Multivariate Density Specification via Conditional Densities

We are interested in the approach of estimating a multivariate density by estimating node-conditional densities. Since node-conditional densities focus on the density of a single variable, though conditioned on the rest of the variables, estimating these is potentially a simpler problem, both statistically and computationally, than estimating the entire joint density itself. Let us consider the general non-parametric conditional density estimation problem. Given the general multivariate density $f (X) = \frac{exp (η (X))}{\int_{X} exp (η (x)) dx}$ , the conditional density of a variable X_s given the rest of the variables X_−s is given by $f (X_{s} ∣ X_{- s}) = \frac{exp (η ((X_{s}, X_{- s})))}{\int_{X_{s}} exp (η ((x, X_{- s}))) dx}$ , which does not have a multi-dimensional integral, but otherwise does not have a computationally amenable form. There has been a line of work on such conditional density estimation, mirroring developments in multivariate density estimation [9, 18, 23], but unlike parametric settings, there are no large sample complexity gains with non-parametric conditional density estimation under general settings. There have also been efforts to use ANOVA decompositions in a conditional density context [31, 26].

In addition to computational and sample complexity caveats, recall that in our context, we would like to use conditional density estimates to infer a joint multivariate density. A crucial caveat with using the above estimates to do so is that it is not clear when the estimated node-conditional densities would be consistent with a joint multivariate density. There has been a line of work on this question (of when conditional densities are consistent with a joint density) for parametric densities; see [1] for an overview, with more recent results in [27, 5, 2, 25]. Overall, while estimating node-conditional densities could be viewed as surrogate estimation of a joint density, arbitrary node-conditional distributions need not be consistent in general with any joint density. There has however been a line of work in recent years [3, 28], where it was shown that when the node-conditional distributions belong to an exponential family, then under certain conditions on their parameterization, there do exist multivariate densities consistent with the node-conditional densities. In the next section, we leverage these results towards non-parametric estimation of conditional densities.

3. Conditional Densities of an Exponential Family Form

We first recall the definition of an exponential family in the context of a conditional density.

Definition 1. A conditional density of a random variable $Y \in Y$ given covariates $Z ≔ (Z_{1}, \dots, Z_{m}) \in Z$ is said to have an exponential family form if it can be written as f(Y ∣ Z) = exp(B(Y)^T E(Z) + C(Y) + D(Z)), for some functions $B : Y \mapsto R^{k}$ (for some finite integer k > 0), $E : Z \mapsto R^{k}, C : Y \mapsto R$ and $D : Z \mapsto R$ .

Thus, f (Y ∣ Z) belongs to a finite-dimensional exponential family with sufficient statistics B(Y), base measure exp(C(Y)), and with natural parameter E(Z) and where −D(Z) is the log-partition function. Contrast this with a general conditional density f (Y ∣ Z) = exp(h(Y, Z) + C(Y) + D(Z)) with respect to the base measure exp(C(Y )) and −D(Z) being the log-normalization constant, and it can be seen that a conditional density of the exponential family form has its logistic density transform h(Y, Z) that factorizes as B(Y)^T E(Z).

Consider the case where the sufficient statistic function is real-valued. The non-parametric estimation problem of a conditional density of exponential form then reduces to the estimation of the sufficient statistics function B(·), the exponential natural parameter function E(·), assuming the base measure C(·) is given. But when would such estimated conditional densities be consistent with a joint density? To answer this question, we draw upon developments in [28]. Suppose that the node-conditional distributions of each random variable X_s conditioned on the rest of random variables have the exponential family form as in Definition 1, so that for each s ∈ V

P (X_{s} ∣ X_{- s}) \propto exp {E_{s} (X_{- s}) B_{s} (X_{s}) + C_{s} (X_{s})},

(1)

for some arbitrary functions E_s(·), B_s(·), C_s(·) that specify a valid conditional density. Then [28] show that these node-conditional densities are consistent with a unique joint density over the random vector X, that moreover factors according to a set of cliques $C$ in the graph G, if and only if the functions {E_s(·)}_s∈V specifying the node-conditional distributions have the form $E_{s} (X_{- s}) = θ_{s} + \sum_{C \in C : s \in C} θ_{C} \prod_{t \in C, t \neq s} B_{t} (X_{t})$ , where ${θ_{s}} \cup {θ_{C}}_{C \in C}$ is a set of parameters. Moreover, the corresponding consistent joint distribution has the following form

P (X) \propto exp {\sum_{s \in V} θ_{s} B_{s} (X_{s}) + \sum_{C \in C} θ_{C} \prod_{s \in C} B_{s} (X_{s}) + \sum_{s \in V} C_{s} (X_{s})} .

(2)

In this paper, we are interested in the non-parametric estimation of the Expxorcist class of densities in (2), where we estimate both the finite-dimensional parameters ${θ_{s}} \cup {θ_{C}}_{C \in C}$ , as well as the functions {B_s(X_s)}_s∈V. We assume we are given the base measures {C_s(X_s)}_s∈V, so that the joint density is with respect to a given product base measure ∏_s∈V exp(C_s(X_S)), as is common in the multivariate density estimation literature. Note that this is not a very restrictive assumption. In practice the base measure at each node can be well approximated using the empirical univariate marginal density of that node. We could also extend our algorithm, which we present next, to estimate the base measures along with sufficient statistic functions.

4. Regularized Conditional Likelihood Estimation for Exponential Family Form Densities

We consider the nonparametric estimation problem of estimating a joint density of the form in (2), focusing on the pairwise case where the factors have size at most k = 2, so that the joint density takes the form

P (X) \propto exp {\sum_{s \in V} θ_{s} B_{s} (X_{s}) + \sum_{(s, t) \in E} θ_{st} B_{s} (X_{s}) B_{t} (X_{t}) + \sum_{s \in V} C_{s} (X_{s})} .

(3)

As detailed in the previous section, estimating this joint density can be reduced to estimating its node-conditional densities, which take the form

P (X_{s} ∣ X_{- s}) \propto exp {B_{s} (X_{s}) (θ_{s} + \sum_{t \in N_{G} (s)} θ_{st} B_{t} (X_{t})) + C_{s} (X_{s})} .

(4)

We now introduce some notation which we use in the sequel. Let Θ = {θ_s}_s∈V ∪ {θ_st}_s≠t and Θ_s = θ_s ∪ {θ_st}_t∈V\{s}. Let B = {B_s}_s∈V be the set of sufficient statistics. Let $X_{s}$ be the domain of X_s, which we assume is bounded and $L^{2} (X_{s})$ be the Hilbert space of square integrable functions over $X_{s}$ with respect to Lebesgue measure. We assume that the sufficient statistics $B_{s} (\cdot) \in L^{2} (X_{s})$ .

Note that the model in Equation (3) is unidentifiable. To overcome this issue we impose additional constraints on its parameters. Specifically, we require B_s(X_s) to satisfy $\int_{X_{s}} B_{s} (X) dX = 0$ , $\int_{X_{s}} B_{s} {(X)}^{2} dX = 1$ and θ_s ≥ 0, ∀s ∈ V.

Optimization objective:

Let $X_{n} = {X^{(1)}, \dots X^{(n)}}$ be n i.i.d. samples drawn from a joint density of the form in Equation (3), with parameters Θ*, B*. And let $L_{s} (Θ_{s}, B; X_{n})$ be the node conditional negative log likelihood at node s

L_{s} (Θ_{s}, B; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- B_{s} (X_{s}^{(i)}) (θ_{s} + \sum_{t \in V \ s} θ_{st} B_{t} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; Θ_{s}, B)},

where A(X_−s; Θ_s, B) is the log partition function. To estimate the unknown parameters, we solve the following regularized node conditional log-likelihood estimation problem at each node s ∈ V

\begin{matrix} min_{Θ_{s}, B} L_{s} (Θ_{s}, B; X_{n}) + λ_{n} {‖ Θ_{s} ‖}_{1} \\ s.t. θ_{s} \geq 0, \int_{X_{i}} B_{t} (X) dX = 0 \int_{X_{t}} B_{t} {(X)}^{2} dX = 1 \forall t \in V . \end{matrix}

(5)

The equality constraints on the norm of functions B_t(·) makes the above optimization problem a difficult one to solve. While the norm constraints on B_t(·), ∀t ∈ V \ s can be handled through re-parametrization, the constraint on B_s(·) can not be handled efficiently. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node s ∈ V, we consider the following re-parametrization of B: B_s(X_s) ← θ_sB_s(X_s), B_t(X_t) ← (θ_st/θ_s)B_t(X_t), ∀t ∈ V \ {s}. With a slight abuse of notation we redefine $L_{s}$ using this re-parametrization as

L_{s} (B; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- B_{s} (X_{s}^{(i)}) (1 + \sum_{t \in V \ s} B_{t} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; B)},

(6)

where A(X_−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5)

\begin{matrix} min_{B} L_{s} (B; X_{n}) + λ_{n} \sum_{t \in V} \sqrt{\int_{X_{t}} B_{t} {(X)}^{2} dX} \\ s.t. \int_{X_{t}} B_{t} (X) dX = 0 \forall t \in V . \end{matrix}

(7)

For more details on the relation between (5) and (7), please refer to Appendix.

Algorithm:

We now present our algorithm for optimization of (7). In the sequel, for simplicity, we assume that the domains $X_{t}$ of random variables X_t are all the same and equal to $X$ . In order to estimate functions B_t, we expand them over a uniformly bounded, orthonormal basis { ${ϕ_{k} (\cdot)}_{k = 0}^{\infty}$ of $L^{2} (X)$ with ϕ₀(·) ∝ 1. Expansion of the functions B_t(·) over this basis yields

B_{t} (X) = \sum_{k = 1}^{m} α_{t, k} ϕ_{k} (X) + ρ_{t, m} (X) where ρ_{t, m} (X) = α_{t, 0} ϕ_{0} (X) + \sum_{k = m + 1}^{\infty} α_{t, k} ϕ_{k} (X) .

Note that the constraint $\int_{X} B_{t} (X) dX = 0$ in Equation (7), translates to α_t,0 = 0. To convert the infinite dimensional optimization problem in (7) into a finite dimensional problem, we truncate the basis expansion to the top m terms and approximate B_t(·) as $\sum_{k = 1}^{m} α_{t, k} ϕ_{k} (\cdot)$ . The optimization problem in Equation (7) can then be rewritten as

min_{α_{m}} L_{s, m} (α_{m}; X_{n}) + λ_{n} \sum_{t \in V} {‖ α_{t, m} ‖}_{2},

(8)

where $α_{t, m} = {α_{t, k}}_{k = 1}^{m}$ , α_m = {α_t,m}_t∈V and $L_{s, m}$ is defined as

L_{s, m} (α_{m}; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- \sum_{k = 1}^{m} α_{s, k} ϕ_{k} (X_{s}^{(i)}) (1 + \sum_{t \in V \ {s}} \sum_{l = 1}^{m} α_{t, l} ϕ_{l} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; α_{m})} .

Iterative minimization of (8):

Note that the objective in (8) is non-convex. In this work, we use a simple alternating minimization technique for its optimization. In this technique, we alternately minimize α_s,m, {α_t,m}_t∈V\s while fixing the other parameters. The resulting optimization problem in each of the alternating steps is convex. We use Proximal Gradient Descent to optimize these sub-problems. To compute the objective and its gradients, we need to numerically evaluate the one-dimensional integrals in the log partition function. To do this, we choose a uniform grid of points over the domain and use quadrature rules to approximate the integrals.

Convergence:

Although (8) is non-convex, we can show that under certain conditions on the objective function, the alternating minimization procedure converges to the global minimum. In a recent work [32] analyze alternating minimization for low rank matrix factorization problems and show that it converges to a global minimum if the sequence of convex problems are strongly convex and satisfy certain other regularity condition. The analysis of [32] can be extended to show global convergence of alternating minimization for (8).

5. Statistical Properties

In this section we provide parameter estimation error rates for the node conditional estimator in Equation (8). Note that these rates are for the re-parameterized model described in Equation (6)and can be easily translated to guarantees on the original model described in Equations (3), (4).

Notation:

Let $B_{2} (x, r) = {y : {‖ y - x ‖}_{2} \leq r}$ be the ℓ₂ ball with center x and radius r. Let ${B_{t}^{*} (\cdot)}_{t \in V}$ be the true functions of the re-parametrized model, which we would like to estimate from the data. Denote the basis expansion coefficients of B_t(·) with respect to orthonormal basis ${ϕ_{k} (\cdot)}_{k = 0}^{\infty}$ by α_t, which is an infinite dimensional vector and let $α_{t}^{*}$ be the coefficients of $B_{t}^{*} (\cdot)$ . And let α_t,m be the coefficients corresponding to the top m basis in the basis expansion of B_t(·). Note that $\int B_{t} {(X)}^{2} dX = {‖ α_{t} ‖}_{2}^{2}$ . Let α = {α_t}_t∈V and α_m = {α_t,m}_t∈V. Let ${\overset{‒}{L}}_{s, m} (α_{m}) = E [L_{s, m} (α_{m}; X_{n})]$ be the population version of the sample loss defined in Equation (8). We will often omit $X_{n}$ from $L_{s, m} (α_{m}; X_{n})$ when clear from the context. We let (α_t – α_t,m) be the difference between infinite dimensional vector α_t and the vector obtained by appropriately padding α_t,m with zeros. Finally, we define the norm $R (\cdot)$ as $R (α_{m}) = \sum_{t \in V} {‖ α_{t, m} ‖}_{2}$ and its dual as $R^{*} (α_{m}) = {sup}_{t \in V} {‖ α_{t, m} ‖}_{2}$ . The norms on infinite dimensional vector α are similarly defined.

We now state our key assumption on the loss function $L_{s, m} (\cdot)$ . This assumption imposes strong curvature condition on $L_{s, m}$ along certain directions in a ball around $α_{m}^{*}$ .

Assumption 1. There exists r_m > 0 and constants c, κ > 0 such that for any $Δ_{m} \in B_{2} (0, r_{m})$ the gradient of the sample loss $L_{s, m}$ satisfies: $〈 \nabla L_{s, m} (α_{m}^{*} + Δ_{m}) - \nabla L_{s, m} (α_{m}^{*}), Δ_{m} 〉 \geq κ {‖ Δ_{m} ‖}_{2}^{2} - c \sqrt{\frac{m \log (p)}{n}} R (Δ_{m})$ .

Similar assumptions are increasingly common in analysis of non-convex estimators, see [19] and references therein. We are now ready to state our results which give the parameter estimation error rates, the proofs of which can be found in Appendix. We first provide a deterministic bound on the error ${‖ α_{m} - α_{m}^{*} ‖}_{2}$ in terms of the random quantity $R^{*} (\nabla L_{s, m} (α_{m}^{*}))$ . We derive probabilistic results in the subsequent corollaries.

Theorem 2. Let $N_{s}$ be the true neighborhood of node s, with $∣ N_{s} ∣ = d$ . Suppose $L_{s, m}$ satisfies Assumption 1. If the regularization parameter λ_n is chosen such that $λ_{n} \geq 2 R^{*} (\nabla L_{s, m} (α_{m}^{*})) + 2 c \sqrt{\frac{m \log (p)}{n}}$ , then any stationary point ${\hat{α}}_{m}$ of (8) in $B_{2} (α_{m}^{*}, r_{m})$ satisfies:

{‖ {\hat{α}}_{m} - α_{m}^{*} ‖}_{2} \leq \frac{6 \sqrt{2}}{κ} \sqrt{d} λ_{n} .

We now provide a set of sufficient conditions under which the random quantity $R^{*} (\nabla L_{s, m} (α_{m}^{*}))$ can be bounded.

Assumption 2. There exists a constant L > 0 such that the gradient of the population loss ${\overset{‒}{L}}_{s, m}$ at $α_{m}^{*}$ satisfies: $R^{*} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) \leq L R^{*} (α^{*} - α_{m}^{*})$ .

Corollary 3. Suppose the conditions in Theorem 2 are satisfied. Moreover, let $γ = {sup}_{i \in N, X \in X} ∣ ϕ_{i} (X) ∣$ and $τ_{m} = {sup}_{t \in V, X \in X} ∣ \sum_{i = 1}^{m} α_{t, i}^{*} ϕ_{i} (X) ∣$ . Suppose $L_{s, m}$ satisfies Assumption 2. If the regularization parameter λ_n is chosen such that $λ_{n} \geq 2 L R^{*} (α^{*} - α_{m}^{*}) + c γ τ_{m} \sqrt{\frac{{md}^{2} \log (p)}{n}}$ , then then with probability at least 1 – 2m/p² any stationary point ${\hat{α}}_{m}$ of (8) in $B_{2} (α_{m}^{*}, r_{m})$ satisfies:

{‖ {\hat{α}}_{m} - α_{m}^{*} ‖}_{2} \leq \frac{6 \sqrt{2}}{κ} \sqrt{d} λ_{n} .

Theorem 2 and Corollary 3 bound the error of the estimated coefficients in the truncated expansion. The approximation error of the truncated expansion itself depends on the function space assumption, as well as the basis chosen, but can be simply combined with the statement of the above corollary to derive the overall error. As an instance, we present a corollary below for the specific case of Sobolev space of order two, and the trigonometric basis.

Corollary 4. Suppose the conditions in Corollary 3 are satisfied. Moreover, suppose the true functions $B_{t}^{*} (\cdot)$ lie in a Sobolev space of order two. Let ${ϕ_{k}}_{k = 0}^{\infty}$ be the trigonometric basis of $L^{2} (X)$ . If the optimization problem (8) is solved with λ_n = c₁(d² log(p)/n)^2/5 and m = c₂(n/d² log(p))^1/5, then with probability at least 1 – 2m/p² any stationary point ${\hat{α}}_{m}$ of (8) in $B_{2} (α_{m}^{*}, r_{m})$ satisfies:

{‖ {\hat{α}}_{m} - α^{*} ‖}_{2} \leq c_{3} {(\frac{d^{13 ∕ 4} \log (p)}{n})}^{2 ∕ 5},

where c₁, c₂, c₃ depend on L, κ, γ, τ_m.

Discussion on Assumption 1:

We now provide a set of sufficient conditions which ensure the restricted strong convexity (RSC) condition. Suppose the population risk ${\overset{‒}{L}}_{s, m} (\cdot)$ is strongly convex in a ball of radius r_m around $α_{m}^{*}$

〈 \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*} + Δ_{m}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}), Δ_{m} 〉 \geq κ {‖ Δ_{m} ‖}_{2}^{2} \forall Δ_{m} \in B_{2} (0, r_{m}) .

(9)

Moreover, suppose the empirical gradients converge uniformly to the population gradients

sup_{α_{m} \in B_{2} (α_{m}^{*}, r_{m})} R^{*} (\nabla L_{s, m} (α_{m}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m})) \leq c \sqrt{\frac{m \log p}{n}} .

(10)

For example, this condition holds with high probability when the gradient of $L_{s, m} (α_{m})$ w.r.t α_t,m, for any t ∈ [p] is a sub-Gaussian process. Equations (9),(10) are easier to check and en sure that $L_{s, m} (α_{m})$ satisfies the RSC property in Assumption 1.

6. Connections to Exponential Family MRF Copulas

The Expxorcist class of models could be viewed as being closely related to an exponential family MRF [28] copula density. Consider the parametric exponential family MRF joint density in (3): $P_{MRF; θ} (X) \propto exp {\sum_{s \in V} θ_{s} B_{s} (X_{s}) + \sum_{(s, t) \in E (G)} θ_{st} B_{s} (X_{s}) B_{t} (X_{t}) + \sum_{s \in V} C_{s} (X_{s})}$ , where the distribution is indexed by the finite-dimensional parameters {θ_s}_s∈V, {θ_st}_(s,t)∈E, and where in contrast to the previous sections, we assume we are given the sufficient statistics functions {B_s(·)}_s∈V as well as the nodewise base measures {C_s(·)}_s∈V. Now consider the following non-parametric problem. Given a random vector X, suppose we are interested in estimating monotonic node-wise functions {f_s(X_s)}_s∈V such that (f₁(X₁), …, f_p(X_p)) follows $P_{MRF; θ}$ for some θ. Letting f(X) = (f₁(X₁), …, f_p(X_p)), we have that $P (f (X)) = P_{MRF; θ} (f (X))$ , so that the density of X can be written as $P (X) \propto P (f (X)) \prod_{s \in V} f_{s}^{'} (X_{s})$ . This is now a semi-parametric estimation problem, where the unknowns are the functions {f_s(X_s)}_s∈V as well as the finite-dimensional parameters θ. To simplify this density, suppose we assume that the given node-wise sufficient statistics are linear, so that B_s(z) = z, for all s ∈ V, so that density reduces to

P (X) \propto exp {\sum_{s \in V} θ_{s} f_{s} (X_{s}) + \sum_{(s, t) \in E (G)} θ_{st} f_{s} (X_{s}) f_{t} (X_{t}) + \sum_{s \in V} (C_{s} (f_{s} (X_{s})) + \log f_{s}^{'} (X_{s}))} .

(11)

In contrast, the Expxorcist nonparametric exponential family graphical model takes the form

P (X) \propto exp {\sum_{s \in V} θ_{s} f_{s} (X_{s}) + \sum_{(s, t) \in E (G)} θ_{st} f_{s} (X_{s}) f_{t} (X_{t}) + \sum_{s \in V} C_{s} (X_{s})} .

(12)

It can be seen that the two densities have very similar forms, except that the density in (11) has a more complex base measure that depends on the unknown functions {f_s}_s∈V and importantly the functions {f_s}_s∈V in (11) are monotonic.

The class of densities in (11) can be cast as an exponential family MRF copula density. Suppose we denote the CDF of the parametric exponential family MRF joint density by F_MRF;θ (X), with nodewise marginal CDFs F_MRF;θ,s(X_s). Then the marginal CDF of the density (11) can be written as $F_{s} (x_{s}) = P [X_{s} \leq x_{s}] = P [f_{s} (X_{s}) \leq f_{s} (x_{s})] = F_{MRF; θ, s} (f_{s} (x_{s}))$ , so that

f_{s} (x_{s}) = F_{MRF; θ, s}^{- 1} (F_{s} (x_{s})) .

(13)

It then follows that: $F (X) = F_{MRF; θ} (F_{MRF; θ, 1}^{- 1} (F_{1} (X_{1})), \dots, F_{MRF; θ, p}^{- 1} (F_{p} (X_{p})))$ , where F (X) is the CDF of density (11). By letting $F_{COP; θ} (U) = F_{MRF; θ} (F_{MRF; θ, 1}^{- 1} (U_{1}), \dots, F_{MRF; θ, p}^{- 1} (U_{p}))$ be the exponential family MRF copula density function, we see that the CDF of X is precisely: F (X) = F_COP;θ (F₁(X₁), …, F_p(X_p)), which is specified by the marginal CDFs {F_s(X_s)}_s∈V and the copula density F_COP;θ corresponding to the exponential family MRF density. In other words, the non-parametric extension in (11) of the exponential family MRF densities is precisely an exponential family MRF copula density. This development thus generalizes the non-parametric extension of Gaussian MRF densities via the Gaussian copula nonparanormal densities [17]. The caveats with the copula density however are two-fold: the node-wise functions are restricted to be monotonic, but also the estimation of these as in (13) requires the estimation of inverses of marginal CDFs of an exponential family MRF, which is intractable in general. Thus, minor differences in the expressions of the Expxorcist density (12) and an exponential family MRF copula density (11) nonetheless have seemingly large consequences for tractable estimation of these densities from data.

7. Experiments

We present experimental results on both synthetic and real datasets. We compare our estimator, Expxorcist, with the Nonparanormal model of [17] and Gaussian Graphical Model (GGM). We use glasso [7] to estimate GGM and the two step estimator of [17] to estimate Nonparanormal model.

7.1. Synthetic Experiments

Data:

We generated synthetic data from the Expxorcist model with chain and grid graph structures. For both the graph structures, we set θ_s = 1, ∀s ∈ V, θ_st = 1, ∀(s, t) ∈ E and fix the domain $X$ to [−1, 1]. We experimented with two choices for sufficient statistics B_s(X): sin(4πX) and [exp (−20(X – 0.5)² + exp (−20(X + 0.5)² – 1] and picked the log base measure C_s(X) to be 0. The grid graph we considered has a 10 × (p/10) structure. We used Gibbs sampling to sample data from these models. We also generated data from Gaussian distribution with chain and grid graph structures. To generate this data we set the off diagonal non-zero entries of inverse covariance matrix to 0.49 for chain graph and 0.25 for grid graph and diagonal entries to 1.

Evaluation Metric:

We compared the performance of Expxorcist against baselines, on graph structure recovery, using ROC curves. The ROC curve plots the true positive rate (TPR) against false positive rate (FPR) over different choices of regularization parameter, where TPR is the fraction of correctly detected edges and FPR is the fraction of mis-identified non edges.

Experiment Settings:

For this experiment we set p = 50 and n ∈ {100, 200, 500} and varied the regularization parameter λ from 10−2 to 1. To fit the data to the non parametric model (3), we used cosine basis and truncated the basis expansion to top 30 terms. In practice, one could choose the number of basis (m) based on domain knowledge (e.g. “smooth” functions), or in the absence of which, one could use hold-out validation/cross validation. Given $\hat{N} (s)$ , the estimated neighborhood for node s, we estimated the overall graph structure as: $\cup_{s \in V} \cup_{t \in \hat{N} (s)} {(s, t)}$ . To reduce the variance in the ROC plots, we averaged results over 10 repetitions.

Results:

Figure 1 shows the ROC plots obtained from this experiment. Due to the lack of space, we present more experimental results in Appendix. It can be seen that Expxorcist has much better performance on non-Gaussian data. On these datasets, even at n = 500 the baselines chose edges at random. This suggests that in the presence of multiple modes and fat tails, Expxorcist is a better model. Expxorcist has slightly poor performance than baselines on Gaussian data. However, this is expected because it learns a broader family of distributions than Nonparanormal.

7.2. Futures Intraday Data

We now present our analysis on the Futures price returns. This dataset was downloaded from http://www.kibot.com/. We focus on the Top-26 most liquid instruments being traded at the Chicago Mercantile Exchange (CME). The instruments span different sectors like Energy, Agriculture, Currencies, Equity Indices, Metals and Interest Rates. We focus on the hours of maximum liquidity (9am Eastern to 3pm Eastern) and look at the 1 minute price returns. The return distribution is a mixture of 1 minute returns with the overnight return. Since overnight returns tend to be bigger than the 1 minute return within the day, the return distribution is multimodal and fat-tailed. We treat each instrument as a random variable and the 1 minute returns as independent samples drawn from these random variables. We use the data collected in February 2010 as training data and data from March 2010 as held out data for tuning parameter selection. After removing samples with missing entries we are left with 894 training and 650 held out data samples. We fit Expxorcist and baselines on this data with the same parameter settings described above. For each of these models, we select the best tuning parameter through log likelihood on held out data. However, this criteria resulted in complete graphs for Nonparanormal and GGM (325 edges) and a relatively sparser graph for Expxorcist (168 edges). So for a better comparison of these models, we selected tuning parameters for each of the models such that the resulting graphs have almost the same number of edges. Figure 2 shows the learned graphs for one such choice of tuning parameters, which resulted in ~ 52 edges in the graphs. Nonparanormal and GGM resulted in very similar graphs, so we only present Nonparanormal here. It can be seen that Expxorcist is able to identify the clusters better than Nonparanormal. More detailed graphs and comparison with GGM can be found in Appendix.

Figure 2: — Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.

8. Conclusion

In this work we considered the problem of non-parametric density estimation and introduced Expxorcist, a new family of non-parametric graphical models. Our approach relies on a simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. We proposed an estimator for Expxorcist that is computationally efficient and comes with statistical guarantees. Our empirical results suggest that, in the presence of multiple modes and fat tails in the data, our non-parametric model is a better choice than the Nonparanormal model of [17].

9. Acknowledgement

A.S. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences. M. K. acknowledges support by an IBM Corporation Faculty Research Fund at the University of Chicago Booth School of Business.

Supplementary material for Nonparametric Graphical Models Via Conditional Exponential Densities

A. Node Conditional Maximum Likelihood Estimation

In this section we derive the relation between optimization problems (5) and (7) defined in Section 4 of the main paper. We start with the optimization problem (5), which is defined in terms of the original parameters of non-parametric graphical model (3):

\begin{matrix} min_{Θ_{s}, B} L_{s} (Θ_{s}, B; X_{n}) + λ_{n} {‖ Θ_{s} ‖}_{1} \\ s.t. θ_{s} \geq 0, \int_{X_{t}} B_{t} (X) dX = 0, \int_{X_{t}} B_{t} {(X)}^{2} dX = 1 \forall t \in V, \end{matrix}

where $L_{s} (Θ_{s}, B; X_{n})$ is the node conditional negative log likelihood at node s:

L_{s} (Θ_{s}, B; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- B_{s} (X_{s}^{(i)}) (θ_{s} + \sum_{t \in V \ s} θ_{st} B_{t} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; θ_{s}, B)} .

Let ${\tilde{B}}_{t} (X_{t}) = θ_{st} B_{t} (X_{t}), \forall t \in V \ {s}$ . Using this parametrization $L_{s} (\cdot)$ can be written as:

L_{s} (θ_{s}, B_{s}, {\tilde{B}}_{- s}; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- B_{s} (X_{s}^{(i)}) (θ_{s} + \sum_{t \in V \ s} {\tilde{B}}_{t} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; Θ_{s}, B_{s}, {\tilde{B}}_{- s})} .

Note that, given ${\tilde{B}}_{t}$ , one can recover θ_st (and thus B_t) by computing the $L^{2} (X_{t})$ norm of ${\tilde{B}}_{t}$ . Using this re-parametrization the original optimization in Equation (5) can be written as the following equivalent problem:

\begin{matrix} min_{θ_{s}, B_{s}, {\tilde{B}}_{- s}} L_{s} (θ_{s}, B_{s}, {\tilde{B}}_{- s}; X_{n}) + λ_{n} \sum_{t \in V \ s} \sqrt{\int_{X_{t}} {\tilde{B}}_{t} {(X)}^{2} dX} + λ_{n} ∣ θ_{s} ∣ \\ s.t. θ_{s} \geq 0, \int_{X_{s}} B_{s} {(X)}^{2} dX = 1, \int_{X_{t}} {\tilde{B}}_{t} (X) dX = 0 \forall t \in V . \end{matrix}

The above problem still has the equality constraint on the norm of functions B_s(·). As pointed out in the main paper, this makes the above optimization problem a difficult one to solve. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node s ∈ V, we consider the following re-parametrization of B: B_s(X_s) ← θ_sB_s(X_s), B_t(X_t) (θ_st/θ_s)B_t(X_t)∀t ∈ V \ {s}. With a slight abuse of notation we redefine $L_{s}$ using this re-parametrization as:

L_{s} (B; X_{n}) = \frac{1}{n} \sum_{i = 1}^{n} {- B_{s} (X_{s}^{(i)}) (1 + \sum_{t \in V \ s} B_{t} (X_{t}^{(i)})) + A (X_{- s}^{(i)}; B)},

where A(X_−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5):

\begin{matrix} min_{B} L_{s} (B; X_{n}) + λ_{n} \sum_{t \in V} \sqrt{\int_{X_{t}} B_{t} {(X)}^{2} dX} \\ s.t. \int_{X_{t}} B_{t} (X) dX = 0 \forall t \in V . \end{matrix}

Note that $θ_{s}^{*}$ need to be bounded away from 0, for this re-parametrized optimization problem to give consistent estimates of the true parameters.

B. Statistical Properties

B.1. Proof of Theorem 2

Proof. Define $C = {α_{m}^{*} + Δ : R (Δ_{N_{s}^{c}}) \leq 3 R (Δ_{N_{s}})}$ , where $Δ_{N_{s}}$ is the sub-vector of Δ restricted to the coordinates specified by variables ${α_{t, m} : t \in N_{s} \cup {s}}$ . Let $F_{s, m}$ denote the optimization objective in Equation (8):

F_{s, m} (α_{m}; X_{n}) = L_{s, m} (α_{m}; X_{n}) + λ_{n} R (α_{m}) .

We prove the theorem in two parts. In the first part we show that $F_{s, m}$ doesn’t have any stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ . In the second part we show that $F_{s, m}$ doesn’t have any stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C \ B_{2} (α_{m}^{*}, r_{s})$ , where $r_{s} = (6 \sqrt{2} ∕ κ) \sqrt{d} λ_{n}$ . The proof of the Theorem then follows by combining the results from these two parts.

(a). No stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ :

Let $α_{m} \in B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ and let $Δ = α_{m} - α_{m}^{*}$ . Let ∂f(x) denote the set of sub-gradients of a function f(.) at x. For any $z (α_{m}) \in \partial F_{s, m} (α_{m})$ , where $z (α_{m}) = \nabla L_{s, m} (α_{m}) + λ_{n} v (α_{m})$ , $v (α_{m}) \in \partial R (α_{m})$ , we have:

\begin{matrix} 〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 = & 〈 z (α_{m}), Δ 〉 \\ = & 〈 \nabla L_{s, m} (α_{m}), Δ 〉 + λ_{n} 〈 v (α_{m}), Δ 〉 \\ = & 〈 \nabla L_{s, m} (α_{m}) - \nabla L_{s, m} (α_{m}^{*}), Δ 〉 + 〈 \nabla L_{s, m} (α_{m}^{*}), Δ 〉 + λ_{n} 〈 v (α_{m}), Δ 〉, \end{matrix}

(1)

We now bound each of the terms in the RHS of above equation. From Assumption 1 on RSC property of the sample loss, we have

〈 \nabla L_{s, m} (α_{m}) - \nabla L_{s, m} (α_{m}^{*}), Δ 〉 \geq κ {‖ Δ ‖}_{2}^{2} - c \sqrt{\frac{m \log (p)}{n}} R (Δ) .

From the definition of $N_{s}$ we have ${(α_{m}^{*})}_{N_{s}^{c}} = 0$ . So we have

〈 v (α_{m}), Δ 〉 = 〈 v {(α_{m})}_{N_{s}^{c}}, {(α_{m})}_{N_{s}^{c}} 〉 + 〈 v {(α_{m})}_{N_{s}}, Δ_{N_{s}} 〉 \geq R (Δ_{N_{s}^{c}}) - R (Δ_{N_{s}}),

where the last inequality follows from the properties of sub-gradient of the norm $R (.)$ . Finally, from the definition of dual norm $R^{*}$ , we have:

〈 \nabla L_{s, m} (α_{m}^{*}), Δ 〉 \geq - R^{*} (\nabla L_{s, m} (α_{m}^{*})) R (Δ) .

Substituting these results in Equation (1) we get:

\begin{matrix} 〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 \geq & κ {‖ Δ ‖}_{2}^{2} - R^{*} (\nabla L_{s, m} (α_{m}^{*})) R (Δ) \\ - (c \sqrt{\frac{m \log (p)}{n}}) R (Δ) + λ_{n} (R (Δ_{N_{s}^{c}}) - R (Δ_{N_{s}})) . \end{matrix}

(2)

Note that since $α_{m} \in C^{c}$ we have $R (Δ_{N_{s}^{c}}) \geq 3 R (Δ_{N_{s}})$ . Moreover, from our choice of λ_n we know that $λ_{n} \geq 2 R^{*} (\nabla L_{s, m} (α_{m}^{*})) + 2 c \sqrt{\frac{m \log (p)}{n}}$ . Substituting this in the above equation we get:

〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 \geq κ {‖ Δ ‖}_{2}^{2} + (R (Δ_{N_{s}^{c}}) - 3 R (Δ_{N_{s}})) (R^{*} (\nabla L_{s, m} (α_{m}^{*})) + c \sqrt{\frac{m \log (p)}{n}}) .

(3)

The above inequality shows that $〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 > 0$ , $\forall α_{m} \in B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ .

Now suppose $α_{m} \in B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ is a stationary point. Then from first-order necessary conditions we know that $〈 z (α_{m}), α_{m}^{*} - α_{m} 〉 \geq 0$ . However, this contradicts the result we obtained in Equation (3). This shows that there are no stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ .

(b). No stationary points in $B_{2} (α_{m}^{}, r_{m}) \cap C \ B_{2} (α_{m}^{}, r_{s})$ :

The proof of this part follows along the lines of the previous part. Let $α_{m} \in B_{2} (α_{m}^{*}, r_{m}) \cap C \ B_{2} (α_{m}^{*}, r_{s})$ . From Equations (1), (2) we know that:

\begin{matrix} 〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 \geq & κ {‖ Δ ‖}_{2}^{2} - R^{*} (\nabla L_{s, m} (α_{m}^{*})) R (Δ) \\ - (c \sqrt{\frac{m \log (p)}{n}}) R (Δ) + λ_{n} 〈 v (α_{m}), Δ 〉 . \end{matrix}

(4)

Since $α_{m} \in C$ we have $R (Δ) \leq 4 \sqrt{d + 1} {‖ Δ ‖}_{2} \leq 4 \sqrt{2 d} {‖ Δ ‖}_{2}$ . Substituting this in the above equation we get:

\begin{matrix} 〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 \geq & κ {‖ Δ ‖}_{2}^{2} - R^{*} (\nabla L_{s, m} (α_{m}^{*})) R (Δ) - (c \sqrt{\frac{m \log (p)}{n}}) R (Δ) - λ_{n} R (Δ) \\ \geq & κ {‖ Δ ‖}_{2}^{2} - R (Δ) (R^{*} (\nabla L_{s, m} (α_{m}^{*})) + c \sqrt{\frac{m \log (p)}{n}} + λ_{n}) \\ \geq & κ {‖ Δ ‖}_{2}^{2} - 4 \sqrt{2 d} {‖ Δ ‖}_{2} (R^{*} (L_{s, m} (α_{m}^{*})) + c \sqrt{\frac{m \log (p)}{n}} + λ_{n}) \\ \geq & (κ {‖ Δ ‖}_{2} - 4 \sqrt{2 d} (R^{*} (\nabla L_{s, m} (α_{m}^{*})) + c \sqrt{\frac{m \log (p)}{n}} + λ_{n})) {‖ Δ ‖}_{2} \\ \geq & (κ {‖ Δ ‖}_{2} - 4 \sqrt{2 d} (\frac{3}{2} λ_{n})) {‖ Δ ‖}_{2} . \end{matrix}

(5)

The above inequality shows that $〈 z (α_{m}), α_{m} - α_{m}^{*} 〉 > 0$ , $\forall α_{m} \in B_{2} (α_{m}^{*}, r_{m}) \cap C \ B_{2} (α_{m}^{*}, r_{s})$ . This shows that there are no stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C \ B_{2} (α_{m}^{*}, r_{s})$ .

Following results from parts (a) and (b) we conclude that any stationary point in $B_{2} (α_{m}^{*}, r_{m})$ satisfies ${‖ α_{m} - α_{m}^{*} ‖}_{2} \leq \frac{6 \sqrt{2}}{κ} \sqrt{d} λ_{n}$ .□

B.2. Proof of Corollary 3

Before we proceed to the proof of Corollary 3, we introduce some notation we use in its proof. We say Z is a σ–sub-Gaussian random variable, if it satisfies the following tail property:

P (∣ Z - E [Z] ∣ \geq ϵ) \leq 2 exp {- \frac{ϵ^{2}}{2 σ^{2}}} .

We use the notation Z ∈ SG(σ²) to say that a random variable Z is σ–sub-Gaussian. We use the following standard result from concentration theory: if Z₁ … Z_n are n i.i.d SG(σ²) random variables then $\frac{1}{n} \sum_{i = 1}^{n} Z_{i} \in SG (\frac{σ^{2}}{n})$ .

The following Lemma provides an upper bound for $R^{*} (\nabla L_{s, m} (α_{m}^{*}))$ that holds with high probability.

Lemma 1. Let $N_{s}$ be the true neighborhood of node s, with $∣ N_{s} ∣ = d$ . Moreover, let

γ = sup_{i \in N, X \in X} ∣ ϕ_{i} (X) ∣

and

τ_{m} = sup_{t \in V, X \in X} ∣ \sum_{i = 1}^{m} α_{t, i}^{*} ϕ_{i} (X) ∣ .

Suppose $L_{s, m}$ satisfies Assumption 2. Then with probability at least 1 – 2m/p² we have:

R^{*} (\nabla L_{s, m} (α_{m}^{*})) \leq L R^{*} (α^{*} - α_{m}^{*}) + c_{1} γ τ_{m} \sqrt{\frac{{md}^{2} \log (p)}{n}},

where c₁ > 0 is a constant.

Proof. From triangle inequality, we have:

R^{*} (\nabla L_{s, m} (α_{m}^{*})) \leq R^{*} (\nabla L_{s, m} (α_{m}^{*}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) + R^{*} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) .

We now upper bound each of the terms in the RHS of the above equation.

(a). Upper bound for $R^{} (\nabla L_{s, m} (α_{m}^{}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))$ :

The gradient of $L_{s, m} (α_{m})$ at $α_{m}^{*}$ is given by:

{(\nabla L_{s, m} (α_{m}^{*}))}_{s, k} = \frac{1}{n} \sum_{i = 1}^{n} - (ϕ_{k} (X_{s}^{(i)}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) + E_{α_{m}^{*}} [ϕ_{k} (X_{s}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*}) ∣ X_{- s}^{(i)}]

and for t ∈ V \ {s}:

{(\nabla L_{s, m} (α_{m}^{*}))}_{t, k} = \frac{1}{n} \sum_{i = 1}^{n} - (ϕ_{k} (X_{t}^{(i)}) B_{s} (X_{s}^{(i)}; α_{m}^{*})) + E_{α_{m}^{*}} [ϕ_{k} (X_{t}^{(i)}) B_{s} (X_{s}; α_{m}^{*}) ∣ X_{- s}^{(i)}],

where $E_{- s} (X_{- s}; α_{m}) = 1 + \sum_{t \in V \ {s}} \sum_{j = 1}^{m} α_{t, j} ϕ_{j} (X_{t})$ and $B_{s} (X_{s}; α_{m}) = \sum_{j = 1}^{m} α_{s, j} ϕ_{j} (X_{s})$ and $E_{α} [.]$ denotes expectation with respect to the density parametrized by α and $(\nabla L_{s, m} {(α_{m}^{*}))}_{t, k}$ is the gradient of $L_{s, m} (α_{m})$ evaluated at $α_{m}^{*}$ with respect to variable α_t,k.

We now show that $(\nabla L_{s, m} {(α_{m}^{*}))}_{t, k}$ concentrates well around its expectation. Note that

E [(\nabla L_{s, m} {(α_{m}^{*}))}_{t, k}] = {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{t, k} .

Lets first define random variable ${Y_{s, m}}_{k = 1}^{m}$ and ${Y_{t, k}}_{k = 1}^{m}, \forall t \in V \ {s}$ as:

Y_{s, k} (X^{(i)}) = - (ϕ_{k} (X_{s}^{(i)}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) + E_{α_{m}^{*}} [ϕ_{k} (X_{s}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*}) ∣ X_{- s}^{(i)}],

and

Y_{t, k} (X^{(i)}) = - (ϕ_{k} (X_{t}^{(i)}) B_{s} (X_{s}^{(i)}; α_{m}^{*})) + E_{α_{m}^{*}} [ϕ_{k} (X_{t}^{(i)}) B_{s} (X_{s}; α_{m}^{*}) ∣ X_{- s}^{(i)}] .

To ease the notation, we denote Y_s,k(X⁽ⁱ⁾) by $Y_{s, k}^{(i)}$ . We now rewrite $\nabla L_{s, m} (α_{m}^{*})$ in terms of random variables ${Y_{s, k}}_{k = 1}^{m}$ and ${Y_{t, k}}_{k = 1}^{m}$ as follows:

{(\nabla L_{s, m} (α_{m}^{*}))}_{s, k} = \frac{1}{n} \sum_{i = 1}^{n} Y_{s, k}^{(i)}, {(\nabla L_{s, m} (α_{m}^{*}))}_{t, k} = \frac{1}{n} \sum_{i = 1}^{n} Y_{t, k}^{(i)} .

For any k ∈ [1, m], i ∈ [1, n] we have:

\begin{matrix} ∣ Y_{s, k}^{(i)} ∣ \leq & ∣ (ϕ_{k} (X_{s}^{(i)}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) ∣ + ∣ E_{α_{m}^{*}} [ϕ_{k} (X_{s}) E_{- s} (X_{- s}^{(i)}; α_{m}^{*}) ∣ X_{- s}^{(i)}] ∣ \\ \leq & γ ∣ (E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) ∣ + γ E_{α_{m}^{*}} [∣ E_{- s} (X_{- s}^{(i)}; α_{m}^{*}) ∣ ∣ X_{- s}^{(i)}], \end{matrix}

where the last inequality follows from Jensen’s inequality and the fact that $γ = sup_{j \in N, X \in X} ∣ ϕ_{j} (X) ∣$ .

We now bound $∣ (E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) ∣$ :

\begin{matrix} ∣ (E_{- s} (X_{- s}^{(i)}; α_{m}^{*})) ∣ = & ∣ 1 + \sum_{t \in V \ {s}} \sum_{j = 1}^{m} α_{t, j}^{*} ϕ_{j} (X_{t}^{(i)}) ∣ \\ \leq & 1 + \sum_{t \in V \ {s}} ∣ \sum_{j = 1}^{m} α_{t, j}^{*} ϕ_{j} (X_{j}^{(i)}) ∣ \\ \leq & 1 + d τ_{m} \leq d (1 + τ_{m}) . \end{matrix}

Substituting this in the above equation we get:

∣ Y_{s, k}^{(i)} ∣ \leq 2 γ d (1 + τ_{m}) .

Using similar arguments we can show that $∣ Y_{t, k}^{(i)} ∣ \leq 2 γ τ_{m}$ , ∀t ∈ V \ {s}. This shows that $Y_{s, k}^{(i)} \in SG (4 γ^{2} d^{2} {(1 + τ_{m})}^{2})$ and $Y_{t, k}^{(i)} \in SG (4 γ^{2} τ_{m}^{2})$ , ∀t ∈ V \ {s}. Since ${(\nabla L_{s, m} (α_{m}^{*}))}_{t, k}$ is a sum of n i.i.d random variables ${Y_{t, k}^{(i)}}_{i = 1}^{n}$ we have:

{(\nabla L_{s, m} (α_{m}^{*}))}_{s, k} \in SG (\frac{4 γ^{2} d^{2} {(1 + τ_{m})}^{2}}{n}), {(\nabla L_{s, m} (α_{m}^{*}))}_{t, k} \in SG (\frac{4 γ^{2} τ_{m}^{2}}{n}) .

From the concentration properties of a sub-Gaussian random variable we have:

P (∣ {(\nabla L_{s, m} (α_{m}^{*}))}_{s, k} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{s, k} ∣ \leq ϵ) \geq 1 - 2 exp {- \frac{n ϵ^{2}}{8 γ^{2} {(1 + τ_{m})}^{2} d^{2}}} .

and

P (∣ {(\nabla L_{s, m} (α_{m}^{*}))}_{t, k} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{t, k} ∣ \leq ϵ) \geq 1 - 2 exp {- \frac{n ϵ^{2}}{8 γ^{2} τ_{m}^{2}}}, \forall t \in V \ {s} .

Now let ${(\nabla L_{s, m} (α_{m}^{*}))}_{t} = {{(\nabla L_{s, m} (α_{m}^{*}))}_{t, k}}_{k = 1}^{m}$ . By union bound we get:

\begin{matrix} P ({‖ {(\nabla L_{s, m} (α_{m}^{*}))}_{s} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{s} ‖}_{2} \geq ϵ) \\ \leq \sum_{k = 1}^{m} P (∣ {(\nabla L_{s, m} (α_{m}^{*}))}_{s, k} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{s, k} ∣ \geq \frac{ϵ}{\sqrt{m}}) \\ \leq 2 m exp {- \frac{n ϵ^{2}}{8 γ^{2} {(1 + τ_{m})}^{2} {md}^{2}}} \end{matrix}

and

P ({‖ {(\nabla L_{s, m} (α_{m}^{*}))}_{t} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{t} ‖}_{2} \geq ϵ) \leq 2 m exp {- \frac{n ϵ^{2}}{8 γ^{2} τ_{m}^{2} m}} .

By using union bound again we get:

\begin{matrix} P (R^{*} (\nabla L_{s, m} (α_{m}^{*}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) \geq ϵ) = & P (sup_{t \in V} ‖ {(\nabla L_{s, m} (α_{m}^{*}))}_{t} - {(\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))}_{t} ‖ \geq ϵ) \\ \leq & 2 m exp {- \frac{n ϵ^{2}}{8 γ^{2} {(1 + τ_{m})}^{2} {md}^{2}}} \\ + 2 m (p - 1) exp {- \frac{n ϵ^{2}}{8 γ^{2} τ_{m}^{2} m}} \\ \leq & 2 mp exp {- \frac{n ϵ^{2}}{8 γ^{2} {(1 + τ_{m})}^{2} {md}^{2}}} . \end{matrix}

Choosing $ϵ = \sqrt{24 γ^{2} {(1 + τ_{m})}^{2} {md}^{2} \log (p) ∕ n}$ , we get the following: with probability at least

R^{*} (\nabla L_{s, m} (α_{m}^{*}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) \leq \sqrt{24 γ^{2} {(1 + τ_{m})}^{2} {md}^{2} \log (p) ∕ n} .

(b). Upper bound for $R^{} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{}))$ :

From Assumption 2 we have the following upper bound for $R^{*} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))$ :

R^{*} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) \leq L R^{*} (α_{m}^{*} - α^{*}) .

Combining the results from parts (a) and (b) we get the following: with probability at least 1–2m/p²

R^{*} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*})) \leq L R^{*} (α_{m}^{*} - α^{*}) + c_{1} γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n},

where c₁ > 0 is a constant.□

Following results from Theorem 2, Lemma 1, we conclude that if the regularization parameter λ_n is chosen such that

λ_{n} \geq 2 L R^{*} (α_{m}^{*} - α^{*}) + c_{1} γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n},

then with probability at least 1 – 2m/^p2, any stationary point α_m in $B_{2} (α_{m}^{*}, r_{m})$ satisfies

{‖ α_{m} - α_{m}^{*} ‖}_{2} \leq \frac{6 \sqrt{2}}{κ} \sqrt{d} λ_{n} .

B.3. Proof of Corollary 4

Let α_m be any stationary point in $B_{2} (α_{m}^{*}, r_{m})$ . And suppose λ_n is chosen such that:

λ_{n} = 2 L R^{*} (α_{m}^{*} - α^{*}) + c_{1} γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n} .

In this section we derive bounds for the overall estimation error ∥α_m – α*∥₂. From triangle inequality, we have:

{‖ α_{m} - α^{*} ‖}_{2} \leq {‖ α_{m} - α_{m}^{*} ‖}_{2} + {‖ α_{m}^{*} - α^{*} ‖}_{2} .

From Theorem 2, we have a bound for ${‖ α_{m} - α_{m}^{*} ‖}_{2}$ . So, here we focus on bounding ${‖ α_{m}^{*} - α^{*} ‖}_{2}$ .

Since the true parameters $B_{t}^{*} (.)$ lie in a Sobolev space of order two, we know that there exists a constant c₂ > 0 such that [1]:

sup_{t \in V} {‖ α_{t, m}^{*} - α_{t}^{*} ‖}_{2} \leq \frac{c_{2}}{m^{2}} .

Combining this result with the result from Theorem 2, we get, with probability at least 1 – 2m/p²:

\begin{matrix} {‖ α_{m} - α^{*} ‖}_{2} \leq & {‖ α_{m} - α_{m}^{*} ‖}_{2} + {‖ α_{m}^{*} - α^{*} ‖}_{2} \\ \leq & \frac{6 \sqrt{2}}{κ} \sqrt{d} λ_{n} + \frac{c_{2} \sqrt{d}}{m^{2}} \\ = & \frac{6 \sqrt{2}}{κ} \sqrt{d} (2 L R^{*} (α_{m}^{*} - α^{*}) + c_{1} γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n}) + \frac{c_{2} \sqrt{d}}{m^{2}} \\ \leq & \frac{6 \sqrt{2}}{κ} \sqrt{d} (\frac{2 c_{2} L}{m^{2}} + c_{1} γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n}) + \frac{c_{2} \sqrt{d}}{m^{2}} \\ \leq & c_{3} \sqrt{d} [\frac{L}{κ} \frac{1}{m^{2}} + γ τ_{m} \sqrt{{md}^{2} \log (p) ∕ n}], \end{matrix}

where c₃ > 0 is a constant. Choosing $m = {(\frac{L}{κ γ τ_{m}})}^{2 ∕ 5} {(\frac{n}{d^{2} \log (p)})}^{1 ∕ 5}$ gives us the following error bound:

{‖ α_{m} - α^{*} ‖}_{2} \leq c_{4} {(\frac{L γ^{4} τ_{m}^{4}}{κ})}^{1 ∕ 5} {(\frac{d^{13 ∕ 4} \log (p)}{n})}^{2 ∕ 5},

and the corresponding λ_n for this choice of m is given by:

λ_{n} = c_{5} {(\frac{L γ^{4} τ_{m}^{4}}{κ})}^{1 ∕ 5} {(\frac{d^{2} \log (p)}{n})}^{2 ∕ 5},

where c₄, c₅ > 0 are constants.

C. Experiments

C.1. Synthetic Data

In this section we present additional results from our synthetic experiments. Specifically, we present ROC plots for n ∈ {100, 200, 500}. We use the same parameter settings as described in Section 7 of the main paper, to generate synthetic data. Figure 1 shows ROC plots for data generated from non-parametric graphical model with B_s(X) = sin(4πX). Figure 2 shows ROC plots for B_s(X) = [exp (−20(X – 0.5)²) + exp (−20(X + 0.5)²) – 1] and Figure 3 shows ROC plots for Gaussian data.

C.2. Futures Intraday Data

In this section we present the graph learned by GGM for Futures Intraday data and also present more detailed graphs learned by all the three estimators. As pointed out in Section 7, selecting tuning parameter based on held out log likelihood resulted in very dense graphs for Nonparanormal and GGM. So we use a different technique to compare all the models. We fix a tuning parameter for Expxorcist and select tuning parameters for the baselines that resulted in graphs with same number of edges. Figure 4 shows the graph structures learned for one such choice of tuning parameters. Figures 5, 7, 6 present more detailed graphs for the corresponding graphs in Figure 4.

ROC plots for data generated from non-parametric graphical model with *B_s*(X) = [exp (−20(X – 0.5)²) + exp (−20(X + 0.5)²) – 1]. Top row shows results for chain graph with *C_s*(X) = 0. Bottom row shows results for grid graph with *C_s*(X) = 0.

Figure 3: — ROC plots for data generated from Gaussian distributions. Top row shows plots for chain graph and bottom row shows plots for grid graph.

Figure 4: — Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.

Contributor Information

Arun Sai Suggala, Carnegie Mellon University Pittsburgh, PA 15213.

Mladen Kolar, University of Chicago Chicago, IL 60637.

Pradeep Ravikumar, Carnegie Mellon University Pittsburgh, PA 15213.

References

[1].Arnold Barry C., Enrique Castillo, and José María Sarabia. Conditionally specified distributions: an introduction. Stat. Sci, 16(3):249–274, 2001. With comments and a rejoinder by the authors. [Google Scholar]
[2].Patrizia Berti, Emanuela Dreassi, and Pietro Rigo. Compatibility results for conditional distributions. J. Multivar. Anal, 125:190–203, 2014. [Google Scholar]
[3].Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B, pages 192–236, 1974. [Google Scholar]
[4].Stéphane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714–720, March 2006. [Google Scholar]
[5].Hua Yun Chen. Compatibility of conditionally specified models. Statist. Probab. Lett, 80(7-8):670–677, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Ronaldo Dias. Density estimation via hybrid splines. J. Statist. Comput. Simulation, 60(4):277–293, 1998. [Google Scholar]
[7].Friedman Jerome H., Hastie Trevor J., and Tibshirani Robert J.. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Good IJ and Gaskins RA. Nonparametric roughness penalties for probability densities. Biometrika, 58:255–277, 1971. [Google Scholar]
[9].Chong Gu. Smoothing spline density estimation: conditional distribution. Stat. Sinica, 5(2):709–726, 1995. [Google Scholar]
[10].Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Stat. Sinica, 23:1131–1153, 2013. [Google Scholar]
[11].Chong Gu and Chunfu Qiu. Smoothing spline density estimation: theory. Ann. Stat, 21(1):217–234, 1993. [Google Scholar]
[12].Chong Gu and Jingyuan Wang. Penalized likelihood density estimation: direct cross-validation and scalable approximation. Stat. Sinica, 13(3):811–826, 2003. [Google Scholar]
[13].Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning discrete graphical models using group-sparse regularization. In AISTATS, pages 378–387, 2011. [Google Scholar]
[14].Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Stat. Sinica, 16(2):353–374, 2006. [Google Scholar]
[15].Tom Leonard. Density estimation, stochastic processes and prior information. J. R. Stat. Soc. B, 40(2):113–146, 1978. With discussion. [Google Scholar]
[16].Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat, 40(4):2293–2326, 2012. [Google Scholar]
[17].Han Liu, Lafferty John D., and Wasserman Larry A.. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res, 10:2295–2328, 2009. [Google Scholar]
[18].Mâsse Benoît R. and Truong Young K.. Conditional logspline density estimation. Canad. J. Statist, 27(4):819–832, 1999. [Google Scholar]
[19].Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016. [Google Scholar]
[20].Pradeep Ravikumar, Martin J Wainwright, Lafferty John D, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010. [Google Scholar]
[21].Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Ann. Stat, 10(3):795–810, 1982. [Google Scholar]
[22].Speed TP and Kiiveri HT. Gaussian markov distributions over finite graphs. The Annals of Statistics, pages 138–150, 1986. [Google Scholar]
[23].Stone Charles J., Hansen Mark H., Charles Kooperberg, and Truong Young K.. Polynomial splines and their tensor products in extended linear modeling. Ann. Stat, 25(4):1371–1470, 1997. With discussion and a rejoinder by the authors and Jianhua Z. Huang. [Google Scholar]
[24].Siqi Sun, Jinbo Xu, and Mladen Kolar. Learning structured densities via infinite dimensional exponential families. In Advances in Neural Information Processing Systems, pages 2287–2295, 2015. [Google Scholar]
[25].Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Stat. Sinica, 21(1):5–42, 2011. [Google Scholar]
[26].Arend Voorman, Ali Shojaie, and Witten Daniela M.. Graph estimation with joint additive models. Biometrika, 101(1):85–101, March 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Wang Yuchung J. and Ip Edward H.. Conditionally specified continuous distributions. Biometrika, 95(3):735–746, 2008. [Google Scholar]
[28].Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015. [PMC free article] [PubMed] [Google Scholar]
[29].Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697, 2014. [Google Scholar]
[30].Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, and Guangcan Liu. Learning additive exponential family graphical models via ℓ_{2, 1}-norm regularized m-estimation. In Advances in Neural Information Processing Systems, pages 4367–4375, 2016. [Google Scholar]
[31].Helen Zhang Hao and Yi Lin. Component selection and smoothing for nonparametric regression in exponential families. Stat. Sinica, 16(3):1021–1041, 2006. [Google Scholar]
[32].Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact first order oracle. Advances in Neural Information Processing Systems, 2015. [Google Scholar]
[1].Bickel P, Diggle P, Feinberg S, Gather U, Olkin I, and Zeger S. Springer series in statistics. 2009. [Google Scholar]

[R1] [1].Arnold Barry C., Enrique Castillo, and José María Sarabia. Conditionally specified distributions: an introduction. Stat. Sci, 16(3):249–274, 2001. With comments and a rejoinder by the authors. [Google Scholar]

[R2] [2].Patrizia Berti, Emanuela Dreassi, and Pietro Rigo. Compatibility results for conditional distributions. J. Multivar. Anal, 125:190–203, 2014. [Google Scholar]

[R3] [3].Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B, pages 192–236, 1974. [Google Scholar]

[R4] [4].Stéphane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714–720, March 2006. [Google Scholar]

[R5] [5].Hua Yun Chen. Compatibility of conditionally specified models. Statist. Probab. Lett, 80(7-8):670–677, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Ronaldo Dias. Density estimation via hybrid splines. J. Statist. Comput. Simulation, 60(4):277–293, 1998. [Google Scholar]

[R7] [7].Friedman Jerome H., Hastie Trevor J., and Tibshirani Robert J.. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Good IJ and Gaskins RA. Nonparametric roughness penalties for probability densities. Biometrika, 58:255–277, 1971. [Google Scholar]

[R9] [9].Chong Gu. Smoothing spline density estimation: conditional distribution. Stat. Sinica, 5(2):709–726, 1995. [Google Scholar]

[R10] [10].Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Stat. Sinica, 23:1131–1153, 2013. [Google Scholar]

[R11] [11].Chong Gu and Chunfu Qiu. Smoothing spline density estimation: theory. Ann. Stat, 21(1):217–234, 1993. [Google Scholar]

[R12] [12].Chong Gu and Jingyuan Wang. Penalized likelihood density estimation: direct cross-validation and scalable approximation. Stat. Sinica, 13(3):811–826, 2003. [Google Scholar]

[R13] [13].Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning discrete graphical models using group-sparse regularization. In AISTATS, pages 378–387, 2011. [Google Scholar]

[R14] [14].Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Stat. Sinica, 16(2):353–374, 2006. [Google Scholar]

[R15] [15].Tom Leonard. Density estimation, stochastic processes and prior information. J. R. Stat. Soc. B, 40(2):113–146, 1978. With discussion. [Google Scholar]

[R16] [16].Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat, 40(4):2293–2326, 2012. [Google Scholar]

[R17] [17].Han Liu, Lafferty John D., and Wasserman Larry A.. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res, 10:2295–2328, 2009. [Google Scholar]

[R18] [18].Mâsse Benoît R. and Truong Young K.. Conditional logspline density estimation. Canad. J. Statist, 27(4):819–832, 1999. [Google Scholar]

[R19] [19].Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016. [Google Scholar]

[R20] [20].Pradeep Ravikumar, Martin J Wainwright, Lafferty John D, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010. [Google Scholar]

[R21] [21].Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Ann. Stat, 10(3):795–810, 1982. [Google Scholar]

[R22] [22].Speed TP and Kiiveri HT. Gaussian markov distributions over finite graphs. The Annals of Statistics, pages 138–150, 1986. [Google Scholar]

[R23] [23].Stone Charles J., Hansen Mark H., Charles Kooperberg, and Truong Young K.. Polynomial splines and their tensor products in extended linear modeling. Ann. Stat, 25(4):1371–1470, 1997. With discussion and a rejoinder by the authors and Jianhua Z. Huang. [Google Scholar]

[R24] [24].Siqi Sun, Jinbo Xu, and Mladen Kolar. Learning structured densities via infinite dimensional exponential families. In Advances in Neural Information Processing Systems, pages 2287–2295, 2015. [Google Scholar]

[R25] [25].Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Stat. Sinica, 21(1):5–42, 2011. [Google Scholar]

[R26] [26].Arend Voorman, Ali Shojaie, and Witten Daniela M.. Graph estimation with joint additive models. Biometrika, 101(1):85–101, March 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Wang Yuchung J. and Ip Edward H.. Conditionally specified continuous distributions. Biometrika, 95(3):735–746, 2008. [Google Scholar]

[R28] [28].Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015. [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697, 2014. [Google Scholar]

[R30] [30].Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, and Guangcan Liu. Learning additive exponential family graphical models via ℓ_{2, 1}-norm regularized m-estimation. In Advances in Neural Information Processing Systems, pages 4367–4375, 2016. [Google Scholar]

[R31] [31].Helen Zhang Hao and Yi Lin. Component selection and smoothing for nonparametric regression in exponential families. Stat. Sinica, 16(3):1021–1041, 2006. [Google Scholar]

[R32] [32].Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact first order oracle. Advances in Neural Information Processing Systems, 2015. [Google Scholar]

[R33] [1].Bickel P, Diggle P, Feinberg S, Gather U, Olkin I, and Zeger S. Springer series in statistics. 2009. [Google Scholar]

PERMALINK

The Expxorcist: Nonparametric Graphical Models Via Conditional Exponential Densities

Arun Sai Suggala

Mladen Kolar

Pradeep Ravikumar

Abstract

1. Introduction

2. Multivariate Density Specification via Conditional Densities

3. Conditional Densities of an Exponential Family Form

4. Regularized Conditional Likelihood Estimation for Exponential Family Form Densities

Optimization objective:

Algorithm:

Iterative minimization of (8):

Convergence:

5. Statistical Properties

Notation:

Discussion on Assumption 1:

6. Connections to Exponential Family MRF Copulas

7. Experiments

7.1. Synthetic Experiments

Data:

Evaluation Metric:

Experiment Settings:

Results:

Figure 1:

7.2. Futures Intraday Data

Figure 2:

8. Conclusion

9. Acknowledgement

Supplementary material for Nonparametric Graphical Models Via Conditional Exponential Densities

A. Node Conditional Maximum Likelihood Estimation

B. Statistical Properties

B.1. Proof of Theorem 2

(a). No stationary points in B2(αm∗,rm)∩Cc:

(b). No stationary points in B2(αm∗,rm)∩C\B2(αm∗,rs):

B.2. Proof of Corollary 3

(a). Upper bound for R∗(∇Ls,m(αm∗)−∇L‒s,m(αm∗)):

(b). Upper bound for R∗(∇L‒s,m(αm∗)):

B.3. Proof of Corollary 4

Figure 1:

C. Experiments

C.1. Synthetic Data

C.2. Futures Intraday Data

Figure 2.

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

(a). No stationary points in $B_{2} (α_{m}^{*}, r_{m}) \cap C^{c}$ :

(b). No stationary points in $B_{2} (α_{m}^{}, r_{m}) \cap C \ B_{2} (α_{m}^{}, r_{s})$ :

(a). Upper bound for $R^{} (\nabla L_{s, m} (α_{m}^{}) - \nabla {\overset{‒}{L}}_{s, m} (α_{m}^{*}))$ :

(b). Upper bound for $R^{} (\nabla {\overset{‒}{L}}_{s, m} (α_{m}^{}))$ :