Abstract
Non-parametric multivariate density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose near-parametric assumptions on the form of the density functions. In this paper, we leverage recent developments to propose a class of non-parametric models which have very attractive computational and statistical properties. Our approach relies on the simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form.
1. Introduction
Let X = (X1, …, Xp) be a p-dimensional random vector. Let G = (V, E) be the graph that encodes conditional independence assumptions underlying the distribution of X, that is, each node of the graph corresponds to a component of vector X and (a, b) ∈ E if and only if with X¬ab := {Xc ∣ c ∈ V \{a, b}}. The graphical model represented by G is then the set of distributions over X that satisfy the conditional independence assumptions specified by the graph G.
There has been a considerable line of work on learning parametric families of such graphical model distributions from data [22, 20, 13, 28], where the distribution is indexed by a finite-dimensional parameter vector. The goal of this paper, however, is on specifying and learning nonparametric families of graphical model distributions, indexed by infinite-dimensional parameters, and for which there has been comparatively limited work. Non-parametric multivariate density estimation broadly, even without the graphical model constraint, has not proved as popular in practical machine learning contexts, for both statistical and computational reasons. Loosely, estimating a non-parametric multivariate density, with mild assumptions, typically requires the number of samples to scale exponentially in the dimension p of the data, which is infeasible even in the big-data era when n is very large. And the resulting estimators are typically computationally expensive or intractable, for instance requiring repeated computations of multivariate integrals.
We present a review of multivariate density estimation, that is necessarily incomplete but sets up our proposed approach. A common approach dating back to [15] uses the logistic density transform to satisfy the unity and positivity constraints for densities, and considers densities of the form , with some constraints on η for identifiability such as η(X0) = 0 for some or .
With the logistic density transform, differing approaches for non-parametric density estimation can be contrasted in part by their assumptions on the infinite-dimensional function space domain of η(·). An early approach [8] considered function spaces of functions with bounded “roughness” functionals. The predominant line of work however has focused on the setting where η(·) lies in a Reproducing Kernel Hilbert Space (RKHS), dating back to [21]. Consider the estimation of these logistic density transforms η(X) given n i.i.d. samples drawn from fη(X). A natural loss functional is penalized log likelihood, with a penalty functional that ensures a smooth fit with respect to the function space domain: , for functions η(·) that lie in an RKHS , and where is the squared RKHS norm. This was studied by many [21, 11, 6]. A crucial caveat is that the representer theorem for RKHSs does not hold. Nonetheless, one can consider finite-dimensional function space approximations consisting of the linear span of kernel functions evaluated at the sample points [12]. Computationally this still scales poorly with the dimension due to the need to compute multidimensional integrals of the form ∫ exp(η(x)dx which do not, in general, decompose. These approximations also do not come with strong statistical guarantees.
We briefly note that the function space assumption that η(·) lies in an RKHS could also be viewed from the lens of an infinite-dimensional exponential family [4]. Specifically, let be a Reproducing Kernel Hilbert Space with reproducing kernel k(·, ·), and inner product . Then , so that the density f (X) can in turn be viewed as a member of an infinite-dimensional exponential family with sufficient statistics , and natural parameter . Following this viewpoint, [4] propose estimators via linear span approximations similar to [11].
Due to the computational caveat with exact likelihood based functionals, a line of approaches have focused on penalized surrogate likelihoods instead. [14] study the following loss functional: , where ρ(X) is some fixed known density with the same support as the unknown density f (X). While this estimation procedure is much more computationally amenable than minimizing the exact penalized likelihood, the caveat, however, is that for a general RKHS this requires solving higher order integrals. The next level of simplification has thus focused on the form of the logistic transform function itself. There has been a line of work on an ANOVA type decomposition of the logistic density function into node-wise and pairwise terms: . A line of work has coupled such a decomposition with the assumption that each of the terms lie in an RKHS. This does not immediately provide a computational benefit: with penalized likelihood based loss functionals, the loss functional does not necessarily decompose into such node and pairwise terms. [24] thus couple this ANOVA type pairwise decomposition with a score matching based objective. [10] use the above decomposition with the surrogate loss functional of [14] discussed above, but note that this still requires the aforementioned function space approximation as a linear span of kernel evaluations, as well as two-dimensional integrals.
A line of recent work has thus focused on further stringent assumptions on the density function space, by assuming some components of the logistic transform to be finite-dimensional. [30] use an ANOVA decomposition but assume the terms belong to finite-dimensional function spaces instead of RKHSs, specified by a pre-defined finite set of basis functions. [29] consider logistic transform functions η(·) that have the pairwise decomposition above, with a specific class of parametric pairwise functions βstXsXt, and non-parametric node-wise functions. [17, 16] consider the problem of estimating monotonic node-wise functions such that the transformed random vector is multivariate Gaussian; which could also be viewed as estimating a Gaussian copula density.
To summarize the (necessarily incomplete) review above, non-parametric density estimation faces strong statistical and computational bottlenecks, and the more practical approaches impose stringent near-parametric assumptions on the form of the (logistic transform of the) density functions. In this paper, we leverage recent developments to propose a very computationally simple non-parametric density estimation algorithm, that still comes with strong statistical guarantees. Moreover, the density could be viewed as a graphical model distribution, with a corresponding sparse conditional independence graph.
Our approach relies on the following simple function space assumption: that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. As we show, for there to exist a consistent joint density, the logistic density transform with respect to a particular base measure necessarily decomposes into the following semi-parametric form: in the pairwise case, with both a parametric component {θs : s = 1, …, p}, {θst : s < t; s, t = 1, …, p}, as well as non-parametric components {Bs : s = 1, …, p}. We call this class of models the “expxorcist”, following other “ghostbusting” semi-parametric models such as the nonparanormal and nonparanormal skeptic [17, 16].
Since the conditional distributions are exponential families, we show that there exist computationally amenable estimators, even in our more general non-parametric setting, where the sufficient statistics have to be estimated as well. The statistical analysis in our non-parametric setting however is more subtle, due in part to non-convexity and in part to the non-parametric setting. We also show how the Expxorcist class of densities is closely related to a semi-parametric exponential family copula density that generalizes the Gaussian copula density of [17, 16]. We corroborate the applicability of our class of models with experiments on synthetic and real data sets.
2. Multivariate Density Specification via Conditional Densities
We are interested in the approach of estimating a multivariate density by estimating node-conditional densities. Since node-conditional densities focus on the density of a single variable, though conditioned on the rest of the variables, estimating these is potentially a simpler problem, both statistically and computationally, than estimating the entire joint density itself. Let us consider the general non-parametric conditional density estimation problem. Given the general multivariate density , the conditional density of a variable Xs given the rest of the variables X−s is given by , which does not have a multi-dimensional integral, but otherwise does not have a computationally amenable form. There has been a line of work on such conditional density estimation, mirroring developments in multivariate density estimation [9, 18, 23], but unlike parametric settings, there are no large sample complexity gains with non-parametric conditional density estimation under general settings. There have also been efforts to use ANOVA decompositions in a conditional density context [31, 26].
In addition to computational and sample complexity caveats, recall that in our context, we would like to use conditional density estimates to infer a joint multivariate density. A crucial caveat with using the above estimates to do so is that it is not clear when the estimated node-conditional densities would be consistent with a joint multivariate density. There has been a line of work on this question (of when conditional densities are consistent with a joint density) for parametric densities; see [1] for an overview, with more recent results in [27, 5, 2, 25]. Overall, while estimating node-conditional densities could be viewed as surrogate estimation of a joint density, arbitrary node-conditional distributions need not be consistent in general with any joint density. There has however been a line of work in recent years [3, 28], where it was shown that when the node-conditional distributions belong to an exponential family, then under certain conditions on their parameterization, there do exist multivariate densities consistent with the node-conditional densities. In the next section, we leverage these results towards non-parametric estimation of conditional densities.
3. Conditional Densities of an Exponential Family Form
We first recall the definition of an exponential family in the context of a conditional density.
Definition 1. A conditional density of a random variable given covariates is said to have an exponential family form if it can be written as f(Y ∣ Z) = exp(B(Y)T E(Z) + C(Y) + D(Z)), for some functions (for some finite integer k > 0), and .
Thus, f (Y ∣ Z) belongs to a finite-dimensional exponential family with sufficient statistics B(Y), base measure exp(C(Y)), and with natural parameter E(Z) and where −D(Z) is the log-partition function. Contrast this with a general conditional density f (Y ∣ Z) = exp(h(Y, Z) + C(Y) + D(Z)) with respect to the base measure exp(C(Y )) and −D(Z) being the log-normalization constant, and it can be seen that a conditional density of the exponential family form has its logistic density transform h(Y, Z) that factorizes as B(Y)T E(Z).
Consider the case where the sufficient statistic function is real-valued. The non-parametric estimation problem of a conditional density of exponential form then reduces to the estimation of the sufficient statistics function B(·), the exponential natural parameter function E(·), assuming the base measure C(·) is given. But when would such estimated conditional densities be consistent with a joint density? To answer this question, we draw upon developments in [28]. Suppose that the node-conditional distributions of each random variable Xs conditioned on the rest of random variables have the exponential family form as in Definition 1, so that for each s ∈ V
(1) |
for some arbitrary functions Es(·), Bs(·), Cs(·) that specify a valid conditional density. Then [28] show that these node-conditional densities are consistent with a unique joint density over the random vector X, that moreover factors according to a set of cliques in the graph G, if and only if the functions {Es(·)}s∈V specifying the node-conditional distributions have the form , where is a set of parameters. Moreover, the corresponding consistent joint distribution has the following form
(2) |
In this paper, we are interested in the non-parametric estimation of the Expxorcist class of densities in (2), where we estimate both the finite-dimensional parameters , as well as the functions {Bs(Xs)}s∈V. We assume we are given the base measures {Cs(Xs)}s∈V, so that the joint density is with respect to a given product base measure ∏s∈V exp(Cs(XS)), as is common in the multivariate density estimation literature. Note that this is not a very restrictive assumption. In practice the base measure at each node can be well approximated using the empirical univariate marginal density of that node. We could also extend our algorithm, which we present next, to estimate the base measures along with sufficient statistic functions.
4. Regularized Conditional Likelihood Estimation for Exponential Family Form Densities
We consider the nonparametric estimation problem of estimating a joint density of the form in (2), focusing on the pairwise case where the factors have size at most k = 2, so that the joint density takes the form
(3) |
As detailed in the previous section, estimating this joint density can be reduced to estimating its node-conditional densities, which take the form
(4) |
We now introduce some notation which we use in the sequel. Let Θ = {θs}s∈V ∪ {θst}s≠t and Θs = θs ∪ {θst}t∈V\{s}. Let B = {Bs}s∈V be the set of sufficient statistics. Let be the domain of Xs, which we assume is bounded and be the Hilbert space of square integrable functions over with respect to Lebesgue measure. We assume that the sufficient statistics .
Note that the model in Equation (3) is unidentifiable. To overcome this issue we impose additional constraints on its parameters. Specifically, we require Bs(Xs) to satisfy , and θs ≥ 0, ∀s ∈ V.
Optimization objective:
Let be n i.i.d. samples drawn from a joint density of the form in Equation (3), with parameters Θ*, B*. And let be the node conditional negative log likelihood at node s
where A(X−s; Θs, B) is the log partition function. To estimate the unknown parameters, we solve the following regularized node conditional log-likelihood estimation problem at each node s ∈ V
(5) |
The equality constraints on the norm of functions Bt(·) makes the above optimization problem a difficult one to solve. While the norm constraints on Bt(·), ∀t ∈ V \ s can be handled through re-parametrization, the constraint on Bs(·) can not be handled efficiently. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node s ∈ V, we consider the following re-parametrization of B: Bs(Xs) ← θsBs(Xs), Bt(Xt) ← (θst/θs)Bt(Xt), ∀t ∈ V \ {s}. With a slight abuse of notation we redefine using this re-parametrization as
(6) |
where A(X−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5)
(7) |
For more details on the relation between (5) and (7), please refer to Appendix.
Algorithm:
We now present our algorithm for optimization of (7). In the sequel, for simplicity, we assume that the domains of random variables Xt are all the same and equal to . In order to estimate functions Bt, we expand them over a uniformly bounded, orthonormal basis { of with ϕ0(·) ∝ 1. Expansion of the functions Bt(·) over this basis yields
Note that the constraint in Equation (7), translates to αt,0 = 0. To convert the infinite dimensional optimization problem in (7) into a finite dimensional problem, we truncate the basis expansion to the top m terms and approximate Bt(·) as . The optimization problem in Equation (7) can then be rewritten as
(8) |
where , αm = {αt,m}t∈V and is defined as
Iterative minimization of (8):
Note that the objective in (8) is non-convex. In this work, we use a simple alternating minimization technique for its optimization. In this technique, we alternately minimize αs,m, {αt,m}t∈V\s while fixing the other parameters. The resulting optimization problem in each of the alternating steps is convex. We use Proximal Gradient Descent to optimize these sub-problems. To compute the objective and its gradients, we need to numerically evaluate the one-dimensional integrals in the log partition function. To do this, we choose a uniform grid of points over the domain and use quadrature rules to approximate the integrals.
Convergence:
Although (8) is non-convex, we can show that under certain conditions on the objective function, the alternating minimization procedure converges to the global minimum. In a recent work [32] analyze alternating minimization for low rank matrix factorization problems and show that it converges to a global minimum if the sequence of convex problems are strongly convex and satisfy certain other regularity condition. The analysis of [32] can be extended to show global convergence of alternating minimization for (8).
5. Statistical Properties
In this section we provide parameter estimation error rates for the node conditional estimator in Equation (8). Note that these rates are for the re-parameterized model described in Equation (6)and can be easily translated to guarantees on the original model described in Equations (3), (4).
Notation:
Let be the ℓ2 ball with center x and radius r. Let be the true functions of the re-parametrized model, which we would like to estimate from the data. Denote the basis expansion coefficients of Bt(·) with respect to orthonormal basis by αt, which is an infinite dimensional vector and let be the coefficients of . And let αt,m be the coefficients corresponding to the top m basis in the basis expansion of Bt(·). Note that . Let α = {αt}t∈V and αm = {αt,m}t∈V. Let be the population version of the sample loss defined in Equation (8). We will often omit from when clear from the context. We let (αt – αt,m) be the difference between infinite dimensional vector αt and the vector obtained by appropriately padding αt,m with zeros. Finally, we define the norm as and its dual as . The norms on infinite dimensional vector α are similarly defined.
We now state our key assumption on the loss function . This assumption imposes strong curvature condition on along certain directions in a ball around .
Assumption 1. There exists rm > 0 and constants c, κ > 0 such that for any the gradient of the sample loss satisfies: .
Similar assumptions are increasingly common in analysis of non-convex estimators, see [19] and references therein. We are now ready to state our results which give the parameter estimation error rates, the proofs of which can be found in Appendix. We first provide a deterministic bound on the error in terms of the random quantity . We derive probabilistic results in the subsequent corollaries.
Theorem 2. Let be the true neighborhood of node s, with . Suppose satisfies Assumption 1. If the regularization parameter λn is chosen such that , then any stationary point of (8) in satisfies:
We now provide a set of sufficient conditions under which the random quantity can be bounded.
Assumption 2. There exists a constant L > 0 such that the gradient of the population loss at satisfies: .
Corollary 3. Suppose the conditions in Theorem 2 are satisfied. Moreover, let and . Suppose satisfies Assumption 2. If the regularization parameter λn is chosen such that , then then with probability at least 1 – 2m/p2 any stationary point of (8) in satisfies:
Theorem 2 and Corollary 3 bound the error of the estimated coefficients in the truncated expansion. The approximation error of the truncated expansion itself depends on the function space assumption, as well as the basis chosen, but can be simply combined with the statement of the above corollary to derive the overall error. As an instance, we present a corollary below for the specific case of Sobolev space of order two, and the trigonometric basis.
Corollary 4. Suppose the conditions in Corollary 3 are satisfied. Moreover, suppose the true functions lie in a Sobolev space of order two. Let be the trigonometric basis of . If the optimization problem (8) is solved with λn = c1(d2 log(p)/n)2/5 and m = c2(n/d2 log(p))1/5, then with probability at least 1 – 2m/p2 any stationary point of (8) in satisfies:
where c1, c2, c3 depend on L, κ, γ, τm.
Discussion on Assumption 1:
We now provide a set of sufficient conditions which ensure the restricted strong convexity (RSC) condition. Suppose the population risk is strongly convex in a ball of radius rm around
(9) |
Moreover, suppose the empirical gradients converge uniformly to the population gradients
(10) |
For example, this condition holds with high probability when the gradient of w.r.t αt,m, for any t ∈ [p] is a sub-Gaussian process. Equations (9),(10) are easier to check and en sure that satisfies the RSC property in Assumption 1.
6. Connections to Exponential Family MRF Copulas
The Expxorcist class of models could be viewed as being closely related to an exponential family MRF [28] copula density. Consider the parametric exponential family MRF joint density in (3): , where the distribution is indexed by the finite-dimensional parameters {θs}s∈V, {θst}(s,t)∈E, and where in contrast to the previous sections, we assume we are given the sufficient statistics functions {Bs(·)}s∈V as well as the nodewise base measures {Cs(·)}s∈V. Now consider the following non-parametric problem. Given a random vector X, suppose we are interested in estimating monotonic node-wise functions {fs(Xs)}s∈V such that (f1(X1), …, fp(Xp)) follows for some θ. Letting f(X) = (f1(X1), …, fp(Xp)), we have that , so that the density of X can be written as . This is now a semi-parametric estimation problem, where the unknowns are the functions {fs(Xs)}s∈V as well as the finite-dimensional parameters θ. To simplify this density, suppose we assume that the given node-wise sufficient statistics are linear, so that Bs(z) = z, for all s ∈ V, so that density reduces to
(11) |
In contrast, the Expxorcist nonparametric exponential family graphical model takes the form
(12) |
It can be seen that the two densities have very similar forms, except that the density in (11) has a more complex base measure that depends on the unknown functions {fs}s∈V and importantly the functions {fs}s∈V in (11) are monotonic.
The class of densities in (11) can be cast as an exponential family MRF copula density. Suppose we denote the CDF of the parametric exponential family MRF joint density by FMRF;θ (X), with nodewise marginal CDFs FMRF;θ,s(Xs). Then the marginal CDF of the density (11) can be written as , so that
(13) |
It then follows that: , where F (X) is the CDF of density (11). By letting be the exponential family MRF copula density function, we see that the CDF of X is precisely: F (X) = FCOP;θ (F1(X1), …, Fp(Xp)), which is specified by the marginal CDFs {Fs(Xs)}s∈V and the copula density FCOP;θ corresponding to the exponential family MRF density. In other words, the non-parametric extension in (11) of the exponential family MRF densities is precisely an exponential family MRF copula density. This development thus generalizes the non-parametric extension of Gaussian MRF densities via the Gaussian copula nonparanormal densities [17]. The caveats with the copula density however are two-fold: the node-wise functions are restricted to be monotonic, but also the estimation of these as in (13) requires the estimation of inverses of marginal CDFs of an exponential family MRF, which is intractable in general. Thus, minor differences in the expressions of the Expxorcist density (12) and an exponential family MRF copula density (11) nonetheless have seemingly large consequences for tractable estimation of these densities from data.
7. Experiments
We present experimental results on both synthetic and real datasets. We compare our estimator, Expxorcist, with the Nonparanormal model of [17] and Gaussian Graphical Model (GGM). We use glasso [7] to estimate GGM and the two step estimator of [17] to estimate Nonparanormal model.
7.1. Synthetic Experiments
Data:
We generated synthetic data from the Expxorcist model with chain and grid graph structures. For both the graph structures, we set θs = 1, ∀s ∈ V, θst = 1, ∀(s, t) ∈ E and fix the domain to [−1, 1]. We experimented with two choices for sufficient statistics Bs(X): sin(4πX) and [exp (−20(X – 0.5)2 + exp (−20(X + 0.5)2 – 1] and picked the log base measure Cs(X) to be 0. The grid graph we considered has a 10 × (p/10) structure. We used Gibbs sampling to sample data from these models. We also generated data from Gaussian distribution with chain and grid graph structures. To generate this data we set the off diagonal non-zero entries of inverse covariance matrix to 0.49 for chain graph and 0.25 for grid graph and diagonal entries to 1.
Evaluation Metric:
We compared the performance of Expxorcist against baselines, on graph structure recovery, using ROC curves. The ROC curve plots the true positive rate (TPR) against false positive rate (FPR) over different choices of regularization parameter, where TPR is the fraction of correctly detected edges and FPR is the fraction of mis-identified non edges.
Experiment Settings:
For this experiment we set p = 50 and n ∈ {100, 200, 500} and varied the regularization parameter λ from 10−2 to 1. To fit the data to the non parametric model (3), we used cosine basis and truncated the basis expansion to top 30 terms. In practice, one could choose the number of basis (m) based on domain knowledge (e.g. “smooth” functions), or in the absence of which, one could use hold-out validation/cross validation. Given , the estimated neighborhood for node s, we estimated the overall graph structure as: . To reduce the variance in the ROC plots, we averaged results over 10 repetitions.
Results:
Figure 1 shows the ROC plots obtained from this experiment. Due to the lack of space, we present more experimental results in Appendix. It can be seen that Expxorcist has much better performance on non-Gaussian data. On these datasets, even at n = 500 the baselines chose edges at random. This suggests that in the presence of multiple modes and fat tails, Expxorcist is a better model. Expxorcist has slightly poor performance than baselines on Gaussian data. However, this is expected because it learns a broader family of distributions than Nonparanormal.
Figure 1:
ROC plots from synthetic experiments. Top and bottom rows show plots for chain and grid graphs respectively. Left column shows plots for data generated from our non-parametric model with Bs(X) = sin(X), n = 500 and center column shows plots for the other choice of sufficient statistic with n = 500. Right column shows plots for Gaussian data with n = 200.
7.2. Futures Intraday Data
We now present our analysis on the Futures price returns. This dataset was downloaded from http://www.kibot.com/. We focus on the Top-26 most liquid instruments being traded at the Chicago Mercantile Exchange (CME). The instruments span different sectors like Energy, Agriculture, Currencies, Equity Indices, Metals and Interest Rates. We focus on the hours of maximum liquidity (9am Eastern to 3pm Eastern) and look at the 1 minute price returns. The return distribution is a mixture of 1 minute returns with the overnight return. Since overnight returns tend to be bigger than the 1 minute return within the day, the return distribution is multimodal and fat-tailed. We treat each instrument as a random variable and the 1 minute returns as independent samples drawn from these random variables. We use the data collected in February 2010 as training data and data from March 2010 as held out data for tuning parameter selection. After removing samples with missing entries we are left with 894 training and 650 held out data samples. We fit Expxorcist and baselines on this data with the same parameter settings described above. For each of these models, we select the best tuning parameter through log likelihood on held out data. However, this criteria resulted in complete graphs for Nonparanormal and GGM (325 edges) and a relatively sparser graph for Expxorcist (168 edges). So for a better comparison of these models, we selected tuning parameters for each of the models such that the resulting graphs have almost the same number of edges. Figure 2 shows the learned graphs for one such choice of tuning parameters, which resulted in ~ 52 edges in the graphs. Nonparanormal and GGM resulted in very similar graphs, so we only present Nonparanormal here. It can be seen that Expxorcist is able to identify the clusters better than Nonparanormal. More detailed graphs and comparison with GGM can be found in Appendix.
Figure 2:
Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.
8. Conclusion
In this work we considered the problem of non-parametric density estimation and introduced Expxorcist, a new family of non-parametric graphical models. Our approach relies on a simple function space assumption that the conditional distribution of each variable conditioned on the other variables has a non-parametric exponential family form. We proposed an estimator for Expxorcist that is computationally efficient and comes with statistical guarantees. Our empirical results suggest that, in the presence of multiple modes and fat tails in the data, our non-parametric model is a better choice than the Nonparanormal model of [17].
9. Acknowledgement
A.S. and P.R. acknowledge the support of ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-1447574, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences. M. K. acknowledges support by an IBM Corporation Faculty Research Fund at the University of Chicago Booth School of Business.
Supplementary material for Nonparametric Graphical Models Via Conditional Exponential Densities
A. Node Conditional Maximum Likelihood Estimation
In this section we derive the relation between optimization problems (5) and (7) defined in Section 4 of the main paper. We start with the optimization problem (5), which is defined in terms of the original parameters of non-parametric graphical model (3):
where is the node conditional negative log likelihood at node s:
Let . Using this parametrization can be written as:
Note that, given , one can recover θst (and thus Bt) by computing the norm of . Using this re-parametrization the original optimization in Equation (5) can be written as the following equivalent problem:
The above problem still has the equality constraint on the norm of functions Bs(·). As pointed out in the main paper, this makes the above optimization problem a difficult one to solve. To make the optimization more amenable for numerical optimization techniques, we solve a closely related optimization problem. At each node s ∈ V, we consider the following re-parametrization of B: Bs(Xs) ← θsBs(Xs), Bt(Xt) (θst/θs)Bt(Xt)∀t ∈ V \ {s}. With a slight abuse of notation we redefine using this re-parametrization as:
where A(X−s; B) is the log partition function. We solve the following optimization problem, which is closely related to the original optimization in Equation (5):
Note that need to be bounded away from 0, for this re-parametrized optimization problem to give consistent estimates of the true parameters.
B. Statistical Properties
B.1. Proof of Theorem 2
Proof. Define , where is the sub-vector of Δ restricted to the coordinates specified by variables . Let denote the optimization objective in Equation (8):
We prove the theorem in two parts. In the first part we show that doesn’t have any stationary points in . In the second part we show that doesn’t have any stationary points in , where . The proof of the Theorem then follows by combining the results from these two parts.
(a). No stationary points in :
Let and let . Let ∂f(x) denote the set of sub-gradients of a function f(.) at x. For any , where , , we have:
(1) |
We now bound each of the terms in the RHS of above equation. From Assumption 1 on RSC property of the sample loss, we have
From the definition of we have . So we have
where the last inequality follows from the properties of sub-gradient of the norm . Finally, from the definition of dual norm , we have:
Substituting these results in Equation (1) we get:
(2) |
Note that since we have . Moreover, from our choice of λn we know that . Substituting this in the above equation we get:
(3) |
The above inequality shows that , .
Now suppose is a stationary point. Then from first-order necessary conditions we know that . However, this contradicts the result we obtained in Equation (3). This shows that there are no stationary points in .
(b). No stationary points in :
The proof of this part follows along the lines of the previous part. Let . From Equations (1), (2) we know that:
(4) |
Since we have . Substituting this in the above equation we get:
(5) |
The above inequality shows that , . This shows that there are no stationary points in .
Following results from parts (a) and (b) we conclude that any stationary point in satisfies .□
B.2. Proof of Corollary 3
Before we proceed to the proof of Corollary 3, we introduce some notation we use in its proof. We say Z is a σ–sub-Gaussian random variable, if it satisfies the following tail property:
We use the notation Z ∈ SG(σ2) to say that a random variable Z is σ–sub-Gaussian. We use the following standard result from concentration theory: if Z1 … Zn are n i.i.d SG(σ2) random variables then .
The following Lemma provides an upper bound for that holds with high probability.
Lemma 1. Let be the true neighborhood of node s, with . Moreover, let
and
Suppose satisfies Assumption 2. Then with probability at least 1 – 2m/p2 we have:
where c1 > 0 is a constant.
Proof. From triangle inequality, we have:
We now upper bound each of the terms in the RHS of the above equation.
(a). Upper bound for :
The gradient of at is given by:
and for t ∈ V \ {s}:
where and and denotes expectation with respect to the density parametrized by α and is the gradient of evaluated at with respect to variable αt,k.
We now show that concentrates well around its expectation. Note that
Lets first define random variable and as:
and
To ease the notation, we denote Ys,k(X(i)) by . We now rewrite in terms of random variables and as follows:
For any k ∈ [1, m], i ∈ [1, n] we have:
where the last inequality follows from Jensen’s inequality and the fact that .
We now bound :
Substituting this in the above equation we get:
Using similar arguments we can show that , ∀t ∈ V \ {s}. This shows that and , ∀t ∈ V \ {s}. Since is a sum of n i.i.d random variables we have:
From the concentration properties of a sub-Gaussian random variable we have:
and
Now let . By union bound we get:
and
By using union bound again we get:
Choosing , we get the following: with probability at least
(b). Upper bound for :
From Assumption 2 we have the following upper bound for :
Combining the results from parts (a) and (b) we get the following: with probability at least 1–2m/p2
where c1 > 0 is a constant.□
Following results from Theorem 2, Lemma 1, we conclude that if the regularization parameter λn is chosen such that
then with probability at least 1 – 2m/p2, any stationary point αm in satisfies
B.3. Proof of Corollary 4
Let αm be any stationary point in . And suppose λn is chosen such that:
In this section we derive bounds for the overall estimation error ∥αm – α*∥2. From triangle inequality, we have:
From Theorem 2, we have a bound for . So, here we focus on bounding .
Since the true parameters lie in a Sobolev space of order two, we know that there exists a constant c2 > 0 such that [1]:
Combining this result with the result from Theorem 2, we get, with probability at least 1 – 2m/p2:
where c3 > 0 is a constant. Choosing gives us the following error bound:
and the corresponding λn for this choice of m is given by:
where c4, c5 > 0 are constants.
Figure 1:
ROC plots for data generated from non-parametric graphical model with Bs(X) =sin(4πX). Top row shows results for chain graph with Cs(X) = 0. Bottom row shows results for grid graph with Cs(X) = 0.
C. Experiments
C.1. Synthetic Data
In this section we present additional results from our synthetic experiments. Specifically, we present ROC plots for n ∈ {100, 200, 500}. We use the same parameter settings as described in Section 7 of the main paper, to generate synthetic data. Figure 1 shows ROC plots for data generated from non-parametric graphical model with Bs(X) = sin(4πX). Figure 2 shows ROC plots for Bs(X) = [exp (−20(X – 0.5)2) + exp (−20(X + 0.5)2) – 1] and Figure 3 shows ROC plots for Gaussian data.
C.2. Futures Intraday Data
In this section we present the graph learned by GGM for Futures Intraday data and also present more detailed graphs learned by all the three estimators. As pointed out in Section 7, selecting tuning parameter based on held out log likelihood resulted in very dense graphs for Nonparanormal and GGM. So we use a different technique to compare all the models. We fix a tuning parameter for Expxorcist and select tuning parameters for the baselines that resulted in graphs with same number of edges. Figure 4 shows the graph structures learned for one such choice of tuning parameters. Figures 5, 7, 6 present more detailed graphs for the corresponding graphs in Figure 4.
Figure 2.
ROC plots for data generated from non-parametric graphical model with Bs(X) = [exp (−20(X – 0.5)2) + exp (−20(X + 0.5)2) – 1]. Top row shows results for chain graph with Cs(X) = 0. Bottom row shows results for grid graph with Cs(X) = 0.
Figure 3:
ROC plots for data generated from Gaussian distributions. Top row shows plots for chain graph and bottom row shows plots for grid graph.
Figure 4:
Graph Structures learned for the Futures Intraday Data. The Expxorcist graph shown here was obtained by selecting λ = 0.1. Nodes are colored based on their categories. Edge thickness is proportional to the magnitude of the interaction.
Figure 5:
Nonparanormal.
Figure 6:
Expxorcist.
Figure 7:
GGM.
Contributor Information
Arun Sai Suggala, Carnegie Mellon University Pittsburgh, PA 15213.
Mladen Kolar, University of Chicago Chicago, IL 60637.
Pradeep Ravikumar, Carnegie Mellon University Pittsburgh, PA 15213.
References
- [1].Arnold Barry C., Enrique Castillo, and José María Sarabia. Conditionally specified distributions: an introduction. Stat. Sci, 16(3):249–274, 2001. With comments and a rejoinder by the authors. [Google Scholar]
- [2].Patrizia Berti, Emanuela Dreassi, and Pietro Rigo. Compatibility results for conditional distributions. J. Multivar. Anal, 125:190–203, 2014. [Google Scholar]
- [3].Julian Besag. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. B, pages 192–236, 1974. [Google Scholar]
- [4].Stéphane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9):714–720, March 2006. [Google Scholar]
- [5].Hua Yun Chen. Compatibility of conditionally specified models. Statist. Probab. Lett, 80(7-8):670–677, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Ronaldo Dias. Density estimation via hybrid splines. J. Statist. Comput. Simulation, 60(4):277–293, 1998. [Google Scholar]
- [7].Friedman Jerome H., Hastie Trevor J., and Tibshirani Robert J.. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Good IJ and Gaskins RA. Nonparametric roughness penalties for probability densities. Biometrika, 58:255–277, 1971. [Google Scholar]
- [9].Chong Gu. Smoothing spline density estimation: conditional distribution. Stat. Sinica, 5(2):709–726, 1995. [Google Scholar]
- [10].Chong Gu, Yongho Jeon, and Yi Lin. Nonparametric density estimation in high-dimensions. Stat. Sinica, 23:1131–1153, 2013. [Google Scholar]
- [11].Chong Gu and Chunfu Qiu. Smoothing spline density estimation: theory. Ann. Stat, 21(1):217–234, 1993. [Google Scholar]
- [12].Chong Gu and Jingyuan Wang. Penalized likelihood density estimation: direct cross-validation and scalable approximation. Stat. Sinica, 13(3):811–826, 2003. [Google Scholar]
- [13].Ali Jalali, Pradeep Ravikumar, Vishvas Vasuki, and Sujay Sanghavi. On learning discrete graphical models using group-sparse regularization. In AISTATS, pages 378–387, 2011. [Google Scholar]
- [14].Yongho Jeon and Yi Lin. An effective method for high-dimensional log-density anova estimation, with application to nonparametric graphical model building. Stat. Sinica, 16(2):353–374, 2006. [Google Scholar]
- [15].Tom Leonard. Density estimation, stochastic processes and prior information. J. R. Stat. Soc. B, 40(2):113–146, 1978. With discussion. [Google Scholar]
- [16].Han Liu, Fang Han, Ming Yuan, John D. Lafferty, and Larry A. Wasserman. High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat, 40(4):2293–2326, 2012. [Google Scholar]
- [17].Han Liu, Lafferty John D., and Wasserman Larry A.. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res, 10:2295–2328, 2009. [Google Scholar]
- [18].Mâsse Benoît R. and Truong Young K.. Conditional logspline density estimation. Canad. J. Statist, 27(4):819–832, 1999. [Google Scholar]
- [19].Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016. [Google Scholar]
- [20].Pradeep Ravikumar, Martin J Wainwright, Lafferty John D, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010. [Google Scholar]
- [21].Silverman BW. On the estimation of a probability density function by the maximum penalized likelihood method. Ann. Stat, 10(3):795–810, 1982. [Google Scholar]
- [22].Speed TP and Kiiveri HT. Gaussian markov distributions over finite graphs. The Annals of Statistics, pages 138–150, 1986. [Google Scholar]
- [23].Stone Charles J., Hansen Mark H., Charles Kooperberg, and Truong Young K.. Polynomial splines and their tensor products in extended linear modeling. Ann. Stat, 25(4):1371–1470, 1997. With discussion and a rejoinder by the authors and Jianhua Z. Huang. [Google Scholar]
- [24].Siqi Sun, Jinbo Xu, and Mladen Kolar. Learning structured densities via infinite dimensional exponential families. In Advances in Neural Information Processing Systems, pages 2287–2295, 2015. [Google Scholar]
- [25].Cristiano Varin, Nancy Reid, and David Firth. An overview of composite likelihood methods. Stat. Sinica, 21(1):5–42, 2011. [Google Scholar]
- [26].Arend Voorman, Ali Shojaie, and Witten Daniela M.. Graph estimation with joint additive models. Biometrika, 101(1):85–101, March 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Wang Yuchung J. and Ip Edward H.. Conditionally specified continuous distributions. Biometrika, 95(3):735–746, 2008. [Google Scholar]
- [28].Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via univariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813–3847, 2015. [PMC free article] [PubMed] [Google Scholar]
- [29].Zhuoran Yang, Yang Ning, and Han Liu. On semiparametric exponential family graphical models. arXiv preprint arXiv:1412.8697, 2014. [Google Scholar]
- [30].Xiaotong Yuan, Ping Li, Tong Zhang, Qingshan Liu, and Guangcan Liu. Learning additive exponential family graphical models via ℓ_{2, 1}-norm regularized m-estimation. In Advances in Neural Information Processing Systems, pages 4367–4375, 2016. [Google Scholar]
- [31].Helen Zhang Hao and Yi Lin. Component selection and smoothing for nonparametric regression in exponential families. Stat. Sinica, 16(3):1021–1041, 2006. [Google Scholar]
- [32].Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact first order oracle. Advances in Neural Information Processing Systems, 2015. [Google Scholar]
- [1].Bickel P, Diggle P, Feinberg S, Gather U, Olkin I, and Zeger S. Springer series in statistics. 2009. [Google Scholar]