Skip to main content
Entropy logoLink to Entropy
. 2019 Jun 3;21(6):559. doi: 10.3390/e21060559

Parameter Estimation with Data-Driven Nonparametric Likelihood Functions

Shixiao W Jiang 1,, John Harlim 1,2,3,*,
PMCID: PMC7515048  PMID: 33267273

Abstract

In this paper, we consider a surrogate modeling approach using a data-driven nonparametric likelihood function constructed on a manifold on which the data lie (or to which they are close). The proposed method represents the likelihood function using a spectral expansion formulation known as the kernel embedding of the conditional distribution. To respect the geometry of the data, we employ this spectral expansion using a set of data-driven basis functions obtained from the diffusion maps algorithm. The theoretical error estimate suggests that the error bound of the approximate data-driven likelihood function is independent of the variance of the basis functions, which allows us to determine the amount of training data for accurate likelihood function estimations. Supporting numerical results to demonstrate the robustness of the data-driven likelihood functions for parameter estimation are given on instructive examples involving stochastic and deterministic differential equations. When the dimension of the data manifold is strictly less than the dimension of the ambient space, we found that the proposed approach (which does not require the knowledge of the data manifold) is superior compared to likelihood functions constructed using standard parametric basis functions defined on the ambient coordinates. In an example where the data manifold is not smooth and unknown, the proposed method is more robust compared to an existing polynomial chaos surrogate model which assumes a parametric likelihood, the non-intrusive spectral projection. In fact, the estimation accuracy is comparable to direct MCMC estimates with only eight likelihood function evaluations that can be done offline as opposed to 4000 sequential function evaluations, whenever direct MCMC can be performed. A robust accurate estimation is also found using a likelihood function trained on statistical averages of the chaotic 40-dimensional Lorenz-96 model on a wide parameter domain.

Keywords: Bayesian inference, MCMC, diffusion maps, nonparametric likelihood function, surrogate modeling, reproducing kernel Hilbert space, kernel embedding of the conditional distribution

1. Introduction

Bayesian inference is a popular approach for solving inverse problems with far-reaching applications, such as parameter estimation and uncertainty quantification (see for example [1,2,3]). In this article, we will focus on a classical Bayesian inference problem of estimating the conditional distribution of hidden parameters of dynamical systems from a given set of noisy observations. In particular, let x(t;θ) be a time-dependent state variable, which implicitly depends on the parameter θ through the following initial value problem,

x˙=f(x,θ),x(0)=x0. (1)

Here, for any fixed θ, f can be either deterministic or stochastic. Our goal is to estimate the conditional distribution of θ, given discrete-time noisy observations y={y1,,yT}, where:

yi=g(xi,ξi),i=1,,T. (2)

Here, xix(ti;θ) are the solutions of Equation (1) for a specific hidden parameter θ, g is the observation function, and ξi are unbiased noises representing the measurement or model error. Although the proposed approach can also estimate the conditional density of the initial condition x0, we will not explore this inference problem in this article.

Given a prior density, p0(θ), Bayes’ theorem states that the conditional distribution of the parameter θ can be estimated as,

p(θ|y)p(y|θ)p0(θ), (3)

where p(y|θ) denotes the likelihood function of θ given the measurements y that depend on a hidden parameter value θ through (2). In most applications, the statistics of the conditional distribution p(θ|y) are the quantity of interest. For example, one can use the mean statistic as a point estimator of θ and the higher order moments for uncertainty quantification. To realize this goal, one draws samples of p(θ|y) and estimates these statistics via Monte Carlo averages over these samples. In this application, Markov Chain Monte Carlo (MCMC) is a natural sampling method that plays a central role in the computational statistics behind most Bayesian inference techniques [4].

In our setup, we assume that for any θ, one can simulate:

yi(θ)=g(xi(θ),ξi),i=1,,T. (4)

where xi(θ)x(ti;θ) denote solutions to the initial value problem in Equation (1). If the observation function has the following form,

g(xi(θ),ξi)=h(xi(θ))+ξi, (5)

where ξi are i.i.d. noises, then one can define the likelihood function of θ, p(y|θ), as a product of the density functions of the noises ξi,

p(y|θ)i=1Tp(ξi)=i=1Tp(yih(xi(θ))). (6)

When the observations are noise-less, ξi=0, and the underlying system is an Itô diffusion process with additive or multiplicative noises, one can use the Bayesian imputation to approximate the likelihood function [5]. In both parametric approaches, it is worth noting that the dependence of the likelihood function on the parameter is implicit through the solutions xi(θ). Practically, this implicit dependence is the source of the computational burden in evaluating the likelihood function since it requires solving the dynamical model in (1) or every proposal in the MCMC chain. In the case when simulating yi(θ) is computationally feasible, but the likelihood function is intractable, then one can use, e.g., the Approximate Bayesian Computation (ABC) rejection algorithm [6,7] for Bayesian inference. Basically, the ABC rejection scheme generates the samples of p(θ|y) by comparing the simulated yi(θ) to the observed data, yi, with an appropriate choice of metric comparison for each proposal θp0(θ). In general, however, repetitive evaluation of (4) can be expensive when the dynamics in (1) is high-dimensional and/or stiff, or when T is large, or when the function g is an average of a long time series. Our goal is to address this situation in addition to not knowing the approximate likelihood function.

Broadly speaking, the existing approaches to overcome repetitive evaluation of (4) require knowledge of an approximate likelihood function such as in (6). They can be grouped into two classes. The first class consists of methods that improve/accelerate the sampling strategy; for example, the Hamiltonian Monte Carlo [8], adaptive MCMC [9], and delay rejection adaptive Metropolis [10], just to name a few. The second class consists of methods that avoid solving the dynamical model in (1) when running the MCMC chain by replacing it with a computationally more efficient model on a known parameter domain. This class of approach, also known as surrogate modeling, includes Gaussian process models [11], polynomial chaos [12,13], and enhanced model error [14]; for example, the non-intrusive spectral projection [13] approximate xi(θ) in (6) with a polynomial chaos expansion. Another related approach, which also avoids MCMC on top of integrating (1), is to employ a polynomial expansion on the likelihood function [15]. This method represents the parametric likelihood function in (6) with orthonormal basis functions of a Hilbert space weighted by the prior measure. This choice of basis functions makes the computation for the statistics of the posterior density straightforward, and thus, MCMC is not needed.

In this paper, we consider a surrogate modeling approach where a nonparametric likelihood function is constructed using a data-driven spectral expansion. By nonparametric, we mean that our approach does not require any parametric form or assume any distribution as in (6). Instead, we approximate the likelihood function using the kernel embedding of conditional distribution formulation introduced in [16,17]. In our application, we will extend their formulation onto a Hilbert space weighted by the sampling measure of the training dataset as in [18]. We will rigorously demonstrate that using orthonormal basis functions of this data-driven weighted Hilbert space, the error bound is independent of the variance of the basis functions, which allows us to determine the amount of training data for accurate likelihood function estimations.

Computationally, assuming that the observations lie on (or close to) a Riemannian manifold N embedded in Rn with sampling density q(y), we apply the diffusion maps algorithm [19,20] to approximate orthonormal basis functions φkL2(N,q) using the training dataset. Subsequently, a nonparametric likelihood function is represented as a weighted sum of these data-driven basis functions, where the coefficients are precomputed using the kernel embedding formulation. In this fashion, our approach respects the geometry of the data manifold. Using this nonparametric likelihood function, we then generate the MCMC chain for estimating the conditional distribution of hidden parameters. For the present work, our aim is to demonstrate that one can obtain accurate and robust parameter estimation by implementing a simple Bayesian inference algorithm, the Metropolis scheme, with the data-driven nonparametric likelihood function. We should also point out that the present method is computationally feasible on low-dimensional parameter space, like any other surrogate modeling approach. Possible ways to overcome this dimensionality issue will be discussed.

This paper is organized as follows: In Section 2, we review the formulation of the reproducing kernel Hilbert space to estimate conditional density functions. In Section 3, we discuss the error estimate of the likelihood function approximation. In Section 4, we discuss the construction of the analytic basis functions for the Euclidean data manifold, as well as the data-driven basis functions with the diffusion maps algorithm for data that lie on embedded Riemannian geometry. In Section 5, we provide numerical results with parameter estimation application on instructive examples. In one of the examples where the dynamical model is low-dimensional and the observation is in the form of (5), we compare the proposed approach with the direct MCMC and non-intrusive spectral projection method (both schemes use likelihood of the form (6)). In addition, we will also demonstrate the robustness of the proposed approach on an example where g is a statistical average of a long-time trajectory (in which the likelihood is intractable) and the dynamical model has relatively high-dimensional chaotic dynamics such that repetitive evaluation of (4) is numerically expensive. In Section 6, we conclude this paper with a short summary. We accompany this paper with Appendices for treating large amount of data and more numerical results.

2. Conditional Density Estimation via Reproducing Kernel Weighted Hilbert Spaces

Let yNRn, where N is a smooth manifold with intrinsic dimension dn. In practice, we measure the observations in the ambient coordinates and denote their components as y={y1,,yn}. For the parameter θ space, M has a Euclidean structure with components, θ={θ1,,θm}, so M is assumed to be either an m-dimensional hyperrectangle or Rm. For training, we are given M number of training parameters θjj=1,,M={θj1,,θjm}j=1,,M. For each training parameter θj, we generate a discrete time series of length N for noisy observation data yi,j={yi,j1,,yi,jn}Rn for i=1,,N, and j=1,,M. Here, the sub-index i and the sub-index j of yi,j correspond to the ith observation data for the jth training parameter θj. Our goal for training is to learn the conditional density p(y|θ) from the training dataset θjj=1,,M and yi,jj=1,,Mi=1,,N for arbitrary y and θ within the range of θjj=1,,M.

The construction of the conditional density p(y|θ) is based on a machine learning tool known as the kernel embedding of the conditional distribution formulation introduced in [16,17]. In their formulation, the representation of conditional distributions is an element of a Reproducing Kernel Hilbert Space (RKHS).

Recently, the representation using a Reproducing Kernel Weighted Hilbert Space (RKWHS) was introduced in [18]. That is, let Ψk:=ψkq be the orthonormal basis of L2(N,q1), where they are eigenbasis of an integral operator,

Kf(y)=NK(y,y)f(y)q1(y)dV,fL2(N,q1), (7)

that is, KΨk=λkΨk.

In the case where N is compact and K is Hilbert–Schmidt, the kernel can be written as,

K(y,y)=k=1λkΨk(y)Ψk(y), (8)

which converges in L2(N,q1). Define the feature map Φ:M2 as,

Φ(y):=Φk(y)=λkΨk(y):kZ+,yN. (9)

Therefore, any fL2(N,q1) can be represented as f=k=1f^kΨk=k=1fk^λkΦk, where f^k=f,Ψkq1=f,ψk:=Nf(y)ψk(y)dV and provided that k|f^k|2/λk<. If we define f,gHq1:=k=1fk^gk^λk, we can write the kernel in (8) as K(y,y)=Φ(y),Φ(y)Hq1. Throughout this manuscript, we denote the RKHS Hq1(N) generating the feature map Φ in (9) as the space of square integrable functions with a reproducing property,

f(y)=f,K(·,y)Hq1:=k=1f^kK(·,y),Ψkq1λk=k=1f^kλkΦk(y)=f,Φ(y)Hq1,yN,

induced by the basis of ΨkL2(N,q1). While this definition deceptively suggests that Hq1(N) is similar to L2(N,q1), we should also point out that the RKHS requires that the Dirac functional δx:Hq1(N)R defined as δxf=f(x) be continuous. Since L2 contains a class of functions, it is not an RKHS and Hq1(N)L2(N,q1). See, e.g., Chapter 4 of [21] for more details. Using the same definition, we denote Hq˜1(M) as the RKHS induced by orthonormal basis of L2(M,q˜1) of functions of the parameter θ.

In this work, we will represent conditional density functions using the RKWHS induced by the data, where the bases will be constructed using the diffusion maps algorithm. The outcome of the training is an estimate of the conditional density, p^(y|θ), for arbitrary y and θ within the range of θjj=1,,M.

2.1. Review of Nonparametric RKWHS Representation of Conditional Density Functions

We first review the RKWHS representation of conditional density functions deduced in [18]. Let ψky be the orthonormal basis functions of L2N,q, where N contains the domain of the training data yi,j, and the weight function qy is defined with respect to the volume form inherited by N from the ambient space Rn. Let φlθL2M,q˜ be the orthonormal basis functions in the parameter θ space, where the training parameters are θjM, with weight function q˜θ. For finite modes, k=1,,K1, and l=1,,K2, a nonparametric RKWHS representation of the conditional density can be written as follows [18]:

p^y|θ=k=1K1c^Y|θ,kψkyqy, (10)

where p^y|θ denotes an estimate of the conditional density py|θHq1(N), and the expansion coefficients are defined as:

c^Y|θ,k=l=1K2CYΘCΘΘ1klφlθ. (11)

Here, the matrix CYΘ is K1×K2, and the matrix CΘΘ is K2×K2, whose components can be approximated by Monte Carlo averages [18]:

CYΘks=EYΘψkφs1MNj=1Mi=1Nψkyi,jφsθj, (12)
CΘΘsl=EΘΘφsφl1Mj=1Mφsθjφlθj, (13)

where the expectations E are taken with respect to the sampling densities of the training dataset yi,jj=1,,Mi=1,,N and θjj=1,,M. The equation for the expansion coefficients in Equation (11) is based on the theory of kernel embedding of the conditional distribution [16,17,18]. See [18] for the detailed proof of Equations (11)–(13). Note that for RKWHS representation, the weight functions q and q˜ can be different from the sampling densities of the training dataset yi,jj=1,,Mi=1,,N and θjj=1,,M, respectively. This generalizes the representation in [18], which sets the weights q and q˜ to be the sampling densities of the training dataset yi,j and θj, respectively. If the assumption of py|θHq1(N) is not satisfied, then CΘΘ can be singular. In such a case, one can follow the suggestion in [16,17] to regularize the linear regression in (11) by replacing CΘΘ1 with (CΘΘ+λIK2)1, where λR is an empirically-chosen parameter and IK2 denotes an identity matrix of size K2×K2.

Incidentally, it is worth mentioning that the conditional density in (10) and (11) is represented as a regression in infinite-dimensional spaces with basis functions ψky and φlθ. The expression (10) is a nonparametric representation in the sense that we do not assume any particular distribution for the density function py|θ. In this representation, only training dataset yi,jj=1,,Mi=1,,N and θjj=1,,M with appropriate basis functions are used to specify the coefficients c^Y|θ,k and the densities p^y|θ. In Section 4, we will demonstrate how to construct the appropriate basis completely from the training data, motivated by the theoretical result in Section 3 below.

2.2. Simplification of the Expansion Coefficients (11)

If the weight function q˜θ is the sampling density of the training parameters θjj=1,,M, the matrix CΘΘ in (13) can be simplified to a K2×K2 identity matrix,

CΘΘsl=EΘΘφsφl=Mφs(θ)φl(θ)q˜(θ)dθ=δsl. (14)

where δsl is the Kronecker delta function. Here, the second equality follows from the weight q˜θ being the sampling density, and the third equality follows from the orthonormality of φlθL2M,q˜ with respect to the weight function q˜. Then, the expansion coefficients c^Y|θ,k in (11) can be simplified to,

c^Y|θ,k=l=1K2CYΘklφlθ, (15)

with the K1×K2 matrix CYΘ still given by (12). In this work, we always take the weight function q˜θ to be the sampling density of the training parameters θjj=1,,M for the simplification of the expansion coefficients c^Y|θ,k in (15). This assumption is not too restrictive since the training parameters are specified by the users.

Finally, the formula in (10) combined with the expansion coefficients c^Y|θ,k in (15) and the matrix CYΘ in (12) forms an RKWHS representation of the conditional density py|θ for arbitrary y and θ. Numerically, the training outcome is the matrix CYΘ in (12), and then, the conditional density p^y|θ can be represented by (10) with coefficients (15) using the basis functions ψkyk=1K1 and φlθl=1K2. From above, one can see that two important questions naturally arise as a consequence of the usage of RKWHS representation: first, whether the representation p^y|θ in (10) is valid in estimating the conditional density py|θ; second, how to construct the orthonormal basis functions ψkyL2N,q and φlθL2M,q˜. We will address these two important questions in the next two sections.

3. Error Estimation

In this section, we focus on the error estimation of the expansion coefficient c^Y|θj,k and, later, the conditional density p^y|θj at the training parameter θj. The notation c^Y|θj,k is defined as the expansion coefficient c^Y|θ,k in (15), evaluated at the training parameter θj. Let the total number of basis functions in parameter space, K2, be equal to the total number of training parameters, M, that is, K2=M. Denoting Φ=[φ1,,φM]RM×M, where the jth component of φl approximates the basis function evaluated at the training data φl(θj), we can write the last equality in (14) in a compact form as M1ΦΦ=IM. This also means that, M1ΦΦ=IM, the components of which are,

1Ml=1Mφlθsφlθj=δsj. (16)

For the training parameter θj, we can simplify the expansion coefficient c^Y|θj,k by substituting Equation (12) into Equation (15),

c^Y|θj,k=l=1MCYΘklφlθjl=1M1MNs=1Mi=1Nψkyi,sφlθsφlθj=1Ni=1Nψkyi,j, (17)

where the last equality follows from (16).

3.1. Error Estimation Using Arbitrary Bases

We first study the error estimation for the expansion coefficient c^Y|θj,k. For each training parameter θj, the conditional density function p(y|θj)Hq1N can be analytically represented in the form,

p(y|θj)=k=1cY|θj,kψk(y)q(y), (18)

due to the completeness of L2N,q. Here, the analytic expansion coefficient cY|θj,k is given by,

cY|θj,k=p·|θj,ψk. (19)

Note that the estimator c^Y|θj,k in (17) is a Monte Carlo approximation of the expansion coefficient cY|θj,k in (19), i.e.,

cY|θj,k=p·|θj,ψk=EY|θj[ψkY]1Ni=1Nψkyi,j, (20)

where the last equality follows from the training dataset yi,ji=1,,N, which admits a conditional density py|θj. Note also that in the following theorems and propositions, the condition p(y|θj)Hq1N is required. In Section 5.2 and Appendix B, we will provide an example to discuss this condition in detail. Next, we provide the unbiasedness and consistency of the estimator c^Y|θj,k.

Proposition 1.

Let yi,ji=1,,N be i.i.d. samples of Y|θj with density p(y|θj). Let p(y|θj)Hq1N and ψk(y) form a complete orthonormal basis of L2N,q. Assume that VarY|θjψkY is finite, then c^Y|θj,k defined in (17) is an unbiased and consistent estimator for cY|θj,k in (19).

Proof. 

The estimator c^Y|θj,k is unbiased,

Ec^Y|θj,k=1Ni=1NEY|θjψkYi,j=cY|θj,k. (21)

where the expectation is taken with respect to the conditional density py|θj. If the variance, VarY|θjψkY, is finite, then the variance of c^Y|θj,k converges to zero as the number of training data N,

Varc^Y|θj,k=1NVarY|θjψkY0,asN. (22)

Then, we can obtain that the estimator c^Y|θj,k is consistent,

Prc^Y|θj,kcY|θj,k>εVarc^Y|θj,kε20,asN,forε>0,

where Chebyshev’s inequality has been used. □

If the estimator of p(y|θj) is given by the representation with an infinite number of basis functions, p˜(y|θj)=k=1c^Y|θj,kψk(y)q(y), then the estimator p˜(y|θj) is pointwise unbiased for every observation y. However, in the numerical implementation, only a finite number of basis functions can be used in the representation (10). Numerically, the estimator of p(y|θj) is given by the representation (10) at the training parameter θj,

p^(y|θj)=k=1K1c^Y|θj,kψk(y)q(y).

Then, the pointwise error of the estimator, e^(y|θj), can be defined as:

e^(y|θj)p(y|θj)p^(y|θj)=k=K1+1cY|θj,kψk(y)q(y)+k=1K1cY|θj,kc^Y|θj,kψk(y)q(y). (23)

It can be seen that the estimator p^(y|θj) is no longer unbiased or consistent due to the first error term in (23) induced by modes k>K1. Next, we estimate the expectation and the variance of an L2-norm error of p^(y|θj) for all training parameters θj.

Theorem 1.

Let the condition in Proposition 1 be satisfied for all θjj=1,,M, and VarY|θjψkY be finite for all kN+. Define the L2-norm error,

e^L2=j=1MNe^(y|θj)2q1(y)dV1/2, (24)

where e^(y|θj) is the pointwise error in (23), and dV is the volume form inherited by the manifold N from the ambient space Rn [18,20]. Then,

Ee^L2j=1Mk=K1+1cY|θj,k2+1Nj=1Mk=1K1VarY|θjψk(Y)12, (25)
Vare^L2j=1Mk=K1+1cY|θj,k2+1Nj=1Mk=1K1VarY|θjψk(Y), (26)

where E and Var are defined with respect to the joint distribution of p(y|θj) for all θjj=1,,M. Moreover, Ee^L2 and Vare^L2 converge to zero as K1 and then N, where the limiting operations of K1 and N are not commutative.

Proof. 

The expectation of e^L2 can be estimated as,

Ee^L22Ej=1MNe^(y|θj)2q1(y)dV=Ej=1MNk=K1+1cY|θj,kψk(y)+k=1K1cY|θj,kc^Y|θj,kψk(y)2q(y)dV, (27)

where the first inequality follows from Jensen’s inequality. Here, the randomness comes from the estimators c^Y|θj,k. Due to the orthonormality of basis functions, ψkL2N,q, the error estimation in (27) can be simplified as,

Ee^L22j=1Mk=K1+1cY|θj,k2+j=1Mk=1K1EY|θjcY|θj,kc^Y|θj,k2,=j=1Mk=K1+1cY|θj,k2+1Nj=1Mk=1K1VarY|θjψk(Y), (28)

where the inequality follows from the linearity of expectation, and the equality follows from Ec^Y|θj,k=cY|θj,k in (21) and Varc^Y|θj,k=1NVarY|θjψkY in (22). In error estimation (28), the first term is deterministic, and the second term is random. We have so far proven that the expectation Ee^L2 is bounded by (25). Similarly, we can prove that the variance Vare^L2 is bounded by (26).

Next, we prove that the expectation Ee^L2 converges to zero as K1 and then N. Parseval’s theorem states that:

k=1cY|θj,k2=Np(y|θj)2q1ydV<+,forallθj, (29)

where the inequality follows from p(y|θj)Hq1(N)L2N,q1 for all θj. For ε>0, there exists an integer K˜1θj for θj such that:

k=K˜1θjcY|θj,k2<ε2M. (30)

Let:

K1=maxK˜1θ1,,K˜1θM, (31)

then the first term in (28) can be bounded by: ε/2,

j=1Mk=K1+1cY|θj,k2<ε2. (32)

Since the variance VarY|θjψk(Y) is assumed to be finite for all k and j, there exists a constant D>0 such that VarY|θjψk(Y) can be bounded above by this constant D,

VarY|θjψk(Y)D,forallk=1,,K1andj=1,,M. (33)

Then, for ε>0, there exists a sufficiently large number of training data:

Nmin=2MK1Dε. (34)

such that whenever N>Nmin, then:

1Nj=1Mk=1K1VarY|θjψk(Y)<ε2. (35)

Since ε>0 is arbitrary, by substituting Equation (32) and Equation (35) into the error estimation (28), we obtain that Ee^L2 converges to zero as K1 and then N. Note that, we first take K1 to ensure the first error term in (28) vanishes and then take N to ensure the second error term in (28) vanishes. Thus, the limiting operations of K1 and N are not commutative. Similarly, we can prove that the variance Vare^L2 converges to zero as K1 and then N. □

Theorem 1 provides the intuition for specifying the number of training observation data N to achieve any desired accuracy ε>0 given fixed M-parameters and sufficiently large K1. It can be seen from Theorem 1 that numerically, the expectation Ee^L2 in (25) and the variance Vare^L2 in (26) can be bounded within arbitrarily small ε by choosing sufficiently large K1 and N. Specifically, there are two error terms in Equations (25) and (26), the first being deterministic, induced by modes k>K1, and the second random, induced by modes kK1. For the deterministic term (k>K1), the error can be bounded by ε/2 by choosing sufficiently large K1 satisfying (31). In our implementation, the number of basis functions K1 is empirically chosen to be large enough in order to make the first error term in Equations (25) and (26) for k>K1 as small as possible.

For the random term (kK1), the error can be bounded by ε/2 by choosing sufficiently large N satisfying N>Nmin=2MK1D/ε (Equation (34)). The minimum number of training data, Nmin, depends on the upper bound of VarY|θjψk(Y), D. However, the upper bound D may not exist for some problems. This means that for some problems, the assumption for finite VarY|θjψk(Y) in Theorem 1 may not be satisfied. Even if the upper bound D exists, it is typically not easy to evaluate its value given an arbitrary basis ψkL2N,q since one needs to evaluate VarY|θjψk(Y) for all k=1,,K1 and j=1,,M. Note that Theorem 1 holds true for representing p^(y|θj) with an arbitrary basis ψkL2N,q as long as p(y|θj)Hq1N for all θj and VarY|θjψk(Y) is finite for kK1 and j=1,,M. Next, we provide several cases in which VarY|θjψk(Y) is finite for k and j.

Remark 1.

If the weighted Hilbert space L2N,q is defined on a compact manifold N and has smooth basis functions ψk, then VarY|θjψkY is finite for a fixed kN+ and j=1,,M. This assertion follows from the fact that continuous functions on a compact manifold are bounded. The smoothness assumption is not unreasonable in many applications since the orthonormal basis functions are obtained as solutions of an eigenvalue problem of a self-adjoint second-order elliptic differential operator. Note that the bound here is not necessarily a uniform bound of ψkY for all kN+ and j=1,,M. As long as VarY|θjψkY is finite for kK1 and j=1,,M, the upper bound D is finite, and then, Theorem 1 holds.

Remark 2.

If the manifold N is a hyperrectangle in Rn and the weight q is a uniform distribution on N, then VarY|θjψkY is finite for a fixed kN+ and j=1,,M. This assertion is an immediate consequence of Remark 1.

In Theorem 1, Nmin depends on the upper bound of VarY|θjψk(Y), D, as shown in (34). In the following, we will specify a Hilbert space, referred to as a data-driven Hilbert space, so that Nmin is independent of D and is only dependent of M, K1, and ε. As a consequence, we can easily determine how many training data N for bounding the second error term in Equations (25) and (26).

3.2. Error Estimation Using a Data-Driven Hilbert Space

We now turn to the discussion of a specific data-driven Hilbert space L2N,q¯ with orthonormal basis functions ψ¯k. Our goal is to specify the weight function q¯ such that the minimum number of training data, N¯min, only depends on M, K1, and ε. Here, the overline ·¯ corresponds to the specific data-driven Hilbert space. The second error term in (28) can be further estimated as,

1Nj=1Mk=1K1VarY|θjψ¯k(Y)1Nk=1K1j=1MEY|θjψ¯k2(Y)=MNk=1K1Nψ¯k2(y)1Mj=1Mp(y|θj)dV, (36)

where the basis functions are substituted with the specific ψ¯k. Notice that ψ¯k(y) are orthonormal basis functions with respect to the weight q¯ in L2N,q¯. One specific choice of the weight function q¯y is:

q¯y=1Mj=1Mp(y|θj), (37)

where q¯y has been normalized, i.e.,

Nq¯ydV=N1Mj=1Mp(y|θj)dV=1. (38)

For the data-driven Hilbert space, we always use a normalized weight function q¯y. Note that the weight function q¯y in (37) is a discretization of the marginal density function of Y with Θ marginalized out,

q¯y=1Mj=1Mp(y|θj)Mp(y|θ)q˜θdθ=Mp(y,θ)dθ, (39)

where p(y,θ) denotes the joint density of (Y,Θ). Essentially, the weight function q¯y in (37) is the sampling density of all the training data yi,jj=1,,Mi=1,,N, which motivates us to refer to L2N,q¯ as a data-driven Hilbert space.

Next, we prove that by specifying the data-driven basis functions ψ¯kL2N,q¯, the variance VarY|θjψ¯k(Y) is finite for all kN+ and j=1,,M. Subsequently, we can obtain the minimum number of training data, N¯min, to only depend on M, K1, and ε, such that the expectation Ee^L2 in (25) and the variance Vare^L2 in (26) are bounded above by any ε>0.

Proposition 2.

Let yi,ji=1,,N be i.i.d. samples of Y|θj with density p(y|θj). Let p(y|θj)Hq1(N) for all θjj=1,,M with weight q¯ specified in (37), and let ψ¯k be the complete orthonormal basis of L2N,q¯. Then, VarY|θjψ¯kY is finite for all kN+ and j=1,,M.

Proof. 

Notice that for all kN+, we have:

1Mj=1MVarY|θjψ¯k(Y)1Mj=1MEY|θjψ¯k2(Y)=Nψ¯k2(y)1Mj=1Mp(y|θj)dV=Nψ¯k2(y)q¯ydV=1, (40)

where the last equality follows directly from the orthonormality of basis functions ψ¯k(y)L2N,q¯. From Equation (40), we can obtain that for all kN+ and j=1,,M, the variance VarY|θjψ¯kY is finite. □

Theorem 2.

Given the same hypothesis as in Proposition 2, then:

Ee^L2j=1Mk=K1+1cY|θj,k2+MK1N12, (41)
Vare^L2j=1Mk=K1+1cY|θj,k2+MK1N. (42)

where e^L2 is defined by (24) and cY|θj,k is given by (19). Moreover, Ee^L2 and Vare^L2 converge to zero as K1 and then N, where the limiting operations of K1 and N are not commutative.

Proof. 

According to Proposition 2, we have that the variance VarY|θjψ¯kY is finite for all kN+ and j=1,,M. According to Proposition 1, since VarY|θjψ¯kY is finite, we have that the estimator c^Y|θj,k is both unbiased and consistent for cY|θj,k. All conditions in Theorem 1 are satisfied, so that we can obtain the error estimation of the expectation Ee^L2 in (25) and the error estimation of the variance Vare^L2 in (26). Moreover, the second error term in Ee^L2 (25) and Vare^L2 (26) can be both bounded by Equation (40), so that we can obtain our error estimations (41) and (42).

Choose K1 as in (31) such that the first term in (41) and (42) is bounded by ε/2. The second term MK1/N in (41) and (42) can be bounded by an arbitrarily small ε/2 if the number of training data N satisfies:

N>N¯min2MK1/ε. (43)

Then, both the expectation Ee^L2 and the variance Vare^L2 can be bounded by ε. Since ε>0 is arbitrary, the proof is complete. □

Recall that by applying arbitrary basis functions to represent p^y|θ in (10), it is typically not easy to evaluate the upper bound D in (33), which implies that it is not easy to determine how many observation data, Nmin (Equation (34)), should be used for training. However, by applying the data-driven basis functions ψ¯k to represent p^y|θ in (10), the minimum number of training data, N¯min (Equation (43)), becomes independent of D, and is only dependent of M, K1, and ε, as can be seen from Theorem 2. To let the error induced by modes kK1 be smaller than a desired ε/2, we can easily determine how many observation data, N¯min (Equation (43)), should be used for training. In this sense, the specific data-driven Hilbert space L2N,q¯ with the corresponding basis functions ψ¯k is a good choice for representing (10).

We have so far theoretically verified the validity of the representation (10) in estimating the conditional density p(y|θj) (Theorem 1). In particular, using the data-driven basis ψ¯kL2N,q¯, we can easily control the error of conditional density estimation by specifying the number of training data N (Theorem 2). To summarize, the training procedures can be outlined as follows:

  • (1-A)

    Generate the training dataset, including training parameters θjj=1,,M and observations yi,jj=1,,Mi=1,,N. The length of training data N is empirically determined based on the criteria (34) or (43).

  • (1-B)

    Construct the basis functions for parameter θ space and for observation y space by using the training dataset. For y space, we need to empirically choose the number of basis functions K1 to let the error induced by modes k>K1 be as small as possible. In particular, for the data-driven Hilbert space, we will provide a detailed discussion on how to estimate the data-driven basis functions of L2(N,q¯) with the sampling density q¯ from the training data in the following Section 4. Note that this basis estimation will introduce additional errors beyond the results in this section, which assumed the data-driven basis functions to be given.

  • (1-C)

    Train the matrix CYΘ in (12) and then estimate the conditional density p^y|θ by using the nonparametric RKWHS representation (10) with the expansion coefficients c^Y|θ,k (15).

  • (1-D)
    Finally, for new observations y={y1,,yT}, define the likelihood function as a product of the conditional densities of new observations y given any θ,
    p(y|θ)t=1Tp^(yt|θ). (44)

Next, we address the second important question for the RKWHS representation (Procedure (1-B)): how to construct basis functions for θ and y. Especially, we focus on how to construct the data-driven basis functions for y.

4. Basis Functions

This section will be organized as follows. In Section 4.1, we discuss how to employ analytical basis functions for parameter θ and for observation y as in the usual polynomial chaos expansion. In Section 4.2, we discuss how to construct the data-driven basis functions ψ¯kL2N,q¯ with N being the manifold of the training dataset yi,jj=1,,Mi=1,,N and the weight q¯ by (37) being the sampling density of yi,jj=1,,Mi=1,,N.

4.1. Analytic Basis Functions

If no prior information about the parameter space other than its domain is known, we can assume that the training parameters are uniformly distributed on the parameter θ space. In particular, we choose M number of well-sampled training parameters θjj=1,,M={θj1,,θjm}j=1,,M in an m-dimensional box MRm,

M[θmin1,θmax1]××[θminm,θmaxm], (45)

where × denotes a Cartesian product and the two parameters θmins and θmaxs are the minimum and maximum values of the uniform distribution for the sth coordinate of θ space. Here, the well-sampled uniform distribution corresponds to a regular grid, which is a tessellation of m-dimensional Euclidean space Rm by congruent parallelotopes. Two parameters θmins and θmaxs are determined by:

θmins=minj=1,,Mθjsγmaxj=1,,Mθjsminj=1,,Mθjs,θmaxs=maxj=1,,Mθjs+γmaxj=1,,Mθjsminj=1,,Mθjs. (46)

For M regularly-spaced grid points θjs, we set γ=.5(Ms1)1 in all of our numerical examples below, where Ms is the number of training parameters in the sth coordinate. For example, see Figure 1 for the 2D well-sampled uniformly-distributed data {(5,5),(6,5),,(12,12)} (blue circles). In this case, the two-dimensional box M is [4.5,12.5]2 (red square).

Figure 1.

Figure 1

(Color online) An example of well-sampled 2D uniformly-distributed data points (blue circles). The boundary of the uniform distribution is depicted with a red square. Furthermore, these well-sampled data points correspond to the training parameters in Example I in Section 5. In this example, the well-sampled uniformly-distributed training parameters are (σX12,σX22)(i,j)i=5,,12j=5,,12 (blue circles). The equal spacing distances of both coordinates are one. The two-dimensional box M is [4.5,12.5]2 (red square).

On this simple geometry, we will choose φk to be the tensor product of the basis functions on each coordinate. Notice that we have taken the weight function q˜ to be the sampling density of the training parameters in order to simplify the expansion coefficient c^Y|θ,k in (15). In this case, the weight q˜ is a uniform distribution on M. Then, for the sth coordinate of the parameter, θs, the weight function q˜sθs is a uniform distribution on the interval [θmins,θmaxs], and one can choose the following cosine basis functions,

Φks(θs)=1,ifks=0,2cosksπθsθminsθmaxsθmins,else, (47)

where Φks(θs) form a complete orthonormal basis of L2[θmins,θmaxs],q˜s. This choice of basis functions corresponds to exactly the data-driven basis functions produced by the diffusion maps algorithm on the uniformly-distributed dataset on a compact interval, which will be discussed in Section 4.2. Although other choices such as the Legendre polynomials can be used, this choice will lead to a larger value of constant D in (34) that controls the minimum number of training data for accurate estimation.

Subsequently, we set L2M,q˜=s=1mL2[θmins,θmaxs],q˜s, where ⊗ denotes the Hilbert tensor product, and q˜θ=s=1mq˜s(θs) is the uniform distribution on the m-dimensional box M. Correspondingly, the basis functions φkθ are a tensor product of Φks(θs) for s=1,,m,

φkθ=s=1mΦks(θs)=Φk1(θ1)Φkm(θm), (48)

where k=k1,,km and θ=θ1,,θm. Based on the property of the tensor product of Hilbert spaces, φkθ forms a complete orthonormal basis of L2M,q˜.

We now turn to the discussion of how to construct analytic basis functions for y. The approach is similar to the one for parameter θ, except that the domain of the data is specified empirically and the weight function is chosen to correspond to some well-known analytical basis functions, independent of the sampling distribution of the data y. That is, we assume the geometry of the data has the following tensor structure, N=N1××Nn, where Ns will be specified empirically based on the ambient space coordinate of y. Let ys be the sth ambient component of y; we can choose a weighted Hilbert space L2Ns,qs(ys;αs) with the weight qs depending on the parameters αs and being normalized to satisfy Rqs(ys;αs)dys=1. For each coordinate, let Ψks(ys;αs) be the corresponding orthonormal basis functions, which possess analytic expressions. Subsequently, we can obtain a set of complete orthonormal basis functions ψkL2N,q for y by taking the tensor product of these Ψks as in (48).

For example, if the weight qs is uniform, NsR is simply a one-dimensional interval. In this case, we can choose the cosine basis functions Ψks for y as in (47) such that the parameters αs correspond to the boundaries of the domain Ns, which can be estimated as in (46). In our numerical experiments below, we will set γ=0.1. Another choice is to set the weight qs(ys;αs) to be Gaussian. In this case, the domain is assumed to be the real line, Ns=R. For this choice, the corresponding orthonormal basis functions Ψks are Hermite polynomials, and the parameters αs, corresponding to the mean and variance of the Gaussian distribution, can be empirically estimated from the training data.

In the remainder of this paper, we will always use the cosine basis functions for θ. The application of (10) using cosine basis functions for y is referred to as the cosine representation. The application of (10) using Hermite basis functions for y is referred to as the Hermite representation.

4.2. Data-Driven Basis Functions

In this section, we discuss how to construct a set of data-driven basis functions ψ¯kL2N,q¯ with N being the manifold of the training dataset yi,jj=1,,Mi=1,,N and weight q¯ in (37) being the sampling density of yi,j for all i=1,,N, and j=1,,M. The issues here are that the analytical expression of the sampling density q¯ is unknown and the Riemannian metric inherited by the data manifold N from the ambient space Rn is also unknown. Fortunately, these issues can be overcome by the diffusion maps algorithm [18,19,20].

4.2.1. Learning the Data-Driven Basis Functions

Given a dataset yi,jNRn with the sampling density q¯(y) (37), defined with respect to the volume form inherited by the manifold N from the ambient space Rn, one can use the kernel-based diffusion maps algorithm to construct an MN×MN matrix L that approximates a weighted Laplacian operator, L=logq¯·+, that takes functions with Neumann boundary conditions for the compact manifold N with the boundary if the manifold has a boundary. The eigenvectors ψ¯k of the matrix L are discrete approximations of the eigenfunctions ψ¯ky of the operator L, which form an orthonormal basis of the weighted Hilbert space L2N,q¯. Connecting to the discussion on the RKWHS in Section 2, the eigenfunctions of L*=div(logq¯)+Δ, that is {Ψk:=ψ¯kq¯}, can be approximated using an integral operator in (7) with the appropriate kernel constructed by the diffusion maps algorithm, up to a diagonal conjugation. Basically, Hq¯1(N) is the data-driven reproducing kernel Hilbert space defined with the feature map in (9), induced by eigenfunctions of L*.

Each component of the eigenvector ψ¯kRMN is a discrete estimate of the eigenfunction ψ¯kyi,j, evaluated at the training data point yi,j. The sampling density q¯ defined in (37) is estimated using a kernel density estimation method [22]. In contrast to the analytic continuous basis functions in the above Section 4.1, the data-driven basis functions ψ¯kL2N,q¯ are represented nonparametrically by the discrete eigenvectors ψ¯kRMN using the diffusion maps algorithm. The outcome of the training is a discrete estimate of the conditional density, p^yi,j|θ, which estimates the representation p^y|θ (10) on each training data point yi,j.

In our implementation, we use the Variable-Bandwidth Diffusion Maps (VBDM) algorithm introduced in [20], which extends the diffusion maps to non-compact manifolds without a boundary. See the supplementary material of [23] for the MATLAB code of this algorithm. We should point out that this discrete approximation induces errors in the basis function, which are estimated in detail in [24]. These errors are in addition to the error estimations in Section 3.

We note that if the data are uniformly distributed on a one-dimensional bounded interval, then the VBDM solutions are the cosine basis functions, which are eigenfunctions of the Laplacian operator on bounded interval with Neumann boundary conditions. This means that the cosine functions in (47) that are used to represent each component of θ are analogous to the data-driven basis functions. The difference is that with the parametric choice in (47), one avoids VBDM at the expense of specifying the boundaries of the domain, [θmins,θmaxs]. In the remainder of this paper, we refer to an application of (10) with cosine basis functions for θ and VBDM basis functions for y as the VBDM representation.

However, a direct application of the VBDM algorithm suffers from the expensive computational cost for large training data. Basically, we need an algorithm that allows us to subsample from the training dataset while preserving the sampling distribution of the full dataset. In Appendix A, we provide a simple box-averaging method to achieve this goal. In the remainder of this paper, we will denote the reduced data obtained via the box-averaging method in Appendix A by {y¯b}b=1,,B, where BMN. We refer to them as the box-averaged data points. When the number of training data is too large, we apply the VBDM algorithm on these box-averaged data to obtain the discrete estimate of the eigenfunctions ψ¯ky¯b.

The second issue arises from the discrete representation of the conditional density in the observation y space using the VBDM algorithm. Notice that the VBDM representation, p^yi,j|θ, is only estimated at each training data point yi,j. A natural problem is to extend the representation onto new observations ytyi,jj=1,,Mi=1,,N that are not part of the training dataset (Procedure (1-D)). Next, we address this issue.

4.2.2. Nyström Extension

We now discuss an extension method to evaluate basis functions ψ¯k on a new data point that does not belong to the training dataset. Given such an extension method, we can proceed with Procedure (1-D) by evaluating ψ¯kyt on new observations ytyi,jj=1,,Mi=1,,N, which in turn give p^yt|θ. Second, this extension is also needed in the training Procedure (1-C) when MN is large. More specifically, for training the matrix CYΘ in (12), we need to know the estimate of the eigenfunction ψ¯kyi,j for all the original training data yi,j. Computationally, however, we can only construct the discrete estimate of the eigenfunction ψ¯ky¯b at the reduced box-averaged data points y¯b. This suggests that we need to extend the eigenfunctions ψ¯ky¯b onto all the original training data yi,jj=1,,Mi=1,,N.

For the convenience of discussion, the training data that are used to construct the eigenfunctions are denoted by yroldr=1,,R, and all the data that are not part of yroldr=1,,R are denoted by ynew. To extend the eigenfunctions ψ¯kyrold onto the data point ynewyroldr=1,,R, one approach would be to use the Nyström extension [25] that is based on the basic theory of RKHS [26]. Let Hq¯N be the RKWHS with a symmetric positive kernel T^:N×NR defined as,

T^(y,y)=k=1λkψ¯k(y)ψ¯k(y),

where λk is the corresponding eigenvalue of L associated with eigenfunction ψ¯k. Then, for any function fHq¯N, the Moore–Aronszajn theorem states that one can evaluate f at aN with the following inner product, f(a)=f,T^a,·Hq¯. In our application, this amounts to evaluating,

ψ¯kynew=1Rr=1RTynew,yroldψ¯kyrold, (49)

where the non-symmetric kernel function T:N×NR (constructed by the diffusion maps algorithm) is related to the symmetric kernel T^ by,

Tyi,yj=q¯1/2yiT^yi,yjq¯1/2yj

with q¯yi being the sampling density of yroldr=1,,R at yi. See the detailed evaluation of the kernels T^ and T for the Nyström extension in [27]. After obtaining the estimate of the eigenfunction ψ¯kynew using the Nyström extension, we can train the matrix CYΘ in (12) for large MN and then obtain the representation of the conditional density on arbitrary new observation yt, p^yt|θ.

To summarize this section, we have constructed two different sets of basis functions for y, the analytic basis functions of L2N,q such as the Hermite and cosine basis functions, which assume that the manifold is Rn or hyperrectangle, respectively, and the data-driven basis functions of L2N,q¯, with N being the data manifold and q¯ being the sampling density that are computed using the VBDM algorithm.

5. Parameter Estimation Using the Metropolis Scheme

First, we briefly review the Metropolis scheme for estimating the posterior density p(θ|y) given new observations y={y1,,yT} for a specific parameter θ. The key idea of the Metropolis scheme is to construct a Markov chain such that it converges to samples of conditional density p(θ|y) as the target density. In our application, the parameter estimation procedures can be outlined as follows:

  • (2-A)

    Suppose we have θ0p(θ0|y)>0, then for i1, we can sample θ*κθi1,θ*. Here, κ is the proposal kernel density. For example, use the random walk Metropolis algorithm to generate proposals, κθi1,θ*=Nθi1,C, where C, the proposal covariance, and is a tunable nuisance parameter.

  • (2-B)

    Accept the proposal, θi=θ* with probability min(p(θ*|y)p(θi1|y),1), otherwise set θi=θi1. Repeat Procedures (2-A) and (2-B) above. Notice that the posterior p(θ|y) can be determined from the prior p0(θ) and the likelihood p(y|θ) based on Bayes’ theorem (3). The likelihood function p(y|θ) is defined as a product of conditional densities of new observations y={y1,,yT} in (44) (Procedure (1-D)). The conditional densities of new observations y given θ are obtained from the training Procedure (1-C).

  • (2-C)

    Generate a sufficiently long chain and use the chain’s statistic as an estimator of the true parameter θ. Take multiple runs of the chain started at different initial θ0, and examine whether all these runs converge to the same distribution. The convergence of all the examples below has been validated using 10 randomly-chosen different initial conditions.

In the remainder of this section, we present numerical results of the Metropolis scheme using the proposed data-driven likelihood function on various instructive examples, where the likelihood function is either explicitly known, or can be approximated as in (6), or is intractable. In an example where the explicit likelihood is known, our goal is to show that the approach numerically converges to the true posterior estimate. In the second example, where the dimension of the data manifold is strictly less than the ambient dimension, we will show that the RKHS framework with the knowledge of the intrinsic geometry is superior. When the intrinsic geometrical information is unknown, the proposed data-driven likelihood function is competitive. In the third example with a low-dimensional dynamic and observation model of the form (5), we compare the proposed approach with standard methods, including the direct MCMC and nonintrusive spectral projection (both use the likelihood function of the form (6)). In our last example, we consider an observation model where the likelihood function is intractable and the cost of evaluating the observation model in (4) is numerically expensive.

5.1. Example I: Two-Dimensional Ornstein–Uhlenbeck Process

Consider an Ornstein–Uhlenbeck (OU) process as follows:

dX=12Xdt+Σ1/2dWt, (50)

where XX1,X2 denotes the state variable, Wt=(W1,W2) denotes two-dimensional Wiener processes, and ΣR2×2 is a diagonal matrix with main diagonal components σX12 and σXs2 to be estimated. In the stationary process, the solution of Equation (50) X=X1,X2 admits a Gaussian distribution XN0,Σ,

p(X|Σ)=det2πΣ12exp12XΣ1X. (51)

Our goal here is to estimate the posterior density and the posterior mean of the parameters (σX12,σX22), given a finite number, T, of observations, X(X1,,XT), for hidden true parameters ((σX12),(σX22))=(6.5,6.3), where each Xt is an i.i.d. sample of (51) for Σ=Σ. This example is shown here to verify the validity of the framework of our RKWHS representations for parameter estimation application.

One can show that the likelihood function for this problem is the inverse matrix gamma distribution, ΣIMGT232,2,Ψ, where Ψ=X(X)R2×2. If a prior is defined to be also the inverse matrix gamma distribution, ΣIMGα0,2,0, for some value of α0, then the posterior density p(Σ|X) can be obtained by applying Bayes’ theorem,

p(Σ|X)IMGα0+T2,2,Ψ. (52)

The posterior mean can thereafter be obtained as,

ΣPM=σX12PM00σX22PM=ΨT+2α03. (53)

To compare with the analytic conditional density p(X|Σ) (51), we trained three RKWHS representations of the conditional density function, p^X|Σ, by using the same training dataset. For training, we used M=64 well-sampled uniformly-distributed training parameters (shown in Figure 1), (σX12,σX22), where σXj2{5,6,,12}, which are denoted by Σjj=1M. For each training parameter Σj, we generated N= 640,000 well-sampled normally distributed observation data of density in (51) with Σ=Σj. For Hermite and cosine representations, we used 20 basis functions for each coordinate, and then, we could construct K1=400 basis functions of two-dimensional observation, X, by taking the tensor product. For the VBDM representation, we first reduced the data from MN=8× 640,000 to B=B1×B2=100×100 by the box-averaging method (Appendix A). Subsequently, we trained K1=400 data-driven basis functions from the B box-averaged data using the VBDM algorithm [20].

Figure 2a displays the analytic conditional density (51), and Figure 2b–d display the pointwise errors of the conditional densities e^X|ΣpX|Σp^X|Σ for the training parameter (σX12,σX22)=5,5. It can be seen from Figure 2b–d that all the pointwise errors are small compared to the analytic p(X|Σ) in Figure 2a, so that all representations of conditional densities p^X|Σ are in excellent agreement with the analytic p(X|Σ) (Figure 2a). This suggests that for the Hermite representation, the upper bound D (33) in Theorem 1 is finite so that the representation is valid in estimating the conditional density, as can be seen from Figure 2b. On the other hand, the upper bounds D (33) for the cosine and the VBDM representations are always finite, as mentioned in Remark 2 and Proposition 2, respectively. We should also point out that for this example, the VBDM representation performed the worst with errors of order 104 compared to the Hermite and cosine representations whose errors were on the order of 106. This larger error in the VBDM representation was because the data-driven basis functions were estimated by discrete eigenvectors ψ¯kRB, so additional errors [20] were introduced through this discrete approximation (especially on the high modes) on the box-averaged data, {y¯b}b=1,,B, B= 10,000. On the other hand, for Hermite and cosine representations, their analytic basis functions are known, so that the errors could be approximated by (25) in Theorem 1.

Figure 2.

Figure 2

(Color online) (a) The analytic conditional density p(X|Σ) (51). For comparison, plotted are the pointwise errors of conditional density functions e^X|ΣpX|Σp^X|Σ for (b) Hermite, (c) cosine, and (d) VBDM representations. The density and all the error functions are plotted on the B= 10,000 box-averaged data points. The training parameter Σ(σX12,σX22)=(5,5).

We now estimate the posterior density (52) and mean (53) by using the MCMC method (Procedures (2-A)–(2-C)). We generated T=400 well-sampled normally-distributed data as the observations from the true values of variance Σ=((σX12),(σX22))=6.5,6.3. From the analytical Formula (53), we obtained the posterior mean as (σX12,σX22)PM=6.03,5.84. Here, the posterior mean deviated greatly from the true value since we only used T=400 normally-distributed observation data as new observations. If using much more new observation data, the analytical posterior mean (53) will get closer to the true value, ((σX12),(σX22)). In our simulation, we set the parameter in the prior, α0=1, and the proposal covariance, C=0.01I. For each chain, the initial condition σX1,02,σX2,02 was drawn randomly from U[5,12]2, and 800,000 iterations are generated for the chain.

Figure 3b,c,d display the densities of the chain by using Hermite, cosine, and VBDM representation, respectively. The densities are plotted using the kernel density estimate on the chain ignoring the first 10,000 iterations. For comparison, Figure 3a displays the analytic posterior density (52). It can be seen from Figure 3 that the posterior densities by the three representations were in excellent agreement with each other and with the analytic posterior density (52). Figure 3 also shows the comparison between the posterior mean (53) and the MCMC mean estimates. From our numerical results, MCMC mean estimates by all representations and the analytic posterior mean (53) were identical within numerical accuracy. Therefore, for this 2D OU-process example, all representations were valid in estimating the posterior density and posterior mean of parameter Σ.

Figure 3.

Figure 3

(Color online) Comparison of the posterior density functions p(Σ|X). (a) Analytical posterior density p(Σ|X) (52). (b) Hermite representation. (c) Cosine representation. (d) VBDM representation. The true value Σ(σX12,σX22)=(6.5,6.3) (blue plus). The analytic posterior mean is (σX12,σX22)PM=(6.03,5.84) (green cross). The MCMC mean estimate using Hermite representation is (σX12,σX22)=(6.05,5.87) (black square). The MCMC mean estimate using Cosine representation is (σX12,σX22)=(6.05,5.87) (black triangle). The MCMC mean estimate using VBDM representation is (σX12,σX22)=(6.04,5.86) (black circle).

Next, we will investigate a system for which the intrinsic dimension d of the data manifold where the observations lie is smaller than the dimension of ambient space n.

5.2. Example II: Three-Dimensional System of SDE’s on a Torus

Consider a system of SDE’s on a torus defined in the intrinsic coordinates θ,ϕ[0,2π)2:

dθϕ=aθ,ϕdt+bθ,ϕdW1dW2, (54)

where W1 and W2 are two independent Wiener processes, and the drift and diffusion coefficients are:

aθ,ϕ=12+18cosθcos2ϕ+12cosθ+π/210+12cosθ+ϕ/2+cosθ+π/2,bθ,ϕ=D+Dsinθ14cosθ+ϕ14cosθ+ϕ140+140sinϕcosθ.

The initial condition is θ,ϕ=π,π. Here, D is a parameter to be estimated. This example exhibits non-gradient drift, anisotropic diffusion, and multiple time scales. Both the observations and the training dataset were generated by numerically solving the SDE on appropriate parameters D in (54) with a time step Δt=0.1 and then mapping this data into the ambient space, R3, via the standard embedding of the torus given by:

xx,y,z=2+sinθcosϕ,2+sinθsinϕ,cosθ. (55)

Here, xx,y,z are observations. This system on a torus satisfies d<n, where d=2 is the intrinsic dimension of x and n=3 is the dimension of ambient space Rn. Our goal is to estimate the posterior density and the posterior mean of parameter D given discrete-time observations of x, which are the solutions of (54) for a specific parameter D.

For training, we used M=8 well-sampled uniformly-distributed training parameters, Dj=j/4j=18. For each training parameter Dj, we generated N= 54,000 observations of x by solving the SDE’s in (54) for parameter Dj. For Hermite and cosine representation, we constructed 10 basis functions for each x,y,z coordinate in Euclidean space. After taking tensor product of these basis functions, we could obtain K1=1000 basis functions on the ambient space R3. For VBDM representation, we first computed B=B1×B2×B3=303 box-averaged data points by the data reduction method in Appendix A. However, we found that some of the B box-averaged data points were far away from the torus. After ignoring these points, we eventually chose B˜= 26,020 of the box-averaged data points that were close enough to the torus for training. Then, we trained K1=1000 data-driven basis functions on N from these 26,020 box-averaged data points using the VBDM algorithm.

Unlike the previous example, the derivation of the analytical expression for the likelihood function p(x|Dj) was not trivial. This difficulty is due to the fact that the diffusion coefficient, b(θ,ϕ), is state dependent. While direct MCMC with an approximate likelihood function constructed using the Bayesian imputation [5] can be done in principle, we neglected this numerical computation since the cost in generating the path {xi} for i=1,,T on each sampling step was too costly in our setup below (where T= 10,000, and we would generate a chain of length 400,000 samples). For diagnostic comparisons, we constructed another representation p^(x|Dj), named the intrinsic Fourier representation, which can be regarded as an accurate approximation of p(x|Dj), as it used the basis functions defined on the intrinsic coordinates (θ,ϕ) instead of xR3. See Appendix B for the construction and the convergence of the intrinsic Fourier representation in detail. We should point out that this intrinsic representation is not available in general since one may not know the embedding of the data manifold.

Figure 4 displays the comparison of the density estimates. It can be observed from Figure 4 that the VBDM representation was in good agreement with the intrinsic Fourier representation, whereas Hermite and cosine representations of p^(x|Dj) deviated significantly from the intrinsic Fourier representation. The reason in short was that if the density p(θ,ϕ|D) in (θ,ϕ) coordinate were in H([0,2π)2)L2[0,2π)2, then the corresponding VBDM representation with respect to dV(x) would be in Hq¯1(N). However, the representation (Hermite and cosine) with respect to dx, xR3 is not in Hq¯1(R3). A more detailed explanation of this assertion is presented in Appendix B.

Figure 4.

Figure 4

(Color online) Comparison of the conditional densities p^(x|Dj) estimated by using Hermite representation (first row), cosine representation (second row), VBDM representation (third row), and intrinsic Fourier representation (fourth row). The left (a,d,g,j), middle (b,e,h,k), and right (c,f,i,l) columns correspond to the densities on the training parameters D1=0.25, D4=1.00, and D7=1.75, respectively. K1=1000 basis functions are used for all representations. For fair visual comparison, all conditional densities are plotted on the same box-averaged data points and normalized to satisfy 1B˜b=1B˜p^(xb|Dj)/q¯xb=1 with q¯ being the estimated sampling density of the box-averaged data xbb=1B˜.

We now compare the MCMC estimates with the true value, D=0.9, from T= 10,000 observations. For this simulation, we set the prior to be uniformly distributed and empirically chose C=0.01 for the proposal. Figure 5 displays the posterior densities of the chains for all representations (each plot of the density estimate was constructed using KDE on a chain of length 400,000). Displayed also is the comparison between the true value D and the MCMC mean estimates by all representations. Here, the mean estimate by the intrinsic Fourier representation nearly overlaps with the true value D=0.9, as shown in Figure 5. The mean estimate by the VBDM representation is closer to the true value D compared to the estimates by Hermite and cosine representations. Moreover, it can be seen from Figure 5 that the posterior by the VBDM representation is close to the posterior by intrinsic Fourier representation, whereas the posterior densities by Hermite and cosine representation are not. We should point out that this result is encouraging considering that the training parameter domain is rather wide, Dj[1/4,2]. This result suggests that when the intrinsic dimension is less than the ambient space dimension, d<n, the VBDM representation (which does not require the knowledge of the embedding function in (55)) with data-driven basis functions in L2N,q¯ is superior compared to the representations with analytic basis functions defined on the ambient coordinates R3.

Figure 5.

Figure 5

(Color online) Comparison of the posterior density functions by all representations. Plotted also are mean estimates by Hermite representation D^=0.78 (blue triangle), cosine representation D^=0.79 (red square), VBDM representation D^=0.88 (black circle), the intrinsic Fourier representation D^=0.90 (green circle), and the true parameter value D=0.9 (magenta asterisk).

5.3. Example III: Five-Dimensional Lorenz-96 Model

Consider the Lorenz-96 model [28]:

dxjdt=xj1xj+1xj2xj+F,j=1,,J, (56)

with periodic boundary, xj+J=xj. For the example in this section, we set J=5. The initial condition was xj(0)=sin(2πj/5). Our goal here was to estimate the posterior density and posterior mean of the hidden parameter F given a time series of noisy observations y=y1,y2,y3,y4,y5, where:

yjtm=xjtm+ϵm,j,ϵm,jN0,σ2,m=1,,T,

with noise variance σ2=0.01. Here, xjtm denotes the approximate solution (with the Runge–Kutta method) with a specific parameter value F at discrete times tm=mst, where t=0.05 is the integration time step and s is the observation interval. Since the embedding function of the observation data is unknown, we do not have a parametric analog to the intrinsic Fourier representation as in the previous example.

In this low-dimensional setting, we can compare the proposed method with basic techniques, including the direct MCMC and the Non-Intrusive Spectral Projection (NISP) method [13]. By direct MCMC, we refer to employing the random walk Metropolis scheme directly on the following likelihood function,

py|Fexpm=1Tj=15yjtmxjtm;F22σ2, (57)

where σ2 is the noise variance and xjtm;F is the solution of the initial value problem in Equation (56) with the parameter F at time tm. Note that evaluating xjtm;F is time consuming if the model time TsΔt is long or the MCMC chain has many iterations. In our implementation, we generated the chain for 4000 iterations. This amounts to 4000 sequential evaluations of the likelihood function in (57), where each evaluation requires integrating the model in (56) with the proposal parameter value F* until model unit time TsΔt. We used a uniform prior distribution and C=0.1 for the proposal.

For the NISP method [13], we used the same Gaussian likelihood function (57) with approximated xj. In particular, we approximated the solutions xj with x˜jt,F for j=1,,5 in the form of:

x˜jt,F=k=1Kx^j,ktφkF, (58)

where φkF are chosen to be the orthonormal cosine basis functions, x^j,kt are the expansion coefficients, and K is the number of basis functions. Subsequently, we prescribe a fixed set of nodes Fj=7.55+0.1jj=18 to be used for training x^j,k(t). Practically, this training procedure only requires eight model evaluations that can be done in parallel, where each evaluation involves integrating the model with the specified Fj until model unit time TsΔt. The number of basis functions is K=8. After specifying the coefficients x^j,kt such that x˜jt,F=xjt;F, we obtain the approximation of the solutions x˜jt,F for all parameters F. Using these approximated x˜jt,F, in place of xj(tm,F) in (57), we can generate the Markov chain using the Metropolis scheme. Again, we used a uniform prior distribution and C=0.1 for the proposal. In our MCMC implementation, we generated the chain for 40,000 iterations; this involved only evaluating (58) instead of integrating the true dynamical model in (56) on the proposal parameter value F*.

For RKWHS representations, we also used M=8 uniformly-distributed training parameters, Fj=7.55+0.1jj=18. As in the NISP, this training procedure required only eight model integrations with parameter value Fj until the model unit time TsΔt, resulting in a total of MN=8Ts training data. In this example, we did not reduce the data using the box-averaging method in Appendix A. In fact, for some cases, such as s=1 and T=50, the total of training data were only MN=400, which was too few for estimation of the eigenfunctions. Of course, one can consider more training parameters to increase this training dataset, but for a fair comparison with NISP, we chose to just add 10 i.i.d. Gaussian noises to each dataset, resulting in a total of MN=4000 for training dataset. This configuration (with a small dataset) is a tough setting for the VBDM since the nonparametric method is advantageous in the limit of a large dataset. When 8Ts is sufficiently large, we do not need to increase the dataset by adding multiple i.i.d Gaussian noises.

For Hermite and Cosine representation, we constructed five Hermite and cosine basis functions for each coordinate, which yielding a total of K1=55=3125 basis functions in R5. For the VBDM representation, we directly applied the VBDM algorithm to train K1=3125 data-driven basis functions on manifold N from the MN=4000 training dataset. From the VBDM algorithm, the estimated intrinsic dimension was d2, which was smaller than the dimension of the ambient space n=5. Then, we applied a uniform prior distribution and C=0.01 for the proposal. As in NISP, we generated the chain for 40,000 iterations, which amounted to evaluating (44) instead of integrating the true dynamical model in (56) on each iteration.

We now compare the posterior densities and mean estimates for the case of s=1 and T=50 noisy observations ytm corresponding to the true parameter value F=8. Figure 6 displays the posterior densities of the chains and mean estimates for the direct MCMC method, NISP method, and all representations. It can be seen from Figure 6 that the mean estimate by VBDM representation was in good agreement with the true value F. In contrast, the mean estimates by Hermite and cosine representations deviated substantially from the true value. Based on this numerical result, where the estimated intrinsic dimension d2 of the observations was lower than the ambient space dimension n=5, the data-driven VBDM representation was superior compared to the Hermite and cosine representations. It can be further observed that direct MCMC, NISP, and VBDM representation can provide good mean estimates to the true value. However, notice that we only ran the model M=8 times for the NISP method and VBDM representation, whereas we ran the model 4000 times for the direct MCMC method.

Figure 6.

Figure 6

(Color online) Comparison of the posterior density functions among the direct MCMC method, NISP method, and all RKWHS representations. Plotted also are the true parameter value F=8 (black cross), mean the estimate by the direct MCMC method F^=8.00 (green circle), the mean estimate by the NISP method F^=8.00 (magenta square), and the mean estimates by Hermite representation F^=8.21 (blue triangle), cosine representation F^=8.10 (red square), and VBDM representation F^=7.99 (black circle). The noisy observations are yj(tm) for s=1, T=50.

In real applications where the observations are not simulated by the model, we expect the observation configuration to be pre-determined. Therefore it is important to have an algorithm that is robust under various observation configurations. In our next numerical experiment, we checked such robustness by comparing the direct MCMC method, NISP method, and VBDM representation for different cases of s and T (Figure 7a). It can be observed from Figure 7a that both the direct MCMC method and VBDM representation can provide reasonably accurate mean estimates for all cases of s and T. However, again notice that we need to run the model much more times for the direct MCMC method than for VBDM representation. It can be further observed that the NISP method can only provide a good mean estimate for observation time up to TsΔt=200Δt when eight uniform nodes Fj=7.55+0.1jj=18 are used. The reason was that the approximated solution by NISP method was only accurate for observation time up to 200Δt (see the green and red curves in Figure 7b). This result suggests that our surrogate modeling approach using the VBDM representation can provide accurate and robust mean estimates under various observation configurations.

Figure 7.

Figure 7

(Color online) (a) Comparison of the mean estimates among the direct MCMC method, NISP method, and VBDM representation for different cases of s and T. Plotted also is the true parameter value F=8 (green curve). (b) Comparison of the exact solution by numerical integration and the approximated solution by the NISP method at the training parameter F=7.65 and at the parameter value F=8, which is not in the training parameter.

5.4. Example IV: The 40-Dimensional Lorenz-96 Model

In this section, we consider estimating the parameter F in the Lorenz-96 model in (56), but of a J=40 dimensional system. We now consider observing the autocorrelation function of several energetic Fourier modes of the system phase-space variables. In particular, let {x^k(tm;F)}k=J/2+1,,J/2 be the kth discrete Fourier mode of {xj(tm;F)}k=1,J, where tm=mΔt with Δt=0.05. Let the observation function be defined as in (4) with four-dimensional {ym(F)}m=0,,T, whose components are the autocorrelation function of Fourier mode kj,

ym,j(F)=E[x^kj(tm;F)x^kj(t0;F)],m=0,,T,j=1,,4,

of the energetic Fourier modes, kj{7,8,9,14}. See [29] for the detailed discussion of the statistical equilibrium behavior of this model for various values of F. Such observations arise naturally since some of the model parameters can be identified from non-equilibrium statistical information via the linear response statistics [30,31]. In our numerics, we will approximate the correlation function by averaging over a long trajectory,

E[x^kj(tm;F)x^kj(t0;F)]1L=1Lx^kj(tm+;F)x^kj(t;F), (59)

with L=106. Here, each of these Fourier modes is assumed to have zero empirical mean. We will consider observing the autocorrelation function up to time T=50 (corresponding to 2.5 unit time).

With this setup, the corresponding likelihood function for p(ym|F) is not easily approximated (since it is not in the form of (6)), and it is computationally demanding to generate ym,j(F) since each evaluation requires integration of the 40-dimensional Lorenz-96 model up to time index L=106. This expensive computational cost makes either the direct MCMC or approximate Bayesian computation infeasible. We should also point out the fact that a long trajectory is needed in the evaluation of (59), making this problem intractable with NISP even if a parametric likelihood function becomes available. This issue is because the approximated trajectory by polynomial chaos expansion in NISP is only accurate for short times, as shown in the previous example. We will consider constructing the likelihood function from a wide range of training parameter values, Fi=6+0.1(i1),i=1,,M=31. This parameter domain is rather wide and includes the weakly chaotic regime (F=6) and strongly chaotic regime (F=8). See [32] for a complete list of chaotic measures in these regimes including the largest Lyapunov exponent and the Kolmogorov–Sinai entropy.

In this setup, we had a total of MN=M(T+1)=31×51=1581 of ym(Fi)R4 for training. We will consider an RKHS representation with K1=500 basis functions. We will demonstrate the performance on 30 sets of observations ym(Fs), where in each case, Fs does not belong to the training parameter set, namely Fs=6.05+0.1(s1),s=1,,30. In each simulation, the MCMC initial chain will be set to be random, FU(6.5,8.5); the prior is uniform; and C=0.01 for the proposal. In Figure 8, we show the mean estimates and an error bar (based on one standard deviation) computed from averaging the MCMC chain of length 40,000 in each case. Notice the robustness of these estimates on a wide range of true parameter values F using a likelihood function constructed using a single set of training parameter values on [6,9].

Figure 8.

Figure 8

(Color online) Mean error estimates and error bars for various true values of F that are not in the training parameters.

6. Conclusions

We have developed a framework of a parameter estimation approach where MCMC was employed with a nonparametric likelihood function. Our approach approximated the likelihood function using the kernel embedding of conditional distribution formulation based on RKWHS. By analyzing the error estimation in Theorem 1, we have verified the validity of our RKWHS representation of the conditional density as long as p(y|θj)Hq1N induced by the basis in L2(N,q) and VarY|θjψk(Y) is finite. Furthermore, the analysis suggests that if the weight q is chosen to be the sampling density of the data, the VarY|θjψk(Y) is always finite. This justifies the use of Variable Bandwidth Diffusion Maps (VBDM) for estimating the data-driven basis functions of the Hilbert space weighted by the sampling density on the data manifold.

We have demonstrated the proposed approach with four numerical examples. In the first example, where the dimension of the data manifold was exactly the dimension of the ambient space, d=n, the RKHS representation with VBDM basis yielded a parameter estimate as accurate as using other analytic basis representation. However, in the examples where the dimension of the data manifold was strictly less than the dimension of the ambient space, d<n, only VBDM representation could provide more accurate estimation of the true parameter value. We also found that VBDM representation produced mean estimates that were robustly accurate (with accuracies that were comparable to the direct MCMC) on various observation configurations where the NISP was not accurate. This numerical comparison was based on using only eight model evaluations, which can be done in parallel for both VBDM and NISP, whereas the direct MCMC involved 4000 sequential model evaluations. Finally, we demonstrated robust accurate parameter estimation on an example where the analytic likelihood function was intractable and computationally demanding, even if it became available. Most importantly, this result was based on training on a wide parameter domain that included different chaotic dynamical behaviors.

From our numerical experiments, we conclude that the proposed nonparametric representation was advantageous in any of these configurations: (1) when the parametric likelihood function was not known, such as in Example IV; (2) when the observation time stamp was long (such as in Example II or for large sT in Example III and Example IV). Ultimately, the only real advantage of this method (as a surrogate model) was when the direct MCMC or ABC, which require sequential model evaluation, was computationally not feasible.

While the theoretical and numerical results were encouraging as a proof the concept for using the VBDM representation in many other parameter estimation applications, there were still practical limitations that need to be overcome. As in the other surrogate modeling approaches, one needs to have knowledge of the feasible domain for the parameters. Even when the parameter domain is given and wide, it is practically not feasible to generate training dataset by evaluating the model on the specified training grid points on this domain when the dimension of the parameter space is large (e.g., order 10), even if the Smolyak sparse grid is used. One possible way to simultaneously overcome these two issues is to use “crude” methods, such as ensemble Kalman filtering or smoothing, to obtain the training parameters. We refer to such a method as “crude” since the parameter estimation with ensemble Kalman filtering is sensitive to the initial conditions, especially when the persistent model is used as the dynamical model for the parameters [23]. However, with such crude methods, we can at least obtain a set of parameters that reflect the observational data, instead of specifying training parameters uniformly or in a random fashion, which can lead to unphysical training parameters. Another issue that arises in the VBDM representation is the expensive computational cost when the amount of data MN is large. When the dimension of the observations is low (as in the examples in this paper), the data reduction technique described in Appendix A is sufficient. For larger dimensional problems, a more sophisticated data reduction is needed. Alternatively, one can explore representations using other orthonormal data-driven basis, such as the QR factorized basis functions as a less expensive alternative to the eigenbasis [27].

Abbreviations

The following abbreviations are used in this manuscript:

VBDM Variable Bandwidth Diffusion Maps
RKHS Reproducing Kernel Hilbert Space
RKWHS Reproducing Kernel Weighted Hilbert Space
MCMC Markov Chain Monte Carlo
ABC Approximate Bayesian Computation
NISP Non-Intrusive Spectral Projection

Appendix A. Data Reduction

When MN is very large, the VBDM algorithm becomes numerically expensive since it involves solving an eigenvalue problem of matrix size MN×MN. Notice that the number of training parameters M grows exponentially as a function of the dimension of parameter, m, if well-sampled uniformly distributed training parameters are used. To overcome this large training data problem, we employ an empirical data reduction method to reduce the original MN training data points yi,jj=1,,Mi=1,,N to a small BMN number of training data points yet preserving the sampling density q¯y in (37). Subsequently, we apply the VBDM algorithm on these reduced training data points. It is worthwhile to mention that this data reduction method is numerically applicable for low-dimensional dataset although in the following we will introduce this reduction method for any n-dimensional dataset.

The basic idea of our method is that we first cluster the training dataset yi,jj=1,,Mi=1,,N into B number of boxes and then take the average of data points in each box as a reduced training data point. First, we cluster the training data yi,j, based on the ascending order of the 1st coordinate of the training data, yi,j1, into B1 number of groups such that each group has the same number =MN/B1 of data points. After the first clustering, we obtain B1 groups with each group denoted by Gk11 for k1=1,,B1. Here, the super-index 1 denotes the first clustering and the sub-index k1 denotes the k1th group. Second, for each group Gk11, we cluster the training data yi,j inside Gk11, based on the ascending order of the second coordinate of the training data, yi,j2, into B2 number of groups such that each group has the same number =MN/B1B2 of data points. After the second clustering, we obtain totally B1B2 groups with each group denoted by Gk1k22 for k1=1,,B1, and k2=1,,B2. We can operate such clustering n times, where n is the dimension of the observation y ambient space. After n times clustering, we obtain Bs=1nBs groups with each group denoted by Gk1k2knn with ks=1,,Bs, for all s=1,,n. Each group is a box [see Figure A1 for example]. After taking the average of the data points in each box Gk1k2knn, we obtain B number of reduced training data points. In the remainder of this paper, we denote these B number of reduced training data points by y¯bb=1,,B and refer to them as the box-averaged data points. Intuitively, this algorithm partitions the domain into hyperrectangle such that Pr(y¯Gk1knn)1/B. Note that the idea of our data reduction method is analogous to that of multivariate k-nearest neighbor density estimates [33,34]. The error estimation can be found in the Refs. [33,34].

To examine whether the distribution of these box-averaged data points is close to the sampling density q¯y of the original dataset, we apply our reduction method to several numerical examples in the following. Figure A1a shows the reduction result for few number (=64) of uniform data in [0,1]×[0,1]. Here, B1=4 and B2=4 so that there are total B=16 boxes and inside each box there are 4 uniform data points (blue circles). It can be seen that the box-averaged data points (red circles) are far away from the well-sampled uniform data points (cyan crosses). However, when uniform data points (blue circles) increase to a large number (=6400), box-average data points (red circles) are very close to well-sampled uniform data points (cyan crosses) as shown in Figure A1b. This suggests that these box-averaged data points nearly admit the uniform distribution when there are a large number of original uniform data points. Figure A1c and d show the comparison of the kernel density estimates applied on the box-averaged data for different B for standard normal distribution and the distribution proportional to expX12+X13+X14, respectively. It can be seen that the reduced box-averaged data points nearly preserve the distribution of the original large dataset, N=640,000.

Figure A1.

Figure A1

(Color online) Data reduction for (a) few number (=64) of uniformly distributed data, and (b) many number (=6400) of uniformly distributed data. The 64 blue circles correspond to uniformly distributed data, 16 cyan crosses correspond to well-sampled uniformly distributed data, and 16 red circles correspond to box-averaged data. Boxes are partitioned by horizontal and vertical black lines. The vertical black lines correspond to the first clustering and the horizontal lines correspond to the second clustering. Panels (c) and (d) display the comparison of kernel density estimates on the box-averaged data for different number B for (c) standard normal distribution, and (d) the distribution proportional to exp[(X12+X13+X14)], respectively. For comparison, also plotted is the analytic probability density of the distribution. The total number of the points is 640,000. It can be seen that the reduced box-averaged data points nearly preserve the distribution of the original dataset.

When MN is very large, the VBDM algorithm for the construction of data-driven basis functions [Procedure (1-B)] can be outlined as follows. We first use our data reduction method to obtain BMN number of box-averaged data y¯bb=1,,BNRn with sampling density q¯y1Mj=1Mp(y|θj) in (37). The sampling density q¯ is estimated at the box-averaged data y¯b using all the box-averaged data points y¯bb=1,,B by a kernel density estimation method. Implementing the VBDM algorithm, we can obtain orthonormal eigenvectors ψ¯kRB, which are discrete estimates of the eigenfunctions ψ¯kyL2N,q¯. The bth component of the eigenvector ψ¯k is a discrete estimate of the eigenfunction ψ¯ky¯b, evaluated at the box-averaged data point y¯b. Due to the dramatic reduction of the training data, the computation of these eigenvectors ψ¯kRB becomes much cheaper than the computation of the eigenvectors ψ¯kRMN using the original training dataset yi,jj=1,,Mi=1,,N. Then we can obtain a discrete representation (10) of the conditional density at the box-averaged data points y¯b, p^y¯b|θ.

Appendix B. Additional Results on Example II

In this section, we discuss the intrinsic Fourier representation constructed for numerical comparisons in Example II and provide more detailed discussion on the numerical results.

We first discuss the construction for the intrinsic Fourier representation of the true conditional density, px|D, defined with respect to the volume form inherited by N from the ambient space Rn for the system on the torus (54) in Example II. By noticing the embedding (55) of θ,ϕ in xx,y,z, we can obtain the following equality,

1=Npx|DdVx=[0,2π)2pxθ,ϕ|Dxθ×xϕdθdϕ[0,2π)2pICθ,ϕ|Ddθdϕ, (A1)

where dVx=xθ×xϕdθdϕ is the volume form, pIC denotes the true conditional density as a function of the intrinsic coordinates, θ,ϕ. Assuming that pICθ,ϕ|DH([0,2π)2)L2([0,2π)2), and the relation in (A1), we can construct the intrinsic Fourier representation as follows,

p^x|D=p^ICθ,ϕ|Dxθ×xϕ, (A2)

where p^ICθ,ϕ|D is a RKWHS representation (10) of the conditional density pICθ,ϕ|D with a set of orthonormal Fourier basis functions ψkθ,ϕL2[0,2π)2. Here, ψkθ,ϕ are formed by the tensor product of two sets of orthonormal Fourier basis functions {1, 2cosmθ, 2sinmθ} and {1, 2cosmϕ, 2sinmϕ} for mN+. Note that for intrinsic Fourier representation, we need to know the embedding (55) and know the data of θ,ϕ in intrinsic coordinates for training, which is available for this example. Nevertheless, for Hermite, Cosine, and VBDM representations, we only need to know the observation data x for training.

The convergence of p^x|D to the true density can be explained as follows. For the system (54) in the intrinsic coordinates θ,ϕ, where pICθ,ϕ|DH[0,2π)2 for all parameter D, the statistics Varθ,ϕ|Dψkθ,ϕ are bounded for all D and all kN+, by the compactness of [0,2π)2 and the uniform boundedness of ψkθ,ϕ for all k. According to Theorem 1, we can obtain the convergence of the representation p^ICθ,ϕ|D. Then by noticing the smoothness of xθ×xϕ on the torus, we can obtain the convergence of the intrinsic Fourier representation p^x|D in (A2).

Next, we give an intuitive explanation for the reason why in the regime d<n, VBDM representation can provide a good approximation whereas Hermite and Cosine representations cannot. Essentially, the VBDM representation uses basis functions of the weighted Hilbert space of functions defined with respect to a volume form V˜ that is conformally equivalent to the volume form V that is inherited by the data manifold N from the ambient space, Rn. That is, the weighted Hilbert space, L2(N,q¯1), means,

L2(N,q¯1)=f:N|f(x)|2dV˜(x)<,

where dV˜(x)=q¯(x)1dV(x) denotes the volume form that is conformally changed by the sampling density q¯. We should point out that the key point of the diffusion maps algorithm [19] is to introduce an appropriate normalization to avoid biased in the geometry induced by the sampling density q¯ when the data are not sampled according to the Riemannian metric inherited by N from the ambient space Rn. Furthermore, the orthonormal basis functions of the Hilbert space L2(N,q¯1) are the eigenfunctions of the adjoint (with respect to L2(N)) of the operator, L=log(q¯)·+Δ, that is constructed by the VBDM algorithm. Incidentally, the adjoint operator L* is the Fokker-Planck operator of a gradient system forced by stochastic noises. The point is that this adjoint operator takes density functions of the weighted Hilbert space L2(N,q¯1). Since the Hilbert space L2N,q¯1 is a function space of some Fokker-Planck operator that acts on densities defined with respect to the geometry of data, then representing the conditional density with basis functions of the weighted Hilbert space L2N,q¯1 is a natural choice. Thus, the error estimation in Theorem 2 is valid in controlling the error of the estimate.

Next, we will show that the representation of the true density pEX in the ambient space R3 is not a function of Hq1(R)L2R,q1, where for Hermite representation R is R3 and q is a normal distribution, and for Cosine representation R is a hyperrectangle containing the torus and q is a uniform distribution. Recall that the torus is parametrized by:

xx,y,z=2+rsinθcosϕ,2+rsinθsinϕ,rcosθ, (A3)

where θ, ϕ are angles which make a full circle, and r is the radius of the tube known as the minor radius. All observation data are located on the torus with r=1. Then, the generalized conditional density p(r,θ,ϕ|D) in r,θ,ϕ coordinate can be defined using the Dirac delta function as follows,

pr,θ,ϕ|D=pICθ,ϕ|Dδr1, (A4)

where pIC, defined in (A1), denotes the conditional density function in the intrinsic coordinate, θ,ϕ, and δ is the Dirac delta function. After coordinate transformation, the density pEX:R3R can be obtained as

pEXx|D=pr,θ,ϕ|DJ=pICθ,ϕ|Dδr1J, (A5)

where J is the Jacobian determinant detx,y,z/r,θ,ϕ. It can be examined that pEXx|D is a generalized conditional density, that is, R3pEX(x|D)dx=1. Now it can be clearly seen that due to the Dirac delta function δr1 in (A5), the density pEX(x|D) is no longer in the weighted Hilbert space, pEX(x|D)Hq1R. Consequently, the error estimation in Theorem 1 becomes invalid in controlling the error of the conditional density.

Here, the key point is that for Cosine and Hermite representations, the volume integral is with respect to dx. The complete basis functions are obtained from tensor product of three sets of basis functions in (x,y,z) coordinates. In order to represent a conditional density function pEX(x|D) defined only on an intrinsically 2D torus domain, theoretically infinite number of basis functions are needed. However, numerically only finite number of basis functions can be used. Then, the density pEX(x|D) in (A5) cannot be well approximated for Hermite and Cosine representations (10). Moreover, if only finite number of Hermite or Cosine basis functions are used for representations, typically Gibbs phenomenon can be observed, i.e., the Dirac delta function δr1 in (A5) will be approximated by a function having a single tall spike at r=1 with some oscillations at two sides along the r direction. On the other hand, the data-driven basis functions obtained via the diffusion maps algorithm are smooth functions defined on the data manifold N. Therefore, while the Gibbs phenomenon still occurs in this spectral expansion, it is due to finite truncation in representing a positive smooth functions (densities) on the data manifold, and not due to the singularity that occurs in the ambient direction as in (A5).

Author Contributions

Both authors contribute equally. Conceptualization, J.H.; methodology, J.H. and S.W.J.; software, S.W.J.; validation, S.W.J. and J.H.; formal analysis, S.W.J. and J.H.; investigation, S.W.J. and J.H.; resources, J.H.; data curation, S.W.J.; writing, original draft preparation, S.W.J. and J.H.; writing, review and editing, S.W.J. and J.H.; visualization, S.W.J.; supervision, J.H.; project administration, J.H.; funding acquisition, J.H.

Funding

This research was funded by the Office of Naval Research Grant Number N00014-16-1-2888. J.H. would also like to acknowledge support from the NSF Grant DMS-1619661.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

References

  • 1.Kaipio J., Somersalo E. Statistical and Computational Inverse Problems. Springer; New York, NY, USA: 2005. [Google Scholar]
  • 2.Sullivan T.J. Introduction to Uncertainty Quantification. Volume 63 Springer; Cham, Switzerland: 2015. [Google Scholar]
  • 3.Dashti M., Stuart A.M. The Bayesian Approach to Inverse Problems. In: Ghanem R., Higdon D., Owhadi H., editors. Handbook of Uncertainty Quantification. Springer International Publishing; Cham, Switzerland: 2017. pp. 311–428. [Google Scholar]
  • 4.Brooks S., Gelman A., Jones G., Meng X.L. Handbook of Markov Chain Monte Carlo. CRC Press; London, UK: 2011. [Google Scholar]
  • 5.Golightly A., Wilkinson D. Markov Chain Monte Carlo Algorithms for SDE Parameter Estimation. [(accessed on 2 June 2019)];2010 :253–276. Available online: http://www.mas.ncl.ac.uk/ nag48/diffchap.pdf.
  • 6.Tavaré S., Balding D.J., Griffiths R.C., Donnelly P. Inferring coalescence times from DNA sequence data. Genetics. 1997;145:505–518. doi: 10.1093/genetics/145.2.505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Turner B.M., Van Zandt T. A tutorial on approximate Bayesian computation. J. Math. Psychol. 2012;56:69–85. doi: 10.1016/j.jmp.2012.02.005. [DOI] [Google Scholar]
  • 8.Neal R.M. MCMC using Hamiltonian dynamics. In: Brooks S., Gelman A., Jones G., Meng X.-L., editors. Handbook of Markov Chain Monte Carlo. CRC Press; London, UK: 2011. pp. 113–167. Chapter 5. [Google Scholar]
  • 9.Beck J.L., Au S.K. Bayesian updating of structural models and reliability using Markov Chain Monte Carlo simulation. J. Eng. Mech. 2002;128:380–391. doi: 10.1061/(ASCE)0733-9399(2002)128:4(380). [DOI] [Google Scholar]
  • 10.Haario H., Laine M., Mira A., Saksman E. DRAM: Efficient adaptive MCMC. Stat. Comput. 2006;16:339–354. doi: 10.1007/s11222-006-9438-0. [DOI] [Google Scholar]
  • 11.Higdon D., Kennedy M., Cavendish J.C., Cafeo J.A., Ryne R.D. Combining field data and computer simulations for calibration and prediction. SIAM J. Sci. Comput. 2004;26:448–466. doi: 10.1137/S1064827503426693. [DOI] [Google Scholar]
  • 12.Marzouk Y., Najm H., Rahn L. Stochastic spectral methods for efficient Bayesian solution of inverse problems. J. Comput. Phys. 2007;224:560–586. doi: 10.1016/j.jcp.2006.10.010. [DOI] [Google Scholar]
  • 13.Marzouk Y., Xiu D. A stochastic collocation approach to Bayesian inference in inverse problems. Commun. Comput. Phys. 2009;6:826–847. doi: 10.4208/cicp.2009.v6.p826. [DOI] [Google Scholar]
  • 14.Huttunen J.M., Kaipio J.P., Somersalo E. Approximation errors in nonstationary inverse problems. Inverse Probl. Imag. 2007;1:77–93. doi: 10.3934/ipi.2007.1.77. [DOI] [Google Scholar]
  • 15.Nagel J.B., Sudret B. Spectral likelihood expansions for Bayesian inference. J. Comput. Phys. 2016;309:267–294. doi: 10.1016/j.jcp.2015.12.047. [DOI] [Google Scholar]
  • 16.Song L., Huang J., Smola A., Fukumizu K. Hilbert space embeddings of conditional distributions with applications to dynamical systems; Proceedings of the 26th Annual International Conference on Machine Learning; Montreal, QC, Canada. 14–18 June 2009; New York, NY, USA: ACM; 2009. pp. 961–968. [Google Scholar]
  • 17.Song L., Fukumizu K., Gretton A. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. IEEE Signal Process. Mag. 2013;30:98–111. doi: 10.1109/MSP.2013.2252713. [DOI] [Google Scholar]
  • 18.Berry T., Harlim J. Correcting biased observation model error in data assimilation. Mon. Weather Rev. 2017;145:2833–2853. doi: 10.1175/MWR-D-16-0428.1. [DOI] [Google Scholar]
  • 19.Coifman R.R., Lafon S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006;21:5–30. doi: 10.1016/j.acha.2006.04.006. [DOI] [Google Scholar]
  • 20.Berry T., Harlim J. Variable bandwidth diffusion kernels. Appl. Comput. Harmon. Anal. 2016;40:68–96. doi: 10.1016/j.acha.2015.01.001. [DOI] [Google Scholar]
  • 21.Steinwart I., Christmann A. Support Vector Machines. Springer; New York, NY, USA: 2008. [Google Scholar]
  • 22.Berry T., Harlim J. Forecasting turbulent modes with nonparametric diffusion models: Learning from noisy data. Physica D. 2016;320:57–76. doi: 10.1016/j.physd.2016.01.012. [DOI] [Google Scholar]
  • 23.Harlim J. Data-Driven Computational Methods: Parameter and Operator Estimations. Cambridge University Press; Cambridge, UK: 2018. [Google Scholar]
  • 24.Berry T., Sauer T. Consistent manifold representation for topological data analysis. Found. Data Sci. 2019;1:1–38. doi: 10.3934/fods.2019001. [DOI] [Google Scholar]
  • 25.Nyström E.J. Über die praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben. Acta Math. 1930;54:185–204. doi: 10.1007/BF02547521. (In German) [DOI] [Google Scholar]
  • 26.Aronszajn N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950;68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7. [DOI] [Google Scholar]
  • 27.Harlim J., Yang H. Diffusion Forecasting Model with Basis Functions from QR-Decomposition. J. Nonlinear Sci. 2018;28:847–872. doi: 10.1007/s00332-017-9430-1. [DOI] [Google Scholar]
  • 28.Lorenz E. Predictability: A problem partly solved; Proceedings of the Seminar on Predictability; Shinfield Park, Reading, UK. 4–8 September 1995; Reading, UK: ECMWF; 1995. pp. 1–18. [Google Scholar]
  • 29.Majda A., Abramov R., Grote M. Information Theory and Stochastics for Multiscale Nonlinear Systems. American Mathematical Society; Providence, RI, USA: 2005. ((CRM Monograph Series)). [Google Scholar]
  • 30.Harlim J., Li X., Zhang H. A Parameter Estimation Method Using Linear Response Statistics. J. Stat. Phys. 2017;168:146–170. doi: 10.1007/s10955-017-1788-9. [DOI] [PubMed] [Google Scholar]
  • 31.Zhang H., Li X., Harlim J. A Parameter Estimation Method Using Linear Response Statistics: Numerical Scheme. Chaos. 2019;29:033101. doi: 10.1063/1.5081744. [DOI] [PubMed] [Google Scholar]
  • 32.Abramov R., Majda A. Blended response algorithm for linear fluctuation-dissipation for complex nonlinear dynamical systems. Nonlinearity. 2007;20:2793–2821. doi: 10.1088/0951-7715/20/12/004. [DOI] [Google Scholar]
  • 33.Loftsgaarden D.O., Quesenberry C.P. A nonparametric estimate of a multivariate density function. Ann. Math. Stat. 1965;36:1049–1051. doi: 10.1214/aoms/1177700079. [DOI] [Google Scholar]
  • 34.Mack Y., Rosenblatt M. Multivariate k-nearest neighbor density estimates. J. Multivar. Anal. 1979;9:1–15. doi: 10.1016/0047-259X(79)90065-4. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES