Skip to main content
Proceedings. Mathematical, Physical, and Engineering Sciences logoLink to Proceedings. Mathematical, Physical, and Engineering Sciences
. 2022 Jun 22;478(2262):20210710. doi: 10.1098/rspa.2021.0710

Unbiased estimation of the Hessian for partially observed diffusions

Neil K Chada 1,, Ajay Jasra 1, Fangyuan Yu 1
PMCID: PMC9215219  PMID: 35756881

Abstract

In this article, we consider the development of unbiased estimators of the Hessian, of the log-likelihood function with respect to parameters, for partially observed diffusion processes. These processes arise in numerous applications, where such diffusions require derivative information, either through the Jacobian or Hessian matrix. As time-discretizations of diffusions induce a bias, we provide an unbiased estimator of the Hessian. This is based on using Girsanov’s Theorem and randomization schemes developed through Mcleish (2011 Monte Carlo Methods Appl. 17, 301–315 (doi:10.1515/mcma.2011.013)) and Rhee & Glynn (2016 Op. Res. 63, 1026–1043). We demonstrate our developed estimator of the Hessian is unbiased, and one of finite variance. We numerically test and verify this by comparing the methodology here to that of a newly proposed particle filtering methodology. We test this on a range of diffusion models, which include different Ornstein–Uhlenbeck processes and the Fitzhugh–Nagumo model, arising in neuroscience.

Keywords: partially observed diffusions, randomization methods, Hessian estimation, coupled conditional particle filter

1. Introduction

In many scientific disciplines, diffusion processes [1] are used to model and describe important phenomena. Particular applications where such processes arise include biological sciences, finance, signal processing and atmospheric sciences [25]. Mathematically, diffusion processes take the general form

dXt=aθ(Xt)dt+σ(Xt)dWt,X0=xRd, 1.1

where XtRd, θΘ is a parameter, X0=x is the initial condition with x given, a:Θ×RdRd denotes the drift term, σ:RdRd×d denotes the diffusion coefficient and {Wt}t0 is a standard d-dimensional Brownian motion. In practice, it is often difficult to have direct access to such continuous processes, where instead one has discrete-time partial observations of the process {Xt}t0, denoted as Yt1,,Ytn, where 0<t1<<tn=T, such that YtpRdy. Such processes are referred to as partially observed diffusion processes (PODPs), where one is interested in doing inference on the hidden process (1.1) given the observations. In order to do such inference, one must time-discretize such a process which induces a discretization bias. For (1.1), this can arise through common discretization forms such as an Euler or Milstein scheme [6]. Therefore, an important question, related to inference, is how one can reduce, or remove the discretization bias. Such a discussion motivates the development and implementation of unbiased estimators.

(a) . Motivating example

To help motivate unbiased estimation for PODPs, we provide an interesting application, for which we will test in this work. Our model example is the Fitzhugh–Nagumo (FHN) model for a neuron, which is a second-order ODE model arising in neuroscience, describing the actional potential generation within a neuronal axon. We consider a stochastic version of it, which is represented as the following:

[dXt(1)dXt(2)]=[θ1(Xt(1)(Xt(1))3Xt(2))θ2Xt(1)Xt(2)+θ3]dt+[σ1σ2]dWt,X0=u0, 1.2

where Xt is the membrane potential. There has been some interest in parameter estimation [79], related to various variants of the FHN model. This particular model is well known within the community, and commonly acts as a toy problem in the field of mathematical neurosciences. For this reason, we will use this example within our numerical experiments. In figure 1, we provide a simulation of (1.2), for arbitrary choices of Θ=(θ1,θ2,θ3), which demonstrates the interesting behaviour and dynamics. In the plot, we have also plotted non-noisy observations for Xt(1).

Figure 1.

Figure 1.

Simulation of FHN model for t=50. Purple crosses in the top subplot represent discrete-time observations. (Online version in colour.)

(b) . Methodology

The unbiased estimation of PODPs has been an important, yet challenging topic. Some original seminal work on this has been the idea of exact simulation, proposed in various works [1012]. The underlying idea behind exact simulation is that, through a particular transformation, one can acquire an unbiased estimator, subject to certain restrictions on the form of the diffusion and its dimension. Since then there have been a number of extensions aimed at going beyond this, w.r.t. to more general multidimensional diffusions and continuous-time dynamics [13,14]. However, attention has recently been paid to unbiased estimation, for Bayesian computation, through the work of Rhee and Glynn [15,16], in which they provide unbiased and finite variance estimators through introducing randomization. In particular, these methods allow us to unbiasedly estimate an expectation of a functional, by randomizing on the level of the time-discretization in a type of multilevel Monte Carlo (MLMC) approach [17], where there is a coupling between different levels. As a result, this methodology has been considered in the context of both filtering and Bayesian computation [1821] and gradient estimation [22]. One advantage of this approach is that, with couplings, it is relatively simple to use & implement computationally, while exploiting such methodologies on a range of different model problems or set-ups.

In this work, we are interested in developing an unbiased estimator of the Hessian for PODPs. Typically, the Hessian is not required for PODPS but currently this is of interest as current state-of-the-art stochastic gradient methodologies exploit Hessian information for an improved speed of convergence. Methods using this information include Newton type methods [23,24], which have improved rates of convergence over first-order stochastic gradient methods. It is also well known that these methods are typically biased as one does not fully require the whole Hessian, due to the computational burden. Therefore, this provides our motivation for firstly developing an unbiased estimator, and secondly for the Hessian. In order to develop an unbiased estimator, our methodology will largely follow that described in [22], with the extension of this from the score function to the Hessian. Other works that consider unbiased estimation of the gradient include [25,26]. In particular, we will exploit the use of the conditional particle filter (CPF), first considered by Andrieu et al. [27,28]. We provide an expression for the Hessian of the likelihood, while introducing an Euler time-discretization of the diffusion process in order to implement our unbiased estimator. We then describe how one can attain unbiased estimators, which is based on various couplings of the CPF. From this, we test this methodology to that of using the methods of [18,29] for the Hessian computation, as for a comparison, where we demonstrate the unbiased estimator through both the variance and bias. This will be conducted on both a single and multidimensional Ornstein–Uhlenbeck (OU) process, as well as a more complicated form of the FHN model. We remark that our estimator of the hessian is unbiased, but if the inverse hessian is required, it is possible to adapt the forthcoming methodology to that context as well.

(c) . Outline

In §2, we present our setting for our diffusion process. We also present a derived expression for the Hessian, with an appropriate time-discretization. Then, in §3, we describe our algorithm in detail for the unbiased estimator of the Hessian. This will be primarily based on a coupling of a coupled CPF. This will lead to §4 where we present our numerical experiments, which provide variance and bias plots. We compare the methodology of this work with that of the Delta particle filter. This comparison will be tested on a range of diffusion processes, which include an OU process and the FHN model. We summarize our findings in §5.

2. Model

In this section, we introduce our setting and notation regarding our partially observed diffusions. This will include a number of assumptions. We will then provide an expression for the Hessian of the likelihood function, with a time-discretization based on the Euler scheme. This will include a discussion on the stochastic model where we define the marginal likelihood. Finally, we present a result indicating the approximation of the Hessian computation as we take the limit of the discretization level.

(a) . Notation

Let (X,X) be a measurable space. For φ:XR, we write Bb(X) as the collection of bounded measurable functions, Cj(X) are the collection of j-times, jN continuously differentiable functions and we omit the subscript j if the functions are simply continuous; if φ:XRd we write Cdj(X) and Cd(X). Let φ:RdR, Lip||||2(Rd) denote the collection of real-valued functions that are Lipschitz w.r.t. ||||2 (||||p denotes the Lpnorm of a vector xRd). That is, φLip||||2(Rd) if there exists a C<+ such that for any (x,y)R2d

|φ(x)φ(y)|C||xy||2.

We write ||φ||Lip as the Lipschitz constant of a function φLip||||2(Rd). For φBb(X), we write the supremum norm ||φ||=supxX|φ(x)|. P(X) denotes the collection of probability measures on (X,X). For a measure μ on (X,X) and a φBb(X), the notation μ(φ)=Xφ(x)μ(dx) is used. B(Rd) denote the Borel sets on Rd. dx is used to denote the Lebesgue measure. Let K:X×X[0,) be a non-negative operator and μ be a measure, then we use the notations μK(dy)=Xμ(dx)K(x,dy) and for φBb(X), K(φ)(x)=Xφ(y)K(x,dy). For AX, the indicator is written IA(x). UA denotes the uniform distribution on the set A. Ns(μ,Σ) (resp. ψs(x;μ,Σ)) denotes an s-dimensional Gaussian distribution (density evaluated at xRs) of mean μ and covariance Σ. If s=1 we omit the subscript s. For a vector/matrix X, X is used to denote the transpose of X. For AX, δA(du) denotes the Dirac measure of A, and if A={x} with xX, we write δx(du). For a vector-valued function in d-dimensions (resp. d-dimensional vector), φ(x) (resp. x) say, we write the ithcomponent (i{1,,d}) as φ(x)(i) (resp. x(i)). For a d×q matrix x, we write the (i,j)thentry as x(ij). For μP(X) and X a random variable on X with distribution associated with μ we use the notation Xμ().

(b) . Diffusion process

Let θΘRdθ be fixed and we consider a diffusion process on the probability space (Ω,F,{F}t0,Pθ), such that

dXt=aθ(Xt)dt+σ(Xt)dWt,X0=xRd, 2.1

where XtRd, X0=x with x given, a:Θ×RdRd is the drift term, σ:RdRd×d is the diffusion coefficient and {Wt}t0 is a standard d-dimensional Brownian motion. We assume that for any fixed θΘ, aθ(i)C2(Rd) and σ(ij)C2(Rd) for (i,j){1,,d}2. For fixed xRd, we have aθ(x)(i)C(Θ) for i{1,,d}.

Furthermore, we make the following additional assumption, termed (D1).

  • (i)

    Uniform ellipticity: Σ(x):=σ(x)σ(x) is uniformly positive definite over xRd.

  • (ii)
    Globally Lipschitz: for any θΘ, there exists a positive constant C< such that
    |aθ(x)(i)aθ(x)(i)|+|σ(x)(ij)σ(x)(ij)|C||xx||2,
    for all (x,x)Rd×Rd, (i,j){1,,d}2.

Let 0<t1,<0 be a given collection of time points. Following [22], by the use of Girsanov Theorem, for any Pθ-integrable φ:Θ×RndR,

Eθ[φθ(Xt1,,Xtn)]=EQ[φθ(Xt1,,Xtn)dPθdQ(XT)], 2.2

where Eθ denotes the expectation w.r.t. Pθ, set XT={Xt}t[0,T], and the change of measure is given by

dPθdQ(XT)=exp{120T||bθ(Xs)||22ds+0Tbθ(Xs)dWs},

with bθ(x)=Σ(x)1σ(x)aθ(x) is a d-dimensional vector. Below, we consider a change of measure to the law Q, which is induced by using that dXt=σ(Xt)dWt, where Xt solves such a process. Since Pθ and Q are equibilant, therefore by Girsanov’s Theorem

ρθ(XT)=φθ(Xt1:tn)dPθdQ(XT),

where the corresponding Radon–Nikodym derivative is

dPθdQ(XT)=exp{120T||bθ(Xs)||22ds+0Tbθ(Xs)Σ(Xs)1σ(Xs)dXs}.

Now if we assume that φθ is differentiable w.r.t. θ, then one has for i{1,,dθ}

Gθ(i):=θ(i)(log{Eθ[φθ(Xt1,,Xtn)]})=EP¯θ[θ(i)(log{ρθ(XT)})], 2.3

where P¯θ=φθPθ/Pθ(φθ) and Pθ[φθ(Xt1,,Xtn)] is the law of the diffusion process (1.1). From herein, we will use the short-hand notation φθ(Xt1,,Xtn)=φθ(Xt1:tn) and also set, for i{1,,dθ},

Gθ(XT)(i)=θ(i)(log{ρθ(XT)}).

(c) . Hessian expression

Given the expression (2.3) our objective is now to write the matrix of second derivatives, for (i,j){1,,dθ}2

Hθ(ij):=2θ(i)θ(j)(log{Eθ[φθ(Xt1:tn)]}),

in terms of expectations w.r.t. P¯θ.

We have the following simple calculation:

2θ(i)θ(j)(log{Eθ[φθ(Xt1:tn)]})=θ(i)((/θ(j)){Eθ[φθ(Xt1:tn)]}Eθ[φθ(Xt1:tn)])=2θ(i)θ(j){Eθ[φθ(Xt1:tn)]}Eθ[φθ(Xt1:tn)]θ(i){Eθ[φθ(Xt1:tn)]}(/θ(j)){Eθ[φθ(Xt1:tn)]}Eθ[φθ(Xt1:tn)]2=:T1T2.

Under relatively weak conditions, one can express T1 and T2 as

T1=EP¯θ[log{ρθ(XT)}θ(i)log{ρθ(XT)}θ(j)]+EP¯θ[2θ(i)θ(j)(log{ρθ(XT)})]

and

T2=EP¯θ[log{ρθ(XT)}θ(i)]EP¯θ[log{ρθ(XT)}θ(j)].

Therefore, we have the following expression:

Hθ(ij)=EP¯θ[log{ρθ(XT)}θ(i)]EP¯θ[log{ρθ(XT)}θ(j)]EP¯θ[log{ρθ(XT)}θ(i)log{ρθ(XT)}θ(j)]EP¯θ[2θ(i)θ(j)(log{ρθ(XT)})]. 2.4

Defining, for (i,j){1,,dθ}2

Hθ(XT)(ij):=2θ(i)θ(j)(log{ρθ(XT)}),

one can write more succinctly

Hθ(ij)=EP¯θ[Gθ(XT)(i)]EP¯θ[Gθ(XT)(j)]EP¯θ[Gθ(XT)(i)Gθ(XT)(j)]EP¯θ[Hθ(XT)(ij)].

(d) . Stochastic model

Consider a sequence of random variables (Yt1,,Ytn) where 0<t1<<tn=T, where YtpRdy, which are assumed to have the following joint conditional Lebesgue density

pθ(yt1,,ytn|{xs}0sT)=p=1ngθ(ytp|xtp),

where g:Θ×Rd×RdyR+ for any (θ,x)Θ×Rd,Rdygθ(y|x)dy=1 such that dy is the Lebesgue measure. Now if one considers instead realizations of the random variables (Yt1,,Ytn), in the conditioning of the joint density we have a state-space model with marginal likelihood

pθ(yt1,,ytn):=Eθ[p=1ngθ(ytp|Xtp)].

Note that the framework to be investigated in this article is not restricted to this special case, but we shall focus on it for the rest of the paper. So to clarify φθ(xt1,,xtn)=p=1ngθ(ytp|xtp) from herein.

In reference to (2.3) and (2.4), we have that

log{ρθ(XT)}θ(i)=p=1nθ(i)(log{gθ(ytp|xtp)})120Tθ(i)(||bθ(Xs)||22)ds+θ(i)(0Tbθ(Xs)Σ(Xs)1σ(Xs)dXs)

and

2θ(i)θ(j)(log{ρθ(XT)})=p=1n2θ(i)θ(j)(log{gθ(ytp|xtp)})120T2θ(i)θ(j)(||bθ(Xs)||22)ds+2θ(i)θ(j)(0Tbθ(Xs)Σ(Xs)1σ(Xs)dXs).

(e) . Time-discretization

From herein, we take the simplification that tp=p,p{1,,n},T=n. Let lN0 be a given level of discretization, and consider the Euler discretization of step size Δl=2l,k{1,2,,Δl1T} with X~0=x:

X~kΔl=X~(k1)Δl+aθ(X~(k1)Δl)Δl+σ(X~(k1)Δl)[WkΔlW(k1)Δl]. 2.5

Set xTl=(x,x~Δl,,x~T). We then consider the vector-valued function Gl:Θ×(Rd)Δl1T+1Rdθ and the matrix-valued function Hl:Θ×(Rd)Δl1T+1Rdθ×dθ defined as, for (i,j){1,,dθ}2

Gθl(xTl)(i)=p=1nθ(i)(log{gθ(yp|x~p)})Δl2k=0Δl1T1θ(i)(||bθ(x~kΔl)||22)+k=0Δl1T1θ(i)(bθ(x~kΔl)Σ(x~kΔl)1σ(x~kΔl)[x~(k+1)Δlx~kΔl])

and

Hθl(xTl)(ij)=p=1n2θ(i)θ(j)(log{gθ(yp|x~p)})Δl2k=0Δl1T12θ(i)θ(j)(||bθ(x~kΔl)||22)+k=0Δl1T12θ(i)θ(j)(bθ(x~kΔl)Σ(x~kΔl)1σ(x~kΔl)[x~(k+1)Δlx~kΔl]).

Then, noting (2.4), we have an Euler approximation of the Hessian

Hθl,(ij):=Eθ[φθ(X~1:n)Gθl(XTl)(i)]Eθ[φθ(X~1:n)]Eθ[φθ(X~1:n)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]Eθ[φθ(X~1:n)Gθl(XTl)(i)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]Eθ[φθ(X~1:n)Hθl(XTl)(ij)]Eθ[φθ(X~1:n)].

In the context of the model in §d, if one sets

πθl(dxTl){p=1ngθ(yp|x~p)pθl(x~p1|x~p)}dxTl,

where pθl is the transition density induced by discretized diffusion process (2.5) (over unit time), and we use the abuse of notation that dxTl is the Lebesgue measure on the coordinates (x~Δl,,x~T), then one has that

Hθl,(ij)=πθl(Gθl,(i))πθl(Gθl,(j))πθl(Gθl,(i)Gθl,(j))πθl(Hθl,(ij)), 2.6

where we are using the short-hand Gθl,(i)=Gθl(XTl)(i) and Hθl,(ij)=Hθl(XTl)(ij) etc.

We have the following result whose proof and assumption (D2) is in appendix A.

Proposition 2.1. —

Assume (D1-D2). Then for any (i,j){1,,dθ}2, we have

limlHθl,(ij)=Hθ(ij).

The main strategy of the proof is by strong convergence, which means that one can characterize an upper-bound on |Hθl,(ij)Hθ(ij)| of O(Δl1/2) but that rate is most likely not sharp, as one expects O(Δl).

3. Algorithm

The objective of this section, using only approximations of (2.6), is to obtain an unbiased estimate of Hθ(ij) for any fixed θΘ and (i,j){1,,dθ}2. Our approach is essentially an application of the methodology in [22] and so we provide a review of that approach in the sequel.

(a) . Strategy

To focus our description, we shall suppose that we are interested in computing an unbiased estimate of Gθ(i) for some fixed i; we remark that this specialization is not needed and is only used for notational convenience. An Euler approximation of Gθ(i) is πθl(Gθl,(i))=:Gθl,(i). To further simplify the notation, we will simply write Gθl instead of Gθl,(i).

Suppose that one can construct a sequence of random variables (π^θl(Gθl))lN0 on a potentially extended probability space with expectation operator E¯θ, such that for each lN0, E¯θ[π^θl(Gθl)]=πθl(Gθl). Moreover, consider the independent sequence of random variables, (Ξθl)lN0 which are constructed so that for lN0

E¯θ[Ξθl]:=E¯θ[π^θl(Gθl)]E¯θ[π^θl1(Gθl1)]=πθl(Gθl)πθl1(Gθl1), 3.1

with E¯θ[π^θ1(Gθ1)]:=πθ1(Gθ1):=0. Now let PL be a positive probability mass function on N0 and set P¯L(l)=p=lPL(p). Now if,

lN01P¯L(l){Var¯θ[Ξθl]+(Gθl,(i)Gθ(i))2}<+, 3.2

then if one samples L from PL independently of the sequence (Ξl)lN0 then by e.g. ([17], Theorem 5) the estimate

G^θ(i):=l=0LΞθlP¯L(l), 3.3

is an unbiased and finite variance estimator of Gθ(i). Before describing in fuller detail our approach, which requires numerous techniques and methodologies, we first present our main result which is an unbiased theorem related to our estimator of the Hessian. This is given through the following proposition.

Proposition 3.1. —

Assume (D1-D2). Then there exists choices of PL so that (3.9) is an unbiased and finite variance estimator of Hθ(ij) for each (i,j)Dθ.

Proof. —

This is the same as ([22], theorem 2), except one must repeat the arguments of that paper given lemma A.3 in the appendix and given the rate in the proof of proposition 2.1. Since the arguments and calculations are almost identical, they are omitted in their entirety.

The main point is that the choice of PL is as in [22], which is: in the case that σ is constant PL(l)Δl(l+1)log2(2+l)2 and in the non-constant case PL(l)Δl1/2(l+1)log2(2+l)2; both choices achieve finite variance and costs to achieve an error of O(ϵ) with high probability as in ([16], propositions 4 and 5).

The main issue is to construct the sequence of independent random variables (Ξθl)lN0 such that (3.1) and (3.2) hold and that the expected computational cost for doing so is not unreasonable as a functional of Δl: a method for doing this is in [22] as we will now describe.

(b) . Computing Ξθ0

The computation of Ξθ0 is performed by using exactly the coupled conditional particle filter (CCPF) that has been introduced in [30]. This is an algorithm which allows one to construct a random variable π^θ0(Gθ0) such that E¯θ[π^θ0(Gθ0)]=πθ0(Gθ0) and we will set Ξθ0=π^θ0(Gθ0).

(b) .

We begin by introducing the Markov kernel Cl:Θ×XlP(Xl) in algorithm 1. To that end, we will use the notation xΔl:ki,l(Rd)kΔl1, where lN0 is the level of discretization, i{1,,N} is a particle (sample) indicator, k{1,,n} is a time parameter and xΔl:ki,l=(xΔli,l,x2Δli,l,,xki,l). The kernel described in algorithm 1 is called the called the CPF, as developed in [27], and allows one to generate, under minor conditions, an ergodic Markov chain of invariant measure πθl. By itself, it does not provide unbiased estimates of expectations w.r.t. πθl, unless πθl is the initial distribution of the Markov chain. However, the kernel will be of use in our subsequent discussion.

Our approach generates a Markov chain {Zm}mN0 on the space Z0:=X0×X0, ZmZ0. In order to describe how one can simulate this Markov chain, we introduce several objects which will be needed. The first of these is the kernel pˇl:Θ×R2dP(R2d), which we need in the case l=0 and its simulation is described in algorithm 2. The Markov kernel is used to simulate intermediate points from xk1 to the next observations at time k, with time step Δl. We will also need to simulate the maximal coupling of two probability mass functions on {1,,N}, for some NN, and this is described in algorithm 3.

Remark 3.2. —

Step 4 of algorithm 3 can be modified to the case where one generates the pair (i,j){1,,N}2 from any coupling of the two probability mass functions (r4,r5). In our simulations in §4, we will do this by sampling by inversion from (r4,r5), using the same uniform random variable. However, to simplify the mathematical analysis that we will give in the appendix, we consider exactly algorithm 3 in our calculations.

To describe the CCPF kernel, we must first introduce a driving CCPF, which is presented in algorithm 4. The driving CCPF is nothing more than an ordinary coupled particle filter, except the final pair of trajectories is ‘frozen’ as is given in the algorithm (that is (x1:n,x¯1:n) as in step 1 of algorithm 4) and allowed to interact with the rest of the particle system. Given the ingredients in algorithms 2–4, we are now in a position to describe the CCPF kernel, which is a Markov kernel K0:Θ×Z0P(Z0), whose simulation is presented in algorithm 5. We will consider the Markov chain {Zm}mN0, Zm=(X1:n(m),X¯1:n(m)), generated by the CCPF kernel in algorithm 5 and with initial distribution

νθ0(d(x1:n,x¯1:n))=Z0(k=1npθ0(xk1,xk))(k=1npθ0(x¯k1,x¯k))Cθ0(x1:n,dx1:n)δ{x¯1:n}(dx¯1:n),d(x1:n,x¯1:n), 3.4

where x0=x¯0=x.

We remark that in algorithm 5, marginally, x1:ni (resp. x¯1:nj) has been generated according to Cθ0(x1:n,) (resp. Cθ0(x¯1:n,)). A rather important point is that if the two input trajectories in step 1 of algorithm 5 are equal, i.e. x1:n=x¯1:n, then the output trajectories will also be equal. To that end, define the stopping time associated with the given Markov chain

τ0=inf{m1:x1:n(m)=x¯1:n(m)}.

Then, setting m{2,3,} one has the following estimator:

π^θ0(Gθ0):=Gθ0(X1:n(m))+m=m+1τ01{Gθ0(X1:n(m))Gθ0(X¯1:n(m))}, 3.5

and one sets Ξθ0=π^θ0(Gθ0). The procedure for computing Ξθ0 is summarized in algorithm 6.

(b) .

(b) .

(c) . Computing (Ξθl)lN

We are now concerned with the task of computing (Ξθl)lN such that (3.1)–(3.2) are satisfied. Throughout the section lN is fixed. We will generate a Markov chain {Zˇml}mN0 on the space Zl×Zl1, where Zl=Xl×Xl and ZˇmlZl×Zl1. In order to construct our Markov chain kernel, as in the previous section, we will need to provide some algorithms. We begin with the Markov kernel qˇl:Θ×R4dP(RΔl12d×RΔl112d) which will be needed and whose simulation is described in algorithm 7. We will also need to sample a coupling for four probability mass functions on {1,,N} and this is presented in algorithm 8.

To continue onwards, we will consider a generalization of that in algorithm 4. The driving CCPF at level l is described in algorithm 10. Now given algorithms 7–10, we are in a position to give our Markov kernel, Kˇl:Θ×Zl×Zl1P(Zl×Zl1), which we shall call the coupled-CCPF (C-CCPF) and it is given in algorithm 11. To assist the subsequent discussion, we will introduce the marginal Markov kernel

qˇθ(l)([x0l,x0l1],d[xΔl:1l,xΔl1:1l1]):=(Rd)Δl1×(Rd)Δl11qˇθl([(x0l,x¯0l),(x0l1,x¯0l1)],d[(xΔl:1l,x¯Δl:1l),(xΔl1:1l1,x¯Δl1:1l1)]). 3.6

Given this kernel, one can describe the CCPF at two different levels l,l1 in algorithm 9. Algorithm 9 details a Markov kernel Cˇl:Θ×Xl×Xl1P(Xl×Xl1) which we will use in the initialization of our Markov chain to be described below.

We will consider the Markov chain {Zˇml}mN0, with

Zˇml=((XΔl:nl(m),X¯Δl:nl(m)),(XΔl1:nl1(m),X¯Δl1:nl1(m))),

generated by the C-CCPF kernel in algorithm 11 and with initial distribution

νˇθl(d[(xΔl:nl,x¯Δl:nl),(xΔl1:nl1,x¯Δl1:nl1)])=Zl×Zl1(k=1nqˇθ(l)([xk1l,,xk1l1,],d[xk1+Δl:kl,,xk1+Δl1:kl1,]))×(k=1nqˇθ(l)([x¯k1l,,x¯k1l1,],d[x¯k1+Δl:kl,,x¯k1+Δl1:kl1,])Cˇθl([xΔl:nl,,xΔl1:nl1,],d[xΔl:nl,xΔl1:nl1])×δ{x¯Δl:nl,,x¯Δl1:nl1,}(d[x¯Δl:nl,x¯Δl1:nl1]), 3.7

where x0l,=x0l1,=x¯0l,=x¯0l1,=x. An important point, as in the case of algorithm 5, is that if the two input trajectories in step 1 of algorithm 11 are equal, i.e. xΔl:nl=x¯Δl:nl, or xΔl1:nl1=x¯Δl1:nl1, then the associated output trajectories will also be equal. As before, we define the stopping times associated with the given Markov chain (Zˇml)mN0, s{l,l1}

τs=inf{m1:XΔs:ns(m)=X¯Δs:ns(m)}.

Then, setting m{2,3,} one has the following estimator:

π^θs(Gθs):=Gθs(XΔs:ns(m))+m=m+1τs1{Gθs(XΔs:ns(m))Gθs(X¯Δs:ns(m))}, 3.8

and one sets Ξθl=π^θl(Gθl)π^θl1(Gθl1). The procedure for computing Ξθl is summarized in algorithm 12.

(d) . Estimate and remarks

Given the commentary above we are ready to present the procedure for our unbiased estimate of Hθ(ij) for each (i,j)Dθ:={(i,j){1,,dθ}2:ij}; Hθ is a symmetric matrix. The two main algorithms we will use (algorithms 6 and 12) are stated in terms of providing Ξθl for one specified function Gθl,(i) (recall that i was suppressed from the notation). However, the algorithms can be run once and provide an unbiased estimate of πθl(Gθl,(i))πθl1(Gθl1,(i)) for every i{1,,dθ}, of πθl(Gθl,(i)Gθl,(j))πθl1(Gθl1,(i)Gθl1,(j)) and πθl(Hθl,(ij))πθl1(Hθl1,(ij)) for every (i,j)Dθ. To that end, we will write Ξθl(Gθl,(i)), Ξθl(Gθl,(i)Gθl,(j)) and Ξθl(Hθl,(ij)) to denote the appropriate estimators.

Our approach consists of the following steps, repeated for k{1,,M}:

  • (i)

    Generate (Lk,L~k) according to PLPL.

  • (ii)

    Compute Ξθk,0(Gθ0,(i)) for every i{1,,dθ} and Ξθk,0(Gθ0,(i)Gθ0,(j)), Ξθk,0(Hθ0,(ij)) for every (i,j)Dθ using algorithm 6. Independently, compute Ξ~θk,0(Gθ0,(i)) for every i{1,,dθ} using algorithm 6.

  • (iii)

    If Lk>0 then independently for each l{1,,Lk} and independently of step 2, calculate Ξθk,l(Gθl,(i)) for every i{1,,dθ} and Ξθk,l(Gθl,(i)Gθl,(j)), Ξθk,l(Hθl,(ij)) for every (i,j)Dθ using algorithm 12.

  • (iv)

    If L~k>0 then independently for each l{1,,L~k} and independently of steps 2 and 3, calculate Ξ~θk,l(Gθl,(i)) for every i{1,,dθ} using algorithm 12.

  • (v)
    Compute for every (i,j)Dθ
    H^θk,(ij)=(l=0LkΞθk,l(Gθl,(i))P¯L(l))(l=0L~kΞ~θk,l(Gθl,(i))P¯L(l))l=0LkΞθk,l(Gθl,(i)Gθl,(j))P¯L(l)l=0LkΞθk,l(Hθl,(ij))P¯L(l).

Then our estimator is for each (i,j)Dθ

H^θ(ij)=1Mk=1MH^θk,(ij). 3.9

The algorithm and the various settings are described and investigated in detail in [22] as well as enhanced estimators. We do not discuss the methodology further in this work.

Remark 3.3. —

As we will see in the succeeding section, we will compare our methodology which is based on the C-CCPF to that of another methodology, which is the ΔPF, within particle Markov chain Monte Carlo. Specifically, it will be a particle marginal Metropolis Hastings algorithm. We omit such a description of the latter, as we only use it as a comparison, but we refer the reader to [18] for a more concrete description. However, we emphasize that it is only asymptotically unbiased, in relation to the Hessian identity (2.4).

Remark 3.4. —

It is important to emphasize that with inverse Hessian, which is required for Newton methodologies, we can debias both the C-CCPF and the ΔPF. This can be achieved by using the same techniques which are presented in the work of Jasra et al. [21].

4. Numerical experiments

In this section, we demonstrate that our estimate of the Hessian is unbiased through various different experiments. We consider testing this through the study of the variance and bias of the mean square error, while also providing plots related to the Newton-type learning. Our results will be demonstrated on three underlying diffusion processes: a univariate OU process, a multivariate OU process and the FHN model. We compare our methodology to that using the ΔPF instead of the coupled-CCPF within our unbiased estimator.

(a) . Ornstein–Uhlenbeck process

Our first set of numerical experiments will be conducted on a univariate OU process, which takes the form

dXt=θ1Xtdt+σdWt

and

X0=x0,

where x0R+ is our initial condition, θ1R is a parameter of interest and σR is the diffusion coefficient. For our discrete observations, we assume that we have Gaussian measurement errors, Yt|Xtgθ(|Xt)=N(Xt,θ2) for t{1,,T} and for some θ2R+. Our observations will be generated with parameter choices defined as θ=(θ1,θ2)=(0.46,0.38), x0=1 and T=500. Throughout the simulation, one observation sequence {Y1,Y2,,YT} is used. The true distribution of observations can be computed analytically, therefore the Hessian is known. In figure 2, we present the surface plots comparing the true Hessian with the estimated Hessian, obtained by the Rhee & Glynn estimator (3.9) truncated at discretization level L=8, this is done by letting PL(l)ΔlI(lL). We use M=104 to obtain the estimate Hessian surface plot. Both surface plots are evaluated at θ1,θ2{0.2,0.3,0.4,,1.0}. In figure 3, we test out the convergence of bias of the Hessian estimate (3.9) with respect to its truncated discretization level. This essentially tests the result in lemma A.2. We use L={2,3,4,5,6,7} and plot the bias against ΔL.

Figure 2.

Figure 2.

Experiments for the OU model. (a) True values of the Hessian. (b) Estimated values of the Hessian. (Online version in colour.)

Figure 3.

Figure 3.

Experiments for the OU model: bias values of Hessian estimate. (Online version in colour.)

The bias is obtained by using M=104 i.i.d. samples, and taking its entry-wise difference with the true Hessian entry-wise value. Note that the Hessian estimate here is evaluated with true parameter choice. As the parameter θ is two-dimensional, we present four log-log plots where the rate represents the fitted slope of log-scaled bias against log-scaled ΔL. We observe that the Hessian estimate bias is of order ΔLα where α{0.9629,0.7536,0.7361,0.8949} respectively for the four entries, which verifies our result in lemma A.2. We also compare the wall-clock time cost of obtaining one realization of Hessian estimate (3.9) with the cost of obtaining one realization of score estimate (see [22]), both truncated at the same discretization levels L={2,3,4,5,6,7}, here M=100. The comparison result is provided in figure 4a.

Figure 4.

Figure 4.

Experiments for the OU model. (a) Cost of Hessian & score estimate. (b) Incremental Hessian estimate variance summed over all entries. (Online version in colour.)

We observe that the cost of obtaining the Hessian estimate is on average three times more expensive than obtaining a score estimate. The reason for this is that we need to simulate three CCPF paths in order to obtain one summand in the Hessian estimate, while to estimate the score function, we need only one path. We also record the fitted slope of log-scaled cost against log-scaled ΔL for both estimates, the cost for Hessian estimates is roughly proportional to ΔL0.9. To verify the rate obtained in lemma A.3, we compare the variance of the Hessian incremental estimate with respect to discretization level L{1,2,3,4,5,6}. The incremental variance is approximated with the sample variance over 103 repetitions, and we sum over all 2×2 entries and present the log-log plot of the summed variance against ΔL on the right of figure 4. We observe that the incremental variance is of order ΔL1.15 for the OU process model. This verifies the result obtained in lemma A.3. It is known that when truncated, the Rhee & Glynn method essentially serves as a variance reduction method. As a result, compared to the discrete Hessian estimate (2.6), the truncated Hessian estimate (3.9) will require less cost to achieve the same MSE target.

We present on figure 5a, the log-log plot of cost against MSE for discrete Hessian estimate (2.6) and the Rhee & Glynn (R&G) Hessian estimate (3.9). We observe that (2.6) requires much lower cost for an MSE target compared to (3.9). For (2.6), the cost is proportional to O(ϵ2.974) for an MSE target of order O(ϵ2). While for (3.9), the cost is proportional to O(ϵ2.428). The average cost ratio between (3.9) and (2.6) under the same MSE target is 5.605. In figure 5b, we present the log-log plot of cost against MSE for (3.9) and the Hessian estimate obtained by the ΔPFmethod. We observe that under similar MSE target, the latter method on average has cost 5.054 times less than (3.9). In figure 6, we present the convergence plots for the stochastic gradient descent (SGD) method with score estimate and Newton method with score & Hessian estimate. For both methods, the parameter is initialized at (0.1,0.1), and the learning rate for the SGD method is set to 0.002. Our conclusion from figure 5a is that firstly the methodology using R&G has a better rate, relating the MSE to computational cost, which favours our methodology. For figure 5b, we note that the methodology exploiting the ΔPF and our methodology, i.e. ‘R&G Hessian’, result in the same rate. As we know the former is unbiased, this comparison indicates our methodology is also unbiased, despite being more expensive by 5.054 times. For figure 6, we observe as expected that the Newton method requires fewer iterations (five compared to 132 which uses SGD) to reach the true parameters of interest, i.e. θ1 and θ2.

Figure 5.

Figure 5.

Hessian estimate cost against summed MSE for the OU model. (Online version in colour.)

Figure 6.

Figure 6.

Parameter estimate for the OU model. (a) SGD with score estimate. (b) Newton method with score & Hessian estimate. (Online version in colour.)

(b) . Multivariate Ornstein–Uhlenbeck process model

Our second model of interest is a two-dimensional OU process defined as

[dXt(1)dXt(2)]=[θ1θ2Xt(1)θ3Xt(2)]dt+[σ1σ2]dWt,X0=x0,

where x0R2 is the initial condition and (σ1,σ2)R+×R+ are the diffusion coefficients. We assume Gaussian measurement errors, Yt|Xtgθ(|Xt)=N2(Xt,θ4I2), where I2 is a two-dimensional identity matrix. We generate one sequence of observations up to time T=500 with parameter choice θ=(θ1,θ2,θ3,θ4)=(0.48,0.78,0.37,0.32), σ1=0.8, σ2=0.6, x0=(1,1)T. As before, we study various properties of (3.9) with the true parameter choice. In figure 7, we present the log-log plot of bias against ΔL for (3.9), where the five points are evaluated with L{2,3,4,5,6}. The bias is approximated by the difference between (3.9) and the true Hessian with M=104, we sum over all entry-wise bias and present it on the plot. We observe that the summed bias is of order ΔL0.972. This verifies the result in lemma A.2. In figure 8a, we present a log-log plot of cost against ΔL for (3.9) and the R&G score estimate both with M=10. We observe that the cost of (3.9) is proportional to ΔL0.866. This rate is similar to that of the score estimate, on average the cost ratio between (3.9) and the score estimate is 3.495.

Figure 7.

Figure 7.

Hessian estimate bias summed over all entries for the multivariate OU diffusion model. (Online version in colour.)

Figure 8.

Figure 8.

Experiments for the multivariate OU diffusion model. (a) Cost of Hessian and score estimate. (b) Incremental Hessian estimate variance summed over all entries. (Online version in colour.)

In figure 8, we present on the left the log-log plot of summed incremental variance of Hessian estimate against ΔL. We compute the entry-wise sample variance of the incremental Hessian estimate for 103 times, and plot the summed variance against ΔL. We observe that the Hessian incremental variance is proportional to ΔL1.068. This verifies the result in lemma A.3. In figure 9a, we present the log-log plot of the cost against MSE for (3.9) and (2.6), where the MSE is approximated through averaging over 103 i.i.d. repetitions of both estimators. We observe that under a summed MSE target of O(ϵ2), the cost for (3.9) is of order O(ϵ2.362), while the cost for (2.6) is of order O(ϵ2.958). On average, the cost ratio between (2.6) and (3.9) is 3.575. This verifies the variance reduction effect of truncated R&G scheme. In figure 9b, we present the log-log plot of the cost against MSE for (3.9) and Hessian estimate using ΔPF. We observe that under a similar MSE target, the latter method on average costs 3.165 times less than that of (3.9). In figure 10, we present the convergence plots for the SGD and Newton methods. Both the score estimate and the Hessian estimate (3.9) are obtained with M=2×103, truncated at level L=8. The learning rate for the SGD is set to 0.005. The training reaches convergence when the relative Euclidean distance between trained and true θ is no bigger than 0.02. We initialize the training parameter at (0.1,0.1,0.1,0.1), and we observe that the SGD method reaches convergence with 122 iterations, compared to four iterations of the modified Newton method. The actual training time until convergence for the Newton method is roughly 7.6 times faster than the SGD method.

Figure 9.

Figure 9.

Hessian estimate cost against summed MSE for the multivariate OU diffusion model. (Online version in colour.)

Figure 10.

Figure 10.

Parameter estimate for the multivariate OU model. (a) SGD with the score estimate. (b) Newton method with score and Hessian estimate. (Online version in colour.)

(c) . FitzHugh–Nagumo model

Our next model will be a two-dimensional ordinary differential equation, which arises in neuroscience, known as the FHN model [7,31]. It is concerned with the membrane potential of a neuron and a (latent) recovery variable modelling the ion channel kinetics. We consider a stochastic perturbed extended version, given as

[dXt(1)dXt(2)]=[θ1(Xt(1)(Xt(1))3Xt(2))θ2Xt(1)Xt(2)+θ3]dt+[σ1σ2]dWt,X0=u0.

For the discrete observations, we assume Gaussian measurement errors, Yt|Xtgθ(|Xt)=N2(Xt,θ4I2), where (θ1,θ2,θ3,θ4)R+×R×R×R+, (σ1,σ2)R+×R+ are the diffusion coefficients and, as before, {Wt}t0 is a Brownian motion. We generate one observation sequence with parameter choices θ=(θ1,θ2,θ3,θ4)=(0.89,0.98,0.5,0.79), σ=(0.2,0.4). As the true distribution of the observation is not available analytically, we use L=10 to simulate out {Y1,Y2,,YT} where T=500.

In figure 11, we compared the bias of (3.9), truncated at discretization level L{2,3,4,5,6,7} and plot it against ΔL (log-log plot). The summed bias is obtained by taking element-wise difference between an average of 103 i.i.d. realizations of the Hessian estimate and the true Hessian, then summed over all the element-wise difference. The true Hessian is approximated by (3.9) with M=104 and L=10. We observe that the summed bias is of order O(ΔL1.402). This verifies the result in lemma A.2. In figure 12a, we present the log-log plot of cost against ΔL for (3.9) and the R&G score estimate both with M=10. We observe that the cost of (3.9) is of order O(ΔL0.866), while the cost for score estimate is of order O(ΔL0.867). The average cost ratio between (3.9) and the score estimate is 3.495. In figure 12b, we present the log-log plot of the summed incremental variance with ΔL. We observe that the summed incremental variance is of order O(ΔL1.118). This verifies the result in lemma A.3.

Figure 11.

Figure 11.

Hessian estimate bias summed over all entries for the FHN model.

Figure 12.

Figure 12.

Experiments for the FHN model. (a) Cost of Hessian & score estimate. (b) Incremental Hessian estimate variance summed over all entries. (Online version in colour.)

In figure 13a, we again present a log-log plot of the cost against the summed MSE of (3.9) over all entries for both (3.9) and (2.6). We observe that under an MSE target of ϵ2, (3.9) requires cost of order O(ϵ2.482), while (2.6) requires cost of order O(ϵ2.97). The average cost ratio between (3.9) and (2.6) under the same MSE target is 3.987. This verifies the variance reduction effect of truncated R&G scheme. In figure 13b is the log-log plot of cost against summed MSE for (3.9) and Hessian estimate using the ΔPF. We observe that under similar MSE target, the latter method on average costs 4.627 times less than that of (3.9). In figure 14, we present the convergence plots of SGD and the modified Newton method. For the modified Newton method, we set all the off-diagonal entries to zero for the Hessian estimate, and add 0.0001 to the diagonal entries to avoid singularity. When the L2 norm of the score is smaller than 0.1, we scale the searching step by a learning rate of 0.002. Both the score estimate and the Hessian estimate (3.9) are obtained with M=2×103, truncated at level L=8. The learning rate for the SGD is set to 0.001. The training reaches convergence when the relative Euclidean distance between trained and true θ is no bigger than 0.02. We initialize the training parameter at (0.8,0.8,0.8,0.8), and observe that the SGD method reaches convergence with fewer iterations than the stochastic Newton method.

Figure 13.

Figure 13.

Hessian estimate cost against summed MSE for FHN model. (Online version in colour.)

Figure 14.

Figure 14.

Parameter estimate for the FHN model. (a) SGD with score estimate. (b) Modified Newton method with score & Hessian estimate. (Online version in colour.)

5. Summary

In this work, we were interested in developing an unbiased estimator of the Hessian, related to PODPs. This task is of interest, as computing the Hessian is primarily biased, due to its computational cost, but also it has improved convergence over the score function. We presented a general expression for the Hessian and proved, in the limit of discretization level, that it is consistent with the continuous form. We demonstrated that we were able to reduce the bias, arising from the discretization. This was shown through various numerical experiments that were tested on a range of diffusion processes. This not only highlighted the reduction in bias, but also that convergence is better compared to computing and using the score function. In terms of research directions beyond what we have done, it would be nice firstly to extend this to more complicated diffusion models, such as ones arising in mathematical finance [32,33]. Such diffusion models would be rough volatility models. Another potential direction would be to consider diffusion bridges, and analyse how one can could use the tools here and adapt them. This has been of interest, with recent works such as [34,35]. One could also aim to derive similar results for unbiased estimation using alternative discretization schemes, such as the Milstein scheme. This should result in different rates of convergence, however the analysis would be different. To do so, one would also require such a newly developed analysis for the score function [22]. Finally, one could aim to apply this to other applications, which are Monte Carlo methods being exploited such as phylogenetics. In particular, as we have tested our methodology on various OU processes, one could test this further on an SDE arising in phylogenetics, presented ad discussed in [36,37].

Supplementary Material

Appendix A. Proofs for proposition 2.1

In this section, we will consider a diffusion process {Xtx}t0=XTx which follows (2.1) and has an initial condition X0=xRd and we will also consider Euler discretizations (2.5), at some given level l, which are driven by the same Brownian motion as {Xtx}t0 and the same initial condition, written (X~Δlx,X~2Δlx,). We also consider another diffusion process {Xtx}t0 which also follows (2.1), initial condition X0=xRd with the same Brownian motion as {Xtx}t0 and associated Euler discretizations, at level l, which are driven by the same Brownian motion as {Xtx}t0 and the same initial condition, written (X~Δlx,X~2Δlx,). The purpose of signifying the initial condition will be made apparent later on in the appendix. The expectation operator for the described process is written Eθ.

We require the following additional assumption called (D2) and all derivatives are assumed to be well defined.

  • [Σ1]j,kBb(Rd)Lip||||2(Rd) (j,k){1,,d}2.

  • For any θΘ, aθjBb(Rd), σj,kBb(Rd), (j,k){1,,d}2.

  • For any θΘ, there exists 0<C_<C¯<+ such that for any (x,y)Rd×Rdy, C_gθ(y|x)C¯. In addition for any (θ,y)Θ×Rdy, gθ(y|)Lip||||2(Rd).

  • For any (θ,y)Θ×Rdy, θ(i)(log{gθ(y|)})Bb(Rd)Lip||||2(Rd), i{1,,dθ}.

  • For any θΘ,
    θ(i)(bθ(j)),θ(i)(bθ(j))2,2θ(i)θ(k)(bθ(j)),2θ(i)θ(k)(bθ(j))2Bb(Rd)Lip||||2(Rd)
    (i,k,j){1,,dθ}2×{1,,d}.

The following result is proved in [22] and is lemma 1 of that article.

Lemma A.1. —

Assume (D1-2). Then for any (n,r,θ,i)N×[1,)×Θ×{1,,dθ}, there exists a C< such that for any (l,x)N0×Rd

Eθ[|Gθl(XTl,x)(i)Gθ(XTx)(i)|r]1/rCΔl1/2.

We have the following result which can be proved using very similar arguments to lemma A.1.

Lemma A.2. —

Assume (D1-2). Then for any (n,r,θ,i,j)N×[1,)×Θ×{1,,dθ}2, there exists a C< such that for any (l,x)N0×Rd:

Eθ[|φθ(X~1:nx)Gθl(XTl,x)(i)φθ(X1:nx)Gθ(XTx)(i)|r]1/rCΔl1/2,Eθ[|φθ(X~1:nx)Hθl(XTl,x)(ij)φθ(X1:nx)Hθ(XTx)(ij)|r]1/rCΔl1/2andEθ[|φθ(X~1:nx)Gθl(XTl,x)(i)Gθl(XTl,x)(j)φθ(X1:nx)Gθ(XTx)(i)Gθ(XTx)(j)|r]1/rCΔl1/2.

Proof of proposition 2.1. —

In the following proof, we will suppress the initial condition from the notation. We have

Hθl,(ij)Hθ(ij)=j=13Tj,

where

T1=Eθ[φθ(X~1:n)Gθl(XTl)(i)]Eθ[φθ(X~1:n)]Eθ[φθ(X~1:n)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)Gθ(XT)(i)]Eθ[φθ(X1:n)]Eθ[φθ(X1:n)Gθ(XT)(j)]Eθ[φθ(X1:n)],T2=Eθ[φθ(X~1:n)Gθl(XTl)(i)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)Gθ(XT)(i)Gθ(XT)(j)]Eθ[φθ(X1:n)]andT3=Eθ[φθ(X~1:n)Hθl(XTl)(ij)]Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)Hθ(XT)(ij)]Eθ[φθ(X1:n)].

We remark that

|Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)]|CΔl1/2, A 1

by using (D2) and convergence of Euler approximations of diffusions. For T1, we have

T1=(Eθ[φθ(X~1:n)Gθl(XTl)(i)]Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)Gθ(XT)(i)]Eθ[φθ(X1:n)])Eθ[φθ(X~1:n)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]+Eθ[φθ(X1:n)Gθ(XT)(i)]Eθ[φθ(X1:n)](Eθ[φθ(X~1:n)Gθl(XTl)(j)]Eθ[φθ(X~1:n)]Eθ[φθ(X1:n)Gθ(XT)(j)]Eθ[φθ(X1:n)]).

So one can easily deduce by ([22], theorem 1) and (D2) that for some C that does not depend upon l

T1CΔl1/2.

Now for any real numbers, a,b,c,d with b,d non-zero, we have the simple identity

abcd=abd[db]+1d[ac].

So for T2,T3 combining this identity with lemma A.2, (A 1) and (D2) one can easily conclude that for some C that does not depend upon l

max{T2,T3}CΔl1/2.

From here, the proof is easily concluded.

We end the section with a couple of results which are more-or-less direct corollaries of ([22], remarks 1 & 2). We do not prove them.

Lemma A.3. —

Assume (D1–D2). Then for any (i,j,r,θ){1,,dθ}2×[1,)×Θ, there exists a C< such that for any (l,x,x)N×R2d

Eθ[|Hθl(XTl,x)(ij)Hθl1(XTl1,x)(ij)|r]1/rC(Δl1/2+||xx||2)Eθ[|Gθl(XTl,x)(i)Gθl(XTl,x)(j)Gθl1(XTl1,x)(i)Gθl1(XTl1,x)(j)|r]1/rC(Δl1/2+||xx||2).

Appendix B. Algorithms

graphic file with name rspa20210710f18.jpg

graphic file with name rspa20210710f19.jpg

graphic file with name rspa20210710f20.jpg

graphic file with name rspa20210710f21.jpg

graphic file with name rspa20210710f22.jpg

graphic file with name rspa20210710f23.jpg

graphic file with name rspa20210710f24.jpg

graphic file with name rspa20210710f25.jpg

graphic file with name rspa20210710f26.jpg

Data accessibility

Data and code for the paper can be found at https://github.com/fangyuan-ksgk/Hessian_Estimate.

Authors' contributions

N.K.C.: conceptualization, investigation, methodology, resources, supervision, writing—original draft, writing—review and editing; A.J.: formal analysis, methodology, project administration, resources, supervision, writing—original draft, writing—review and editing; F.Y.: resources, software, visualization.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interest.

Funding

This work was supported by KAUST baseline funding.

References

  • 1.Cappé O, Moulines E, Ryden T. 2005. Inference in Hidden Markov Models, vol. 580. New York, NY: Springer. [Google Scholar]
  • 2.Bain A, Crisan D. 2009. Fundamentals of stochastic filtering. New York, NY: Springer. [Google Scholar]
  • 3.Majda A, Wang X. 2006. Non-linear dynamics and statistical theories for basic geophysical flows. Cambridge, UK: Cambridge University Press. [Google Scholar]
  • 4.Ricciardi LM. 1977. Diffusion processes and related topics in biology. Lecture Notes in Biomathematics. Berlin, Heidelberg: Springer. [Google Scholar]
  • 5.Shreve SE. 2004. Stochastic calculus for finance II: continuous-time model. New York, NY: Springer Science & Business Media. [Google Scholar]
  • 6.Kloeden PE, Platen E. 2013. Numerical solution of stochastic differential equations, vol. 23. New York, NY: Springer Science & Business Media. [Google Scholar]
  • 7.Bierkens J, van der Meulen F, Schauer M. 2020. Simulation of elliptic and hypo-elliptic conditional diffusions. Adv. Appl. Probab. 52, 173-212. ( 10.1017/apr.2019.54) [DOI] [Google Scholar]
  • 8.Che Y, Geng L-H, Han C, Cui S, Wang J. 2012. Parameter estimation of the FitzHugh-Nagumo model using noisy measurements for membrane potential. Chaos 22, 023139. ( 10.1063/1.4729458) [DOI] [PubMed] [Google Scholar]
  • 9.Doruk R, Abosharb L. 2019. Estimating the parameters of Fitzhugh-Nagumo neurons from neural spiking data. Brain Sci. 9, 364. ( 10.3390/brainsci9120364) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Beskos A, Roberts GO. 2005. Exact simulation of diffusions. Ann. Appl. Probab. 15, 2422-2444. ( 10.1214/105051605000000485) [DOI] [Google Scholar]
  • 11.Beskos A, Papaspiliopoulos O, Roberts GO. 2006. Retrospective exact simulation of diffusion sample paths with applications. Bernoulli 12, 1077-1098. ( 10.3150/bj/1165269151) [DOI] [Google Scholar]
  • 12.Fearnhead P, Papaspiliopoulos O, Roberts GO. 2008. Particle filters for partially observed diffusions. J. R. Stat. Soc. B 70, 755-777. ( 10.1111/j.1467-9868.2008.00661.x) [DOI] [Google Scholar]
  • 13.Blanchet J, Zhang F. 2017. Exact simulation for multivariate Ito diffusions. (http://arxiv.org/abs/1706.05124)
  • 14.Fearnhead P, Papaspiliopoulos O, Roberts GO, Stuart AM. 2010. Random-weight particle filtering of continuous time processes. J. R. Stat. Soc. Ser. B Stat. Methodol. 72, 497-512. ( 10.1111/j.1467-9868.2010.00744.x) [DOI] [Google Scholar]
  • 15.Glynn PW. 2014. Exact estimation for Markov chain equilibrium expectations. J. Appl. Probab. 51, 377-389. ( 10.1239/jap/1417528487) [DOI] [Google Scholar]
  • 16.Rhee CH, Glynn P. 2016. Unbiased estimation with square root convergence for SDE models. Op. Res. 63, 1026-1043. ( 10.1287/opre.2015.1404) [DOI] [Google Scholar]
  • 17.Vihola M. 2018. Unbiased estimators and multilevel Monte Carlo. Op. Res. 66, 448-462. ( 10.1287/opre.2017.1670) [DOI] [Google Scholar]
  • 18.Chada NK, Franks J, Jasra A, Law KJH, Vihola M. 2021. Unbiased inference for discretely observed hidden Markov model diffusions. SIAM/ASA J. Unc. Quant. 9, 763-787. ( 10.1137/20m131549x) [DOI] [Google Scholar]
  • 19.Heng J, Jasra A, Law KJH, Tarakanov A. 2021. On unbiased estimation for discretized models. (http://arxiv.org/abs/2102.12230)
  • 20.Jacob PE, O’Leary J, Atchadè YF. 2020. Unbiased Markov chain Monte Carlo methods with couplings. J. R. Stat. Soc. B 82, 543-600. ( 10.1111/rssb.12336) [DOI] [Google Scholar]
  • 21.Jasra A, Law KJH. 2021. Unbiased filtering for a class of partially observed diffusion processes. Adv. Appl. Probab. (to appear). [Google Scholar]
  • 22.Heng J, Houssineau J, Jasra A. 2021. On unbiased score estimation for partially observed diffusions. (http://arxiv.org/abs/2105:04912)
  • 23.Agarwal N, Bullins B, Hazan E. 2017. Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18, 1-40. [Google Scholar]
  • 24.Byrd RH, Chin GM, Neveitt W, Nocedal J. 2011. On the Use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optim. 21, 977-995. ( 10.1137/10079923X) [DOI] [Google Scholar]
  • 25.Jasra A, Law KJH, Lu D. 2021. Unbiased estimation of the gradient of the log-likelihood in inverse problems. Stat. Comp. 31, 1-18. ( 10.1007/s11222-021-09994-6) [DOI] [Google Scholar]
  • 26.Shi Y, Cornish R. 2021. On multilevel Monte Carlo unbiased gradient estimation for deep latent variable models. In Proc. of The 24th Int. Conf. on Artificial Intelligence and Statistics, vol. 130, pp. 3925–3933, PMLR.
  • 27.Andrieu C, Doucet A, Holenstein R. 2010. Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. B 72, 269-342. ( 10.1111/j.1467-9868.2009.00736.x) [DOI] [Google Scholar]
  • 28.Andrieu C, Lee A, Vihola M. 2018. Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers. Bernoulli 24, 842-872. ( 10.3150/15-BEJ785) [DOI] [Google Scholar]
  • 29.Jasra A, Law KJH, Zhou Y. 2018. Bayesian static parameter estimation for partially observed diffusions via multilevel Monte Carlo. SIAM J. Sci. Comp. 40, A887-A902. ( 10.1137/17M1112595) [DOI] [Google Scholar]
  • 30.Jacob P, Lindstein F, Schön T. 2021. Smoothing with couplings of conditional particle filters. J. Am. Stat. Assoc. 115, 721-729. ( 10.1080/01621459.2018.1548856) [DOI] [Google Scholar]
  • 31.Ditlevsen S, Samson A. 2019. Hypoelliptic diffusions: filtering and inference from complete and partial observations. J. R. Stat. Soc. B 81, 361-384. ( 10.1111/rssb.12307) [DOI] [Google Scholar]
  • 32.Gatheral J. 2006. The volatility surface: a practitioner’s guide. New York, NY: John Wiley & Sons. [Google Scholar]
  • 33.Gatheral J, Jaisson T, Rosenbaum M. 2018. Volatility is Rough. Quant. Financ. 18, 933-949. ( 10.1080/14697688.2017.1393551) [DOI] [Google Scholar]
  • 34.van der Meulen F, Schauer M. 2018. Bayesian estimation of incompletely observed diffusions. Stochastic 90, 641-662. ( 10.1080/17442508.2017.1381097) [DOI] [Google Scholar]
  • 35.Schauer M, Van Der Meulen F, Van Zanten H. 2017. Guided proposals for simulating multi-dimensional diffusion bridges. Bernoulli 23, 2917-2950. ( 10.3150/16-BEJ833) [DOI] [Google Scholar]
  • 36.Hansen T, Martins E. 1996. Translating between microevolutionary process and macroevolutionary patterns: the correlation structure of interspecific data. Evolution 50, 1404-1417. ( 10.1111/j.1558-5646.1996.tb03914.x) [DOI] [PubMed] [Google Scholar]
  • 37.Jhwueng T, Vasileios M. 2014. Phylogenetic Ornstein–Uhlenbeck regression curves. Stat. Probab. Lett. 89, 110-117. ( 10.1016/j.spl.2014.02.023) [DOI] [Google Scholar]
  • 38.Thorisson H. 2002. Coupling, stationarity, and regeneration. New York, NY: Springer. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

Data and code for the paper can be found at https://github.com/fangyuan-ksgk/Hessian_Estimate.


Articles from Proceedings. Mathematical, Physical, and Engineering Sciences are provided here courtesy of The Royal Society

RESOURCES