Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jun 15.
Published in final edited form as: J Comput Phys. 2024 Mar 29;507:112971. doi: 10.1016/j.jcp.2024.112971

Learning nonparametric ordinary differential equations from noisy data

Kamel Lahouel b, Michael Wells a, Victor Rielly a, Ethan Lew c, David Lovitza a, Bruno M Jedynak a,*
PMCID: PMC11090484  NIHMSID: NIHMS1986948  PMID: 38745873

Abstract

Learning nonparametric systems of Ordinary Differential Equations (ODEs) x˙=f(t,x) from noisy data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for f for which the solution of the ODE exists and is unique. Learning f consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the L2 distance between x and its estimator. Experiments are provided for the FitzHugh–Nagumo oscillator, the Lorenz system, and for predicting the Amyloid level in the cortex of aging subjects. In all cases, we show competitive results compared with the state-of-the-art.

1. Introduction

1.1. Description of the problem and related works

Fitting a system of nonparametric ordinary differential equations (ODEs) x˙=f(t,x) to longitudinal data could lead to scientific breakthroughs in disciplines where ODEs or dynamical systems have been used for a long time, including physics, chemistry, and biology, see [1]. By nonparametric, we mean that there is no need to specify the functional form of the vector-field f using a pre-defined finite dimensional parameter. Instead, this force field belongs to a functional space and the number of parameters that characterize this vector field depends on the amount of data available. This provides a great advantage in situations where the form of the vector field is unknown but data is available for learning. The functional spaces considered are Reproducing Kernel Hilbert Spaces (RKHS) [2], allowing for efficient optimization among other desirable properties.

A particular difficulty arises when the data is sparse and noisy. This is often the case for longitudinal healthcare data obtained during hospital visits. These visits provide measurements that are sparse in time, with a high level of individual variability. The work presented in this paper has been motivated in part by the need to model the accumulation of the Amyloid protein in the brain of aging subjects. Understanding how amyloid contributes to the manifestation of Alzheimer’s is a crucial task. The algorithm discussed here will (we hope) shed more light on the development of this devastating disease.

Fitting data to nonparametric ODEs is an inverse problem. It requires making assumptions on the initial state of the solution and on the vector field. Furthermore, one needs to make assumptions about the noise model and provide a tractable optimization algorithm.

We now provide a short bibliographic survey. Further references can be found in the cited papers. First, note that if the time derivative (x˙) was observed, then fitting ODEs to noisy data would reduce to solving a regression problem. This remark has led to the methods known as “gradient matching” and to the earliest success in fitting ODEs to data, see e.g. [3, 4]. It consists in estimating the gradient from the data, then performing nonparametric regression to fit the vector filed f and eventually, iterating, see [5]. These methods become inefficient when the data is sparse and/or noisy.

Another approach consists in modeling f with polynomials [6]. Alternatively, one could model f using the units of a Deep Neural Network, see [7, 8].These methods integrate the solution along the vector field from guessed initial conditions and compare the resulting trajectories with the observations. Optimization is used iteratively to refine the estimation of f and the initial conditions. Stochastic gradient descent and backpropagation is used in the latter case. Another modeling approach is to assume that f belongs to an RKHS. This idea, also known under the name of kernel method, could be traced back to [9]. It was successfully applied to fluid mechanics in [10]. This is the conceptual approach pursued here. We believe that this approach is well-motivated since there is a tight connection between the regularity (smoothness) properties of a kernel and the regularity properties of f. Specifically, one can choose an RKHS of vector-valued functions for which one is guaranteed the existence and uniqueness of the corresponding initial value problem. This is a necessary step in proving that more data would result in more accurate predictions. Another advantage of kernel methods is that there is no need to choose a dictionary of functions as in [4]. Instead, one selects a kernel, which, our experiments suggest, is easier. In [11], the authors assume that each coordinate of the trajectory belongs to a real-valued RKHS where the functions’ input is time. In their approach, they first retrieve the full trajectory solving a kernel ridge regression problem. Next, they solve for the vector field given the full trajectory, assuming that each coordinate of the vector field can be written as a sum of a linear combination of functions, which are defined on each coordinate of the trajectory. Our framework allows for linear combinations of pairwise products of such functions, as well. The functions characterizing such a vector field are assumed to be in a real-valued RKHS taking a single coordinate as input. In our approach, we make an assumption on the vector field. This soft constraint translates to a soft constraint on the set of trajectories, without imposing additional constraints on the trajectory itself. As a result, we solve one optimization problem as opposed to the two-step approach in [11]. Moreover, we allow for higher-order interaction terms compared to the pairwise single coordinates interaction assumed in the mentioned work. In [12], the authors use a Gaussian process (GP) for the vector field. This is the Bayesian counterpart of the frequentist RKHS modeling, see [13] for a review of the similarities and differences between RKHSs and GPs. Comparisons between a collection of algorithms representative of the state of the art and the proposed algorithm is provided in the experiment section.

For the purpose of providing a visual and easy to understand illustration of the results generated by the algorithms presented in this paper, please see Figure 1. The details of this experiment are provided in section 4.3.1. We see that the proposed algorithm is able to recover a noisy trajectory and extrapolate the data, contrary to a method that would use a regression model and ignore the ODE.

Fig. 1.

Fig. 1.

(a) Predicted vector field of the Lorenz system. The Black arrows are the prediction and the grey are the true vector field. Red points are observations. The red curve is a predicted trajectory while the grey is the true trajectory. (b) is the x-dimension, (c) is the y-dimension and (d) is the z-dimension. The red points are the observations. This plot also shows a prediction beyond the last observation in the data.

1.2. Main contributions

The main contributions of this paper are as follows:

  1. We present an RKHS model for fitting nonparametric ODEs to observational data. Conditions for existence and uniqueness of the solutions of the corresponding initial value problem are expressed in terms of the regularity of the kernel;

  2. We propose a novel algorithm for estimating nonparametric ODEs and the initial condition(s) from noisy data. This algorithm solves a constrained optimization problem using a penalty method;

  3. We derive and prove a consistency result for the prediction of the state (interpolation) at unobserved times. This is, up to our knowledge, the first result for the problem of fitting nonparametric ODEs to data.

  4. We provide experiments with simulated data. We compare the proposed algorithm to 7 existing methods representing state of the art for various noise levels. We show that the ODE-RKHS algorithm is competitive.

  5. We provide an experiment modeling the accumulation of Amyloid in the cortex of aging subjects. The data is sparse with, on average, three data points per trajectory (subject) and 179 trajectories. We show competitive performance compared to state of the art.

The rest of this paper is organized as follows: Section 2 presents some background material as well as the model and the algorithms. The consistency results are presented in Section 3 and proved in Appendix A. The experiments appear in Section 4 while Section 5 provides concluding remarks. Appendix B provides examples of kernels.

2. Model and algorithm

2.1. Background on Reproducing Kernel Hilbert Spaces (RKHSs)

Basic notions and notations associated with RKHS are important for understanding the algorithms and derivations presented in this paper. We thus provide a short presentation. We limit ourselves to RKHS over the field of real numbers instead of complex numbers as this is sufficient throughout this paper. We begin with the univariate real-valued case and we continue with the vector-valued case which allows us to describe vector fields, central to this paper.

2.1.1. Real-valued RKHS

Real-valued RKHS are Hilbert spaces of real-valued functions: 𝒳R, where 𝒳 is a nonempty space. The critical assumption which make them “reproducing” is that the evaluation functional is continuous. The evaluation functional at x𝒳 is a mapping from a RKHS H to R, which associates to a function its evaluation at x, that is ff(x). Thanks to the Riesz representation theorem, evaluating a function in an RKHS is a geometric operation consisting in computing an inner product. Effectively, for any x𝒳, there is a unique vector kxH such that

f(x)=f,kxH (1)

where .,.H is the scalar product associated with H. In what follows, we will simply notate .,.. for this inner product. Moreover, let us define, for any x,y𝒳, the so-called kernel

k(x,y)=kx,ky (2)

and let us use this to characterize the function kx. Evaluating kx at y and using Riesz representation provides

kx(y)=kx,ky=ky,kx=k(y,x) (3)

Thus the function kx(.) is the function k(.,x) and for any fH,

f(x)=f,k(.,x) (4)

This is the reproducible property of the kernel. Replacing the function f by ky, and using (3), we obtain that

ky(x)=ky,k(.,x)=k(.,y),k(.,x)=k(y,x) (5)

2.1.2. Vector-valued RKHSs

Vector-valued RKHSs generalize the real-valued case. The construction is similar. Consider a Hilbert space of functions from 𝒳 to Rd. Assume, moreover, as in the real-valued case, that the evaluation functional is continuous. Riesz representation theorem then states that for any x𝒳, and vRd, there exists a unique element in H, notated Kx,v such that vTf(x)=f,Kx,v. The kernel of H is then the (d,d) matrix where the element (i,j) at the ith row and jth column is defined by

Kij(x,y)=Kx,ei,Ky,ej (6)

where e1,,ed is the natural basis of Rd. Let us use (6) to characterize the function Kx,v. We start with Ky,ej and use the reproducing property as well as the symmetry of the inner product.

eiTKy,ej(x)=Ky,ej,Kx,ei=Kx,ei,Ky,ej=Kij(x,y)=eiTK(x,y)ej (7)

Thus Ky,ej(.)=K(.,y)ej, and

vTf(x)=f,Kx,v=f,K(.,x)v (8)

which is the reproducing property for vector-valued RKHS. Applying (8) to the function xK(x,y)w, for wRd provides

vTK(x,y)w=K(.,y)w,K(.,x)v=K(.,x)v,K(.,y)w (9)

Lastly, a useful property of the kernel K is that K(x,y)T=K(y,x). Indeed,

Kji(x,y)=ejTK(x,y)ei=K(.,x)ej,K(.,y)ei=K(.,y)ei,K(.,x)ej=eiTK(y,x)ej=Kij(y,x) (10)

Choosing 𝒳=Rd allows for defining autonomous vector fields, that is functions RdRd, and choosing a suitable kernel allows for choosing Lipschitz continuous vector fields as will be discussed in Section 2.3.

2.2. Notations

The observations are characterized by multiple time series. There are n times series. The ith one is of length mi. It is characterized by mi couples tij,yijtij,i=1,,mi, where tij[0,T] for some maximum predefined time T, and the observations yijtij belong to Rd.

We aim to make predictions at new time points along a time series having one or several noisy snapshots. To this end, we explore the following nonparametric ODE model:

{x˙=f(t,x)yij(tij)=x(tij)+ϵij (11)

where i=1,,n,j=1,,mi. The noise ϵij is bounded or sub-Gaussian. This model is nonparametric because f is not specified parametrically. We assume that f belongs to a RKHS of smooth functions for which the solution of the ODE exists and is unique, see Section 2.3. Background material on RKHS can be found in [14] and vector-valued RKHS are reviewed in [15]. The rest of the paper is written for the autonomous case when f(t,x)=f(x) and for the simpler situation where mi is the same for all time series and when the time points tij are the same for all the time series i.e. do not depend on i. However, we will point to the modifications for the non-autonomous setting when necessary, as well as the situation of non regular sampling.

2.3. Existence and uniqueness

It is a classical result, see [16], that the initial value problem (IVP):

x˙t=fxt and x0=x0, (12)

where f:RdRd is Lipschitz continuous has a unique solution defined on the domain [0,+).

Let H be an RKHS of vector-valued functions RdRd and let K be the reproducing kernel of H. K is a (d,d) matrix-valued kernel. It is then natural to ask: what is a sufficient condition on K which ensures that all fH are Lipschitz continuous? The following lemma provides an answer.

Lemma 1. If f:RdRd belongs to an RKHS with kernel K such that:

dKii2(u,v)Kii(u,u)-2Kii(u,v)+Kii(v,v)NK2|u-v|2,u,vRd,i=1d, (13)

for some constant NK, then the IVP problem (12) has a unique solution defined on [0,+).

Proof. Notice that for every i=1,,d

|fi(u)fi(v)|2=|K(u,)eiK(v,)ei,fH|2 (14)
K(u,)eiK(v,)eiH2fH2 (15)
=dKii2(u,v)fH2 (16)

where e=e1,,ed is the natural basis of Rd. Here we have used the reproducing property of the matrix-valued kernel and the Cauchy-Schwartz inequality. □

Thus, one can choose a kernel that guarantees the existence and uniqueness of the solution of the IVP, which will lead to provable asymptotic performance. We believe that this simple result is a good motivator for the proposed modeling approach.

Let us discuss some examples of kernels satisfying lemma 1. The simplest matrix-valued kernels are separable kernels. They are obtained by choosing a scalar kernel K1 and a positive semi-definite matrix A. Then,

K(x,y)=K1(x,y)A (17)

The diagonal elements of K are then positive multiples of K1. Thus, if K1 verifies the regularity condition of lemma 1, then so do all the separable kernels based on K1. The scalar kernels satisfying the hypothesis of lemma 1 contain the linear kernel, the Gaussian Kernel, the rational quadratic kernel, the sinc kernel and the mattern kernels for p>3/2. Kernels for which the functions in their corresponding RKHSs are not guaranteed to provide unique solutions to the corresponding IVP due to lack of regularity include the polynomial kernels with an order of at least two, the Laplacian kernel and the Mattern kernel for p3/2 Details are provided in Appendix B. The condition of lemma 1 has a nice interpretation in the case where explicit kernels are used. Indeed, when a feature map associated with the kernel is given explicitly, the conditions of lemma 1 are equivalent to assuming Lipschitz continuous features. The details are provided in the Appendix B.

Note on the non-autonomous case: When the vector field is time-dependent denoted by f(t,x), the kernel is defined on Rd×[0,). It is sufficient to assume a global Lipschitz condition with respect to the second variable [16], namely: There exists a constant LK such that for every t0 and u,vRd and i1,,d:

fi(t,u)-fi(t,v)LK|u-v| (18)

It is therefore sufficient to assume a a kernel K defined on Rd×[0,) and satisfying the conditions of lemma as it will ensure the following inequality:

dKii2(t,u,t,v)NK2|u-v|2 (19)

2.4. From constrained to unconstrained optimization

We first construct the optimization algorithm in the case n=1. All the observations are from a single trajectory with the same initial condition. Thus, we temporarily drop the double indexing with subjects and times to simplify the notation.

Assume the observation times are t1<<tm. Consider the following constrained minimization problem:

minx,f1mj=1m|yjx(tj)|2+λff0H2, (20)

under the constraints

fH, the RKHS with matrix-valued kernel K,xt=xt1+t1tfxsds, for t1ttm. (21)

The function f0H is an initial guess for f. Section 2.6 describes a gradient matching algorithm for selecting f0. K is a kernel that satisfies lemma 1.

Consider a regular one-dimensional grid over the interval t1,tm. Specifically, we choose

sl=t1+lh (22)

with l=0,,k and we assume that h is small enough so that there are integers k1=0<k2<<km, such that the observation times are

tj=t1+kjh,j=1m. (23)

In practice, the observation times are rounded to fit on this grid. Note that with this notation, tj=skj. We now proceed through a series of transformations to rewrite this constrained optimization problem into an unconstrained one.

First, we replace the constraints on x by a finite number of constraints as follows:

fH, the RKHS with kernel K,xsl+1=xsl+slsl+1f(x(s))dsl=0k-1. (24)

Second, we discretize the constraints using the Euler method of integration:

fH, the RKHS with kernel K,xsl+1=xsl+hfxsl for l=0k-1. (25)

Third, we replace the constrained optimization problem by an unconstrained one using a single Lagrange constant γ>0. Notate zl=xsl,l=0k,

minzRd(k+1),fHJz,f,γ, (26)

with

J(z,f,γ)=1mj=1m|yjzkj|2+γ1kl=0k1|zl+1zlhf(zl)|2+λff0H2. (27)

It is instructive to remark the similarities between the loss function in equation 27 and the loss proposed in Physics-informed Neural Networks [17], where the observations are generated from an unknown partial differential equation. Indeed, the total loss function in Physics-informed Neural Networks can be decomposed as a sum of two functions: One that measures the deviation of solution from the observations, and the second usually defined as the residual function term, measures the violation of the partial differential equation constraint that the solution must satisfy. In our context,

1mj=1myj-zkj2

corresponds to first function, and

1kl=0k-1zl+1-zl-hfzl2

corresponds to the residual function term. However there are some notable differences. In physics informed neural networks, the form of the PDE is known up to finite dimensional parameters. The loss is viewed as a function of the solution to the partial differential equation and these finite dimensional parameters. The solution itself is modeled by a neural network. In our case, the loss is viewed as a function of the vector field and the initial solution. The differential equation is therefore characterized by the RKHS, usually infinite-dimensional. Moreover, equation 27 contains a regularization term penalizing vector fields with large RKHS norm, which is typical of loss function parametrized by RKHS functions.

2.5. Penalty method

The penalty method is an iterative method that consists of enforcing the constraints by increasing a penalty parameter, in this case γ. The schematic of the method is presented in Algorithm 1. At each step, the functional J(z,f,γ) in (27) is minimized with respect to (z,f), for a fixed value of γ. Then, γ is increased. The optimization for (z,f) is done asynchronously, first optimizing over z for a fixed f, then optimizing over f for the newly updated z.

Let us now describe these optimization steps in more detail. For a fixed γ and f,J(z,f,γ) in (27) is non-convex in z due to the presence of fzl. Therefore we replace f by its first-order Taylor expansion evaluated at the value zl(s) obtained in the previous iteration s:

fzlfzl(s)+zl-zl(s)Tzlfzl(s) (28)

Note that with this approximation, J is convex, quadratic, and sparse in z. This allows the use of an efficient linear solver for this minimization. The number of unknowns is d(k+1).

Note on the non-autonomous case: When the vector field is time-dependent, the vector field is evaluated at points of the form ftl,zl. Notice that the tl’s are the time points of the grid, therefore fixed and known. Hence, the linearization in equation (29) is made only with respect to the space variable:

fzl,tlfzl(s),tl+zl-zl(s)Tzlfzl(s),tl (29)

For a fixed γ and z, minimizing J in f is equivalent to a multivariate kernel ridge regression problem. After the change of variable, g=f-f0, and setting

ul=zl+1-zl/h-f0zl,l=0k-1, (30)

we use the representer theorem to show that the minimizer in fH of J is of the form

fz=f0z+l=0kKz,zlwl, (31)

where wlRd. Let W=w1T,,wk+1T, be of dimension (d(k+1),1) and similarly let U=u1T,,uk+1T and K be the matrix with (d,d) block element Kkl=Kxk,xl. We find that W is a minimizer of the convex quadratic function

γh2k|U-KW|2+λWTKW (32)

and thus W is the solution to the linear system:

K+λkγh2IW=U (33)

The schematic algorithm is provided in Algorithm 1.

Algorithm 1.

Penalty method for ODE-RKHS

1: Init: h,ρ,λ,f(0),γ(0),s=0
2: while termination condition is not met do
3: z(s+1)arg minzRd(k+1)Jz,f(s),γ(s)
4: f(s+1)arg minfHJz(s+1),f,γ(s)
5: γ(s+1)γ(s)(1+ρ)
6: s=s+1
7:  Check termination condition
8: end while

2.6. Initial condition and termination criteria

Since the algorithm will converge to a local minimum of the cost function, the choice of the initial condition is important. We use a gradient matching method.

  1. Approximate the time derivatives of x at the observed times x˙tj, denoted x˙ˆtj

  2. Estimate f0H using ridge regression, i.e. minimize over H

G(f0)=1mj=1m|x˙^(tj)f0(yj)|2+λf0H2 (34)

There are several possibilities for the approximation in the first step depending on the sparsity of the data and the amount of noise. In the experiments below, we use central differences.

The termination condition of Algorithm 1 includes a fixed number of iterations S and a threshold on the quantity f(s+1)-f(s)/f(s) which allows for early stopping.

2.7. Multiple trajectories

We present here the extension of the method to multiple trajectories, say n>1 subjects. We assume the same number of observations for each subject and regular sampling to simplify the presentation.

First, we replace (27) and (24) with

minx,f1nmi=1nj=1m|yijxi(tij)|2+λff0H2, (35)

under the constraints

fH, the RKHS with matrix-valued kernel K, xit=xit1+t1tfxisds, for t1ttm,i=1n (36)

We then proceed along the same steps as for the single trajectory case, leading to the unconstrained optimization problem, generalizing (26) and (27).

Algorithm 2.

Multi Trajectories Penalty method for ODE-RKHS

1: Init: h,ρ,λ,f(0),γ(0),s=0
2: while termination condition is not met do
3: for i=1n do
4:   zi(s+1)arg minziRd(k+1)Jmulti z,f(s),γ(s)
5: end for
6: f(s+1)arg minfHJmulti z(s+1),f,γ(s)
7: γ(s+1)γ(s)(1+ρ)
8: s=s+1
9:  Check termination condition
10: end while

Notate zil=xisl,l=0k,i=1n, and z=z1,,zn

minzRnd(k+1),fHJmulti z,f,γ, (37)

with

Jmulti (z,f,γ) = 1nmi=1nj=1m|yijzikj|2+γ1nki=1nl=0k1|zi,l+1zilhf(zil)|2+λff0H2. (38)

The key point is that Jmulti decouples the trajectories such that the optimization over z can be carried out separately for each trajectory. However, all the observations contribute to the estimation of f. The algorithm is presented in Alg 2. In Line 6: we use the no-trick formulation using Gaussian quadrature Fourier features as described in [18].

2.8. Computational Complexity

We analyze the complexity of the algorithm Alg 2. The key parameters are:

  1. d: the dimension of the observed vectors;

  2. n: the number of observed trajectories;

  3. k: the number of samples in the discretization of the time interval;

  4. S: the number of steps in Alg 2;

  5. nF: the number of Fourier features.

We use Op3 for the time complexity of solving a (dense) linear system with p variables and Ow2p in the case of a band matrix of width w, see [19]. Alg 2, line 4 consists in solving a linear system of size dk with a band matrix of bandwidth w=3d, thus Okd3 computations. Line 6 consists in solving d full linear systems of dimension nF, thus OdnF3 computations. In total, we find OS nkd3+S dnF3. Note that k is typically chosen proportional to the average number of data points per trajectory. Thus, overall, the algorithm is linear in the number of observations but cubic in the dimension of the observations.

2.9. Non autonomous systems, covariates, and irregular sampling

Non autonomous systems and covariates are handled by modifying the kernel. The issue of irregular sampling is addressed by replacing the first term of (27) by

1ni=1nj=1miti,j+1-tijyij-zikj2 (39)

with ti,mi+1=T,i=1,n

3. Consistency of the solution: A finite sample result

In this section, we assume that the algorithm solves the following optimization problem (where tm+1=T by definition):

minRd(k+1),fHj=1mtj+1-tjyj-zkj2, (40)

Under the constraints:

  1. f-f0HR,z0r

  2. zl+1=zl+hfzl,0lk

Notice that constraint 2 corresponds to the Euler method for the ODE:x˙=f(x). Therefore, by linearly interpolating between the times of subdivision sl,0lk, we can generate a solution xˆ() defined on [0,T]. We denote by x*() the true trajectory generating the noisy observations yj at each time tj. The purpose of this section is to present a result controlling (in probability) the L2 norm squared of xˆ-x*:

x^x*L22:=0T|(x^(t)x*(t))|2dt (41)

Let us make the following assumptions:

  • A1: There exists an f*H,f*-f0HR and x0*r such that x*(0)=x0* and x˙*(t)=f*x*(t) for every 0tT.

  • A2: The noise variables ϵij are independent and bounded in absolute value by a constant Mϵ. (We can assume that the variables are subgaussian instead of bounded if we want to generalize this result)

  • A3: The kernel K is 𝒞2Rd in its first argument (this implies that it is also 𝒞2Rd in its second argument).

  • A4: The kernel K satisfies (13).

We refer to section 2.3 for examples of kernels satisfying A3 and A4.

These assumptions are sufficient for obtaining the main theorem of this section, controlling xˆ-x*L22 with high probability.

Theorem 1. Assuming A1,A2,A3 and A4, there exist positive constants K1,K2,K3 and K4, depending only on R,r,T,Mϵ,NK and the kernel K such that for every ϵ>0, with probability less than exp-K2ϵ2dj=1mtj+1-tj2:

x^x*L22K1dj=1m(tj+1tj)2+h2K3d+K4dj=1m(tj+1tj)2+ϵ. (42)

For a better understanding of Theorem 1, assume a regular sampling of the interval [0,T] with m points, so that for every j,tj+1-tj=1m. In that case, under the same hypothesis, for any ϵ>0, with probability less than exp-K2mϵ2d:

x^x*L22K1dm+K4dm+h2K3d+ϵ. (43)

A proof of Theorem 1 is provided in the appendix. We provide here a description of the main ideas. The third term in the right hand side of inequality (42) corresponds to the global truncation error between the numerical solution of the ODE and the true solution. The second term corresponds to the error between xˆ-x*L22 and 1mj=1mx*tj-xˆtj2. The first term is the leading term, assuming that h is always less than 1m. Assume that xˆ solves the continuous-constraints optimization problem (without an Euler approximation), i.e:

minx,f1mj=1myj-xtj2, (44)

Under the constraints: f-f0HR,x0r and x(t)=x0+0tf(x(u))du,0tT, we can then consider the “generalization” error:

1mj=1m|x*(tj)x^(tj)|2. (45)

An upper bound of this error is given by the first term. The main tool used to obtain the upper bound is Dudley’s chaining inequality, see [20]. We notice that for every i=1,,d, the set of coordinate functions xi, where x and f satisfy the constraints of the continuous problem, is included in a set of functions that are uniformly Lipschitz continuous and bounded (the Lipschitz constant and bound does not depend on x0 and f). Upper bounds of covering numbers of such functions are well-known, see [20], hence the use of Dudley’s inequality.

One can easily transform the inequality on the probability of theorem 1 to an inequality on Exˆ-x*L22. Indeed, let us assume for simplicity a regular sampling of m points the interval [0,T]. We denote by:

E^L2:=x^x*L22K1dmK4dmh2K3d. (46)

Using theorem 1, we have the following inequality:

E(|E^L2|)=0(|E^L2|ϵ) (47)
0exp(K2mϵ2d) (48)
=π4K2dm (49)

This implies the following result.

Corollary 1. Assume we have a regular sampling of m points on the interval [0,T]. Then:

E(x^x*L22)π4K2dm+K1dm+K4dm+h2K3d. (50)

To illustrate the inequality in (50), we conducted a simple toy experiment where the conditions of the theorem are satisfied, and evaluated the convergence rate. In this experiment we considered a one-dimensional autonomous system. We randomly initialized the weights of a function determined by 200 Fourier random features, recorded the norm of the function, and generated a trajectory of 5120 samples using this function. Then we took ten independent and identically distributed random samples of noise with a standard deviation of .05. This provided us with 10 noisy trajectories of 5120 samples (of the same trajectory but different samples of noise). Finally, we sub-sampled each of these ten noisy trajectories to get 2560 samples, 1280 samples,… all the way down to 5 samples. This gave us 10 training sets, each with 5, 10, 20, 40, …, 5120 samples. We trained the algorithm on each of these datasets and reported the average L2 (squared) error between the estimated trajectory and the true one over the ten trajectories at each level of sparsity. In figure 3, we provide a plot of the log of the average L2 (squared) errors as a function of the log of the number of samples used during training. Equation (50) predicts a slope at least −.5. We fit the data to a line of slope −.8, consistent with (50). We provide a plot with a line of slope −.5 for comparison.

Fig. 3.

Fig. 3.

Illustration of the ODE-RKHS Algorithm: The dots show the observations. The estimated trajectories are shown with lines and curves with corresponding colors. Steps i=1,25,50, and 75 are shown from left to right and from top to bottom

4. Experiments

We report experiments for simulated data as well as for real data. In each case, we compare the performances of the proposed algorithm, generically named ODE-RKHS, with six other algorithms. This section is organized as follows: In subsection one, we present the various benchmark methods used for comparison. In subsection two, we present the tuning of the hyperparameters for the ODE-RKHS method. In subsection three, we describe fifteen simulated datasets and an example medical dataset. Finally, in subsection four, we report and comment on the performance of the ODE-RKHS method compared with the benchmark methods on all the datasets.

4.1. Benchmark methods

These algorithms constitute, up to our knowledge, the current state of the art for learning nonparametric ODEs from noisy data. We briefly review these algorithms and provide references below.

  1. Nonparametric Ordinary Differential Equations: Nonparametric Ordinary Differential Equations (npODE) is presented in [12]. The authors use a Bayesian model with Gaussian processes (GP). It is the Bayesian counterpart of the frequentist model presented in this paper. Unlike GP regression where the optimization can be computed in closed form, an approximate optimization method is required. The authors use inducing points, see [21] and sensitivity equations, see [22]. The npODE code was downloaded from http://www.github.com/cagatayyildiz/npode in February 2021. Given the normalized trajectory sets, we ran the algorithm with a scale factor of 1 and an 0 of 1. For the 2D systems, we used a width of the inducing point grid W=6, matching the demonstration examples. For the 6D Lorenz96, we encountered out-of-memory errors for W>2, possibly indicating an empirical scaling issue with the method. We thus used W=2 for this system.

  2. Sparse Identification of Nonlinear Dynamics (Fourier and Polynomial Candidate Functions): Sparse Identification of Nonlinear Dynamics (SINDy) is a highly cited technique for identifying nonlinear dynamics from data, see[4]. SINDy predicts governing dynamics equations using gradient matching via sparse regression. In the experiments shown, we test SINDy with two different libraries of possible functions: polynomials up to order three and Fourier features. We choose the SR3 sparsity regularization for its superior performance, detailed in [23], which has a threshold value as a hyperparameter. Other hyperparameters in our tests include the polynomial library’s degree and the size and lengthscale of the Fourier features library. A grid search tuner was employed to determine the best hyperparameter values, with the same holdout and evaluation sets as in the competing algorithms. pySINDy v1.6.3 was used for the implementation [24]. We use the AutoKoopman library to tune the hyperparameters, described in [25].

  3. Extended Dynamic Mode Decomposition: The Koopman operator is an infinite dimensional linear operator that captures the dynamics of a non-linear dynamical system. Dynamic Mode Decomposition (DMD), described in [10], can approximate the Koopman operator’s eigenvalues and eigenvectors based on observations of the system state. Extended DMD (EDMD) generalizes to nonlinear systems learning by approximating the Koopman operator in a high-dimensional space of observables, see [26]. These observables must be selected before using EDMD, and can be chosen ad-hoc or by using library learning methods [27]. We use random Fourier features as the observable functions for these experiments, specified in [28]. We use the AutoKoopman library to tune the hyperparameters via Bayesian optimization, available at https://github.com/EthanJamesLew/AutoKoopman.

  4. Kernel Analog Forecasting: Analog forecasting is a time series prediction method that utilizes the idea of analog forecasting that follows the evolution of a historical time series that most closely matches the current state. Kernel analog forecasting (KAF) replaces single-analog forecasting with weighted ensembles of analogs constructed using local similarity kernels that employ several dynamics-dependent features designed to improve forecast skill [29] [30]. Our KAF implementation is based on https://github.com/rward314/StreamingKAF. Hyperparameters are the kernel function and rank used for the number of eigenvalues found from the data-defined kernel matrix. We selected a Gaussian kernel and grid tuned for rank and kernel length-scale. We use the same eigenvalue multiplier of 10−4 as the referenced code.

  5. Sparse Cyclic Recovery: We implement the method formulated in [31] which is well-suited for the experiments as it is designed for learning structured dynamical systems from under-sampled and possibly noisy state-space measurements. For index invariant systems, the method generates cyclic permutations to augment the training data. Then, it builds a library of Legendre polynomials of candidate functions and does basis pursuit with thresholding to recover the dynamics. The hyper-parameters involved are the parameters for the Douglas-Rachford algorithm used to solve the Legendre basis pursuit (L-BP) problem and the Legendre polynomial degree; we tune these parameters via grid search. We referenced the parameters used in their GitHub project https://github.com/linanzhang/SparseCyclicRecovery. We utilize the same candidate functions as the paper, but tune the noise threshold σ and the μ,τ parameters of the optimizer. Because of compute effort limitations, we set the maximum number of optimization iterations to 104.

4.2. Validation, initialization, and selection of hyper-parameters in the ODE-RKHS algorithm

We use the Multi Trajectories Penalty method for ODE-RKHS described in Alg. 2, and a Gaussian kernel. For each coordinate, we chose a bandwidth equal to 20% of the range of the data. We set γ=1 and fit λ,ρ using a validation set consisting of 20 percent of the training data. We set a maximum of S=500 iterations and used the early stopping criterion of stopping when the ratio f(s+1)-f(s)/f(s) was less than 10−3. Initialization of f0 was done via gradient matching, see section 2.6.

4.3. Datasets

We ran experiments with the same training, validation and test sets for all the algorithms. Testing consisted of computing predicted trajectories starting at the initial condition of each test trajectory.

4.3.1. Oscillator data

The FitzHugh-Nagumo (FHN) oscillator data is a controlled experiment with known and easy-to-visualize 2D trajectories. It has helped calibrate the algorithm described in this paper. It was also demonstrated in [12] for the npODE algorithm. We ran experiments using a simulated dataset generated as follows:

v˙=vv3/3w+1w˙=0.08(v+0.70.8w) (51)

Intermediate and final results of the ODE-RKHS algorithm are presented in Fig. 3 for the FHN data. Notice that during the first steps, shown on the top line, the estimated trajectories with solid color lines are rough but fit the data closely. During the later steps, shown on the bottom line, the trajectories are smoother but still fit the data.

We generated a set of 50 noiseless trajectories. There were 201 observations per trajectory, one for each .1 increment in time. To generate the training sets, we added samples of Gaussian noise to these fifty trajectories. There were five levels of noise, with respective standard deviations σ{0.120,0.365,0.610,0.855,1.100}. We generated a single test set of 100 trajectories without noise, again with 201 observations per trajectory separated by .1 time increments.

4.3.2. Lorenz63 data

Our next experiment was on the Lorenz system defined by the equations

x˙=10y-xy˙=x28-z-yz˙=xy-83z (52)

We generated 50 noiseless trajectories with 201 observations per trajectory, each separated by a 0.01 increment in time. Next, we generated samples of Gaussian noise with levels σ{0.5,1.2,1.9,2.6,3.3}. We added the noise samples to the noiseless trajectories to generate five training sets. Then we generated a single test set consisting of 100 trajectories, each with 201 observations at 0.01 time increments.

4.3.3. Lorenz96

The Lorenz96 data arises from [34]. The chaotic system is defined for n=6 dimension by:

x˙k=-xk-1xk-1+xk+1xk-1-xk+F,k=16 (53)

We have selected F=8. Indices wrap-around so that x-1=x6 and x7=x1. To construct the training set, we generated a set of 100 noiseless trajectories, each with 100 observations. The observations were separated by a time increment of 0.01. We added the five levels of Gaussian noise to the noiseless data to generate five different training sets. The standard deviations of the noise generated here are σ{0.1,0.2,0.3,0.4,0.5}. Our test set consisted of 150 noiseless trajectories, each with 100 points on them. The time increment between observations was the same as the training set.

4.3.4. The accumulation of Amyloid in the cortex of aging subjects

The accumulation of Amyloid in the brain is believed to be one of the earliest pathological mechanisms of Alzheimer’s disease, beginning more than a decade before the onset of clinical symptoms, see [35].

Based on observations from several longitudinal Amyloid positron emission tomography (PET) studies, it is believed that the rate of Amyloid accumulation is closely associated with the level of Amyloid at the same age, see [36]. We develop a principled mathematical model capturing this phenomenon and use it to predict the accumulation of Amyloid across individuals longitudinally.

We used (PiB) PET scans from the Wisconsin Registry for Alzheimer’s Prevention (WRAP) to assess global Amyloid burden, measured by the Distribution Volume Ratio (DVR)1. The number of subjects in this study is n=179, with 3.06 visits on average, over an average span of 6.84 years. We fit the model in (11) to the posterior cingulum, precuneus and gyrus rectus DVRs, averaging the left and right DVR in each case. These regions are known to show Amyloid accumulation early in the disease process. Figure 4 provides a visualization of the trajectories estimated using RKHS-ODE super-imposed (same color) with the data. This shows that the estimated trajectories are qualitatively accurate.

Fig. 4.

Fig. 4.

Amyloid prediction experiment. Horizontal axis is in years. Vertical axis corresponds to DVR. The left-most image corresponds to the gyrus rectus, the middle to the cingulum and the right to the precuneus.

4.4. Evaluation

Testing consisted of computing predicted trajectories starting at the initial condition of the test trajectories and computing the following error measurement for each predicted trajectory:

Err:=i=2n(titi1)yiy^i2 (54)

where ti is the ith observation time, yi is the ith observation of the test trajectory, yˆi is the ith point of the predicted trajectory and n is the number of observations in the trajectory. For each dataset, we report the average error measurement over the test set trajectories.

In table 1, we report the performance of the ODE-RKHS method and other benchmark methods on the Amyloid dataset. The average L2 norm errors (Err) between predicted Amyloid level trajectories and true level trajectories are reported. The ODE-RKHS algorithm yields the lowest average L2 error among the seven compared methods.

Table 1.

Results for Amyloid data. Minimum errors are in bold

Err
npODE .59
KAF .84
Koopman .52
L-BP .40
ODE-RKHS .36
SINDy Fourier .42
SINDy Polynomial .39

In table 2, we report the performance of ODE-RKHS and the benchmark methods on the 3 simulated datasets (FHN, Lorenz63, and Lorenz96) with the 5 simulated levels of noise, level 1 corresponding to the noise with the smallest standard deviation. The ODE-RKHS algorithm performed best in 10 out of the 15 simulated test sets. The second best performing method was SINDy Polynomial with the lowest error in just 2 out of the 15 simulated datasets. Moreover, the 2 cases where SINDy polynomial performed best correspond to the lowest noise levels of the Lorenz63 dataset, indicating that our method is more robust to higher noise levels.

Table 2.

Performance Table for the 3 Simulated Datasets

FHN Noise Level 1 Lorenz63 Noise Level 1 Lorenz96 Noise Level 1
Err Err Err
npODE 1.53 npODE 17.49 npODE 1.61
KAF 6.019 KAF 22.75 KAF 2.18
Koopman 2.27 Koopman 5.96 Koopman .25
L-BP 5.55 L-BP 16.35 L-BP 1.02
ODE-RKHS .53 ODE-RKHS 9.06 ODE-RKHS .30
SINDy Fourier 5.59 SINDy Fourier 23.37 SINDy Fourier .52
SINDy Polynomial 1.28 SINDy Polynomial 2.18 SINDy Polynomial 1.13
FHN Noise Level 2 Lorenz63 Noise Level 2 Lorenz96 Noise Level 2
Err Err Err
npODE 1.57 npODE 18.75 npODE 1.29
KAF 8.35 KAF 22.34 KAF 2.17
Koopman 3.15 Koopman 13.67 Koopman 1.09
L-BP 6.87 L-BP 18.33 L-BP 1.10
ODE-RKHS 1.16 ODE-RKHS 11.24 ODE-RKHS .42
SINDy Fourier 5.54 SINDy Fourier 21.63 SINDy Fourier .84
SINDy Polynomial 2.60 SINDy Polynomial 10.73 SINDy Polynomial 1.76
FHN Noise Level 3 Lorenz63 Noise Level 3 Lorenz96 Noise Level 3
Err Err Err
npODE 3.07 npODE 20.06 npODE 1.31
KAF 8.25 KAF 21.96 KAF 2.16
Koopman 3.57 Koopman 16.12 Koopman 1.11
L-BP 5.48 L-BP 19.63 L-BP 1.17
ODE-RKHS 1.83 ODE-RKHS 13.38 ODE-RKHS .52
SINDy Fourier 6.50 SINDy Fourier 22.88 SINDy Fourier 1.00
SINDy Polynomial 2.84 SINDy Polynomial 15.88 SINDy Polynomial 1.23
FHN Noise Level 4 Lorenz63 Noise Level 4 Lorenz96 Noise Level 4
Err Err Err
npODE 4.33 npODE 19.61 npODE 1.95
KAF 8.53 KAF 21.82 KAF 2.16
Koopman 7.18 Koopman 17.92 Koopman 1.02
L-BP 6.58 L-BP 21.03 L-BP 1.09
ODE-RKHS 2.20 ODE-RKHS 14.38 ODE-RKHS .83
SINDy Fourier 9.47 SINDy Fourier 22.07 SINDy Fourier 1.24
SINDy Polynomial 5.57 SINDy Polynomial 20.67 SINDy Polynomial 3.03
FHN Noise Level 5 Lorenz63 Noise Level 5 Lorenz96 Noise Level 5
Err Err Err
npODE 4.37 npODE 19.45 npODE 2.10
KAF 7.62 KAF 21.49 KAF 2.15
Koopman 7.12 Koopman 18.97 Koopman 1.23
L-BP 7.51 L-BP 21.68 L-BP 1.34
ODE-RKHS 1.97 ODE-RKHS 21.20 ODE-RKHS 1.23
SINDy Fourier 12.29 SINDy Fourier 22.19 SINDy Fourier 1.18
SINDy Polynomial 9.00 SINDy Polynomial 23.23 SINDy Polynomial 1.67

Minimum values are in bold. ODE-RKHS performs best in 10 out the 15 datasets.

5. Discussion

We proposed an algorithm for learning non-parametric ODEs assuming that the function f generating the vector field in Rd belongs to a vector-valued RKHS with a kernel satisfying certain regularity conditions. The data input of the algorithm consists of noisy observations at different times of multiple trajectories. The algorithm is linear in the number of observations but cubic in their dimension. We proved the consistency of the estimated trajectory, showing that the L2 squared distance between the estimated trajectory and the true one vanishes as more observations are collected. We assessed the algorithm with simulated and real data and obtained results that consistently compare favorably with the state of the art on a wide range of noise levels.

Fig. 2.

Fig. 2.

On the left we plot the log of the average L2 squared error between the true trajectory and the estimated one as a function of the log of the number of samples. A linear regression yields a slope of −.8 indicating convergence at a rate between 1m and 1m. On the right we plot the predicted trajectories when we use have 5 observations, together with the true trajectory (in the dotted line).

Acknowledgements

The work at Portland State University was partly funded using the National Institute of Health RO1AG021155, R01EY032284, and R01AG027161, National Science Foundation #2136228, and the Google Research Award “Kernel PDE”. The funding sources had no involvement in the study design; in the collection, analysis, and interpretation of data; in the report’s writing; and in the decision to submit the article for publication. The material of Galois, Inc. is based upon work supported by the Air Force Research Laboratory (AFRL) and DARPA under Contract No. FA8750–20-C-0534. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s). They do not necessarily reflect the views of the Air Force Research Laboratory (AFRL) and DARPA.

Appendix A. Consistency of the estimator of the trajectory

Appendix A.1. Assuming we solve the problem without Euler approximation

This section gives the proof of the theorem presented in section 3 of the main text. We present the proof for d=1 since the generalization to multiple dimensions is straightforward. We also present the proof for the case of autonomous systems. Keeping the notations of the main text, we make the following assumptions:

  • A1: There exist an f*H,f*-f0HR and x0*r such that x*(0)=x0* and x˙*(t)=f*x*(t) for every 0tT.

  • A2: The noise variables ϵj are independent and bounded by a constant Mϵ, with a variance denoted by σ2. (We can assume that the variables are subgaussian instead of bounded if we want to generalize this result)

  • A3: The kernel K is 𝒞2(R) in its first argument (this implies that it is also 𝒞2(R) in its second argument).

  • A4: The kernel K satisfies the hypothesis of lemma 1.

Without loss of generality, we will assume that f0=0 in our proof.

Let H be the RKHS with reproducing kernel K. Let fH such that fHR. We know using assumption A4 and lemma 1 that f is uniformly Lipschitz, with a Lipschitz constant that does not depend on f that we denote by L1. Specifically,

|f(x)-f(y)|L1|x-y| (A.1)

with L1=NKR Using (A.1), we will prove the following lemma:

Lemma 2. Assuming A4, consider the set of solutions to the problem

xt=x˙=f(x),xt0=x0 (A.2)

where f belongs to the RKHS with kernel K,x0r and t[0,T]. Then any solution x in this set of solutions is bounded by a uniform constant B1 that only depends on T,R,L1 and L32:=supx<C|K(x,x)|.

Specifically,

x(t)-xt0B1=TL3ReL1T (A.3)

Proof. We start by taking f in our class of functions and x0 such that x0r. We therefore can write:

x(t)-x0 =0tf(x(s))-fx0ds+tfx0 (A.4)
0t|f(x(s))f(x0)|ds+tfHK(x0,x0) (A.5)
L10tx(s)-x0ds+TL3R (A.6)

Now denote by G(t)x(t)-x0. If we prove that G(t) is bounded by a constant depending only on T,R,L1 and L3, we will be done. So far we have:

G(t)L10tG(s)ds+TL3R (A.7)

Denote by V(t)0tG(s)ds. We have that:

V(t)L1V(t)+TL3R (A.8)

which implies:

e-L1tV(t)-L1e-L1tV(t)TL3Re-L1t (A.9)

Integrating the inequality between 0 and t using the fact that V(0)=G(0)=0, we obtain:

exp-L1tV(t)TL3RL11-e-L1t (A.10)

or, equivalently,

V(t)TL3RL1eL1t-1 (A.11)

Finally since V(t)=G(t)L1V(t)+TL3R, we have:

G(t)TL3ReL1tTL3ReL1T (A.12)

Let us now introduce the following notations:

  • We denote by xx0,f,t the solution to the ODE with derivative f and initial condition x0

  • yi is the observed noisy point from the trajectory at time ti.

  • x*(t) is the true trajectory evaluated at time t

We now proceed with the following reasoning. We assume that our trajectory minimizes

L^(f,x0)i=1m(ti+1ti)((x(x0,f,ti)yi)2σ2) (A.13)

over f,x0 such that fHR, and x0r. We denote the minimizer by fˆ,xˆ0.

When x0 and f are fixed and not data dependent (deterministic), the expected value of Lˆf,x0 is:

Lf,x0i=1mti+1-tixx0,f,ti-x*ti2 (A.14)

Notice that A1 implies:

minfHR,x0rLf,x0=Lf*,x0*=i=1mti+1-tix*ti-x*ti2=0 (A.15)

Our goal is to evaluate Lfˆ,xˆ0 and obtain a generalization bound. We have:

L(f^,x^0)=L(f^,x^0)L^(f^,x^0)+L^(f^,x^0)L^(f*,x0*)+L^(f*,x0*)L(f*,x0*) (A.16)

And therefore, since the middle term in (A.16): Lˆfˆ,xˆ0-Lˆf*,x0*<0,

L(f^,x^0)supfHR,|x0|r2|L(f,x0)L^(f,x0)| (A.17)

We thus consider the following quantity:

ErrsupfHR,|x0|r|L^(f,x0)L(f,x0)| (A.18)

Expanding this quantity we get:

supfHR,|x0|r|i=1m(ti+1ti)(yi2x*(ti)2σ22x(x0,f,ti)(yix*(ti))| (A.19)

Notice that if we replace for a given single i,yi=x*ti+ϵi by y˜i=x*ti+ϵ˜i, the quantity of equation A.19 will change by a quantity bounded by some constant K2ti+1-ti, that we can bound by 4B1+r+MϵMϵ+4B1+rMϵ. Therefore, using McDiarmid inequality [37]:

P(ErrE(Err)+ϵ)exp-2ϵ2K22i=1mti+1-ti2 (A.20)

We therefore need to provide an upper bound of E(Err). For that, we are going to view:

|L^(f,x0)L(f,x0)|=|i=1m(ti+1ti)(yi2x*(ti)2σ22x(x0,f,ti)(yix*(ti)))| (A.21)

as a stochastic process indexed by x, where x𝒳: Set of all solutions x(f,x0,.) for all fHR and x0r. In other words, we view the process Lˆf,x0-Lf,x0 indexed by f and x0 as:

|L^(x)L(x)| (A.22)

where x𝒳 is some xf,x0,.. Notice that Err is also:

supx𝒳|L^(x)L(x)| (A.23)

Notice that x is a subset of continuous functions defined on [0, T]. Therefore we can equip 𝒳 with the metric structure X,.). We will apply Dudley’s inequality (see for e.g [20], theorem 8.1.3) to bound:

E(Err)=E(supfHR,|x0|r|L^(f,x0)L(f,x0)|) (A.24)

To apply Dudley’s inequality, we are going to use the following lemma.

Lemma 3. The solutions x𝒳 are Lipschitz with a Lipschitz constant that is uniform over 𝒳, i.e, there exists a constant L6 such that for every x𝒳,t[0,T] and s[0,T]:

|x(t)-x(s)|L6|t-s| (A.25)

K6 depends on R,B1,r and the kernel K.

Proof. Let x0 such that x0r and f such that fHR. We have:

x˙x0,f,t=|f(x(t))| (A.26)
Rsup|x|B1+rK(x,x) (A.27)

As a consequence, if we denote by 𝒩(𝒳,ϵ) the covering number of 𝒳 with a radius ϵ we have the existence of a constant L7 (L7 only depends on B1,r and L6) such that:

𝒩X,ϵexpL7ϵ, (A.28)

where we used a known upper bound that can be found for example in [20] (exercise 8.2.7) on the covering number of uniformly bounded Lipschitz continuous functions defined on a finite interval.

Using this result combined with Dudley’s inequality, we obtain the existence of a constant L8 (depending only on L7) such that:

Proposition 1.

E(Err)L8i=1mti+1-ti2 (A.29)

Proof. Apply Dudley’s inequality to Err using inequality (A.28) and the fact that the diameter of 𝒳 is finite bounded by 2B1+r and that for every M<

0Mlog(𝒩(𝒳,ϵ))dϵ0MlogexpK7ϵdϵ< (A.30)

As a consequence, using (A.20) and theorem (1), we obtain the following inequality:

PErrL8i=1mti+1-ti2+ϵexp-2ϵ2K22i=1mti+1-ti2 (A.31)

Using inequalities (A.17) and (A.31) we finally obtain the following theorem:

Theorem 2. With assumptions A1,A2,A3 and A4, there exist constants L9 and K2 depending only on R,r,T,Mϵ and the kernel K such that for every ϵ:

(L(f^,x^0)L9i=1m(ti+1ti)2+ϵ) exp(2ϵ2K22i=1m(ti+1ti)2) (A.32)

Appendix A.2. Including the Euler approximation

In reality, the solution (trajectory) that we propose for every f and x0 is not xx0,f,.) the solution of the ODE but x˜x0,f,h,., the solution obtained with an Euler’s method of time step h. The idea is to use the fact that under some sufficient conditions, we know how to bound the error between Euler’s method and the true solution. For example, we know that if f is Lipschitz with a Lipschitz constant K1 and the solution xx0,f,. is 𝒞2 with a constant K11 such that:

xx0,f,tL11,0tT (A.33)

then we have the following global truncation error bound [38]:

max1in|x(x0,f,ti)x˜(x0,f,h,ti)|hL112L1(expL1T1) (A.34)

We already showed that f is Lipschitz with some constant L1. To ensure the condition of inequality (A.33), notice that:

xx0,f,t=fxx0,f,tfxx0,f,t (A.35)

Since we already showed that the solutions x(x0,f,.) are uniformly bounded by B1+r, it is sufficient to ensure that f is 𝒞1. This is true if we assume that our kernel K is 𝒞2 and hence (A.34) will be insured.

Taking into account the Euler approximation and the error bound, the steps of the consistency proof are identical only with the following important difference in equation (A.15) from the previous section

minfHR,x0rLf,x0Lf*,x0* (A.36)

with

L(f*,x0*)=i=1m(ti+1ti)(x˜*(ti,h)x*(ti))2h2L112T4L12(expL1T1)2L12 (A.37)

With this modification, theorem 2 becomes:

Theorem 3. Assuming A1,A2,A3 and A4, there exist constants K2,L12 and L13 depending only on R,r,T,Mϵ and the kernel K such that for every ϵ:

(L(f^,x^0)L13i=1m(ti+1ti)2+h2L12+ϵ) exp(2ϵ2K22i=1m(ti+1ti)2) (A.38)

Appendix A.3. L2 squared distance between the true solution and the estimated trajectory

In reality Lfˆ,xˆ0 is an approximation of the L2 norm squared

x(f^,x^0,)x*()L220T(x(f^,x^0,t)x*(t))2dt (A.39)

Since we proved that the solutions are uniformly bounded by B1+r and x˙ is bounded by L6, we have txfˆ,xˆ0,t-x*(t)2 is Lipschitz with Lipschitz constant 8B1+rL6 (we just bound the norm of the derivative). Therefore:

|x(f^,x^0,)x*()L22L(f^,x^0)|8(B1+r)L6i=1m(ti+1ti)2 (A.40)

Which proves theorem 2 of the main text.

Appendix B. Kernels

We are interested in listing kernels that satisfy Lemma 1, and thus can be used to model ODEs admitting a single solution. There are cases when one can directly verify the hypothesis of Lemma 1. In the case of translation invariant kernels, one can use the Bochner theorem to provide a sufficient condition as explained in the next section.

Appendix B.1. Translation invariant kernels

We consider translation invariant scalar positive definite kernels over Rd, that is kernels for which

k(u,v)=h(u-v),u,vRd (B.1)

The Bochner theorem provides a characterization of translation invariant kernels. Specifically, there exists a probability density q with respect to the Lebesgues measure over Rd such that

h(x)=h(0)RdeixTyq(y)dy (B.2)

Furthermore, since we restrict our attention to real-valued kernels,

h(x)=h(0)RdcosxTyq(y)dy (B.3)

The gradient of h is then formally the vector of length d

h(x)=-h(0)RdysinxTyq(y)dy (B.4)

and the Hessian of h is formally the matrix

h(x)=-h(0)RdyyTcosxTyq(y)dy (B.5)

Translation invariant kernels that satisfy Lemma 1 are such that

Q(x)==cx2+2(h(x)h(0))0 (B.6)

for some constant c>0 and for any x,yRd. Notice that Q(0)=0. Next, since h(0)=0,Q(0)=0. Moreover,

Q(x)=2cI+2h(x) (B.7)

where I is the identity matrix. Next, since Q is a symmetric matrix, it has real eigenvalues. Suppose these eigenvalues are bounded uniformly from below. In that case, one can choose a constant c large enough such that Q(x) is positive definite for each xRd which implies that Q is convex and since Q(0)=0 and Q(0)=0,Q(x)0 for each xRd and the conditions for Lemma 1 are satisfied. A sufficient condition for this to happen is that all the coordinates of h are bounded, i.e., for each i{1,,d},EYi2<, where Yi is a random variable with density qi, the ith marginal of q.

Appendix B.2. Explicit Kernels:

We begin by observing the condition

dKii2(u,v)NK2|u-v|2,u,vRd,i=1,,d (B.8)

is equivalent to the condition:

i=1ddKii2(u,v)N2|u-v|2 (B.9)

Consider the case where K is an explicit kernel. That is to say there exists a finite (p) dimensional feature space and a mapping Φ:RdRp×d for which:

K(u,v)=Φ(u)TΦ(v) (B.10)

The Fourier random features used in our experiments fall in this category.

Lemma 4.

i=1ddKii2(u,v)=Φ(u)Φ(v)2 (B.11)

Where is the Frobenious norm.

Proof:

i=1dki,i(u,u)-2ki,i(u,v)+ki,i(v,v)=i=1deiTΦ(u)TΦ(u)ei-2eiTΦT(u)Φ(v)ei+eiTΦ(v)TΦ(v)ei (B.12)
=i=1deiTΦ(u)TΦ(u)-Φ(u)TΦ(v)-Φ(v)TΦ(u)+Φ(v)TΦ(v)ei (B.13)
=i=1deiT(Φ(u)-Φ(v))T(Φ(u)-Φ(v))ei (B.14)
=Trace(Φ(u)-Φ(v))T(Φ(u)-Φ(v)) (B.15)
=Φ(u)Φ(v)2 (B.16)

Therefore, for explicit kernels, we conclude that the condition of lemma 1 is equivalent to the condition that the features are Lipschitz continuous with respect to the Frobenious norm.

Appendix B.3. Examples of kernels which satisfy the assumptions of lemma 1

Let us notate

P(u,v)=K1(u,u)+K1(v,v)-2K1(u,v) (B.17)
  1. The linear kernel
    K1(u,v)=uTAv (B.18)
    where A is a psd matrix. Indeed,
    P(u,v)=(uv)TA(uv)uv2sup1idλi (B.19)
    where λi are the eigenvalues of A using the Rayleigh quotient property.
  2. The Gaussian kernel:
    K1(u,v)=exp-12(u-v)TA(u-v) (B.20)
    where A is a psd matrix. Indeed,
    P(u,v)=22 exp(12((uv)TA(uv)))2(uv)TA(uv)2uv2sup1Idλi (B.21)
    where λi are the eigenvalues of A and the first inequality comes from the basic inequality ex1+x
  3. The rational quadratic kernel:
    K1(x,y)=xy2xy2+θ,θ>0 (B.22)
    Note that in this case,
    P(u,v)1θuv2 (B.23)
  4. The sinc kernel
    K1(u,v)=i=1dsin(uivi)uivi (B.24)
    We use the fact that K1 is a translation invariant kernel with associated density q(y)=i=1dq1yi with
    q1(z)=12 for -1z1 (B.25)
  5. The Mattern kernel with p>3/2. This kernel is translation invariant with associated density q(y)=i=1dq1yi with
    q1(z)=11+x2p (B.26)
    and
    EX2<,X~q1 (B.27)

Appendix C. An example of a non-autonomous system

We provide in this appendix a toy example of a non-autonomous system, namely the harmonic oscillator with sinusoidal input force

y¨+0.001y˙+10000y=cos(t) (C.1)

The kernel is an explicit Fourier random feature kernel with p=200 random features as well as a constant term, where time was included as input together with the spatial variables. Each feature was centered and standardized using the training set only for computing the mean and standard deviation. The functions in the corresponding RKHS are then

fx1,x2,t=i=1pαicosz1i,z2i,z3,ix1,x2,t+βisinz1i,z2i,z3,ix1,x2,t+ω1i=1pγicosz1i,z2i,z3,ix1,x2,t+δisinz1i,z2i,z3,ix1,x2,t+ω2 (C.2)

Where the z variables are iid sampled from a standard Normal (or Gaussian) distribution and the parameters αi,βi,γi,δi,i=1p along with ωj,j=1,2 are learned from the training set. Figure C.5 illustrates the output ODE-RKHS algorithm for this system.

Fig. C.5.

Fig. C.5.

(a): plot of the 2D system where the z-axis is time. Black arrows: true vector field. Grey arrows: estimated vector field. Black curves: true trajectories. Red curves: estimated trajectories. (b): Grey points: initial conditions. Black curves: true trajectories. Red curves: estimated trajectories.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1

The data used for this experiment has been obtained from the Wisconsin Registry for Alzheimer’s Prevention. See https://wrap.wisc.edu/. A request for accessing this data can be initiated from this website.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  • [1].Hirsch MW, Smale S, Devaney RL, Differential equations, dynamical systems, and an introduction to chaos, Academic press, 2012. [Google Scholar]
  • [2].Manton JH, Amblard P-O, et al. , A primer on reproducing kernel hilbert spaces, Foundations and Trends® in Signal Processing 8 (2015) 1–126. [Google Scholar]
  • [3].Dondelinger F, Husmeier D, Rogers S, Filippone M, Ode parameter inference using adaptive gradient matching with gaussian processes, in: Artificial intelligence and statistics, PMLR, 2013, pp. 216–228. [Google Scholar]
  • [4].Brunton SL, Proctor JL, Kutz JN, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the national academy of sciences 113 (2016) 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Niu M, Rogers S, Filippone M, Husmeier D, Fast parameter inference in nonlinear dynamical systems using iterative gradient matching, in: International Conference on Machine Learning, PMLR, 2016, pp. 1699–1707. [Google Scholar]
  • [6].Hu P, Yang W, Zhu Y, Hong L, Revealing hidden dynamics from time-series data by odenet, arXiv preprint arXiv:2005.04849 (2020). [Google Scholar]
  • [7].Qin T, Wu K, Xiu D, Data driven governing equations approximation using deep neural networks, Journal of Computational Physics 395 (2019) 620–635. [Google Scholar]
  • [8].Chen RT, Rubanova Y, Bettencourt J, Duvenaud D, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366 (2018). [Google Scholar]
  • [9].Koopman BO, Hamiltonian systems and transformation in hilbert space, Proceedings of the national academy of sciences of the united states of america 17 (1931) 315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Schmid PJ, Dynamic mode decomposition of numerical and experimental data, Journal of fluid mechanics 656 (2010) 5–28. [Google Scholar]
  • [11].Dai X, Li L, Kernel ordinary differential equations, Journal of the American Statistical Association 117 (2022) 1711–1725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Heinonen M, Yildiz C, Mannerström H, Intosalmi J, Lähdesmäki H, Learning unknown ode models with gaussian processes, in: International Conference on Machine Learning, PMLR, 2018, pp. 1959–1968. [Google Scholar]
  • [13].Kanagawa M, Hennig P, Sejdinovic D, Sriperumbudur BK, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv preprint arXiv:1807.02582 (2018). [Google Scholar]
  • [14].Hofmann T, Schölkopf B, Smola AJ, Kernel methods in machine learning, The annals of statistics (2008) 1171–1220. [Google Scholar]
  • [15].Alvarez MA, Rosasco L, Lawrence ND, Kernels for vector-valued functions: A review, arXiv preprint arXiv:1106.6251 (2011) [Google Scholar]
  • [16].Simmons GF, Differential equations with applications and historical notes, CRC Press, 2016. Theorem B [Google Scholar]
  • [17].Raissi M, Perdikaris P, Karniadakis G, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707. [Google Scholar]
  • [18].Dao T, De Sa C, Ré C, Gaussian quadrature for kernel features, Advances in neural information processing systems 30 (2017) 6109. [PMC free article] [PubMed] [Google Scholar]
  • [19].Kiliç E, Stanica P, The inverse of banded matrices, Journal of Computational and Applied Mathematics 237 (2013) 126–135. [Google Scholar]
  • [20].Vershynin R, High-dimensional probability: An introduction with applications in data science, volume 47, Cambridge university press, 2018. [Google Scholar]
  • [21].Quinonero-Candela J, Rasmussen CE, A unifying view of sparse approximate gaussian process regression, The Journal of Machine Learning Research 6 (2005) 1939–1959. [Google Scholar]
  • [22].Kokotovic P, Heller J, Direct and adjoint sensitivity equations for parameter optimization, IEEE Transactions on Automatic Control 12 (1967) 609–610. [Google Scholar]
  • [23].Zheng P, Askham T, Brunton SL, Kutz JN, Aravkin AY, A unified framework for sparse relaxed regularized regression: Sr3, IEEE Access 7 (2018) 1404–1423. [Google Scholar]
  • [24].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, Brunton S, Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data, Journal of Open Source Software 5 (2020) 2104. [Google Scholar]
  • [25].Lew E, Hekal A, Potomkin K, Kochdumper N, Hencey B, Bak S, Bogomolov S, Autokoopman: A toolbox for automated system identification via koopman operator linearization, in: International Symposium on Automated Technology for Verification and Analysis, Springer, 2023, pp. 237–250. [Google Scholar]
  • [26].Williams MO, Kevrekidis IG, Rowley CW, A data–driven approximation of the koopman operator: Extending dynamic mode decomposition, Journal of Nonlinear Science 25 (2015) 1307–1346. [Google Scholar]
  • [27].Yeung E, Kundu S, Hodas N, Learning deep neural network representations for koopman operators of nonlinear dynamical systems, in: 2019 American Control Conference (ACC), IEEE, 2019, pp. 4832–4839. [Google Scholar]
  • [28].DeGennaro AM, Urban NM, Scalable extended dynamic mode decomposition using random kernel approximation, SIAM Journal on Scientific Computing 41 (2019) A1482–A1499. [Google Scholar]
  • [29].Zhao Z, Giannakis D, Analog forecasting with dynamics-adapted kernels, Nonlinearity 29 (2016) 2888. [Google Scholar]
  • [30].Burov D, Giannakis D, Manohar K, Stuart A, Kernel analog forecasting: Multiscale test problems, Multiscale Modeling & Simulation 19 (2021) 1011–1040 [Google Scholar]
  • [31].Schaeffer H, Tran G, Ward R, Zhang L, Extracting structured dynamical systems using sparse optimization with very few samples, Multiscale Modeling & Simulation 18 (2020) 1435–1461. [Google Scholar]
  • [32].Pontryagin LS, Mathematical theory of optimal processes, CRC press, 1987. [Google Scholar]
  • [33].Younes L, Diffeomorphic learning, Journal of Machine Learning Research 21 (2020) 1–28.34305477 [Google Scholar]
  • [34].Lorenz E, Predictability: a problem partly solved, Ph.D. thesis, Shinfield Park, Reading, 1995. [Google Scholar]
  • [35].Murphy MP, LeVine III H, Alzheimer’s disease and the amyloid-β peptide, Journal of Alzheimer’s disease 19 (2010) 311–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Vernhet P, Bilgel M, Durrleman S, Resnick SM, Johnson SC, Jedynak BM, Modeling the early accumulation of amyloid using differential equations in wrap and blsa: Neuroimaging/optimal neuroimaging measures for early detection, Alzheimer’s & Dementia 16 (2020) e039536. [Google Scholar]
  • [37].Doob JL, Regularity properties of certain families of chance variables, Transactions of the American Mathematical Society 47 (1940) 455–486. [Google Scholar]
  • [38].Atkinson KE, An introduction to numerical analysis, John wiley & sons, 2008. [Google Scholar]

RESOURCES