Learning nonparametric ordinary differential equations from noisy data

Kamel Lahouel; Michael Wells; Victor Rielly; Ethan Lew; David Lovitza; Bruno M Jedynak

doi:10.1016/j.jcp.2024.112971

. Author manuscript; available in PMC: 2025 Jun 15.

Published in final edited form as: J Comput Phys. 2024 Mar 29;507:112971. doi: 10.1016/j.jcp.2024.112971

Learning nonparametric ordinary differential equations from noisy data

Kamel Lahouel ^b, Michael Wells ^a, Victor Rielly ^a, Ethan Lew ^c, David Lovitza ^a, Bruno M Jedynak ^a,^*

^aDept. of Math & Stat, Portland State University, 1855 SW Broadway, Portland, OR 97201

^bTGen, 445 N. Fifth Street, Phoenix, AZ 85004

^cGalois Inc., 421 SW 6th Avenue, Suite 300, Portland, Oregon 97204

JCOMP author statement

Term	Definition	Participants (in alphabetic order)
Conceptualization	Ideas; formulation or evolution of overarching research goals and aims	Jedynak, Lahouel
Methodology	Development or design of methodology; creation of models	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Software	Programming, software development; designing computer programs; implementation of the computer code and supporting algorithms; testing of existing code components	Jedynak, Lahouel, Lew, Lovitz, Rielly, Wells
Validation	Verification, whether as a part of the activity or separate, of the overall replication/reproducibility of results/experiments and other research outputs	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Formal analysis	Application of statistical, mathematical, computational, or other formal techniques to analyze or synthesize study data	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Investigation	Conducting a research and investigation process, specifically performing the experiments, or data/evidence collection	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Resources	Provision of study materials, reagents, materials, patients, laboratory samples, animals, instrumentation, computing resources, or other analysis tools	Jedynak
Data Curation	Management activities to annotate (produce metadata), scrub data and maintain research data (including software code, where it is necessary for interpreting the data itself) for initial use and later reuse	Jedynak, Lew, Lovitz, Rielly, Wells
Writing - Original Draft	Preparation, creation and/or presentation of the published work, specifically writing the initial draft (including substantive translation)	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Writing - Review & Editing	Preparation, creation and/or presentation of the published work by those from the original research group, specifically critical review, commentary or revision - including pre-or postpublication stages	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Visualization	Preparation, creation and/or presentation of the published work, specifically visualization/data presentation	Jedynak, Lahouel, Lew, Lovitz Rielly, Wells
Supervision	Oversight and leadership responsibility for the research activity planning and execution, including mentorship external to the core team	Jedynak, Lahouel
Project administration	Management and coordination responsibility for the research activity planning and execution	Jedynak, Lahouel
Funding acquisition	Acquisition of the financial support for the project leading to this publication	Jedynak, Lahouel

Open in a new tab

Corresponding author: bruno.jedynak@pdx.edu (Bruno M. Jedynak)

PMCID: PMC11090484 NIHMSID: NIHMS1986948 PMID: 38745873

Abstract

Learning nonparametric systems of Ordinary Differential Equations (ODEs) $\dot{x} = f (t, x)$ from noisy data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for $f$ for which the solution of the ODE exists and is unique. Learning $f$ consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the $L^{2}$ distance between $x$ and its estimator. Experiments are provided for the FitzHugh–Nagumo oscillator, the Lorenz system, and for predicting the Amyloid level in the cortex of aging subjects. In all cases, we show competitive results compared with the state-of-the-art.

1. Introduction

1.1. Description of the problem and related works

Fitting a system of nonparametric ordinary differential equations (ODEs) $\dot{x} = f (t, x)$ to longitudinal data could lead to scientific breakthroughs in disciplines where ODEs or dynamical systems have been used for a long time, including physics, chemistry, and biology, see [1]. By nonparametric, we mean that there is no need to specify the functional form of the vector-field $f$ using a pre-defined finite dimensional parameter. Instead, this force field belongs to a functional space and the number of parameters that characterize this vector field depends on the amount of data available. This provides a great advantage in situations where the form of the vector field is unknown but data is available for learning. The functional spaces considered are Reproducing Kernel Hilbert Spaces (RKHS) [2], allowing for efficient optimization among other desirable properties.

A particular difficulty arises when the data is sparse and noisy. This is often the case for longitudinal healthcare data obtained during hospital visits. These visits provide measurements that are sparse in time, with a high level of individual variability. The work presented in this paper has been motivated in part by the need to model the accumulation of the Amyloid protein in the brain of aging subjects. Understanding how amyloid contributes to the manifestation of Alzheimer’s is a crucial task. The algorithm discussed here will (we hope) shed more light on the development of this devastating disease.

Fitting data to nonparametric ODEs is an inverse problem. It requires making assumptions on the initial state of the solution and on the vector field. Furthermore, one needs to make assumptions about the noise model and provide a tractable optimization algorithm.

We now provide a short bibliographic survey. Further references can be found in the cited papers. First, note that if the time derivative ( $\dot{x}$ ) was observed, then fitting ODEs to noisy data would reduce to solving a regression problem. This remark has led to the methods known as “gradient matching” and to the earliest success in fitting ODEs to data, see e.g. [3, 4]. It consists in estimating the gradient from the data, then performing nonparametric regression to fit the vector filed $f$ and eventually, iterating, see [5]. These methods become inefficient when the data is sparse and/or noisy.

Another approach consists in modeling $f$ with polynomials [6]. Alternatively, one could model $f$ using the units of a Deep Neural Network, see [7, 8].These methods integrate the solution along the vector field from guessed initial conditions and compare the resulting trajectories with the observations. Optimization is used iteratively to refine the estimation of $f$ and the initial conditions. Stochastic gradient descent and backpropagation is used in the latter case. Another modeling approach is to assume that $f$ belongs to an RKHS. This idea, also known under the name of kernel method, could be traced back to [9]. It was successfully applied to fluid mechanics in [10]. This is the conceptual approach pursued here. We believe that this approach is well-motivated since there is a tight connection between the regularity (smoothness) properties of a kernel and the regularity properties of $f$ . Specifically, one can choose an RKHS of vector-valued functions for which one is guaranteed the existence and uniqueness of the corresponding initial value problem. This is a necessary step in proving that more data would result in more accurate predictions. Another advantage of kernel methods is that there is no need to choose a dictionary of functions as in [4]. Instead, one selects a kernel, which, our experiments suggest, is easier. In [11], the authors assume that each coordinate of the trajectory belongs to a real-valued RKHS where the functions’ input is time. In their approach, they first retrieve the full trajectory solving a kernel ridge regression problem. Next, they solve for the vector field given the full trajectory, assuming that each coordinate of the vector field can be written as a sum of a linear combination of functions, which are defined on each coordinate of the trajectory. Our framework allows for linear combinations of pairwise products of such functions, as well. The functions characterizing such a vector field are assumed to be in a real-valued RKHS taking a single coordinate as input. In our approach, we make an assumption on the vector field. This soft constraint translates to a soft constraint on the set of trajectories, without imposing additional constraints on the trajectory itself. As a result, we solve one optimization problem as opposed to the two-step approach in [11]. Moreover, we allow for higher-order interaction terms compared to the pairwise single coordinates interaction assumed in the mentioned work. In [12], the authors use a Gaussian process (GP) for the vector field. This is the Bayesian counterpart of the frequentist RKHS modeling, see [13] for a review of the similarities and differences between RKHSs and GPs. Comparisons between a collection of algorithms representative of the state of the art and the proposed algorithm is provided in the experiment section.

For the purpose of providing a visual and easy to understand illustration of the results generated by the algorithms presented in this paper, please see Figure 1. The details of this experiment are provided in section 4.3.1. We see that the proposed algorithm is able to recover a noisy trajectory and extrapolate the data, contrary to a method that would use a regression model and ignore the ODE.

Fig. 1. — (a) Predicted vector field of the Lorenz system. The Black arrows are the prediction and the grey are the true vector field. Red points are observations. The red curve is a predicted trajectory while the grey is the true trajectory. (b) is the $x$ -dimension, (c) is the $y$ -dimension and (d) is the $z$ -dimension. The red points are the observations. This plot also shows a prediction beyond the last observation in the data.

1.2. Main contributions

The main contributions of this paper are as follows:

We present an RKHS model for fitting nonparametric ODEs to observational data. Conditions for existence and uniqueness of the solutions of the corresponding initial value problem are expressed in terms of the regularity of the kernel;
We propose a novel algorithm for estimating nonparametric ODEs and the initial condition(s) from noisy data. This algorithm solves a constrained optimization problem using a penalty method;
We derive and prove a consistency result for the prediction of the state (interpolation) at unobserved times. This is, up to our knowledge, the first result for the problem of fitting nonparametric ODEs to data.
We provide experiments with simulated data. We compare the proposed algorithm to 7 existing methods representing state of the art for various noise levels. We show that the ODE-RKHS algorithm is competitive.
We provide an experiment modeling the accumulation of Amyloid in the cortex of aging subjects. The data is sparse with, on average, three data points per trajectory (subject) and 179 trajectories. We show competitive performance compared to state of the art.

The rest of this paper is organized as follows: Section 2 presents some background material as well as the model and the algorithms. The consistency results are presented in Section 3 and proved in Appendix A. The experiments appear in Section 4 while Section 5 provides concluding remarks. Appendix B provides examples of kernels.

2. Model and algorithm

2.1. Background on Reproducing Kernel Hilbert Spaces (RKHSs)

Basic notions and notations associated with RKHS are important for understanding the algorithms and derivations presented in this paper. We thus provide a short presentation. We limit ourselves to RKHS over the field of real numbers instead of complex numbers as this is sufficient throughout this paper. We begin with the univariate real-valued case and we continue with the vector-valued case which allows us to describe vector fields, central to this paper.

2.1.1. Real-valued RKHS

Real-valued RKHS are Hilbert spaces of real-valued functions: $𝒳 \to R$ , where $𝒳$ is a nonempty space. The critical assumption which make them “reproducing” is that the evaluation functional is continuous. The evaluation functional at $x \in 𝒳$ is a mapping from a RKHS $H$ to $R$ , which associates to a function its evaluation at $x$ , that is $f \mapsto f (x)$ . Thanks to the Riesz representation theorem, evaluating a function in an RKHS is a geometric operation consisting in computing an inner product. Effectively, for any $x \in 𝒳$ , there is a unique vector $k_{x} \in H$ such that

f (x) = {〈 f, k_{x} 〉}_{H}

(1)

where $⟨ ., . ⟩_{H}$ is the scalar product associated with $H$ . In what follows, we will simply notate $⟨ ., . ⟩$ . for this inner product. Moreover, let us define, for any $x, y \in 𝒳$ , the so-called kernel

k (x, y) = 〈 k_{x}, k_{y} 〉

(2)

and let us use this to characterize the function $k_{x}$ . Evaluating $k_{x}$ at $y$ and using Riesz representation provides

k_{x} (y) = 〈 k_{x}, k_{y} 〉 = 〈 k_{y}, k_{x} 〉 = k (y, x)

(3)

Thus the function $k_{x} (.)$ is the function $k (., x)$ and for any $f \in H$ ,

f (x) = 〈 f, k (., x) 〉

(4)

This is the reproducible property of the kernel. Replacing the function $f$ by $k_{y}$ , and using (3), we obtain that

k_{y} (x) = 〈 k_{y}, k (., x) 〉 = 〈 k (., y), k (., x) 〉 = k (y, x)

(5)

2.1.2. Vector-valued RKHSs

Vector-valued RKHSs generalize the real-valued case. The construction is similar. Consider a Hilbert space of functions from $𝒳$ to $R^{d}$ . Assume, moreover, as in the real-valued case, that the evaluation functional is continuous. Riesz representation theorem then states that for any $x \in 𝒳$ , and $v \in R^{d}$ , there exists a unique element in $H$ , notated $K_{x, v}$ such that $v^{T} f (x) = ⟨f, K_{x, v}⟩$ . The kernel of $H$ is then the $(d, d)$ matrix where the element $(i, j)$ at the $i^{t h}$ row and $j^{t h}$ column is defined by

K_{i j} (x, y) = 〈 K_{x, e_{i}}, K_{y, e_{j}} 〉

(6)

where $(e_{1}, \dots, e_{d})$ is the natural basis of $R^{d}$ . Let us use (6) to characterize the function $K_{x, v}$ . We start with $K_{y, e_{j}}$ and use the reproducing property as well as the symmetry of the inner product.

e_{i}^{T} K_{y, e_{j}} (x) = 〈 K_{y, e_{j}}, K_{x, e_{i}} 〉 = 〈 K_{x, e_{i}}, K_{y, e_{j}} 〉 = K_{i j} (x, y) = e_{i}^{T} K (x, y) e_{j}

(7)

Thus $K_{y, e_{j}} (.) = K (., y) e_{j}$ , and

v^{T} f (x) = 〈 f, K_{x, v} 〉 = 〈 f, K (., x) v 〉

(8)

which is the reproducing property for vector-valued RKHS. Applying (8) to the function $x \mapsto K (x, y) w$ , for $w \in R^{d}$ provides

v^{T} K (x, y) w = 〈 K (., y) w, K (., x) v 〉 = 〈 K (., x) v, K (., y) w 〉

(9)

Lastly, a useful property of the kernel $K$ is that $K (x, y)^{T} = K (y, x)$ . Indeed,

K_{j i} (x, y) = e_{j}^{T} K (x, y) e_{i} = 〈 K (., x) e_{j}, K (., y) e_{i} 〉 = 〈 K (., y) e_{i}, K (., x) e_{j} 〉 = e_{i}^{T} K (y, x) e_{j} = K_{i j} (y, x)

(10)

Choosing $𝒳 = R^{d}$ allows for defining autonomous vector fields, that is functions $R^{d} \to R^{d}$ , and choosing a suitable kernel allows for choosing Lipschitz continuous vector fields as will be discussed in Section 2.3.

2.2. Notations

The observations are characterized by multiple time series. There are $n$ times series. The $i^{t h}$ one is of length $m_{i}$ . It is characterized by $m_{i}$ couples $(t_{i j}, y_{i j} (t_{i j})), i = 1, \dots, m_{i}$ , where $t_{i j} \in [0, T]$ for some maximum predefined time $T$ , and the observations $y_{i j} (t_{i j})$ belong to $R^{d}$ .

We aim to make predictions at new time points along a time series having one or several noisy snapshots. To this end, we explore the following nonparametric ODE model:

{\begin{matrix} \dot{x} & = & f (t, x) \\ y_{i j} (t_{i j}) & = & x (t_{i j}) + ϵ_{i j} \end{matrix}

(11)

where $i = 1, \dots, n, j = 1, \dots, m_{i}$ . The noise $ϵ_{i j}$ is bounded or sub-Gaussian. This model is nonparametric because $f$ is not specified parametrically. We assume that $f$ belongs to a RKHS of smooth functions for which the solution of the ODE exists and is unique, see Section 2.3. Background material on RKHS can be found in [14] and vector-valued RKHS are reviewed in [15]. The rest of the paper is written for the autonomous case when $f (t, x) = f (x)$ and for the simpler situation where $m_{i}$ is the same for all time series and when the time points $t_{i j}$ are the same for all the time series i.e. do not depend on $i$ . However, we will point to the modifications for the non-autonomous setting when necessary, as well as the situation of non regular sampling.

2.3. Existence and uniqueness

It is a classical result, see [16], that the initial value problem (IVP):

\dot{x} (t) = f (x (t)) and x (0) = x_{0},

(12)

where $f : R^{d} \to R^{d}$ is Lipschitz continuous has a unique solution defined on the domain $[0, + \infty)$ .

Let $H$ be an RKHS of vector-valued functions $R^{d} \mapsto R^{d}$ and let $K$ be the reproducing kernel of $H$ . $K$ is a $(d, d)$ matrix-valued kernel. It is then natural to ask: what is a sufficient condition on $K$ which ensures that all $f \in H$ are Lipschitz continuous? The following lemma provides an answer.

Lemma 1. If $f : R^{d} \to R^{d}$ belongs to an RKHS with kernel $K$ such that:

d_{K_{i i}}^{2} (u, v) ≔ K_{i i} (u, u) - 2 K_{i i} (u, v) + K_{i i} (v, v) \leq N_{K}^{2} | u - v |^{2}, \forall u, v \in R^{d}, i = 1 \dots d,

(13)

for some constant $N_{K}$ , then the IVP problem (12) has a unique solution defined on $[0, + \infty)$ .

Proof. Notice that for every $i = 1, \dots, d$

{| f_{i} (u) - f_{i} (v) |}^{2} = {| {〈 K (u, \cdot) e_{i} - K (v, \cdot) e_{i}, f 〉}_{H} |}^{2}

(14)

\leq {‖ K (u, \cdot) e_{i} - K (v, \cdot) e_{i} ‖}_{H}^{2} ‖ f ‖_{H}^{2}

(15)

= d_{K_{i i}}^{2} (u, v) ‖ f ‖_{H}^{2}

(16)

where $e = (e_{1}, \dots, e_{d})$ is the natural basis of $R^{d}$ . Here we have used the reproducing property of the matrix-valued kernel and the Cauchy-Schwartz inequality. □

Thus, one can choose a kernel that guarantees the existence and uniqueness of the solution of the IVP, which will lead to provable asymptotic performance. We believe that this simple result is a good motivator for the proposed modeling approach.

Let us discuss some examples of kernels satisfying lemma 1. The simplest matrix-valued kernels are separable kernels. They are obtained by choosing a scalar kernel $K_{1}$ and a positive semi-definite matrix $A$ . Then,

K (x, y) = K_{1} (x, y) A

(17)

The diagonal elements of $K$ are then positive multiples of $K_{1}$ . Thus, if $K_{1}$ verifies the regularity condition of lemma 1, then so do all the separable kernels based on $K_{1}$ . The scalar kernels satisfying the hypothesis of lemma 1 contain the linear kernel, the Gaussian Kernel, the rational quadratic kernel, the sinc kernel and the mattern kernels for $p > 3 / 2$ . Kernels for which the functions in their corresponding RKHSs are not guaranteed to provide unique solutions to the corresponding IVP due to lack of regularity include the polynomial kernels with an order of at least two, the Laplacian kernel and the Mattern kernel for $p \leq 3 / 2$ Details are provided in Appendix B. The condition of lemma 1 has a nice interpretation in the case where explicit kernels are used. Indeed, when a feature map associated with the kernel is given explicitly, the conditions of lemma 1 are equivalent to assuming Lipschitz continuous features. The details are provided in the Appendix B.

Note on the non-autonomous case: When the vector field is time-dependent denoted by $f (t, x)$ , the kernel is defined on $R^{d} \times [0, \infty)$ . It is sufficient to assume a global Lipschitz condition with respect to the second variable [16], namely: There exists a constant $L_{K}$ such that for every $t \geq 0$ and $u, v \in R^{d}$ and $i \in 1, \dots, d$ :

|f_{i} (t, u) - f_{i} (t, v)| \leq L_{K} | u - v |

(18)

It is therefore sufficient to assume a a kernel $K$ defined on $R^{d} \times [0, \infty)$ and satisfying the conditions of lemma as it will ensure the following inequality:

d_{K_{i i}}^{2} (t, u, t, v) \leq N_{K}^{2} | u - v |^{2}

(19)

2.4. From constrained to unconstrained optimization

We first construct the optimization algorithm in the case $n = 1$ . All the observations are from a single trajectory with the same initial condition. Thus, we temporarily drop the double indexing with subjects and times to simplify the notation.

Assume the observation times are $t_{1} < \dots < t_{m}$ . Consider the following constrained minimization problem:

min_{x, f} \frac{1}{m} \sum_{j = 1}^{m} {| y_{j} - x (t_{j}) |}^{2} + λ ∥ f - f_{0} ∥_{H}^{2},

(20)

under the constraints

\{\begin{array}{l} f \in H, the RKHS with matrix-valued kernel K, \\ x (t) = x (t_{1}) + \int_{t_{1}}^{t} f (x (s)) d s, for t_{1} \leq t \leq t_{m} . \end{array}

(21)

The function $f_{0} \in H$ is an initial guess for $f$ . Section 2.6 describes a gradient matching algorithm for selecting $f_{0}$ . K is a kernel that satisfies lemma 1.

Consider a regular one-dimensional grid over the interval $[t_{1}, t_{m}]$ . Specifically, we choose

s_{l} = t_{1} + l h

(22)

with $l = 0, \dots, k$ and we assume that $h$ is small enough so that there are integers $k_{1} = 0 < k_{2} < \dots < k_{m}$ , such that the observation times are

t_{j} = t_{1} + k_{j} h, j = 1 \dots m .

(23)

In practice, the observation times are rounded to fit on this grid. Note that with this notation, $t_{j} = s_{k_{j}}$ . We now proceed through a series of transformations to rewrite this constrained optimization problem into an unconstrained one.

First, we replace the constraints on $x$ by a finite number of constraints as follows:

\{\begin{array}{l} f \in H, the RKHS with kernel K, \\ x (s_{l + 1}) = x (s_{l}) + \int_{s_{l}}^{s_{l + 1}} f (x (s)) d s \\ l = 0 \dots k - 1 . \end{array}

(24)

Second, we discretize the constraints using the Euler method of integration:

\{\begin{array}{l} f \in H, the RKHS with kernel K, \\ x (s_{l + 1}) = x (s_{l}) + h f (x (s_{l})) \\ for l = 0 \dots k - 1 . \end{array}

(25)

Third, we replace the constrained optimization problem by an unconstrained one using a single Lagrange constant $γ > 0$ . Notate $z_{l} = x (s_{l}), l = 0 \dots k$ ,

min_{z \in R^{d (k + 1)}, f \in H} J (z, f, γ),

(26)

with

J (z, f, γ) = \frac{1}{m} \sum_{j = 1}^{m} {| y_{j} - z_{k_{j}} |}^{2} + γ \frac{1}{k} \sum_{l = 0}^{k - 1} {| z_{l + 1} - z_{l} - h f (z_{l}) |}^{2} + λ ∥ f - f_{0} ∥_{H}^{2} .

(27)

It is instructive to remark the similarities between the loss function in equation 27 and the loss proposed in Physics-informed Neural Networks [17], where the observations are generated from an unknown partial differential equation. Indeed, the total loss function in Physics-informed Neural Networks can be decomposed as a sum of two functions: One that measures the deviation of solution from the observations, and the second usually defined as the residual function term, measures the violation of the partial differential equation constraint that the solution must satisfy. In our context,

\frac{1}{m} \sum_{j = 1}^{m} {|y_{j} - z_{k_{j}}|}^{2}

corresponds to first function, and

\frac{1}{k} \sum_{l = 0}^{k - 1} {|z_{l + 1} - z_{l} - h f (z_{l})|}^{2}

corresponds to the residual function term. However there are some notable differences. In physics informed neural networks, the form of the PDE is known up to finite dimensional parameters. The loss is viewed as a function of the solution to the partial differential equation and these finite dimensional parameters. The solution itself is modeled by a neural network. In our case, the loss is viewed as a function of the vector field and the initial solution. The differential equation is therefore characterized by the RKHS, usually infinite-dimensional. Moreover, equation 27 contains a regularization term penalizing vector fields with large RKHS norm, which is typical of loss function parametrized by RKHS functions.

2.5. Penalty method

The penalty method is an iterative method that consists of enforcing the constraints by increasing a penalty parameter, in this case $γ$ . The schematic of the method is presented in Algorithm 1. At each step, the functional $J (z, f, γ)$ in (27) is minimized with respect to $(z, f)$ , for a fixed value of $γ$ . Then, $γ$ is increased. The optimization for $(z, f)$ is done asynchronously, first optimizing over $z$ for a fixed $f$ , then optimizing over $f$ for the newly updated $z$ .

Let us now describe these optimization steps in more detail. For a fixed $γ$ and $f, J (z, f, γ)$ in (27) is non-convex in $z$ due to the presence of $f (z_{l})$ . Therefore we replace $f$ by its first-order Taylor expansion evaluated at the value $z_{l}^{(s)}$ obtained in the previous iteration $s$ :

f (z_{l}) \approx f (z_{l}^{(s)}) + {(z_{l} - z_{l}^{(s)})}^{T} \nabla_{z_{l}} f (z_{l}^{(s)})

(28)

Note that with this approximation, $J$ is convex, quadratic, and sparse in $z$ . This allows the use of an efficient linear solver for this minimization. The number of unknowns is $d (k + 1)$ .

Note on the non-autonomous case: When the vector field is time-dependent, the vector field is evaluated at points of the form $f (t_{l}, z_{l})$ . Notice that the $t_{l}$ ’s are the time points of the grid, therefore fixed and known. Hence, the linearization in equation (29) is made only with respect to the space variable:

f (z_{l}, t_{l}) \approx f (z_{l}^{(s)}, t_{l}) + {(z_{l} - z_{l}^{(s)})}^{T} \nabla_{z_{l}} f (z_{l}^{(s)}, t_{l})

(29)

For a fixed $γ$ and $z$ , minimizing $J$ in $f$ is equivalent to a multivariate kernel ridge regression problem. After the change of variable, $g = f - f_{0}$ , and setting

u_{l} = (z_{l + 1} - z_{l}) / h - f_{0} (z_{l}), l = 0 \dots k - 1,

(30)

we use the representer theorem to show that the minimizer in $f \in H$ of $J$ is of the form

f (z) = f_{0} (z) + \sum_{l = 0}^{k} K (z, z_{l}) w_{l},

(31)

where $w_{l} \in R^{d}$ . Let $W = (w_{1}^{T}, \dots, w_{k + 1}^{T})$ , be of dimension $(d (k + 1), 1)$ and similarly let $U = (u_{1}^{T}, \dots, u_{k + 1}^{T})$ and $K$ be the matrix with $(d, d)$ block element $K_{k l} = K (x_{k}, x_{l})$ . We find that $W$ is a minimizer of the convex quadratic function

\frac{γ h^{2}}{k} | U - K W |^{2} + λ W^{T} K W

(32)

and thus $W$ is the solution to the linear system:

(K + \frac{λ k}{γ h^{2}} I) W = U

(33)

The schematic algorithm is provided in Algorithm 1.

Algorithm 1.

Penalty method for ODE-RKHS

1:	Init: $h, ρ, λ, f^{(0)}, γ^{(0)}, s = 0$
2:	while termination condition is not met do
3:	$z^{(s + 1)} \leftarrow arg {min}_{z \in R^{d (k + 1)}} J (z, f^{(s)}, γ^{(s)})$
4:	$f^{(s + 1)} \leftarrow arg {min}_{f \in H} J (z^{(s + 1)}, f, γ^{(s)})$
5:	$γ^{(s + 1)} \leftarrow γ^{(s)} (1 + ρ)$
6:	$s = s + 1$
7:	Check termination condition
8:	end while

Open in a new tab

2.6. Initial condition and termination criteria

Since the algorithm will converge to a local minimum of the cost function, the choice of the initial condition is important. We use a gradient matching method.

Approximate the time derivatives of $x$ at the observed times $\dot{x} (t_{j})$ , denoted $\hat{\dot{x}} (t_{j})$
Estimate $f_{0} \in H$ using ridge regression, i.e. minimize over $H$

G (f_{0}) = \frac{1}{m} \sum_{j = 1}^{m} {| \hat{\dot{x}} (t_{j}) - f_{0} (y_{j}) |}^{2} + λ ‖ f_{0} ‖_{H}^{2}

(34)

There are several possibilities for the approximation in the first step depending on the sparsity of the data and the amount of noise. In the experiments below, we use central differences.

The termination condition of Algorithm 1 includes a fixed number of iterations $S$ and a threshold on the quantity $∥f^{(s + 1)} - f^{(s)}∥ / ∥f^{(s)}∥$ which allows for early stopping.

2.7. Multiple trajectories

We present here the extension of the method to multiple trajectories, say $n > 1$ subjects. We assume the same number of observations for each subject and regular sampling to simplify the presentation.

First, we replace (27) and (24) with

min_{x, f} \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {| y_{i j} - x_{i} (t_{i j}) |}^{2} + λ ‖ f - f_{0} ‖_{H}^{2},

(35)

under the constraints

\{\begin{array}{l} f \in H, the RKHS with matrix-valued kernel K, \\ x_{i} (t) = x_{i} (t_{1}) + \int_{t_{1}}^{t} f (x_{i} (s)) d s, \\ for t_{1} \leq t \leq t_{m}, i = 1 \dots n \end{array}

(36)

We then proceed along the same steps as for the single trajectory case, leading to the unconstrained optimization problem, generalizing (26) and (27).

Algorithm 2.

Multi Trajectories Penalty method for ODE-RKHS

1:	Init: $h, ρ, λ, f^{(0)}, γ^{(0)}, s = 0$
2:	while termination condition is not met do
3:	for $i = 1 \dots n$ do
4:	$z_{i}^{(s + 1)} \leftarrow arg {min}_{z_{i} \in R^{d (k + 1)}} J_{multi} (z, f^{(s)}, γ^{(s)})$
5:	end for
6:	$f^{(s + 1)} \leftarrow arg {min}_{f \in H} J_{multi} (z^{(s + 1)}, f, γ^{(s)})$
7:	$γ^{(s + 1)} \leftarrow γ^{(s)} (1 + ρ)$
8:	$s = s + 1$
9:	Check termination condition
10:	end while

Open in a new tab

Notate $z_{i l} = x_{i} (s_{l}), l = 0 \dots k, i = 1 \dots n$ , and $z = (z_{1}, \dots, z_{n})$

min_{z \in R^{n d (k + 1)}, f \in H} J_{multi} (z, f, γ),

(37)

with

J_{multi} (z, f, γ) = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {| y_{i j} - z_{i k_{j}} |}^{2} + γ \frac{1}{n k} \sum_{i = 1}^{n} \sum_{l = 0}^{k - 1} {| z_{i, l + 1} - z_{i l} - h f (z_{i l}) |}^{2} + λ ‖ f - f_{0} ‖_{H}^{2} .

(38)

The key point is that $J_{multi}$ decouples the trajectories such that the optimization over $z$ can be carried out separately for each trajectory. However, all the observations contribute to the estimation of $f$ . The algorithm is presented in Alg 2. In Line 6: we use the no-trick formulation using Gaussian quadrature Fourier features as described in [18].

2.8. Computational Complexity

We analyze the complexity of the algorithm Alg 2. The key parameters are:

$d$ : the dimension of the observed vectors;
$n$ : the number of observed trajectories;
$k$ : the number of samples in the discretization of the time interval;
$S$ : the number of steps in Alg 2;
$n_{F}$ : the number of Fourier features.

We use $O (p^{3})$ for the time complexity of solving a (dense) linear system with $p$ variables and $O (w^{2} p)$ in the case of a band matrix of width $w$ , see [19]. Alg 2, line 4 consists in solving a linear system of size $d k$ with a band matrix of bandwidth $w = 3 d$ , thus $O (k d^{3})$ computations. Line 6 consists in solving $d$ full linear systems of dimension $n_{F}$ , thus $O (d n_{F}^{3})$ computations. In total, we find $O (S n k d^{3} + S d n_{F}^{3})$ . Note that $k$ is typically chosen proportional to the average number of data points per trajectory. Thus, overall, the algorithm is linear in the number of observations but cubic in the dimension of the observations.

2.9. Non autonomous systems, covariates, and irregular sampling

Non autonomous systems and covariates are handled by modifying the kernel. The issue of irregular sampling is addressed by replacing the first term of (27) by

\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} (t_{i, j + 1} - t_{i j}) {|y_{i j} - z_{i k_{j}}|}^{2}

(39)

with $t_{i, m_{i} + 1} = T, i = 1 \dots, n$

3. Consistency of the solution: A finite sample result

In this section, we assume that the algorithm solves the following optimization problem (where $t_{m + 1} = T$ by definition):

min_{R^{d (k + 1)}, f \in H} \sum_{j = 1}^{m} (t_{j + 1} - t_{j}) {|y_{j} - z_{k_{j}}|}^{2},

(40)

Under the constraints:

$∥ f - f_{0} ∥_{H} \leq R, |z_{0}| \leq r$
$z_{l + 1} = z_{l} + h f (z_{l}), 0 \leq l \leq k$

Notice that constraint 2 corresponds to the Euler method for the $ODE : \dot{x} = f (x)$ . Therefore, by linearly interpolating between the times of subdivision $s_{l}, 0 \leq l \leq k$ , we can generate a solution $\hat{x} (\cdot)$ defined on $[0, T]$ . We denote by $x^{*} (\cdot)$ the true trajectory generating the noisy observations $y_{j}$ at each time $t_{j}$ . The purpose of this section is to present a result controlling (in probability) the $L^{2}$ norm squared of $\hat{x} - x^{*}$ :

‖ \hat{x} - x^{*} ‖_{L^{2}}^{2} : = \int_{0}^{T} {| (\hat{x} (t) - x^{*} (t)) |}^{2} d t

(41)

Let us make the following assumptions:

$A_{1}$ : There exists an $f^{*} \in H, {∥f^{*} - f_{0}∥}_{H} \leq R$ and $|x_{0}^{*}| \leq r$ such that $x^{*} (0) = x_{0}^{*}$ and ${\dot{x}}^{*} (t) = f^{*} (x^{*} (t))$ for every $0 \leq t \leq T$ .
$A_{2}$ : The noise variables $ϵ_{i j}$ are independent and bounded in absolute value by a constant $M_{ϵ}$ . (We can assume that the variables are subgaussian instead of bounded if we want to generalize this result)
$A_{3}$ : The kernel $K$ is $𝒞^{2} (R^{d})$ in its first argument (this implies that it is also $𝒞^{2} (R^{d})$ in its second argument).
$A_{4}$ : The kernel $K$ satisfies (13).

We refer to section 2.3 for examples of kernels satisfying $A_{3}$ and $A_{4}$ .

These assumptions are sufficient for obtaining the main theorem of this section, controlling ${∥\hat{x} - x^{*}∥}_{L^{2}}^{2}$ with high probability.

Theorem 1. Assuming $A_{1}, A_{2}, A_{3}$ and $A_{4}$ , there exist positive constants $K_{1}, K_{2}, K_{3}$ and $K_{4}$ , depending only on $R, r, T, M_{ϵ}, N_{K}$ and the kernel $K$ such that for every $ϵ > 0$ , with probability less than $exp (\frac{- K_{2} ϵ^{2}}{d \sum_{j = 1}^{m} {(t_{j + 1} - t_{j})}^{2}})$ :

‖ \hat{x} - x^{*} ‖_{L_{2}}^{2} \geq K_{1} d \sqrt{\sum_{j = 1}^{m} {(t_{j + 1} - t_{j})}^{2}} + h^{2} K_{3} d + K_{4} d \sum_{j = 1}^{m} {(t_{j + 1} - t_{j})}^{2} + ϵ .

(42)

For a better understanding of Theorem 1, assume a regular sampling of the interval $[0, T]$ with $m$ points, so that for every $j, t_{j + 1} - t_{j} = \frac{1}{m}$ . In that case, under the same hypothesis, for any $ϵ > 0$ , with probability less than $exp (\frac{- K_{2} m ϵ^{2}}{d})$ :

∥ \hat{x} - x^{*} ∥_{L_{2}}^{2} \geq \frac{K_{1} d}{\sqrt{m}} + \frac{K_{4} d}{m} + h^{2} K_{3} d + ϵ .

(43)

A proof of Theorem 1 is provided in the appendix. We provide here a description of the main ideas. The third term in the right hand side of inequality (42) corresponds to the global truncation error between the numerical solution of the ODE and the true solution. The second term corresponds to the error between ${∥\hat{x} - x^{*}∥}_{L^{2}}^{2}$ and $\frac{1}{m} \sum_{j = 1}^{m} {|x^{*} (t_{j}) - \hat{x} (t_{j})|}^{2}$ . The first term is the leading term, assuming that $h$ is always less than $\frac{1}{m}$ . Assume that $\hat{x}$ solves the continuous-constraints optimization problem (without an Euler approximation), i.e:

min_{x, f} \frac{1}{m} \sum_{j = 1}^{m} {|y_{j} - x (t_{j})|}^{2},

(44)

Under the constraints: ${∥f - f_{0}∥}_{H} \leq R, |x_{0}| \leq r$ and $x (t) = x_{0} + \int_{0}^{t} f (x (u)) d u, \forall 0 \leq t \leq T$ , we can then consider the “generalization” error:

\frac{1}{m} \sum_{j = 1}^{m} {| x^{*} (t_{j}) - \hat{x} (t_{j}) |}^{2} .

(45)

An upper bound of this error is given by the first term. The main tool used to obtain the upper bound is Dudley’s chaining inequality, see [20]. We notice that for every $i = 1, \dots, d$ , the set of coordinate functions $x_{i}$ , where $x$ and $f$ satisfy the constraints of the continuous problem, is included in a set of functions that are uniformly Lipschitz continuous and bounded (the Lipschitz constant and bound does not depend on $x_{0}$ and $f$ ). Upper bounds of covering numbers of such functions are well-known, see [20], hence the use of Dudley’s inequality.

One can easily transform the inequality on the probability of theorem 1 to an inequality on $E ({∥\hat{x} - x^{*}∥}_{L_{2}}^{2})$ . Indeed, let us assume for simplicity a regular sampling of $m$ points the interval $[0, T]$ . We denote by:

{\hat{E}}_{L_{2}} : = ∥ \hat{x} - x^{*} ∥_{L_{2}}^{2} - \frac{K_{1} d}{\sqrt{m}} - \frac{K_{4} d}{m} - h^{2} K_{3} d .

(46)

Using theorem 1, we have the following inequality:

E (| {\hat{E}}_{L_{2}} |) = \int_{0}^{\infty} ℙ (| {\hat{E}}_{L_{2}} | \geq ϵ)

(47)

\leq \int_{0}^{\infty} exp (\frac{- K_{2} m ϵ^{2}}{d})

(48)

= \sqrt{\frac{π}{4 K_{2}}} \sqrt{\frac{d}{m}}

(49)

This implies the following result.

Corollary 1. Assume we have a regular sampling of m points on the interval $[0, T]$ . Then:

E (∥ \hat{x} - x^{*} ∥_{L_{2}}^{2}) \leq \sqrt{\frac{π}{4 K_{2}}} \sqrt{\frac{d}{m}} + \frac{K_{1} d}{\sqrt{m}} + \frac{K_{4} d}{m} + h^{2} K_{3} d .

(50)

To illustrate the inequality in (50), we conducted a simple toy experiment where the conditions of the theorem are satisfied, and evaluated the convergence rate. In this experiment we considered a one-dimensional autonomous system. We randomly initialized the weights of a function determined by 200 Fourier random features, recorded the norm of the function, and generated a trajectory of 5120 samples using this function. Then we took ten independent and identically distributed random samples of noise with a standard deviation of .05. This provided us with 10 noisy trajectories of 5120 samples (of the same trajectory but different samples of noise). Finally, we sub-sampled each of these ten noisy trajectories to get 2560 samples, 1280 samples,… all the way down to 5 samples. This gave us 10 training sets, each with 5, 10, 20, 40, …, 5120 samples. We trained the algorithm on each of these datasets and reported the average $L_{2}$ (squared) error between the estimated trajectory and the true one over the ten trajectories at each level of sparsity. In figure 3, we provide a plot of the log of the average $L_{2}$ (squared) errors as a function of the log of the number of samples used during training. Equation (50) predicts a slope at least −.5. We fit the data to a line of slope −.8, consistent with (50). We provide a plot with a line of slope −.5 for comparison.

Fig. 3. — Illustration of the ODE-RKHS Algorithm: The dots show the observations. The estimated trajectories are shown with lines and curves with corresponding colors. Steps $i = 1, 25, 50$ , and 75 are shown from left to right and from top to bottom

4. Experiments

We report experiments for simulated data as well as for real data. In each case, we compare the performances of the proposed algorithm, generically named ODE-RKHS, with six other algorithms. This section is organized as follows: In subsection one, we present the various benchmark methods used for comparison. In subsection two, we present the tuning of the hyperparameters for the ODE-RKHS method. In subsection three, we describe fifteen simulated datasets and an example medical dataset. Finally, in subsection four, we report and comment on the performance of the ODE-RKHS method compared with the benchmark methods on all the datasets.

4.1. Benchmark methods

These algorithms constitute, up to our knowledge, the current state of the art for learning nonparametric ODEs from noisy data. We briefly review these algorithms and provide references below.

Nonparametric Ordinary Differential Equations: Nonparametric Ordinary Differential Equations (npODE) is presented in [12]. The authors use a Bayesian model with Gaussian processes (GP). It is the Bayesian counterpart of the frequentist model presented in this paper. Unlike GP regression where the optimization can be computed in closed form, an approximate optimization method is required. The authors use inducing points, see [21] and sensitivity equations, see [22]. The npODE code was downloaded from http://www.github.com/cagatayyildiz/npode in February 2021. Given the normalized trajectory sets, we ran the algorithm with a scale factor of 1 and an $ℓ_{0}$ of 1. For the 2D systems, we used a width of the inducing point grid $W = 6$ , matching the demonstration examples. For the 6D Lorenz96, we encountered out-of-memory errors for $W > 2$ , possibly indicating an empirical scaling issue with the method. We thus used $W = 2$ for this system.
Sparse Identification of Nonlinear Dynamics (Fourier and Polynomial Candidate Functions): Sparse Identification of Nonlinear Dynamics (SINDy) is a highly cited technique for identifying nonlinear dynamics from data, see[4]. SINDy predicts governing dynamics equations using gradient matching via sparse regression. In the experiments shown, we test SINDy with two different libraries of possible functions: polynomials up to order three and Fourier features. We choose the SR3 sparsity regularization for its superior performance, detailed in [23], which has a threshold value as a hyperparameter. Other hyperparameters in our tests include the polynomial library’s degree and the size and lengthscale of the Fourier features library. A grid search tuner was employed to determine the best hyperparameter values, with the same holdout and evaluation sets as in the competing algorithms. pySINDy v1.6.3 was used for the implementation [24]. We use the AutoKoopman library to tune the hyperparameters, described in [25].
Extended Dynamic Mode Decomposition: The Koopman operator is an infinite dimensional linear operator that captures the dynamics of a non-linear dynamical system. Dynamic Mode Decomposition (DMD), described in [10], can approximate the Koopman operator’s eigenvalues and eigenvectors based on observations of the system state. Extended DMD (EDMD) generalizes to nonlinear systems learning by approximating the Koopman operator in a high-dimensional space of observables, see [26]. These observables must be selected before using EDMD, and can be chosen ad-hoc or by using library learning methods [27]. We use random Fourier features as the observable functions for these experiments, specified in [28]. We use the AutoKoopman library to tune the hyperparameters via Bayesian optimization, available at https://github.com/EthanJamesLew/AutoKoopman.
Kernel Analog Forecasting: Analog forecasting is a time series prediction method that utilizes the idea of analog forecasting that follows the evolution of a historical time series that most closely matches the current state. Kernel analog forecasting (KAF) replaces single-analog forecasting with weighted ensembles of analogs constructed using local similarity kernels that employ several dynamics-dependent features designed to improve forecast skill [29] [30]. Our KAF implementation is based on https://github.com/rward314/StreamingKAF. Hyperparameters are the kernel function and rank used for the number of eigenvalues found from the data-defined kernel matrix. We selected a Gaussian kernel and grid tuned for rank and kernel length-scale. We use the same eigenvalue multiplier of 10⁻⁴ as the referenced code.
Sparse Cyclic Recovery: We implement the method formulated in [31] which is well-suited for the experiments as it is designed for learning structured dynamical systems from under-sampled and possibly noisy state-space measurements. For index invariant systems, the method generates cyclic permutations to augment the training data. Then, it builds a library of Legendre polynomials of candidate functions and does basis pursuit with thresholding to recover the dynamics. The hyper-parameters involved are the parameters for the Douglas-Rachford algorithm used to solve the Legendre basis pursuit (L-BP) problem and the Legendre polynomial degree; we tune these parameters via grid search. We referenced the parameters used in their GitHub project https://github.com/linanzhang/SparseCyclicRecovery. We utilize the same candidate functions as the paper, but tune the noise threshold $σ$ and the $μ, τ$ parameters of the optimizer. Because of compute effort limitations, we set the maximum number of optimization iterations to 10⁴.

4.2. Validation, initialization, and selection of hyper-parameters in the ODE-RKHS algorithm

We use the Multi Trajectories Penalty method for ODE-RKHS described in Alg. 2, and a Gaussian kernel. For each coordinate, we chose a bandwidth equal to 20% of the range of the data. We set $γ = 1$ and fit $λ, ρ$ using a validation set consisting of 20 percent of the training data. We set a maximum of $S = 500$ iterations and used the early stopping criterion of stopping when the ratio $∥f^{(s + 1)} - f^{(s)}∥ / ∥f^{(s)}∥$ was less than 10⁻³. Initialization of $f_{0}$ was done via gradient matching, see section 2.6.

4.3. Datasets

We ran experiments with the same training, validation and test sets for all the algorithms. Testing consisted of computing predicted trajectories starting at the initial condition of each test trajectory.

4.3.1. Oscillator data

The FitzHugh-Nagumo (FHN) oscillator data is a controlled experiment with known and easy-to-visualize 2D trajectories. It has helped calibrate the algorithm described in this paper. It was also demonstrated in [12] for the npODE algorithm. We ran experiments using a simulated dataset generated as follows:

\dot{v} = v - v^{3} / 3 - w + 1 \dot{w} = 0.08 (v + 0.7 - 0.8 w)

(51)

Intermediate and final results of the ODE-RKHS algorithm are presented in Fig. 3 for the FHN data. Notice that during the first steps, shown on the top line, the estimated trajectories with solid color lines are rough but fit the data closely. During the later steps, shown on the bottom line, the trajectories are smoother but still fit the data.

We generated a set of 50 noiseless trajectories. There were 201 observations per trajectory, one for each .1 increment in time. To generate the training sets, we added samples of Gaussian noise to these fifty trajectories. There were five levels of noise, with respective standard deviations $σ \in {0.120, 0.365, 0.610, 0.855, 1.100}$ . We generated a single test set of 100 trajectories without noise, again with 201 observations per trajectory separated by .1 time increments.

4.3.2. Lorenz63 data

Our next experiment was on the Lorenz system defined by the equations

\dot{x} = 10 (y - x) \dot{y} = x (28 - z) - y \dot{z} = x y - \frac{8}{3} z

(52)

We generated 50 noiseless trajectories with 201 observations per trajectory, each separated by a 0.01 increment in time. Next, we generated samples of Gaussian noise with levels $σ \in {0.5, 1.2, 1.9, 2.6, 3.3}$ . We added the noise samples to the noiseless trajectories to generate five training sets. Then we generated a single test set consisting of 100 trajectories, each with 201 observations at 0.01 time increments.

4.3.3. Lorenz96

The Lorenz96 data arises from [34]. The chaotic system is defined for $n = 6$ dimension by:

{\dot{x}}_{k} = - x_{k - 1} x_{k - 1} + x_{k + 1} x_{k - 1} - x_{k} + F, k = 1 \dots 6

(53)

We have selected $F = 8$ . Indices wrap-around so that $x_{- 1} = x_{6}$ and $x_{7} = x_{1}$ . To construct the training set, we generated a set of 100 noiseless trajectories, each with 100 observations. The observations were separated by a time increment of 0.01. We added the five levels of Gaussian noise to the noiseless data to generate five different training sets. The standard deviations of the noise generated here are $σ \in {0.1, 0.2, 0.3, 0.4, 0.5}$ . Our test set consisted of 150 noiseless trajectories, each with 100 points on them. The time increment between observations was the same as the training set.

4.3.4. The accumulation of Amyloid in the cortex of aging subjects

The accumulation of Amyloid in the brain is believed to be one of the earliest pathological mechanisms of Alzheimer’s disease, beginning more than a decade before the onset of clinical symptoms, see [35].

Based on observations from several longitudinal Amyloid positron emission tomography (PET) studies, it is believed that the rate of Amyloid accumulation is closely associated with the level of Amyloid at the same age, see [36]. We develop a principled mathematical model capturing this phenomenon and use it to predict the accumulation of Amyloid across individuals longitudinally.

We used (PiB) PET scans from the Wisconsin Registry for Alzheimer’s Prevention (WRAP) to assess global Amyloid burden, measured by the Distribution Volume Ratio (DVR)¹. The number of subjects in this study is $n = 179$ , with 3.06 visits on average, over an average span of 6.84 years. We fit the model in (11) to the posterior cingulum, precuneus and gyrus rectus DVRs, averaging the left and right DVR in each case. These regions are known to show Amyloid accumulation early in the disease process. Figure 4 provides a visualization of the trajectories estimated using RKHS-ODE super-imposed (same color) with the data. This shows that the estimated trajectories are qualitatively accurate.

Fig. 4. — Amyloid prediction experiment. Horizontal axis is in years. Vertical axis corresponds to DVR. The left-most image corresponds to the gyrus rectus, the middle to the cingulum and the right to the precuneus.

4.4. Evaluation

Testing consisted of computing predicted trajectories starting at the initial condition of the test trajectories and computing the following error measurement for each predicted trajectory:

Err : = \sqrt{\sum_{i = 2}^{n} (t_{i} - t_{i - 1}) ‖ y_{i} - {\hat{y}}_{i} ‖^{2}}

(54)

where $t_{i}$ is the $i^{t h}$ observation time, $y_{i}$ is the $i^{t h}$ observation of the test trajectory, ${\hat{y}}_{i}$ is the $i^{t h}$ point of the predicted trajectory and $n$ is the number of observations in the trajectory. For each dataset, we report the average error measurement over the test set trajectories.

In table 1, we report the performance of the ODE-RKHS method and other benchmark methods on the Amyloid dataset. The average $L_{2}$ norm errors (Err) between predicted Amyloid level trajectories and true level trajectories are reported. The ODE-RKHS algorithm yields the lowest average $L_{2}$ error among the seven compared methods.

Table 1.

Results for Amyloid data. Minimum errors are in bold

	Err
npODE	.59
KAF	.84
Koopman	.52
L-BP	.40
ODE-RKHS	.36
SINDy Fourier	.42
SINDy Polynomial	.39

Open in a new tab

In table 2, we report the performance of ODE-RKHS and the benchmark methods on the 3 simulated datasets (FHN, Lorenz63, and Lorenz96) with the 5 simulated levels of noise, level 1 corresponding to the noise with the smallest standard deviation. The ODE-RKHS algorithm performed best in 10 out of the 15 simulated test sets. The second best performing method was SINDy Polynomial with the lowest error in just 2 out of the 15 simulated datasets. Moreover, the 2 cases where SINDy polynomial performed best correspond to the lowest noise levels of the Lorenz63 dataset, indicating that our method is more robust to higher noise levels.

Table 2.

Performance Table for the 3 Simulated Datasets

FHN Noise Level 1		Lorenz63 Noise Level 1		Lorenz96 Noise Level 1
	Err		Err		Err
npODE	1.53	npODE	17.49	npODE	1.61
KAF	6.019	KAF	22.75	KAF	2.18
Koopman	2.27	Koopman	5.96	Koopman	.25
L-BP	5.55	L-BP	16.35	L-BP	1.02
ODE-RKHS	.53	ODE-RKHS	9.06	ODE-RKHS	.30
SINDy Fourier	5.59	SINDy Fourier	23.37	SINDy Fourier	.52
SINDy Polynomial	1.28	SINDy Polynomial	2.18	SINDy Polynomial	1.13
FHN Noise Level 2		Lorenz63 Noise Level 2		Lorenz96 Noise Level 2
	Err		Err		Err
npODE	1.57	npODE	18.75	npODE	1.29
KAF	8.35	KAF	22.34	KAF	2.17
Koopman	3.15	Koopman	13.67	Koopman	1.09
L-BP	6.87	L-BP	18.33	L-BP	1.10
ODE-RKHS	1.16	ODE-RKHS	11.24	ODE-RKHS	.42
SINDy Fourier	5.54	SINDy Fourier	21.63	SINDy Fourier	.84
SINDy Polynomial	2.60	SINDy Polynomial	10.73	SINDy Polynomial	1.76
FHN Noise Level 3		Lorenz63 Noise Level 3		Lorenz96 Noise Level 3
	Err		Err		Err
npODE	3.07	npODE	20.06	npODE	1.31
KAF	8.25	KAF	21.96	KAF	2.16
Koopman	3.57	Koopman	16.12	Koopman	1.11
L-BP	5.48	L-BP	19.63	L-BP	1.17
ODE-RKHS	1.83	ODE-RKHS	13.38	ODE-RKHS	.52
SINDy Fourier	6.50	SINDy Fourier	22.88	SINDy Fourier	1.00
SINDy Polynomial	2.84	SINDy Polynomial	15.88	SINDy Polynomial	1.23
FHN Noise Level 4		Lorenz63 Noise Level 4		Lorenz96 Noise Level 4
	Err		Err		Err
npODE	4.33	npODE	19.61	npODE	1.95
KAF	8.53	KAF	21.82	KAF	2.16
Koopman	7.18	Koopman	17.92	Koopman	1.02
L-BP	6.58	L-BP	21.03	L-BP	1.09
ODE-RKHS	2.20	ODE-RKHS	14.38	ODE-RKHS	.83
SINDy Fourier	9.47	SINDy Fourier	22.07	SINDy Fourier	1.24
SINDy Polynomial	5.57	SINDy Polynomial	20.67	SINDy Polynomial	3.03
FHN Noise Level 5		Lorenz63 Noise Level 5		Lorenz96 Noise Level 5
	Err		Err		Err
npODE	4.37	npODE	19.45	npODE	2.10
KAF	7.62	KAF	21.49	KAF	2.15
Koopman	7.12	Koopman	18.97	Koopman	1.23
L-BP	7.51	L-BP	21.68	L-BP	1.34
ODE-RKHS	1.97	ODE-RKHS	21.20	ODE-RKHS	1.23
SINDy Fourier	12.29	SINDy Fourier	22.19	SINDy Fourier	1.18
SINDy Polynomial	9.00	SINDy Polynomial	23.23	SINDy Polynomial	1.67

Open in a new tab

Minimum values are in bold. ODE-RKHS performs best in 10 out the 15 datasets.

5. Discussion

We proposed an algorithm for learning non-parametric ODEs assuming that the function $f$ generating the vector field in $R^{d}$ belongs to a vector-valued RKHS with a kernel satisfying certain regularity conditions. The data input of the algorithm consists of noisy observations at different times of multiple trajectories. The algorithm is linear in the number of observations but cubic in their dimension. We proved the consistency of the estimated trajectory, showing that the $L^{2}$ squared distance between the estimated trajectory and the true one vanishes as more observations are collected. We assessed the algorithm with simulated and real data and obtained results that consistently compare favorably with the state of the art on a wide range of noise levels.

Fig. 2. — On the left we plot the log of the average $L_{2}$ squared error between the true trajectory and the estimated one as a function of the log of the number of samples. A linear regression yields a slope of −.8 indicating convergence at a rate between $\frac{1}{\sqrt{m}}$ and $\frac{1}{m}$ . On the right we plot the predicted trajectories when we use have 5 observations, together with the true trajectory (in the dotted line).

Acknowledgements

The work at Portland State University was partly funded using the National Institute of Health RO1AG021155, R01EY032284, and R01AG027161, National Science Foundation #2136228, and the Google Research Award “Kernel PDE”. The funding sources had no involvement in the study design; in the collection, analysis, and interpretation of data; in the report’s writing; and in the decision to submit the article for publication. The material of Galois, Inc. is based upon work supported by the Air Force Research Laboratory (AFRL) and DARPA under Contract No. FA8750–20-C-0534. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s). They do not necessarily reflect the views of the Air Force Research Laboratory (AFRL) and DARPA.

Appendix A. Consistency of the estimator of the trajectory

Appendix A.1. Assuming we solve the problem without Euler approximation

This section gives the proof of the theorem presented in section 3 of the main text. We present the proof for $d = 1$ since the generalization to multiple dimensions is straightforward. We also present the proof for the case of autonomous systems. Keeping the notations of the main text, we make the following assumptions:

$A_{1}$ : There exist an $f^{*} \in H, {∥f^{*} - f_{0}∥}_{H} \leq R$ and $|x_{0}^{*}| \leq r$ such that $x^{*} (0) = x_{0}^{*}$ and ${\dot{x}}^{*} (t) = f^{*} (x^{*} (t))$ for every $0 \leq t \leq T$ .
$A_{2}$ : The noise variables $ϵ_{j}$ are independent and bounded by a constant $M_{ϵ}$ , with a variance denoted by $σ^{2}$ . (We can assume that the variables are subgaussian instead of bounded if we want to generalize this result)
$A_{3}$ : The kernel $K$ is $𝒞^{2} (R)$ in its first argument (this implies that it is also $𝒞^{2} (R)$ in its second argument).
$A_{4}$ : The kernel $K$ satisfies the hypothesis of lemma 1.

Without loss of generality, we will assume that $f_{0} = 0$ in our proof.

Let $H$ be the RKHS with reproducing kernel $K$ . Let $f \in H$ such that $∥ f ∥_{H} \leq R$ . We know using assumption $A_{4}$ and lemma 1 that $f$ is uniformly Lipschitz, with a Lipschitz constant that does not depend on $f$ that we denote by $L_{1}$ . Specifically,

| f (x) - f (y) | \leq L_{1} | x - y |

(A.1)

with $L_{1} = N_{K} R$ Using (A.1), we will prove the following lemma:

Lemma 2. Assuming $A_{4}$ , consider the set of solutions to the problem

\frac{\partial x}{\partial t} = \dot{x} = f (x), x (t_{0}) = x_{0}

(A.2)

where $f$ belongs to the RKHS with kernel $K, |x_{0}| \leq r$ and $t \in [0, T]$ . Then any solution $x$ in this set of solutions is bounded by a uniform constant $B_{1}$ that only depends on $T, R, L_{1}$ and $L_{3}^{2} : = {sup}_{∥ x ∥ ∣ < C} | K (x, x) |$ .

Specifically,

|x (t) - x (t_{0})| \leq B_{1} = T L_{3} R e^{L_{1} T}

(A.3)

Proof. We start by taking $f$ in our class of functions and $x_{0}$ such that $|x_{0}| \leq r$ . We therefore can write:

|x (t) - x_{0}| = |\int_{0}^{t} (f (x (s)) - f (x_{0})) d s + t f (x_{0})|

(A.4)

\leq \int_{0}^{t} | f (x (s)) - f (x_{0}) | d s + t ∥ f ∥_{H} \sqrt{K (x_{0}, x_{0})}

(A.5)

\leq L_{1} \int_{0}^{t} |x (s) - x_{0}| d s + T L_{3} R

(A.6)

Now denote by $G (t) ≔ |x (t) - x_{0}|$ . If we prove that $G (t)$ is bounded by a constant depending only on $T, R, L_{1}$ and $L_{3}$ , we will be done. So far we have:

G (t) \leq L_{1} \int_{0}^{t} G (s) d s + T L_{3} R

(A.7)

Denote by $V (t) ≔ \int_{0}^{t} G (s) d s$ . We have that:

V^{'} (t) \leq L_{1} V (t) + T L_{3} R

(A.8)

which implies:

e^{- L_{1} t} V^{'} (t) - L_{1} e^{- L_{1} t} V (t) \leq T L_{3} R e^{- L_{1} t}

(A.9)

Integrating the inequality between 0 and $t$ using the fact that $V (0) = G (0) = 0$ , we obtain:

exp (- L_{1} t) V (t) \leq \frac{T L_{3} R}{L_{1}} (1 - e^{- L_{1} t})

(A.10)

or, equivalently,

V (t) \leq \frac{T L_{3} R}{L_{1}} (e^{L_{1} t} - 1)

(A.11)

Finally since $V^{'} (t) = G (t) \leq L_{1} V (t) + T L_{3} R$ , we have:

G (t) \leq T L_{3} R e^{L_{1} t} \leq T L_{3} R e^{L_{1} T}

(A.12)

□

Let us now introduce the following notations:

We denote by $x (x_{0}, f, t)$ the solution to the ODE with derivative $f$ and initial condition $x_{0}$
$y_{i}$ is the observed noisy point from the trajectory at time $t_{i}$ .
$x^{*} (t)$ is the true trajectory evaluated at time $t$

We now proceed with the following reasoning. We assume that our trajectory minimizes

\hat{L} (f, x_{0}) ≔ \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) ({(x (x_{0}, f, t_{i}) - y_{i})}^{2} - σ^{2})

(A.13)

over $(f, x_{0})$ such that $∥ f ∥_{H} \leq R$ , and $|x_{0}| \leq r$ . We denote the minimizer by $(\hat{f}, {\hat{x}}_{0})$ .

When $x_{0}$ and $f$ are fixed and not data dependent (deterministic), the expected value of $\hat{L} (f, x_{0})$ is:

L (f, x_{0}) ≔ \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) {(x (x_{0}, f, t_{i}) - x^{*} (t_{i}))}^{2}

(A.14)

Notice that $A_{1}$ implies:

{min}_{∥ f ∥_{H} \leq R, {∣ x}_{0} ∣ \leq r} L (f, x_{0}) = L (f^{*}, x_{0}^{*}) = \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) {(x^{*} (t_{i}) - x^{*} (t_{i}))}^{2} = 0

(A.15)

Our goal is to evaluate $L (\hat{f}, {\hat{x}}_{0})$ and obtain a generalization bound. We have:

L (\hat{f}, {\hat{x}}_{0}) = L (\hat{f}, {\hat{x}}_{0}) - \hat{L} (\hat{f}, {\hat{x}}_{0}) + \hat{L} (\hat{f}, {\hat{x}}_{0}) - \hat{L} (f^{*}, x_{0}^{*}) + \hat{L} (f^{*}, x_{0}^{*}) - L (f^{*}, x_{0}^{*})

(A.16)

And therefore, since the middle term in (A.16): $\hat{L} (\hat{f}, {\hat{x}}_{0}) - \hat{L} (f^{*}, x_{0}^{*}) < 0$ ,

L (\hat{f}, {\hat{x}}_{0}) \leq sup_{∥ f ∥_{H} \leq R, | x_{0} | \leq r} 2 | L (f, x_{0}) - \hat{L} (f, x_{0}) |

(A.17)

We thus consider the following quantity:

Err ≔ sup_{∥ f ∥_{H} \leq R, | x_{0} | \leq r} | \hat{L} (f, x_{0}) - L (f, x_{0}) |

(A.18)

Expanding this quantity we get:

sup_{‖ f ‖_{H} \leq R, | x_{0} | \leq r} | \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) (y_{i}^{2} - x^{*} {(t_{i})}^{2} - σ^{2} - 2 x (x_{0}, f, t_{i}) (y_{i} - x^{*} (t_{i})) |

(A.19)

Notice that if we replace for a given single $i, y_{i} = x^{*} (t_{i}) + ϵ_{i}$ by ${\tilde{y}}_{i} = x^{*} (t_{i}) + {\tilde{ϵ}}_{i}$ , the quantity of equation A.19 will change by a quantity bounded by some constant $K_{2} (t_{i + 1} - t_{i})$ , that we can bound by $4 (B_{1} + r + M_{ϵ}) M_{ϵ} + 4 (B_{1} + r) M_{ϵ}$ . Therefore, using McDiarmid inequality [37]:

P (Err \geq E (Err) + ϵ) \leq exp (\frac{- 2 ϵ^{2}}{K_{2}^{2} \sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}})

(A.20)

We therefore need to provide an upper bound of $E (Err)$ . For that, we are going to view:

| \hat{L} (f, x_{0}) - L (f, x_{0}) | = | \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) (y_{i}^{2} - x^{*} {(t_{i})}^{2} - σ^{2} - 2 x (x_{0}, f, t_{i}) (y_{i} - x^{*} (t_{i}))) |

(A.21)

as a stochastic process indexed by $x$ , where $x \in 𝒳$ : Set of all solutions $x (f, x_{0}, .)$ for all $∥ f ∥_{H} \leq R$ and $|x_{0}| \leq r$ . In other words, we view the process $|\hat{L} (f, x_{0}) - L (f, x_{0})|$ indexed by $f$ and $x_{0}$ as:

| \hat{L} (x) - L (x) |

(A.22)

where $x \in 𝒳$ is some $x (f, x_{0}, .)$ . Notice that Err is also:

sup_{x \in 𝒳} | \hat{L} (x) - L (x) |

(A.23)

Notice that $x$ is a subset of continuous functions defined on $[0, T]$ . Therefore we can equip $𝒳$ with the metric structure $(X, ∥ . ∥_{\infty}$ ). We will apply Dudley’s inequality (see for e.g [20], theorem 8.1.3) to bound:

E (Err) = E (sup_{∥ f ∥_{H} \leq R, | x_{0} | \leq r} | \hat{L} (f, x_{0}) - L (f, x_{0}) |)

(A.24)

To apply Dudley’s inequality, we are going to use the following lemma.

Lemma 3. The solutions $x \in 𝒳$ are Lipschitz with a Lipschitz constant that is uniform over $𝒳$ , i.e, there exists a constant $L_{6}$ such that for every $x \in 𝒳, t \in [0, T]$ and $s \in [0, T]$ :

| x (t) - x (s) | \leq L_{6} | t - s |

(A.25)

$K_{6}$ depends on $R, B_{1}, r$ and the kernel $K$ .

Proof. Let $x_{0}$ such that $|x_{0}| \leq r$ and $f$ such that $∥ f ∥_{H} \leq R$ . We have:

|\dot{x} (x_{0}, f, t)| = | f (x (t)) |

(A.26)

\leq R \sqrt{sup_{| x | \leq B_{1} + r} K (x, x)}

(A.27)

□

As a consequence, if we denote by $𝒩 (𝒳, ϵ)$ the covering number of $𝒳$ with a radius $ϵ$ we have the existence of a constant $L_{7}$ ( $L_{7}$ only depends on $B_{1}, r$ and $L_{6}$ ) such that:

𝒩 (X, ϵ) \leq exp (\frac{L_{7}}{ϵ}),

(A.28)

where we used a known upper bound that can be found for example in [20] (exercise 8.2.7) on the covering number of uniformly bounded Lipschitz continuous functions defined on a finite interval.

Using this result combined with Dudley’s inequality, we obtain the existence of a constant $L_{8}$ (depending only on $L_{7}$ ) such that:

Proposition 1.

E (Err) \leq L_{8} \sqrt{\sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}}

(A.29)

Proof. Apply Dudley’s inequality to Err using inequality (A.28) and the fact that the diameter of $𝒳$ is finite bounded by $2 (B_{1} + r)$ and that for every $M < \infty$

\int_{0}^{M} \sqrt{log (𝒩 (𝒳, ϵ))} d ϵ \leq \int_{0}^{M} \sqrt{log (exp (\frac{K_{7}}{ϵ}))} d ϵ < \infty

(A.30)

□

As a consequence, using (A.20) and theorem (1), we obtain the following inequality:

P (Err \geq L_{8} \sqrt{\sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}} + ϵ) \leq exp (\frac{- 2 ϵ^{2}}{K_{2}^{2} \sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}})

(A.31)

Using inequalities (A.17) and (A.31) we finally obtain the following theorem:

Theorem 2. With assumptions $A_{1}, A_{2}, A_{3}$ and $A_{4}$ , there exist constants $L_{9}$ and $K_{2}$ depending only on $R, r, T, M_{ϵ}$ and the kernel $K$ such that for every $ϵ$ :

ℙ (L (\hat{f}, {\hat{x}}_{0}) \geq L_{9} \sqrt{\sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}} + ϵ) \leq exp (\frac{- 2 ϵ^{2}}{K_{2}^{2} \sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}})

(A.32)

Appendix A.2. Including the Euler approximation

In reality, the solution (trajectory) that we propose for every $f$ and $x_{0}$ is not $x (x_{0}, f, .)$ the solution of the ODE but $\tilde{x} (x_{0}, f, h, .)$ , the solution obtained with an Euler’s method of time step $h$ . The idea is to use the fact that under some sufficient conditions, we know how to bound the error between Euler’s method and the true solution. For example, we know that if $f$ is Lipschitz with a Lipschitz constant $K_{1}$ and the solution $x (x_{0}, f, .)$ is $𝒞^{2}$ with a constant $K_{11}$ such that:

x^{″} (x_{0}, f, t) \leq L_{11}, \forall 0 \leq t \leq T

(A.33)

then we have the following global truncation error bound [38]:

max_{1 \leq i \leq n} | x (x_{0}, f, t_{i}) - \tilde{x} (x_{0}, f, h, t_{i}) | \leq \frac{h L_{11}}{2 L_{1}} ({exp}^{L_{1} T} - 1)

(A.34)

We already showed that $f$ is Lipschitz with some constant $L_{1}$ . To ensure the condition of inequality (A.33), notice that:

x^{″} (x_{0}, f, t) = f (x (x_{0}, f, t)) f^{'} (x (x_{0}, f, t))

(A.35)

Since we already showed that the solutions $x (x_{0}, f, .)$ are uniformly bounded by $B_{1} + r$ , it is sufficient to ensure that $f$ is $𝒞^{1}$ . This is true if we assume that our kernel $K$ is $𝒞^{2}$ and hence (A.34) will be insured.

Taking into account the Euler approximation and the error bound, the steps of the consistency proof are identical only with the following important difference in equation (A.15) from the previous section

min_{∥ f ∥_{H} \leq R, |x_{0}| \leq r} L (f, x_{0}) \leq L (f^{*}, x_{0}^{*})

(A.36)

with

L (f^{*}, x_{0}^{*}) = \sum_{i = 1}^{m} (t_{i + 1} - t_{i}) {({\tilde{x}}^{*} (t_{i}, h) - x^{*} (t_{i}))}^{2} \leq \frac{h^{2} {L_{11}}^{2} T}{4 L_{1}^{2}} {({exp}^{L_{1} T} - 1)}^{2} ≔ L_{12}

(A.37)

With this modification, theorem 2 becomes:

Theorem 3. Assuming $A_{1}, A_{2}, A_{3}$ and $A_{4}$ , there exist constants $K_{2}, L_{12}$ and $L_{13}$ depending only on $R, r, T, M_{ϵ}$ and the kernel $K$ such that for every $ϵ$ :

ℙ (L (\hat{f}, {\hat{x}}_{0}) \geq L_{13} \sqrt{\sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}} + h^{2} L_{12} + ϵ) \leq exp (\frac{- 2 ϵ^{2}}{K_{2}^{2} \sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}})

(A.38)

Appendix A.3. $L^{2}$ squared distance between the true solution and the estimated trajectory

In reality $L (\hat{f}, {\hat{x}}_{0})$ is an approximation of the $L^{2}$ norm squared

∥ x (\hat{f}, {\hat{x}}_{0}, \cdot) - x^{*} (\cdot) ∥_{L_{2}}^{2} ≔ \int_{0}^{T} {(x (\hat{f}, {\hat{x}}_{0}, t) - x^{*} (t))}^{2} d t

(A.39)

Since we proved that the solutions are uniformly bounded by $(B_{1} + r)$ and $\dot{x}$ is bounded by $L_{6}$ , we have $t \to {(x (\hat{f}, {\hat{x}}_{0}, t) - x^{*} (t))}^{2}$ is Lipschitz with Lipschitz constant $8 (B_{1} + r) L_{6}$ (we just bound the norm of the derivative). Therefore:

| ∥ x (\hat{f}, {\hat{x}}_{0}, \cdot) - x^{*} {(\cdot) ∥}_{L_{2}}^{2} - L (\hat{f}, {\hat{x}}_{0}) | \leq 8 (B_{1} + r) L_{6} \sum_{i = 1}^{m} {(t_{i + 1} - t_{i})}^{2}

(A.40)

Which proves theorem 2 of the main text.

Appendix B. Kernels

We are interested in listing kernels that satisfy Lemma 1, and thus can be used to model ODEs admitting a single solution. There are cases when one can directly verify the hypothesis of Lemma 1. In the case of translation invariant kernels, one can use the Bochner theorem to provide a sufficient condition as explained in the next section.

Appendix B.1. Translation invariant kernels

We consider translation invariant scalar positive definite kernels over $R^{d}$ , that is kernels for which

k (u, v) = h (u - v), u, v \in R^{d}

(B.1)

The Bochner theorem provides a characterization of translation invariant kernels. Specifically, there exists a probability density $q$ with respect to the Lebesgues measure over $R^{d}$ such that

h (x) = h (0) \int_{R^{d}} e^{i x^{T} y} q (y) d y

(B.2)

Furthermore, since we restrict our attention to real-valued kernels,

h (x) = h (0) \int_{R^{d}} cos (x^{T} y) q (y) d y

(B.3)

The gradient of $h$ is then formally the vector of length $d$

\nabla h (x) = - h (0) \int_{R^{d}} ysin (x^{T} y) q (y) d y

(B.4)

and the Hessian of $h$ is formally the matrix

\nabla \nabla h (x) = - h (0) \int_{R^{d}} (y y^{T}) cos (x^{T} y) q (y) d y

(B.5)

Translation invariant kernels that satisfy Lemma 1 are such that

Q (x) = = c ∥ x ∥^{2} + 2 (h (x) - h (0)) \geq 0

(B.6)

for some constant $c > 0$ and for any $x, y \in R^{d}$ . Notice that $Q (0) = 0$ . Next, since $\nabla h (0) = 0, \nabla Q (0) = 0$ . Moreover,

\nabla \nabla Q (x) = 2 c I + 2 \nabla \nabla h (x)

(B.7)

where $I$ is the identity matrix. Next, since $\nabla \nabla Q$ is a symmetric matrix, it has real eigenvalues. Suppose these eigenvalues are bounded uniformly from below. In that case, one can choose a constant $c$ large enough such that $\nabla \nabla Q (x)$ is positive definite for each $x \in R^{d}$ which implies that $Q$ is convex and since $Q (0) = 0$ and $\nabla Q (0) = 0, Q (x) \geq 0$ for each $x \in R^{d}$ and the conditions for Lemma 1 are satisfied. A sufficient condition for this to happen is that all the coordinates of $\nabla \nabla h$ are bounded, i.e., for each $i \in {1, \dots, d}, E [Y_{i}^{2}] < \infty$ , where $Y_{i}$ is a random variable with density $q_{i}$ , the $i^{t h}$ marginal of $q$ .

Appendix B.2. Explicit Kernels:

We begin by observing the condition

d_{K_{i i}}^{2} (u, v) \leq N_{K}^{2} | u - v |^{2}, \forall u, v \in R^{d}, i = 1, \dots, d

(B.8)

is equivalent to the condition:

\sum_{i = 1}^{d} d_{K_{i i}}^{2} (u, v) \leq N^{2} | u - v |^{2}

(B.9)

Consider the case where $K$ is an explicit kernel. That is to say there exists a finite (p) dimensional feature space and a mapping $Φ : R^{d} \to R^{p \times d}$ for which:

K (u, v) = Φ (u)^{T} Φ (v)

(B.10)

The Fourier random features used in our experiments fall in this category.

Lemma 4.

\sum_{i = 1}^{d} d_{K_{i i}}^{2} (u, v) = ∥ Φ (u) - Φ (v) ∥_{ℱ}^{2}

(B.11)

Where $ℱ$ is the Frobenious norm.

Proof:

\sum_{i = 1}^{d} (k_{i, i} (u, u) - 2 k_{i, i} (u, v) + k_{i, i} (v, v)) = \sum_{i = 1}^{d} e_{i}^{T} Φ (u)^{T} Φ (u) e_{i} - 2 e_{i}^{T} Φ^{T} (u) Φ (v) e_{i} + e_{i}^{T} Φ (v)^{T} Φ (v) e_{i}

(B.12)

= \sum_{i = 1}^{d} e_{i}^{T} \{Φ (u)^{T} Φ (u) - Φ (u)^{T} Φ (v) - Φ (v)^{T} Φ (u) + Φ (v)^{T} Φ (v)\} e_{i}

(B.13)

= \sum_{i = 1}^{d} e_{i}^{T} (Φ (u) - Φ (v))^{T} (Φ (u) - Φ (v)) e_{i}

(B.14)

= Trace ((Φ (u) - Φ (v))^{T} (Φ (u) - Φ (v)))

(B.15)

= ∥ Φ (u) - Φ (v) ∥_{ℱ}^{2}

(B.16)

Therefore, for explicit kernels, we conclude that the condition of lemma 1 is equivalent to the condition that the features are Lipschitz continuous with respect to the Frobenious norm.

Appendix B.3. Examples of kernels which satisfy the assumptions of lemma 1

Let us notate

P (u, v) = K_{1} (u, u) + K_{1} (v, v) - 2 K_{1} (u, v)

(B.17)

The linear kernel
$K_{1} (u, v) = (u^{T} A v)$ (B.18)
where $A$ is a psd matrix. Indeed,
$P (u, v) = {(u - v)}^{T} A (u - v) \leq ∥ u - v ∥^{2} sup_{1 \leq i \leq d} λ_{i}$ (B.19)
where $λ_{i}$ are the eigenvalues of $A$ using the Rayleigh quotient property.
The Gaussian kernel:
$K_{1} (u, v) = exp (- \frac{1}{2} ((u - v)^{T} A (u - v)))$ (B.20)
where $A$ is a psd matrix. Indeed,
$P (u, v) = 2 - 2 exp (- \frac{1}{2} ({(u - v)}^{T} A (u - v))) \leq 2 {(u - v)}^{T} A (u - v) \leq 2 ∥ u - v ∥^{2} sup_{1 \leq I \leq d} λ_{i}$ (B.21)
where $λ_{i}$ are the eigenvalues of $A$ and the first inequality comes from the basic inequality $e^{x} \geq 1 + x$
The rational quadratic kernel:
$K_{1} (x, y) = \frac{∥ x - y ∥^{2}}{∥ x - y ∥^{2} + θ}, θ > 0$ (B.22)
Note that in this case,
$P (u, v) \leq \frac{1}{θ} ∥ u - v ∥^{2}$ (B.23)
The sinc kernel
$K_{1} (u, v) = \prod_{i = 1}^{d} \frac{sin (∥ u_{i} - v_{i} ∥)}{∥ u_{i} - v_{i} ∥}$ (B.24)
We use the fact that $K_{1}$ is a translation invariant kernel with associated density $q (y) = \prod_{i = 1}^{d} q_{1} (y_{i})$ with
$q_{1} (z) = \frac{1}{2} for - 1 \leq z \leq 1$ (B.25)
The Mattern kernel with $p > 3 / 2$ . This kernel is translation invariant with associated density $q (y) = \prod_{i = 1}^{d} q_{1} (y_{i})$ with
$q_{1} (z) = \frac{1}{{(1 + x^{2})}^{p}}$ (B.26)
and
$E [X^{2}] < \infty, X ~ q_{1}$ (B.27)

Appendix C. An example of a non-autonomous system

We provide in this appendix a toy example of a non-autonomous system, namely the harmonic oscillator with sinusoidal input force

\ddot{y} + 0.001 \dot{y} + 10000 y = c o s (t)

(C.1)

The kernel is an explicit Fourier random feature kernel with $p = 200$ random features as well as a constant term, where time was included as input together with the spatial variables. Each feature was centered and standardized using the training set only for computing the mean and standard deviation. The functions in the corresponding RKHS are then

f ([x_{1}, x_{2}, t]) = [\begin{matrix} \sum_{i = 1}^{p} α_{i} cos ([z_{1 i}, z_{2 i}, z_{3, i}] \cdot [x_{1}, x_{2}, t]) + β_{i} sin ([z_{1 i}, z_{2 i}, z_{3, i}] \cdot [x_{1}, x_{2}, t] + ω_{1}) \\ \sum_{i = 1}^{p} γ_{i} cos ([z_{1 i}, z_{2 i}, z_{3, i}] \cdot [x_{1}, x_{2}, t]) + δ_{i} sin ([z_{1 i}, z_{2 i}, z_{3, i}] \cdot [x_{1}, x_{2}, t] + ω_{2}) \end{matrix}]

(C.2)

Where the $z$ variables are iid sampled from a standard Normal (or Gaussian) distribution and the parameters $\{α_{i}, β_{i}, γ_{i}, δ_{i}\}, i = 1 \dots p$ along with $\{ω_{j}\}, j = 1, 2$ are learned from the training set. Figure C.5 illustrates the output ODE-RKHS algorithm for this system.

Fig. C.5. — (a): plot of the 2D system where the $z$ -axis is time. Black arrows: true vector field. Grey arrows: estimated vector field. Black curves: true trajectories. Red curves: estimated trajectories. (b): Grey points: initial conditions. Black curves: true trajectories. Red curves: estimated trajectories.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

The data used for this experiment has been obtained from the Wisconsin Registry for Alzheimer’s Prevention. See https://wrap.wisc.edu/. A request for accessing this data can be initiated from this website.

Declaration of interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1].Hirsch MW, Smale S, Devaney RL, Differential equations, dynamical systems, and an introduction to chaos, Academic press, 2012. [Google Scholar]
[2].Manton JH, Amblard P-O, et al. , A primer on reproducing kernel hilbert spaces, Foundations and Trends^® in Signal Processing 8 (2015) 1–126. [Google Scholar]
[3].Dondelinger F, Husmeier D, Rogers S, Filippone M, Ode parameter inference using adaptive gradient matching with gaussian processes, in: Artificial intelligence and statistics, PMLR, 2013, pp. 216–228. [Google Scholar]
[4].Brunton SL, Proctor JL, Kutz JN, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the national academy of sciences 113 (2016) 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Niu M, Rogers S, Filippone M, Husmeier D, Fast parameter inference in nonlinear dynamical systems using iterative gradient matching, in: International Conference on Machine Learning, PMLR, 2016, pp. 1699–1707. [Google Scholar]
[6].Hu P, Yang W, Zhu Y, Hong L, Revealing hidden dynamics from time-series data by odenet, arXiv preprint arXiv:2005.04849 (2020). [Google Scholar]
[7].Qin T, Wu K, Xiu D, Data driven governing equations approximation using deep neural networks, Journal of Computational Physics 395 (2019) 620–635. [Google Scholar]
[8].Chen RT, Rubanova Y, Bettencourt J, Duvenaud D, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366 (2018). [Google Scholar]
[9].Koopman BO, Hamiltonian systems and transformation in hilbert space, Proceedings of the national academy of sciences of the united states of america 17 (1931) 315. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Schmid PJ, Dynamic mode decomposition of numerical and experimental data, Journal of fluid mechanics 656 (2010) 5–28. [Google Scholar]
[11].Dai X, Li L, Kernel ordinary differential equations, Journal of the American Statistical Association 117 (2022) 1711–1725. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Heinonen M, Yildiz C, Mannerström H, Intosalmi J, Lähdesmäki H, Learning unknown ode models with gaussian processes, in: International Conference on Machine Learning, PMLR, 2018, pp. 1959–1968. [Google Scholar]
[13].Kanagawa M, Hennig P, Sejdinovic D, Sriperumbudur BK, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv preprint arXiv:1807.02582 (2018). [Google Scholar]
[14].Hofmann T, Schölkopf B, Smola AJ, Kernel methods in machine learning, The annals of statistics (2008) 1171–1220. [Google Scholar]
[15].Alvarez MA, Rosasco L, Lawrence ND, Kernels for vector-valued functions: A review, arXiv preprint arXiv:1106.6251 (2011) [Google Scholar]
[16].Simmons GF, Differential equations with applications and historical notes, CRC Press, 2016. Theorem B [Google Scholar]
[17].Raissi M, Perdikaris P, Karniadakis G, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707. [Google Scholar]
[18].Dao T, De Sa C, Ré C, Gaussian quadrature for kernel features, Advances in neural information processing systems 30 (2017) 6109. [PMC free article] [PubMed] [Google Scholar]
[19].Kiliç E, Stanica P, The inverse of banded matrices, Journal of Computational and Applied Mathematics 237 (2013) 126–135. [Google Scholar]
[20].Vershynin R, High-dimensional probability: An introduction with applications in data science, volume 47, Cambridge university press, 2018. [Google Scholar]
[21].Quinonero-Candela J, Rasmussen CE, A unifying view of sparse approximate gaussian process regression, The Journal of Machine Learning Research 6 (2005) 1939–1959. [Google Scholar]
[22].Kokotovic P, Heller J, Direct and adjoint sensitivity equations for parameter optimization, IEEE Transactions on Automatic Control 12 (1967) 609–610. [Google Scholar]
[23].Zheng P, Askham T, Brunton SL, Kutz JN, Aravkin AY, A unified framework for sparse relaxed regularized regression: Sr3, IEEE Access 7 (2018) 1404–1423. [Google Scholar]
[24].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, Brunton S, Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data, Journal of Open Source Software 5 (2020) 2104. [Google Scholar]
[25].Lew E, Hekal A, Potomkin K, Kochdumper N, Hencey B, Bak S, Bogomolov S, Autokoopman: A toolbox for automated system identification via koopman operator linearization, in: International Symposium on Automated Technology for Verification and Analysis, Springer, 2023, pp. 237–250. [Google Scholar]
[26].Williams MO, Kevrekidis IG, Rowley CW, A data–driven approximation of the koopman operator: Extending dynamic mode decomposition, Journal of Nonlinear Science 25 (2015) 1307–1346. [Google Scholar]
[27].Yeung E, Kundu S, Hodas N, Learning deep neural network representations for koopman operators of nonlinear dynamical systems, in: 2019 American Control Conference (ACC), IEEE, 2019, pp. 4832–4839. [Google Scholar]
[28].DeGennaro AM, Urban NM, Scalable extended dynamic mode decomposition using random kernel approximation, SIAM Journal on Scientific Computing 41 (2019) A1482–A1499. [Google Scholar]
[29].Zhao Z, Giannakis D, Analog forecasting with dynamics-adapted kernels, Nonlinearity 29 (2016) 2888. [Google Scholar]
[30].Burov D, Giannakis D, Manohar K, Stuart A, Kernel analog forecasting: Multiscale test problems, Multiscale Modeling & Simulation 19 (2021) 1011–1040 [Google Scholar]
[31].Schaeffer H, Tran G, Ward R, Zhang L, Extracting structured dynamical systems using sparse optimization with very few samples, Multiscale Modeling & Simulation 18 (2020) 1435–1461. [Google Scholar]
[32].Pontryagin LS, Mathematical theory of optimal processes, CRC press, 1987. [Google Scholar]
[33].Younes L, Diffeomorphic learning, Journal of Machine Learning Research 21 (2020) 1–28.34305477 [Google Scholar]
[34].Lorenz E, Predictability: a problem partly solved, Ph.D. thesis, Shinfield Park, Reading, 1995. [Google Scholar]
[35].Murphy MP, LeVine III H, Alzheimer’s disease and the amyloid-β peptide, Journal of Alzheimer’s disease 19 (2010) 311–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Vernhet P, Bilgel M, Durrleman S, Resnick SM, Johnson SC, Jedynak BM, Modeling the early accumulation of amyloid using differential equations in wrap and blsa: Neuroimaging/optimal neuroimaging measures for early detection, Alzheimer’s & Dementia 16 (2020) e039536. [Google Scholar]
[37].Doob JL, Regularity properties of certain families of chance variables, Transactions of the American Mathematical Society 47 (1940) 455–486. [Google Scholar]
[38].Atkinson KE, An introduction to numerical analysis, John wiley & sons, 2008. [Google Scholar]

[R1] [1].Hirsch MW, Smale S, Devaney RL, Differential equations, dynamical systems, and an introduction to chaos, Academic press, 2012. [Google Scholar]

[R2] [2].Manton JH, Amblard P-O, et al. , A primer on reproducing kernel hilbert spaces, Foundations and Trends^® in Signal Processing 8 (2015) 1–126. [Google Scholar]

[R3] [3].Dondelinger F, Husmeier D, Rogers S, Filippone M, Ode parameter inference using adaptive gradient matching with gaussian processes, in: Artificial intelligence and statistics, PMLR, 2013, pp. 216–228. [Google Scholar]

[R4] [4].Brunton SL, Proctor JL, Kutz JN, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proceedings of the national academy of sciences 113 (2016) 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Niu M, Rogers S, Filippone M, Husmeier D, Fast parameter inference in nonlinear dynamical systems using iterative gradient matching, in: International Conference on Machine Learning, PMLR, 2016, pp. 1699–1707. [Google Scholar]

[R6] [6].Hu P, Yang W, Zhu Y, Hong L, Revealing hidden dynamics from time-series data by odenet, arXiv preprint arXiv:2005.04849 (2020). [Google Scholar]

[R7] [7].Qin T, Wu K, Xiu D, Data driven governing equations approximation using deep neural networks, Journal of Computational Physics 395 (2019) 620–635. [Google Scholar]

[R8] [8].Chen RT, Rubanova Y, Bettencourt J, Duvenaud D, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366 (2018). [Google Scholar]

[R9] [9].Koopman BO, Hamiltonian systems and transformation in hilbert space, Proceedings of the national academy of sciences of the united states of america 17 (1931) 315. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Schmid PJ, Dynamic mode decomposition of numerical and experimental data, Journal of fluid mechanics 656 (2010) 5–28. [Google Scholar]

[R11] [11].Dai X, Li L, Kernel ordinary differential equations, Journal of the American Statistical Association 117 (2022) 1711–1725. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Heinonen M, Yildiz C, Mannerström H, Intosalmi J, Lähdesmäki H, Learning unknown ode models with gaussian processes, in: International Conference on Machine Learning, PMLR, 2018, pp. 1959–1968. [Google Scholar]

[R13] [13].Kanagawa M, Hennig P, Sejdinovic D, Sriperumbudur BK, Gaussian processes and kernel methods: A review on connections and equivalences, arXiv preprint arXiv:1807.02582 (2018). [Google Scholar]

[R14] [14].Hofmann T, Schölkopf B, Smola AJ, Kernel methods in machine learning, The annals of statistics (2008) 1171–1220. [Google Scholar]

[R15] [15].Alvarez MA, Rosasco L, Lawrence ND, Kernels for vector-valued functions: A review, arXiv preprint arXiv:1106.6251 (2011) [Google Scholar]

[R16] [16].Simmons GF, Differential equations with applications and historical notes, CRC Press, 2016. Theorem B [Google Scholar]

[R17] [17].Raissi M, Perdikaris P, Karniadakis G, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019) 686–707. [Google Scholar]

[R18] [18].Dao T, De Sa C, Ré C, Gaussian quadrature for kernel features, Advances in neural information processing systems 30 (2017) 6109. [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Kiliç E, Stanica P, The inverse of banded matrices, Journal of Computational and Applied Mathematics 237 (2013) 126–135. [Google Scholar]

[R20] [20].Vershynin R, High-dimensional probability: An introduction with applications in data science, volume 47, Cambridge university press, 2018. [Google Scholar]

[R21] [21].Quinonero-Candela J, Rasmussen CE, A unifying view of sparse approximate gaussian process regression, The Journal of Machine Learning Research 6 (2005) 1939–1959. [Google Scholar]

[R22] [22].Kokotovic P, Heller J, Direct and adjoint sensitivity equations for parameter optimization, IEEE Transactions on Automatic Control 12 (1967) 609–610. [Google Scholar]

[R23] [23].Zheng P, Askham T, Brunton SL, Kutz JN, Aravkin AY, A unified framework for sparse relaxed regularized regression: Sr3, IEEE Access 7 (2018) 1404–1423. [Google Scholar]

[R24] [24].de Silva B, Champion K, Quade M, Loiseau J-C, Kutz J, Brunton S, Pysindy: A python package for the sparse identification of nonlinear dynamical systems from data, Journal of Open Source Software 5 (2020) 2104. [Google Scholar]

[R25] [25].Lew E, Hekal A, Potomkin K, Kochdumper N, Hencey B, Bak S, Bogomolov S, Autokoopman: A toolbox for automated system identification via koopman operator linearization, in: International Symposium on Automated Technology for Verification and Analysis, Springer, 2023, pp. 237–250. [Google Scholar]

[R26] [26].Williams MO, Kevrekidis IG, Rowley CW, A data–driven approximation of the koopman operator: Extending dynamic mode decomposition, Journal of Nonlinear Science 25 (2015) 1307–1346. [Google Scholar]

[R27] [27].Yeung E, Kundu S, Hodas N, Learning deep neural network representations for koopman operators of nonlinear dynamical systems, in: 2019 American Control Conference (ACC), IEEE, 2019, pp. 4832–4839. [Google Scholar]

[R28] [28].DeGennaro AM, Urban NM, Scalable extended dynamic mode decomposition using random kernel approximation, SIAM Journal on Scientific Computing 41 (2019) A1482–A1499. [Google Scholar]

[R29] [29].Zhao Z, Giannakis D, Analog forecasting with dynamics-adapted kernels, Nonlinearity 29 (2016) 2888. [Google Scholar]

[R30] [30].Burov D, Giannakis D, Manohar K, Stuart A, Kernel analog forecasting: Multiscale test problems, Multiscale Modeling & Simulation 19 (2021) 1011–1040 [Google Scholar]

[R31] [31].Schaeffer H, Tran G, Ward R, Zhang L, Extracting structured dynamical systems using sparse optimization with very few samples, Multiscale Modeling & Simulation 18 (2020) 1435–1461. [Google Scholar]

[R32] [32].Pontryagin LS, Mathematical theory of optimal processes, CRC press, 1987. [Google Scholar]

[R33] [33].Younes L, Diffeomorphic learning, Journal of Machine Learning Research 21 (2020) 1–28.34305477 [Google Scholar]

[R34] [34].Lorenz E, Predictability: a problem partly solved, Ph.D. thesis, Shinfield Park, Reading, 1995. [Google Scholar]

[R35] [35].Murphy MP, LeVine III H, Alzheimer’s disease and the amyloid-β peptide, Journal of Alzheimer’s disease 19 (2010) 311–323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Vernhet P, Bilgel M, Durrleman S, Resnick SM, Johnson SC, Jedynak BM, Modeling the early accumulation of amyloid using differential equations in wrap and blsa: Neuroimaging/optimal neuroimaging measures for early detection, Alzheimer’s & Dementia 16 (2020) e039536. [Google Scholar]

[R37] [37].Doob JL, Regularity properties of certain families of chance variables, Transactions of the American Mathematical Society 47 (1940) 455–486. [Google Scholar]

[R38] [38].Atkinson KE, An introduction to numerical analysis, John wiley & sons, 2008. [Google Scholar]

PERMALINK

Learning nonparametric ordinary differential equations from noisy data

Kamel Lahouel

Michael Wells

Victor Rielly

Ethan Lew

David Lovitza

Bruno M Jedynak

Abstract

1. Introduction

1.1. Description of the problem and related works

Fig. 1.

1.2. Main contributions

2. Model and algorithm

2.1. Background on Reproducing Kernel Hilbert Spaces (RKHSs)

2.1.1. Real-valued RKHS

2.1.2. Vector-valued RKHSs

2.2. Notations

2.3. Existence and uniqueness

2.4. From constrained to unconstrained optimization

2.5. Penalty method

Algorithm 1.

2.6. Initial condition and termination criteria

2.7. Multiple trajectories

Algorithm 2.

2.8. Computational Complexity

2.9. Non autonomous systems, covariates, and irregular sampling

3. Consistency of the solution: A finite sample result

Fig. 3.

4. Experiments

4.1. Benchmark methods

4.2. Validation, initialization, and selection of hyper-parameters in the ODE-RKHS algorithm

4.3. Datasets

4.3.1. Oscillator data

4.3.2. Lorenz63 data

4.3.3. Lorenz96

4.3.4. The accumulation of Amyloid in the cortex of aging subjects

Fig. 4.

4.4. Evaluation

Table 1.

Table 2.

5. Discussion

Fig. 2.

Acknowledgements

Appendix A. Consistency of the estimator of the trajectory

Appendix A.1. Assuming we solve the problem without Euler approximation

Appendix A.2. Including the Euler approximation

Appendix A.3. L2 squared distance between the true solution and the estimated trajectory

Appendix B. Kernels

Appendix B.1. Translation invariant kernels

Appendix B.2. Explicit Kernels:

Appendix B.3. Examples of kernels which satisfy the assumptions of lemma 1

Appendix C. An example of a non-autonomous system

Fig. C.5.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Appendix A.3. $L^{2}$ squared distance between the true solution and the estimated trajectory