Abstract
We present a novel weak formulation and discretization for discovering governing equations from noisy measurement data. This method of learning differential equations from data fits into a new class of algorithms that replace pointwise derivative approximations with linear transformations and variance reduction techniques. Compared to the standard SINDy algorithm presented in [S. L. Brunton, J. L. Proctor, and J. N. Kutz, Proc. Natl. Acad. Sci. USA, 113 (2016), pp. 3932–3937], our so-called weak SINDy (WSINDy) algorithm allows for reliable model identification from data with large noise (often with ratios greater than 0.1) and reduces the error in the recovered coefficients to enable accurate prediction. Moreover, the coefficient error scales linearly with the noise level, leading to high-accuracy recovery in the low-noise regime. Altogether, WSINDy combines the simplicity and efficiency of the SINDy algorithm with the natural noise reduction of integration, as demonstrated in [H. Schaeffer and S. G. McCalla, Phys. Rev. E, 96 (2017), 023302], to arrive at a robust and accurate method of sparse recovery.
Keywords: data-driven model selection, nonlinear dynamics, sparse recovery, generalized least squares, Galerkin method, adaptive grid, 37M10, 62J99, 62-07, 65R99
1. Problem statement.
Consider a first-order dynamical system in dimensions of the form
(1.1) |
and measurement data given at timepoints by
where throughout we use the bracket notation . The variable represents a matrix of independent and identically distributed measurement noise. The focus of this article is the reconstruction of the dynamics (1.1) from the measurements .
The SINDy algorithm (sparse identification of nonlinear dynamics [4]) has been shown to be successful in solving this problem for sparsely represented nonlinear dynamics when noise is small and dynamic scales do not vary across multiple orders of magnitude. This framework assumes that the function in (1.1) is given componentwise by
(1.2) |
for some known family of functions and a sparse weight matrix . The problem is then transformed into solving for by building a data matrix given by
so that the candidate functions are directly evaluated at the noisy data. Solving (1.1) for then reduces to identifying a sparse weight matrix such that
(1.3) |
where is the numerical time derivative of the data . Sequential-thresholding least squares is then used to arrive at a sparse solution.
1.1. Background.
Research into statistically rigorous selection of mathematical models from data can be traced back to Akaike’s seminal work in the 1970s [1, 2]. In the last 20 years, there has been substantial work in this area at the interface between applied mathematics, computer science, and statistics (see [3, 11, 12, 19, 22, 23] for both theory and applications). More recently, the formulation of system discovery problems in terms of a candidate basis of nonlinear functions (1.2) and subsequent discretization (1.3) was introduced in [21] in the context of catastrophe prediction. The authors of [21] used compressed sensing techniques to enforce sparsity. Since then there has been an explosion of interest in the problem of identifying nonlinear dynamical systems from data, with some of the primary techniques being Gaussian process regression [15], deep neural networks [16], Bayesian inference [26, 27], and classical methods from numerical analysis [7, 9, 25]. These techniques have been successfully applied to the discovery of both ordinary and partial differential equations.
The variety of discovery algorithms qualitatively differ in the interpretability of the resulting data-driven dynamical system, the scope and efficiency of the algorithm, and the robustness to noise, scale separation, etc. For instance, a neural network based data-driven dynamical system does not easily lend itself to physical interpretation, while the SINDy algorithm identifies governing equations which can be analyzed directly. Moreover, it is also well-known that the training stage for neural networks and other iterative learning algorithms can be computationally costly. Concerning the scope of an algorithm, several methods have been independently developed to discover models under the assumption of some prior knowledge of the governing equations, notably for low-degree polynomial chaotic systems, cyclic ODEs, interacting particles, and Hamiltonian dynamics [20, 18, 13, 24]. In each of these cases the authors derive probabilistic recovery guarantees depending on the number of available trajectories, the size of the candidate model library, the level of incoherence of the data, and/or the sparsity of the governing equations.
The vast majority of algorithms and recovery guarantees assume that pointwise derivatives of the data either are available or can be reliably computed. This severely limits an algorithm’s robustness to noise and hence its applicability to real world data. Here we relax this assumption and provide rigorous justification for the weak formulation of the dynamics as a means to circumvent this ubiquitous problem in model selection. Building off of the SINDy framework, we present the robust discovery algorithm WSINDy (weak SINDy), which operates under the assumption that the time derivative is unavailable and that the only prior knowledge of the governing equations is their inclusion in a large model library. We also focus on the realistic scenario where only a single noisy trajectory of the state variable is available; however, extension to multiple trajectories is of course possible. For simplicity, we restrict numerical experiments to autonomous ODEs for their amenability to analysis. Natural next steps are to explore identification of PDEs and nonautonomous dynamical systems. We note that the use of integral equations for system identification was introduced in [17], where compressed sensing techniques were used to enforce sparsity, and that this technique can be seen as a special case of the method introduced here.
In section 2 we introduce the algorithm with analysis of the resulting error structure. Section 3 contains numerical results showing identification of six ODE systems over a range of noise levels and parameter regimes. In section 4, we provide concluding remarks as well as natural next directions for this line of research. In Appendix A we include a detailed comparison between WSINDy and SINDy as well as further information on the generalized least squares method.
2. WSINDy.
We approach the problem of system identification (1.3) from a nonstandard perspective by utilizing the weak form of the differential equation. Recall that for any smooth test function (absolutely continuous is sufficient) and interval , (1.1) admits the weak formulation
(2.1) |
With , we arrive at the integral equation of the dynamics explored in [17]. If we instead take to be nonconstant and compactly supported in , we arrive at
(2.2) |
Assuming a representation of the form (1.2), we then define the generalized residual for a given test function by replacing with a candidate element from the span of and with as follows:
(2.3) |
Clearly, with and we have for all compactly supported in ; however, is a discrete set of data, so (2.3) can at best be approximated numerically. Measurement noise then presents a significant barrier to accurate indentification of .
2.1. Method overview.
For analogy with traditional Galerkin methods, consider the forward problem of solving a dynamical system such as (1.1) for . The Galerkin approach is to seek a solution represented in a chosen trial basis such that the residual , defined by
is minimized over all test functions living in the span of a given test function basis . If the trial and test function bases are known analytically, inner products of the form appearing in the residual can be computed exactly. Thus, the computational error results only from representing the solution in a finite-dimensional function space.
The method we present here can be considered a data-driven Galerkin method of solving for where the trial “basis” is given by the set of gridfunctions evaluated at the data and only the test function basis is known analytically. In this way, inner products appearing in must be approximated numerically, implying that the accuracy of the recovered weights is ultimately limited by the quadrature scheme used to discretize inner products. Using Lemma 2 below, we show that the correct coefficients may be recovered to effective machine precision accuracy (given by the tolerance of the forward ODE solver) from noise-free trajectories by discretizing (2.2) using the trapezoidal rule and choosing to decay smoothly to zero at the boundaries of its support. Specifically, in this article we demonstrate this fact by choosing test functions from a particular family of unimodal piecewise polynomials defined in (2.6).
Having chosen a quadrature scheme, the next accuracy barrier is presented by measurement noise, introducing randomness into the residuals . Numerical integration then couples residuals and whenever and have overlapping support. In this way, does not have an ideal error structure for least squares but may be amenable to generalized least squares. Below we analyze the distribution of the residuals to arrive at a generalized least squares approach where an approximate covariance matrix can be computed directly from the test functions. This analysis also suggests that placing test functions near steep gradients in the dynamics may improve recovery; hence we develop a derivative-free method for adaptively clustering test functions near steep gradients.
Remark 1.
The weak formulation of the dynamics introduces a wealth of information: given timepoints , (2.2) affords residuals over all possible supports with . Of course, one could also assimilate the responses of multiple families of test functions ; however, the computational complexity of such an exhaustive approach quickly becomes intractable. We stress that even with large noise, our proposed method identifies the correct nonlinearities with accurate weight recovery while keeping the number of test functions lower than the number of timepoints .
2.2. Algorithm: WSINDy.
We state here the WSINDy algorithm in full generality. We propose a generalized least squares approach with approximate covariance matrix . Below we derive a particular choice of which utilizes the action of the test functions on the data . Sequential thresholding on the weight coefficients with thresholding parameter is used to enforce sparsity, where is necessary for recovery. Lastly, an -regularization term with coefficient is included for problems involving rank deficiency. Methods of choosing optimal values of and directly from a given dataset do exist, for instance, by selecting the optimal position in a Pareto front [5]; however, this is not the focus of our current study, and thus we select values that work across multiple examples. Specifically, in the experiments below we set with the exception of the nonlinear pendulum and the five-dimensional linear system, examples which show that regularization can be used to discover dynamics from excessively large libraries. For noise-free data the algorithm is only weakly dependent on and so we use , while for noisy data we set
Construct matrix of trial gridfunctions .
- Construct integration matrices , such that
Compute Gram matrix and right-hand side so that and .
- Solve the generalized least squares problem with -regularization
using sequential thresholding with parameter to enforce sparsity.
With this as our core algorithm, we can now consider a residual analysis (section 2.3) leading to a generalized least squares framework. We can also develop theoretical results related to the test functions (section 2.4), yielding a more thorough understanding of the impact of using uniform (section 2.4.1) and adaptive (section 2.4.2) placement of test functions along the time axis.
2.3. Residual analysis.
Performance of WSINDy is determined by the behavior of the residuals
denoted for the entire residual matrix. Here we analyze the residual for autonomous to highlight key aspects for future analysis, as well as to arrive at an appropriate choice of approximate covariance . We also provide a heuristic argument in favor of placing test functions near steep gradients in the dynamics.
A key difficulty in recovering the true weights is that for nonlinear systems the residual evaluated at the true weights is biased: . Any minimization of thus introduces a bias in the recovered weights . Nevertheless, we can understand how different test functions impact the residual by linearizing around the true trajectory and isolating the dominant error terms :
where . The errors manifest in the following ways:
is the misfit between and .
results from measurement error in trial gridfunctions: .
results from replacing with in the left-hand side of (2.2).
is a deterministic integration error.
- is the remainder term in the truncated Taylor expansion of around :
Clearly, recovery of when is straightforward: and are the only error terms; thus one only needs to select a quadrature scheme that ensures that the integration error is negligible and will be the minimizer. A primary focus of this study is the use of a specific family of piecewise polynomial test functions defined below for which the trapezoidal rule is highly accurate (see Lemma 2). Figure 3.1 demonstrates this fact on noise-free data.
FIG. 3.1.
Noise-free data (): plots of relative coefficient error (defined in (3.2)) vs. p. V1-V4 indicate different ODE parameters (see Table 2). For the Lorenz system the parameters are fixed, and 40 different initial conditions are sampled from a uniform distribution. In each case, the recovered coefficients rapidly converge to within the accuracy of the ODE solver (10−10).
For , accurate recovery of requires one to choose hyperparameters that exemplify the true misfit term by enforcing that the other error terms are of lower order. We look for and that approximately enforce , justifying the least squares approach. In the next subsection we address the issue of approximating the covariance matrix, providing justification for using . The following subsection provides a heuristic argument for how to reduce corruption from the error terms and by placing test functions near steep gradients in the data.
2.3.1. Approximate covariance .
Neglecting the deterministic integration error, which can be made small (see Lemma 2 below), and higher-order noise terms, the residual evaluated at the true weights is approximately
where implies that to leading order. Given the variances
and
the true distribution of depends on , which is not known a priori. If it holds that , a leading order approximation to is
using that . For this reason, we employ localized test functions and adopt the heuristic below.
2.3.2. Adaptive refinement.
Next we show that by localizing around large , we get an approximate cancellation of the error terms and . Consider the one-dimensional case where is an arbitrary time index and is an observation. When is large compared to , we approximately have
(2.4) |
for some small , i.e., the perturbed value lands close to the true trajectory at the time . To understand the heuristic behind this approximation, let be the point of intersection between the tangent line to at and . Then
hence implies that will approximately lie on the true trajectory. As well, regions where is small will not yield accurate recovery in the case of noisy data, since perturbations are more likely to exit the relevant region of phase space. If we linearize using the approximation (2.4) we get
(2.5) |
Assuming is sufficiently localized around , (2.4) also implies that
hence , while (2.5) implies
having integrated by parts. Collecting the terms together yields that the residual takes the form
and we see that and have effectively cancelled. In higher dimensions this interpretation does not appear to be as illuminating, but nevertheless, for any given coordinate , it does hold that terms in the error expansion vanish around points where is large, precisely because .
2.4. Test function basis
Here we introduce a test function space and quadrature scheme to minimize integration errors and enact the heuristic arguments above, which rely on having fast decay to its support boundaries and being sufficiently localized to ensure . We define the space of unimodal piecewise polynomials of the form
(2.6) |
where satisfies and . The normalization
ensures that . Functions are nonnegative, unimodal, and compactly supported in with continuous derivatives. Larger and imply faster decay towards the endpoints of the support. For , we refer to as the degree of .
To ensure the integration error in approximating inner products is negligible, we rely on the following lemma, which provides a bound on the error in discretizing the weak derivative relation
(2.7) |
using the trapezoidal rule for compactly supported . Following the lemma we introduce two strategies for choosing the parameters of the test functions .
Lemma 2 (numerical error in weak derivatives).
Let have continuous derivatives of order , and define . If has roots of multiplicity , then
(2.8) |
where . In other words, the composite trapezoidal rule discretizes the weak derivative relation (2.7) to order .
Proof.
This is a simple consequence of the Euler-Maclaurin formula. If is a smooth function, then the following asymptotic expansion holds:
where are the Bernoulli numbers. The asymptotic expansion provides corrections to the trapezoidal rule that realize machine precision accuracy up until a certain value of , after which terms in the expansion grow and the series diverges [6, Chapter 3]. In our case, , where the root conditions on imply that
So for odd, we have that
For even , the leading term is with a slightly different coefficient.
For with , the exact leading order error in term in (2.8) is
(2.9) |
which is negligible for a wide range of reasonable and values. The Bernoulli numbers eventually start growing like , but for smaller values of they are moderate. For instance, with and , this error term is up until , where it takes the value 0.495352, while for , the error is below machine precision for all between 7 and 819. For these reasons, in what follows we choose test functions and discretize all integrals using the trapezoidal rule. Unless otherwise stated, each function satisfies and so is fully determined by the tuple indicating its polynomial degree and support. In the next two subsections we propose two different strategies for determining using the data .
2.4.1. Strategy 1: Uniform grid.
The simplest strategy for choosing a basis of test functions is to place uniformly on the interval with fixed degree and fixed support size
(i.e., is the number of timepoints in that is supported on). The triple then defines the scheme, where each piece effects the distribution of the residual .
Step 1: Choosing L.
Heuristically, the support size of relates to the Fourier transform of the data. If is small compared to the dominant wavemodes in the dynamics, then high-frequency noise will dominate the values of the inner products . If is much larger than the dominant wavemodes, then too much averaging may occur, leading to unresolved dynamics. A natural choice is then to set equal to the period of a known active wavemode1 :
In the noise-free and small-noise experiments below we set and leave optimal selection of based on Fourier analysis to future work.
Step 2: Determining .
In light of the derivation above of the approximate covariance matrix , we define the parameter , which serves as an estimate for the ratio between the standard deviations of the two dominant error terms and in the residual . Larger indicates better agreement with the approximate covariance matrix , since . Furthermore, for we have the exact formula
where is the gamma function. Given , a polynomial degree may be selected from using the formula
Step 3: Determining .
Next we introduce the shift parameter defined by
which determines from and . In words, is the height of intersection between and and measures the amount of overlap between successive test functions. More overlap increases the correlation between rows in the residual and hence leads to larger off-diagonal elements in the covariance matrix . Larger implies that neighboring functions overlap on more points, with indicating that . Specifically, neighboring test functions overlap on timepoints. In Figures 3.2 and 3.3 we vary the parameters and and observe that results agree with intuition: larger (better agreement with ) and larger (more test functions) lead to better recovery of . We summarize the uniform grid algorithm below.
FIG. 3.2.
Small-noise regime: dynamic recovery of the Duffing equation with . Top: heat map of the average error (left) and sample standard deviation of (right) over 200 instantiations of noise with (4% noise) vs. and . Bottom:. for fixed and various . For the average error is roughly an order of magnitude below .
FIG. 3.3.
Small-noise regime: dynamic recovery of the van der Pol oscillator with . Top: heat map of the average error (left) and sample standard deviation of (right) over 200 instantiations of noise with (4% noise) vs. and . Bottom: vs. for fixed and various . Similar to the Duffing equation, average error falls to roughly an order of magnitude below , although for van der Pol this regime is reached when .
:
Construct matrix of trial gridfunctions .
- Construct integration matrices such that
with the test functions determined by as described above. Compute Gram matrix and right-hand side so that and .
Compute approximate covariance and Cholesky factorization .
- Solve the generalized least squares problem with -regularization
using sequential thresholding with parameter to enforce sparsity.
2.4.2. Strategy 2: Adaptive grid.
Motivated by the arguments above, we now introduce an algorithm for constructing a test function basis localized near points of large change in the dynamics. This occurs in three steps: (1) construct a weak approximation to the derivative of the dynamics , (2) sample points from a cumulative distribution with density proportional to the total variation , and (3) construct test functions centered at using a width-at-half-max parameter to determine the parameters of each function . Each of these steps is numerically stable and carried out independently along each coordinate of the dynamics. A visual diagram is provided in Figure 2.1.
FIG. 2.1.
Adaptive grid construction used on data from the Duffing equation with 10% noise (). As desired, the centers are clustered near steep gradients in the dynamics despite large measurement noise. ( is plotted in the upper-left instead of in order to visualize both and .)
Step 1: Weak derivative approximation.
Define , where the matrix enacts a linear convolution with the derivative of a chosen test function of degree and support size so that
The parameters and are chosen by the user, with and corresponding to taking a centered finite difference derivative with a 3-point stencil. Smaller results in more smoothing and minimizes the corruption from noise while still accurately locating steep gradients in the dynamics. For the examples below we arbitrarily2 use and .
Step 2: Selecting .
Having computed , define to be the cumulative sum of normalized so that max . In this way is a valid cumulative distribution function with density proportional to the total variation of . We then find by sampling from . Let with being the number of the test functions; we then define , or numerically,
This stage requires the user to select the number of test functions .
Step 3: Construction of test functions .
Having chosen the location of the centerpoint for each test function , we are left to choose the degree of the polynomial and the supports . The degree is chosen according to the width-at-half-max parameter , which specifies the difference in timepoints between each center and , while the supports are chosen such that . This gives us a nonlinear system of two equations in two unknowns which can be easily solved (i.e., using fzero in MATLAB). This can be done for one reference test functions and the rest of the weights obtained by translation. The optimal value of depends on the timescales of the dynamics and can be chosen from the data using the Fourier transform as in the uniform grid case; however, for simplicity we set in the large-noise examples below.
The adaptive grid WSINDy algorithm is summarized as follows: :
Construct matrix of trial gridfunctions .
- Construct integration matrices such that
with test functions determined by as described above. Compute Gram matrix and right-hand side so that and .
Compute approximate covariance and Cholesky factorization
- Solve the generalized least squares problem with -regularization
using sequential thresholding with parameter to enforce sparsity.
3. Numerical experiments.
We now show that WSINDy is capable of recovering the correct dynamics to high accuracy over a range of noise levels. We examine the systems in Table 1 which exhibit several canonical dynamics, namely growth and decay, nonlinear oscillations and chaotic dynamics, in dimensions . To generate true trajectory data we use ode45 in MATLAB with absolute and relative tolerance 10−10 and collect samples uniformly3 in time with sampling rate . The parameters and are chosen to provide a balance between illustrating ODE behaviors and avoiding an overabundance of observations. Gaussian white noise with mean zero and variance is added to the exact trajectories, where is computed by specifying a noise ratio and setting
(3.1) |
where the Frobenius norm of a matrix is defined by
The ratio of noise to signal is then approximately equal to the square root of the variance: .
TABLE 1.
ODEs used in numerical experiments. For Linear 5D, Duffing, van der Pol, and Lotka–Volterra we measure the accuracy in the recovered system as the parameter varies (see Table 2).
Name | Governing equations | ||
---|---|---|---|
Linear 5D | 1401 | 0.025 | |
Duffing | 3001 | 0.01 | |
Van der Pol | 3001 | 0.01 | |
Lotka–Volterra | 1001 | 0.01 | |
Nonlinear pendulum | 501 | 0.1 | |
Lorenz | 10001 | 0.001 |
We measure the accuracy in the recovered dynamical system using the relative error in the recovered coefficients,
(3.2) |
and the relative error between the noise-free data and the data-driven dynamics along the same timepoints:
(3.3) |
The collection of ODEs in Table 1 are all first-order autonomous systems; however, they exhibit a diverse range of dynamics. The Linear 5D system (for ) and Duffing’s equation are both examples of damped oscillators, showing that WSINDy is able to discern whether such motion is governed by linear or nonlinear coupling between variables. For , the Linear 5D system exhibits exponential growth. The van der Pol oscillator, Lotka–Volterra system, and nonlinear pendulum demonstrate that a stable limit cycle with abrupt changes may manifest from vastly different nonlinear mechanisms, which turn out to be identifiable using the weak form. Finally, the Lorenz system exhibits deterministic chaos, and hence the dynamics cover a wide range of Fourier modes, which easily become corrupted with noise.
3.1. Noise-free data.
The goal of the following noise-free experiments is to demonstrate convergence of the recovered weights to the true weights to within the accuracy tolerance of the ODE solver (fixed 10−10 throughout). In light of Lemma 2, this should occur as the decay rate of the test functions is increased, which for test functions in class (see (2.6)) is realized by increasing the polynomial degree . Hence, over the range of parameter values in Table 2, for each system we test convergence as increases. We use the uniform grid approach with shift parameter chosen such that the number of test functions equals to the number of trial functions , resulting in square Gram matrices . The support of the basis functions along the timegrid is set to points. The data-driven trial basis includes all monomials in the state variables up to degree 5 as well as the trigonometric terms , for and . We set the regularization parameter to zero , with the exception of the nonlinear pendulum, where , and the sparsity threshold to . We note that a nonzero is always necessary to discover the nonlinear pendulum from combined trigonometric and polynomial libraries since is well-approximated by polynomial terms; however, the same is not true for low-order polynomial systems. In cases considered here, sequential thresholding successfully removes trigonometric library terms for ODE systems with polynomial dynamics despite initially ill-conditioned Gram matrices resulting from combining polynomial and trigonometric terms.
TABLE 2.
Specifications for parameters used in illustrating simulations in Figure 3.1.
ODE | |||||
---|---|---|---|---|---|
Linear 5D | (−0.3, −0.2, −0.1,0.1) | 57 | 5 | 252 | |
Duffing | (0.01,0.1,1,10) | 121 | 99 | 29 | |
Van der Pol | (0.01,0.1,1,10) | 121 | 99 | 29 | |
Lotka–Volterra | (0.005,0.01,0.1,1) | 41 | 33 | 29 | |
Pendulum | — | 21 | 16 | 29 | |
Lorenz | — | 401 | 141 | 68 |
Figure 3.1 shows that in the limit of large , WSINDy recovers the correct weight matrix of each system in Table 1 to an accuracy of . For the Linear 5D system, we vary the growth/decay parameter, showing that the system is identifiable to high accuracy despite an excessively large trial library (252 terms). For Duffing’s equation and the van der Pol oscillator, the same convergence trend is observed for values spanning several orders of magnitude. Accuracy is slightly worse for the Lotka–Volterra equation when , which corresponds to highly infrequent predator-prey interactions and leads to solutions with large amplitudes and gradients. For the nonlinear pendulum, we test that WSINDy is able to identify the nonlinearity for both large and small initial amplitudes, noting that produces strongly nonlinear oscillations, while produces small-angle oscillations where . In addition, for the pendulum we use fewer samples and a larger time step and hence observe a decreased convergence rate. For the Lorenz equations we vary the initial conditions, generating 40 random initial conditions from a region covering the strange attractor, and show convergence over all cases.
3.2. Small-noise regime.
We now turn to the case of low to moderate noise levels, examining a noise ratio in the range for the van der Pol oscillator and Duffing’s equation. We examine and , where and is the height of intersection of two neighboring test functions and (with leading to and indicating . Using the analysis from section 2.3, increasing affects the distribution of the residual by magnifying the portion that is linear in the noise. For , larger corresponds to a higher polynomial degree , with leading to . Larger shift parameter corresponds to more test functions (higher ) but also to higher correlation between rows in , as when the supports of and sufficiently overlap. Here corresponds to . We again use the uniform grid approach with and . For each system we generate 200 instantiations of noise and record the coefficient error over the range of and values.
From Figures 3.2 and 3.3 we observe two properties. Firstly, the coefficient error monotonically deceases with increasing and ; hence accurate recovery re quires sufficient overlap between test functions (large enough shift parameter ) and sufficiently localized test functions that amplify the portion of the residual that is linear in the noise. Secondly, for large enough and , the error in the coefficients scales linearly with , leading to an accuracy of , or significant digits in the recovered coefficients. In Appendix A we show that this second property does not hold for standard SINDy; in particular, the method of differentiation must change depending on the noise level in order to reach a desired accuracy.
3.3. Large-noise regime.
Figures 3.4 to 3.9 show that adaptive placement of test functions (Strategy 2) can be employed to discover dynamics in the large-noise regime with fewer test functions. We test that each system in Table 1 can be discovered under (10% noise) from only 250 test functions distributed near steep gradients in , which are located using the scheme in section 2.4.2 with and . We set the width-at-half-max of the test functions to timepoints. To exemplify the separation of scales and the severity of the corruption from noise, the noisy data , true data , and trajectories from the learned dynamical systems are shown in dynamo view and in phase space (for ). We extend by 50% to show that the data-driven system captures the true limiting behavior. We set the sparsity to and except in the Linear 5D and nonlinear pendulum examples, where . For the trial basis we use all monomials up to degree 5 in the state variables, and for the pendulum we include the trigonometric terms for and .
FIG. 3.4.
Large-noise regime: Linear 5D system with damping . All correct terms were identified with an error in the weights of and a trajectory error of .
FIG. 3.9.
Large-noise regime: Lorenz system with . All correct terms were identified with an error in the weights of and trajectory error . The large trajectory error is expected due to the chaotic nature of the solution. Using data up until (first 1500 timepoints) the trajectory error is 0.027.
In each case the correct terms are identified with coefficient error , in agreement with the trend observed in the small-noise regime. For the Linear 5D, Duffing, and Lotka–Volterra systems (Figures 3.4, 3.5, and 3.7) the data-driven trajectory is indistinguishable from the true data to the eye, with trajectory error . For the van der Pol oscillator and nonlinear pendulum (Figures 3.6 and 3.8), follows a limit cycle with an attractor that is indistiguishable from the true data (see phase plane plots); however, an error in the period of oscillation of roughly 0.6% leads to a larger trajectory error. The data-driven trajectory for the Lorenz equation diverges from the true trajectory around (Figure 3.9), which is expected from chaotic dynamics, but still remains close to the Lorenz attractor.
FIG. 3.5.
Large-noise regime: Duffing equation, . All correct terms were identified with an error in the weights of and a trajectory error of .
FIG. 3.7.
Large-noise regime: Lotka–Volterra system with . All correct nonzero terms were identified with an error in the weights of and trajectory error .
FIG. 3.6.
Large-noise regime: van der Pol oscillator, . All correct terms were identified with coefficient error and trajectory error . The data-driven trajectory has a slightly shorter oscillation period of 10.14 time units compared to the true 10.2, resulting in an eventual offset from the true data and hence a larger trajectory error. Measured over the time interval [0, 8] the trajectory error is 0.065.
FIG. 3.8.
Large-noise regime: nonlinear pendulum with initial conditions . All correct nonzero terms were identified with an error in the weights of and an error between .
4. Concluding remarks.
We have developed and investigated a data-driven model selection algorithm based on the weak formulation of differential equations. The algorithm utilizes the reformulation of the model selection problem as a sparse regression problem for the weights of a candidate function basis introduced in [21] and generalized in [4] as the SINDy algorithm. Our WSINDy algorithm can be seen as a generalization of the sparse recovery scheme using integral terms found in [17], where dynamics were recovered from noisy data using the integral equation. We have shown that by extending the integral equation to the weak form and using test functions with certain localization and smoothness properties, one may discovery the dynamics over a wide range of noise levels, with accuracy scaling favorably with noise: .
A natural line of inquiry is to consider how WSINDy compares with conventional SINDy. There are several notable advantages of WSINDy; in particular, by considering the weak form of the equations, WSINDy completely avoids approximation of pointwise derivatives which significantly reduce the accuracy in conventional SINDy. When using SINDy, one must choose an appropriate numerical differentiation scheme depending on the noise level (e.g., finite differences are not robust to large noise but work well for small noise). For WSINDy, test functions from the space (see section 2.4) together with the trapezoidal rule are effective in both low-noise and high-noise regimes. We demonstrate these observations in Appendix A by comparing WSINDy to SINDy under several numerical differentiation schemes. On the other hand, it may be the case that less data is required by standard SINDy. For the examples shown here, WSINDy works optimally for test functions supported on at least 15 timepoints, while many derivative approximations require fewer consecutive points.
WSINDy also utilizes the linearity of inner products with test functions to estimate the covariance structure of the residual, performing model selection in a generalized least squares framework. This is a much more appropriate setting given that residuals are neither independent nor uniformly distributed; however, we note that our implementations in this article employ approximate covariance matrices and could benefit from further refinements and investigation. In Appendix B we show that using generalized least squares with approximate covariance improves some results over ordinary least squares, but not significantly. We leave incorporation of more detailed knowledge of the covariance structure to future work. In addition, generalized least squares could potentially improve traditional model selection algorithms that rely on pointwise derivative estimates by similarly exploiting linear operators. Ultimately, a thorough analysis of the advantages of generalized least squares for model selection deserves further study.
Lastly, the most obvious extensions lie in generalizing the WSINDy method to spatiotemporal datasets. WSINDy as presented here in the context of ODEs is an exciting proof of concept with natural extensions to spatiotemporal and multiresolution settings building upon the extensive results in numerical and functional analysis for weak and variational formulations of physical problems.
Acknowledgments.
Code used in this manuscript is publicly available on GitHub at https://github.com/MathBioCU/WSINDy. The authors would like to thank Prof. Vanja Dukic (University of Colorado at Boulder, Department of Applied Mathematics) and Kadierdan Kaheman (University of Washington at Seattle, Department of Applied Mathematics) for helpful discussions.
Funding:
This research was supported in part by the NSF/NIH Joint DMS/NIGMS Mathematical Biology Initiative grant R01GM126559 and in part by the NSF Computing and Communications Foundations Division grant CCF-1815983. This work also utilized resources from the University of Colorado Boulder Research Computing Group, which is supported by the National Science Foundation (awards ACI-1532235 and ACI-1532236), the University of Colorado Boulder, and Colorado State University.
Appendix A. Comparison between WSINDy and SINDy.
Here we compare WSINDy and SINDy using the van der Pol oscillator, Lotka–Volterra system, and Lorenz equation. For WSINDy we place test functions along the time axis according to the uniform grid strategy. For SINDy, we examine three differentiation methods: total variation regularized derivatives (SINDy-TV), centered second-order finite difference (SINDy-FD-2), and centered fourth-order finite difference (SINDy-FD-4). For SINDy-TV we use default settings and set the regularization parameter equal to the time step.
For each system and noise level we generate 200 independent instantiations of noise and record the average coefficient error (3.2) as well as the average true positivity ratio (TPR) [10]:
(A.1) |
where is the number of correctly identified nonzero terms, is the number of falsely identified nonzero terms, and is the number of terms that are falsely identified as having a coefficient of zero. Since the feasible range of sparsity thresholds depends on the noise level, we adopt the selection methodology in [14] to choose an appropriate value for each instantiation of noise: is chosen from the set (i.e., the 51 values from 10−5 to 1 equally spaced ) as the minimizer of the loss function
where for WSINDy and for is the sequential-thresholding least squares solution for sparsity threshold , and is the number of terms in the model library (for further details see [14]).
From Figures A.1, A.2, and A.3 we observe that for small noise (up to ), the coefficient error for WSINDy follows the linear trend (observed in the text) and that SINDy-FD-4 behaves similarly but with slightly worse accuracy. For larger noise, SINDy diverges in accuracy and identification of the correct nonzero terms for each differentiation scheme, while WSINDy maintains a TPR of at least 0.8 up to 40% noise for each system. WSINDy thus provides an advantage across the entire noise spectrum examined, all while employing the same weak discretization scheme.
FIG. A.1.
Comparison between WSINDy and SINDy: van der Pol. Clockwise from top left: small-noise (defined in (A.1)), large-noise , large-noise (defined (3.2)), small-noise .
FIG. A.2.
Comparison between WSINDy and SINDy: Lotka–Volterra. Clockwise from top left: small-noise (defined in (A.1)), large-noise , large-noise (defined (3.2)), small-noise .
FIG. A.3.
Comparison between WSINDy and SINDy: Lorenz system. Clockwise from top left: small-noise (defined in (A.1)), large-noise , large-noise (defined (3.2)), small-noise .
Appendix B. Generalized least squares vs. ordinary least squares.
FIG. B.1.
Comparison between WSINDy with GLS and WSINDy with ordinary least squares using the Duffing equation. Results are averaged over 200 instantiations of noise.
Generalized least squares (GLS) aims to account for correlations between the residuals [8]. Given a linear model , where and , the GLS estimator of the parameters upon observing is
This provides the best linear unbiased estimator of in the sense that if is any other unbiased estimator, then has lower variance: .
Above we derived an approximate covariance matrix to use in the GLS implementation of WSINDy, although the true covariance depends on the underlying unknown dynamical system and hence is unattainable. In addition, since in our case depends on the noise , the assumption is violated. Nevertheless, we find that the noise regime does benefit from using GLS over ordinary least squares. Figure B.1 shows that for the Duffing equation, GLS extends the region from , as well as increases the accuracy in the recovered coefficients. This suggests that further improvements can be made with a more refined covariance matrix.
Footnotes
Such that is not negligible.
We find that a lower-degree test function with small support effectively locates steep gradients in noisy trajectories.
We leave a detailed study of nonuniform time sampling to future work.
REFERENCES
- [1].AKAIKE H, A new look at the statistical model identification, IEEE Trans. Automat. Control, 19 (1974), pp. 716–723, 10.1109/TAC.1974.1100705. [DOI] [Google Scholar]
- [2].AKAIKE H, On entropy maximization principle, in Applications of Statistics, Krishnaiah PR, ed., North-Holland, Amsterdam, 1977, pp. 27–41 [Google Scholar]
- [3].BORTZ DM AND NELSON PW, Model selection and mixed-effects modeling of HIV infection dynamics, Bull. Math. Biol, 68 (2006), pp. 2005–2025, 10.1007/s11538-006-9084-x. [DOI] [PubMed] [Google Scholar]
- [4].BRUNTON SL, PROCTOR JL, AND KUTZ JN, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. USA, 113 (2016), pp. 3932–3937. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].CORTIELLA A, PARK K-C, AND DOOSTAN A, Sparse identification of nonlinear dynamical systems via reweighted l1-regularized least squares, Comput. Methods Appl. Mech. Engrg, (2021), p. 113620.
- [6].DAHLQUIST G. AND BJÖRCK A, Numerical Methods in Scientific Computing: Volume 1, vol. 103, SIAM, 2008. [Google Scholar]
- [7].KANG SH, LIAO W, AND LIU Y, IDENT: Identifying differential equations with numerical time evolution, J. Sci. Comput, 87 (2021), 1. [Google Scholar]
- [8].KARIYA T. AND KURATA H, Generalized Least Squares, John Wiley & Sons, New York, 2004. [Google Scholar]
- [9].KELLER RT AND DU Q, Discovery of dynamics using linear multistep methods, SIAM J. Numer. Anal, 59 (2021), pp. 429–455. [Google Scholar]
- [10].LAGERGREN J, NARDINI JT, LAVIGNE GM, RUTTER EM, AND FLORES KB, Learning partial differential equations for biological transport models from noisy spatio-temporal data, Proc. A, 476 (2020), 20190800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].LAGERGREN JH, NARDINI JT, MICHAEL LAVIGNE G, RUTTER EM, AND FLORES KB, Learning partial differential equations for biological transport models from noisy spatiotemporal data, Proc. A, 476 (2020), 20190800, 10.1098/rspa.2019.0800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].LILLACCI G. AND KHAMMASH M, Parameter estimation and model selection in computational biology, PLoS Comput. Biol, 6 (2010), e1000696, 10.1371/journal.pcbi.1000696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].LU F, MAGGIONI M, AND TANG S, Learning interaction kernels in heterogeneous systems of agents from multiple trajectories, J. Mach. Learn. Res, 22 (2021), pp. 1–67. [Google Scholar]
- [14].MESSENGER DA AND BORTZ DM, Weak SINDy for Partial Differential Equations, arXiv preprint, arXiv:2007.02848 [math.NA], 2020, https://arxiv.org/abs/2007.02848. [DOI] [PMC free article] [PubMed]
- [15].RAISSI M, PERDIKARIS P, AND KARNIADAKIS GE, Machine learning of linear differential equations using Gaussian processes, J. Comput. Phys, 348 (2017), pp. 683–693. [Google Scholar]
- [16].RUDY SH, KUTZ JN, AND BRUNTON SL, Deep learning of dynamics and signal-noise decomposition with time-stepping constraints, J. Comput. Phys, 396 (2019), pp. 483–506. [Google Scholar]
- [17].SCHAEFFER H. AND MCCALLA SG, Sparse model selection via integral terms, Phys. Rev. E, 96 (2017), 023302. [DOI] [PubMed] [Google Scholar]
- [18].SCHAEFFER H, TRAN G, WARD R, AND ZHANG L, Extracting structured dynamical systems using sparse optimization with very few samples, Multiscale Model. Simul, 18 (2020), pp. 1435–1461. [Google Scholar]
- [19].TONI T, WELCH D, STRELKOWA N, IPSEN A, AND STUMPF MP, Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface, 6 (2009), pp. 187–202, 10.1098/rsif.2008.0172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].TRAN G. AND WARD R, Exact recovery of chaotic systems from highly corrupted data, Multiscale Model. Simul, 15 (2017), pp. 1108–1129. [Google Scholar]
- [21].WANG W-X, YANG R, LAI Y-C, KOVANIS V, AND GREBOGI C, Predicting catastrophes in nonlinear dynamical systems by compressive sensing, Phys. Rev. Lett, 106 (2011), p. 154101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].WARNE DJ, BAKER RE, AND SIMPSON MJ, Using experimental data and information criteria to guide model selection for reaction–diffusion problems in mathematical biology, Bull. Math. Biol, 81 (2019), pp. 1760–1804, 10.1007/s11538-019-00589-x. [DOI] [PubMed] [Google Scholar]
- [23].WU H. AND WU L, Identification of significant host factors for HIV dynamics modelled by non-linear mixed-effects models, Stat. Med, 21 (2002), pp. 753–771, 10.1002/sim.1015. [DOI] [PubMed] [Google Scholar]
- [24].WU K, QIN T, AND XIU D, Structure-preserving method for reconstructing unknown Hamiltonian systems from trajectory data, SIAM J. Sci. Comput, 42 (2020), pp. A3704–A3729. [Google Scholar]
- [25].WU K. AND XIU D, Numerical aspects for approximating governing equations using data, J. Comput. Phys, 384 (2019), pp. 200–221. [Google Scholar]
- [26].ZHANG S. AND LIN G, Robust data-driven discovery of governing physical laws with error bars, Proc. A, 474 (2018), 20180305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].ZHANG S. AND LIN G, Robust Subsampling-Based Sparse Bayesian Inference to Tackle Four Challenges (Large Noise, Outliers, Data Integration, and Extrapolation) in the Discovery of Physical Laws from data, arXiv preprint, arXiv:1907.07788 [stat.ML], 2019, https://arxiv.org/abs/1907.07788.