Kernel Ordinary Differential Equations

Xiaowu Dai; Lexin Li

doi:10.1080/01621459.2021.1882466

. Author manuscript; available in PMC: 2023 Feb 23.

Published in final edited form as: J Am Stat Assoc. 2021 Apr 27;117(540):1711–1725. doi: 10.1080/01621459.2021.1882466

Kernel Ordinary Differential Equations

Xiaowu Dai ^a, Lexin Li ^a,^b

PMCID: PMC9949731 NIHMSID: NIHMS1746362 PMID: 36845295

Abstract

Ordinary differential equation (ODE) is widely used in modeling biological and physical processes in science. In this article, we propose a new reproducing kernel-based approach for estimation and inference of ODE given noisy observations. We do not assume the functional forms in ODE to be known, or restrict them to be linear or additive, and we allow pairwise interactions. We perform sparse estimation to select individual functionals, and construct confidence intervals for the estimated signal trajectories. We establish the estimation optimality and selection consistency of kernel ODE under both the low-dimensional and high-dimensional settings, where the number of unknown functionals can be smaller or larger than the sample size. Our proposal builds upon the smoothing spline analysis of variance (SS-ANOVA) framework, but tackles several important problems that are not yet fully addressed, and thus extends the scope of existing SS-ANOVA as well. We demonstrate the efficacy of our method through numerous ODE examples.

Keywords: Component selection and smoothing operator, High dimensionality, Ordinary differential equations, Smoothing spline analysis of variance, Reproducing kernel Hilbert space

1. Introduction

Ordinary differential equation (ODE) has been widely used to model dynamic systems and biological and physical processes in a variety of scientific applications. Examples include infectious disease (Liang and Wu 2008), genomics (Cao and Zhao 2008; Chou and Voit 2009; Ma et al. 2009; Lu et al. 2011; Henderson and Michailidis 2014; Wu et al. 2014), neuroscience (Izhikevich 2007; Zhang et al. 2015, 2017; Cao, Sandstede, and Luo 2019), among many others. A system of ODEs take the form,

\frac{d x (t)}{d t} = (\begin{matrix} \frac{d x_{1} (t)}{d t} \\ ⋮ \\ \frac{d x_{p} (t)}{d t} \end{matrix}) = (\begin{matrix} F_{1} (x (t)) \\ ⋮ \\ F_{p} (x (t)) \end{matrix}) = F (x (t)),

(1)

Where $x (t) = {(x_{1} (t), \dots, x_{p} (t))}^{⊤} \in ℝ^{p}$ denotes the system of p variables of interest, F = {F₁, …, F_p} denotes the set of unknown functionals that characterize the regulatory relations among x(t), and t indexes time in an interval standardized to $T = [0, 1]$ . Typically, the system (1) is observed on discrete time points {t₁, …, t_n} with measurement errors,

y_{i} = x (t_{i}) + ϵ_{i}, i = 1, \dots, n,

(2)

where $y_{i} = {(y_{i 1}, \dots, y_{i p})}^{⊤} \in ℝ^{p}$ denotes the observed data, $ϵ_{i} = {(ϵ_{i 1}, \dots, ϵ_{i p})}^{⊤} \in ℝ^{p}$ denotes the vector of measurement errors that are usually assumed to follow independent normal distribution with mean 0 and variance $σ_{j}^{2}$ , j = 1, …, p, and n denotes the number of time points. Besides, an initial condition $x (0) \in ℝ^{p}$ is usually given for the system (1).

In a biological or physical system, a central question of interest is to uncover the structure of the system of ODEs in terms of which variables regulate which other variables, given the observed noisy time-course data ${y_{i}}_{i = 1}^{n}$ . Specifically, we say that x_k regulates x_j, if F_j is a functional of x_k. In other words, x_k controls the change of x_j through the functional F_j on the derivative dx_j/dt. Therefore, the functionals F = {F₁, …, F_p} encode the regulatory relations of interest, and are often assumed to take the form,

F_{j} (x (t)) = θ_{j 0} + \sum_{k = 1}^{p} F_{j k} (x_{k} (t)) + \sum_{k \neq l, k = 1}^{p} \sum_{l = 1}^{p} F_{j k l} (x_{k} (t), x_{l} (t)), j = 1, \dots, p,

(3)

where $θ_{j 0} \in ℝ$ denotes the intercept, and F_jk and F_jkl represent the main effect and two-way interaction, respectively. Higher order interactions are possible, but two-way interactions are the most common structure studied in ODE (Ma et al. 2009; Zhang et al. 2015).

There have been numerous pioneering works studying statistical modeling of ODEs. However, nearly all existing solutions constrain the forms of F. Broadly speaking, there are three categories of functional forms imposed. The first category considers linear functionals for F. For instance, Lu et al. (2011) studied a system of linear ODEs to model dynamic gene regulatory networks. Zhang et al. (2015) extended the linear ODE to include the interactions to model brain connectivity networks. The model of Zhang et al. (2015), other than differentiating between the variables that encode the neuronal activities and the ones that represent the stimulus signals, is in effect of the form,

F_{j} (x (t)) = θ_{j 0} + \sum_{k = 1}^{p} θ_{j k} x_{k} (t) + \sum_{k \neq l, k = 1}^{p} \sum_{l = 1}^{p} θ_{j k l} x_{k} (t) x_{l} (t), j = 1, \dots, p,

(4)

whereas the model of Lu et al. (2011) is similar to (4) but focuses on the main-effect terms only. In both cases, F_j takes a linear form. Dattner and Klaassen (2015) further extended the functional F_j in (4) to a generalized linear form, but without the interactions, that is,

F_{j} (x (t)) = θ_{j 0} + ψ_{j} {(x (t))}^{⊤} θ_{j}, j = 1, \dots, p,

(5)

where $θ_{j 0} \in ℝ, θ_{j} \in ℝ^{d}$ , and $ψ_{j} (x) = {(ψ_{j 1} (x), \dots, ψ_{j d} (x))}^{⊤} \in ℝ^{d}$ is a finite set of known basis functions. The second category considers additive functionals for F. Particularly, Henderson and Michailidis (2014), Wu et al. (2014), and Chen, Shojaie, and Witten (2017) considered the generalized additive model for F_j,

F_{j} (x (t)) = θ_{j 0} + \sum_{k = 1}^{p} F_{j k} (x_{k} (t)) = θ_{j 0} + \sum_{k = 1}^{p} {ψ {(x_{k} (t))}^{⊤} θ_{j k} + δ_{j k} (x_{k} (t))}, j = 1, \dots, p,

(6)

where $θ_{j 0} \in ℝ$ , $θ_{j k} \in ℝ^{d}$ , $ψ (x) = {(ψ_{1} (x), \dots, ψ_{d} (x))}^{⊤} \in ℝ^{d}$ is a finite set of common basis functions, and $δ_{j k} \in ℝ$ is the residual function. Different from Dattner and Klaassen (2015), the residual δ_jk is unknown. The functional F_j in (6) takes an additive form. Finally, there is a category of ODE solutions focusing on the scenario where the functional forms for F are known (González, Vujačić, and Wit 2014; Zhang, Cao, and Carroll 2015; Mikkelsen and Hansen 2017).

These works have laid a solid foundation for statistical modeling of ODE. However, in plenty of scientific applications, the forms of the functionals F are unknown, and the linear or additive forms on F can be restrictive. Besides, it is highly nontrivial to couple the basis function-based solutions with the interactions. We give an example in Section 2.1, where a commonly used enzyme network ODE system involves both nonlinear functionals and two-way interactions. Such examples are often the rules rather than the exceptions, motivating us to consider a more flexible form of ODE. Moreover, the existing ODE methods have primarily focused on sparse estimation, but few tackled the problem of statistical inference, which is challenging due to the complicated correlation structure of ODE.

In this article, we propose a novel approach of kernel ordinary differential equation (KODE) for estimation and inference of the ODE system in (1) given noisy observations from (2). We adopt the general formulation of (3), but we do not assume the functional forms of F are known, or restrict them to be linear or additive, and we allow pairwise interactions. As such, we consider a more general ODE system that encompasses (4)–(6) as special cases. We further introduce sparsity regularization to achieve selection of individual functionals in (3), which yields a sparse recovery of the regulatory relations among F, and improves the model interpretability. Moreover, we derive the confidence interval for the estimated signal trajectory x_j(t). We establish the estimation optimality and selection consistency of kernel ODE, under both low-dimensional and high-dimensional settings, where the number of unknown functionals p can be smaller or larger than the number of time points n, and we study the regime-switching phenomenon. These differences clearly separate our proposal from the existing ODE solutions in the literature.

Our proposal is built upon the smoothing spline analysis of variance (SS-ANOVA) framework that was first introduced by Wahba et al. (1995), then further developed in regression and functional data analysis settings by Huang (1998), Lin and Zhang (2006), and Zhu, Yao, and Zhang (2014). We adopt a similar component selection and smoothing operator (COSSO) type penalty of Lin and Zhang (2006) for regularization, and conceptually, our work extends COSSO to the ODE setting. However, our proposal considerably differs from COSSO and the existing SS-ANOVA methods, in multiple ways. First, unlike the standard SS-ANOVA models, the regressors of kernel ODE are not directly observed and need to be estimated from the data with error. This extra layer of randomness and estimation error introduces additional difficulty to SS-ANOVA. Second, we employ the integral of the estimated trajectories in the loss function to improve the estimation properties (Dattner and Klaassen 2015). The use of the integral and the inclusion of the interaction terms pose some identifiability question that we tackle explicitly. Third, we establish the estimation optimality and selection consistency in the RKHS framework, which is utterly different from Zhu, Yao, and Zhang (2014), and requires new technical tools. Moreover, our theoretical analysis extends that of Chen, Shojaie, and Witten (2017) from the finite bases setting of cubic splines to the infinite bases setting of RKHS. Finally, for statistical inference, we derive the confidence bands to provide uncertainty quantification for the penalized estimators of the signal trajectories in the ODE model. Our solution builds on the confidence intervals idea of Wahba (1983). But unlike the classical methods focusing on the fixed dimensionality p (Wahba 1983; Opsomer and Ruppert 1997), we allow a diverging p that can far exceed the sample size n. In summary, our proposal tackles several crucial problems that are not yet fully addressed in the existing SS-ANOVA framework, and it is far from a straightforward extension. We believe the proposed kernel ODE method not only makes a useful addition to the toolbox of ODE modeling, but also extends the scope of SS-ANOVA-based kernel learning.

The rest of the article is organized as follows. We propose kernel ODE in Section 2, and develop the estimation algorithm and inference procedure in Section 3. We derive the consistency and optimality of the proposed method in Section 4. We investigate the numerical performance in Section 5, and illustrate with a real data example in Section 6. We conclude the paper with a discussion in Section 7, and relegate all proofs and some additional numerical results to the Supplementary Appendix.

2. Kernel Ordinary Differential Equations

2.1. Motivating example

We consider an enzymatic regulatory network as an example to demonstrate that nonlinear functionals as well as interactions are common in the system of ODEs. Ma et al. (2009) found that all circuits of three-node enzyme network topologies that perform biochemical adaptation can be well approximated by two architectural classes: a negative feedback loop with a buffering node, and an incoherent feedforward loop with a proportioner node. The mechanism of the first class follows the Michaelis–Menten kinetic equations (Tzafriri 2003),

\frac{d x_{1} (t)}{d t} = c_{1} \frac{x_{0} {1 - x_{1} (t)}}{{1 - x_{1} (t)} + C_{1}} - {\tilde{c}}_{1} c_{2} \frac{x_{1} (t)}{x_{1} (t) + C_{2}}, \frac{d x_{2} (t)}{d t} = c_{3} \frac{{1 - x_{2} (t)} x_{3} (t)}{{1 - x_{2} (t)} + C_{3}} - {\tilde{c}}_{2} c_{4} \frac{x_{2} (t)}{x_{2} (t) + C_{4}}, \frac{d x_{3} (t)}{d t} = c_{5} \frac{x_{1} (t) {1 - x_{3} (t)}}{{1 - x_{3} (t)} + C_{5}} - c_{6} \frac{x_{2} (t) x_{3} (t)}{x_{3} (t) + C_{6}},

(7)

where x₁(t), x₂(t), x₃(t) are three interacting nodes, such that x₁(t) receives the input, x₂(t) plays the diverse regulatory role, and x₃(t) transmits the output, x₀ is the initial input stimulus, and c₁, …, c₆, C₁, …, C₆, ${\tilde{c}}_{1}$ , ${\tilde{c}}_{2}$ denote the catalytic rate parameters, the Michaelis–Menten constants, and the concentration parameters, respectively. See Figure 1(a) for a graphical illustration of this ODE system. In this model, the functionals F₁, F₂, F₃ are all nonlinear, and both F₂ and F₃ involve two-way interactions. It is of great interest to estimate F_j’s given the observed data, to verify model (7), and to carry out statistical inference of the unknown parameters. This example, along with many other ODE systems with nonlinear functionals and interaction terms motivate us to consider a general ODE system as in (3).

Figure 1. — (a) Diagram of the NFBLB regulatory network following (7). (b) Phase dynamics for the three nodes x₁, x₂, x₃ over time [0, 1], with a random input x₀ uniformly drawn from [0.5, 1.5]. (c) Illustration of the NFBLB network in terms of the interactions in KODE.

2.2. Two-Step Collocation Estimation

Before presenting our method, we first briefly review the two-step collocation estimation method, which is commonly used for parameter estimation in ODE, and is also useful in our setting. The method was first proposed by Varah (1982), then extended to various ODE models. In the first step, it fits a smoothing estimate,

{\hat{x}}_{j} (t) = \underset{z_{j} \in F}{arg min} {\frac{1}{n} \sum_{i = 1}^{n} {y_{i j} - z_{j} (t_{i})}^{2} + λ_{n j} J_{1} (z_{j})}, j = 1, \dots, p,

where J₁(·) is a smoothness penalty in the function space $F$ , and z_j is a function in $F$ that we minimize over. In the second step, it solves an optimization problem to estimate the model parameters $θ_{j 0} \in ℝ$ and $θ_{j} = {(θ_{j 1}, \dots, θ_{j p})}^{⊤} \in ℝ^{p}$ , for j = 1, …, p. Particularly, Varah (1982) considered the derivative $d {\hat{x}}_{j} (t) / d t$ and the following minimization,

min_{θ_{j 0}, θ_{j}} \int_{0}^{1} {(\frac{d {\hat{x}}_{j} (t)}{d t} - θ_{j 0} - \sum_{k = 1}^{p} θ_{j k} {\hat{x}}_{k} (t))}^{2} d t, j = 1, \dots, p .

Wu et al. (2014) developed a similar two-step collocation method for their additive ODE model (6), and estimated the model parameters $θ_{j 0} \in ℝ$ and $θ_{j k} = {(θ_{j k 1}, \dots, θ_{j k d})}^{⊤} \in ℝ^{d}$ , for j, k = 1, …, p, with a standardized group ℓ₁-penalty,

min_{θ_{j 0}, θ_{j k}} \int_{0}^{1} {‖ \frac{d {\hat{x}}_{j} (t)}{d t} - θ_{j 0} - \sum_{k = 1}^{p} θ_{j k}^{⊤} ψ ({\hat{x}}_{k} (t)) ‖}_{2}^{2} d t + τ_{n j} \sum_{k = 1}^{p} {[\int_{0}^{1} {θ_{j k}^{⊤} ψ ({\hat{x}}_{k} (t))}^{2} d t]}^{1 / 2} .

They further discussed adaptive group ℓ₁ and regular ℓ₁-penalties. Meanwhile, Henderson and Michailidis (2014) considered an extra ℓ₂-penalty.

Alternatively, in the second step, Dattner and Klaassen (2015) proposed to focus on the integral $\int_{0}^{t} g_{j} (\hat{x} (u)) d u$ , rather than the derivative $d {\hat{x}}_{j} (t) / d t$ , and they estimated the model parameters $θ_{j 0} \in ℝ$ and $θ_{j} = {(θ_{j 1}, \dots, θ_{j d})}^{⊤} \in ℝ^{d}$ , for j = 1, …, p, in (5) by,

{min_{θ_{j 0}, θ_{j}} \sum_{j = 1}^{p} \int_{0}^{1} {{\hat{x}}_{j} (t) - θ_{j 0} - θ_{j}^{⊤} \int_{0}^{t} ψ_{j} \hat{x} (u)) d u}}^{2} d t .

They found that this modification from the derivative to integral leads to a more robust estimate and also an easier derivation of the asymptotic properties. Chen, Shojaie, and Witten (2017) adopted this idea for their additive ODE model (6), and estimated the parameters $θ_{j 0} \in ℝ$ , ${\tilde{θ}}_{j} \in ℝ$ , and $θ_{j k} = {(θ_{j k 1}, \dots, θ_{j k d})}^{⊤} \in ℝ^{d}$ , for j, k = 1, …, p, by

min_{θ_{j 0}, {\tilde{θ}}_{j}, θ_{j k}} \frac{1}{2 n} \sum_{i = 1}^{n} {y_{i j} - θ_{j 0} - b_{j} t_{i} - \sum_{k = 1}^{p} θ_{j k}^{⊤} \int_{0}^{t_{i}} ψ ({\hat{x}}_{k} (u)) d u}^{2} + τ_{n j} \sum_{k = 1}^{p} {[\frac{1}{n} \sum_{i = 1}^{n} {θ_{j k}^{⊤} \int_{0}^{t_{i}} ψ ({\hat{x}}_{k} (u)) d t}^{2}]}^{1 / 2} .

2.3. Kernel ODE

We build the proposed kernel ODE within the smoothing spline ANOVA framework; see Wahba et al. (1995) and Gu (2013) for more background on SS-ANOVA. Specifically, let $H_{k}$ denote a space of functions of $x_{k} (t) \in X$ with zero marginal integral, where $X \subset ℝ$ is a compact set. Let {1} denote the space of constant functions. We construct the tensor product space as

H = {1} \oplus \sum_{k = 1}^{p} H_{k} \oplus \sum_{k = 1, k \neq l}^{p} \sum_{l = 1}^{p} (H_{k} \otimes H_{l}) .

(8)

We assume the functionals F_j, j = 1, …, p, in the ODE model (3) are located in the space of $H$ . The identifiability of the terms in (3) is assured by the conditions specified through the averaging operators: $\int_{T} F_{j k} (x_{k} (t)) d t = 0$ for k = 1, …, p. Let· $∥ \cdot ∥_{H}$ denote the norm of $H$ , and $P^{k} F_{j}$ and $P^{k l} F_{j}$ denote the orthogonal projection of F_j onto $H_{k}$ and $H_{k} \otimes H_{l}$ , respectively. We consider a two-step collocation estimation method, by first obtaining a smoothing spline estimate $\hat{x} (t) = {({\hat{x}}_{1} (t), \dots, {\hat{x}}_{p} (t))}^{⊤}$ , where

{\hat{x}}_{j} (t) = \underset{z_{j} \in F}{arg min} {\frac{1}{n} \sum_{i = 1}^{n} {y_{i j} - z_{j} (t_{i})}^{2} + λ_{n j} {‖ z_{j} (t) ‖}_{F}^{2}}, j = 1, \dots, p,

(9)

then estimating $F_{j} \in H$ and $θ_{j 0} \in ℝ$ by the following penalized optimization,

min_{θ_{j 0}, F_{j}} \frac{1}{n} \sum_{i = 1}^{n} {y_{i j} - θ_{j 0} - \int_{0}^{t_{i}} F_{j} (\hat{x} (t)) d t}^{2} + τ_{n j} (\sum_{k = 1}^{p} {‖ P^{k} F_{j} ‖}_{H} + \sum_{k \neq l, k = 1}^{p} \sum_{l = 1}^{p} {‖ P^{k l} F_{j} ‖}_{H}) .

(10)

Our proposal deals with the integral $\int_{0}^{t_{i}} F_{j} (\hat{x} (u)) d u$ , rather than the derivative $d {\hat{x}}_{j} (t) / d t$ , which is in a similar spirit as Dattner and Klaassen (2015). Besides, it involves two penalty functions, $J_{1} \equiv ∥ \cdot ∥_{F}^{2}$ in (9), and $J_{2} (F_{j}) \equiv \sum_{k = 1}^{p} {‖ P^{k} F_{j} ‖}_{H} + \sum_{k = 1}^{p} \sum_{l = 1}^{p} {‖ P^{k l} F_{j} ‖}_{H}$ in (10), with λ_nj and τ_nj as two tuning parameters. We next make some remarks about this proposal.

For the functionals, the formulation in (10) is highly flexible, nonlinear, and incorporates two-way interactions. Meanwhile, it naturally covers the linear ODE in (4) and (5), and the additive ODE in (6) as special cases. In particular, if $H$ is the linear functional space, $H = {1} \oplus \sum_{k = 1}^{p} {x_{k} - 1 / 2} \oplus \sum_{k \neq l} [{x_{k} - 1 / 2} \otimes {x_{l} - 1 / 2}]$ with the input space $X = {[0, 1]}^{p}$ , then any F of the form in (4) belongs to $H$ . If $H$ is spanned by some known generalized functions, $H = ψ_{j 1} (x) \oplus \dots \oplus ψ_{j p} (x)$ , then any F in (5) belongs to $H$ . If $H$ is the additive functional space, $H = {1} \oplus \sum_{k = 1}^{p} H_{k}$ with the ℓ₂-norm, then for F_jk, (x_k(t)) = ψ(x_k(t))^⊤θ_jk, the penalty on the main effects becomes ${\sum_{k = 1}^{p} ‖ P^{k} F_{j} ‖}_{H} = {\sum_{k = 1}^{p} [\int_{0}^{1} {ψ {(x_{k} (t))}^{⊤} θ_{j k}}^{2} d t]}^{1 / 2}$ , which is exactly the same as the ODE model of Chen, Shojaie, and Witten (2017).

For the penalties, the first penalty function J₁ is the squared RKHS norm corresponding to the RKHS ${F, ∥ \cdot ∥_{F}}$ . It is for estimating ${\hat{x}}_{j}$ , and $F$ does not have to be the same as $H$ . The second penalty function J₂ is a sum of RKHS norms on the main effects and pairwise interactions. This penalty is similar as the COSSO penalty of Lin and Zhang (2006). But as we outline in Section 1, our extension is far from trivial. We also note that, we do not impose a hierarchical structure for the main effects and interactions, in that if an interaction term is selected, the corresponding main effect term does not have to be selected (Wang et al. 2009). This is motivated by the observation that, for example, in the enzymatic regulatory network example in Section 2.1, the interaction terms x₁(t)x₃(t) and x₂(t)x₃(t) both appear in the ODE regulating x₃(t), but the main effect terms x₁(t) and x₂(t) are not present.

Theorem 1.

Assume that the RKHS $H$ can be decomposed as in (8). Then there exists a minimizer of (10) in $H$ for any tuning parameter τ_nj ≥ 0. Moreover, the minimizer is in a finite-dimensional space.

Theorem 1 is a generalization of the well-known representer theorem (Wahba 1990). The difference is that, unlike the smoothing splines model as studied in Wahba (1990), the minimization of (10) involves an integral in the loss function, and the penalty is not a norm in $H$ but a convex pseudo-norm. A direct implication of Theorem 1 is that, although the minimization with respect to F_j is taken over an infinite-dimensional space in (10), the solution to (10) can actually be found in a finite-dimensional space. We next develop an estimation algorithm to solve (10).

3. Estimation and Inference

3.1. Estimation Procedure

The estimation of the proposed kernel ODE system consists of two major steps. The first step is the smoothing spline estimation in (9), which is standard and the tuning of the smoothness parameter λ_nj is often done through generalized cross-validation (see, e.g., Gu 2013). The second step is to solve (10). Toward that end, we first propose an optimization problem that is equivalent to (10), but is computationally easier to tackle. We then develop an estimation algorithm to solve this new equivalent problem.

Specifically, we consider the following optimization problem, for j = 1, …, p,

min_{θ_{j 0}, θ_{j}, F_{j}} \frac{1}{n} \sum_{i = 1}^{n} {y_{i j} - θ_{j 0} - \int_{0}^{t_{i}} F_{j} (\hat{x} (t)) d t}^{2} + η_{n j} (\sum_{k = 1}^{p} θ_{j k}^{- 1} {‖ P^{k} F_{j} ‖}_{H}^{2} + θ_{j k l}^{- 1} \sum_{k = 1, k \neq l}^{p} \sum_{l = 1}^{p} {‖ P^{k l} F_{j} ‖}_{H}^{2}) + κ_{n j} (\sum_{k = 1}^{p} θ_{j k} + \sum_{k = 1, k \neq l}^{p} \sum_{l = 1}^{p} θ_{j k l}),

(11)

subject to θ_k ≥ 0, θ_kl ≥ 0, k, l = 1, …, p, k ≠ l, where $θ_{j} = {(θ_{j 1}, \dots, θ_{j p}, θ_{j 12}, \dots, θ_{j 1 p}, \dots, θ_{j p 1}, \dots, θ_{j p (p - 1)})}^{⊤} \in ℝ^{p^{2}}$ collects the parameters to estimate, and η_nj, κ_nj≥0 are the tuning parameters, j = 1, …, p. Comparing (11) to (10), we introduce the parameters θ_jk and θ_jkl to control the sparsity of the main effect and interaction terms in F_j. This is similar to Lin and Zhang (2006). The two optimization problems (10) and (11) are equivalent, in the following sense. Let $κ_{n j} = τ_{n j}^{2} / (4 η_{n j})$ . Then we have,

η_{n j} θ_{j k}^{- 1} {‖ P^{k} F_{j} ‖}_{H}^{2} + κ_{n j} θ_{j k} \geq 2 η_{n j}^{1 / 2} κ_{n j}^{1 / 2} {‖ P^{k} F_{j} ‖}_{H} = τ_{n j} {‖ P^{k} F_{j} ‖}_{H},

where the equality holds if $θ_{j k} = η_{n j}^{1 / 2} κ_{n j}^{- 1 / 2} {‖ P^{k} F_{j} ‖}_{H}$ . A similar result holds for $θ_{j k l} = η_{n j}^{1 / 2} κ_{n j}^{- 1 / 2} {‖ P^{k l} F_{j} ‖}_{H}$ . In other words, if $({\hat{θ}}_{j 0}, {\hat{F}}_{j})$ minimizes (10), then $({\hat{θ}}_{j 0}, {\hat{θ}}_{j}, {\hat{F}}_{j})$ minimizes (11), with ${\hat{θ}}_{j k} = η_{n j}^{1 / 2} κ_{n j}^{- 1 / 2} {‖ P^{k} {\hat{F}}_{j} ‖}_{H}$ , and $θ_{j k l} = η_{n j}^{1 / 2} κ_{n j}^{- 1 / 2} {‖ P^{k l} F_{j} ‖}_{H}$ , for any k, l = 1, …, p, k ≠ l. Meanwhile, if $({\hat{θ}}_{j 0}, {\hat{θ}}_{j}, {\hat{F}}_{j})$ minimizes (11), then $({\hat{θ}}_{j 0}, {\hat{F}}_{j})$ minimizes (10).

Next, we devise an iterative alternating optimization approach to solve (11). That is, we first estimate θ_j0 given fixed F_j and θ_j, then estimate the functional F_j given fixed θ_j0 and θ_j, and finally estimate θ_j given fixed θ_j0 and F_j.

For given ${\hat{F}}_{j}$ and ${\hat{θ}}_{j}$ , we have that,

{\hat{θ}}_{j 0} = {\bar{y}}_{j} - \int_{T} \bar{T} (t) {\hat{F}}_{j} (\hat{x} (t)) d t,

where T_i(t) = 1{0 ≤ t ≤ t_i}, $\bar{T} (t) = \frac{1}{n} \sum_{i = 1}^{n} T_{i} (t)$ and ${\bar{y}}_{j} = n^{- 1} \sum_{i = 1}^{n} y_{i j}$ .

For given ${\hat{θ}}_{j 0}$ and ${\hat{θ}}_{j}$ , the optimization problem (11) becomes,

min_{F_{j}} {\frac{1}{n} \sum_{i = 1}^{n} {[(y_{i j} - {\bar{y}}_{j}) - \int_{T} {T_{i} (t) - \bar{T} (t)} F_{j} (\hat{x} (t)) d t]}^{2} + η_{n j} (\sum_{k = 1}^{p} {\hat{θ}}_{j k}^{- 1} {‖ P^{k} F_{j} ‖}_{H}^{2} + {\hat{θ}}_{j k l}^{- 1} \sum_{k = 1, k \neq l}^{p} \sum_{l = 1}^{p} {‖ P^{k l} F_{j} ‖}_{H}^{2})} .

(12)

Let $K_{j} (\cdot, \cdot) : X \times X \mapsto ℝ$ denote the Mercer kernel generating the RKHS $H_{j}$ , j = 1, …, p. Then K_kl ≡ K_kK_l is the reproducing kernel of the RKHS $H_{k} \otimes H_{l}$ . Let $K_{θ_{j}} = \sum_{k = 1}^{p} {\hat{θ}}_{j k} K_{k} + \sum_{k \neq l} {\hat{θ}}_{j k l} K_{k l}$ . By the representer theorem (Wahba 1990), the solution ${\hat{F}}_{j}$ to (12) is of the form,

{\hat{F}}_{j} (\hat{x} (t)) = b_{j} + \sum_{i = 1}^{n} c_{i j} \int_{T} K_{θ_{j}} (\hat{x} (t), \hat{x} (s)) {T_{i} (s) - \bar{T} (s)} d s

(13)

for some $b_{j} \in ℝ$ and $c_{j} = (c_{1 j}, \dots, c_{n j}) \in ℝ^{n}$ . Write $y_{j} = {(y_{1 j}, \dots, y_{n j})}^{⊤} \in ℝ^{n}$ and ${\bar{y}}_{j} = {({\bar{y}}_{j}, \dots, {\bar{y}}_{j})}^{⊤} \in ℝ^{n}$ . Let B be an n × 1 vector whose ith entry is $B_{i} = \int_{T} {T_{i} (t) - \bar{T} (t)} d t$ , i = 1, …, n. Let Σ be an n × n matrix whose (i, i′)th entry is $Σ_{i i^{'}} = \int_{T} \int_{T} {T_{i} (s) - \bar{T} (s)} K_{θ_{j}} (\hat{x} (t), \hat{x} (s)) {T_{i^{'}} (t) - \bar{T} (t)} dsdt$ , i, i′ = 1, …, n. Plugging (13) into (12), we obtain the following quadratic minimization problem in terms of {b_j, c_j},

min_{b_{j}, c_{j}} \frac{1}{n} {‖ (y_{j} - {\bar{y}}_{j}) - (B b_{j} + Σ c_{j}) ‖}_{2}^{2} + η_{n j} c_{j}^{⊤} Σ c_{j},

which has a closed-form solution. Consider the QR decomposition B = [Q₁ Q₂][R 0]^⊤, where $Q_{1} \in ℝ^{n \times 1}$ , $Q_{2} \in ℝ^{n \times (n - 1)}$ , and [Q₁ Q₂] is orthogonal such that B^⊤Q₂ = 0_1×(n−1). Write W_j = Σ+ nη_njI_n, where I_n is the n × n identity matrix. Then the minimizers are,

c_{j} = Q_{2} {(Q_{2}^{⊤} W_{j} Q_{2})}^{- 1} Q_{2}^{⊤} (y_{j} - {\bar{y}}_{j}), b_{j} = R^{- 1} Q_{1}^{⊤} (y_{j} - {\bar{y}}_{j} - W_{j} c_{j}) .

Following the usual smoothing splines literature, we tune the parameter η_nj in (12) by minimizing the generalized cross-validation criterion (GCV, Wahba et al. 1995),

GCV = \frac{{‖ A_{j} (η_{n j}) (y_{j} - {\bar{y}}_{j}) - (y_{j} - {\bar{y}}_{j}) ‖}^{2}}{{[n^{- 1} t r {I_{n} - A_{j} (η_{n j})}]}^{2}},

where the smoothing matrix $A_{j} (η_{n j}) \in ℝ^{n \times n}$ is of the form,

A_{j} (η_{n j}) = I_{n} - n η_{n j} Q_{2} {(Q_{2}^{⊤} W_{j} Q_{2})}^{- 1} Q_{2}^{⊤} .

(14)

For given ${\hat{θ}}_{j 0}$ and ${\hat{F}}_{j}$ , θ_j is the solution to a usual ℓ₁-penalized regression problem,

min_{θ_{j}} {{(z_{j} - G θ_{j})}^{⊤} (z_{j} - G θ_{j}) + n κ_{n j} (\sum_{k = 1}^{p} θ_{j k} + \sum_{k \neq l, k = 1}^{p} \sum_{l = 1}^{p} θ_{j k l})},

(15)

subject to θ_k ≥ 0, θ_kl ≥ 0, k, l = 1, …, p, k ≠ l where the “response” is $z_{j} = (y_{j} - {\bar{y}}_{j}) - (1 / 2) n η_{n j} c_{j} - B b_{j}$ , the “predictor” is $G \in ℝ^{n \times p^{2}}$ , whose first p columns are Σ^kc_j with k = 1, …, p, and the last p(p − 1) columns are Σ^klc_j with k, l = 1, …, p, k ≠ l, and $Σ^{k} = (Σ_{i i^{'}}^{k}), Σ^{k l} = (Σ_{i i^{'}}^{k l})$ are both n × n matrices whose (i, i′)th entries are $Σ_{i i^{'}}^{k} = \int_{T} \int_{T} {T_{i} (s) - \bar{T} (s)} K_{k} (\hat{x} (t), \hat{x} (s)) {T_{i^{'}} (t) - \bar{T} (t)} dsdt$ , and $Σ_{i i^{'}}^{k} = \int_{T} \int_{T} {T_{i} (s) - \bar{T} (s)} K_{k l} \hat{x} (t), \hat{x} (s)) {T_{i^{'}} (t) - \bar{T} (t)} dsdt$ , respectively, where i, i′ = 1, …, n, j = 1, …, p. We employ Lasso for (15) in our implementation, and tune the parameter κ_nj using 10-fold cross-validation, following the usual Lasso literature.

We repeat the above optimization steps iteratively until some stopping criterion is met; that is, when the estimates in two consecutive iterations are close enough, or when the number of iterations reaches some maximum number. In our simulations, we have found that the algorithm converges quickly, usually within 10 iterations. Another issue is the identifiability of $P^{k} F_{j}^{'} s$ and $P^{k l} F_{j}^{'} s$ in (11) in the sense of unique solutions. We introduce the collinearity indices $C_{j k}$ and $C_{j k l}$ to reflect the identifiability. Specifically, let $W$ denote a p² × p² matrix, whose entries are $cos (P^{k} F_{j}, P^{k^{'}} F_{j})$ , $cos (P^{k} F_{j}, P^{k^{'} l^{'}} F_{j})$ , $cos (P^{k l} F_{j}, P^{k^{'}} F_{j})$ , $cos (P^{k l} F_{j}, P^{k^{'} l^{'}} F_{j})$ , j, k, l = 1, …, p. Then $C_{j k}^{2}$ and $C_{j k l}^{2}$ are defined by the diagonals of $W^{- 1}$ . When some $C_{j k}$ and $C_{j k l}$ are much larger than one, then the identifiability issue occurs (Gu 2013). This is often due to insufficient amount of data relative to the complexity of the model we fit. In this case, we find that increasing η_nj and κ_nj in (11) often helps with the identifiability issue, as it helps reduce the model complexity.

We summarize the above estimation procedure in Algorithm 1.

3.2. Confidence Intervals

Next, we derive the confidence intervals for the estimated trajectory ${\hat{x}}_{j} (t_{i})$ . This is related to post-selection inference, as the actual coverage probability of the confidence interval ignoring the preceding sparse estimation uncertainty can be dramatically smaller than the nominal level. Our result extends the recent work of Berk et al. (2013) and Bachoc, Leeb, and Pötscher (2019) from linear regression models to nonparametric ODE models, while our setting is more challenging, as it involves infinite-dimensional functional objects.

Let ${\hat{θ}}_{j}$ denote the estimator of θ_j obtained from Algorithm 1. Denote $M \equiv {1, \dots, p, (1, 2), \dots, (1, p), \dots, (p, 1), \dots, (p, p - 1)}$ , and denote $M_{j} \subseteq M$ as the index set of the nonzero entries of the sparse estimator ${\hat{θ}}_{j}$ . Note that M_j is allowed to be an empty set. Let ${\hat{θ}}_{M_{j}}$ be the least squares estimate with M_j as the support that minimizes the unpenalized objective function in (15), that is, (z_j − Gθ_j)^⊤ (z_j − Gθ_j). Plugging this estimate ${\hat{θ}}_{M_{j}}$ into (13) gets the corresponding estimate of the functional F_j as,

{\hat{F}}_{j, {\hat{θ}}_{M_{j}}} (\hat{x} (t)) = b_{j} + \sum_{i = 1}^{n} c_{i j} \int_{T} K_{{\hat{θ}}_{M_{j}}} (\hat{x}, \hat{x} (s)) {T_{i} (s) - \bar{T} (s)} d s .

For a nominal level α ∈ (0, 1) and i = 1, …, n, define $c_{0} ({\hat{x}}_{j} (t_{i}))$ as the smallest constant satisfying that,

ℙ_{n, F_{j}, σ_{j}} [max_{M_{j} \subseteq M} σ_{j}^{- 1} ∣ {{\tilde{A}}_{M_{j}}}_{i \cdot} (y_{j} - {\bar{y}}_{j})) ∣ \leq c_{0} ({\hat{x}}_{j} (t_{i}))] \geq 1 - α,

(16)

where ${{\tilde{A}}_{M_{j}}}_{i .} = {A_{M_{j}}}_{i .} / {‖ {A_{M_{j}}}_{i .} ‖}_{l_{2}}, {A_{M_{j}}}_{i .}$ is the ith row of $A_{M_{j}}$ , $A_{M_{j}}$ is the smoothing matrix as defined in (14) with the corresponding ${\hat{θ}}_{M_{j}}$ , and $σ_{j}^{2}$ is the variance of the error term ϵ_ij in (2). We then construct the confidence interval $CI ({\hat{x}}_{j} (t))$ for the prediction of true trajectory x_j(t) following model selection as,

CI ({\hat{x}}_{j} (t_{i})) = \int_{T} {T_{i} (t) - \bar{T} (t)} {\hat{F}}_{j, {\hat{θ}}_{M_{j}}} (\hat{x} (t)) d t \pm c_{0} ({\hat{x}}_{j} (t_{i})) σ_{j} ‖ {A_{M_{j}}}_{i .} ‖,

(17)

for any i = 1, …, n and j = 1, …, p.

Next, we show that the confidence interval in (17) has the desired coverage probability. Later we develop a procedure to estimate the cutoff value $c_{0} ({\hat{x}}_{j})$ in (16) given the data.

Theorem 2.

Let $M_{j} \subseteq M$ be the index set of the nonzero entries of the sparse estimator ${\hat{θ}}_{j}$ . Then the choice of $c_{0} ({\hat{x}}_{j} (t_{i}))$ in (16) does not depend on F_j, and $CI ({\hat{x}}_{j} (t_{i}))$ in (17) satisfies the coverage property, for any i = 1, …, n and j = 1, …, p, in that,

inf_{F_{j} \in H, σ_{j} > 0} ℙ {\int_{T} {T_{i} (t) - \bar{T} (t)} E [{\hat{F}}_{j, {\hat{θ}}_{M_{j}}} (\hat{x} (t))] d t \in CI ({\hat{x}}_{j} (t_{i}))} \geq 1 - α .

A few remarks are in order. First, the coverage in Theorem 2 is guaranteed for all sparse estimation and selection procedures. As such, $CI ({\hat{x}}_{j})$ in (17), following the terminology of Berk et al. (2013), is a universally valid post-selection confidence interval. Second, if we replace $c_{0} ({\hat{x}}_{j} (t_{i}))$ in (17) by z_α/2, that is, the α/2 cutoff value of a standard normal distribution, then $CI ({\hat{x}}_{j} (t_{i}))$ reduces to the “naive” confidence interval. It is constructed as if M_j were fixed a priori, and it ignores any uncertainty or error of the sparse estimation step. This naive confidence interval, however, does not have the coverage property as in Theorem 2, and thus is not a truly valid confidence interval. Finally, data splitting is a commonly used alternative strategy for post-selection inference. But it is not directly applicable in our ODE setting, because it is difficult to split the time series data into independent parts.

Next, we devise a procedure to compute the cutoff value $c_{0} ({\hat{x}}_{j} (t_{i}))$ .

Proposition 1.

The value $c_{0} ({\hat{x}}_{j} (t_{i}))$ in (16) is the same as the solution of t ≥ 0 satisfying,

E_{U} ℙ (max_{M_{j} \subseteq M} | {{\tilde{A}}_{M_{j}}}_{i .} V | \leq t / U ∣ U) = 1 - α,

where V is uniformly distributed on the unit sphere in $ℝ^{n}$ , and U is a nonnegative random variable such that U² follows a chisquared distribution χ²(n).

Following Proposition 1, we compute $c_{0} ({\hat{x}}_{j} (t_{i}))$ as follows. We first generate N iid copies of random vectors V₁, …, V_N uniformly distributed on the unit sphere in $ℝ^{n}$ . We then calculate the quantity, $c_{v} = {max}_{M_{j} \subseteq M} | {{\tilde{A}}_{M_{j}}}_{i .} V_{v} |$ for ν = 1, …, N. Let D_U denote the cumulative distribution function of U, and $D_{χ^{2}}$ denote the cumulative distribution function of a χ²(n) distribution. Then $D_{U} (t) = D_{χ^{2}} (t^{2})$ . We next obtain $c_{0} ({\hat{x}}_{j} (t_{i}))$ by searching for c that solves $N^{- 1} \sum_{i = 1}^{N} D_{U} (c / c_{i}) = 1 - α$ , using, for example, a bisection searching method.

Finally, we estimate the error variance $σ_{j}^{2}$ in (17) using the usual noise estimator in the context of RKHS (Wahba 1990); that is, ${\hat{σ}}_{j}^{2} = {‖ A_{M_{j}} (y_{j} - {\bar{y}}_{j}) - (y_{j} - {\bar{y}}_{j}) ‖}^{2} / t r (I - A_{M_{j}})$ .

We also remark that, the inference on the prediction of the trajectory x_j(t) following model selection as described in Theorem 2 amounts to the inference on the estimation of the integration $\int_{0}^{t} F_{j} (x (s)) d s$ . This type of inference is of great importance in dynamic systems (Izhikevich 2007; Chou and Voit 2009; Ma et al. 2009). Our solution takes the selected model as an approximation to the truth, but does not require that the true data generation model has to be among the candidates of model selection. We note that, it is also possible to do inference on the individual components of F_j directly; for example, one could construct the confidence interval for F_jk in (3). But this is achieved at the cost of imposing additional assumptions, including the requirement that the true data generation model is among the class of pairwise interaction model as in (3), and the orthogonality property as in Chernozhukov, Hansen, and Spindler (2015), or its equivalent characterization as in Zhang and Zhang (2014); Javanmard and Montanari (2014). For nonparametric kernel estimators, the orthogonality property is shown to hold if the covariates x_j’s are assumed to be weakly dependent (Lu, Kolar, and Liu 2020). It is interesting to further investigate if such a property holds in the context of kernel ODE model under a similar condition of weakly dependent covariates. We leave this as our future research.

4. Theoretical Properties

We next establish the estimation optimality and selection consistency of kernel ODE. These theoretical results hold for both the low-dimensional and high-dimensional settings, where the number of functionals p can be smaller or larger than the sample size n. We first introduce two assumptions.

Assumption 1.

The number of nonzero functional components is bounded, that is, ({k : F_jk ≠ 0} ∪ {1 ≤ l ≠ k ≤ p : F_jkl ≠ 0}) is bounded for any j = 1, …, p.

Assumption 2.

For any $F_{j} \in H$ , there exists a random variable B, with $E (B) < \infty$ , and

| \frac{\partial F_{j} (x)}{\partial x_{k}} | \leq B {‖ F_{j} ‖}_{L_{2}}, almost surely .

Assumption 1 concerns the complexity of the functionals. Similar assumptions have been adopted in the sparse additive model over RKHS when F_jkl = 0 (see, e.g., Koltchinskii and Yuan 2010; Raskutti, Wainwright, and Yu 2011). Assumption 2 is an inverse Poincaré inequality type condition, which places regularization on the fluctuation in F_j relative to the ℓ₂-norm. The same assumption was also used in additive models in RKHS (Zhu, Yao, and Zhang 2014).

We begin with the error bound for the estimated trajectory $\hat{x} (t)$ uniformly for j = 1, …, p. This is a relatively standard result, which is needed for both analyzing the error of the functional estimators in kernel ODE, and establishing the selection consistency later.

Theorem 3 (Optimal estimation of the trajectory).

Suppose that $x_{j} (t) \in F, j = 1, \dots, p$ , and the RKHS $F$ is embedded to a β₁th order Sobolev space, β₁ > 1/2. Then the smoothing spline estimate from (9) satisfies that, for any j = 1, …, p,

min_{λ_{n j} \geq 0} \int_{T} {{\hat{x}}_{j} (t) - x_{j} (t)}^{2} d t = O_{p} (n^{- \frac{2 β_{1}}{2 β_{1} + 1}}),

which achieves the minimax optimal rate.

Next, we derive the convergence rate for the estimated functional F_j. Because the trajectory $\hat{x}$ is estimated, to establish the optimal rate of convergence, it requires extra theoretical attention, which is related to recent work on errors in variables for lasso-type regressions (Loh and Wainwright 2012; Zhu, Yao, and Zhang 2014). The proof involves several tools for the Rademacher processes (van der Vaart and Wellner 1996), and the concentration inequalities for empirical processes (Talagrand 1996; Yuan and Zhou 2016).

Theorem 4 (Optimal estimation of the functional).

Suppose that $F_{j} \in H$ , j = 1, …, p, where $H$ satisfies (8), and the RKHS $H_{j}$ is embedded to a β₂th order Sobolev space, β₂ > 1. Suppose Assumptions 1 and 2 hold. Then, as long as F_j is not a constant function, the KODE estimate ${\hat{F}}_{j}$ from (10) satisfies that, for any j = 1, …, p,

min_{τ_{n j} \geq 0} \int_{T} {{\hat{F}}_{j} (x (t)) - F_{j} (x (t))}^{2} d t = O_{p} ({(\frac{n}{log n})}^{- \frac{2 β_{2}}{2 β_{2} + 1}} + \frac{log p}{n} + n^{- \frac{2 β_{1}}{2 β_{1} + 1}}),

which achieves the minimax optimal rate.

This theorem is one of our key results, and we make a few remarks. First, there are three error terms in Theorem 4, which are attributed to the estimation of the interactions, the Lasso estimation, and the measurement errors in variables, respectively. Particularly, the error term $O_{p} (n^{- 2 β_{1} / (2 β_{1} + 1)})$ arises due to the unobserved x(t), which is instead measured at discrete time points and is subject to measurement errors. Since this error term achieves the optimal rate, it fully characterizes the influence of the estimated $\hat{x} (t)$ on the resulting estimator ${\hat{F}}_{j}$ . Moreover, β₁ and β₂ measure the orders of smoothness for estimating x_j and F_j, respectively. They can be different, which makes it flexible when choosing kernels for the estimation procedure. For instance, if there is prior knowledge that x(t) is smooth, we may then choose β₁ > β₂, and the resulting estimator ${\hat{F}}_{j}$ achieves a convergence rate of $O_{p} ({(n / log n)}^{- 2 β_{2} / (2 β_{2} + 1)} + log p / n)$ . It is interesting to note that this rate is the same as the rate as if x(t) were directly observed and there were no integral involved in the loss function, for example, in the setting of Lin and Zhang (2006).

Second, there exists a regime-switching phenomenon, depending on the dimensionality p with respect to the sample size n. On one hand, if it is an ultrahigh-dimensional setting, that is, $p > exp [{n {(log n)}^{2 β_{2}}}^{\frac{1}{2 β_{2} + 1}}]$ , then the minimax optimal rate in Theorem 4 becomes $O_{p} (log p / n + n^{- 2 β_{1} / (2 β_{1} + 1)})$ . Here, the first rate O_p(log p/n) matches with the minimax optimal rate for estimating a p-dimensional linear regression when the vector of regression coefficients has a bounded number of nonzero entries (Raskutti, Wainwright, and Yu 2011). Hence, we pay no extra price in terms of the rate of convergence for adopting a nonparametric modeling of F_j in (3), when compared with the more restrictive linear ODE model in (4) (Zhang et al. 2015). On the other hand, if it is a low-dimensional setting, that is, $p \leq exp [{n {(log n)}^{2 β_{2}}}^{\frac{1}{2 β_{2} + 1}}]$ , then the optimal rate becomes $O_{p} ({(n / log n)}^{- 2 β_{2} / (2 β_{2} + 1)} + n^{- 2 β_{1} / (2 β_{1} + 1)})$ . Here, the first rate $O_{p} ({(n / log n)}^{- 2 β_{2} / (2 β_{2} + 1)})$ is the same as the optimal rate of estimating F_j as if we knew a priori that F_j comes from a two-dimensional tensor product functional space, rather than the p-variate functional space $H$ in (8); see also Lin (2000) for a similar observation.

Third, the optimal rate in Theorem 4 is immune to the “curse of dimensionality”, in the following sense. We introduce p(p − 1) pairwise interaction components to $H$ in (8), and henceforth, for each x_j(t), j = 1, …, p, it requires to estimate a total of p² functions. A direct application of an existing basis expansion approach, for instance, Brunton, Proctor, and Kutz (2016), leads to a rate of $O_{p} (n^{- O (1 / p^{2})})$ . This rate degrades fast when p increases. By contrast, we proceed in a different way, where we simultaneously aim for the flexibility of a nonparametric ODE model by letting $H$ obey a tensor product structure as in (8), while exploiting the interaction structure of the system. As a result, our optimal error bound $O_{p} ({(n / log n)}^{- 2 β_{2} / (2 β_{2} + 1)})$ does not depend on the dimensionality p.

Finally, the incorporation of the integral, $\int_{0}^{t_{i}} F_{j} (\hat{x} (t)) d t$ , in the loss function in (10) makes the estimation error of ${\hat{F}}_{j}$ depend on the convergence of $E \int_{T} {{\hat{x}}_{j} (t) - x_{j} (t)}^{2} d t$ . As a comparison, if we use the derivative instead of the integration, then the estimation error would depend on the convergence of the derivative, $E \int_{T} {d {\hat{x}}_{j} (t) / d t - d x_{j} (t) / d t}^{2} d t$ (Wu et al. 2014). However, it is known that the derivative estimation in the reproducing kernel Hilbert space has a slower convergence rate than the function estimation (Cox 1983). That is, $E \int_{T} {d {\hat{x}}_{j} (t) / d t - d x_{j} (t) / d t}^{2} d t$ converges at a slower rate than $E \int_{T} {{\hat{x}}_{j} (t) - x_{j} (t)}^{2} d t$ . This demonstrates the advantage of working with the integral in our KODE formulation, and our result echoes the observation for the additive ODE model (Chen, Shojaie, and Witten 2017).

Next, we establish the selection consistency of KODE. Putting all the functionals {F₁, …, F_p} together forms a network of regulatory relations among the p variables {x₁(t), …, x_p(t)}. Recall that, we say x_k is a regulator of x_j, if in (3) F_jk is nonzero, or if F_jkl is nonzero for some l ≠ k. Denote the set of the true regulators and the estimated regulators of x_j(t) by

S_{j}^{0} = {1 \leq k \leq p : F_{j k} \neq 0, or F_{j k l} \neq 0 for some 1 \leq l \neq k \leq p},

{\hat{S}}_{j} = {1 \leq k \leq p : {‖ {\hat{F}}_{j k} ‖}_{H} \neq 0, or {‖ {\hat{F}}_{j k l} ‖}_{H} \neq 0 for some 1 \leq l \neq k \leq p},

respectively, j = 1, …, p. We need some extra regularity conditions on the minimum regulatory effect and the design matrix, which are commonly adopted in the literature of Lasso regression (Zhao and Yu 2006; Ravikumar, Wainwright, and Lafferty 2010). In the interest of space, we defer those conditions to Section S.1.6.2 of the Appendix. The next theorem establishes that KODE is able to recover the true regulatory network asymptotically.

Theorem 5 (Recovery of the regulatory network).

Suppose that $F_{j} \in H$ , j = 1, …, p, where $H$ satisfies (8), and the RKHS $H_{j}$ is embedded to a β₂th order Sobolev space, with β₂ > 1. Suppose Assumption 1, and Assumptions 3, 4, 5 in the Appendix hold. Then, the KODE correctly recovers the true regulatory network, in that, for all j = 1, …, p,

ℙ ({\hat{S}}_{j} = S_{j}^{0}) \to 1, as n \to \infty .

5. Simulation Studies

5.1. Setup

We study the empirical performance of the proposed KODE using two ODE examples, the enzyme regulatory network in Section 5.2, and the Lotka–Volterra equations in Section 5.3. For a given system of ODEs and the initial condition, we obtain the numerical solutions of the ODEs using the Euler method with step size 0.01. The data observations are drawn from the solutions at an evenly spaced time grid, with measurement errors. To implement KODE, we fit the smoothing spline to estimate x_j(t) in (9) using a Matérn kernel, $K_{F} (x, x^{'}) = (1 + \sqrt{3} ∥ x - x^{'} ∥ / v) exp (- \sqrt{3} ‖ x - x^{'} ‖ / v)$ , where the smoothing parameter λ_nj is chosen by GCV, and the bandwidth ν is chosen by 10-fold cross-validation. We compute the integral $\int_{0}^{t_{i}} F_{j} (\hat{x} (t)) d t$ in (10) numerically with independent sets of 1000 Monte Carlo points. We compare KODE with linear ODE with interactions in (4) (Zhang et al. 2015), and additive ODE in (6) (Chen, Shojaie, and Witten 2017). Due to the lack of available code online, we implement the two competing methods in the framework of Algorithm 1, using a linear kernel for (6), and using an additive Matérn kernel for (6). We evaluate the performance using the prediction error, plus the false discovery proportion and power for edge selection of the corresponding regulatory network. Furthermore, we compare with the family of ODE solutions assuming a known F (Zhang, Cao, and Carroll 2015; Mikkelsen and Hansen 2017) in Section S2.1 of the Appendix. We also carry out a sensitivity analysis in Section S2.2 of the Appendix to study the robustness of the choice of kernel function and initial parameters.

5.2. Enzymatic Regulatory Network

The first example is a three-node enzyme regulatory network of a negative feedback loop with a buffering node (Ma et al. 2009, NFBLB). The ODE system is given in (7) in Section 2.1. Figure 1(a) shows the NFBLB network diagram consisting of the three interacting nodes: x₁ receives the input, x₃ transmits the output, and x₂ plays a regulatory role, leading a negative regulatory link to x₃. We note that, although biological circuits can have more than three nodes, many of those circuits can be reduced to a three-node framework, given that multiple molecules often function as a single virtual node. Moreover, despite the diversity of possible network topologies, NFBLB is one of the two core three-node topologies that could perform adaption in the sense that the system resets itself after responding to a stimulus; see Ma et al. (2009) for more discussion of NFBLB. For the ODE system in (7), we set the catalytic rate parameters of the enzymes as c₁ = c₂ = c₃ = c₅ = c₆ = 10, c₄ = 1, the Michaelis–Menten constants as C₁ = ⋯ = C₆ = 0.1, and the concentration parameters of enzymes as ${\tilde{c}}_{1} = 1$ , ${\tilde{c}}_{2} = 0.2$ . These parameters achieve the adaption as shown in Figure 1(b). The output node x₃ shows a strong initial response to the stimulus, and also exhibits strong adaption, since its post-stimulus steady state x₃ = 0.052 is close to the prestimulus state x₃ = 0. The input $x_{0} \in ℝ^{3}$ is drawn uniformly from [0.5, 1.5], with the initial value x(0) = 0, and the measurement errors are iid normal with mean zero and variance $σ_{j}^{2}$ . The time points are evenly distributed, t_i = (i − 1)/20, i = 1, …, n. In this example, p = 3, and for each function x_j(t), j = 1, 2, 3, there are p² = 9 functions to estimate, and in total there are 27 functions to estimate under the sample size n = 40.

Figure 2 reports the true and estimated trajectory of x₃(t), with 95% upper and lower confidence bounds, of the three ODE methods, where we use the tensor product Matérn kernel for KODE in (10). The noise level is set as σ_j = 0.1, j = 1, 2, 3, and the results are averaged over 500 data replications. It is seen that the KODE estimate has a smaller variance than the additive and linear ODE estimates. Moreover, the confidence interval of KODE achieves the desired coverage for the true trajectory. In contrast, the confidence intervals of additive and linear ODE models mostly fail to include the truth. This is because there is a discrepancy between the additive and linear ODE model specifications and the true ODE model in (7), and this discrepancy accumulates as the course of the ODE evolves.

Figure 3 reports the prediction and selection performance of the three ODE methods, with varying noise level σ_j ∈ {0.01, 0.02, …, 0.1}, j = 1, 2, 3. The results are averaged over 500 data replications. The prediction error is defined as the squared root of the sum of predictive mean squared errors for x₁(t), x₂(t), x₃(t) at the unseen “future” time point t = 2. The false discovery proportion is defined as the proportion of falsely selected edges in the regulatory network out of the total number of edges. The empirical power is defined as the proportion of selected true edges in the network. It is seen that KODE clearly outperforms the two alternative solutions in both prediction and selection accuracy. Moreover, we report graphically the sparse recovery of this regulatory network in Section S2.3 of the Appendix.

5.3. Lotka–Volterra Equations

The second example is the high-dimensional Lotka–Volterra equations, which are pairs of first-order nonlinear differential equations describing the dynamics of biological systems in which predators and prey interact (Volterra 1928). We consider a 10-node system,

\frac{d x_{2 j - 1} (t)}{d t} = α_{1, j} x_{2 j - 1} (t) - α_{2, j} x_{2 j - 1} (t) x_{2 j} (t), \frac{d x_{2 j} (t)}{d t} = α_{3, j} x_{2 j - 1} (t) x_{2 j} (t) - α_{4, j} x_{2 j} (t),

(18)

where α_1,j = 1.1 + 0.2(j − 1), α_2,j = 0.4 + 0.2(j − 1), α_3,j = 0.1 + 0.2(j − 1), and α_4,j = 0.4 + 0.2(j − 1), j = 1, …, 5. The parameters α_2,j and α_3,j define the interaction between the two populations such that dx_2j−1(t)/dt and dx_2j(t)/dt are nonadditive functions of x_2j−1 and x_2j, where x_2j−1 is the prey and x_2j is the predator. Figure 4(a) shows an illustration of the interaction between x₁(t) and x₂(t). The input $x_{0} \in ℝ^{10}$ is drawn uniformly from [5, 15]¹⁰, with the initial value x_2j−1(0) = x_2j(0), and the measurement errors are iid normal $N (0, σ_{j}^{2})$ , where σ_j again reflects the noise level. The time points are evenly distributed in [0, 100] with n = 200. In this example, p = 10, and for each function x_j(t), j = 1, …, 10, there are p² = 100 functions to estimate, and in total there are 1000 functions to estimate under the sample size n = 200.

Figure 4(b) and (c) report the estimated trajectory of x₁(t), with 95% upper and lower confidence bounds, of KODE and additive ODE, where the noise level is set as σ_j = 1, j = 1, …, 10. The confidence interval of KODE achieves a better empirical coverage for the true trajectory compared to that of additive ODE. For this example, we use the linear kernel for KODE in (10), since the functional forms in (18) are known to be linear. For this reason, we only compare KODE with the additive ODE method. Moreover, in the implementation, the estimates ${\hat{F}}_{j} (\hat{x} (t))$ are thresholded to be nonnegative to ensure the physical constraint that the number of population cannot be negative. Figure 5 reports the prediction and selection performance of the two ODE methods, with varying noise level σ_j ∈ {1, 2, …, 10}, j = 1, …, 10. All the results are averaged over 500 data replications. It is seen that the KODE estimate achieves a smaller prediction error, and a higher selection accuracy, since KODE allows flexible nonadditive structures, which results in significantly smaller bias and variance in functional estimation as compared to the additive modeling.

Figure 5. — The prediction and selection performance of two ODE methods with varying noise level. The results are averaged over 500 data replications. (a) Prediction error; (b) false discovery proportion; (c) empirical power.

6. Application to Gene Regulatory Network

We illustrate KODE with a gene regulatory network application. Schaffter, Marbach, and Floreano (2011) developed an open-source platform called GeneNetWeaver (GNW) that generates in silico benchmark gene expression data using dynamical models of gene regulations and nonlinear ODEs. The generated data have been used for evaluating the performance of network inference methods in the DREAM3 competition (Marbach et al. 2009), and were also analyzed by Henderson and Michailidis (2014); Chen, Shojaie, and Witten (2017) in additive ODE modeling. GNW extracts two regulatory networks of E.coli (E.coli1, E.coli2), and three regulatory networks of yeast (yeast1, yeast2, yeast 3), each of which has two dimensions, p = 10 nodes and p = 100 nodes. This yields totally 10 combinations of network structures. Figures 6(a) and (b) show an example of the 10-node and the 100-node E.coli1 networks, respectively. The systems of ODEs for each extracted network are based on a thermodynamic approach, which leads to a nonadditive and nonlinear ODE structure (Marbach et al. 2010). Besides, the network structures are sparse; for example, for the 10-node E.coli1 network, there are 11 edges out of 90 possible pairwise edges, and for the 100-node E.coli1 network, there are 125 edges out of 9900 possible pairwise edges. Moreover, for the 10-node network, GNW provides R = 4 perturbation experiments, and for the 100-node network, GNW provides R = 46 experiments. In each experiment, GNW generates the time-course data with different initial conditions of the ODE system to emulate the diversity of gene expression trajectories (Marbach et al. 2009). Figures 6(c)–(f) show the time-course data under R = 4 experiments for the 10-node E.coli1 network. All the trajectories are measured at n = 21 evenly spaced time points in [0, 1]. We add independent measurement errors from a normal distribution with mean zero and standard deviation 0.025, which is the same as the DREAM3 competition and the data analysis done in Henderson and Michailidis (2014) and Chen, Shojaie, and Witten (2017).

Figure 6. — (a) The 10-node *E.coli1* network. (b) The 100-node *E.coli1* network. (c–f) Four perturbation experiments for the 10-node *E.coli1* network, where each experiment corresponds to a different initial condition of the ODE system.

The kernel ODE model we have developed focuses on a single experiment data, but it can be easily generalized to incorporate multiple experiments. Specifically, let ${y_{i j}^{(r)}; i = 1, \dots, n, j = 1, \dots, p, r = 1, \dots, R}$ denote the observed data from n subjects for p variables under R experiments, with unknown initial conditions $x^{(r)} (0) \in ℝ^{p}$ , r = 1, …, R. Then we modify the KODE method in (9) and (10), by seeking $F_{j} \in H$ and $θ_{j 0} \in ℝ$ that minimize

\frac{1}{R n} \sum_{r = 1}^{R} \sum_{i = 1}^{n} {y_{i j}^{(r)} - θ_{j 0} - \int_{0}^{t_{i}} F_{j} ({\hat{x}}^{(r)} (t)) d t}^{2} + τ_{n j} (\sum_{k = 1}^{p} {‖ P^{k} F_{j} ‖}_{H} + \sum_{k \neq l, k = 1}^{p} \sum_{l = 1}^{p} {‖ P^{k l} F_{j} ‖}_{H}),

(19)

where ${\hat{x}}^{(r)} (t) = {({\hat{x}}_{1}^{(r)} (t), \dots, {\hat{x}}_{p}^{(r)} (t))}^{⊤}$ is the smoothing spline estimate obtained by,

{\hat{x}}_{j}^{(r)} (t) = \underset{z_{j} \in F}{arg min} {\frac{1}{n} \sum_{i = 1}^{n} {(y_{i j}^{(r)} - z_{j} (t_{i}))}^{2} + λ_{n j} {‖ z_{j} (t) ‖}_{F}^{2}}, j = 1, \dots, p, r = 1, \dots, R .

Algorithm 1 can be modified accordingly to work with multiple experiments.

We again compare KODE with the additive ODE (Chen, Shojaie, and Witten 2017) and the linear ODE (Zhang et al. 2015), adopting the same implementation as in the simulations. Since we know the true edges of the generated gene regulatory networks, we use the area under the ROC curve (AUC) as the evaluation criterion. Table 1 reports the results averaged over 100 data realizations for all ten combinations of network structures. It is clearly seen that KODE outperforms both alternative methods in all cases. We further report graphically the sparse recovery of the 10-node E.coli1 network in Section S2.4 of the Appendix. This example shows that our proposed KODE is a competitive and useful tool for ODE modeling. In addition, it also shows that the proposed method can scale up and work with reasonably large networks. For instance, for the network with p = 100 nodes, there are p² = 10,000 functions to estimate, and the sample size is n = 21 with R = 46 perturbations.

Table 1.

The area under the ROC curve, and the 95% confidence interval, for 10 combinations of network structures from GNW.

	p = 10			p = 100
	KODE	Additive ODE	Linear ODE	KODE	Additive ODE	Linear ODE
E.coli1	0.582	0.541	0.460	0.711	0.677	0.640
	(0.577, 0.587)	(0.535, 0.547)	(0.453, 0.467)	(0.708, 0.714)	(0.672, 0.682)	(0.637, 0.643)
E.coli2	0.662	0.632	0.562	0.685	0.659	0.533
	(0.658, 0.666)	(0.625, 0.639)	(0.555, 0.569)	(0.681, 0.689)	(0.652, 0.666)	(0.527, 0.539)
Yeast1	0.603	0.541	0.436	0.619	0.589	0.569
	(0.599, 0.607)	(0.536, 0.546)	(0.430, 0.442)	(0.616, 0.622)	(0.581, 0.597)	(0.562, 0.576)
Yeast2	0.599	0.562	0.536	0.606	0.588	0.541
	(0.595, 0.603)	(0.555, 0.570)	(0.530, 0.542)	(0.603, 0.609)	(0.582, 0.594)	(0.536, 0.546)
Yeast3	0.612	0.569	0.487	0.621	0.613	0.609
	(0.608, 0.616)	(0.564, 0.573)	(0.481, 0.493)	(0.617, 0.625)	(0.607, 0.619)	(0.605, 0.613)

Open in a new tab

NOTE: The results are averaged over 100 data replications. Boldface indicates the method with larger AUC.

7. Conclusion and Discussion

In this article, we have developed a new reproducing kernel-based approach for a general family of ODE models that learn a dynamic system from noisy time-course data. We employ sparsity regularization to select individual functionals and uncover the underlying regulatory network, and we derive the post-selection confidence interval for the estimated signal trajectory. Our proposal is built upon but also extends the smoothing spline analysis of variance framework. We establish the theoretical properties of the method, while allowing the number of functionals to be either smaller or larger than the number of time points.

In numerous scientific applications, ODE is often employed to understand the regulatory effects and causal mechanisms within a dynamic system under interventions. Our proposed KODE method can be applied for this very purpose. There are different formulations of causal modeling for dynamic systems in the literature. We next consider and illustrate with two relatively common scenarios, one regarding dynamic causal modeling (DCM) under experimental stimuli (Friston, Harrison, and Penny 2003), and the other about kinetic modeling that is invariant across heterogeneous experiments (Pfister, Bauer, and Peters 2019).

The first scenario concerns DCM that infers the regulatory effects within a dynamic system under experimental stimuli (Friston, Harrison, and Penny 2003). Specifically, the DCM characterizes the variations of the state variables $x (t) = {(x_{1} (t), \dots, x_{p} (t))}^{⊤} \in ℝ^{p}$ under the stimulus inputs $u (t) = {(u_{1} (t), \dots, u_{p} (t))}^{⊤} \in ℝ^{q}$ via a set of ODEs, dx(t)/dt = F(x(t), u(t)), where the functional F is modeled by a bilinear form,

F_{j} (x (t), u (t)) = θ_{j 0} + \sum_{k = 1}^{p} θ_{j k}^{(1)} x_{k} (t) + \sum_{l = 1}^{q} θ_{j l}^{(2)} u_{l} (t) + \sum_{k = 1}^{p} \sum_{l = 1}^{q} θ_{j k l}^{(1, 2)} x_{k} (t) u_{l} (t), j = 1, \dots, p .

(20)

In this model, $θ_{j k}^{(1)} \in ℝ$ reflects the strength of intrinsic connection from x_k(t) to x_j(t), $θ_{j l}^{(2)} \in ℝ$ reflects the effect of the lth input stimulus u_l(t) on x_j(t), and $θ_{j k l}^{(1, 2)} \in ℝ$ reflects the influence of u_l(t) on the directional connection between x_k(t) and x_j(t), j, k = 1, …, p, l = 1, …, q. Note that $θ_{j k}^{(1)}$ and $θ_{k j}^{(1)}$ can be different, and thus the effect from x_k(t) to x_j(t) and that from x_j(t) to x_k(t) can be different. Similarly, $θ_{j k l}^{(1, 2)}$ and $θ_{k j l}^{(1, 2)}$ can be different. As such, model (20) encodes a directional network, and under certain conditions, a causal network. DCM has been widely used in biology and neuroscience (see, e.g., Friston, Harrison, and Penny 2003; Zhang et al. 2015, 2017; Cao, Sandstede, and Luo 2019).

We can combine the proposed KODE with the DCM model (20) straightforwardly. Such a combination allows us to estimate and infer the causal regulatory effects under experimental stimuli without specifying the forms of the functionals F. This is appealing, as there have been evidences suggesting that the regulatory effects can be nonlinear (Buxton et al. 2004; Friston et al. 2019). More specifically, we model F such that,

F_{j} (x (t), u (t)) = θ_{j 0} + \sum_{k = 1}^{p} F_{j k}^{(1)} (x_{k} (t)) + \sum_{l = 1}^{q} F_{j l}^{(2)} (u_{l} (t)) + \sum_{k = 1}^{p} \sum_{l = 1}^{q} F_{j k l}^{(1, 2)} (x_{k} (t), u_{l} (t)), j = 1, \dots, p .

(21)

Similar as the tensor product space defined in (8), let $H_{k}^{(1)}$ and $H_{l}^{(2)}$ denote the space of functions of x_k(t) and u_l(t) with zero marginal integral, respectively. We impose that the functionals F_j, j = 1, …, p in (21) are located in the following space,

H = {1} \oplus \sum_{k = 1}^{p} H_{k}^{(1)} \oplus \sum_{l = 1}^{q} H_{l}^{(2)} \oplus \sum_{k = 1}^{p} \sum_{l = 1}^{q} (H_{k}^{(1)} \otimes H_{l}^{(2)}) .

Parallel to (20), the functions $F_{j k}^{(1)}$ , $F_{j l}^{(2)}$ and $F_{j k l}^{(1, 2)}$ in (21) capture the regulatory effects, and together, they encode a directional network. Moreover, Algorithm 1 of KODE is directly applicable to estimate $F_{j k}^{(1)}$ , $F_{j l}^{(2)}$ , and $F_{j k l}^{(1, 2)}$ . As we have shown in our simulations, the DCM model (21) based on KODE should outperform (20) that is based on linear ODE.

The second scenario concerns learning the causal structure of kinetic systems by identifying a stable model from noisy observations generated from heterogeneous experiments. Pfister, Bauer, and Peters (2019) proposed the CausalKinetiX method, where the main idea is to optimize a noninvariance score to identify a causal ODE model that is invariant across heterogeneous experiments. Again, we can combine the proposed KODE with CausalKinetiX to learn the causal structure, while balancing between predictability and causality of the ODE model, and extending from a linear ODE model to a more flexible ODE model. We refer to this integrated method as KODE-CKX.

More specifically, consider R heterogeneous experiments, which stem from interventions such as manipulations of initial or environmental conditions. Following Algorithm 1 of KODE, we obtain ${\hat{θ}}_{j}^{(r)}$ for each experiment r ∈ {1, …, R}, and j = 1, …, p. Let $M_{j}^{(r)} \subseteq M$ denote the index set of the nonzero entries of the sparse estimator ${\hat{θ}}_{j}^{(r)}$ . We propose the following four-step procedure to score each model $M_{j}^{(r)}$ . In the first step, we obtain the smoothing spline estimate ${\hat{x}}_{j}^{(r)} (t)$ by (9) using the data from the rth experiment. In the second step, we apply Algorithm 1 to compute ${\hat{F}}_{j}^{(r)}$ , by setting κ_nj = 0, restricting $θ_{j} \in M_{j}^{(r)}$ , and using the data from all other experiments except for the rth experiment. Here leaving out the rth experiment is to ensure a good generalization capability. In the third step, we estimate the signal trajectory under the derivative constraint,

{\tilde{x}}_{j}^{(r)} (t) = \underset{z_{j} \in F}{arg min} {\frac{1}{n} \sum_{i = 1}^{n} {y_{i j} - z_{j} (t_{i})}^{2} + λ_{n j} {‖ z_{j} (t) ‖}_{F}^{2}}, such that {\tilde{x}}_{j}^{(r)} (t_{i}) = {\hat{F}}_{j}^{(r)} ({\hat{x}}_{j}^{(r)} (t_{i})),

(22)

for i = 1, …, n, j = 1, …, p. In the last step, similar as CausalKinetiX, we obtain for each model $M_{j}^{(r)} \subseteq M$ the noninvariance score,

NS (M_{j}^{(r)}) \equiv \frac{1}{R} \sum_{r = 1}^{R} \frac{{RSS}_{B}^{(r)} - {RSS}_{A}^{(r)}}{{RSS}_{A}^{(r)}},

where ${RSS}_{A}^{(r)} = n^{- 1} {\sum_{i = 1}^{n} {y_{i j}^{(r)} - {\hat{x}}_{j}^{(r)} (t_{i j})}}^{2}$ , and ${RSS}_{B}^{(r)} = n^{- 1} {\sum_{i = 1}^{n} {y_{i j}^{(r)} - {\tilde{x}}_{j}^{(r)} (t_{i j})}}^{2}$ are the residual sums of squares based on ${\hat{x}}_{j}^{(r)} (t)$ and ${\tilde{x}}_{j}^{(r)} (t)$ , respectively. Due to the additional constraint in (22), ${RSS}_{B}^{(r)}$ is always larger than ${RSS}_{A}^{(r)}$ . Following Pfister, Bauer, and Peters (2019), the model $M_{j}^{(r)} \subseteq M$ with a small score $NS (M_{j}^{(r)})$ is predictive and invariant. Such an invariant ODE model allows researchers to predict the behavior of the dynamic system under interventions, and it is closely related to the causal mechanism of the underlying dynamic system from the structural casual model and modularity perspective (Rubenstein et al. 2018; Pfister, Bauer, and Peters 2019). Compared to CausalKinetiX, our proposed KODE-CKX further extends the linear ODE to a general class of nonlinear and nonadditive ODE.

To verify the empirical performance of KODE-CKX and to compare with CausalKinetiX, we consider the 100-node E.coli1 gene regulatory network example in Section 6. Figure 7 compares the models with the smallest noninvariance score from KODE-CKX and CausalKinetiX, respectively, based on 100 data replications. Comparing Figures 7(a) and (b), it is seen that in the majority of cases, KODE-CKX is able to recover the causal parents, and it outperforms CausalKinetiX by achieving a smaller number of false discoveries. Here, the measurement errors were drawn from a normal distribution with mean zero and standard deviation 0.025, the same setup as in Section 6. We next further evaluate the performance of the two methods when we vary the standard deviation of the measurement errors. Figure 7(c) reports the AUC averaged over 100 data replications. It is seen again that, for all noise levels, KODE-CKX performs better than CausalKinetiX.

Figure 7. — The selection performance of KODE-CKX and CausalKinetiX. The results are averaged over 100 data replications. (a) Number of false discoveries in the estimated model based on KODE-CKX; (b) number of false discoveries in the estimated model based on CausalKinetiX; (c) area under ROC under different noise levels.

In summary, our proposed KODE is readily applicable to numerous scenarios to facilitate the understanding of the regulatory causal mechanisms within a dynamic system from noisy data under interventions.

Supplementary Material

Supplement

NIHMS1746362-supplement-Supplement.pdf^{(890.6KB, pdf)}

Acknowledgments

We thank the Editor, the Associate Editor, and two referees for their constructive comments and suggestions.

Funding

Xiaowu Dai’s research was partially supported by CDAR, Department of Economics, the University of California, Berkeley, and this work was done while Xiaowu Dai was visiting the Simons Institute for the Theory of Computing. Lexin Li’s research was partially supported by NSF grant DMS-1613137, and NIH grants R01AG061303, R01AG062542, and R01AG034570.

Footnotes

Supplementary Materials

The supplementary material contains proofs and additional numerical results for the main article.

References

Bachoc F, Leeb H, and Pötscher BM (2019), “Valid Confidence Intervals for Post-Model-Selection Predictors,” The Annals of Statistics, 47, 1475–1504. [6] [Google Scholar]
Berk R, Brown L, Buja A, Zhang K, and Zhao L (2013), “Valid Post-Selection Inference,” The Annals of Statistics, 41, 802–837. [6] [Google Scholar]
Brunton SL, Proctor JL, and Kutz JN (2016), “Discovering Governing Equations From Data by Sparse Identification of Nonlinear Dynamical Systems,” Proceedings of the National Academy of Sciences of the United States of America, 113, 3932–3937. [8] [DOI] [PMC free article] [PubMed] [Google Scholar]
Buxton RB, Uludağ K, Dubowitz DJ, and Liu TT (2004), “Modeling the Hemodynamic Response to Brain Activation,” Neuroimage, 23, S220–S233. [12] [DOI] [PubMed] [Google Scholar]
Cao J, and Zhao H (2008), “Estimating Dynamic Models for Gene Regulation Networks,” Bioinformatics, 24, 1619–1624. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao X, Sandstede B, and Luo X (2019), “A Functional Data Method for Causal Dynamic Network Modeling of Task-Related fMRI,” Frontiers in Neuroscience, 13, 127. [1,12] [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen S, Shojaie A, and Witten DM (2017), “Network Reconstruction From High-Dimensional Ordinary Differential Equations,” Journal of the American Statistical Association, 112, 1697–1707. [2,3,4,8,10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]
Chernozhukov V, Hansen C, and Spindler M (2015), “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach,” Annual Review of Economics, 7, 649–688. [7] [Google Scholar]
Chou I-C, and Voit EO (2009), “Recent Developments in Parameter Estimation and Structure Identification of Biochemical and Genomic Systems,” Mathematical Biosciences, 219, 57–83. [1,6] [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DD (1983), “Asymptotics for M-Type Smoothing Splines,” The Annals of Statistics, 11, 530–551. [8] [Google Scholar]
Dattner I, and Klaassen CAJ (2015), “Optimal Rate of Direct Estimators in Systems of Ordinary Differential Equations Linear in Functions of the Parameters,” Electronic Journal of Statistics, 9, 1939–1973. [2,3,4] [Google Scholar]
Friston KJ, Harrison L, and Penny W (2003), “Dynamic Causal Modelling,” Neuroimage, 19, 1273–1302. [12] [DOI] [PubMed] [Google Scholar]
Friston KJ, Preller KH, Mathys C, Cagnan H, Heinzle J, Razi A, and Zeidman P (2019), “Dynamic Causal Modelling Revisited,” Neuroimage, 199, 730–744. [12] [DOI] [PMC free article] [PubMed] [Google Scholar]
González J, Vujačić I, and Wit E (2014), “Reproducing Kernel Hilbert Space Based Estimation of Systems of Ordinary Differential Equations,” Pattern Recognition Letters, 45, 26–32. [2] [Google Scholar]
Gu C (2013), Smoothing Spline ANOVA Models, New York: Springer-Verlag. [4,5] [Google Scholar]
Henderson J, and Michailidis G (2014), “Network Reconstruction Using Nonparametric Additive ODE Models,” PLOS ONE, 9, 1–15. [1,2,3,10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang JZ (1998), “Projection Estimation in Multiple Regression With Application to Functional ANOVA Models,” The Annals of Statistics, 26, 242–272. [2] [Google Scholar]
Izhikevich E (2007), Dynamical Systems in Neuroscience, Cambridge, MA: MIT Press. [1,6] [Google Scholar]
Javanmard A, and Montanari A (2014), “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression,” Journal of Machine Learning Research, 15, 2869–2909. [7] [Google Scholar]
Koltchinskii V, and Yuan M (2010), “Sparsity in Multiple Kernel Learning,” The Annals of Statistics, 38, 3660–3695. [7] [Google Scholar]
Liang H, and Wu H (2008), “Parameter Estimation for Differential Equation Models Using a Framework of Measurement Error in Regression Models,” Journal of the American Statistical Association, 103, 1570–1583. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Y (2000), “Tensor Product Space ANOVA Models,” The Annals of Statistics, 28, 734–755. [7] [Google Scholar]
Lin Y, and Zhang HH (2006), “Component Selection and Smoothing in Multivariate Nonparametric Regression,” The Annals of Statistics, 34, 2272–2297. [2,4,5,7] [Google Scholar]
Loh P-L, and Wainwright MJ (2012), “High-Dimensional Regression With Noisy and Missing Data: Provable Guarantees With Nonconvexity,” The Annals of Statistics, 40, 1637–1664. [7] [Google Scholar]
Lu J, Kolar M, and Liu H (2020), “Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model,” Journal of the American Statistical Association, 115, 2084–2099. [7] [Google Scholar]
Lu T, Liang H, Li H, and Wu H (2011), “High-Dimensional ODEs Coupled With Mixed-Effects Modeling Techniques for Dynamic Gene Regulatory Network Identification,” Journal of the American Statistical Association, 106, 1242–1258. [1,2] [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma W, Trusina A, El-Samad H, Lim WA, and Tang C (2009), “Defining Network Topologies That Can Achieve Biochemical Adaptation,” Cell, 138, 760–773. [1,3,6,8] [DOI] [PMC free article] [PubMed] [Google Scholar]
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G (2010), “Revealing Strengths and Weaknesses of Methods for Gene Network Inference,” Proceedings of the National Academy of Sciences of the United States of America, 107, 6286–6291. [10] [DOI] [PMC free article] [PubMed] [Google Scholar]
Marbach D, Schaffter T, Mattiussi C, and Floreano D (2009), “Generating Realistic in Silico Gene Networks for Performance Assessment of Reverse Engineering Methods,” Journal of Computational Biology, 16, 229–239. [10,11] [DOI] [PubMed] [Google Scholar]
Mikkelsen FV, and Hansen NR (2017), “Learning Large Scale Ordinary Differential Equation Systems,” arXiv no. 1710.09308. [2,8]
Opsomer JD, and Ruppert D (1997), “Fitting a Bivariate Additive Model by Local Polynomial Regression,” The Annals of Statistics, 25, 186–211. [2] [Google Scholar]
Pfister N, Bauer S, and Peters J (2019), “Learning Stable and Predictive Structures in Kinetic Systems,” Proceedings of the National Academy of Sciences of the United States of America, 116, 25405–25411. [12,13] [DOI] [PMC free article] [PubMed] [Google Scholar]
Raskutti G, Wainwright MJ, and Yu B (2011), “Minimax Rates of Estimation for High-Dimensional Linear Regression Over ℓ_q-Balls,” IEEE Transactions on Information Theory, 57, 6976–6994. [7] [Google Scholar]
Ravikumar P, Wainwright MJ, and Lafferty J (2010), “High-Dimensional Ising Model Selection Using l₁-Regularized Logistic Regression,” The Annals of Statistics, 38, 1287–1319. [8] [Google Scholar]
Rubenstein PK, Bongers S, Schölkopf B, and Mooij JM (2018), “From Deterministic ODEs to Dynamic Structural Causal Models,” in Proceedings of the 34th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI). [13] [Google Scholar]
Schaffter T, Marbach D, and Floreano D (2011), “GeneNetWeaver: In Silico Benchmark Generation and Performance Profiling of Network Inference Methods,” Bioinformatics, 27, 2263–2270. [10] [DOI] [PubMed] [Google Scholar]
Talagrand M (1996), “New Concentration Inequalities in Product Spaces,” Inventiones Mathematicae, 126, 505–563. [7] [Google Scholar]
Tzafriri AR (2003), “Michaelis–Menten Kinetics at High Enzyme Concentrations,” Bulletin of Mathematical Biology, 65, 1111–1129. [3] [DOI] [PubMed] [Google Scholar]
van der Vaart AW, and Wellner JA (1996), Weak Convergence and Empirical Processes, New York: Springer-Verlag. [7] [Google Scholar]
Varah JM (1982), “A Spline Least Squares Method for Numerical Parameter Estimation in Differential Equations,” SIAM Journal on Scientific and Statistical Computing, 3, 28–46. [3] [Google Scholar]
Volterra V (1928), “Variations and Fluctuations of the Number of Individuals in Animal Species Living Together,” ICES Journal of Marine Science, 3, 3–51. [9] [Google Scholar]
Wahba G (1983), “Bayesian ‘Confidence Intervals’ for the Cross-Validated Smoothing Spline,” Journal of the Royal Statistical Society, Series B, 45, 133–150. [2] [Google Scholar]
____ (1990), Spline Models for Observational Data, Philadelphia, PA: SIAM. [4,5,6] [Google Scholar]
Wahba G, Wang Y, Gu C, Klein R, and Klein B (1995), “Smoothing Spline ANOVA for Exponential Families, With Application to the Wisconsin Epidemiological Study of Diabetic Retinopathy,” The Annals of Statistics, 23, 1865–1895. [2,4,5] [Google Scholar]
Wang S, Nan B, Zhu N, and Zhu J (2009), “Hierarchically Penalized Cox Regression With Grouped Variables,” Biometrika, 96, 307–322. [4] [Google Scholar]
Wu H, Lu T, Xue H, and Liang H (2014), “Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling,” Journal of the American Statistical Association, 109, 700–716. [1,2,3,8] [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M, and Zhou D-X (2016), “Minimax Optimal Rates of Estimation in High Dimensional Additive Models,” The Annals of Statistics, 44, 2564–2593. [7] [Google Scholar]
Zhang C-H, and Zhang SS (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. [7] [Google Scholar]
Zhang T, Wu J, Li F, Caffo B, and Boatman-Reich D (2015), “A Dynamic Directional Model for Effective Brain Connectivity Using Electrocorticographic (ECoG) Time Series,” Journal of the American Statistical Association, 110, 93–106. [1,7,8,11,12] [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang T, Yin Q, Caffo B, Sun Y, and Boatman-Reich D (2017), “Bayesian Inference of High-Dimensional, Cluster-Structured Ordinary Differential Equation Models With Applications to Brain Connectivity Studies,” The Annals of Applied Statistics, 11, 868–897. [1,12] [Google Scholar]
Zhang X, Cao J, and Carroll RJ (2015), “On the Selection of Ordinary Differential Equation Models With Application to Predator-Prey Dynamical Models,” Biometrics, 71, 131–138. [2,8] [DOI] [PubMed] [Google Scholar]
Zhao P, and Yu B (2006), “On Model Selection Consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [8] [Google Scholar]
Zhu H, Yao F, and Zhang HH (2014), “Structured Functional Additive Regression in Reproducing Kernel Hilbert Spaces,” Journal of the Royal Statistical Society, Series B, 76, 581–603. [2,7] [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1746362-supplement-Supplement.pdf^{(890.6KB, pdf)}

[R1] Bachoc F, Leeb H, and Pötscher BM (2019), “Valid Confidence Intervals for Post-Model-Selection Predictors,” The Annals of Statistics, 47, 1475–1504. [6] [Google Scholar]

[R2] Berk R, Brown L, Buja A, Zhang K, and Zhao L (2013), “Valid Post-Selection Inference,” The Annals of Statistics, 41, 802–837. [6] [Google Scholar]

[R3] Brunton SL, Proctor JL, and Kutz JN (2016), “Discovering Governing Equations From Data by Sparse Identification of Nonlinear Dynamical Systems,” Proceedings of the National Academy of Sciences of the United States of America, 113, 3932–3937. [8] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Buxton RB, Uludağ K, Dubowitz DJ, and Liu TT (2004), “Modeling the Hemodynamic Response to Brain Activation,” Neuroimage, 23, S220–S233. [12] [DOI] [PubMed] [Google Scholar]

[R5] Cao J, and Zhao H (2008), “Estimating Dynamic Models for Gene Regulation Networks,” Bioinformatics, 24, 1619–1624. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cao X, Sandstede B, and Luo X (2019), “A Functional Data Method for Causal Dynamic Network Modeling of Task-Related fMRI,” Frontiers in Neuroscience, 13, 127. [1,12] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chen S, Shojaie A, and Witten DM (2017), “Network Reconstruction From High-Dimensional Ordinary Differential Equations,” Journal of the American Statistical Association, 112, 1697–1707. [2,3,4,8,10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Chernozhukov V, Hansen C, and Spindler M (2015), “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach,” Annual Review of Economics, 7, 649–688. [7] [Google Scholar]

[R9] Chou I-C, and Voit EO (2009), “Recent Developments in Parameter Estimation and Structure Identification of Biochemical and Genomic Systems,” Mathematical Biosciences, 219, 57–83. [1,6] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Cox DD (1983), “Asymptotics for M-Type Smoothing Splines,” The Annals of Statistics, 11, 530–551. [8] [Google Scholar]

[R11] Dattner I, and Klaassen CAJ (2015), “Optimal Rate of Direct Estimators in Systems of Ordinary Differential Equations Linear in Functions of the Parameters,” Electronic Journal of Statistics, 9, 1939–1973. [2,3,4] [Google Scholar]

[R12] Friston KJ, Harrison L, and Penny W (2003), “Dynamic Causal Modelling,” Neuroimage, 19, 1273–1302. [12] [DOI] [PubMed] [Google Scholar]

[R13] Friston KJ, Preller KH, Mathys C, Cagnan H, Heinzle J, Razi A, and Zeidman P (2019), “Dynamic Causal Modelling Revisited,” Neuroimage, 199, 730–744. [12] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] González J, Vujačić I, and Wit E (2014), “Reproducing Kernel Hilbert Space Based Estimation of Systems of Ordinary Differential Equations,” Pattern Recognition Letters, 45, 26–32. [2] [Google Scholar]

[R15] Gu C (2013), Smoothing Spline ANOVA Models, New York: Springer-Verlag. [4,5] [Google Scholar]

[R16] Henderson J, and Michailidis G (2014), “Network Reconstruction Using Nonparametric Additive ODE Models,” PLOS ONE, 9, 1–15. [1,2,3,10,11] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Huang JZ (1998), “Projection Estimation in Multiple Regression With Application to Functional ANOVA Models,” The Annals of Statistics, 26, 242–272. [2] [Google Scholar]

[R18] Izhikevich E (2007), Dynamical Systems in Neuroscience, Cambridge, MA: MIT Press. [1,6] [Google Scholar]

[R19] Javanmard A, and Montanari A (2014), “Confidence Intervals and Hypothesis Testing for High-Dimensional Regression,” Journal of Machine Learning Research, 15, 2869–2909. [7] [Google Scholar]

[R20] Koltchinskii V, and Yuan M (2010), “Sparsity in Multiple Kernel Learning,” The Annals of Statistics, 38, 3660–3695. [7] [Google Scholar]

[R21] Liang H, and Wu H (2008), “Parameter Estimation for Differential Equation Models Using a Framework of Measurement Error in Regression Models,” Journal of the American Statistical Association, 103, 1570–1583. [1] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Lin Y (2000), “Tensor Product Space ANOVA Models,” The Annals of Statistics, 28, 734–755. [7] [Google Scholar]

[R23] Lin Y, and Zhang HH (2006), “Component Selection and Smoothing in Multivariate Nonparametric Regression,” The Annals of Statistics, 34, 2272–2297. [2,4,5,7] [Google Scholar]

[R24] Loh P-L, and Wainwright MJ (2012), “High-Dimensional Regression With Noisy and Missing Data: Provable Guarantees With Nonconvexity,” The Annals of Statistics, 40, 1637–1664. [7] [Google Scholar]

[R25] Lu J, Kolar M, and Liu H (2020), “Kernel Meets Sieve: Post-Regularization Confidence Bands for Sparse Additive Model,” Journal of the American Statistical Association, 115, 2084–2099. [7] [Google Scholar]

[R26] Lu T, Liang H, Li H, and Wu H (2011), “High-Dimensional ODEs Coupled With Mixed-Effects Modeling Techniques for Dynamic Gene Regulatory Network Identification,” Journal of the American Statistical Association, 106, 1242–1258. [1,2] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Ma W, Trusina A, El-Samad H, Lim WA, and Tang C (2009), “Defining Network Topologies That Can Achieve Biochemical Adaptation,” Cell, 138, 760–773. [1,3,6,8] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, and Stolovitzky G (2010), “Revealing Strengths and Weaknesses of Methods for Gene Network Inference,” Proceedings of the National Academy of Sciences of the United States of America, 107, 6286–6291. [10] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Marbach D, Schaffter T, Mattiussi C, and Floreano D (2009), “Generating Realistic in Silico Gene Networks for Performance Assessment of Reverse Engineering Methods,” Journal of Computational Biology, 16, 229–239. [10,11] [DOI] [PubMed] [Google Scholar]

[R30] Mikkelsen FV, and Hansen NR (2017), “Learning Large Scale Ordinary Differential Equation Systems,” arXiv no. 1710.09308. [2,8]

[R31] Opsomer JD, and Ruppert D (1997), “Fitting a Bivariate Additive Model by Local Polynomial Regression,” The Annals of Statistics, 25, 186–211. [2] [Google Scholar]

[R32] Pfister N, Bauer S, and Peters J (2019), “Learning Stable and Predictive Structures in Kinetic Systems,” Proceedings of the National Academy of Sciences of the United States of America, 116, 25405–25411. [12,13] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Raskutti G, Wainwright MJ, and Yu B (2011), “Minimax Rates of Estimation for High-Dimensional Linear Regression Over ℓ_q-Balls,” IEEE Transactions on Information Theory, 57, 6976–6994. [7] [Google Scholar]

[R34] Ravikumar P, Wainwright MJ, and Lafferty J (2010), “High-Dimensional Ising Model Selection Using l₁-Regularized Logistic Regression,” The Annals of Statistics, 38, 1287–1319. [8] [Google Scholar]

[R35] Rubenstein PK, Bongers S, Schölkopf B, and Mooij JM (2018), “From Deterministic ODEs to Dynamic Structural Causal Models,” in Proceedings of the 34th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI). [13] [Google Scholar]

[R36] Schaffter T, Marbach D, and Floreano D (2011), “GeneNetWeaver: In Silico Benchmark Generation and Performance Profiling of Network Inference Methods,” Bioinformatics, 27, 2263–2270. [10] [DOI] [PubMed] [Google Scholar]

[R37] Talagrand M (1996), “New Concentration Inequalities in Product Spaces,” Inventiones Mathematicae, 126, 505–563. [7] [Google Scholar]

[R38] Tzafriri AR (2003), “Michaelis–Menten Kinetics at High Enzyme Concentrations,” Bulletin of Mathematical Biology, 65, 1111–1129. [3] [DOI] [PubMed] [Google Scholar]

[R39] van der Vaart AW, and Wellner JA (1996), Weak Convergence and Empirical Processes, New York: Springer-Verlag. [7] [Google Scholar]

[R40] Varah JM (1982), “A Spline Least Squares Method for Numerical Parameter Estimation in Differential Equations,” SIAM Journal on Scientific and Statistical Computing, 3, 28–46. [3] [Google Scholar]

[R41] Volterra V (1928), “Variations and Fluctuations of the Number of Individuals in Animal Species Living Together,” ICES Journal of Marine Science, 3, 3–51. [9] [Google Scholar]

[R42] Wahba G (1983), “Bayesian ‘Confidence Intervals’ for the Cross-Validated Smoothing Spline,” Journal of the Royal Statistical Society, Series B, 45, 133–150. [2] [Google Scholar]

[R43] ____ (1990), Spline Models for Observational Data, Philadelphia, PA: SIAM. [4,5,6] [Google Scholar]

[R44] Wahba G, Wang Y, Gu C, Klein R, and Klein B (1995), “Smoothing Spline ANOVA for Exponential Families, With Application to the Wisconsin Epidemiological Study of Diabetic Retinopathy,” The Annals of Statistics, 23, 1865–1895. [2,4,5] [Google Scholar]

[R45] Wang S, Nan B, Zhu N, and Zhu J (2009), “Hierarchically Penalized Cox Regression With Grouped Variables,” Biometrika, 96, 307–322. [4] [Google Scholar]

[R46] Wu H, Lu T, Xue H, and Liang H (2014), “Sparse Additive Ordinary Differential Equations for Dynamic Gene Regulatory Network Modeling,” Journal of the American Statistical Association, 109, 700–716. [1,2,3,8] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Yuan M, and Zhou D-X (2016), “Minimax Optimal Rates of Estimation in High Dimensional Additive Models,” The Annals of Statistics, 44, 2564–2593. [7] [Google Scholar]

[R48] Zhang C-H, and Zhang SS (2014), “Confidence Intervals for Low Dimensional Parameters in High Dimensional Linear Models,” Journal of the Royal Statistical Society, Series B, 76, 217–242. [7] [Google Scholar]

[R49] Zhang T, Wu J, Li F, Caffo B, and Boatman-Reich D (2015), “A Dynamic Directional Model for Effective Brain Connectivity Using Electrocorticographic (ECoG) Time Series,” Journal of the American Statistical Association, 110, 93–106. [1,7,8,11,12] [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Zhang T, Yin Q, Caffo B, Sun Y, and Boatman-Reich D (2017), “Bayesian Inference of High-Dimensional, Cluster-Structured Ordinary Differential Equation Models With Applications to Brain Connectivity Studies,” The Annals of Applied Statistics, 11, 868–897. [1,12] [Google Scholar]

[R51] Zhang X, Cao J, and Carroll RJ (2015), “On the Selection of Ordinary Differential Equation Models With Application to Predator-Prey Dynamical Models,” Biometrics, 71, 131–138. [2,8] [DOI] [PubMed] [Google Scholar]

[R52] Zhao P, and Yu B (2006), “On Model Selection Consistency of Lasso,” Journal of Machine Learning Research, 7, 2541–2563. [8] [Google Scholar]

[R53] Zhu H, Yao F, and Zhang HH (2014), “Structured Functional Additive Regression in Reproducing Kernel Hilbert Spaces,” Journal of the Royal Statistical Society, Series B, 76, 581–603. [2,7] [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Kernel Ordinary Differential Equations

Xiaowu Dai

Lexin Li

Abstract

1. Introduction

2. Kernel Ordinary Differential Equations

2.1. Motivating example

Figure 1.

2.2. Two-Step Collocation Estimation

2.3. Kernel ODE

Theorem 1.

3. Estimation and Inference

3.1. Estimation Procedure

3.2. Confidence Intervals

Theorem 2.

Proposition 1.

4. Theoretical Properties

Assumption 1.

Assumption 2.

Theorem 3 (Optimal estimation of the trajectory).

Theorem 4 (Optimal estimation of the functional).

Theorem 5 (Recovery of the regulatory network).

5. Simulation Studies

5.1. Setup

5.2. Enzymatic Regulatory Network

Figure 2.

Figure 3.

5.3. Lotka–Volterra Equations

Figure 4.

Figure 5.

6. Application to Gene Regulatory Network

Figure 6.

Table 1.

7. Conclusion and Discussion

Figure 7.

Supplementary Material

Acknowledgments

Funding

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases