Geometric Neural Ordinary Differential Equations: From Manifolds to Lie Groups

Yannik P Wotte; Federico Califano; Stefano Stramigioli

doi:10.3390/e27080878

. 2025 Aug 19;27(8):878. doi: 10.3390/e27080878

Geometric Neural Ordinary Differential Equations: From Manifolds to Lie Groups

Yannik P Wotte ^1,^*, Federico Califano ¹, Stefano Stramigioli ¹

Editors: Fanzhang Li¹, Li Liu¹

PMCID: PMC12385718 PMID: 40870350

Abstract

Neural ordinary differential equations (neural ODEs) are a well-established tool for optimizing the parameters of dynamical systems, with applications in image classification, optimal control, and physics learning. Although dynamical systems of interest often evolve on Lie groups and more general differentiable manifolds, theoretical results for neural ODEs are frequently phrased on $R^{n}$ . We collect recent results for neural ODEs on manifolds and present a unifying derivation of various results that serves as a tutorial to extend existing methods to differentiable manifolds. We also extend the results to the recent class of neural ODEs on Lie groups, highlighting a non-trivial extension of manifold neural ODEs that exploits the Lie group structure.

Keywords: neural ordinary differential equations, differential geometry, Lie groups, machine learning, optimal control

1. Introduction

Ordinary differential equations (ODEs) are ubiquitous in the engineering sciences, from modeling and control of simple physical systems like pendulums and mass–spring–dampers, or more complicated robotic arms and drones, to the description of high-dimensional spatial discretizations of distributed systems, such as fluid flows, chemical reactions, or quantum oscillators. Neural ordinary differential equations (neural ODEs) [1,2] are ODEs parameterized by neural networks. Given a state x, and parameters $θ$ representing the weights and biases of a neural network, a neural ODE reads as follows:

\dot{x} = f_{θ} (x, t), x (0) = x_{0} .

(1)

First introduced by [1] as the continuum limit of recurrent neural networks, the number of applications of neural ODEs quickly exploded beyond simple classification tasks: learning highly nonlinear dynamics of multi-physical systems from sparse data [3,4,5], optimal control of nonlinear systems [6], medical imaging [7], and real-time handling of irregular time series [8], to name but a few. Discontinuous state transitions and dynamics [9,10], time-dependent parameters [11], augmented neural ODEs [12], and physics-preserving formulations [13,14] present further extensions that increase the expressivity of neural ODEs.

However, these methods are typically phrased for states $x \in R^{n}$ . For many physical systems of interest, such as robot arms, humanoid robots, and drones, the state lives on differentiable manifolds and Lie groups [15,16]. More generally, the manifold hypothesis in machine learning raises the expectation that many high-dimensional data-sets evolve on intrinsically lower-dimensional, albeit more complicated, manifolds [17]. Neural ODEs on manifolds [18,19] presented significant steps to address this gap, with the first optimization methods for neural ODEs on manifolds. Yet, the general tools and approaches available on $R^{n}$ , such as running costs, augmented states, time-dependent parameters, control inputs, or discontinuous state transitions, are rarely addressed in a manifold context. Similar issues persist in a Lie group context, where neural ODEs on Lie groups [20,21] have been formalized.

Our goal is to extend further architectures and costs for neural ODEs from $R^{n}$ to arbitrary manifolds (cf. Table 1), and in particular Lie groups, and to equip the reader with the technical background for their own extensions. Here the main conceptual challenge lies in phrasing chart-independent optimization methods [18,19] in a manner that easily adapts to a variety of neural ODE architectures and cost functions [1,3,12]. To this end we present a systematic approach for deriving geometric versions of the adjoint sensitivity method [1,2], which is a memory-efficient and scalable tool for the optimization of neural ODEs (cf. Section 1.1). Such benefits extend to manifolds and Lie groups [18,19,20,21]. A second challenge, both conceptual and practical, lies in expressing various manifolds in terms of local charts and in expressing neural-net-parameterized functions, dynamics, and tensor fields in local charts. To this end we classify existing methods into extrinsic [19,20] and intrinsic [18,21,22] approaches, a distinction inspired by well-known differential geometric concepts. In our context the distinction suggests different parameterizations, affects numerical integration techniques, and affects scaling to high-dimensional dynamics. Specifically, our contributions are as follows:

Systematic derivation of adjoint methods for neural ODEs on manifolds and Lie groups, highlighting the differences and equivalence of various approaches—for an overview, see also Table 1;
Summarizing the state of the art of manifold and Lie group neural ODEs by formalizing the notion of extrinsic and intrinsic neural ODEs;
A tutorial on neural ODEs on manifolds and Lie groups, with a focus on the derivation of coordinate-agnostic adjoint methods for optimization of various neural ODE architectures. Readers will gain a conceptual understanding of the geometric nature of the underlying variables, a coordinate-free derivation of adjoint methods and learn to incorporate additional geometric and physical structures. On the practical side, this will aid in the derivation and implementation of adjoint methods with non-trivial terms for various architectures, also with regard to coordinate expressions and chart transformations.

Table 1.

Summary of neural ODEs on manifolds and Lie groups presented in this article.

Name of Neural ODE	Subtype	Trajectory Cost	Subsection	Originally Introduced in
Neural ODEs on manifolds (Section 3)	Extrinsic	Running and final cost	Section 3.1.1	Final cost [19], running cost [21]
	Intrinsic	Running and final cost, intermittent cost	Section 3.1.2 and Section 3.2.1	Final cost [18], running cost [21], intermittent cost (this work)
	Augmented, time-dependent parameters	Final cost	Section 3.2.2	Augmenting $M$ to $T M$ [23], Augmenting $M$ to $M \times N$ (this work)
Neural ODEs on Lie groups (Section 4)	Extrinsic	Final cost and intermittent cost	Section 4.1	In [20]
	Intrinsic, dynamics in local charts	Running and final cost	Section 4.2	In [21,24]
	Intrinsic, dynamics on Lie algebra	Running and final cost	Section 4.2	In [21]

Open in a new tab

The remainder of this article is organized as follows. A brief state of the art on neural ODEs concludes this introduction. Section 2 provides a background on differentiable manifolds, Lie groups, and the coordinate-free adjoint method. Section 3 describes neural ODEs on manifolds and derives parameter updates via the adjoint method for various common architectures and cost functions, including time-dependent parameters, augmented neural ODEs, running costs, and intermediate cost terms. Section 4 describes neural ODEs on matrix Lie groups, explaining the merits of treating Lie groups separately from general differentiable manifolds. Both Section 3 and Section 4 also classify methods into extrinsic and intrinsic approaches. We conclude with a discussion in Section 5, highlighting advantages, disadvantages, challenges, and promise of the presented material. Appendix A includes a background on Hamiltonian systems, which appear when transforming the adjoint method into a form that is unique to Lie groups.

1.1. Literature Review

For a general introduction to neural ODEs, see [25]. Neural ODEs on $R^{n}$ with fixed parameters were first introduced by [1], and parameter optimization via the adjoint method allowed for intermittent and final cost terms on each trajectory. The generalized adjoint method [2] also allows for running cost terms. Memory-efficient checkpointing is introduced in [26] to address stability issues of adjoint methods. Augmented neural ODEs [12] introduced augmented state spaces to allow neural ODEs to express arbitrary diffeomorphisms. Time-varying parameters were introduced by [11], with similar benefits to augmented neural ODEs. Neural ODEs with discrete transitions were formulated in [9,10], with [9] also learning event-triggered transitions common in engineering applications. Neural controlled differential equations (CDEs) were introduced in [27] for handling irregular time series, and parameter updates reapply the adjoint method [1]. Neural stochastic differential equations (SDEs) were introduced in [28], relying on a stochastic variant of the adjoint method for the parameter update. The previously mentioned literature phrases dynamics of neural ODEs on $R^{n}$ .

Recent trends in research on neural ODEs focus on structure preservation to improve performance and reduce training time by appropriately restricting the class of parameterized vector fields. This includes symmetry preservation by equivariant [23] and approximately equivariant neural ODEs [29], which tackle symmetric and approximately symmetric time series and dynamics, e.g., in N-body dynamics and molecular dynamics. It also includes physics preservation in a physics learning context, where Hamiltonian neural networks [30,31] and (generalized) Lagrangian neural networks [32,33,34] improve performance by guaranteeing energy conservation. In control and model order reduction, port-Hamiltonian neural ODEs [3,35] further allow for learning models that interact with external ports in a power-preserving manner. These methods also phrase dynamics on $R^{n}$ and frequently apply the adjoint method for parameter updates.

Neural ODEs on manifolds were first introduced by [19], including an adjoint method on manifolds for final cost terms and application to continuous normalizing flows on Riemannian manifolds, but embedding manifolds into $R^{n}$ . Neural ODEs on Riemannian manifolds are expressed in local exponential charts in [18], avoiding embedding into $R^{n}$ and considering final cost terms in the optimization. Charts for unknown, non-trivial latent manifolds together with dynamics in local charts are learned from high-dimensional data in [22], also including discretized solutions to partial differential equations. Parameterized equivariant neural ODEs on manifolds are constructed in [23], also commenting on state augmentation to express arbitrary (equivariant) flows on manifolds.

Neural ODEs on Lie groups were first introduced in [36] on the Lie group $S E (3)$ to learn the port-Hamiltonian dynamics of a drone from an experiment, expressing group elements on an embedding $R^{12}$ , and the approach was formalized to port-Hamiltonian systems on arbitrary matrix Lie groups in [20], embedding $m \times m$ matrices in $R^{m^{2}}$ .

Neural ODEs on $S E (3)$ were phrased in local exponential charts in [24] to optimize a controller for a rigid body using a chart-based adjoint method in local exponential charts. As an alternative, a Lie algebra-based adjoint method on general Lie groups was introduced in [21], foregoing Lie group-specific numerical issues of applying the adjoint method in local charts.

The choice of numerical solver in integrating neural ODEs and adjoint sensitivity equations is a nuanced area with much active research, especially for highly stiff [37], highly nonlinear [38,39], and structure-preserving neural ODEs [40]. We point towards the aforementioned sources for the interested reader. Results are expected to carry over into a manifold and Lie group context, where they hold in local charts. Also Lie group integrators [41,42] may be of interest for geometrically exact integration but are not well-investigated in a neural ODE context [20,21].

The optimization of neural ODEs via adjoint sensitivity methods is also referred to as “optimize-then-discretize” [25,43], since the formulation of the continuous adjoint system (called “optimize”) precedes their numerical solution (called “discretize”). This is opposed to “discretize-then-optimize” approaches, in which the neural ODE is first solved numerically (discretize) and gradients are then backpropagated through the numerical solver (optimize) [25,37,43]. Comparing the two, the constant memory efficiency of “optimize-then-discretize” approaches allows them to scale better to high-dimensional systems, giving them an edge for cases with more than 100 parameters and states [43]. Instead, “discretize-then-optimize” boasts higher accuracy and speed for low-dimensional systems, as well as highly stiff systems in which adjoint methods struggle with stability [37]. A popular discrete alternative to neural ODEs for physics-informed dynamics learning is given by variational integrator networks (VINs) [44,45], phrasing Lagrangian and Hamiltonian dynamics as discrete systems that conserve energy and the symplectic structure of the continuum dynamics [46,47]. Recent work [48] on Lie group forced VINs (LieFVINs) also allows inputs to the Lagrangian and Hamiltonian dynamics to be included in the variational formulation, allowing discrete optimal control. Both VINs and LieFVINs are applicable in a Lie group context, where they conserve geometry, symplecticity, and energy. The approach does not use adjoint methods for optimization and outperforms neural ODEs in the investigated conservative, low-dimensional dynamical systems [44,45,48]. Compared to continuous neural ODEs, both VINs and LieFVINs are discrete, which removes overhead from ODE solvers for lightweight applications, but their necessarily energy-based formulation presently restricts their use cases to conservative physical systems. We mention this promising area for completeness but narrow our attention to a geometric “optimize-then-discretize” approach via adjoint methods in the remainder of this article.

1.2. Notation

For a complete introduction to differential geometry see, e.g, [49], and for Lie group theory see [50].

Calligraphic letters $M, N, \dots$ denote smooth manifolds. For conceptual clarity, the reader may think of these manifolds as embedded in a high-dimensional $R^{N}$ , e.g., $M \subset R^{N}$ . The set $C^{\infty} (M, N)$ contains smooth functions between $M$ and $N$ , and we define $C^{\infty} (M) : = C^{\infty} (M, R)$ .

The tangent space at $x \in M$ is $T_{x} M$ and the cotangent space is $T_{x}^{*} M$ . The tangent bundle of $M$ is $T M$ , and the cotangent bundle of $M$ is $T^{*} M$ . Then $X (M)$ denotes the set of vector fields over $M$ , and $Ω^{k} (M)$ denotes the set of k forms, where $Ω^{1} (M)$ are co-vector fields and $Ω^{0} (M) = C^{\infty} (M)$ are smooth functions $V : M \to R$ . The exterior derivative is denoted as $d : Ω^{k} (M) \to Ω^{k + 1} (M)$ . For functions $V \in C^{\infty} (M \times N, R)$ , with $x \in M$ , $y \in N$ , we denote by $(d_{x} V) (y) \in T_{x}^{*} M$ the partial differential at $x \in M$ . Curves $x : R \to M$ are denoted as $x (t)$ , and their tangent vectors are denoted as $\dot{x} \in T_{x (t)} M$ .

A Lie group is denoted by G and its elements by $g, h$ . The group identity is $e \in G$ , and I denotes the identity matrix. The Lie algebra of G is $g$ , and its dual is $g^{*}$ . Letters $\tilde{A}, \tilde{B}$ denote vectors in the Lie algebra, while letters $A, B$ denote vectors in $R^{n}$ .

In coordinate expressions, lower indices are covariant and upper indices are contravariant components of tensors. For example for a $(0, 2)$ -tensor M the components $M_{i j}$ are covariant, and for non-degenerate M the components of its inverse $M^{- 1}$ are $M^{i j}$ , which are contravariant. We use the Einstein summation convention $a_{i} b^{i} : = \sum_{i} a_{i} b^{i}$ ; i.e., the product of variables with repeated lower and upper indices implies a sum.

Denoting W as a topological space, D the Borel $σ$ -algebra, and $P : D \to [0, 1]$ a probability measure, the tuple $(W, D, P)$ denotes a probability space. Given a vector space L and a random variable $C : X \to L$ , the expectation of C with respect to $P$ is $E_{w \sim P} (C) : = \int_{W} C (w) d P (w)$ .

2. Background

2.1. Smooth Manifolds

Given an n-dimensional manifold $M$ , with $U \subset M$ being an open set and $Q : U \to R^{n}$ a homeomorphism, we call $(U, Q)$ a chart and we denote the coordinates of $x \in U$ as

(q^{1}, \dots, q^{n}) : = Q (x), x \in U \subset M .

(2)

Smooth manifolds admit charts $(U_{1}, Q_{1})$ and $(U_{2}, Q_{2})$ with smooth transition maps $Q_{21} = Q_{2} \circ Q_{1}^{- 1}$ defined on the intersection $U_{1} ⋂ U_{2}$ , and a collection $A$ of charts $(U, Q)$ with smooth transition maps is called a smooth atlas. For examples of local charts for particular manifolds, see [49], Example 1.4, Example 1.5. A vector field $f \in X (M)$ assigns a vector $f (x) \in T_{x} M$ at any point $x \in M$ . This defines a dynamic system, also shown in a local chart $(U, Q)$ with components $f^{i} (q), {\dot{q}}^{i} \in R$ :

\begin{matrix} \dot{x} & = f (x) = f^{i} (q) \frac{\partial}{\partial Q^{i}}; & x (0) = x_{0}, \end{matrix}

(3)

\begin{matrix} {\dot{q}}^{i} & = f^{i} (q); & q (0) = Q (x_{0}) . \end{matrix}

(4)

Solutions of (3) are then found by numerical integration of (4), applying chart transitions (e.g., $q_{2} (t) = Q_{21} (q_{1} (t))$ from $q_{1} (t) = Q_{1} (x (t))$ to $q_{2} (t) = Q_{2} (x (t))$ ) during integration to avoid coordinate singularities (cf. Section 3.1.2). Denote the solution of (3) by the flow operator

Ψ_{f}^{t} : M \to M; Ψ_{f}^{t} (x_{0}) : = x (t) .

(5)

For a real-valued function $V \in C^{\infty} (M)$ , its differential is the covector field

d V \in Ω^{1} (M); d V = \frac{\partial V}{\partial q^{i}} d Q^{i} .

(6)

Additionally, given a smooth manifold $N$ and a smooth map $φ : N \to M$ , with $(U, Q)$ and $(\bar{U}, \bar{Q})$ appropriate charts of $M$ and $N$ , respectively, the pullback of $d V$ via $φ$ is

φ^{*} d V \in Ω^{1} (N); φ^{*} d V : = d (V \circ φ) = \frac{\partial φ^{j}}{\partial {\bar{q}}^{i}} \frac{\partial V}{\partial q^{j}} d {\bar{Q}}^{i} .

(7)

With a Riemannian metric M (i.e., a symmetric, non-degenerate (0,2) tensor field) on $M$ , the gradient of V is a uniquely defined vector field $\nabla V \in X (M)$ given by

\nabla V : = M^{- 1} d V = M^{i j} \frac{\partial V}{\partial q^{j}} \frac{\partial}{\partial q^{i}} .

(8)

When $M = R^{n}$ , we assume that M is the Euclidean metric and pick coordinates such that the components of the gradient and differential are the same. Finally, we define the Lie derivative of 1-forms, which differentiates $ω \in Ω^{1} (M)$ along a vector field $f \in X (M)$ and returns $L_{f} ω \in Ω^{1} (M)$ :

L_{f} ω : = \frac{d}{d t} {({Ψ_{f}^{t}}^{*} ω)}_{t = 0} = ω_{j} (\frac{\partial}{\partial q^{i}} f^{j}) d Q^{i} + (\frac{\partial}{\partial q^{j}} ω_{i}) f^{j} d Q^{i} .

(9)

2.2. Lie Groups

Lie groups are smooth manifolds with a compatible group structure. We consider real matrix Lie groups $G \subseteq G L (m, R)$ , i.e., subgroups of the general linear group

G L (m, R) : = {g \in R^{m \times m} | \det (g) \neq 0} .

(10)

For $g, h \in G$ the left and right translations by h are, respectively, the matrix multiplications

\begin{matrix} L_{h} (g) : = h g, \end{matrix}

(11)

\begin{matrix} R_{h} (g) : = g h . \end{matrix}

(12)

The Lie algebra of G is the vector space $g \subseteq gl (m, R)$ , with $gl (m, R) = R^{m \times m}$ being the Lie algebra of $G L (m, R)$ .

Define a basis $E : = {{\tilde{E}}_{1}, \dots, {\tilde{E}}_{n}}$ with ${\tilde{E}}_{i} \in g \subset R^{m \times m}$ , and define the (invertible linear) map $Λ : R^{n} \to g$ as (equivalently (e.g, [51]), $Λ$ and $Λ^{- 1}$ are often denoted as the operators “hat” $\land : R^{n} \to R^{m \times m}$ and “vee” $\lor : R^{m \times m} \to R^{n}$ , respectively)

Λ : R^{n} \to g; (A^{1}, \dots, A^{n}) \mapsto \sum_{i} A^{i} {\tilde{E}}_{i} .

(13)

The dual of $g$ is denoted $g^{*}$ , and given the map $Λ$ we call $Λ^{*} : g^{*} \to R^{n}$ its dual. For $\tilde{A}, \tilde{B} \in g$ the small adjoint ${ad}_{\tilde{A}} (\tilde{B})$ is a bilinear map, and the large adjoint ${Ad}_{g} (\tilde{A})$ is a linear map

\begin{matrix} ad & : g \times g \to g; {ad}_{\tilde{A}} (\tilde{B}) = \tilde{A} \tilde{B} - \tilde{B} \tilde{A}, \end{matrix}

(14)

\begin{matrix} Ad & : G \times g \to g; {Ad}_{g} (\tilde{A}) = g \tilde{A} g^{- 1} . \end{matrix}

(15)

In the remainder of this article, we exclusively use the adjoint representation ${ad}_{A} : R^{n} \to R^{n}$ , written without a tilde in the subscript A, and adjoint representation ${Ad}_{g} : R^{n} \times R^{n}$ , which are obtained as

\begin{matrix} {ad}_{A} & : = Λ^{- 1} ({ad}_{Λ (A)} Λ (\cdot)), \end{matrix}

(16)

\begin{matrix} {Ad}_{g} & : = Λ^{- 1} ({Ad}_{g} Λ (\cdot)) . \end{matrix}

(17)

The exponential map $exp : g \to G$ is a local diffeomorphism given by the matrix exponential ([50], Chapter 3.7)

exp (\tilde{A}) : = \sum_{n = 0}^{\infty} \frac{1}{n!} {\tilde{A}}^{n} .

(18)

Its inverse $log : U_{log} \to g$ is given by the matrix logarithm, and it is well-defined on a subset $U_{log} \subset G$ ([50], Chapter 2.3):

log (g) = \sum_{n = 1}^{\infty} {(- 1)}^{n + 1} \frac{{(g - I)}^{n}}{n} .

(19)

Often, these infinite sums in (18) and (19) can be further reduced to a finite sums in m terms by use of the Cayley–Hamilton theorem [52]. A chart $(U_{h}, Q_{h})$ on G that assigns zero coordinates to $h \in G$ can be defined using (19) and (13):

\begin{matrix} U_{h} & = {h g | g \in U_{log}}, \end{matrix}

(20)

\begin{matrix} Q_{h} & : U_{h} \to R^{n}; g \mapsto Λ^{- 1} log (h^{- 1} g), \end{matrix}

(21)

\begin{matrix} Q_{h}^{- 1} & : R^{n} \to G; q \mapsto h exp (Λ (q)) . \end{matrix}

(22)

The chart $(U_{h}, Q_{h})$ is called an exponential chart, and a collection $A$ of exponential charts $(U_{h}, Q_{h})$ that cover the manifold is called an exponential atlas.

The differential of a function $V \in C^{\infty} (G, R)$ is the co-vector field $d V \in Ω^{1} (G)$ (see also Equation (6)). For any given $g \in G$ we further transform the co-vector $d V (g) \in T_{g}^{*} G$ to a left-trivialized differential, which collects the components of the gradient expressed in $g^{*}$ :

\begin{matrix} d_{g}^{L} V : = Λ^{*} L_{g}^{*} d V (g) = \frac{\partial}{\partial q} V {(g (I + Λ (q)))}_{| q = 0} \in R^{n} . \end{matrix}

(23)

For a derivation of this coordinate expression, see ([21], Section 3).

2.3. Gradient over a Flow

We are interested in computing the gradient of functions with respect to the initial state of a flow. The adjoint sensitivity equations are a set of differential equations that achieve this. In the following, we show a derivation of the adjoint sensitivity on manifolds ([21], App. A2). Given a function $C : M \to R$ , a vector field $f \in X (M)$ , the associated flow $Ψ_{f}^{t} : M \to M$ , and a final time $T \in R$ , the goal of the adjoint sensitivity method on manifolds is to compute the gradient

d (C \circ Ψ_{f}^{T}) (x_{0}) .

In the adjoint method we define a co-state $λ (t) = d (C \circ Ψ^{T - t}) (x (t)) \in T_{x (t)}^{*} M$ , which represents the differential of $C (x (T))$ with respect to $x (t)$ . The adjoint sensitivity method describes its dynamics, which are integrated backwards in time from the known final condition $λ (T) = d C (x (T))$ , see also Figure 1. The adjoint sensitivity method is stated in Theorem 1.

(a) The problem of computing the gradient over a flow, highlighting the cotangent spaces $d C (x (T)) \in T_{x (T)}^{*} M$ and $d (C \circ Ψ_{f}^{T}) (x_{0}) = {(Ψ_{f}^{T})}^{*} d C (x (T)) \in T_{x_{0}}^{*} M$ . (b) In the adjoint method we set $λ (t) = d (C \circ Ψ^{T - t}) (x (t))$ , whose dynamics are uniquely determined by the property $L_{f} λ = 0$ , allowing us to find $λ (0) = d (C \circ Ψ_{f}^{T}) (x_{0})$ by integrating $\dot{λ}$ backwards from $λ (T) = d C (x (T))$ .

Theorem 1

(Adjoint sensitivity on manifolds). The gradient of a function $C \circ Ψ_{f}^{T}$ is

$d (C \circ Ψ_{f}^{T}) (x_{0}) = λ (0),$ (24)

where $λ (t) \in T_{x (t)}^{*} M$ is the co-state. In a local chart $(U, Q)$ of $M$ with coordinates $q (t) = Q (x (t))$ , $λ (t) = λ_{i} (t) d Q^{i}$ , the state and co-state satisfy the dynamics

$\begin{matrix} {\dot{q}}^{j} & = f^{j} (q), q (0) = Q (x_{0}), \end{matrix}$ (25)

$\begin{matrix} {\dot{λ}}_{i} & = - λ_{j} \frac{\partial}{\partial q^{i}} f^{j} (q), λ_{i} (T) = \frac{\partial C}{\partial q^{i}} (x (T)) . \end{matrix}$ (26)

Proof.

Define the co-state $λ (t) \in T_{x (t)}^{*} M$ as

$\begin{matrix} λ (t) & : = {(Ψ_{f}^{T - t})}^{*} d C (x (T)) . \end{matrix}$ (27)

Then Equation (24) is recovered by application of Equation (7):

$λ (0) = {(Ψ_{f}^{T})}^{*} d C (x (T)) = (d C \circ Ψ_{f}^{T}) (x_{0}),$ (28)

A derivation of the dynamics governing $λ (t)$ constitutes the remainder of this proof. By definition of $λ (t)$ and the Lie derivative (9), we have that $L_{f} λ (t) = 0$ :

$\begin{matrix} L_{f} λ (t) & = \frac{d}{d s} {({(Ψ_{f}^{s})}^{*} λ (t + s))}_{s = 0} \\ = \frac{d}{d s} λ (t) = 0 . \end{matrix}$ (29)

If we further treat $λ$ as a 1-form $λ \in Ω^{1} (M)$ (denoted as $λ$ by an abuse of notation), we obtain

$\begin{matrix} L_{f} λ = & λ_{j} (\frac{\partial}{\partial q^{i}} f^{j}) d Q^{i} + (\frac{\partial}{\partial q^{j}} λ_{i}) f^{j} d Q^{i} = 0 . \end{matrix}$

The components satisfy the partial differential equation

$λ_{j} \frac{\partial}{\partial q^{i}} f^{j} + f^{j} \frac{\partial}{\partial q^{j}} λ_{i} = 0 .$ (30)

Impose that $λ (t) = λ (Ψ_{f}^{t} (x_{0}))$ (this defines the 1-form $λ$ along $x (t)$ ); then

${\dot{λ}}_{i} = \frac{\partial λ_{i}}{\partial q^{j}} {\dot{q}}^{j} = \frac{\partial λ_{i}}{\partial q^{j}} f^{j} .$ (31)

Combining Equations (30) and (31) leads to Equation (26):

${\dot{λ}}_{i} = - λ_{j} \frac{\partial}{\partial q^{i}} f^{j} .$ (32)

Expanding the final condition $λ (T) = d C (x (T))$ in local coordinates (see Equation (6)) gives

$λ (T) = \frac{\partial C}{\partial q^{i}} (x (T)) d Q^{i} = λ_{i} (T) d Q^{i} \Leftrightarrow λ_{i} (T) = \frac{\partial C}{\partial q^{i}} (x (T)) .$ (33)

□

Given a chart transition from a chart $(U_{1}, Q_{1})$ to a chart $(U_{2}, Q_{2})$ , e.g., during numerical integration of (26), the respective co-state components $λ_{1, i}$ and $λ_{2, i}$ are related by a transformation $A_{i}^{j} = \partial_{i} (Q_{1}^{j} \circ Q_{2}^{- 1})$ as follows:

\begin{matrix} λ_{i, 2} & = A_{i}^{j} λ_{j, 1} . \end{matrix}

(34)

A fact that will become useful in Section 4 is that Equations (25) and (26) have a Hamiltonian form. Define the control Hamiltonian $H_{c} : T^{*} M \to R$ as

H_{c} (x, λ) = λ (f (x, t)) = λ_{i} (f^{i} (q, t)) .

(35)

Then Equation (25) and Equation (26), respectively, of Theorem 1 follow as the Hamiltonian equations on $T^{*} M$ :

\begin{matrix} {\dot{q}}^{j} & = \frac{\partial H_{c}}{\partial λ_{j}} = f^{j} (q, t), \end{matrix}

(36)

\begin{matrix} {\dot{λ}}_{i} & = - \frac{\partial H_{c}}{\partial q^{i}} = - λ_{j} \frac{\partial}{\partial q^{i}} f^{j} (q, t) . \end{matrix}

(37)

For a background on Hamilton’s equations, see also Appendix A.

3. Neural ODEs on Manifolds

A neural ODE on a manifold is an NN-parameterized vector field in $X (M)$ —or including time dependence, it is an NN-parameterized vector field in $X (M \times R)$ , with t in the $R$ slot and $\dot{t} = 1$ . Given parameters $θ \in R^{n_{θ}}$ , we denote this parameterized vector field as $f_{θ} (x, t) : = f (x, t, θ)$ . This results in the dynamic system

\dot{x} = f_{θ} (x, t), x (0) = x_{0} .

(38)

The key idea of neural ODEs is to tackle various flow approximation tasks by optimizing the parameters with respect to a to-be-specified optimization problem. Denote a finite time horizon T and intermittent times $T_{1}, T_{2}, \dots < T$ . Denote a general trajectory cost by

C_{f_{θ}}^{T} (x_{0}, θ) = F (θ, Ψ_{f_{θ}}^{T_{0}} (x_{0}), Ψ_{f_{θ}}^{T_{1}} (x_{0}), \dots, Ψ_{f_{θ}}^{T} (x_{0})) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (x_{0}), s) d s,

(39)

with an intermittent and final cost term F and running cost r. Indicating a probability space $(M, D, P)$ , we define the total cost as

J (θ) : = E_{x_{0} \sim P} C_{f_{θ}}^{T} (x_{0}, θ) .

(40)

The minimization problem takes the form

\begin{matrix} min_{θ} J (θ) . \end{matrix}

(41)

Note that (41) is not subject to any dynamic constraint—the flow already appears explicitly in the cost $C_{f_{θ}}^{T}$ .

Normally, the optimization problem is solved by means of a stochastic gradient descent optimization algorithm [53]. In this, a batch of N initial conditions $x_{i}$ is sampled from the probability distribution corresponding to the probability measure $P$ . Writing $C_{i} = C_{f_{θ}}^{T} (x_{i}, θ)$ , the parameter gradient $\frac{\partial}{\partial θ} J (θ)$ is approximated as

\frac{\partial}{\partial θ} J (θ) = E_{x_{0} \sim P} \frac{\partial}{\partial θ} C_{f_{θ}}^{T} (x_{0}) \approx \frac{1}{N} \sum_{i = 0}^{N} \frac{\partial}{\partial θ} C_{i} .

(42)

In this section, we show how to optimize the parameters $θ$ for various choices of neural ODEs and cost functions, with (39) being the most general case of a cost, and highlight similarities in the various derivations. In the following, the gradient $\frac{\partial}{\partial θ} C_{i}$ is computed via the adjoint method on manifolds for various scenarios. The advantage of the adjoint method over, e.g., automatic differentiation of $C_{i}$ /backpropagation through an ODE solver is that it has a constant memory efficiency with respect to the network depth T.

3.1. Constant Parameters and Running and Final Cost

Here we consider neural ODEs of the form (38), with constant parameters $θ$ and cost functions of the form

C_{f_{θ}}^{T} (x_{0}, θ) = F (Ψ_{f_{θ}}^{T} (x_{0}), θ) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (x_{0}), θ, s) d s,

(43)

with a final cost term F and a running cost term r. This generalizes [2] to manifolds. Compared to existing manifold methods for neural ODEs [18,54], the running cost is new.

The parameter gradient’s components $\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}$ are then computed by Theorem 2 (see also [21]):

Theorem 2

(Generalized Adjoint Method on Manifolds). Given the dynamics (38) and the cost (43), the parameter gradient’s components $\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}$ are computed by

$\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = (\frac{\partial F}{\partial θ}) (x (T), θ) + \int_{0}^{T} \frac{\partial}{\partial θ} (λ_{j} f_{θ}^{j} (q (s)) + r (q (s), θ, s)) d s .$ (44)

where the state $x (s) \in M$ and co-state $λ (s) \in T_{x (s)}^{*} M$ satisfy, in a local chart $(U, Q)$ with $q (t) = Q (x (t))$ , $λ (t) = λ_{i} (t) d Q^{i}$ ,

$\begin{matrix} {\dot{q}}^{j} & = f_{θ}^{j} (q, t), q (0) = Q (x_{0}), t (0) = t_{0}, \end{matrix}$ (45)

$\begin{matrix} {\dot{λ}}_{i} & = - λ_{j} \frac{\partial}{\partial q^{i}} f_{θ}^{j} (q, t) - \frac{\partial r}{\partial q^{i}}, λ_{i} (T) = \frac{\partial F}{\partial q^{i}} (x (T), θ) . \end{matrix}$ (46)

Proof.

Define the augmented state space as $M^{'} = M \times R^{n_{θ}} \times R \times R$ to include the original state $x \in M$ , parameters $θ \in R^{n_{θ}}$ , accumulated running cost $L \in R$ , and time $t \in R$ in the augmented state $x^{'} : = (x, θ, L, t) \in M^{'}$ . In addition, define the augmented dynamics $f_{aug} \in X (M^{'})$ as

${\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} f_{θ} (x, t) \\ 0 \\ r (x, θ, t) \\ 1 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ θ \\ 0 \\ t_{0} \end{matrix}) .$ (47)

This is an autonomous system with final state $x^{'} (T) = (x (T), θ, \int_{0}^{T} r (x, θ, s) d s, T)$ . Next, define the cost $C_{aug} : M^{'} \to R$ on the augmented space:

$C_{aug} (x^{'}) = F (x, θ) + L .$ (48)

Then Equation (43) can be rewritten as the evaluation of a terminal cost $C_{aug} (x^{'} (T))$ :

$C_{f_{θ}}^{T} (x_{0}) = (C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) .$ (49)

By Theorem 1, the gradient $d (C_{aug} \circ Ψ_{f_{aug}}^{T})$ is given by

$d (C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) = λ (0),$ (50)

and by Equation (26), the components of $λ (s)$ satisfy

$\begin{matrix} {\dot{λ}}_{i} = - λ_{j} \frac{\partial}{\partial {q^{i}}^{'}} f_{aug}^{j}, λ_{i} (T) = \frac{\partial}{\partial {q^{i}}^{'}} C_{aug} (x^{'} (T)) \end{matrix}$ (51)

Split the co-state into $λ_{q}, λ_{θ}, λ_{L}, λ_{t}$ ; then their components’ dynamics are as follows:

$\begin{matrix} {\dot{λ}}_{q, i} & = - \frac{\partial}{\partial q^{i}} (λ_{q, j} f_{θ}^{j} (q, t) + λ_{L} r (q, θ, t)), λ_{q} (T) = \frac{\partial F}{\partial q} (x (T), θ), \end{matrix}$ (52)

$\begin{matrix} {\dot{λ}}_{θ, i} & = - \frac{\partial}{\partial θ^{i}} (λ_{q, j} f_{θ}^{j} (q, t) + λ_{L} r (q, θ, t)), λ_{θ} (T) = \frac{\partial F}{\partial θ} (x (T), θ), \end{matrix}$ (53)

$\begin{matrix} {\dot{λ}}_{L} & = 0, λ_{L} (T) = \frac{\partial}{\partial L} C_{aug} (x (T), θ) = 1, \end{matrix}$ (54)

$\begin{matrix} {\dot{λ}}_{t} & = - \frac{\partial}{\partial t} (λ_{q, j} f_{θ}^{j} (q, t) + λ_{L} r (q, θ, t)), λ_{t} (T) = \frac{\partial}{\partial t} C_{aug} (x (T), θ) = 0 . \end{matrix}$ (55)

The component $λ_{L} = 1$ is constant, so Equation (52) coincides with (46). Integrating (53) from $s = 0$ to $s = T$ recovers Equation (44). $λ_{t}$ does not appear in any of the other equations, so Equation (55) may be ignored. □

In summary, the above proof depends on identifying a suitable augmented manifold $M^{'}$ , with the goal that augmented dynamics $f_{aug} \in X (M^{'})$ are autonomous such that the cost function $C_{aug} : M^{'} \to R$ on the augmented manifold rephrases the cost (43) as a final cost $C_{aug} (x (T))$ . This allows Theorem 1 to be applied to derive the corresponding adjoint method. In later sections (Section 3.2), this process will be the main technical tool for generalizations of Theorem 2. The next sections describe common special cases of (38) and Theorem 2.

3.1.1. Vanilla Neural ODEs and Extrinsic Neural ODEs on Manifolds

The case of neural ODEs on $R^{n}$ (e.g., [2]) is obtained by setting $M = R^{n}$ . Scalar functions, vector fields, and tensor fields are readily expressed, see Table 2.

Table 2.

Parameterization of functions in extrinsic neural ODEs.

Function	Vanilla Neural ODE	Extrinsic Neural ODE
Scalar fields $V_{θ} (x) \in R$	$V_{θ} : R^{n} \to R$	$V_{θ} : R^{N} \to R$
Vector fields $f_{θ} (x, t) \in T_{x} M$	$f_{θ} : R^{n} \times R \to R^{n}$	$f_{θ}^{↑} : R^{N} \times R \to R^{N}$ with tangency constraint [19], optional stabilization [55]
Components of $(p, q)$ -tensor fields $M_{j_{1}, \dots, j_{q}}^{i_{1}, \dots, i_{p}} (x) \in R$	$M_{j_{1}, \dots, j_{q}}^{i_{1}, \dots, i_{p}} : R^{n} \to R$	$M_{j_{1}, \dots, j_{q}}^{i_{1}, \dots, i_{p}} : R^{N} \to R$ , see footnote ¹

Open in a new tab

¹ A tangency constraint on contravariant components of $(p, q)$ -tensors is not necessarily required for the vector field $f_{θ}^{↑}$ to remain tangent to $ι (M)$ and depends on the vector field under investigation.

There is an overlap with extrinsic neural ODEs on manifolds (described, for instance, in [19]), which optimize the neural ODE on an embedding space $R^{N}$ , see also Figure 2.

In the extrinsic formulation of neural ODEs on manifolds, the manifold $M$ is embedded in $R^{N}$ as $ι (M) \subset R^{N}$ , and a neural ODE $f_{θ}^{↑} \in X (R^{N})$ is optimized.

We denote the embedding as $ι : M \to R^{N}$ , where $x \in M$ and $y \in R^{N}$ . Optimizing the neural ODE on $R^{N}$ requires extending the dynamics $f_{θ} (\cdot, t) \in X (M)$ to a vector field $f_{θ}^{↑} (\cdot, t) \in X (R^{N})$ such that

ι_{*} f_{θ} (x, t) = f_{θ}^{↑} (ι (x), t) .

(56)

The dynamics $f_{θ}^{↑} (y, t)$ are then used in Theorem 2, and also the co-state lives in $T^{*} R^{N}$ .

As shown in [19], the resulting parameter gradients are equivalent to those resulting from an application in local charts, as long as it can be guaranteed that the integral curves of $f^{↑} (y, t)$ remain within $ι (M) \subset R^{N}$ , i.e., are geometrically exact. Geometrically exact integration has to be guaranteed separately, either by integration in local charts [18] or stabilization techniques [55].

A strong upside of an extrinsic formulation is that existing neural ODE packages (e.g., [56]) can be applied directly. A downside to extrinsic neural ODEs is that finding $f^{↑} (y, t)$ may not be immediate, since tangency to $ι (M)$ is required, see also Table 2. Finally, the extrinsic dimension N can be much larger than the intrinsic dimension $n = dim M$ , leading to computational overhead that does not fully exploit the manifold hypothesis. Extrinsic methods for neural ODEs are the preferred choice when the intrinsic dimension n is small and there is a known embedding $ι (M) \subset R^{N}$ with low extrinsic dimension N. Then the computational overhead due to $N > n$ is negligible, and stabilization techniques [55] can be applied to guarantee geometrically exact integration.

3.1.2. Intrinsic Neural ODEs on Manifolds

The intrinsic case of neural ODEs on manifolds [18] is described by integrating the dynamics in local charts, see also Figure 3.

In the intrinsic formulation of neural ODEs on manifolds, the neural ODE $f_{θ} \in X (M)$ is optimized in local charts, here $(U_{1}, Q_{1})$ and $(U_{2}, Q_{2})$ , and the state and co-state undergo chart transitions.

The advantage of intrinsic over extrinsic neural ODEs on manifolds is that the dimension of the resulting equations is as low as possible in the intrinsic case for a given manifold. The flexibility of chart representations gives intrinsic neural ODEs on manifolds the power to represent high-dimensional data distributions at their latent dimension, see especially [22] for learning charts from data and [18,21] for chart-switching methods during numerical integration. Numerical integration in local charts is also geometrically exact by default. However parameterization of scalar functions, vector fields, and tensor fields with neural networks in local charts, as well as their differentiation with respect to parameters, presents a source of complexity. There are three common methods to parameterize scalar-valued functions $V \in C^{\infty} (M)$ in local charts (vector fields and tensor-valued functions directly follow by parameterizing their scalar component functions in an analogous way):

A partition of unity $σ_{i} : R^{n} \to R$ with respect to a collection of charts $(U_{i}, Q_{i})$ can be used to sum over chart -components $V_{i} : R^{n} \to R$ as $V (x) : = \sum_{i} σ_{i} (Q_{i} (x)) V_{i} (Q_{i} (x))$ , see examples in [49], Chapter 2, and [24].
The function V can be directly defined by chart representatives $V_{i} : = V \circ Q_{i}^{- 1}$ , enforcing compatibility between overlapping charts $(U_{i}, Q_{i}), (U_{j}, Q_{j})$ by soft constraints, which are implemented as additional cost terms $∥ V_{i} (q_{i}) - V_{j} \circ Q_{j} \circ Q_{i}^{- 1} (q_{i}) ∥$ that are minimized on chart overlaps $U_{i} \cap U_{j}$ , see [18,22].
Given an embedding $ι : M \to R^{N}$ and $\bar{V} \in C^{\infty} (R^{N})$ , an extrinsic representation $V_{i} : = \bar{V} \circ ι \circ Q_{i}^{- 1}$ is possible, see [19,21].

Advantages and disadvantages are summarized in Table 3.

Table 3.

Parameterization of scalar functions and tensor components in intrinsic neural ODEs.

Partition of Unity [24,49]	Soft Constraint [18,22]	Pullback [19,21]
Components from all local charts are summed, weighted by a partition of unity.	Function is directly represented in local charts.	Function is pulled back to local chart.
Allows representation of arbitrary smooth functions.	Functions are smooth where charts do not overlap, but are not well-defined at chart transitions.	Allows representation of arbitrary smooth functions.
Differentiating functions generally requires differentiating through chart transition maps, creating computational overload [24].	Chart transition maps do not have to be differentiated.	Chart representations of the embedding $ι (M)$ are differentiated, possibly creating computational overload.

Open in a new tab

In available state-of-the-art packages for neural ODEs, the chart dynamics are phrased as discontinuous dynamics with state transitions $Q_{1} \circ Q_{2}^{- 1} : R^{n} \to R^{n}$ , but implementation is not yet streamlined for local charts, chart transitions of the co-state (cf. Section 2.3), and custom adjoint sensitivity equations.

3.1.3. Structure Preservation

Structure-preserving architectures narrow down the class of learnable neural ODEs from arbitrary vector fields $X (M)$ to subsets of $X (M)$ with particular properties, improving training speed and performance (cf. Section 1.1). Examples of such subsets are (symmetry-preserving) equivariant dynamical systems [23] and (physics-preserving) Hamiltonian, Lagrangian, and port-Hamiltonian dynamical systems [3]. Given that a structure-preserving parameterization of the neural ODE is known in closed form, these are readily implemented in the above formalism.

For example, reusing results from Table 2 and Table 3, Hamiltonian and Lagrangian neural ODEs [30,32] are fully determined by scalar functions $H_{θ} \in C^{\infty} (T^{*} Q), L_{θ} \in C^{\infty} (T Q)$ , respectively, and their gradients. Hamiltonian neural ODEs are advantageous for joint learning of the dynamics and energy of conservative physical systems, where the learned Hamiltonian vector fields $X_{H_{θ}} (T^{*} Q)$ are guaranteed to conserve the Hamiltonian $H_{θ}$ representing the total energy. Lagrangian neural ODEs likewise enable learning the dynamics of conservative physical systems and enable incorporation of dissipative terms [34] but do not directly represent the total energy.

Port-Hamiltonian neural ODEs [3,35] offer further expressiveness: besides a scalar Hamiltonian $H_{θ} \in C^{\infty} (M)$ , they offer degrees of freedom in a skew-symmetric $(2, 0)$ -tensor $J_{θ}$ (called a Poisson tensor), a positive-definite $(2, 0)$ -tensor $R_{θ}$ (called a dissipation tensor), and a linear input map $B_{θ} (x) : R^{k} \to T_{x} M$ . This allows learning the dynamics of non-conservative dynamical systems that can be coupled with known physical systems and control inputs through the input map, while jointly learning the total energy $H_{θ}$ , rate of energy dissipation $R_{θ} (d H_{θ}, d H_{θ})$ , and externally supplied power (see [57] for an introductory overview). Most physical systems can be represented in a port-Hamiltonian form [58], giving this parametrization a high degree of expressiveness that has been used in dynamics learning [3], control [20,21], and model order reduction [35]. Albeit not investigated in practice, this expressiveness may also be a disadvantage compared to Lagrangian or Hamiltonian neural networks, resulting in overfitting when, e.g., small dissipation terms are learned where there is no dissipation. Generally speaking, choosing the most specific structure-preserving neural network is advised.

3.2. Extensions

The proof of Theorem 2 depended on identifying a suitable augmented manifold $M^{'}$ , autonomous augmented dynamics $f_{aug} \in X (M^{'})$ , and an augmented cost function $C_{aug} : M^{'} \to R$ that rephrases the cost (43) as a final cost $C_{aug} (x (T))$ to apply Theorem 1. This approach generalizes to various other scenarios, including different cost terms, augmented neural ODEs on manifolds, and time-dependent parameters, presented in the following.

3.2.1. Nonlinear and Intermittent Cost Terms

We consider here the case of neural ODEs on manifolds of the form (38) with cost (39). This is a generalization of [1], in which intermittent cost terms appear for neural ODEs on $R^{n}$ . For the final and intermittent cost term $F_{θ} : M \times M \times \dots \times M \to R$ , we denote by $d_{k} F_{θ} \in T_{x}^{*} M$ the differential with respect to the k-th slot and denote $θ$ as a subscript to avoid confusion. The components of $d_{k} F$ will be denoted $\frac{\partial F}{\partial^{k} q^{i}}$ . In this case, the parameter gradient is determined by repeated application of Theorem 2:

Theorem 3

(Generalized Adjoint Method on Manifolds). Given the dynamics (38) and the cost (39), the parameter gradient’s components $\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}$ are computed by

$\begin{matrix} \frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = & (\frac{\partial F}{\partial θ}) (θ, x (T_{1}), x (T_{2}), \dots, x (T)) \\ + \int_{0}^{T} \frac{\partial}{\partial θ} (λ_{j} f_{θ}^{j} (q (s)) + r (q (s), θ, s)) d s . \end{matrix}$ (57)

where the state $x (s) \in M$ satisfies (45) and the co-state $λ (s) \in T_{x (s)}^{*} M$ satisfies dynamics with discrete updates at times $T_{1}, \dots, T_{N - 1}$ given by

$\begin{matrix} {\dot{λ}}_{q, i} & = \frac{\partial}{\partial q^{i}} (λ_{q, j} f_{θ}^{j} (q, t) + r (q, θ, t)); λ_{q, i} (T) = \frac{\partial F_{θ}}{\partial^{N} q^{i}} (x (T_{1}), \dots, x (T)) \end{matrix}$ (58)

$\begin{matrix} λ_{i} (T_{k, -}) & = λ_{i} (T_{k, +}) + \frac{\partial F_{θ}}{\partial^{k} q^{i}} (x (T_{1}), \dots, x (T)), \end{matrix}$ (59)

with $T_{k, -}$ being the instance after a discrete update at time $T_{k}$ (recall that co-state dynamics are integrated backwards, so $T_{k, -} < T_{k} < T_{k, +}$ ) and $T_{k, +}$ the instance before.

Proof.

We introduce an augmented manifold $M^{'} = M \times \dots \times M \times R^{n_{θ}} \times R \times R$ to include N copies of the original state $x \in M$ , parameters $θ \in R^{n_{θ}}$ , accumulated running cost $L \in R$ , and time $t \in R$ in the augmented state $x^{'} : = (x_{1}, \dots, x_{N}, θ, L, t) \in M^{'}$ . Let

$ϱ_{T_{i}} (t) = \{\begin{matrix} 1 t \leq T_{i} \\ 0 t > T_{i} \end{matrix},$ (60)

and define the augmented dynamics $f_{aug} \in X (M^{'})$ as

${\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} ϱ_{T_{1}} (t) f_{θ} (x_{1}, t) \\ ⋮ \\ ϱ_{T_{N - 1}} (t) f_{θ} (x_{N - 1}, t) \\ f_{θ} (x_{N}, t) \\ 0 \\ r (x_{N}, θ, t) \\ 1 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ ⋮ \\ x_{0} \\ x_{0} \\ θ \\ 0 \\ t_{0} \end{matrix}) .$ (61)

This is an autonomous system with final state

$x^{'} (T) = (x (T_{1}), \dots, x (T_{N - 1}), x (T), θ, \int_{0}^{T} r (x, θ, s) d s, T) .$ (62)

Next, define the cost $C_{aug} : M^{'} \to R$ on the augmented space:

$C_{aug} (x^{'}) = F_{θ} (x_{1}, \dots, x_{N}) + L .$ (63)

Then Equation (39) can be rewritten as the evaluation of a terminal cost $C_{aug} (x^{'} (T))$ :

$C_{f_{θ}}^{T} (x_{0}) = (C_{aug} \circ Ψ_{f_{aug}}^{T}) (x_{0}^{'}) .$ (64)

Apply Equation (26), and split the co-state into $λ_{q_{1}}, \dots, λ_{q_{N}}, λ_{θ}, λ_{L}, λ_{t}$ ; then their components’ dynamics are as follows:

$\begin{matrix} {\dot{λ}}_{q_{1}, i} & = - \frac{\partial}{\partial q^{i}} (λ_{q_{1}, j} ϱ_{T_{1}} (t) f_{θ}^{j} (q_{1}, t)), λ_{q_{1}} (T) = \frac{\partial F_{θ}}{\partial^{1} q} (x (T_{1}), \dots, x (T)), \\ ⋮ \end{matrix}$ (65)

$\begin{matrix} {\dot{λ}}_{q_{N}, i} & = - \frac{\partial}{\partial^{N} q^{i}} (λ_{q_{N}, j} f_{θ}^{j} (q_{N}, t) + λ_{L} r (q_{N}, θ, t)), λ_{q_{N}} (T) = \frac{\partial F_{θ}}{\partial^{N} q} (x (T_{1}), \dots, x (T)), \end{matrix}$ (66)

$\begin{matrix} {\dot{λ}}_{θ, i} & = - \frac{\partial}{\partial θ^{i}} (λ_{q_{1}, j} ϱ_{T_{1}} (t) f_{θ}^{j} (q_{1}, t) + \dots + λ_{q_{N}, j} f_{θ}^{j} (q_{N}, t) + λ_{L} r (q, θ, t)), \\ λ_{θ} (T) & = \frac{\partial F_{θ}}{\partial θ} (x (T_{1}), \dots, x (T)) . \end{matrix}$ (67)

We excluded the dynamics of $λ_{t}$ , which does not appear in any of the other equations, and the constant $λ_{L} = 1$ . Finally, define the cumulative co-state

$λ_{q} = ϱ_{T_{1}} (t) λ_{q_{1}} + \dots + λ_{q_{N}} .$ (68)

Its dynamics at $t \in [0, T] T_{1}, \cdot, T_{N - 1}$ are given by the sum of (65) to (66), letting $q = q_{N}$ :

$\begin{matrix} {\dot{λ}}_{q, i} & = {\dot{λ}}_{q_{1}, i} + \dots + {\dot{λ}}_{q_{N}, i} \end{matrix}$ (69)

$\begin{matrix} = \frac{\partial}{\partial q^{i}} (λ_{q, j} f_{θ}^{j} (q, t) + r (q, θ, t)) \end{matrix}$ (70)

$\begin{matrix} λ_{q} (T) & = \frac{\partial F_{θ}}{\partial^{N} q} (x (T_{1}), \dots, x (T)), \end{matrix}$ (71)

with discrete jumps (58) accounting for the final conditions of $λ_{q_{1}}, \dots, λ_{q_{N}}$ , and the dynamics (67) can be rewritten as

${\dot{λ}}_{θ, i} = \frac{\partial}{\partial θ^{i}} (λ_{q, j} f_{θ}^{j} (q, t) + r (q, θ, t)); λ_{θ} (T) = \frac{\partial F_{θ}}{\partial θ} (x (T_{1}), \dots, x (T)) .$ (72)

Integrating this from $s = 0$ to $s = T$ recovers Equation (57). □

Cost terms of this form are interesting for optimization of, e.g., periodic orbits [59] or trajectories on manifolds, where conditions at multiple checkpoints $Ψ_{f_{θ}}^{T_{i}} (x_{0})$ may appear in the cost.

3.2.2. Augmented Neural ODEs on Manifolds and Time-Dependent Parameters

With state $x \in M$ , augmented state $α \in N$ (not to be confused with $x^{'} \in M^{'}$ ), and parameterized $φ_{θ} : M \to N$ , augmented neural ODEs on manifolds are neural ODEs on the manifold $M \times N$ of the form

(\begin{matrix} \dot{x} \\ \dot{α} \end{matrix}) = (\begin{matrix} f_{θ} (x, α) \\ g_{θ} (x, α) \end{matrix}); (\begin{matrix} x (0) \\ α (0) \end{matrix}) = (\begin{matrix} x_{0} \\ φ_{θ} (x_{0}) \end{matrix}) .

(73)

Time t is not included explicitly in these dynamics, since it can be included in $α$ . This case also includes the scenario of time-dependent parameters $\bar{θ} (t)$ as part of $α$ . As the trajectory cost, we take a final cost

C_{f_{θ}, g_{θ}}^{T} (x_{0}, θ) = F (Ψ_{f_{θ}, g_{θ}}^{T} (x_{0}, φ_{θ} (x_{0})), θ) .

(74)

This is a generalization of [11,12].

Theorem 4

(Adjoint Method for Augmented Neural ODEs on Manifolds). Given the dynamics (73) and the cost (74), the parameter gradient’s components $\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) \in R^{n_{θ}}$ are computed by

$\begin{matrix} \frac{\partial}{\partial θ} C_{f_{θ}, g_{θ}}^{T} ((x_{0}, φ (x_{0})), θ) = & (\frac{\partial F}{\partial θ}) (x (T), α (T), θ) + \frac{\partial φ^{j}}{\partial θ} λ_{α, j} (0) \\ + \int_{0}^{T} \frac{\partial}{\partial θ} (λ_{x, j} f_{θ}^{j} (q (s)) + λ_{α, j} g_{θ}^{j} (q (s))) d s . \end{matrix}$ (75)

where the states $x (s) \in M, α (s) \in N$ satisfy (73) and co-states $λ_{x} (s) \in T_{x (s)}^{*} M, λ_{α} (s) \in T_{α (s)}^{*} N,$ satisfy, in a local chart $(U, Q)$ on $M$ and $\bar{U}, \bar{Q}$ on $N$ ,

$\begin{matrix} {\dot{λ}}_{x, i} & = - \frac{\partial}{\partial q^{i}} (λ_{x, j} f_{θ}^{j} (q, \bar{q}, t) + λ_{α, j} g_{θ}^{j} (q, \bar{q}, t)), λ_{x, i} (T) = \frac{\partial F}{\partial q^{i}} (x (T), α (T), θ), \end{matrix}$ (76)

$\begin{matrix} {\dot{λ}}_{α, i} & = - \frac{\partial}{\partial {\bar{q}}^{i}} (λ_{x, j} f_{θ}^{j} (q, \bar{q}, t) + λ_{α, j} g_{θ}^{j} (q, \bar{q}, t)), λ_{α, i} (T) = \frac{\partial F}{\partial {\bar{q}}^{i}} (x (T), α (T), θ) . \end{matrix}$ (77)

Proof.

Define the augmented state space as $M^{'} = M \times N \times R^{n_{θ}}$ to include the states $x \in M, α (s) \in N$ and parameters $θ \in R^{n_{θ}}$ in the augmented state $x^{'} : = (x, α, θ) \in M^{'}$ . In addition, define the augmented dynamics $f_{aug} \in X (M^{'})$ as

${\dot{x}}^{'} = f_{aug} (x^{'}) = (\begin{matrix} f_{θ} (x, α) \\ g_{θ} (x, α) \\ 0 \end{matrix}), x^{'} (0) = x_{0}^{'} : = (\begin{matrix} x_{0} \\ φ_{θ} (x_{0}) \\ θ \end{matrix}) .$ (78)

This is an autonomous system with final state $x^{'} (T) = (x (T), α (T), θ)$ . Next, define the cost $C_{aug} : M^{'} \to R$ on the augmented space:

$C_{aug} (x^{'}) = F (x, α, θ) .$ (79)

Then Equation (43) can be rewritten as the evaluation of a terminal cost $C_{aug} (x^{'} (T))$ . The gradient $d (C_{aug} \circ Ψ_{f_{aug}}^{T})$ is given by an application of Equation (26). Split the co-state into $λ_{x}, λ_{α}, λ_{θ}$ ; then their components’ dynamics are as follows:

$\begin{matrix} {\dot{λ}}_{x, i} & = - \frac{\partial}{\partial q^{i}} (λ_{x, j} f_{θ}^{j} (q, \bar{q}, t) + λ_{α, j} g_{θ}^{j} (q, \bar{q}, t)), λ_{x} (T) = \frac{\partial F}{\partial q} (x (T), α (T), θ), \end{matrix}$ (80)

$\begin{matrix} {\dot{λ}}_{α, i} & = - \frac{\partial}{\partial {\bar{q}}^{i}} (λ_{x, j} f_{θ}^{j} (q, \bar{q}, t) + λ_{α, j} g_{θ}^{j} (q, \bar{q}, t)), λ_{α, i} (T) = \frac{\partial F}{\partial {\bar{q}}^{i}} (x (T), α (T), θ), \end{matrix}$ (81)

$\begin{matrix} {\dot{λ}}_{θ, i} & = - \frac{\partial}{\partial θ^{i}} (λ_{x, j} f_{θ}^{j} (q, \bar{q}, t) + λ_{α, j} g_{θ}^{j} (q, \bar{q}, t)), λ_{θ} (T) = \frac{\partial F}{\partial θ} (x (T), α (T), θ) . \end{matrix}$ (82)

Since $α (0) = φ_{θ} (x_{0})$ also depends on $θ$ , the total gradient of the cost with respect to $θ$ is given by

$\frac{\partial}{\partial θ^{i}} C_{f_{θ}}^{T} ((x_{0}, φ_{θ} (x_{0})), θ) = λ_{θ, i} (0) + \frac{\partial φ^{j}}{\partial θ^{i}} λ_{α, j} (0) .$ (83)

Integrate (82) to find $λ_{θ, i} (0)$ ; then Equation (75) is recovered. □

Augmented neural ODEs are universal function approximators ([25], Chapter 2). Potential applications of augmented neural ODEs on manifolds include, e.g., the optimization of guiding vector fields for path-following of closed or self-intersecting paths [60], where state augmentation sits at the core of formulating singularity-free guiding vector fields for self-intersecting paths. In the same context, discontinuous initializations $g_{α} (x_{0})$ allow globally stabilizing controllers to be represented for topologically non-trivial manifolds (e.g., the sphere $S^{2}$ ), where smooth controllers are necessarily not globally stable. A further degenerate application of Theorem 4 is obtained by removing x, i.e., fixing $x = 0$ and $f_{θ} (x, α) = 0$ in Equation (73). Then both the dynamics $g_{θ} (α)$ and initial condition $α (0) = φ_{θ} (0)$ are parameterized by $θ$ , allowing joint optimization of the parameters and initial condition. This is interesting for joint optimization and numerical continuation, e.g., [59].

4. Neural ODEs on Lie Groups

Just as a neural ODE on a manifold is an NN-parameterized vector field in $X (M)$ (or, including time, $X (M \times R)$ ), a neural ODE on a Lie group can be seen as a parameterized vector field in $X (G)$ (or $X (G \times R)$ ). Similarly to Equation (38), this results in a dynamic system

\dot{g} = f_{θ} (g, t), g (0) = g_{0} .

(84)

Yet, Lie groups offer more structure than manifolds: the Lie algebra $g$ provides a canonical space to represent tangent vectors, and its dual $g^{*}$ provides a canonical space to represent the co-state. Similarly, canonical (exponential) charts offer a structure for integrating dynamic systems [41]. Frequently, dynamics on a Lie group induce dynamics on a manifold $M$ : by means of an action

Φ : G \times M \to M; (g, x) \mapsto Φ (g, x),

(85)

evolutions $g (t)$ induce evolutions $x (t) = Φ (g (t), x_{0})$ on $M$ . This makes neural ODEs on Lie groups interesting in their own right.

In this section, we describe optimizing (41) for the cost

C_{f_{θ}}^{T} (g_{0}, θ) = F (Ψ_{f_{θ}}^{T} (g_{0}), θ) + \int_{0}^{T} r (Ψ_{f_{θ}}^{s} (g_{0}), θ, s) d s,

(86)

with a final cost term F and a running cost term r. We highlight the extrinsic approach and two intrinsic approaches, where one of the latter is particular to Lie groups.

4.1. Extrinsic Neural ODEs on Lie Groups

The extrinsic formulation of neural ODEs on Lie groups was first introduced by [20] and applies ideas of [54] (see also Section 3.1.1). Given $G \subset G L (m, R)$ , this formulation treats the dynamic system (84) as a dynamic system on $R^{m^{2}}$ . Denote $vec : R^{m \times m} \to R^{m^{2}}$ as an invertible map that stacks the components of an input matrix into a component vector (in canonical coordinates on $R^{m \times m}$ and $R^{m^{2}}$ , though this choice is not required.) and let ${proj}_{G} : R^{m \times m} \to G$ be a projection onto $G \subset R^{m \times m}$ . Further denote $A_{y} = {vec}^{- 1} (y)$ and $g_{y} = {proj}_{G} (A_{y})$ . A lift $f_{θ}^{↑} (y, t)$ can then be defined as

f_{θ}^{↑} (y, t) = vec (A_{y} g_{y}^{- 1} f (g_{y}, θ, t)) .

(87)

As was the case for extrinsic neural ODEs on manifolds, the cost gradient resulting from this optimization is well-defined and equivalent to any intrinsically defined procedure. However, the dimension $m^{2}$ of the vectorization can be significantly larger than the intrinsic dimension of the Lie group.

4.2. Intrinsic Neural ODEs on Lie Groups

Theorem 2 directly applies to optimization of neural ODEs on Lie groups, given the local exponential charts (20) and (21) on G. This does not make full use of the available structure on Lie groups. Frequently, dynamical systems are of a left-invariant form (88) or a right-invariant form (89)

\begin{matrix} \dot{g} & = g Λ (ρ_{θ}^{L} (g, t)), \end{matrix}

(88)

\begin{matrix} \dot{g} & = Λ (ρ_{θ}^{R} (g, t)) g . \end{matrix}

(89)

Denote $K (q) : T_{q} R^{n} \to R^{n}$ as the derivative of the exponential map (see [21] for details). Then the chart representatives $f_{θ}^{i}$ in a local exponential chart $(U_{h}, Q_{h})$ are

\begin{matrix} f_{θ}^{L, i} (q, t) & = {(K^{- 1})}_{j}^{i} (q) ρ^{L, j} (Q_{h}^{- 1} (q)), \end{matrix}

(90)

\begin{matrix} f_{θ}^{R, i} (q, t) & = {(K^{- 1})}_{j}^{i} (q) {Ad}_{Q_{h}^{- 1} (q)} ρ^{R, j} (Q_{h}^{- 1} (q)) . \end{matrix}

(91)

Application of Theorem 2 then requires computing $\frac{\partial}{\partial q^{j}} f_{θ}^{L, i} (q, t)$ or $\frac{\partial}{\partial q^{j}} f_{θ}^{R, i} (q, t)$ . But this leads to significant computational overhead due to differentiation of the terms ${(K^{- 1})}_{j}^{i} (q)$ (see [21]). Instead of applying Theorem 2, i.e., expressing dynamics in local charts, the dynamics can also be expressed in the Lie algebra $g$ . Theorem 1 has a Hamiltonian form, which can be directly transformed into Hamiltonian equations on a Lie group (see also Appendix A). Applying this reasoning to Theorem 2, we arrive at the following form, which foregoes differentiating ${(K^{- 1})}_{j}^{i} (x)$ :

Theorem 5

(Left Generalized Adjoint Method on Matrix Lie Groups). Given the dynamics (88) and the cost (86), or the dynamics (89) with $ρ_{θ}^{L} (g, t) = {Ad}_{g^{- 1}} ρ_{θ}^{R} (g, t)$ , the parameter gradient $\frac{\partial}{\partial θ} C_{f_{θ}}^{T} (g_{0})$ of the cost is given by the integral equation

$\frac{\partial}{\partial θ} C_{f_{θ}}^{T} (g_{0}) = \frac{\partial F}{\partial θ} (g (T), θ) + \int_{0}^{T} \frac{\partial}{\partial θ} (λ_{g}^{⊤} ρ_{θ}^{L} (g, s) + r (g, θ, s)) d s,$ (92)

where the state $g (t) \in G$ and co-state $λ_{g} (t) \in R^{n}$ are the solutions of the system of equations

$\begin{matrix} \dot{g} & = f_{θ} (g, t), g (0) = g_{0}, \end{matrix}$ (93)

$\begin{matrix} {\dot{λ}}_{g} & = - d_{g}^{L} (λ_{g}^{⊤} ρ_{θ}^{L} (g, s) + r (g, θ, s)) + {ad}_{ρ_{θ}^{L} (g, t)}^{⊤} λ_{g}, λ_{g} (T) = d_{g}^{L} F (g (T), θ) . \end{matrix}$ (94)

Proof.

This is proven in two steps. First, define the time- and parameter-dependent control Hamiltonian $H_{c} : T^{*} M \times R^{n_{θ}} \times R \to R$ as

$H_{c} (x, λ, θ, t) = λ (f_{θ} (x, t)) + r (x, θ, t) = λ_{i} (f_{θ}^{i} (q, t)) + r (q, θ, t) .$ (95)

The equations for the state and co-state dynamics (45) and (46), respectively, of Theorem 2 follow as the Hamiltonian equations on $T^{*} M$ :

$\begin{matrix} {\dot{q}}^{j} & = \frac{\partial H_{c}}{\partial λ_{j}} = f_{θ}^{j} (q, t), \end{matrix}$ (96)

$\begin{matrix} {\dot{λ}}_{i} & = - \frac{\partial H_{c}}{\partial q^{i}} = - λ_{j} \frac{\partial}{\partial q^{i}} f_{θ}^{j} (q, t) - \frac{\partial r}{\partial q^{i}} . \end{matrix}$ (97)

And the integral Equation (44) reads

$\frac{\partial}{\partial θ} C_{f_{θ}}^{T} ((x_{0}, t_{0}), θ) = \frac{\partial F}{\partial θ} (x (T), θ) + \int_{0}^{T} \frac{\partial H_{c}}{\partial θ} d t .$ (98)

Second, rewrite the control Hamiltonian (95) on a Lie group G, i.e., $H_{c} : T^{*} G \times \times R^{n_{θ}} \times R \to R$ . By substituting $λ_{g} (t) = Λ^{*} L_{g}^{*} λ (t)$ (see also Equation (A6)), this induces $H_{c} : G \times g^{*} \times R^{n_{θ}} \times R \to R$ ,

$H_{c} (g, λ_{g}, θ, t) = λ_{g}^{⊤} ρ_{θ}^{L} (g, t) + r (g, θ, t) .$ (99)

Finally Hamilton’s equations (96) and (97) are rewritten in their form on a matrix Lie group by means of (A7) and (A8), which recovers Equations (93) and (94):

$\begin{matrix} \dot{g} & = g Λ (\frac{\partial H_{c}}{\partial λ_{g}}), \end{matrix}$ (100)

$\begin{matrix} {\dot{λ}}_{g} & = - d_{g}^{L} H_{c} + {ad}_{\frac{\partial H_{c}}{\partial λ_{g}}}^{⊤} λ_{g} . \end{matrix}$ (101)

To find the final condition for $λ_{g}$ , use that $λ_{g} (t) = Λ^{*} L_{g}^{*} λ (t)$ :

$λ_{g} (T) = Λ^{*} L_{g}^{*} λ (T) = Λ^{*} L_{g}^{*} d F (g (T), θ) = d_{g}^{L} F (g (T), θ) .$ (102)

□

Similar equations also hold on abstract (non-matrix) Lie groups, see [21]. Compared to the extrinsic method of Section 4.1, Theorem 5 has the advantage that the dimension of the co-state $λ_{g}$ is as low as possible. Compared to the chart-based approach on Lie groups, Theorem 5 foregoes differentiating through the terms $K_{j}^{i} (q)$ , avoiding overhead. Compared to a chart-based approach on manifolds, the choice of charts is also canonical on Lie groups. Although the Lie group approach foregoes many of the pitfalls of intrinsic neural ODEs on manifolds, implementation in existing neural ODE packages is currently cumbersome: the adjoint sensitivity equations (94) have a non-standard form, requiring an adapted dynamics of the co-state $λ$ , but these equations are rarely intended for modification in existing packages. Packages for geometry-preserving integrators on Lie groups, such as [41], are also not readily available for arbitrary Lie groups.

4.3. Extensions

The proof of Theorem 5 relied on finding a control Hamiltonian formulation for Theorem 2. This approach generalizes to methods in Section 3.2, which rely on the use of Theorem 1. This is because Theorem 1 itself has a Hamiltonian form ([21,54]).

A further straightforward extension of the methods presented in this Section are port-Hamiltonian neural ODEs on Lie groups [20]. In [20], these are systems with a configuration on a Lie group G and momentum on $g^{*}$ . In terms of the theory presented above, such port-Hamiltonian dynamics can be phrased as a dynamic system on a product Lie group $G \times g^{*}$ (taking vector addition as the group operation on $g^{*}$ ), recovering both extrinsic [20] and intrinsic [21] port-Hamiltonian neural ODEs on Lie groups.

5. Discussion

We discuss advantages and disadvantages of the main flavors of the presented formulations for manifold neural ODEs, expanding on the previous sections. We focus on extrinsic (embedding dynamics in $R^{N}$ ) and intrinsic (integrating in local charts) formulations. The prior comments can be summarized as follows:

The extrinsic formulation is readily implemented if the low-dimensional manifold $M$ and an embedding into $R^{N}$ are known. This comes at the possible cost of geometric inexactness and a higher dimension of the co-state and sensitivity equations.
The co-state in the intrinsic formulation has a generally lower dimension, which reduces the dimension of the sensitivity equations. The chart-based formulation also guarantees geometrically exact integration of dynamics. This comes at the mild cost of having to define local charts and chart-transitions.

This dimensionality reduction is unlikely to have a high impact when the manifold $M$ is known and low-dimensional, e.g., for the sphere $M = S^{2}$ or similar manifolds. However, when applying the manifold hypothesis to high-dimensional data, there might be non-trivial latent manifolds for which the embedding is not immediate and where the latent manifold is of a much lower dimension than the embedding data manifold. Then the intrinsic method becomes difficult to avoid. If geometric exactness of the integration is desired, local charts also need to be defined for the extrinsic approach, in which case the intrinsic approach may offer further advantages.

In order to derive neural ODEs on Lie groups, three approaches are possible: the extrinsic and intrinsic formulations on manifolds directly carry over to matrix Lie groups, embedding $G \subset G L (m, R)$ in $R^{m^{2}}$ or using local exponential charts, respectively. A third option is a novel intrinsic method for neural ODEs on matrix Lie groups, which makes full use of the Lie group structure by phrasing dynamics on $g$ (as is more common on Lie groups) and the co-state on $g^{*}$ , avoiding difficulties of the chart-based formalism in differentiating extra terms.

Prior comments on advantages and disadvantages of these flavors can be summarized as follows:

The extrinsic formulation on matrix Lie groups can come at much higher cost than that on manifolds, since the intrinsic dimension of G can be much lower than $m^{2}$ and a higher dimension of the co-state and sensitivity equations can be obtained. Geometrically exact integration procedures are more readily available for matrix Lie groups, integrating $\dot{g}$ in local exponential charts.
The chart-based formulation on matrix Lie groups struggles when dynamics are not naturally phrased in local charts. This is common; dynamics are often more naturally phrased on $g$ . This was alleviated by an algebra-based formulation on matrix Lie groups. Both are intrinsic approaches that feature co-state dynamics that are as low as possible. However, the algebra-based approach still lacks readily available software implementation.

The authors believe that the algebra-based formulation is more convenient in principle and consider software implementations of the algebra-based approach as possible future work.

In summary, we presented a unified, geometric approach to extend various methods for neural ODEs on $R^{N}$ to neural ODEs on manifolds and Lie groups. Optimization of neural ODEs on manifolds was based on the adjoint method on manifolds. Given a novel cost function C and neural ODE architecture f, the strategy to present the results in a unified fashion was to identify a suitable augmented manifold $M_{aug}$ , augmented dynamics $f_{aug} \in X (M_{aug})$ , and cost $C_{aug} : M_{aug} \to R$ such that the original cost function can be rephrased as $C = C_{aug} \circ Ψ_{f_{aug}}^{T}$ . To further derive optimization of intrinsic neural ODEs on Lie groups, we found a Hamiltonian formulation of the adjoint method on manifolds and subsequently transformed it into Hamiltonian equations on a matrix Lie group.

Appendix A. Hamiltonian Dynamics on Lie Groups

We briefly review Hamiltonian systems on manifolds and matrix Lie groups (see also [21], App. A1).

Given a manifold $Q$ with coordinate maps $Q^{i} : Q \to R$ and $p_{i}$ in the basis $d Q^{i}$ on $T_{q}^{*} Q$ , we define the symplectic form $ω \in Ω^{2} (T^{*} M)$ as

ω = d p_{i} \land d Q^{i} .

(A1)

Let $Y \in X (T^{*} Q)$ ; then a Hamiltonian $H \in C^{\infty} (T^{*} Q, R)$ implicitly defines a unique vector field $X_{H} \in X (T^{*} Q)$ by

d H (Y) = ω (X_{H}, Y) .

(A2)

In coordinates, $X_{H}$ has the components

\begin{matrix} {\dot{q}}^{i} & = \frac{\partial H}{\partial p_{i}}, \end{matrix}

(A3)

\begin{matrix} {\dot{p}}_{i} & = - \frac{\partial H}{\partial q^{i}} . \end{matrix}

(A4)

On a Lie group G, the group structure allows the identification of $T^{*} G \equiv G \times g^{*} \equiv G \times R^{n}$ , e.g., using the pullback ${L_{g}}^{*} : T_{g}^{*} G \to g^{*}$ of the left-translation map $L_{g} : G \to G$ , and $Λ^{*} : g \to R^{n}$ , to define $P_{g} \in R^{n}$ as

P_{g} = Λ^{*} L_{g}^{*} P .

(A5)

Then the left Hamiltonian $H^{L} : G \times g^{*} \to R$ is defined in terms of $H : T^{*} G \to g$ as

H^{L} (g, P_{g}) = H (g, P)) .

(A6)

For a matrix Lie group, the left Hamiltonian equations read as follows:

\begin{matrix} \dot{g} & = g Λ (\frac{\partial H^{L}}{\partial P}), \end{matrix}

(A7)

\begin{matrix} \dot{P} & = - d_{g}^{L} H^{L} + {ad}_{\frac{\partial H^{L}}{\partial P}}^{⊤} P, \end{matrix}

(A8)

with $Λ : R^{n} \to g$ as in (13) and $d_{g}^{L} H \in R^{n}$ as in (23).

Author Contributions

Conceptualization, Y.P.W.; methodology, Y.P.W.; software, Y.P.W.; validation, Y.P.W.; formal analysis, Y.P.W.; investigation, Y.P.W.; resources, S.S.; data curation, Y.P.W.; writing—original draft preparation, Y.P.W.; writing—review and editing, Y.P.W., F.C., and S.S.; visualization, Y.P.W.; supervision, F.C. and S.S.; project administration, S.S.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Funding Statement

This research received no external funding.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

1.Chen R.T.Q., Rubanova Y., Bettencourt J., Duvenaud D. Neural Ordinary Differential Equations; Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018); Montreal, QC, Canada. 3–8 December 2018; [(accessed on 13 August 2025)]. pp. 31–60. Available online: http://arxiv.org/abs/1806.07366. [Google Scholar]
2.Massaroli S., Poli M., Park J., Yamashita A., Asama H. Dissecting neural ODEs. [(accessed on 13 August 2025)];Adv. Neural Inf. Process. Syst. 2020 2020:3952–3963. Available online: http://arxiv.org/abs/2002.08071. [Google Scholar]
3.Zakwan M., Natale L.D., Svetozarevic B., Heer P., Jones C., Trecate G.F. Physically Consistent Neural ODEs for Learning Multi-Physics Systems. IFAC-PapersOnLine. 2023;56:5855–5860. doi: 10.1016/j.ifacol.2023.10.079. [DOI] [Google Scholar]
4.Sholokhov A., Liu Y., Mansour H., Nabi S. Physics-informed neural ODE (PINODE): Embedding physics into models using collocation points. Sci. Rep. 2023;13:10166. doi: 10.1038/s41598-023-36799-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ghanem P., Demirkaya A., Imbiriba T., Ramezani A., Danziger Z., Erdogmus D. Learning Physics Informed Neural ODEs with Partial Measurements. [(accessed on 13 August 2025)];Proc. AAAI Conf. Artif. Intell. 2024 AAAI-25 doi: 10.1609/aaai.v39i16.33846. Available online: http://arxiv.org/abs/2412.08681. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Massaroli S., Poli M., Califano F., Park J., Yamashita A., Asama H. Optimal Energy Shaping via Neural Approximators. SIAM J. Appl. Dyn. Syst. 2022;21:2126–2147. doi: 10.1137/21M1414279. [DOI] [Google Scholar]
7.Niu H., Zhou Y., Yan X., Wu J., Shen Y., Yi Z., Hu J. On the applications of neural ordinary differential equations in medical image analysis. Artif. Intell. Rev. 2024;57:236. doi: 10.1007/s10462-024-10894-0. [DOI] [Google Scholar]
8.Oh Y., Kam S., Lee J., Lim D.Y., Kim S., Bui A.A.T. Comprehensive Review of Neural Differential Equations for Time Series Analysis. [(accessed on 13 August 2025)];arXiv. 2025 Available online: http://arxiv.org/abs/2502.09885.2502.09885 [Google Scholar]
9.Poli M., Massaroli S., Scimeca L., Chun S., Oh S.J., Yamashita A., Asama H., Park J., Garg A. Neural Hybrid Automata: Learning Dynamics with Multiple Modes and Stochastic Transitions; Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021); Online. 6–14 December 2021; [(accessed on 13 August 2025)]. pp. 9977–9989. Available online: http://arxiv.org/abs/2106.04165. [Google Scholar]
10.Chen R.T.Q., Amos B., Nickel M. Learning Neural Event Functions for Ordinary Differential Equations; Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021); Virtual. 3–7 May 2021; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2011.03902. [Google Scholar]
11.Davis J.Q., Choromanski K., Varley J., Lee H., Slotine J.J., Likhosterov V., Weller A., Makadia A., Sindhwani V. Time Dependence in Non-Autonomous Neural ODEs. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2005.01906. Available online: http://arxiv.org/abs/2005.01906.2005.01906 [DOI] [Google Scholar]
12.Dupont E., Doucet A., Teh Y.W. Augmented Neural ODEs; Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019); Vancouver, BC, Canada. 8–14 December 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1904.01681. [Google Scholar]
13.Chu H., Miyatake Y., Cui W., Wei S., Furihata D. Structure-Preserving Physics-Informed Neural Networks with Energy or Lyapunov Structure; Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 24); Jeju, Republic of Korea. 3–9 August 2024; [DOI] [Google Scholar]
14.Kütük M., Yücel H. Energy dissipation preserving physics informed neural network for Allen–Cahn equations. J. Comput. Sci. 2025;87:102577. doi: 10.1016/j.jocs.2025.102577. [DOI] [Google Scholar]
15.Bullo F., Murray R.M. Tracking for fully actuated mechanical systems: A geometric framework. Automatica. 1999;35:17–34. doi: 10.1016/S0005-1098(98)00119-8. [DOI] [Google Scholar]
16.Marsden J.E., Ratiu T.S. Introduction to Mechanics and Symmetry. Volume 17. Springer; New York, NY, USA: 1999. [DOI] [Google Scholar]
17.Whiteley N., Gray A., Rubin-Delanchy P. Statistical exploration of the Manifold Hypothesis. [(accessed on 13 August 2025)];arXiv. 2025 Available online: http://arxiv.org/abs/2208.11665.2208.11665 [Google Scholar]
18.Lou A., Lim D., Katsman I., Huang L., Jiang Q., Lim S.N., De Sa C. Neural Manifold Ordinary Differential Equations; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2006.10254. [Google Scholar]
19.Falorsi L., Forré P. Neural Ordinary Differential Equations on Manifolds. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2006.06663. Available online: http://arxiv.org/abs/2006.06663.2006.06663 [DOI] [Google Scholar]
20.Duong T., Altawaitan A., Stanley J., Atanasov N. Port-Hamiltonian Neural ODE Networks on Lie Groups for Robot Dynamics Learning and Control. IEEE Trans. Robot. 2024;40:3695–3715. doi: 10.1109/TRO.2024.3428433. [DOI] [Google Scholar]
21.Wotte Y.P., Califano F., Stramigioli S. Optimal potential shaping on SE(3) via neural ordinary differential equations on Lie groups. Int. J. Robot. Res. 2024;43:2221–2244. doi: 10.1177/02783649241256044. [DOI] [Google Scholar]
22.Floryan D., Graham M.D. Data-driven discovery of intrinsic dynamics. Nat. Mach. Intell. 2022;4:1113–1120. doi: 10.1038/s42256-022-00575-4. [DOI] [Google Scholar]
23.Andersdotter E., Persson D., Ohlsson F. Equivariant Manifold Neural ODEs and Differential Invariants. [(accessed on 13 August 2025)];arXiv. 2024 doi: 10.48550/arXiv.2401.14131. Available online: http://arxiv.org/abs/2401.14131.2401.14131 [DOI] [Google Scholar]
24.Wotte Y.P. Master’s Thesis. University of Twente; Enschede, The Netherlands: 2021. Optimal Potential Shaping on SE(3) via Neural Approximators. [Google Scholar]
25.Kidger P. Ph.D. Thesis. Mathematical Institute, University of Oxford; Oxford, UK: 2022. On Neural Differential Equations. [Google Scholar]
26.Gholami A., Keutzer K., Biros G. ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs; Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 19); Macao. 10–16 August 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1902.10298. [Google Scholar]
27.Kidger P., Morrill J., Foster J., Lyons T.J. Neural Controlled Differential Equations for Irregular Time Series; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2005.08926. [Google Scholar]
28.Li X., Wong T.L., Chen R.T.Q., Duvenaud D. Scalable Gradients for Stochastic Differential Equations; Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020); Online. 26–28 August 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2001.01328. [Google Scholar]
29.Liu Y., Cheng J., Zhao H., Xu T., Zhao P., Tsung F., Li J., Rong Y. SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases; Proceedings of the 12th International Conference on Learning Representations (ICLR 2024); Vienna, Austria. 7–11 May 2024; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2308.13212. [Google Scholar]
30.Greydanus S., Dzamba M., Yosinski J. Hamiltonian Neural Networks; Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); Vancouver, BC, Canada. 8–14 December 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1906.01563. [Google Scholar]
31.Finzi M., Wang K.A., Wilson A.G. Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2010.13581. [Google Scholar]
32.Cranmer M., Greydanus S., Hoyer S., Research G., Battaglia P., Spergel D., Ho S. Lagrangian Neural Networks; Proceedings of the ICLR 2020 Deep Differential Equations Workshop; Addis Ababa, Ethiopia. 26 April 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2003.04630. [Google Scholar]
33.Bhattoo R., Ranu S., Krishnan N.M. Learning the Dynamics of Particle-based Systems with Lagrangian Graph Neural Networks. Mach. Learn. Sci. Technol. 2023;4:015003. doi: 10.1088/2632-2153/acb03e. [DOI] [Google Scholar]
34.Xiao S., Zhang J., Tang Y. Generalized Lagrangian Neural Networks. [(accessed on 13 August 2025)];arXiv. 2024 doi: 10.48550/arXiv.2401.03728. Available online: http://arxiv.org/abs/2401.03728.2401.03728 [DOI] [Google Scholar]
35.Rettberg J., Kneifl J., Herb J., Buchfink P., Fehr J., Haasdonk B. Data-Driven Identification of Latent Port-Hamiltonian Systems. [(accessed on 13 August 2025)];arXiv. 2024 :37–99. Available online: http://arxiv.org/abs/2408.08185. [Google Scholar]
36.Duong T., Atanasov N. Hamiltonian-based Neural ODE Networks on the SE(3) Manifold For Dynamics Learning and Control; Proceedings of the Robotics: Science and Systems (RSS 2021); Online. 12–16 July 2021; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2106.12782v3. [Google Scholar]
37.Fronk C., Petzold L. Training stiff neural ordinary differential equations with explicit exponential integration methods. Chaos. 2025;35:33154. doi: 10.1063/5.0251475. [DOI] [PubMed] [Google Scholar]
38.Kloberdanz E., Le W. Artificial Neural Networks and Machine Learning—ICANN 2023. Volume 14262. Springer; Cham, Switzerland: 2023. S-SOLVER: Numerically Stable Adaptive Step Size Solver for Neural ODEs; pp. 388–400. (Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). [DOI] [Google Scholar]
39.Akhtar S.W. On Tuning Neural ODE for Stability, Consistency and Faster Convergence. SN Comput. Sci. 2025;6:318. doi: 10.1007/s42979-025-03832-6. [DOI] [Google Scholar]
40.Zhu A., Jin P., Zhu B., Tang Y. On Numerical Integration in Neural Ordinary Differential Equations; Proceedings of the 39th International Conference on Machine Learning (ICML 2022); Baltimore, MD, USA. 17–23 July 2022; Baltimore, MD, USA: ML Research Press; 2022. [(accessed on 13 August 2025)]. pp. 27527–27547. Available online: http://arxiv.org/abs/2206.07335. [Google Scholar]
41.Munthe-Kaas H. High order Runge-Kutta methods on manifolds. Appl. Numer. Math. 1999;29:115–127. doi: 10.1016/S0168-9274(98)00030-0. [DOI] [Google Scholar]
42.Celledoni E., Marthinsen H., Owren B. An introduction to Lie group integrators—Basics, New Developments and Applications. J. Comput. Phys. 2014;257:1040–1061. doi: 10.1016/j.jcp.2012.12.031. [DOI] [Google Scholar]
43.Ma Y., Dixit V., Innes M.J., Guo X., Rackauckas C. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions; Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC 2021); Online. 20–24 September 2021; [DOI] [Google Scholar]
44.Saemundsson S., Terenin A., Hofmann K., Deisenroth M.P. Variational Integrator Networks for Physically Structured Embeddings; Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020); Online. 26–28 August 2020; [(accessed on 13 August 2025)]. pp. 3078–3087. Available online: http://arxiv.org/abs/1910.09349. [Google Scholar]
45.Desai S.A., Mattheakis M., Roberts S.J. Variational integrator graph networks for learning energy-conserving dynamical systems. Phys. Rev. E. 2021;104:035310. doi: 10.1103/PhysRevE.104.035310. [DOI] [PubMed] [Google Scholar]
46.Bobenko A.I., Suris Y.B. Mathematical Physics Discrete Time Lagrangian Mechanics on Lie Groups, with an Application to the Lagrange Top. Commun. Math. Phys. 1999;204:147–188. doi: 10.1007/s002200050642. [DOI] [Google Scholar]
47.Marsden J.E., Pekarsky S., Shkoller S., West M. Variational Methods, Multisymplectic Geometry and Continuum Mechanics. J. Geom. Phys. 2001;38:253–284. doi: 10.1016/S0393-0440(00)00066-8. [DOI] [Google Scholar]
48.Duruisseaux V., Duong T., Leok M., Atanasov N. Lie Group Forced Variational Integrator Networks for Learning and Control of Robot Systems; Proceedings of the 5th Annual Conference on Learning for Dynamics and Control; Philadelphia, PA, USA. 15–16 June 2023; [(accessed on 13 August 2025)]. pp. 1–21. Available online: http://arxiv.org/abs/2211.16006. [Google Scholar]
49.Lee J.M. Introduction to Smooth Manifolds. 2nd ed. Springer; New York, NY, USA: 2012. [DOI] [Google Scholar]
50.Hall B.C. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. Volume 222. Springer; Berlin/Heidelberg, Germany: 2015. Graduate Texts in Mathematics (GTM) [DOI] [Google Scholar]
51.Solà J., Deray J., Atchuthan D. A micro Lie theory for state estimation in robotics. [(accessed on 13 August 2025)];arXiv. 2021 doi: 10.48550/arXiv.1812.01537. Available online: http://arxiv.org/abs/1812.01537.1812.01537 [DOI] [Google Scholar]
52.Visser M., Stramigioli S., Heemskerk C. Cayley-Hamilton for roboticists. IEEE Int. Conf. Intell. Robot. Syst. 2006;1:4187–4192. doi: 10.1109/IROS.2006.281911. [DOI] [Google Scholar]
53.Robbins H., Monro S. A Stochastic Approximation Method. Ann. Math. Stat. 1951;22:400–407. doi: 10.1214/aoms/1177729586. [DOI] [Google Scholar]
54.Falorsi L., de Haan P., Davidson T.R., Forré P. Reparameterizing Distributions on Lie Groups; Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019); Naha, Okinawa, Japan. 16–18 April 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1903.02958. [Google Scholar]
55.White A., Kilbertus N., Gelbrecht M., Boers N. Stabilized Neural Differential Equations for Learning Dynamics with Explicit Constraints; Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023); New Orleans, LA, USA. 10–16 December 2023; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2306.09739. [Google Scholar]
56.Poli M., Massaroli S., Yamashita A., Asama H., Park J. TorchDyn: A Neural Differential Equations Library. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2009.09346. Available online: http://arxiv.org/abs/2009.09346.2009.09346 [DOI] [Google Scholar]
57.Schaft A.V.D., Jeltsema D. Port-Hamiltonian Systems Theory: An Introductory Overview. Volume 1. Now Publishers Inc.; Hanover, MA, USA: 2014. pp. 173–378. [DOI] [Google Scholar]
58.Rashad R., Califano F., van der Schaft A.J., Stramigioli S. Twenty years of distributed port-Hamiltonian systems: A literature review. IMA J. Math. Control Inf. 2020;37:1400–1422. doi: 10.1093/imamci/dnaa018. [DOI] [Google Scholar]
59.Wotte Y.P., Dummer S., Botteghi N., Brune C., Stramigioli S., Califano F. Discovering efficient periodic behaviors in mechanical systems via neural approximators. Optim. Control Appl. Methods. 2023;44:3052–3079. doi: 10.1002/oca.3025. [DOI] [Google Scholar]
60.Yao W. A Singularity-Free Guiding Vector Field for Robot Navigation. Springer; Cham, Switzerland: 2023. pp. 159–190. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[B1-entropy-27-00878] 1.Chen R.T.Q., Rubanova Y., Bettencourt J., Duvenaud D. Neural Ordinary Differential Equations; Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018); Montreal, QC, Canada. 3–8 December 2018; [(accessed on 13 August 2025)]. pp. 31–60. Available online: http://arxiv.org/abs/1806.07366. [Google Scholar]

[B2-entropy-27-00878] 2.Massaroli S., Poli M., Park J., Yamashita A., Asama H. Dissecting neural ODEs. [(accessed on 13 August 2025)];Adv. Neural Inf. Process. Syst. 2020 2020:3952–3963. Available online: http://arxiv.org/abs/2002.08071. [Google Scholar]

[B3-entropy-27-00878] 3.Zakwan M., Natale L.D., Svetozarevic B., Heer P., Jones C., Trecate G.F. Physically Consistent Neural ODEs for Learning Multi-Physics Systems. IFAC-PapersOnLine. 2023;56:5855–5860. doi: 10.1016/j.ifacol.2023.10.079. [DOI] [Google Scholar]

[B4-entropy-27-00878] 4.Sholokhov A., Liu Y., Mansour H., Nabi S. Physics-informed neural ODE (PINODE): Embedding physics into models using collocation points. Sci. Rep. 2023;13:10166. doi: 10.1038/s41598-023-36799-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5-entropy-27-00878] 5.Ghanem P., Demirkaya A., Imbiriba T., Ramezani A., Danziger Z., Erdogmus D. Learning Physics Informed Neural ODEs with Partial Measurements. [(accessed on 13 August 2025)];Proc. AAAI Conf. Artif. Intell. 2024 AAAI-25 doi: 10.1609/aaai.v39i16.33846. Available online: http://arxiv.org/abs/2412.08681. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6-entropy-27-00878] 6.Massaroli S., Poli M., Califano F., Park J., Yamashita A., Asama H. Optimal Energy Shaping via Neural Approximators. SIAM J. Appl. Dyn. Syst. 2022;21:2126–2147. doi: 10.1137/21M1414279. [DOI] [Google Scholar]

[B7-entropy-27-00878] 7.Niu H., Zhou Y., Yan X., Wu J., Shen Y., Yi Z., Hu J. On the applications of neural ordinary differential equations in medical image analysis. Artif. Intell. Rev. 2024;57:236. doi: 10.1007/s10462-024-10894-0. [DOI] [Google Scholar]

[B8-entropy-27-00878] 8.Oh Y., Kam S., Lee J., Lim D.Y., Kim S., Bui A.A.T. Comprehensive Review of Neural Differential Equations for Time Series Analysis. [(accessed on 13 August 2025)];arXiv. 2025 Available online: http://arxiv.org/abs/2502.09885.2502.09885 [Google Scholar]

[B9-entropy-27-00878] 9.Poli M., Massaroli S., Scimeca L., Chun S., Oh S.J., Yamashita A., Asama H., Park J., Garg A. Neural Hybrid Automata: Learning Dynamics with Multiple Modes and Stochastic Transitions; Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021); Online. 6–14 December 2021; [(accessed on 13 August 2025)]. pp. 9977–9989. Available online: http://arxiv.org/abs/2106.04165. [Google Scholar]

[B10-entropy-27-00878] 10.Chen R.T.Q., Amos B., Nickel M. Learning Neural Event Functions for Ordinary Differential Equations; Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021); Virtual. 3–7 May 2021; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2011.03902. [Google Scholar]

[B11-entropy-27-00878] 11.Davis J.Q., Choromanski K., Varley J., Lee H., Slotine J.J., Likhosterov V., Weller A., Makadia A., Sindhwani V. Time Dependence in Non-Autonomous Neural ODEs. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2005.01906. Available online: http://arxiv.org/abs/2005.01906.2005.01906 [DOI] [Google Scholar]

[B12-entropy-27-00878] 12.Dupont E., Doucet A., Teh Y.W. Augmented Neural ODEs; Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019); Vancouver, BC, Canada. 8–14 December 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1904.01681. [Google Scholar]

[B13-entropy-27-00878] 13.Chu H., Miyatake Y., Cui W., Wei S., Furihata D. Structure-Preserving Physics-Informed Neural Networks with Energy or Lyapunov Structure; Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 24); Jeju, Republic of Korea. 3–9 August 2024; [DOI] [Google Scholar]

[B14-entropy-27-00878] 14.Kütük M., Yücel H. Energy dissipation preserving physics informed neural network for Allen–Cahn equations. J. Comput. Sci. 2025;87:102577. doi: 10.1016/j.jocs.2025.102577. [DOI] [Google Scholar]

[B15-entropy-27-00878] 15.Bullo F., Murray R.M. Tracking for fully actuated mechanical systems: A geometric framework. Automatica. 1999;35:17–34. doi: 10.1016/S0005-1098(98)00119-8. [DOI] [Google Scholar]

[B16-entropy-27-00878] 16.Marsden J.E., Ratiu T.S. Introduction to Mechanics and Symmetry. Volume 17. Springer; New York, NY, USA: 1999. [DOI] [Google Scholar]

[B17-entropy-27-00878] 17.Whiteley N., Gray A., Rubin-Delanchy P. Statistical exploration of the Manifold Hypothesis. [(accessed on 13 August 2025)];arXiv. 2025 Available online: http://arxiv.org/abs/2208.11665.2208.11665 [Google Scholar]

[B18-entropy-27-00878] 18.Lou A., Lim D., Katsman I., Huang L., Jiang Q., Lim S.N., De Sa C. Neural Manifold Ordinary Differential Equations; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2006.10254. [Google Scholar]

[B19-entropy-27-00878] 19.Falorsi L., Forré P. Neural Ordinary Differential Equations on Manifolds. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2006.06663. Available online: http://arxiv.org/abs/2006.06663.2006.06663 [DOI] [Google Scholar]

[B20-entropy-27-00878] 20.Duong T., Altawaitan A., Stanley J., Atanasov N. Port-Hamiltonian Neural ODE Networks on Lie Groups for Robot Dynamics Learning and Control. IEEE Trans. Robot. 2024;40:3695–3715. doi: 10.1109/TRO.2024.3428433. [DOI] [Google Scholar]

[B21-entropy-27-00878] 21.Wotte Y.P., Califano F., Stramigioli S. Optimal potential shaping on SE(3) via neural ordinary differential equations on Lie groups. Int. J. Robot. Res. 2024;43:2221–2244. doi: 10.1177/02783649241256044. [DOI] [Google Scholar]

[B22-entropy-27-00878] 22.Floryan D., Graham M.D. Data-driven discovery of intrinsic dynamics. Nat. Mach. Intell. 2022;4:1113–1120. doi: 10.1038/s42256-022-00575-4. [DOI] [Google Scholar]

[B23-entropy-27-00878] 23.Andersdotter E., Persson D., Ohlsson F. Equivariant Manifold Neural ODEs and Differential Invariants. [(accessed on 13 August 2025)];arXiv. 2024 doi: 10.48550/arXiv.2401.14131. Available online: http://arxiv.org/abs/2401.14131.2401.14131 [DOI] [Google Scholar]

[B24-entropy-27-00878] 24.Wotte Y.P. Master’s Thesis. University of Twente; Enschede, The Netherlands: 2021. Optimal Potential Shaping on SE(3) via Neural Approximators. [Google Scholar]

[B25-entropy-27-00878] 25.Kidger P. Ph.D. Thesis. Mathematical Institute, University of Oxford; Oxford, UK: 2022. On Neural Differential Equations. [Google Scholar]

[B26-entropy-27-00878] 26.Gholami A., Keutzer K., Biros G. ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs; Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 19); Macao. 10–16 August 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1902.10298. [Google Scholar]

[B27-entropy-27-00878] 27.Kidger P., Morrill J., Foster J., Lyons T.J. Neural Controlled Differential Equations for Irregular Time Series; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2005.08926. [Google Scholar]

[B28-entropy-27-00878] 28.Li X., Wong T.L., Chen R.T.Q., Duvenaud D. Scalable Gradients for Stochastic Differential Equations; Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020); Online. 26–28 August 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2001.01328. [Google Scholar]

[B29-entropy-27-00878] 29.Liu Y., Cheng J., Zhao H., Xu T., Zhao P., Tsung F., Li J., Rong Y. SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases; Proceedings of the 12th International Conference on Learning Representations (ICLR 2024); Vienna, Austria. 7–11 May 2024; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2308.13212. [Google Scholar]

[B30-entropy-27-00878] 30.Greydanus S., Dzamba M., Yosinski J. Hamiltonian Neural Networks; Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); Vancouver, BC, Canada. 8–14 December 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1906.01563. [Google Scholar]

[B31-entropy-27-00878] 31.Finzi M., Wang K.A., Wilson A.G. Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints; Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020); Online. 6–12 December 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2010.13581. [Google Scholar]

[B32-entropy-27-00878] 32.Cranmer M., Greydanus S., Hoyer S., Research G., Battaglia P., Spergel D., Ho S. Lagrangian Neural Networks; Proceedings of the ICLR 2020 Deep Differential Equations Workshop; Addis Ababa, Ethiopia. 26 April 2020; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2003.04630. [Google Scholar]

[B33-entropy-27-00878] 33.Bhattoo R., Ranu S., Krishnan N.M. Learning the Dynamics of Particle-based Systems with Lagrangian Graph Neural Networks. Mach. Learn. Sci. Technol. 2023;4:015003. doi: 10.1088/2632-2153/acb03e. [DOI] [Google Scholar]

[B34-entropy-27-00878] 34.Xiao S., Zhang J., Tang Y. Generalized Lagrangian Neural Networks. [(accessed on 13 August 2025)];arXiv. 2024 doi: 10.48550/arXiv.2401.03728. Available online: http://arxiv.org/abs/2401.03728.2401.03728 [DOI] [Google Scholar]

[B35-entropy-27-00878] 35.Rettberg J., Kneifl J., Herb J., Buchfink P., Fehr J., Haasdonk B. Data-Driven Identification of Latent Port-Hamiltonian Systems. [(accessed on 13 August 2025)];arXiv. 2024 :37–99. Available online: http://arxiv.org/abs/2408.08185. [Google Scholar]

[B36-entropy-27-00878] 36.Duong T., Atanasov N. Hamiltonian-based Neural ODE Networks on the SE(3) Manifold For Dynamics Learning and Control; Proceedings of the Robotics: Science and Systems (RSS 2021); Online. 12–16 July 2021; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2106.12782v3. [Google Scholar]

[B37-entropy-27-00878] 37.Fronk C., Petzold L. Training stiff neural ordinary differential equations with explicit exponential integration methods. Chaos. 2025;35:33154. doi: 10.1063/5.0251475. [DOI] [PubMed] [Google Scholar]

[B38-entropy-27-00878] 38.Kloberdanz E., Le W. Artificial Neural Networks and Machine Learning—ICANN 2023. Volume 14262. Springer; Cham, Switzerland: 2023. S-SOLVER: Numerically Stable Adaptive Step Size Solver for Neural ODEs; pp. 388–400. (Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). [DOI] [Google Scholar]

[B39-entropy-27-00878] 39.Akhtar S.W. On Tuning Neural ODE for Stability, Consistency and Faster Convergence. SN Comput. Sci. 2025;6:318. doi: 10.1007/s42979-025-03832-6. [DOI] [Google Scholar]

[B40-entropy-27-00878] 40.Zhu A., Jin P., Zhu B., Tang Y. On Numerical Integration in Neural Ordinary Differential Equations; Proceedings of the 39th International Conference on Machine Learning (ICML 2022); Baltimore, MD, USA. 17–23 July 2022; Baltimore, MD, USA: ML Research Press; 2022. [(accessed on 13 August 2025)]. pp. 27527–27547. Available online: http://arxiv.org/abs/2206.07335. [Google Scholar]

[B41-entropy-27-00878] 41.Munthe-Kaas H. High order Runge-Kutta methods on manifolds. Appl. Numer. Math. 1999;29:115–127. doi: 10.1016/S0168-9274(98)00030-0. [DOI] [Google Scholar]

[B42-entropy-27-00878] 42.Celledoni E., Marthinsen H., Owren B. An introduction to Lie group integrators—Basics, New Developments and Applications. J. Comput. Phys. 2014;257:1040–1061. doi: 10.1016/j.jcp.2012.12.031. [DOI] [Google Scholar]

[B43-entropy-27-00878] 43.Ma Y., Dixit V., Innes M.J., Guo X., Rackauckas C. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions; Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC 2021); Online. 20–24 September 2021; [DOI] [Google Scholar]

[B44-entropy-27-00878] 44.Saemundsson S., Terenin A., Hofmann K., Deisenroth M.P. Variational Integrator Networks for Physically Structured Embeddings; Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (AISTATS 2020); Online. 26–28 August 2020; [(accessed on 13 August 2025)]. pp. 3078–3087. Available online: http://arxiv.org/abs/1910.09349. [Google Scholar]

[B45-entropy-27-00878] 45.Desai S.A., Mattheakis M., Roberts S.J. Variational integrator graph networks for learning energy-conserving dynamical systems. Phys. Rev. E. 2021;104:035310. doi: 10.1103/PhysRevE.104.035310. [DOI] [PubMed] [Google Scholar]

[B46-entropy-27-00878] 46.Bobenko A.I., Suris Y.B. Mathematical Physics Discrete Time Lagrangian Mechanics on Lie Groups, with an Application to the Lagrange Top. Commun. Math. Phys. 1999;204:147–188. doi: 10.1007/s002200050642. [DOI] [Google Scholar]

[B47-entropy-27-00878] 47.Marsden J.E., Pekarsky S., Shkoller S., West M. Variational Methods, Multisymplectic Geometry and Continuum Mechanics. J. Geom. Phys. 2001;38:253–284. doi: 10.1016/S0393-0440(00)00066-8. [DOI] [Google Scholar]

[B48-entropy-27-00878] 48.Duruisseaux V., Duong T., Leok M., Atanasov N. Lie Group Forced Variational Integrator Networks for Learning and Control of Robot Systems; Proceedings of the 5th Annual Conference on Learning for Dynamics and Control; Philadelphia, PA, USA. 15–16 June 2023; [(accessed on 13 August 2025)]. pp. 1–21. Available online: http://arxiv.org/abs/2211.16006. [Google Scholar]

[B49-entropy-27-00878] 49.Lee J.M. Introduction to Smooth Manifolds. 2nd ed. Springer; New York, NY, USA: 2012. [DOI] [Google Scholar]

[B50-entropy-27-00878] 50.Hall B.C. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. Volume 222. Springer; Berlin/Heidelberg, Germany: 2015. Graduate Texts in Mathematics (GTM) [DOI] [Google Scholar]

[B51-entropy-27-00878] 51.Solà J., Deray J., Atchuthan D. A micro Lie theory for state estimation in robotics. [(accessed on 13 August 2025)];arXiv. 2021 doi: 10.48550/arXiv.1812.01537. Available online: http://arxiv.org/abs/1812.01537.1812.01537 [DOI] [Google Scholar]

[B52-entropy-27-00878] 52.Visser M., Stramigioli S., Heemskerk C. Cayley-Hamilton for roboticists. IEEE Int. Conf. Intell. Robot. Syst. 2006;1:4187–4192. doi: 10.1109/IROS.2006.281911. [DOI] [Google Scholar]

[B53-entropy-27-00878] 53.Robbins H., Monro S. A Stochastic Approximation Method. Ann. Math. Stat. 1951;22:400–407. doi: 10.1214/aoms/1177729586. [DOI] [Google Scholar]

[B54-entropy-27-00878] 54.Falorsi L., de Haan P., Davidson T.R., Forré P. Reparameterizing Distributions on Lie Groups; Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019); Naha, Okinawa, Japan. 16–18 April 2019; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/1903.02958. [Google Scholar]

[B55-entropy-27-00878] 55.White A., Kilbertus N., Gelbrecht M., Boers N. Stabilized Neural Differential Equations for Learning Dynamics with Explicit Constraints; Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023); New Orleans, LA, USA. 10–16 December 2023; [(accessed on 13 August 2025)]. Available online: http://arxiv.org/abs/2306.09739. [Google Scholar]

[B56-entropy-27-00878] 56.Poli M., Massaroli S., Yamashita A., Asama H., Park J. TorchDyn: A Neural Differential Equations Library. [(accessed on 13 August 2025)];arXiv. 2020 doi: 10.48550/arXiv.2009.09346. Available online: http://arxiv.org/abs/2009.09346.2009.09346 [DOI] [Google Scholar]

[B57-entropy-27-00878] 57.Schaft A.V.D., Jeltsema D. Port-Hamiltonian Systems Theory: An Introductory Overview. Volume 1. Now Publishers Inc.; Hanover, MA, USA: 2014. pp. 173–378. [DOI] [Google Scholar]

[B58-entropy-27-00878] 58.Rashad R., Califano F., van der Schaft A.J., Stramigioli S. Twenty years of distributed port-Hamiltonian systems: A literature review. IMA J. Math. Control Inf. 2020;37:1400–1422. doi: 10.1093/imamci/dnaa018. [DOI] [Google Scholar]

[B59-entropy-27-00878] 59.Wotte Y.P., Dummer S., Botteghi N., Brune C., Stramigioli S., Califano F. Discovering efficient periodic behaviors in mechanical systems via neural approximators. Optim. Control Appl. Methods. 2023;44:3052–3079. doi: 10.1002/oca.3025. [DOI] [Google Scholar]

[B60-entropy-27-00878] 60.Yao W. A Singularity-Free Guiding Vector Field for Robot Navigation. Springer; Cham, Switzerland: 2023. pp. 159–190. [DOI] [Google Scholar]

PERMALINK

Geometric Neural Ordinary Differential Equations: From Manifolds to Lie Groups

Yannik P Wotte

Federico Califano

Stefano Stramigioli

Roles

Abstract

1. Introduction

Table 1.

1.1. Literature Review

1.2. Notation

2. Background

2.1. Smooth Manifolds

2.2. Lie Groups

2.3. Gradient over a Flow

Figure 1.

Theorem 1

Proof.

3. Neural ODEs on Manifolds

3.1. Constant Parameters and Running and Final Cost

Theorem 2

Proof.

3.1.1. Vanilla Neural ODEs and Extrinsic Neural ODEs on Manifolds

Table 2.

Figure 2.

3.1.2. Intrinsic Neural ODEs on Manifolds

Figure 3.

Table 3.

3.1.3. Structure Preservation

3.2. Extensions

3.2.1. Nonlinear and Intermittent Cost Terms

Theorem 3

Proof.

3.2.2. Augmented Neural ODEs on Manifolds and Time-Dependent Parameters

Theorem 4

Proof.

4. Neural ODEs on Lie Groups

4.1. Extrinsic Neural ODEs on Lie Groups

4.2. Intrinsic Neural ODEs on Lie Groups

Theorem 5

Proof.

4.3. Extensions

5. Discussion

Appendix A. Hamiltonian Dynamics on Lie Groups

Author Contributions

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases