Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger; Samira Kabri; Yury Korolev; Tim Roith; Lukas Weigand

doi:10.1098/rsta.2024.0233

. 2025 Jun 5;383(2298):20240233. doi: 10.1098/rsta.2024.0233

Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger ^1,^2,^✉, Samira Kabri ¹, Yury Korolev ³, Tim Roith ¹, Lukas Weigand ¹

PMCID: PMC12152857 PMID: 40471030

Abstract

The aim of this article is to provide a mathematical analysis of transformer architectures using a self-attention mechanism with layer normalization. In particular, observed patterns in such architectures resembling either clusters or uniform distributions pose a number of challenging mathematical questions. We focus on a special case that admits a gradient flow formulation in the spaces of probability measures on the unit sphere under a special metric, which allows us to give at least partial answers in a rigorous way. The arising mathematical problems resemble those recently studied in aggregation equations but with additional challenges emerging from restricting the dynamics to the sphere and the particular form of the interaction energy. We provide a rigorous framework for studying the gradient flow, which also suggests a possible metric geometry to study the general case (i.e. one that is not described by a gradient flow). We further analyse the stationary points of the induced self-attention dynamics. The latter are related to stationary points of the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers and maximizers in different parameter settings.

This article is part of the theme issue ‘Partial differential equations in data science’.

Keywords: transformer architectures, self-attention dynamics, gradient flows, interaction energies, stationary states

1. Introduction

Transformer architectures and the associated (self-)attention dynamics gained strong interest recently due to the success of artificial intelligence relying on them in several applications. Examples include large language models such as GPT-4 [1], multimodal large language models such as vision language transformers [2,3], text-to-image generation like Stable Diffusion [4] and protein folding with AlphaFold [5,6], which won the Nobel Prize in Chemistry in 2024.

The practical success of transformers and (self-)attention dynamics calls for developing detailed mathematical understanding which started recently in [7–19].

An interesting viewpoint on such dynamics is to interpret it as an interacting particle system [8,20,21], which allows for natural continuous-time and mean-field limits. The latter approach already provided valuable insights into feed-forward neural networks and their training dynamics (cf. [22,23]). In the context of transformers, this viewpoint also provides interesting (so far formal [9]) connections to gradient flows and the minimization of interaction energy for the particle measures. The latter is a topic of great recent interest due to various applications in biology and social interactions. Indeed, the self-attention dynamics in transformers share certain mathematical similarities with models used in opinion formation, which also exhibit similar emergence of clusters in certain cases [24–26]. In this work, we focus on cluster formation in the infinite time horizon. However, we note that the formation of metastable states is of special interest. For the case of isotropic interaction, metastability was studied in [27,28].

In this article, we proceed with the work in [9] on analysing transformer dynamics with layer normalization, focusing in particular on the case when the underlying dynamics has a gradient flow structure. Indeed, the continuum limit of the self-attention dynamics leads to a Wasserstein-type gradient flow for probability measures on the unit sphere $S$ of the form

\partial_{t} μ_{t} = \nabla_{S} \cdot (μ_{t} m_{μ_{t}} \nabla_{S} E^{'} (μ_{t})),

(1.1)

where $\nabla_{S}$ and $\nabla_{S} \cdot$ are the tangential gradient and divergence, respectively, and $m_{μ} = \frac{1}{E^{'} (μ)}$ is a non-local mobility. The underlying energy in this case is of the form

E (μ) = \int_{S} \int_{S} e^{x \cdot D y} d μ (x) d μ (y),

(1.2)

with $D$ being a symmetric matrix and $E^{'}$ denoting its first variation. Since $D$ is symmetric and hence diagonalizable, we can equivalently assume that $D$ is a diagonal matrix, since we can use an orthogonal diagonalization and a corresponding transfer of variables to the eigenvectors, which leaves the unit ball unchanged. This will be used in several instances to simplify notation. It also permits a more detailed study of stationary patterns, in particular minimizers and maximizers of the energy.

Compared to the existing literature on such gradient flows, there are three distinct features that motivate our study, namely:

—
restriction of the dynamics to the unit sphere (a consequence of the layer normalization);
—
non-local mobility (a consequence of the self-attention mechanism), which is related to but still distinctly different from other variations of Wasserstein gradient flows studied recently (cf. [29–32]);
—
multiplicative coupling of states in the interaction energy, as opposed to commonly used interaction potentials depending only on the difference of the states (cf., e.g. [33–38]).

We make the gradient flow, formally introduced in [9], rigorous, showing that the transport distance with non-local mobilities is well defined, studying energy dissipation properties of the associated gradient flow and describing the large-time behaviour of the dynamics, specifically the convergence to stationary solutions, at least along subsequences. We further carry out a detailed study of energy minimizers and maximizers of $E$ (extending the previously studied case of $D$ being a multiple of the identity) as well as stationary points of the energy in a Wasserstein setting, which we prove to be equivalent to stationary solutions of the dynamics. For the energy minimizers, we obtain an interesting picture depending on the structure of $D$ :

—
If there is a positive eigenvalue that is the eigenvalue of maximal absolute value, then a Dirac delta concentrated in the direction of a corresponding eigenvalue is a maximizer.
—
If the smallest eigenvalue is negative, then only a Dirac delta concentrated in the direction of a corresponding eigenvalue is a minimizer.
—
If the smallest eigenvalue is zero, then any measure concentrated on the null space of $D$ is a minimizer.
—
Dirac deltas concentrated in directions of arbitrary eigenvectors are stationary points. We also find some convex combinations of Dirac deltas being stationary points.
—
If the smallest eigenvalue is positive, we conjecture that the minimizer of the energy has full support on the unit sphere. To obtain some insight, we carry out a second-order asymptotic analysis of the minimizers for $D$ being a small perturbation of the identity.

We support our theoretical findings with several computational experiments and investigate the cases when the energy minimizers or maximizers cannot be characterized explicitly.

The rest of this work is organized as follows. In the remainder of the introduction, we recapitulate the simplified softmax transformer model introduced in[8], with additional layer normalization as considered in [9]. In §2, we provide a rigorous derivation of the gradient flow induced by the considered model. Sections 3 and 4 are dedicated to characterizing optimizers or stationary points of the studied energy, respectively. We support our findings by numerical experiments in §5 and summarize our results in §6.

(a). Self-attention

Transformer architectures [39] were developed in the field of natural language processing. Here, the input is usually a sentence, which is decomposed into a sequence of tokens (e.g. words or syllables). Each token (possibly along with its position in the sentence) is represented as a vector in a high-dimensional vector space. Apart from a conventional feed-forward component, the main feature of a transformer layer is the so-called attention mechanism. This mechanism implements interactions between tokens and was first introduced in [40] in the context of neural machine translation as an alternative to encoder–decoder approaches, the performance of which often deteriorates for large input lengths due to the use of latent representations of fixed dimensions.

Like [9], we shall focus on a simple yet widely used form of attention, the so-called self-attention. It can be formalized as follows: consider an input sequence $X = [X_{i}]_{i = 1}^{N} \in ℝ^{N \times n}$ , where each $X_{i} \in ℝ^{n}$ represents an $n$ -dimensional token and $N$ denotes the number of tokens. The self-attention matrix $A \in ℝ^{N \times N}$ is given by

A_{i j} = \frac{\exp (X_{i} \cdot D X_{j})}{\sum_{k = 1}^{N} \exp (X_{i} \cdot D X_{k})},

(1.3)

where we assume $D \in ℝ^{n \times n}$ to be symmetric. The latter property does not necessarily hold for learned parameters in transformer architectures, but we expect the symmetric part to determine the asymptotic behaviour of the self-attention dynamics. Since the symmetry of $D$ allows one to interpret the dynamics as a gradient flow corresponding to a certain interaction energy, as observed in [9], it will allow us to analyse the asymptotic behaviour for this subclass; the study of the general case is left for future research. An important example of non-symmetric interaction is given by masked attention, which can be used to model causality. We refer to [41–43] for a mean-field interpretation of such dynamics.

By definition, the matrix $A$ is stochastic, i.e. each of its rows is a probability vector. Roughly speaking, the attention matrix determines how strongly a token is influenced by each other token. To determine how tokens influence each other, another matrix $V \in ℝ^{n \times n}$ , called the value matrix, is used. The influence of $X_{j}$ on $X_{i}$ can then be written as $A_{i j} V X_{j}$ and the self-attention layer $A : ℝ^{N \times n} \to ℝ^{N \times n}$ is given by

A (X) = {[X_{i} + \sum_{j = 1}^{N} A_{i j} V X_{j}]}_{i = 1}^{N} .

(1.4)

For our purposes, we assume $V = D$ or $V = - D$ since, in this case, one can show that the particles move along a gradient flow. The general case is the subject of future work.

(b). Normalization method

The normalization of intermediate values is a common practice in machine learning models. In the context of neural networks, so-called batch normalization [44] is a popular method to prevent gradients from blowing up and thus to stabilize (and to improve) the training. Since this form of normalization uses information from the entire training batch, [45] proposes layer normalization (LayerNorm), which translates the mean of an intermediate vector to zero and divides it by its standard deviation, and therefore does not depend on any other vector in the batch. While the original implementation of the transformer [39] uses LayerNorm, some of the more recent publications (e.g. Llama, [46]) use a simplified version called Root Mean Square Layer Normalization (RMSNorm) proposed in [47]. Up to a multiplication with learned weights $[g_{i}]_{i = 1}^{n}$ , called gain parameters, RMSNorm performs a projection on to the unit sphere $S^{n - 1}$ (where in the following, we shall suppress the superscript and simply write $S$ ). More precisely, for $x \in ℝ^{n}$ we write

\begin{array}{r} RMSNorm (x)_{i} = g_{i} \frac{x_{i}}{‖ x ‖_{2}}, \end{array}

where, in practice, a division by zero is circumvented by adding a small value $ϵ > 0$ into $‖ x ‖_{2}$ . In our setting, we can assume the norm to be strictly positive as we consider the dynamics in continuous time. Following the setting of [9], we focus on RMSNorm with fixed gain parameters $g_{i} = 1$ for all $i = 1, \dots n$ and denote the projection on to the unit sphere for $x \in ℝ^{n} ∖ {0}$ by

\begin{array}{r} Π (x) = \frac{x}{‖ x ‖_{2}} . \end{array}

(c). Simplified transformer layer and time-continuous dynamics

Combining the attention layer with a normalization layer, we arrive at the following update step:

\begin{array}{r} X \leftarrow Π (A (X)), \end{array}

where the projection is applied vector-wise to each row of $A (X)$ . For the sake of our analysis, we shall deviate from typical practical implementations of transformers and consider the architecture to be a composition of such layers which all share the same matrices $D$ and $V$ in equations (1.3) and (1.4). In [9], it was proposed to study the continuum limit of these updates. This approach has become a popular tool for analyzing residual neural networks [48]: as discussed from various perspectives, e.g. in [49–52], the skip connections (i.e. the residual components) of the residual neural network architecture make it possible to interpret it as a forward Euler discretization of an ordinary differential equation. Introducing a time variable $t > 0$ and a small time increment $Δ t > 0$ , we get

X_{i} (t + Δ t) = Π (X_{i} (t) + Δ t \sum_{j = 1}^{N} A_{i j} (t) V X_{j} (t)), i = 1, \dots, N .

(1.5)

At this point, the residual component is hidden in the attention layer and cannot easily be extracted since the projection is nonlinear. In the continuous time limit $Δ t \to 0$ , remembering that $Π (x) = x$ for any $x \in S$ , we arrive at the following system of differential equations:

\begin{matrix} (1.6) & {\dot{X}}_{i} (t) = ⟨ \nabla_{x} Π (X_{i} (t)), \sum_{j = 1}^{N} A_{i j} (t) V X_{j} (t) ⟩, i = 1, \dots, N, \end{matrix}

where the spatial derivatives are understood as derivatives in $ℝ^{n}$ . With a simple computation, one can further show that for any $x \in S$ and $z \in ℝ^{n}$ it holds that

\begin{array}{r} ⟨ \nabla_{x} Π (x), z ⟩ = P_{x}^{⟂} (z), \end{array}

where, following [9], we define $P_{x}^{⟂} (z) = z - x \cdot z x$ . Substituting this into equation (1.6), we arrive at the following dynamics:

{\begin{cases} {\dot{X}}_{i} (t) = P_{X_{i} (t)}^{⊥} (\sum_{j = 1}^{N} A_{i j} (t) V X_{j} (t)), (1.7a) \\ X_{i} (0) = X_{0, i} \in S, (1.7b) \end{cases}

which serve as a starting point of [9].

(d). Interpretation as an evolution of measures

Instead of studying the dynamics of distinct particles, [9] propose to view equation (1.7) as an evolution of an empirical measure

μ_{t} = \frac{1}{N} \sum_{i = 1}^{N} δ_{X_{i} (t)} .

The right-hand side of equation (1.7a) can be understood as an integral with respect to $μ_{t}$ ; for a generic probability measure $μ$ , this can be written as a measure-dependent velocity field:

\begin{array}{lrr} V [μ] (x) = \frac{P_{x}^{⟂} (\int_{S} e^{x \cdot D y} V y d μ (y))}{\int_{S} e^{x \cdot D y} d μ (y)}, \end{array}

(1.8)

and equation (1.7a) turns into $\dot{X_{i}} (t) = V [μ_{t}] (X_{i} (t))$ . With this notion, we recover the weak continuity equation formulated in [9]: for any test function $φ \in C^{1} (S \times [0, T])$ , one has

\begin{aligned} \frac{d}{d t} \int_{S} ϕ (t, x) d μ_{t} (x) & = \frac{d}{d t} \frac{1}{N} \sum_{i = 1}^{N} ϕ (t, X_{i} (t)) \\ = \frac{1}{N} \sum_{i = 1}^{N} \partial_{t} ϕ (t, X_{i} (t)) + ⟨ \nabla_{x} ϕ (t, X_{i} (t)), V [μ_{t}] (X_{i} (t)) ⟩ \\ (1.9) & = \int_{S} \partial_{t} ϕ (t, x) + ⟨ \nabla_{x} ϕ (t, x), V [μ_{t}] (x) ⟩ d μ_{t} (x), \end{aligned}

where, in this case, the spatial derivatives of $φ$ have to be understood as derivatives on $S$ .

Similarly, Geshkovski et al. [9] propose the interaction energy in equation (1.2), which for an empirical measure $μ_{t}$ reduces to

E (μ_{t}) = \sum_{i, j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)} .

In this discrete case, a straightforward application of the chain rule and a reordering of the terms yields

\frac{d}{d t} E (μ_{t}) = 2 \sum_{i = 1}^{N} (\sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)} D X_{j} (t)) \cdot {\dot{X}}_{i} (t) .

Under our assumption that the value matrix is given by $V = \pm D$ , we see that, up to an application of $P_{X_{i} (t)}^{⟂}$ and a division by $\sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)}$ , the term in the brackets is given by $\dot{X_{i}} (t)$ . Since $P_{x}^{⟂} (z) \cdot z = P_{x}^{⟂} (z) \cdot P_{x}^{⟂} (z)$ for any $x \in S$ , $z \in ℝ^{n}$ , we have that

\frac{d}{d t} E (μ_{t}) = \pm 2 \sum_{i = 1}^{N} {‖ {\dot{X}}_{i} (t) ‖}^{2} \sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)} \underset{\leq}{\geq} 0,

and hence the energy $E$ increases ( $V = D$ ) or decreases ( $V = - D$ ) monotonously along the trajectory of $μ_{t}$ . A formal derivation of the above formulae for general probability measures on smooth manifolds is provided in §2.

Let us mention that problems with similar energies as $E$ have been studied in the past. The most prominent is an interaction energy with respect to $D$ with a non-local interaction kernel depending on $x - y$ . Choosing the kernel as Gaussian with covariance matrix $D^{- 1}$ (which makes sense only if $D$ is positive definite) results in

\begin{aligned} (1.10) & E^{inter} (μ) = \int_{S} \int_{S} e^{- \frac{1}{2} (x - y) \cdot (D (x - y))} d μ (x) d μ (y) . \end{aligned}

For $D = \pm Id$ , the minimizers and maximizers of the expressions in equations (1.2) and (1.10) are equivalent as $\mp \frac{1}{2} (x - y) \cdot (x - y) = \mp \frac{1}{2} (x \cdot x + y \cdot y) \pm x \cdot y = \mp 1 + \pm x \cdot y$ for all $x, y \in S$ . The important difference between equations (1.2) and (1.10) is the rotation-(in)variance of the interaction functions $e^{x \cdot (D y)}$ and $e^{- \frac{1}{2} (x - y) \cdot (D (x - y))}$ . In the general case, this is not true, but we shall use an analogy to the interaction energy to rewrite

E (μ) = e^{λ} \int_{S} \int_{S} e^{- \frac{λ}{2} | x |^{2} - \frac{λ}{2} | y |^{2} + x \cdot D y} d μ (x) d μ (y) = e^{λ} \int_{S} \int_{S} e^{- \frac{λ}{2} | x - y |^{2} + x \cdot ((D - λ Id) y)} d μ (x) d μ (y) .

(e). Understanding $x \cdot D y$ on the sphere

For our further analysis, it is crucial to understand the implications of restricting the problem to the unit sphere and the behaviour of the bilinear form $x \cdot D y$ on it. For $D = Id$ , it is clear that the minimizer of $f_{y} (x) = x \cdot D y$ is given by $x = - y$ and the maximizer by $x = y$ . This changes for a general $D$ and as a result, the minimizer of the energy in equation (1.2) is not given by the uniform distribution on $S$ anymore. For a diagonal matrix $D$ , the maximizer/minimizer of $f_{y}$ for a fixed $y \in S$ with $D y \neq 0$ is given by $x_{\pm} = \pm \frac{D y}{‖ D y ‖}$ . Therefore, we know that $x \cdot D y = 0$ if and only if $x \cdot x_{\pm} = 0$ (same for $>$ and $<$ ). For $D y = 0$ , we already have $f_{y} (x) = 0$ for any $x \in S$ , i.e. each point is a minimizer, maximizer and orthogonal to $y$ w.r.t. $D$ . A further consequence is that

\begin{array}{r} \max_{x, y \in S} x \cdot D y = \max_{y \in S} \frac{D y \cdot D y}{‖ D y ‖} = \max_{y \in S} ‖ D y ‖ = | λ |, \end{array}

where $λ$ denotes the eigenvalue of maximum absolute value of $D$ . We further note that all of the following results on minimizers/maximizers as well as stationary points of $E_{D}$ can be generalized to probability measures concentrated on an ellipsoid instead of a sphere. To see this, we consider the ellipsoid

\begin{array}{r} C S = {x \in ℝ^{n} : ‖ C^{- 1} x ‖ = 1}, \end{array}

where $C \in ℝ^{n \times n}$ is invertible, and the corresponding energy

\begin{array}{r} E_{D}^{C} (μ) = \int_{C S} \int_{C S} e^{x \cdot D y} d μ (x) d μ (y) . \end{array}

Since $C$ is invertible, any measure $μ$ is uniquely determined by the pushforward measure $ν = C_{#}^{- 1} μ$ , as $μ = C_{#} ν$ . Thus, we can rewrite the energy as

\begin{array}{r} E_{D}^{C} (μ) = \int_{S} \int_{S} e^{C x \cdot D C y} d ν (x) d ν (y) = E_{C^{T} D C} (ν), \end{array}

and equivalently optimize the energy $E_{C^{T} D C}$ on the sphere. A special case that leads to measures concentrated on an ellipsoid corresponds to RMSNorm normalization with non-vanishing gain parameters $g_{i} \neq 0$ . In this case, the ellipsoid is given by $G S$ , where $G$ is a diagonal matrix with entries $[g_{i}]_{i = 1}^{n}$ .

2. Gradient flow

As shown above, the particle dynamics can be ‘lifted’ by the use of empirical measures to the space of probability measures $P (S)$ over the sphere. As mentioned in [9, Remark 3.3], for arbitrary probability measures, the connection between the partial dynamics and a corresponding continuity equation can be made by a mean field limit approach. Hence, instead of the particle dynamics, one can study the continuity equation:

\begin{aligned} \partial_{t} μ + d i v (V [μ] μ) = 0 & on [0, T] \times S, \\ μ_{| t = 0} = μ (0) & on S, \end{aligned}

(2.1)

with the velocity field given by equation (1.8), which holds in the sense of distributions. Note that, in this section, we scale the energy by a factor of $1 / 2$ to be consistent with [9]. It was remarked in [9, ch. 3.3] that for $V = \pm D$ , the energy,

E (μ) = \pm \frac{1}{2} \int_{S} \int_{S} e^{x \cdot D y} d μ (x) d μ (y),

is monotonic along these dynamics, and the partial differential equation (2.1) can be interpreted as a gradient flow for a modified optimal transport distance. However, as the authors of [9] acknowledge, there is a gap in the literature that prevents them from making this observation rigorous.

In this section, we aim to close this gap. We show that $P (S)$ equipped with this new distance is a geodesic space with properties similar to the classical $2$ -Wasserstein space and prove that solutions of equation (2.1) are curves of maximal slope of $E$ with respect to this distance and thus satisfy the energy dissipation equality

\begin{array}{r} \frac{d}{d t} E (μ_{t}) = - \int_{S} \int_{S} e^{x \cdot D y} d μ_{t} (y) | V [μ_{t}] (x) |^{2} d μ_{t} (x) for a.e. t . \end{array}

Finally, we study the long-time behaviour of the dynamics and show that subsequences of the flow converge to stationary points of the energy $E$ .

Let us mention that the basic analysis of this section related to the novel transport distance can be generalized in a rather straightforward way to the more general case of $D$ being non-symmetric and can thus provide the basis for future analysis of the non-gradient flow case with $V$ arbitrary and $D$ non-symmetric.

(a). Continuity equation on manifolds

Let $M$ be a compact $n$ -dimensional Riemannian manifold without a boundary, e.g. the sphere $S \subset ℝ^{n}$ . The tangent bundle $T M = ⊔_{x \in M} T_{x} M$ is given by the disjoint union of all tangent spaces of all $x \in M$ . We denote by $P (M)$ the space of Borel probability measures on $M$ , equipped with the standard narrow topology (e.g. [53, ch. 5.1]). The symbol $⇀$ is used to indicate convergence in this topology. Let $I = (0, T)$ be an open interval, $μ : t \to μ_{t} \in P (M)$ a narrowly continuous curve and $V : (x, t) \in M \times I \mapsto v_{t} (x) \in T M$ a Borel velocity field such that $\int_{0}^{T} \int_{M} | v_{t} (x) | d μ_{t} d t < \infty$ . The continuity equation holds in the sense of distributions if

\begin{array}{lrr} \int_{(0, T)} \int_{M} \partial_{t} φ (x, t) + ⟨ D φ (x, t), v_{t} (x) ⟩ d μ_{t} d t = 0, \forall φ \in C_{c}^{1} (M \times (0, T)) . \end{array}

(2.2)

Here, $D$ denotes the differential on the manifold $M$ . Sometimes, we shall use $D_{x}$ to clarify with respect to which variable the differential is taken. We define the set of solutions to the continuity equation as follows:

\begin{aligned} C E (0, T) := {(μ, v) : \begin{matrix} μ : I \mapsto P (M) is narrowly continuous, \\ \int_{0}^{T} \int_{M} | v_{t} (x) | d μ_{t} d t < \infty, \\ (μ, v) satisfy the continuity equation, \end{matrix}} \end{aligned}

Furthermore, we define $C E (0, T; ν \to η)$ as the subset $(μ, v)$ such that $μ_{0} = ν$ , $μ_{T} = η$ . For more details, we refer to appendix A(a).

(b). Distance

To interpret equation (2.1) as a gradient flow on $P (M)$ , we need to modify the well-known dynamic formulation of the $2$ -Wasserstein distance [54] and introduce the following mobility:

\begin{array}{r} m_{μ} (x) = \int_{M} K (x, y) d μ (y) . \end{array}

With this, the modified transport distance between $μ_{0}, μ_{1} \in P (M)$ is defined as follows (see [9, Section 3.4.2]):

\begin{array}{lrr} W_{m, 2}^{2} (μ_{0}, μ_{1}) = \inf {\int_{0}^{1} \int_{M} m_{μ_{t}} (x) | v_{t} (x) |^{2} d μ_{t} (x) d t : (μ, v) \in C E (0,1; μ_{0} \to μ_{1})} . \end{array}

(2.3)

For $K \equiv 1$ , we recover the classical $2$ -Wasserstein distance. The dynamic (2.1) corresponds to the kernel $K (x, y) = e^{x \cdot D y}$ , but for the sake of generality, we carry out the analysis for a more general class of kernels $K$ .

Assumption 1. The kernel $K (x, y) \in C (M \times M)$ is continuous, and there exists a constant $C > 0$ such that $K (x, y) ⩾ C$ for all $x, y \in M$ .

Remark 2.1. The assumption that $K$ is bounded from below is vital for our analysis and covers the cases of interest in this article. Nonetheless, it would be interesting to see whether this assumption can be relaxed. For example, instead of a compact manifold $M$ , we could consider $ℝ^{d}$ as the underlying space and take $K$ to be a Gaussian or a bounded confidence kernel $K (x, y) = 1_{| x - y | ⩽ 1}$ as studied in [ 55 ].

As the next theorem shows, the infimum in equation (2.3) is actually attained by some $(μ, v) \in C E (0,1; μ_{0} \to μ_{1})$ . The proof can be found in appendix A(b).

Theorem 2.2 (Existence of minimizers). For every pair $μ_{0}, μ_{1} \in P (M)$ with $W_{m, 2} (μ_{0}, μ_{1}) < + \infty$ , there exists a couple $(μ, v) \in C E (0,1)$ such that

\begin{array}{r} W_{m, 2}^{2} (μ_{0}, μ_{1}) = \int_{0}^{1} \int_{M} m_{μ_{t}} (x) | v_{t} (x) |^{2} d μ_{t} (x) d t . \end{array}

Furthermore, such minimizers can be equivalently characterized as those of

\begin{matrix} (2.4) & W_{m, 2} (μ_{0}, μ_{t}) = inf {\int_{0}^{T} {(\int_{M} m_{μ_{t}} (x) {| v_{t} (x) |}^{2} d μ_{t} (x))}^{\frac{1}{2}} d t : (μ, v) \in C E (0, T; μ_{0} \to μ_{T})} . \end{matrix}

Using the theorem above, it is easy to show that $W_{m, 2}$ is a distance on $P (M)$ .

Theorem 2.3. The space $P (M)$ equipped with $W_{m, 2}$ is a complete metric space and its topology is equivalent to the one induced by the $2$ -Wasserstein distance which, since $M$ is compact, is equivalent to the topology of narrow convergence.

Proof. First, we check that $W_{m, 2}$ is a distance. Indeed, (i) symmetry follows from simply rescaling time by $\tilde{t} : t \in [0, T] \mapsto T - t \in [0, T]$ ; (ii) definiteness: Since $m_{μ_{t}}$ is bounded from below, $W_{m, 2} (μ, ν) = 0$ implies that $v_{t} = 0$ for $μ$ -a.e. $(p, t) \in M \times (0, T)$ . Thus by equation (A 3) $μ = ν$ ; (iii) the triangle inequality follows from the characterization in equation (2.4) and the gluing property from proposition A.1. To show the equivalence of the distances, we observe that by assumption 1, $K (x, y) ⩾ C$ and since $M \times M$ is compact and $K (x, y)$ is continuous, we can also find a $\tilde{C}$ such that $K (x, y) ⩽ \tilde{C}$ . This implies that

\begin{aligned} \frac{1}{C} W_{2} (μ, ν) ⩽ W_{m, 2} (μ, ν) ⩽ \tilde{C} W_{2} (μ, ν) < + \infty \forall μ, ν \in P (M), \end{aligned}

and the distances are equivalent. Since $(P (M), W_{2})$ is complete, $(P (M), W_{m, 2})$ has to be complete as well.∎

Let us recall that in a general complete metric space $(X, d)$ , a curve $γ : [0, T] \to X$ is called absolutely continuous if there exists a function $m \in L^{1} (0, T)$ such that

\begin{aligned} d (γ_{s}, γ_{r}) ⩽ \int_{s}^{r} m (t) d t \forall s, r \in [0, T] with s ⩽ r . \end{aligned}

(2.5)

For an absolutely continuous curve $γ (t)$ , its metric derivative is defined by

\begin{aligned} | \dot{γ} | (t) := lim_{h \to 0} \frac{d (γ_{t + h}, γ_{t})}{h}, \end{aligned}

and it exists for a.e. $t \in (0, T)$ . It can be shown that $| \dot{γ} |$ is minimal in the sense that for all $m (t)$ satisfying equation (2.5), it holds that $| \dot{γ} | (t) ⩽ m (t)$ for a.e. $t \in (0, T)$ . The next lemma, which is proven in appendix A(c), characterizes absolutely continuous curves in $(P (M), W_{m, 2})$ .

Lemma 2.4. Let $μ_{t}$ be an absolutely continuous curve w.r.t. $W_{2, m}$ . Then there exists a Borel velocity field $(v_{t})_{t \in (0, T)}$ such that $(μ, v) \in C E (0, T)$ and

{(\int_{M} m_{μ_{t}} (x) | v_{t} (x) |^{2} d μ_{t} (x))}^{1 / 2} = | \dot{μ} | (t) f o r a . e . t \in (0, T) .

Conversely, if $(μ, v) \in C E (0, T)$ and $\int_{0}^{T} {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t < + \infty$ then $t \to μ_{t}$ is absolutely continuous and

| \dot{μ} | (t) ⩽ {(\int_{M} m_{μ_{t}} (x) | v_{t} (x) |^{2} d μ_{t} (x))}^{1 / 2} f o r a . e . t \in (0, T) .

A metric space is called a length space if

d (x, y) = \inf \int_{0}^{1} | \dot{γ} | (t) d t,

where the infimum is taken over all absolutely continuous curves $γ : [0,1] \to X$ with $γ (0) = x$ and $γ (1) = y$ . If this infimum is obtained by a minimal curve, also called geodesic, we say that $(X, d)$ is a geodesic space. As it turns out, the minimal curves obtained in theorem 2.2 are such geodesics. This can be immediately deduced from equation (A 9) and the definition of the metric velocity,

Corollary 2.5. The space $(P (M), W_{m, 2})$ is a geodesic space.

(c). Gradient flows of the interaction energy

Let $W (x, y) \in C^{1} (M \times M)$ be a symmetric interaction kernel. The interaction energy is given by

\begin{array}{r} E (μ) ≔ \frac{1}{2} \int_{M \times M} W (x, y) d μ (x) d μ (y) . \end{array}

Let us consider the following inverse duality map:

J_{2} : x \in T M_{p}^{*} \mapsto | x |_{*} \underset{y \in T M_{p} : | y | = 1}{arg max} x (y) .

Since all tangent spaces are finite-dimensional, this map is well defined. The application of $J_{2}$ to a 1-form on $M$ (in particular, a differential of a function) yields a velocity field on $M$ . Below we show that gradient flows of the energy $E$ with respect to the metric $W_{m, 2}$ are given by weak solutions to PDEs of the form

\begin{aligned} (2.6) & \partial_{t} μ + d i v (\frac{1}{m_{μ}} J_{2} (D W [μ]) μ) = 0, \end{aligned}

where $W [μ] (x) = \int_{M} W (x, y) d μ (y)$ . For $M = S$ , $K (x, y) = e^{x \cdot D y}$ and $W (x, y) = \pm e^{x \cdot D y}$ equation (2.6) corresponds precisely to equation (2.1) if $V = \pm D$ . The sole difference between equation (2.6) and classical Wasserstein gradient flows is the presence of the factor $\frac{1}{m_{μ}}$ . It arises since the modified transport distance punishes the movement of particles with a high mobility $m_{μ} (x)$ . When we interpret $K (x, y)$ as an interaction kernel between particles, those particles interacting strongly with others are slowed down, while particles with low interaction are sped up.

Lemma 2.6 (Chain rule). Let $t \to μ_{t}$ be an absolutely continuous curve in $W_{2, m}$ . Then $t \mapsto E (μ_{t})$ is absolutely continuous and

\begin{aligned} (2.7) & \frac{d}{d t} E (μ_{t}) = \int_{M} ⟨ D W [μ_{t}] (x), v_{t} (x) ⟩ d μ_{t} (x) f o r a . e t \in (0, T) . \end{aligned}

Proof. Let us consider an absolutely continuous curve $(μ, v) \in C E (0,1; μ \to ν)$ and the function $η : (x, t) \in M \times [0, T] \mapsto \frac{1}{2} \int_{M} W (x, y) d μ_{t} (y)$ . In the case when $η \in C^{1} (M \times [0, T])$ , we could use it as a test function in equation (A 3) and immediately obtain

\begin{aligned} E (μ_{T}) - E (μ_{0}) & = \int_{M} η (x, T) d μ_{T} (x) - \int_{M} η (x, 0) d μ_{0} (x) \\ = \int_{0}^{T} \int_{M} \partial_{t} η (x, t) d μ_{t} (x) + \int_{M} ⟨ D η (t, x), v_{t} (x) ⟩ d μ_{t} (x) d t \\ = \int_{0}^{T} \int_{M} \int_{M} ⟨ D_{x} W (x, y), v_{t} (x) ⟩ d μ_{t} (y) d μ_{t} (x) d t < + \infty . \end{aligned}

The finiteness follows from the fact that we can bound $| D_{x} W (x, y) |_{*}$ uniformly on $M \times M$ . In the general case, we have to use a rather lengthy time mollification argument, see appendix A(d).∎

Equation (2.7) is reminiscent of the classical chain rule $\frac{d}{d t} F (x (t)) = \nabla F (x (t)) \cdot \dot{x} (t)$ for a function $F : ℝ^{d} \to ℝ$ and a curve $x : [0, T] \to ℝ^{d}$ . The velocity field $v_{t}$ can be viewed as the ‘derivative’ of the curve $μ_{t}$ , while $D W [μ_{t}]$ is the corresponding ‘gradient’ of the interaction energy. Using this chain rule, we can estimate how fast the energy can decrease along a curve $μ_{t}$ . Therefore, curves reaching this bound dissipate the energy as fast as possible and satisfy the so-called energy dissipation equality.

Lemma 2.7. For any absolutely continuous w.r.t. $W_{2, m}$ curve $(μ_{t})_{t \in (0, T)}$ , we have that

\begin{array}{lrr} E (μ_{T}) - E (μ_{0}) + \frac{1}{2} \int_{0}^{T} \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t} d t + \frac{1}{2} \int_{0}^{T} \int_{M} \frac{1}{m_{μ_{t}}} | D W [μ_{t}] |_{*}^{2} d μ_{t} d t ⩾ 0. \end{array}

(2.8)

Moreover, we have equality if and only if $(μ_{t})_{t \in (0, T)}$ is a weak solution to equation (2.6) .

Proof. We can estimate the right-hand side of equation (2.7) by Hölder’s and Young’s inequalities:

\begin{aligned} \int_{M} D W [μ_{t}] (v_{t}) d μ_{t} & ⩾ - \sqrt{\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t}} \sqrt{\int_{M} \frac{1}{m_{μ_{t}}} | D W [μ_{t}] |_{*}^{2} d μ_{t}} \\ ⩾ - \frac{1}{2} \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t} - \frac{1}{2} \int_{M} \frac{1}{m_{μ_{t}}} | D W [μ_{t}] |_{*}^{2} d μ_{t} . \end{aligned}

Integrating both sides of equation (2.7) from 0 to T, we obtain equation (2.8). Moreover, equality holds if and only if for a.e. $t$ and $μ_{t}$ -a.e. we have $v_{t} = \frac{1}{m_{μ_{t}}} J_{2} (D W [μ_{t}]$ ). Hence, $μ_{t}$ is a weak solution to equation (2.6).∎

(d). Metric gradient flows

Let us put the previous calculations into the context of curves of maximal slope [53, ch. 1], which can be viewed as a way to generalize gradient flows to general metric spaces. We assume $(X, d)$ to be a complete metric space. Let $E : X \to ℝ$ . A function $g : X \to [0, + \infty]$ is called a strong upper gradient of $E$ if for any absolutely continuous curve $x : [0, T] \to X$ the concatenation $g \circ x$ is Borel and

\begin{array}{r} | E (x (t)) - E (x (s)) | ⩽ \int_{s}^{t} g (x (r)) | \dot{x} | (r) d r \forall 0 ⩽ s ⩽ t ⩽ T . \end{array}

If $E (x (t))$ is non-increasing in $t$ then the application of Young’s inequality yields

\begin{array}{r} E (x (t)) - E (x (s)) + \frac{1}{2} \int_{s}^{t} g (x (r))^{2} + | \dot{x} | (r)^{2} d r ⩾ 0 \forall 0 ⩽ s ⩽ t ⩽ T . \end{array}

This observation allows us to define curves of maximal slope as those that decrease the energy as fast as possible.

Definition 2.8 (Curve of maximal slope). An absolutely continuous curve $x : [0, T] \to X$ is called a curve of maximal slope of $E$ with respect to its strong upper gradient $g$ if $t \mapsto E (x (t))$ is non-increasing and

\begin{array}{r} E (x (t)) - E (x (s)) + \frac{1}{2} \int_{s}^{t} g (x (r))^{2} + | \dot{x} | (r)^{2} d r ⩽ 0 \forall 0 ⩽ s ⩽ t ⩽ T . \end{array}

Lemma 2.9. The map

\begin{array}{r} g : μ \mapsto \sqrt{\int_{M} \frac{1}{m_{μ}} | D W [μ_{t}] |_{*}^{2} d μ} \end{array}

is a strong upper gradient of $E$ and solutions of equation (2.6) coincide with curves of maximal slope of $E$ with respect to the strong upper gradient $g$ .

Proof. For an absolutely continuous w.r.t. $W_{2, m}$ curve $μ_{t}$ , we can find, by lemma 2.4, a velocity field $(v_{t})_{t \in (0, T)}$ such that $(μ, v) \in C E (0, T)$ and

{(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} = | \dot{μ} | (t) for a.e. t \in (0, T) .

Then, the chain rule, lemma 2.6 yields

\begin{aligned} | E (μ_{t}) - E (μ_{s}) | ⩽ \int_{s}^{t} | ⟨ D W [μ_{t}], v_{r} ⟩ | d r ⩽ \int_{s}^{t} g (μ_{r}) | \dot{μ} | (r) d r, \end{aligned}

and $g$ is a strong upper gradient. The coincidence of solutions of equation (2.6) and curves of maximal slope follows from lemma 2.7.∎

(e). Energy dissipation and large-time behaviour

Due to the missing geodesic convexity properties of the energy, we cannot expect convergence of the evolution to a unique minimizer in the large time limit. However, we can obtain some weaker results by further analysing the energy dissipation property:

\begin{array}{lcr} E (μ_{t}) + \frac{1}{2} \int_{0}^{t} \int_{M} m_{μ_{s}} (x) | \nabla E^{'} (μ_{s}) |^{2} d μ_{s} (x) d s ⩽ E (μ_{0}) . \end{array}

(2.9)

As $s \to \infty$ , we can pick narrowly convergent subsequences of $μ_{s}$ (i.e. converging weakly star in the Banach space of Radon measures). Moreover, the entropy dissipation inequality above implies

\int_{0}^{\infty} \int_{M} m_{μ_{s}} (x) | \nabla E^{'} (μ_{s}) |^{2} d μ,_{s} (x) d s < \infty,

hence, along suitable subsequences, the entropy dissipation,

D (s) = \int_{M} m_{μ_{s}} (x) | \nabla E^{'} (μ_{s}) |^{2} d μ_{s} (x),

converges to zero since it is non-negative and bounded. To establish the existence of subsequences converging to stationary solutions, we need to identify the limit in suitable spaces. Under appropriate regularity assumptions on the interaction kernel $W$ (satisfied, for example, for the exponential kernel), this is a direct consequence of the Arzelà–Ascoli theorem.

Lemma 2.10. Let $M$ be a compact manifold without a boundary, $W \in C^{1, α} (M \times M)$ for some $α > 0$ and symmetric. Moreover, let $μ^{n}$ be a sequence of probability measures on $M$ . Then the sequences

m_{μ^{n}} = \int_{M} W (\cdot, y) d μ^{n} (y) and \nabla E^{'} (μ^{n}) = \int_{M} \nabla_{x} W (\cdot, y) d μ^{n} (y)

have uniformly convergent subsequences. If $μ^{n}$ converges narrowly to $μ^{*}$ , then $m_{μ^{n}}$ converges uniformly to $m_{μ^{*}}$ and $\nabla E^{'} (μ^{n})$ converges uniformly to $\nabla E^{'} (μ^{*}) .$

Lemma 2.10 combined with the entropy dissipation inequality (2.9) yields the following result.

Corollary 2.11. Let $M$ be a compact manifold without a boundary, $W \in C^{1, α} (M \times M)$ for some $α > 0$ and symmetric. Then each weak solution $μ_{t}$ of equation (2.1) with the velocity field given by equation (1.8) has a narrowly convergent subsequence $μ_{t_{n}}$ as $t_{n} \to \infty$ , the limit of which is a stationary solution.

The following example connects the general results of this section with the transformer dynamics.

Example 2.12. The transformer dynamics for a finite number of particles described by equation (1.7) with $V = \pm D$ correspond to the choice $M = S$ , $K (x, y) = e^{x \cdot D y}$ and $W (x, y) = \pm e^{x \cdot D y}$ . As discussed in §1d, the corresponding empirical measures $μ_{t}$ fulfil the continuity equation (1.9). Thus, they solve equation (2.1) in the weak sense with the velocity field given by equation (1.8), and all requirements of corollary 2.11 are fulfilled. Therefore, there exists a subsequence of $μ_{t}$ that converges narrowly to a stationary solution of the interaction energy $E_{D}$ defined in equation (1.2).

This section establishes the relation between the particle model in equation (1.7) and gradient flows of interaction energies for the special cases $V = \pm D$ . The energy dissipation property equation (2.8) and convergence property from corollary 2.11 motivate the study of stationary solutions of the energy $E_{D}$ , which we carry out in §§3 and 4. We shall start with minimizers and maximizers.

3. Explicit energy minimizers and maximizers

In this section, we compute explicit minimizers and maximizers of the energy $E_{D}$ (from equation (1.2), i.e. without the factor $1 / 2$ ) in different scenarios, depending on the properties of the interaction matrix $D$ . We make the dependence on the matrix $D$ explicit by employing it as a subscript of the energy. The case $D = Id$ has already been covered in [9, Proposition 3.4], where it is stated that a measure is a maximizer if and only if it is a Dirac delta placed at any point on the sphere, and a minimizer if and only if it is the uniform distribution. As we show below, for more general matrices, the position of optimal Diracs depends strongly on the eigenvalues of the matrix $D$ . We further derive a symmetry condition for minimizers of energies with a positive definite interaction matrix $D$ . This property yields an alternative, simpler proof that the uniform distribution is the only minimizer for $D = Id$ .

(a). Maximal eigenvalue and related maximizers or minimizers

Like for $D = Id$ , there are several cases in which the minimizers or maximizers of the energy $E_{D}$ are given by Diracs concentrated at a single point. We start with the maximizers when the largest eigenvalue of $D$ is also an eigenvalue of the largest absolute value (or, respectively, minimizers when the smallest eigenvalue of $D$ is also an eigenvalue of the largest absolute value).

Theorem 3.1. Let $λ$ be an eigenvalue of maximal absolute value of $D$ and $Z_{λ} \subseteq S$ the set of associated normalized eigenvectors. If $λ > 0$ then $μ^{*} = δ_{z}$ with $z \in Z_{λ}$ are the only maximizers of the energy $E_{D}$ . If $λ < 0$ then $μ^{*} = δ_{z}$ with $z \in Z_{λ}$ are the only minimizers.

Proof. We consider the case $λ > 0$ ; the case $λ < 0$ can be treated similarly. For all $x, y \in S$ , we have $e^{x \cdot D y} ⩽ e^{λ}$ with equality if and only if $x = y = \pm z$ . Thus,

E_{D} (μ) = \int_{S} \int_{S} e^{x \cdot D y} d μ (x) d μ (y) ⩽ \int_{S} \int_{S} e^{λ} d μ (x) d μ (y) = e^{λ} = E_{D} (μ^{*}),

where the inequality is strict if $μ$ is not concentrated on an eigenvector associated with $λ$ .∎

An example of the above setting is maximizing the energy for $D = Id$ [9, Proposition 3.4], where the authors make a connection between the existence of concentrated maximizers and the so-called mode collapse of transformers often observed in practice. For a positive definite $D \neq Id$ , theorem 3.1 shows that the set of maximizers is not only restricted to Dirac measures, but that it is actually finite. We summarize this insight in the following example and refer to §5a for an illustrating numerical example.

Example 3.2. If $D = Id$ then $μ^{*} = δ_{z}$ is a maximizer of the energy $E_{Id}$ for any $z \in S$ . Similarly, for $D = - Id$ , $μ^{*} = δ_{z}$ is a minimizer for any $z \in S$ . If $D \neq Id$ is positive definite then $μ^{*} = δ_{z}$ is a maximizer of $E_{D}$ only if $D z = λ z$ and $λ$ is the largest eigenvalue of $D$ . Similarly, for a negative definite $D \neq Id$ , $μ^{*} = δ_{z}$ is a minimizer only if $D z = λ z$ and $λ$ is the smallest eigenvalue of $D$ .

In the remainder of this section, we study minimizers for matrices that do not fulfil the conditions of theorem 3.1.

(b). Minimizers for indefinite matrices

We now generalize the statement in theorem 3.1 to minimizers of energies where the matrix $D$ has at least one non-positive eigenvalue. In particular, we do not assume that the smallest eigenvalue is the eigenvalue of maximal absolute value. A key property is the following result that gives a lower bound on the energy in terms of the smallest eigenvalue of $D$ .

Lemma 3.3. Let $\bar{x}$ be the expected value of $x$ under $μ$ , i.e. $\bar{x} : = \int_{S} x d μ (x)$ . Then

\begin{array}{lcr} E_{D} (μ) ⩾ e^{\bar{x} \cdot D \bar{x}} . \end{array}

(3.1)

If $D$ is not positive definite and $λ,_{\min}$ is its smallest eigenvalue, it further holds that

\begin{matrix} (3.2) & E_{D} (μ) ⩾ e^{λ_{min}} . \end{matrix}

Proof. We use the convexity of exponential functions of the form $x \mapsto e^{x \cdot a}$ and $y \mapsto e^{b \cdot y}$ for arbitrary $a, b \in ℝ^{n}$ , which, with two applications of Jensen’s inequality, implies

\begin{array}{lrlr} E_{D} (μ) & = \int_{S} \int_{S} e^{x \cdot D y} d μ (y) d μ (y) \int_{S} e^{x \cdot D \bar{x}} d μ (x) ⩾ e^{\bar{x} \cdot D \bar{x}} . \end{array}

(3.3)

Since, further, $\bar{x} \cdot D \bar{x} ⩾ λ_{min} ‖ \bar{x} ‖^{2}$ and $0 ⩽ ‖ \bar{x} ‖ ⩽ 1$ , the monotonicity of the exponential function gives us

E_{D} (μ) ⩾ e^{min {λ_{min}, 0}} .

If $D$ is not positive definite, we know that $λ,_{\min} ⩽ 0$ and the above inequality reduces to inequality (3.2).∎

A direct consequence of lemma 3.3 for indefinite matrices is that a Dirac measure that is concentrated on an eigenvector corresponding to the smallest eigenvalue is a minimizer of the energy. If the smallest eigenvalue is negative, we can even show that all minimizers are of this form. In the case of a vanishing smallest eigenvalue, it is necessary and sufficient that the measure is concentrated on the null space of $D$ .

Theorem 3.4. Consider a matrix $D$ that is not positive definite with the smallest eigenvalue $λ,_{\min} ⩽ 0$ . If $λ,_{\min} < 0$ , a measure minimizes the energy if and only if it is a Dirac measure placed at an eigenvector corresponding to $λ,_{\min}$ . If $λ,_{\min} = 0$ , a measure minimizes the energy if and only if it is concentrated on the null space of $D$ .

Proof. We first assume $λ,_{\min} < 0$ . It follows directly from equation (3.2) that every Dirac measure concentrated on an eigenvector corresponding to $λ,_{\min}$ is a minimizer. We further see that $\bar{x} \cdot D \bar{x} = λ,_{\min}$ if only if $\bar{x}$ is an eigenvector corresponding to $λ,_{\min}$ and $‖ \bar{x} ‖ = 1$ . This can only hold for Dirac measures. Thus, there are no other minimizers.

For $λ,_{\min} = 0$ , it also follows directly from equation (3.2) that every measure concentrated on the null space of $D$ minimizes the energy. However, $\bar{x} \cdot D \bar{x} = λ,_{\min}$ holds for all measures that fulfil $\bar{x} = 0$ . Still, the estimate in equation (3.3), obtained using Jensen’s inequality, is only an equality if $x \cdot D x = \bar{x} \cdot D \bar{x} = 0$ for $μ$ -a.e. $x \in S$ . Therefore, all minimizers are concentrated on the null space of $D$ .∎

Remark 3.5. In general, theorem 3.4 does not transfer to maximizers for matrices $D$ that are not negative definite. To see this, consider $D$ with the largest eigenvalue $λ,_{\max} ⩾ 0$ , the smallest eigenvalue $λ,_{\min} < 0$ and corresponding eigenvectors $z_{\min}$ and $z_{\max}$ . If further $e^{λ_{max}} < \cosh (λ_{min})$ , it holds that

E_{D} (δ_{z_{\max}}) = e^{λ_{\max}} < \cosh (λ_{\min}) = E_{D} (\frac{δ_{z_{\min}} + δ_{- z_{\min}}}{2})

and thus, $δ_{z_{\max}}$ is not a maximizer. In the special case $λ,_{\max} = 0$ , the above inequality holds for all measures concentrated on the null space of $D$ and all $λ,_{\min} < 0$ .

At this point, we further note that the above strategy does not work for analysing minimizers for positive definite interaction matrices $D$ . In this case, lemma 3.3 not only gives us $E_{D} (μ) ⩾ e^{0} = 1$ , but also $x \cdot D x > 0$ for all $x \in S$ , so the inequality is strict for all measures $μ \in P (S)$ .

(c). Symmetry property for positive definite matrices

The remainder of this section gives the first characterization of minimizers of the energy when the interaction matrix is positive definite. More precisely, we can show that, in this case, all minimizers are symmetric, and the symmetry axes are determined by the eigenvectors of $D$ . The first step towards this is to show that the energy $E_{D}$ is strictly convex if $D$ is positive definite.

Lemma 3.6. If $D$ is positive semi-definite (resp. positive definite) then $E_{D}$ is convex (resp. strictly convex).

Proof. Since $E_{D}$ is quadratic, convexity (resp. strict convexity) follows from the non-negativity (resp. positivity) of the quadratic form:

F (μ) = \int_{S} \int_{S} e^{x \cdot D y} d μ (x) d μ (y),

for arbitrary signed Radon measures $μ$ , e.g. [56, Proposition 2.11]. For $D$ positive semi-definite, there exists a unique positive semi-definite matrix square root $D^{1 / 2}$ and we can use the transformation $T (x) = D^{1 / 2} x$ . We denote by $T_{#} μ$ the pushforward of $μ$ by $T$ , so that

\begin{aligned} F (μ) & = \int_{T (S)} \int_{T (S)} e^{x \cdot y} d T_{#} μ (x) d T_{#} μ (y) \\ = \int_{T (S)} \int_{T (S)} e^{- \frac{1}{2} | x - y |^{2}} e^{\frac{1}{2} | x |^{2}} d T_{#} μ (x) e^{\frac{1}{2} | y |^{2}} d T_{#} μ (y) . \end{aligned}

Let $d η = e^{\frac{1}{2} | x |^{2}} d T_{#} μ (x)$ , then

F (μ) = \int_{T (S)} \int_{T (S)} e^{- \frac{1}{2} | x - y |^{2}} d η (x) d η (y) .

The fact that the Gaussian kernel is positive definite (e.g. [57]) yields that $F (μ) > 0$ unless $ν$ vanishes. This can only happen if $μ = 0$ or, in the case of a semi-definite matrix $D$ , if $μ$ is concentrated on the null space $N (D)$ and $μ (N (D)) = 0$ . This yields the assertion.∎

Remark 3.7. The previous convexity result does not guarantee the convergence of the gradient flow in ( equation (2.6) ) to a global minimizer of $F$ . For such results, usually a slightly different notion of convexity is required, the so-called geodesic convexity. The following example shows that besides the case of $D$ being a multiple of the identity, we do not have geodesic convexity for the classical $2$ -Wasserstein distance. We do not expect any improvements for our modified optimal transport distance.

Example 3.8. We consider a simple counterexample in $S^{1}$ (equipped with the spherical distance) to show that $F$ is not convex along $2$ -Wasserstein geodesics. Choose

D = (\begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array}) and the curve γ : t \in [0,1] \mapsto (\begin{array}{c} \cos (- \frac{π}{4} + t \frac{π}{2}) \\ \sin (- \frac{π}{4} + t \frac{π}{2}) \end{array}) .

Then $μ_{t} ≔ δ_{γ (t)}$ is a constant-speed geodesic in the $2$ -Wasserstein space connecting $δ_{γ (0)}$ and $δ_{γ (1)}$ . Clearly, the map $[0,1] ∋ t \mapsto F (γ (t))$ is not convex, since

F (γ (0)) = F (γ (1)) = e^{1.5} < e^{2} = F (γ (\frac{1}{2})) .

Such a counterexample can always be constructed as long as $D$ has two different eigenvectors. Lemma 3.6 does not contradict this counterexample, however, as it only implies the convexity of

[0,1] ∋ t \mapsto F ((1 - t) μ_{0} + t μ_{1}) .

Having established convexity, we can show that reflecting a measure along the eigenvectors of $D$ and then normalizing it does not increase the energy. Moreover, if $D$ is positive definite and $μ$ is not symmetric with respect to all eigenvectors of $D$ , one can always construct a symmetric measure with a smaller energy.

Lemma 3.9. Let $z$ be an eigenvector related to an eigenvalue $λ$ of a positive semi-definite matrix $D$ . For a measure $μ$ , we define $\tilde{μ}$ as

\begin{array}{r} \tilde{μ} : = \frac{1}{2} (μ + {H_{z}}_{#} μ), H_{z} (x) = x - 2 (x \cdot z) z, \end{array}

where $H_{z}$ denotes a reflection. Then, $E_{D} (\tilde{μ}) ⩽ E_{D} (μ)$ and the inequality is strict if $D$ is positive definite and $\tilde{μ} \neq μ$ .

Proof. Since $e^{x \cdot D y} = e^{H_{z} (x) \cdot D H_{z} (y)}$ , it is straightforward to see that $E_{D} (μ) = E_{D} ({H_{z}}_{#} μ)$ . The (strict) convexity of the energy yields the assertion.∎

As a direct consequence, we obtain a symmetry property of minimizers for positive definite $D$ .

Corollary 3.10. If $D$ is positive definite then each minimizer is symmetric with respect to its eigenvectors.

If $D$ is a positive multiple of the identity, one can easily show using the above result that the uniform distribution is the unique energy minimizer. This has been shown already in [9, Proposition 3.4] using properties of Gegenbauer polynomials [58, Proposition 2.2]. The symmetry property from corollary 3.10 gives an alternative—and straightforward—proof of this fact.

Proposition 3.11. If $D = λ Id$ for $λ > 0$ then the uniform distribution is the unique energy minimizer.

Proof. If $μ$ is not uniform, we can find a unit vector $z$ such that with $H_{z}$ as in lemma 3.9, we have

\tilde{μ} = \frac{1}{2} (μ + {H_{z}}_{#} μ) \neq μ .

However, for $D = λ Id$ , every unit vector is an eigenvector and lemma 3.9 implies that $E_{D} (\tilde{μ}) < E_{D} (μ)$ . Hence, the uniform distribution is the only minimizer of the energy.∎

Remark 3.12. The statement in proposition 3.11 does not transfer to maximizers for negative multiples of the identity. To see this, consider $D = λ Id$ with $λ < 0$ and let $μ_{0}$ denote the uniform distribution on $S$ . The symmetry of $μ_{0}$ yields

E_{D} (μ_{0}) = 2 \int_{S^{+}} \int_{S^{+}} e^{λ x \cdot y} + e^{- λ x \cdot y} d μ_{0} (x) d μ_{0} (y) = 4 \int_{S^{+}} \int_{S^{+}} \cosh (λ x \cdot y) d μ_{0} (x) d μ_{0} (y),

where $S^{+} : = {x \in S : x,_{1} > 0}$ . Since $| x \cdot y | < 1$ $μ_{0} \times μ_{0}$ -almost everywhere on $S^{+} \times S^{+}$ the integrand can be strictly bounded from above by $4 \cosh (λ)$ . Since $μ_{0} (S^{+}) = 1 / 2$ it follows that

\begin{array}{r} E_{D} (μ_{0}) < \cosh (λ) = E_{D} (1 / 2 (δ_{z} + δ_{- z})), \end{array}

with $z \in S$ . Therefore, $μ_{0}$ cannot be a maximizer of $E_{D}$ .

Remark 3.13. The above argument can be used to show that for arbitrary $D$ , one has

\begin{array}{r} E_{D} (μ) ⩽ E_{D} (\frac{δ_{z} + δ_{- z}}{2}) \end{array}

for all symmetric measures $μ$ if and only if $z$ is an eigenvector that corresponds to the eigenvalue of the largest absolute value. In the upcoming section, we use this insight to show that such measures are maximizers of $E_{D}$ for negative semi-definite $D$ .

If $D$ has non-positive eigenvalues, theorems 3.1 and 3.4 still show that all minimizers are invariant with respect to reflections $H_{z}$ , where $z$ corresponds to a positive eigenvalue. However, if $D$ has negative eigenvalues, such reflections can increase the energy when they are applied to general, non-minimizing measures. This is illustrated by the following example.

Example 3.14. Consider the two-dimensional case with $D = diag (λ, 1)$ and $λ < 0$ . For any $θ \in [0,2 π)$ , denote by $δ_{θ}$ the Dirac delta placed at $(\cos (θ), \sin (θ))$ . Fix $φ \in [0,2 π)$ and let

\begin{array}{r} μ = \frac{1}{2} (δ_{φ} + δ_{π + φ}) . \end{array}

In the two-dimensional setting, the symmetrization is given by

\begin{array}{r} \tilde{μ} = \frac{1}{4} (δ_{φ} + δ_{π + φ} + δ_{- φ} + δ_{π - φ}) . \end{array}

Denoting, for convenience, $\cos (φ) = c$ , we have

E_{D} (μ) - E_{D} (\tilde{μ}) = \frac{1}{2} (\cosh | (λ - 1) c^{2} + 1 | - \cosh | (- λ - 1) c^{2} + 1 |) .

Since $t \mapsto \cosh (t)$ is strictly increasing for $t ⩾ 0$ , we get that $E_{D} (μ) ⩽ E_{D} (\tilde{μ})$ since

| (λ - 1) c^{2} + 1 | = | - | λ | c^{2} + 1 - c^{2} | ⩽ | λ | c^{2} + 1 - c^{2} = | | λ | c^{2} + 1 - c^{2} | = | (- λ - 1) c^{2} + 1 |

for any $0 ⩽ c ⩽ 1$ and $λ ⩽ 0$ , and the inequality is strict if and only if $0 < c < 1$ and $λ < 0$ .

(d). Maximizers for negative semi-definite matrices

There is no apparent way to use the proof strategy from the previous section for showing that maximizers for negative definite matrices are symmetric, since the kernel $(x, y) \mapsto e^{x \cdot D y}$ is not negative definite for a negative definite $D$ . However, we can show that the quadratic form $F$ used to prove lemma 3.6 is non-positive for anti-symmetric measures. This yields a symmetry property of maximizers for negative semi-definite matrices.

Lemma 3.15. Let $D$ be a negative semi-definite matrix and $μ$ a measure on the sphere. Define $\tilde{μ}$ as

\begin{aligned} d \tilde{μ} (x) = \frac{1}{2} (d μ (x) + d μ (- x)) . \end{aligned}

Then $E_{D} (\tilde{μ}) ⩾ E_{D} (μ)$ and the inequality is strict if $\tilde{μ} \neq μ$ and either $D$ is negative definite or $\tilde{μ} = μ$ on the null space $N (D)$ .

Proof. We denote by $N (x) = - x$ the negation and define

\begin{array}{r} μ^{+} : = μ, μ^{-} : = N_{#} μ, ζ : = 1 / 2 (μ^{-} - μ^{+}) . \end{array}

This yields that $d ζ (- x) = 2 (d μ (- x) - d μ (x)) = - d ζ (x)$ and

\begin{aligned} E_{D} (ζ) = & \int_{S} \int_{S} e^{x \cdot D y} d ζ (x) d ζ (y) = \int_{S^{+}} \int_{S^{+}} e^{x \cdot D y} d ζ (x) d ζ (y) + \int_{S^{+}} \int_{S^{+}} e^{x \cdot D y} d ζ (- x) d ζ (- y) \\ + 2 \int_{S^{+}} \int_{S^{+}} e^{- x \cdot D y} d ζ (- x) d ζ (y) = 2 \int_{S^{+}} \int_{S^{+}} e^{x \cdot D y} - e^{- x \cdot D y} d ζ (x) d ζ (y) = - E_{- D} (ζ) . \end{aligned}

Since $- D$ is positive semi-definite, the proof of lemma 3.6 shows that $E_{- D} (ζ) ⩾ 0$ and thus $E_{D} (ζ) ⩽ 0$ . The inequality is strict if $ζ \neq 0$ and either $D$ is negative definite or $ζ$ is concentrated on $N (D)^{⟂}$ . The symmetry of the kernel yields $E_{D} (μ^{-}) = E_{D} (μ^{+})$ . Further, by substituting $μ^{+} = \tilde{μ} + ζ$ and $μ^{-} = \tilde{μ} - ζ$ , we see that

\begin{aligned} E_{D} (\tilde{μ}) & = \frac{1}{4} E_{D} (μ^{+}) + \frac{1}{4} E_{D} (μ^{-}) + \frac{1}{2} E_{D} (μ^{+}, μ^{-}) \\ = \frac{1}{2} E_{D} (μ) + \frac{1}{2} E_{D} (\tilde{μ} + ζ, \tilde{μ} - ζ) = \frac{1}{2} E_{D} (μ) + \frac{1}{2} E_{D} (\tilde{μ}) - \frac{1}{2} E_{D} (ζ) . \end{aligned}

Reordering the terms leads to

E_{D} (\tilde{μ}) = E_{D} (μ) - E_{D} (ζ) ⩾ E_{D} (μ) .

From the conditions on $ζ$ and $D$ that lead to $E_{D} < 0$ , we derive that the above inequality is strict if $\tilde{μ} \neq μ$ and either $D$ negative definite or $\tilde{μ} = μ$ on $N$ .∎

Corollary 3.16. Let $μ^{*}$ be a maximizer of $E_{D}$ for a negative definite $D$ . Then $d μ^{*} (x) = d μ^{*} (- x)$ .

This symmetry property is the missing ingredient for showing that the discrete measures introduced in remarks 3.12 and 3.13 are maximizers for negative semi-definite matrices $D$ .

Theorem 3.17. Let $D$ be negative semi-definite and $λ,_{\min} < 0$ its smallest eigenvalue. Then, a measure $μ$ maximizes $E_{D}$ if and only if $μ^{*} = 1 / 2 (δ_{z} + δ_{- z})$ where $z \in S$ is an eigenvector associated with $λ,_{\min}$ .

Proof. By lemma 3.15, it suffices to consider $μ$ satisfying $d μ (x) = d μ (- x)$ . Denoting $S^{+} : = {x \in S : x_{1} > 0}$ and using the symmetry property of $μ$ , with the arguments from remark 3.12, we have

E_{D} (μ) ⩽ \cosh λ_{\min} = E_{D} (μ^{*}),

where equality is only obtained if $| x \cdot D y | = λ,_{\min}$ holds $μ \times μ$ -almost everywhere on $S^{+} \times S^{+}$ . Since $μ$ is symmetric, this is equivalent to $μ = μ^{*}$ . For a negative definite $D$ , we already know from corollary 3.16 that there are no other measures that maximize $E_{D}$ . In the negative semi-definite case, we have that any $μ$ that fulfils $E_{D} (μ) = \cosh λ_{min}$ has to be concentrated on $N (D)^{⟂}$ and, therefore, also in this case, there are no other maximizers.∎

4. Energy variation and stationary points

To study stationary points or local maximizers/minimizers, it is useful to consider the first and second variations of the energy on the Wasserstein space of probability measures on the sphere, as studied previously for Vlasov-type interactions, e.g. the mean-field aggregation equation, cf. [36,59,60]. The first variation of $E_{D}$ is given by

d E_{D} (μ; V) = \frac{d}{d t} E_{D} (μ_{t}) |_{t = 0},

(4.1)

where $μ_{t}$ satisfies

\begin{array}{lcr} \partial_{t} μ_{t} + \nabla \cdot (μ_{t} P_{x}^{⟂} V) = 0, μ_{0} = μ, \end{array}

(4.2)

and $P_{x}^{⟂} = Id - x x^{T}$ is the projection to the tangent space of the unit ball at $x$ . Here, the velocity field $V$ is an arbitrary Lipschitz function on $ℝ^{n}$ ; by the projection $P_{x}^{⟂}$ , we restrict it further to admissible velocities that keep the distribution on the unit sphere.

The following weak formulation, where $φ$ is a continuously differentiable test function, will be useful later:

\frac{d}{d t} \int_{S} φ (x) d μ,_{t} (x) = \int_{S} P_{x}^{⟂} \nabla φ (x) \cdot V (x) d μ,_{t} (x) .

Similar to the first variation, the second variation of $E_{D}$ can be defined as

\begin{array}{lcr} d,^{2} E_{D} (μ; V . W) = \frac{d}{d t} d E_{D} (μ_{t}, W) |_{t = 0} \end{array}

(4.3)

if the derivative on the right-hand side exists. The computation of the first variation is completely analogous to the case of the aggregation equation (cf. [59]) and thus omitted here.

Lemma 4.1. For any Lipschitz continuous vector field $V$ , the first variation of the energy $E_{D}$ in the direction $V$ exists and is given by

\begin{array}{lcr} d E_{D} (μ; V) = \int_{S} \int_{S} e^{x \cdot (D y)} P_{x}^{⟂} D y \cdot V (x) d μ (x) d μ (y) . \end{array}

(4.4)

It is straightforward to see that the first variation vanishes at the extremal points of the energy:

Proposition 4.2. Let $μ^{*}$ be a minimizer or maximizer of the energy. Then $d E_{D} (μ; V) = 0$ for all Lipschitz vector fields $V$ .

Proof. Let $μ^{*}$ be the initial value for the transport equation (4.2). For Lipschitz-continuous vector fields, there is a unique solution $μ_{t}$ of the transport equation, and for all times $t > 0$ , it is an admissible distribution on the sphere. Hence, if $μ^{*}$ is a minimizer, then

E_{D} (μ^{*}) ⩽ E_{D} (μ_{t})

for all $t > 0$ , which implies that $d E_{D} (μ^{*}; V) ⩽ 0$ in the limit $t ↓ 0$ . Since $V$ is arbitrary and $d E_{D}$ is linear in $V$ , we have that $d E_{D} (μ; V) = 0$ . The case of a maximizer is treated in the same way, with an opposite inequality initially.∎

The connection between the transformer dynamics and the energy variations in Wasserstein spaces is readily established in the following.

Lemma 4.3. A probability measure $μ$ is a stationary solution of equation (2.1) with the velocity field given by equation (1.8) if and only if $d E_{D} (μ; W) = 0$ for all Lipschitz continuous $W$ .

Similarly to lemma 4.1, one can obtain an expression for the second variation.

Lemma 4.4. For $V, W$ being Lipschitz continuous, the second variation of the energy $E_{D}$ in the directions $V$ , $W$ exists and is given by

d E_{D} (μ; V, W) = \int_{S} \int_{S} e^{x \cdot D y} ((P_{x}^{⊥} D y \cdot V (x)) (P_{x}^{⊥} D y \cdot W (x)) + (D y)^{T} \nabla (P_{x}^{⊥} V (x))) d μ (x) d μ (y) .

(a). Energy variation at concentrated distributions

From lemma 4.1, we see that any measure $μ$ that fulfils

\begin{array}{lrr} \int_{S} e^{x \cdot D y} P_{x}^{⟂} D y d μ (y) = 0 for μ -almost all x \in S, \end{array}

(4.5)

is a stationary point of $E_{D}$ . Here and in the following, with a slight abuse of notation, we denote the $0$ -vector by $0$ . For concentrated measures, the above condition is also necessary and rather easy to verify, as we see in what follows. We first show that single Dirac measures can only be stationary points if they align with an eigenvector of the matrix $D$ .

Lemma 4.5. A Dirac measure $μ^{*} = δ_{z}$ is a stationary point of $E_{D}$ if and only if $z$ is an eigenvector of $D$ .

Proof. The first variation is given by

d E_{D} (μ^{*}; V) = - e^{z \cdot D z} P_{z} D z \cdot V (z) .

Since $V (z)$ is an arbitrary vector, $μ^{*}$ is a stationary point if and only if

0 = P_{z}^{⟂} D z = D z - (z^{T} D z) z,

which holds if and only if $z$ is an eigenvector of $D$ .∎

Intuitively speaking, $P_{z}^{⟂} D z = 0$ means that the force emerging from the interaction of a particle located at eigenvector $z$ with itself is orthogonal to the tangent space of $S$ at point $z$ and is thus cancelled out by the projection. The same effect can be observed for convex combinations of a Dirac measure and its reflection.

Lemma 4.6. For any $t \in [0,1]$ , we have that $t δ_{z} + (1 - t) δ_{- z}$ is a stationary point of $E_{D}$ if and only if $z$ is an eigenvector of $D$ .

Proof. Using the expression in lemma 4.1, we obtain for any Lipschitz continuous $V$ , using the abbreviation $ι = z \cdot D z$ , that

\begin{aligned} d E_{D} (t δ_{z} + (1 - t) δ_{- z}; V) = & t^{2} e^{ι} P_{z}^{⊥} D z V (z) + (1 - t)^{2} e^{ι} P_{- z}^{⊥} D (- z) V (- z) \\ + t (1 - t) e^{- ι} P_{- z}^{⊥} D z V (z) + t (1 - t) e^{- ι} P_{z}^{⊥} D (- z) V (- z) . \end{aligned}

We first observe that for any $x, y$ one has that $P_{x}^{⟂} y = P_{- x}^{⟂} y = - P_{x}^{⟂} (- y)$ . By comparing the coefficients in the above equation, we obtain that

\begin{aligned} d E_{D} (t) δ_{z} & + (1 - t) δ_{- z}; V) = 0 for all V Lipschitz \Leftrightarrow P_{z}^{⊥} D z = 0 \\ \Leftrightarrow D z - (z \cdot D z) z = 0 \Leftrightarrow z is an eigenvector. \end{aligned}

∎

For the symmetric case $t = 1 / 2$ in the above lemma, we can further show that any convex combination of such stationary points is again a stationary point.

Lemma 4.7. Let $Z_{D}$ be a finite subset of eigenvectors of $D$ such that $w \cdot z = 0$ for all $z \in Z_{D} \ {w}$ . Then for any choice of parameters $t : Z_{D} \to ℝ_{0}^{+}$ such that $\sum_{z \in Z_{D}} t (z) = 1$ the following measure is a stationary point of $E_{D}$ :

μ = \frac{1}{2} \sum_{z \in Z_{D}} t (z) (δ_{z} + δ_{- z}) .

Proof. We prove the statement by showing that equation (4.5) holds. For any $w \in Z_{D}$ , it holds that

\begin{array}{r} P_{w}^{⟂} D w = - P_{w}^{⟂} D w = 0, \end{array}

since $Z_{D}$ only contains eigenvectors of $D$ . On the other hand, since we also require $w \cdot z = 0$ for all $z \in Z_{D} \ {w}$ it follows that $z \cdot D w = - z \cdot D w = 0$ and therefore,

\begin{array}{r} e^{w \cdot D z} = e^{- w \cdot D z} \end{array}

for all $z \in Z_{D} \ {w}$ . In total, this yields

\int_{S} e^{w \cdot D y} P_{x}^{⊥} D y d μ (y) = \sum_{z \in Z_{D}} t (z) (e^{w \cdot D z} - e^{- w \cdot D z}) P_{w}^{⊥} (D z) = 0,

for all $w \in Z_{D}$ and thus also for $μ$ -almost all $w \in S$ .∎

The above proof strategy works only for Dirac measures aligned with the eigenvectors of $D$ . However, there exist other discrete measures that are stationary points, as the following example shows. For the sake of simplicity, we restrict ourselves to the two-dimensional case with a positive definite matrix $D$ and a symmetric combination of four Dirac measures. We further assume that $D$ is diagonal; the case of a general symmetric $D$ can be treated similarly with a rotation argument.

Lemma 4.8. Let $n = 2$ , $φ \in [0,2 π)$ and $D$ be diagonal and positive definite. A discrete measure:

μ_{φ} = \frac{1}{| X_{φ} |} \sum_{x \in X_{φ}} δ_{x}, w h e r e X_{φ} = {X (φ), X (π - φ), X (π + φ), X (2 π - φ)},

(4.6)

is a stationary point of $E_{D}$ if and only if either $φ \in {0, π / 2, π}$ or

\begin{array}{lrr} \frac{\tanh (λ,_{1} \cos^{2} φ)}{\tanh (λ,_{2} \sin^{2} φ)} = \frac{λ,_{2}}{λ,_{1}}, \end{array}

(4.7)

where $λ,_{1}, λ,_{2}$ denote the diagonal entries of $D$ . For any choice of $λ,_{1}, λ,_{2} > 0$ , there exists exactly one $φ \in (0, π / 2)$ that fulfils the condition in equation (4.7).

Proof. Without loss of generality, we prove the statement for $φ \in [0, π / 2]$ , since otherwise it holds that $(ψ \mod 2 π) \in [0, π / 2]$ for a $ψ \in {π - φ, π + φ, 2 π - φ}$ , and thus $μ_{φ} = μ_{ψ}$ .

It follows directly from lemma 4.6 that $μ_{φ}$ is a stationary point if $φ \in {0, π / 2}$ . Therefore, it remains to show that $μ_{φ}$ is a stationary point if and only if equation (4.7) is fulfilled. This means that we have to see when there exists a Lipschitz continuous $V$ such that $d E_{D} (μ_{φ}, V) \neq 0$ .

We first fix $x \in S$ and consider

\begin{aligned} \int_{S} e^{x \cdot D y} P_{x}^{⊥} D y d μ_{φ} (y) = & \frac{1}{4} ((e^{x \cdot D X (φ)} - e^{- x \cdot D X (φ)}) P_{x}^{⊥} D X (φ) \\ + (e^{x \cdot D X (π - φ)} - e^{- x \cdot D X (π - φ)}) P_{x}^{⊥} D X (π - φ)) . \end{aligned}

(4.8)

Since $n = 2$ , we can further write $P_{x}^{⟂} y = x^{⟂} \cdot y x,^{⟂}$ , where $x^{⟂} = (- x_{2}, x_{1})^{T}$ . We factor out $x^{⟂}$ to rewrite equation (4.8) as $E (x; μ_{φ}) x,^{⟂}$ with

\begin{array}{r} E (x; μ_{φ}) = (1 / 2) (\sinh (x \cdot D X (φ)) x,^{⟂} \cdot D X (φ) + \sinh (x \cdot D X (π - φ)) x,^{⟂} \cdot D X (π - φ)) . \end{array}

Lemma 4.1 now gives us that

d E_{D} (μ_{φ}, V) = \sum_{x \in X_{φ}} E (x; μ_{φ}) x^{⊥} \cdot V (x),

which can become zero for all admissible $V$ if and only if $E (x; μ_{φ}) = 0$ for all $x \in X_{φ}$ . Due to the symmetry properties of our measures $μ_{φ}$ , it further holds that $E (x; μ_{φ})$ is constant on $X_{φ}$ ; therefore, it suffices to consider $x = X (φ)$ . Remembering that $X (φ) = (\cos φ, \sin φ)^{T}$ , we derive

\begin{aligned} 2 E (X (φ); μ_{φ}) = & \sinh (λ_{1} \cos^{2} φ + λ_{2} \sin^{2} φ) (- λ_{1} + λ_{2}) \sin φ \cos φ \\ + \sinh (- λ_{1} \cos^{2} φ + λ_{2} \sin^{2} φ) (λ_{1} + λ_{2}) \sin φ \cos φ . \end{aligned}

Since $φ \in (0, π / 2)$ , the factor $\sin φ \cos φ$ cannot vanish, and the zeros of $E (X (φ); μ_{φ})$ coincide with those of

\begin{aligned} \sinh (λ_{1} \cos^{2} φ + λ_{2} \sin^{2} φ) (- λ_{1} + λ_{2}) + \sinh (- λ_{1} \cos^{2} φ + λ_{2} \sin^{2} φ) (λ_{1} + λ_{2}) \\ (4.9) & = \sinh (λ_{1} + (- λ_{1} + λ_{2}) \sin^{2} φ) (- λ_{1} + λ_{2}) + \sinh (- λ_{1} + (λ_{1} + λ_{2}) \sin^{2} φ) (λ_{1} + λ_{2}) . \end{aligned}

This function obtains its minima at $(φ \mod 2 π) \in {0, π}$ and its maxima at $(φ \mod 2 π) \in {π / 2, 3 π / 2}$ and strictly increases or decreases, respectively, in between. Substituting these points into equation (4.9), we see that the minima are strictly negative and the maxima are strictly positive since $λ,_{1}, λ,_{2} > 0$ . Therefore, there exists exactly one zero in the interval $(0, π / 2)$ . Using the hyperbolic identity $\sinh (x + y) = \sinh x \cosh y + \cosh x \sinh y$ in equation (4.9), we arrive at the criterion in equation (4.7).∎

Remark 4.9. Importantly, the angle $φ$ that fulfils equation (4.7) depends not only on the ratio of the eigenvalues of $D$ but also on their magnitude since they appear separately within the hyperbolic tangent.

Although the ratio of the eigenvalues does in general not determine the angle $φ$ that fulfils equation (4.7), we can still make a qualitative prediction based on the ratio. The left-hand side of equation (4.7) decreases monotonically for $φ \in [0, π / 2)$ ; for $λ,_{1} = λ,_{2}$ , the condition is fulfilled for $φ = π / 4$ . Therefore, the condition is fulfilled by some $φ \in [0, π / 4)$ if $λ,_{2} > λ,_{1}$ and by some $φ \in (π / 4, π / 2]$ if $λ,_{1} > λ,_{2}$ . The numerical experiments in §5b show that the measures characterized by equation (4.7) are not only stationary points but also minimizers among empirical measures consisting of at most four Dirac measures. In the remainder of this section, we aim to characterize minimizers for positive definite matrices $D$ in arbitrary dimensions $n ⩾ 2$ .

(b). Energy variation at the uniform distribution

To characterize minimizers for positive definite $D$ , we start by identifying the cases when the uniform distribution is a stationary state. As we show in the following lemma, this can only be the case if the strength of the interaction does not depend on the direction, i.e. the eigenvalues of $D$ all have the same absolute value.

Lemma 4.10. The uniform distribution $μ = \frac{1}{| S^{n - 1} |} H^{n}$ is a stationary point of $E_{D}$ if and only if all eigenvalues $(λ,_{i})_{i = 1}^{n}$ of $D$ have the same absolute value, i.e. $| λ,_{i} | = λ$ for some $λ \in ℝ$ .

Proof. To keep the notation simple, we treat here the case $n = 2$ , leaving the general proof for $n > 2$ to appendix C(a). Let us fix $x \in S$ and determine $φ \in [0,2 π)$ such that $D x / ‖ D x ‖ = (\cos φ, \sin φ)^{T}$ . Consider the integral

\begin{aligned} \int_{S} e^{x \cdot D y} P_{x}^{⟂} D y d H,^{2} (y) & = \int_{0}^{2 π} e^{‖ D x ‖ \cos (ψ - φ)} P_{x}^{⟂} (D (\cos ψ, \sin ψ)^{T}) d ψ = (*), \end{aligned}

which can be rewritten with a change of variables $θ = ψ - φ$ as follows (recall that $P_{x}^{⟂} = Id - x x^{T}$ ):

\begin{aligned} * & = \int_{0}^{2 π} e^{‖ D x ‖ \cos θ} (\cos θ (D^{2} x / ‖ D x ‖ - ‖ D x ‖ x) + \sin θ (D x / ‖ D x ‖)^{⊥}) d θ \\ = (D^{2} x / ‖ D x ‖ - ‖ D x ‖ x) \underset{> 0}{\underset{⏟}{\int_{0}^{2 π} e^{‖ D x ‖ \cos θ} \cos θ d θ}} + (D x / ‖ D x ‖)^{⊥} \underset{= 0}{\underset{⏟}{\int_{0}^{2 π} e^{‖ D x ‖ \cos θ} \sin θ d θ}} . \end{aligned}

From the above derivations, we see that $(*) = 0$ if and only if $x$ is an eigenvector of $D^{2}$ . This holds true for $μ$ -almost all $x \in S$ if and only if $| λ,_{1} | = | λ,_{2} |$ . This automatically yields $d E_{D} (μ, V) = 0$ if $| λ,_{1} | = | λ,_{2} |$ . It remains to show that this is also a necessary condition.

Without loss of generality, we assume that $| λ,_{1} | > | λ,_{2} |$ , where $λ,_{1}$ and $λ,_{2}$ are the eigenvalues corresponding to the eigenvectors $z_{1}$ and $z_{2}$ , respectively. Then, $(D^{2} x / ‖ D x ‖ - ‖ D x ‖ x) \cdot z_{2}$ is strictly negative on the set

\begin{array}{r} A = {x \in S | (x \cdot z_{1}) \in (| λ,_{2} / λ,_{1} |, 1), (x \cdot z_{2}) > 0} . \end{array}

Since $μ (A) > 0$ we can find a Lipschitz continuous $V$ such that $V \cdot z_{1} = 0$ for $μ$ -a.e. on $S$ and

\begin{aligned} V (x) \cdot z_{2} {\begin{cases} > 0 & for a.e. x \in A, \\ = 0 & for a.e. x \in S ∖ A . \end{cases} \end{aligned}

For all such $V$ it holds that $d E_{D} (μ, V) > 0$ , which concludes the proof.∎

Since we already know that minimizers for $D$ with at least one negative eigenvalue are Dirac measures, we can conclude that the uniform distribution is only a minimizer for $D = Id$ .

Corollary 4.11. The uniform distribution $μ = \frac{1}{| S^{n - 1} |} H^{n}$ minimizes $E_{D}$ if and only if $D = λ Id$ for $λ ⩾ 0$ .

Proof. We only need to show that there are no other matrices $D$ such that $E_{D}$ is minimized by $μ$ ; the other direction has been treated in proposition 3.11. The measure $μ$ can only be a minimizer if it is a stationary point. By lemma 4.10, this implies that all eigenvalues of $D$ have to have the same absolute value. If such $D$ has at least one negative eigenvalue, it is also the smallest eigenvalue. Thus, by theorem 3.1, the only minimizers are Dirac deltas placed at eigenvectors corresponding to the negative eigenvalue.∎

(c). Perturbation of the identity

It is not clear whether an explicit computation of stationary points for an arbitrary positive definite matrix $D$ with at least two distinct eigenvalues is possible, but some insight can be gained with asymptotic analysis. We consider the following perturbed energy:

\begin{array}{r} E,_{ε} (μ) : = \int_{S} \int_{S} e^{x \cdot (Id + ε M) y} d μ (x) d μ (y), \end{array}

where $M$ is a diagonal matrix and $| ε | ≪ 1$ is a small parameter. Using the second-order Taylor expansion of the exponential function, we can write

\begin{array}{lcr} E_{ε} (μ) \approx E_{D} (μ) + ε \int_{S} \int_{S} e^{x \cdot y} x \cdot M y d μ (x) d μ (y) + ε^{2} \int_{S} e^{x \cdot y} (x \cdot M y)^{2} d μ (x) d μ (y) . & (4.10) \end{array}

For $ε = 0$ we know that the unique minimizer $μ_{0}$ is the uniform distribution on the sphere. Therefore, we use the following second-order asymptotic ansatz:

\begin{array}{lcr} μ_{ε} : = μ_{0} + ε ν + ε^{2} w, \int_{S} d ν = \int_{S} d w = 0. \end{array}

(4.11)

We stress that here we consider the energy as a function on the space of signed Radon measures on the sphere $M (S)$ with the total variation norm and not on the space of probability measures $P (S)$ with the Wasserstein metric as in §4a. For this reason, the perturbation here is a measure and not a vector field (cf. equation (4.1)).

Substituting equation (4.11) into equation (4.10) and neglecting higher-order terms, we derive

\begin{array}{r} E,_{ε} (μ_{ε}) - E,_{ε} (μ_{0}) \approx ε E_{D} (μ_{0}, ν) + ε^{2} E_{D} (μ_{0}, w) + ε^{2} E_{D} (ν) + 2 ε^{2} \int_{S} \int_{S} e^{x \cdot y} x \cdot M y d μ,_{0} (x) d ν (y) . \end{array}

Since further $y \mapsto \int_{S} e^{x \cdot y} d μ_{0} (x)$ is constant on $S$ , it follows that

\begin{aligned} E_{D} (μ_{0}, ν) = C (n) \int_{S} d ν = 0 and E_{D} (μ_{0}, w) = C (n) \int_{S} d w = 0. \end{aligned}

In particular, we see that the term $ε^{2} ω$ from equation (4.11) does not contribute to the second-order expansion of the energy. Therefore, minimizing $E,_{ε}$ over all possible $μ_{ε}$ satisfying equation (4.11) is equivalent to minimizing

\begin{array}{r} {\tilde{E}}_{ε} (ν) : = ε^{2} (E_{D} (ν) + 2 \int_{S} \int_{S} e^{x \cdot y} x \cdot M y d μ,_{0} (x) d ν (y)) \end{array}

over all signed measures $ν$ with $ν (S) = 0$ . The first variation in the direction $ν^{'}$ satisfying $\int_{S} d ν,^{'} = 0$ is given by

\begin{array}{lrr} d {\tilde{E}}_{ε} (ν, ν^{'}) = 2 ε^{2} (\int_{S} \int_{S} e^{x \cdot y} d ν (x) d ν,^{'} (y) + \int_{S} \int_{S} e^{x \cdot y} x \cdot M y d μ,_{0} (x) d ν^{'} (y)) . \end{array}

(4.12)

Our goal is now to find an optimal measure $ν$ , such that its first variation vanishes in any direction $ν^{'}$ such that $\int_{S} d ν,^{'} = 0$ . To do so, we shall need the following two technical lemmas. To make the definition of the uniform distribution on the sphere rigorous, we denote by $H^{n}$ the $n$ -dimensional Hausdorff measure and write $S^{n - 1}$ instead of $S$ .

Lemma 4.12. Let $n ⩾ 2$ and $μ_{0} = \frac{1}{| S^{n - 1} |} H^{n}$ . It holds that

\int_{S^{n - 1}} e^{x \cdot y} x d μ_{0} (x) = C_{1} y

(4.13)

for any $y \in S^{n - 1}$ , where the constant $C_{1}$ is positive and depends only on the dimension $n$ .

Proof. For the sake of simplicity, here we present the (more intuitive) proof for $n = 2$ , leaving the general case $n > 2$ to appendix C(b). We write $x = (\cos φ, \sin φ)^{T}$ and $y = (\cos ψ, \sin ψ)^{T}$ and derive that

\begin{aligned} 2 π \int_{S} e^{x \cdot y} x d μ_{0} (x) & = \int_{0}^{2 π} e^{\cos (φ - ψ)} (\cos φ, \sin φ)^{T} d φ = \int_{0}^{2 π} e^{\cos θ} (\cos (ψ + θ), \sin (ψ + θ))^{T} d θ \\ = (\cos ψ, \sin ψ)^{T} \int_{0}^{2 π} e^{\cos θ} \cos θ d θ + (- \sin ψ, \cos ψ)^{T} \int_{0}^{2 π} e^{\cos θ} \sin θ d θ, \end{aligned}

where we use the coordinate transform $θ = φ - ψ$ and two trigonometric identities to separate the summands inside sine and cosine. Since $\int_{S} e^{\cos θ} \sin θ d θ = 0$ , this yields equation (4.13) with

C_{1} = \frac{1}{2 π} \int_{0}^{2 π} e^{\cos θ} \cos θ d θ > 0.

∎

Lemma 4.13. Let $n ⩾ 2$ and $μ_{0} = \frac{1}{| S^{n - 1} |} H^{n}$ . It holds that for any $y \in S^{n - 1}$

\begin{matrix} (4.14) & \int_{S^{n - 1}} e^{x \cdot y} x_{i}^{2} d μ_{0} (x) = C_{2} y_{i}^{2} + C_{3}, 1 ⩽ i ⩽ n, \end{matrix}

where the constants $C_{2}$ and $C_{3}$ are positive and depend only on the dimension $n$ .

Proof. For the sake of simplicity, we again present the proof for $n = 2$ ; the general case $n > 2$ is treated in appendix C(c). Using the same arguments as in the previous proof, we derive

\begin{aligned} 2 π \int_{S} e^{x \cdot y} x^{2} d μ_{0} (x) & = \int_{0}^{2 π} e^{\cos θ} (\cos^{2} (ψ + θ), \sin^{2} (ψ + θ))^{T} d θ \\ = (\cos^{2} ψ, \sin^{2} ψ)^{T} \int_{0}^{2 π} e^{\cos θ} \cos^{2} θ d θ + (\sin^{2} ψ, \cos^{2} ψ)^{T} \int_{0}^{2 π} e^{\cos θ} \sin^{2} θ d θ, \end{aligned}

where the mixed terms containing $\cos θ \sin θ$ vanish due to symmetry. Further, since $\cos^{2} ψ + \sin^{2} ψ = 1$ , we can write

\begin{array}{r} (\sin^{2} ψ, \cos^{2} ψ)^{T} = (1,1)^{T} - (\cos^{2} ψ, \sin^{2} ψ)^{T} . \end{array}

This yields equation (4.14) with positive constants:

\begin{aligned} C_{2} = \frac{1}{2 π} \int_{0}^{2 π} e^{\cos θ} (\cos^{2} (θ) - \sin^{2} (θ)) d θ, C_{3} = \frac{1}{2 π} \int_{0}^{2 π} e^{\cos θ} \sin^{2} (θ) d θ . \end{aligned}

∎

Lemma 4.12 allows us to rewrite the second summand in equation (4.12) such that it contains $y \cdot M y$ . Using lemma 4.13, we can then deduce that, up to constants, the measure $- (x \cdot M x) μ,_{0} (x)$ is a stationary point of ${\tilde{E}}_{ε}$ .

Theorem 4.14. The measure

\begin{array}{r} d ν^{*} (x) = (α x \cdot M x + β) d μ_{0} (x), where α = - C_{1} / C_{2} and β = - \int_{S} α x \cdot M x d μ_{0} (x), \end{array}

fulfils $\int_{S} d ν^{*} = 0$ and $d E,_{ε} (ν^{*}, ν^{'}) = 0$ for all $ν^{'}$ satisfying $\int_{S} d ν^{'} = 0$ .

Proof. From the definition of $β$ and $\int_{S} d μ_{0} = 1$ , it follows that $\int_{S} d ν^{*} = 0$ . With lemma 4.12, we write the optimality condition derived from equation (4.12) as

\begin{array}{r} \int_{S} \int_{S} e^{x \cdot y} d ν (x) d ω (y) = - C_{1} \int_{S} y \cdot M y d ω (y) . \end{array}

Substituting $ν^{*}$ into the left-hand side and using lemma 4.13, we get

\begin{aligned} \int_{S} \int_{S} e^{x \cdot y} d ν^{*} (x) d ω (y) & = \int_{S} α (C_{2} y \cdot M y + Tr (M) C_{3}) d ω (y) + β \int_{S} \int_{S} e^{x \cdot y} d μ_{0} (x) d ω (y) \\ = α C_{2} \int_{S} y \cdot M y d ω (y), \end{aligned}

where all terms that do not depend on $y$ , including $\int_{S} e^{x \cdot y} d μ,_{0} (x)$ , vanish due to $\int_{S} d ω = 0$ . Substituting $α = - C_{1} / C_{2}$ completes the proof.∎

Theorem 4.14 gives us the following intuitive characterization. The measure $μ_{ε}$ that optimizes the perturbed energy is obtained by taking mass from the uniform distribution where $(x \cdot M x)$ is large and adding it where $(x \cdot M x)$ is small. In other words, we expect minimizers of the energy $E_{D}$ with a positive definite matrix $D$ to have more mass in regions that correspond to small eigenvalues of $D$ than in regions that correspond to large ones. This intuition is in line with the results of the particle approximation in figure 3. Furthermore, in figure 5, we also observe that the density obtained in equation (4.11) with the measure $ν^{*}$ from above can indeed be seen as a first-order approximation for small values of $ε$ .

5. Numerical examples

To illustrate the obtained theoretical results, we perform a series of numerical experiments using a particle approximation of the energy from equation (1.2) with an ensemble of $N$ particles $X = (X_{1}, \dots, X_{N})$ ,

E_{D} (μ_{N} (X)), where μ_{N} (X) = \frac{1}{N} \sum_{i = 1}^{N} δ_{X_{i}} .

We consider the following particle flow, introduced in [9],

{\dot{X}}_{i} (t) = P_{X_{i} (t)}^{⊥} (\pm \frac{1}{J_{i} (X)} \sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)} D X_{j} (t)),

with normalization factors $J_{i} (X)$ . If we choose the constant normalization

\begin{array}{lrr} J_{i} (X) = N, \end{array}

(5.1)

this corresponds merely to a step-size rescaling of a standard gradient descent scheme for $E_{D}$ , which is called the (USA) flow in [9]. Choosing the normalization as the partition function

\begin{aligned} J_{i} (X) = \sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)}, \end{aligned}

(5.2)

corresponds more closely to the self-attention dynamics and is labelled the SA flow in [9]. In what follows, we mostly use the normalization in equation (5.2), highlighting minor differences between the two formulations as appropriate. We use the explicit Euler discretization from equation (1.5) with step size $τ > 0$ to obtain the following update:

\begin{aligned} X_{i} (t + τ) = Π (X_{i} (t) \pm \frac{τ}{J_{i} (X)} \sum_{j = 1}^{N} e^{X_{i} (t) \cdot D X_{j} (t)} D X_{j} (t)) . \end{aligned}

(5.3)

Remark 5.1. For $N = 1$ and this scheme reduces to the following power iteration in the limit $τ \to \infty$ :

\begin{array}{r} X_{1} (t + τ) = Π (D X_{1} (t)) . \end{array}

In this regard, the iteration in equation (5.3) can be seen as a method for approximating the largest eigenvalue and the corresponding eigenvector. We leave further analysis of this connection to future work.

The source code for the experiments here is available at https://github.com/TimRoith/TransformerDynamics and uses Python [61], mainly building upon the packages NumPy [62], SciPy [63] and PyTorch [64].

(a). Maximizers for positive definite matrices

To validate our results on maximizers, we first consider a simple set-up of a one-particle system, $N = 1$ . We choose $τ = 0.075$ and run the scheme in equation (5.3) for $1500$ iterations. We only report the results for the adaptive normalization from equation (5.2), those for the constant normalization from equation (5.1) being essentially the same. For $D = Id$ , we know that every single Dirac is a maximizer, which is indeed observed in figure 1a. Here, each random initialization on the sphere leads to a different final state. In fact, in this case, there is no evolution at all, and the particle stays at its initial position. If $D$ is positive definite and has a strictly largest eigenvalue $λ,_{max}$ , theorem 3.1 shows that only Diracs at eigenvectors $z_{max}$ corresponding to $λ,_{max}$ are maximizers. This can be observed in figure 1b where the final state is either at $z_{max}$ or $- z_{max}$ .

Discrete maximizers on the sphere for $N = 1$ particles. The colour indicates the value of $x \cdot D x$ at each point on the sphere. (a) For $D = I d$ every single Dirac is a maximizer. We show the results for 30 different initializations (b) For $D = d i a g (1, 3, 4)$ the final state is either (0, 0,1) or (0,0,−1).

For multiple particle systems with $N > 1$ , lemma 4.6 suggests also that linear combinations of an eigenvector with its negative are stationary points. These linear combinations are not maximizers, but their basin of attraction depends on the eigenvalues of the matrix. In figure 2 (left), we plot the probability (i.e. the proportion of random initializations) of converging to a single cluster versus two clusters as function of the eigenvalues. We fix $λ,_{1} = 1$ and vary $λ,_{2}$ between $1$ and $1.5$ . Note that, as discussed in lemma 4.8 and remark 4.9, the actual values of the eigenvalues matter; not just their ratio. For $λ_{2} \approx 1$ , the probability of converging to a single cluster is high, whereas for larger values $λ,_{2} ≳ 1.4$ , most trajectories converge to two clusters. The results in figure 2 were obtained with the adaptive normalization from equation (5.2); however, we observed the same quantitative behaviour with the constant normalization from equation (5.1).

We study the trajectories for a symmetric positive definite matrix $D = diag (1, λ,_{2})$ with $λ,_{2} \in [1.,1.5]$ and $100$ different initializations using $100$ particles. We evaluate the number of clusters at the final iteration with the $k$ -means implementation of the `SciPy` package [63]. The centre of each cluster is close to an eigenvector corresponding to an eigenvalue of maximal absolute value. For $λ,_{2} \approx 1$ , the evolution converges to the optimal state with a single cluster (blue, solid), while for bigger values, it tends to get stuck in the suboptimal stationary state with two clusters (red, hatched) from lemma 4.6.

(b). Minimizers for positive (semi-)definite matrices

We now study discrete minimizers for positive definite matrices. In figure 3, we show how the matrix $D$ influences the particle configuration to which the scheme in equation (5.3) converges. Here, too, we used the adaptive normalization from equation (5.2); the results for the constant one from equation (5.1) are largely the same.

Final states for the minimization scheme after 10 000 steps with $N = 400$ particles. The colour indicates the value of $x \cdot D x$ at each point on the sphere. In (a), the uniform distribution is the minimizer of the energy. In (b), the particles do not form clusters at single Diracs but rather follow a smooth distribution on the sphere. In (c), any configuration with $(X_{i})_{1} = (X_{i})_{3} = 0$ for all $i$ is a minimizer. In (d), any configuration with $(X_{i})_{3} = 0$ for all $i$ is a minimizer.

Furthermore, in figure 4, we illustrate the results of lemma 4.8 for matrices $D = diag (1, λ,_{2})$ with varying values $λ,_{2} \in [0.5,8]$ . We initialize $N = 4$ particles as

We consider minimizers for the matrix $D = diag (1, λ,_{2})$ . Starting with the initial configuration described in equation (5.4) , we compute the mean of $\tanh (\cos^{2} φ_{i}) / \tanh (λ,_{2} \sin^{2} φ_{i})$ over all particles. For a small step size, the resulting curve is very close to the identity, as predicted by lemma 4.8. If $λ,_{2} τ$ is too big, the dynamics converge to a suboptimal stationary point. We also compare the normalizations given by equations (5.1) and (5.2). We see that with the same step size $τ = 0.2$ , the adaptive normalization in equation (5.1) yields faster convergence than the constant one in equation (5.2).

\begin{aligned} (5.4) & X_{i} = X (φ_{i}) with φ_{i} = (i - 1) \cdot π + π / 4 for i = 1, \dots, 4, \end{aligned}

and let the scheme in equation (5.3) run for 10 000 iterations. From the final particle state, we compute the value $\tanh (\cos^{2} φ_{i}) / \tanh (λ,_{2} \sin^{2} φ_{i})$ for each particle separately; lemma 4.8 tells us that this should be equal to $λ,_{2}$ for the minimizer. In figure 4, we observe that this holds true for the particle configurations computed with the discrete scheme. However, if the step size is too big compared to the value $λ,_{2}$ , the system instead converges to the two-cluster stationary point from figure 2. Here, we notice a slight difference between the two normalizations. The adaptive normalization from equation (5.2) allows choosing bigger step sizes compared to the constant normalization from equation (5.1), enabling faster convergence to the large-time limit.

We further investigate the validity of the asymptotic solution from theorem 4.14 in the two-dimensional case. Here, we deviate from the particle approximation and instead discretize the interval $[- π, π)$ with $N$ equidistant grid points $Θ \in [- π, π],^{N}$ and the associated points on the sphere $x_{1}, \dots, x_{N} \in S^{1}$ . In this setting, we then aim to minimize

\begin{matrix} (5.5) & {\tilde{E}}_{ε} (m) = \sum_{i, j = 1}^{N} e^{x_{i} \cdot (Id + ε M) x_{j}} m_{i} \cdot m_{j}, \end{matrix}

where $m \in ℝ^{N}$ is a probability vector. Note that already, for $n = 3$ , a more sophisticated quadrature rule would be required, e.g. the Lebedev quadrature on the sphere [65]. To deal with the simplex constraint for the vector $m$ , we use exponentiated gradient descent, specifically mirror descent with the negative log-entropy as the distance generating function [66], which yields the update

\begin{array}{lcr} m (ε)_{i} \leftarrow \frac{m_{i} e^{- τ \nabla {\tilde{E}}_{ε} (m (ε))_{i}}}{\sum_{j = 1}^{N} m (ε)_{j} e^{- τ \nabla \tilde{E} (m (ε))_{j}}} = SoftMax (\log (m (ε)) - τ \nabla {\tilde{E}}_{ε} (m (ε))_{i} . & (5.6) \end{array}

We take the perturbation matrix as $M = diag (0,1)$ , that is, the perturbed matrix $D$ is given by $D_{ε} = diag (1,1 + ε) .$ Recall the asymptotic expansion in equation (4.10). As noted in §4c, the contribution of the term $ε^{2} ω$ vanishes in the second-order expansion of the energy, and we are left with a solution:

\begin{array}{lcr} μ_{ε}^{*} = μ_{0} + ε ν^{*}, \end{array}

(5.7)

where $ν^{*}$ is as in theorem 4.14. We note that this measure has a Lebesgue density that can be evaluated at the grid points in $Θ$ ; we denote the resulting vector by $d μ_{ε}^{*} |_{Θ}$ . In figure 5, we compare this solution to the vector $m (ε)$ obtained by solving equations (5.5)–(5.6). The vector $m (ε)$ for different values of $ε$ is shown in figures 5a and 5b, we plot the $ℓ^{2}$ error ${| m (ϵ) - d μ_{ε}^{*} |_{Θ} |}_{2}$ .

Numerical study of the asymptotic solution from Theorem 4 — Numerical study of the asymptotic solution from theorem 4.14 in two dimensions. (a) The probability vectors $m (ϵ)$ computed using equation (5.5) with 500 steps for $τ = 0.1.$ (b) The $l^{2}$ approximation error for the first-order expansion in equation (5.7) (blue, solid) and the conjectured form in equation (5.8) (green, dotted)

Beyond the first-order expansion in equation (5.7), we conjecture that $m (ε)$ behaves as follows:

\begin{array}{lrr} d μ_{ε}^{guess} (θ) \sim \exp (Υ (ε) \cos (2 θ)), \end{array}

(5.8)

where $Υ (ε)$ is a function to be determined. Taking a second-order Taylor expansion $Υ (ε)$ , we estimate the coefficients via linear regression with the given vectors $m (ε)$ as data points and obtain $Υ (ε) \approx 1 / 5 ε,^{2} + e / 2 ε$ . The $ℓ^{2}$ error of this approximation is shown in figure 5b and is lower than that of the first-order expansion in equation (5.7). We leave the analysis of this ansatz to future work.

(c). Maximizers for negative definite and indefinite matrices

We proceed to numerical examples for §3d, i.e. maximization of the energy corresponding to a negative definite matrix. We take a system of $N = 100$ particles and consider the two matrices from figure 1 multiplied by $- 1$ . The results are shown in figure 6. We observe that a single final state consists of clusters at $\pm z$ , where $z$ is an eigenvector corresponding to the smallest eigenvalue, in agreement with theorem 3.17. As shown there, the behaviour does not change if one of the eigenvalues is zero, as only the eigenvectors corresponding to the smallest eigenvalue are relevant. For this reason, we do not consider the semi-definite case separately. The results here are not affected by the choice of the normalization; we only show the ones obtained with that in equation (5.2).

Discrete maximizers on the sphere for negative definite matrices obtained with $N = 100$ particles. We visualize the two-cluster final states by connecting the two components of each cluster corresponding to the same run with a line, assigning different colours to the two opposite clusters. The colour of the sphere indicates the value of $x \cdot D x$ at each point on the sphere. (a) For D = −Id a single final state has clusters at both $z$ and $- z$ for any $z \in S$ . For clarity, we only show results for 6 different initializations. (b) For D = −diag(1,3,4) a single final state has clusters both at (0,0,1) and (0,0,−1). We show the results for 100 different initializations.

Finally, we turn to the case of indefinite matrices. As noted in remark 3.5, for a matrix $D$ that is not negative definite, a Dirac delta placed at the eigenvector corresponding to the largest eigenvalue may not be a maximizer. This can be observed numerically as shown in figure 7 where we plot the energies of one- and two-cluster states for $D = diag (- 1, λ,_{2})$ with $λ,_{2} \in [- 1,1]$ .

Energies of the states $X^{single} = ((0,1))$ in blue, $X^{two, 1} = ((0,1), (0, - 1))$ in red and $X^{two, 2} = ((1,0), (- 1,0))$ in green for the matrix $D = diag (- 1, λ,_{2})$ with varying values of $λ,_{2}$ .

6. Conclusion

In this work, we studied a mathematical model of self-attention layers used in the transformer architecture. Building upon [9], we analysed a continuum limit in the space of probability measures on a sphere. To understand the underlying geometry, we studied a new optimal transport distance $W_{m, 2}$ with non-local mobility. We proved that the space of probability measures with this distance is a geodesic space and characterized absolutely continuous curves in this space. This allowed us to interpret the continuity equation (2.5) as curves of maximal slope of the interaction energy and to analyse the large-time behaviour using the energy dissipation property, showing that the dynamics converge to a stationary point of the interaction energy.

We analysed these critical points (in particular, minimizers and maximizers) for various types of interactions determined by the matrix $D$ in equation (1.2). These results are listed in table 1. We find that the positions of stationary points are strongly connected to normalized eigenvectors of $D$ , which form a strict subset of $S$ in the case $D \neq λ Id$ . In other words, the regions where clusters appear do not only depend on the initial configuration, but also on the interaction matrix itself. This could be related to mode collapse often observed in practice. It is an interesting question to understand whether an alternative, rotation-invariant architecture could prevent mode collapse.

Table 1.

Summary of results on minimizers/maximizers of the interaction energy in equation (1.2). We denote by $z_{\min}$ and $z_{\max}$ the eigenvectors that correspond to the smallest, respectively largest, eigenvalue of $D$ .

property of $D$	minimizers	maximizers
top rule positive definite	symmetric w.r.t. all eigenvectors (corollary 3.10 and §5b)	$μ = δ_{z_{\max}}$ (theorem 3.1 and §5a)
mid-rule positive semi-definite	any $μ$ concentrated on $N (D)$ (theorem 3.4 and §5b)	$μ = δ_{z_{\max}}$ (theorem 3.1 )
negative (semi-)definite	$μ = δ_{z_{\min}}$ (theorem 3.1)	$μ = 1 / 2 (δ_{z_{\min}} + δ_{- z_{\min}})$ (corollary 3.16 and §5c)
indefinite	$μ = δ_{z_{\min}}$ (theorem 3.4)	$\| λ_{\max} \|$ maximal: $μ = δ_{z_{\max}}$ (theorem 3.1 and §5c)

Open in a new tab

Several further questions remain open for future work: as already discussed, it would be interesting to study the optimal transport distance for mobilities $m_{μ}$ that cannot be bounded from below, which is the case, for example, in problems of opinion dynamics where the Gaussian kernel on the Euclidean space is often used. In this case, the metric $W_{m, 2}$ is no longer equivalent to $W_{2}$ . So far, we have only shown that equation (2.6) represents gradient flows in $(P (M), W_{m, 2})$ using the concept of curves of maximal slope. We do not know if these curves satisfy the slightly stronger energy variational inequality, which would yield an easy stability estimate for solutions of equation (2.6).

From a practical point of view, an even more interesting direction is studying more general flows in $W_{m, 2}$ that correspond to non-symmetrical matrices $D$ in equation (1.2), which is common in transformer architectures. As mentioned above, basic properties of the distance carry over to the non-symmetric case, but characterizing the stationary states is non-trivial; one possibility is splitting the effective velocity fields into a dissipative and a (generalized) divergence-free part, similar to non-symmetric Fokker–Planck equations.

Finally, to justify the use of the continuum limit for studying the practical behaviour of transformers, one needs to establish convergence of discrete time-stepping in arbitrary time intervals. Moreover, it is worth studying how the step size influences the behaviour of the system and what effect weight-sharing would have.

Appendix A. Proofs of Section 2

A.1. Continuity equation on manifolds

Let $M$ be a compact, $n$ -dimensional Riemannian manifold and $T M = ⊔_{x \in M} T_{x} M$ its tangent bundle. Although $T M$ is not a vector space, the tangent bundle $T M$ itself can be considered as $2 n$ -dimensional Riemannian manifold. For its proper definition and the topology on $T M$ , we refer to [67, ch. 3 (The Tangent Bundle)]. Velocity fields on manifolds are maps $V : M \to T M$ such that $π \circ V = I d_{M}$ , where $π : T M \to M$ is the projection map sending each vector in $T_{x} M$ to $x$ . We shall regularly commit the mild crime of interpreting $V (x)$ as an element in $T_{x} M$ instead of $T M$ . Let $I = (0, T)$ be an open interval, ${(μ_{t})}_{t \in I}$ be a Borel family of probability measures on $M$ and $v : (x, t) \in M \times I \mapsto v_{t} (x) \in T_{x} M$ be a time-dependent Borel velocity field such that

\begin{array}{lrr} \int \int | v_{t} (x) | d μ_{t} d t < \infty, \end{array}

(A 1)

where $| \cdot | : T M_{x} \to [0, + \infty)$ denotes the norm induced by the inner product of the Riemannian structure. The continuity equation holds in the sense of distributions if

\begin{array}{lrr} \int_{(0, T)} \int_{M} \partial_{t} φ (x, t) + ⟨ D φ (x, t), v_{t} (x) ⟩ d μ_{t} d t = 0 \forall φ \in C_{c}^{1} (M \times (0, T)) . \end{array}

(A 2)

Here, $D φ$ denotes the differential of the map $x \in M \mapsto φ (t, x)$ for a fixed $t \in [0, T]$ .

Proposition A.1 (Properties). Solutions to the continuity equation have the following properties:

—
Continuous representative: Let $μ_{t}$ be a Borel family of probability measures satisfying equation (A 2) for a Borel vector field $v_{t}$ satisfying equation (A 1) . Then there exists a narrowly continuous curve $t \in [0, T] \to {\tilde{μ}}_{t} \in P (M)$ such that $μ_{t} = {\tilde{μ}}_{t}$ for a.e. $t \in (0, T)$ . Moreover, if $φ \in C_{c}^{1} (M \times [0, T])$ and $s ⩽ r \in [0, T]$ we have [53, Lemma 8.1.2]:

$\begin{array}{lrr} \int_{M} φ (x, r) d {\tilde{μ}}_{r} - \int_{M} φ (x, s) d {\tilde{μ}}_{s} = \int_{s}^{r} \int_{M} \partial_{t} φ + D φ (v_{t}) d μ_{t} d t . \end{array}$ (A3)
—
Time rescaling: Let $t : s \in [0, T^{'}] \to t (s) \in [0, T]$ be a strictly increasing absolutely continuous map with absolutely continuous inverse $s ≔ t^{- 1}$ . Then $(μ_{t}, v_{t})$ is a distributional solution of the continuity equation if and only if [68, Lemma 8.1.3]

$\begin{aligned} \hat{μ} := μ \circ t, \hat{v} = t^{'} v \circ t \end{aligned}$ is a distributional solution of the continuity equation on $(0, T^{'})$ .
—
Gluing solutions: Let ${μ_{t}}_{t \in [0, T_{1}]}, {ν_{t}}_{t \in [0, T_{2}]}$ be two narrowly continuous curves in $P (M)$ with $μ_{T_{1}} = ν_{0}$ . Let further ${v}_{t \in [0, T_{1}]}, {w}_{t \in [0, T_{2}]}$ be the corresponding Borel velocity fields such that equation (A 3) is satisfied. Then ${η_{t}}_{t \in [0, T_{1} + T_{2}]}$ and ${u_{t}}_{t \in [0, T_{1} + T_{2}]}$ defined by

$\begin{aligned} η_{t} := {\begin{cases} μ_{t} & if t \in [0, T_{1}], \\ ν_{t - T_{1}} & if t \in (T_{1}, T_{1} + T_{2}], \end{cases} u_{t} := {\begin{cases} v_{t} & if t \in [0, T_{1}], \\ w_{t - T_{1}} & if t \in (T_{1}, T_{1} + T_{2}], \end{cases} \end{aligned}$

satisfy equation (A 3) [ 69 , Lemma 4.4].

A.2. Proof of Theorem 2.2

We follow the proof strategy from [69] for the ‘flat’ Euclidean case, but since $T M$ is not a vector space, modifications are required. We start by establishing a compactness result for solutions of continuity equations with finite energy. For our purposes, we define the ‘lifted’ flux $J_{t} \in P (T M \times M)$ in duality with $C_{c} (T M \times M)$ (see [70, Theorem 7.2]) by

\begin{array}{lrr} \int_{T M \times M} φ (w, y) d J_{t} (w, y) = \int_{M} \int_{M} φ (v_{t} (x), y) d μ_{t} (x) d μ_{t} (y) \forall φ \in C_{c} (T M \times M) . \end{array}

(A 4)

Notably, $(μ_{t}, J_{t})$ solve the continuity equation in the sense that for all $s ⩽ r \in [0, T]$ :

\begin{array}{lrr} \int_{M} φ_{r} d μ_{r} - \int_{M} φ_{s} d μ_{s} = \int_{s}^{r} \int_{M} \partial_{t} φ d μ_{t} d t + \int_{s}^{r} \int_{T M \times M} \tilde{D φ} d J_{t} d t \forall φ \in C^{1} (M \times [0, T]), \end{array}

(A 5)

where $\tilde{D φ} : (w, y) \mapsto ⟨ D φ (π (w)), w ⟩$ is the extension of $D φ$ on to $T M \times M$ that is constant along $y \in M$ . Further, we define $J \in P (T M \times M \times [0, T])$ in duality with $C_{c} (T M \times M \times [0, T])$ by

\int_{T M \times M \times (0, T)} φ d J = \int_{0}^{T} \int_{T M \times M} φ d J_{t} d t \forall φ \in C_{c} (T M \times M \times [0, T]) .

Lemma A.2. Let $(μ^{n}, v^{n})$ be a sequence in $C E (0, T)$ with

\begin{array}{r} \sup_{n} {\int_{0}^{1} \int_{M} m_{μ} (x) | v_{t}^{n} (x) |^{2} d μ_{t}^{n} (x) d t} < + \infty . \end{array}

Then there exists a subsequence and a couple $(μ, J)$ satisfying the continuity equation in the sense of equation (A 5) such that

\begin{aligned} μ_{t}^{n} ⇀ μ_{t} \forall t \in [0, T] and J^{n} ⇀ J, \end{aligned}

and for the map $g : (v, p) \in T M \times M \mapsto (π (v), p)$ one has

\begin{array}{lrr} g_{#} J_{t} = μ_{t} \otimes μ_{t} for a.e. t \in (0, T) . \end{array}

(A 6)

Proof. Step 1 (Convergence of $J$ ):

The estimate

\begin{array}{r} \sup_{n} \int_{0}^{T} \int_{T M \times M} | w |^{2} d J_{t}^{n} (w, x) d t ⩽ \frac{1}{C} \sup_{n} \int_{0}^{T} \int_{T M} m_{μ_{t}^{n}} | v_{t}^{n} |^{2} d μ_{t}^{n} d t < \infty \end{array}

combined with the fact that $M$ is compact and [53, Remark 5.1.5] implies tightness of $J \in P (T M \times M \times [0, T])$ . By disintegrating $J$ , we obtain a Borel family $J_{t}$ such that $d J = d J_{t} d t$ . Since $M$ is compact $μ_{0}^{n}$ is tight, and we extract a further subsequence such that $μ_{0}^{n} ⇀ μ_{0}$ .

Step 2 (Convergence of $μ_{t}$ ):

Consider a function $φ \in C^{1} (M)$ and for $t \in [0, T]$ set $ζ : (v, y, t) \in T M \times M \times [0, T] \mapsto χ_{[0, t]} ⟨ D φ (π (v)), v ⟩$ . Since the discontinuity set of $ζ$ is concentrated on $N = T M \times M \times {0, t}$ and $| F | (N) = 0$ , general convergence theorems (see, e.g. [3, Prop. 5.1.10]) imply

\begin{aligned} lim_{n \to \infty} \int_{0}^{t} \int_{T M \times M} \tilde{D} φ d J_{t}^{n} d t & = lim_{n \to \infty} \int_{T M \times M \times [0, T]} ζ d J^{n} \\ (A 7) & = \int_{T M \times M \times [0, T]} ζ d J = \int_{0}^{t} \int_{T M \times M} \tilde{D} φ d J_{t} d t . \end{aligned}

Let us fix a $t \in (0, T]$ . Since $M$ is compact, $μ_{t}^{n}$ is tight, and we can extract from any subsequence a further subsequence such that $μ_{t}^{n}$ converges narrowly. Then by equations (A 5) and (A 7) and the fact that $C^{1}$ is dense in $C^{0}$ , we know that all subsequences have the same limit. Therefore, $μ_{t}^{n} ⇀ μ_{t} \in P (M)$ for a particular $μ_{t}$ . By the previous calculations, we also immediately obtain that $(μ, J)$ satisfy the continuity equation in the sense of equation (A 5). To show equation (A 6), we observe that since $M$ is compact

\begin{array}{r} g_{#} J_{t}^{n} = μ_{t}^{n} \otimes μ_{t}^{n} ⇀ μ_{t} \otimes μ_{t} \forall t \in [0, T] . \end{array}

∎

Proof of Theorem 2.2. Step 1:

Let $(μ^{n}, v^{n}) \in C E (0,1)$ be a minimizing sequence of the functional in equation (2.3) for some $μ_{0}, μ_{1}$ . Then the conditions of lemma A.2 are met, and we obtain

\begin{array}{r} μ_{t}^{n} ⇀ μ_{t} \in P (M) \forall [0, T] and J^{n} ⇀ J \in P (T M \times M \times [0, T]), \end{array}

where the limit satisfies the continuity equation in the sense of equation (A 5). Equation (A 6) in particular implies that $J$ can be disintegrated in the following way:

d J = d u_{t, x} (v) d μ_{t} (x) d μ_{t} (y) d t,

where $u_{t, x} (v) \in P (T M_{p} = π^{- 1} (x))$ . Using [53, Lemma 5.1.7], we now show that for ${\bar{u}}_{x, t} = M e a n (u_{t, x}) = \int_{T M_{x}} v d u_{t, x} (v)$ it holds that

\begin{aligned} W_{m, 2} (μ_{0}, μ_{1})^{2} = & lim_{n \to \infty} \int_{0}^{1} \int_{M} m_{μ_{t}^{n}} (x) | v_{t}^{n} (x) |^{2} d μ_{t}^{n} (x) d t \\ = & lim_{n \to \infty} \int_{0}^{1} \int_{T M \times M} K (π (v), y) | v |^{2} d J_{t}^{n} (v, y) d t = \int_{0}^{1} \int_{T M \times M} K (π (v), y) | v |^{2} d J_{t} (v, y) d t \\ = & \int_{0}^{1} \int_{M} \int_{M} K (x, y) d μ_{t} (y) \int_{T M_{p}} | v |^{2} d u_{x, t} (v) d μ_{t} (x) d t \\ ⩾ & \int_{0}^{1} \int_{M} m_{μ_{t}} (x) \int_{T M_{x}} | v |^{2} d δ_{{\bar{u}}_{x, t}} (v) d μ_{t} (x) d t = \int_{0}^{1} \int_{M} m_{μ_{t}} (x) | {\bar{u}}_{x, t} |^{2} d μ_{t} (x) d t, \end{aligned}

where in the last line, we used Jensen’s inequality. Since $D φ (x) : T M_{x} \to ℝ$ is linear and $(μ, J)$ satisfy equation (A 5), this implies that $(μ, v = ({\bar{u}}_{x, t})_{t \in [0, T]}) \in C E (0,1)$ and for this couple the infimum in equation (2.3) is obtained.

Step 2:

Proposition A.1 and a linear time rescaling show that

\begin{array}{lrr} W_{m, 2}^{2} (μ_{0}, μ_{T}) = \inf {T \int_{0}^{T} \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t} d t : (μ_{t}, v_{t}) \in C E (0, T; μ_{0} \to μ_{T})} . \end{array}

(A 8)

We denote by ${\bar{W}}_{m, 2} (μ, ν)$ the infimum in equation (2.4) and show that indeed ${\bar{W}}_{m, 2} (μ, ν) = W_{m, 2} (μ, ν)$ . By Hölder’s inequality, we immediately obtain that ${\bar{W}}_{m, 2} (μ, ν) ⩽ W_{m, 2} (μ, ν)$ . To show the reverse, we follow the arguments of [69, Theorem 5.4] and define for $(μ, v) \in C E (0, T; μ \to ν)$ :

\begin{array}{r} s_{ϵ} (t) ≔ \int_{0}^{t} {(ϵ + \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d r for t \in [0, T] . \end{array}

Then $s_{ϵ}$ is strictly increasing, $s_{ϵ}^{'} ⩾ ϵ$ and $s_{ϵ} (0, T) = (0, S_{ϵ})$ with $S_{ϵ} ≔ s_{ϵ} (T)$ , so that its inverse map $t_{ϵ} : [0, S_{ϵ}] \to [0, T]$ is well defined and Lipschitz continuous and

\begin{aligned} t_{ϵ}^{'} \circ s_{ϵ} := {(ϵ + \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{- 1 / 2} for a.e. t \in (0, T) . \end{aligned}

By proposition A.1, we have that for $μ^{ϵ} := μ \circ t_{ϵ}$ , $v^{ϵ} := t_{ϵ}^{'} v \circ t_{ϵ}$ the couple $(μ^{ϵ}, v^{ϵ}) \in C E (0, S_{ϵ}; μ, ν)$ and

\begin{aligned} W_{m, 2}^{2} (μ, ν) & ⩽ S_{ϵ} \int_{0}^{S_{ϵ}} \int_{M} m_{μ_{t}^{ϵ}} | v_{t}^{ϵ} | d μ_{t}^{ϵ} d s \\ = S_{ϵ} \int_{0}^{T} \frac{\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t}}{ϵ + \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t}} {(ϵ + \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t, \end{aligned}

with the last term being smaller or equal to $S_{ϵ}^{2}$ . Sending $ϵ \to 0$ , we obtain

\begin{array}{r} W_{m, 2} (μ, ν) = \int_{0}^{T} {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t for all (μ, v) \in C E (0, T; μ \to ν) \end{array}

and hence, $W_{m, 2} (μ, ν) = {\bar{W}}_{m, 2} (μ, ν)$ . This in particular implies that for every minimizer $(μ, v) \in C E (0,1; μ \to ν)$ of the functional in equation (2.3), the equality

\begin{array}{r} {(\int_{0}^{1} \int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t} d t)}^{1 / 2} = \int_{0}^{1} {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t \end{array}

holds, which is only the case when $\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t}$ is constant for a.e. $t \in (0, T)$ , implying by a further time rescaling argument∎

W_{m, 2} (μ_{s}, μ_{t}) = | s - t | W_{m, 2} (μ_{0}, μ_{1}) \forall 0 ⩽ s ⩽ t ⩽ 1.

(A 9)

A.3. Proof of Lemma 2.4

Proof of Lemma 2.4. If $(μ, v) \in C E (0, T)$ and $\int_{0}^{T} {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t < + \infty$ then by equation (2.4), we have

\begin{array}{r} W_{m} (μ_{s}, ν_{r}) ⩽ \int_{s}^{r} {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} d t \forall 0 ⩽ s ⩽ r ⩽ T . \end{array}

On the other hand, if $μ_{t}$ is an absolutely continuous curve, then by a standard reparametrization argument [53, Lemma 1.1.4], we may assume $μ_{t}$ to be Lipschitz. For $N \in ℕ$ , we set the step size as $τ = T 2^{- N}$ and choose a family of constant-speed geodesics $(μ^{k, N}, v^{k, N}) \in C E ((k - 1) τ, k τ; μ_{(k - 1) τ} \to μ_{k τ})$ , $k \in {1, . . ., N}$ such that for $t \in ((k - 1) τ, k τ)$

τ \int_{M} m_{μ_{t}} {| v_{t} |}^{2} d μ_{t} \overset{(A 8)}{=} \frac{1}{τ} W_{m}^{2} (μ_{(k - 1) τ}, μ_{k τ}) ⩽ \frac{1}{τ} {(\int_{(k - 1) τ}^{k τ} | \dot{μ} | (t) d t)}^{2} \overset{Hölder}{⩽} \int_{(k - 1) τ}^{k τ} | \dot{μ} | (t)^{2} d t .

Gluing all geodesics together by proposition A.1, we obtain a curve $(μ^{N}, v^{N}) \in C E (0,1)$ . Lemma A.2 gives us a subsequence, still denoted by $N$ , and a couple $(\tilde{μ}, \tilde{v}) \in C E (0,1)$ such that $μ_{t}^{N} ⇀ {\tilde{μ}}_{t}$ and $J ⇀ \tilde{J}$ . By construction, ${\tilde{μ}}_{t}$ and $μ_{t}$ coincide on the dense (in $[0, T]$ ) set ${0} \cup {\frac{T}{M} 2^{- N} : M, N \in ℝ, M ⩽ N}$ . Since both ${\tilde{μ}}_{t}$ and $μ_{t}$ are narrowly continuous ${\tilde{μ}}_{t} = μ_{t}$ must hold. Again, equation (A 6) implies that $J$ can be disintegrated in the following way:

d J = d u_{t, x} (v) d μ_{t} (x) d μ_{t} (y) d t,

where $u_{t, x} (v) \in P (T M_{x} = π^{- 1} (x))$ . Then $(μ, \tilde{v}) \in C E (0, T)$ with ${\tilde{v}}_{t} ≔ \int_{T M_{x}} w d u_{t, x} (w)$ and

\begin{aligned} \begin{aligned} \int_{0}^{T} \int_{M} m_{μ_{t}} | {\tilde{v}}_{t} |^{2} d μ_{t} d t & \overset{Jensen}{⩽} \int_{0}^{T} \int_{M} \int_{M} K (x, y) d μ_{t} (y) \int_{T M_{p}} | v |^{2} d u_{x, t} (v) d μ_{t} (x) d t \\ ⩽ \int_{0}^{T} \int_{T M \times M} K (π (v), y) | v |^{2} d J_{t} (v, y) d t \\ ⩽ \underset{n \to \infty}{lim inf} \int_{0}^{T} \int_{T M \times M} K (π (v), y) | v |^{2} d J_{t}^{n} (v, y) d t \\ = \underset{n \to \infty}{lim inf} \int_{0}^{T} \int_{M} m_{μ_{t}^{n}} | v_{t}^{n} |^{2} d μ_{t} d t ⩽ \int_{0}^{T} | \dot{μ} |^{2} (t) d t . \end{aligned} \end{aligned}

(A 10)

Since $(μ, \tilde{v}) \in C E (0, T)$ , we have that

| \dot{μ} | (t) ⩽ {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2} for a.e. t \in (0, T) .

Finally, for equation (A 10) to hold $| \dot{μ} | (t) = {(\int_{M} m_{μ_{t}} | v_{t} |^{2} d μ_{t})}^{1 / 2}$ must hold for a.e. $t \in (0, T)$ ∎

A.4. Proof of Lemma 2.6

Proof of Lemma 2.6. From theorem 2.3, we know that the distances $W_{2}$ and $W_{m, 2}$ are equivalent. Therefore, we can assume absolute continuity with respect to $W_{2}$ . Further, by a standard rescaling argument (e.g. [53, Lemma 1.1.4] or [53, Lemma 8.1.3]), it is enough to prove equation (2.7) for $1$ -Lipschitz curves (w.r.t. $W_{2}$ ), i.e. we only need to consider absolutely continuous curves $(μ_{t}, v_{t}) \in C E (0,1; μ \to ν)$ such that

\begin{array}{r} \int_{M} | v_{t} (x) |^{2} d μ_{t} (x) = 1 for a.e. t \in (0, T) . \end{array}

For convenience, we shall set $μ_{t} = μ_{0}$ for $t ⩽ 0$ and $μ_{t} = μ_{T}$ for $t ⩾ T$ as well as $v_{t} = 0$ for $t \notin [0, T]$ . We define the function $η : (x, t) \in M \times ℝ \mapsto \frac{1}{2} \int_{M} W (x, y) d μ_{t} (y)$ for which

\partial_{t} η (t, x) = {\begin{cases} 0 & if t \notin [0, T], \\ \frac{1}{2} \int_{M} ⟨ D_{y} W (x, y), v_{t} (y) ⟩ d μ_{t} (y) & else, \end{cases}

in the distributional sense. Using the mollifier $g_{ϵ}$ , as described in [71, ch. C.5], one can smooth out $η$ in the time direction by setting

\begin{array}{r} η_{ϵ} (t, x) ≔ \int_{ℝ} η (τ, x) g_{ϵ} (t - τ) d τ . \end{array}

By [71, ch. C.5, Theorem 7 (iii)], we have that $η_{ϵ} \to η$ pointwise, and with the use of the dominated convergence theorem with the upper bound $| η_{ϵ} | ⩽ \sup_{(x, y) \in M \times M} | W (x, y) | < \infty$ , we calculate

\begin{array}{r} E (μ_{T}) - E (μ_{0}) = \int_{M} η d μ_{T} - \int_{M} η d μ_{0} = \lim_{ϵ \to 0} \int_{M} η_{ϵ} d μ_{T} - \int_{M} η_{ϵ} d μ_{0} . \end{array}

We further have that

\begin{aligned} + \infty & > \frac{1}{2} \int_{0}^{T} \int_{M} \int_{M} ⟨ D_{y} W (y, x), v_{t} (y) ⟩ d μ_{t} (y) d μ_{t} (x) d t \\ \overset{(* *)}{=} \lim_{ϵ \to 0} \frac{1}{2} \int_{0}^{T} \int_{ℝ} \int_{M} \int_{M} ⟨ D_{y} W (x, y), v_{t} ⟩ d μ_{t} (y) g_{ϵ} (t - τ) (t) d τ d μ_{t} (x) d t \\ \overset{(*)}{=} - \lim_{ϵ \to 0} \int_{0}^{T} \int_{M} \int_{ℝ} η (τ, x) \partial_{τ} g_{ϵ} (t - τ) d τ d μ_{t} (x) d t \\ = \lim_{ϵ \to 0} \int_{0}^{T} \int_{M} \int_{ℝ} η (τ, x) \partial_{t} g_{ϵ} (t - τ) d τ d μ_{t} (x) d t = \lim_{ϵ \to 0} \int_{0}^{T} \int_{M} \partial_{t} η_{ϵ} (t, x) d μ_{t} (x) d t, \end{aligned}

where for $(*)$ we use the definition of the distributional derivative and rearrange the integral using the Fubini–Tonelli theorem. To prove $(* *)$ , we need to define a piecewise constant approximation of $μ_{t}$ . We fix a $N \in ℕ$ $τ = \frac{T}{N}$ and set for $k \in {1, N}$

\begin{array}{r} {\bar{μ}}_{t} ≔ μ_{k τ} for t \in [k τ, (k + 1) τ), {\bar{μ}}_{T} ≔ μ_{T} . \end{array}

Since $μ_{t}$ is $1$ -Lipschitz, we have $W_{2} (μ_{t}, {\bar{μ}}_{t}) ⩽ τ$ for all $t \in [0, T]$ . Then, we estimate

\begin{aligned} | \int_{0}^{T} \int_{M} \int_{M} ⟨ D_{y} W (x, y), v_{t} (y) ⟩ d μ_{t} (y) d μ_{t} (x) d t - \int_{0}^{T} \int_{M} \int_{M} ⟨ D_{y} W (x, y), v_{t} (y) ⟩ d μ_{t} (y) d {\bar{μ}}_{t} (x) d t | \\ ⩽ \int_{0}^{T} \int_{M \times M} \int_{M} | ⟨ D_{y} W (x_{1}, y), v_{t} (y) ⟩ - ⟨ D_{y} W (x_{2}, y), v_{t} (y) ⟩ | d μ_{t} (y) d π_{t} (x_{1}, x_{2}) d t \\ ⩽ \int_{0}^{T} \int_{M \times M} \int_{M} | D_{y} W (x_{1}, y) - D_{y} W (x_{2}, y) |_{*} | v_{t} (y) | d μ_{t} (y) d π_{t} (x_{1}, x_{2}) d t \\ ⩽ C \int_{0}^{T} \int_{M \times M} \int_{M} | x_{1} - x_{2} | | v_{t} (y) | d μ_{t} (y) d π_{t} (x_{1}, x_{2}) d t \\ = C \int_{0}^{T} (\int_{M \times M} | x_{1} - x_{2} | d π_{t} (x_{1}, x_{2})) (\int_{M} | v_{t} (y) | d μ_{t} (y)) d t \\ ⩽ C (\int_{0}^{T} W_{2}^{2} (μ_{t}, {\bar{μ}}_{t}) d t \int_{0}^{T} \int_{M} | v_{t} (y) |^{2} d μ_{t} (y) d t)^{1 / 2} \\ (A 11) & = C (T \int_{0}^{T} W_{2}^{2} (μ_{t}, {\bar{μ}}_{t}) d t)^{1 / 2} ⩽ C (T \int_{0}^{T} τ^{2} d t)^{1 / 2} = C \frac{T^{2}}{N}, \end{aligned}

where $π_{t} \in P (M \times M)$ is the optimal transport plan between $μ_{t}$ and ${\bar{μ}}_{t}$ and $| \cdot |_{*}$ denotes the dual norm of $| \cdot |$ . (For more details on the static formulation of Wasserstein distances via optimal transport plans, we refer to [53, ch. 6]). We can argue similarly in the mollified case:

\begin{aligned} | \int_{0}^{T} \int_{M} \int_{R} \int_{M} ⟨ D_{y} W (x, y), v_{τ} (y) ⟩ d μ_{τ} (y) g_{ϵ} (t - τ) d τ d μ_{t} (x) d t - \\ \int_{0}^{T} \int_{M} \int_{R} \int_{M} ⟨ D_{y} W (x, y), v_{τ} (y) ⟩ d μ_{τ} (y) g_{ϵ} (t - τ) d τ d {\bar{μ}}_{t} (x) d t | \\ ⩽ & \int_{0}^{T} \int_{M \times M} \int_{R} \int_{M} | ⟨ D_{y} W (x_{1}, y), v_{τ} (y) ⟩ - \\ ⟨ D_{y} W (x_{2}, y), v_{τ} (y) ⟩ | d μ_{τ} (y) g_{ϵ} (t - τ) d τ d π_{t} (x_{1}, x_{2}) d t \\ ⩽ & \int_{0}^{T} \int_{M \times M} \int_{R} \int_{M} | D_{y} W (x_{1}, y) - D_{y} W (x_{2}, y) |_{*} | v_{τ} (y) | d μ_{τ} (y) g_{ϵ} (t - τ) d τ d π_{t} (x_{1}, x_{2}) d t \\ ⩽ & C \int_{0}^{T} \int_{M \times M} \int_{R} \int_{M} | x_{1} - x_{2} | | v_{τ} (y) | d μ_{τ} (y) g_{ϵ} (t - τ) d τ d π_{t} (x_{1}, x_{2}) d t \\ = & C \int_{0}^{T} (\int_{M \times M} | x_{1} - x_{2} | d π_{t} (x_{1}, x_{2})) (\int_{R} \int_{M} | v_{τ} (y) | d μ_{τ} (y) g_{ϵ} (t - τ) d τ) d t \\ ⩽ & C \int_{0}^{T} W_{2} (μ_{t}, {\bar{μ}}_{t}) \int_{R} \int_{M} | v_{τ} (y) | d μ_{τ} (y) g_{ϵ} (t - τ) d τ d t \\ ⩽ & C \frac{T}{N} \int_{0}^{T} \int_{R} \int_{M} | v_{τ} (y) | d μ_{τ} (y) g_{ϵ} (t - τ) d τ d t = C \frac{T}{N} \int_{R} \int_{M} | v_{τ} (y) | d μ_{τ} (y) \int_{0}^{T} g_{ϵ} (t - τ) d t d τ \\ \leq & C \frac{T}{N} \int_{R} \int_{M} | v_{τ} (y) | d μ_{τ} (y) d τ = C \frac{T}{N} \int_{0}^{T} \int_{M} | v_{τ} (y) | d μ_{τ} (y) d τ \\ (A 12) & \leq & C \frac{T}{N} \int_{0}^{T} (\int_{M} | v_{τ} (y) |^{2} d μ_{τ} (y))^{1 / 2} d τ ⩽ C \frac{T^{2}}{N} . \end{aligned}

We denote $\tilde{C} = \sup_{(x, y) \in M \times M} | D_{y} W (x, y) |_{*} < + \infty$ and combine equations (A 11) and (A 12) to estimate

\begin{aligned} | \int_{0}^{T} \int_{M} \int_{M} ⟨ D_{y} W (x, y), v_{t} (y) ⟩ d μ_{t} (y) d μ_{t} (x) d t \\ - \int_{0}^{T} \int_{M} \int_{R} \int_{M} ⟨ D_{y} W (x, y), v_{τ} (y) ⟩ d μ_{τ} (y) g_{ϵ} (t - τ) d τ d μ_{t} (x) d t | \\ ⩽ 2 C \frac{T^{2}}{N} + | \int_{0}^{T} \int_{M} \underset{:= f (x, t)}{\underset{⏟}{\int_{M} ⟨ D_{y} W (x, y), v_{t} (y) ⟩ d μ_{t} (y)}} d {\bar{μ}}_{t} (x) d t - \\ \int_{0}^{T} \int_{M} \underset{:= f_{ϵ} (x, t)}{\underset{⏟}{\int_{R} \int_{M} ⟨ D_{y} W (x, y), v_{τ} (y) ⟩ d μ_{τ} (y) g_{ϵ} (t - τ) d τ}} d {\bar{μ}}_{t} (x) d t | \\ ⩽ 2 C \frac{T^{2}}{N} + \sum_{i = 1}^{N} \int_{(i - 1) τ + ϵ}^{i τ - ϵ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t + \int_{(i - 1) τ}^{(i - 1) τ + ϵ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t + \\ \int_{i τ - ϵ}^{i τ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t \\ ⩽ 2 C \frac{T^{2}}{N} + \sum_{i = 1}^{N} \int_{(i - 1) τ + ϵ}^{i τ - ϵ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t + \int_{(i - 1) τ}^{(i - 1) τ + ϵ} \int_{M} 2 \tilde{C} d {\bar{μ}}_{t} d t + \int_{i τ - ϵ}^{i τ} \int_{M} 2 \tilde{C} d {\bar{μ}}_{t} d t \\ ⩽ 2 C \frac{T^{2}}{N} + \sum_{i = 1}^{N} \int_{(i - 1) τ + ϵ}^{i τ - ϵ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t + 4 N ϵ \tilde{C} ⩽ \frac{δ}{3} + \sum_{i = 1}^{N} \frac{δ}{3 N} + \frac{δ}{3}, \end{aligned}

where, first, $N$ is chosen such that $N ⩾ 6 C \frac{T^{2}}{δ}$ and, second, $ϵ$ such that $ϵ ⩽ \frac{δ}{12 N \tilde{C}}$ and for each $i \in {1, . . ., N}$ , it holds $\int_{(i - 1) τ + ϵ}^{i τ - ϵ} \int_{M} | f - f_{ϵ} | d {\bar{μ}}_{t} d t ⩽ \frac{δ}{3 N}$ (by lemma A.3). Therefore $(* *)$ is proven.

Finally, by lemma A.4, we obtain $n_{ϵ} \in C^{1} (M \times [0, T])$ that we can use as a test function in equation (A 3) and send $ϵ \to 0$ to obtain

$\begin{aligned} E (μ_{T}) - E (μ_{0}) & = \int_{M} η d μ_{T} - \int_{M} η d μ_{0} = \int_{0}^{T} \int_{M} \partial_{t} η d μ_{t} + \int_{M} ⟨ D η, v_{t} ⟩ d μ_{t} d t \\ = \int_{0}^{T} \int_{M \times M} ⟨ D_{x} W (x, y), v_{t} (x) ⟩ d μ_{t} (x) d μ_{t} (y) d t . \end{aligned}$

∎

Lemma A.3. Let $f : M \times [0, T] \to ℝ$ be Borel measurable and $μ \in P (M)$ with

\int_{a}^{b} \int_{M} | f | d μ d t < \infty for 0 ⩽ a < b ⩽ T .

For

μ_{a}^{b} (A) ≔ μ \otimes L (a, b)

it holds

‖ f_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} ⩽ ‖ f ‖_{L^{1} (μ_{a}^{b})} and f_{ϵ} \to f in L^{1} (μ_{a + ϵ}^{b - ϵ}) .

Proof. We adapt [71, ch. 5, Theorem 7] to our case and start by showing

\begin{aligned} ‖ f_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} ⩽ & \int_{M} \int_{a + ϵ}^{b - ϵ} \int_{a}^{b} | f (x, τ) | g_{ϵ} (t - τ) d τ d t d μ (x) \\ = & \int_{M} \int_{a}^{b} | f (x, τ) | \int_{a + ϵ}^{b - ϵ} g_{ϵ} (t - τ) d t d τ d μ (x) = \int_{M} \int_{a}^{b} | f (x, τ) | d τ d μ (x) = ‖ f ‖_{L^{1} (μ_{a}^{b})} . \end{aligned}

We approximate $f$ in $L^{1} (μ_{a}^{b})$ by $γ \in C_{c} (M \times [a, b])$ (see [70, Proposition 7.9]) and calculate

\begin{aligned} ‖ f - f_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} ⩽ & ‖ f - γ ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} + ‖ γ - γ_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} + ‖ γ_{ϵ} - f_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} \\ ⩽ & 2 ‖ f - γ ‖_{L^{1} (μ_{a}^{b})} + ‖ γ - γ_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} . \end{aligned}

From [71, ch. C.5, Theorem 7], we know that $γ_{ϵ} \to γ$ for all $(x, t) \in M \times [a, b]$ because $γ$ is continuous. Choosing $γ$ such that $‖ f - γ ‖_{L^{1} (μ_{a}^{b})} < δ$ and using the dominated convergence theorem, we get ${lim sup}_{ϵ \to 0} ‖ f - f_{ϵ} ‖_{L^{1} (μ_{a + ϵ}^{b - ϵ})} ⩽ 2 δ$ . As $δ$ can be chosen arbitrarily small, we obtain convergence.∎

Lemma A.4. We have $η_{ϵ} \in C^{1} (M \times [0, T])$ .

Proof. Let $γ : V \subset ℝ^{d} \to U (x)$ be a smooth local chart for an open set $U (x)$ containing $x$ . Then, since $W (x, y) \in C^{1} (M \times M)$ , the function $z \mapsto \int_{M} \partial_{z_{i}} W (γ (z), y) d μ_{τ}$ is continuous in $z$ and the product

(z, t) \mapsto \int_{M} \partial_{z_{i}} W (γ (z), y) d μ_{τ} (y) g_{ϵ} (t - τ)

is continuous on $V \times ℝ$ . Taking any sequence $(z_{n}, t_{n}) \to (z, t)$ , we can use the dominated convergence theorem to obtain

\begin{aligned} lim_{n \to \infty} \partial_{t} η_{ϵ} (γ (z_{n}), t_{n}) & = lim_{n \to \infty} \int_{R} \int_{M} \partial_{z_{i}} W (γ (z_{n}), y) d μ_{τ} (y) g_{ϵ} (t_{n} - τ) d τ \\ = \int_{R} \int_{M} \partial_{z_{i}} W (γ (z), y) d μ_{τ} (y) g_{ϵ} (t - τ) d τ = \partial_{t} η_{ϵ} (γ (z), t) . \end{aligned}

An upper bound is given by the function $\sup_{V \times W} \partial_{z_{i}} W (γ (z), y) χ_{[\inf_{n} t_{n} - ϵ, \sup_{n} t_{n} + ϵ]} (τ)$ . Thus, $\partial_{t} η_{ϵ} (γ (z), t)$ is continuous in $V \times [0, T]$ . With the same argument, a similar statement can be shown for

\begin{aligned} \partial_{t} η_{ϵ} (x, t) = \int_{R} \int_{M} W (x, y) d μ_{τ} (y) \partial_{t} g_{ϵ} (t - τ) (t) d τ . \end{aligned}

By [72, Theorem 2.8], it follows that $η_{ϵ} (t, γ (z)) \in C^{1} (V \times [0, T])$ and since the local chart was chosen arbitrarily $η_{ϵ} \in C^{1} (M \times [0, T])$ .∎

Appendix B. Spherical coordinates

For many computations in §4, we use spherical coordinates. Up to small notational changes, we use the definition provided in [73]. We define the coordinate transform $X_{n} : φ \in [0, π],^{n - 2} \times [0,2 π] \to S^{n - 1}$ for $φ \in [0, π],^{n - 2} \times [0,2 π]$ as

X_{n} (φ) = \cos (φ_{1}) e_{1} + \sum_{i = 2}^{n - 1} \cos (φ_{i}) \prod_{j = 1}^{i - 1} \sin (φ_{j}) e_{i} + \prod_{j = i}^{n - 1} \sin (φ_{i}) e_{n} .

Here and in the following, $e_{i} \in ℝ^{n}$ denotes the $i$ th standard basis vector.

The Jacobian determinant is given by

\begin{array}{r} J X_{n} (φ) = \prod_{i = 1}^{n - 2} \sin^{n - 1 - i} (φ_{i}) . \end{array}

To highlight the recursive character of $X_{n}$ with respect to $n$ , we further note that

\begin{aligned} X_{n} (φ)_{\hat{1}} = \sin (φ_{1}) X_{n - 1} (φ_{\hat{1}}) and J X_{n} (φ) = \sin^{n - 2} (φ_{1}) J X_{n - 1} (φ_{\hat{1}}), \end{aligned}

where the index $\hat{1}$ denotes that we drop the first element, i.e. for $φ \in ℝ^{n - 1}$ , $φ_{\hat{1}} = \sum_{i = 2}^{n - 1} φ_{i} e_{i - 1}$ . A practical consequence of this property is the recursive computation formula for the Hausdorff measure of the $n$ -dimensional sphere.

Lemma B.1. Denote $| S^{n - 1} | : = H^{n} (S^{n - 1})$ . For $n ⩾ 2$ , it holds that

\begin{array}{r} | S^{n - 1} | = | S^{n - 2} | \int_{0}^{π} \sin^{n - 2} φ d φ . \end{array}

Proof. For $n = 2$ , the proof follows from a simple computation and the fact that $| S^{0} | = 2$ and $| S^{1} | = 2 π$ . For $n > 2$ , we have

\begin{aligned} | S^{n - 1} | & = \int_{[0, π]^{n - 2} \times [0, 2 π]} J X_{n} (φ) d φ = \int_{0}^{π} \sin^{n - 2} φ_{1} J X_{n - 1} (φ_{\hat{1}}) d φ \\ = \int_{0}^{π} \sin^{n - 2} φ d φ \int_{[0, π]^{n - 3} \times [0, 2 π]} J X_{n - 1} (ψ) d ψ = | §^{n - 2} | \int_{0}^{π} \sin^{n - 2} φ d φ, \end{aligned}

where we use the recursive property of the Jacobian determinant. ∎

B.1. Definition using Givens rotations

Spherical coordinates can equivalently be defined using Givens rotations (see e.g. [74, ch. 5.1.8]). A Givens rotation for an angle $φ \in [0,2 π)$ and indices $i, j ⩽ n$ with $i \neq j$ is determined by the rotation matrix $G (i, j, φ) \in ℝ^{n \times n}$ :

\begin{aligned} G (i, j, φ)_{k, l} = {\begin{cases} \cos (φ) & if k = l = i or k = l = j, \\ 1 & if k = l \neq i and k = l \neq j, \\ \sin (φ) & if k = i, l = j, \\ - \sin (φ) & if k = j, l = i, \\ 0 & otherwise. \end{cases} \end{aligned}

Applying $G (i, j, φ)^{T}$ to a vector $x \in R^{n}$ corresponds to a counterclockwise rotation of $x$ by the angle $φ$ in the $(i, j)$ -plane. For a given vector of angles $φ \in [0, π],^{n - 2} \times [0,2 π]$ , we can thus construct the matrix

\begin{array}{lrr} \begin{aligned} R (φ) = G (n - 1, n, φ_{n - 1}) \circ \dots & \circ G (2,3, φ_{2}) \\ \circ G (1,2, φ_{1}) \circ G (2,3, φ_{2})^{T} \circ \dots \circ G (n - 1, n, φ_{n - 1})^{T} . \end{aligned} \end{array}

(B 1)

The rotation matrix $R (φ)$ can be written as a two-dimensional rotation of angle $φ_{1}$ in the $(e_{1}, X_{n - 1} (φ_{\hat{1}}))$ -plane, as the following lemma shows.

Lemma B.2. Let $R (φ)$ be the rotation matrix as described in equation (B 1) . Then, it holds that

\begin{array}{r} R (φ) = U G (1,2, φ_{1}) U,^{T}, \end{array}

with $U U^{T} = Id$ , $U_{1, \cdot} = e_{1}$ and $U_{2, \cdot} = (0, X,_{n - 1} (φ_{\hat{1}}))^{T}$ .

Proof. For $n = 2$ , the statement can be verified by inserting $U = Id$ and the definition of $R (φ)$ . For $n > 2$ , we define

\begin{array}{r} U = G (2,3, φ_{2}) \circ \dots \circ G (n - 1, n, φ_{n - 1}) . \end{array}

With this choice of $U$ , $R (φ)$ has the claimed form and $U U^{T} = Id$ due to the orthogonality of Givens matrices. It remains to show that the first two rows of $U$ fulfil $U_{1, \cdot} = e_{1}$ and $U_{2, \cdot} = (0, X,_{n - 1} (φ_{\hat{1}}))^{T}$ . For $n = 3$ , $U$ reduces to

\begin{array}{r} U = G (2,3, φ_{2}) = (\begin{array}{ccc} 1 & 0 & 0 \\ 0 & \cos φ_{2} & \sin φ_{2} \\ 0 & - \sin φ_{2} & \cos φ_{2} \end{array}), \end{array}

and clearly, $U_{1, \cdot} = e_{1}$ and $U_{2, \cdot} = (0, \cos φ_{2}, \sin φ_{2})^{T} = (0, X_{2} (φ_{2}))^{T}$ . For $n > 3$ , the proof follows from induction over $n$ .∎

Corollary B.3. Let $x = X_{n} (φ)$ , $\tilde{x} = (0, X_{n - 1} (φ_{\hat{1}}))$ then

\begin{aligned} R (φ)^{T} y = y - (y \cdot e_{1}) e_{1} - (y \cdot \tilde{x}) \tilde{x} & + (\cos (φ_{1}) (y \cdot e_{1}) - \sin (φ_{1}) (y \cdot \tilde{x})) e_{1} \\ + & (\sin (φ_{1}) (y \cdot e_{1}) + \cos (φ_{1}) (y \cdot \tilde{x})) \tilde{x} \end{aligned}

In particular, if $y \cdot e_{1} = 0$ it holds that

\begin{aligned} R (φ)^{T} y = y & - (y \cdot \tilde{x}) \tilde{x} + (y \cdot \tilde{x}) (- \sin (φ_{1}) e,_{1} + \cos (φ_{1}) \tilde{x}) . \end{aligned}

With the above results, we obtain

\begin{array}{r} X_{n} (φ) = R (φ)^{T} e_{1}, \end{array}

and since Givens matrices are orthonormal, it also holds that

\begin{array}{r} R (φ) X_{n} (φ) = e_{1} . \end{array}

We can also therefore consider rotated spherical coordinates

\begin{array}{r} X_{n}^{θ} (φ) = R (θ)^{T} X_{n} (φ) \end{array}

for a reference point $x = X_{n} (θ)$ , with the same Jacobian determinant as before, i.e. $J X_{n}^{θ} (φ) = J X_{n} (φ)$ .

Appendix C. Proofs for Section 4

C.1. Proof of Lemma 4.10

Lemma 4.7 (cont.) Let $n > 2$ . The uniform distribution $μ = \frac{1}{| S^{n - 1} |} H^{n}$ is a stationary point of $E$ if and only if all eigenvalues ${λ,_{i}}_{i = 1}^{n}$ of $D$ have the same absolute value, i.e. $| λ,_{i} | = λ$ for some $λ \in ℝ$ .

Proof. The proof for $n > 2$ uses the same arguments as for $n = 2$ ; however, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated. We use the notation and techniques from appendix B (spherical coordinates $X_{n}$ and rotations $R$ ).

Again, we first fix $x \in S^{n - 1}$ and consider the integral

\begin{array}{r} \int_{S^{n - 1}} e^{x \cdot D y} P_{x}^{⟂} D y d H,^{n} (y) = (*) . \end{array}

Similarly to the two-dimensional case, we choose $φ \in [0, π],^{n - 2} \times [0,2 π]$ such that

\begin{array}{r} X_{n} (φ) = \frac{D x}{‖ D x ‖}, \end{array}

and therefore also

\begin{array}{r} R (φ) D x = ‖ D x ‖ R (φ) R (φ)^{T} e_{1} = ‖ D x ‖ e_{1}, \end{array}

where $e_{1} = (1,0, \dots, 0)^{T} \in ℝ^{n}$ denotes the first standard basis vector. We rewrite the integral using rotated spherical coordinates and substitute it into the above identity to obtain

\begin{aligned} (*) = & \int_{[0, π]^{n - 2} \times [0, 2 π]} e^{x \cdot D R (φ)^{T} X_{n} (θ)} P_{x}^{⊥} D R (φ)^{T} X_{n} (θ) J X_{n} (θ) d θ \\ = & \int_{[0, π]^{n - 2} \times [0, 2 π]} e^{‖ D x ‖ \cos (θ_{1})} (D R (φ)^{T} X_{n} (θ) - ‖ D x ‖ \cos (θ_{1}) x) J X_{n} (θ) d θ, \end{aligned}

where $J X_{n}$ denotes the Jacobian determinant of $X_{n}$ . To reduce the above integral over the vector $θ$ to an integral over only the first component $θ_{1}$ , we write

\begin{aligned} R (φ)^{T} X_{n} (θ) & = \cos (θ_{1}) R (φ)^{T} e_{1} + R (φ)^{T} (\begin{array}{c} 0 \\ X_{n} (θ)_{\hat{1}} \end{array}) \\ = \cos (θ_{1}) \frac{D x}{‖ D x ‖} + \sin (θ_{1}) R (φ)^{T} (\begin{array}{c} 0 \\ X_{n - 1} (θ_{\hat{1}}) \end{array}), \end{aligned}

where the subscript $\hat{1}$ denotes that we neglect the first component. Inserting this into $(*)$ , we get

\begin{aligned} (*) & = (D^{2} x / ‖ D x ‖ - ‖ D x ‖ x) \int_{0}^{π} e^{‖ D x ‖ \cos (θ_{1})} \cos (θ_{1}) \sin^{n - 2} (θ_{1}) d θ_{1} \\ + \int_{0}^{π} e^{‖ D x ‖ \cos (θ_{1})} \sin^{n - 1} (θ_{1}) D R (θ)^{T} \underset{= 0}{\underset{⏟}{\int_{S^{n - 2}} (\begin{matrix} 0 \\ z \end{matrix}) d H^{n - 1} (z)}} d θ_{1} \\ = C (n, ‖ D x ‖) (D^{2} x / ‖ D x ‖ - ‖ D x ‖ x), \end{aligned}

and due to the symmetry of sine and cosine, we have that $C (n, ‖ D x ‖) > 0$ for any $n ⩾ 2$ , $‖ D x ‖ > 0$ . We can thus deduce that $(*) = 0$ if and only if $x$ is an eigenvector of $D^{2}$ , exactly as in the case $n = 2$ . This holds true for $μ$ -almost all $x \in S^{n - 1}$ if and only if all eigenvalues of $D$ have the same absolute value, which then automatically yields $d E_{D} (μ, V) = 0$ .

Again, it remains to show that this is also necessary. Without loss of generality, we assume $| λ,_{1} | > | λ,_{2} |$ and $λ,_{1}$ and $λ,_{2}$ to be the eigenvalues of largest, respectively second largest, absolute value corresponding to the eigenvectors $z_{1}$ , respectively $z_{2}$ .

From here, the strategy is the exact same as in the two-dimensional case, which we restate here for completeness. The factor $(D^{2} x / ‖ D x ‖ - ‖ D x ‖ x) \cdot z_{2}$ is strictly negative on the set

\begin{array}{r} A = {x \in S^{n - 1} | (x \cdot z_{1}) \in (| λ,_{2} / λ,_{1} |, 1), (x \cdot z_{2}) > 0} . \end{array}

Since $μ (A) > 0$ , we can find a Lipschitz continuous $V$ such that $V \cdot z_{1} = 0$ for $μ$ -a.e. on $S^{n - 1}$ and

\begin{aligned} V (x) \cdot z_{2} = {\begin{cases} > 0 & for a.e. x \in A, \\ = 0 & for a.e. x \in S^{n - 1} ∖ A . \end{cases} \end{aligned}

For all such $V$ , it holds that $d E_{D} (μ, V) > 0$ , which concludes the proof.∎

C.2. Proof of Lemma 4.12

Lemma 4.8 (cont.) Let $n ⩾ 2$ , and $μ_{0} = \frac{1}{| S^{n - 1} |} H^{n}$ . Then, it holds that

\begin{matrix} (C 1) & \int_{S^{n - 1}} e^{x \cdot y} x d μ_{0} (x) = C_{1} y \end{matrix}

for any $y \in S^{n - 1}$ , where the constant $C_{1}$ is positive and depends only on the dimension $n$ .

Proof. The proof for $n > 2$ goes along the lines of the proof for $n = 2$ . However, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated in higher dimensions. For an introduction to rotated spherical coordinates used in this proof, we refer the reader to appendix B.

We first fix $y \in S^{n - 1}$ and choose $θ \in ℝ^{n - 1}$ such that $y = X_{n} (θ)$ . We proceed to write the integral using rotated spherical coordinates $x = X_{n}^{θ} (φ)$ and obtain

\int_{S^{n - 1}} e^{x \cdot y} x_{j} d μ_{0} (x) = \frac{1}{| S^{n - 1} |} \int_{[0, π]^{n - 2} \times [0, 2 π]} e^{X_{n}^{θ} (φ) \cdot X_{n} (θ)} {(X_{n}^{θ} (φ))}_{i} J X_{n} (φ) d φ = (*) .

Substituting the expressions for $X_{n}$ and $X_{n}^{θ}$ yields

\begin{aligned} X_{n}^{θ} (φ) \cdot X_{n} (θ) & = R (θ)^{T} X_{n} (φ) \cdot X_{n} (θ) = X_{n} (φ) \cdot R (θ) X_{n} (θ) = X_{n} (φ) \cdot e_{1} = \cos (φ_{1}) . \end{aligned}

In addition, we note that we can write any $x = x_{1} e_{1} + (0, x_{2}, \dots, x_{n})^{T}$ and see that

\begin{aligned} X_{n}^{θ} (φ) = R (θ)^{T} X_{n} (φ) = & R (θ)^{T} \cos (φ_{1}) e_{1} + R (θ)^{T} (\begin{matrix} 0 \\ X_{n} (φ)_{\hat{1}} \end{matrix}) \\ = & \cos (φ_{1}) y + \sin (φ_{1}) R (θ)^{T} (\begin{matrix} 0 \\ X_{n - 1} (φ_{\hat{1}}) \end{matrix}), \end{aligned}

where $e_{1} = (1,0, \dots, 0)^{T} \in ℝ^{n}$ denotes the first standard basis vector. Substituting the above equality into the integral, we derive

\begin{aligned} (*) = & \frac{1}{| S^{n - 1} |} \int_{[0, π]^{n - 2} \times [0, 2 π]} e^{\cos (φ_{1})} {[\cos (φ_{1}) y + \sin (φ_{1}) R (θ)^{T} (\begin{matrix} 0 \\ X_{n - 1} (φ_{\hat{1}}) \end{matrix})]}_{j} J X_{n} (φ) d φ \\ = & y_{i} \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \cos φ \sin^{n - 2} φ d φ \\ + \frac{1}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \sin^{n - 1} φ {[R (θ)^{T} \underset{= 0}{\underset{⏟}{\int_{S^{n - 2}} (\begin{matrix} 0 \\ z \end{matrix}) d H^{n - 1} (z)}}]}_{j} d φ . \end{aligned}

The proof now follows from choosing the constant:

\begin{array}{r} C_{1} = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \cos φ \sin^{n - 2} φ d φ = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π / 2} \sin^{n - 2} φ \cos φ \sinh (\cos φ) d φ, \end{array}

which is positive for all $n ⩾ 2$ since the function $t \mapsto t \sinh t$ is positive for $t > 0$ and both sine and cosine are positive for $φ \in (0, π / 2)$ .∎

C.3. Proof of Lemma 4.13

Lemma 4.9 (cont.) Let $n ⩾ 2$ , and $μ_{0} = \frac{1}{| S^{n - 1} |} H^{n}$ . Then, for all $y \in S^{n - 1}$ and $1 ⩽ i ⩽ n$ , it holds that

\begin{matrix} (C 2) & \int_{S^{n - 1}} e^{x \cdot y} x_{i}^{2} d μ_{0} (x) = C_{2} y_{i}^{2} + C_{3}, \end{matrix}

where the constants $C_{2}$ and $C_{3}$ are positive and depend only on the dimension $n$ .

Proof. Using the same arguments as in the previous proof, we obtain

\begin{aligned} \int_{0}^{π} e^{x \cdot y} x_{j}^{2} d μ_{0} (x) = & y_{j}^{2} \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{S^{n - 1}} e^{\cos φ} \cos^{2} φ \sin^{n - 2} φ d φ \\ (C 3) & + \frac{1}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \sin^{n} φ \int_{S^{n - 2}} {[R (θ)^{T} (\binom{0}{z})]}_{j}^{2} d H^{n - 1} (z) d φ, \end{aligned}

where the mixed term containing $x_{i} y_{i}$ vanishes due to symmetry. Since the second term still depends on $y$ due to the rotation, we write $\tilde{y} = (0, X_{n - 1} (θ_{\hat{1}}))$ and decompose $\tilde{z} = (0, z)$ into its rotation-invariant and rotation-variant part. More precisely, we use corollary B.3 to get

\begin{array}{r} R (θ)^{T} \tilde{z} = \tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y} + (\tilde{y} \cdot \tilde{z}) [- \sin (θ_{1}) e,_{1} + \cos (θ_{1}) \tilde{y}] \end{array}

and thus

\begin{array}{r} {[R (θ)^{T} \tilde{z}]}^{2} = (\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y})^{2} + (\tilde{y} \cdot \tilde{z})^{2} (\sin^{2} (θ_{1}) e,_{1} + \cos^{2} (θ_{1}) \tilde{y},^{2}) + 2 \cos (θ_{1}) (\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y}) (\tilde{y} \cdot \tilde{z}) \tilde{y} . \end{array}

Making use of the trigonometric identity $\cos^{2} (θ_{1}) + \sin^{2} (θ_{1}) = 1$ , we get

\begin{aligned} {[R (θ)^{T} \tilde{z}]}^{2} & = {\tilde{z}}^{2} + (\tilde{y} \cdot \tilde{z})^{2} e_{1} + 2 (\tilde{y} \cdot \tilde{z})^{2} {\tilde{y}}^{2} - 2 (\tilde{y} \cdot \tilde{z}) \tilde{z} \tilde{y} + 2 \cos (θ_{1}) (\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y}) (\tilde{y} \cdot \tilde{z}) \tilde{y} \\ - (\tilde{y} \cdot \tilde{z})^{2} (\cos^{2} (θ_{1}) e_{1} + \sin^{2} (θ_{1}) {\tilde{y}}^{2}), \\ (C 4) & = {\tilde{z}}^{2} + (\tilde{y} \cdot \tilde{z})^{2} (e_{1} - y^{2}) + 2 (\cos (θ_{1}) - 1) (\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y}) (\tilde{y} \cdot \tilde{z}) \tilde{y}, \end{aligned}

where in the last step, we use the fact that $y^{2} = \cos^{2} θ_{1} e_{1} + \sin^{2} θ_{1} {\tilde{y}}^{2}$ . To prove that the integral over the expression in equation (C 4) can be written as claimed, we observe that for all $j = 2, \dots, n$

\begin{array}{r} \int_{S^{n - 2}} {\tilde{z}}_{j}^{2} d H,^{n - 1} (z) = : \tilde{C}, \end{array}

where $\tilde{C}$ is positive and depends only on $n$ , and therefore, also $\int_{S^{n - 2}} (\tilde{z} \cdot \tilde{y})^{2} d H,^{n - 1} (z) = \tilde{C} ‖ \tilde{y} ‖^{2} = \tilde{C}$ . With this, we derive that

\begin{matrix} (C 5) & {[\int_{S^{n - 2}} {\tilde{z}}^{2} + (\tilde{y} \cdot \tilde{z})^{2} (e_{1} - y^{2}) d H^{n - 1} (z)]}_{j} = \tilde{C} (1 - y_{j}^{2}) \end{matrix}

for all $j = 1, \dots, n$ and it remains to show that for any $1 ⩽ j ⩽ n$

\begin{aligned} (C 6) & \int_{S^{n - 2}} [(\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y}) (\tilde{y} \cdot \tilde{z}) \tilde{y}]_{j} d H^{n - 1} (z) = 0. \end{aligned}

The case $j = 1$ is trivial as ${\tilde{y}}_{1} = {\tilde{z}}_{1} = 0$ . For $2 ⩽ j ⩽ n$ , we write out the integrand and obtain

\begin{aligned} [(\tilde{z} - (\tilde{y} \cdot \tilde{z}) \tilde{y}) (\tilde{y} \cdot \tilde{z}) \tilde{y}]_{j} = (\sum_{k = 1}^{n} {\tilde{z}}_{k} {\tilde{y}}_{k}) {\tilde{z}}_{j} {\tilde{y}}_{j} - (\sum_{k, l = 1}^{n} {\tilde{z}}_{k} {\tilde{z}}_{l} {\tilde{y}}_{k} {\tilde{y}}_{l}) {\tilde{y}}_{j}^{2} \\ = (\sum_{k = 1, k \neq j}^{n} {\tilde{z}}_{j} {\tilde{z}}_{k} {\tilde{y}}_{j} {\tilde{y}}_{k}) - (\sum_{k, l = 1, k \neq l}^{n} {\tilde{z}}_{k} {\tilde{z}}_{l} {\tilde{y}}_{k} {\tilde{y}}_{l}) {\tilde{y}}_{j}^{2} + ({\tilde{z}}_{j}^{2} - (\tilde{z} \cdot \tilde{y})^{2}) {\tilde{y}}_{j}^{2}, \end{aligned}

where we can use the same argument as for equation (C 5) to show that the last summand integrates to zero. Since also $\int_{S^{n - 2}} {\tilde{z}}_{j} {\tilde{z}}_{k} d H^{n - 1} (z) = 0$ for any $j \neq k$ , we derive equation (C 6). Togethepossible to interpret it as a forward Euler discretizationr with equations (C 4) and (C 5), this yields

\begin{aligned} \int_{S^{n - 2}} {[R (θ)^{T} \tilde{z}]}_{j}^{2} d H,^{n - 1} (z) & = \tilde{C} (1 - y_{j}^{2}) . \end{aligned}

The statement now follows from substituting the above into equation (C 3), with constants given by

\begin{aligned} C_{2} = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \cos^{2} φ \sin^{n - 2} φ d φ - C_{3}, C_{3} = \frac{\tilde{C}}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \sin^{n} φ d φ . \end{aligned}

Since $\tilde{C} > 0$ for all $n ⩾ 2$ , it directly follows that $C_{3} > 0$ . To show that $C_{2} > 0$ for all $n ⩾ 2$ , we first show that $\tilde{C} = | S^{n - 2} | / (n - 1)$ . For $n = 2$ , this follows directly from $\tilde{C} = | S^{0} | = 2$ . For $n > 2$ , we have

\begin{aligned} \tilde{C} = | S^{n - 3} | \int_{0}^{π} \cos^{2} φ \sin^{n - 3} φ d φ = | S^{n - 3} | \int_{0}^{π} \sin^{n - 3} φ - \sin^{n - 1} φ d φ . \end{aligned}

Using integration by parts, we further derive that

\begin{aligned} \int_{0}^{π} \sin^{n - 1} φ d φ = (n - 2) / (n - 1) \sin^{n - 3} φ d φ . \end{aligned}

As shown in lemma B.1, the recursive form of the Jacobian determinant of spherical coordinates yields that

| S^{n - 2} | = | S^{n - 3} | \int_{0}^{π} \sin^{n - 3} φ d φ

Combining these equalities, we see that

\begin{array}{r} \tilde{C} = (1 - ((n - 2) / (n - 1))) | S^{n - 2} | = | S^{n - 2} | / (n - 1), \end{array}

and therefore, with integration by parts, we get

\begin{aligned} C_{2} & = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} [\cos^{2} φ \sin^{n - 2} φ - \frac{1}{1 - n} \sin^{n} φ] d φ \\ = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π} e^{\cos φ} \sin^{n - 2} φ \cos φ (\cos φ - 1) d φ . \end{aligned}

Due to the symmetry of sine and cosine, we get

\begin{aligned} C_{2} & = \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π / 2} e^{\cos φ} \sin^{n - 2} φ \cos φ (\cos φ - 1) + e^{- \cos φ} \sin^{n - 2} φ \cos φ (\cos φ + 1) d φ \\ = 2 \frac{| S^{n - 2} |}{| S^{n - 1} |} \int_{0}^{π / 2} \sin^{n - 2} φ (\cos (φ) \cosh (\cos (φ)) - \sinh (\cos (φ))) > 0, \end{aligned}

where the positivity follows from the fact that the function $t \mapsto t \cosh (t) - \sinh (t)$ is positive for $t > 0$ and both sine and cosine are positive for $φ \in (0, π / 2)$ .∎

Contributor Information

Martin Burger, Email: martin.burger@desy.de.

Samira Kabri, Email: samira.kabri@desy.de.

Yury Korolev, Email: ymk30@bath.ac.uk.

Tim Roith, Email: tim.roith@desy.de.

Lukas Weigand, Email: lukas.weigand@desy.de.

Data accessibility

This article has no additional data.

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors’ contributions

M.B.: conceptualization, formal analysis, funding acquisition, investigation, methodology, supervision, writing—original draft; S.K.: formal analysis, investigation, methodology, visualization, writing—original draft; Y.K.: conceptualization, formal analysis, funding acquisition, investigation, methodology, writing—original draft; T.R.: funding acquisition, investigation, methodology, software, visualization, writing—original draft; L.W.: formal analysis, investigation, methodology, writing—original draft.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

M.B. and T.R. acknowledge funding by the German Ministry of Science and Technology (BMBF) under grant agreement No. 01IS24072A (COMFORT). M.B., S.K., T.R. and L.W. acknowledge support from DESY (Hamburg, Germany), a member of the Helmholtz Association HGF. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany. M.B. and S.K. acknowledge support from the German Research Foundation, project BU 2327/19-1. M.B. and L.W. acknowledge support from the German Research Foundation, project BU 2327/20-1. Y.K. acknowledges support from the German Research Foundation as visiting fellow within the priority programme Foundations of Deep Learning. Part of this study was carried out while S.K. and T.R. were visiting the California Institute of Technology, supported by the DAAD grant for project 57698811 'Bayesian Computations for Large-scale (Nonlinear) Inverse Problems in Imaging'. Y.K. acknowledges the support of the EPSRC (Fellowship EP/V003615/2 and Programme Grant EP/V026259/1). S.K. and Y.K. are grateful for the hospitality of the University of Bath during the workshop 'Machine Learning in Infinite Dimensions', sponsored by the ICMS, LMS, IMI Bath, ProbAI and Maths4DL, where part of this work was undertaken.

References

1. OpenAI . 2023. GPT-4 technical report. arXiv:2303.08774. ( 10.48550/arXiv.2303.08774) [DOI]
2. Wu J, Gan W, Chen Z, Wan S, Philip SY. 2023. Multimodal large language models: a survey. In 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, pp. 2247–2256. IEEE. ( 10.1109/BigData59044.2023.10386743) [DOI] [Google Scholar]
3. Fields C, Kennington C. 2023. Vision language transformers: a survey. arXiv:2307.03254. ( 10.48550/arXiv.2307.03254) [DOI]
4. Esser P, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-First International Conference on Machine Learning. Vienna, Austria: PMLR. [Google Scholar]
5. Abramson J, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630 , 493–500. ( 10.1038/s41586-024-07487-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589. ( 10.1038/s41586-021-03819-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Vuckovic J, Baratin A, Combes RT. 2020. A mathematical theory of attention. arXiv 2007.02876. ( 10.48550/arXiv.2007.02876) [DOI] [Google Scholar]
8. Sander ME, Ablin P, Blondel M, Peyré G. 2022. Sinkformers: transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515–3530. JMLR. [Google Scholar]
9. Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. 2023. A mathematical perspective on transformers. arXiv 2312.10794. ( 10.48550/arXiv.2312.10794) [DOI] [Google Scholar]
10. Calvello E, Kovachki NB, Levine ME, Stuart AM. 2024. Continuum attention for neural operators. arXiv: 2406.06486. ( 10.48550/arXiv.2406.06486) [DOI] [Google Scholar]
11. Nguyen TM, Nguyen T, Ho N, Bertozzi AL, Baraniuk RG, Osher SJ. 2024. A primal-dual framework for transformers and neural networks. arXiv 2106.01506. ( 10.48550/arXiv.2406.13781) [DOI] [Google Scholar]
12. Wright MA, Gonzalez J. 2021. Transformers are deep infinite-dimensional non-mercer binary kernel machines. arXiv 2106.01506. ( 10.48550/arXiv.2106.01506) [DOI] [Google Scholar]
13. Criscitiello C, Rebjock Q, McRae AD, Boumal N. 2024. Synchronization on circles and spheres with nonlinear interactions. arXiv 2405.18273. ( 10.48550/arXiv.2405.18273) [DOI] [Google Scholar]
14. Alcalde A, Fantuzzi G, Zuazua E. 2024. Clustering in pure-attention hardmax transformers and its role in sentiment analysis. arXiv Preprint 2407.01602. ( 10.48550/arXiv.2407.01602) [DOI] [Google Scholar]
15. Geshkovski B, Rigollet P, Ruiz-Balet D. 2024. Measure-to-measure interpolation using transformers. arXiv Preprint 2411.04551. ( 10.48550/arXiv.2411.04551) [DOI] [Google Scholar]
16. Kan K, Li X, Osher S. 2025. OT-Transformer: a continuous-time transformer architecture with optimal transport regularization. arXiv Preprint 2501.18793. ( 10.48550/arXiv.2501.18793) [DOI] [Google Scholar]
17. Viswanathan K, Gardinazzi Y, Panerai G, Cazzaniga A, Biagetti M. 2025. The geometry of tokens in internal representations of large language models. arXiv Preprint 2501.10573. ( 10.48550/arXiv.2501.10573) [DOI] [Google Scholar]
18. Abella ÁR, Silvestre JP, Tabuada P. 2024. The asymptotic behavior of attention in transformers. arXiv Preprint 2412.02682. ( 10.48550/arXiv.2412.02682) [DOI] [Google Scholar]
19. Alcalde A, Fantuzzi G, Zuazua E. 2025. Exact sequence classification with hardmax transformers. arXiv Preprint 2502.02270. ( 10.48550/arXiv.2502.02270) [DOI] [Google Scholar]
20. Lu Y, Li Z, He D, Sun Z, Dong B, Qin T, Wang L, Liu T. Understanding and Improving Transformer From a Multi Particle Dynamic System Point of View. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations. [Google Scholar]
21. Dutta S, Gautam T, Chakrabarti S, Chakraborty T. 2021. Redesigning the transformer architecture with insights from multi-particle dynamical systems. Adv. Neural Inf. Process. Syst. 34 , 5531–5544. [Google Scholar]
22. Chizat L, Bach F. 2018. on the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inf. Process. Syst. 31 , 3040–3050. [Google Scholar]
23. Ding Z, Chen S, Li Q, Wright S. 2021. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv 2110.02926. ( 10.48550/arXiv.2110.02926) [DOI] [Google Scholar]
24. Hegselmann R, Krause U. 2002. Opinion dynamics and bounded confidence models, analysis and stimulations. J. Artif. Soc. Soc. Simulation 5 . [Google Scholar]
25. Gómez-Serrano J, Graham C, Le Boudec JY. 2012. The bounded confidence model of opinion dynamics. Math. Model. Methods Appl. Sci. 22 , 1150007. ( 10.1142/s0218202511500072) [DOI] [Google Scholar]
26. Piccoli B, Rossi F. 2021. Generalized solutions to bounded-confidence models. Math. Model. Methods Appl. Sci. 31 , 1237–1276. ( 10.1142/s0218202521400054) [DOI] [Google Scholar]
27. Bruno G, Pasqualotto F, Agazzi A. 2024. Emergence of meta-stable clustering in mean-field transformer models. arXiv Preprint 2410.23228. ( 10.48550/arXiv.2410.23228) [DOI] [Google Scholar]
28. Geshkovski B, Koubbi H, Polyanskiy Y, Rigollet P. 2024. Dynamic metastability in the self-attention model. arXiv Preprint 2410.06833. ( 10.48550/arXiv.2410.06833) [DOI] [Google Scholar]
29. Burger M, Erbar M, Hoffmann F, Matthes D, Schlichting A. 2025. Covariance-modulated optimal transport and gradient flows. Arch. Ration. Mech. Anal. 249 . ( 10.1007/s00205-024-02065-w) [DOI] [Google Scholar]
30. Duncan A, Nüsken N, Szpruch L. 2023. On the Geometry of stein variational gradient descent. J. Mach. Learn. Res. 24 , 1–39. [Google Scholar]
31. Li W. 2021. Hessian metric via transport information geometry. J. Math. Phys. 62 . ( 10.1063/5.0012605) [DOI] [Google Scholar]
32. Lisini S, Matthes D, Savaré G. 2012. Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-Wasserstein metrics. J. Differ. Equ. 253 , 814–850. ( 10.1016/j.jde.2012.04.004) [DOI] [Google Scholar]
33. Burger M, Di Francesco M. 2008. Large time behavior of nonlocal aggregation models withnonlinear diffusion. Netw. Heterog. Media 3 , 749–785. ( 10.3934/nhm.2008.3.749) [DOI] [Google Scholar]
34. Cañizo JA, Ramos-Lora A. 2024. Discrete minimizers of the interaction energy in collective behavior: a brief numerical and analytic review. arXiv 2403.00594. ( 10.48550/arXiv.2403.00594) [DOI] [Google Scholar]
35. Carrillo JA, Chipot M, Huang Y. 2014. On global minimizers of repulsive–attractive power-law interaction energies. Phil. Trans. R. Soc. A 372 , 20130399. ( 10.1098/rsta.2013.0399) [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Carrillo J, Figalli A, Patacchini SF. 2017. Geometry of minimizers for the interaction energy with mildly repulsive potentials. Ann. De L’IHP Anal. Non Linéaire 34 , 1299–1308. ( 10.1016/J.ANIHPC.2016.10.004) [DOI] [Google Scholar]
37. Shu R. 2024. Wasserstein-infinity stability and mean field limit of discrete interaction energy minimizers. arXiv 2407.18395. ( 10.48550/arXiv.2407.18395) [DOI] [Google Scholar]
38. Simione R, Slepčev D, Topaloglu I. 2015. Existence of ground states of nonlocal-interaction energies. J. Stat. Phys. 159 , 972–986. ( 10.1007/s10955-015-1215-z) [DOI] [Google Scholar]
39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. [Google Scholar]
40. Bahdanau D. 2014. Neural machine translation by jointly learning to align and translate. arXiv 1409.0473. ( 10.48550/arXiv.1409.0473) [DOI] [Google Scholar]
41. Castin V, Ablin P, Peyré G. Proceedings of Machine Learning Research (eds Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F). In Proceedings of the 41stInternational Conference on Machine Learning, vol. 235, pp. 5817–5840, Vienna, Austria: PMLR. [Google Scholar]
42. Castin V, Ablin P, Carrillo J, Peyré G. 2025. A unified perspective on the dynamics of deep transformers. arXiv Preprint 2501.18322. ( 10.48550/arXiv.2501.18322) [DOI] [Google Scholar]
43. Karagodin N, Polyanskiy Y, Rigollet P. 2024. Clustering in causal attention masking. arXiv Preprint 2411.04990. ( 10.48550/arXiv.2411.04990) [DOI] [Google Scholar]
44. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift (eds Bach F, Blei D). In Proceedings of the 32nd InternationalConference on Machine Learning, vol. 37, pp. 448–456, Lille, France. [Google Scholar]
45. Lei Ba J, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv 1607.06450. [Google Scholar]
46. Touvron H. 2023. Llama: open and efficient foundation language models. arXiv 2302.13971. ( 10.48550/arXiv.2302.13971) [DOI] [Google Scholar]
47. Zhang B, Sennrich R. 2019. Root mean square layer normalization. Adv. Neural Inf. Process. Syst. 32 , 12381–12392. [Google Scholar]
48. He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. In Computer vision – ECCV 2016 (eds Leibe B, Matas J, Sebe N, Welling M), pp. 630–645. Cham: Springer International Publishing. ( 10.1007/978-3-319-46493-0_38) [DOI] [Google Scholar]
49. Weinan E. 2017. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5 , 1–11. ( 10.1007/s40304-017-0103-z) [DOI] [Google Scholar]
50. Haber E, Ruthotto L. 2018. Stable architectures for deep neural networks. Inverse Probl. 34 , 20. ( 10.1088/1361-6420/aa9a90) [DOI] [Google Scholar]
51. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. 2018. Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31 , 6571–6583. [Google Scholar]
52. Thorpe M, van Gennip Y. 2023. Deep limits of residual neural networks. Res. Math. Sci. 10 , 6. ( 10.1007/s40687-022-00370-y) [DOI] [Google Scholar]
53. Ambrosio L, Gigli N, Savaré G. 2008. GradientFlows. In Lectures in mathematics, 2nd edn. Basel, Switzerland: ETH Zürich. ( 10.1007/978-3-7643-8722-8) [DOI] [Google Scholar]
54. Benamou JD, Brenier Y. 2000. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 84 , 375–393. ( 10.1007/s002110050002) [DOI] [Google Scholar]
55. Deffuant G, Neau D, Amblard F, Weisbuch G. 2000. Mixing beliefs among interacting agents. Adv. Complex Syst. 03 , 87–98. ( 10.1142/s0219525900000078) [DOI] [Google Scholar]
56. Bilyk D, Matzke RW, Vlasiuk O. 2022. Positive definiteness and the Stolarsky invariance principle. J. Math. Anal. Appl. 513 , 126220. ( 10.1016/j.jmaa.2022.126220) [DOI] [Google Scholar]
57. Fasshauer GE. 2011. Positive definite kernels: past, present and future. In ’Kernel functionsand meshless methods’ dolomites research notes on approximation (eds Marchi S, Buhmann MD, Plonka-Hoch G). [Google Scholar]
58. Bilyk D, Dai F. 2016. Geodesic distance riesz energy on the sphere. arXiv 1612.08442. ( 10.48550/arXiv.1612.08442) [DOI] [Google Scholar]
59. Burger M, Francesco M di, Franek M. 2013. Stationary states of quadratic diffusion equations with long-range attraction. Commun. Math. Sci. 11 , 709–738. ( 10.4310/cms.2013.v11.n3.a3) [DOI] [Google Scholar]
60. Gómez-Castro D. 2024. Beginner’s guide to aggregation-diffusion equations. SeMA J. 1–57 ( 10.1007/s40324-024-00350-y) [DOI] [Google Scholar]
61. Rossum G, Drake FL Jr. 1995. Python tutorial. The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam. [Google Scholar]
62. Harris CR, et al. 2020. Array programming with NumPy. Nature 585 , 357–362. ( 10.1038/s41586-020-2649-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
63. Virtanen P, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272. ( 10.1038/s41592-019-0686-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
64. Paszke A. 2019. Pytorch: an imperative style high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8026–8037. [Google Scholar]
65. Marchuk G, Lebedev VI. 1986. Numerical methods in the theory of neutron transport. New York, NY, USA: Harwood Academic Puḃ. [Google Scholar]
66. Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 , 1–63. ( 10.1006/inco.1996.2612) [DOI] [Google Scholar]
67. Lee JM. 2013. Introduction to smooth manifolds, pp. 1–31. New York, NY, USA: Springer New York. ( 10.1007/978-1-4419-9982-5_1) [DOI] [Google Scholar]
68. Ambrosio L, Fusco N, Pallara D. 2000. Functions of bounded variation and free discontinuity problems, pp. 116–210. Oxford: Oxford University Press. ( 10.1093/oso/9780198502456.003.0003) [DOI] [Google Scholar]
69. Dolbeault J, Nazaret B, Savaré G. 2009. A new class of transport distances between measures. Calc. Var. Partial Differ. Equ. 34 , 193–231. ( 10.1007/s00526-008-0182-5) [DOI] [Google Scholar]
70. Folland GB. 1999. Real analysis: modern techniques and their applications. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
71. Evans LC. 2010. Partial differential equations, 2nd edn. Providence, RI: American Mathematical Society. ( 10.1090/gsm/019) [DOI] [Google Scholar]
72. Spivak M. 2018. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. Boca Raton, FL: CRC press. [Google Scholar]
73. Blumenson LE. 1960. A derivation of n-dimensional spherical coordinates. Am. Math. Mon. 67 , 63–66. ( 10.2307/2308932) [DOI] [Google Scholar]
74. Golub GH, Van Loan CF. 2013. Matrix computations, 4th edn. Philadelphia, PA, USA: Johns Hopkins University Press. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.

[rsta.2024.0233_B1] 1. OpenAI . 2023. GPT-4 technical report. arXiv:2303.08774. ( 10.48550/arXiv.2303.08774) [DOI]

[rsta.2024.0233_B2] 2. Wu J, Gan W, Chen Z, Wan S, Philip SY. 2023. Multimodal large language models: a survey. In 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, pp. 2247–2256. IEEE. ( 10.1109/BigData59044.2023.10386743) [DOI] [Google Scholar]

[rsta.2024.0233_B3] 3. Fields C, Kennington C. 2023. Vision language transformers: a survey. arXiv:2307.03254. ( 10.48550/arXiv.2307.03254) [DOI]

[rsta.2024.0233_B4] 4. Esser P, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-First International Conference on Machine Learning. Vienna, Austria: PMLR. [Google Scholar]

[rsta.2024.0233_B5] 5. Abramson J, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630 , 493–500. ( 10.1038/s41586-024-07487-w) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0233_B6] 6. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589. ( 10.1038/s41586-021-03819-2) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0233_B7] 7. Vuckovic J, Baratin A, Combes RT. 2020. A mathematical theory of attention. arXiv 2007.02876. ( 10.48550/arXiv.2007.02876) [DOI] [Google Scholar]

[rsta.2024.0233_B8] 8. Sander ME, Ablin P, Blondel M, Peyré G. 2022. Sinkformers: transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515–3530. JMLR. [Google Scholar]

[rsta.2024.0233_B9] 9. Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. 2023. A mathematical perspective on transformers. arXiv 2312.10794. ( 10.48550/arXiv.2312.10794) [DOI] [Google Scholar]

[rsta.2024.0233_B10] 10. Calvello E, Kovachki NB, Levine ME, Stuart AM. 2024. Continuum attention for neural operators. arXiv: 2406.06486. ( 10.48550/arXiv.2406.06486) [DOI] [Google Scholar]

[rsta.2024.0233_B11] 11. Nguyen TM, Nguyen T, Ho N, Bertozzi AL, Baraniuk RG, Osher SJ. 2024. A primal-dual framework for transformers and neural networks. arXiv 2106.01506. ( 10.48550/arXiv.2406.13781) [DOI] [Google Scholar]

[rsta.2024.0233_B12] 12. Wright MA, Gonzalez J. 2021. Transformers are deep infinite-dimensional non-mercer binary kernel machines. arXiv 2106.01506. ( 10.48550/arXiv.2106.01506) [DOI] [Google Scholar]

[rsta.2024.0233_B13] 13. Criscitiello C, Rebjock Q, McRae AD, Boumal N. 2024. Synchronization on circles and spheres with nonlinear interactions. arXiv 2405.18273. ( 10.48550/arXiv.2405.18273) [DOI] [Google Scholar]

[rsta.2024.0233_B14] 14. Alcalde A, Fantuzzi G, Zuazua E. 2024. Clustering in pure-attention hardmax transformers and its role in sentiment analysis. arXiv Preprint 2407.01602. ( 10.48550/arXiv.2407.01602) [DOI] [Google Scholar]

[rsta.2024.0233_B15] 15. Geshkovski B, Rigollet P, Ruiz-Balet D. 2024. Measure-to-measure interpolation using transformers. arXiv Preprint 2411.04551. ( 10.48550/arXiv.2411.04551) [DOI] [Google Scholar]

[rsta.2024.0233_B16] 16. Kan K, Li X, Osher S. 2025. OT-Transformer: a continuous-time transformer architecture with optimal transport regularization. arXiv Preprint 2501.18793. ( 10.48550/arXiv.2501.18793) [DOI] [Google Scholar]

[rsta.2024.0233_B17] 17. Viswanathan K, Gardinazzi Y, Panerai G, Cazzaniga A, Biagetti M. 2025. The geometry of tokens in internal representations of large language models. arXiv Preprint 2501.10573. ( 10.48550/arXiv.2501.10573) [DOI] [Google Scholar]

[rsta.2024.0233_B18] 18. Abella ÁR, Silvestre JP, Tabuada P. 2024. The asymptotic behavior of attention in transformers. arXiv Preprint 2412.02682. ( 10.48550/arXiv.2412.02682) [DOI] [Google Scholar]

[rsta.2024.0233_B19] 19. Alcalde A, Fantuzzi G, Zuazua E. 2025. Exact sequence classification with hardmax transformers. arXiv Preprint 2502.02270. ( 10.48550/arXiv.2502.02270) [DOI] [Google Scholar]

[rsta.2024.0233_B20] 20. Lu Y, Li Z, He D, Sun Z, Dong B, Qin T, Wang L, Liu T. Understanding and Improving Transformer From a Multi Particle Dynamic System Point of View. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations. [Google Scholar]

[rsta.2024.0233_B21] 21. Dutta S, Gautam T, Chakrabarti S, Chakraborty T. 2021. Redesigning the transformer architecture with insights from multi-particle dynamical systems. Adv. Neural Inf. Process. Syst. 34 , 5531–5544. [Google Scholar]

[rsta.2024.0233_B22] 22. Chizat L, Bach F. 2018. on the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inf. Process. Syst. 31 , 3040–3050. [Google Scholar]

[rsta.2024.0233_B23] 23. Ding Z, Chen S, Li Q, Wright S. 2021. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv 2110.02926. ( 10.48550/arXiv.2110.02926) [DOI] [Google Scholar]

[rsta.2024.0233_B24] 24. Hegselmann R, Krause U. 2002. Opinion dynamics and bounded confidence models, analysis and stimulations. J. Artif. Soc. Soc. Simulation 5 . [Google Scholar]

[rsta.2024.0233_B25] 25. Gómez-Serrano J, Graham C, Le Boudec JY. 2012. The bounded confidence model of opinion dynamics. Math. Model. Methods Appl. Sci. 22 , 1150007. ( 10.1142/s0218202511500072) [DOI] [Google Scholar]

[rsta.2024.0233_B26] 26. Piccoli B, Rossi F. 2021. Generalized solutions to bounded-confidence models. Math. Model. Methods Appl. Sci. 31 , 1237–1276. ( 10.1142/s0218202521400054) [DOI] [Google Scholar]

[rsta.2024.0233_B27] 27. Bruno G, Pasqualotto F, Agazzi A. 2024. Emergence of meta-stable clustering in mean-field transformer models. arXiv Preprint 2410.23228. ( 10.48550/arXiv.2410.23228) [DOI] [Google Scholar]

[rsta.2024.0233_B28] 28. Geshkovski B, Koubbi H, Polyanskiy Y, Rigollet P. 2024. Dynamic metastability in the self-attention model. arXiv Preprint 2410.06833. ( 10.48550/arXiv.2410.06833) [DOI] [Google Scholar]

[rsta.2024.0233_B29] 29. Burger M, Erbar M, Hoffmann F, Matthes D, Schlichting A. 2025. Covariance-modulated optimal transport and gradient flows. Arch. Ration. Mech. Anal. 249 . ( 10.1007/s00205-024-02065-w) [DOI] [Google Scholar]

[rsta.2024.0233_B30] 30. Duncan A, Nüsken N, Szpruch L. 2023. On the Geometry of stein variational gradient descent. J. Mach. Learn. Res. 24 , 1–39. [Google Scholar]

[rsta.2024.0233_B31] 31. Li W. 2021. Hessian metric via transport information geometry. J. Math. Phys. 62 . ( 10.1063/5.0012605) [DOI] [Google Scholar]

[rsta.2024.0233_B32] 32. Lisini S, Matthes D, Savaré G. 2012. Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-Wasserstein metrics. J. Differ. Equ. 253 , 814–850. ( 10.1016/j.jde.2012.04.004) [DOI] [Google Scholar]

[rsta.2024.0233_B33] 33. Burger M, Di Francesco M. 2008. Large time behavior of nonlocal aggregation models withnonlinear diffusion. Netw. Heterog. Media 3 , 749–785. ( 10.3934/nhm.2008.3.749) [DOI] [Google Scholar]

[rsta.2024.0233_B34] 34. Cañizo JA, Ramos-Lora A. 2024. Discrete minimizers of the interaction energy in collective behavior: a brief numerical and analytic review. arXiv 2403.00594. ( 10.48550/arXiv.2403.00594) [DOI] [Google Scholar]

[rsta.2024.0233_B35] 35. Carrillo JA, Chipot M, Huang Y. 2014. On global minimizers of repulsive–attractive power-law interaction energies. Phil. Trans. R. Soc. A 372 , 20130399. ( 10.1098/rsta.2013.0399) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0233_B36] 36. Carrillo J, Figalli A, Patacchini SF. 2017. Geometry of minimizers for the interaction energy with mildly repulsive potentials. Ann. De L’IHP Anal. Non Linéaire 34 , 1299–1308. ( 10.1016/J.ANIHPC.2016.10.004) [DOI] [Google Scholar]

[rsta.2024.0233_B37] 37. Shu R. 2024. Wasserstein-infinity stability and mean field limit of discrete interaction energy minimizers. arXiv 2407.18395. ( 10.48550/arXiv.2407.18395) [DOI] [Google Scholar]

[rsta.2024.0233_B38] 38. Simione R, Slepčev D, Topaloglu I. 2015. Existence of ground states of nonlocal-interaction energies. J. Stat. Phys. 159 , 972–986. ( 10.1007/s10955-015-1215-z) [DOI] [Google Scholar]

[rsta.2024.0233_B39] 39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. [Google Scholar]

[rsta.2024.0233_B40] 40. Bahdanau D. 2014. Neural machine translation by jointly learning to align and translate. arXiv 1409.0473. ( 10.48550/arXiv.1409.0473) [DOI] [Google Scholar]

[rsta.2024.0233_B41] 41. Castin V, Ablin P, Peyré G. Proceedings of Machine Learning Research (eds Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F). In Proceedings of the 41stInternational Conference on Machine Learning, vol. 235, pp. 5817–5840, Vienna, Austria: PMLR. [Google Scholar]

[rsta.2024.0233_B42] 42. Castin V, Ablin P, Carrillo J, Peyré G. 2025. A unified perspective on the dynamics of deep transformers. arXiv Preprint 2501.18322. ( 10.48550/arXiv.2501.18322) [DOI] [Google Scholar]

[rsta.2024.0233_B43] 43. Karagodin N, Polyanskiy Y, Rigollet P. 2024. Clustering in causal attention masking. arXiv Preprint 2411.04990. ( 10.48550/arXiv.2411.04990) [DOI] [Google Scholar]

[rsta.2024.0233_B44] 44. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift (eds Bach F, Blei D). In Proceedings of the 32nd InternationalConference on Machine Learning, vol. 37, pp. 448–456, Lille, France. [Google Scholar]

[rsta.2024.0233_B45] 45. Lei Ba J, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv 1607.06450. [Google Scholar]

[rsta.2024.0233_B46] 46. Touvron H. 2023. Llama: open and efficient foundation language models. arXiv 2302.13971. ( 10.48550/arXiv.2302.13971) [DOI] [Google Scholar]

[rsta.2024.0233_B47] 47. Zhang B, Sennrich R. 2019. Root mean square layer normalization. Adv. Neural Inf. Process. Syst. 32 , 12381–12392. [Google Scholar]

[rsta.2024.0233_B48] 48. He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. In Computer vision – ECCV 2016 (eds Leibe B, Matas J, Sebe N, Welling M), pp. 630–645. Cham: Springer International Publishing. ( 10.1007/978-3-319-46493-0_38) [DOI] [Google Scholar]

[rsta.2024.0233_B49] 49. Weinan E. 2017. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5 , 1–11. ( 10.1007/s40304-017-0103-z) [DOI] [Google Scholar]

[rsta.2024.0233_B50] 50. Haber E, Ruthotto L. 2018. Stable architectures for deep neural networks. Inverse Probl. 34 , 20. ( 10.1088/1361-6420/aa9a90) [DOI] [Google Scholar]

[rsta.2024.0233_B51] 51. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. 2018. Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31 , 6571–6583. [Google Scholar]

[rsta.2024.0233_B52] 52. Thorpe M, van Gennip Y. 2023. Deep limits of residual neural networks. Res. Math. Sci. 10 , 6. ( 10.1007/s40687-022-00370-y) [DOI] [Google Scholar]

[rsta.2024.0233_B53] 53. Ambrosio L, Gigli N, Savaré G. 2008. GradientFlows. In Lectures in mathematics, 2nd edn. Basel, Switzerland: ETH Zürich. ( 10.1007/978-3-7643-8722-8) [DOI] [Google Scholar]

[rsta.2024.0233_B54] 54. Benamou JD, Brenier Y. 2000. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 84 , 375–393. ( 10.1007/s002110050002) [DOI] [Google Scholar]

[rsta.2024.0233_B55] 55. Deffuant G, Neau D, Amblard F, Weisbuch G. 2000. Mixing beliefs among interacting agents. Adv. Complex Syst. 03 , 87–98. ( 10.1142/s0219525900000078) [DOI] [Google Scholar]

[rsta.2024.0233_B56] 56. Bilyk D, Matzke RW, Vlasiuk O. 2022. Positive definiteness and the Stolarsky invariance principle. J. Math. Anal. Appl. 513 , 126220. ( 10.1016/j.jmaa.2022.126220) [DOI] [Google Scholar]

[rsta.2024.0233_B57] 57. Fasshauer GE. 2011. Positive definite kernels: past, present and future. In ’Kernel functionsand meshless methods’ dolomites research notes on approximation (eds Marchi S, Buhmann MD, Plonka-Hoch G). [Google Scholar]

[rsta.2024.0233_B58] 58. Bilyk D, Dai F. 2016. Geodesic distance riesz energy on the sphere. arXiv 1612.08442. ( 10.48550/arXiv.1612.08442) [DOI] [Google Scholar]

[rsta.2024.0233_B59] 59. Burger M, Francesco M di, Franek M. 2013. Stationary states of quadratic diffusion equations with long-range attraction. Commun. Math. Sci. 11 , 709–738. ( 10.4310/cms.2013.v11.n3.a3) [DOI] [Google Scholar]

[rsta.2024.0233_B60] 60. Gómez-Castro D. 2024. Beginner’s guide to aggregation-diffusion equations. SeMA J. 1–57 ( 10.1007/s40324-024-00350-y) [DOI] [Google Scholar]

[rsta.2024.0233_B61] 61. Rossum G, Drake FL Jr. 1995. Python tutorial. The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam. [Google Scholar]

[rsta.2024.0233_B62] 62. Harris CR, et al. 2020. Array programming with NumPy. Nature 585 , 357–362. ( 10.1038/s41586-020-2649-2) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0233_B63] 63. Virtanen P, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272. ( 10.1038/s41592-019-0686-2) [DOI] [PMC free article] [PubMed] [Google Scholar]

[rsta.2024.0233_B64] 64. Paszke A. 2019. Pytorch: an imperative style high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8026–8037. [Google Scholar]

[rsta.2024.0233_B65] 65. Marchuk G, Lebedev VI. 1986. Numerical methods in the theory of neutron transport. New York, NY, USA: Harwood Academic Puḃ. [Google Scholar]

[rsta.2024.0233_B66] 66. Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 , 1–63. ( 10.1006/inco.1996.2612) [DOI] [Google Scholar]

[rsta.2024.0233_B67] 67. Lee JM. 2013. Introduction to smooth manifolds, pp. 1–31. New York, NY, USA: Springer New York. ( 10.1007/978-1-4419-9982-5_1) [DOI] [Google Scholar]

[rsta.2024.0233_B68] 68. Ambrosio L, Fusco N, Pallara D. 2000. Functions of bounded variation and free discontinuity problems, pp. 116–210. Oxford: Oxford University Press. ( 10.1093/oso/9780198502456.003.0003) [DOI] [Google Scholar]

[rsta.2024.0233_B69] 69. Dolbeault J, Nazaret B, Savaré G. 2009. A new class of transport distances between measures. Calc. Var. Partial Differ. Equ. 34 , 193–231. ( 10.1007/s00526-008-0182-5) [DOI] [Google Scholar]

[rsta.2024.0233_B70] 70. Folland GB. 1999. Real analysis: modern techniques and their applications. Hoboken, NJ: John Wiley & Sons. [Google Scholar]

[rsta.2024.0233_B71] 71. Evans LC. 2010. Partial differential equations, 2nd edn. Providence, RI: American Mathematical Society. ( 10.1090/gsm/019) [DOI] [Google Scholar]

[rsta.2024.0233_B72] 72. Spivak M. 2018. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. Boca Raton, FL: CRC press. [Google Scholar]

[rsta.2024.0233_B73] 73. Blumenson LE. 1960. A derivation of n-dimensional spherical coordinates. Am. Math. Mon. 67 , 63–66. ( 10.2307/2308932) [DOI] [Google Scholar]

[rsta.2024.0233_B74] 74. Golub GH, Van Loan CF. 2013. Matrix computations, 4th edn. Philadelphia, PA, USA: Johns Hopkins University Press. [Google Scholar]

PERMALINK

Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger

Samira Kabri

Yury Korolev

Tim Roith

Lukas Weigand

Roles

Abstract

1. Introduction

(a). Self-attention

(b). Normalization method

(c). Simplified transformer layer and time-continuous dynamics

(d). Interpretation as an evolution of measures

(e). Understanding x⋅Dy on the sphere

2. Gradient flow

(a). Continuity equation on manifolds

(b). Distance

(c). Gradient flows of the interaction energy

(d). Metric gradient flows

(e). Energy dissipation and large-time behaviour

3. Explicit energy minimizers and maximizers

(a). Maximal eigenvalue and related maximizers or minimizers

(b). Minimizers for indefinite matrices

(c). Symmetry property for positive definite matrices

(d). Maximizers for negative semi-definite matrices

4. Energy variation and stationary points

(a). Energy variation at concentrated distributions

(b). Energy variation at the uniform distribution

(c). Perturbation of the identity

5. Numerical examples

(a). Maximizers for positive definite matrices

Figure 1.

Figure 2.

(b). Minimizers for positive (semi-)definite matrices

Figure 3.

Figure 4.

Figure 5.

(c). Maximizers for negative definite and indefinite matrices

Figure 6.

Figure 7.

6. Conclusion

Table 1.

Appendix A. Proofs of Section 2

A.1. Continuity equation on manifolds

A.2. Proof of Theorem 2.2

A.3. Proof of Lemma 2.4

A.4. Proof of Lemma 2.6

Appendix B. Spherical coordinates

B.1. Definition using Givens rotations

Appendix C. Proofs for Section 4

C.1. Proof of Lemma 4.10

C.2. Proof of Lemma 4.12

C.3. Proof of Lemma 4.13

Contributor Information

Data accessibility

Declaration of AI use

Authors’ contributions

Conflict of interest declaration

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

(e). Understanding $x \cdot D y$ on the sphere