Skip to main content
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences logoLink to Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
. 2025 Jun 5;383(2298):20240233. doi: 10.1098/rsta.2024.0233

Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger 1,2,, Samira Kabri 1, Yury Korolev 3, Tim Roith 1, Lukas Weigand 1
PMCID: PMC12152857  PMID: 40471030

Abstract

The aim of this article is to provide a mathematical analysis of transformer architectures using a self-attention mechanism with layer normalization. In particular, observed patterns in such architectures resembling either clusters or uniform distributions pose a number of challenging mathematical questions. We focus on a special case that admits a gradient flow formulation in the spaces of probability measures on the unit sphere under a special metric, which allows us to give at least partial answers in a rigorous way. The arising mathematical problems resemble those recently studied in aggregation equations but with additional challenges emerging from restricting the dynamics to the sphere and the particular form of the interaction energy. We provide a rigorous framework for studying the gradient flow, which also suggests a possible metric geometry to study the general case (i.e. one that is not described by a gradient flow). We further analyse the stationary points of the induced self-attention dynamics. The latter are related to stationary points of the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers and maximizers in different parameter settings.

This article is part of the theme issue ‘Partial differential equations in data science’.

Keywords: transformer architectures, self-attention dynamics, gradient flows, interaction energies, stationary states

1. Introduction

Transformer architectures and the associated (self-)attention dynamics gained strong interest recently due to the success of artificial intelligence relying on them in several applications. Examples include large language models such as GPT-4 [1], multimodal large language models such as vision language transformers [2,3], text-to-image generation like Stable Diffusion [4] and protein folding with AlphaFold [5,6], which won the Nobel Prize in Chemistry in 2024.

The practical success of transformers and (self-)attention dynamics calls for developing detailed mathematical understanding which started recently in [719].

An interesting viewpoint on such dynamics is to interpret it as an interacting particle system [8,20,21], which allows for natural continuous-time and mean-field limits. The latter approach already provided valuable insights into feed-forward neural networks and their training dynamics (cf. [22,23]). In the context of transformers, this viewpoint also provides interesting (so far formal [9]) connections to gradient flows and the minimization of interaction energy for the particle measures. The latter is a topic of great recent interest due to various applications in biology and social interactions. Indeed, the self-attention dynamics in transformers share certain mathematical similarities with models used in opinion formation, which also exhibit similar emergence of clusters in certain cases [2426]. In this work, we focus on cluster formation in the infinite time horizon. However, we note that the formation of metastable states is of special interest. For the case of isotropic interaction, metastability was studied in [27,28].

In this article, we proceed with the work in [9] on analysing transformer dynamics with layer normalization, focusing in particular on the case when the underlying dynamics has a gradient flow structure. Indeed, the continuum limit of the self-attention dynamics leads to a Wasserstein-type gradient flow for probability measures on the unit sphere S of the form

tμt=S(μtmμtSE(μt)), (1.1)

where S and S are the tangential gradient and divergence, respectively, and mμ=1E(μ) is a non-local mobility. The underlying energy in this case is of the form

E(μ)=SSexDydμ(x)dμ(y), (1.2)

with D being a symmetric matrix and E denoting its first variation. Since D is symmetric and hence diagonalizable, we can equivalently assume that D is a diagonal matrix, since we can use an orthogonal diagonalization and a corresponding transfer of variables to the eigenvectors, which leaves the unit ball unchanged. This will be used in several instances to simplify notation. It also permits a more detailed study of stationary patterns, in particular minimizers and maximizers of the energy.

Compared to the existing literature on such gradient flows, there are three distinct features that motivate our study, namely:

  • restriction of the dynamics to the unit sphere (a consequence of the layer normalization);

  • non-local mobility (a consequence of the self-attention mechanism), which is related to but still distinctly different from other variations of Wasserstein gradient flows studied recently (cf. [2932]);

  • multiplicative coupling of states in the interaction energy, as opposed to commonly used interaction potentials depending only on the difference of the states (cf., e.g. [3338]).

We make the gradient flow, formally introduced in [9], rigorous, showing that the transport distance with non-local mobilities is well defined, studying energy dissipation properties of the associated gradient flow and describing the large-time behaviour of the dynamics, specifically the convergence to stationary solutions, at least along subsequences. We further carry out a detailed study of energy minimizers and maximizers of E (extending the previously studied case of D being a multiple of the identity) as well as stationary points of the energy in a Wasserstein setting, which we prove to be equivalent to stationary solutions of the dynamics. For the energy minimizers, we obtain an interesting picture depending on the structure of D :

  • If there is a positive eigenvalue that is the eigenvalue of maximal absolute value, then a Dirac delta concentrated in the direction of a corresponding eigenvalue is a maximizer.

  • If the smallest eigenvalue is negative, then only a Dirac delta concentrated in the direction of a corresponding eigenvalue is a minimizer.

  • If the smallest eigenvalue is zero, then any measure concentrated on the null space of D is a minimizer.

  • Dirac deltas concentrated in directions of arbitrary eigenvectors are stationary points. We also find some convex combinations of Dirac deltas being stationary points.

  • If the smallest eigenvalue is positive, we conjecture that the minimizer of the energy has full support on the unit sphere. To obtain some insight, we carry out a second-order asymptotic analysis of the minimizers for D being a small perturbation of the identity.

We support our theoretical findings with several computational experiments and investigate the cases when the energy minimizers or maximizers cannot be characterized explicitly.

The rest of this work is organized as follows. In the remainder of the introduction, we recapitulate the simplified softmax transformer model introduced in[8], with additional layer normalization as considered in [9]. In §2, we provide a rigorous derivation of the gradient flow induced by the considered model. Sections 3 and 4 are dedicated to characterizing optimizers or stationary points of the studied energy, respectively. We support our findings by numerical experiments in §5 and summarize our results in §6.

(a). Self-attention

Transformer architectures [39] were developed in the field of natural language processing. Here, the input is usually a sentence, which is decomposed into a sequence of tokens (e.g. words or syllables). Each token (possibly along with its position in the sentence) is represented as a vector in a high-dimensional vector space. Apart from a conventional feed-forward component, the main feature of a transformer layer is the so-called attention mechanism. This mechanism implements interactions between tokens and was first introduced in [40] in the context of neural machine translation as an alternative to encoder–decoder approaches, the performance of which often deteriorates for large input lengths due to the use of latent representations of fixed dimensions.

Like [9], we shall focus on a simple yet widely used form of attention, the so-called self-attention. It can be formalized as follows: consider an input sequence X=[Xi]i=1NN×n , where each Xin represents an n -dimensional token and N denotes the number of tokens. The self-attention matrix AN×N is given by

Aij=exp(XiDXj)k=1Nexp(XiDXk), (1.3)

where we assume Dn×n to be symmetric. The latter property does not necessarily hold for learned parameters in transformer architectures, but we expect the symmetric part to determine the asymptotic behaviour of the self-attention dynamics. Since the symmetry of D allows one to interpret the dynamics as a gradient flow corresponding to a certain interaction energy, as observed in [9], it will allow us to analyse the asymptotic behaviour for this subclass; the study of the general case is left for future research. An important example of non-symmetric interaction is given by masked attention, which can be used to model causality. We refer to [4143] for a mean-field interpretation of such dynamics.

By definition, the matrix A is stochastic, i.e. each of its rows is a probability vector. Roughly speaking, the attention matrix determines how strongly a token is influenced by each other token. To determine how tokens influence each other, another matrix Vn×n , called the value matrix, is used. The influence of Xj on Xi can then be written as AijVXj and the self-attention layer A:N×nN×n is given by

A(X)=[Xi+j=1NAijVXj]i=1N. (1.4)

For our purposes, we assume V=D or V=D since, in this case, one can show that the particles move along a gradient flow. The general case is the subject of future work.

(b). Normalization method

The normalization of intermediate values is a common practice in machine learning models. In the context of neural networks, so-called batch normalization [44] is a popular method to prevent gradients from blowing up and thus to stabilize (and to improve) the training. Since this form of normalization uses information from the entire training batch, [45] proposes layer normalization (LayerNorm), which translates the mean of an intermediate vector to zero and divides it by its standard deviation, and therefore does not depend on any other vector in the batch. While the original implementation of the transformer [39] uses LayerNorm, some of the more recent publications (e.g. Llama, [46]) use a simplified version called Root Mean Square Layer Normalization (RMSNorm) proposed in [47]. Up to a multiplication with learned weights [gi]i=1n , called gain parameters, RMSNorm performs a projection on to the unit sphere Sn1 (where in the following, we shall suppress the superscript and simply write S ). More precisely, for xn we write

RMSNorm(x)i=gixix2,

where, in practice, a division by zero is circumvented by adding a small value ϵ>0 into x2 . In our setting, we can assume the norm to be strictly positive as we consider the dynamics in continuous time. Following the setting of [9], we focus on RMSNorm with fixed gain parameters gi=1 for all i=1,n and denote the projection on to the unit sphere for xn{0} by

Π(x)=xx2.

(c). Simplified transformer layer and time-continuous dynamics

Combining the attention layer with a normalization layer, we arrive at the following update step:

XΠ(A(X)),

where the projection is applied vector-wise to each row of A(X) . For the sake of our analysis, we shall deviate from typical practical implementations of transformers and consider the architecture to be a composition of such layers which all share the same matrices D and V in equations (1.3) and (1.4). In [9], it was proposed to study the continuum limit of these updates. This approach has become a popular tool for analyzing residual neural networks [48]: as discussed from various perspectives, e.g. in [4952], the skip connections (i.e. the residual components) of the residual neural network architecture make it possible to interpret it as a forward Euler discretization of an ordinary differential equation. Introducing a time variable t>0 and a small time increment Δt>0 , we get

Xi(t+Δt)=Π(Xi(t)+Δtj=1NAij(t)VXj(t)),i=1,,N. (1.5)

At this point, the residual component is hidden in the attention layer and cannot easily be extracted since the projection is nonlinear. In the continuous time limit Δt0 , remembering that Π(x)=x for any xS , we arrive at the following system of differential equations:

(1.6)X˙i(t)=xΠ(Xi(t)),j=1NAij(t)VXj(t),i=1,,N,

where the spatial derivatives are understood as derivatives in n . With a simple computation, one can further show that for any xS and zn it holds that

xΠ(x),z=Px(z),

where, following [9], we define Px(z)=zxzx . Substituting this into equation (1.6), we arrive at the following dynamics:

{X˙i(t)=PXi(t)(j=1NAij(t)VXj(t)),(1.7a)Xi(0)=X0,iS,(1.7b)

which serve as a starting point of [9].

(d). Interpretation as an evolution of measures

Instead of studying the dynamics of distinct particles, [9] propose to view equation (1.7) as an evolution of an empirical measure

μt=1Ni=1NδXi(t).

The right-hand side of equation (1.7a) can be understood as an integral with respect to μt ; for a generic probability measure μ , this can be written as a measure-dependent velocity field:

V[μ](x)=Px(SexDyVydμ(y))SexDydμ(y), (1.8)

and equation (1.7a) turns into Xi˙(t)=V[μt](Xi(t)) . With this notion, we recover the weak continuity equation formulated in [9]: for any test function φC1(S×[0,T]) , one has

ddtSϕ(t,x)dμt(x)=ddt1Ni=1Nϕ(t,Xi(t))=1Ni=1Ntϕ(t,Xi(t))+xϕ(t,Xi(t)),V[μt](Xi(t))(1.9)=Stϕ(t,x)+xϕ(t,x),V[μt](x)dμt(x),

where, in this case, the spatial derivatives of φ have to be understood as derivatives on S .

Similarly, Geshkovski et al. [9] propose the interaction energy in equation (1.2), which for an empirical measure μt reduces to

E(μt)=i,j=1NeXi(t)DXj(t).

In this discrete case, a straightforward application of the chain rule and a reordering of the terms yields

ddtE(μt)=2i=1N(j=1NeXi(t)DXj(t)DXj(t))X˙i(t).

Under our assumption that the value matrix is given by V=±D , we see that, up to an application of PXi(t) and a division by j=1NeXi(t)DXj(t) , the term in the brackets is given by Xi˙(t) . Since Px(z)z=Px(z)Px(z) for any xS , zn , we have that

ddtE(μt)=±2i=1NX˙i(t)2j=1NeXi(t)DXj(t)0,

and hence the energy E increases ( V=D ) or decreases ( V=D ) monotonously along the trajectory of μt . A formal derivation of the above formulae for general probability measures on smooth manifolds is provided in §2.

Let us mention that problems with similar energies as E have been studied in the past. The most prominent is an interaction energy with respect to D with a non-local interaction kernel depending on xy . Choosing the kernel as Gaussian with covariance matrix D1 (which makes sense only if D is positive definite) results in

(1.10)Einter(μ)=SSe12(xy)(D(xy))dμ(x)dμ(y).

For D=±Id , the minimizers and maximizers of the expressions in equations (1.2) and (1.10) are equivalent as 12(xy)(xy)=12(xx+yy)±xy=1+±xy for all x,yS . The important difference between equations (1.2) and (1.10) is the rotation-(in)variance of the interaction functions ex(Dy) and e12(xy)(D(xy)) . In the general case, this is not true, but we shall use an analogy to the interaction energy to rewrite

E(μ)=eλSSeλ2|x|2λ2|y|2+xDydμ(x)dμ(y)=eλSSeλ2|xy|2+x((DλId)y)dμ(x)dμ(y).

(e). Understanding xDy on the sphere

For our further analysis, it is crucial to understand the implications of restricting the problem to the unit sphere and the behaviour of the bilinear form xDy on it. For D=Id , it is clear that the minimizer of fy(x)=xDy is given by x=y and the maximizer by x=y . This changes for a general D and as a result, the minimizer of the energy in equation (1.2) is not given by the uniform distribution on S anymore. For a diagonal matrix D , the maximizer/minimizer of fy for a fixed yS with Dy0 is given by x±=±DyDy . Therefore, we know that xDy=0 if and only if xx±=0 (same for > and < ). For Dy=0 , we already have fy(x)=0 for any xS , i.e. each point is a minimizer, maximizer and orthogonal to y w.r.t. D . A further consequence is that

maxx,ySxDy=maxySDyDyDy=maxySDy=|λ|,

where λ denotes the eigenvalue of maximum absolute value of D . We further note that all of the following results on minimizers/maximizers as well as stationary points of ED can be generalized to probability measures concentrated on an ellipsoid instead of a sphere. To see this, we consider the ellipsoid

CS={xn:C1x=1},

where Cn×n is invertible, and the corresponding energy

EDC(μ)=CSCSexDydμ(x)dμ(y).

Since C is invertible, any measure μ is uniquely determined by the pushforward measure ν=C#1μ , as μ=C#ν . Thus, we can rewrite the energy as

EDC(μ)=SSeCxDCydν(x)dν(y)=ECTDC(ν),

and equivalently optimize the energy ECTDC on the sphere. A special case that leads to measures concentrated on an ellipsoid corresponds to RMSNorm normalization with non-vanishing gain parameters gi0 . In this case, the ellipsoid is given by GS , where G is a diagonal matrix with entries [gi]i=1n .

2. Gradient flow

As shown above, the particle dynamics can be ‘lifted’ by the use of empirical measures to the space of probability measures P(S) over the sphere. As mentioned in [9, Remark 3.3], for arbitrary probability measures, the connection between the partial dynamics and a corresponding continuity equation can be made by a mean field limit approach. Hence, instead of the particle dynamics, one can study the continuity equation:

tμ+div(V[μ]μ)=0on [0,T]×S,μ|t=0=μ(0)on S, (2.1)

with the velocity field given by equation (1.8), which holds in the sense of distributions. Note that, in this section, we scale the energy by a factor of 1/2 to be consistent with [9]. It was remarked in [9, ch. 3.3] that for V=±D , the energy,

E(μ)=±12SSexDydμ(x)dμ(y),

is monotonic along these dynamics, and the partial differential equation (2.1) can be interpreted as a gradient flow for a modified optimal transport distance. However, as the authors of [9] acknowledge, there is a gap in the literature that prevents them from making this observation rigorous.

In this section, we aim to close this gap. We show that P(S) equipped with this new distance is a geodesic space with properties similar to the classical 2 -Wasserstein space and prove that solutions of equation (2.1) are curves of maximal slope of E with respect to this distance and thus satisfy the energy dissipation equality

ddtE(μt)=SSexDydμt(y)|V[μt](x)|2dμt(x)for a.e. t.

Finally, we study the long-time behaviour of the dynamics and show that subsequences of the flow converge to stationary points of the energy E .

Let us mention that the basic analysis of this section related to the novel transport distance can be generalized in a rather straightforward way to the more general case of D being non-symmetric and can thus provide the basis for future analysis of the non-gradient flow case with V arbitrary and D non-symmetric.

(a). Continuity equation on manifolds

Let M be a compact n -dimensional Riemannian manifold without a boundary, e.g. the sphere Sn . The tangent bundle TM=xMTxM is given by the disjoint union of all tangent spaces of all xM . We denote by P(M) the space of Borel probability measures on M , equipped with the standard narrow topology (e.g. [53, ch. 5.1]). The symbol is used to indicate convergence in this topology. Let I=(0,T) be an open interval, μ:tμtP(M) a narrowly continuous curve and V:(x,t)M×Ivt(x)TM a Borel velocity field such that 0TM|vt(x)|dμtdt< . The continuity equation holds in the sense of distributions if

(0,T)Mtφ(x,t)+Dφ(x,t),vt(x)dμtdt=0,φCc1(M×(0,T)). (2.2)

Here, D denotes the differential on the manifold M . Sometimes, we shall use Dx to clarify with respect to which variable the differential is taken. We define the set of solutions to the continuity equation as follows:

CE(0,T):={(μ,v):μ:IP(M) is narrowly continuous,0TM|vt(x)|dμtdt<,(μ,v) satisfy the continuity equation,}

Furthermore, we define CE(0,T;νη) as the subset (μ,v) such that μ0=ν , μT=η . For more details, we refer to appendix A(a).

(b). Distance

To interpret equation (2.1) as a gradient flow on P(M) , we need to modify the well-known dynamic formulation of the 2 -Wasserstein distance [54] and introduce the following mobility:

mμ(x)=MK(x,y)dμ(y).

With this, the modified transport distance between μ0,μ1P(M) is defined as follows (see [9, Section 3.4.2]):

Wm,22(μ0,μ1)=inf{01Mmμt(x)|vt(x)|2dμt(x)dt:(μ,v)CE(0,1;μ0μ1)}. (2.3)

For K1 , we recover the classical 2 -Wasserstein distance. The dynamic (2.1) corresponds to the kernel K(x,y)=exDy , but for the sake of generality, we carry out the analysis for a more general class of kernels K .

Assumption 1. The kernel K(x,y)C(M×M) is continuous, and there exists a constant C>0 such that K(x,y)C for all x,yM .

Remark 2.1. The assumption that K is bounded from below is vital for our analysis and covers the cases of interest in this article. Nonetheless, it would be interesting to see whether this assumption can be relaxed. For example, instead of a compact manifold M , we could consider d as the underlying space and take K to be a Gaussian or a bounded confidence kernel K(x,y)=1|xy|1 as studied in [ 55 ].

As the next theorem shows, the infimum in equation (2.3) is actually attained by some (μ,v)CE(0,1;μ0μ1) . The proof can be found in appendix A(b).

Theorem 2.2 (Existence of minimizers). For every pair μ0,μ1P(M) with Wm,2(μ0,μ1)<+ , there exists a couple (μ,v)CE(0,1) such that

Wm,22(μ0,μ1)=01Mmμt(x)|vt(x)|2dμt(x)dt.

Furthermore, such minimizers can be equivalently characterized as those of

(2.4)Wm,2(μ0,μt)=inf{0T(Mmμt(x)|vt(x)|2 dμt(x))12 dt:(μ,v)CE(0,T;μ0μT)}.

Using the theorem above, it is easy to show that Wm,2 is a distance on P(M) .

Theorem 2.3. The space P(M) equipped with Wm,2 is a complete metric space and its topology is equivalent to the one induced by the 2 -Wasserstein distance which, since M is compact, is equivalent to the topology of narrow convergence.

Proof. First, we check that Wm,2 is a distance. Indeed, (i) symmetry follows from simply rescaling time by t~:t[0,T]Tt[0,T] ; (ii) definiteness: Since mμt is bounded from below, Wm,2(μ,ν)=0 implies that vt=0 for μ -a.e. (p,t)M×(0,T) . Thus by equation (A 3) μ=ν ; (iii) the triangle inequality follows from the characterization in equation (2.4) and the gluing property from proposition A.1. To show the equivalence of the distances, we observe that by assumption 1, K(x,y)C and since M×M is compact and K(x,y) is continuous, we can also find a C~ such that K(x,y)C~ . This implies that

1CW2(μ,ν)Wm,2(μ,ν)C~W2(μ,ν)<+μ,νP(M),

and the distances are equivalent. Since (P(M),W2) is complete, (P(M),Wm,2) has to be complete as well.∎

Let us recall that in a general complete metric space (X,d) , a curve γ:[0,T]X is called absolutely continuous if there exists a function mL1(0,T) such that

d(γs,γr)srm(t)dts,r[0,T] with sr. (2.5)

For an absolutely continuous curve γ(t) , its metric derivative is defined by

|γ˙|(t):=limh0d(γt+h,γt)h,

and it exists for a.e. t(0,T) . It can be shown that |γ˙| is minimal in the sense that for all m(t) satisfying equation (2.5), it holds that |γ˙|(t)m(t) for a.e. t(0,T) . The next lemma, which is proven in appendix A(c), characterizes absolutely continuous curves in (P(M),Wm,2) .

Lemma 2.4. Let μt be an absolutely continuous curve w.r.t. W2,m . Then there exists a Borel velocity field (vt)t(0,T) such that (μ,v)CE(0,T) and

(Mmμt(x)|vt(x)|2dμt(x))1/2=|μ˙|(t)fora.e.t(0,T).

Conversely, if (μ,v)CE(0,T) and 0T(Mmμt|vt|2dμt)1/2dt<+ then tμt is absolutely continuous and

|μ˙|(t)(Mmμt(x)|vt(x)|2dμt(x))1/2fora.e.t(0,T).

A metric space is called a length space if

d(x,y)=inf01|γ˙|(t)dt,

where the infimum is taken over all absolutely continuous curves γ:[0,1]X with γ(0)=x and γ(1)=y . If this infimum is obtained by a minimal curve, also called geodesic, we say that (X,d) is a geodesic space. As it turns out, the minimal curves obtained in theorem 2.2 are such geodesics. This can be immediately deduced from equation (A 9) and the definition of the metric velocity,

Corollary 2.5. The space (P(M),Wm,2) is a geodesic space.

(c). Gradient flows of the interaction energy

Let W(x,y)C1(M×M) be a symmetric interaction kernel. The interaction energy is given by

E(μ)12M×MW(x,y)dμ(x)dμ(y).

Let us consider the following inverse duality map:

J2:xTMp|x|arg maxyTMp:|y|=1x(y).

Since all tangent spaces are finite-dimensional, this map is well defined. The application of J2 to a 1-form on M (in particular, a differential of a function) yields a velocity field on M . Below we show that gradient flows of the energy E with respect to the metric Wm,2 are given by weak solutions to PDEs of the form

(2.6)tμ+div(1mμJ2(DW[μ])μ)=0,

where W[μ](x)=MW(x,y)dμ(y) . For M=S , K(x,y)=exDy and W(x,y)=±exDy equation (2.6) corresponds precisely to equation (2.1) if V=±D . The sole difference between equation (2.6) and classical Wasserstein gradient flows is the presence of the factor 1mμ . It arises since the modified transport distance punishes the movement of particles with a high mobility mμ(x) . When we interpret K(x,y) as an interaction kernel between particles, those particles interacting strongly with others are slowed down, while particles with low interaction are sped up.

Lemma 2.6 (Chain rule). Let tμt be an absolutely continuous curve in W2,m . Then tE(μt) is absolutely continuous and

(2.7)ddtE(μt)=MDW[μt](x),vt(x)dμt(x)fora.et(0,T).

Proof. Let us consider an absolutely continuous curve (μ,v)CE(0,1;μν) and the function η:(x,t)M×[0,T]12MW(x,y)dμt(y) . In the case when ηC1(M×[0,T]) , we could use it as a test function in equation (A 3) and immediately obtain

E(μT)E(μ0)=Mη(x,T)dμT(x)Mη(x,0)dμ0(x)=0TMtη(x,t)dμt(x)+MDη(t,x),vt(x)dμt(x)dt=0TMMDxW(x,y),vt(x)dμt(y)dμt(x)dt<+.

The finiteness follows from the fact that we can bound |DxW(x,y)| uniformly on M×M . In the general case, we have to use a rather lengthy time mollification argument, see appendix A(d).∎

Equation (2.7) is reminiscent of the classical chain rule ddtF(x(t))=F(x(t))x˙(t) for a function F:d and a curve x:[0,T]d . The velocity field vt can be viewed as the ‘derivative’ of the curve μt , while DW[μt] is the corresponding ‘gradient’ of the interaction energy. Using this chain rule, we can estimate how fast the energy can decrease along a curve μt . Therefore, curves reaching this bound dissipate the energy as fast as possible and satisfy the so-called energy dissipation equality.

Lemma 2.7. For any absolutely continuous w.r.t. W2,m curve (μt)t(0,T) , we have that

E(μT)E(μ0)+120TMmμt|vt|2dμtdt+120TM1mμt|DW[μt]|2dμtdt0. (2.8)

Moreover, we have equality if and only if (μt)t(0,T) is a weak solution to equation (2.6) .

Proof. We can estimate the right-hand side of equation (2.7) by Hölder’s and Young’s inequalities:

MDW[μt](vt)dμtMmμt|vt|2dμtM1mμt|DW[μt]|2dμt12Mmμt|vt|2dμt12M1mμt|DW[μt]|2dμt.

Integrating both sides of equation (2.7) from 0 to T, we obtain equation (2.8). Moreover, equality holds if and only if for a.e. t and μt -a.e. we have vt=1mμtJ2(DW[μt] ). Hence, μt is a weak solution to equation (2.6).∎

(d). Metric gradient flows

Let us put the previous calculations into the context of curves of maximal slope [53, ch. 1], which can be viewed as a way to generalize gradient flows to general metric spaces. We assume (X,d) to be a complete metric space. Let E:X . A function g:X[0,+] is called a strong upper gradient of E if for any absolutely continuous curve x:[0,T]X the concatenation gx is Borel and

|E(x(t))E(x(s))|stg(x(r))|x˙|(r)dr0stT.

If E(x(t)) is non-increasing in t then the application of Young’s inequality yields

E(x(t))E(x(s))+12stg(x(r))2+|x˙|(r)2dr00stT.

This observation allows us to define curves of maximal slope as those that decrease the energy as fast as possible.

Definition 2.8 (Curve of maximal slope). An absolutely continuous curve x:[0,T]X is called a curve of maximal slope of E with respect to its strong upper gradient g if tE(x(t)) is non-increasing and

E(x(t))E(x(s))+12stg(x(r))2+|x˙|(r)2dr00stT.

Lemma 2.9. The map

g:μM1mμ|DW[μt]|2dμ

is a strong upper gradient of E and solutions of equation (2.6) coincide with curves of maximal slope of E with respect to the strong upper gradient g .

Proof. For an absolutely continuous w.r.t. W2,m curve μt , we can find, by lemma 2.4, a velocity field (vt)t(0,T) such that (μ,v)CE(0,T) and

(Mmμt|vt|2dμt)1/2=|μ˙|(t)for a.e. t(0,T).

Then, the chain rule, lemma 2.6 yields

|E(μt)E(μs)|st|DW[μt],vr|drstg(μr)|μ˙|(r)dr,

and g is a strong upper gradient. The coincidence of solutions of equation (2.6) and curves of maximal slope follows from lemma 2.7.∎

(e). Energy dissipation and large-time behaviour

Due to the missing geodesic convexity properties of the energy, we cannot expect convergence of the evolution to a unique minimizer in the large time limit. However, we can obtain some weaker results by further analysing the energy dissipation property:

E(μt)+120tMmμs(x)|E(μs)|2dμs(x)dsE(μ0). (2.9)

As s , we can pick narrowly convergent subsequences of μs (i.e. converging weakly star in the Banach space of Radon measures). Moreover, the entropy dissipation inequality above implies

0Mmμs(x)|E(μs)|2dμs(x)ds<,

hence, along suitable subsequences, the entropy dissipation,

D(s)=Mmμs(x)|E(μs)|2dμs(x),

converges to zero since it is non-negative and bounded. To establish the existence of subsequences converging to stationary solutions, we need to identify the limit in suitable spaces. Under appropriate regularity assumptions on the interaction kernel W (satisfied, for example, for the exponential kernel), this is a direct consequence of the Arzelà–Ascoli theorem.

Lemma 2.10. Let M be a compact manifold without a boundary, WC1,α(M×M) for some α>0 and symmetric. Moreover, let μn be a sequence of probability measures on M . Then the sequences

mμn=MW(,y)dμn(y)andE(μn)=MxW(,y)dμn(y)

have uniformly convergent subsequences. If μn converges narrowly to μ , then mμn converges uniformly to mμ and E(μn) converges uniformly to E(μ).

Lemma 2.10 combined with the entropy dissipation inequality (2.9) yields the following result.

Corollary 2.11. Let M be a compact manifold without a boundary, WC1,α(M×M) for some α>0 and symmetric. Then each weak solution μt of equation (2.1) with the velocity field given by equation (1.8) has a narrowly convergent subsequence μtn as tn , the limit of which is a stationary solution.

The following example connects the general results of this section with the transformer dynamics.

Example 2.12. The transformer dynamics for a finite number of particles described by equation (1.7) with V=±D correspond to the choice M=S , K(x,y)=exDy and W(x,y)=±exDy . As discussed in §1d, the corresponding empirical measures μt fulfil the continuity equation (1.9). Thus, they solve equation (2.1) in the weak sense with the velocity field given by equation (1.8), and all requirements of corollary 2.11 are fulfilled. Therefore, there exists a subsequence of μt that converges narrowly to a stationary solution of the interaction energy ED defined in equation (1.2).

This section establishes the relation between the particle model in equation (1.7) and gradient flows of interaction energies for the special cases V=±D . The energy dissipation property equation (2.8) and convergence property from corollary 2.11 motivate the study of stationary solutions of the energy ED , which we carry out in §§3 and 4. We shall start with minimizers and maximizers.

3. Explicit energy minimizers and maximizers

In this section, we compute explicit minimizers and maximizers of the energy ED (from equation (1.2), i.e. without the factor 1/2 ) in different scenarios, depending on the properties of the interaction matrix D . We make the dependence on the matrix D explicit by employing it as a subscript of the energy. The case D=Id has already been covered in [9, Proposition 3.4], where it is stated that a measure is a maximizer if and only if it is a Dirac delta placed at any point on the sphere, and a minimizer if and only if it is the uniform distribution. As we show below, for more general matrices, the position of optimal Diracs depends strongly on the eigenvalues of the matrix D . We further derive a symmetry condition for minimizers of energies with a positive definite interaction matrix D . This property yields an alternative, simpler proof that the uniform distribution is the only minimizer for D=Id .

(a). Maximal eigenvalue and related maximizers or minimizers

Like for D=Id , there are several cases in which the minimizers or maximizers of the energy ED are given by Diracs concentrated at a single point. We start with the maximizers when the largest eigenvalue of D is also an eigenvalue of the largest absolute value (or, respectively, minimizers when the smallest eigenvalue of D is also an eigenvalue of the largest absolute value).

Theorem 3.1. Let λ be an eigenvalue of maximal absolute value of D and ZλS the set of associated normalized eigenvectors. If λ>0 then μ=δz with zZλ are the only maximizers of the energy ED . If λ<0 then μ=δz with zZλ are the only minimizers.

Proof. We consider the case λ>0 ; the case λ<0 can be treated similarly. For all x,yS , we have exDyeλ with equality if and only if x=y=±z . Thus,

ED(μ)=SSexDydμ(x)dμ(y)SSeλdμ(x)dμ(y)=eλ=ED(μ),

where the inequality is strict if μ is not concentrated on an eigenvector associated with λ .∎

An example of the above setting is maximizing the energy for D=Id [9, Proposition 3.4], where the authors make a connection between the existence of concentrated maximizers and the so-called mode collapse of transformers often observed in practice. For a positive definite DId , theorem 3.1 shows that the set of maximizers is not only restricted to Dirac measures, but that it is actually finite. We summarize this insight in the following example and refer to §5a for an illustrating numerical example.

Example 3.2. If D=Id then μ=δz is a maximizer of the energy EId for any zS . Similarly, for D=Id , μ=δz is a minimizer for any zS . If DId is positive definite then μ=δz is a maximizer of ED only if Dz=λz and λ is the largest eigenvalue of D . Similarly, for a negative definite DId , μ=δz is a minimizer only if Dz=λz and λ is the smallest eigenvalue of D .

In the remainder of this section, we study minimizers for matrices that do not fulfil the conditions of theorem 3.1.

(b). Minimizers for indefinite matrices

We now generalize the statement in theorem 3.1 to minimizers of energies where the matrix D has at least one non-positive eigenvalue. In particular, we do not assume that the smallest eigenvalue is the eigenvalue of maximal absolute value. A key property is the following result that gives a lower bound on the energy in terms of the smallest eigenvalue of D .

Lemma 3.3. Let x¯ be the expected value of x under μ , i.e. x¯:=Sxdμ(x) . Then

ED(μ)ex¯Dx¯. (3.1)

If D is not positive definite and λmin is its smallest eigenvalue, it further holds that

(3.2)ED(μ)eλmin.

Proof. We use the convexity of exponential functions of the form xexa and yeby for arbitrary a,bn , which, with two applications of Jensen’s inequality, implies

ED(μ)=SSexDydμ(y)dμ(y)SexDx¯dμ(x)ex¯Dx¯. (3.3)

Since, further, x¯Dx¯λminx¯2 and 0x¯1 , the monotonicity of the exponential function gives us

ED(μ)emin{λmin,0}.

If D is not positive definite, we know that λmin0 and the above inequality reduces to inequality (3.2).∎

A direct consequence of lemma 3.3 for indefinite matrices is that a Dirac measure that is concentrated on an eigenvector corresponding to the smallest eigenvalue is a minimizer of the energy. If the smallest eigenvalue is negative, we can even show that all minimizers are of this form. In the case of a vanishing smallest eigenvalue, it is necessary and sufficient that the measure is concentrated on the null space of D .

Theorem 3.4. Consider a matrix D that is not positive definite with the smallest eigenvalue λmin0 . If λmin<0 , a measure minimizes the energy if and only if it is a Dirac measure placed at an eigenvector corresponding to λmin . If λmin=0 , a measure minimizes the energy if and only if it is concentrated on the null space of D .

Proof. We first assume λmin<0 . It follows directly from equation (3.2) that every Dirac measure concentrated on an eigenvector corresponding to λmin is a minimizer. We further see that x¯Dx¯=λmin if only if x¯ is an eigenvector corresponding to λmin and x¯=1 . This can only hold for Dirac measures. Thus, there are no other minimizers.

For λmin=0 , it also follows directly from equation (3.2) that every measure concentrated on the null space of D minimizes the energy. However, x¯Dx¯=λmin holds for all measures that fulfil x¯=0 . Still, the estimate in equation (3.3), obtained using Jensen’s inequality, is only an equality if xDx=x¯Dx¯=0 for μ -a.e. xS . Therefore, all minimizers are concentrated on the null space of D .∎

Remark 3.5. In general, theorem 3.4 does not transfer to maximizers for matrices D that are not negative definite. To see this, consider D with the largest eigenvalue λmax0 , the smallest eigenvalue λmin<0 and corresponding eigenvectors zmin and zmax . If further eλmax<cosh(λmin) , it holds that

ED(δzmax)=eλmax<cosh(λmin)=ED(δzmin+δzmin2)

and thus, δzmax is not a maximizer. In the special case λmax=0 , the above inequality holds for all measures concentrated on the null space of D and all λmin<0 .

At this point, we further note that the above strategy does not work for analysing minimizers for positive definite interaction matrices D . In this case, lemma 3.3 not only gives us ED(μ)e0=1 , but also xDx>0 for all xS , so the inequality is strict for all measures μP(S) .

(c). Symmetry property for positive definite matrices

The remainder of this section gives the first characterization of minimizers of the energy when the interaction matrix is positive definite. More precisely, we can show that, in this case, all minimizers are symmetric, and the symmetry axes are determined by the eigenvectors of D . The first step towards this is to show that the energy ED is strictly convex if D is positive definite.

Lemma 3.6. If D is positive semi-definite (resp. positive definite) then ED is convex (resp. strictly convex).

Proof. Since ED is quadratic, convexity (resp. strict convexity) follows from the non-negativity (resp. positivity) of the quadratic form:

F(μ)=SSexDydμ(x)dμ(y),

for arbitrary signed Radon measures μ , e.g. [56, Proposition 2.11]. For D positive semi-definite, there exists a unique positive semi-definite matrix square root D1/2 and we can use the transformation T(x)=D1/2x . We denote by T#μ the pushforward of μ by T , so that

F(μ)=T(S)T(S)exydT#μ(x)dT#μ(y)=T(S)T(S)e12|xy|2e12|x|2dT#μ(x)e12|y|2dT#μ(y).

Let dη=e12|x|2dT#μ(x) , then

F(μ)=T(S)T(S)e12|xy|2dη(x)dη(y).

The fact that the Gaussian kernel is positive definite (e.g. [57]) yields that F(μ)>0 unless ν vanishes. This can only happen if μ=0 or, in the case of a semi-definite matrix D , if μ is concentrated on the null space N(D) and μ(N(D))=0 . This yields the assertion.∎

Remark 3.7. The previous convexity result does not guarantee the convergence of the gradient flow in ( equation (2.6) ) to a global minimizer of F . For such results, usually a slightly different notion of convexity is required, the so-called geodesic convexity. The following example shows that besides the case of D being a multiple of the identity, we do not have geodesic convexity for the classical 2 -Wasserstein distance. We do not expect any improvements for our modified optimal transport distance.

Example 3.8. We consider a simple counterexample in S1 (equipped with the spherical distance) to show that F is not convex along 2 -Wasserstein geodesics. Choose

D=(2001)and the curveγ:t[0,1](cos(π4+tπ2)sin(π4+tπ2)).

Then μtδγ(t) is a constant-speed geodesic in the 2 -Wasserstein space connecting δγ(0) and δγ(1) . Clearly, the map [0,1]tF(γ(t)) is not convex, since

F(γ(0))=F(γ(1))=e1.5<e2=F(γ(12)).

Such a counterexample can always be constructed as long as D has two different eigenvectors. Lemma 3.6 does not contradict this counterexample, however, as it only implies the convexity of

[0,1]tF((1t)μ0+tμ1).

Having established convexity, we can show that reflecting a measure along the eigenvectors of D and then normalizing it does not increase the energy. Moreover, if D is positive definite and μ is not symmetric with respect to all eigenvectors of D , one can always construct a symmetric measure with a smaller energy.

Lemma 3.9. Let z be an eigenvector related to an eigenvalue λ of a positive semi-definite matrix D . For a measure μ , we define μ~ as

μ~:=12(μ+Hz#μ),Hz(x)=x2(xz)z,

where Hz denotes a reflection. Then, ED(μ~)ED(μ) and the inequality is strict if D is positive definite and μ~μ .

Proof. Since exDy=eHz(x)DHz(y) , it is straightforward to see that ED(μ)=ED(Hz#μ) . The (strict) convexity of the energy yields the assertion.∎

As a direct consequence, we obtain a symmetry property of minimizers for positive definite D .

Corollary 3.10. If D is positive definite then each minimizer is symmetric with respect to its eigenvectors.

If D is a positive multiple of the identity, one can easily show using the above result that the uniform distribution is the unique energy minimizer. This has been shown already in [9, Proposition 3.4] using properties of Gegenbauer polynomials [58, Proposition 2.2]. The symmetry property from corollary 3.10 gives an alternative—and straightforward—proof of this fact.

Proposition 3.11. If D=λId for λ>0 then the uniform distribution is the unique energy minimizer.

Proof. If μ is not uniform, we can find a unit vector z such that with Hz as in lemma 3.9, we have

μ~=12(μ+Hz#μ)μ.

However, for D=λId , every unit vector is an eigenvector and lemma 3.9 implies that ED(μ~)<ED(μ) . Hence, the uniform distribution is the only minimizer of the energy.∎

Remark 3.12. The statement in proposition 3.11 does not transfer to maximizers for negative multiples of the identity. To see this, consider D=λId with λ<0 and let μ0 denote the uniform distribution on S . The symmetry of μ0 yields

ED(μ0)=2S+S+eλxy+eλxy dμ0(x)dμ0(y)=4S+S+cosh(λxy)dμ0(x)dμ0(y),

where S+:={xS:x1>0} . Since |xy|<1 μ0×μ0 -almost everywhere on S+×S+ the integrand can be strictly bounded from above by 4cosh(λ) . Since μ0(S+)=1/2 it follows that

ED(μ0)<cosh(λ)=ED(1/2(δz+δz)),

with zS . Therefore, μ0 cannot be a maximizer of ED .

Remark 3.13. The above argument can be used to show that for arbitrary D , one has

ED(μ)ED(δz+δz2)

for all symmetric measures μ if and only if z is an eigenvector that corresponds to the eigenvalue of the largest absolute value. In the upcoming section, we use this insight to show that such measures are maximizers of ED for negative semi-definite D .

If D has non-positive eigenvalues, theorems 3.1 and 3.4 still show that all minimizers are invariant with respect to reflections Hz , where z corresponds to a positive eigenvalue. However, if D has negative eigenvalues, such reflections can increase the energy when they are applied to general, non-minimizing measures. This is illustrated by the following example.

Example 3.14. Consider the two-dimensional case with D=diag(λ,1) and λ<0 . For any θ[0,2π) , denote by δθ the Dirac delta placed at (cos(θ),sin(θ)) . Fix φ[0,2π) and let

μ=12(δφ+δπ+φ).

In the two-dimensional setting, the symmetrization is given by

μ~=14(δφ+δπ+φ+δφ+δπφ).

Denoting, for convenience, cos(φ)=c , we have

ED(μ)ED(μ~)=12(cosh|(λ1)c2+1|cosh|(λ1)c2+1|).

Since tcosh(t) is strictly increasing for t0 , we get that ED(μ)ED(μ~) since

|(λ1)c2+1|=||λ|c2+1c2||λ|c2+1c2=||λ|c2+1c2|=|(λ1)c2+1|

for any 0c1 and λ0 , and the inequality is strict if and only if 0<c<1 and λ<0 .

(d). Maximizers for negative semi-definite matrices

There is no apparent way to use the proof strategy from the previous section for showing that maximizers for negative definite matrices are symmetric, since the kernel (x,y)exDy is not negative definite for a negative definite D . However, we can show that the quadratic form F used to prove lemma 3.6 is non-positive for anti-symmetric measures. This yields a symmetry property of maximizers for negative semi-definite matrices.

Lemma 3.15. Let D be a negative semi-definite matrix and μ a measure on the sphere. Define μ~ as

dμ~(x)=12(dμ(x)+dμ(x)).

Then ED(μ~)ED(μ) and the inequality is strict if μ~μ and either D is negative definite or μ~=μ on the null space N(D) .

Proof. We denote by N(x)=x the negation and define

μ+:=μ,μ:=N#μ,ζ:=1/2(μμ+).

This yields that dζ(x)=2(dμ(x)dμ(x))=dζ(x) and

ED(ζ)=SSexDydζ(x)dζ(y)=S+S+exDydζ(x)dζ(y)+S+S+exDydζ(x)dζ(y)+2S+S+exDydζ(x)dζ(y)=2S+S+exDyexDydζ(x)dζ(y)=ED(ζ).

Since D is positive semi-definite, the proof of lemma 3.6 shows that ED(ζ)0 and thus ED(ζ)0 . The inequality is strict if ζ0 and either D is negative definite or ζ is concentrated on N(D) . The symmetry of the kernel yields ED(μ)=ED(μ+) . Further, by substituting μ+=μ~+ζ and μ=μ~ζ , we see that

ED(μ~)=14ED(μ+)+14ED(μ)+12ED(μ+,μ)=12ED(μ)+12ED(μ~+ζ,μ~ζ)=12ED(μ)+12ED(μ~)12ED(ζ).

Reordering the terms leads to

ED(μ~)=ED(μ)ED(ζ)ED(μ).

From the conditions on ζ and D that lead to ED<0 , we derive that the above inequality is strict if μ~μ and either D negative definite or μ~=μ on N .∎

Corollary 3.16. Let μ be a maximizer of ED for a negative definite D . Then dμ(x)=dμ(x) .

This symmetry property is the missing ingredient for showing that the discrete measures introduced in remarks 3.12 and 3.13 are maximizers for negative semi-definite matrices D .

Theorem 3.17. Let D be negative semi-definite and λmin<0 its smallest eigenvalue. Then, a measure μ maximizes ED if and only if μ=1/2(δz+δz) where zS is an eigenvector associated with λmin .

Proof. By lemma 3.15, it suffices to consider μ satisfying dμ(x)=dμ(x) . Denoting S+:={xS:x1>0} and using the symmetry property of μ , with the arguments from remark 3.12, we have

ED(μ)coshλmin=ED(μ),

where equality is only obtained if |xDy|=λmin holds μ×μ -almost everywhere on S+×S+ . Since μ is symmetric, this is equivalent to μ=μ . For a negative definite D , we already know from corollary 3.16 that there are no other measures that maximize ED . In the negative semi-definite case, we have that any μ that fulfils ED(μ)=coshλmin has to be concentrated on N(D) and, therefore, also in this case, there are no other maximizers.∎

4. Energy variation and stationary points

To study stationary points or local maximizers/minimizers, it is useful to consider the first and second variations of the energy on the Wasserstein space of probability measures on the sphere, as studied previously for Vlasov-type interactions, e.g. the mean-field aggregation equation, cf. [36,59,60]. The first variation of ED is given by

dED(μ;V)=ddtED(μt)|t=0, (4.1)

where μt satisfies

tμt+(μtPxV)=0,μ0=μ, (4.2)

and Px=IdxxT is the projection to the tangent space of the unit ball at x . Here, the velocity field V is an arbitrary Lipschitz function on n ; by the projection Px , we restrict it further to admissible velocities that keep the distribution on the unit sphere.

The following weak formulation, where φ is a continuously differentiable test function, will be useful later:

ddtSφ(x)dμt(x)=SPxφ(x)V(x)dμt(x).

Similar to the first variation, the second variation of ED can be defined as

d2ED(μ;V.W)=ddtdED(μt,W)|t=0 (4.3)

if the derivative on the right-hand side exists. The computation of the first variation is completely analogous to the case of the aggregation equation (cf. [59]) and thus omitted here.

Lemma 4.1. For any Lipschitz continuous vector field V , the first variation of the energy ED in the direction V exists and is given by

dED(μ;V)=SSex(Dy)PxDyV(x)dμ(x)dμ(y). (4.4)

It is straightforward to see that the first variation vanishes at the extremal points of the energy:

Proposition 4.2. Let μ be a minimizer or maximizer of the energy. Then dED(μ;V)=0 for all Lipschitz vector fields V .

Proof. Let μ be the initial value for the transport equation (4.2). For Lipschitz-continuous vector fields, there is a unique solution μt of the transport equation, and for all times t>0 , it is an admissible distribution on the sphere. Hence, if μ is a minimizer, then

ED(μ)ED(μt)

for all t>0 , which implies that dED(μ;V)0 in the limit t0 . Since V is arbitrary and dED is linear in V , we have that dED(μ;V)=0 . The case of a maximizer is treated in the same way, with an opposite inequality initially.∎

The connection between the transformer dynamics and the energy variations in Wasserstein spaces is readily established in the following.

Lemma 4.3. A probability measure μ is a stationary solution of equation (2.1) with the velocity field given by equation (1.8) if and only if dED(μ;W)=0 for all Lipschitz continuous W .

Similarly to lemma 4.1, one can obtain an expression for the second variation.

Lemma 4.4. For V,W being Lipschitz continuous, the second variation of the energy ED in the directions V , W exists and is given by

dED(μ;V,W)=SSexDy((PxDyV(x))(PxDyW(x))+(Dy)T(PxV(x)))dμ(x)dμ(y).

(a). Energy variation at concentrated distributions

From lemma 4.1, we see that any measure μ that fulfils

SexDyPxDydμ(y)=0for μ-almost all xS, (4.5)

is a stationary point of ED . Here and in the following, with a slight abuse of notation, we denote the 0 -vector by 0 . For concentrated measures, the above condition is also necessary and rather easy to verify, as we see in what follows. We first show that single Dirac measures can only be stationary points if they align with an eigenvector of the matrix D .

Lemma 4.5. A Dirac measure μ=δz is a stationary point of ED if and only if z is an eigenvector of D .

Proof. The first variation is given by

dED(μ;V)=ezDzPzDzV(z).

Since V(z) is an arbitrary vector, μ is a stationary point if and only if

0=PzDz=Dz(zTDz)z,

which holds if and only if z is an eigenvector of D .∎

Intuitively speaking, PzDz=0 means that the force emerging from the interaction of a particle located at eigenvector z with itself is orthogonal to the tangent space of S at point z and is thus cancelled out by the projection. The same effect can be observed for convex combinations of a Dirac measure and its reflection.

Lemma 4.6. For any t[0,1] , we have that tδz+(1t)δz is a stationary point of ED if and only if z is an eigenvector of D .

Proof. Using the expression in lemma 4.1, we obtain for any Lipschitz continuous V , using the abbreviation ι=zDz , that

dED(tδz+(1t)δz;V)=t2e ιPzDzV(z)+(1t)2e ιPzD(z)V(z)+t(1t)e ιPzDzV(z)+t(1t)e ιPzD(z)V(z).

We first observe that for any x,y one has that Pxy=Pxy=Px(y) . By comparing the coefficients in the above equation, we obtain that

dED(t)δz+(1t)δz;V)=0 for all V Lipschitz PzDz=0Dz(zDz)z=0z is an eigenvector. 

For the symmetric case t=1/2 in the above lemma, we can further show that any convex combination of such stationary points is again a stationary point.

Lemma 4.7. Let ZD be a finite subset of eigenvectors of D such that wz=0 for all zZD\{w} . Then for any choice of parameters t:ZD0+ such that zZDt(z)=1 the following measure is a stationary point of ED :

μ=12zZDt(z)(δz+δz).

Proof. We prove the statement by showing that equation (4.5) holds. For any wZD , it holds that

PwDw=PwDw=0,

since ZD only contains eigenvectors of D . On the other hand, since we also require wz=0 for all zZD\{w} it follows that zDw=zDw=0 and therefore,

ewDz=ewDz

for all zZD\{w} . In total, this yields

SewDyPxDy dμ(y)=zZDt(z)(ewDzewDz)Pw(Dz)=0,

for all wZD and thus also for μ -almost all wS .∎

The above proof strategy works only for Dirac measures aligned with the eigenvectors of D . However, there exist other discrete measures that are stationary points, as the following example shows. For the sake of simplicity, we restrict ourselves to the two-dimensional case with a positive definite matrix D and a symmetric combination of four Dirac measures. We further assume that D is diagonal; the case of a general symmetric D can be treated similarly with a rotation argument.

Lemma 4.8. Let n=2 , φ[0,2π) and D be diagonal and positive definite. A discrete measure:

μφ=1|Xφ|xXφδx,whereXφ={X(φ),X(πφ),X(π+φ),X(2πφ)}, (4.6)

is a stationary point of ED if and only if either φ{0,π/2,π} or

tanh(λ1cos2φ)tanh(λ2sin2φ)=λ2λ1, (4.7)

where λ1,λ2 denote the diagonal entries of D . For any choice of λ1,λ2>0 , there exists exactly one φ(0,π/2) that fulfils the condition in equation (4.7).

Proof. Without loss of generality, we prove the statement for φ[0,π/2] , since otherwise it holds that (ψmod2π)[0,π/2] for a ψ{πφ,π+φ,2πφ} , and thus μφ=μψ .

It follows directly from lemma 4.6 that μφ is a stationary point if φ{0,π/2} . Therefore, it remains to show that μφ is a stationary point if and only if equation (4.7) is fulfilled. This means that we have to see when there exists a Lipschitz continuous V such that dED(μφ,V)0 .

We first fix xS and consider

SexDyPxDydμ φ(y)=14((exDX(φ)exDX(φ))PxDX(φ)+(exDX(πφ)exDX(πφ))PxDX(πφ)). (4.8)

Since n=2 , we can further write Pxy=xyx , where x=(x2,x1)T . We factor out x to rewrite equation (4.8) as E(x;μφ)x with

E(x;μφ)=(1/2)(sinh(xDX(φ))xDX(φ)+sinh(xDX(πφ))xDX(πφ)).

Lemma 4.1 now gives us that

dED(μφ,V)=xXφE(x;μφ)xV(x),

which can become zero for all admissible V if and only if E(x;μφ)=0 for all xXφ . Due to the symmetry properties of our measures μφ , it further holds that E(x;μφ) is constant on Xφ ; therefore, it suffices to consider x=X(φ) . Remembering that X(φ)=(cosφ,sinφ)T , we derive

2E(X(φ);μφ)=sinh(λ 1cos 2φ+λ 2sin 2φ)(λ 1+λ 2)sinφcosφ+sinh(λ 1cos 2φ+λ 2sin 2φ)(λ 1+λ 2)sinφcosφ.

Since φ(0,π/2) , the factor sinφcosφ cannot vanish, and the zeros of E(X(φ);μφ) coincide with those of

sinh(λ 1cos 2φ+λ 2sin 2φ)(λ 1+λ 2)+sinh(λ 1cos 2φ+λ 2sin 2φ)(λ 1+λ 2)(4.9)=sinh(λ 1+(λ 1+λ 2)sin 2φ)(λ 1+λ 2)+sinh(λ 1+(λ 1+λ 2)sin 2φ)(λ 1+λ 2).

This function obtains its minima at (φmod2π){0,π} and its maxima at (φmod2π){π/2,3π/2} and strictly increases or decreases, respectively, in between. Substituting these points into equation (4.9), we see that the minima are strictly negative and the maxima are strictly positive since λ1,λ2>0 . Therefore, there exists exactly one zero in the interval (0,π/2) . Using the hyperbolic identity sinh(x+y)=sinhxcoshy+coshxsinhy in equation (4.9), we arrive at the criterion in equation (4.7).∎

Remark 4.9. Importantly, the angle φ that fulfils equation (4.7) depends not only on the ratio of the eigenvalues of D but also on their magnitude since they appear separately within the hyperbolic tangent.

Although the ratio of the eigenvalues does in general not determine the angle φ that fulfils equation (4.7), we can still make a qualitative prediction based on the ratio. The left-hand side of equation (4.7) decreases monotonically for φ[0,π/2) ; for λ1=λ2 , the condition is fulfilled for φ=π/4 . Therefore, the condition is fulfilled by some φ[0,π/4) if λ2>λ1 and by some φ(π/4,π/2] if λ1>λ2 . The numerical experiments in §5b show that the measures characterized by equation (4.7) are not only stationary points but also minimizers among empirical measures consisting of at most four Dirac measures. In the remainder of this section, we aim to characterize minimizers for positive definite matrices D in arbitrary dimensions n2 .

(b). Energy variation at the uniform distribution

To characterize minimizers for positive definite D , we start by identifying the cases when the uniform distribution is a stationary state. As we show in the following lemma, this can only be the case if the strength of the interaction does not depend on the direction, i.e. the eigenvalues of D all have the same absolute value.

Lemma 4.10. The uniform distribution μ=1|Sn1|Hn is a stationary point of ED if and only if all eigenvalues (λi)i=1n of D have the same absolute value, i.e. |λi|=λ for some λ .

Proof. To keep the notation simple, we treat here the case n=2 , leaving the general proof for n>2 to appendix C(a). Let us fix xS and determine φ[0,2π) such that Dx/Dx=(cosφ,sinφ)T . Consider the integral

SexDyPxDydH2(y)=02πeDxcos(ψφ)Px(D(cosψ,sinψ)T)dψ=(),

which can be rewritten with a change of variables θ=ψφ as follows (recall that Px=IdxxT ):

=02πeDxcosθ(cosθ(D2x/DxDxx)+sinθ(Dx/Dx))dθ=(D2x/DxDxx)02πeDxcosθcosθdθ>0+(Dx/Dx)02πeDxcosθsinθdθ=0.

From the above derivations, we see that ()=0 if and only if x is an eigenvector of D2 . This holds true for μ -almost all xS if and only if |λ1|=|λ2| . This automatically yields dED(μ,V)=0 if |λ1|=|λ2| . It remains to show that this is also a necessary condition.

Without loss of generality, we assume that |λ1|>|λ2| , where λ1 and λ2 are the eigenvalues corresponding to the eigenvectors z1 and z2 , respectively. Then, (D2x/DxDxx)z2 is strictly negative on the set

A={xS|(xz1)(|λ2/λ1|,1),(xz2)>0}.

Since μ(A)>0 we can find a Lipschitz continuous V such that Vz1=0 for μ -a.e. on S and

V(x)z2{>0for a.e. xA,=0for a.e. xSA.

For all such V it holds that dED(μ,V)>0 , which concludes the proof.∎

Since we already know that minimizers for D with at least one negative eigenvalue are Dirac measures, we can conclude that the uniform distribution is only a minimizer for D=Id .

Corollary 4.11. The uniform distribution μ=1|Sn1|Hn minimizes ED if and only if D=λId for λ0 .

Proof. We only need to show that there are no other matrices D such that ED is minimized by μ ; the other direction has been treated in proposition 3.11. The measure μ can only be a minimizer if it is a stationary point. By lemma 4.10, this implies that all eigenvalues of D have to have the same absolute value. If such D has at least one negative eigenvalue, it is also the smallest eigenvalue. Thus, by theorem 3.1, the only minimizers are Dirac deltas placed at eigenvectors corresponding to the negative eigenvalue.∎

(c). Perturbation of the identity

It is not clear whether an explicit computation of stationary points for an arbitrary positive definite matrix D with at least two distinct eigenvalues is possible, but some insight can be gained with asymptotic analysis. We consider the following perturbed energy:

Eε(μ):=SSex(Id+εM)ydμ(x)dμ(y),

where M is a diagonal matrix and |ε|1 is a small parameter. Using the second-order Taylor expansion of the exponential function, we can write

Eε(μ)ED(μ)+εSSexyxMydμ(x)dμ(y)+ε2Sexy(xMy)2dμ(x)dμ(y).(4.10)

For ε=0 we know that the unique minimizer μ0 is the uniform distribution on the sphere. Therefore, we use the following second-order asymptotic ansatz:

με:=μ0+εν+ε2w,Sdν=Sdw=0. (4.11)

We stress that here we consider the energy as a function on the space of signed Radon measures on the sphere M(S) with the total variation norm and not on the space of probability measures P(S) with the Wasserstein metric as in §4a. For this reason, the perturbation here is a measure and not a vector field (cf. equation (4.1)).

Substituting equation (4.11) into equation (4.10) and neglecting higher-order terms, we derive

Eε(με)Eε(μ0)εED(μ0,ν)+ε2ED(μ0,w)+ε2ED(ν)+2ε2SSexyxMydμ0(x)dν(y).

Since further ySexydμ0(x) is constant on S , it follows that

ED(μ0,ν)=C(n)Sdν=0andED(μ0,w)=C(n)Sdw=0.

In particular, we see that the term ε2ω from equation (4.11) does not contribute to the second-order expansion of the energy. Therefore, minimizing Eε over all possible με satisfying equation (4.11) is equivalent to minimizing

E~ε(ν):=ε2(ED(ν)+2SSexyxMydμ0(x)dν(y))

over all signed measures ν with ν(S)=0 . The first variation in the direction ν satisfying Sdν=0 is given by

dE~ε(ν,ν)=2ε2(SSexydν(x)dν(y)+SSexyxMydμ0(x)dν(y)). (4.12)

Our goal is now to find an optimal measure ν , such that its first variation vanishes in any direction ν such that Sdν=0 . To do so, we shall need the following two technical lemmas. To make the definition of the uniform distribution on the sphere rigorous, we denote by Hn the n -dimensional Hausdorff measure and write Sn1 instead of S .

Lemma 4.12. Let n2 and μ0=1|Sn1|Hn . It holds that

Sn1exyx dμ0(x)=C1y (4.13)

for any ySn1 , where the constant C1 is positive and depends only on the dimension n .

Proof. For the sake of simplicity, here we present the (more intuitive) proof for n=2 , leaving the general case n>2 to appendix C(b). We write x=(cosφ,sinφ)T and y=(cosψ,sinψ)T and derive that

2πSexyxdμ 0(x)=02πecos(φψ)(cosφ,sinφ)Tdφ=02πecosθ(cos(ψ+θ),sin(ψ+θ))Tdθ=(cosψ,sinψ)T02πecosθcosθdθ+(sinψ,cosψ)T02πecosθsinθdθ,

where we use the coordinate transform θ=φψ and two trigonometric identities to separate the summands inside sine and cosine. Since Secosθsinθdθ=0 , this yields equation (4.13) with

C1=12π02πecosθcosθdθ>0.

Lemma 4.13. Let n2 and μ0=1|Sn1|Hn . It holds that for any ySn1

(4.14)Sn1exyxi2 dμ0(x)=C2yi2+C3,1in,

where the constants C2 and C3 are positive and depend only on the dimension n .

Proof. For the sake of simplicity, we again present the proof for n=2 ; the general case n>2 is treated in appendix C(c). Using the same arguments as in the previous proof, we derive

2πSexyx2dμ 0(x)=02πecosθ(cos2(ψ+θ),sin2(ψ+θ))Tdθ=(cos2ψ,sin2ψ)T02πecosθcos2θdθ+(sin2ψ,cos2ψ)T02πecosθsin2θdθ,

where the mixed terms containing cosθsinθ vanish due to symmetry. Further, since cos2ψ+sin2ψ=1 , we can write

(sin2ψ,cos2ψ)T=(1,1)T(cos2ψ,sin2ψ)T.

This yields equation (4.14) with positive constants:

C2=12π02πecosθ(cos2(θ)sin2(θ))dθ,C3=12π02πecosθsin2(θ)dθ.

Lemma 4.12 allows us to rewrite the second summand in equation (4.12) such that it contains yMy . Using lemma 4.13, we can then deduce that, up to constants, the measure (xMx)μ0(x) is a stationary point of E~ε .

Theorem 4.14. The measure

dν(x)=(αxMx+β)dμ0(x),where    α=C1/C2    and    β=SαxMxdμ0(x),

fulfils Sdν=0 and dEε(ν,ν)=0 for all ν satisfying Sdν=0 .

Proof. From the definition of β and Sdμ0=1 , it follows that Sdν=0 . With lemma 4.12, we write the optimality condition derived from equation (4.12) as

SSexydν(x)dω(y)=C1SyMydω(y).

Substituting ν into the left-hand side and using lemma 4.13, we get

SSexy dν(x)dω(y)=Sα(C2yMy+Tr(M)C3)dω(y)+βSSexy dμ0(x)dω(y)=αC2SyMy dω(y),

where all terms that do not depend on y , including Sexydμ0(x) , vanish due to Sdω=0 . Substituting α=C1/C2 completes the proof.∎

Theorem 4.14 gives us the following intuitive characterization. The measure με that optimizes the perturbed energy is obtained by taking mass from the uniform distribution where (xMx) is large and adding it where (xMx) is small. In other words, we expect minimizers of the energy ED with a positive definite matrix D to have more mass in regions that correspond to small eigenvalues of D than in regions that correspond to large ones. This intuition is in line with the results of the particle approximation in figure 3. Furthermore, in figure 5, we also observe that the density obtained in equation (4.11) with the measure ν from above can indeed be seen as a first-order approximation for small values of ε .

5. Numerical examples

To illustrate the obtained theoretical results, we perform a series of numerical experiments using a particle approximation of the energy from equation (1.2) with an ensemble of N particles X=(X1,,XN) ,

ED(μN(X)), where μN(X)=1Ni=1NδXi.

We consider the following particle flow, introduced in [9],

X˙i(t)=PXi(t)(±1Ji(X)j=1NeXi(t)DXj(t)DXj(t)),

with normalization factors Ji(X) . If we choose the constant normalization

Ji(X)=N, (5.1)

this corresponds merely to a step-size rescaling of a standard gradient descent scheme for ED , which is called the (USA) flow in [9]. Choosing the normalization as the partition function

Ji(X)=j=1NeXi(t)DXj(t), (5.2)

corresponds more closely to the self-attention dynamics and is labelled the SA flow in [9]. In what follows, we mostly use the normalization in equation (5.2), highlighting minor differences between the two formulations as appropriate. We use the explicit Euler discretization from equation (1.5) with step size τ>0 to obtain the following update:

Xi(t+τ)=Π(Xi(t)±τJi(X)j=1NeXi(t)DXj(t)DXj(t)). (5.3)

Remark 5.1. For N=1 and this scheme reduces to the following power iteration in the limit τ :

X1(t+τ)=Π(DX1(t)).

In this regard, the iteration in equation (5.3) can be seen as a method for approximating the largest eigenvalue and the corresponding eigenvector. We leave further analysis of this connection to future work.

The source code for the experiments here is available at https://github.com/TimRoith/TransformerDynamics and uses Python [61], mainly building upon the packages NumPy [62], SciPy [63] and PyTorch [64].

(a). Maximizers for positive definite matrices

To validate our results on maximizers, we first consider a simple set-up of a one-particle system, N=1 . We choose τ=0.075 and run the scheme in equation (5.3) for 1500 iterations. We only report the results for the adaptive normalization from equation (5.2), those for the constant normalization from equation (5.1) being essentially the same. For D=Id , we know that every single Dirac is a maximizer, which is indeed observed in figure 1a. Here, each random initialization on the sphere leads to a different final state. In fact, in this case, there is no evolution at all, and the particle stays at its initial position. If D is positive definite and has a strictly largest eigenvalue λmax , theorem 3.1 shows that only Diracs at eigenvectors zmax corresponding to λmax are maximizers. This can be observed in figure 1b where the final state is either at zmax or zmax .

Figure 1.

Discrete maximizers on the sphere

Discrete maximizers on the sphere for N=1 particles. The colour indicates the value of xDx at each point on the sphere. (a) For D=Id every single Dirac is a maximizer. We show the results for 30 different initializations (b) For D=diag(1,3,4) the final state is either (0, 0,1) or (0,0,−1).

For multiple particle systems with N>1 , lemma 4.6 suggests also that linear combinations of an eigenvector with its negative are stationary points. These linear combinations are not maximizers, but their basin of attraction depends on the eigenvalues of the matrix. In figure 2 (left), we plot the probability (i.e. the proportion of random initializations) of converging to a single cluster versus two clusters as function of the eigenvalues. We fix λ1=1 and vary λ2 between 1 and 1.5 . Note that, as discussed in lemma 4.8 and remark 4.9, the actual values of the eigenvalues matter; not just their ratio. For λ 21 , the probability of converging to a single cluster is high, whereas for larger values λ21.4 , most trajectories converge to two clusters. The results in figure 2 were obtained with the adaptive normalization from equation (5.2); however, we observed the same quantitative behaviour with the constant normalization from equation (5.1).

Figure 2.

We study the trajectories for a symmetric positive definite matrix

We study the trajectories for a symmetric positive definite matrix D=diag(1,λ2) with λ2[1.,1.5] and 100 different initializations using 100 particles. We evaluate the number of clusters at the final iteration with the k -means implementation of the SciPy package [63]. The centre of each cluster is close to an eigenvector corresponding to an eigenvalue of maximal absolute value. For λ21 , the evolution converges to the optimal state with a single cluster (blue, solid), while for bigger values, it tends to get stuck in the suboptimal stationary state with two clusters (red, hatched) from lemma 4.6.

(b). Minimizers for positive (semi-)definite matrices

We now study discrete minimizers for positive definite matrices. In figure 3, we show how the matrix D influences the particle configuration to which the scheme in equation (5.3) converges. Here, too, we used the adaptive normalization from equation (5.2); the results for the constant one from equation (5.1) are largely the same.

Figure 3.

Final states for the minimization scheme after

Final states for the minimization scheme after 10 000 steps with N=400 particles. The colour indicates the value of xDx at each point on the sphere. In (a), the uniform distribution is the minimizer of the energy. In (b), the particles do not form clusters at single Diracs but rather follow a smooth distribution on the sphere. In (c), any configuration with (Xi)1=(Xi)3=0 for all i is a minimizer. In (d), any configuration with (Xi)3=0 for all i is a minimizer.

Furthermore, in figure 4, we illustrate the results of lemma 4.8 for matrices D=diag(1,λ2) with varying values λ2[0.5,8] . We initialize N=4 particles as

Figure 4.

We consider minimizers for the matrix

We consider minimizers for the matrix D=diag(1,λ2) . Starting with the initial configuration described in equation (5.4) , we compute the mean of tanh(cos2φi)/tanh(λ2sin2φi) over all particles. For a small step size, the resulting curve is very close to the identity, as predicted by lemma 4.8. If λ2τ is too big, the dynamics converge to a suboptimal stationary point. We also compare the normalizations given by equations (5.1) and (5.2). We see that with the same step size τ=0.2 , the adaptive normalization in equation (5.1) yields faster convergence than the constant one in equation (5.2).

(5.4)Xi=X(φi)withφi=(i1)π+π/4fori=1,,4,

and let the scheme in equation (5.3) run for 10 000 iterations. From the final particle state, we compute the value tanh(cos2φi)/tanh(λ2sin2φi) for each particle separately; lemma 4.8 tells us that this should be equal to λ2 for the minimizer. In figure 4, we observe that this holds true for the particle configurations computed with the discrete scheme. However, if the step size is too big compared to the value λ2 , the system instead converges to the two-cluster stationary point from figure 2. Here, we notice a slight difference between the two normalizations. The adaptive normalization from equation (5.2) allows choosing bigger step sizes compared to the constant normalization from equation (5.1), enabling faster convergence to the large-time limit.

We further investigate the validity of the asymptotic solution from theorem 4.14 in the two-dimensional case. Here, we deviate from the particle approximation and instead discretize the interval [π,π) with N equidistant grid points Θ[π,π]N and the associated points on the sphere x1,,xNS1 . In this setting, we then aim to minimize

(5.5)E~ε(m)=i,j=1Nexi(Id+εM)xjmimj,

where mN is a probability vector. Note that already, for n=3 , a more sophisticated quadrature rule would be required, e.g. the Lebedev quadrature on the sphere [65]. To deal with the simplex constraint for the vector m , we use exponentiated gradient descent, specifically mirror descent with the negative log-entropy as the distance generating function [66], which yields the update

m(ε)imieτE~ε(m(ε))ij=1Nm(ε)jeτE~(m(ε))j=SoftMax(log(m(ε))τE~ε(m(ε))i.(5.6)

We take the perturbation matrix as M=diag(0,1) , that is, the perturbed matrix D is given by Dε=diag(1,1+ε). Recall the asymptotic expansion in equation (4.10). As noted in §4c, the contribution of the term ε2ω vanishes in the second-order expansion of the energy, and we are left with a solution:

με=μ0+εν, (5.7)

where ν is as in theorem 4.14. We note that this measure has a Lebesgue density that can be evaluated at the grid points in Θ ; we denote the resulting vector by dμε|Θ . In figure 5, we compare this solution to the vector m(ε) obtained by solving equations (5.5)(5.6). The vector m(ε) for different values of ε is shown in figures 5a and 5b, we plot the 2 error |m(ϵ)dμε|Θ|2 .

Figure 5.

Numerical study of the asymptotic solution from Theorem 4

Numerical study of the asymptotic solution from theorem 4.14 in two dimensions. (a) The probability vectors m(ϵ) computed using equation (5.5) with 500 steps for τ=0.1. (b) The l2 approximation error for the first-order expansion in equation (5.7) (blue, solid) and the conjectured form in equation (5.8) (green, dotted)

Beyond the first-order expansion in equation (5.7), we conjecture that m(ε) behaves as follows:

dμεguess(θ)exp(Υ(ε)cos(2θ)), (5.8)

where Υ(ε) is a function to be determined. Taking a second-order Taylor expansion Υ(ε) , we estimate the coefficients via linear regression with the given vectors m(ε) as data points and obtain Υ(ε)1/5ε2+e/2ε . The 2 error of this approximation is shown in figure 5b and is lower than that of the first-order expansion in equation (5.7). We leave the analysis of this ansatz to future work.

(c). Maximizers for negative definite and indefinite matrices

We proceed to numerical examples for §3d, i.e. maximization of the energy corresponding to a negative definite matrix. We take a system of N=100 particles and consider the two matrices from figure 1 multiplied by 1 . The results are shown in figure 6. We observe that a single final state consists of clusters at ±z , where z is an eigenvector corresponding to the smallest eigenvalue, in agreement with theorem 3.17. As shown there, the behaviour does not change if one of the eigenvalues is zero, as only the eigenvectors corresponding to the smallest eigenvalue are relevant. For this reason, we do not consider the semi-definite case separately. The results here are not affected by the choice of the normalization; we only show the ones obtained with that in equation (5.2).

Figure 6.

Discrete maximizers on the sphere for negative definite matrices obtained

Discrete maximizers on the sphere for negative definite matrices obtained with N=100 particles. We visualize the two-cluster final states by connecting the two components of each cluster corresponding to the same run with a line, assigning different colours to the two opposite clusters. The colour of the sphere indicates the value of xDx at each point on the sphere. (a) For D = −Id a single final state has clusters at both z and z for any zS . For clarity, we only show results for 6 different initializations. (b) For D = −diag(1,3,4) a single final state has clusters both at (0,0,1) and (0,0,−1). We show the results for 100 different initializations.

Finally, we turn to the case of indefinite matrices. As noted in remark 3.5, for a matrix D that is not negative definite, a Dirac delta placed at the eigenvector corresponding to the largest eigenvalue may not be a maximizer. This can be observed numerically as shown in figure 7 where we plot the energies of one- and two-cluster states for D=diag(1,λ2) with λ2[1,1] .

Figure 7.

Energies of the states

Energies of the states Xsingle=((0,1)) in blue, Xtwo,1=((0,1),(0,1)) in red and Xtwo,2=((1,0),(1,0)) in green for the matrix D=diag(1,λ2) with varying values of λ2 .

6. Conclusion

In this work, we studied a mathematical model of self-attention layers used in the transformer architecture. Building upon [9], we analysed a continuum limit in the space of probability measures on a sphere. To understand the underlying geometry, we studied a new optimal transport distance Wm,2 with non-local mobility. We proved that the space of probability measures with this distance is a geodesic space and characterized absolutely continuous curves in this space. This allowed us to interpret the continuity equation (2.5) as curves of maximal slope of the interaction energy and to analyse the large-time behaviour using the energy dissipation property, showing that the dynamics converge to a stationary point of the interaction energy.

We analysed these critical points (in particular, minimizers and maximizers) for various types of interactions determined by the matrix D in equation (1.2). These results are listed in table 1. We find that the positions of stationary points are strongly connected to normalized eigenvectors of D , which form a strict subset of S in the case DλId . In other words, the regions where clusters appear do not only depend on the initial configuration, but also on the interaction matrix itself. This could be related to mode collapse often observed in practice. It is an interesting question to understand whether an alternative, rotation-invariant architecture could prevent mode collapse.

Table 1.

Summary of results on minimizers/maximizers of the interaction energy in equation (1.2). We denote by zmin and zmax the eigenvectors that correspond to the smallest, respectively largest, eigenvalue of D .

property of D

minimizers

maximizers

top rule positive definite

symmetric w.r.t. all eigenvectors

(corollary 3.10 and §5b)

μ=δzmax (theorem 3.1 and §5a)

mid-rule positive semi-definite

any μ concentrated on N(D)

(theorem 3.4 and §5b)

μ=δzmax (theorem 3.1 )

negative (semi-)definite

μ=δzmin

(theorem 3.1)

μ=1/2(δzmin+δzmin) (corollary 3.16 and §5c)

indefinite

μ=δzmin

(theorem 3.4)

|λmax| maximal: μ=δzmax (theorem 3.1 and §5c)

Several further questions remain open for future work: as already discussed, it would be interesting to study the optimal transport distance for mobilities mμ that cannot be bounded from below, which is the case, for example, in problems of opinion dynamics where the Gaussian kernel on the Euclidean space is often used. In this case, the metric Wm,2 is no longer equivalent to W2 . So far, we have only shown that equation (2.6) represents gradient flows in (P(M),Wm,2) using the concept of curves of maximal slope. We do not know if these curves satisfy the slightly stronger energy variational inequality, which would yield an easy stability estimate for solutions of equation (2.6).

From a practical point of view, an even more interesting direction is studying more general flows in Wm,2 that correspond to non-symmetrical matrices D in equation (1.2), which is common in transformer architectures. As mentioned above, basic properties of the distance carry over to the non-symmetric case, but characterizing the stationary states is non-trivial; one possibility is splitting the effective velocity fields into a dissipative and a (generalized) divergence-free part, similar to non-symmetric Fokker–Planck equations.

Finally, to justify the use of the continuum limit for studying the practical behaviour of transformers, one needs to establish convergence of discrete time-stepping in arbitrary time intervals. Moreover, it is worth studying how the step size influences the behaviour of the system and what effect weight-sharing would have.

Appendix A. Proofs of Section 2

A.1. Continuity equation on manifolds

Let M be a compact, n -dimensional Riemannian manifold and TM=xMTxM its tangent bundle. Although TM is not a vector space, the tangent bundle TM itself can be considered as 2n -dimensional Riemannian manifold. For its proper definition and the topology on TM , we refer to [67, ch. 3 (The Tangent Bundle)]. Velocity fields on manifolds are maps V:MTM such that πV=IdM , where π:TMM is the projection map sending each vector in TxM to x . We shall regularly commit the mild crime of interpreting V(x) as an element in TxM instead of TM . Let I=(0,T) be an open interval, (μt)tI be a Borel family of probability measures on M and v:(x,t)M×Ivt(x)TxM be a time-dependent Borel velocity field such that

|vt(x)|dμtdt<, (A 1)

where ||:TMx[0,+) denotes the norm induced by the inner product of the Riemannian structure. The continuity equation holds in the sense of distributions if

(0,T)Mtφ(x,t)+Dφ(x,t),vt(x)dμtdt=0φCc1(M×(0,T)). (A 2)

Here, Dφ denotes the differential of the map xMφ(t,x) for a fixed t[0,T] .

Proposition A.1 (Properties). Solutions to the continuity equation have the following properties:

  • Continuous representative: Let μt be a Borel family of probability measures satisfying equation (A 2) for a Borel vector field vt satisfying equation (A 1) . Then there exists a narrowly continuous curve t[0,T]μ~tP(M) such that μt=μ~t for a.e. t(0,T) . Moreover, if φCc1(M×[0,T]) and sr[0,T] we have [53, Lemma 8.1.2]:

    Mφ(x,r)dμ~rMφ(x,s)dμ~s=srMtφ+Dφ(vt)dμtdt. (A3)
  • Time rescaling: Let t:s[0,T]t(s)[0,T] be a strictly increasing absolutely continuous map with absolutely continuous inverse st1 . Then (μt,vt) is a distributional solution of the continuity equation if and only if [68, Lemma 8.1.3]

    μ^:=μt,v^=tvt is a distributional solution of the continuity equation on (0,T) .

  • Gluing solutions: Let {μt}t[0,T1],{νt}t[0,T2] be two narrowly continuous curves in P(M) with μT1=ν0 . Let further {v}t[0,T1],{w}t[0,T2] be the corresponding Borel velocity fields such that equation (A 3) is satisfied. Then {ηt}t[0,T1+T2] and {ut}t[0,T1+T2] defined by

    ηt:={μtif t[0,T1],νtT1if t(T1,T1+T2],ut:={vtif t[0,T1],wtT1if t(T1,T1+T2],

    satisfy equation (A 3) [ 69 , Lemma 4.4].

A.2. Proof of Theorem 2.2

We follow the proof strategy from [69] for the ‘flat’ Euclidean case, but since TM is not a vector space, modifications are required. We start by establishing a compactness result for solutions of continuity equations with finite energy. For our purposes, we define the ‘lifted’ flux JtP(TM×M) in duality with Cc(TM×M) (see [70, Theorem 7.2]) by

TM×Mφ(w,y)dJt(w,y)=MMφ(vt(x),y)dμt(x)dμt(y)φCc(TM×M). (A 4)

Notably, (μt,Jt) solve the continuity equation in the sense that for all sr[0,T] :

MφrdμrMφsdμs=srMtφdμtdt+srTM×MDφ~dJtdtφC1(M×[0,T]), (A 5)

where Dφ~:(w,y)Dφ(π(w)),w is the extension of Dφ on to TM×M that is constant along yM . Further, we define JP(TM×M×[0,T]) in duality with Cc(TM×M×[0,T]) by

TM×M×(0,T)φdJ=0TTM×MφdJtdtφCc(TM×M×[0,T]).

Lemma A.2. Let (μn,vn) be a sequence in CE(0,T) with

supn{01Mmμ(x)|vtn(x)|2dμtn(x)dt}<+.

Then there exists a subsequence and a couple (μ,J) satisfying the continuity equation in the sense of equation (A 5) such that

μtnμtt[0,T]andJnJ,

and for the map g:(v,p)TM×M(π(v),p) one has

g#Jt=μtμtfor a.e. t(0,T). (A 6)

Proof. Step 1 (Convergence of J ):

The estimate

supn0TTM×M|w|2dJtn(w,x)dt1Csupn0TTMmμtn|vtn|2dμtndt<

combined with the fact that M is compact and [53, Remark 5.1.5] implies tightness of JP(TM×M×[0,T]) . By disintegrating J , we obtain a Borel family Jt such that dJ=dJtdt . Since M is compact μ0n is tight, and we extract a further subsequence such that μ0nμ0 .

Step 2 (Convergence of μt ):

Consider a function φC1(M) and for t[0,T] set ζ:(v,y,t)TM×M×[0,T]χ[0,t]Dφ(π(v)),v . Since the discontinuity set of ζ is concentrated on N=TM×M×{0,t} and |F|(N)=0 , general convergence theorems (see, e.g. [3, Prop. 5.1.10]) imply

limn0tTM×MD~φ dJtn dt=limnTM×M×[0,T]ζdJn(A 7)=TM×M×[0,T]ζdJ=0tTM×MD~φ dJt dt.

Let us fix a t(0,T] . Since M is compact, μtn is tight, and we can extract from any subsequence a further subsequence such that μtn converges narrowly. Then by equations (A 5) and (A 7) and the fact that C1 is dense in C0 , we know that all subsequences have the same limit. Therefore, μtnμtP(M) for a particular μt . By the previous calculations, we also immediately obtain that (μ,J) satisfy the continuity equation in the sense of equation (A 5). To show equation (A 6), we observe that since M is compact

g#Jtn=μtnμtnμtμtt[0,T].

Proof of Theorem 2.2. Step 1:

Let (μn,vn)CE(0,1) be a minimizing sequence of the functional in equation (2.3) for some μ0,μ1 . Then the conditions of lemma A.2 are met, and we obtain

μtnμtP(M)[0,T]andJnJP(TM×M×[0,T]),

where the limit satisfies the continuity equation in the sense of equation (A 5). Equation (A 6) in particular implies that J can be disintegrated in the following way:

dJ=dut,x(v)dμt(x)dμt(y)dt,

where ut,x(v)P(TMp=π1(x)) . Using [53, Lemma 5.1.7], we now show that for u¯x,t=Mean(ut,x)=TMxvdut,x(v) it holds that

Wm,2(μ0,μ1)2=limn01Mmμtn(x)|vtn(x)|2dμtn(x)dt=limn01TM×MK(π(v),y)|v|2dJtn(v,y)dt=01TM×MK(π(v),y)|v|2dJt(v,y)dt=01MMK(x,y)dμt(y)TMp|v|2dux,t(v)dμt(x)dt01Mmμt(x)TMx|v|2dδu¯x,t(v)dμt(x)dt=01Mmμt(x)|u¯x,t|2dμt(x)dt,

where in the last line, we used Jensen’s inequality. Since Dφ(x):TMx is linear and (μ,J) satisfy equation (A 5), this implies that (μ,v=(u¯x,t)t[0,T])CE(0,1) and for this couple the infimum in equation (2.3) is obtained.

Step 2:

Proposition A.1 and a linear time rescaling show that

Wm,22(μ0,μT)=inf{T0TMmμt|vt|2dμtdt:(μt,vt)CE(0,T;μ0μT)}. (A 8)

We denote by W¯m,2(μ,ν) the infimum in equation (2.4) and show that indeed W¯m,2(μ,ν)=Wm,2(μ,ν) . By Hölder’s inequality, we immediately obtain that W¯m,2(μ,ν)Wm,2(μ,ν) . To show the reverse, we follow the arguments of [69, Theorem 5.4] and define for (μ,v)CE(0,T;μν) :

sϵ(t)0t(ϵ+Mmμt|vt|2dμt)1/2drfor t[0,T].

Then sϵ is strictly increasing, sϵϵ and sϵ(0,T)=(0,Sϵ) with Sϵsϵ(T) , so that its inverse map tϵ:[0,Sϵ][0,T] is well defined and Lipschitz continuous and

tϵsϵ:=(ϵ+Mmμt|vt|2dμt)1/2for a.e. t(0,T).

By proposition A.1, we have that for μϵ:=μtϵ , vϵ:=tϵvtϵ the couple (μϵ,vϵ)CE(0,Sϵ;μ,ν) and

Wm,22(μ,ν)Sϵ0SϵMmμtϵ|vtϵ|dμtϵds=Sϵ0TMmμt|vt|2dμtϵ+Mmμt|vt|2dμt(ϵ+Mmμt|vt|2dμt)1/2dt,

with the last term being smaller or equal to Sϵ2 . Sending ϵ0 , we obtain

Wm,2(μ,ν)=0T(Mmμt|vt|2dμt)1/2dtfor all (μ,v)CE(0,T;μν)

and hence, Wm,2(μ,ν)=W¯m,2(μ,ν) . This in particular implies that for every minimizer (μ,v)CE(0,1;μν) of the functional in equation (2.3), the equality

(01Mmμt|vt|2dμtdt)1/2=01(Mmμt|vt|2dμt)1/2dt

holds, which is only the case when Mmμt|vt|2dμt is constant for a.e. t(0,T) , implying by a further time rescaling argument∎

Wm,2(μs,μt)=|st|Wm,2(μ0,μ1)0st1. (A 9)

A.3. Proof of Lemma 2.4

Proof of Lemma 2.4. If (μ,v)CE(0,T) and 0T(Mmμt|vt|2dμt)1/2dt<+ then by equation (2.4), we have

Wm(μs,νr)sr(Mmμt|vt|2dμt)1/2dt0srT.

On the other hand, if μt is an absolutely continuous curve, then by a standard reparametrization argument [53, Lemma 1.1.4], we may assume μt to be Lipschitz. For N , we set the step size as τ=T2N and choose a family of constant-speed geodesics (μk,N,vk,N)CE((k1)τ,kτ;μ(k1)τμkτ) , k{1,...,N} such that for t((k1)τ,kτ)

τMmμt|vt|2 dμt=(A8)1τWm2(μ(k1)τ,μkτ)1τ((k1)τkτ|μ˙|(t)dt)2 Hölder (k1)τkτ|μ˙|(t)2 dt.

Gluing all geodesics together by proposition A.1, we obtain a curve (μN,vN)CE(0,1) . Lemma A.2 gives us a subsequence, still denoted by N , and a couple (μ~,v~)CE(0,1) such that μtNμ~t and JJ~ . By construction, μ~t and μt coincide on the dense (in [0,T] ) set {0}{TM2N:M,N,MN} . Since both μ~t and μt are narrowly continuous μ~t=μt must hold. Again, equation (A 6) implies that J can be disintegrated in the following way:

dJ=dut,x(v)dμt(x)dμt(y)dt,

where ut,x(v)P(TMx=π1(x)) . Then (μ,v~)CE(0,T) with v~tTMxwdut,x(w) and

0TMmμt|v~t|2dμtdtJensen0TMMK(x,y)dμt(y)TMp|v|2dux,t(v)dμt(x)dt0TTM×MK(π(v),y)|v|2dJt(v,y)dtlim infn0TTM×MK(π(v),y)|v|2dJtn(v,y)dt=lim infn0TMmμtn|vtn|2dμtdt0T|μ˙|2(t)dt. (A 10)

Since (μ,v~)CE(0,T) , we have that

|μ˙|(t)(Mmμt|vt|2dμt)1/2for a.e. t(0,T).

Finally, for equation (A 10) to hold |μ˙|(t)=(Mmμt|vt|2dμt)1/2 must hold for a.e. t(0,T)

A.4. Proof of Lemma 2.6

Proof of Lemma 2.6. From theorem 2.3, we know that the distances W2 and Wm,2 are equivalent. Therefore, we can assume absolute continuity with respect to W2 . Further, by a standard rescaling argument (e.g. [53, Lemma 1.1.4] or [53, Lemma 8.1.3]), it is enough to prove equation (2.7) for 1 -Lipschitz curves (w.r.t. W2 ), i.e. we only need to consider absolutely continuous curves (μt,vt)CE(0,1;μν) such that

M|vt(x)|2dμt(x)=1for a.e. t(0,T).

For convenience, we shall set μt=μ0 for t0 and μt=μT for tT as well as vt=0 for t[0,T] . We define the function η:(x,t)M×12MW(x,y)dμt(y) for which

tη(t,x)={0if t[0,T],12MDyW(x,y),vt(y)dμt(y)else,

in the distributional sense. Using the mollifier gϵ , as described in [71, ch. C.5], one can smooth out η in the time direction by setting

ηϵ(t,x)η(τ,x)gϵ(tτ)dτ.

By [71, ch. C.5, Theorem 7 (iii)], we have that ηϵη pointwise, and with the use of the dominated convergence theorem with the upper bound |ηϵ|sup(x,y)M×M|W(x,y)|< , we calculate

E(μT)E(μ0)=MηdμTMηdμ0=limϵ0MηϵdμTMηϵdμ0.

We further have that

+>120TMMDyW(y,x),vt(y)dμt(y)dμt(x)dt=()limϵ0120TMMDyW(x,y),vtdμt(y)gϵ(tτ)(t)dτdμt(x)dt=()limϵ00TMη(τ,x)τgϵ(tτ)dτdμt(x)dt=limϵ00TMη(τ,x)tgϵ(tτ)dτdμt(x)dt=limϵ00TMtηϵ(t,x)dμt(x)dt,

where for () we use the definition of the distributional derivative and rearrange the integral using the Fubini–Tonelli theorem. To prove () , we need to define a piecewise constant approximation of μt . We fix a N τ=TN and set for k{1,N}

μ¯tμkτfor t[kτ,(k+1)τ),μ¯TμT.

Since μt is 1 -Lipschitz, we have W2(μt,μ¯t)τ for all t[0,T] . Then, we estimate

|0TMMDyW(x,y),vt(y)dμt(y)dμt(x)dt0TMMDyW(x,y),vt(y)dμt(y)dμ¯t(x)dt|0TM×MM|DyW(x1,y),vt(y)DyW(x2,y),vt(y)|dμt(y)dπt(x1,x2)dt0TM×MM|DyW(x1,y)DyW(x2,y)||vt(y)|dμt(y)dπt(x1,x2)dtC0TM×MM|x1x2||vt(y)|dμt(y)dπt(x1,x2)dt=C0T(M×M|x1x2|dπt(x1,x2))(M|vt(y)|dμt(y))dtC(0TW22(μt,μ¯t)dt0TM|vt(y)|2dμt(y)dt)1/2(A 11)=C(T0TW22(μt,μ¯t)dt)1/2C(T0Tτ2dt)1/2=CT2N,

where πtP(M×M) is the optimal transport plan between μt and μ¯t and || denotes the dual norm of || . (For more details on the static formulation of Wasserstein distances via optimal transport plans, we refer to [53, ch. 6]). We can argue similarly in the mollified case:

|0TMRMDyW(x,y),vτ(y)dμτ(y)gϵ(tτ)dτdμt(x)dt0TMRMDyW(x,y),vτ(y)dμτ(y)gϵ(tτ)dτdμ¯t(x)dt|0TM×MRM|DyW(x1,y),vτ(y)DyW(x2,y),vτ(y)|dμτ(y)gϵ(tτ)dτdπt(x1,x2)dt0TM×MRM|DyW(x1,y)DyW(x2,y)||vτ(y)|dμτ(y)gϵ(tτ)dτdπt(x1,x2)dtC0TM×MRM|x1x2||vτ(y)|dμτ(y)gϵ(tτ)dτdπt(x1,x2)dt=C0T(M×M|x1x2|dπt(x1,x2))(RM|vτ(y)|dμτ(y)gϵ(tτ)dτ)dtC0TW2(μt,μ¯t)RM|vτ(y)|dμτ(y)gϵ(tτ)dτdtCTN0TRM|vτ(y)|dμτ(y)gϵ(tτ)dτdt=CTNRM|vτ(y)|dμτ(y)0Tgϵ(tτ)dtdτCTNRM|vτ(y)|dμτ(y)dτ=CTN0TM|vτ(y)|dμτ(y)dτ(A 12)CTN0T(M|vτ(y)|2dμτ(y))1/2dτCT2N.

We denote C~=sup(x,y)M×M|DyW(x,y)|<+ and combine equations (A 11) and (A 12) to estimate

|0TMMDyW(x,y),vt(y)dμt(y)dμt(x)dt0TMRMDyW(x,y),vτ(y)dμτ(y)gϵ(tτ)dτ dμt(x)dt|2CT2N+|0TMMDyW(x,y),vt(y)dμt(y):=f(x,t)dμ¯t(x)dt0TMRMDyW(x,y),vτ(y)dμτ(y)gϵ(tτ)dτ:=fϵ(x,t)dμ¯t(x)dt|2CT2N+i=1N(i1)τ+ϵiτϵM|ffϵ|dμ¯t dt+(i1)τ(i1)τ+ϵM|ffϵ|dμ¯t dt+iτϵiτM|ffϵ|dμ¯t dt2CT2N+i=1N(i1)τ+ϵiτϵM|ffϵ|dμ¯t dt+(i1)τ(i1)τ+ϵM2C~ dμ¯t dt+iτϵiτM2C~ dμ¯t dt2CT2N+i=1N(i1)τ+ϵiτϵM|ffϵ|dμ¯t dt+4NϵC~δ3+i=1Nδ3N+δ3,

where, first, N is chosen such that N6CT2δ and, second, ϵ such that ϵδ12NC~ and for each i{1,...,N} , it holds (i1)τ+ϵiτϵM|ffϵ|dμ¯tdtδ3N (by lemma A.3). Therefore () is proven.

Finally, by lemma A.4, we obtain nϵC1(M×[0,T]) that we can use as a test function in equation (A 3) and send ϵ0 to obtain

E(μT)E(μ0)=Mη dμTMη dμ0=0TMtη dμt+MDη,vtdμt dt=0TM×MDxW(x,y),vt(x)dμt(x)dμt(y)dt.

Lemma A.3. Let f:M×[0,T] be Borel measurable and μP(M) with

abM|f|dμdt<for 0a<bT.

For

μab(A)μL(a,b)

it holds

fϵL1(μa+ϵbϵ)fL1(μab)andfϵfin L1(μa+ϵbϵ).

Proof. We adapt [71, ch. 5, Theorem 7] to our case and start by showing

fϵL1(μa+ϵbϵ)Ma+ϵbϵab|f(x,τ)|gϵ(tτ)dτdtdμ(x)=Mab|f(x,τ)|a+ϵbϵgϵ(tτ)dtdτdμ(x)=Mab|f(x,τ)|dτdμ(x)=fL1(μab).

We approximate f in L1(μab) by γCc(M×[a,b]) (see [70, Proposition 7.9]) and calculate

ffϵL1(μa+ϵbϵ)fγL1(μa+ϵbϵ)+γγϵL1(μa+ϵbϵ)+γϵfϵL1(μa+ϵbϵ)2fγL1(μab)+γγϵL1(μa+ϵbϵ).

From [71, ch. C.5, Theorem 7], we know that γϵγ for all (x,t)M×[a,b] because γ is continuous. Choosing γ such that fγL1(μab)<δ and using the dominated convergence theorem, we get lim supϵ0ffϵL1(μa+ϵbϵ)2δ . As δ can be chosen arbitrarily small, we obtain convergence.∎

Lemma A.4. We have ηϵC1(M×[0,T]) .

Proof. Let γ:VdU(x) be a smooth local chart for an open set U(x) containing x . Then, since W(x,y)C1(M×M) , the function zMziW(γ(z),y)dμτ is continuous in z and the product

(z,t)MziW(γ(z),y)dμτ(y)gϵ(tτ)

is continuous on V× . Taking any sequence (zn,tn)(z,t) , we can use the dominated convergence theorem to obtain

limntηϵ(γ(zn),tn)=limnRMziW(γ(zn),y)dμτ(y)gϵ(tnτ)dτ=RMziW(γ(z),y)dμτ(y)gϵ(tτ)dτ=tηϵ(γ(z),t).

An upper bound is given by the function supV×WziW(γ(z),y)χ[infntnϵ,supntn+ϵ](τ) . Thus, tηϵ(γ(z),t) is continuous in V×[0,T] . With the same argument, a similar statement can be shown for

tηϵ(x,t)=RMW(x,y)dμτ(y)tgϵ(tτ)(t)dτ.

By [72, Theorem 2.8], it follows that ηϵ(t,γ(z))C1(V×[0,T]) and since the local chart was chosen arbitrarily ηϵC1(M×[0,T]) .∎

Appendix B. Spherical coordinates

For many computations in §4, we use spherical coordinates. Up to small notational changes, we use the definition provided in [73]. We define the coordinate transform Xn:φ[0,π]n2×[0,2π]Sn1 for φ[0,π]n2×[0,2π] as

Xn(φ)=cos(φ1)e1+i=2n1cos(φi)j=1i1sin(φj)ei+j=in1sin(φi)en.

Here and in the following, ein denotes the i th standard basis vector.

The Jacobian determinant is given by

JXn(φ)=i=1n2sinn1i(φi).

To highlight the recursive character of Xn with respect to n , we further note that

Xn(φ)1^=sin(φ1)X n1(φ1^)andJXn(φ)=sinn2(φ1)JXn1(φ1^),

where the index 1^ denotes that we drop the first element, i.e. for φn1 , φ1^=i=2n1φiei1 . A practical consequence of this property is the recursive computation formula for the Hausdorff measure of the n -dimensional sphere.

Lemma B.1. Denote |Sn1|:=Hn(Sn1) . For n2 , it holds that

|Sn1|=|Sn2|0πsinn2φdφ.

Proof. For n=2 , the proof follows from a simple computation and the fact that |S0|=2 and |S1|=2π . For n>2 , we have

|Sn1|=[0,π]n2×[0,2π]JXn(φ)dφ=0πsinn2φ1JXn1(φ1^)dφ=0πsinn2φdφ[0,π]n3×[0,2π]JXn1(ψ)dψ=|§n2|0πsinn2φdφ,

where we use the recursive property of the Jacobian determinant. ∎

B.1. Definition using Givens rotations

Spherical coordinates can equivalently be defined using Givens rotations (see e.g. [74, ch. 5.1.8]). A Givens rotation for an angle φ[0,2π) and indices i,jn with ij is determined by the rotation matrix G(i,j,φ)n×n :

G(i,j,φ)k,l={cos(φ)if k=l=i or k=l=j,1if k=li and k=lj,sin(φ)if k=i,l=j,sin(φ)if k=j,l=i,0otherwise.

Applying G(i,j,φ)T to a vector xRn corresponds to a counterclockwise rotation of x by the angle φ in the (i,j) -plane. For a given vector of angles φ[0,π]n2×[0,2π] , we can thus construct the matrix

R(φ)=G(n1,n,φn1)G(2,3,φ2)G(1,2,φ1)G(2,3,φ2)TG(n1,n,φn1)T. (B 1)

The rotation matrix R(φ) can be written as a two-dimensional rotation of angle φ1 in the (e1,Xn1(φ1^)) -plane, as the following lemma shows.

Lemma B.2. Let R(φ) be the rotation matrix as described in equation (B 1) . Then, it holds that

R(φ)=UG(1,2,φ1)UT,

with UUT=Id , U1,=e1 and U2,=(0,Xn1(φ1^))T .

Proof. For n=2 , the statement can be verified by inserting U=Id and the definition of R(φ) . For n>2 , we define

U=G(2,3,φ2)G(n1,n,φn1).

With this choice of U , R(φ) has the claimed form and UUT=Id due to the orthogonality of Givens matrices. It remains to show that the first two rows of U fulfil U1,=e1 and U2,=(0,Xn1(φ1^))T . For n=3 , U reduces to

U=G(2,3,φ2)=(1000cosφ2sinφ20sinφ2cosφ2),

and clearly, U1,=e1 and U2,=(0,cosφ2,sinφ2)T=(0,X2(φ2))T . For n>3 , the proof follows from induction over n .∎

Corollary B.3. Let x=Xn(φ) , x~=(0,Xn1(φ1^)) then

R(φ)Ty=y(ye1)e1(yx~)x~+(cos(φ1)(ye1)sin(φ1)(yx~))e1+(sin(φ1)(ye1)+cos(φ1)(yx~))x~

In particular, if ye1=0 it holds that

R(φ)Ty=y(yx~)x~+(yx~)(sin(φ1)e1+cos(φ1)x~).

With the above results, we obtain

Xn(φ)=R(φ)Te1,

and since Givens matrices are orthonormal, it also holds that

R(φ)Xn(φ)=e1.

We can also therefore consider rotated spherical coordinates

Xnθ(φ)=R(θ)TXn(φ)

for a reference point x=Xn(θ) , with the same Jacobian determinant as before, i.e. JXnθ(φ)=JXn(φ) .

Appendix C. Proofs for Section 4

C.1. Proof of Lemma 4.10

Lemma 4.7 (cont.) Let n>2 . The uniform distribution μ=1|Sn1|Hn is a stationary point of E if and only if all eigenvalues {λi}i=1n of D have the same absolute value, i.e. |λi|=λ for some λ .

Proof. The proof for n>2 uses the same arguments as for n=2 ; however, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated. We use the notation and techniques from appendix B (spherical coordinates Xn and rotations R ).

Again, we first fix xSn1 and consider the integral

Sn1exDyPxDydHn(y)=().

Similarly to the two-dimensional case, we choose φ[0,π]n2×[0,2π] such that

Xn(φ)=DxDx,

and therefore also

R(φ)Dx=DxR(φ)R(φ)Te1=Dxe1,

where e1=(1,0,,0)Tn denotes the first standard basis vector. We rewrite the integral using rotated spherical coordinates and substitute it into the above identity to obtain

()=[0,π]n2×[0,2π]exDR(φ)TXn(θ)PxDR(φ)TXn(θ)JXn(θ)dθ=[0,π]n2×[0,2π]eDxcos(θ1)(DR(φ)TXn(θ)Dxcos(θ1)x)JXn(θ)dθ,

where JXn denotes the Jacobian determinant of Xn . To reduce the above integral over the vector θ to an integral over only the first component θ1 , we write

R(φ)TXn(θ)=cos(θ1)R(φ)Te1+R(φ)T(0Xn(θ)1^)=cos(θ1)DxDx+sin(θ1)R(φ)T(0Xn1(θ1^)),

where the subscript 1^ denotes that we neglect the first component. Inserting this into () , we get

()=(D2x/DxDxx)0πeDxcos(θ1)cos(θ1)sinn2(θ1)dθ 1+0πeDxcos(θ1)sinn1(θ1)DR(θ)TSn2(0z)dHn1(z)=0dθ 1=C(n,Dx)(D2x/DxDxx),

and due to the symmetry of sine and cosine, we have that C(n,Dx)>0 for any n2 , Dx>0 . We can thus deduce that ()=0 if and only if x is an eigenvector of D2 , exactly as in the case n=2 . This holds true for μ -almost all xSn1 if and only if all eigenvalues of D have the same absolute value, which then automatically yields dED(μ,V)=0 .

Again, it remains to show that this is also necessary. Without loss of generality, we assume |λ1|>|λ2| and λ1 and λ2 to be the eigenvalues of largest, respectively second largest, absolute value corresponding to the eigenvectors z1 , respectively z2 .

From here, the strategy is the exact same as in the two-dimensional case, which we restate here for completeness. The factor (D2x/DxDxx)z2 is strictly negative on the set

A={xSn1|(xz1)(|λ2/λ1|,1),(xz2)>0}.

Since μ(A)>0 , we can find a Lipschitz continuous V such that Vz1=0 for μ -a.e. on Sn1 and

V(x)z2={>0for a.e. xA,=0for a.e. xSn1A.

For all such V , it holds that dED(μ,V)>0 , which concludes the proof.∎

C.2. Proof of Lemma 4.12

Lemma 4.8 (cont.) Let n2 , and μ0=1|Sn1|Hn . Then, it holds that

(C 1)Sn1exyx dμ0(x)=C1y

for any ySn1 , where the constant C1 is positive and depends only on the dimension n .

Proof. The proof for n>2 goes along the lines of the proof for n=2 . However, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated in higher dimensions. For an introduction to rotated spherical coordinates used in this proof, we refer the reader to appendix B.

We first fix ySn1 and choose θn1 such that y=Xn(θ) . We proceed to write the integral using rotated spherical coordinates x=Xnθ(φ) and obtain

Sn1exyxj dμ0(x)=1|Sn1|[0,π]n2×[0,2π]eXnθ(φ)Xn(θ)(Xnθ(φ))iJXn(φ)dφ=().

Substituting the expressions for Xn and Xnθ yields

Xnθ(φ)Xn(θ)=R(θ)TXn(φ)Xn(θ)=Xn(φ)R(θ)Xn(θ)=Xn(φ)e1=cos(φ1).

In addition, we note that we can write any x=x1e1+(0,x2,,xn)T and see that

Xnθ(φ)=R(θ)TXn(φ)=R(θ)Tcos(φ1)e1+R(θ)T(0Xn(φ)1^)=cos(φ1)y+sin(φ1)R(θ)T(0Xn1(φ1^)),

where e1=(1,0,,0)Tn denotes the first standard basis vector. Substituting the above equality into the integral, we derive

()=1|Sn1|[0,π]n2×[0,2π]ecos(φ1)[cos(φ1)y+sin(φ1)R(θ)T(0Xn1(φ1^))]jJXn(φ)dφ=yi|Sn2||Sn1|0πecosφcosφsinn2φdφ+1|Sn1|0πecosφsinn1φ[R(θ)TSn2(0z)dHn1(z)=0]jdφ.

The proof now follows from choosing the constant:

C1=|Sn2||Sn1|0πecosφcosφsinn2φdφ=|Sn2||Sn1|0π/2sinn2φcosφsinh(cosφ)dφ,

which is positive for all n2 since the function ttsinht is positive for t>0 and both sine and cosine are positive for φ(0,π/2) .∎

C.3. Proof of Lemma 4.13

Lemma 4.9 (cont.) Let n2 , and μ0=1|Sn1|Hn . Then, for all ySn1 and 1in , it holds that

(C 2)Sn1exyxi2 dμ0(x)=C2yi2+C3,

where the constants C2 and C3 are positive and depend only on the dimension n .

Proof. Using the same arguments as in the previous proof, we obtain

0πexyxj2 dμ0(x)=yj2|Sn2||Sn1|Sn1ecosφcos2φsinn2φ dφ(C 3)+1|Sn1|0πecosφsinnφSn2[R(θ)T(0z)]j2 dHn1(z)dφ,

where the mixed term containing xiyi vanishes due to symmetry. Since the second term still depends on y due to the rotation, we write y~=(0,Xn1(θ1^)) and decompose z~=(0,z) into its rotation-invariant and rotation-variant part. More precisely, we use corollary B.3 to get

R(θ)Tz~=z~(y~z~)y~+(y~z~)[sin(θ1)e1+cos(θ1)y~]

and thus

[R(θ)Tz~]2=(z~(y~z~)y~)2+(y~z~)2(sin2(θ1)e1+cos2(θ1)y~2)+2cos(θ1)(z~(y~z~)y~)(y~z~)y~.

Making use of the trigonometric identity cos2(θ1)+sin2(θ1)=1 , we get

[R(θ)Tz~]2=z~2+(y~z~)2e1+2(y~z~)2y~22(y~z~)z~y~+2cos(θ1)(z~(y~z~)y~)(y~z~)y~(y~z~)2(cos2(θ1)e1+sin2(θ1)y~2),(C 4)=z~2+(y~z~)2(e1y2)+2(cos(θ1)1)(z~(y~z~)y~)(y~z~)y~,

where in the last step, we use the fact that y2=cos2θ1e1+sin2θ1y~2 . To prove that the integral over the expression in equation (C 4) can be written as claimed, we observe that for all j=2,,n

Sn2z~j2dHn1(z)=:C~,

where C~ is positive and depends only on n , and therefore, also Sn2(z~y~)2dHn1(z)=C~y~2=C~ . With this, we derive that

(C 5)[Sn2z~2+(y~z~)2(e1y2)dHn1(z)]j=C~(1yj2)

for all j=1,,n and it remains to show that for any 1jn

(C 6)Sn2[(z~(y~z~)y~)(y~z~)y~]jdH n1(z)=0.

The case j=1 is trivial as y~1=z~1=0 . For 2jn , we write out the integrand and obtain

[(z~(y~z~)y~)(y~z~)y~]j=(k=1nz~ky~k)z~jy~j(k,l=1nz~kz~ly~ky~l)y~j2=(k=1,kjnz~jz~ky~jy~k)(k,l=1,klnz~kz~ly~ky~l)y~j2+(z~j2(z~y~)2)y~j2,

where we can use the same argument as for equation (C 5) to show that the last summand integrates to zero. Since also Sn2z~jz~kdHn1(z)=0 for any jk , we derive equation (C 6). Togethepossible to interpret it as a forward Euler discretizationr with equations (C 4) and (C 5), this yields

Sn2[R(θ)Tz~]j2dHn1(z)=C~(1yj2).

The statement now follows from substituting the above into equation (C 3), with constants given by

C2=|Sn2||Sn1|0πecosφcos2φsinn2φdφC3,C3=C~|Sn1|0πecosφsinnφdφ.

Since C~>0 for all n2 , it directly follows that C3>0 . To show that C2>0 for all n2 , we first show that C~=|Sn2|/(n1) . For n=2 , this follows directly from C~=|S0|=2 . For n>2 , we have

C~=|Sn3|0πcos2φsinn3φdφ=|Sn3|0πsinn3φsinn1φdφ.

Using integration by parts, we further derive that

0πsinn1φdφ=(n2)/(n1)sinn3φdφ.

As shown in lemma B.1, the recursive form of the Jacobian determinant of spherical coordinates yields that

|Sn2|=|Sn3|0πsinn3φdφ

Combining these equalities, we see that

C~=(1((n2)/(n1)))|Sn2|=|Sn2|/(n1),

and therefore, with integration by parts, we get

C2=|Sn2||Sn1|0πecosφ[cos2φsinn2φ11nsinnφ]dφ=|Sn2||Sn1|0πecosφsinn2φcosφ(cosφ1)dφ.

Due to the symmetry of sine and cosine, we get

C2=|Sn2||Sn1|0π/2ecosφsinn2φcosφ(cosφ1)+ecosφsinn2φcosφ(cosφ+1)dφ=2|Sn2||Sn1|0π/2sinn2φ(cos(φ)cosh(cos(φ))sinh(cos(φ)))>0,

where the positivity follows from the fact that the function ttcosh(t)sinh(t) is positive for t>0 and both sine and cosine are positive for φ(0,π/2) .∎

Contributor Information

Martin Burger, Email: martin.burger@desy.de.

Samira Kabri, Email: samira.kabri@desy.de.

Yury Korolev, Email: ymk30@bath.ac.uk.

Tim Roith, Email: tim.roith@desy.de.

Lukas Weigand, Email: lukas.weigand@desy.de.

Data accessibility

This article has no additional data.

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors’ contributions

M.B.: conceptualization, formal analysis, funding acquisition, investigation, methodology, supervision, writing—original draft; S.K.: formal analysis, investigation, methodology, visualization, writing—original draft; Y.K.: conceptualization, formal analysis, funding acquisition, investigation, methodology, writing—original draft; T.R.: funding acquisition, investigation, methodology, software, visualization, writing—original draft; L.W.: formal analysis, investigation, methodology, writing—original draft.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

M.B. and T.R. acknowledge funding by the German Ministry of Science and Technology (BMBF) under grant agreement No. 01IS24072A (COMFORT). M.B., S.K., T.R. and L.W. acknowledge support from DESY (Hamburg, Germany), a member of the Helmholtz Association HGF. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany. M.B. and S.K. acknowledge support from the German Research Foundation, project BU 2327/19-1. M.B. and L.W. acknowledge support from the German Research Foundation, project BU 2327/20-1. Y.K. acknowledges support from the German Research Foundation as visiting fellow within the priority programme Foundations of Deep Learning. Part of this study was carried out while S.K. and T.R. were visiting the California Institute of Technology, supported by the DAAD grant for project 57698811 'Bayesian Computations for Large-scale (Nonlinear) Inverse Problems in Imaging'. Y.K. acknowledges the support of the EPSRC (Fellowship EP/V003615/2 and Programme Grant EP/V026259/1). S.K. and Y.K. are grateful for the hospitality of the University of Bath during the workshop 'Machine Learning in Infinite Dimensions', sponsored by the ICMS, LMS, IMI Bath, ProbAI and Maths4DL, where part of this work was undertaken.

References

  • 1. OpenAI . 2023. GPT-4 technical report. arXiv:2303.08774. ( 10.48550/arXiv.2303.08774) [DOI]
  • 2. Wu J, Gan W, Chen Z, Wan S, Philip SY. 2023. Multimodal large language models: a survey. In 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, pp. 2247–2256. IEEE. ( 10.1109/BigData59044.2023.10386743) [DOI] [Google Scholar]
  • 3. Fields C, Kennington C. 2023. Vision language transformers: a survey. arXiv:2307.03254. ( 10.48550/arXiv.2307.03254) [DOI]
  • 4. Esser P, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-First International Conference on Machine Learning. Vienna, Austria: PMLR. [Google Scholar]
  • 5. Abramson J, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630 , 493–500. ( 10.1038/s41586-024-07487-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589. ( 10.1038/s41586-021-03819-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Vuckovic J, Baratin A, Combes RT. 2020. A mathematical theory of attention. arXiv 2007.02876. ( 10.48550/arXiv.2007.02876) [DOI] [Google Scholar]
  • 8. Sander ME, Ablin P, Blondel M, Peyré G. 2022. Sinkformers: transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515–3530. JMLR. [Google Scholar]
  • 9. Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. 2023. A mathematical perspective on transformers. arXiv 2312.10794. ( 10.48550/arXiv.2312.10794) [DOI] [Google Scholar]
  • 10. Calvello E, Kovachki NB, Levine ME, Stuart AM. 2024. Continuum attention for neural operators. arXiv: 2406.06486. ( 10.48550/arXiv.2406.06486) [DOI] [Google Scholar]
  • 11. Nguyen TM, Nguyen T, Ho N, Bertozzi AL, Baraniuk RG, Osher SJ. 2024. A primal-dual framework for transformers and neural networks. arXiv 2106.01506. ( 10.48550/arXiv.2406.13781) [DOI] [Google Scholar]
  • 12. Wright MA, Gonzalez J. 2021. Transformers are deep infinite-dimensional non-mercer binary kernel machines. arXiv 2106.01506. ( 10.48550/arXiv.2106.01506) [DOI] [Google Scholar]
  • 13. Criscitiello C, Rebjock Q, McRae AD, Boumal N. 2024. Synchronization on circles and spheres with nonlinear interactions. arXiv 2405.18273. ( 10.48550/arXiv.2405.18273) [DOI] [Google Scholar]
  • 14. Alcalde A, Fantuzzi G, Zuazua E. 2024. Clustering in pure-attention hardmax transformers and its role in sentiment analysis. arXiv Preprint 2407.01602. ( 10.48550/arXiv.2407.01602) [DOI] [Google Scholar]
  • 15. Geshkovski B, Rigollet P, Ruiz-Balet D. 2024. Measure-to-measure interpolation using transformers. arXiv Preprint 2411.04551. ( 10.48550/arXiv.2411.04551) [DOI] [Google Scholar]
  • 16. Kan K, Li X, Osher S. 2025. OT-Transformer: a continuous-time transformer architecture with optimal transport regularization. arXiv Preprint 2501.18793. ( 10.48550/arXiv.2501.18793) [DOI] [Google Scholar]
  • 17. Viswanathan K, Gardinazzi Y, Panerai G, Cazzaniga A, Biagetti M. 2025. The geometry of tokens in internal representations of large language models. arXiv Preprint 2501.10573. ( 10.48550/arXiv.2501.10573) [DOI] [Google Scholar]
  • 18. Abella ÁR, Silvestre JP, Tabuada P. 2024. The asymptotic behavior of attention in transformers. arXiv Preprint 2412.02682. ( 10.48550/arXiv.2412.02682) [DOI] [Google Scholar]
  • 19. Alcalde A, Fantuzzi G, Zuazua E. 2025. Exact sequence classification with hardmax transformers. arXiv Preprint 2502.02270. ( 10.48550/arXiv.2502.02270) [DOI] [Google Scholar]
  • 20. Lu Y, Li Z, He D, Sun Z, Dong B, Qin T, Wang L, Liu T. Understanding and Improving Transformer From a Multi Particle Dynamic System Point of View. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations. [Google Scholar]
  • 21. Dutta S, Gautam T, Chakrabarti S, Chakraborty T. 2021. Redesigning the transformer architecture with insights from multi-particle dynamical systems. Adv. Neural Inf. Process. Syst. 34 , 5531–5544. [Google Scholar]
  • 22. Chizat L, Bach F. 2018. on the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inf. Process. Syst. 31 , 3040–3050. [Google Scholar]
  • 23. Ding Z, Chen S, Li Q, Wright S. 2021. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv 2110.02926. ( 10.48550/arXiv.2110.02926) [DOI] [Google Scholar]
  • 24. Hegselmann R, Krause U. 2002. Opinion dynamics and bounded confidence models, analysis and stimulations. J. Artif. Soc. Soc. Simulation 5 . [Google Scholar]
  • 25. Gómez-Serrano J, Graham C, Le Boudec JY. 2012. The bounded confidence model of opinion dynamics. Math. Model. Methods Appl. Sci. 22 , 1150007. ( 10.1142/s0218202511500072) [DOI] [Google Scholar]
  • 26. Piccoli B, Rossi F. 2021. Generalized solutions to bounded-confidence models. Math. Model. Methods Appl. Sci. 31 , 1237–1276. ( 10.1142/s0218202521400054) [DOI] [Google Scholar]
  • 27. Bruno G, Pasqualotto F, Agazzi A. 2024. Emergence of meta-stable clustering in mean-field transformer models. arXiv Preprint 2410.23228. ( 10.48550/arXiv.2410.23228) [DOI] [Google Scholar]
  • 28. Geshkovski B, Koubbi H, Polyanskiy Y, Rigollet P. 2024. Dynamic metastability in the self-attention model. arXiv Preprint 2410.06833. ( 10.48550/arXiv.2410.06833) [DOI] [Google Scholar]
  • 29. Burger M, Erbar M, Hoffmann F, Matthes D, Schlichting A. 2025. Covariance-modulated optimal transport and gradient flows. Arch. Ration. Mech. Anal. 249 . ( 10.1007/s00205-024-02065-w) [DOI] [Google Scholar]
  • 30. Duncan A, Nüsken N, Szpruch L. 2023. On the Geometry of stein variational gradient descent. J. Mach. Learn. Res. 24 , 1–39. [Google Scholar]
  • 31. Li W. 2021. Hessian metric via transport information geometry. J. Math. Phys. 62 . ( 10.1063/5.0012605) [DOI] [Google Scholar]
  • 32. Lisini S, Matthes D, Savaré G. 2012. Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-Wasserstein metrics. J. Differ. Equ. 253 , 814–850. ( 10.1016/j.jde.2012.04.004) [DOI] [Google Scholar]
  • 33. Burger M, Di Francesco M. 2008. Large time behavior of nonlocal aggregation models withnonlinear diffusion. Netw. Heterog. Media 3 , 749–785. ( 10.3934/nhm.2008.3.749) [DOI] [Google Scholar]
  • 34. Cañizo JA, Ramos-Lora A. 2024. Discrete minimizers of the interaction energy in collective behavior: a brief numerical and analytic review. arXiv 2403.00594. ( 10.48550/arXiv.2403.00594) [DOI] [Google Scholar]
  • 35. Carrillo JA, Chipot M, Huang Y. 2014. On global minimizers of repulsive–attractive power-law interaction energies. Phil. Trans. R. Soc. A 372 , 20130399. ( 10.1098/rsta.2013.0399) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Carrillo J, Figalli A, Patacchini SF. 2017. Geometry of minimizers for the interaction energy with mildly repulsive potentials. Ann. De L’IHP Anal. Non Linéaire 34 , 1299–1308. ( 10.1016/J.ANIHPC.2016.10.004) [DOI] [Google Scholar]
  • 37. Shu R. 2024. Wasserstein-infinity stability and mean field limit of discrete interaction energy minimizers. arXiv 2407.18395. ( 10.48550/arXiv.2407.18395) [DOI] [Google Scholar]
  • 38. Simione R, Slepčev D, Topaloglu I. 2015. Existence of ground states of nonlocal-interaction energies. J. Stat. Phys. 159 , 972–986. ( 10.1007/s10955-015-1215-z) [DOI] [Google Scholar]
  • 39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. [Google Scholar]
  • 40. Bahdanau D. 2014. Neural machine translation by jointly learning to align and translate. arXiv 1409.0473. ( 10.48550/arXiv.1409.0473) [DOI] [Google Scholar]
  • 41. Castin V, Ablin P, Peyré G. Proceedings of Machine Learning Research (eds Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F). In Proceedings of the 41stInternational Conference on Machine Learning, vol. 235, pp. 5817–5840, Vienna, Austria: PMLR. [Google Scholar]
  • 42. Castin V, Ablin P, Carrillo J, Peyré G. 2025. A unified perspective on the dynamics of deep transformers. arXiv Preprint 2501.18322. ( 10.48550/arXiv.2501.18322) [DOI] [Google Scholar]
  • 43. Karagodin N, Polyanskiy Y, Rigollet P. 2024. Clustering in causal attention masking. arXiv Preprint 2411.04990. ( 10.48550/arXiv.2411.04990) [DOI] [Google Scholar]
  • 44. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift (eds Bach F, Blei D). In Proceedings of the 32nd InternationalConference on Machine Learning, vol. 37, pp. 448–456, Lille, France. [Google Scholar]
  • 45. Lei Ba J, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv 1607.06450. [Google Scholar]
  • 46. Touvron H. 2023. Llama: open and efficient foundation language models. arXiv 2302.13971. ( 10.48550/arXiv.2302.13971) [DOI] [Google Scholar]
  • 47. Zhang B, Sennrich R. 2019. Root mean square layer normalization. Adv. Neural Inf. Process. Syst. 32 , 12381–12392. [Google Scholar]
  • 48. He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. In Computer vision – ECCV 2016 (eds Leibe B, Matas J, Sebe N, Welling M), pp. 630–645. Cham: Springer International Publishing. ( 10.1007/978-3-319-46493-0_38) [DOI] [Google Scholar]
  • 49. Weinan E. 2017. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5 , 1–11. ( 10.1007/s40304-017-0103-z) [DOI] [Google Scholar]
  • 50. Haber E, Ruthotto L. 2018. Stable architectures for deep neural networks. Inverse Probl. 34 , 20. ( 10.1088/1361-6420/aa9a90) [DOI] [Google Scholar]
  • 51. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. 2018. Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31 , 6571–6583. [Google Scholar]
  • 52. Thorpe M, van Gennip Y. 2023. Deep limits of residual neural networks. Res. Math. Sci. 10 , 6. ( 10.1007/s40687-022-00370-y) [DOI] [Google Scholar]
  • 53. Ambrosio L, Gigli N, Savaré G. 2008. GradientFlows. In Lectures in mathematics, 2nd edn. Basel, Switzerland: ETH Zürich. ( 10.1007/978-3-7643-8722-8) [DOI] [Google Scholar]
  • 54. Benamou JD, Brenier Y. 2000. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 84 , 375–393. ( 10.1007/s002110050002) [DOI] [Google Scholar]
  • 55. Deffuant G, Neau D, Amblard F, Weisbuch G. 2000. Mixing beliefs among interacting agents. Adv. Complex Syst. 03 , 87–98. ( 10.1142/s0219525900000078) [DOI] [Google Scholar]
  • 56. Bilyk D, Matzke RW, Vlasiuk O. 2022. Positive definiteness and the Stolarsky invariance principle. J. Math. Anal. Appl. 513 , 126220. ( 10.1016/j.jmaa.2022.126220) [DOI] [Google Scholar]
  • 57. Fasshauer GE. 2011. Positive definite kernels: past, present and future. In ’Kernel functionsand meshless methods’ dolomites research notes on approximation (eds Marchi S, Buhmann MD, Plonka-Hoch G). [Google Scholar]
  • 58. Bilyk D, Dai F. 2016. Geodesic distance riesz energy on the sphere. arXiv 1612.08442. ( 10.48550/arXiv.1612.08442) [DOI] [Google Scholar]
  • 59. Burger M, Francesco M di, Franek M. 2013. Stationary states of quadratic diffusion equations with long-range attraction. Commun. Math. Sci. 11 , 709–738. ( 10.4310/cms.2013.v11.n3.a3) [DOI] [Google Scholar]
  • 60. Gómez-Castro D. 2024. Beginner’s guide to aggregation-diffusion equations. SeMA J. 1–57 ( 10.1007/s40324-024-00350-y) [DOI] [Google Scholar]
  • 61. Rossum G, Drake FL Jr. 1995. Python tutorial. The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam. [Google Scholar]
  • 62. Harris CR, et al. 2020. Array programming with NumPy. Nature 585 , 357–362. ( 10.1038/s41586-020-2649-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Virtanen P, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272. ( 10.1038/s41592-019-0686-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Paszke A. 2019. Pytorch: an imperative style high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8026–8037. [Google Scholar]
  • 65. Marchuk G, Lebedev VI. 1986. Numerical methods in the theory of neutron transport. New York, NY, USA: Harwood Academic Puḃ. [Google Scholar]
  • 66. Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 , 1–63. ( 10.1006/inco.1996.2612) [DOI] [Google Scholar]
  • 67. Lee JM. 2013. Introduction to smooth manifolds, pp. 1–31. New York, NY, USA: Springer New York. ( 10.1007/978-1-4419-9982-5_1) [DOI] [Google Scholar]
  • 68. Ambrosio L, Fusco N, Pallara D. 2000. Functions of bounded variation and free discontinuity problems, pp. 116–210. Oxford: Oxford University Press. ( 10.1093/oso/9780198502456.003.0003) [DOI] [Google Scholar]
  • 69. Dolbeault J, Nazaret B, Savaré G. 2009. A new class of transport distances between measures. Calc. Var. Partial Differ. Equ. 34 , 193–231. ( 10.1007/s00526-008-0182-5) [DOI] [Google Scholar]
  • 70. Folland GB. 1999. Real analysis: modern techniques and their applications. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
  • 71. Evans LC. 2010. Partial differential equations, 2nd edn. Providence, RI: American Mathematical Society. ( 10.1090/gsm/019) [DOI] [Google Scholar]
  • 72. Spivak M. 2018. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. Boca Raton, FL: CRC press. [Google Scholar]
  • 73. Blumenson LE. 1960. A derivation of n-dimensional spherical coordinates. Am. Math. Mon. 67 , 63–66. ( 10.2307/2308932) [DOI] [Google Scholar]
  • 74. Golub GH, Van Loan CF. 2013. Matrix computations, 4th edn. Philadelphia, PA, USA: Johns Hopkins University Press. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.


Articles from Philosophical transactions. Series A, Mathematical, physical, and engineering sciences are provided here courtesy of The Royal Society

RESOURCES