Abstract
The aim of this article is to provide a mathematical analysis of transformer architectures using a self-attention mechanism with layer normalization. In particular, observed patterns in such architectures resembling either clusters or uniform distributions pose a number of challenging mathematical questions. We focus on a special case that admits a gradient flow formulation in the spaces of probability measures on the unit sphere under a special metric, which allows us to give at least partial answers in a rigorous way. The arising mathematical problems resemble those recently studied in aggregation equations but with additional challenges emerging from restricting the dynamics to the sphere and the particular form of the interaction energy. We provide a rigorous framework for studying the gradient flow, which also suggests a possible metric geometry to study the general case (i.e. one that is not described by a gradient flow). We further analyse the stationary points of the induced self-attention dynamics. The latter are related to stationary points of the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers and maximizers in different parameter settings.
This article is part of the theme issue ‘Partial differential equations in data science’.
Keywords: transformer architectures, self-attention dynamics, gradient flows, interaction energies, stationary states
1. Introduction
Transformer architectures and the associated (self-)attention dynamics gained strong interest recently due to the success of artificial intelligence relying on them in several applications. Examples include large language models such as GPT-4 [1], multimodal large language models such as vision language transformers [2,3], text-to-image generation like Stable Diffusion [4] and protein folding with AlphaFold [5,6], which won the Nobel Prize in Chemistry in 2024.
The practical success of transformers and (self-)attention dynamics calls for developing detailed mathematical understanding which started recently in [7–19].
An interesting viewpoint on such dynamics is to interpret it as an interacting particle system [8,20,21], which allows for natural continuous-time and mean-field limits. The latter approach already provided valuable insights into feed-forward neural networks and their training dynamics (cf. [22,23]). In the context of transformers, this viewpoint also provides interesting (so far formal [9]) connections to gradient flows and the minimization of interaction energy for the particle measures. The latter is a topic of great recent interest due to various applications in biology and social interactions. Indeed, the self-attention dynamics in transformers share certain mathematical similarities with models used in opinion formation, which also exhibit similar emergence of clusters in certain cases [24–26]. In this work, we focus on cluster formation in the infinite time horizon. However, we note that the formation of metastable states is of special interest. For the case of isotropic interaction, metastability was studied in [27,28].
In this article, we proceed with the work in [9] on analysing transformer dynamics with layer normalization, focusing in particular on the case when the underlying dynamics has a gradient flow structure. Indeed, the continuum limit of the self-attention dynamics leads to a Wasserstein-type gradient flow for probability measures on the unit sphere of the form
(1.1) |
where and are the tangential gradient and divergence, respectively, and is a non-local mobility. The underlying energy in this case is of the form
(1.2) |
with being a symmetric matrix and denoting its first variation. Since is symmetric and hence diagonalizable, we can equivalently assume that is a diagonal matrix, since we can use an orthogonal diagonalization and a corresponding transfer of variables to the eigenvectors, which leaves the unit ball unchanged. This will be used in several instances to simplify notation. It also permits a more detailed study of stationary patterns, in particular minimizers and maximizers of the energy.
Compared to the existing literature on such gradient flows, there are three distinct features that motivate our study, namely:
-
—
restriction of the dynamics to the unit sphere (a consequence of the layer normalization);
-
—
non-local mobility (a consequence of the self-attention mechanism), which is related to but still distinctly different from other variations of Wasserstein gradient flows studied recently (cf. [29–32]);
-
—
multiplicative coupling of states in the interaction energy, as opposed to commonly used interaction potentials depending only on the difference of the states (cf., e.g. [33–38]).
We make the gradient flow, formally introduced in [9], rigorous, showing that the transport distance with non-local mobilities is well defined, studying energy dissipation properties of the associated gradient flow and describing the large-time behaviour of the dynamics, specifically the convergence to stationary solutions, at least along subsequences. We further carry out a detailed study of energy minimizers and maximizers of (extending the previously studied case of being a multiple of the identity) as well as stationary points of the energy in a Wasserstein setting, which we prove to be equivalent to stationary solutions of the dynamics. For the energy minimizers, we obtain an interesting picture depending on the structure of :
-
—
If there is a positive eigenvalue that is the eigenvalue of maximal absolute value, then a Dirac delta concentrated in the direction of a corresponding eigenvalue is a maximizer.
-
—
If the smallest eigenvalue is negative, then only a Dirac delta concentrated in the direction of a corresponding eigenvalue is a minimizer.
-
—
If the smallest eigenvalue is zero, then any measure concentrated on the null space of is a minimizer.
-
—
Dirac deltas concentrated in directions of arbitrary eigenvectors are stationary points. We also find some convex combinations of Dirac deltas being stationary points.
-
—
If the smallest eigenvalue is positive, we conjecture that the minimizer of the energy has full support on the unit sphere. To obtain some insight, we carry out a second-order asymptotic analysis of the minimizers for being a small perturbation of the identity.
We support our theoretical findings with several computational experiments and investigate the cases when the energy minimizers or maximizers cannot be characterized explicitly.
The rest of this work is organized as follows. In the remainder of the introduction, we recapitulate the simplified softmax transformer model introduced in[8], with additional layer normalization as considered in [9]. In §2, we provide a rigorous derivation of the gradient flow induced by the considered model. Sections 3 and 4 are dedicated to characterizing optimizers or stationary points of the studied energy, respectively. We support our findings by numerical experiments in §5 and summarize our results in §6.
(a). Self-attention
Transformer architectures [39] were developed in the field of natural language processing. Here, the input is usually a sentence, which is decomposed into a sequence of tokens (e.g. words or syllables). Each token (possibly along with its position in the sentence) is represented as a vector in a high-dimensional vector space. Apart from a conventional feed-forward component, the main feature of a transformer layer is the so-called attention mechanism. This mechanism implements interactions between tokens and was first introduced in [40] in the context of neural machine translation as an alternative to encoder–decoder approaches, the performance of which often deteriorates for large input lengths due to the use of latent representations of fixed dimensions.
Like [9], we shall focus on a simple yet widely used form of attention, the so-called self-attention. It can be formalized as follows: consider an input sequence , where each represents an -dimensional token and denotes the number of tokens. The self-attention matrix is given by
(1.3) |
where we assume to be symmetric. The latter property does not necessarily hold for learned parameters in transformer architectures, but we expect the symmetric part to determine the asymptotic behaviour of the self-attention dynamics. Since the symmetry of allows one to interpret the dynamics as a gradient flow corresponding to a certain interaction energy, as observed in [9], it will allow us to analyse the asymptotic behaviour for this subclass; the study of the general case is left for future research. An important example of non-symmetric interaction is given by masked attention, which can be used to model causality. We refer to [41–43] for a mean-field interpretation of such dynamics.
By definition, the matrix is stochastic, i.e. each of its rows is a probability vector. Roughly speaking, the attention matrix determines how strongly a token is influenced by each other token. To determine how tokens influence each other, another matrix , called the value matrix, is used. The influence of on can then be written as and the self-attention layer is given by
(1.4) |
For our purposes, we assume or since, in this case, one can show that the particles move along a gradient flow. The general case is the subject of future work.
(b). Normalization method
The normalization of intermediate values is a common practice in machine learning models. In the context of neural networks, so-called batch normalization [44] is a popular method to prevent gradients from blowing up and thus to stabilize (and to improve) the training. Since this form of normalization uses information from the entire training batch, [45] proposes layer normalization (LayerNorm), which translates the mean of an intermediate vector to zero and divides it by its standard deviation, and therefore does not depend on any other vector in the batch. While the original implementation of the transformer [39] uses LayerNorm, some of the more recent publications (e.g. Llama, [46]) use a simplified version called Root Mean Square Layer Normalization (RMSNorm) proposed in [47]. Up to a multiplication with learned weights , called gain parameters, RMSNorm performs a projection on to the unit sphere (where in the following, we shall suppress the superscript and simply write ). More precisely, for we write
where, in practice, a division by zero is circumvented by adding a small value into . In our setting, we can assume the norm to be strictly positive as we consider the dynamics in continuous time. Following the setting of [9], we focus on RMSNorm with fixed gain parameters for all and denote the projection on to the unit sphere for by
(c). Simplified transformer layer and time-continuous dynamics
Combining the attention layer with a normalization layer, we arrive at the following update step:
where the projection is applied vector-wise to each row of . For the sake of our analysis, we shall deviate from typical practical implementations of transformers and consider the architecture to be a composition of such layers which all share the same matrices and in equations (1.3) and (1.4). In [9], it was proposed to study the continuum limit of these updates. This approach has become a popular tool for analyzing residual neural networks [48]: as discussed from various perspectives, e.g. in [49–52], the skip connections (i.e. the residual components) of the residual neural network architecture make it possible to interpret it as a forward Euler discretization of an ordinary differential equation. Introducing a time variable and a small time increment , we get
(1.5) |
At this point, the residual component is hidden in the attention layer and cannot easily be extracted since the projection is nonlinear. In the continuous time limit , remembering that for any , we arrive at the following system of differential equations:
where the spatial derivatives are understood as derivatives in . With a simple computation, one can further show that for any and it holds that
where, following [9], we define . Substituting this into equation (1.6), we arrive at the following dynamics:
which serve as a starting point of [9].
(d). Interpretation as an evolution of measures
Instead of studying the dynamics of distinct particles, [9] propose to view equation (1.7) as an evolution of an empirical measure
The right-hand side of equation (1.7a) can be understood as an integral with respect to ; for a generic probability measure , this can be written as a measure-dependent velocity field:
(1.8) |
and equation (1.7a) turns into . With this notion, we recover the weak continuity equation formulated in [9]: for any test function , one has
where, in this case, the spatial derivatives of have to be understood as derivatives on .
Similarly, Geshkovski et al. [9] propose the interaction energy in equation (1.2), which for an empirical measure reduces to
In this discrete case, a straightforward application of the chain rule and a reordering of the terms yields
Under our assumption that the value matrix is given by , we see that, up to an application of and a division by , the term in the brackets is given by . Since for any , , we have that
and hence the energy increases ( ) or decreases ( ) monotonously along the trajectory of . A formal derivation of the above formulae for general probability measures on smooth manifolds is provided in §2.
Let us mention that problems with similar energies as have been studied in the past. The most prominent is an interaction energy with respect to with a non-local interaction kernel depending on . Choosing the kernel as Gaussian with covariance matrix (which makes sense only if is positive definite) results in
For , the minimizers and maximizers of the expressions in equations (1.2) and (1.10) are equivalent as for all . The important difference between equations (1.2) and (1.10) is the rotation-(in)variance of the interaction functions and . In the general case, this is not true, but we shall use an analogy to the interaction energy to rewrite
(e). Understanding on the sphere
For our further analysis, it is crucial to understand the implications of restricting the problem to the unit sphere and the behaviour of the bilinear form on it. For , it is clear that the minimizer of is given by and the maximizer by . This changes for a general and as a result, the minimizer of the energy in equation (1.2) is not given by the uniform distribution on anymore. For a diagonal matrix , the maximizer/minimizer of for a fixed with is given by . Therefore, we know that if and only if (same for and ). For , we already have for any , i.e. each point is a minimizer, maximizer and orthogonal to w.r.t. . A further consequence is that
where denotes the eigenvalue of maximum absolute value of . We further note that all of the following results on minimizers/maximizers as well as stationary points of can be generalized to probability measures concentrated on an ellipsoid instead of a sphere. To see this, we consider the ellipsoid
where is invertible, and the corresponding energy
Since is invertible, any measure is uniquely determined by the pushforward measure , as . Thus, we can rewrite the energy as
and equivalently optimize the energy on the sphere. A special case that leads to measures concentrated on an ellipsoid corresponds to RMSNorm normalization with non-vanishing gain parameters . In this case, the ellipsoid is given by , where is a diagonal matrix with entries .
2. Gradient flow
As shown above, the particle dynamics can be ‘lifted’ by the use of empirical measures to the space of probability measures over the sphere. As mentioned in [9, Remark 3.3], for arbitrary probability measures, the connection between the partial dynamics and a corresponding continuity equation can be made by a mean field limit approach. Hence, instead of the particle dynamics, one can study the continuity equation:
(2.1) |
with the velocity field given by equation (1.8), which holds in the sense of distributions. Note that, in this section, we scale the energy by a factor of to be consistent with [9]. It was remarked in [9, ch. 3.3] that for , the energy,
is monotonic along these dynamics, and the partial differential equation (2.1) can be interpreted as a gradient flow for a modified optimal transport distance. However, as the authors of [9] acknowledge, there is a gap in the literature that prevents them from making this observation rigorous.
In this section, we aim to close this gap. We show that equipped with this new distance is a geodesic space with properties similar to the classical -Wasserstein space and prove that solutions of equation (2.1) are curves of maximal slope of with respect to this distance and thus satisfy the energy dissipation equality
Finally, we study the long-time behaviour of the dynamics and show that subsequences of the flow converge to stationary points of the energy .
Let us mention that the basic analysis of this section related to the novel transport distance can be generalized in a rather straightforward way to the more general case of being non-symmetric and can thus provide the basis for future analysis of the non-gradient flow case with arbitrary and non-symmetric.
(a). Continuity equation on manifolds
Let be a compact -dimensional Riemannian manifold without a boundary, e.g. the sphere . The tangent bundle is given by the disjoint union of all tangent spaces of all . We denote by the space of Borel probability measures on , equipped with the standard narrow topology (e.g. [53, ch. 5.1]). The symbol is used to indicate convergence in this topology. Let be an open interval, a narrowly continuous curve and a Borel velocity field such that . The continuity equation holds in the sense of distributions if
(2.2) |
Here, denotes the differential on the manifold . Sometimes, we shall use to clarify with respect to which variable the differential is taken. We define the set of solutions to the continuity equation as follows:
Furthermore, we define as the subset such that , . For more details, we refer to appendix A(a).
(b). Distance
To interpret equation (2.1) as a gradient flow on , we need to modify the well-known dynamic formulation of the -Wasserstein distance [54] and introduce the following mobility:
With this, the modified transport distance between is defined as follows (see [9, Section 3.4.2]):
(2.3) |
For , we recover the classical -Wasserstein distance. The dynamic (2.1) corresponds to the kernel , but for the sake of generality, we carry out the analysis for a more general class of kernels .
Assumption 1. The kernel is continuous, and there exists a constant such that for all .
Remark 2.1. The assumption that is bounded from below is vital for our analysis and covers the cases of interest in this article. Nonetheless, it would be interesting to see whether this assumption can be relaxed. For example, instead of a compact manifold , we could consider as the underlying space and take to be a Gaussian or a bounded confidence kernel as studied in [ 55 ].
As the next theorem shows, the infimum in equation (2.3) is actually attained by some . The proof can be found in appendix A(b).
Theorem 2.2 (Existence of minimizers). For every pair with , there exists a couple such that
Furthermore, such minimizers can be equivalently characterized as those of
Using the theorem above, it is easy to show that is a distance on .
Theorem 2.3. The space equipped with is a complete metric space and its topology is equivalent to the one induced by the -Wasserstein distance which, since is compact, is equivalent to the topology of narrow convergence.
Proof. First, we check that is a distance. Indeed, (i) symmetry follows from simply rescaling time by ; (ii) definiteness: Since is bounded from below, implies that for -a.e. . Thus by equation (A 3) ; (iii) the triangle inequality follows from the characterization in equation (2.4) and the gluing property from proposition A.1. To show the equivalence of the distances, we observe that by assumption 1, and since is compact and is continuous, we can also find a such that . This implies that
and the distances are equivalent. Since is complete, has to be complete as well.∎
Let us recall that in a general complete metric space , a curve is called absolutely continuous if there exists a function such that
(2.5) |
For an absolutely continuous curve , its metric derivative is defined by
and it exists for a.e. . It can be shown that is minimal in the sense that for all satisfying equation (2.5), it holds that for a.e. . The next lemma, which is proven in appendix A(c), characterizes absolutely continuous curves in .
Lemma 2.4. Let be an absolutely continuous curve w.r.t. . Then there exists a Borel velocity field such that and
Conversely, if and then is absolutely continuous and
A metric space is called a length space if
where the infimum is taken over all absolutely continuous curves with and . If this infimum is obtained by a minimal curve, also called geodesic, we say that is a geodesic space. As it turns out, the minimal curves obtained in theorem 2.2 are such geodesics. This can be immediately deduced from equation (A 9) and the definition of the metric velocity,
Corollary 2.5. The space is a geodesic space.
(c). Gradient flows of the interaction energy
Let be a symmetric interaction kernel. The interaction energy is given by
Let us consider the following inverse duality map:
Since all tangent spaces are finite-dimensional, this map is well defined. The application of to a 1-form on (in particular, a differential of a function) yields a velocity field on . Below we show that gradient flows of the energy with respect to the metric are given by weak solutions to PDEs of the form
where . For , and equation (2.6) corresponds precisely to equation (2.1) if . The sole difference between equation (2.6) and classical Wasserstein gradient flows is the presence of the factor . It arises since the modified transport distance punishes the movement of particles with a high mobility . When we interpret as an interaction kernel between particles, those particles interacting strongly with others are slowed down, while particles with low interaction are sped up.
Lemma 2.6 (Chain rule). Let be an absolutely continuous curve in . Then is absolutely continuous and
Proof. Let us consider an absolutely continuous curve and the function . In the case when , we could use it as a test function in equation (A 3) and immediately obtain
The finiteness follows from the fact that we can bound uniformly on . In the general case, we have to use a rather lengthy time mollification argument, see appendix A(d).∎
Equation (2.7) is reminiscent of the classical chain rule for a function and a curve . The velocity field can be viewed as the ‘derivative’ of the curve , while is the corresponding ‘gradient’ of the interaction energy. Using this chain rule, we can estimate how fast the energy can decrease along a curve . Therefore, curves reaching this bound dissipate the energy as fast as possible and satisfy the so-called energy dissipation equality.
Lemma 2.7. For any absolutely continuous w.r.t. curve , we have that
(2.8) |
Moreover, we have equality if and only if is a weak solution to equation (2.6) .
Proof. We can estimate the right-hand side of equation (2.7) by Hölder’s and Young’s inequalities:
Integrating both sides of equation (2.7) from 0 to T, we obtain equation (2.8). Moreover, equality holds if and only if for a.e. and -a.e. we have ). Hence, is a weak solution to equation (2.6).∎
(d). Metric gradient flows
Let us put the previous calculations into the context of curves of maximal slope [53, ch. 1], which can be viewed as a way to generalize gradient flows to general metric spaces. We assume to be a complete metric space. Let . A function is called a strong upper gradient of if for any absolutely continuous curve the concatenation is Borel and
If is non-increasing in then the application of Young’s inequality yields
This observation allows us to define curves of maximal slope as those that decrease the energy as fast as possible.
Definition 2.8 (Curve of maximal slope). An absolutely continuous curve is called a curve of maximal slope of with respect to its strong upper gradient if is non-increasing and
Lemma 2.9. The map
is a strong upper gradient of and solutions of equation (2.6) coincide with curves of maximal slope of with respect to the strong upper gradient .
Proof. For an absolutely continuous w.r.t. curve , we can find, by lemma 2.4, a velocity field such that and
Then, the chain rule, lemma 2.6 yields
and is a strong upper gradient. The coincidence of solutions of equation (2.6) and curves of maximal slope follows from lemma 2.7.∎
(e). Energy dissipation and large-time behaviour
Due to the missing geodesic convexity properties of the energy, we cannot expect convergence of the evolution to a unique minimizer in the large time limit. However, we can obtain some weaker results by further analysing the energy dissipation property:
(2.9) |
As , we can pick narrowly convergent subsequences of (i.e. converging weakly star in the Banach space of Radon measures). Moreover, the entropy dissipation inequality above implies
hence, along suitable subsequences, the entropy dissipation,
converges to zero since it is non-negative and bounded. To establish the existence of subsequences converging to stationary solutions, we need to identify the limit in suitable spaces. Under appropriate regularity assumptions on the interaction kernel (satisfied, for example, for the exponential kernel), this is a direct consequence of the Arzelà–Ascoli theorem.
Lemma 2.10. Let be a compact manifold without a boundary, for some and symmetric. Moreover, let be a sequence of probability measures on . Then the sequences
have uniformly convergent subsequences. If converges narrowly to , then converges uniformly to and converges uniformly to
Lemma 2.10 combined with the entropy dissipation inequality (2.9) yields the following result.
Corollary 2.11. Let be a compact manifold without a boundary, for some and symmetric. Then each weak solution of equation (2.1) with the velocity field given by equation (1.8) has a narrowly convergent subsequence as , the limit of which is a stationary solution.
The following example connects the general results of this section with the transformer dynamics.
Example 2.12. The transformer dynamics for a finite number of particles described by equation (1.7) with correspond to the choice , and . As discussed in §1d, the corresponding empirical measures fulfil the continuity equation (1.9). Thus, they solve equation (2.1) in the weak sense with the velocity field given by equation (1.8), and all requirements of corollary 2.11 are fulfilled. Therefore, there exists a subsequence of that converges narrowly to a stationary solution of the interaction energy defined in equation (1.2).
This section establishes the relation between the particle model in equation (1.7) and gradient flows of interaction energies for the special cases . The energy dissipation property equation (2.8) and convergence property from corollary 2.11 motivate the study of stationary solutions of the energy , which we carry out in §§3 and 4. We shall start with minimizers and maximizers.
3. Explicit energy minimizers and maximizers
In this section, we compute explicit minimizers and maximizers of the energy (from equation (1.2), i.e. without the factor ) in different scenarios, depending on the properties of the interaction matrix . We make the dependence on the matrix explicit by employing it as a subscript of the energy. The case has already been covered in [9, Proposition 3.4], where it is stated that a measure is a maximizer if and only if it is a Dirac delta placed at any point on the sphere, and a minimizer if and only if it is the uniform distribution. As we show below, for more general matrices, the position of optimal Diracs depends strongly on the eigenvalues of the matrix . We further derive a symmetry condition for minimizers of energies with a positive definite interaction matrix . This property yields an alternative, simpler proof that the uniform distribution is the only minimizer for .
(a). Maximal eigenvalue and related maximizers or minimizers
Like for , there are several cases in which the minimizers or maximizers of the energy are given by Diracs concentrated at a single point. We start with the maximizers when the largest eigenvalue of is also an eigenvalue of the largest absolute value (or, respectively, minimizers when the smallest eigenvalue of is also an eigenvalue of the largest absolute value).
Theorem 3.1. Let be an eigenvalue of maximal absolute value of and the set of associated normalized eigenvectors. If then with are the only maximizers of the energy . If then with are the only minimizers.
Proof. We consider the case ; the case can be treated similarly. For all , we have with equality if and only if . Thus,
where the inequality is strict if is not concentrated on an eigenvector associated with .∎
An example of the above setting is maximizing the energy for [9, Proposition 3.4], where the authors make a connection between the existence of concentrated maximizers and the so-called mode collapse of transformers often observed in practice. For a positive definite , theorem 3.1 shows that the set of maximizers is not only restricted to Dirac measures, but that it is actually finite. We summarize this insight in the following example and refer to §5a for an illustrating numerical example.
Example 3.2. If then is a maximizer of the energy for any . Similarly, for , is a minimizer for any . If is positive definite then is a maximizer of only if and is the largest eigenvalue of . Similarly, for a negative definite , is a minimizer only if and is the smallest eigenvalue of .
In the remainder of this section, we study minimizers for matrices that do not fulfil the conditions of theorem 3.1.
(b). Minimizers for indefinite matrices
We now generalize the statement in theorem 3.1 to minimizers of energies where the matrix has at least one non-positive eigenvalue. In particular, we do not assume that the smallest eigenvalue is the eigenvalue of maximal absolute value. A key property is the following result that gives a lower bound on the energy in terms of the smallest eigenvalue of .
Lemma 3.3. Let be the expected value of under , i.e. . Then
(3.1) |
If is not positive definite and is its smallest eigenvalue, it further holds that
Proof. We use the convexity of exponential functions of the form and for arbitrary , which, with two applications of Jensen’s inequality, implies
(3.3) |
Since, further, and , the monotonicity of the exponential function gives us
If is not positive definite, we know that and the above inequality reduces to inequality (3.2).∎
A direct consequence of lemma 3.3 for indefinite matrices is that a Dirac measure that is concentrated on an eigenvector corresponding to the smallest eigenvalue is a minimizer of the energy. If the smallest eigenvalue is negative, we can even show that all minimizers are of this form. In the case of a vanishing smallest eigenvalue, it is necessary and sufficient that the measure is concentrated on the null space of .
Theorem 3.4. Consider a matrix that is not positive definite with the smallest eigenvalue . If , a measure minimizes the energy if and only if it is a Dirac measure placed at an eigenvector corresponding to . If , a measure minimizes the energy if and only if it is concentrated on the null space of .
Proof. We first assume . It follows directly from equation (3.2) that every Dirac measure concentrated on an eigenvector corresponding to is a minimizer. We further see that if only if is an eigenvector corresponding to and . This can only hold for Dirac measures. Thus, there are no other minimizers.
For , it also follows directly from equation (3.2) that every measure concentrated on the null space of minimizes the energy. However, holds for all measures that fulfil . Still, the estimate in equation (3.3), obtained using Jensen’s inequality, is only an equality if for -a.e. . Therefore, all minimizers are concentrated on the null space of .∎
Remark 3.5. In general, theorem 3.4 does not transfer to maximizers for matrices that are not negative definite. To see this, consider with the largest eigenvalue , the smallest eigenvalue and corresponding eigenvectors and . If further , it holds that
and thus, is not a maximizer. In the special case , the above inequality holds for all measures concentrated on the null space of and all .
At this point, we further note that the above strategy does not work for analysing minimizers for positive definite interaction matrices . In this case, lemma 3.3 not only gives us , but also for all , so the inequality is strict for all measures .
(c). Symmetry property for positive definite matrices
The remainder of this section gives the first characterization of minimizers of the energy when the interaction matrix is positive definite. More precisely, we can show that, in this case, all minimizers are symmetric, and the symmetry axes are determined by the eigenvectors of . The first step towards this is to show that the energy is strictly convex if is positive definite.
Lemma 3.6. If is positive semi-definite (resp. positive definite) then is convex (resp. strictly convex).
Proof. Since is quadratic, convexity (resp. strict convexity) follows from the non-negativity (resp. positivity) of the quadratic form:
for arbitrary signed Radon measures , e.g. [56, Proposition 2.11]. For positive semi-definite, there exists a unique positive semi-definite matrix square root and we can use the transformation . We denote by the pushforward of by , so that
Let , then
The fact that the Gaussian kernel is positive definite (e.g. [57]) yields that unless vanishes. This can only happen if or, in the case of a semi-definite matrix , if is concentrated on the null space and . This yields the assertion.∎
Remark 3.7. The previous convexity result does not guarantee the convergence of the gradient flow in ( equation (2.6) ) to a global minimizer of . For such results, usually a slightly different notion of convexity is required, the so-called geodesic convexity. The following example shows that besides the case of being a multiple of the identity, we do not have geodesic convexity for the classical -Wasserstein distance. We do not expect any improvements for our modified optimal transport distance.
Example 3.8. We consider a simple counterexample in (equipped with the spherical distance) to show that is not convex along -Wasserstein geodesics. Choose
Then is a constant-speed geodesic in the -Wasserstein space connecting and . Clearly, the map is not convex, since
Such a counterexample can always be constructed as long as has two different eigenvectors. Lemma 3.6 does not contradict this counterexample, however, as it only implies the convexity of
Having established convexity, we can show that reflecting a measure along the eigenvectors of and then normalizing it does not increase the energy. Moreover, if is positive definite and is not symmetric with respect to all eigenvectors of , one can always construct a symmetric measure with a smaller energy.
Lemma 3.9. Let be an eigenvector related to an eigenvalue of a positive semi-definite matrix . For a measure , we define as
where denotes a reflection. Then, and the inequality is strict if is positive definite and .
Proof. Since , it is straightforward to see that . The (strict) convexity of the energy yields the assertion.∎
As a direct consequence, we obtain a symmetry property of minimizers for positive definite .
Corollary 3.10. If is positive definite then each minimizer is symmetric with respect to its eigenvectors.
If is a positive multiple of the identity, one can easily show using the above result that the uniform distribution is the unique energy minimizer. This has been shown already in [9, Proposition 3.4] using properties of Gegenbauer polynomials [58, Proposition 2.2]. The symmetry property from corollary 3.10 gives an alternative—and straightforward—proof of this fact.
Proposition 3.11. If for then the uniform distribution is the unique energy minimizer.
Proof. If is not uniform, we can find a unit vector such that with as in lemma 3.9, we have
However, for , every unit vector is an eigenvector and lemma 3.9 implies that . Hence, the uniform distribution is the only minimizer of the energy.∎
Remark 3.12. The statement in proposition 3.11 does not transfer to maximizers for negative multiples of the identity. To see this, consider with and let denote the uniform distribution on . The symmetry of yields
where . Since -almost everywhere on the integrand can be strictly bounded from above by . Since it follows that
with . Therefore, cannot be a maximizer of .
Remark 3.13. The above argument can be used to show that for arbitrary , one has
for all symmetric measures if and only if is an eigenvector that corresponds to the eigenvalue of the largest absolute value. In the upcoming section, we use this insight to show that such measures are maximizers of for negative semi-definite .
If has non-positive eigenvalues, theorems 3.1 and 3.4 still show that all minimizers are invariant with respect to reflections , where corresponds to a positive eigenvalue. However, if has negative eigenvalues, such reflections can increase the energy when they are applied to general, non-minimizing measures. This is illustrated by the following example.
Example 3.14. Consider the two-dimensional case with and . For any , denote by the Dirac delta placed at . Fix and let
In the two-dimensional setting, the symmetrization is given by
Denoting, for convenience, , we have
Since is strictly increasing for , we get that since
for any and , and the inequality is strict if and only if and .
(d). Maximizers for negative semi-definite matrices
There is no apparent way to use the proof strategy from the previous section for showing that maximizers for negative definite matrices are symmetric, since the kernel is not negative definite for a negative definite . However, we can show that the quadratic form used to prove lemma 3.6 is non-positive for anti-symmetric measures. This yields a symmetry property of maximizers for negative semi-definite matrices.
Lemma 3.15. Let be a negative semi-definite matrix and a measure on the sphere. Define as
Then and the inequality is strict if and either is negative definite or on the null space .
Proof. We denote by the negation and define
This yields that and
Since is positive semi-definite, the proof of lemma 3.6 shows that and thus . The inequality is strict if and either is negative definite or is concentrated on . The symmetry of the kernel yields . Further, by substituting and , we see that
Reordering the terms leads to
From the conditions on and that lead to , we derive that the above inequality is strict if and either negative definite or on .∎
Corollary 3.16. Let be a maximizer of for a negative definite . Then .
This symmetry property is the missing ingredient for showing that the discrete measures introduced in remarks 3.12 and 3.13 are maximizers for negative semi-definite matrices .
Theorem 3.17. Let be negative semi-definite and its smallest eigenvalue. Then, a measure maximizes if and only if where is an eigenvector associated with .
Proof. By lemma 3.15, it suffices to consider satisfying . Denoting and using the symmetry property of , with the arguments from remark 3.12, we have
where equality is only obtained if holds -almost everywhere on . Since is symmetric, this is equivalent to . For a negative definite , we already know from corollary 3.16 that there are no other measures that maximize . In the negative semi-definite case, we have that any that fulfils has to be concentrated on and, therefore, also in this case, there are no other maximizers.∎
4. Energy variation and stationary points
To study stationary points or local maximizers/minimizers, it is useful to consider the first and second variations of the energy on the Wasserstein space of probability measures on the sphere, as studied previously for Vlasov-type interactions, e.g. the mean-field aggregation equation, cf. [36,59,60]. The first variation of is given by
(4.1) |
where satisfies
(4.2) |
and is the projection to the tangent space of the unit ball at . Here, the velocity field is an arbitrary Lipschitz function on ; by the projection , we restrict it further to admissible velocities that keep the distribution on the unit sphere.
The following weak formulation, where is a continuously differentiable test function, will be useful later:
Similar to the first variation, the second variation of can be defined as
(4.3) |
if the derivative on the right-hand side exists. The computation of the first variation is completely analogous to the case of the aggregation equation (cf. [59]) and thus omitted here.
Lemma 4.1. For any Lipschitz continuous vector field , the first variation of the energy in the direction exists and is given by
(4.4) |
It is straightforward to see that the first variation vanishes at the extremal points of the energy:
Proposition 4.2. Let be a minimizer or maximizer of the energy. Then for all Lipschitz vector fields .
Proof. Let be the initial value for the transport equation (4.2). For Lipschitz-continuous vector fields, there is a unique solution of the transport equation, and for all times , it is an admissible distribution on the sphere. Hence, if is a minimizer, then
for all , which implies that in the limit . Since is arbitrary and is linear in , we have that . The case of a maximizer is treated in the same way, with an opposite inequality initially.∎
The connection between the transformer dynamics and the energy variations in Wasserstein spaces is readily established in the following.
Lemma 4.3. A probability measure is a stationary solution of equation (2.1) with the velocity field given by equation (1.8) if and only if for all Lipschitz continuous .
Similarly to lemma 4.1, one can obtain an expression for the second variation.
Lemma 4.4. For being Lipschitz continuous, the second variation of the energy in the directions , exists and is given by
(a). Energy variation at concentrated distributions
From lemma 4.1, we see that any measure that fulfils
(4.5) |
is a stationary point of . Here and in the following, with a slight abuse of notation, we denote the -vector by . For concentrated measures, the above condition is also necessary and rather easy to verify, as we see in what follows. We first show that single Dirac measures can only be stationary points if they align with an eigenvector of the matrix .
Lemma 4.5. A Dirac measure is a stationary point of if and only if is an eigenvector of .
Proof. The first variation is given by
Since is an arbitrary vector, is a stationary point if and only if
which holds if and only if is an eigenvector of .∎
Intuitively speaking, means that the force emerging from the interaction of a particle located at eigenvector with itself is orthogonal to the tangent space of at point and is thus cancelled out by the projection. The same effect can be observed for convex combinations of a Dirac measure and its reflection.
Lemma 4.6. For any , we have that is a stationary point of if and only if is an eigenvector of .
Proof. Using the expression in lemma 4.1, we obtain for any Lipschitz continuous , using the abbreviation , that
We first observe that for any one has that . By comparing the coefficients in the above equation, we obtain that
∎
For the symmetric case in the above lemma, we can further show that any convex combination of such stationary points is again a stationary point.
Lemma 4.7. Let be a finite subset of eigenvectors of such that for all . Then for any choice of parameters such that the following measure is a stationary point of :
Proof. We prove the statement by showing that equation (4.5) holds. For any , it holds that
since only contains eigenvectors of . On the other hand, since we also require for all it follows that and therefore,
for all . In total, this yields
for all and thus also for -almost all .∎
The above proof strategy works only for Dirac measures aligned with the eigenvectors of . However, there exist other discrete measures that are stationary points, as the following example shows. For the sake of simplicity, we restrict ourselves to the two-dimensional case with a positive definite matrix and a symmetric combination of four Dirac measures. We further assume that is diagonal; the case of a general symmetric can be treated similarly with a rotation argument.
Lemma 4.8. Let , and be diagonal and positive definite. A discrete measure:
(4.6) |
is a stationary point of if and only if either or
(4.7) |
where denote the diagonal entries of . For any choice of , there exists exactly one that fulfils the condition in equation (4.7).
Proof. Without loss of generality, we prove the statement for , since otherwise it holds that for a , and thus .
It follows directly from lemma 4.6 that is a stationary point if . Therefore, it remains to show that is a stationary point if and only if equation (4.7) is fulfilled. This means that we have to see when there exists a Lipschitz continuous such that .
We first fix and consider
(4.8) |
Since , we can further write , where . We factor out to rewrite equation (4.8) as with
Lemma 4.1 now gives us that
which can become zero for all admissible if and only if for all . Due to the symmetry properties of our measures , it further holds that is constant on ; therefore, it suffices to consider . Remembering that , we derive
Since , the factor cannot vanish, and the zeros of coincide with those of
This function obtains its minima at and its maxima at and strictly increases or decreases, respectively, in between. Substituting these points into equation (4.9), we see that the minima are strictly negative and the maxima are strictly positive since . Therefore, there exists exactly one zero in the interval . Using the hyperbolic identity in equation (4.9), we arrive at the criterion in equation (4.7).∎
Remark 4.9. Importantly, the angle that fulfils equation (4.7) depends not only on the ratio of the eigenvalues of but also on their magnitude since they appear separately within the hyperbolic tangent.
Although the ratio of the eigenvalues does in general not determine the angle that fulfils equation (4.7), we can still make a qualitative prediction based on the ratio. The left-hand side of equation (4.7) decreases monotonically for ; for , the condition is fulfilled for . Therefore, the condition is fulfilled by some if and by some if . The numerical experiments in §5b show that the measures characterized by equation (4.7) are not only stationary points but also minimizers among empirical measures consisting of at most four Dirac measures. In the remainder of this section, we aim to characterize minimizers for positive definite matrices in arbitrary dimensions .
(b). Energy variation at the uniform distribution
To characterize minimizers for positive definite , we start by identifying the cases when the uniform distribution is a stationary state. As we show in the following lemma, this can only be the case if the strength of the interaction does not depend on the direction, i.e. the eigenvalues of all have the same absolute value.
Lemma 4.10. The uniform distribution is a stationary point of if and only if all eigenvalues of have the same absolute value, i.e. for some .
Proof. To keep the notation simple, we treat here the case , leaving the general proof for to appendix C(a). Let us fix and determine such that . Consider the integral
which can be rewritten with a change of variables as follows (recall that ):
From the above derivations, we see that if and only if is an eigenvector of . This holds true for -almost all if and only if . This automatically yields if . It remains to show that this is also a necessary condition.
Without loss of generality, we assume that , where and are the eigenvalues corresponding to the eigenvectors and , respectively. Then, is strictly negative on the set
Since we can find a Lipschitz continuous such that for -a.e. on and
For all such it holds that , which concludes the proof.∎
Since we already know that minimizers for with at least one negative eigenvalue are Dirac measures, we can conclude that the uniform distribution is only a minimizer for .
Corollary 4.11. The uniform distribution minimizes if and only if for .
Proof. We only need to show that there are no other matrices such that is minimized by ; the other direction has been treated in proposition 3.11. The measure can only be a minimizer if it is a stationary point. By lemma 4.10, this implies that all eigenvalues of have to have the same absolute value. If such has at least one negative eigenvalue, it is also the smallest eigenvalue. Thus, by theorem 3.1, the only minimizers are Dirac deltas placed at eigenvectors corresponding to the negative eigenvalue.∎
(c). Perturbation of the identity
It is not clear whether an explicit computation of stationary points for an arbitrary positive definite matrix with at least two distinct eigenvalues is possible, but some insight can be gained with asymptotic analysis. We consider the following perturbed energy:
where is a diagonal matrix and is a small parameter. Using the second-order Taylor expansion of the exponential function, we can write
For we know that the unique minimizer is the uniform distribution on the sphere. Therefore, we use the following second-order asymptotic ansatz:
(4.11) |
We stress that here we consider the energy as a function on the space of signed Radon measures on the sphere with the total variation norm and not on the space of probability measures with the Wasserstein metric as in §4a. For this reason, the perturbation here is a measure and not a vector field (cf. equation (4.1)).
Substituting equation (4.11) into equation (4.10) and neglecting higher-order terms, we derive
Since further is constant on , it follows that
In particular, we see that the term from equation (4.11) does not contribute to the second-order expansion of the energy. Therefore, minimizing over all possible satisfying equation (4.11) is equivalent to minimizing
over all signed measures with . The first variation in the direction satisfying is given by
(4.12) |
Our goal is now to find an optimal measure , such that its first variation vanishes in any direction such that . To do so, we shall need the following two technical lemmas. To make the definition of the uniform distribution on the sphere rigorous, we denote by the -dimensional Hausdorff measure and write instead of .
Lemma 4.12. Let and . It holds that
(4.13) |
for any , where the constant is positive and depends only on the dimension .
Proof. For the sake of simplicity, here we present the (more intuitive) proof for , leaving the general case to appendix C(b). We write and and derive that
where we use the coordinate transform and two trigonometric identities to separate the summands inside sine and cosine. Since , this yields equation (4.13) with
∎
Lemma 4.13. Let and . It holds that for any
where the constants and are positive and depend only on the dimension .
Proof. For the sake of simplicity, we again present the proof for ; the general case is treated in appendix C(c). Using the same arguments as in the previous proof, we derive
where the mixed terms containing vanish due to symmetry. Further, since , we can write
This yields equation (4.14) with positive constants:
∎
Lemma 4.12 allows us to rewrite the second summand in equation (4.12) such that it contains . Using lemma 4.13, we can then deduce that, up to constants, the measure is a stationary point of .
Theorem 4.14. The measure
fulfils and for all satisfying .
Proof. From the definition of and , it follows that . With lemma 4.12, we write the optimality condition derived from equation (4.12) as
Substituting into the left-hand side and using lemma 4.13, we get
where all terms that do not depend on , including , vanish due to . Substituting completes the proof.∎
Theorem 4.14 gives us the following intuitive characterization. The measure that optimizes the perturbed energy is obtained by taking mass from the uniform distribution where is large and adding it where is small. In other words, we expect minimizers of the energy with a positive definite matrix to have more mass in regions that correspond to small eigenvalues of than in regions that correspond to large ones. This intuition is in line with the results of the particle approximation in figure 3. Furthermore, in figure 5, we also observe that the density obtained in equation (4.11) with the measure from above can indeed be seen as a first-order approximation for small values of .
5. Numerical examples
To illustrate the obtained theoretical results, we perform a series of numerical experiments using a particle approximation of the energy from equation (1.2) with an ensemble of particles ,
We consider the following particle flow, introduced in [9],
with normalization factors . If we choose the constant normalization
(5.1) |
this corresponds merely to a step-size rescaling of a standard gradient descent scheme for , which is called the (USA) flow in [9]. Choosing the normalization as the partition function
(5.2) |
corresponds more closely to the self-attention dynamics and is labelled the SA flow in [9]. In what follows, we mostly use the normalization in equation (5.2), highlighting minor differences between the two formulations as appropriate. We use the explicit Euler discretization from equation (1.5) with step size to obtain the following update:
(5.3) |
Remark 5.1. For and this scheme reduces to the following power iteration in the limit :
In this regard, the iteration in equation (5.3) can be seen as a method for approximating the largest eigenvalue and the corresponding eigenvector. We leave further analysis of this connection to future work.
The source code for the experiments here is available at https://github.com/TimRoith/TransformerDynamics and uses Python [61], mainly building upon the packages NumPy [62], SciPy [63] and PyTorch [64].
(a). Maximizers for positive definite matrices
To validate our results on maximizers, we first consider a simple set-up of a one-particle system, . We choose and run the scheme in equation (5.3) for iterations. We only report the results for the adaptive normalization from equation (5.2), those for the constant normalization from equation (5.1) being essentially the same. For , we know that every single Dirac is a maximizer, which is indeed observed in figure 1a. Here, each random initialization on the sphere leads to a different final state. In fact, in this case, there is no evolution at all, and the particle stays at its initial position. If is positive definite and has a strictly largest eigenvalue , theorem 3.1 shows that only Diracs at eigenvectors corresponding to are maximizers. This can be observed in figure 1b where the final state is either at or .
Figure 1.
Discrete maximizers on the sphere for particles. The colour indicates the value of at each point on the sphere. (a) For every single Dirac is a maximizer. We show the results for 30 different initializations (b) For the final state is either (0, 0,1) or (0,0,−1).
For multiple particle systems with , lemma 4.6 suggests also that linear combinations of an eigenvector with its negative are stationary points. These linear combinations are not maximizers, but their basin of attraction depends on the eigenvalues of the matrix. In figure 2 (left), we plot the probability (i.e. the proportion of random initializations) of converging to a single cluster versus two clusters as function of the eigenvalues. We fix and vary between and . Note that, as discussed in lemma 4.8 and remark 4.9, the actual values of the eigenvalues matter; not just their ratio. For , the probability of converging to a single cluster is high, whereas for larger values , most trajectories converge to two clusters. The results in figure 2 were obtained with the adaptive normalization from equation (5.2); however, we observed the same quantitative behaviour with the constant normalization from equation (5.1).
Figure 2.
We study the trajectories for a symmetric positive definite matrix with and different initializations using particles. We evaluate the number of clusters at the final iteration with the -means implementation of the SciPy package [63]. The centre of each cluster is close to an eigenvector corresponding to an eigenvalue of maximal absolute value. For , the evolution converges to the optimal state with a single cluster (blue, solid), while for bigger values, it tends to get stuck in the suboptimal stationary state with two clusters (red, hatched) from lemma 4.6.
(b). Minimizers for positive (semi-)definite matrices
We now study discrete minimizers for positive definite matrices. In figure 3, we show how the matrix influences the particle configuration to which the scheme in equation (5.3) converges. Here, too, we used the adaptive normalization from equation (5.2); the results for the constant one from equation (5.1) are largely the same.
Figure 3.
Final states for the minimization scheme after 10 000 steps with particles. The colour indicates the value of at each point on the sphere. In (a), the uniform distribution is the minimizer of the energy. In (b), the particles do not form clusters at single Diracs but rather follow a smooth distribution on the sphere. In (c), any configuration with for all is a minimizer. In (d), any configuration with for all is a minimizer.
Furthermore, in figure 4, we illustrate the results of lemma 4.8 for matrices with varying values . We initialize particles as
Figure 4.
We consider minimizers for the matrix . Starting with the initial configuration described in equation (5.4) , we compute the mean of over all particles. For a small step size, the resulting curve is very close to the identity, as predicted by lemma 4.8. If is too big, the dynamics converge to a suboptimal stationary point. We also compare the normalizations given by equations (5.1) and (5.2). We see that with the same step size , the adaptive normalization in equation (5.1) yields faster convergence than the constant one in equation (5.2).
and let the scheme in equation (5.3) run for 10 000 iterations. From the final particle state, we compute the value for each particle separately; lemma 4.8 tells us that this should be equal to for the minimizer. In figure 4, we observe that this holds true for the particle configurations computed with the discrete scheme. However, if the step size is too big compared to the value , the system instead converges to the two-cluster stationary point from figure 2. Here, we notice a slight difference between the two normalizations. The adaptive normalization from equation (5.2) allows choosing bigger step sizes compared to the constant normalization from equation (5.1), enabling faster convergence to the large-time limit.
We further investigate the validity of the asymptotic solution from theorem 4.14 in the two-dimensional case. Here, we deviate from the particle approximation and instead discretize the interval with equidistant grid points and the associated points on the sphere . In this setting, we then aim to minimize
where is a probability vector. Note that already, for , a more sophisticated quadrature rule would be required, e.g. the Lebedev quadrature on the sphere [65]. To deal with the simplex constraint for the vector , we use exponentiated gradient descent, specifically mirror descent with the negative log-entropy as the distance generating function [66], which yields the update
We take the perturbation matrix as , that is, the perturbed matrix is given by Recall the asymptotic expansion in equation (4.10). As noted in §4c, the contribution of the term vanishes in the second-order expansion of the energy, and we are left with a solution:
(5.7) |
where is as in theorem 4.14. We note that this measure has a Lebesgue density that can be evaluated at the grid points in ; we denote the resulting vector by . In figure 5, we compare this solution to the vector obtained by solving equations (5.5)–(5.6). The vector for different values of is shown in figures 5a and 5b, we plot the error .
Figure 5.
Numerical study of the asymptotic solution from theorem 4.14 in two dimensions. (a) The probability vectors computed using equation (5.5) with 500 steps for (b) The approximation error for the first-order expansion in equation (5.7) (blue, solid) and the conjectured form in equation (5.8) (green, dotted)
Beyond the first-order expansion in equation (5.7), we conjecture that behaves as follows:
(5.8) |
where is a function to be determined. Taking a second-order Taylor expansion , we estimate the coefficients via linear regression with the given vectors as data points and obtain . The error of this approximation is shown in figure 5b and is lower than that of the first-order expansion in equation (5.7). We leave the analysis of this ansatz to future work.
(c). Maximizers for negative definite and indefinite matrices
We proceed to numerical examples for §3d, i.e. maximization of the energy corresponding to a negative definite matrix. We take a system of particles and consider the two matrices from figure 1 multiplied by . The results are shown in figure 6. We observe that a single final state consists of clusters at , where is an eigenvector corresponding to the smallest eigenvalue, in agreement with theorem 3.17. As shown there, the behaviour does not change if one of the eigenvalues is zero, as only the eigenvectors corresponding to the smallest eigenvalue are relevant. For this reason, we do not consider the semi-definite case separately. The results here are not affected by the choice of the normalization; we only show the ones obtained with that in equation (5.2).
Figure 6.
Discrete maximizers on the sphere for negative definite matrices obtained with particles. We visualize the two-cluster final states by connecting the two components of each cluster corresponding to the same run with a line, assigning different colours to the two opposite clusters. The colour of the sphere indicates the value of at each point on the sphere. (a) For D = −Id a single final state has clusters at both and for any . For clarity, we only show results for 6 different initializations. (b) For D = −diag(1,3,4) a single final state has clusters both at (0,0,1) and (0,0,−1). We show the results for 100 different initializations.
Finally, we turn to the case of indefinite matrices. As noted in remark 3.5, for a matrix that is not negative definite, a Dirac delta placed at the eigenvector corresponding to the largest eigenvalue may not be a maximizer. This can be observed numerically as shown in figure 7 where we plot the energies of one- and two-cluster states for with .
Figure 7.
Energies of the states in blue, in red and in green for the matrix with varying values of .
6. Conclusion
In this work, we studied a mathematical model of self-attention layers used in the transformer architecture. Building upon [9], we analysed a continuum limit in the space of probability measures on a sphere. To understand the underlying geometry, we studied a new optimal transport distance with non-local mobility. We proved that the space of probability measures with this distance is a geodesic space and characterized absolutely continuous curves in this space. This allowed us to interpret the continuity equation (2.5) as curves of maximal slope of the interaction energy and to analyse the large-time behaviour using the energy dissipation property, showing that the dynamics converge to a stationary point of the interaction energy.
We analysed these critical points (in particular, minimizers and maximizers) for various types of interactions determined by the matrix in equation (1.2). These results are listed in table 1. We find that the positions of stationary points are strongly connected to normalized eigenvectors of , which form a strict subset of in the case . In other words, the regions where clusters appear do not only depend on the initial configuration, but also on the interaction matrix itself. This could be related to mode collapse often observed in practice. It is an interesting question to understand whether an alternative, rotation-invariant architecture could prevent mode collapse.
Table 1.
Summary of results on minimizers/maximizers of the interaction energy in equation (1.2). We denote by and the eigenvectors that correspond to the smallest, respectively largest, eigenvalue of .
property of |
minimizers |
maximizers |
---|---|---|
top rule positive definite |
symmetric w.r.t. all eigenvectors (corollary 3.10 and §5b) |
(theorem 3.1 and §5a) |
mid-rule positive semi-definite |
any concentrated on (theorem 3.4 and §5b) |
(theorem 3.1 ) |
negative (semi-)definite |
(theorem 3.1) |
(corollary 3.16 and §5c) |
indefinite |
(theorem 3.4) |
maximal: (theorem 3.1 and §5c) |
Several further questions remain open for future work: as already discussed, it would be interesting to study the optimal transport distance for mobilities that cannot be bounded from below, which is the case, for example, in problems of opinion dynamics where the Gaussian kernel on the Euclidean space is often used. In this case, the metric is no longer equivalent to . So far, we have only shown that equation (2.6) represents gradient flows in using the concept of curves of maximal slope. We do not know if these curves satisfy the slightly stronger energy variational inequality, which would yield an easy stability estimate for solutions of equation (2.6).
From a practical point of view, an even more interesting direction is studying more general flows in that correspond to non-symmetrical matrices in equation (1.2), which is common in transformer architectures. As mentioned above, basic properties of the distance carry over to the non-symmetric case, but characterizing the stationary states is non-trivial; one possibility is splitting the effective velocity fields into a dissipative and a (generalized) divergence-free part, similar to non-symmetric Fokker–Planck equations.
Finally, to justify the use of the continuum limit for studying the practical behaviour of transformers, one needs to establish convergence of discrete time-stepping in arbitrary time intervals. Moreover, it is worth studying how the step size influences the behaviour of the system and what effect weight-sharing would have.
Appendix A. Proofs of Section 2
A.1. Continuity equation on manifolds
Let be a compact, -dimensional Riemannian manifold and its tangent bundle. Although is not a vector space, the tangent bundle itself can be considered as -dimensional Riemannian manifold. For its proper definition and the topology on , we refer to [67, ch. 3 (The Tangent Bundle)]. Velocity fields on manifolds are maps such that , where is the projection map sending each vector in to . We shall regularly commit the mild crime of interpreting as an element in instead of . Let be an open interval, be a Borel family of probability measures on and be a time-dependent Borel velocity field such that
(A 1) |
where denotes the norm induced by the inner product of the Riemannian structure. The continuity equation holds in the sense of distributions if
(A 2) |
Here, denotes the differential of the map for a fixed .
Proposition A.1 (Properties). Solutions to the continuity equation have the following properties:
-
—
Continuous representative: Let be a Borel family of probability measures satisfying equation (A 2) for a Borel vector field satisfying equation (A 1) . Then there exists a narrowly continuous curve such that for a.e. . Moreover, if and we have [53, Lemma 8.1.2]:
(A3) -
—
Time rescaling: Let be a strictly increasing absolutely continuous map with absolutely continuous inverse . Then is a distributional solution of the continuity equation if and only if [68, Lemma 8.1.3]
is a distributional solution of the continuity equation on .
-
—
Gluing solutions: Let be two narrowly continuous curves in with . Let further be the corresponding Borel velocity fields such that equation (A 3) is satisfied. Then and defined by
satisfy equation (A 3) [ 69 , Lemma 4.4].
A.2. Proof of Theorem 2.2
We follow the proof strategy from [69] for the ‘flat’ Euclidean case, but since is not a vector space, modifications are required. We start by establishing a compactness result for solutions of continuity equations with finite energy. For our purposes, we define the ‘lifted’ flux in duality with (see [70, Theorem 7.2]) by
(A 4) |
Notably, solve the continuity equation in the sense that for all :
(A 5) |
where is the extension of on to that is constant along . Further, we define in duality with by
Lemma A.2. Let be a sequence in with
Then there exists a subsequence and a couple satisfying the continuity equation in the sense of equation (A 5) such that
and for the map one has
(A 6) |
Proof. Step 1 (Convergence of ):
The estimate
combined with the fact that is compact and [53, Remark 5.1.5] implies tightness of . By disintegrating , we obtain a Borel family such that . Since is compact is tight, and we extract a further subsequence such that .
Step 2 (Convergence of ):
Consider a function and for set . Since the discontinuity set of is concentrated on and , general convergence theorems (see, e.g. [3, Prop. 5.1.10]) imply
Let us fix a . Since is compact, is tight, and we can extract from any subsequence a further subsequence such that converges narrowly. Then by equations (A 5) and (A 7) and the fact that is dense in , we know that all subsequences have the same limit. Therefore, for a particular . By the previous calculations, we also immediately obtain that satisfy the continuity equation in the sense of equation (A 5). To show equation (A 6), we observe that since is compact
∎
Proof of Theorem 2.2. Step 1:
Let be a minimizing sequence of the functional in equation (2.3) for some . Then the conditions of lemma A.2 are met, and we obtain
where the limit satisfies the continuity equation in the sense of equation (A 5). Equation (A 6) in particular implies that can be disintegrated in the following way:
where . Using [53, Lemma 5.1.7], we now show that for it holds that
where in the last line, we used Jensen’s inequality. Since is linear and satisfy equation (A 5), this implies that and for this couple the infimum in equation (2.3) is obtained.
Step 2:
Proposition A.1 and a linear time rescaling show that
(A 8) |
We denote by the infimum in equation (2.4) and show that indeed . By Hölder’s inequality, we immediately obtain that . To show the reverse, we follow the arguments of [69, Theorem 5.4] and define for :
Then is strictly increasing, and with , so that its inverse map is well defined and Lipschitz continuous and
By proposition A.1, we have that for , the couple and
with the last term being smaller or equal to . Sending , we obtain
and hence, . This in particular implies that for every minimizer of the functional in equation (2.3), the equality
holds, which is only the case when is constant for a.e. , implying by a further time rescaling argument∎
(A 9) |
A.3. Proof of Lemma 2.4
Proof of Lemma 2.4. If and then by equation (2.4), we have
On the other hand, if is an absolutely continuous curve, then by a standard reparametrization argument [53, Lemma 1.1.4], we may assume to be Lipschitz. For , we set the step size as and choose a family of constant-speed geodesics , such that for
Gluing all geodesics together by proposition A.1, we obtain a curve . Lemma A.2 gives us a subsequence, still denoted by , and a couple such that and . By construction, and coincide on the dense (in ) set . Since both and are narrowly continuous must hold. Again, equation (A 6) implies that can be disintegrated in the following way:
where . Then with and
(A 10) |
Since , we have that
Finally, for equation (A 10) to hold must hold for a.e. ∎
A.4. Proof of Lemma 2.6
Proof of Lemma 2.6. From theorem 2.3, we know that the distances and are equivalent. Therefore, we can assume absolute continuity with respect to . Further, by a standard rescaling argument (e.g. [53, Lemma 1.1.4] or [53, Lemma 8.1.3]), it is enough to prove equation (2.7) for -Lipschitz curves (w.r.t. ), i.e. we only need to consider absolutely continuous curves such that
For convenience, we shall set for and for as well as for . We define the function for which
in the distributional sense. Using the mollifier , as described in [71, ch. C.5], one can smooth out in the time direction by setting
By [71, ch. C.5, Theorem 7 (iii)], we have that pointwise, and with the use of the dominated convergence theorem with the upper bound , we calculate
We further have that
where for we use the definition of the distributional derivative and rearrange the integral using the Fubini–Tonelli theorem. To prove , we need to define a piecewise constant approximation of . We fix a and set for
Since is -Lipschitz, we have for all . Then, we estimate
where is the optimal transport plan between and and denotes the dual norm of . (For more details on the static formulation of Wasserstein distances via optimal transport plans, we refer to [53, ch. 6]). We can argue similarly in the mollified case:
We denote and combine equations (A 11) and (A 12) to estimate
where, first, is chosen such that and, second, such that and for each , it holds (by lemma A.3). Therefore is proven.
Finally, by lemma A.4, we obtain that we can use as a test function in equation (A 3) and send to obtain
∎
Lemma A.3. Let be Borel measurable and with
For
it holds
Proof. We adapt [71, ch. 5, Theorem 7] to our case and start by showing
We approximate in by (see [70, Proposition 7.9]) and calculate
From [71, ch. C.5, Theorem 7], we know that for all because is continuous. Choosing such that and using the dominated convergence theorem, we get . As can be chosen arbitrarily small, we obtain convergence.∎
Lemma A.4. We have .
Proof. Let be a smooth local chart for an open set containing . Then, since , the function is continuous in and the product
is continuous on . Taking any sequence , we can use the dominated convergence theorem to obtain
An upper bound is given by the function . Thus, is continuous in . With the same argument, a similar statement can be shown for
By [72, Theorem 2.8], it follows that and since the local chart was chosen arbitrarily .∎
Appendix B. Spherical coordinates
For many computations in §4, we use spherical coordinates. Up to small notational changes, we use the definition provided in [73]. We define the coordinate transform for as
Here and in the following, denotes the th standard basis vector.
The Jacobian determinant is given by
To highlight the recursive character of with respect to , we further note that
where the index denotes that we drop the first element, i.e. for , . A practical consequence of this property is the recursive computation formula for the Hausdorff measure of the -dimensional sphere.
Lemma B.1. Denote . For , it holds that
Proof. For , the proof follows from a simple computation and the fact that and . For , we have
where we use the recursive property of the Jacobian determinant. ∎
B.1. Definition using Givens rotations
Spherical coordinates can equivalently be defined using Givens rotations (see e.g. [74, ch. 5.1.8]). A Givens rotation for an angle and indices with is determined by the rotation matrix :
Applying to a vector corresponds to a counterclockwise rotation of by the angle in the -plane. For a given vector of angles , we can thus construct the matrix
(B 1) |
The rotation matrix can be written as a two-dimensional rotation of angle in the -plane, as the following lemma shows.
Lemma B.2. Let be the rotation matrix as described in equation (B 1) . Then, it holds that
with , and .
Proof. For , the statement can be verified by inserting and the definition of . For , we define
With this choice of , has the claimed form and due to the orthogonality of Givens matrices. It remains to show that the first two rows of fulfil and . For , reduces to
and clearly, and . For , the proof follows from induction over .∎
Corollary B.3. Let , then
In particular, if it holds that
With the above results, we obtain
and since Givens matrices are orthonormal, it also holds that
We can also therefore consider rotated spherical coordinates
for a reference point , with the same Jacobian determinant as before, i.e. .
Appendix C. Proofs for Section 4
C.1. Proof of Lemma 4.10
Lemma 4.7 (cont.) Let . The uniform distribution is a stationary point of if and only if all eigenvalues of have the same absolute value, i.e. for some .
Proof. The proof for uses the same arguments as for ; however, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated. We use the notation and techniques from appendix B (spherical coordinates and rotations ).
Again, we first fix and consider the integral
Similarly to the two-dimensional case, we choose such that
and therefore also
where denotes the first standard basis vector. We rewrite the integral using rotated spherical coordinates and substitute it into the above identity to obtain
where denotes the Jacobian determinant of . To reduce the above integral over the vector to an integral over only the first component , we write
where the subscript denotes that we neglect the first component. Inserting this into , we get
and due to the symmetry of sine and cosine, we have that for any , . We can thus deduce that if and only if is an eigenvector of , exactly as in the case . This holds true for -almost all if and only if all eigenvalues of have the same absolute value, which then automatically yields .
Again, it remains to show that this is also necessary. Without loss of generality, we assume and and to be the eigenvalues of largest, respectively second largest, absolute value corresponding to the eigenvectors , respectively .
From here, the strategy is the exact same as in the two-dimensional case, which we restate here for completeness. The factor is strictly negative on the set
Since , we can find a Lipschitz continuous such that for -a.e. on and
For all such , it holds that , which concludes the proof.∎
C.2. Proof of Lemma 4.12
Lemma 4.8 (cont.) Let , and . Then, it holds that
for any , where the constant is positive and depends only on the dimension .
Proof. The proof for goes along the lines of the proof for . However, the rotation corresponding to a translation of the angle in two dimensions is technically more complicated in higher dimensions. For an introduction to rotated spherical coordinates used in this proof, we refer the reader to appendix B.
We first fix and choose such that . We proceed to write the integral using rotated spherical coordinates and obtain
Substituting the expressions for and yields
In addition, we note that we can write any and see that
where denotes the first standard basis vector. Substituting the above equality into the integral, we derive
The proof now follows from choosing the constant:
which is positive for all since the function is positive for and both sine and cosine are positive for .∎
C.3. Proof of Lemma 4.13
Lemma 4.9 (cont.) Let , and . Then, for all and , it holds that
where the constants and are positive and depend only on the dimension .
Proof. Using the same arguments as in the previous proof, we obtain
where the mixed term containing vanishes due to symmetry. Since the second term still depends on due to the rotation, we write and decompose into its rotation-invariant and rotation-variant part. More precisely, we use corollary B.3 to get
and thus
Making use of the trigonometric identity , we get
where in the last step, we use the fact that . To prove that the integral over the expression in equation (C 4) can be written as claimed, we observe that for all
where is positive and depends only on , and therefore, also . With this, we derive that
for all and it remains to show that for any
The case is trivial as . For , we write out the integrand and obtain
where we can use the same argument as for equation (C 5) to show that the last summand integrates to zero. Since also for any , we derive equation (C 6). Togethepossible to interpret it as a forward Euler discretizationr with equations (C 4) and (C 5), this yields
The statement now follows from substituting the above into equation (C 3), with constants given by
Since for all , it directly follows that . To show that for all , we first show that . For , this follows directly from . For , we have
Using integration by parts, we further derive that
As shown in lemma B.1, the recursive form of the Jacobian determinant of spherical coordinates yields that
Combining these equalities, we see that
and therefore, with integration by parts, we get
Due to the symmetry of sine and cosine, we get
where the positivity follows from the fact that the function is positive for and both sine and cosine are positive for .∎
Contributor Information
Martin Burger, Email: martin.burger@desy.de.
Samira Kabri, Email: samira.kabri@desy.de.
Yury Korolev, Email: ymk30@bath.ac.uk.
Tim Roith, Email: tim.roith@desy.de.
Lukas Weigand, Email: lukas.weigand@desy.de.
Data accessibility
This article has no additional data.
Declaration of AI use
We have not used AI-assisted technologies in creating this article.
Authors’ contributions
M.B.: conceptualization, formal analysis, funding acquisition, investigation, methodology, supervision, writing—original draft; S.K.: formal analysis, investigation, methodology, visualization, writing—original draft; Y.K.: conceptualization, formal analysis, funding acquisition, investigation, methodology, writing—original draft; T.R.: funding acquisition, investigation, methodology, software, visualization, writing—original draft; L.W.: formal analysis, investigation, methodology, writing—original draft.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration
We declare we have no competing interests.
Funding
M.B. and T.R. acknowledge funding by the German Ministry of Science and Technology (BMBF) under grant agreement No. 01IS24072A (COMFORT). M.B., S.K., T.R. and L.W. acknowledge support from DESY (Hamburg, Germany), a member of the Helmholtz Association HGF. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany. M.B. and S.K. acknowledge support from the German Research Foundation, project BU 2327/19-1. M.B. and L.W. acknowledge support from the German Research Foundation, project BU 2327/20-1. Y.K. acknowledges support from the German Research Foundation as visiting fellow within the priority programme Foundations of Deep Learning. Part of this study was carried out while S.K. and T.R. were visiting the California Institute of Technology, supported by the DAAD grant for project 57698811 'Bayesian Computations for Large-scale (Nonlinear) Inverse Problems in Imaging'. Y.K. acknowledges the support of the EPSRC (Fellowship EP/V003615/2 and Programme Grant EP/V026259/1). S.K. and Y.K. are grateful for the hospitality of the University of Bath during the workshop 'Machine Learning in Infinite Dimensions', sponsored by the ICMS, LMS, IMI Bath, ProbAI and Maths4DL, where part of this work was undertaken.
References
- 1. OpenAI . 2023. GPT-4 technical report. arXiv:2303.08774. ( 10.48550/arXiv.2303.08774) [DOI]
- 2. Wu J, Gan W, Chen Z, Wan S, Philip SY. 2023. Multimodal large language models: a survey. In 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, pp. 2247–2256. IEEE. ( 10.1109/BigData59044.2023.10386743) [DOI] [Google Scholar]
- 3. Fields C, Kennington C. 2023. Vision language transformers: a survey. arXiv:2307.03254. ( 10.48550/arXiv.2307.03254) [DOI]
- 4. Esser P, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-First International Conference on Machine Learning. Vienna, Austria: PMLR. [Google Scholar]
- 5. Abramson J, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630 , 493–500. ( 10.1038/s41586-024-07487-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Jumper J, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589. ( 10.1038/s41586-021-03819-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Vuckovic J, Baratin A, Combes RT. 2020. A mathematical theory of attention. arXiv 2007.02876. ( 10.48550/arXiv.2007.02876) [DOI] [Google Scholar]
- 8. Sander ME, Ablin P, Blondel M, Peyré G. 2022. Sinkformers: transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pp. 3515–3530. JMLR. [Google Scholar]
- 9. Geshkovski B, Letrouit C, Polyanskiy Y, Rigollet P. 2023. A mathematical perspective on transformers. arXiv 2312.10794. ( 10.48550/arXiv.2312.10794) [DOI] [Google Scholar]
- 10. Calvello E, Kovachki NB, Levine ME, Stuart AM. 2024. Continuum attention for neural operators. arXiv: 2406.06486. ( 10.48550/arXiv.2406.06486) [DOI] [Google Scholar]
- 11. Nguyen TM, Nguyen T, Ho N, Bertozzi AL, Baraniuk RG, Osher SJ. 2024. A primal-dual framework for transformers and neural networks. arXiv 2106.01506. ( 10.48550/arXiv.2406.13781) [DOI] [Google Scholar]
- 12. Wright MA, Gonzalez J. 2021. Transformers are deep infinite-dimensional non-mercer binary kernel machines. arXiv 2106.01506. ( 10.48550/arXiv.2106.01506) [DOI] [Google Scholar]
- 13. Criscitiello C, Rebjock Q, McRae AD, Boumal N. 2024. Synchronization on circles and spheres with nonlinear interactions. arXiv 2405.18273. ( 10.48550/arXiv.2405.18273) [DOI] [Google Scholar]
- 14. Alcalde A, Fantuzzi G, Zuazua E. 2024. Clustering in pure-attention hardmax transformers and its role in sentiment analysis. arXiv Preprint 2407.01602. ( 10.48550/arXiv.2407.01602) [DOI] [Google Scholar]
- 15. Geshkovski B, Rigollet P, Ruiz-Balet D. 2024. Measure-to-measure interpolation using transformers. arXiv Preprint 2411.04551. ( 10.48550/arXiv.2411.04551) [DOI] [Google Scholar]
- 16. Kan K, Li X, Osher S. 2025. OT-Transformer: a continuous-time transformer architecture with optimal transport regularization. arXiv Preprint 2501.18793. ( 10.48550/arXiv.2501.18793) [DOI] [Google Scholar]
- 17. Viswanathan K, Gardinazzi Y, Panerai G, Cazzaniga A, Biagetti M. 2025. The geometry of tokens in internal representations of large language models. arXiv Preprint 2501.10573. ( 10.48550/arXiv.2501.10573) [DOI] [Google Scholar]
- 18. Abella ÁR, Silvestre JP, Tabuada P. 2024. The asymptotic behavior of attention in transformers. arXiv Preprint 2412.02682. ( 10.48550/arXiv.2412.02682) [DOI] [Google Scholar]
- 19. Alcalde A, Fantuzzi G, Zuazua E. 2025. Exact sequence classification with hardmax transformers. arXiv Preprint 2502.02270. ( 10.48550/arXiv.2502.02270) [DOI] [Google Scholar]
- 20. Lu Y, Li Z, He D, Sun Z, Dong B, Qin T, Wang L, Liu T. Understanding and Improving Transformer From a Multi Particle Dynamic System Point of View. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations. [Google Scholar]
- 21. Dutta S, Gautam T, Chakrabarti S, Chakraborty T. 2021. Redesigning the transformer architecture with insights from multi-particle dynamical systems. Adv. Neural Inf. Process. Syst. 34 , 5531–5544. [Google Scholar]
- 22. Chizat L, Bach F. 2018. on the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inf. Process. Syst. 31 , 3040–3050. [Google Scholar]
- 23. Ding Z, Chen S, Li Q, Wright S. 2021. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv 2110.02926. ( 10.48550/arXiv.2110.02926) [DOI] [Google Scholar]
- 24. Hegselmann R, Krause U. 2002. Opinion dynamics and bounded confidence models, analysis and stimulations. J. Artif. Soc. Soc. Simulation 5 . [Google Scholar]
- 25. Gómez-Serrano J, Graham C, Le Boudec JY. 2012. The bounded confidence model of opinion dynamics. Math. Model. Methods Appl. Sci. 22 , 1150007. ( 10.1142/s0218202511500072) [DOI] [Google Scholar]
- 26. Piccoli B, Rossi F. 2021. Generalized solutions to bounded-confidence models. Math. Model. Methods Appl. Sci. 31 , 1237–1276. ( 10.1142/s0218202521400054) [DOI] [Google Scholar]
- 27. Bruno G, Pasqualotto F, Agazzi A. 2024. Emergence of meta-stable clustering in mean-field transformer models. arXiv Preprint 2410.23228. ( 10.48550/arXiv.2410.23228) [DOI] [Google Scholar]
- 28. Geshkovski B, Koubbi H, Polyanskiy Y, Rigollet P. 2024. Dynamic metastability in the self-attention model. arXiv Preprint 2410.06833. ( 10.48550/arXiv.2410.06833) [DOI] [Google Scholar]
- 29. Burger M, Erbar M, Hoffmann F, Matthes D, Schlichting A. 2025. Covariance-modulated optimal transport and gradient flows. Arch. Ration. Mech. Anal. 249 . ( 10.1007/s00205-024-02065-w) [DOI] [Google Scholar]
- 30. Duncan A, Nüsken N, Szpruch L. 2023. On the Geometry of stein variational gradient descent. J. Mach. Learn. Res. 24 , 1–39. [Google Scholar]
- 31. Li W. 2021. Hessian metric via transport information geometry. J. Math. Phys. 62 . ( 10.1063/5.0012605) [DOI] [Google Scholar]
- 32. Lisini S, Matthes D, Savaré G. 2012. Cahn–Hilliard and thin film equations with nonlinear mobility as gradient flows in weighted-Wasserstein metrics. J. Differ. Equ. 253 , 814–850. ( 10.1016/j.jde.2012.04.004) [DOI] [Google Scholar]
- 33. Burger M, Di Francesco M. 2008. Large time behavior of nonlocal aggregation models withnonlinear diffusion. Netw. Heterog. Media 3 , 749–785. ( 10.3934/nhm.2008.3.749) [DOI] [Google Scholar]
- 34. Cañizo JA, Ramos-Lora A. 2024. Discrete minimizers of the interaction energy in collective behavior: a brief numerical and analytic review. arXiv 2403.00594. ( 10.48550/arXiv.2403.00594) [DOI] [Google Scholar]
- 35. Carrillo JA, Chipot M, Huang Y. 2014. On global minimizers of repulsive–attractive power-law interaction energies. Phil. Trans. R. Soc. A 372 , 20130399. ( 10.1098/rsta.2013.0399) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Carrillo J, Figalli A, Patacchini SF. 2017. Geometry of minimizers for the interaction energy with mildly repulsive potentials. Ann. De L’IHP Anal. Non Linéaire 34 , 1299–1308. ( 10.1016/J.ANIHPC.2016.10.004) [DOI] [Google Scholar]
- 37. Shu R. 2024. Wasserstein-infinity stability and mean field limit of discrete interaction energy minimizers. arXiv 2407.18395. ( 10.48550/arXiv.2407.18395) [DOI] [Google Scholar]
- 38. Simione R, Slepčev D, Topaloglu I. 2015. Existence of ground states of nonlocal-interaction energies. J. Stat. Phys. 159 , 972–986. ( 10.1007/s10955-015-1215-z) [DOI] [Google Scholar]
- 39. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. [Google Scholar]
- 40. Bahdanau D. 2014. Neural machine translation by jointly learning to align and translate. arXiv 1409.0473. ( 10.48550/arXiv.1409.0473) [DOI] [Google Scholar]
- 41. Castin V, Ablin P, Peyré G. Proceedings of Machine Learning Research (eds Salakhutdinov R, Kolter Z, Heller K, Weller A, Oliver N, Scarlett J, Berkenkamp F). In Proceedings of the 41stInternational Conference on Machine Learning, vol. 235, pp. 5817–5840, Vienna, Austria: PMLR. [Google Scholar]
- 42. Castin V, Ablin P, Carrillo J, Peyré G. 2025. A unified perspective on the dynamics of deep transformers. arXiv Preprint 2501.18322. ( 10.48550/arXiv.2501.18322) [DOI] [Google Scholar]
- 43. Karagodin N, Polyanskiy Y, Rigollet P. 2024. Clustering in causal attention masking. arXiv Preprint 2411.04990. ( 10.48550/arXiv.2411.04990) [DOI] [Google Scholar]
- 44. Ioffe S, Szegedy C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift (eds Bach F, Blei D). In Proceedings of the 32nd InternationalConference on Machine Learning, vol. 37, pp. 448–456, Lille, France. [Google Scholar]
- 45. Lei Ba J, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv 1607.06450. [Google Scholar]
- 46. Touvron H. 2023. Llama: open and efficient foundation language models. arXiv 2302.13971. ( 10.48550/arXiv.2302.13971) [DOI] [Google Scholar]
- 47. Zhang B, Sennrich R. 2019. Root mean square layer normalization. Adv. Neural Inf. Process. Syst. 32 , 12381–12392. [Google Scholar]
- 48. He K, Zhang X, Ren S, Sun J. 2016. Identity mappings in deep residual networks. In Computer vision – ECCV 2016 (eds Leibe B, Matas J, Sebe N, Welling M), pp. 630–645. Cham: Springer International Publishing. ( 10.1007/978-3-319-46493-0_38) [DOI] [Google Scholar]
- 49. Weinan E. 2017. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 5 , 1–11. ( 10.1007/s40304-017-0103-z) [DOI] [Google Scholar]
- 50. Haber E, Ruthotto L. 2018. Stable architectures for deep neural networks. Inverse Probl. 34 , 20. ( 10.1088/1361-6420/aa9a90) [DOI] [Google Scholar]
- 51. Chen RT, Rubanova Y, Bettencourt J, Duvenaud DK. 2018. Neural ordinary differential equations. Adv. Neural Inf. Process. Syst. 31 , 6571–6583. [Google Scholar]
- 52. Thorpe M, van Gennip Y. 2023. Deep limits of residual neural networks. Res. Math. Sci. 10 , 6. ( 10.1007/s40687-022-00370-y) [DOI] [Google Scholar]
- 53. Ambrosio L, Gigli N, Savaré G. 2008. GradientFlows. In Lectures in mathematics, 2nd edn. Basel, Switzerland: ETH Zürich. ( 10.1007/978-3-7643-8722-8) [DOI] [Google Scholar]
- 54. Benamou JD, Brenier Y. 2000. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 84 , 375–393. ( 10.1007/s002110050002) [DOI] [Google Scholar]
- 55. Deffuant G, Neau D, Amblard F, Weisbuch G. 2000. Mixing beliefs among interacting agents. Adv. Complex Syst. 03 , 87–98. ( 10.1142/s0219525900000078) [DOI] [Google Scholar]
- 56. Bilyk D, Matzke RW, Vlasiuk O. 2022. Positive definiteness and the Stolarsky invariance principle. J. Math. Anal. Appl. 513 , 126220. ( 10.1016/j.jmaa.2022.126220) [DOI] [Google Scholar]
- 57. Fasshauer GE. 2011. Positive definite kernels: past, present and future. In ’Kernel functionsand meshless methods’ dolomites research notes on approximation (eds Marchi S, Buhmann MD, Plonka-Hoch G). [Google Scholar]
- 58. Bilyk D, Dai F. 2016. Geodesic distance riesz energy on the sphere. arXiv 1612.08442. ( 10.48550/arXiv.1612.08442) [DOI] [Google Scholar]
- 59. Burger M, Francesco M di, Franek M. 2013. Stationary states of quadratic diffusion equations with long-range attraction. Commun. Math. Sci. 11 , 709–738. ( 10.4310/cms.2013.v11.n3.a3) [DOI] [Google Scholar]
- 60. Gómez-Castro D. 2024. Beginner’s guide to aggregation-diffusion equations. SeMA J. 1–57 ( 10.1007/s40324-024-00350-y) [DOI] [Google Scholar]
- 61. Rossum G, Drake FL Jr. 1995. Python tutorial. The Netherlands: Centrum voor Wiskunde en Informatica Amsterdam. [Google Scholar]
- 62. Harris CR, et al. 2020. Array programming with NumPy. Nature 585 , 357–362. ( 10.1038/s41586-020-2649-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Virtanen P, et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17 , 261–272. ( 10.1038/s41592-019-0686-2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Paszke A. 2019. Pytorch: an imperative style high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8026–8037. [Google Scholar]
- 65. Marchuk G, Lebedev VI. 1986. Numerical methods in the theory of neutron transport. New York, NY, USA: Harwood Academic Puḃ. [Google Scholar]
- 66. Kivinen J, Warmuth MK. 1997. Exponentiated gradient versus gradient descent for linear predictors. Inf. Comput. 132 , 1–63. ( 10.1006/inco.1996.2612) [DOI] [Google Scholar]
- 67. Lee JM. 2013. Introduction to smooth manifolds, pp. 1–31. New York, NY, USA: Springer New York. ( 10.1007/978-1-4419-9982-5_1) [DOI] [Google Scholar]
- 68. Ambrosio L, Fusco N, Pallara D. 2000. Functions of bounded variation and free discontinuity problems, pp. 116–210. Oxford: Oxford University Press. ( 10.1093/oso/9780198502456.003.0003) [DOI] [Google Scholar]
- 69. Dolbeault J, Nazaret B, Savaré G. 2009. A new class of transport distances between measures. Calc. Var. Partial Differ. Equ. 34 , 193–231. ( 10.1007/s00526-008-0182-5) [DOI] [Google Scholar]
- 70. Folland GB. 1999. Real analysis: modern techniques and their applications. Hoboken, NJ: John Wiley & Sons. [Google Scholar]
- 71. Evans LC. 2010. Partial differential equations, 2nd edn. Providence, RI: American Mathematical Society. ( 10.1090/gsm/019) [DOI] [Google Scholar]
- 72. Spivak M. 2018. Calculus on manifolds: a modern approach to classical theorems of advanced calculus. Boca Raton, FL: CRC press. [Google Scholar]
- 73. Blumenson LE. 1960. A derivation of n-dimensional spherical coordinates. Am. Math. Mon. 67 , 63–66. ( 10.2307/2308932) [DOI] [Google Scholar]
- 74. Golub GH, Van Loan CF. 2013. Matrix computations, 4th edn. Philadelphia, PA, USA: Johns Hopkins University Press. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
This article has no additional data.