Abstract
Understanding symmetries within data is crucial for explainability and enhancing model efficiency in artificial intelligence. This work investigates an approach to neural symmetry detection, specifically leveraging the mathematical framework of Lie theory. Our approach projects data into a low-dimensional latent space, where symmetry transformations can be efficiently applied. By leveraging the matrix exponential, we accurately capture both affine and non-affine transformations, allowing for improved data augmentation and model selection as potential applications. Our method also estimates transformation magnitude distributions, providing deeper insights into the geometric structure of data. Experiments conducted on augmented MNIST demonstrate the effectiveness of our approach in detecting complex symmetries with multiple transformations. This work paves the way for more interpretable and parameter efficient AI models by identifying structural priors that align with the inherent symmetries in data.
Keywords: Deep learning, Symmetry detection, Lie theory
Subject terms: Mathematics and computing, Computational science
Introduction
Symmetries play a fundamental role in both physics and mathematics, providing deep insights into the laws governing natural phenomena and the structure of mathematical objects. In physics, for example, symmetries underlie conservation laws1 and are pivotal in formulating theories such as relativity2 and both classical3 and quantum mechanics4,5. Similarly, in mathematics, symmetries help in understanding geometric structures and solving complex equations6. Despite their significance, the integration of symmetry principles into artificial intelligence (AI) remains limited. While certain neural architectures (e.g., convolutional7 or group-equivariant neural networks8,9) inherently exploit a specific symmetry (translations or a specific group transformation, respectively), a universal approach for detecting and leveraging a broader class of symmetry transformations remains elusive. This is a significant shortcoming, because exploiting symmetries can lead to substantial improvements in model efficiency, interpretability, and performance across various AI applications. In particular, we emphasize its potential use for automatic model selection through structural bias learning in the context of geometric deep learning, what we will refer to as Type-II Deep Learning (cf. Type-II Bayes). In other words, given a neural network
, it could have been engineered8–11 or learnt to self-impose12 a specific symmetry as a bias through weight-sharing scheme S, such that, for set of weights determined by the scheme
, its associated neural network
:
| 1 |
with a suitable prior distribution over possible schemes and where theoretical aspects regarding an appropriate measure over weight sharing choices are left open to future scrutiny. Practically, a Type-II Neural Network can be thought of as an architecture that not only adjusts its weights during training, but has the ability to optimize its weight-sharing scheme for the task at hand.
In this paper, we tackle deep learning-aided symmetry detection, or neural symmetry detection. One can be interested in learning the fundamental geometrical properties of the distribution of data for various applications, ranging from data exploration and topological data analysis13,14 to model selection and structural prior learning15,16. Motivations include understanding the nature of the data in a broader explainable AI pipeline17 (in contrary to processing big data using a black-box) and model efficiency by ensuring a suitable inductive bias, leading to constraints on the hypothesis space while retaining or even improving the expressive power of neural architectures18. Traditionally, advancements in deep learning have relied on computational scaling through the accrual of progressively larger data sets19. Since this becomes impractical, recent approaches20 have opted for reducing the cost of training or the overall size of the model while maintaining competitive performance. In the context of the latter, the irrelevancies of a given task, usually formalized mathematically as symmetry groups21, can be exploited by geometric methods that either lead to weight-sharing schemes within the neural network9,22,23 or overcompensate with additional computational operations for when the input undergoes the expected transformation10,24,25. Modern approaches to symmetry detection focus on learning the most likely symmetry group that relates points in a data set. In previous work26,27, the matrix exponential required to quantify the continuous symmetry transformation was approximated in various ways, which sacrifices accuracy for tractability.
We propose instead to evaluate the matrix exponential exactly while avoiding the computationally expensive operation in pixel space by going to a low-dimensional latent space, using a deep autoencoder-like architecture. This offers more control over the behaviour of latent vectors and features by enforcing continuous transformations such as rotations between them, which has been shown to improve the interpretability of features learned by deep models (left column, Fig. 1). Other work that models data geometrically in latent space has interesting connections to neuroscience and information propagation in the brain (top right of Fig. 1, e.g., “topographic” modelling28). In this work, in order to learn and control the amount of symmetry bias in our model, geometric information is completely stored in the transformation T applied in latent space (e.g., for a Lie group,
), allowing for the application to arbitrarily structured data. This bottleneck in the symmetry detection algorithm removes the need for the engineer to pick a G-equivariant neural network for symmetry detection. The transformations are parameterized efficiently, and an extension to non-affine transformations is also possible, leading to more interesting settings such as in the broader class of conformal transformations and more general diffeomorphisms. This has the added benefit of learning suitable representations and has a possible extension to defining the connectivity matrix in the context of structural prior learning (See Section 2.1) and the construction of more general diffeomorphism-equivarianct networks. Splitting the learning of the transformation distribution from the generator is another challenge, and this was possible by introducing a separate network to estimate the transformation magnitudes, creating a seamless end-to-end pipeline. Compared to other works in this field, this model is efficient (low-parameter count for symmetry generator and better than brute force) and interpretable (generators are transferable).
Fig. 1.
Taxonomy of Lie-Autoencoding Models. Usual approaches exploit the structure of data by using an appropriate neural network for dimensionality reduction. Here, we differentiate between “agnostic” (MLPs) and (geometrically-)biased autoencoders (e.g., CNNs). G is used to denote a (geometric) prior, but not it need not necessarily be a G-equivariant model. This work investigates models with MLP AEs (left column) while most approaches explored in the literature can be situated in the abstract-biased class (bottom right). Here, n denotes the characteristic latent vector size.
The proposed neural symmetry detector model enforces consistency losses to ensure the encoder-decoder pair remains non-trivial, even when the transformation magnitude t is set to zero. At the core of the model, transformations are parameterized using a basis (e.g., affine or quadratic, the latter having the ability to model infinitesimal Special Conformal Transformations), and their magnitudes are estimated by a separate neural network that takes data pairs as input (see Fig. 2). To align transformations in latent and pixel space, we introduce a Taylor expansion-based approximation for matching their estimates, bypassing the computational expense of directly calculating matrix exponentials. This model is evaluated on SyMNIST, a task involving augmented MNIST digit pairs designed to detect symmetry transformations applied during augmentation and estimate the underlying magnitude distribution. We also provide results for SuperSyMNIST, in which the digits in pairs have the same label but are different, including multiple one-parameter transformations and larger canvases. We provide baselines and comparisons with other models, introducing an unsupervised technique (
-matching) to link latent space transformations to pixel-space transformations. Experimental results highlight its performance and suggest avenues for further work, including applications to downstream tasks such as Type-II Neural Networks. We also provide all the code for the latent model, which is designed in a modular way to allow for out-of-the-box tweaking and experimenting. Finally, we note that some of these ideas are already reflected in related literature, such as Noether’s Razor29, latent space symmetry discovery for non-linear symmetry transformations using GANs30 or otherwise31, attempts at learning weight-sharing schemes using permutation matrices32, and the automatic topology design system AutoML33.
Fig. 2.
Latent model architecture. The input pair is passed to the t-network
that predicts the magnitude of the transformation (top branch). Simultaneously, the pair is encoded (middle branch) and a consistency loss enforces the latents to be connected by a matrix multiplication with the result of the discretized exponential map. The generator is updated through updating the coefficients
related to the chosen basis
. Multiple generators can be applied in series as well, introducing a label n for each.
Method
Symmetry generators and structural priors
We start with some background knowledge on symmetry detection using the formalism of Lie theory, following an approach by34 and recently applied to the same problem in31. For the interested reader, for a more thorough introduction to the topic we refer to35 and6, with a focus on representation theory and differential equations, respectively. Last, we briefly discuss the application of detecting symmetries to structural prior learning for model selection.
Lie groups and Lie algebras
The transformations that describe the symmetries are assumed to form Lie groups. This means they are sufficiently smooth (k-times differentiable, where k is usually chosen to be infinity), closed under composition, associative, have a neutral element, and have smooth inverse. These transformations can be defined by the way in which they act on objects, namely
, with object
. In order to apply differential operators to this object, we can treat it as a (canonical) coordinate vector field. Namely, each point is assigned its spatial coordinates as values, and this vector field is clearly
. This also introduces a parameter,
, which is related to the magnitude of the transformation, forming what is usually called a one-parameter group. For rotations, this parameter will correspond to the angle, for translations, it will be the distance, etc. Because of continuity in the parameter, we can perform a Taylor expansion of the transformation
for small values of t:
| 2 |
We apply the First Fundamental Theorem of Lie6,34 in order to make the following claim:
defines the transformation and is related to what is known as the generator of the transformation. Intuitively, this correspondence between action and generator is due to the constraints imposed on the transformation function being a Lie group. The generator can thus be written as a differential operator as follows:
| 3 |
That is, if one solves the differential equations
that characterize the generator, the original transformation function is obtained. More specifically, the solution is the family of functions topologically connected to the identity transformation through continuity in t. The generator is an element of the Lie algebra of the transformation group and is related to the original transformation by what is called the exponential map. This nomenclature emphasizes the connection between the differentiation performed in Equation 2 and exponentiation, easily seen when solving the characteristic equation6, i.e.,
| 4 |
as the solution is
with
, where the integer power of G is defined by applying the differential operator iteratively. Since we are applying it to a pixelated image or arbitrary signal, we must evaluate G in some basis by choosing an interpolation scheme. We discuss this in a later section. The inverse procedure, which extracts the generator from the action as shown in Equation 2, is also referred to as the logarithmic map.
We focus on one-parameter groups for two reasons: Ease of implementation and the fact that one such inductive bias is incredibly powerful already. One need not look further than CNNs to conclude that identifying translation as a symmetry of a dataset immediately leads to equivariant models that are superbly successful in practice. Multiple transformations also require additional considerations that relate to the algebra itself, such as closure under commutators36, an extension we explored as well for 3-parameter groups.
Connectivity and equivariance: a Type-II NN roadmap
Learning connectivity matrices for deep equivariant models, regardless of whether a symmetry group related to solving the task is given a priori, has been of particular interest to the (geometric) deep learning community9,10,15,16,32. It is worth noting that generators can be related to the connectivity matrix, explicitly so for translations, where the shift matrix determines a power series that tiles the weight matrix accordingly. Formally, this involved picking the right representation of the symmetry group of interest, mapping the continuous differential operator formalism described above to linear maps (matrices). We choose to write the partial derivatives in the compact notation, e.g.,
.
Practically, for the one-pixel shift matrix S, we can write, for
:
| 5 |
The above equation defines the weight matrix of one such convolutional layer L, with updatable weights
. A collection of multiple power series applied in succession and interlaced with non-linear activation functions is the neural network. Schematically, we have
![]() |
6 |
with a total number of layers
, biases
, and activation functions
. Note that this finite power series is not able to capture the parametrization of fully-connected layers, as all the elements in the basis need to be related by matrix powers. However, as with the Fourier Transform of the Dirac delta function, it might be necessary to include infinitely many terms in order to correctly converge to an element of the basis of matrices as a vector space, i.e. the null matrix with a single entry equal to one.
In previous work, a main issue was overcoming the computational complexity associated with high pixel count, especially in learning the exponent of a matrix exponential. A second potential issue is introducing strong spatial correlations as a constraint a priori, defeating the purpose of learning transformations from scratch. If it is already known that the data is spatially structured, one could introduce continuous coordinates12 or use spatial convolutions10 immediately, without conveniently ignoring the possibility of spatially unstructured data. This issue is partially alleviated with the matrix exponential method, as learning a generator that corresponds to the zeros matrix leads to the identity matrix, a trivial operation.
Neural symmetry detection
The biggest technical issue to overcome in the symmetry detection task is learning two separate properties of the data set with a single pipeline: one collective, another pair-dependent. Namely, the generator and the transformation magnitude, respectively. For each dataset, one model is trained and ideally captures the most salient symmetry transformation that relates the data points to each other. The learned symmetry can then be used in downstream tasks. In previous work, especially before the triumph of deep learning techniques, methods such as gradient descent37 and expectation-maximization26 were used. In this work, we are interested in neural symmetry detection, taking inspiration from works leveraging neural networks as function approximators27,31.
Defining the task: SyMNIST and GalaxSym
A symmetry is a transformation that leaves a certain quantity of interest unchanged. In order to define the symmetry under consideration, we must state what is being kept “the same”. For the experiments that follow, we consider the classification task and the symmetry that keeps the underlying label identical. The overarching problem, therefore, is learning transformations that map instances of the same class (i.e., data points with the same underlying label) to each other. For the experiments, we introduce the SyMNIST and GalaxSym tasks. The dataset consists of image data, paired with an augmented version of the original image. The classification labels are not used for prediction. The goal is two-fold: (i) extract the type of transformation G that was applied, and (ii) estimate the distribution of transformation magnitudes t of the seen transformations in the data, which are the “parameters” of the transformation (e.g., rotation angle, scaling factor, etc.). We also have a SuperSyMNIST and SuperGalaxSym extension, in which the second image is still augmented in the usual way but stems from a disparate root image with the same label (See Fig. 3).
Fig. 3.
Visual comparison of different models and tasks. The first row shows results from a linear model AE (i.e., single layer MLP), the second row shows a deep AE applied to SyMNIST (augmented sample pairs). The third and fourth rows show SuperSyMNIST, where different digits with the same label are given to the model. The final row illustrates a 3-parameter group of transformations with their corresponding targets and model outputs. The model has learned to capture the overall shape and transformations well.
Data availability: The datasets analyzed were the widely available MNIST, which can be accessed through many libraries such as torchvision.datasets.MNIST (MNIST can also be found at: https://www.kaggle.com/datasets/hojjatk/mnist-dataset.), and the Galaxy-10 DECaLS dataset (Available at: https://zenodo.org/records/10845026/files/Galaxy10_DECals.h5.). The latter’s images originate from the DESI Legacy Imaging Surveys and were labeled by Galaxy Zoo. The original 128-by-128 RGB images were averaged over the color channels to obtain grayscale version. The torchvision38,39 library was used to apply affine transformations to the MNIST images. For non-affine transformations, i.e., the SCT, the torch.nn.functional.grid_sample method was used.
Parametrizing the generator
We can parametrize the generator with any given basis for the functional form of its components, to allow for modelling a broad range of symmetry transformations40. In other words, regression is performed on the coefficients of a basis, which can be chosen freely. We wish to have the ability to detect the “typical” symmetries considered in the symmetry detection literature, such as rotation, scaling, and translation, which will be referred to as the canonical symmetries, a subset of the affine transformations. This choice, including coefficient sparsity, also places a prior on p(S) (cf. Eqn. 1), the distribution or hypothesis space of weight sharing schemes S. Therefore, we pick a quadratic basis, which includes affine transformations, such that the functions
from (3) have the following form:
| 7 |
| 8 |
with learnable coefficients
. For clarity, in the above we replaced the integers in the subscripts by c (for constant terms, related to translations), x, and y accordingly. The above quadratic basis can capture the canonical symmetries but others (such as shears, compositions, special conformal transformations) as well. Note that one can pick an arbitrarily complicated basis for the expressions given above. This is the major appeal of this approach, and we hope to explore different bases in future work.
Example 1: Canonical transformations In order to encode the three canonical symmetries, one should use coefficients
that yield
,
, and
, which are the generators of translation (in the x-direction), counterclockwise rotation about the origin, and isotropic scaling w.r.t. the origin, respectively. Additionally, one can write down generators for shearing, anisotropic scaling, or combinations of translation and any another of the above transformations.
Example 2: Special Conformal transformations A non-affine transformation that plays an important role in the field of theoretical physics and mathematics is the angle-preserving special conformal transformation (SCT), which has a natural connection to M bius transformations and conformal field theories41. The SCT can be thought of as a combination of an inversion w.r.t. the unit circle, followed by a translation by vector
, and finally another inversion (we refer the reader to the supplementary material for an example of its effect on a pixelated smiley face). The vector
plays the role of the parameter here, and taking infinitesimally small values leads to the following expression for the generators of the SCT,
and
, with the vector pointing in the x and y-directions, respectively.
Interpolation scheme
To apply the operators to a grid, one must write the partial derivatives as matrices. We use the Shannon-Whittaker interpolation, as is done in37 and27. This automatically assumes the function to be interpolated is periodic, although other interpolation schemes could have been chosen. We note that this scheme introduces some aliasing for transformations of low-resolution images, and forms one of the notable limitations of the current model. One could investigate other choices, such as bicubic interpolation, although similar results were obtained. Nevertheless, we pick this interpolation scheme for its ability to perform the transformations of interest using matrix-vector multiplication. Let I be some real-valued signal. For a discrete set of n points on the real line and
for all samples i from 1 to n, the Shannon-Whittaker interpolation reconstructs the signal for all
as
![]() |
9 |
To obtain numerical expressions (matrices) for
, Q can be differentiated with respect to its input. This then describes continuous changes in the one dimensional spatial coordinate at all n points, i.e.,
. The above can be extended to two dimensions by performing the Kronecker product of the result obtained for one dimension with the identity matrix,
and
, mirroring the flattening operation applied to the input images. The parametrized generator for the 2D affine case, for example, looks like:
![]() |
10 |
where the
are the matrices that represent the operators
,
,
,
,
, and
, respectively. This can easily be extended to arbitrarily dimensional data by adding more factors to the above matrices, as was done above for the quadratic basis. One can see that performing this operation in pixel space scales poorly with signal length (or image width) n.
Latent model
Learning symmetries using a latent space bottleneck has multiple motivations and has been explored in other research recently30. First, learning symmetries from scratch requires learning the relationship between pixels, if one exists, in the data, and this does not scale well with input size as the space of possible connectivities is factorial in nature (cf. the permutation group). Second, and related to the previous point, is the poor scaling of the size of the generator that needs to be matrix exponentiated in order to preserve the continuous nature of the symmetry transformations. Finally, the encoder should ideally learn to remove unimportant information from the data, such as background information in an image classification setting. We opt for an adapted autoencoder architecture, whose latent space we use to learn the transformations.
In our model, the autoencoder design allows for the model to keep relevant information in the latent space and transform this according to a transformation, a result of the exponential map, it shares with all other input pairs (Fig. 2). The model takes an image as input and reconstructs the transformed image as output. In order to get an estimate for the parameter (e.g. rotation angle) a separate network is trained together with the autoencoder. The parameter estimate is then passed to the matrix exponential function that is then used to matrix multiply the latent patch(es). Finally, the latent patch(es) are decoded, and a reconstruction loss is enforced on the output-transformed image pair. The learnable parts of the model can be grouped as follows:
The encoder

The decoder

The t-network

The parametrized generator

The loss function consists of a part that ensures both the n-dimensional pixel-space and d-dimensional latent-space vectorized data pairs transform to each other. For images
and
, one can write these as:
| 11 |
| 12 |
and can also be written as
and
. The model allows for various numbers of latent patches (cf. channels) to be transformed in parallel in the latent space.
-matching: We also include a loss term that enforces the generator in the latent space to be close to the one in pixel space is introduced. Since calculating the matrix exponential in pixel space does not scale well w.r.t. data size, a different approach needs to be used. Hence, this
-matching term uses the learned
from the latent space (Eq. 7) and places them in a generator for the pixel-space, using a Taylor expansion, to compare its effect on the vectorized input, formally
| 13 |
where the prime denotes the basis
evaluated in pixel-space, i.e.
. If one wishes to enforce sparsity on the coefficients of the terms in the generator, a sparsity loss (LASSO) can be added in order to enforce correct behavior in the symbolic regression. This assumption helps with understandability and interpretability of coefficients, but is also a prior on S.
Furthermore, we include the standard reconstruction loss term for each of the inputs in the data pair individually,
, to train the autoencoder. This loss term simply encourages the encoder to learn to reconstruct individual image inputs well, as usually done with autoencoders. Thus, this term ignores the exponential on the latent space. The total loss, therefore, is:
| 14 |
In order to avoid the ambiguity in the exponent of the matrix exponential (namely,
), the generator is normalized by enforcing the coefficient vector to have unit norm during training. I.e.,
, where
is the vector made up of the coefficients
. A consequence of this choice is that the t-network is expected to produce zero-valued outputs for image pairs that are not path connected.
Multiple transformations: In order to allow for more complicated transformations that do not form one-parameter groups, a straightforward extension to the proposed model can be accomplished by adding more factors of matrix exponentials. This allows for transformations connected to the identity in a continuous way, charting out the space of possible transformations defined by the parameters associated with the various generators. If one wishes to enforce the Lie algebra structure, one can include, by design, multiplication by a closure factor:
. This implementation is present in the codebase. Since no conclusive results were obtained from experiments with this setup, and for the sake of clarity of the presentation of the results, we leave this for future work. We note that reconstructions were still very good, so further research in this direction should prove fruitful. Another option would be to enforce this algebraic constraint in the loss.
Computational complexity
The computational complexity of the latent Lie symmetry detector model depends on the choice of loss formulation. With the first order Taylor term enforcing
-matching, we have a matrix exponential of
compared to
for brute force, where
is the size of the latent patch and
is the total number of pixels or features in each data point. In the latent model, there are now 3 additional NNs which scale predominantly with input size for this setting. Depending on the tapering of the MLP, the complexity is always better than
. Activating the second order Taylor term in the loss introduces an
scaling for the matrix-matrix multiplication. Either this term can be omitted or more efficient procedures for calculating this term can be implemented in future work. Nevertheless, the computational burden is still lower than that of calculating a matrix exponential. (We refer the interested reader to the comparisons in supplemental material.)
Results
Implementation and the SyMNIST/GalaxSym tasks
Here, the implementation of the latent model is quickly sketched. For further implementation details, we refer the reader to the Appendix (A). We note that code will be made available with extensive comments and documentation. Transformations are applied using the affine function from the torchvision library38 and torch’s grid_sample for more non-affine transformations. The images can also be placed on a larger canvas, which is particularly useful when translations are part of the applied augmentations. The magnitude of the transformation is the parameter sampled from the chosen distribution. This procedure allows for flexibility in testing a neural symmetry detector, as the distribution can be arbitrary and, in theory, so can the transformations. In these experiments, we will focus on detecting combinations of various affine transformations and a non-affine transformation, the SCT. We focus on applying our model to the SyMNIST task, but we note the latent model works equally well with Galaxy-10 DECaLS data. Samples for both SyMNIST and SuperSyMNIST are shown in Fig. 3.
The proposed model offers several advantages over previous approaches. Unlike earlier works that rely on supervised learning for estimating symmetries27, or are restricted to small image patches26, our method enables unsupervised detection of both affine and non-affine transformations directly from images. Additionally, the proposed framework can handle arbitrary and multi-modal distributions of transformation magnitudes and multiple one-parameter group transformations, which significantly broadens its applicability compared to methods that assume uniform or low-modal (typically 2 or 3) transformation magnitude distributions42,43. Furthermore, the model directly evaluates the learned symmetries rather than relying on downstream tasks for validation15,32. By learning both the underlying generators and the distributions of transformation magnitudes, our approach provides a comprehensive solution for symmetry detection that is robust, transferable across datasets, and capable of handling complex transformations. Finally, we experimented with abstract (2-dimensional) latents and did not observe as high a quality of reconstructions as for the structured (patchified) latents, most likely due to the low dimension of the latent space. This issue could potentially be fixed in future work by separating the shape (2-dimensional) from the residuals (finer details). (A figure showcasing the 2-dimensional latents is provided in the supplemental materials.)
Transformation magnitude distribution
Histograms of learned transformation magnitudes reveal correct modes for same-sample pairs (Fig. 4), but performance degrades for dissimilar pairs, i.e., SuperSyMNIST. This is most likely due to the fact that MNIST digits are all differently aligned, are not perfectly centered, and that the MSE loss does not capture semantic similarity in pixel space very well, especially for small transformation magnitudes. It seems like the distributions capture aliasing artifacts in the small angle and translation setting (Fig. 4b). Note the broadening of the range (in radians) of the learned distribution when sampling angles in the rotation setting, which breaks down for larger values (Fig. 4a). Peaks are also visible at the origin, even when no pair relates to such an identity transformation (as in Fig. 4d and e). We qualitatively see good shape reconstruction for uniform distributions for small angles and scaling transformations (Fig. 4a and c). Clearly, such a qualitative assessment is not rigorous enough for a thorough analysis.
Fig. 4.
Learnt transformation magnitude histogram for various transformations. Magnitudes for more complicated distributions, compositions, and non-canonical transformations were also learnt. Results for (d) and (e) were produced by categorical, the others, by uniform sampling of t.
The Wasserstein distance for the normalized distributions provides a quantitative metric for learned distributions. By looking at test-time samples, we observe interesting behaviour when traversing in t-space (Fig. 5). Certain transitions between modes are seemingly continuous is pixel-space, but anomalous can be found in certain models. For a 5-modal distribution, the model seems to learn intrinsic symmetries in the digits as a way to rotate them (top). We also show a drawback of the model, namely that it is quite tough to learn exact distributions, in particular for uniform distributions in the SuperSyMNIST case (bottom). In the latter, this is probably due to the MSE loss not being the most suitable objective for this task. On average, however, the model seems to recover the correct transformation and the reconstructions at test time look accurate, especially considering the overall shape.
Fig. 5.
Top row: 5-modal SyMNIST with rotation. (Left) Image-target pair and output of the model used for the auxiliary task, including the sampling versus the recovered distribution of normalized transformation parameters
(above). A traversal through t-space and reconstructions for unseen transformations, corresponding to red dotted lines in the t-distribution plot (below). (Right) Fine-grained variation of t emphasizes topological anomalies, with transitions being less problematic for digits such as “1”, “2”, and “0” but more apparent for symmetric digits like “7”. Plots are for equally spaced
. Bottom row: (Left) Visual comparison of SuperSyMNIST for uniform SCT distributions (non-affine model,
), highlighting versatility but sensitivity to sharp distribution edges. (Right) Continuous variation of t for
.
Symmetry detection performance
Visually, the transformations at test time look good and similar to the ground truth. In most failure cases, the location and overall shape is still correct, but the more detailed elements are either missing, blurry, or in an incorrect location. By plotting the vector fields associated with the learnt generator, we can visualize the flow of the transformation (Fig. 7, top). In order to compare to the ground truth generator, we split the coefficients up in drift, diffusion, and non-affine terms. This makes it easier to identify a correct transformation, such as a rotation, that has drifted due to a spurious non-zero translation coefficient. Additionally, we can look at the latent patches the model learnt and evaluate the transformations on a sample patch (Fig. 7, bottom). In cases where a rotation was not recovered, we do see evidence of different solutions to the symmetry detection task: for certain flow fields, and low transformation magnitudes, one can mimic the correct transformation, e.g., by pushing and pulling the sides of the image using a shear rather than a rotation (Fig. 7, bottom left). From looking at the latents, it is also not obvious whether the model has learnt any geometrically structured signals. Qualitatively, in GalaxSym, we note the pixel level transformations look exceptional both within and just outside the t-distribution (Fig. 6). Generators are the closest to the ground truth for translation and scaling, but rarely correct for rotation (checked by flagging imaginary eigenvalues, signalling a center of rotation, in the diffusion matrix). Perhaps the alpha matching term requires more tuning. The LASSO loss term only seemed to have a marginal effect regarding performance, hampering reconstruction quality for large values of
, perhaps a poor combination with the choice of coefficient normalization.
Fig. 6.
Visualization of learnt transformations on three GalaxSym and one SuperGalaxSym tasks: rotation, scale, shift, and super-rotation (similar to no augmentation) respectively. The model clearly learns to reproduce the transformation well in pixel space. In the SuperGalaxSym case, it seems to learn scaling as well as rotation, reflecting the distribution of sizes of galaxies in a given class.
Alpha-matching
First- and second-order Taylor expansions were compared to assess
-matching loss effectiveness. While higher-order terms improve pixel-space approximations, the gains are limited for large parameter values. This connects to Noether Prior and Type-II networks, underscoring the importance of correct inductive biases.
With
-matching (cf. Eqn. 13), a first order Taylor expansion was used to estimate the pixel space transformation. This has the obvious drawback that the model is expected to perform better for small values of the sampled transformation parameter. If instead one adds a second-order term, extending the support of the matrix exponential around the identity, slight improvements in the output images are obtained. There was no improvement in the range of parameter values, however. What we do observe, is a shift in the correct generator for higher values of
(Fig. 7). Not only does the correct generator appear for optimal
, but the orbit of reconstructed digits at test time is more robust.
Fig. 7.
(Top) Comparison of reconstructed images (first and fourth rows, for t-values regularly spaced between
to
) for a model trained on rotations with
and generator flow fields visualizations (second and third rows) for different values of
values using order 1 (first two rows) and order 2 (last two rows) Taylor approximations. (Bottom) When omitted, or when
is too small (here:
), alpha matching cannot tie the transformation in pixel space to those in latent space, possibly resulting in a mismatch, such as a shift when a rotation was desired. (Actual latents and predicted t-values, including the action of the exponential map on line segments and Gaussian blobs is shown for an order 2 model trained on rotated GalaxSym pairs.)
Complex transformations
Results on non-affine transformations, such as SCT (Fig. 5, bottom), demonstrate the model’s capability to handle complex symmetries. For composed transformations (e.g., rotation and scaling), the model detects multimodal distributions, capturing combined effects accurately: For compositions, we sampled from a categorical distribution for both rotation angle (
or 45 degrees) and scaling factor (0.5 or 1.5), resulting in a 4-modal distribution (Fig. 4e). Finally, we mention that the model is also capable of handling multiple-parameter groups in latent space, as illustrated in Fig. 2. It is possible to either include a closure factor by introducing an induced one-parameter group, i.e.,
, or omit it entirely. For these multi-parameter cases, only the first order Taylor term is implemented. The results for auxiliary reconstruction are very good, especially for SyMNIST (SuperSyMNIST reconstructions can be seen in Figure 3, bottom row), while splitting up the learnt generator into the expected one-parameter groups is not always guaranteed since other superpositions of correct generators are theoretically possible.
Discussion
While correct generators were hard to obtain, the pixel space transformations match the augmentations very well, including the number of modes in the distribution of outputs from the t-network. From qualitative analysis of the results for translated samples, we conclude we recover the symmetry generator the best here, perhaps due to an observed slight bias towards keeping the drift terms non-zero. Rotations, scaling, shear, and SCTs are less conclusive, but recoverable for the right settings of the regularizers, in particular for high enough values and order of the alpha-matching term. In particular, small to medium transformation magnitudes, but not so small as to be below the aliasing threshold, one-parameter group augmentation transformed SyMNIST, with correctly tuned
yields the correct symmetry generator in most cases. Reconstructions and transformation magnitude distributions are quantitatively very good, in particular the latter. Despite the MSE reconstruction loss having drawbacks, such as blurriness and misalignment with the theoretical objective, we observe relatively good results even in the SuperSyMNIST setting, suggesting good to excellent performance on the overall shape of an object, a promising result assuming one can disentangle texture and high frequencies from overall pose and, assumedly, low spatial frequencies. This is most likely rooted in the role of the reconstruction as an auxiliary task, the MSE loss having a smoothing effect. Additionally, we note that the model seems to learn “Platonic” digits, namely recognizable digits with different font or style than the expected one, that are transformed appropriately (this was not so clear in the GalaxSym dataset). Despite learning correct transformations in pixel space, the latent transformations are still hard to get exactly right for a single set of hyperparameters for multiple transformations, the latent transformations seem to still be underconstrained. From the
-matching experiments, we deduce that this parameter is the most crucial to tune correctly, and we provide some quantitative results showcasing behavior beyond its optimal value. We expect that this term helps the coefficients flow towards a basin where the correct pixel-space transformations are reachable, although placing too much weight on this term might be counter-productive for large values of the parameters.
Finally, we note the choice of interpolation scheme and aliasing issues in latent space. Experimenting with different choices for interpolation, we do not observe major differences in the results shown above. Nevertheless, the codebase has options to change this in order to allow for further experimentation in this direction. We note that the latent patches do not seem to have learnt any geometric structure in latent space detectable by eye, despite the traversals in magnitude space being relatively stable and, in all cases, interpretable when decoded to pixel space.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix: Experimental details
For SyMNIST, the encoder and decoders were fully-connected MLPs with 512, 256, 128, 64 neurons in each hidden layer with LeakyReLU (0.2) activation functions. The output layer of the encoder and therefore the input layer of the decoder had size 81, for most of the experiments, or 25 for a couple of others (e.g., Fig. 4). The Adam optimizer44 was used with a learning rate of 0.001, batch size of 1024. All the results are shown for 1 channel in latent space, but more channels are possible and can improve the reconstruction quality. Wider and deeper MLPs, and various channels in latent space were all checked (1, 4, 16, and 64). 16 seemed to work best for MSE loss on the reconstructions, but there was no dramatic increase in predictive performance regarding the generators. LayerNorm was also used. All these options, including the ones discussed in the main text, can easily be toggled in the code accompanying this paper (https://github.com/alexgabel/LiePaper). For consistency, the same hyperparameters were used, namely
, unless specified otherwise. For the GalaxSym experiments, we opted for identical architecture as well as a single channel, since this produced good reconstructions, and 81 latent pixels. The alpha-matching regularizer was 0.001 for all the results except for the shifted pairs, where a value of 1.0 yielded cleaner samples. The transformation magnitudes were uniformly sampled, with a maximum of 15 pixels, 90 degrees, and 1.5 scaling factor for translation, rotation, and scaling respectively. For the final rows (SuperGalaxSym) the alpha-matching regularizer was set to 1000.0.
Author contributions
A.G. did the main research (coding, experiments, mathematical analysis) and wrote the article. As supervisors, E.G. and R.Q. helped write the main text by providing revisions of certain sections, comments and suggestions. E.G. and R.Q. helped suggest applications and variations of the experimental setup.
Data availability
The dataset analysed is the widely available MNIST, which can be accessed through many libraries such as torchvision.datasets.MNIST. It can also be found at https://www.kaggle.com/datasets/hojjatk/mnist-dataset1. It contains 60,000 training images and 10,000 test images of handwritten digits, each in greyscale with a resolution of 28x28 pixels. The torchvision library was used to apply affine transformations to the MNIST images. For non-affine transformations, such as the SCT, the torch.nn.functional.grid_sample method was used.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-17098-8.
References
- 1.Noether, E. Invariante Variationsprobleme. Nachrichten von der Gesellschaft der Wissenschaften zu G ttingen, Mathematisch-Physikalische Klasse. (1918);1918:235-57. Available from: http://eudml.org/doc/59024.
- 2.Misner, C. W., Thorne, K. S., & Wheeler, J. A. Gravitation. Misner, C W , Thorne, K S , & Wheeler, J A , editor; (1973).
- 3.Arnold, V. I. Mathematical methods of classical mechanics. vol. 60. Springer; (1989).
- 4.Nakahara, M. Geometry, topology and physics; 2003. Bristol, UK: Hilger (1990) 505 p. (Graduate student series in physics). Available from: http://www.slac.stanford.edu/spires/find/hep/www?key=7208855.
- 5.Frankel, T. The Geometry of Physics: An Introduction. 3rd ed. Cambridge University Press; (2011).
- 6.Olver, P. J. Applications of Lie Groups to Differential Equations. Graduate Texts in Mathematics. Springer New York; (1993). Available from: https://books.google.nl/books?id=sI2bAxgLMXYC.
- 7.Fukushima, K. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biological Cybernetics.36, 193–202 (1980). [DOI] [PubMed] [Google Scholar]
- 8.Cohen, T., & Welling, M. Group equivariant convolutional networks. In: International conference on machine learning. PMLR; (2016). p. 2990-9.
- 9.Finzi, M., Welling, M., & Wilson, A. G. A Practical Method for Constructing Equivariant Multilayer Perceptrons for Arbitrary Matrix Groups; (2021).
- 10.Kondor, R., & Trivedi, S. On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups; (2018).
- 11.Bekkers, E. J. B-Spline CNNs on Lie Groups; (2021). Available from: arxiv:org/abs/1909.12057.
- 12.Moskalev, A., Sepliarskaia, A., Sosnovik, I., & Smeulders, A. LieGG: Studying Learned Lie Group Generators. In: Advances in Neural Information Processing Systems; (2022).
- 13.Carlsson, G. E. Topology and data. Bulletin of the American Mathematical Society. (2009);46:255-308. Available from: https://api.semanticscholar.org/CorpusID:1472609.
- 14.Wasserman, L. Topological Data Analysis; (2016).
- 15.Zhou, A., Knowles, T., & Finn, C. Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933. (2020).
- 16.van der Ouderaa, T. F., Immer, A., & van der Wilk, M. Learning Layer-wise Equivariances Automatically using Gradients. In: Thirty-seventh Conference on Neural Information Processing Systems; (2023) .
- 17.Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences. (2019) Oct;116(44):22071 22080. Available from: 10.1073/pnas.1900654116. [DOI] [PMC free article] [PubMed]
- 18.Cohen, T., & Welling, M. Group Equivariant Convolutional Networks. In: Balcan MF, Weinberger KQ, editors. Proceedings of The 33rd International Conference on Machine Learning. vol. 48 of Proceedings of Machine Learning Research. New York, New York, USA: PMLR; (2016). p. 2990-9. Available from: https://proceedings.mlr.press/v48/cohenc16.html.
- 19.L Heureux, A., Grolinger, K., Elyamany, H. F., & Capretz, M. A. M. Machine Learning With Big Data: Challenges and Approaches. IEEE Access. (2017);5:7776-97.
- 20.Zhou, C. et al. LIMA: Less Is More for Alignment; (2023).
- 21.Knigge, D. M., Romero, D. W., & Bekkers, E. J. Exploiting Redundancy: Separable Group Convolutional Networks on Lie Groups; (2022).
- 22.van der Ouderaa, T. F. A., & van der Wilk, M. Learning Invariant Weights in Neural Networks; (2022).
- 23.van der Ouderaa, T. F. A. Immer A (van der Wilk M, Learning Layer-wise Equivariances Automatically using Gradients, 2023). [Google Scholar]
- 24.Bekkers, E. J. et al. Roto-Translation Covariant Convolutional Networks for Medical Image Analysis. (2018). [DOI] [PubMed]
- 25.Romero, D. W., & Lohit, S. Learning Partial Equivariances From Data. In: Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors. Advances in Neural Information Processing Systems. vol. 35. Curran Associates, Inc.; (2022). p. 36466-78. Available from: https://proceedings.neurips.cc/paper_files/paper/2022/file/ec51d1fe4bbb754577da5e18eb54e6d1-Paper-Conference.pdf.
- 26.Sohl-Dickstein, J., Wang, C. M., & Olshausen, B. A. An unsupervised algorithm for learning lie group transformations. arXiv preprint arXiv:1001.1027. 2010.
- 27.Dehmamy, N., Walters, R., Liu, Y., Wang, D. & Yu, R. Automatic symmetry discovery with lie algebra convolutional network. Advances in Neural Information Processing Systems.34, 2503–15 (2021). [Google Scholar]
- 28.Keller, T.A., & Welling, M. Neural Wave Machines: Learning Spatiotemporally Structured Representations with Locally Coupled Oscillatory Recurrent Neural Networks. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J, editors. Proceedings of the 40th International Conference on Machine Learning. vol. 202 of Proceedings of Machine Learning Research. PMLR; 2023. p. 16168-89. Available from: https://proceedings.mlr.press/v202/keller23a.html.
- 29.van der Ouderaa, T. F. A., van der Wilk, M., & de Haan, P. Noether’s razor: Learning Conserved Quantities; 2024. Available from: arxiv:org/abs/2410.08087.
- 30.Yang, J., Dehmamy, N., Walters, R., & Yu, R. Latent Space Symmetry Discovery; (2024). Available from: arxiv:org/abs/2310.00105.
- 31.Gabel, A. et al. Learning Lie Group Symmetry Transformations with Neural Networks. (2023).
- 32.van der Linden, P. A., Garc a-Castellanos, A., Vadgama, S., Kuipers, T. P., & Bekkers, E.J. Learning Symmetries via Weight-Sharing with Doubly Stochastic Tensors; 2024. Available from: arxiv:org/abs/2412.04594.
- 33.Elsken, T., Metzen, J. H., & Hutter, F. Neural Architecture Search: A Survey. Journal of Machine Learning Research. (2019);20(55):1-21. Available from: http://jmlr.org/papers/v20/18-598.html.
- 34.Oliveri, F. Lie Symmetries of Differential Equations: Classical Results and Recent Contributions. Symmetry.06, 2 (2010). [Google Scholar]
- 35.Fulton, W., & Harris, J. Representation Theory: A First Course. Graduate Texts in Mathematics. Springer New York; (1991). Available from: https://books.google.nl/books?id=6GUH8ARxhp8C.
- 36.Roman, A., Forestano, R. T., Matchev, K. T., Matcheva, K., & Unlu, E. B. Oracle-Preserving Latent Flows; (2023).
- 37.Rao, R., & Ruderman, D. Learning Lie groups for invariant visual perception. Advances in neural information processing systems. (1998);11.
- 38.Marcel, S., & Rodriguez, Y. Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM International Conference on Multimedia. MM ’10. New York, NY, USA: Association for Computing Machinery; (2010). p. 1485 1488. Available from: 10.1145/1873951.1874254.
- 39.Maintainers, T. contributors. TorchVision: PyTorch’s Computer Vision library. GitHub. https://github.com/pytorch/vision (2016).
- 40.Gabel, A., Quax, R., Data-driven, Gavves E. & Lie point symmetry detection for continuous dynamical systems. Machine Learning: Science and Technology. 5(1):015037. Available from: 10.1088/2632-2153/ad2629 (2024).
- 41.Di Francesco, P., Mathieu, P.S.n., & Chal, D. Conformal field theory. Graduate texts in contemporary physics. New York, NY: Springer; (1997). Available from: https://cds.cern.ch/record/639405.
- 42.Benton, G., Finzi, M., Izmailov, P. & Wilson, A. G. Learning invariances in neural networks from training data. Advances in neural information processing systems.33, 17605–16 (2020). [Google Scholar]
- 43.Singhal, U., Esteves, C., Makadia, A., & Yu, S. X. Learning to Transform for Generalizable Instance-wise Invariance. (2024).
- 44.Kingma, D. P., & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The dataset analysed is the widely available MNIST, which can be accessed through many libraries such as torchvision.datasets.MNIST. It can also be found at https://www.kaggle.com/datasets/hojjatk/mnist-dataset1. It contains 60,000 training images and 10,000 test images of handwritten digits, each in greyscale with a resolution of 28x28 pixels. The torchvision library was used to apply affine transformations to the MNIST images. For non-affine transformations, such as the SCT, the torch.nn.functional.grid_sample method was used.










