Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang; Satpreet H Singh; Flavio Martinelli; Kanaka Rajan

. Author manuscript; available in PMC: 2026 May 8.

Published in final edited form as: Adv Neural Inf Process Syst. 2025;38:105259–105304.

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang ^1,^2,³, Satpreet H Singh ^2,³, Flavio Martinelli ^2,^3,⁴, Kanaka Rajan ^2,³

PMCID: PMC13151944 NIHMSID: NIHMS2124298 PMID: 42111901

Abstract

Task-trained recurrent neural networks (RNNs) are widely used in neuroscience and machine learning to model dynamical computations. To gain mechanistic insight into how neural systems solve tasks, prior work often reverse-engineers individual trained networks. However, different RNNs trained on the same task and achieving similar performance can exhibit strikingly different internal solutions, a phenomenon known as solution degeneracy. Here, we develop a unified framework to systematically quantify and control solution degeneracy across three levels: behavior, neural dynamics, and weight space. We apply this framework to 3,400 RNNs trained on four neuroscience-relevant tasks—flip-flop memory, sine wave generation, delayed discrimination, and path integration—while systematically varying task complexity, learning regime, network size, and regularization. We find that higher task complexity and stronger feature learning reduce degeneracy in neural dynamics but increase it in weight space, with mixed effects on behavior. In contrast, larger networks and structural regularization reduce degeneracy at all three levels. These findings empirically validate the Contravariance Principle and provide practical guidance for researchers seeking to tune the variability of RNN solutions, either to uncover shared neural mechanisms or to model the individual variability observed in biological systems. This work provides a principled framework for quantifying and controlling solution degeneracy in task-trained RNNs, offering new tools for building more interpretable and biologically grounded models of neural computation.

1. Introduction

Recurrent neural networks (RNNs) are widely used in machine learning and computational neuroscience to model dynamical processes. They are typically trained with standard nonconvex optimization methods and have proven useful as surrogate models for generating hypotheses about the neural mechanisms underlying task performance [1, 2, 3, 4, 5, 6]. Traditionally, the study of task-trained RNNs has focused on reverse-engineering a single trained model, implicitly assuming that networks trained on the same task would converge to similar solutions—even when initialized or trained differently. However, recent work has shown that this assumption does not hold universally, and the solution space of task-trained RNNs can be highly degenerate: networks may achieve the same level of training loss, yet differ in out-of-distribution (OOD) behavior, internal representations, neural dynamics, and connectivity [7, 8, 9, 10, 11, 12, 13]. For instance, [8] found that while trained RNNs may share certain topological features, their representational geometry can vary widely. Similarly, [7] showed that task-trained networks can develop qualitatively distinct neural dynamics and OOD generalization behaviors.

These findings raise fundamental questions about the solution space of task-trained RNNs: What factors govern the solution degeneracy across independently trained RNNs? When the solution space of task-trained RNNs is highly degenerate, to what extent can we trust conclusions drawn from a single model instance? While feedforward networks have been extensively studied in terms of how weight initialization and stochastic training (e.g., mini-batch gradients) lead to divergent solutions, RNNs still lack a systematic and unified understanding of the factors that govern solution degeneracy [14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Cao and Yamins [24] proposed the Contravariance Principle, which posits that as the computational objective (i.e., the task) becomes more complex, the solution space should become less dispersed—since fewer models can simultaneously satisfy the stricter constraints imposed by harder tasks. While this principle is intuitive and compelling, it has thus far remained largely theoretical and has not been directly validated through empirical studies.

In this paper, we introduce a unified framework for quantifying solution degeneracy at three levels: behavior, neural dynamics, and weight space (Figure 1). Leveraging this framework, we isolate four key factors that control solution degeneracy—task complexity, learning regime, network width, and structural regularization. We apply this framework in a large-scale experiment, training 50 independently initialized RNNs on each of four neuroscience-relevant tasks. By systematically varying task complexity, learning regime, network width, and regularization, we map how each factor shapes degeneracy across behavior, dynamics, and weights. We find that as task complexity increases—whether via more input–output channels, higher memory demand, or auxiliary objectives, or as networks undergo stronger feature learning—their neural dynamics become more consistent, while their weight configurations grow more variable. In contrast, increasing network size or imposing structural regularization during training reduces variability at both the dynamics and weight levels. At the behavioral level, each of these factors reliably modulates behavioral degeneracy; however, the relationship between behavioral and dynamical degeneracy is not always consistent.

Figure 1: — Schematic of our framework for analyzing solution degeneracy in task-trained RNNs. We evaluate how task complexity, learning regime, network size, and structural regularization influence degeneracy at three levels: behavior (network outputs), neural dynamics (state trajectories), and weight space (connectivity).

Table 1 summarizes how task complexity, learning regime, network size, and regularization affect degeneracy across levels. In both machine learning and neuroscience, the desired level of degeneracy may vary depending on the specific research questions being investigated. This framework offers practical guidance for tailoring training to a given goal—whether encouraging consistency across models [25], or promoting diversity across learned solutions [26, 27, 28].

Table 1:

Summary of how each factor affects solution degeneracy. Arrows indicate the direction of change for each level as the factor increases. Contravariant factors shift dynamic and weight degeneracy in opposite direction; covariant factors shift them in the same directions.

Factor	Dynamics	Weights	Behavior
Higher Task complexity (contravariant)	↓	↑	↓
More Feature learning (contravariant)	↓	↑	↑
Larger Network size (covariant)	↓	↓	↓
Regularization (covariant)	↓	↓	↓

Open in a new tab

Our key contributions are as follows:

A unified framework for analyzing solution degeneracy in task-trained RNNs across behavior, dynamics, and weights.
A systematic sweep of four factors—task complexity, feature learning, network size, and regularization—and a summary of their effects across levels (Table 1), with practical guidance for tuning consistency vs. diversity [25, 26, 27, 28].
A double dissociation: task complexity and feature learning yield contravariant effects on weights vs. dynamics, while network size and regularization yield covariant effects. Here, contravariant means that a factor decreases degeneracy at one level (e.g., dynamics) while increasing it at another (e.g., weights), whereas covariant means both levels change in the same direction.

2. Methods

2.1. Model architecture and training procedure

We use discrete-time nonlinear vanilla recurrent neural networks (RNNs), defined by the update rule: $h_{t} = \tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b)$ where $h_{t} \in R^{n}$ is the hidden state, $x_{t} \in R^{m}$ is the input, $W_{h} \in R^{n \times n}$ and $W_{x} \in R^{n \times m}$ are the recurrent and input weight matrices, and $b \in R^{n}$ is a bias vector. A learned linear readout is applied to the hidden state to produce the model’s output at each time step. Networks are trained with Backpropagation Through Time (BPTT) [29], which unrolls the RNN over time to compute gradients at each step. All networks are trained using supervised learning with the Adam optimizer without weight decay. Learning rates are tuned per task (Appendix B). For each task, we train 50 RNNs with 128 hidden units. Weights are initialized from the uniform distribution $𝒰 (- 1 ∕ \sqrt{n}, 1 ∕ \sqrt{n})$ and hidden states are initialized to be zeros.

In all experiments, we train networks until them reach a near-asymptotic, task-specific mean-squred error (MSE) threshold on the training set (see Appendix B), after which we allow a patience period of 3 epochs and stop training to measure degeneracy. This early-stopping criterion ensures that networks trained on the same task achieve comparable final losses before any degeneracy analysis.

2.2. Task suite for diagnosing solution degeneracy

We selected a diverse set of four tasks designed to elicit distinct neural dynamics commonly studied in neuroscience. The N-Bit Flip-Flop task captures pattern recognition and memory retrieval processes, analogous to Hopfield-type attractor networks that store discrete binary patterns and retrieve them from partial cues [30, 31]. The Delayed Discrimination task models working memory maintenance in classic delayed-response paradigms [32, 33]. The Sine Wave Generation task represents pattern generation, analogous to Central Pattern Generators (CPGs) that produce self-sustaining rhythmic outputs underlying motor control [34], as well as oscillatory activity observed in motor cortex during movement [35]. Finally, the Path Integration task is inspired by hippocampal and entorhinal circuits that build a cognitive map of the environment to track position by integrating self-motion cues [36]. These tasks have also been used in prior benchmark suites for neuroscience-relevant RNN training [37, 38, 8], underscoring their broad relevance for studying diverse neural computations. Below, we briefly describe the task structure and the typical dynamics required to solve each one.

N-Bit Flip-Flop Task

Each RNN receives $N$ independent input channels taking values in {−1, 0, +1}, which switch with probability $p_{switch}$ . The network has $N$ output channels that must retain the most recent nonzero input on their respective channels. The network dynamics form $2^{N}$ fixed points, corresponding to all binary combinations of ${- 1, + 1}^{N}$ . The output range of this task is [−1, 1] and we apply an early-stopping training MSE threshold at 0.001.

Delayed Discrimination Task

The network receives two pulses of amplitudes $f_{1}$ , $f_{2} \in [2, 10]$ , separated by a variable delay $t \in [5, 20]$ time steps, and must output $sign (f_{2} - f_{1})$ . In the $N$ -channel variant, comparisons are made independently across channels. The network forms task-relevant fixed points to retain the amplitude of $f_{1}$ during the delay period. The output range of this task is [−1, 1] and we apply an early-stopping training MSE threshold at 0.01.

Sine Wave Generation

The network receives a static input specifying a target frequency $f \in [1, 30]$ and must generate the corresponding sine wave $\sin (2 π f t)$ over time. We define $N_{freq}$ target frequencies, evenly spaced within the range [1, 30], and use them during training. In the $N$ -channel variant, each input channel specifies a frequency, and the corresponding output channel generates a sine wave at that frequency. For each frequency, the network dynamics form and traverse a limit cycle that produces the corresponding sine wave. The output range of this task is [−1, 1] and we apply an early-stopping training MSE threshold at 0.05.

Path Integration Task

Starting from a random position in 2D, the network receives angular direction $θ$ and speed $v$ at each time step and updates its position estimate. In the 3D variant, the network takes as input azimuth $θ$ , elevation $ϕ$ , and speed $v$ , and outputs updated ( $x$ , $y$ , $z$ ) position. The network performs path integration by accumulating velocity vectors based on the input directions and speeds. After training, the network forms a map of the environment in its internal state space. The output range of this task is [−5, 5] and we apply an early-stopping training MSE threshold at 0.05.

In our task suite, trained RNNs develop distinct stable dynamical objects: fixed-point (N-Bit Flip Flop, Delayed Discrimination), limit cycle (Sine Wave Generation), and attractor manifold (Path Integration). In Appendix E, we extend our task suite to include a next-step prediction task on the Lorenz 96 chaotic attractors [39], where networks exhibit chaotic dynamical regime.

2.3. Multi-level framework for quantifying degeneracy

2.3.1. Behavioral degeneracy

We define a novel metric for behavioral degeneracy as the variability in network responses to out-ofdistribution (OOD) inputs. We quantify OOD performance as the mean squared error of all converged networks that achieved near-asymptotic training loss under a temporal generalization condition. For the Delayed Discrimination task, we doubled the delay period. For all other tasks, we doubled the length of the entire trial to assess generalization under extended temporal contexts. Behavioral degeneracy is defined as standard deviation of the OOD losses: $σ_{OOD} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(ℒ_{OOD}^{(i)} - {\bar{ℒ}}_{OOD})}^{2}}$ , where ${\bar{ℒ}}_{OOD}$ is the mean OOD loss. While we focus primarily on the temporal generalization condition for behavioral degeneracy since it directly probes RNNs’ sequence processing capacities and their ability to generalize across extended temporal horizons, the same metric can be readily applied to other OOD conditions, such as input noise or external perturbations. In the rest of the paper, we use the term behavioral degeneracy [temporal generalization] to explicitly indicate the OOD condition being tested.

2.3.2. Dynamical degeneracy

We use Dynamical Similarity Analysis (DSA) [40] to compare the neural dynamics of task-trained networks through pairwise analyses. While previous comparison methods mostly focus on geometry of the data [41, 42, 43, 44], RNNs implement computations through time-varying trajectories rather than static representations, and two RNNs exhibiting similar representational geometry can implement distinct dynamical computations, and vise versa. DSA compares the topological structure of the neural dynamics and has been shown to be more robust to noise and better at identifying behaviorally relevant differences than geometry-based comparison method [45]. For a pair of networks $X$ and $Y$ , DSA projects their time series of activities to a higher-dimensional space and identifies a linear dynamic operator for each system via next-step prediction. The DSA distance between two systems is then computed by minimizing the Frobenius norm between the operators, up to an orthogonal transformation (rotation and reflection):

d_{DSA} (A_{x}, A_{y}) = \min_{C \in O (n)} {‖ A_{x} - C A_{y} C^{- 1} ‖}_{F},

where $O (n)$ is the orthogonal group. We define dynamical degeneracy as the average DSA distance across all network pairs. Additional details on the DSA metric are provided in Appendix F. We note that scale of the DSA distance used to quantify dynamical degeneracy can depend on the choice of DSA hyperparameters. To ensure fair comparison across conditions, we keep all DSA hyperparameters fixed for RNNs trained on the same task. To assess if the neural dynamics across different trained networks are statistically different, we also establish a null distribution by comparing neural trajectories sampled from the same underlying network, see Appendix F.3 for details.

We focus on comparing neural dynamics because RNNs implement computations through time-evolving trajectories rather than static input representations. In addition, we assess representational degeneracy using Singular Vector Canonical Correlation Analysis (SVCCA) [41]. As shown in Appendix G, the four factors that influence dynamical degeneracy do not impose the same constraints on representational degeneracy.

2.3.3. Weight degeneracy

We quantify weight-level degeneracy via a permutation-invariant version of the Frobenius norm, defined as:

d_{PIF} (W_{1}, W_{2}) = \min_{P \in 𝒫 (n)} {‖ W_{1} - P^{T} W_{2} P ‖}_{F}

where $W_{1}$ and $W_{2}$ are the recurrent weight matrices for a pair of RNNs, $𝒫 (n)$ is the set of permutation matrices of size $n \times n$ , and ${‖ \cdot ‖}_{F}$ denotes the Frobenius distance. See Appendix F.2 for additional details. For comparing $d_{P I F}$ computed on networks of different sizes, we normalize the above norm by the number of parameters in the weight matrix.

3. Results

3.1. Task complexity modulates degeneracy across levels

To investigate how task complexity influences dynamical degeneracy, we varied the number of independent input–output channels. This increased the representational load by forcing networks to solve multiple input-output mappings simultaneously. To visualize how neural dynamics vary across networks, we applied two-dimensional Multidimensional Scaling (MDS) to their pairwise distances. As task complexity increased, network dynamics became more similar, forming tighter clusters in the MDS space (Figure 3A). This contravariant relationship between task complexity and dynamical degeneracy was consistent across all tasks (Figure 3B). Higher task demands constrain the space of viable dynamical solutions, leading to greater consistency across independently trained networks.

Figure 3: — **(A)** Two-dimensional MDS embedding of network dynamics shows that independently trained networks converge to more similar trajectories as task complexity increases. **(B)** Dynamical, (C) weight, and (D) behavioral degeneracy [temporal generalization] across 50 networks as a function of task complexity. Shaded area indicates ±1 standard error.

At the behavioral level, networks trained on more complex tasks consistently showed lower variability in their responses to OOD test inputs (Figure 3D) in the temporal generalization condition. This finding suggests that increased task complexity, by reducing dynamical degeneracy, also leads to more consistent and less degenerate behavior on the temporal generalization condition across networks. Together, the results at the behavioral and dynamical levels support the Contravariance Principle, which posits an inverse relationship between task complexity and the dispersion of network solutions [24].

At the weight level, we found that pairwise distances between converged RNNs’ weight matrices increased consistently with task complexity (Figure 3C). This likely reflects increased dispersion of local minima in weight space for harder tasks. This interpretation is consistent with prior work on mode averaging and loss landscape geometry in feedforward networks, showing that harder tasks tend to yield increasingly isolated minima, separated by steeper barriers [46, 47, 48, 49, 50, 51, 52]. A complementary perspective comes from [53], who introduced the intrinsic dimension—the lowest-dimensional weight subspace that still contains a solution—which can serve as a proxy for task complexity. As task complexity increases, the intrinsic dimension of the weight space expands and each solution occupies a thinner slice of a higher-dimensional space, leading to minima that lie further apart. In Section 3.2, we propose an additional mechanism: an interaction between task complexity and the network’s learning regime that further amplifies weight-space degeneracy.

3.1.1. Additional axes of task complexity

In earlier experiments, we controlled task complexity by varying the number of independent input–output channels, effectively duplicating the task across dimensions. Here, we explore two alternative approaches: increasing the task’s memory demand and adding auxiliary objectives.

Changing memory demand.

Of the four tasks, only Delayed Discrimination requires extended memory, as its performance depends on maintaining the first stimulus across a variable delay. See Appendix D for a quantification of each task’s memory demand. We increased the memory load in Delayed Discrimination by lengthening the delay period. This manipulation reduced degeneracy at the dynamical and behavioral levels but increased it at the weight level, mirroring the effect of increasing task dimensionality (Figure 4A).

Figure 4: — In the Delayed Discrimination task, both manipulations reduce dynamical and behavioral degeneracy [temporal generalization] while increasing weight degeneracy. The auxiliary loss also induces additional line attractors in the network’s dynamics, as shown in (C).

Adding auxiliary loss.

We next examined how adding an auxiliary loss affects solution degeneracy in the Delayed Discrimination task. Specifically, the network outputs both the sign and the magnitude of the difference between two stimulus values ( $f_{2} - f_{1}$ ), using separate output channels for each. This manipulation added a second output channel and increased memory demand by requiring the network to track the magnitude of the difference between incoming stimuli. Consistent with our hypothesis, this manipulation reduced dynamical and behavioral degeneracy [temporal generalization] while increasing weight degeneracy (Figure 4B). Crucially, the auxiliary loss induced additional line attractors in the network dynamics, further structuring internal trajectories and aligning neural responses across networks (Figure 4C). While the auxiliary loss increases both output dimensionality and temporal memory demand, we interpret its effect holistically as a structured increase in task complexity.

3.2. Feature learning

3.2.1. Task complexity scales feature learning

In deep learning theory, neural networks can either solve tasks using their random features at initialization, or adapt their weights and internal features to capture task specific structure [54, 55, 56, 57]. These are referred to as the lazy learning regime, where weights and internal features remain largely unchanged during training, and the rich learning, or feature learning regime, where networks reshape their hidden representations and weights to capture task-specific structure [54, 58, 59, 55]. As the complexity of a task grows, the initial random features no longer suffice to solve it, pushing the network beyond the lazy regime and into feature learning, where weights and internal representations adapt more substantially. [60, 61]. If more complex task variants, like those in Section 3.1, truly induce greater feature learning, then networks should adapt more from their initializations and traverse a greater distance in the weight space, resulting in more dispersed final weights.

We therefore hypothesize that the increased weight degeneracy observed in harder tasks reflects stronger feature learning within the network. To test this idea, we measured feature learning strength in networks trained on different task variants using two complementary metrics [62, 58]: Weight-change norm: ${‖ W_{T} - W_{0} ‖}_{F}$ , where larger values indicate stronger feature learning. Kernel alignment (KA): The geometry of learning under gradient descent can be described by the neural tangent kernel (NTK), which captures how weight updates affect the network output. The NTK is defined by $K = \nabla w {\hat{y}}^{T} \nabla w \hat{y}$ where $\hat{y}$ denotes the network output. KA measures the directional change of the NTK before and after training: $KA (K^{(T)}, K^{(0)}) = \frac{Tr (K^{(T)} K^{(0)})}{{‖ K^{(T)} ‖}_{F} {‖ K^{(0)} ‖}_{F}}$ . Lower KA indicates greater NTK rotation and thus stronger feature learning.

We find that more complex tasks consistently drive stronger feature learning and greater dispersion in weight space, as reflected by increasing weight-change norm and decreasing kernel alignment across all tasks (Figure 5).

Figure 5: — Increased input–output dimensionality leads to higher weight-change norms $({‖ Δ W ‖}_{F})$ and lower kernel alignment (KA). Error bars indicate ±1 standard error.

3.2.2. Controlling feature learning reshapes degeneracy across levels

Our earlier results show that harder tasks induce stronger feature learning, which in turn shapes the dispersion of solutions in the weight space. To test whether feature learning causally affects degeneracy, we used a principled network parameterization known as maximum update parameterization ( $μ P$ ), which allows stable feature learning across network widths, even in the infinite-width limit [57, 54, 56, 55]. In this setup, a single hyperparameter ( $γ$ ) controls the strength of feature learning: higher $γ$ values induce a richer feature-learning regime. Under this parameterization, the network update rule, initialization, and learning rate are scaled with respect to network width $N$ . For the Adam optimizer, the output is scaled as $f (t) = \frac{1}{γ N} W_{readout} ϕ (h (t))$ . The hidden state update is scaled as $h (t + 1) - h (t) = τ (- h (t) + \frac{1}{N} J ϕ (h (t)) + U x (t))$ , where $J_{i j} \sim 𝒩 (0, N)$ are the recurrent weights and $ϕ$ is the tanh nonlinearity. The learning rate scales as $η = γ η_{0}$ . A detailed explanation of $μ P$ and its relationship to the standard parameterization is in Appendix K and L. For each task, we trained networks with multiple $γ$ values and confirmed that larger $γ$ consistently induces stronger feature learning, as evidenced by increased weight-change norm and decreased kernel alignment (Appendix M).

We observed that stronger feature learning reduced degeneracy at the dynamical level but increased it at the weight level. We see that when $γ$ is high, networks tend to learn similar task-specific features and converge to consistent dynamics and behavior. In contrast, lazy networks (with small $γ$ ) rely on their initial random features, leading to more divergent solutions across seeds—even though their weights move less overall (Figure 6). This finding aligns with prior work in feedforward networks, where feature learning was shown to reduce the variance of the neural tangent kernel across converged models [60]. At the behavioral level, however, increasing feature-learning strength leads networks to overfit the training distribution (Appendix J.2). We hypothesize that stronger feature learning exacerbates overfitting, increasing both average OOD loss and the variability of OOD behavior across models (Figure 6) [63, 64, 65, 66]. Although stronger feature learning increases behavioral degeneracy [temporal generalization], this may partially reflect overfitting to the training distribution, an effect we highlight in Appendix J.2. Clarifying the mechanistic link between dynamical and behavioral degeneracy [temporal generalization] remains an important direction for future work. In Appendix I, we demonstrate that the observed effects of feature learning on degeneracy both interpolates smoothly within the range of $γ$ values and extrapolates beyond the range reported in Figure 6.

Figure 6: — Panels show degeneracy at the dynamical, weight, and behavioral levels (top to bottom). Shaded area indicates ±1 standard error.

3.3. Larger networks yield more consistent solutions across levels

Prior work in machine learning and optimization shows that over-parameterization improves convergence by helping gradient methods escape saddle points [67, 68, 69, 70, 71, 72, 16]. We therefore hypothesized that larger RNNs would converge to more consistent solutions across seeds. However, increasing width also tends to push models towards the lazy regime, where feature learning is suppressed [73, 59, 54, 55, 56]. To disentangle these competing effects, we again use the $μ P$ parameterization, which holds feature learning strength constant (via fixed $γ$ ) while scaling width. Although larger networks may yield more consistent solutions via self-averaging, this outcome is not guaranteed without controlling for feature learning. In standard RNNs, increasing width often induces lazier dynamics, which can paradoxically increase dynamical degeneracy rather than reduce it. The $μ P$ setup enables us to isolate the size effect cleanly.

Across all tasks, larger networks consistently exhibit lower degeneracy at the weight, dynamical, and behavioral levels, producing more consistent solutions across random seeds (Figure 7). Our dense sweep over 12 intermediate network sizes from 32 to 512 on the 3-Bits Flip Flop task in Appendix I further confirms the observed effect of network width on degeneracy. This pattern aligns with findings in vision and language models, where wider networks converge to more similar internal representations [74, 75, 41, 76, 77, 65]. In recurrent networks, only a few studies have investigated this “convergence-with-scale” effect using representation-based metrics [74, 78]. Our results extend these findings by (1) focusing on neural computations across time (i.e., neural dynamics) rather than static representations, and (2) demonstrating convergence-with-scale across weight, dynamical, and behavioral levels in RNNs.

Figure 7: — After controlling for feature learning strength ( $γ = 1$ held constant across network widths), wider RNNs yield more consistent solutions across all three levels of analysis. Panels show degeneracy at the dynamical, weight, and behavioral levels (top to bottom). Shaded area indicates ±1 standard error.

3.4. Structural regularization reduces solution degeneracy

Low-rank and sparsity constraints are widely used structural regularizers in neuroscience-inspired modeling and efficient machine learning [4, 79, 80, 81, 82]. A low-rank penalty compresses the weight matrices into a few dominant modes, while an $ℓ_{1}$ penalty drives many parameters to zero and induces sparsity. In both cases, task-irrelevant features are pruned, nudging independently initialized networks toward more consistent solutions on the same task. To test this idea, we augmented the task loss with either a nuclear-norm penalty on the recurrent weights $ℒ = ℒ_{task} + λ_{rank} \sum_{i = 1}^{r} σ_{i}$ , where $σ_{i}$ are the singular values of the recurrent matrix, or an $ℓ_{1}$ sparsity penalty: $ℒ = ℒ_{task} + λ_{ℓ_{1}} \sum_{i} ∣ w_{i} ∣$ . We focused on the Delayed Discrimination task to control for baseline difficulty, and observe that both regularizers consistently reduced degeneracy across all levels. Similar effects hold in other tasks (Appendix O, Figure 8) and intermediate regularization strengths (Appendix I).

Figure 8: — On the Delayed Discrimination task, both regularizers lower degeneracy in dynamics, weights, and behavior. Shaded area indicates ±1 standard error.

4. Discussion

In this work, we introduced a unified framework for quantifying solution degeneracy in task-trained recurrent neural networks (RNNs) at three complementary levels: behavior, neural dynamics, and weights. We systematically varied four factors within our generalizable framework: (i) task complexity (via input–output dimensionality, memory demand, or auxiliary loss), (ii) feature learning strength, (iii) network size, and (iv) structural regularization. We then evaluated their effects on solution degeneracy across a diverse set of neuroscience-relevant tasks.

Two consistent patterns emerged from this analysis. First, increasing task complexity or boosting feature learning produced a contravariant effect: dynamical degeneracy decreased while weight degeneracy increased. Second, increasing network size or applying structural regularization reduced degeneracy at both the weight and dynamical levels—that is, a covariant effect. Here, covariant and contravariant refer to the relationship between weight and dynamic degeneracy—not whether degeneracy increases or decreases overall. For example, task complexity and feature learning reduce dynamical degeneracy but increase weight degeneracy, whereas network size and regularization reduce both.

We also observed that the relationship between dynamical and behavioral degeneracy depends on the varying factor. For instance, stronger feature learning leads to more consistent neural dynamics on the training task but greater variability in OOD generalization This suggests that tightly constrained dynamics on the training set do not guarantee more consistent behavior on OOD inputs. This highlights the need for further empirical and theoretical work on how generalization depends on the internal structure of task-trained networks [83, 84, 85]. This divergence highlights a key open question: how much of behavioral consistency generalizes beyond training-aligned dynamics, and what task or network factors drive this decoupling?

These knobs allow researchers to tune the level of degeneracy in task-trained RNNs to suit specific research questions or application needs. For example, researchers may want to suppress degeneracy to study common mechanisms underlying a neural computation. Conversely, to probe individual differences, they can increase degeneracy to expose solution diversity across independently trained networks [86, 87, 88, 89]. Our framework also supports ensemble-based modeling of brain data. By comparing dynamical and behavioral degeneracy across trained networks, it may be possible to match inter-individual variability in models to that observed in animals—helping capture the full distribution of task-solving strategies [90, 91, 92, 93].

Although our analyses use artificial networks, several of the mechanisms we uncover may translate directly to experimental neuroscience. For example, introducing an auxiliary sub-task during behavioral shaping—mirroring our auxiliary-loss manipulation—could constrain the solution space animals explore, thereby reducing behavioral degeneracy [94]. Finally, our contrasting findings motivate theoretical analysis—e.g., using linear RNNs—to understand why some factors induce contravariant versus covariant relationships across behavioral, dynamical, and weight-level degeneracy.

In summary, our work takes a first step toward addressing this classic puzzle in task-driven modeling: What factors shape the variability across independently trained networks? We present a unified framework for quantifying solution degeneracy in task-trained RNNs, identify the key factors that shape the solution landscape, and provide practical guidance for controlling degeneracy to match specific research goals in neuroscience and machine learning.

Limitations and future directions.

This work considers networks equivalent if they achieve similar training loss. Future work could extend the framework to tasks with multiple qualitatively distinct solutions, to examine whether specific factors bias the distribution of networks across those solutions. Another open question is the observed decoupling between dynamical and behavioral degeneracy: how much of behavioral consistency generalizes beyond training-aligned dynamics, and what task or network factors drive this divergence.

Supplementary Material

NIHMS2124298-supplement-Supplementary_Material.zip^{(19.5MB, zip)}

Figure 2: — Task schematics and representative network trajectories projected onto the top principal components are shown in (A)–(B). The four tasks are: **N-Bit Flip-Flop**: The network must remember the last nonzero input on each of $N$ independent channels. **Delayed Discrimination**: The network compares the magnitude of two pulses, separated by a variable delay, and outputs their sign difference. **Sine Wave Generation**: A static input specifies a target frequency, and the network generates the corresponding sine wave over time. **Path Integration**: The network integrates velocity inputs to track position in a bounded 2D or 3D arena (schematic shows 2D case).

Acknowledgments

We acknowledge funding from NIH (RF1DA056403, U01NS136507), James S. McDonnell Foundation (220020466), Simons Foundation (Pilot Extension-00003332-02, McKnight Endowment Fund, CIFAR Azrieli Global Scholar Program, NSF (2046583), Harvard Medical School Neurobiology Lefler Small Grant Award, Harvard Medical School Dean’s Innovation Award, Alice and Joseph Brooks Fund Postdoctoral Fellowship, and Kempner Graduate Fellowship. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University.

References

[1].Sussillo David. Neural circuits as computational dynamical systems. Current opinion in neurobiology, 25:156–163, 2014. [DOI] [PubMed] [Google Scholar]
[2].Rajan Kanaka, Harvey Christopher D, and Tank David W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Barak Omri. Recurrent neural networks as versatile tools of neuroscience research. 46:1–6. ISSN 09594388. doi: 10.1016/j.conb.2017.06.003. URL https://linkinghub.elsevier.com/retrieve/pii/S0959438817300429. [DOI] [Google Scholar]
[4].Mastrogiuseppe Francesca and Ostojic Srdjan. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron, 99(3):609–623.e29, 2018. doi: 10.1016/j.neuron.2018.07.003. [DOI] [PubMed] [Google Scholar]
[5].Vyas Saurabh, Golub Matthew D, Sussillo David, and Shenoy Krishna V. Computation through neural population dynamics. Annual Review of Neuroscience, 43:249–275, 2020. [Google Scholar]
[6].Driscoll Laura N., Shenoy Krishna, and Sussillo David. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs. Nature Neuroscience, 27(7):1349–1363, July 2024. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-024-01668-6. URL https://www.nature.com/articles/s41593-024-01668-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Turner Elia, Dabholkar Kabir V, and Barak Omri. Charting and navigating the space of solutions for recurrent neural networks. Advances in Neural Information Processing Systems, 34:25320–25333, 2021. [Google Scholar]
[8].Maheswaranathan Niru, Williams Alex H., Golub Matthew D., Ganguli Surya, and Sussillo David. Universality and individuality in neural dynamics across recurrent networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019. [Google Scholar]
[9].Kurtkaya Bariscan, Dinc Fatih, Yuksekgonul Mert, Blanco-Pozo Marta, Cirakman Ege, Schnitzer Mark, Yemez Yucel, Tanaka Hidenori, Yuan Peng, and Miolane Nina. Dynamical phases of short-term memory mechanisms in rnns, 2025. URL https://arxiv.org/abs/2502.17433. [Google Scholar]
[10].Pagan Marino, Tang Vincent D., Aoi Mikio C., Pillow Jonathan W., Mante Valerio, Sussillo David, and Brody Carlos D.. Individual variability of neural computations underlying flexible decisions. Nature, 639(8054):421–429, March 2025. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-08433-6. URL https://www.nature.com/articles/s41586-024-08433-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Clark David G., Abbott LF, and Sompolinsky Haim. Symmetries and Continuous Attractors in Disordered Neural Circuits. bioRxiv, 2025. doi: 10.1101/2025.01.26.634933. URL https://www.biorxiv.org/content/early/2025/01/26/2025.01.26.634933. Publisher: Cold Spring Harbor Laboratory _eprint: https://www.biorxiv.org/content/early/2025/01/26/2025.01.26.634933.full.pdf. [DOI] [Google Scholar]
[12].Lappalainen Janne K., Tschopp Fabian D., Prakhya Sridhama, McGill Mason, Nern Aljoscha, Shinomiya Kazunori, Takemura Shin-ya, Gruntman Eyal, Macke Jakob H., and Turaga Srinivas C.. Connectome-constrained networks predict neural activity across the fly visual system. Nature, 634(8036):1132–1140, October 2024. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-07939-3. URL https://www.nature.com/articles/s41586-024-07939-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Murray Keith T.. Phase codes emerge in recurrent neural networks optimized for modular arithmetic, 2025. URL https://arxiv.org/abs/2310.07908. [Google Scholar]
[14].Das Abhranil and Fiete Ila R. Systematic errors in connectivity inferred from activity in strongly recurrent networks. Nature Neuroscience, 23(10):1286–1296, 2020. [DOI] [PubMed] [Google Scholar]
[15].Turner Elia and Barak Omri. The simplicity bias in multi-task rnns: shared attractors, reuse of dynamics, and geometric representation. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
[16].Martinelli Flavio, Simsek Berfin, Gerstner Wulfram, and Brea Johanni. Expand-and-cluster: Parameter recovery of neural networks, 2024. URL https://arxiv.org/abs/2304.12794. [Google Scholar]
[17].Martinelli Flavio, Van Meegen Alexander, Şimşek Berfin, Gerstner Wulfram, and Brea Johanni. Flat channels to infinity in neural loss landscapes, 2025. URL https://arxiv.org/abs/2506.14951. [Google Scholar]
[18].Fort Stanislav, Hu Huiyi, and Lakshminarayanan Balaji. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019. [Google Scholar]
[19].Goodfellow Ian J., Vinyals Oriol, and Saxe Andrew M.. Qualitatively characterizing neural network optimization problems, 2015. URL https://arxiv.org/abs/1412.6544. [Google Scholar]
[20].Li Hao, Xu Zheng, Taylor Gavin, Studer Christoph, and Goldstein Tom. Visualizing the loss landscape of neural nets, 2018. URL https://arxiv.org/abs/1712.09913. [Google Scholar]
[21].Jastrzębski Stanisław, Kenton Zachary, Arpit Devansh, Ballas Nicolas, Fischer Asja, Bengio Yoshua, and Storkey Amos. Three factors influencing minima in sgd, 2018. URL https://arxiv.org/abs/1711.04623. [Google Scholar]
[22].Chaudhari Pratik, Choromanska Anna, Soatto Stefano, LeCun Yann, Baldassi Carlo, Borgs Christian, Chayes Jennifer, Sagun Levent, and Zecchina Riccardo. Entropy-sgd: Biasing gradient descent into wide valleys, 2017. URL https://arxiv.org/abs/1611.01838. [Google Scholar]
[23].Kornblith Simon, Norouzi Mohammad, Lee Honglak, and Hinton Geoffrey. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414. [Google Scholar]
[24].Cao Rosa and Yamins Daniel. Explanatory models in neuroscience, part 2: Functional intelligibility and the contravariance principle. Cognitive Systems Research, 85:101200, 2024. [Google Scholar]
[25].Kepple D, Engelken Rainer, and Rajan Kanaka. Curriculum learning as a tool to uncover learning principles in the brain. In International Conference on Learning Representations, 2022. [Google Scholar]
[26].Garcia Samuel Liebana, Laffere Aeron, Toschi Chiara, Schilling Louisa, Podlaski Jacek, Fritsche Matthias, Zatka-Haas Peter, Li Yulong, Bogacz Rafal, Saxe Andrew, and Lak Armin. Striatal dopamine reflects individual long-term learning trajectories, December 2023. URL http://biorxiv.org/lookup/doi/10.1101/2023.12.14.571653. [Google Scholar]
[27].Fascianelli Valeria, Battista Aldo, Stefanini Fabio, Tsujimoto Satoshi, Genovesio Aldo, and Fusi Stefano. Neural representational geometries reflect behavioral differences in monkeys and recurrent neural networks. Nature Communications, 15(1):6479, August 2024. ISSN 2041-1723. doi: 10.1038/s41467-024-50503-w. URL https://www.nature.com/articles/s41467-024-50503-w. [DOI] [Google Scholar]
[28].Pan-Vazquez A, Sanchez Araujo Y, McMannon B, Louka M, Bandi A, Haetzel L, International Brain Laboratory, Pillow JW, Daw ND, and Witten IB. Pre-existing visual responses in a projection-defined dopamine population explain individual learning trajectories, 2024. URL https://europepmc.org/article/PPR/PPR811803. [Google Scholar]
[29].Werbos Paul J. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990. [Google Scholar]
[30].Hopfield John J.. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79 (8):2554–2558, 1982. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Jarne Ignacio. Exploring flip flop memories and beyond: training recurrent neural networks with key insights. Frontiers in Systems Neuroscience, 2024. [Google Scholar]
[32].Funahashi Shintaro, Bruce Charles J., and Goldman-Rakic Patricia S.. Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophysiology, 61(2): 331–349, 1989. [DOI] [PubMed] [Google Scholar]
[33].Goldman-Rakic Patricia S.. Cellular basis of working memory. Neuron, 14(3):477–485, 1995. [DOI] [PubMed] [Google Scholar]
[34].Marder Eve and Bucher Dirk. Central pattern generators and the control of rhythmic movement. Current Biology, 11(23):R986–R996, 2001. [DOI] [PubMed] [Google Scholar]
[35].Churchland Mark M., Cunningham John P., Kaufman Matthew T., Foster Justin D., Nuyujukian Paul, Ryu Stephen I., and Shenoy Krishna V.. Neural population dynamics during reaching. Nature, 487(7405):51–56, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].McNaughton Bruce L., Battaglia Francesco P., Jensen Ole, Moser Edvard I., and Moser May-Britt. Path integration and the neural basis of the ‘cognitive map’. Nature Reviews Neuroscience, 7:663–678, 2006. [DOI] [PubMed] [Google Scholar]
[37].Yang Guangyu R., Joglekar Madhura R., Song H. Francis, Newsome William T., and Wang Xiao-Jing. Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 2019. [Google Scholar]
[38].Khona Mihir, Chandra Shreyas, Ma James J., and Fiete Ila R.. Winning the lottery with neural connectivity constraints: Faster learning across cognitive tasks with spatially constrained sparse rnns. Neural Computation, 35(11), 2023. doi: 10.1162/neco_a_01613. [DOI] [Google Scholar]
[39].Lorenz Edward N.. Predictability: A problem partly solved. In ECMWF Seminar on Predictability, 4–8 September 1995, Reading, U.K., 1996. European Centre for Medium-Range Weather Forecasts. [Google Scholar]
[40].Ostrow Mitchell, Eisen Adam, Kozachkov Leo, and Fiete Ila. Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis, October 2023. URL http://arxiv.org/abs/2306.10168. arXiv:2306.10168 [cs, q-bio]. [Google Scholar]
[41].Raghu Maithra et al. Svcca: Singular vector canonical correlation analysis for deep learning dynamics. In NeurIPS, 2017. [Google Scholar]
[42].Kriegeskorte Nikolaus, Mur Marieke, and Bandettini Peter. Representational similarity analysis — connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 2008. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Williams Alex H., Kunz Erin, Kornblith Simon, and Linderman Scott W.. Generalized Shape Metrics on Neural Representations, January 2022. URL http://arxiv.org/abs/2110.14739. arXiv:2110.14739 [cs, stat]. [Google Scholar]
[44].Schrimpf Martin, Kubilius Jonas, Hong Ha, Majaj Najib J., Rajalingham Rishi, Issa Elias B., Kar Kohitij, Bashivan Pouya, Prescott-Roy James, Schmidt Kailyn, Yamins Daniel L. K., and DiCarlo James J.. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv, 2020. doi: 10.1101/407007. URL https://www.biorxiv.org/content/10.1101/407007v2. [DOI] [Google Scholar]
[45].Guilhot Quentin, Wójcik Michał J, Achterberg Jascha, and Costa Rui Ponte. Dynamical similarity analysis uniquely captures how computations develop in RNNs, 2025. URL https://openreview.net/forum?id=pXPIQsV1St. [Google Scholar]
[46].Goodfellow Ian J., Vinyals Oriol, and Saxe Andrew M.. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations (ICLR), 2015. arXiv:1412.6544. [Google Scholar]
[47].Frankle Jonathan, Dziugaite Gintare Karolina, Roy Daniel, and Carbin Michael. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 3259–3269. PMLR, 2020. [Google Scholar]
[48].Lucas James R., Bae Juhan, Zhang Michael R., Fort Stanislav, Zemel Richard, and Grosse Roger B.. On monotonic linear interpolation of neural network parameters. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 7168–7179. PMLR, 2021. [Google Scholar]
[49].Fort Stanislav and Jastrzebski Stanislaw. Large scale structure of neural network loss landscapes. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. [Google Scholar]
[50].Achille Alessandro, Paolini Giovanni, and Soatto Stefano. Where is the information in a deep neural network? CoRR, abs/1905.12213, 2019. [Google Scholar]
[51].Qu Xingyu and Horvath Samuel. Rethink model re-basin and the linear mode connectivity. arXiv preprint arXiv:2402.05966, 2024. [Google Scholar]
[52].Ly Andrew and Gong Pulin. Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning. Nature Communications, 16 (3252), 2025. doi: 10.1038/s41467-025-58532-9. [DOI] [Google Scholar]
[53].Li Chunyuan, Farkhoor Heerad, Liu Rosanne, and Yosinski Jason. Measuring the intrinsic dimension of objective landscapes. CoRR, abs/1804.08838, 2018. [Google Scholar]
[54].Chizat Léon, Oyallon Edouard, and Bach Francis. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2938–2950, 2019. [Google Scholar]
[55].Woodworth Bryan, Gunasekar Suriya, Lee Jason D., Srebro Nathan, Bhojanapalli Srinadh, Khanna Rina, Chatterji Aaron, and Jaggi Martin. Kernel and rich regimes in deep learning. Journal of Machine Learning Research, 21(243):1–48, 2020. [PMC free article] [PubMed] [Google Scholar]
[56].Geiger Mario, Spigler Stefano, Jacot Arthur, and Wyart Matthieu. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020. doi: 10.1088/1742-5468/abc4de. [DOI] [Google Scholar]
[57].Bordelon Blake and Pehlevan Cengiz. Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks, October 2022. URL http://arxiv.org/abs/2205.09653. arXiv:2205.09653 [stat]. [Google Scholar]
[58].George Thomas, Lajoie Guillaume, and Baratin Aristide. Lazy vs hasty: Linearization in deep networks impacts learning schedule based on example difficulty. arXiv preprint arXiv:2209.09658, 2022. URL https://arxiv.org/abs/2209.09658. [Google Scholar]
[59].Lee Jaehoon, Bahri Yuval, Novak Roman, Schoenholz Samuel S., Pennington Jeffrey, and Sohl-Dickstein Jascha. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32, pages 8572–8583, 2019. [Google Scholar]
[60].Bordelon Blake and Pehlevan Cengiz. Dynamics of finite width Kernel and prediction fluctuations in mean field neural networks*. Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104021, October 2024. ISSN 1742-5468. doi: 10.1088/1742-5468/ad642b. URL https://iopscience.iop.org/article/10.1088/1742-5468/ad642b. [DOI] [Google Scholar]
[61].Kumar Tanishq, Bordelon Blake, Gershman Samuel J., and Pehlevan Cengiz. Grokking as the transition from lazy to rich training dynamics. arXiv preprint arXiv:2310.06110, 2023. URL https://arxiv.org/abs/2310.06110. [Google Scholar]
[62].Liu Yuhan Helena, Baratin Aristide, Cornford Jonathan, Mihalas Stefan, Shea-Brown Eric, and Lajoie Guillaume. How connectivity structure shapes rich and lazy learning in neural circuits. arXiv preprint arXiv:2310.08513, 2023. doi: 10.48550/arXiv.2310.08513. URL https://arxiv.org/abs/2310.08513. [DOI] [Google Scholar]
[63].Bansal Yamini, Nakkiran Preetum, and Barak Boaz. Revisiting model stitching to compare neural representations. arXiv preprint arXiv:2106.07682, 2021. URL https://arxiv.org/abs/2106.07682. [Google Scholar]
[64].Duan Sunny, Matthey Loïc, Saraiva André, Watters Nicholas, Burgess Christopher P., Lerchner Alexander, and Higgins Irina. Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614, 2020. URL https://arxiv.org/abs/1905.12614. [Google Scholar]
[65].Huh Minyoung, Cheung Brian, Wang Tongzhou, and Isola Phillip. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024. URL https://arxiv.org/abs/2405.07987. [Google Scholar]
[66].Li Yixuan, Yosinski Jason, Clune Jeff, Lipson Hod, and Hopcroft John. Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2016. URL https://arxiv.org/abs/1511.07543. [Google Scholar]
[67].Kawaguchi Kenji. Deep learning without poor local minima. In Advances in Neural Information Processing Systems 29, pages 586–594, 2016. [Google Scholar]
[68].Nguyen Quynh and Hein Matthias. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017. [Google Scholar]
[69].Du Simon S., Lee Jason D., Li Haochuan, Wang Liwei, and Zhai Xiyu. Gradient descent finds global minima of deep neural networks. CoRR, abs/1811.03804, 2018. [Google Scholar]
[70].Allen-Zhu Zeyuan, Li Yuanzhi, and Song Zhao. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, pages 242–252, 2019. [Google Scholar]
[71].Zou Difan, Cao Yuan, Zhou Dongruo, and Gu Quanquan. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018. [Google Scholar]
[72].Şimşek Berfin, Ged François, Jacot Arthur, Spadaro Francesco, Hongler Clément, Gerstner Wulfram, and Brea Johanni. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances, 2021. URL https://arxiv.org/abs/2105.12221. [Google Scholar]
[73].Jacot Arthur, Gabriel Franck, and Hongler Clément. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018. [Google Scholar]
[74].Morcos Ari S., Raghu Maithra, and Bengio Samy. Insights on representational similarity in neural networks with canonical correlation. In NeurIPS, 2018. [Google Scholar]
[75].Kornblith Simon, Norouzi Mohammad, Lee Honglak, and Hinton Geoffrey. Similarity of neural network representations revisited. In ICML, 2019. [Google Scholar]
[76].Wolf Fred, Engelken Rainer, Puelma-Touzel Maximilian, Weidinger Juan Daniel Flórez, and Neef Andreas. Dynamical models of cortical circuits. 25:228–236. ISSN 09594388. doi: 10.1016/j.conb.2014.01.017. URL https://linkinghub.elsevier.com/retrieve/pii/S0959438814000324. [DOI] [Google Scholar]
[77].Klabunde Felix et al. Contrasim – analyzing neural representations based on contrastive learning. In ICLR, 2024. [Google Scholar]
[78].Nguyen Thao, Raghu Maithra, and Kornblith Simon. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=KJNcAkY8tY4. Poster. [Google Scholar]
[79].Beiran Manuel, Dubreuil Alexis, Valente Adrian, Mastrogiuseppe Francesca, and Ostojic Srdjan. Shaping dynamics with multiple populations in low-rank recurrent networks. arXiv preprint arXiv:2007.02062, 2020. doi: 10.48550/arXiv.2007.02062. [DOI] [Google Scholar]
[80].Olshausen Bruno A. and Field David J.. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]
[81].Han Song, Pool Jeff, Tran John, and Dally William J.. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015. doi: 10.48550/arXiv.1506.02626. [DOI] [Google Scholar]
[82].Glorot Xavier, Bordes Antoine, and Bengio Yoshua. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 315–323, 2011. [Google Scholar]
[83].Li Qianyi, Sorscher Ben, and Sompolinsky Haim. Representations and generalization in artificial and brain neural networks. Proceedings of the National Academy of Sciences, 121 (27):e2311805121, 2024. doi: 10.1073/pnas.2311805121. [DOI] [Google Scholar]
[84].Cohen Uri, Chung SueYeon, Lee Daniel D., and Sompolinsky Haim. Separability and geometry of object manifolds in deep neural networks. Nature Communications, 11:746, 2020. doi: 10.1038/s41467-020-14578-5. [DOI] [Google Scholar]
[85].Sorscher Ben, Ganguli Surya, and Sompolinsky Haim. Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences, 119 (43):e2200800119, 2022. doi: 10.1073/pnas.2200800119. [DOI] [Google Scholar]
[86].Bouthillier Xavier, Delaunay Pierre, Bronzi Mirko, Trofimov Assya, Nichyporuk Brennan, Szeto Justin, Sepahvand Nazanin Mohammadi, Raff Edward, Madan Kanika, Voleti Vikram, et al. Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 2021. [Google Scholar]
[87].Morik Katharina. Sloppy modeling. Knowledge representation and organization in machine learning, pages 107–134, 2005. [Google Scholar]
[88].Yang Rubing, Mao Jialin, and Chaudhari Pratik. Does the data induce capacity control in deep learning? In International Conference on Machine Learning, pages 25166–25197. PMLR, 2022. [Google Scholar]
[89].Singh Satpreet H, van Breugel Floris, Rao Rajesh PN, and Brunton Bingni W. Emergent behaviour and neural dynamics in artificial agents tracking odour plumes. Nature Machine Intelligence, 5(1):58–70, 2023. [Google Scholar]
[90].Negrón-Oyarzo Ignacio, Espinosa Nelson, Aguilar-Rivera Marcelo, Fuenzalida Marco, Aboitiz Francisco, and Fuentealba Pablo. Coordinated prefrontal–hippocampal activity and navigation strategy-related prefrontal firing during spatial memory formation. Proceedings of the National Academy of Sciences, 115(27):7123–7128, 2018. doi: 10.1073/pnas.1720117115. [DOI] [Google Scholar]
[91].Ashwood Zoe C., Roy Nicholas A., Stone Iris R., International Brain Laboratory, Urai Anne E., Churchland Anne K., Pouget Alexandre, and Pillow Jonathan W.. Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25:201–212, 2022. doi: 10.1038/s41593-021-01007-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
[92].Cazettes Fanny, Mazzucato Luca, Murakami Masayoshi, Morais João P., Augusto Elisabete, Renart Alfonso, and Mainen Zachary F.. A reservoir of foraging decision variables in the mouse brain. Nature Neuroscience, 26:840–849, 2023. doi: 10.1038/s41593-023-01305-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
[93].Pagan Marino, Tang Vincent D., Aoi Mikio C., Pillow Jonathan W., Mante Valerio, Sussillo David, and Brody Carlos D.. Individual variability of neural computations underlying flexible decisions. Nature, 639:421–429, 2025. doi: 10.1038/s41586-024-08433-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[94].Howard BR. Control of variability. ILAR journal, 43(4):194–201, 2002. [DOI] [PubMed] [Google Scholar]
[95].Schmid Peter J. Dynamic mode decomposition and its variants. Annual Review of Fluid Mechanics, 54(1):225–254, 2022. [Google Scholar]
[96].Schönemann Peter H.. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, Mar 1966. doi: 10.1007/BF02289451. [DOI] [Google Scholar]
[97].Ding Chris, Li Tao, and Jordan Michael I.. Nonnegative matrix factorization for combinatorial optimization: Spectral clustering, graph matching, and clique finding. In Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM ‘08), pages 183–192. IEEE, 2008. doi: 10.1109/ICDM.2008.130. [DOI] [Google Scholar]
[98].Meng Fanwang, Richer Michael G., Tehrani Alireza, La Jonathan, Kim Taewon David, Ayers PW, and Heidar-Zadeh Farnaz. Procrustes: A python library to find transformations that maximize the similarity between matrices. Computer Physics Communications, 276: 108334, 2022. doi: 10.1016/j.cpc.2022.108334. URL https://www.sciencedirect.com/science/article/pii/S0010465522000522. [DOI] [Google Scholar]
[99].Gutenkunst Ryan N, Waterfall Joshua J, Casey Fergal P, Brown Kevin S, Myers Christopher R, and Sethna James P. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology, 3(10):e189, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
[100].Yang Greg, Hu Edward J., Babuschkin Igor, Sidor Szymon, Liu Xiaodong, Farhi David, Ryder Nick, Pachocki Jakub, Chen Weizhu, and Gao Jianfeng. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022. doi: 10.48550/arXiv.2203.03466. Accepted at NeurIPS 2021. [DOI] [Google Scholar]
[101].Yang Greg, Yu Dingli, Zhu Chen, and Hayou Soufiane. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023. doi: 10.48550/arXiv.2310.02244. Accepted at ICLR 2024. [DOI] [Google Scholar]
[102].Miller Kenneth D and Fumarola Francesco. Mathematical equivalence of two common forms of firing rate models of neural networks. Neural computation, 24(1):25–31, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS2124298-supplement-Supplementary_Material.zip^{(19.5MB, zip)}

[R1] [1].Sussillo David. Neural circuits as computational dynamical systems. Current opinion in neurobiology, 25:156–163, 2014. [DOI] [PubMed] [Google Scholar]

[R2] [2].Rajan Kanaka, Harvey Christopher D, and Tank David W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Barak Omri. Recurrent neural networks as versatile tools of neuroscience research. 46:1–6. ISSN 09594388. doi: 10.1016/j.conb.2017.06.003. URL https://linkinghub.elsevier.com/retrieve/pii/S0959438817300429. [DOI] [Google Scholar]

[R4] [4].Mastrogiuseppe Francesca and Ostojic Srdjan. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron, 99(3):609–623.e29, 2018. doi: 10.1016/j.neuron.2018.07.003. [DOI] [PubMed] [Google Scholar]

[R5] [5].Vyas Saurabh, Golub Matthew D, Sussillo David, and Shenoy Krishna V. Computation through neural population dynamics. Annual Review of Neuroscience, 43:249–275, 2020. [Google Scholar]

[R6] [6].Driscoll Laura N., Shenoy Krishna, and Sussillo David. Flexible multitask computation in recurrent networks utilizes shared dynamical motifs. Nature Neuroscience, 27(7):1349–1363, July 2024. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-024-01668-6. URL https://www.nature.com/articles/s41593-024-01668-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Turner Elia, Dabholkar Kabir V, and Barak Omri. Charting and navigating the space of solutions for recurrent neural networks. Advances in Neural Information Processing Systems, 34:25320–25333, 2021. [Google Scholar]

[R8] [8].Maheswaranathan Niru, Williams Alex H., Golub Matthew D., Ganguli Surya, and Sussillo David. Universality and individuality in neural dynamics across recurrent networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019. [Google Scholar]

[R9] [9].Kurtkaya Bariscan, Dinc Fatih, Yuksekgonul Mert, Blanco-Pozo Marta, Cirakman Ege, Schnitzer Mark, Yemez Yucel, Tanaka Hidenori, Yuan Peng, and Miolane Nina. Dynamical phases of short-term memory mechanisms in rnns, 2025. URL https://arxiv.org/abs/2502.17433. [Google Scholar]

[R10] [10].Pagan Marino, Tang Vincent D., Aoi Mikio C., Pillow Jonathan W., Mante Valerio, Sussillo David, and Brody Carlos D.. Individual variability of neural computations underlying flexible decisions. Nature, 639(8054):421–429, March 2025. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-08433-6. URL https://www.nature.com/articles/s41586-024-08433-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Clark David G., Abbott LF, and Sompolinsky Haim. Symmetries and Continuous Attractors in Disordered Neural Circuits. bioRxiv, 2025. doi: 10.1101/2025.01.26.634933. URL https://www.biorxiv.org/content/early/2025/01/26/2025.01.26.634933. Publisher: Cold Spring Harbor Laboratory _eprint: https://www.biorxiv.org/content/early/2025/01/26/2025.01.26.634933.full.pdf. [DOI] [Google Scholar]

[R12] [12].Lappalainen Janne K., Tschopp Fabian D., Prakhya Sridhama, McGill Mason, Nern Aljoscha, Shinomiya Kazunori, Takemura Shin-ya, Gruntman Eyal, Macke Jakob H., and Turaga Srinivas C.. Connectome-constrained networks predict neural activity across the fly visual system. Nature, 634(8036):1132–1140, October 2024. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-07939-3. URL https://www.nature.com/articles/s41586-024-07939-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Murray Keith T.. Phase codes emerge in recurrent neural networks optimized for modular arithmetic, 2025. URL https://arxiv.org/abs/2310.07908. [Google Scholar]

[R14] [14].Das Abhranil and Fiete Ila R. Systematic errors in connectivity inferred from activity in strongly recurrent networks. Nature Neuroscience, 23(10):1286–1296, 2020. [DOI] [PubMed] [Google Scholar]

[R15] [15].Turner Elia and Barak Omri. The simplicity bias in multi-task rnns: shared attractors, reuse of dynamics, and geometric representation. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]

[R16] [16].Martinelli Flavio, Simsek Berfin, Gerstner Wulfram, and Brea Johanni. Expand-and-cluster: Parameter recovery of neural networks, 2024. URL https://arxiv.org/abs/2304.12794. [Google Scholar]

[R17] [17].Martinelli Flavio, Van Meegen Alexander, Şimşek Berfin, Gerstner Wulfram, and Brea Johanni. Flat channels to infinity in neural loss landscapes, 2025. URL https://arxiv.org/abs/2506.14951. [Google Scholar]

[R18] [18].Fort Stanislav, Hu Huiyi, and Lakshminarayanan Balaji. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019. [Google Scholar]

[R19] [19].Goodfellow Ian J., Vinyals Oriol, and Saxe Andrew M.. Qualitatively characterizing neural network optimization problems, 2015. URL https://arxiv.org/abs/1412.6544. [Google Scholar]

[R20] [20].Li Hao, Xu Zheng, Taylor Gavin, Studer Christoph, and Goldstein Tom. Visualizing the loss landscape of neural nets, 2018. URL https://arxiv.org/abs/1712.09913. [Google Scholar]

[R21] [21].Jastrzębski Stanisław, Kenton Zachary, Arpit Devansh, Ballas Nicolas, Fischer Asja, Bengio Yoshua, and Storkey Amos. Three factors influencing minima in sgd, 2018. URL https://arxiv.org/abs/1711.04623. [Google Scholar]

[R22] [22].Chaudhari Pratik, Choromanska Anna, Soatto Stefano, LeCun Yann, Baldassi Carlo, Borgs Christian, Chayes Jennifer, Sagun Levent, and Zecchina Riccardo. Entropy-sgd: Biasing gradient descent into wide valleys, 2017. URL https://arxiv.org/abs/1611.01838. [Google Scholar]

[R23] [23].Kornblith Simon, Norouzi Mohammad, Lee Honglak, and Hinton Geoffrey. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414. [Google Scholar]

[R24] [24].Cao Rosa and Yamins Daniel. Explanatory models in neuroscience, part 2: Functional intelligibility and the contravariance principle. Cognitive Systems Research, 85:101200, 2024. [Google Scholar]

[R25] [25].Kepple D, Engelken Rainer, and Rajan Kanaka. Curriculum learning as a tool to uncover learning principles in the brain. In International Conference on Learning Representations, 2022. [Google Scholar]

[R26] [26].Garcia Samuel Liebana, Laffere Aeron, Toschi Chiara, Schilling Louisa, Podlaski Jacek, Fritsche Matthias, Zatka-Haas Peter, Li Yulong, Bogacz Rafal, Saxe Andrew, and Lak Armin. Striatal dopamine reflects individual long-term learning trajectories, December 2023. URL http://biorxiv.org/lookup/doi/10.1101/2023.12.14.571653. [Google Scholar]

[R27] [27].Fascianelli Valeria, Battista Aldo, Stefanini Fabio, Tsujimoto Satoshi, Genovesio Aldo, and Fusi Stefano. Neural representational geometries reflect behavioral differences in monkeys and recurrent neural networks. Nature Communications, 15(1):6479, August 2024. ISSN 2041-1723. doi: 10.1038/s41467-024-50503-w. URL https://www.nature.com/articles/s41467-024-50503-w. [DOI] [Google Scholar]

[R28] [28].Pan-Vazquez A, Sanchez Araujo Y, McMannon B, Louka M, Bandi A, Haetzel L, International Brain Laboratory, Pillow JW, Daw ND, and Witten IB. Pre-existing visual responses in a projection-defined dopamine population explain individual learning trajectories, 2024. URL https://europepmc.org/article/PPR/PPR811803. [Google Scholar]

[R29] [29].Werbos Paul J. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990. [Google Scholar]

[R30] [30].Hopfield John J.. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79 (8):2554–2558, 1982. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Jarne Ignacio. Exploring flip flop memories and beyond: training recurrent neural networks with key insights. Frontiers in Systems Neuroscience, 2024. [Google Scholar]

[R32] [32].Funahashi Shintaro, Bruce Charles J., and Goldman-Rakic Patricia S.. Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophysiology, 61(2): 331–349, 1989. [DOI] [PubMed] [Google Scholar]

[R33] [33].Goldman-Rakic Patricia S.. Cellular basis of working memory. Neuron, 14(3):477–485, 1995. [DOI] [PubMed] [Google Scholar]

[R34] [34].Marder Eve and Bucher Dirk. Central pattern generators and the control of rhythmic movement. Current Biology, 11(23):R986–R996, 2001. [DOI] [PubMed] [Google Scholar]

[R35] [35].Churchland Mark M., Cunningham John P., Kaufman Matthew T., Foster Justin D., Nuyujukian Paul, Ryu Stephen I., and Shenoy Krishna V.. Neural population dynamics during reaching. Nature, 487(7405):51–56, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].McNaughton Bruce L., Battaglia Francesco P., Jensen Ole, Moser Edvard I., and Moser May-Britt. Path integration and the neural basis of the ‘cognitive map’. Nature Reviews Neuroscience, 7:663–678, 2006. [DOI] [PubMed] [Google Scholar]

[R37] [37].Yang Guangyu R., Joglekar Madhura R., Song H. Francis, Newsome William T., and Wang Xiao-Jing. Task representations in neural networks trained to perform many cognitive tasks. Nature Neuroscience, 2019. [Google Scholar]

[R38] [38].Khona Mihir, Chandra Shreyas, Ma James J., and Fiete Ila R.. Winning the lottery with neural connectivity constraints: Faster learning across cognitive tasks with spatially constrained sparse rnns. Neural Computation, 35(11), 2023. doi: 10.1162/neco_a_01613. [DOI] [Google Scholar]

[R39] [39].Lorenz Edward N.. Predictability: A problem partly solved. In ECMWF Seminar on Predictability, 4–8 September 1995, Reading, U.K., 1996. European Centre for Medium-Range Weather Forecasts. [Google Scholar]

[R40] [40].Ostrow Mitchell, Eisen Adam, Kozachkov Leo, and Fiete Ila. Beyond Geometry: Comparing the Temporal Structure of Computation in Neural Circuits with Dynamical Similarity Analysis, October 2023. URL http://arxiv.org/abs/2306.10168. arXiv:2306.10168 [cs, q-bio]. [Google Scholar]

[R41] [41].Raghu Maithra et al. Svcca: Singular vector canonical correlation analysis for deep learning dynamics. In NeurIPS, 2017. [Google Scholar]

[R42] [42].Kriegeskorte Nikolaus, Mur Marieke, and Bandettini Peter. Representational similarity analysis — connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2:4, 2008. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Williams Alex H., Kunz Erin, Kornblith Simon, and Linderman Scott W.. Generalized Shape Metrics on Neural Representations, January 2022. URL http://arxiv.org/abs/2110.14739. arXiv:2110.14739 [cs, stat]. [Google Scholar]

[R44] [44].Schrimpf Martin, Kubilius Jonas, Hong Ha, Majaj Najib J., Rajalingham Rishi, Issa Elias B., Kar Kohitij, Bashivan Pouya, Prescott-Roy James, Schmidt Kailyn, Yamins Daniel L. K., and DiCarlo James J.. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv, 2020. doi: 10.1101/407007. URL https://www.biorxiv.org/content/10.1101/407007v2. [DOI] [Google Scholar]

[R45] [45].Guilhot Quentin, Wójcik Michał J, Achterberg Jascha, and Costa Rui Ponte. Dynamical similarity analysis uniquely captures how computations develop in RNNs, 2025. URL https://openreview.net/forum?id=pXPIQsV1St. [Google Scholar]

[R46] [46].Goodfellow Ian J., Vinyals Oriol, and Saxe Andrew M.. Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations (ICLR), 2015. arXiv:1412.6544. [Google Scholar]

[R47] [47].Frankle Jonathan, Dziugaite Gintare Karolina, Roy Daniel, and Carbin Michael. Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 3259–3269. PMLR, 2020. [Google Scholar]

[R48] [48].Lucas James R., Bae Juhan, Zhang Michael R., Fort Stanislav, Zemel Richard, and Grosse Roger B.. On monotonic linear interpolation of neural network parameters. In Proceedings of the 38th International Conference on Machine Learning (ICML), pages 7168–7179. PMLR, 2021. [Google Scholar]

[R49] [49].Fort Stanislav and Jastrzebski Stanislaw. Large scale structure of neural network loss landscapes. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. [Google Scholar]

[R50] [50].Achille Alessandro, Paolini Giovanni, and Soatto Stefano. Where is the information in a deep neural network? CoRR, abs/1905.12213, 2019. [Google Scholar]

[R51] [51].Qu Xingyu and Horvath Samuel. Rethink model re-basin and the linear mode connectivity. arXiv preprint arXiv:2402.05966, 2024. [Google Scholar]

[R52] [52].Ly Andrew and Gong Pulin. Optimization on multifractal loss landscapes explains a diverse range of geometrical and dynamical properties of deep learning. Nature Communications, 16 (3252), 2025. doi: 10.1038/s41467-025-58532-9. [DOI] [Google Scholar]

[R53] [53].Li Chunyuan, Farkhoor Heerad, Liu Rosanne, and Yosinski Jason. Measuring the intrinsic dimension of objective landscapes. CoRR, abs/1804.08838, 2018. [Google Scholar]

[R54] [54].Chizat Léon, Oyallon Edouard, and Bach Francis. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2938–2950, 2019. [Google Scholar]

[R55] [55].Woodworth Bryan, Gunasekar Suriya, Lee Jason D., Srebro Nathan, Bhojanapalli Srinadh, Khanna Rina, Chatterji Aaron, and Jaggi Martin. Kernel and rich regimes in deep learning. Journal of Machine Learning Research, 21(243):1–48, 2020. [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Geiger Mario, Spigler Stefano, Jacot Arthur, and Wyart Matthieu. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020. doi: 10.1088/1742-5468/abc4de. [DOI] [Google Scholar]

[R57] [57].Bordelon Blake and Pehlevan Cengiz. Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks, October 2022. URL http://arxiv.org/abs/2205.09653. arXiv:2205.09653 [stat]. [Google Scholar]

[R58] [58].George Thomas, Lajoie Guillaume, and Baratin Aristide. Lazy vs hasty: Linearization in deep networks impacts learning schedule based on example difficulty. arXiv preprint arXiv:2209.09658, 2022. URL https://arxiv.org/abs/2209.09658. [Google Scholar]

[R59] [59].Lee Jaehoon, Bahri Yuval, Novak Roman, Schoenholz Samuel S., Pennington Jeffrey, and Sohl-Dickstein Jascha. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32, pages 8572–8583, 2019. [Google Scholar]

[R60] [60].Bordelon Blake and Pehlevan Cengiz. Dynamics of finite width Kernel and prediction fluctuations in mean field neural networks*. Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104021, October 2024. ISSN 1742-5468. doi: 10.1088/1742-5468/ad642b. URL https://iopscience.iop.org/article/10.1088/1742-5468/ad642b. [DOI] [Google Scholar]

[R61] [61].Kumar Tanishq, Bordelon Blake, Gershman Samuel J., and Pehlevan Cengiz. Grokking as the transition from lazy to rich training dynamics. arXiv preprint arXiv:2310.06110, 2023. URL https://arxiv.org/abs/2310.06110. [Google Scholar]

[R62] [62].Liu Yuhan Helena, Baratin Aristide, Cornford Jonathan, Mihalas Stefan, Shea-Brown Eric, and Lajoie Guillaume. How connectivity structure shapes rich and lazy learning in neural circuits. arXiv preprint arXiv:2310.08513, 2023. doi: 10.48550/arXiv.2310.08513. URL https://arxiv.org/abs/2310.08513. [DOI] [Google Scholar]

[R63] [63].Bansal Yamini, Nakkiran Preetum, and Barak Boaz. Revisiting model stitching to compare neural representations. arXiv preprint arXiv:2106.07682, 2021. URL https://arxiv.org/abs/2106.07682. [Google Scholar]

[R64] [64].Duan Sunny, Matthey Loïc, Saraiva André, Watters Nicholas, Burgess Christopher P., Lerchner Alexander, and Higgins Irina. Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614, 2020. URL https://arxiv.org/abs/1905.12614. [Google Scholar]

[R65] [65].Huh Minyoung, Cheung Brian, Wang Tongzhou, and Isola Phillip. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024. URL https://arxiv.org/abs/2405.07987. [Google Scholar]

[R66] [66].Li Yixuan, Yosinski Jason, Clune Jeff, Lipson Hod, and Hopcroft John. Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2016. URL https://arxiv.org/abs/1511.07543. [Google Scholar]

[R67] [67].Kawaguchi Kenji. Deep learning without poor local minima. In Advances in Neural Information Processing Systems 29, pages 586–594, 2016. [Google Scholar]

[R68] [68].Nguyen Quynh and Hein Matthias. The loss surface of deep and wide neural networks. CoRR, abs/1704.08045, 2017. [Google Scholar]

[R69] [69].Du Simon S., Lee Jason D., Li Haochuan, Wang Liwei, and Zhai Xiyu. Gradient descent finds global minima of deep neural networks. CoRR, abs/1811.03804, 2018. [Google Scholar]

[R70] [70].Allen-Zhu Zeyuan, Li Yuanzhi, and Song Zhao. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, pages 242–252, 2019. [Google Scholar]

[R71] [71].Zou Difan, Cao Yuan, Zhou Dongruo, and Gu Quanquan. Stochastic gradient descent optimizes over-parameterized deep relu networks. CoRR, abs/1811.08888, 2018. [Google Scholar]

[R72] [72].Şimşek Berfin, Ged François, Jacot Arthur, Spadaro Francesco, Hongler Clément, Gerstner Wulfram, and Brea Johanni. Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances, 2021. URL https://arxiv.org/abs/2105.12221. [Google Scholar]

[R73] [73].Jacot Arthur, Gabriel Franck, and Hongler Clément. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018. [Google Scholar]

[R74] [74].Morcos Ari S., Raghu Maithra, and Bengio Samy. Insights on representational similarity in neural networks with canonical correlation. In NeurIPS, 2018. [Google Scholar]

[R75] [75].Kornblith Simon, Norouzi Mohammad, Lee Honglak, and Hinton Geoffrey. Similarity of neural network representations revisited. In ICML, 2019. [Google Scholar]

[R76] [76].Wolf Fred, Engelken Rainer, Puelma-Touzel Maximilian, Weidinger Juan Daniel Flórez, and Neef Andreas. Dynamical models of cortical circuits. 25:228–236. ISSN 09594388. doi: 10.1016/j.conb.2014.01.017. URL https://linkinghub.elsevier.com/retrieve/pii/S0959438814000324. [DOI] [Google Scholar]

[R77] [77].Klabunde Felix et al. Contrasim – analyzing neural representations based on contrastive learning. In ICLR, 2024. [Google Scholar]

[R78] [78].Nguyen Thao, Raghu Maithra, and Kornblith Simon. Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations (ICLR), 2021. URL https://openreview.net/forum?id=KJNcAkY8tY4. Poster. [Google Scholar]

[R79] [79].Beiran Manuel, Dubreuil Alexis, Valente Adrian, Mastrogiuseppe Francesca, and Ostojic Srdjan. Shaping dynamics with multiple populations in low-rank recurrent networks. arXiv preprint arXiv:2007.02062, 2020. doi: 10.48550/arXiv.2007.02062. [DOI] [Google Scholar]

[R80] [80].Olshausen Bruno A. and Field David J.. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]

[R81] [81].Han Song, Pool Jeff, Tran John, and Dally William J.. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015. doi: 10.48550/arXiv.1506.02626. [DOI] [Google Scholar]

[R82] [82].Glorot Xavier, Bordes Antoine, and Bengio Yoshua. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, volume 15, pages 315–323, 2011. [Google Scholar]

[R83] [83].Li Qianyi, Sorscher Ben, and Sompolinsky Haim. Representations and generalization in artificial and brain neural networks. Proceedings of the National Academy of Sciences, 121 (27):e2311805121, 2024. doi: 10.1073/pnas.2311805121. [DOI] [Google Scholar]

[R84] [84].Cohen Uri, Chung SueYeon, Lee Daniel D., and Sompolinsky Haim. Separability and geometry of object manifolds in deep neural networks. Nature Communications, 11:746, 2020. doi: 10.1038/s41467-020-14578-5. [DOI] [Google Scholar]

[R85] [85].Sorscher Ben, Ganguli Surya, and Sompolinsky Haim. Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences, 119 (43):e2200800119, 2022. doi: 10.1073/pnas.2200800119. [DOI] [Google Scholar]

[R86] [86].Bouthillier Xavier, Delaunay Pierre, Bronzi Mirko, Trofimov Assya, Nichyporuk Brennan, Szeto Justin, Sepahvand Nazanin Mohammadi, Raff Edward, Madan Kanika, Voleti Vikram, et al. Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 2021. [Google Scholar]

[R87] [87].Morik Katharina. Sloppy modeling. Knowledge representation and organization in machine learning, pages 107–134, 2005. [Google Scholar]

[R88] [88].Yang Rubing, Mao Jialin, and Chaudhari Pratik. Does the data induce capacity control in deep learning? In International Conference on Machine Learning, pages 25166–25197. PMLR, 2022. [Google Scholar]

[R89] [89].Singh Satpreet H, van Breugel Floris, Rao Rajesh PN, and Brunton Bingni W. Emergent behaviour and neural dynamics in artificial agents tracking odour plumes. Nature Machine Intelligence, 5(1):58–70, 2023. [Google Scholar]

[R90] [90].Negrón-Oyarzo Ignacio, Espinosa Nelson, Aguilar-Rivera Marcelo, Fuenzalida Marco, Aboitiz Francisco, and Fuentealba Pablo. Coordinated prefrontal–hippocampal activity and navigation strategy-related prefrontal firing during spatial memory formation. Proceedings of the National Academy of Sciences, 115(27):7123–7128, 2018. doi: 10.1073/pnas.1720117115. [DOI] [Google Scholar]

[R91] [91].Ashwood Zoe C., Roy Nicholas A., Stone Iris R., International Brain Laboratory, Urai Anne E., Churchland Anne K., Pouget Alexandre, and Pillow Jonathan W.. Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25:201–212, 2022. doi: 10.1038/s41593-021-01007-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] [92].Cazettes Fanny, Mazzucato Luca, Murakami Masayoshi, Morais João P., Augusto Elisabete, Renart Alfonso, and Mainen Zachary F.. A reservoir of foraging decision variables in the mouse brain. Nature Neuroscience, 26:840–849, 2023. doi: 10.1038/s41593-023-01305-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R93] [93].Pagan Marino, Tang Vincent D., Aoi Mikio C., Pillow Jonathan W., Mante Valerio, Sussillo David, and Brody Carlos D.. Individual variability of neural computations underlying flexible decisions. Nature, 639:421–429, 2025. doi: 10.1038/s41586-024-08433-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R94] [94].Howard BR. Control of variability. ILAR journal, 43(4):194–201, 2002. [DOI] [PubMed] [Google Scholar]

[R95] [95].Schmid Peter J. Dynamic mode decomposition and its variants. Annual Review of Fluid Mechanics, 54(1):225–254, 2022. [Google Scholar]

[R96] [96].Schönemann Peter H.. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, Mar 1966. doi: 10.1007/BF02289451. [DOI] [Google Scholar]

[R97] [97].Ding Chris, Li Tao, and Jordan Michael I.. Nonnegative matrix factorization for combinatorial optimization: Spectral clustering, graph matching, and clique finding. In Proceedings of the Eighth IEEE International Conference on Data Mining (ICDM ‘08), pages 183–192. IEEE, 2008. doi: 10.1109/ICDM.2008.130. [DOI] [Google Scholar]

[R98] [98].Meng Fanwang, Richer Michael G., Tehrani Alireza, La Jonathan, Kim Taewon David, Ayers PW, and Heidar-Zadeh Farnaz. Procrustes: A python library to find transformations that maximize the similarity between matrices. Computer Physics Communications, 276: 108334, 2022. doi: 10.1016/j.cpc.2022.108334. URL https://www.sciencedirect.com/science/article/pii/S0010465522000522. [DOI] [Google Scholar]

[R99] [99].Gutenkunst Ryan N, Waterfall Joshua J, Casey Fergal P, Brown Kevin S, Myers Christopher R, and Sethna James P. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology, 3(10):e189, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R100] [100].Yang Greg, Hu Edward J., Babuschkin Igor, Sidor Szymon, Liu Xiaodong, Farhi David, Ryder Nick, Pachocki Jakub, Chen Weizhu, and Gao Jianfeng. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022. doi: 10.48550/arXiv.2203.03466. Accepted at NeurIPS 2021. [DOI] [Google Scholar]

[R101] [101].Yang Greg, Yu Dingli, Zhu Chen, and Hayou Soufiane. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023. doi: 10.48550/arXiv.2310.02244. Accepted at ICLR 2024. [DOI] [Google Scholar]

[R102] [102].Miller Kenneth D and Fumarola Francesco. Mathematical equivalence of two common forms of firing rate models of neural networks. Neural computation, 24(1):25–31, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

Ann Huang

Satpreet H Singh

Flavio Martinelli

Kanaka Rajan

Abstract

1. Introduction

Figure 1: Key factors shape degeneracy across behavior, dynamics, and weights.

Table 1:

2. Methods

2.1. Model architecture and training procedure

2.2. Task suite for diagnosing solution degeneracy

N-Bit Flip-Flop Task

Delayed Discrimination Task

Sine Wave Generation

Path Integration Task

2.3. Multi-level framework for quantifying degeneracy

2.3.1. Behavioral degeneracy

2.3.2. Dynamical degeneracy

2.3.3. Weight degeneracy

3. Results

3.1. Task complexity modulates degeneracy across levels

Figure 3: Higher task complexity reduces dynamical and behavioral degeneracy, but increases weight degeneracy.

3.1.1. Additional axes of task complexity

Changing memory demand.

Figure 4: Increasing memory demand or adding auxiliary loss changes task complexity, which in turn modulates degeneracy.

Adding auxiliary loss.

3.2. Feature learning

3.2.1. Task complexity scales feature learning

Figure 5: More complex tasks drive stronger feature learning in RNNs.

3.2.2. Controlling feature learning reshapes degeneracy across levels

Figure 6: Stronger feature learning reduces dynamical degeneracy but increases weight and behavioral degeneracy.

3.3. Larger networks yield more consistent solutions across levels

Figure 7: Larger networks reduce degeneracy across weight, dynamics, and behavior.

3.4. Structural regularization reduces solution degeneracy

Figure 8: Low-rank and sparsity regularization reduce solution degeneracy across all levels.

4. Discussion

Limitations and future directions.

Supplementary Material

Figure 2: Our task suite spans memory, integration, pattern generation, and decision-making.

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases