Theory of Gating in Recurrent Neural Networks

Kamesh Krishnamurthy; Tankut Can; David J Schwab

doi:10.1103/physrevx.12.011011

. Author manuscript; available in PMC: 2023 Jan 18.

Published in final edited form as: Phys Rev X. 2022 Jan 18;12(1):011011. doi: 10.1103/physrevx.12.011011

Theory of Gating in Recurrent Neural Networks

Kamesh Krishnamurthy ^1,^*, Tankut Can ^2,^†, David J Schwab ³

PMCID: PMC9762509 NIHMSID: NIHMS1853875 PMID: 36545030

Abstract

Recurrent neural networks (RNNs) are powerful dynamical models, widely used in machine learning (ML) and neuroscience. Prior theoretical work has focused on RNNs with additive interactions. However gating i.e., multiplicative interactions are ubiquitous in real neurons and also the central feature of the best-performing RNNs in ML. Here, we show that gating offers flexible control of two salient features of the collective dynamics: (i) timescales and (ii) dimensionality. The gate controlling timescales leads to a novel marginally stable state, where the network functions as a flexible integrator. Unlike previous approaches, gating permits this important function without parameter fine-tuning or special symmetries. Gates also provide a flexible, context-dependent mechanism to reset the memory trace, thus complementing the memory function. The gate modulating the dimensionality can induce a novel, discontinuous chaotic transition, where inputs push a stable system to strong chaotic activity, in contrast to the typically stabilizing effect of inputs. At this transition, unlike additive RNNs, the proliferation of critical points (topological complexity) is decoupled from the appearance of chaotic dynamics (dynamical complexity). The rich dynamics are summarized in phase diagrams, thus providing a map for principled parameter initialization choices to ML practitioners.

Subject Areas: Interdisciplinary Physics Nonlinear Dynamics, Statistical Physics

I. INTRODUCTION

Recurrent neural networks (RNNs) are powerful dynamical systems that can represent a rich repertoire of trajectories and are popular models in neuroscience and machine learning. In modern machine learning, RNNs are used to learn complex dynamics from data with rich sequential or temporal structure such as speech [1,2], turbulent flows [3–5], or text sequences [6]. RNNs are also influential in neuroscience as models to study the collective behavior of a large network of neurons [7] (and references therein). For instance, they have been used to explain the dynamics and temporally irregular fluctuations observed in cortical networks [8,9] and how the motor-cortex network generates movement sequences [10,11].

Classical RNN models typically involve units that interact with each other in an additive fashion—i.e., each unit integrates a weighted sum of the output of the rest of the network. However, researchers in machine learning have empirically found that RNNs with gating—a form of multiplicative interaction—can be trained to perform significantly more complex tasks than classical RNNs [6,12]. Gating interactions are also ubiquitous in real neurons due to mechanisms such as shunting inhibition [13]. Moreover, when single-neuron models are endowed with more realistic conductance dynamics, the effective interactions at the network level have gating effects, which confer robustness to time-warped inputs [14]. Thus, RNNs with gating interactions not only have superior information processing capabilities, but they also embody a prominent feature found in real neurons.

Prior theoretical work on understanding the dynamics and functional capabilities of RNNs has mostly focused on RNNs with additive interactions. The original work by Sompolinsky, Crisanti, and Sommers [15] identifies a phase transition in the autonomous dynamics of randomly connected RNNs from stability to chaos. Subsequent work extends this analysis to cases where the random connectivity additionally has correlations [16], a low-rank structured component [17,18], strong self-interaction [19], and heterogeneous variance across blocks [20]. The role of sparse connectivity and the single-neuron nonlinearity is studied in Ref. [9]. The effect of a Gaussian noise input is analyzed in Ref. [21].

In this work, we study the consequences of gating interactions on the dynamics of RNNs. We introduce a gated RNN model that naturally extends a classical RNN by augmenting it with two kinds of gating interactions: (i) an update gate that acts like an adaptive time constant and (ii) an output gate which modulates the output of a neuron. The choice of these forms for gates are motivated by biophysical considerations (e.g., Refs. [14,22]) and retain the most functionally important aspects of the gated RNNs in machine learning. Our gated RNN reduces to the classical RNN [15,23] when the gates are open and is closely related to the state-of-the-art gated RNNs in machine learning when the dynamics are discretized [24]. We further elaborate on this connection in Sec. VIII.

We develop a theory for the gated RNN based on non-Hermitian random matrix techniques [25,26] and the Martin–Siggia–Rose–De Dominicis-Janssen (MSRDJ) formalism [21,27–32] and use the theory to map out, in a phase diagram, the rich, functionally significant dynamical phenomena produced by gating.

We show that the update gate produces slow modes and a marginally stable critical state. Marginally stable systems are of special interest in the context of biological information processing (e.g., Ref. [33]). Moreover, the network in this marginally stable state can function as a robust integrator—a function that is critical for memory capabilities in biological systems [34–37] but has been hard to achieve without parameter fine-tuning and handcrafted symmetries [38]. Gating permits the network to serve this function without any symmetries or fine-tuning. For a detailed discussion of these issues, we refer the reader to Ref. [39] (pp. 329–350) and Refs. [38,40]. Integratorlike dynamics are also empirically observed in gated machine learning (ML) RNNs successfully trained on complex sequential tasks [41]; our theory shows how gates allow for this robustly.

The output gate allows fine control over the dimensionality of the network activity; control of the dimensionality can be useful during learning tasks [42]. In certain regimes, this gate can mediate an input-driven chaotic transition, where static inputs can push a stable system abruptly to a chaotic state. This behavior with gating is in stark contrast to the typically stabilizing effect of inputs in high-dimensional systems [21,43,44]. The output gate also leads to a novel, discontinuous chaotic transition, where the proliferation of critical points (a static property) is decoupled from the appearance of chaotic transients (a dynamical property); this is in contrast to the tight link between the two properties in additive RNNs as shown by Wainrib and Touboul [45]. This transition is also characterized by a nontrivial state where a stable fixed point coexists with long chaotic transients. Gates also provide a flexible, context-dependent way to reset the state, thus providing a way to selectively erase the memory trace of past inputs.

We summarize these functionally significant phenomena in phase diagrams, which are also practically useful for ML practitioners—indeed, the choice of parameter initialization is known to be one of the most important factors deciding the success of training [46], with best outcomes occurring near critical lines [10,47–49]. Phase diagrams, thus, allow a principled and exhaustive exploration of dynamically distinct initializations.

II. A RECURRENT NEURAL NETWORK MODEL TO STUDY GATING

We study an extension of a classical RNN [15,23] by augmenting it with multiplicative gating interactions. Specifically, we consider two gates: (i) an update (or z) gate which controls the rate of integration and (ii) an output (or r) gate which modulates the strength of the output. The equations describing the gated RNN are given by

{\dot{h}}_{i} (t) = σ_{z} (z_{i}) [- h_{i} (t) + R_{i} (t)] + I_{i}^{h} (t),

(1)

where h_i represents the internal state of the ith unit and σ_(·)(x) = [1 + exp(−α_(·)x + β_(·))]⁻¹ are sigmoidal gating functions. The recurrent input to a neuron is $R_{i} (t) = \sum_{j = 1}^{N} J_{i j}^{h} ϕ [h_{j} (t)] \cdot σ_{r} [r_{j} (t)]$ , where $J_{i j}^{h}$ are the coupling strengths between the units and ϕ(x) = tanh(g_hx + β_h) is the activation function. ϕ and σ_z,r are parametrized by gain parameters (g_h, α_z,r) and biases (β_h,z,r), which constitute the parameters of the gated RNN. Finally, I^h represents external input to the network. The gating variables z_i(t) and r_i(t) evolve according to dynamics driven by the output ϕ[h(t)] of the network:

τ_{x} {\dot{x}}_{i} (t) = - x_{i} (t) + \sum_{j = 1}^{N} J_{i j}^{x} ϕ [h_{j} (t)] + I_{i}^{x},

(2)

where x ∈ {z, r}. Note that the coupling matrices J^z,r for z, r are distinct from J^h. We also allow for different inputs I^r and I^z being fed to the gates. For instance, they might be zero, or they might be equal up to a scaling factor to I^h.

The value of σ_z(z_i) can be viewed as a dynamical time constant for the ith unit, while the output gate σ_r(r_i) modulates the output strength of unit i. In the presence of external input, the r gate can control the relative strengths of the internal (recurrent) activity and the external input I^h. In the limit σ_z, σ_r → 1, we recover the dynamics of the classical RNN.

We choose the coupling weights from a Gaussian distribution with variance scaled such that the input to each unit remains O(1). Specifically, $J_{i j}^{h, z, r} ~ 𝒩 (0, N^{- 1})$ . This choice of couplings is a popular initialization scheme for RNNs in machine learning [6,46] and also in models of cortical neural circuits [15,20]. If the gating variables are purely internal, then (J^z,r) is diagonal; however, we do not consider this case below. In the rest of the paper, we analyze the various dynamical regimes the gated RNN exhibits and their functional significance.

III. HOW THE GATES SHAPE THE LINEARIZED DYNAMICS

We first study the linearized dynamics of the gated RNN through the lens of the instantaneous Jacobian and describe how these dynamics are shaped by the gates. The instantaneous Jacobian describes the linearized dynamics about an operating point, and the eigenvalues of the Jacobian inform us about the timescales of growth and decay of perturbations and the local stability of the dynamics. As we show below, the spectral density of the Jacobian depends on equal-time correlation functions, which are the order parameters in the mean-field picture of the dynamics, developed in the Appendix C. We study how the gates shape the support and the density of Jacobian eigenvalues in the steady state, through their influence on the correlation functions.

The linearized dynamics in the tangent space at an operating point x = (h, z, r) is given by

δ \dot{x} = 𝒟 (t) δ x,

(3)

where 𝒟 is the 3N × 3N-dimensional instantaneous Jacobian of the full network equations. Linearization of Eqs. (1) and (2) yields

𝒟 = (\begin{matrix} [σ_{z}] (- 1 + J^{h} [ϕ^{'} σ_{r}]) & D & [σ_{z}] J^{h} [ϕ σ_{r}^{'}] \\ τ_{z}^{- 1} J^{z} [ϕ^{'}] & - τ_{z}^{- 1} 1 & 0 \\ τ_{r}^{- 1} J^{r} [ϕ^{'}] & 0 & - τ_{r}^{- 1} 1 \end{matrix}),

(4)

where [x] denotes a diagonal matrix with the diagonal entries given by the vector x. The term $D_{i j} = δ_{i j} σ_{z}^{'} (z_{i}) (- h_{i} + \sum_{j} J_{i j}^{h} ϕ (h_{j}) σ_{r} (r_{j})) = [- σ_{z}^{'} (z) h] + [σ_{z}^{'} ⊙ J^{h} (ϕ ⊙ σ_{r})]$ arises when we linearize about a time-varying state and is zero for fixed points. We introduce the additional shorthand ϕ′(t) = ϕ′(h(t)) and $σ_{r / z}^{'} = σ_{r / z}^{'} (r / z (t))$ .

The Jacobian is a block-structured matrix involving random elements (J^z,h,r) and functions of various state variables. We need additional tools from non-Hermitian random matrix theory (RMT) [26] and dynamical mean-field theory (DMFT) [15] to analyze the spectrum of the Jacobian 𝒟. We provide a detailed, self-contained derivation of the calculations in Appendix C (DMFT) and Appendix A (RMT). Here, we state only the main results derived from these formalisms.

One of the main results is an analytical expression for the spectral curve, which describes the boundary of the Jacobian spectrum, in terms of the moments of the state variables. The most general expression for the spectral curve [Appendix A, Eq. (A34)] involves empirical averages over the 3N-dimensional state variables. However, for large N, we can appeal to a concentration of measure argument to replace these discrete sums with averages over the steady-state distribution from the DMFT (cf. Appendix C)—i.e., we can replace empirical averages of any function of the state variables (1/N) Σ_i F(h_i, z_i, r_i) with 〈F[h(t), z(t), r(t)]〉, where the brackets indicate average over the steady-state distribution. The DMFT + RMT prediction for the spectral curve for a generic steady-state point is given in Appendix A, Eq. (A35). Strictly speaking, the analysis of the DMFT around a generic time-dependent steady state is complicated by the fact that the distribution for h is not Gaussian (while r and z are Gaussian). For fixed points, however, the distributions of h, z, and r are all Gaussian, and the expression for the spectral curve reduces simplifies. It is given by the set of $λ \in ℂ$ which satisfy

〈 ϕ^{' 2} 〉 (〈 σ_{r}^{2} 〉 + \frac{〈 ϕ^{2} 〉 〈 σ_{r}^{' 2} 〉}{{| 1 + τ_{r} λ |}^{2}}) {〈 \frac{σ_{z}^{2}}{{| λ + σ_{z} |}^{2}} 〉}_{z} = 1 .

(5)

Here, the averages are taken over the Gaussian fixed-point distributions (h, z, r) ~ 𝒩(0, Δ_h,z,r) which follow from the MFT [Eq. (C26)]. For example, $〈 ϕ^{' 2} 〉 = E_{h ~ 𝒩 (0, Δ_{h})} [ϕ^{'} {(h)}^{2}]$ .

We make two comments on the Jacobian of a time-varying state: (i) One might wonder if any useful information can be gleaned by studying the Jacobian at a time-varying state where the Hartman-Grobman theorem is not valid. Indeed, as we see below, the limiting form of the Jacobian in steady state crucially informs us about the suppression of unstable directions and the emergence of slow dynamics due to pinching and marginal stability in certain parameter regimes (also see Ref. [50]). In other words, the instantaneous Jacobian charts the approach to marginal stability and provides a quantitative justification for the approximate integrator functionality exhibited in Sec. IV B. (ii) Interestingly, the spectral curve calculated using the MFT [Eq. (5)] for a time-varying steady state not deep in the chaotic regime is a very good approximation for the true spectral curve (see Fig. 8 in Appendix A).

Figures 1(a)–1(d) show that the RMT prediction of the spectral support (dark outline) agrees well with the numerically calculated spectrum (red dots) in different dynamical regimes. As a consequence of Eq. (5), we get a condition for the stability of the zero fixed point. The leading edge of the spectral curve for the zero fixed point (FP) crosses the origin when $g_{h} < 1 + e^{- β_{r}}$ . So, in the absence of biases, g_h > 2 makes the zero FP unstable. More generally, the leading edge of the spectrum crossing the origin gives us the condition for the FP to become unstable:

〈 ϕ^{' 2} 〉 (〈 ϕ^{2} 〉 〈 σ_{r}^{' 2} 〉 + 〈 σ_{r}^{2} 〉) > 1 \Rightarrow unstable FP .

(6)

FIG. 1. — How gates shape the Jacobian spectrum. (a)–(d) Jacobian eigenvalues (red dots) of the gated RNN in (time-varying) steady state. The dark outline is the spectral support curve predicted by Eq. (5). The bottom row corresponds to larger α_z, and the right column corresponds to large α_r. (e) Cumulative distribution function of Jacobian eigenvalues in a disk of radius r = 0.05 centered at the origin plotted against α_z. Circles are numerical density calculated from the true network Jacobian (averaged over ten instances), and the dashed line is a fit from Eq. (7). (f) Intercept of the spectral curve on the imaginary axis, plotted against α_z for three different values of g_h (α_r = 0). For network simulations, N = 2000, g_h = 3, and τ_r = τ_z = 1 unless otherwise stated, and all biases are zero.

We see later on that the time-varying state corresponding to this regime is chaotic. We now proceed to analyze how the two gates shape the Jacobian spectrum via the equation for the spectral curve.

A. Update gate facilitates slow modes and output gate causes instability

To understand how each gate shapes the local dynamics, we study their effect on the density of Jacobian eigenvalues and the shape of the spectral support curve—the eigenvalues tell us about the rate of growth or decay of small perturbations and, thus, timescales in the local dynamics, and the spectral curve informs us about stability. For ease of exposition, we consider the case without biases in the main text (β_r,z,h = 0); we discuss the role of biases in Appendix H.

Figure 1 shows how the gain parameters of the update and output gates—α_z and α_r, respectively—shape the Jacobian spectrum. In Figs. 1(a)–1(d), we see that α_z has two salient effects on the spectrum: Increasing α_z leads to (i) an accumulation of eigenvalues near zero and (ii) a pinching of the spectral curve for certain values of g_h wherein the intercept on the imaginary axis gets smaller [Fig. 1(f); also see Sec. IVA]. In Figs. 1(a)–1(d), we also see that increasing the value of α_r leads to an increase in the spectral radius, thus pushing the leading edge (max Reλ_i) to the right and thereby increasing the local dimensionality of the unstable manifold. This behavior of the linearized dynamics is also reflected in the nonlinear dynamics, where, as we show in Sec. V, α_r has the effect of controlling the dimensionality of full phase-space dynamics.

The accumulation of eigenvalues near zero with increasing α_z suggests the emergence of a wide spectrum of timescales in the local dynamics. To understand this accumulation quantitatively, it is helpful to consider the scenario where α_z is large and we replace the tanh activation functions with a piecewise linear approximation. In this limit, the density of eigenvalues within a radius δ of the origin is well approximated by the following functional form (details in Appendix B):

P [| λ (𝒟_{x}) | < δ] ~ c_{0} erf (\frac{c_{1}}{α_{z}}),

(7)

where c₀ and c₁ are constants that, in general, depend on a_r, δ, and g_h. Figure 1(e) shows this scaling for a specific value of δ: The dashed line shows the predicted curve, and the circles indicate the actual eigenvalue density calculated using the full Jacobian. In the limit of α_z → ∞, we get an extensive number of eigenvalues at zero, and the eigenvalue density converges to (see Appendix B)

μ (λ) = (1 - f_{z}) δ (λ) + f_{z} (1 - f_{h}) δ (λ + 1) + \frac{4}{π g_{h}^{2}} I_{{| λ | \leq g_{h}^{2} / 4}},

where f_z = 〈σ_z(z)〉 is the fraction of update gates which are nonzero and f_h is the fraction of unsaturated activation functions ϕ(h). For other choices of saturating nonlinearities, the extensive number of eigenvalues at zero remains; however, the expressions are more complicated. Analogous phenomena are observed for discrete-time gated RNNs in Ref. [51], using a similar combination of analytical and numerical techniques [52].

In Sec. VA, we show that the slow modes, as seen from linearization, persist asymptotically (i.e., in the nonlinear regime). This can be seen from the Lyapunov spectrum in Fig. 3(a), which for large α_z exhibits an analogous accumulation of Lyapunov exponents near zero.

FIG. 3. — Lyapunov spectra and dimensionality of the gated RNN. (a),(b) The first 50 ordered Lyapunov exponents for a gated RNN (N = 2000) as a function of varying (a) α_z and (b) α_r. The Lyapunov spectrum is calculated as described in Appendix D. (c) The Kaplan-Yorke dimensionality of the dynamics as a function of α_r. (d) The maximal Lyapunov exponent λ_max predicted by the DMFT [solving Eqs. (10) and (11); solid line] and obtained numerically using the QR method (circles; N = 2000 and α_z = 0). Note that the transition for α_r = 20 is sharp; also cf. Fig. 5(c). τ_z = τ_t = 2.0 here.

In the next section, we study the profound functional consequences of the combination of spectral pinching and accumulation of eigenvalues near zero.

IV. MARGINAL STABILITY AND ITS CONSEQUENCES

As the update gate becomes more switchlike (higher α_z), we see an accumulation of slow modes and a pinching of the spectral curve which drastically suppresses the unstable directions. In the limit α_z → ∞, this can make previously unstable points marginally stable by pinning the leading edge of the spectral curve exactly at zero. Marginally stable systems are of significant interest because of the potential benefits in information processing—for instance, they can generate long timescales in their collective modes [33,39]. Moreover, achieving marginal stability often requires fine-tuning parameters close to a bifurcation point. As we see, gating allows us to achieve a marginally stable critical state over a wide range of parameters; this has been typically highly nontrivial to achieve (e.g., Ref. [39], pp. 329–350, and Ref. [33]). We first investigate the conditions under which marginal stability arises, and then we touch on one of its important functional consequences: the appearance of “line attractors” which allow the system to be used as a robust integrator.

A. Condition for marginal stability

Marginal stability is a consequence of pinching of the spectral curve with increasing α_z, wherein the (positive) leading edge of the spectrum and the intercept of the spectral curve on the imaginary axis both shrink with α_z [e.g., Fig. 1(f) and compare Figs. 1(a) and 1(c)]. However, we see in Fig. 1(f) (via the intercept) that pinching does not happen if g_h is sufficiently large (even as α_z → ∞). Here, we provide the conditions when pinching can occur and, thus, marginal stability can emerge. For simplicity, let us consider the case where τ_r = 1 and there are no biases.

Marginal stability strictly exists only for α_z = ∞. We first examine the conditions under which the system can become marginally stable in this limit, and then we explain the route to marginal stability for large but finite α_z, i.e., how a time-varying state ends up as a marginally stable fixed point. For α_z = ∞, the spectral density has an extensive number N[1 − 〈σ_z(z)〉] of zero eigenvalues, and the remaining eigenvalues are distributed in a disk centered at λ = −1 with radius ρ. If ρ < 1, then the spectral density has two topologically disconnected configurations (the disk and the zero modes) and the system is marginally stable. If ρ > 1, the zero modes get absorbed in the interior of the disk and the system is unstable with fast, chaotic dynamics. The radius ρ is given by $ρ^{2} = \frac{1}{2} a + \frac{1}{2} \sqrt{4 b + a^{2}} < 1$ , where $a = 〈 {ϕ^{'}}^{2} 〉 〈 σ_{z} 〉 〈 σ_{r}^{2} 〉$ and $b = 〈 ϕ^{' 2} 〉 〈 σ_{z} 〉 〈 ϕ^{2} 〉 〈 σ_{r}^{' 2} 〉$ . This follows from Eq. (5) by evaluating the z-expectation value assuming σ_z is a binary variable. Thus, the system is marginally stable in the limit α_z = ∞ as long as

〈 ϕ^{' 2} 〉 (〈 ϕ^{2} 〉 〈 σ_{r}^{' 2} 〉 + 〈 σ_{r}^{2} 〉) < {〈 σ_{z} 〉}^{- 1} .

(8)

The crucial difference between this expression and Eq. (6) is that the rhs now has a factor of 〈σ_z〉⁻¹ which can be greater than unity, thus pushing the transition to chaos further out along the g_h and α_r directions, as depicted in the phase diagram (Fig. 7). For concreteness, we report here how the transition changes at α_r = 0. In this setting, the transition to chaos moves from g_h = 2 to g_h ⪅ 6.2, and the system is marginally stable for 2 < g_h ⪅ 6.2.

FIG. 7. — Phase diagram for the gated RNN. (a) (no biases) In regions 1 and 2, the zero FP is the global attractor of dynamics; however, in region 2, there is a proliferation of unstable FPs without any asymptotic dynamical signatures. In region 3, the (stable) zero FP coexists with chaotic dynamics. Note that the plotted curve separating regions 2 and 3 is computed for α_z = 0 and remains valid for sufficiently small values of α_z. In region 4, the zero FP is unstable, and dynamics are chaotic. For all parameter values in region 5, a previously unstable or chaotic state can be made marginally stable when α_z = ∞. For any given parameter values in region 5, there are infinitely many marginally stable points in the phase space to which the dynamics converge. The red dashed line indicates the critical transition between a stable fixed point (below the line) and chaos (above the line) in the presence of static random input (to the h variable) with standard deviation σ_h = 0.5. Note that, while chaos is suppressed for small α_r along the g_h axis, for larger α_r there are regions of stable FPs that become chaotic with finite input. This leads to the phenomenon of input-induced chaos.

Having identified the region in the phase diagram that can be made marginally stable for α_z = ∞, we can now discuss the route to marginal stability for large but finite α_z. In other words, how does an unstable chaotic state become marginally stable with increasing α_z? Since the marginally stable region is characterized by a disconnected spectral density, evidently increasing α_z must lead to singular behavior in the spectral curve. This takes the form of a pinching at the origin. We show that, for values of g_h supporting marginal stability, the leading edge λ_e of the spectrum for the time-varying state gets pinched exponentially fast with α_z as $λ_{e} ~ e^{- c α_{z} \sqrt{Δ_{h}}}$ (see Appendix B). This accounts for the fact that, already for α_z = 15, we observe the pinching in Fig. 1(c). In contrast, the parameters in Fig. 1(d) lie outside the marginally stable region, and, thus, there is no pinching, since the zero modes are asymptotically (in α_z) buried in the bulk of the spectrum.

In summary, as α_z → ∞ the Jacobian spectrum undergoes a topological transition from a single simply connected domain to two domains, both containing an extensive number of eigenvalues. A finite fraction of eigenvalues end up sitting exactly at zero, while the rest occupy a finite circular region. If the leading edge of the circular region crosses zero in this limit, then the state remains unstable; otherwise, the state becomes marginally stable. The latter case is achieved through a gradual pinching of the spectrum near zero; there is no pinching in the former case.

We emphasize that marginal stability requires more than just an accumulation of eigenvalues near zero. Indeed, this happens even when g_h is outside the range supporting marginal stability as α_z → ∞, but there is no pinching and the system remains unstable [e.g., see Fig. 1(d)]. We return to this when we describe the phase diagram for the gated RNN (Sec. VII). There, we see that the marginally stable region occupies a macroscopic volume in the parameter space adjoining the critical lines on one side.

B. Functional consequences of marginal stability

The marginally stable critical state produced by gating can subserve the function of a robust integrator. This integratorlike function is crucial for a variety of computational functions such as motor control [34–36], decision making [37], and auditory processing [53]. However, achieving this function has typically required fine-tuning or special handcrafted architectures [38], but gating permits the integrator function over a range of parameters and without any specific symmetries in J^h,z,r. Specifically, for large α_z, any perturbation in the span of the eigenvectors corresponding to the eigenvalues with a magnitude close to zero is integrated by the network, and, once the input perturbation ceases, the memory trace of the input is retained for a duration much longer than the intrinsic time constant of the neurons; perturbations along other directions, however, relax with a spectrum of timescales dictated by the inverse of (the real part of) their eigenvalues. Thus, the manifold of slow directions forms an approximate continuous attractor on which input can effortlessly move the state vector around. These approximate continuous attractor dynamics are illustrated in Fig. 2. At time t = 0, an input I^h (with I^r = I^z = 0) is applied till t = 10 (between dashed vertical lines) along an eigenvector of the Jacobian with an eigenvalue close to zero. Inputs along this slow manifold with varying strengths (different shades of red) are integrated by the network as evidenced by the excess projection of the network activity on the left eigenvector u_λ corresponding to the slow mode; on the other hand, inputs not aligned with the slow modes decay away quickly (dashed black line). Recall that the intrinsic time constant of the neurons here is set to one unit. The exponentially fast (in α_z) pinching of the spectral curve (discussed above in Sec. IVA) suggests this slow-manifold behavior should also hold for moderately large α_z (as in Fig. 2). Therefore, even though the state is technically unstable, the local structure of the Jacobian is responsible for giving rise to extremely long timescales and allows the network to operate as an approximate integrator within relatively long windows of time, as demonstrated in Fig. 2.

FIG. 2. — Network in the marginally stable state functions as an integrator. (a) Sample traces from a network with switchlike update gates (α_z = 30, g_h = 3) show slow evolution (time on x axis is relative to τ_h). (b) An input is applied to the same network in (a) from t = 0 till t = 10, either aligned with a slow eigenvector u_λ (red traces) or unaligned with slow modes (black dashed trace). The plot shows the excess projection of the network state on the left eigenvector u_λ. Different shades of red correspond to different input strengths. If the input is along the slow manifold, the trace of the input is retained for a long time after the cessation of input. [The traces in (a) are for the network with an input along the manifold.].

Of course, after sufficiently long times, the instability causes the state to evolve and the memory is lost. Exactly how long the memory lasts depends on the asymptotic stability of the network, which is revealed by the Lyapunov spectrum, discussed below in Sec. VA.

V. OUTPUT GATE CONTROLS DIMENSIONALITY AND LEADS TO A NOVEL CHAOTIC TRANSITION

We thus far use insights from local dynamics to study the functional consequences of the gates. To study the salient features of the output gate, it is useful to analyze the effect of inputs and the long-time behavior of the network through the lens of Lyapunov spectra. We see that the output gate controls the dimensionality of the dynamics in the phase space; dimensionality is a salient aspect of the dynamics for task function [42]. The output gate also gives rise to a novel discontinuous chaotic transition, near which inputs (even static ones) can abruptly push a stable system into strongly chaotic behavior—contrary to the typically stabilizing effect of inputs. Below, we begin with the Lyapunov analyses of the dynamics and then proceed to study the chaotic transition.

A. Long-time behavior of the network

We study the asymptotic behavior of the network and the nature of the time-varying state through the lens of its Lyapunov spectra. In this section, where we study the effects of α_z, our results are numerical except in cases where α_z = 0 [e.g., in Fig. 3(d)]. Lyapunov exponents specify how infinitesimal perturbations δx(t) grow or shrink along the trajectories of the dynamics—in particular, if the growth or decay is exponentially fast, then the rate is dictated by the maximal Lyapunov exponent defined as [54] λ_max ≔ lim_T→∞ T⁻¹ lim_{‖δx(0)‖→0} ln[‖δx(T)‖/‖δx(0)‖]. More generally, the set of all Lyapunov exponents—the Lyapunov spectrum—yields the rates at which perturbations along different directions shrink or diverge and, thus, provide a fuller characterization of asymptotic behavior. We first numerically study how the gates shape the full Lyapunov spectrum (details in Appendix D) and derive an analytical prediction for the maximum Lyapunov exponent using the DMFT (Sec. VA1) [55].

Figures 3(a) and 3(b) show how the update (z) and output (r) gates shape the Lyapunov spectrum. We see that, as the update gets more sensitive (larger α_z), the Lyapunov spectrum flattens, pushing more exponents closer to zero, generating long timescales. As the output gate becomes more sensitive (larger α_r), all Lyapunov exponents increase, thus increasing the rate of growth in unstable directions.

We can estimate the dimensionality of the activity in the chaotic state by calculating an upper bound D_A on the dimension according to a conjecture by Kaplan and Yorke [54]. The Kaplan-Yorke upper bound for the attractor dimension D_A is given by

D_{A} = M + \frac{\sum_{i = 1}^{M} λ_{i}}{| λ_{M + 1} |}, where M = max_{j} {\sum_{i = 1}^{j} λ_{i} \geq 0},

(9)

where λ_i are the rank-ordered Lyapunov exponents. We see in Fig. 3(c) that the sensitivity of the output gate (α_r) can shape the dimensionality of the dynamics—a more sensitive output gate leads to higher dimensionality. As we see below, this effect of the output gate is different from how the gain g_h shapes dimensionality and can lead to a novel chaotic transition. Even more directly, if the r gate for neurons i₁…i_K is set to zero, then the activity is constrained to evolve in an N − K-dimensional subspace; however, the r gate allows the possibility—i.e., the “inductive bias”—of doing this dynamically.

1. DMFT prediction for λ_max

We would also like to study the chaotic nature of the time-varying phase by means of the maximal Lyapunov exponent and characterize when the transition to chaos occurs. We extend the DMFT for the gated RNN to calculate the maximum Lyapunov exponent, and, to do this, we make use of a technique suggested by Refs. [56,57] and clearly elucidated in Ref. [21]. The details are provided in Appendix E, and the end result of the calculation is the DMFT prediction for λ_max as the solution to a generalized eigenvalue problem for κ involving the correlation functions of the state variables:

[{(〈 σ_{z} 〉 + κ)}^{2} - \partial_{τ}^{2} + C_{σ_{z}} (τ) - {〈 σ_{z} 〉}^{2}] χ_{h} (τ) = C_{σ_{z}^{'}} (τ) [C_{ϕ \cdot σ_{r}} (τ) - C_{h} (τ)] χ_{z} (τ) + C_{σ_{z}} (τ) \frac{\partial C_{ϕ \cdot σ_{r}} (τ)}{\partial C_{h}} χ_{h} (τ),

(10)

[{(1 + τ_{z / r} κ)}^{2} - τ_{z / r}^{2} \partial_{τ}^{2}] χ_{z / r} (τ) = \frac{\partial C_{ϕ} (τ)}{\partial C_{h}} χ_{h} (τ),

(11)

where we denote the two-time correlation function C_x(t, t′) ≡ 〈x(t)x(t′)〉 for different (functions of) state variables x(t) [see Eq. (C25) for more context]. The largest eigenvalue solution to this problem is the required maximal Lyapunov exponent [58]. Note that this is the analog of the Schrodinger equation for the maximal Lyapunov exponent in the vanilla RNN. When α_z = 0 (or small), the h field is Gaussian, and we can use Price’s theorem for Gaussian integrals to replace the variational derivatives on the rhs of Eqs. (10) and (11) by simple correlation functions, for instance, ∂C_ϕ(τ)/∂C_h(τ) = C_ϕ′(τ). In this limit, we see good agreement between the numerically calculated maximal Lyapunov exponent [Fig. 3(c), dots] compared to the DMFT prediction [Fig. 3(c), solid line] obtained by solving the eigenvalue problem [Eqs. (10) and (11)]. For large values of α_z, we see quantitative deviations between the DMFT prediction and the true λ_max. Indeed, for large α_z, the distribution of h is strongly non-Gaussian, and there is no reason to expect that variational formulas given by Price’s theorem are even approximately correct. For more on this point, see the discussion toward the end of Appendix C.

2. Condition for continuous transition to chaos

The value of α_z affects the precise value of the maximal Lyapunov exponent λ_max; however, numerics suggest that, across a continuous transition to chaos, the point at which λ_max becomes positive is not dependent on α_z (data not shown). We can see this more clearly by calculating the transition to chaos when the leading edge of the spectral curve (for a FP) crosses zero. This condition is given by Eq. (6), and we see that it has no dependence on α_z or the update gate. We stress that this condition [Eq. (6)] for the transition to chaos—when the stable fixed point becomes unstable—is valid when the chaotic attractor emerges continuously from the fixed point [Fig. 3(c), α_r = 0, 2]. However, in the gated RNN, there is another discontinuous transition to chaos [Fig. 3(c), α_r = 20]: For sufficiently large α_r, the transition to chaos is discontinuous and occurs at a value of g_h where the zero FP is still stable (g_h < 2 with no biases). To our knowledge, this is a novel type of transition which is not present in the vanilla RNN and not visible from an analysis that considers only the stability of fixed points. We characterize this phenomenon in detail below.

B. Output gate induces a novel chaotic transition

Here, we describe a novel phase, characterized by a proliferation of unstable fixed points and the coexistence of a stable fixed point with chaotic dynamics. It is the appearance of this state that gives rise to the discontinuous transition observed in Fig. 3(c). The appearance of this state is mediated by the output gate becoming more switchlike (i.e., increasing α_r) in the quiescent region for g_h. To our knowledge, no such comparable phenomenon exists in RNNs with additive interactions. The full details of the calculations for this transition are provided in Appendix G. Here, we simply state and describe the salient features. For ease of presentation, the rest of the section assumes that all biases are zero. The results in this section are strictly valid only for α_z = 0. In Appendix G3, we argue that they should also hold for moderate α_z.

This discontinuous transition is characterized by a few noteworthy features.

1. Spontaneous emergence of fixed points

When g_h < 2.0, the zero fixed point is stable. Moreover, for $\sqrt{2} < g_{h} < 2$ , when α_r crosses a threshold value $α_{r, FP}^{*} (g_{h})$ , unstable fixed points spontaneously appear in the phase space. The only dynamical signature of these unstable FPs are short-lived transients which do not scale with system size (see Fig. 11). Thus, we have a condition for fixed-point transition:

\sqrt{2} < g_{h} \leq 2 and α_{r} > α_{r, FP}^{*} (g_{h}) .

(12)

These unstable fixed points correspond to the emergence of nontrivial solutions to the time-independent MFT. Figure 4(a) shows the appearance of fixed-point MFT solutions for a fixed g_h, and Fig. 4(b) shows the critical $α_{r, FP}^{*} (g_{h})$ as a function of g_h. As g_h → 2⁻, we see that $α_{r, FP}^{*} \to \sqrt{8}$ .

FIG. 4. — The discontinuous dynamical transition. (a) Spontaneous appearance of nonzero solutions (dashed and solid red lines) to the FP equations once α_r crosses a critical value $α_{r, FP}^{*} (g_{h})$ at fixed g_h. (b) The critical $α_{r, FP}^{*} (g_{h})$ as a function of g_h. The vertical dashed line represents left critical value $g_{c} = \sqrt{2}$ , below which a bifurcation is not possible. (c) The critical DMFT transition curve $α_{r, DMFT}^{*} (g_{h})$ (red curve) calculated using Eqs. (G8) and (G9). The FP transition curve from (b) is shown in black. The green dashed line corresponds to $g_{c} = \sqrt{8 / 3}$ , below which the dynamical transition is not possible. (d) Numerically calculated maximum Lyapunov exponent λ_max as a function of α_r for two different values of g_h. The dashed lines correspond to the DMFT prediction for the discontinuous transition from (c). (e) Schematic of the bifurcation transition: For g_h < 2 and $α_{r} < α_{r, FP}^{*}$ , the zero FP is the only (stable) solution (bottom left box); for $\sqrt{2} < g_{h} < 2$ and $α_{r, FP}^{*} < α_{r} < α_{r, DMFT}^{*}$ , the zero FP is still stable, but there is a proliferation of unstable FPs without any obvious dynamical signature (top left); for $\sqrt{8 / 3} < g_{h} < 2$ and $α_{r} > α_{r, DMFT}^{*}$ , chaotic dynamics coexist with the stable FP and this transition is discontinuous (top right); finally, for g_h > 2.0, the stable FP becomes unstable, and only the chaotic attractor remains; this transition is continuous (bottom right).

These spontaneous MFT fixed-point solutions are unstable according to the criterion Eq. (6) derived from RMT. Moreover, in Appendix J, using a Kac-Rice analysis, we show that in this region the full 3N-dimensional system does indeed have a number of unstable fixed points that grows exponentially fast with N. Thus, this transition line $α_{r, FP}^{*}$ represents a topological trivialization transition as conceived by, e.g., Refs. [59,60]. Our analysis shows that instability is intimately connected to the proliferation of fixed points. Remarkably, however, a time-dependent solution to the DMFT does not emerge across this transition, and the microscopic dynamics are insensitive to the transition in topological complexity, bringing us to the next point.

2. A delayed dynamical transition that shows a decoupling between topological and dynamical complexity

On increasing α_r beyond $α_{r, FP}^{*}$ , there is a second transition when α_r crosses a critical value $α_{r, DMFT}^{*}$ . This happens when we satisfy the condition for dynamical transition:

\sqrt{\frac{8}{3}} < g_{h} \leq 2 and α_{r} > α_{r, DMFT}^{*} (g_{h}),

(13)

derived in Appendix G2. Figure 4(c) shows how $α_{r, DMFT}^{*} (g_{h})$ varies with g_h. As g_h → 2⁻, we see that $α_{r, DMFT}^{*} \to \sqrt{12}$ . Across this transition, a dynamical state spontaneously emerges, and the maximum Lyapunov exponent jumps from a negative value to a positive value [Fig. 4(d)]. This state exhibits chaotic dynamics that coexist with the stable zero fixed point. The presence of the stable FP means that the dynamical state is not strictly a chaotic attractor but rather a spontaneously appearing “chaotic set.” On increasing g_h beyond 2.0, for large but fixed α_r, the stable fixed point disappears, and the state smoothly transitions into a full chaotic attractor that is characterized above. This picture is summarized in the schematic in Fig. 4(e). This gap between the proliferation of unstable fixed points and the appearance of the chaotic dynamics differs from the result of Wainrib and Touboul [45] for purely additive RNNs, where the proliferation (topological complexity) is tightly linked to the chaotic dynamics (dynamical complexity). Thus, for gated RNNs, there appears to be another distinct mechanism for the transition to chaos, and the accompanying transition is a discontinuous one.

3. Long chaotic transients

For finite systems, across the transition the dynamics eventually flow into the zero FP after chaotic transients. Moreover, we expect this transient time to scale with the system size, and, in the infinite system size limit, the transient time should diverge in spite of the fact that the stable fixed point still exists. This is because the relative volume of the basin of attraction of the fixed point vanishes as N → ∞. In Appendix G [Figs. 11(c) and 11(d)], we do indeed see that the transient time for a fixed g_h scales with system size [Fig. 11(c)] once α_r is above the second transition (dashed line) and not otherwise [see Figs. 11(a) and 11(e), dashed lines].

4. An input-induced chaotic transition

The discontinuous chaotic transition has a functional consequence: Near the transition, static inputs can push a stable system to strong chaotic activity. This is in contrast to the typically stabilizing effects of inputs on the activity of random additive RNNs [21,43,44]. In Figs. 5(a) and 5(b), we see that, when static input with variance $σ_{β_{h}}$ is applied to a stable system (a) near the discontinuous chaotic transition (in region 2 in Fig. 7), it induces chaotic activity (b); however, for the same input when applied to the system in the chaotic state [Fig. 5(c)], the dynamics are stabilized (d) as reported before.

FIG. 5. — Input-driven chaos. (a),(b) Near the discontinuous chaotic transition (in region 2 in Fig. 7), static input I^h (with I^r = I^z = 0) can push a stable system (a) to chaotic activity (b). (c),(d) In the purely chaotic state [(c), g_h = 3.0], input has the familiar effect of stabilizing the dynamics (d). The elements of the input vector I^h are random Gaussian variables with zero mean and variance $σ_{β_{h}}^{2}$ .

This phenomenon for static inputs can be understood using the phase diagram with nonzero biases, discussed in Sec. VII. There, we see how the transition curves move when a random bias β_h is included. Near the classic chaotic transition (α_r = 0), the bias moves the curve toward larger g_h, thus suppressing chaos. Near the discontinuous chaotic transition $α_{r, DMFT}^{*}$ , the bias pulls the curve toward smaller values of α_r, thus promoting chaos. Thus, inputs can have opposite effects of inducing or stabilizing chaos within the same model in different parameter regimes. This phenomenon could, in principle, be leveraged for shaping the interaction between inputs and internal dynamics.

VI. GATES PROVIDE A FLEXIBLE RESET MECHANISM

Here, we discuss how the gates provide another critical function—a mechanism to flexibly reset the memory trace depending on external input or the internal state. This function complements the memory function; a memory that cannot be erased when needed is not very useful. To build intuition, let us consider a linear network ḣ = −h + Jh, where the matrix $- 1 + J$ has a few eigenvalues that are zero, while the rest have a negative real part. The slow modes are good for memory function; however, that fact also makes it hard to forget memory traces along the slow modes. This trade-off is pointed out in Ref. [61]. To be functionally useful, it is critical that the memory trace can be erased flexibly in a context-dependent manner. The r gate allows this function naturally. Consider the same net, but now augmented with an r gate: ḣ = −h + Jh ⊙ σ_r. If the gate is turned off (σ_r = 0) for a short duration, the state h is reset to zero. One can actually be more specific: We may choose a $J^{r} = - 1 u^{T}$ with σ_r = σ[J^r(ϕh)], such that the r gate turns off whenever ϕ(h) gets aligned with u, thus providing an internal-context-dependent reset.

Apart from resetting to zero, the z gate also allows the possibility of rapidly scrambling the state to a random value by means of the input-induced chaos. This phenomenon is illustrated in Fig. 6, where the network in the marginally stable state normally functions as a memory (retains traces for long times, as in Fig. 2), but positive inputs I^z (with I^h = I^r = 0) to the z gate above a threshold strength even for a short duration can induce chaos, thereby scrambling the state and erasing the previous memory state (Fig. 6, bottom panel). The mechanism for this scrambling can be understood by appealing to Eq. (8). A finite input I^z with nonzero mean is able to change 〈σ(z)〉 and, thus, push the critical line for marginal stability in one way or the other. For instance, if 〈I^z〉 > 0, 〈σ(z)〉 > 1/2, which (for α_r = 0) moves the transition to marginal stability to a smaller value of g_h. This implies that a marginally stable state can be made chaotic in the presence of I^z with finite mean. This mechanism for input-induced chaos actually appears to be different from that explored in the previous section, which occurs across the discontinuous chaotic transition. We discuss this more in Sec. VII.

FIG. 6. — Gates provide a reset mechanism. Positive static inputs are applied to the z gate when the RNN is in the marginally stable state (g_h = 3.0, α_r = 2.5, and α_z = ∞) for 20 time units at times indicated by dashed lines. The input induces chaos which rapidly scrambles the network state, thus erasing the trace of the previous memory; the bottom panel shows the normalized projection of the state h(t) on the directions h^(1,2,3) aligned with the state in regions 1, 2, and 3.

In summary, gating imbues the RNN with the capacity to flexibly reset memory traces, providing an “inductive bias” for context-dependent reset. The specific method of reset depends on the task or function, and this can be selected, e.g., by gradient-based training. This inductive bias for resetting is found to be critical for performance in ML tasks [62].

VII. PHASE DIAGRAMS FOR THE GATED NETWORK

Here, we summarize the rich dynamical phases of the gated RNN and the critical lines separating them. The key parameters determining the critical lines and the phase diagram are the activation and output-gate gains and the associated biases: (g_h, β_h, α_r, β_r). The update gate does not play a role in determining continuous or critical chaotic transitions. On the other hand, it influences the discontinuous transition to chaos for sufficiently large values of α_z (see Sec. G3 for discussion). Furthermore, the update gate has a strong effect on the dynamical aspects of the states near the critical lines. There are macroscopic regions of the parameter space adjacent to the critical lines where the states can be made marginally stable in the limit of α_z → ∞. The shape of this marginal stability region is influenced by β_z and I^z.

Figure 7(a) shows the dynamical phases for the network with no biases in the (g_h, α_r) plane. When g_h is below 2.0 and $α_{r} < α_{r, FP}^{*}$ , the zero fixed point is the only solution (region 1). As discussed in Sec. VB, on crossing the fixed-point bifurcation line [green line, Fig. 7(a)], there is a spontaneous proliferation of unstable fixed points in the phase space (region 2). This can occur only when $g_{h} > \sqrt{2}$ . The proliferation of fixed points is not accompanied by any obvious dynamical signatures. However, if $\sqrt{8 / 3} < g_{h} < 2$ , we can increase α_r further to cross a second discontinuous transition where a dynamical state spontaneously appears featuring the coexistence of chaotic activity and a stable fixed point (region 3). When g_h is increased beyond the critical value of 2.0, the stable zero fixed point becomes unstable for all α_r, and we get a chaotic attractor (region 4). All the critical lines are determined by g_h and α_r, and α_z has no explicit role; however, for large α_z there is a large region of the parameter space on the chaotic side of the chaotic transition that can be made marginally stable [thatched region 5 in Fig. 7(a)].

A. Role of biases and static inputs

Biases have the effect of generating nontrivial fixed points and controlling stability by moving the edge of the spectral curve. Another key feature of biases is the suppression of the discontinuous bifurcation transition observed without biases. This is explained in detail in Appendix H. A particularly illuminating illustration of the effects of a bias can be inferred from the critical line (red dashed) for finite bias shown in Fig. 7. This curve, computed using the FP stability criterion (6) combined with the MFT equations [(C28)–(C30)], represents the transition between stability and chaos for finite bias with zero mean and nonzero variance. Equivalently, we may think of this as the critical line for a network with static input $I_{i}^{h} ~ 𝒩 (0, σ_{h}^{2})$ (with I_r = I_z = 0). Along the g_h axis, we can observe the well-documented phenomena whereby an input suppresses chaos. This corresponds to the region g_h > 2 which lies to the left of the red dashed critical line, which is chaotic in the absence of input and flows to a stable fixed point in the presence of input. However, this behavior is reversed for g_h < 2. Here, we see a significant swath of phase space which is stable in the absence of input but which becomes chaotic when input is present. Thus, the stability-to-chaos phase boundary in the presence of biases (or inputs) reveals that the output (r) gate can facilitate an input-induced transition to chaos.

VIII. DISCUSSION

Gating is a form of multiplicative interaction that is a central feature of the best-performing RNNs in machine learning, and it is also a prominent feature of biological neurons. Prior theoretical work on RNNs has considered only RNNs with additive interactions. Here, we present the first detailed study on the consequences of gating for RNNs and show that gating can produce dramatically richer behavior that have significant functional benefits.

The continuous-time gated RNN (gRNN) we study resembles a popular model used in machine learning applications, the gated recurrent unit (GRU) [see the note below Eq. (C27)]. Previous work [51] looks at the instantaneous Jacobian spectrum for the discrete-time GRU using RMT methods similar to those presented in Appendix A; however, this work does not go beyond time-independent MFT and presents a phase diagram showing only boundaries across which the MFT fixed points become unstable [63]. In the present manuscript, we illuminate the full dynamical phase diagram for our gated RNN, uncovering much richer structure. Both the GRU and our gRNN have a gating function which dynamically scales the time constant, which in both cases leads to a marginally stable phase in the limit of a binary gate. However, the dynamical phase diagram presented in Fig. 7 reveals a novel discontinuous transition to chaos. We conjecture that such a phase transition should also be present in the GRU. Also, Ref. [51] lacks any discussion of the influence of inputs or biases. The present paper includes discussion of the functional significance of the gates in the presence of inputs. We believe these results, combined with the refined dynamical phase diagram, can shed some light on the role of analogous gates in the GRU and other gated ML architectures. We review the significance of the gates in more detail below.

A. Significance of the update gate

The update gate modulates the rate of integration. In single-neuron models, such a modulation is shown to make the neuron’s responses robust to time-warped inputs [14]. Furthermore, normative approaches, requiring time reparametrization invariance in ML RNNs, naturally imply the existence of a mechanism that modulates the integration rate [64]. We show that, for a wide range of parameters, a more sensitive (or switchlike) update gate leads to marginal stability. Marginally stable models of biological function have long been of interest with regard to their benefits for information processing (cf. Ref. [33] and references therein). In the gated RNN, a functional consequence of the marginally stable state is the use of the network as a robust integrator—such integratorlike function is shown to be beneficial for a variety of computational functions such as motor control [34–36], decision making [37], and auditory processing [53]. However, previous models of these integrators often require handcrafted symmetries and fine-tuning [38]. We show that gating allows this function without fine-tuning. Signatures of integratorlike behavior are also empirically observed in successfully trained gated ML RNNs on complex tasks [41]. We provide a theoretical basis for how gating produces these. The update gate facilitates accumulation of slow modes and a pinching of the spectral curve which leads to a suppression of unstable directions and overall slowing of the dynamics over a range of parameters. This is a manifestly self-organized slowing down. Other mechanisms for slowing down dynamics have been proposed where the slow timescales of the network dynamics are inherited from other slow internal processes such as synaptic filtering [65,66]; however, such mechanisms differ from the slowing due to gating; they do not seem to display the pinching and clumping, and they also do not rely on self-organized behavior.

B. Significance of the output gate

The output gate dynamically modulates the outputs of individual neurons. Similar shunting mechanisms are widely observed in real neurons and are crucial for performance in ML tasks [62]. We show that the output gate offers fine control over the dimensionality of the dynamics in phase space, and this ability is implicated in task performance in ML RNNs [42]. This gate also gives rise to a novel discontinuous chaotic transition where inputs can abruptly push stable systems to strongly chaotic activity; this is in contrast to the typically stabilizing role of inputs in additive RNNs. In this transition, there is a decoupling between topological and dynamical complexity. The chaotic state across this transition is also characterized by the coexistence of a stable fixed point with chaotic dynamics—in finite size systems, this manifests as long transients that scale with the system size. We note that there are other systems displaying either a discontinuous chaotic transition or the existence of fixed points overlapping with chaotic (pseudo)attractors [19] or apparent chaotic attractors with finite alignment with particular directions [67]; however, we are not aware of a transition in large RNNs where static inputs can induce strong chaos or the topological and dynamical complexity are decoupled across the transition. In this regard, the chaotic transition mediated by the output gated seems to be fundamentally different. More generally, the output gate is likely to have a significant role in controlling the influence of external inputs on the intrinsic dynamics.

We also show how the gates complement the memory function of the update gate by providing an important, context- and input-dependent reset mechanism. The ability to erase a memory trace flexibly is critical for function [62]. Gates also provide a mechanism to avoid the accuracy-flexibility trade-off noted for purely additive networks—where the stability of a memory comes at the cost of the ease with which it is rewritten [61].

We summarize the rich behavior of the gated RNN via phase diagrams indicating the critical surfaces and regions of marginal stability. From a practical perspective, the phase diagram is useful in light of the observation that it is often easier to train RNNs initialized in the chaotic regime but close to the critical points. This is often referred to as the “edge of chaos” hypothesis [68–70]. Thus, the phase diagrams provide ML practitioners with a map for principled parameter initialization—one of the most critical choices deciding training success.

ACKNOWLEDGMENTS

K. K. is supported by a C. V. Starr fellowship and a CPBF fellowship (through NSF PHY-1734030). T. C. is supported by a grant from the Simons Foundation (891851, TC). D. J. S. was supported by the NSF through the CPBF (PHY-1734030) and by a Simons Foundation fellowship for the MMLS. This work was partially supported by the NIH under Grant No. R01EB026943. K. K. and D.J.S. thank the Simons Institute for the Theory of Computing at U. C. Berkeley, where part of the research was conducted. T. C. gratefully acknowledges the support of the Initiative for the Theoretical Sciences at the Graduate Center, CUNY, where most of this work was completed. We are most grateful to William Bialek, Jonathan Cohen, Andrea Crisanti, Rainer Engelken, Moritz Helias, Jonathan Kadmon, Jimmy Kim, Itamar Landau, Wave Ngampruetikorn, Katherine Quinn, Friedrich Schuessler, Julia Steinberg, and Merav Stern for fruitful discussions.

APPENDIX A: DETAILS OF RANDOM MATRIX THEORY FOR SPECTRUM OF THE JACOBIAN

In this section, we provide details of the calculation of the bounding curve for the Jacobian spectrum for both fixed points and time-varying states. Our approach to the problem utilizes the method of Hermitian reduction [25,26] to deal with non-Hermitian random matrices.The analysis here resembles that in Ref. [51], which also considers Jacobians that are highly structured random matrices arising from discrete-time gated RNNs.

The Jacobian 𝒟 is a block-structured matrix constructed from the random coupling matrices J^h,z,r and diagonal matrices of the state variables. In the limit of large N, we expect the spectrum to be self-averaging—i.e., the distribution of eigenvalues for a random instance of the network approaches the ensemble-averaged distribution. We can, thus, gain insight about typical dynamical behavior by studying the ensemble- (or disorder-) averaged spectrum of the Jacobian. Our starting point is the disorder-averaged spectral density μ(λ) defined as

μ (λ) = \frac{1}{3 N} E [\sum_{i = 1}^{3 N} δ (λ - λ_{i})],

(A1)

where the λ_i are the eigenvalues of 𝒟 for a given realization of J^h,z,r and the expectation is taken over the distribution of real Ginibre random matrices from which J^h,z,r are drawn. Using an alternate representation for the Dirac delta function in the complex plane $[δ (λ) = π^{- 1} \partial_{\bar{λ}} λ^{- 1}]$ , we can write the average spectral density as

μ (λ) = \frac{1}{π} \frac{\partial}{\partial \bar{λ}} E [\frac{1}{3 N} Tr [{(λ 1_{3 N} - 𝒟)}^{- 1}]],

(A2)

where $1_{3 N}$ is the 3N-dimensional identity matrix. 𝒟 is in general non-Hermitian, so the support of the spectrum is not limited to the real line, and the standard procedure of studying the Green’s function $G (λ, \bar{λ}) = {(3 N)}^{- 1} Tr E [{(λ 1_{3 N} - 𝒟)}^{- 1}]$ by analytic continuation is not applicable, since it is nonholomorphic on the support. Instead, we use the method of Hermitization [25,26] to analyze the resolvent for an expanded 6N × 6N Hermitian matrix H:

𝒢 (η, λ, \bar{λ}) = E [{(η 1_{6 N} - H)}^{- 1}],

(A3)

H = (\begin{matrix} 0 & λ - 𝒟 \\ \bar{λ} - 𝒟^{T} & 0 \end{matrix}),

(A4)

and the Green’s function for the original problem is obtained by considering the lower-left block of 𝒢:

G (λ, \bar{λ}) = lim_{η \to i 0^{+}} \frac{1}{3 N} Tr 𝒢_{21} (η, λ, \bar{λ}) .

(A5)

To make this problem tractable, we invoke an ansatz called the local chaos hypothesis [57,71], which posits that, for large random networks in steady state, the state variables are statistically independent of the random coupling matrices J^z,h,r (also see Ref. [72]). This implies that the Jacobian [Eq. (4)] has an explicit linear dependence only on J^h,z,r, and the state variables are governed by their steady-state distribution from the disorder-averaged DMFT (Appendix C). These assumptions make the random matrix problem tractable, and we can evaluate the Green’s function by using the self-consistent Born approximation, which is exact as N → ∞. We detail this procedure below.

The Jacobian itself can be decomposed into structured (A, L, R) and random parts (𝒥):

𝒟 = \underset{A}{\underset{︸}{(\begin{matrix} - [σ_{z}] & D & 0 \\ 0 & - τ_{z}^{- 1} 1 & 0 \\ 0 & 0 & - τ_{r}^{- 1} 1 \end{matrix})}} + \underset{L}{\underset{︸}{(\begin{matrix} [σ_{z}] & 0 & 0 \\ 0 & τ_{z}^{- 1} 1 & 0 \\ 0 & 0 & τ_{r}^{- 1} 1 \end{matrix})}} \times \underset{𝒥}{\underset{︸}{(\begin{matrix} J^{h} & 0 & 0 \\ 0 & J^{z} & 0 \\ 0 & 0 & J^{r} \end{matrix})}} \underset{R}{\underset{︸}{(\begin{matrix} [ϕ^{'} σ_{r}] & 0 & [ϕ σ_{r}^{'}] \\ [ϕ^{'}] & 0 & 0 \\ [ϕ^{'}] & 0 & 0 \end{matrix})}} .

(A6)

At this point, we must make a crucial assumption: The structured matrices A, L, and R are independent of the random matrices appearing 𝒥. This implies that the dynamics is self-averaging and that the state variables reach a steady-state distribution determined by the DMFT and insensitive to the particular quenched disorder 𝒥. This self-averaging assumption leads to theoretical predictions which are in very good agreement with simulations of large networks, as presented in Fig. 1.

This independence assumption renders 𝒟 a linear function of the random matrix 𝒥, whose entries are Gaussian random variables. The next steps are to develop an asymptotic series in the random components of H, compute the resulting moments, and perform a resummation of the series. This is conveniently accomplished by the self-consistent Born approximation (SCBA). The output of the SCBA is a self-consistently determined self-energy functional Σ[𝒢] which succinctly encapsulates the resummation of moments. With this, the Dyson equation for 𝒢 is given by

𝒢^{- 1} = 𝒢_{0}^{- 1} - Σ [𝒢],

(A7)

where the matrices on the right are defined in terms of 3N × 3N blocks:

𝒢_{0}^{- 1} = (\begin{matrix} η 1 & λ - A \\ \bar{λ} - A^{T} & η 1 \end{matrix}),

(A8)

Σ [𝒢] = (\begin{matrix} L Q [R 𝒢_{22} R^{T}] L & 0 \\ 0 & R^{T} Q [L^{T} 𝒢_{11} L] R \end{matrix}),

(A9)

and Q is a superoperator which acts on its argument as follows:

Q [M] = (\begin{matrix} \frac{1}{N} Tr M_{11} & 0 & 0 \\ 0 & \frac{1}{N} Tr M_{22} & 0 \\ 0 & 0 & \frac{1}{N} Tr M_{33} \end{matrix}) .

(A10)

Here, we express the self-energy using the 3N × 3N subblocks of the Green’s function 𝒢:

𝒢 = (\begin{array}{l} 𝒢_{11} & 𝒢_{12} \\ 𝒢_{21} & 𝒢_{22} \end{array}) .

(A11)

At this point, we have presented all of the necessary ingredients for computing the Green’s function and, thus, determining the spectral properties of the Jacobian. These are the Dyson equation (A7), along with the free Green’s function (A8) and the self-energy (A9). Most of what is left is complicated linear algebra. However, in the interest of completeness, we proceed to unpack these equations and give a detailed derivation of the main equation of interest, the bounding curve of the spectral density.

To proceed further, it is useful to define the following transformed Green’s functions, which can be written in terms of N × N subblocks:

{\tilde{𝒢}}_{11} \equiv L^{T} 𝒢_{11} L = (\begin{matrix} {\tilde{G}}_{11} & {\tilde{G}}_{12} & {\tilde{G}}_{13} \\ {\tilde{G}}_{21} & {\tilde{G}}_{22} & {\tilde{G}}_{23} \\ {\tilde{G}}_{31} & {\tilde{G}}_{32} & {\tilde{G}}_{33} \end{matrix}),

(A12)

{\tilde{𝒢}}_{22} \equiv R 𝒢_{22} R^{T} = (\begin{matrix} {\tilde{G}}_{44} & {\tilde{G}}_{45} & {\tilde{G}}_{46} \\ {\tilde{G}}_{54} & {\tilde{G}}_{55} & {\tilde{G}}_{56} \\ {\tilde{G}}_{64} & {\tilde{G}}_{65} & {\tilde{G}}_{66} \end{matrix}) .

(A13)

Denote also the mean trace of these subblocks as

{\tilde{g}}_{i j} = \frac{1}{N} Tr [{\tilde{G}}_{i j}] .

(A14)

Then the self-energy matrix in Eq. (A9) is block diagonal, i.e., Σ[𝒢] = bdiag(Σ₁₁, Σ₂₂), with

Σ_{11} = (\begin{matrix} [σ_{z}^{2}] {\tilde{g}}_{44} & 0 & 0 \\ 0 & τ_{z}^{- 2} {\tilde{g}}_{55} & 0 \\ 0 & 0 & τ_{r}^{- 2} {\tilde{g}}_{66} \end{matrix}),

(A15)

Σ_{22} = (\begin{matrix} {[ϕ^{'} σ_{r}]}^{2} {\tilde{g}}_{11} + {[ϕ^{'}]}^{2} ({\tilde{g}}_{22} + {\tilde{g}}_{33}) & 0 & [ϕ^{'} σ_{r}] [ϕ σ_{r}^{'}] {\tilde{g}}_{11} \\ 0 & 0 & 0 \\ [ϕ^{'} σ_{r}] [ϕ σ_{r}^{'}] {\tilde{g}}_{11} & 0 & {[ϕ σ_{r}^{'}]}^{2} {\tilde{g}}_{11} \end{matrix}) .

(A16)

With the self-energy in this form, we are able to solve the Dyson equation for the full Green’s function 𝒢 by direct matrix inversion:

𝒢 = {(\begin{matrix} η - Σ_{11} & λ - A \\ \bar{λ} - A^{T} & η - Σ_{22} \end{matrix})}^{- 1},

(A17)

which can be carried out easily by symbolic manipulation software. The rhs of Eq. (A17) is a function of ${\tilde{g}}_{i i}$ , whereas the lhs is a function of the Green’s function before the transformations (A12) and (A13). Thus, to get a set of equations we can solve, we apply these same transformations to both sides of Eq. (A17) after solving the Dyson equation. The final step is to take the limit η → 0, recovering the problem we originally wished to solve.

The result of these manipulations is a set of six equations for the mean traces of the transformed Green’s function defined in Eq. (A14). In order to write these down, we introduce some additional notation. The self-consistent equations are of the form

{\tilde{g}}_{i i} = 〈 \frac{Γ_{i}}{Γ} 〉,

(A18)

where we denote 〈M〉 ≡ N⁻¹TrM for shorthand and i runs from 1 to 6. Denote the state-variable-dependent diagonal matrices as

p = [ϕ^{'}], q = [ϕ σ_{r}^{'}], r = [ϕ^{'} σ_{r}],

(A19)

and, because they appear frequently in the resulting equations, define

X = {\tilde{g}}_{11} {| λ τ_{r} + 1 |}^{2} r^{2} + ({\tilde{g}}_{22} + {\tilde{g}}_{33}) p^{2} Z,

(A20)

Y = D^{2} {\tilde{g}}_{55} + {| λ τ_{z} + 1 |}^{2} [σ_{z}^{2}] {\tilde{g}}_{44},

(A21)

Z = {| λ τ_{r} + 1 |}^{2} - {\tilde{g}}_{11} {\tilde{g}}_{66} q^{2} .

(A22)

The denominator in Eq. (A18) is then given by

Γ = {| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2} Z - X Y,

(A23)

and the numerators Γ_i are given by

Γ_{1} = σ_{z}^{2} {| λ τ_{z} + 1 |}^{2} X,

(A24)

Γ_{2} = D^{2} X,

(A25)

Γ_{3} = {\tilde{g}}_{11} {| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2} q^{2} - {\tilde{g}}_{11} ({\tilde{g}}_{22} + {\tilde{g}}_{33}) p^{2} q^{2} Y,

(A26)

Γ_{4} = {\tilde{g}}_{66} {| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2} q^{2} + [{| λ τ_{r} + 1 |}^{2} r^{2} - {\tilde{g}}_{66} ({\tilde{g}}_{22} + {\tilde{g}}_{66}) p^{2} q^{2}] Y,

(A27)

Γ_{5} = Γ_{6} = p^{2} Y Z .

(A28)

The numerators and denominator are all diagonal matrices with real entries, which is why we use the simple notation of a ratio when referring to matrix inversion.

Solving these equations gives us the ${\tilde{g}}_{i i}$ as implicit functions of λ. They are, in general, complicated and resist exact solution. However, the situation simplifies considerably when we are looking for the spectral curve. In this case, we are looking for all $λ \in ℂ$ that satisfy the self-consistent equations with ${\tilde{g}}_{i i} \to 0$ .

We must take this limit carefully, since the ratio of these functions can remain constant. For this reason, it is necessary to define

x_{2} = {\tilde{g}}_{22} / {\tilde{g}}_{11}, x_{3} = {\tilde{g}}_{33} / {\tilde{g}}_{11} .

(A29)

We may do the same for ${\tilde{g}}_{44}$ , ${\tilde{g}}_{55}$ , and ${\tilde{g}}_{66}$ , but it turns out that x₂ and x₃ are sufficient to compute the spectral curve. Next, divide by ${\tilde{g}}_{11}$ and send all ${\tilde{g}}_{i i} \to 0$ , keeping the ratios fixed. Applying this to the equation for ${\tilde{g}}_{11}$ results in

1 = lim_{{\tilde{g}}_{i i} \to 0} \frac{1}{{\tilde{g}}_{11}} 〈 \frac{Γ_{1}}{Γ} 〉 = γ_{1} + γ_{2} (x_{2} + x_{3}) .

(A30)

Similarly, for ${\tilde{g}}_{22}$ and ${\tilde{g}}_{33}$ , we get

x_{2} = γ_{3} + γ_{4} (x_{2} + x_{3}),

(A31)

x_{3} = γ_{5},

(A32)

where the coefficients γ_i, which are functions of λ, are given by

γ_{1} = 〈 \frac{σ_{z}^{2} r^{2}}{{| λ + σ_{z} |}^{2}} 〉, γ_{2} = 〈 \frac{p^{2} σ_{z}^{2}}{{| λ + σ_{z} |}^{2}} 〉, γ_{5} = \frac{q^{2}}{{| λ τ_{r} + 1 |}^{2}},

γ_{3} = 〈 \frac{D^{2} r^{2}}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2}} 〉, γ_{4} = 〈 \frac{D^{2} p^{2}}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2}} 〉 .

The linear system of equations (A30)–(A32) is consistent iff

(1 - γ_{1}) (1 - γ_{4}) = γ_{2} (γ_{3} + γ_{5}) .

(A33)

In other words, γ_i must satisfy Eq. (A33) when ${\tilde{g}}_{i i} \to 0$ . This expression depends on λ and implicitly defines a curve in $ℂ$ , which is the boundary of the support of the spectral density.

Plugging in the explicit expression for γ_i, we get the implicit equation for the spectral curve as all $λ \in ℂ$ that satisfy

{1 - 〈 \frac{r^{2} σ_{z}^{2}}{{| λ + σ_{z} |}^{2}} 〉} {1 - 〈 \frac{D^{2} p^{2}}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2}} 〉} = 〈 \frac{σ_{z}^{2} p^{2}}{{| λ + σ_{z} |}^{2}} 〉 {〈 \frac{D^{2} r^{2}}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2}} 〉 + \frac{〈 q^{2} 〉}{{| λ τ_{r} + 1 |}^{2}}} .

(A34)

For large systems, we can replace the empirical traces of the state variable by their averages given by the DMFT variances. Then, the equation for the curve for a general steady state is given by

(〈 σ_{r}^{2} 〉 + \frac{〈 ϕ^{2} σ_{r}^{' 2} 〉}{{| 1 + τ_{r} λ |}^{2}}) 〈 \frac{ϕ^{' 2} σ_{z}^{2}}{{| λ + σ_{z} |}^{2}} 〉 + \frac{1}{{| 1 + τ_{z} λ |}^{2}} 〈 \frac{D^{2} ϕ^{' 2}}{{| λ + σ_{z} |}^{2}} 〉 = 1.

(A35)

FIG. 8. — Jacobian spectrum at a time-varying state. Red dots are the Jacobian eigenvalues for the full network in a (time-varying) steady state, and the spectral curve of the Jacobian is calculated using moments from (i) the full state vectors (blue curve) or using the variances from the fixed-point MFT (green). Surprisingly, the agreement is reasonably good. For network simulations, N = 1000, g_h = 2.5, α_r = 1, α_z = 15, and all biases are zero.

For fixed points, we have $D = 0$ , which makes γ₃ = γ₄ = 0. The equation for the spectral curve simplifies to that which is quoted in the main text [Eq. (5)]:

1 = 〈 \frac{r^{2} σ_{z}^{2}}{{| λ + σ_{z} |}^{2}} 〉 + \frac{〈 q^{2} 〉}{{| λ τ_{r} + 1 |}^{2}} 〈 \frac{σ_{z}^{2} p^{2}}{{| λ + σ_{z} |}^{2}} 〉 .

(A36)

1. Jacobian spectrum for the case α_r = 0

In the case when α_r = 0, it is possible to express the Green’s function [Eq. (A5)] in a simpler form. Recall that

G (λ, \bar{λ}) = lim_{η \to i 0^{+}} \frac{1}{3 N} tr 𝒢_{21} (η, λ, \bar{λ}) .

(A37)

Let $\tilde{Y} = D^{2} + σ_{r}^{2} σ_{z}^{2} {| λ τ_{z} + 1 |}^{2}$ . Then, the Green’s function is given by

G (λ, \bar{λ}) = \frac{1}{3} 〈 \frac{{| λ τ_{z} + 1 |}^{2} (\bar{λ} + σ_{z})}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2} - ξ (λ, \bar{λ}) p^{2} \tilde{Y}} 〉

(A38)

+ \frac{1}{3} 〈 \frac{(\bar{λ} + τ_{z}^{- 1}) ({| λ + σ_{z} |}^{2} - ξ (λ, \bar{λ}) p^{2} σ_{z}^{2})}{{| λ + τ_{z}^{- 1} |}^{2} {| λ + σ_{z} |}^{2} - ξ (λ, \bar{λ}) p^{2} \tilde{Y}} 〉

(A39)

+ \frac{1}{3} \frac{1}{λ + τ_{r}^{- 1}},

(A40)

where $ξ (λ, \bar{λ})$ is defined implicitly to satisfy the equation

1 = 〈 \frac{p^{2} \tilde{Y}}{{| λ τ_{z} + 1 |}^{2} {| λ + σ_{z} |}^{2} - ξ (λ, \bar{λ}) p^{2} \tilde{Y}} 〉 .

(A41)

The function $ξ (λ, \bar{λ})$ acts as a sort of order parameter for the spectral density, indicating the transition on the complex plane between zero and finite density μ. Outside the spectral support, λ ∈ Σ^c, this order parameter vanishes, ξ = 0, and the Green’s function is holomorphic:

G (λ, \bar{λ}) = \frac{1}{3} (〈 \frac{1}{λ + σ_{z}} 〉 + \frac{1}{λ + τ_{z}^{- 1}} + \frac{1}{λ + τ_{r}^{- 1}}),

(A42)

which, of course, indicates that the density is zero since $μ (λ) = \partial_{\bar{λ}} G (λ, \bar{λ})$ . Inside the support λ ∈ Σ, the order parameter ξ ≠ 0, and the Green’s function consequently picks up nonanalytic contributions, proportional to $\bar{λ}$ . Since the Green’s function is continuous on the complex plane, it must be continuous across the boundary of the spectral support. This must then occur precisely when the holomorphic solution meets the nonanalytic solution, at ξ = 0. This is the condition used to find the boundary curve above.

APPENDIX B: SPECTRAL CLUMPING AND PINCHING IN THE LIMIT α_z → ∞

In this section, we provide details on the accumulation of eigenvalues near zero and the pinching of the leading spectral curve (for certain values of g_h) as the update gate becomes switchlike (α_z → ∞). To focus on the key aspects of these phenomena, we consider the case when the reset gate is off and there are no biases (α_r = 0 and β_r,h,z = 0). Moreover, we consider a piecewise linear approximation—sometimes called “hard” tanh—to the tanh function given by

ϕ_{lin} (x) = {\begin{array}{l} 1 & x > 1 / g_{h}, \\ g_{h} x & | x | \leq 1 / g_{h}, \\ - 1 & x < - 1 / g_{h} . \end{array}

(B1)

This approximation does not qualitatively change the nature of the clumping.

In the limit α_z → ∞, the update gate σ_z becomes binary with a distribution given by

P (σ_{z} = x) = f_{z} δ (x - 1) + (1 - f_{z}) δ (x),

(B2)

where f_z = 〈σ_z〉 is the fraction of update gates that are open (i.e., equal to one). Using this, along with the assumption that $D \approx 0$ —which is valid in this regime—we can simplify the expression for the Green’s function [Eqs. (A38)–(A42)] to yield

G (λ, \bar{λ}) = \frac{1 - f_{z}}{λ} + f_{z} (1 - f_{h}) \frac{1}{λ + 1} + \frac{1}{λ + τ_{z}^{- 1}} + \frac{(1 + \bar{λ})}{g_{h}^{2} σ {(β_{r})}^{2}} I_{{| λ | < g_{h}^{2} σ {(β_{r})}^{2}}},

(B3)

where f_h is the fraction of hard tanh activations that are not saturated. In the limit of small τ_z and β_r = 0, we get the expression for the density given in the text:

μ (λ) = (1 - f_{z}) δ (λ) + f_{z} (1 - f_{h}) δ (λ + 1) + \frac{4}{π g_{h}^{2}} I_{{| λ | \leq g_{h}^{2} / 4}} .

(B4)

Thus, we see an extensive number of eigenvalues at zero.

Now, let us study the regime where α_z is large but not infinite. We would like to get the scaling behavior of the leading edge of the spectrum and the density of eigenvalues contained in a radius δ around the origin. We make an ansatz for the spectral edge close to zero $λ ~ e^{- c α_{z} \sqrt{Δ_{h}}}$ , where c is a positive constant. With this ansatz, the equation for the spectral curve reads

\int 𝒟 z \frac{σ_{z} {(\sqrt{Δ_{z}} \cdot z)}^{2}}{{| λ_{0} e^{- c α_{z} \sqrt{Δ_{h}}} + σ_{z} (\sqrt{Δ_{z}^{2}}) |}^{2}} = \frac{σ_{r} {(β_{r})}^{- 2}}{〈 ϕ^{'} {(\sqrt{Δ_{h}} \cdot h)}^{2} 〉} .

(B5)

In the limit of large α_z and β_r = 0, this implies

erfc (\frac{c}{\sqrt{2}}) \approx \frac{4}{〈 ϕ^{'} {(\sqrt{Δ_{h}} \cdot h)}^{2} 〉} .

(B6)

If this has a positive solution for c, then the scaling of the spectral edge as $λ ~ e^{- c α_{z} \sqrt{Δ_{h}}}$ holds. Moreover, whenever there is a positive solution for c, we also expect pinching of the spectral curve, and in the limit α_z → ∞ we have marginal stability.

Under the same approximation, we can approximate the eigenvalue density in a radius δ around zero as

P [| λ (𝒟) | < δ] = \frac{1}{2 π i} \oint_{𝒞} d z G (z),

(B7)

where we choose the contour along $z = e^{- c α_{z} \sqrt{Δ_{h}} + i θ}$ for θ ∈ [0, 2π) and $δ = e^{- c α_{z} \sqrt{Δ_{h}}}$ . In the limit of large α_z (thus, δ ≪ 1), we get the scaling form described in the main text:

P [| λ (𝒟) | < δ] \approx \frac{1}{2} erfc (- \frac{log (δ)}{α_{z} \sqrt{2 Δ_{h}}}) .

(B8)

APPENDIX C: DETAILS OF THE DYNAMICAL MEAN-FIELD THEORY

The DMFT is a powerful analytical framework used to study the dynamics of disordered systems, and it traces its origins to the study of dynamical aspects of spin glasses [73,74] and has been later applied to the study of random neural networks [9,15,21,75]. In our case, the DMFT reduces the description of the full 3N-dimensional (deterministic) ordinary differential equations (ODEs) describing (h, z, r) to a set of three coupled stochastic differential equations for scalar variables (h, z, r).

Here, we provide a detailed, self-contained description of the dynamical mean-field theory for the gated RNN using the Martin–Siggia–Rose–De Dominicis–Janssen formalism. The starting point is a generating functional—akin to the generating function of a random variable—which takes an expectation over the paths generated by the dynamics. The generating functional is defined as

Z_{𝒥} [\hat{b}, b] = E [exp (i \sum_{j = 1}^{N} \int {\hat{b}}_{j} {(t)}^{T} x_{j} (t) d t)],

(C1)

where x_j(t) ≡ [h_j(t), z_j(t), r_j(t)] is the trajectory and ${\hat{b}}_{j} (t) = ({\hat{b}}_{j}^{h}, {\hat{b}}_{j}^{z}, {\hat{b}}_{j}^{r})$ is the argument of the generating functional. We also include external fields $b_{j} = (b_{j}^{h}, b_{j}^{z}, b_{j}^{r})$ , which are used to calculate the response functions. The measure in the expectation is a path integral over the dynamics. The generating functional is then used to calculate correlation and response functions using the appropriate (variational) derivatives. For instance, the two-point function for the h field is given by

〈 h_{i} (t) h_{i} (t^{'}) 〉 = {\frac{δ^{2}}{δ {\hat{b}}_{i}^{h} (t^{'}) δ {\hat{b}}_{i}^{h} (t)} Z_{𝒥} [\hat{b}, b] |}_{b = 0} .

(C2)

Up until this point, things are quite general and do not rely on the specific form of the dynamics. However, for large random networks, we expect certain quantities such as the population averaged correlation function C_h ≡ N⁻¹ Σ_i〈h_i(t)h_i(t′)〉 to be self-averaging and, thus, not vary much across realizations. Thus, we can study the disorder averaged (over 𝒥), the generating functional $\bar{Z} = {〈 Z_{𝒥} 〉}_{𝒥}$ , and approximate $\bar{Z}$ with its value evaluated at the saddle point of the action. This approximation gives us the single-site DMFT picture of dynamics described in Eqs. (C19) and (C20).

To see how this all works, we start with the equations of motion (in vector form)

τ_{z} \dot{z} = - z + J^{z} ϕ_{z} (h),

(C3)

τ_{r} \dot{r} = - r + J^{r} ϕ_{r} (h),

(C4)

\dot{h} = σ_{z} (z) ⊙ (- h + {J^{h} [σ_{r} (r) ⊙ ϕ_{h} (h)]}),

(C5)

where ⊙ stands for elementwise multiplication.

To write down the MSRDJ generating functional, let us discretize the dynamics (in the Itô convention). Note that in this convention the Jacobian is unity.

h_{i} (t + 1) - h_{i} (t) = σ_{z, i} (t) {- h_{i} (t) + \sum_{j} J_{i j}^{h} σ_{r, j} (t) ϕ_{j} (t) + b_{i}^{h} (t)} δ t,

τ_{z} [z_{i} (t + 1) - z_{i} (t)] = {- z_{i} (t) + \sum_{i} J_{i j}^{z} ϕ (t) + b_{i}^{z} (t)} δ t,

τ_{r} [r_{i} (t + 1) - r_{i} (t)] = {- r_{i} (t) + \sum_{j} J_{i j}^{r} ϕ (t) + b_{i}^{r} (t)} δ t,

where we introduce external fields in the dynamics ${b_{i}^{h} (t)}$ , ${b_{i}^{z} (t)}$ , and ${b_{i}^{r} (t)}$ . The generating functional is given by

Z_{𝒥} [\hat{b}, b] = E [exp (i \sum_{j = 1}^{N} \sum_{t} {\hat{b}}_{j} {(t)}^{T} x_{j} (t) δ t)],

(C6)

where $\hat{b} = ({\hat{b}}_{j}^{h}, {\hat{b}}_{j}^{z}, {\hat{b}}_{j}^{r})$ , $b = (b_{j}^{h}, b_{j}^{z}, b_{j}^{r})$ , and x_j(t) ≡ [h_j(t), z_j(t), r_j(t)]; also, the expectation is over the dynamics generated by the network. Writing this out explicitly, with δ functions enforcing the dynamics, we get the following integral for the generating functional:

Z_{𝒥} [\hat{b}, b] = \int \prod_{i, t} \prod_{k, t^{'}} \prod_{m, t^{″}} d h_{i} (t) d z_{k} (t^{'}) d r_{m} (t^{″}) \cdot exp (i {\sum_{i, t} {\hat{b}}_{i}^{h} (t) h_{i} (t) + {\hat{b}}_{i}^{z} (t) z_{i} (t) + {\hat{b}}_{i}^{r} (t) r_{i} (t)} δ t) \times δ (h_{i} (t + 1) - h_{i} (t) + {h_{i} (t) σ_{z, i} (t) - σ_{z, i} (t) [\sum_{j} J_{i j}^{h} σ_{r, j} (t) ϕ_{j} (t)] - b_{i}^{h} (t)} δ t) \times δ (z_{k} (t^{'} + 1) - z_{k} (t^{'}) + \frac{1}{τ_{z}} {z_{k} (t^{'}) + \sum_{l} J_{k l}^{z} ϕ_{l} (t^{'}) + b_{k}^{z} (t^{'})} δ t) \times δ (r_{m} (t^{″} + 1) - r_{m} (t^{″}) + \frac{1}{τ_{r}} {r_{m} (t^{″}) + \sum_{n} J_{m n}^{r} ϕ_{n} (t^{″}) + b_{m}^{r} (t^{'})} δ t) .

(C7)

Now, let us introduce the Fourier representation for the δ function; this introduces an auxiliary field variable, which as we see allows us to calculate the response function in the MSRDJ formalism. The generating functional can then be expressed as

Z_{𝒥} [\hat{b}, b] = \int \prod_{i, t} \prod_{k, t^{'}} \prod_{m, t^{″}} d h_{i} (t) \frac{d {\hat{h}}_{i} (t)}{2 π} d z_{k} (t^{'}) \frac{d {\hat{z}}_{k} (t^{'})}{2 π} d r_{m} (t^{″}) \frac{d {\hat{r}}_{m} (t^{″})}{2 π} \times exp [- i \sum_{i, t} {\hat{h}}_{i} (t) [h_{i} (t + 1) - h_{i} (t) - f_{h} (h_{i}, z_{i}, r_{i}) δ t - b_{i}^{h} (t) δ t] + i \sum_{i, t} {\hat{b}}_{i}^{h} (t) h_{i} (t) δ t] \times exp [- i \sum_{k, t^{'}} {\hat{z}}_{k} (t) (z_{k} (t^{'} + 1) - z_{k} (t^{'}) - f_{z} (h_{k}, z_{k}) \frac{δ t}{τ_{z}} - b_{k}^{z} (t^{'}) \frac{δ t}{τ_{z}}) + i \sum_{k, t^{'}} {\hat{b}}_{k}^{z} (t^{'}) z_{k} (t^{'}) δ t] \times exp [- i \sum_{m, t^{″}} {\hat{r}}_{m} (t^{″}) (r_{m} (t^{″} + 1) - r_{m} (t^{″}) - f_{r} (h_{m}, r_{m}) \frac{δ t}{τ_{r}} - b_{m}^{r} (t^{″}) \frac{δ t}{τ_{r}}) + i \sum_{m, t^{″}} {\hat{b}}_{m}^{r} (t^{″}) r_{m} (t^{″}) δ t],

(C8)

where the functions f_h,z,r summarize the gated RNN dynamics

f_{h} (h_{i}, z_{i}, r_{i}) = σ_{z, i} (t) (- h_{i} (t) + \sum_{j} J_{i j}^{h} σ_{r, j} (t) ϕ_{j} (t)),

f_{z} (h_{k}, z_{k}) = - z_{k} (t^{'}) + \sum_{l} J_{k l}^{z} ϕ_{l} (t^{'}),

f_{r} (h_{m}, r_{m}) = - r_{m} (t^{″}) + \sum J_{m n}^{r} ϕ_{n} (t^{″}) .

Let us now take the continuum limit δt → 0 and formally define the measures 𝒟h_i = lim_δt→0 ∏_t dh_i(t). We can then write the generating functional as a path integral:

Z_{𝒥} [\hat{b}, b] = \int \prod_{i} 𝒟 h_{i} 𝒟 {\hat{h}}_{i} 𝒟 z_{i} 𝒟 {\hat{z}}_{i} 𝒟 r_{i} 𝒟 {\hat{r}}_{i} exp {- S [\hat{x}, x] + i \int d t [\hat{b} {(t)}^{T} x (t) + b {(t)}^{T} \hat{x} (t)]},

(C9)

where $\hat{b} = ({\hat{b}}_{i}^{h}, {\hat{b}}_{i}^{z}, {\hat{b}}_{j}^{r})$ , x = (h_i, z_i, r_i), $\hat{x} = ({\hat{h}}_{i}, {\hat{z}}_{i} / τ_{z}, {\hat{r}}_{i} / τ_{r})$ , and the action S which gives weights to the paths is given by

S [\hat{x}, x] = i \sum_{i} \int d t {\hat{h}}_{i} (t) [\partial_{t} h_{i} (t) - f_{h} (h_{i}, z_{i}, r_{i})] + i \sum_{k} \int d t {\hat{z}}_{k} (t) [\partial_{t} z_{k} (t) - \frac{f_{z} (h_{k}, z_{k})}{τ_{z}}] + i \sum_{k} \int d t {\hat{r}}_{m} (t) [\partial_{t} r_{m} (t) - \frac{f_{r} (h_{m}, r_{m})}{τ_{r}}] .

(C10)

The functional is properly normalized, so Z_𝒥[0, b] = l. We can calculate correlation functions and response functions by taking appropriate variational derivatives of the generating functional Z, but first we address the role of the random couplings.

1. Disorder averaging

We are interested in the typical behavior of ensembles of the networks, so we work with the disorder-averaged generating functional $\bar{Z}$ ; Z_𝒥 is properly normalized, so we are allowed to do this averaging on Z_𝒥. Averaging over $J_{i j}^{h}$ involves the following integral:

\int d J_{i j}^{h} \sqrt{\frac{N}{2 π}} exp {- \frac{N {(J_{i j}^{h})}^{2}}{2} + i \cdot J_{i j}^{h} \int d t {\hat{h}}_{i} (t) σ_{z, i} (t) ϕ_{j} (t) σ_{r, j} (t)},

which evaluates to

exp {- (1 / 2 N) \cdot {[\int d t {\hat{h}}_{i} (t) σ_{z, i} (t) ϕ_{j} (t) σ_{r, j} (t)]}^{2}},

and similarly for J^z and J^r we get terms

exp {- (1 / 2 N) \cdot {(\int d t {\hat{z}}_{k} (t) ϕ_{l} (t))}^{2} τ_{z}^{- 2}},

exp {- (1 / 2 N) \cdot {(\int d t {\hat{r}}_{m} (t) ϕ_{n} (t))}^{2} τ_{r}^{- 2}} .

The disorder-averaged generating functional is then given by

\bar{Z} [\hat{b}, b] = \int \prod_{i} 𝒟 h_{i} 𝒟 {\hat{h}}_{i} 𝒟 z_{i} 𝒟 {\hat{z}}_{i} 𝒟 r_{i} 𝒟 {\hat{r}}_{i} exp {- \bar{S} [\hat{x}, x] + i \int d t [\hat{b} {(t)}^{T} x (t) + b {(t)}^{T} \hat{x} (t)]},

(C11)

where the disorder-averaged action $\bar{S}$ is given by

\bar{S} [\hat{x}, x] = i \sum_{i} \int d t {\hat{h}}_{i} (t) [\partial_{t} h_{i} (t) + h_{i} (t) σ_{z, i} (t)] + \frac{1}{2 N} \sum_{i, j} {(\int d t {\hat{h}}_{i} (t) σ_{z, i} (t) ϕ_{j} (t) σ_{r, j} (t))}^{2} + i \sum_{k} \int d t {\hat{z}}_{k} (t) (\partial_{t} z_{k} (t) + \frac{z_{k} (t)}{τ_{z}}) + \frac{1}{2 N} \sum_{k, l} {(\int d t \frac{{\hat{z}}_{k} (t)}{τ_{z}} \cdot ϕ_{l} (t))}^{2} + i \sum_{m} \int d t {\hat{r}}_{m} (t) (\partial_{t} r_{m} (t) + \frac{r_{m} (t)}{τ_{r}}) + \frac{1}{2 N} \sum_{m, n} {(\int d t \frac{{\hat{r}}_{m} (t)}{τ_{r}} \cdot ϕ_{n} (t))}^{2} .

(C12)

With some foresight, we see the action is extensive in the system size, and we can try to reduce it to a single-site description. However, the issue now is that we have nonlocal terms (e.g., involving both i and j), and we can introduce the following auxiliary fields to decouple these nonlocal terms:

C_{ϕ σ_{r}} (t, t^{'}) ≔ \frac{1}{N} \sum_{i} ϕ_{i} (t) ϕ_{i} (t^{'}) σ_{r, i} (t) σ_{r, i} (t^{'}),

C_{ϕ} (t, t^{'}) ≔ \frac{1}{N} \sum_{k} ϕ_{k} (t) ϕ_{k} (t^{'}) .

(C13)

To make the C’s free fields that we integrate over, we enforce these relations using the Fourier representation of δ functions with additional auxiliary fields:

δ (N C_{ϕ σ_{r}} (t, t^{'}) - \sum_{i} ϕ_{i} (t) ϕ_{i} (t^{'}) σ_{r, i} (t) σ_{r, i} (t^{'})) = \int \frac{N}{π} d {\hat{C}}_{ϕ σ_{r}} (t, t^{'}) \exp [- \frac{i}{2} {\hat{C}}_{ϕ σ_{r}} (t, t^{'}) (N \cdot C_{ϕ σ_{r}} (t, t^{'}) - \sum_{i} ϕ_{i} (t) ϕ_{i} (t^{'}) σ_{r, i} (t) σ_{r, i} (t^{'}))],

δ (N C_{ϕ} (t, t^{'}) - \sum_{k} ϕ_{k} (t) ϕ_{k} (t^{'})) = \int \frac{N}{π} d {\hat{C}}_{ϕ} (t, t^{'}) \exp [- \frac{i}{2} {\hat{C}}_{ϕ} (t, t^{'}) (N \cdot C_{ϕ} (t, t^{'}) - \sum_{k} ϕ_{k} (t) ϕ_{k} (t^{'}))] .

This allows us to make the following transformations to decouple the nonlocal terms in the action $\bar{S}$ :

\frac{1}{2 N} \sum_{i, j} {[{\hat{h}}_{i} (t) σ_{z, i} (t) ϕ_{j} (t) σ_{r, j} (t)]}^{2} \to \frac{1}{2} \sum_{i} \int d t d t^{'} {\hat{h}}_{i} (t) σ_{z, i} (t) C_{ϕ σ_{r}} (t, t^{'}) {\hat{h}}_{i} (t^{'}) σ_{z, i} (t^{'}),

\frac{1}{2 N} \sum_{k, l} {(\int d t \frac{{\hat{z}}_{k} (t)}{τ_{z}} \cdot ϕ_{l} (t))}^{2} \to \frac{1}{2} \sum_{k} \int d t d t^{'} \frac{{\hat{z}}_{k} (t)}{τ_{z}} C_{ϕ} (t, t^{'}) \frac{{\hat{z}}_{k} (t^{'})}{τ_{z}},

\frac{1}{2 N} \sum_{m, n} {(\int d t \frac{{\hat{r}}_{m} (t)}{τ_{r}} \cdot ϕ_{n} (t))}^{2} \to \frac{1}{2} \sum_{m} \int d t d t^{'} \frac{{\hat{r}}_{m} (t)}{τ_{r}} C_{ϕ} (t, t^{'}) \frac{{\hat{r}}_{m} (t^{'})}{τ_{r}} .

We see clearly that the $C_{ϕ σ_{r}}$ and C_ϕ auxiliary fields which represent the (population-averaged) ϕσ_r − ϕσ_r and ϕ − ϕ correlation functions decouple the sites by summarizing all the information present in the rest of the network in terms of two-point functions; two different sites interact only by means of the correlation functions. The disorder-averaged generating functional can now be written as

\bar{Z} [\hat{b}, b] = \int 𝒟 \hat{C} 𝒟 C exp (- N \cdot ℒ [\hat{C}, C; \hat{b}, b]), ℒ = \frac{i}{2} \int d t d t^{'} [C {(t, t^{'})}^{T} \hat{C} (t, t^{'})] - W [\hat{C}, C; \hat{b}, b], exp (N \cdot W) = \int \prod_{i} 𝒟 h_{i} 𝒟 {\hat{h}}_{i} 𝒟 z_{i} 𝒟 {\hat{z}}_{i} 𝒟 r_{i} 𝒟 {\hat{r}}_{i} \times exp {i \int d t [b {(t)}^{T} \hat{h} (t) + \hat{b} {(t)}^{T} h (t)] - S_{d} [\hat{h}, h; {C, \hat{C}}]},

(C14)

where C = (C_h, C_z, C_r) and Ĉ = (Ĉ_h, Ĉ_z, Ĉ_r). The sitewise decoupled action S_d contains only terms involving a single site (and the C fields). So, for a given value of Ĉ and C, the different sites are decoupled and driven by the sitewise action

S_{d} [\hat{h}, h; {C, \hat{C}}] = i \int d t [\hat{h} {(t)}^{T} \partial_{t} h (t) + {\hat{h}}_{τ} {(t)}^{T} h (t)] + \frac{1}{2} \int d t d t^{'} {\hat{h}}_{τ} {(t)}^{T} D C (t, t^{'}) {\hat{h}}_{τ} (t^{'}) - \frac{i}{2} \int d t d t^{'} S_{x} {(t)}^{T} D \hat{C} (t, t^{'}) S_{x} (t^{'}),

(C15)

where

{\hat{h}}_{τ} (t) = ({\hat{h}}_{i} σ_{z, i}, {\hat{z}}_{i} / τ_{z}, {\hat{r}}_{i} / τ_{r}),

\hat{h} (t) = ({\hat{h}}_{i}, {\hat{z}}_{i}, {\hat{r}}_{i}),

S_{x} = (ϕ_{i} σ_{r, i}, ϕ_{i}, ϕ_{i}),

D C (t, t^{'}) = Diag [C_{ϕ σ_{r}} (t, t^{'}), C_{ϕ} (t, t^{'}), C_{ϕ} (t, t^{'})],

ℂ \hat{C} (t, t^{'}) = Diag [{\hat{C}}_{ϕ σ_{r}} (t, t^{'}), {\hat{C}}_{ϕ} (t, t^{'}), {\hat{C}}_{ϕ} (t, t^{'})] .

2. Saddle-point approximation for N → ∞

So far, we do not make any use of the fact that we are considering large networks. However, noting that N appears in the exponent in the expression for the disorder-averaged generating functional, we can approximate it using a saddle-point approximation:

\bar{Z} [\hat{b}, b] ≃ e^{N \cdot ℒ_{0} [\hat{b}, b; C^{0}, {\hat{C}}^{0}]} \int 𝒟 \hat{Q} 𝒟 Q e^{- N \cdot ℒ_{2} [\hat{Q}, Q, \hat{b}, b]} .

We approximate the action ℒ in Eq. (C14) by its saddle-point value plus a Hessian term: ℒ ≃ ℒ₀ + ℒ₂ and the Q and Q̂ fields represent Gaussian fluctuations about the saddle-point values C⁰ and Ĉ⁰, respectively. At the saddle-point the action is stationary with respect to variations; thus, the saddle-point values of C fields satisfy

C_{ϕ σ_{r}}^{0} (t, t^{'}) = \frac{1}{N} \sum_{i = 1}^{N} {〈 ϕ_{i} (t) σ_{r, i} (t) ϕ_{i} (t^{'}) σ_{r, i} (t^{'}) 〉}_{0}, {\hat{C}}_{ϕ σ_{r}}^{0} (t, t^{'}) = \frac{1}{N} \sum_{i = 1}^{N} {〈 {\hat{h}}_{i} (t) σ_{z, i} (t) {\hat{h}}_{i} (t^{'}) σ_{z, i} (t^{'}) 〉}_{0} = \frac{δ^{2} {〈 σ_{z, i} (t) σ_{z, i} (t^{'}) 〉}_{0}}{δ b_{i} (t) δ b_{i} (t^{'})} = 0, C_{ϕ}^{0} (t, t^{'}) = \frac{1}{N} \sum_{k = 1}^{N} {〈 ϕ_{k} (t) ϕ_{k} (t^{'}) 〉}_{0}, {\hat{C}}_{ϕ}^{0} (t, t^{'}) = 0.

(C16)

In evaluating the saddle-point correlation function in the second line, we use the fact that equal-time response functions in the Itô convention are zero [29]. This is perhaps the first significant point of departure from previous studies of disordered neural networks and forces us to confront the multiplicative nature of the z gate. Here, 〈⋯〉₀ denotes averages with respect to paths generated by the saddle-point action; thus, these equations are a self-consistency constraint. With the correlation fields fixed at their saddle-point values, if we neglect the contribution of the fluctuations (i.e., ignore ℒ₂), then the generating functional is given by a product of identical sitewise generating functionals:

\bar{Z} [\hat{b}, b] = Z_{0} {[\hat{b}, b]}^{N},

(C17)

where the sitewise functionals are given by

Z_{0} [\hat{b}, b] = \int 𝒟 h 𝒟 \hat{h} 𝒟 z 𝒟 \hat{z} 𝒟 r 𝒟 \hat{r} \times e^{(i \int d t [b {(t)}^{T} \hat{h} (t) + \hat{b} {(t)}^{T} h (t)] - S_{d} [\hat{h}, h; {C^{0}, 0}])},

(C18)

where $C^{0} = (C_{ϕ σ}^{0}, C_{ϕ}^{0})$ .

The sitewise decoupled action is now quadratic with the correlation functions taking on their saddle-point values. This corresponds to an action for each site containing three scalar variables driven by Gaussian processes. This can be seen explicitly by using a Hubbard-Stratonovich transform which makes the action linear at the cost of introducing three auxiliary Gaussian fields η_h, η_z, and η_r with correlation functions $C_{ϕ σ_{r}}^{0} (t, t^{'})$ , $C_{ϕ}^{0} (t, t^{'})$ , and $C_{ϕ}^{0} (t, t^{'})$ , respectively. With this transformation, the action for each site corresponds to stochastic dynamics for three scalar variables given by

\dot{h} (t) = - σ_{z} (z) \cdot h (t) + σ_{z} (z) \cdot η_{h} (t),

(C19)

τ_{z} \dot{z} (t) = - z (t) + η_{z} (t),

(C20)

τ_{r} \dot{r} (t) = - r (t) + η_{r} (t),

(C21)

where the Gaussian noise processes η_h, η_z, and η_r have correlation functions that must be determined self-consistently:

〈 η_{h} (t) \cdot η_{h} (t^{'}) 〉 = 〈 ϕ (t) σ_{r} (t) \cdot ϕ (t^{'}) σ_{r} (t^{'}) 〉,

〈 η_{z} (t) \cdot η_{z} (t^{'}) 〉 = 〈 ϕ (t) \cdot ϕ (t^{'}) 〉,

〈 η_{r} (t) \cdot η_{r} (t^{'}) 〉 = 〈 ϕ (t) \cdot ϕ (t^{'}) 〉 .

The intuitive picture of the saddle-point approximation is as follows: The sites of the full network become decoupled, and they are each driven by a Gaussian processes whose correlation functions summarize the activity of the rest of the network “felt” by each site. It is possible to argue about the final result heuristically, but one does not have access to the systematic corrections that a field theory formulation affords.

We comment here on the unique difficulty that gating presents to an analysis of the DMFT. While r(t) and z(t) are both described by Gaussian processes in the DMFT, the multiplicative σ_z(z) interaction in Eq. (C19) spoils the Gaussianity of h(t). Note that r(t) is always Gaussian and uncorrelated to h(t). In order to try solving for the correlation functions, we need to make a factorization assumption, justified numerically in Fig. 10. The story simplifies at a fixed point, where h = η^h (since σ_z > 0), and is, thus, Gaussian and independent of r.

In order to solve the DMFT equations, we use a numerical method described in Ref. [76]. Specifically, we generate noise paths η_h,z,r starting with an initial guess for the correlation functions and then iteratively update the correlation functions using the mean-field equations till convergence. The classical method of solving the DMFT by mapping the DMFT equations to a second-order ODE describing the motion of a particle in a potential cannot be used in the presence of multiplicative gates. In Fig. 9, we see that the solution to the mean-field equations agrees well with the true population-averaged correlation function; Fig. 9 also shows the scale of fluctuations around the mean-field solutions (Fig. 9, thin black lines).

The correlation functions in the DMFT picture such as C_h(t, t′) = 〈h(t)h(t′)〉 are the order parameters and correspond to the population-averaged correlation functions in the full network. These turn out to useful in our analysis of the RNN dynamics in some analyses. Qualitative changes in the correlation functions correspond to transitions between dynamical regimes of the RNN.

In general, the non-Gaussian nature of h makes it impossible to get equations governing the correlation functions. However, when α_z is not too large, Eqs. (C19) and (C20) can be extended to get equations of motions for the correlation functions C_h, C_z, and C_r, which proves useful later on. This requires a separation assumption between the h and σ_z correlators, which seems reasonable for moderate α_z (see Fig. 10). “Squaring” Eqs. (C19) and (C20), we get

[- \partial_{τ}^{2} + C_{σ_{z}} (τ)] C_{h} (τ) = C_{σ_{z}} (τ) C_{σ_{r}} (τ) C_{ϕ} (τ),

(C22)

[- τ_{z}^{2} \partial_{τ}^{2} + 1] C_{z} (τ) = C_{ϕ} (τ),

(C23)

[- τ_{r}^{2} \partial_{τ}^{2} + 1] C_{r} (τ) = C_{ϕ} (τ),

(C24)

where we use the shorthand σ_z(t) ≡ σ_z[z(t)], ϕ(t) ≡ ϕ[h(t)], and denote the two-time correlation functions as

C_{x} (t, t^{'}) = 〈 x (t) x (t^{'}) 〉,

(C25)

where x ∈ {h, z, r, σ_z, σ_r, ϕ} and the expectation here is over the random Gaussian fields in Eqs. (C19)–(C21). We assume that the network reaches steady state, so that the correlation functions are only a function of the time difference τ = t − t′. The role of the z gate as an adaptive time constant is evident in Eq. (C22).

FIG. 9. — Validating the DMFT. We show the comparison between the population-averaged correlation functions C_ϕ(τ) ≡ 〈ϕ(t)ϕ(t + τ)〉 obtained from the full network simulations of a single instantiation in steady state (purple line) and from solving the DMFT equations (red line) for three distinct parameter values. The lag τ is relative to τ_h (taken to be unity). Thin black lines are the time-averaged correlation functions for individual neurons sampled from the network, to show the scale of fluctuations around the population-averaged correlation functions. N = 5000 for all the panels.

FIG. 10. — The validity of the approximation $C_{h σ_{z}} = C_{h} C_{σ_{z}}$ for two values of α_z. The correlation functions are calculated numerically in a network with n = 1000, g_h = 3.5, and α_r = 0.

For time-independent solutions, i.e., fixed points, Eqs. (C22)–(C24) simplify to read

Δ_{z} \equiv 〈 z^{2} 〉 = \int D x ϕ {(\sqrt{Δ_{h}} x)}^{2} = Δ_{r},

(C26)

Δ_{h} \equiv 〈 h^{2} 〉 = \int DxDy ϕ {(\sqrt{Δ_{h}} x)}^{2} σ_{r} {(\sqrt{Δ_{r}} y)}^{2},

(C27)

where we use Δ instead of C to indicate fixed-point variances and D_x is the standard Gaussian measure. It is interesting to note that these mean-field equations can be mapped to those obtained in Ref. [51] for the discrete-time GRU.

We also make use of the MFT with static random inputs. For completeness, we include the resulting equations here. With $I_{i}^{h, z, r} ~ 𝒩 (0, σ_{h, z, r}^{2})$ , the MFT time-independent solution satisfies

Δ_{z} = \int D x ϕ {(\sqrt{Δ_{h}} x)}^{2} + σ_{z}^{2},

(C28)

Δ_{r} = \int D x ϕ {(\sqrt{Δ_{h}} x)}^{2} + σ_{r}^{2},

(C29)

Δ_{h} = \int DxDy ϕ {(\sqrt{Δ_{h}} x)}^{2} σ_{r} {(\sqrt{Δ_{r}} y)}^{2} + σ_{h}^{2} .

(C30)

APPENDIX D: DETAILS OF THE NUMERICS FOR THE LYAPUNOV SPECTRUM

The evolution of perturbations δx(t) along a trajectory follow the tangent-space dynamics governed by the Jacobian

\partial_{t} δ x (t) = 𝒟 (t) δ x (t) .

(D1)

So, after a time T, the initial perturbation δx(0) is given by

δ x (t) = U (t, 0) δ x (0), U (t, 0) = 𝒯 [e^{\int_{0}^{t} d s 𝒟 (s)}],

(D2)

where 𝒯[⋯] is the time-ordering operator applied to the contents of the bracket. When the infinitesimal perturbations grow (shrink) exponentially, the rate of this exponential growth (decay) is dictated by the maximal Lyapunov exponent defined as [54]

λ_{max} ≔ lim_{T \to \infty} \frac{1}{T} lim_{∥ δ x (0) ∥ \to 0} \ln \frac{∥ δ x (T) ∥}{∥ δ x (0) ∥},

(D3)

For ergodic systems, this limit is independent of almost all initial conditions, as guaranteed by the Oseledets multiplicative ergodic theorem [54]. Positive values of λ_max imply that the nearby trajectories diverge exponentially fast, and the system is chaotic. More generally, the set of all Lyapunov exponents—the Lyapunov spectrum—yields the rates at which perturbations along different directions shrink or diverge and, thus, provides a fuller characterization of asymptotic behavior. The first k-ordered Lyapunov exponents are given by the growth rates of k linearly independent perturbations. These can be obtained as the logarithms of the eigenvalues of the Oseledets matrix, defined as [54]

M (t) = lim_{t \to \infty} {[U {(t, 0)}^{T} U (t, 0)]}^{1 / 2 t} .

(D4)

However, this expression cannot be directly used to calculate the Lyapunov spectra in practice, since M(t) rapidly becomes ill conditioned. We instead employ a method suggested by Ref. [77] (also cf. Ref. [78] for Lyapunov spectra of RNNs). We start with k orthogonal vectors Q⁰ = [q₁, …, q_k] and evolve them using the tangent-space dynamics [Eq. (D1)] for a short time interval t₀. Therefore, the new set of vectors is given by

\hat{Q} = U (t_{0}, 0) Q^{0} .

(D5)

We now decompose Q̂ = Q¹R¹ using a QR decomposition, into an orthogonal matrix Q¹ and a upper-triangular matrix R¹ with positive diagonal elements, which give the rate of shrinkage or expansion of the volume element along the different directions. We iterate this procedure for a long time, t₀ × N_l, and the first k-ordered Lyapunov exponents are given by

λ_{i} = lim_{N_{l} \to \infty} \frac{1}{N_{l} t_{0}} \sum_{j = 1}^{N_{l}} \ln R_{i i}^{j}, i \in {1, \dots, k} .

(D6)

APPENDIX E: DETAILS OF THE DMFT PREDICTION FOR λ_max

The starting point of the method to calculate the DMFT prediction for λ_max is two replicas of the system x¹(t) and x²(t) with the same coupling matrices J^h,z,r and the same parameters. If the two systems are started with initial conditions which are close, then the rate of convergence or divergence of the trajectories reveals the maximal Lyapunov exponent. To this end, let us define $d (t, s) ≔ N^{- 1} \sum_{i} {[x_{i}^{1} (t) - x_{i}^{2} (s)]}^{2}$ and study the growth rate of d(t, t). In the large N limit, we expect population averages like $C^{12} (t, s) ≔ N^{- 1} \sum_{i} x_{i}^{1} (t) x_{i}^{2} (s)$ to be self-averaging (like in the DMFT for a single system) [79], and, thus, we can write

d (t, s) = C^{11} (t, t) + C^{22} (s, s) - C^{12} (t, s) - C^{21} (t, s) .

(E1)

For trajectories that start nearby, the asymptotic growth rate of d(t) is the maximal Lyapunov exponent. In order to calculate this using the DMFT, we need a way to calculate C¹²—the correlation between replicas—for a typical instantiation of systems in the large N limit. As suggested by Ref. [21], this can be achieved by considering a joint generating functional for the replicated system:

{\tilde{Z}}_{𝒥} [{\hat{b}}^{1}, {\hat{b}}^{2}, b^{1}, b^{2}] = E [exp (i \sum_{μ = 1}^{2} \sum_{j = 1}^{N} \int {\hat{b}}_{j}^{μ} {(t)}^{T} x_{j}^{μ} (t) d t)] .

(E2)

We then proceed to take the disorder average of this generating functional—in much the same way as a single system—and this introduces correlations between the state vectors of the two replicas. A saddle-point approximation as in the single system case (cf. Appendix C) yields a system of coupled stochastic differential equations (SDEs) (one for each replica), similar to Eq. (C20), but now the noise processes in the two replicas are coupled, so that terms like $〈 η_{h}^{1} (t) η_{h}^{2} (t^{'}) 〉$ need to be considered. As before, the SDEs imply the equations of motion for the correlation functions

[- \partial_{τ}^{2} + C_{σ_{z}}^{μ ν} (τ)] C_{h}^{μ ν} (τ) = C_{σ_{z}}^{μ ν} (τ) C_{ϕ}^{μ ν} (τ) C_{σ_{r}}^{μ ν} (τ),

(E3)

[- τ_{z}^{2} \partial_{τ}^{2} + 1] C_{z}^{μ ν} (τ) = C_{ϕ}^{μ ν} (τ),

(E4)

[- τ_{r}^{2} \partial_{τ}^{2} + 1] C_{r} (τ) = C_{ϕ}^{μ ν} (τ),

(E5)

where μ, v ∈ {1, 2} are the replica indices. Note that the single-replica solution clearly is a solution to this system, reflecting the fact that marginal statistics of each replica is the same as before. When the replicas are started with initial conditions that are ϵ-close, we expect the inter-replica correlation function to diverge from the single-replica steady-state solution, so we expand C¹² to linear order as $C_{h, z, r}^{12} (t, s) \approx C_{h, z, r} (t - s) + ϵ {\tilde{χ}}_{h, z, r} (t, s)$ . From Eq. (E1), we see that $d (t, t) ~ ϵ \tilde{χ} (t, t)$ , and, thus, the growth rate of $\tilde{χ}$ yields the required Lyapunov exponent. To this end, we make an ansatz ${\tilde{χ}}_{h, z, r} = e^{κ T} χ (τ)$ , where 2T = t + s, 2τ = t − s, and κ is the DMFT prediction of the maximum Lyapunov exponent that needs to be solved for. Substituting this back into Eq. (E3), we get a generalized eigenvalue problem for κ as stated in the text [Eqs. (10) and (11)].

APPENDIX F: CALCULATION OF MAXIMAL LYAPUNOV EXPONENT FROM RMT

The DMFT prediction for how gates shape λ_max (via the correlation functions) is somewhat involved; thus, we provide an alternate expression for the maximal Lyapunov exponent λ_max, derived using RMT which relates it to the relaxation time of the dynamics. The starting point to get λ_max is the Oseledets multiplicative ergodic theorem, which guarantees that [80]

λ_{max} = lim_{t \to \infty} \frac{1}{2 t} log \frac{∥ χ (t) ∥^{2}}{N}

(F1)

= lim_{t \to \infty} \frac{1}{2 t} log \frac{1}{N} Tr [χ (t) χ {(t)}^{T}],

(F2)

where $χ (t) = 𝒯 e^{\int_{0}^{t} d t^{'} 𝒟 (t^{'})}$ and 𝒟 is the Jacobian. For the vanilla RNN, the Jacobian is given by

𝒟 = - 1 + J [ϕ^{'} (t)] .

(F3)

We expect the maximal Lyapunov exponent to be independent of the random network realization and, thus, equal to its value after disorder averaging. Furthermore, to make any progress, we use a short-time approximation for $χ (t) \approx e^{\int_{0}^{t} d t^{'} 𝒟 (t^{'})}$ . Defining the diagonal matrix R(t) = ∫^t [ϕ′(t′)]dt′, these assumptions give

\frac{1}{N} Tr [χ (t) χ {(t)}^{T}] \approx e^{- 2 t} 〈 \frac{1}{N} Tr e^{J R (t)} e^{R (t) J^{T}} 〉

(F4)

= e^{- 2 t} \sum_{n = 0}^{\infty} \frac{1}{{(n!)}^{2}} {(\frac{1}{N} Tr R {(t)}^{2})}^{n},

(F5)

where the second line in Eq. (F5) follows after disorder averaging over J and keeping only terms to leading order in N. Next, we may apply the DMFT to write

\frac{1}{N} Tr R {(t)}^{2} = \int^{t} d t^{'} d t^{″} \frac{1}{N} \sum_{i = 1}^{N} ϕ_{i}^{'} (t^{″}) ϕ_{i}^{'} (t^{'})

(F6)

\approx \int d t^{'} d t^{″} C_{ϕ^{'}} (t^{'}, t^{″}) .

(F7)

In steady state, the correlation function depends only on the difference of the two times, and, thus, we can write

\int d t^{'} d t^{″} C_{ϕ^{'}} (t^{'}, t^{″}) \approx \int_{0}^{2 t} \frac{d u}{2} \int_{0}^{t} d τ C_{ϕ^{'}} (τ) \equiv t^{2} τ_{R},

(F8)

where we define the relaxation time for the C_ϕ′ correlation function

τ_{R} \equiv \frac{1}{t} \int_{0}^{t} d τ C_{ϕ^{'}} (τ) .

(F9)

Substituting Eq. (F8) in Eq. (F4), we get

\frac{1}{N} Tr [χ (t) χ {(t)}^{T}] = e^{- 2 t} I_{0} (2 t \sqrt{τ_{R}}),

(F10)

which for long times behaves like $exp [2 (\sqrt{τ_{R}} - 1) t]$ . By inserting this into Eq. (F1), we obtain a bound for the maximal Lyapunov exponent for the vanilla RNN:

λ_{max} \geq \sqrt{τ_{R}} - 1,

(F11)

where τ_{R} \equiv \frac{1}{t} \int_{0}^{t} d τ C_{ϕ^{'}} (τ) .

(F12)

This formula relates the asymptotic Lyapunov exponent to relaxation time of a local correlation function in steady state. It is interesting to note that the bound also follows by applying the variational theorem to the potential energy obtained from the Schrodinger equation that arises in computing the Lyapunov exponent using DMFT (e.g., see Refs. [15,32]). Specifically, if one uses the potential obtained in these works V(τ) = 1 − C_ϕ′(τ), and assumes a uniform “ground state wave function,” the variational theorem implies that the true ground state energy E₀ is upper bounded $E_{0} \leq \lim_{T \to \infty} (1 / T) \int_{- T / 2}^{T / 2} V (τ) d τ \equiv 1 - τ_{R}$ , which consequently implies the bound (F11).

Now we present the derivation for the mean-squared singular value of the susceptibility matrix for the gated RNN with α_z = 0 and β_z = −∞. In this limit, σ_z = 1, and the instantaneous Jacobian becomes the 2N × 2N matrix

𝒟_{t} = - 1_{2 N} + (\begin{matrix} J^{r} & 0 \\ 0 & J^{h} \end{matrix}) (\begin{matrix} 0 & P_{t} \\ Q_{t} & R_{t} \end{matrix}) \equiv - 1_{2 N} + \hat{J} S_{t},

(F13)

Q_{t} = [ϕ (h) ⊙ σ_{r}^{'} (r)], P_{t} = [ϕ^{'} (h)],

(F14)

R_{t} = [ϕ^{'} (h) ⊙ σ_{r} (r)],

(F15)

where h = h(t) and r = r(t) are time dependent.

Let us define the quantity of interest

σ_{χ}^{2} = 〈 \frac{1}{2 N} Tr [χ (t) χ^{T} (t)] 〉

(F16)

= e^{- 2 t} 〈 \frac{1}{2 N} Tr e^{\hat{J} {\hat{S}}_{t}} e^{{\hat{S}}_{t}^{T} {\hat{J}}^{T}} 〉,

(F17)

where we additionally define Ŝ_t = ∫^t dt′S_t and the integration is performed elementwise. Expanding the exponentiated matrices and computing moments directly, one finds that the leading order in N moments must have an equal number of Ĵ and Ĵ^T. Thus, we must evaluate

c_{n} = 〈 \frac{1}{2 N} Tr [{(\hat{J} {\hat{S}}_{t})}^{n} {({\hat{S}}_{t}^{T} {\hat{J}}^{T})}^{n}] 〉 .

(F18)

The ordering of the matrices is important in this expression. Since all of the Ĵ appear to the left of Ĵ^T, the leading-order contributions to the moment come from Wick contractions that are “noncrossing”—in the language of diagrams, the moment is given by a “rainbow” diagram. Consequently, we may evaluate c_n by induction. First, the induction step. Define the expected value of the matrix moment

{\hat{c}}_{n} = 〈 {(\hat{J} {\hat{S}}_{t})}^{n} {({\hat{S}}_{t}^{T} {\hat{J}}^{T})}^{n} 〉

(F19)

= 〈 \hat{J} [{\hat{S}}_{t} {(\hat{J} {\hat{S}}_{t})}^{n - 1} {({\hat{S}}_{t}^{T} {\hat{J}}^{T})}^{n - 1} {\hat{S}}_{t}^{T}] {\hat{J}}^{T} 〉

(F20)

= (\begin{matrix} a_{n} 1 & 0 \\ 0 & b_{n} 1 \end{matrix}) + O (N^{- 1}) .

(F21)

We wish to determine a_n and b_n. Next, define

g_{P} = \frac{1}{N} Tr \int^{t} d t^{'} d t^{″} P_{t^{'}} P_{t^{″}},

(F22)

g_{Q} = \frac{1}{N} Tr \int^{t} d t^{'} d t^{″} Q_{t^{'}} Q_{t^{″}},

(F23)

g_{R} = \frac{1}{N} Tr \int^{t} d t^{'} d t^{″} R_{t^{'}} R_{t^{″}} .

(F24)

Now we can directly determine the induction step at the level of matrix moments by Wick contraction of the rainbow diagram:

{\hat{c}}_{n} = 〈 \hat{J} {\hat{S}}_{t} {(\hat{J} {\hat{S}}_{t})}^{n - 1} {({\hat{S}}_{t}^{T} {\hat{J}}^{T})}^{n - 1} {\hat{S}}_{t}^{T} {\hat{J}}^{T} 〉

(F25)

= 〈 \hat{J} {\hat{S}}_{t} {\hat{c}}_{n - 1} {\hat{S}}_{t}^{T} {\hat{J}}^{T} 〉 + O (N^{- 1})

(F26)

= (\begin{matrix} b_{n - 1} g_{P} 1 & 0 \\ 0 & (a_{n - 1} g_{Q} + b_{n - 1} g_{R}) 1 \end{matrix}) + O (N^{- 1}) .

(F27)

This implies the following recursion for the diagonal elements of ĉ_n:

a_{n} = g_{P} b_{n - 1}, b_{n} = g_{R} b_{n - 1} + g_{Q} a_{n - 1} .

(F28)

The initial condition is given by observing that ${\hat{c}}_{0} = 1$ , which implies a₀ = b₀ = 1. The solution to this recursion relation can be written in terms of a transfer matrix

(\begin{array}{l} a_{n} \\ b_{n} \end{array}) = {(\begin{array}{l} 0 & g_{P} \\ g_{O} & g_{R} \end{array})}^{n} (\begin{array}{l} 1 \\ 1 \end{array}),

(F29)

which implies the moment $c_{n} = \frac{1}{2} (a_{n} + b_{n})$ is given by

c_{n} = \frac{1}{2} (\begin{array}{l} 1 & 1 \end{array}) {(\begin{matrix} 0 & g_{P} \\ g_{Q} & g_{R} \end{matrix})}^{n} (\begin{array}{l} 1 \\ 1 \end{array}) .

(F30)

To evaluate this, we use the fact that the eigenvalues of the transfer matrix are

v_{\pm} = \frac{1}{2} (g_{R} \pm \sqrt{g_{R}^{2} + 4 g_{P} g_{Q}}),

(F31)

which are real valued. The eigenvectors are

v_{\pm} = (- \frac{v_{\mp}}{g_{Q}}, 1) .

(F32)

Then, defining l = (1, 1), the moment can be written

c_{n} = \frac{1}{2} l^{T} (v_{+}^{n} v_{+} v_{+}^{T} + v_{-}^{n} v_{-} v_{-}^{T}) l

(F33)

= \frac{1}{2} {(1 - \frac{v_{-}}{g_{Q}})}^{2} v_{+}^{n} + \frac{1}{2} {(1 - \frac{v_{+}}{g_{Q}})}^{2} v_{-}^{n} .

(F34)

The final expression for the mean-squared singular value is then

σ_{χ}^{2} = e^{- 2 t} \sum_{n = 0}^{\infty} \frac{c_{n}}{{(n!)}^{2}} .

(F35)

After resumming this infinite series, we wind up with an expression in terms of the modified Bessel function:

σ_{χ}^{2} = \frac{1}{2} e^{- 2 t} [{(1 - \frac{v_{-}}{g_{Q}})}^{2} I_{0} (2 \sqrt{v_{+}}) + {(1 - \frac{v_{+}}{g_{Q}})}^{2} I_{0} (2 \sqrt{v_{-}})] .

(F36)

In the steady state, we approximate these expressions by assuming the correlation functions are time-translation invariant. Thus, we may write, for instance,

g_{R} = \int d t d t^{'} R_{t} R_{t^{'}} \approx t^{2} \frac{1}{t} \int d τ C_{R} (τ) = t^{2} τ_{R},

(F37)

and similarly for g_Q and g_P. Then, the eigenvalues of the transfer matrix become

v_{\pm} = t^{2} \frac{1}{2} (τ_{R} \pm \sqrt{τ_{R}^{2} + 4 τ_{P} τ_{Q}}) .

(F38)

At late times, using the asymptotic behavior of the modified Bessel function, the moment becomes

σ_{χ}^{2} ~ exp (- 2 t + 2 \sqrt{v_{+}}),

(F39)

which gives the Lyapunov exponent

λ_{L} \geq {(\frac{τ_{R} + \sqrt{τ_{R}^{2} + 4 τ_{P} τ_{Q}}}{2})}^{1 / 2} - 1,

(F40)

where the relaxation times τ_A, τ_r, and τ_q are defined as, respectively,

τ_{R} = lim_{t \to \infty} \frac{1}{t} \int_{0}^{t} d τ C_{ϕ^{'}} (τ) C_{σ_{r}} (τ),

(F41)

τ_{A} = lim_{t \to \infty} \frac{1}{t} \int_{0}^{t} d τ C_{ϕ^{'}} (τ),

(F42)

τ_{Q} = lim_{t \to \infty} \frac{1}{t} \int_{0}^{t} d τ C_{ϕ} (τ) C_{σ_{r}^{'}} (τ) .

(F43)

APPENDIX G: DETAILS OF THE DISCONTINUOUS CHAOTIC TRANSITION

In this section, we provide the details for the calculations involved in the discontinuous chaotic transition.

1. Spontaneous emergence of fixed-points

For g_h < 2.0 and small α_r, the zero fixed point is the globally stable state for the dynamics and the only solution to the fixed-point equations [Eq. (C26)] for Δ_h. However, as we increase α_r for a fixed g_h, two additional nonzero solutions to Δ_h spontaneously appear at a critical value $α_{FP}^{*} (g_{h})$ as shown in Fig. 4(a). Numerical solutions to the fixed-point equations reveal the form of the bifurcation curve $α_{r, FP}^{*} (g_{h})$ and the associated value of $Δ_{h}^{*} (g_{h})$ . We see that $α_{r, FP}^{*} (g_{h})$ increases rapidly with decreasing g_h, dividing the parameter space into regions with either one or three solutions for Δ_h. The corresponding $Δ_{h}^{*} (g_{h})$ vanishes at two boundary values of g_h—one at 2.0 and another, g_c, below 1.5, where $α_{r}^{*} \to \infty$ . This naturally leads to the question of whether the fixed-point bifurcation exists for all values of g_h below 2.0.

To answer this, we perturbatively solve the fixed-point equations in two asymptotic regimes: (i) g_h → 2⁻ and (ii) $g_{h} \to g_{c}^{+}$ . Details of the perturbative treatment are in Appendix I 2. For g_h = 2 − ϵ, we see that the perturbative problem undergoes a bifurcation from one solution (Δ_h = 0) to three when α_r crosses the bifurcation threshold $α_{r}^{*} (2.0) = \sqrt{8}$ , and this is the left limit of the bifurcation curve in Fig. 4(b). The larger nonzero solution for the variance at the bifurcation point scales as

Δ_{h}^{*} \approx (α_{r}^{2} - 8) \cdot ξ_{0} + ξ_{1} ϵ for α_{r} \to α_{r, FP}^{*} (2) = \sqrt{8},

(G1)

where ξ₀ and ξ₀ are positive constants (see Appendix I 2).

At the other extreme, to determine the smallest value of g_h for which a bifurcation is possible, we note from Fig. 4(b) that in this limit α_r → ∞, and, thus, we can look for solutions to Δ_h in the limit: Δ_h ≪ 1 and α_r → ∞ and $α_{r} \sqrt{Δ_{h}} ≫ 1$ . In this limit, there is a bifurcation in the perturbative solution when $g_{h} > g_{h}^{*} = \sqrt{2}$ , and, close to the critical point, the fixed-point solution is given by (see Appendix I 2)

Δ_{h}^{*} ({\sqrt{2}}^{+}) ~ \frac{g_{h}^{2} - 2}{2 g_{h}^{4}} for g_{h} \to {\sqrt{2}}^{+} .

(G2)

Thus, in the region $g_{h} \in (\sqrt{2}, 2)$ , there exist nonzero solutions to the fixed-point equations once α_r is above a critical value $α_{r}^{*} (g_{h})$ . These solutions correspond to unstable fixed points appearing in the phase space.

2. Delayed dynamical transition shows a decoupling between topological and dynamical complexity

The picture from the fixed-point transition above is that. when g_h is in the interval ( $\sqrt{2}$ , 2), there is a proliferation of unstable fixed points in the phase space provided $α_{r} > α_{r, FP}^{*} (g_{h})$ . However, it turns out that the spontaneous appearance of these unstable fixed points is not accompanied by any asymptotic dynamical signatures—as measured by the Lyapunov exponents (see Fig. 4) or by the transient times (see Fig. 11). It is only when α_r is increased further beyond a second critical value $α_{r, DMFT}^{*} (g_{h})$ that we see the appearance of chaotic and long-lived transients. This is significant in regard to a result by Wainrib and Touboul [45], where they show that the transition to chaotic dynamics (dynamical complexity) in random RNNs is tightly linked to the proliferation of critical points (topological complexity), and, in their case, the exponential rate of growth of critical points (a topological property) is the same as the maximal Lyapunov exponent (a dynamical property).

FIG. 11. — Transient times at the bifurcation transition. Transient times (τ_T, relative to τ_h) as a function of g_h, α_r, and system size N. Dashed plot lines correspond to situations where $α_{r} < α_{r, DMFT}^{*} (g_{h})$ . Dashed vertical lines are critical values of α_r or g_h. (a) τ_T vs N for g_h = 1.775 and (b) τ_T vs g_h for N = 500; the dashed line indicates g_h such that $α_{r, DMFT}^{*} (g_{h}) = 40$ . (c) τ_T vs α_r for g_h = 1.775; the dashed vertical line is $α_{r, DMFT}^{*} (1.775)$ . (d) τ_T vs g_h for α_r = 30; the dashed line is g_h such that $α_{r, DMFT}^{*} (g_{h}) = 30$ . (e) τ_T vs N for α_r = 30; dashed plot lines correspond to situations where $30 > α_{r, DMFT}^{*} (g_{h})$ . (f) τ_T vs α_r for N = 500; the dashed vertical line is $α_{r, DMFT}^{*} (1.85)$ . Transient times are averaged over 2000 instances of random networks.

Let us characterize the second dynamical transition curve given by $α_{r, DMFT}^{*} (g_{h})$ [Fig. 4(c), red curve]. For ease of discussion, we turn off the update gate (α_z = 0) and introduce a functional F_ψ for a 2D Gaussian average of a given function ψ(x):

F_{ψ} [C_{h} (0), C_{h} (τ)] = E [ψ (z_{1}) ψ (z_{2})],

(G3)

where (\begin{array}{l} z_{1} \\ z_{2} \end{array}) ~ 𝒩 (0, C_{h}), C_{h} = (\begin{array}{l} C_{h} (0) & C_{h} (τ) \\ C_{h} (τ) & C_{h} (0) \end{array}) .

(G4)

The DMFT equations for the correlation functions then become

\frac{1}{4} C_{h} (τ) - \partial_{τ}^{2} C_{h} (τ) = \frac{1}{4} F_{ϕ} [C_{h} (0), C_{h} (τ)] F_{σ_{r}} [C_{r} (0), C_{r} (τ)],

C_{r} (τ) - τ_{r}^{2} \partial_{τ}^{2} C_{r} (τ) = F_{ϕ} [C_{ϕ} (0), C_{ϕ} (τ)] .

(G5)

We further make an approximation that τ_r ≪ 1, which, in turn, implies C_r(τ) ≈ C_ϕ(τ). This approximation turns out to hold even for moderately large τ_r. With these approximations, we can integrate the equations for C_h(τ) to arrive at an equation for the variance $C_{h}^{0} \equiv C_{h} (0)$ . We do this by multiplying by ∂_τC_h(τ) and integrating from τ to ∞, and we get

\frac{1}{2} {\dot{C}}_{h} {(τ)}^{2} = \frac{1}{4} \frac{{(C_{h})}^{2}}{2} - \frac{1}{4} \int_{0}^{C_{h}^{0}} d C_{h} F_{ϕ} (C_{h}, C_{h}^{0}) F_{σ_{r}} (C_{ϕ}, C_{ϕ}^{0}) .

(G6)

Using the boundary condition that Ċ_h(0) = 0, we get the equation for the variance:

\frac{1}{8} C_{h} {(0)}^{2} - \frac{1}{4} \int_{0}^{C_{h}^{0}} d C_{h} F_{ϕ} (C_{h}, C_{h}^{0}) F_{σ_{r}} (C_{ϕ}, C_{ϕ}^{0}) = 0.

(G7)

Solving this equation gives the DMFT prediction for the variance for any g_h and α_r. Beyond the critical value of α_r, two nonzero solutions for $C_{h}^{0}$ spontaneously emerge. In order to use Eq. (G7) to find a prediction for the DMFT bifurcation curve $α_{r, DMFT}^{*} (g_{h})$ , we need to use the additional fact that at the bifurcation point the two solutions coincide, and there is only one nonzero solution. To proceed, we can view the lhs of Eq. (G7), as a function of α_r, g_h, and $C_{h}^{0} : ℱ (g_{h}, α_{r}, C_{h}^{0})$ . Then, the equation for the bifurcation curve is obtained by solving the following two equations for $C_{h}^{0, *}$ and $α_{r}^{*}$ :

ℱ (g_{h}, α_{r}^{*}, C_{h}^{0, *}) = 0,

(G8)

{\frac{\partial ℱ (g_{h}, α_{r}, C_{h}^{0})}{\partial C_{h}^{0}} |}_{α_{r}^{*}, C_{h}^{0, *}} = 0.

(G9)

To get the condition for the dynamical bifurcation transition, we need to differentiate the lhs of Eq. (G7) $[ℱ (g_{h}, α_{r}, C_{h}^{0})]$ with respect to $C_{h}^{0}$ and set it to 0. This involves terms like

\frac{\partial F_{ψ} (C_{h}^{0}, C_{h}^{0})}{\partial C_{h}^{0}}; \frac{\partial F_{ψ} (C_{h}^{0}, 0)}{\partial C_{h}^{0}} .

(G10)

We give a brief outline of calculating the first term. It is easier to work in the Fourier domain:

F_{ψ} (C_{h}^{0}, C_{h}) = E [\int \frac{d k}{2 π} \int \frac{d k^{'}}{2 π} \tilde{ψ} (k) e^{- k z_{1}} \tilde{ψ} (k^{'}) e^{- k^{'} z_{2}}] = \int \frac{d k}{2 π} \int \frac{d k^{'}}{2 π} \tilde{ψ} (k) \tilde{ψ} (k^{'}) exp [- \frac{C_{h}^{0}}{2} (k^{2} + k^{' 2}) - C_{h} (τ) k k^{'}] .

(G11)

This immediately gives us

\frac{\partial F_{ψ} (C_{h}^{0}, C_{h}^{0})}{\partial C_{h}^{0}} = \int 𝒟 x ψ (\sqrt{c_{h}^{0}} x) ψ^{″} (\sqrt{c_{h}^{0}} x) + \int 𝒟 x ψ^{'} {(\sqrt{c_{h}^{0}} x)}^{2},

\frac{\partial F_{ψ} (C_{h}^{0}, 0)}{\partial C_{h}^{0}} = \int 𝒟 x ψ (\sqrt{c_{h}^{0}} x) \int 𝒟 x ψ^{″} (\sqrt{c_{h}^{0}} x) .

(G12)

Using this fact, we can calculate the derivative of $ℱ (g_{h}, α_{r}, C_{h}^{0})$ as a straightforward (but long) sum of Gaussian integrals. We then numerically solve Eqs. (G8) and (G9) to get the bifurcation curve shown in Fig. 4(c). Figure 4(d) shows the corresponding variance at the bifurcation point $C_{h}^{0, *}$ (red curves). We note two salient points: (i) The DMFT bifurcation curve is always above the fixed-point bifurcation curve [black, in Fig. 4(a)], and (ii) the lower critical value of g_h which permits a dynamical transition [dashed green curve in Figs. 4(a) and 4(b)] is smaller than the corresponding fixed-point critical value of $\sqrt{2}$ .

We now calculate the lower critical value of g_h and provide an analytical description of the asymptotic behavior near the lower and higher critical values of g_h. From the red curve in Fig. 4(c), we know that, as g_h tends toward the lower critical value, $α_{r, DMFT}^{*} \to \infty$ and $C_{h}^{0} \to 0$ , we can approximate σ_r as a step function in this limit, and $F_{σ_{r}}$ is approximated as

F_{σ_{r}} (C_{ϕ}^{0}, C_{ϕ}) \approx \frac{1}{4} + \frac{1}{2 π} {tan}^{- 1} (\frac{x}{\sqrt{1 - x^{2}}}),

(G13)

where x ≔ \frac{C_{h} (τ)}{C_{h} (0)} \approx \frac{C_{ϕ} (τ)}{C_{ϕ} (0)} .

(G14)

The DMFT equation then reads

4 \ddot{x} = x - g_{h}^{2} x [\frac{1}{4} + \frac{1}{2 π} {tan}^{- 1} (\frac{x}{\sqrt{1 - x^{2}}})] + O [C_{h} {(0)}^{2}] .

Integrating this equation, we get

2 {\dot{x}}^{2} = \frac{x^{2}}{2} (1 - \frac{g_{h}^{2}}{4}) + \frac{g_{h}^{2}}{8 π} [(1 - 2 x^{2}) {sin}^{- 1} (x) - x \sqrt{1 - x^{2}}],

which has O(C_h(0)²) corrections. From the boundary condition Ċ_h(0) = 0, we know that as x → 1 then ẋ → 0. We thus find that these boundary conditions are consistent only to leading order in C_h(0) when g_h is equal to its critical value:

g_{h}^{*} = \sqrt{\frac{8}{3}},

(G15)

which indicates that C_h(0) must vanish as $g_{h} \to {\sqrt{8 / 3}}^{+}$ .

In the other limit when g_h → 2⁻, we see that $α_{r}^{*}$ remains finite and $C_{h}^{0, *} \to 0$ . We assume that, for g_h = 2 − ϵ, $C_{h}^{0}$ has a power-series expansion

C_{h}^{0} = c_{0} + c_{1} ϵ + c_{2} ϵ^{2} + \dots .

(G16)

We also expand F_ϕ and $F_{σ_{r}}$ to O[C_h(0)²]:

F_{ϕ} \approx g_{h}^{2} C_{h} (τ) - 2 g_{h}^{4} C_{h}^{0} \cdot C_{h} (τ) + 5 g_{h}^{6} {(C_{h}^{0})}^{2} \cdot C_{h} (τ)

(G17)

and look for values of α_r which permit a nonzero value for c₀ in the leading-order solutions to the DMFT. We find that the critical value of α_r from the perturbative solution is given by

α_{r, DMFT}^{*} (2) = \sqrt{12} .

(G18)

The DMFT prediction for the dynamical bifurcation agrees well with the full network simulations. In Fig. 4(e), we see that the maximum Lyapunov exponent experiences a discontinuous transition from a negative value (network activity decays to fixed point) to a positive value (activity is chaotic) at the critical value of α_r predicted by the DMFT (dashed vertical lines).

3. Influence of update gate on the discontinuous transition

Here, we comment briefly on the possible influence of the z gate on the discontinuous dynamical phase transition given by the curve $α_{r, DMFT}^{*}$ . Assuming Eq. (C22) is valid (discussed in more detail toward the end of Appendix C), we may rewrite the DMFT equation for the two-point correlation functions as

\frac{1}{2} {\dot{C}}_{h} {(τ)}^{2} = \int_{0}^{C_{h} (τ)} F_{σ_{z}} (C_{ϕ}^{0}, C_{ϕ}) Q (C_{h}, C_{h}^{0}),

(G19)

where

Q (C_{h}, C_{h}^{0}) = C_{h} - F_{ϕ} (C_{h}, C_{h}^{0}) F_{σ_{r}} (C_{ϕ}, C_{ϕ}^{0}) .

(G20)

Noting that a time-dependent solution corresponds to a nonzero solution for C_h(0) and satisfies the boundary condition Ċ_h(0) = 0 then requires

ℱ_{α_{z}} \equiv \int_{0}^{C_{h}^{0}} d C_{h} F_{σ_{z}} (C_{ϕ}, C_{ϕ}^{0}) Q (C_{h}, C_{h}^{0}) = 0,

(G21)

where we define a new “potential” function which is related to that defined above by

{ℱ_{α_{z}} |}_{α_{z} = 0} = ℱ = \frac{1}{4} \int_{0}^{C_{h}^{0}} d C_{h} Q (C_{h}, C_{h}^{0}) .

(G22)

We leave the arguments (g_h, α_r, $C_{h}^{0}$ ) implicit, for ease of presentation. We proceed to bound the new potential by establishing bounds on $F_{σ_{z}}$ . To be explicit, we have

F_{σ_{z}} = 〈 σ_{z} (τ) σ_{z} (0) 〉 = {〈 σ_{z} (τ) σ_{z} (0) 〉}_{c} + {〈 σ_{z} 〉}^{2},

(G23)

which we express as the sum of a connected component (indicated by a subscript c) and a disconnected component. We can consider two limiting behaviors. When the correlation time tends to zero, the connected component vanishes and (at zero bias βz = 0)

F_{σ_{z}} \approx {〈 σ_{z} 〉}^{2} = \frac{1}{4} .

(G24)

Increasing the correlation time can serve only to increase the two-point function, since σ ≥ 0. In the extreme limit of very long correlation time, we have that

F_{σ_{z}} \approx 〈 σ_{z}^{2} (0) 〉 \leq \frac{1}{2} .

(G25)

The inequality is saturated at α_z = ∞, when σ_z becomes a step function of its argument. Therefore, the two-point correlation function of the update gate is bounded above and below:

\frac{1}{4} \leq F_{σ_{z}} \leq \frac{1}{2},

(G26)

and this bound is uniform in the sense that it holds for all values of the argument $0 \leq C_{h} \leq C_{h}^{0} < \infty$ . Consequently, we are able to bound the potential

\frac{1}{4} ℱ \leq ℱ_{α_{z}} \leq \frac{1}{2} ℱ .

(G27)

It follows immediately that the derivative is similarly bounded. Consequently, the zeros of $ℱ_{α_{z}}$ and $\partial ℱ_{α_{z}} / \partial C_{h}^{0}$ coincide with the zeros of ℱ and $\partial ℱ / \partial C_{h}^{0}$ , respectively. As a result, the discontinuous transition, determined by Eqs. (G8) and (G9), remain unchanged for values of α_z for which Eq. (C22) is valid. Thus, for moderately large α_z [approximately 10, where Eq. (C22) is valid], the critical line for the discontinuous transition remains unchanged.

APPENDIX H: THE ROLE OF BIASES

We thus far describe the salient dynamical aspects for the gated RNN in the absence of biases. Here, we describe the role of the biases β_h (bias of the activation ϕ) and β_r (bias of the output gate σ_r). We first note that, when β_h = 0, zero is always a fixed point of the dynamics, and the zero fixed point is stable provided

- 1 + ϕ^{'} (0) σ_{r} (0) < 0,

(H1)

where ϕ(x) = tanh(g_hx + β_h). This gives the familiar g_h < 2 condition when β_r = 0 [81]. Thus, in this case, there is an interplay between g_h and β_r in determining the leading edge of the Jacobian around the zero fixed point and, thus, its stability. In the limit β_r → −∞, the leading edge retreats to $- τ_{r}^{- 1}$ . When β_h > 0, zero cannot be a fixed point of the dynamics. Therefore, β_h facilitates the appearance of nonzero fixed points, and both β_r and β_h determine the stability of these nonzero fixed points.

FIG. 12. — The role of biases. (a) FP solutions as a function of increasing β_h; different shades of green correspond to different values of β_r. Dashed lines correspond to FP solutions that are unstable (time-varying states). (b) The leading edge of the spectrum corresponding to the FP solutions calculated in (a); the FP solution is unstable when the leading edge is positive. (c) Similar to (a) but for β_r; different shades of blue correspond to different values of β_h. (d) Similar to (b) but for β_r. (e) FP solutions near critical g_c where the zero FP becomes unstable (circles) compared with the perturbative solution predicted by Eq. (H2) (solid lines). (f) FP solution as a function of β_r and β_h. The orange line indicates the stability line—i.e., regions on top of the orange line correspond to unstable or time-varying states.

To gain some insight into the role of β_h in generating fixed points, we treat the mean-field FP equations [Eq. (C26)] perturbatively around the operating point g_c where the zero fixed point becomes unstable [Eq. (H1)]. For small β_h and ϵ = g_h − g_c, we can express the solution Δ_h as a power series in ϵ, and we see that to leading order the fixed-point variance behaves as (details in Appendix I 1)

Δ_{h} \approx {\begin{array}{l} \frac{β_{h} + ϵ}{g_{c}^{2} (2 - g_{c}^{2} a_{1})} & g_{c}^{2} a_{1} < 2, \\ (g_{c}^{2} a_{1} - 2) f_{1} + ϵ \cdot f_{2} & g_{c}^{2} a_{1} > 2, \end{array}

(H2)

where a_{1} = \frac{α_{r}^{2}}{16} [ϕ_{0}^{(1)} {(β_{r} / 2)}^{2} + ϕ_{0} (β_{r} / 2) ϕ_{0}^{(2)} (β_{r} / 2)],

(H3)

where ϕ₀ ≡ tanh and f₂(α_r, β_r) and f₂(α_r, β_r) are constant functions with respect to ϵ. Therefore, we see that the bias β_h gives rise to nonzero fixed points near the critical point which scale linearly with the bias. In Fig. 12(e), we show this linear scaling of the solution for the case when β_h = ϵ, and we see that the prediction (lines) matches the true solution (circles) over a reasonably wide range.

More generally, away from the critical g_c, an increasing β_h gives rise to fixed-point solutions with increasing variance, and this can arise continuously from zero, or it can arise by stabilizing an unstable, time-varying state depending on the value of β_r. In Fig. 12(a), we see how the Δ_h behaves for increasing β_h for different β_r, and we can see the stabilizing effect of β_h on unstable solutions by looking at its effect on the leading spectral edge [Fig. 12(b)]. In Fig. 12(c), we see that an increasing β_r also gives rise to increasing Δ_h. However, in this case, it has a destabilizing effect by shifting the leading spectral edge to the right. In particular, when β_h = 0, increasing β_r destabilizes the zero fixed point and give rise to a time-varying solution. We note that, when β_h = 0, varying β_r cannot yield stable nonzero FPs. The combined effect of β_h and β_r can been seen in Fig. 12(f), where the nonzero solutions to the left of the orange line indicate unstable (time-varying) solutions. We choose the parameters to illustrate an interesting aspect of the biases: In some cases, increasing β_h can have a nonmonotonic effect on the stability, wherein the solution becomes unstable with increasing β_h and is then eventually stabilized for sufficiently large β_h.

FIG. 13. — How the biases alter the transition between stability and chaos. (a) Critical lines indicating boundaries for stability (solid lines) or marginal stability (dashed lines) for different values of β_h. (b) Similar to (a) but for different values of β_r. (c),(d) How the boundaries of stability (solid lines) or marginal stability (dashed lines) change as we vary α_r (c) or β_r (d).

1. Effect of biases on the phase boundaries

In Figs. 13(a) and 13(b), we look at how the critical line for the chaotic transition, in the α_r − g_h plane, changes as we vary β_h (a) or β_r (b). Positive values of β_r (“open” output gate) tend to make the transition line less dependent on α_r [Fig. 13(b)], and negative values of β_r have a stabilizing effect by requiring larger values of g_h and α_r to transition to chaos. As we see above, higher values of β_h have a stabilizing effect, requiring higher g_h and α_r to make the (nonzero) stable fixed point unstable. In both cases, the critical lines for marginal stability [Figs. 13(a) and 13(b), dashed lines] are also influenced in a similar way. In Figs. 13(c) and 13(d), we see how the stability-to-chaos transition is affected by α_r (c) and β_r (d). Consistent with the discussion above, larger α_r and β_r have a destabilizing effect, requiring a larger β_h to make the system stable.

APPENDIX I: DETAILS OF THE PERTURBATIVE SOLUTIONS TO THE MEAN-FIELD EQUATIONS

1. Perturbative solutions for the fixed-point variance Δ_h with biases

In this section, we derive the perturbative solutions for the fixed-point variance Δ_h with finite biases, near the critical point where the zero fixed point becomes unstable. Recall that fixed-point variances are obtained by solving

Δ_{z} \equiv 〈 z^{2} 〉 = \int 𝒟 x ϕ {(\sqrt{Δ_{h}} x)}^{2} = Δ_{r},

(I1)

Δ_{h} \equiv 〈 h^{2} 〉 = \int 𝒟 x 𝒟 y ϕ {(\sqrt{Δ_{h}} x)}^{2} σ_{r} {(\sqrt{Δ_{r}} y)}^{2} .

(I2)

The expansion we seek is perturbative in Δ_h. So, expanding the gating and activating functions about their biases under the assumption $Δ_{r} \approx g_{h}^{2} Δ_{h}$ , we have a series expansion to $O (Δ_{h}^{2})$ :

{〈 σ_{r} {(\sqrt{Δ_{r}} x)}^{2} 〉}_{x} = a_{0} + a_{1} g_{h}^{2} Δ_{h} + a_{2} g_{h}^{4} Δ_{h}^{2},

a_{0} = \frac{1}{4} {[1 + ϕ_{0} (β_{r} / 2)]}^{2},

(I3)

a_{1} = \frac{α_{r}^{2}}{16} [ϕ_{0}^{(1)} {(β_{r} / 2)}^{2} + ϕ_{0} (β_{r} / 2) ϕ_{0}^{(2)} (β_{r} / 2) + ϕ_{0}^{(2)} (β_{r} / 2)],

(I4)

a_{2} = \frac{α_{r}^{4}}{256} [12 ϕ_{0}^{(2)} {(β_{r} / 2)}^{2} + 4 ϕ_{0} (β_{r} / 2) ϕ_{0}^{(4)} (β_{r} / 2) + 16 ϕ_{0}^{(1)} (β_{r} / 2) ϕ_{0}^{(3)} (β_{r} / 2) + ϕ_{0}^{(4)} (β_{r} / 2)],

(I5)

where we use the following identities involving the derivatives of tanh:

ϕ_{0} (x) = tanh (x),

(I6)

ϕ_{0}^{(1)} (x) = 1 - ϕ_{0} {(x)}^{2},

(I7)

ϕ_{0}^{(2)} (x) = - 2 ϕ_{0} (x) [1 - ϕ_{0} {(x)}^{2}],

(I8)

ϕ_{0}^{(3)} (x) = 2 [1 - ϕ_{0} {(x)}^{2}] [3 ϕ_{0} {(x)}^{2} - 1],

(I9)

ϕ_{0}^{(4)} (x) = - 8 ϕ_{0} (x) [1 - ϕ_{0} {(x)}^{2}] [3 ϕ_{0} {(x)}^{2} - 2] .

(I10)

This gives us to $O (Δ_{h}^{2})$

Δ_{h} \approx [c_{0} + c_{1} Δ_{h} + c_{2} Δ_{h}^{2}] {〈 σ_{r} {(\sqrt{Δ_{r}} x)}^{2} 〉}_{x},

(I11)

c_{0} = ϕ_{0} {(β_{h})}^{2},

(I12)

c_{1} = g_{h}^{2} [ϕ_{0}^{(1)} {(β_{h})}^{2} + ϕ_{0}^{(2)} (β_{h}) ϕ_{0} (β_{h})],

(I13)

c_{2} = g_{h}^{4} [\frac{1}{4} ϕ_{0} (β_{h}) ϕ_{0}^{(4)} (β_{h}) + ϕ_{0}^{(1)} (β_{h}) ϕ_{0}^{(3)} (β_{h}) + \frac{3}{4} ϕ_{0}^{(2)} {(β_{h})}^{2}],

(I14)

and, therefore,

Δ_{h} \approx (c_{0} + c_{1} Δ_{h} + c_{2} Δ_{h}^{2}) (a_{0} + a_{1} g_{h}^{2} Δ_{h} + a_{2} g_{h}^{4} Δ_{h}^{2}) .

(I15)

To proceed further, we study the solutions to this equation for small deviations for a critical value of g_h. Which critical value should we use? Recall that the zero fixed point becomes unstable when

- 1 + ϕ^{'} (0) σ_{r} (0) = 0.

(I16)

Therefore, we expand around this operating point and our small parameter ϵ = g_h − g_c, where g_c = σ_r(0)⁻¹. We make an ansatz that we can express Δ_h as a power series in ϵ:

Δ_{h} = ϵ^{η} (d_{0} + d_{1} ϵ + d_{2} ϵ^{2}),

(I17)

where η is the exponent for the prefactor scaling and needs to be determined self-consistently. To get the scaling relations for Δ_h, we need to expand the coefficients in the Taylor series for Δ_h in terms of ϵ. We note that c₀ = tanh(β_h)², and, therefore, these approximations make sense only for small β_h. How small should β_h be relative to ϵ? We make the following ansatz:

β_{h} = β_{0} ϵ^{δ},

(I18)

and, thus, if δ > 1/2, then $c_{0} ~ β_{0}^{2} ϵ^{2 δ}$ increases slower than ϵ.

We now express the coefficients for small β_h:

c_{0} \approx β_{0}^{2} ϵ^{2 δ},

(I19)

c_{1} \approx g_{h}^{2} (1 - 2 β_{h}^{2}),

(I20)

c_{2} \approx g_{h}^{4} (- 2 + 17 β_{h}^{2}) .

(I21)

After solving Eqs. (I15)–(I19) self-consistently in terms of the expansion parameter ϵ, we get the following perturbative solution for δ ≤ 1:

Δ_{h} \approx {\begin{array}{l} \frac{2 β_{0} ϵ^{δ}}{g_{c}^{2} (2 - g_{c}^{2} a_{1})} & g_{c}^{2} a_{1} < 2, \\ (g_{c}^{2} a_{1} - 2) f_{1} + ϵ \cdot f_{2} & g_{c}^{2} a_{1} > 2, \end{array}

(I22)

where a_{1} = \frac{α_{r}^{2}}{16} [ϕ_{0}^{(1)} {(β_{r} / 2)}^{2} + ϕ_{0} (β_{r} / 2) ϕ_{0}^{(2)} (β_{r} / 2)] .

(I23)

f₂(α_r, β_r) and f₂(α_r, β_r) are constant functions (with respect to ϵ). Therefore, we see a linear scaling with the bias β_h.

2. Perturbative solutions for the fixed-point variance Δ_h in the bifurcation region with no biases

The perturbative treatment of the fixed-point solutions in this case closely follows that described above. For g_h = 2 − ϵ, we can express Δ_h as a power series in ϵ (Δ_h = c₀ + c₁ϵ + c₂ϵ²) and look for a condition that allows for a nonzero c₀ corresponding to the bifurcation point. Since we expect, Δ_h to be small in this regime, we can expand Δ_r as

Δ_{r} \approx g_{h}^{2} Δ_{h} - 2 g_{h}^{4} Δ_{h}^{2} + \frac{17}{3} g_{h}^{6} Δ_{h}^{3} + O (Δ_{h}^{4}),

(I24)

and, similarly, we can also approximate

{〈 σ_{r} {(\sqrt{Δ_{r}} x)}^{2} 〉}_{x} \approx \frac{1}{4} [1 + \frac{α_{r}^{2}}{4} Δ_{r} - \frac{α_{r}^{4}}{8} Δ_{r}^{2}] .

(I25)

Now, equating coefficient of powers of ϵ, we get that either c₀ = 0 or

c_{0} = \frac{3 (α_{r}^{2} - 8)}{2 (- 136 + 24 α_{r}^{2} + 3 α_{r}^{4})},

(I26)

which is a valid solution when $α_{r} \geq \sqrt{8}$ . This is the bifurcation curve limit near g_h = 2⁻.

In the other limit, $α_{r}^{*} \to \infty$ and $Δ_{h}^{*} \to 0$ . We can work in the regime where $α_{r} \sqrt{Δ_{h}} ≫ 1$ to see what values of g_h admit a bifurcation in the perturbative solutions. The equation [to $O (Δ_{h}^{2})$ ] is given by

Δ_{h} \approx \frac{1}{2} [g_{h}^{2} Δ_{h} - 2 g_{h}^{4} Δ_{h}^{2}] .

(I27)

Thus, we get a positive solution for Δ_h, when $g_{h} > \sqrt{2}$ , and, to the leading order, the solution scales as

Δ_{h}^{*} ({\sqrt{2}}^{+}) ~ \frac{g_{h}^{2} - 2}{2 g_{h}^{4}} for g_{h} \to {\sqrt{2}}^{+} .

(I28)

3. C_h(τ) near critical point

Here, we study the asymptotic behavior of C_h(τ) near the critical point g_h = 2.0 for small α_z. For simplicity, we set the biases to be zero. In this limit, we can assume that C_h(τ) and C_ϕ(τ) are small. Let us begin by approximating $C_{σ_{z}} (τ)$ .

We get, up to $O (C_{z}^{3})$ ,

C_{σ_{z}} (τ) = g_{0} + g_{1} C_{z} (τ) + g_{3} C_{z} {(τ)}^{3},

(I29)

where g_{0} = \frac{1}{4},

(I30)

g_{1} = \frac{α_{z}^{2}}{16} - \frac{α_{z}^{4}}{32} C_{z} (0) + \frac{5 α_{z}^{6}}{256} C_{z} {(0)}^{2},

(I31)

g_{3} = \frac{α_{z}^{6}}{384} - \frac{α_{z}^{8}}{192} C_{z} (0) .

(I32)

This can be obtained, for instance, by expanding σ_z[z(t)] and taking the Gaussian averages over the argument z(t) in the steady state. The relation between C_ϕ(τ) and C_z(τ), in general, does not have a simple form; however, when g_h ~ 2, we expect the relaxation time τ_R ≫ 1, and therefore, we can approximate C_z(τ) ≈ C_ϕ(τ). We can then approximate C_ϕ as

C_{ϕ} (τ) = g_{0} + g_{1} C_{h} (τ) + g_{3} C_{h} {(τ)}^{3},

(I33)

where g_{0} = 0 (for β_{h} = 0),

(I34)

g_{1} = g_{h}^{2} - 2 g_{h}^{4} C_{h} (0) + 5 g_{h}^{6} C_{h} {(0)}^{2},

(I35)

g_{3} = \frac{2}{3} g_{h}^{6} - \frac{16}{3} g_{h}^{8} C_{h} (0) .

(I36)

Note that this also gives us an approximation for C_ϕ(0). Putting all this together, the equation governing C_h(τ),

[- \partial_{τ}^{2} + C_{σ_{z}} (τ)] C_{h} (τ) = \frac{1}{4} C_{σ_{z}} (τ) C_{ϕ} (τ),

(I37)

becomes [up to $O (C_{h}^{3})$ ]

\partial_{τ}^{2} C_{h} (τ) ≃ a_{1} C_{h} (τ) + a_{2} C_{h} {(τ)}^{2} + a_{3} C_{h} {(τ)}^{3},

(I38)

where a_{1} = \frac{1}{16} (4 - Γ),

(I39)

a_{2} = \frac{α_{z}^{2}}{64} (4 - Γ) Γ,

(I40)

a_{3} = - \frac{g_{h}^{6}}{24},

(I41)

Γ = g_{h}^{2} - 2 g_{h}^{4} C_{h} (0) + 5 g_{h}^{6} C_{h} {(0)}^{2} .

(I42)

Integrating with respect to τ gives

{[\partial_{τ} C_{h} (τ)]}^{2} = 2 (\frac{a_{1}}{2} C_{h} {(τ)}^{2} + \frac{a_{2}}{3} C_{h} {(τ)}^{3} + \frac{a_{3}}{4} C_{h} {(τ)}^{4} + const) .

(I43)

The boundary conditions are

\partial_{τ} C_{h} (0) = 0, lim_{τ \to \infty} \partial_{τ} C_{h} (τ) = 0.

(I44)

The second condition implies the constant is 0. And the first condition implies

\frac{a_{1}}{2} + \frac{a_{2}}{3} C_{h} (0) + \frac{a_{3}}{4} C_{h} {(0)}^{2} = 0.

(I45)

From this, we can solve for C_h(0) (neglecting terms higher than quadratic) to get a solution that is perturbative in the deviation ϵ from the critical point (g_h = 2 + ϵ). To the leading order, the variance grows as

C_{h} (0) \approx \frac{1}{8} ϵ + O (ϵ^{2}),

(I46)

and the α_z enters the timescale-governing term a₁ only at O(ϵ²). At first, it might seem counterintuitive that α_z, which effectively controls the dynamical time constant in the equations of motion, should not influence the relaxation rate to leading order. However, this result is for the dynamical behavior close to the critical point, where the relaxation time is a scaling function of ϵ. Moving away from this critical point, the relaxation time becomes finite, and the z gate, and, thus, α_z, should have a more visible effect.

APPENDIX J: TOPOLOGICAL COMPLEXITY VIA KAC-RICE FORMULA

The arguments here are similar to those presented in Ref. [82], which use a self-averaging assumption to express the topological complexity (defined below) in terms of a spectral integral. Let us begin.

The goal is to estimate the total number of fixed points for a dynamical system ẋ = G(x). The Kac-Rice analysis proceeds by constructing the integral over the state space x whose integrand has delta-functional support only on the fixed points:

𝒩 = \int d x E [δ (G (x)) | \det 𝒟 |],

(J1)

where 𝒟 = ∂G/∂x is the instantaneous Jacobian. The expectation value here is over the random coupling matrices. The average number of fixed points is related to the so-called topological complexity 𝒞 via the definition

𝒩 = exp (N 𝒞) .

We seek a saddle-point approximation of this quantity below.

For the gated RNN, the state space x = (h, z, r), and the fixed points satisfy

σ_{z} (z_{i}) (- h_{i} + η_{i}^{h}) = 0,

(J2)

- z_{i} + η_{i}^{z} = 0,

(J3)

- r_{i} + η_{i}^{r} = 0,

(J4)

where for notational shorthand we introduce $η_{i}^{h} = \sum_{j} J_{i j}^{h} ϕ (h_{j}) σ_{r} (r_{j})$ and $η_{i}^{r / z} = \sum_{j} J_{i j}^{r / z} ϕ (h_{j})$ , anticipating the mean-field approximation to come. Notice that only the first equation for h provides a nontrivial constraint. Once h is found, the second and third equations can be used to determine z and r, respectively. Notice, furthermore, that, since σ(z_i) > 0, the solutions h_i to the first equation do not depend on z_i. Indeed, the dependence on σ(z) can be factorized out of the Kac-Rice integral. This requires noting first that, for the fixed point Jacobian, Eq. (A6) implies that the Jacobian can be written (setting τ_r = τ_z = 1 for simplicity)

𝒟 = A (- 1 + 𝒥 R)

(J5)

and that the determinant can be factorized:

\det | 𝒟 | = \det | A | \times \det | - 1 + 𝒥 R |

(J6)

= (\prod_{i} σ (z_{i})) \det | - 1 + 𝒥 R | .

(J7)

The product of σ(z_i) produced by the determinant is canceled by the product of delta functions, using the fact that σ(z_i) > 0 and the transformation law

\prod_{i} δ [σ (z_{i}) (- h_{i} + η_{i}^{h})] = \frac{1}{\prod_{i} σ (z_{i})} \prod_{i} δ [- h_{i} + η_{i}^{h}] .

(J8)

So we see that what evidently matters for the topological complexity is the fixed-point Jacobian:

𝒟^{f p} = - 1 + 𝒥 R,

(J9)

whose eigenvalues we denote by λ_i for i = 1, …, N and with the spectral density

\hat{μ} (z) = \frac{1}{N} \sum_{i} δ^{(2)} (z - λ_{i}) .

(J10)

The preceding analysis is all basically to show that we could easily have set α_z = 0 and gotten the same answer; i.e., the z gate does not influence the topological properties of the dynamics. For α_z = ∞, the situation changes drastically, and the analysis likely needs to be significantly reworked. Indeed, in this limit, we most likely do not have discrete fixed points anymore, so the very notion of counting fixed points no longer makes sense.

Having introduced the spectral density, we can rewrite the Kac-Rice integral as

𝒩 = \int \prod_{x \in {h, r}} d x E [δ^{(N)} (x - η^{x}) e^{N \int d^{2} z \hat{μ} (z) log | z |}] .

(J11)

Note that, since the spectral density of 𝒟^fp is independent of z, the integral over z is trivial to perform and leaves only h and r in the integrand.

So far, everything is exact. We begin now to make some approximations. The first crucial approximation is that the spectral density is self-averaging. The RMT analysis in the previous sections shows us furthermore that the spectral density depends only on macroscopic correlation functions of the state variables. Let us denote the spectral integral factor

I (x, 𝒥) = \exp (N \int d^{2} z \hat{μ} (z) log | z |),

(J12)

by which we mean that it depends on the particular realization of the random coupling 𝒥 and the state vector x. The self-averaging assumption implies that

I (x, 𝒥) \approx E [I (x, 𝒥)] \equiv \bar{I} (x);

(J13)

i.e., this factor does not depend on the particular realization of ℐ but just on the state vector. Equivalently, we are assuming that the spectral density ${\hat{μ}}_{h, r} (z)$ depends only on the configurations h and r and not the particular realization J^h,r. This allows us to pull this factor outside of the expectation value:

𝒩 \approx \int \prod_{x \in {h, r}} \bar{I} (x) E [δ^{(N)} (x - η^{x})] .

(J14)

Now we give some nonrigorous arguments for how one might evaluate the remaining expectation value. In order to carry out the average over J^h and J^r, we utilize the Fourier representation of the delta function to write

E [\int d \hat{x} e^{i \hat{x} (x - η^{x})}]

(J15)

= \int d x d \hat{x} E [exp \sum_{i, j} {i {\hat{h}}_{i} [h_{i} - J_{i j}^{h} ϕ (h_{j}) σ_{r} (r_{j})]

(J16)

+ i {\hat{r}}_{i} [r_{i} - J_{i j}^{r} ϕ (h_{j})]}],

which upon disorder averaging yields

\int d x d \hat{x} \exp {\sum_{i} (i {\hat{h}}_{i} h_{i} + i {\hat{r}}_{i} r_{i} - \frac{1}{2} {\hat{h}}_{i}^{2} {\hat{C}}_{ϕ σ_{r}} - \frac{1}{2} {\hat{r}}_{i}^{2} {\hat{C}}_{ϕ})},

(J18)

where we define

{\hat{C}}_{ϕ σ_{r}} = \frac{1}{N} \sum_{i} ϕ {(h_{i})}^{2} σ_{r} {(r_{i})}^{2}, {\hat{C}}_{ϕ} = \frac{1}{N} \sum_{i} ϕ {(h_{i})}^{2} .

(J19)

This is where we make our second crucial assumption: that the empirical averages appearing in Eq. (J19) converge to their average value

{\hat{C}}_{ϕ σ_{r}} \to C_{ϕ σ_{r}} = E_{h, r} [\frac{1}{N} \sum_{i} ϕ {(h_{i})}^{2} σ_{r} {(r_{i})}^{2}],

(J20)

{\hat{C}}_{ϕ} \to C_{ϕ} \equiv E_{h} [\frac{1}{N} \sum_{i} ϕ {(h_{i})}^{2}] .

(J21)

This means we are assuming the strong law of large numbers. With this essential step, the integral in Eq. (J18) evaluates to

\frac{1}{\sqrt{2 π Δ_{h}}} \frac{1}{\sqrt{2 π Δ_{r}}} exp (- ∥ h ∥^{2} / 2 Δ_{h} - ∥ r ∥^{2} / 2 Δ_{r})

(J22)

= \prod_{i = 1}^{N} P_{h} (h_{i}) P_{r} (r_{i}),

(J23)

where $Δ_{h} = C_{ϕ σ_{r}}$ and Δ_r = C_ϕ—which are just the time-independent (fixed-point) MFT equations (C26).

Returning to the expression for the complexity, this series of approximations gives us

𝒩 \approx \int \prod_{i = 1}^{N} d h_{i} d r_{i} P_{h} (h_{i}) P_{r} (r_{i}) \bar{I} (r, h) .

(J24)

Let us now describe our derivation more intuitively. We start with the formal expression for the Kac-Rice formula, which uses the delta functional integrand to find fixed points and counts them with the weighting factor related to the Jacobian. Our first assumption allows us to simplify the calculation involving the Jacobian, since we argue that this term is self-averaging. The second assumption allows us to deal with the remaining expectation value of the delta functions. The expectation value adds a number of delta functions (however many there may be for that J^h/r) for each configuration of the connectivity. For continuously distributed connectivity, this implies that the expectation value smears out the delta functions and results in a smooth distribution. What should this distribution be? Well, we know from the mean-field analysis that the state vectors are distributed as Gaussians at a fixed point. Furthermore, the mean-field theory becomes exact for large N. Therefore, we should expect that, in this limit, the delta functions are smeared out into the Gaussian distributions determined by the MFT. This is what our derivation shows.

The final step is to recall that the spectral density depends on the state vectors only via empirical averages. For instance, in the absence of an r gate, the spectral density depends on the empirical average Ĉ_ϕ′. Again invoking the strong law of large numbers, we may argue that the self-averaging goes a step further and that

𝒩 \approx \int \prod_{i} d P_{h} (h_{i}) d P_{r} (r_{i}) exp (N \int d^{2} z {\hat{μ}}_{h, r} (z) log | z |)

(J25)

\approx exp (N \int d^{2} z \bar{μ} (z) log | z |),

(J26)

where

\bar{μ} (z) = E_{h, r} [{\hat{μ}}_{h, r} (z)] = \int d h d r P (h) P (r) {\hat{μ}}_{h, r} (z) .

(J27)

This is precisely the spectral density we study in a preceding Appendix and the one for which we obtain an explicit expression for the spectral curve. These approximations give us the topological complexity

𝒞 = \int d^{2} z \bar{μ} (z) log | z | .

(J28)

Now we take a closer look at the spectral density. The eigenvalues of 𝒟^fp form a circular droplet of finite radius ρ and centered on −1. Therefore, the eigenvalues have the form λ = −1 + re^iθ, and the spectral density is a function only of r. The value of the radius is found from Eq. (5) by removing the z gate (i.e., setting α_z = 0). After some algebraic steps, we find for the radius

ρ^{2} = \frac{1}{2} (C_{1} + \sqrt{C_{1}^{2} + 4 C_{2}}),

(J29)

C_{1} = C_{ϕ^{'}} C_{σ_{r}}, C_{2} = C_{ϕ^{'}} C_{σ_{r}^{'}} C_{ϕ} .

(J30)

Using these facts, we can write the topological complexity as

𝒞 = \int rdrd θ \bar{μ} (r) I_{{r < ρ}} log | r e^{i θ} - 1 |

(J31)

= {\begin{array}{l} 2 π \int_{1}^{ρ} r d r \log r \bar{μ} (r) \geq 0, & for ρ > 1, \\ 0 & for ρ < 1, \end{array}

(J32)

where $I_{{r < ρ}}$ is the indicator function which is one for r < ρ and vanishes for r > ρ. Thus, we see that the topological complexity is zero for ρ < 1. This is precisely the fixed-point stability condition derived in the main text [Eq. (6)]. Conversely, the topological complexity is nonzero for ρ > 1, which corresponds to unstable fixed points. Thus, we see, under our set of reasonable approximations, unstable MFT fixed points correspond to a finite topological complexity and, consequently, to a number of “microscopic” fixed points that grows exponentially with N.

The final missing ingredient, necessary to show that region 2 in the phase diagram has an exponentially growing number of fixed points, is to show that the MFT fixed points which appear after the bifurcation are indeed unstable. At the moment, we lack any analytical handle on this. However, we confirm numerically that, along the bifurcation curve, the fixed points are unstable and that increasing the variance Δ_h serves only to increase ρ. However, is it possible for the lower branch, on which Δ_h decreases with α_r? Evidently not, since Δ_h scales with α_r in such a way that $C_{σ_{r}^{'}}$ ends up growing like $α_{r}^{2}$ , thus once again increasing ρ. Therefore, we conclude that the MFT fixed points appearing after the bifurcation are always unstable, with ρ > 1. This concludes our informal proof of the transition in topological complexity between regions 1 and 2 in the phase diagram in Fig. 7.

References

[1].Graves A, Mohamed A-R, and Hinton G, in Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, New York, 2013), pp. 6645–6649. [Google Scholar]
[2].Pascanu R, Gulcehre C, Cho K, and Bengio Y, How to Construct Deep Recurrent Neural Networks, arXiv:1312.6026. [Google Scholar]
[3].Pathak J, Hunt B, Girvan M, Lu Z, and Ott E, Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach, Phys. Rev. Lett 120, 024102 (2018). [DOI] [PubMed] [Google Scholar]
[4].Vlachas PR, Pathak J, Hunt BR, Sapsis TP, Girvan M, Ott E, and Koumoutsakos P, Backpropagation Algorithms and Reservoir Computing in Recurrent Neural Networks for the Forecasting of Complex Spatiotemporal Dynamics, Neural Netw. 126, 191 (2020). [DOI] [PubMed] [Google Scholar]
[5].Guastoni L, Srinivasan PA, Azizpour H, Schlatter P, and Vinuesa R, On the Use of Recurrent Neural Networks for Predictions of Turbulent Flows, arXiv:2002.01222. [Google Scholar]
[6].Jozefowicz R, Zaremba W, and Sutskever I, An Empirical Exploration of Recurrent Network Architectures, Proc. Mach. Learn. Res 37, 2342 (2015). [Google Scholar]
[7].Vogels TP, Rajan K, and Abbott LF, Neural Network Dynamics, Annu. Rev. Neurosci 28, 357 (2005). [DOI] [PubMed] [Google Scholar]
[8].Ahmadian Y and Miller KD, What Is the Dynamical Regime of Cerebral Cortex?, arXiv:1908.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Kadmon J and Sompolinsky H, Transition to Chaos in Random Neuronal Networks, Phys. Rev. X 5, 041030 (2015). [Google Scholar]
[10].Sussillo D and Abbott LF, Generating Coherent Patterns of Activity from Chaotic Neural Networks, Neuron 63, 544 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Laje R and Buonomano DV, Robust Timing and Motor Patterns by Taming Chaos in Recurrent Neural Networks, Nat. Neurosci 16, 925 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Hochreiter S and Schmidhuber J, Long Short-Term Memory, Neural Comput. 9, 1735 (1997). [DOI] [PubMed] [Google Scholar]
[13].Mitchell SJ and Silver RA, Shunting Inhibition Modulates Neuronal Gain during Synaptic Excitation, Neuron 38, 433 (2003). [DOI] [PubMed] [Google Scholar]
[14].Gütig R and Sompolinsky H, Time-Warp–Invariant Neuronal Processing, PLoS Biol. 7, e1000141 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Sompolinsky H, Crisanti A, and Sommers H-J, Chaos in Random Neural Networks, Phys. Rev. Lett 61, 259 (1988). [DOI] [PubMed] [Google Scholar]
[16].Martí D, Brunel N, and Ostojic S, Correlations between Synapses in Pairs of Neurons Slow Down Dynamics in Randomly Connected Neural Networks, Phys. Rev. E 97, 062314 (2018). [DOI] [PubMed] [Google Scholar]
[17].Schuessler F, Dubreuil A, Mastrogiuseppe F, Ostojic S, and Barak O, Dynamics of Random Recurrent Networks with Correlated Low-Rank Structure, Phys. Rev. Research 2, 013111 (2020). [Google Scholar]
[18].Mastrogiuseppe F and Ostojic S, Linking Connectivity, Dynamics, and Computations in Low-Rank Recurrent Neural Networks, Neuron 99, 609 (2018). [DOI] [PubMed] [Google Scholar]
[19].Stern M, Sompolinsky H, and Abbott LF, Dynamics of Random Neural Networks with Bistable Units, Phys. Rev. E 90, 062710 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Aljadeff J, Stern M, and Sharpee T, Transition to Chaos in Random Networks with Cell-Type-Specific Connectivity, Phys. Rev. Lett 114, 088101 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Schuecker J, Goedeke S, and Helias M, Optimal Sequence Memory in Driven Random Networks, Phys. Rev. X 8, 041029 (2018). [Google Scholar]
[22].Brette R, Exact Simulation of Integrate-and-Fire Models with Synaptic Conductances, Neural Comput. 18, 2004 (2006). [DOI] [PubMed] [Google Scholar]
[23].Amari S-I, Characteristics of Random Nets of Analog Neuron-Like Elements, IEEE Trans. Syst. Man Cybernet SMC-2, 643 (1972). [Google Scholar]
[24].Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, arXiv:1406.1078. [Google Scholar]
[25].Chalker JT and Mehlig B, Eigenvector Statistics in Non-Hermitian Random Matrix Ensembles, Phys. Rev. Lett 81, 3367 (1998). [Google Scholar]
[26].Feinberg J and Zee A, Non-Hermitian Random Matrix Theory: Method of Hermitian Reduction, Nucl. Phys B504, 579 (1997). [Google Scholar]
[27].Martin PC, Siggia E, and Rose H, Statistical Dynamics of Classical Systems, Phys. Rev. A 8, 423 (1973). [Google Scholar]
[28].De Dominicis C, Dynamics as a Substitute for Replicas in Systems with Quenched Random Impurities, Phys. Rev. B 18, 4913 (1978). [Google Scholar]
[29].Hertz JA, Roudi Y, and Sollich P, Path Integral Methods for the Dynamics of Stochastic and Disordered Systems, J. Phys. A 50, 033001 (2017). [Google Scholar]
[30].Janssen H-K, On a Lagrangean for Classical Field Dynamics and Renormalization Group Calculations of Dynamical Critical Properties, Z. Phys. B 23, 377 (1976). [Google Scholar]
[31].Crisanti A and Sompolinsky H, Path Integral Approach to Random Neural Networks, Phys. Rev. E 98, 062120 (2018). [Google Scholar]
[32].Helias M and Dahmen D, Statistical Field Theory for Neural Networks (Springer, New York, 2020). [Google Scholar]
[33].Mora T and Bialek W, Are Biological Systems Poised at Criticality?, J. Stat. Phys 144, 268 (2011). [Google Scholar]
[34].Seung HS, Continuous Attractors and Oculomotor Control, Neural Netw. 11, 1253 (1998). [DOI] [PubMed] [Google Scholar]
[35].Seung HS, How the Brain Keeps the Eyes Still, Proc. Natl. Acad. Sci. U.S.A 93, 13339 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Seung HS, Lee DD, Reis BY, and Tank DW, Stability of the Memory of Eye Position in a Recurrent Network of Conductance-Based Model Neurons, Neuron 26, 259 (2000). [DOI] [PubMed] [Google Scholar]
[37].Machens CK, Romo R, and Brody CD, Flexible Control of Mutual Inhibition: A Neural Model of Two-Interval Discrimination, Science 307, 1121 (2005). [DOI] [PubMed] [Google Scholar]
[38].Chaudhuri R and Fiete I, Computational Principles of Memory, Nat. Neurosci 19, 394 (2016). [DOI] [PubMed] [Google Scholar]
[39].Bialek W, Biophysics: Searching for Principles (Princeton University Press, Princeton, NJ, 2012). [Google Scholar]
[40].Goldman MS, Memory without Feedback in a Neural Network, Neuron 61, 621 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Maheswaranathan N, Williams A, Golub MD, Ganguli S, and Sussillo D, Reverse Engineering Recurrent Networks for Sentiment Classification Reveals Line Attractor dynAmics., in Advances in Neural Information Processing Systems, edited by Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (Curran Associates, Inc., New York, 2019), Vol 32, p. 15696. [PMC free article] [PubMed] [Google Scholar]
[42].Farrell M, Recanatesi S, Moore T, Lajoie G, and Shea-Brown E, Recurrent Neural Networks Learn Robust Representations by Dynamically Balancing Compression and Expansion, bioRxiv 10.1101/564476. [DOI] [Google Scholar]
[43].Molgedey L, Schuchhardt J, and Schuster HG, Suppressing Chaos in Neural Networks by Noise, Phys. Rev. Lett 69, 3717 (1992). [DOI] [PubMed] [Google Scholar]
[44].Rajan K, Abbott LF, and Sompolinsky H, Stimulus-Dependent Suppression of Chaos in Recurrent Neural Networks, Phys. Rev. E 82, 011903 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Wainrib G and Touboul J, Topological and Dynamical Complexity of Random Neural Networks, Phys. Rev. Lett 110, 118101 (2013). [DOI] [PubMed] [Google Scholar]
[46].Sutskever I, Martens J, Dahl G, and Hinton G, On the Importance of Initialization and Momentum in Deep Learning, Proc. Mach. Learn. Res 28, 1139 (2013). [Google Scholar]
[47].Legenstein R and Maass W, Edge of Chaos and Prediction of Computational Performance for Neural Circuit Models, Neural Netw. 20, 323 (2007). [DOI] [PubMed] [Google Scholar]
[48].Jaeger H and Haas H, Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication, Science 304, 78 (2004). [DOI] [PubMed] [Google Scholar]
[49].Toyoizumi T and Abbott LF, Beyond the Edge of Chaos: Amplification and Temporal Integration by Recurrent Networks in the Chaotic Regime, Phys. Rev. E 84, 051908 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Since the Jacobian spectral density depends on correlation functions (see Appendix A), in the dynamical steady state the spectral density becomes time-translation invariant. In other words, the spectral density also reaches a steady-state distribution. As a result, a snapshot of the spectral density at any given time has the same form. Instability then implies that the eigenvectors must evolve over time in order to keep the dynamics bounded. The timescale involved in the evolution of the eigenvectors should correspond roughly with the correlation time implied by the DMFT. Within this window, the spectral analysis of the Jacobian in the steady state gives a meaningful description of the range of timescales involved. Furthermore, we see empirically that the local structure appears very informative of the true dynamics, in particular, with understanding the emergence of continuous attractors and marginal stability, as we discuss in Sec. IV.
[51].Can T, Krishnamurthy K, and Schwab DJ, Gating Creates Slow Modes and Controls Phase-Space Complexity in GRUs and LSTMS, Proc. Mach. Learn. Res 107, 476 (2020). [Google Scholar]
[52].The continuous-time gated RNN we study in this paper is most closely related to the GRU architecture studied in Ref. [51].
[53].Eguíluz VM, Ospeck M, Choe Y, Hudspeth AJ, and Magnasco MO, Essential Nonlinearities in Hearing, Phys. Rev. Lett 84, 5232 (2000). [DOI] [PubMed] [Google Scholar]
[54].Eckmann J-P and Ruelle D, in The Theory of Chaotic Attractors (Springer, New York, 1985), pp. 273–312. [Google Scholar]
[55].For reference, we also supply a bound on the maximal Lyapunov exponent in Appendix F, showing that the relaxation time of the dynamics enters into an upper bound on λ_max.
[56].Derrida B and Pomeau Y, Random Networks of Automata: A Simple Annealed Approximation, Europhys. Lett 1, 45 (1986). [Google Scholar]
[57].Cessac B, Increase in Complexity in Random Neural Networks, J. Phys. I (France) 5, 409 (1995). [Google Scholar]
[58].One might worry that the h and σ(z) correlators are not separable, in general. However, this issue arises only for large α_z. For moderate α_z, the separability assumption is valid.
[59].Fyodorov YV, Complexity of Random Energy Landscapes, Glass Transition, and Absolute Value of the Spectral Determinant of Random Matrices, Phys. Rev. Lett 92, 240601 (2004). [DOI] [PubMed] [Google Scholar]
[60].Fyodorov YV and Le Doussal P, Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization, J. Stat. Phys 154, 466 (2014). [Google Scholar]
[61].Pereira J and Wang X-J, A tradeoff between Accuracy and Flexibility in a Working Memory Circuit Endowed with Slow Feedback Mechanisms, Cereb. Cortex 25, 3586 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Greff K, Srivastava RK, Koutník J, Steunebrink BR, and Schmidhuber J, LSTM: A Search Space Odyssey, IEEE Trans. Neural Netw. Learn. Syst 28, 2222 (2017). [DOI] [PubMed] [Google Scholar]
[63].In fact, the fixed-point phase diagrams for the current model and the GRU are in one-to-one correspondence. What this static phase diagram importantly lacks is region 3 in Fig. 7.
[64].Tallec C and Ollivier Y, Can Recurrent Neural Networks Warp Time?, arXiv:1804.11188. [Google Scholar]
[65].Muscinelli SP, Gerstner W, and Schwalger T, How Single Neuron Properties Shape Chaotic Dynamics and Signal Transmission in Random Neural Networks, PLoS Comput. Biol 15, e1007122 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[66].Beiran M and Ostojic S, Contrasting the Effects of Adaptation and Synaptic Filtering on the Timescales of Dynamics in Recurrent Networks, PLoS Comput. Biol 15, e1006893 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[67].Pereira U and Brunel N, Attractor Dynamics in Networks with Learning Rules Inferred from In Vivo Data, Neuron 99, 227 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[68].Bertschinger N and Natschläger T, Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks, Neural Comput. 16, 1413 (2004). [DOI] [PubMed] [Google Scholar]
[69].Legenstein R and Maass W, in New Directions in Statistical Signal Processing: From Systems to Brain, edited by Haykin Simon, Principe Jose C., Sejnowski Terrence J., and McWhirter John (The MIT Press, 2006), p. 127. [Google Scholar]
[70].Boedecker J, Obst O, Lizier JT, Mayer NM, and Asada M, Information Processing in Echo State Networks at the Edge of Chaos, Theory Biosci. 131, 205 (2012). [DOI] [PubMed] [Google Scholar]
[71].Geman S and Hwang C-R, A Chaos Hypothesis for Some Large Systems of Random Equations, Z. Wahrscheinlichkeitstheorie Verwandte Gebiete 60, 291 (1982). [Google Scholar]
[72]. Strictly speaking, the state variables evolve according to dynamics governed by (and, thus, dependent on) the J’s. However, the local chaos hypothesis states that large random networks approach a steady state where the state variables are independent of J’s and are distributed according to their steady-state distribution.
[73].Sompolinsky H and Zippelius A, Relaxational Dynamics of the Edwards-Anderson Model and the Mean-Field Theory of Spin-Glasses, Phys. Rev. B 25, 6860 (1982). [Google Scholar]
[74].Sompolinsky H and Zippelius A, Dynamic Theory of the Spin-Glass Phase, Phys. Rev. Lett 47, 359 (1981). [Google Scholar]
[75].Chow CC and Buice MA, Path Integral Methods for Stochastic Differential Equations, J. Math. Neurosci 5, 8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[76].Roy F, Biroli G, Bunin G, and Cammarota C, Numerical Implementation of Dynamical Mean Field Theory for Disordered Systems: Application to the Lotka–Volterra Model of Ecosystems, J. Phys. A 52, 484001 (2019). [Google Scholar]
[77].Geist K, Parlitz U, and Lauterborn W, Comparison of Different Methods for Computing Lyapunov Exponents, Prog. Theor. Phys 83, 875 (1990). [Google Scholar]
[78].Engelken R, Wolf F, and Abbott L, Lyapunov Spectra of Chaotic Recurrent Neural Networks, arXiv:2006.02427. [Google Scholar]
[79]. The local chaos hypothesis employed by Cessac [57] amounts to the same assumption.
[80]. Strictly speaking, Oseledets theorem guarantees that λmax = limt→∞ (1/2t) log[(‖χu‖2)/(‖u‖2)] for almost every u. In particular, we can take u to be the all-ones vector. The term inside the log then becomes $\frac{1}{N} \sum_{i, k} χ_{i k}^{2} + (1 / N) \sum_{i \neq j} \sum_{k} χ_{i k} χ_{j k}$ , and the second term is subleading in N, since the susceptibilities are random functions. This justifies Eq. (F1).
[81]. In previous work, g = 1 sets the critical value. The difference is simply due to the factor σr(0) = 1/2. The vanilla RNN result is recovered by sending βr → ∞.
[82].Ipsen JR and Peterson ADH, Consequences of Dale’s Law on the Stability-Complexity Relationship of Random Neural Networks, Phys. Rev. E 101, 052412 (2020). [DOI] [PubMed] [Google Scholar]

[R1] [1].Graves A, Mohamed A-R, and Hinton G, in Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, New York, 2013), pp. 6645–6649. [Google Scholar]

[R2] [2].Pascanu R, Gulcehre C, Cho K, and Bengio Y, How to Construct Deep Recurrent Neural Networks, arXiv:1312.6026. [Google Scholar]

[R3] [3].Pathak J, Hunt B, Girvan M, Lu Z, and Ott E, Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach, Phys. Rev. Lett 120, 024102 (2018). [DOI] [PubMed] [Google Scholar]

[R4] [4].Vlachas PR, Pathak J, Hunt BR, Sapsis TP, Girvan M, Ott E, and Koumoutsakos P, Backpropagation Algorithms and Reservoir Computing in Recurrent Neural Networks for the Forecasting of Complex Spatiotemporal Dynamics, Neural Netw. 126, 191 (2020). [DOI] [PubMed] [Google Scholar]

[R5] [5].Guastoni L, Srinivasan PA, Azizpour H, Schlatter P, and Vinuesa R, On the Use of Recurrent Neural Networks for Predictions of Turbulent Flows, arXiv:2002.01222. [Google Scholar]

[R6] [6].Jozefowicz R, Zaremba W, and Sutskever I, An Empirical Exploration of Recurrent Network Architectures, Proc. Mach. Learn. Res 37, 2342 (2015). [Google Scholar]

[R7] [7].Vogels TP, Rajan K, and Abbott LF, Neural Network Dynamics, Annu. Rev. Neurosci 28, 357 (2005). [DOI] [PubMed] [Google Scholar]

[R8] [8].Ahmadian Y and Miller KD, What Is the Dynamical Regime of Cerebral Cortex?, arXiv:1908.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Kadmon J and Sompolinsky H, Transition to Chaos in Random Neuronal Networks, Phys. Rev. X 5, 041030 (2015). [Google Scholar]

[R10] [10].Sussillo D and Abbott LF, Generating Coherent Patterns of Activity from Chaotic Neural Networks, Neuron 63, 544 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Laje R and Buonomano DV, Robust Timing and Motor Patterns by Taming Chaos in Recurrent Neural Networks, Nat. Neurosci 16, 925 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Hochreiter S and Schmidhuber J, Long Short-Term Memory, Neural Comput. 9, 1735 (1997). [DOI] [PubMed] [Google Scholar]

[R13] [13].Mitchell SJ and Silver RA, Shunting Inhibition Modulates Neuronal Gain during Synaptic Excitation, Neuron 38, 433 (2003). [DOI] [PubMed] [Google Scholar]

[R14] [14].Gütig R and Sompolinsky H, Time-Warp–Invariant Neuronal Processing, PLoS Biol. 7, e1000141 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Sompolinsky H, Crisanti A, and Sommers H-J, Chaos in Random Neural Networks, Phys. Rev. Lett 61, 259 (1988). [DOI] [PubMed] [Google Scholar]

[R16] [16].Martí D, Brunel N, and Ostojic S, Correlations between Synapses in Pairs of Neurons Slow Down Dynamics in Randomly Connected Neural Networks, Phys. Rev. E 97, 062314 (2018). [DOI] [PubMed] [Google Scholar]

[R17] [17].Schuessler F, Dubreuil A, Mastrogiuseppe F, Ostojic S, and Barak O, Dynamics of Random Recurrent Networks with Correlated Low-Rank Structure, Phys. Rev. Research 2, 013111 (2020). [Google Scholar]

[R18] [18].Mastrogiuseppe F and Ostojic S, Linking Connectivity, Dynamics, and Computations in Low-Rank Recurrent Neural Networks, Neuron 99, 609 (2018). [DOI] [PubMed] [Google Scholar]

[R19] [19].Stern M, Sompolinsky H, and Abbott LF, Dynamics of Random Neural Networks with Bistable Units, Phys. Rev. E 90, 062710 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Aljadeff J, Stern M, and Sharpee T, Transition to Chaos in Random Networks with Cell-Type-Specific Connectivity, Phys. Rev. Lett 114, 088101 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Schuecker J, Goedeke S, and Helias M, Optimal Sequence Memory in Driven Random Networks, Phys. Rev. X 8, 041029 (2018). [Google Scholar]

[R22] [22].Brette R, Exact Simulation of Integrate-and-Fire Models with Synaptic Conductances, Neural Comput. 18, 2004 (2006). [DOI] [PubMed] [Google Scholar]

[R23] [23].Amari S-I, Characteristics of Random Nets of Analog Neuron-Like Elements, IEEE Trans. Syst. Man Cybernet SMC-2, 643 (1972). [Google Scholar]

[R24] [24].Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, arXiv:1406.1078. [Google Scholar]

[R25] [25].Chalker JT and Mehlig B, Eigenvector Statistics in Non-Hermitian Random Matrix Ensembles, Phys. Rev. Lett 81, 3367 (1998). [Google Scholar]

[R26] [26].Feinberg J and Zee A, Non-Hermitian Random Matrix Theory: Method of Hermitian Reduction, Nucl. Phys B504, 579 (1997). [Google Scholar]

[R27] [27].Martin PC, Siggia E, and Rose H, Statistical Dynamics of Classical Systems, Phys. Rev. A 8, 423 (1973). [Google Scholar]

[R28] [28].De Dominicis C, Dynamics as a Substitute for Replicas in Systems with Quenched Random Impurities, Phys. Rev. B 18, 4913 (1978). [Google Scholar]

[R29] [29].Hertz JA, Roudi Y, and Sollich P, Path Integral Methods for the Dynamics of Stochastic and Disordered Systems, J. Phys. A 50, 033001 (2017). [Google Scholar]

[R30] [30].Janssen H-K, On a Lagrangean for Classical Field Dynamics and Renormalization Group Calculations of Dynamical Critical Properties, Z. Phys. B 23, 377 (1976). [Google Scholar]

[R31] [31].Crisanti A and Sompolinsky H, Path Integral Approach to Random Neural Networks, Phys. Rev. E 98, 062120 (2018). [Google Scholar]

[R32] [32].Helias M and Dahmen D, Statistical Field Theory for Neural Networks (Springer, New York, 2020). [Google Scholar]

[R33] [33].Mora T and Bialek W, Are Biological Systems Poised at Criticality?, J. Stat. Phys 144, 268 (2011). [Google Scholar]

[R34] [34].Seung HS, Continuous Attractors and Oculomotor Control, Neural Netw. 11, 1253 (1998). [DOI] [PubMed] [Google Scholar]

[R35] [35].Seung HS, How the Brain Keeps the Eyes Still, Proc. Natl. Acad. Sci. U.S.A 93, 13339 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Seung HS, Lee DD, Reis BY, and Tank DW, Stability of the Memory of Eye Position in a Recurrent Network of Conductance-Based Model Neurons, Neuron 26, 259 (2000). [DOI] [PubMed] [Google Scholar]

[R37] [37].Machens CK, Romo R, and Brody CD, Flexible Control of Mutual Inhibition: A Neural Model of Two-Interval Discrimination, Science 307, 1121 (2005). [DOI] [PubMed] [Google Scholar]

[R38] [38].Chaudhuri R and Fiete I, Computational Principles of Memory, Nat. Neurosci 19, 394 (2016). [DOI] [PubMed] [Google Scholar]

[R39] [39].Bialek W, Biophysics: Searching for Principles (Princeton University Press, Princeton, NJ, 2012). [Google Scholar]

[R40] [40].Goldman MS, Memory without Feedback in a Neural Network, Neuron 61, 621 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Maheswaranathan N, Williams A, Golub MD, Ganguli S, and Sussillo D, Reverse Engineering Recurrent Networks for Sentiment Classification Reveals Line Attractor dynAmics., in Advances in Neural Information Processing Systems, edited by Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (Curran Associates, Inc., New York, 2019), Vol 32, p. 15696. [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Farrell M, Recanatesi S, Moore T, Lajoie G, and Shea-Brown E, Recurrent Neural Networks Learn Robust Representations by Dynamically Balancing Compression and Expansion, bioRxiv 10.1101/564476. [DOI] [Google Scholar]

[R43] [43].Molgedey L, Schuchhardt J, and Schuster HG, Suppressing Chaos in Neural Networks by Noise, Phys. Rev. Lett 69, 3717 (1992). [DOI] [PubMed] [Google Scholar]

[R44] [44].Rajan K, Abbott LF, and Sompolinsky H, Stimulus-Dependent Suppression of Chaos in Recurrent Neural Networks, Phys. Rev. E 82, 011903 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Wainrib G and Touboul J, Topological and Dynamical Complexity of Random Neural Networks, Phys. Rev. Lett 110, 118101 (2013). [DOI] [PubMed] [Google Scholar]

[R46] [46].Sutskever I, Martens J, Dahl G, and Hinton G, On the Importance of Initialization and Momentum in Deep Learning, Proc. Mach. Learn. Res 28, 1139 (2013). [Google Scholar]

[R47] [47].Legenstein R and Maass W, Edge of Chaos and Prediction of Computational Performance for Neural Circuit Models, Neural Netw. 20, 323 (2007). [DOI] [PubMed] [Google Scholar]

[R48] [48].Jaeger H and Haas H, Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication, Science 304, 78 (2004). [DOI] [PubMed] [Google Scholar]

[R49] [49].Toyoizumi T and Abbott LF, Beyond the Edge of Chaos: Amplification and Temporal Integration by Recurrent Networks in the Chaotic Regime, Phys. Rev. E 84, 051908 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Since the Jacobian spectral density depends on correlation functions (see Appendix A), in the dynamical steady state the spectral density becomes time-translation invariant. In other words, the spectral density also reaches a steady-state distribution. As a result, a snapshot of the spectral density at any given time has the same form. Instability then implies that the eigenvectors must evolve over time in order to keep the dynamics bounded. The timescale involved in the evolution of the eigenvectors should correspond roughly with the correlation time implied by the DMFT. Within this window, the spectral analysis of the Jacobian in the steady state gives a meaningful description of the range of timescales involved. Furthermore, we see empirically that the local structure appears very informative of the true dynamics, in particular, with understanding the emergence of continuous attractors and marginal stability, as we discuss in Sec. IV.

[R51] [51].Can T, Krishnamurthy K, and Schwab DJ, Gating Creates Slow Modes and Controls Phase-Space Complexity in GRUs and LSTMS, Proc. Mach. Learn. Res 107, 476 (2020). [Google Scholar]

[R52] [52].The continuous-time gated RNN we study in this paper is most closely related to the GRU architecture studied in Ref. [51].

[R53] [53].Eguíluz VM, Ospeck M, Choe Y, Hudspeth AJ, and Magnasco MO, Essential Nonlinearities in Hearing, Phys. Rev. Lett 84, 5232 (2000). [DOI] [PubMed] [Google Scholar]

[R54] [54].Eckmann J-P and Ruelle D, in The Theory of Chaotic Attractors (Springer, New York, 1985), pp. 273–312. [Google Scholar]

[R55] [55].For reference, we also supply a bound on the maximal Lyapunov exponent in Appendix F, showing that the relaxation time of the dynamics enters into an upper bound on λ_max.

[R56] [56].Derrida B and Pomeau Y, Random Networks of Automata: A Simple Annealed Approximation, Europhys. Lett 1, 45 (1986). [Google Scholar]

[R57] [57].Cessac B, Increase in Complexity in Random Neural Networks, J. Phys. I (France) 5, 409 (1995). [Google Scholar]

[R58] [58].One might worry that the h and σ(z) correlators are not separable, in general. However, this issue arises only for large α_z. For moderate α_z, the separability assumption is valid.

[R59] [59].Fyodorov YV, Complexity of Random Energy Landscapes, Glass Transition, and Absolute Value of the Spectral Determinant of Random Matrices, Phys. Rev. Lett 92, 240601 (2004). [DOI] [PubMed] [Google Scholar]

[R60] [60].Fyodorov YV and Le Doussal P, Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization, J. Stat. Phys 154, 466 (2014). [Google Scholar]

[R61] [61].Pereira J and Wang X-J, A tradeoff between Accuracy and Flexibility in a Working Memory Circuit Endowed with Slow Feedback Mechanisms, Cereb. Cortex 25, 3586 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Greff K, Srivastava RK, Koutník J, Steunebrink BR, and Schmidhuber J, LSTM: A Search Space Odyssey, IEEE Trans. Neural Netw. Learn. Syst 28, 2222 (2017). [DOI] [PubMed] [Google Scholar]

[R63] [63].In fact, the fixed-point phase diagrams for the current model and the GRU are in one-to-one correspondence. What this static phase diagram importantly lacks is region 3 in Fig. 7.

[R64] [64].Tallec C and Ollivier Y, Can Recurrent Neural Networks Warp Time?, arXiv:1804.11188. [Google Scholar]

[R65] [65].Muscinelli SP, Gerstner W, and Schwalger T, How Single Neuron Properties Shape Chaotic Dynamics and Signal Transmission in Random Neural Networks, PLoS Comput. Biol 15, e1007122 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] [66].Beiran M and Ostojic S, Contrasting the Effects of Adaptation and Synaptic Filtering on the Timescales of Dynamics in Recurrent Networks, PLoS Comput. Biol 15, e1006893 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] [67].Pereira U and Brunel N, Attractor Dynamics in Networks with Learning Rules Inferred from In Vivo Data, Neuron 99, 227 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] [68].Bertschinger N and Natschläger T, Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks, Neural Comput. 16, 1413 (2004). [DOI] [PubMed] [Google Scholar]

[R69] [69].Legenstein R and Maass W, in New Directions in Statistical Signal Processing: From Systems to Brain, edited by Haykin Simon, Principe Jose C., Sejnowski Terrence J., and McWhirter John (The MIT Press, 2006), p. 127. [Google Scholar]

[R70] [70].Boedecker J, Obst O, Lizier JT, Mayer NM, and Asada M, Information Processing in Echo State Networks at the Edge of Chaos, Theory Biosci. 131, 205 (2012). [DOI] [PubMed] [Google Scholar]

[R71] [71].Geman S and Hwang C-R, A Chaos Hypothesis for Some Large Systems of Random Equations, Z. Wahrscheinlichkeitstheorie Verwandte Gebiete 60, 291 (1982). [Google Scholar]

[R72] [72]. Strictly speaking, the state variables evolve according to dynamics governed by (and, thus, dependent on) the J’s. However, the local chaos hypothesis states that large random networks approach a steady state where the state variables are independent of J’s and are distributed according to their steady-state distribution.

[R73] [73].Sompolinsky H and Zippelius A, Relaxational Dynamics of the Edwards-Anderson Model and the Mean-Field Theory of Spin-Glasses, Phys. Rev. B 25, 6860 (1982). [Google Scholar]

[R74] [74].Sompolinsky H and Zippelius A, Dynamic Theory of the Spin-Glass Phase, Phys. Rev. Lett 47, 359 (1981). [Google Scholar]

[R75] [75].Chow CC and Buice MA, Path Integral Methods for Stochastic Differential Equations, J. Math. Neurosci 5, 8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] [76].Roy F, Biroli G, Bunin G, and Cammarota C, Numerical Implementation of Dynamical Mean Field Theory for Disordered Systems: Application to the Lotka–Volterra Model of Ecosystems, J. Phys. A 52, 484001 (2019). [Google Scholar]

[R77] [77].Geist K, Parlitz U, and Lauterborn W, Comparison of Different Methods for Computing Lyapunov Exponents, Prog. Theor. Phys 83, 875 (1990). [Google Scholar]

[R78] [78].Engelken R, Wolf F, and Abbott L, Lyapunov Spectra of Chaotic Recurrent Neural Networks, arXiv:2006.02427. [Google Scholar]

[R79] [79]. The local chaos hypothesis employed by Cessac [57] amounts to the same assumption.

[R80] [80]. Strictly speaking, Oseledets theorem guarantees that λmax = limt→∞ (1/2t) log[(‖χu‖2)/(‖u‖2)] for almost every u. In particular, we can take u to be the all-ones vector. The term inside the log then becomes $\frac{1}{N} \sum_{i, k} χ_{i k}^{2} + (1 / N) \sum_{i \neq j} \sum_{k} χ_{i k} χ_{j k}$ , and the second term is subleading in N, since the susceptibilities are random functions. This justifies Eq. (F1).

[R81] [81]. In previous work, g = 1 sets the critical value. The difference is simply due to the factor σr(0) = 1/2. The vanilla RNN result is recovered by sending βr → ∞.

[R82] [82].Ipsen JR and Peterson ADH, Consequences of Dale’s Law on the Stability-Complexity Relationship of Random Neural Networks, Phys. Rev. E 101, 052412 (2020). [DOI] [PubMed] [Google Scholar]

PERMALINK

Theory of Gating in Recurrent Neural Networks

Kamesh Krishnamurthy

Tankut Can

David J Schwab

Abstract

I. INTRODUCTION

II. A RECURRENT NEURAL NETWORK MODEL TO STUDY GATING

III. HOW THE GATES SHAPE THE LINEARIZED DYNAMICS

FIG. 1.

A. Update gate facilitates slow modes and output gate causes instability

FIG. 3.

IV. MARGINAL STABILITY AND ITS CONSEQUENCES

A. Condition for marginal stability

FIG. 7.

B. Functional consequences of marginal stability

FIG. 2.

V. OUTPUT GATE CONTROLS DIMENSIONALITY AND LEADS TO A NOVEL CHAOTIC TRANSITION

A. Long-time behavior of the network

1. DMFT prediction for λmax

2. Condition for continuous transition to chaos

B. Output gate induces a novel chaotic transition

1. Spontaneous emergence of fixed points

FIG. 4.

2. A delayed dynamical transition that shows a decoupling between topological and dynamical complexity

3. Long chaotic transients

4. An input-induced chaotic transition

FIG. 5.

VI. GATES PROVIDE A FLEXIBLE RESET MECHANISM

FIG. 6.

VII. PHASE DIAGRAMS FOR THE GATED NETWORK

A. Role of biases and static inputs

VIII. DISCUSSION

A. Significance of the update gate

B. Significance of the output gate

ACKNOWLEDGMENTS

APPENDIX A: DETAILS OF RANDOM MATRIX THEORY FOR SPECTRUM OF THE JACOBIAN

FIG. 8.

1. Jacobian spectrum for the case αr = 0

APPENDIX B: SPECTRAL CLUMPING AND PINCHING IN THE LIMIT αz → ∞

APPENDIX C: DETAILS OF THE DYNAMICAL MEAN-FIELD THEORY

1. Disorder averaging

2. Saddle-point approximation for N → ∞

FIG. 9.

FIG. 10.

APPENDIX D: DETAILS OF THE NUMERICS FOR THE LYAPUNOV SPECTRUM

APPENDIX E: DETAILS OF THE DMFT PREDICTION FOR λmax

APPENDIX F: CALCULATION OF MAXIMAL LYAPUNOV EXPONENT FROM RMT

APPENDIX G: DETAILS OF THE DISCONTINUOUS CHAOTIC TRANSITION

1. Spontaneous emergence of fixed-points

2. Delayed dynamical transition shows a decoupling between topological and dynamical complexity

FIG. 11.

3. Influence of update gate on the discontinuous transition

APPENDIX H: THE ROLE OF BIASES

FIG. 12.

FIG. 13.

1. Effect of biases on the phase boundaries

APPENDIX I: DETAILS OF THE PERTURBATIVE SOLUTIONS TO THE MEAN-FIELD EQUATIONS

1. Perturbative solutions for the fixed-point variance Δh with biases

2. Perturbative solutions for the fixed-point variance Δh in the bifurcation region with no biases

3. Ch(τ) near critical point

APPENDIX J: TOPOLOGICAL COMPLEXITY VIA KAC-RICE FORMULA

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

1. DMFT prediction for λ_max

1. Jacobian spectrum for the case α_r = 0

APPENDIX B: SPECTRAL CLUMPING AND PINCHING IN THE LIMIT α_z → ∞

APPENDIX E: DETAILS OF THE DMFT PREDICTION FOR λ_max

1. Perturbative solutions for the fixed-point variance Δ_h with biases

2. Perturbative solutions for the fixed-point variance Δ_h in the bifurcation region with no biases

3. C_h(τ) near critical point