Abstract
Recurrent neural networks (RNNs) are powerful dynamical models, widely used in machine learning (ML) and neuroscience. Prior theoretical work has focused on RNNs with additive interactions. However gating i.e., multiplicative interactions are ubiquitous in real neurons and also the central feature of the best-performing RNNs in ML. Here, we show that gating offers flexible control of two salient features of the collective dynamics: (i) timescales and (ii) dimensionality. The gate controlling timescales leads to a novel marginally stable state, where the network functions as a flexible integrator. Unlike previous approaches, gating permits this important function without parameter fine-tuning or special symmetries. Gates also provide a flexible, context-dependent mechanism to reset the memory trace, thus complementing the memory function. The gate modulating the dimensionality can induce a novel, discontinuous chaotic transition, where inputs push a stable system to strong chaotic activity, in contrast to the typically stabilizing effect of inputs. At this transition, unlike additive RNNs, the proliferation of critical points (topological complexity) is decoupled from the appearance of chaotic dynamics (dynamical complexity). The rich dynamics are summarized in phase diagrams, thus providing a map for principled parameter initialization choices to ML practitioners.
Subject Areas: Interdisciplinary Physics Nonlinear Dynamics, Statistical Physics
I. INTRODUCTION
Recurrent neural networks (RNNs) are powerful dynamical systems that can represent a rich repertoire of trajectories and are popular models in neuroscience and machine learning. In modern machine learning, RNNs are used to learn complex dynamics from data with rich sequential or temporal structure such as speech [1,2], turbulent flows [3–5], or text sequences [6]. RNNs are also influential in neuroscience as models to study the collective behavior of a large network of neurons [7] (and references therein). For instance, they have been used to explain the dynamics and temporally irregular fluctuations observed in cortical networks [8,9] and how the motor-cortex network generates movement sequences [10,11].
Classical RNN models typically involve units that interact with each other in an additive fashion—i.e., each unit integrates a weighted sum of the output of the rest of the network. However, researchers in machine learning have empirically found that RNNs with gating—a form of multiplicative interaction—can be trained to perform significantly more complex tasks than classical RNNs [6,12]. Gating interactions are also ubiquitous in real neurons due to mechanisms such as shunting inhibition [13]. Moreover, when single-neuron models are endowed with more realistic conductance dynamics, the effective interactions at the network level have gating effects, which confer robustness to time-warped inputs [14]. Thus, RNNs with gating interactions not only have superior information processing capabilities, but they also embody a prominent feature found in real neurons.
Prior theoretical work on understanding the dynamics and functional capabilities of RNNs has mostly focused on RNNs with additive interactions. The original work by Sompolinsky, Crisanti, and Sommers [15] identifies a phase transition in the autonomous dynamics of randomly connected RNNs from stability to chaos. Subsequent work extends this analysis to cases where the random connectivity additionally has correlations [16], a low-rank structured component [17,18], strong self-interaction [19], and heterogeneous variance across blocks [20]. The role of sparse connectivity and the single-neuron nonlinearity is studied in Ref. [9]. The effect of a Gaussian noise input is analyzed in Ref. [21].
In this work, we study the consequences of gating interactions on the dynamics of RNNs. We introduce a gated RNN model that naturally extends a classical RNN by augmenting it with two kinds of gating interactions: (i) an update gate that acts like an adaptive time constant and (ii) an output gate which modulates the output of a neuron. The choice of these forms for gates are motivated by biophysical considerations (e.g., Refs. [14,22]) and retain the most functionally important aspects of the gated RNNs in machine learning. Our gated RNN reduces to the classical RNN [15,23] when the gates are open and is closely related to the state-of-the-art gated RNNs in machine learning when the dynamics are discretized [24]. We further elaborate on this connection in Sec. VIII.
We develop a theory for the gated RNN based on non-Hermitian random matrix techniques [25,26] and the Martin–Siggia–Rose–De Dominicis-Janssen (MSRDJ) formalism [21,27–32] and use the theory to map out, in a phase diagram, the rich, functionally significant dynamical phenomena produced by gating.
We show that the update gate produces slow modes and a marginally stable critical state. Marginally stable systems are of special interest in the context of biological information processing (e.g., Ref. [33]). Moreover, the network in this marginally stable state can function as a robust integrator—a function that is critical for memory capabilities in biological systems [34–37] but has been hard to achieve without parameter fine-tuning and handcrafted symmetries [38]. Gating permits the network to serve this function without any symmetries or fine-tuning. For a detailed discussion of these issues, we refer the reader to Ref. [39] (pp. 329–350) and Refs. [38,40]. Integratorlike dynamics are also empirically observed in gated machine learning (ML) RNNs successfully trained on complex sequential tasks [41]; our theory shows how gates allow for this robustly.
The output gate allows fine control over the dimensionality of the network activity; control of the dimensionality can be useful during learning tasks [42]. In certain regimes, this gate can mediate an input-driven chaotic transition, where static inputs can push a stable system abruptly to a chaotic state. This behavior with gating is in stark contrast to the typically stabilizing effect of inputs in high-dimensional systems [21,43,44]. The output gate also leads to a novel, discontinuous chaotic transition, where the proliferation of critical points (a static property) is decoupled from the appearance of chaotic transients (a dynamical property); this is in contrast to the tight link between the two properties in additive RNNs as shown by Wainrib and Touboul [45]. This transition is also characterized by a nontrivial state where a stable fixed point coexists with long chaotic transients. Gates also provide a flexible, context-dependent way to reset the state, thus providing a way to selectively erase the memory trace of past inputs.
We summarize these functionally significant phenomena in phase diagrams, which are also practically useful for ML practitioners—indeed, the choice of parameter initialization is known to be one of the most important factors deciding the success of training [46], with best outcomes occurring near critical lines [10,47–49]. Phase diagrams, thus, allow a principled and exhaustive exploration of dynamically distinct initializations.
II. A RECURRENT NEURAL NETWORK MODEL TO STUDY GATING
We study an extension of a classical RNN [15,23] by augmenting it with multiplicative gating interactions. Specifically, we consider two gates: (i) an update (or z) gate which controls the rate of integration and (ii) an output (or r) gate which modulates the strength of the output. The equations describing the gated RNN are given by
(1) |
where hi represents the internal state of the ith unit and σ(·)(x) = [1 + exp(−α(·)x + β(·))]−1 are sigmoidal gating functions. The recurrent input to a neuron is , where are the coupling strengths between the units and ϕ(x) = tanh(ghx + βh) is the activation function. ϕ and σz,r are parametrized by gain parameters (gh, αz,r) and biases (βh,z,r), which constitute the parameters of the gated RNN. Finally, Ih represents external input to the network. The gating variables zi(t) and ri(t) evolve according to dynamics driven by the output ϕ[h(t)] of the network:
(2) |
where x ∈ {z, r}. Note that the coupling matrices Jz,r for z, r are distinct from Jh. We also allow for different inputs Ir and Iz being fed to the gates. For instance, they might be zero, or they might be equal up to a scaling factor to Ih.
The value of σz(zi) can be viewed as a dynamical time constant for the ith unit, while the output gate σr(ri) modulates the output strength of unit i. In the presence of external input, the r gate can control the relative strengths of the internal (recurrent) activity and the external input Ih. In the limit σz, σr → 1, we recover the dynamics of the classical RNN.
We choose the coupling weights from a Gaussian distribution with variance scaled such that the input to each unit remains O(1). Specifically, . This choice of couplings is a popular initialization scheme for RNNs in machine learning [6,46] and also in models of cortical neural circuits [15,20]. If the gating variables are purely internal, then (Jz,r) is diagonal; however, we do not consider this case below. In the rest of the paper, we analyze the various dynamical regimes the gated RNN exhibits and their functional significance.
III. HOW THE GATES SHAPE THE LINEARIZED DYNAMICS
We first study the linearized dynamics of the gated RNN through the lens of the instantaneous Jacobian and describe how these dynamics are shaped by the gates. The instantaneous Jacobian describes the linearized dynamics about an operating point, and the eigenvalues of the Jacobian inform us about the timescales of growth and decay of perturbations and the local stability of the dynamics. As we show below, the spectral density of the Jacobian depends on equal-time correlation functions, which are the order parameters in the mean-field picture of the dynamics, developed in the Appendix C. We study how the gates shape the support and the density of Jacobian eigenvalues in the steady state, through their influence on the correlation functions.
The linearized dynamics in the tangent space at an operating point x = (h, z, r) is given by
(3) |
where 𝒟 is the 3N × 3N-dimensional instantaneous Jacobian of the full network equations. Linearization of Eqs. (1) and (2) yields
(4) |
where [x] denotes a diagonal matrix with the diagonal entries given by the vector x. The term arises when we linearize about a time-varying state and is zero for fixed points. We introduce the additional shorthand ϕ′(t) = ϕ′(h(t)) and .
The Jacobian is a block-structured matrix involving random elements (Jz,h,r) and functions of various state variables. We need additional tools from non-Hermitian random matrix theory (RMT) [26] and dynamical mean-field theory (DMFT) [15] to analyze the spectrum of the Jacobian 𝒟. We provide a detailed, self-contained derivation of the calculations in Appendix C (DMFT) and Appendix A (RMT). Here, we state only the main results derived from these formalisms.
One of the main results is an analytical expression for the spectral curve, which describes the boundary of the Jacobian spectrum, in terms of the moments of the state variables. The most general expression for the spectral curve [Appendix A, Eq. (A34)] involves empirical averages over the 3N-dimensional state variables. However, for large N, we can appeal to a concentration of measure argument to replace these discrete sums with averages over the steady-state distribution from the DMFT (cf. Appendix C)—i.e., we can replace empirical averages of any function of the state variables (1/N) Σi F(hi, zi, ri) with 〈F[h(t), z(t), r(t)]〉, where the brackets indicate average over the steady-state distribution. The DMFT + RMT prediction for the spectral curve for a generic steady-state point is given in Appendix A, Eq. (A35). Strictly speaking, the analysis of the DMFT around a generic time-dependent steady state is complicated by the fact that the distribution for h is not Gaussian (while r and z are Gaussian). For fixed points, however, the distributions of h, z, and r are all Gaussian, and the expression for the spectral curve reduces simplifies. It is given by the set of which satisfy
(5) |
Here, the averages are taken over the Gaussian fixed-point distributions (h, z, r) ~ 𝒩(0, Δh,z,r) which follow from the MFT [Eq. (C26)]. For example, .
We make two comments on the Jacobian of a time-varying state: (i) One might wonder if any useful information can be gleaned by studying the Jacobian at a time-varying state where the Hartman-Grobman theorem is not valid. Indeed, as we see below, the limiting form of the Jacobian in steady state crucially informs us about the suppression of unstable directions and the emergence of slow dynamics due to pinching and marginal stability in certain parameter regimes (also see Ref. [50]). In other words, the instantaneous Jacobian charts the approach to marginal stability and provides a quantitative justification for the approximate integrator functionality exhibited in Sec. IV B. (ii) Interestingly, the spectral curve calculated using the MFT [Eq. (5)] for a time-varying steady state not deep in the chaotic regime is a very good approximation for the true spectral curve (see Fig. 8 in Appendix A).
Figures 1(a)–1(d) show that the RMT prediction of the spectral support (dark outline) agrees well with the numerically calculated spectrum (red dots) in different dynamical regimes. As a consequence of Eq. (5), we get a condition for the stability of the zero fixed point. The leading edge of the spectral curve for the zero fixed point (FP) crosses the origin when . So, in the absence of biases, gh > 2 makes the zero FP unstable. More generally, the leading edge of the spectrum crossing the origin gives us the condition for the FP to become unstable:
(6) |
We see later on that the time-varying state corresponding to this regime is chaotic. We now proceed to analyze how the two gates shape the Jacobian spectrum via the equation for the spectral curve.
A. Update gate facilitates slow modes and output gate causes instability
To understand how each gate shapes the local dynamics, we study their effect on the density of Jacobian eigenvalues and the shape of the spectral support curve—the eigenvalues tell us about the rate of growth or decay of small perturbations and, thus, timescales in the local dynamics, and the spectral curve informs us about stability. For ease of exposition, we consider the case without biases in the main text (βr,z,h = 0); we discuss the role of biases in Appendix H.
Figure 1 shows how the gain parameters of the update and output gates—αz and αr, respectively—shape the Jacobian spectrum. In Figs. 1(a)–1(d), we see that αz has two salient effects on the spectrum: Increasing αz leads to (i) an accumulation of eigenvalues near zero and (ii) a pinching of the spectral curve for certain values of gh wherein the intercept on the imaginary axis gets smaller [Fig. 1(f); also see Sec. IVA]. In Figs. 1(a)–1(d), we also see that increasing the value of αr leads to an increase in the spectral radius, thus pushing the leading edge (max Reλi) to the right and thereby increasing the local dimensionality of the unstable manifold. This behavior of the linearized dynamics is also reflected in the nonlinear dynamics, where, as we show in Sec. V, αr has the effect of controlling the dimensionality of full phase-space dynamics.
The accumulation of eigenvalues near zero with increasing αz suggests the emergence of a wide spectrum of timescales in the local dynamics. To understand this accumulation quantitatively, it is helpful to consider the scenario where αz is large and we replace the tanh activation functions with a piecewise linear approximation. In this limit, the density of eigenvalues within a radius δ of the origin is well approximated by the following functional form (details in Appendix B):
(7) |
where c0 and c1 are constants that, in general, depend on ar, δ, and gh. Figure 1(e) shows this scaling for a specific value of δ: The dashed line shows the predicted curve, and the circles indicate the actual eigenvalue density calculated using the full Jacobian. In the limit of αz → ∞, we get an extensive number of eigenvalues at zero, and the eigenvalue density converges to (see Appendix B)
where fz = 〈σz(z)〉 is the fraction of update gates which are nonzero and fh is the fraction of unsaturated activation functions ϕ(h). For other choices of saturating nonlinearities, the extensive number of eigenvalues at zero remains; however, the expressions are more complicated. Analogous phenomena are observed for discrete-time gated RNNs in Ref. [51], using a similar combination of analytical and numerical techniques [52].
In Sec. VA, we show that the slow modes, as seen from linearization, persist asymptotically (i.e., in the nonlinear regime). This can be seen from the Lyapunov spectrum in Fig. 3(a), which for large αz exhibits an analogous accumulation of Lyapunov exponents near zero.
In the next section, we study the profound functional consequences of the combination of spectral pinching and accumulation of eigenvalues near zero.
IV. MARGINAL STABILITY AND ITS CONSEQUENCES
As the update gate becomes more switchlike (higher αz), we see an accumulation of slow modes and a pinching of the spectral curve which drastically suppresses the unstable directions. In the limit αz → ∞, this can make previously unstable points marginally stable by pinning the leading edge of the spectral curve exactly at zero. Marginally stable systems are of significant interest because of the potential benefits in information processing—for instance, they can generate long timescales in their collective modes [33,39]. Moreover, achieving marginal stability often requires fine-tuning parameters close to a bifurcation point. As we see, gating allows us to achieve a marginally stable critical state over a wide range of parameters; this has been typically highly nontrivial to achieve (e.g., Ref. [39], pp. 329–350, and Ref. [33]). We first investigate the conditions under which marginal stability arises, and then we touch on one of its important functional consequences: the appearance of “line attractors” which allow the system to be used as a robust integrator.
A. Condition for marginal stability
Marginal stability is a consequence of pinching of the spectral curve with increasing αz, wherein the (positive) leading edge of the spectrum and the intercept of the spectral curve on the imaginary axis both shrink with αz [e.g., Fig. 1(f) and compare Figs. 1(a) and 1(c)]. However, we see in Fig. 1(f) (via the intercept) that pinching does not happen if gh is sufficiently large (even as αz → ∞). Here, we provide the conditions when pinching can occur and, thus, marginal stability can emerge. For simplicity, let us consider the case where τr = 1 and there are no biases.
Marginal stability strictly exists only for αz = ∞. We first examine the conditions under which the system can become marginally stable in this limit, and then we explain the route to marginal stability for large but finite αz, i.e., how a time-varying state ends up as a marginally stable fixed point. For αz = ∞, the spectral density has an extensive number N[1 − 〈σz(z)〉] of zero eigenvalues, and the remaining eigenvalues are distributed in a disk centered at λ = −1 with radius ρ. If ρ < 1, then the spectral density has two topologically disconnected configurations (the disk and the zero modes) and the system is marginally stable. If ρ > 1, the zero modes get absorbed in the interior of the disk and the system is unstable with fast, chaotic dynamics. The radius ρ is given by , where and . This follows from Eq. (5) by evaluating the z-expectation value assuming σz is a binary variable. Thus, the system is marginally stable in the limit αz = ∞ as long as
(8) |
The crucial difference between this expression and Eq. (6) is that the rhs now has a factor of 〈σz〉−1 which can be greater than unity, thus pushing the transition to chaos further out along the gh and αr directions, as depicted in the phase diagram (Fig. 7). For concreteness, we report here how the transition changes at αr = 0. In this setting, the transition to chaos moves from gh = 2 to gh ⪅ 6.2, and the system is marginally stable for 2 < gh ⪅ 6.2.
Having identified the region in the phase diagram that can be made marginally stable for αz = ∞, we can now discuss the route to marginal stability for large but finite αz. In other words, how does an unstable chaotic state become marginally stable with increasing αz? Since the marginally stable region is characterized by a disconnected spectral density, evidently increasing αz must lead to singular behavior in the spectral curve. This takes the form of a pinching at the origin. We show that, for values of gh supporting marginal stability, the leading edge λe of the spectrum for the time-varying state gets pinched exponentially fast with αz as (see Appendix B). This accounts for the fact that, already for αz = 15, we observe the pinching in Fig. 1(c). In contrast, the parameters in Fig. 1(d) lie outside the marginally stable region, and, thus, there is no pinching, since the zero modes are asymptotically (in αz) buried in the bulk of the spectrum.
In summary, as αz → ∞ the Jacobian spectrum undergoes a topological transition from a single simply connected domain to two domains, both containing an extensive number of eigenvalues. A finite fraction of eigenvalues end up sitting exactly at zero, while the rest occupy a finite circular region. If the leading edge of the circular region crosses zero in this limit, then the state remains unstable; otherwise, the state becomes marginally stable. The latter case is achieved through a gradual pinching of the spectrum near zero; there is no pinching in the former case.
We emphasize that marginal stability requires more than just an accumulation of eigenvalues near zero. Indeed, this happens even when gh is outside the range supporting marginal stability as αz → ∞, but there is no pinching and the system remains unstable [e.g., see Fig. 1(d)]. We return to this when we describe the phase diagram for the gated RNN (Sec. VII). There, we see that the marginally stable region occupies a macroscopic volume in the parameter space adjoining the critical lines on one side.
B. Functional consequences of marginal stability
The marginally stable critical state produced by gating can subserve the function of a robust integrator. This integratorlike function is crucial for a variety of computational functions such as motor control [34–36], decision making [37], and auditory processing [53]. However, achieving this function has typically required fine-tuning or special handcrafted architectures [38], but gating permits the integrator function over a range of parameters and without any specific symmetries in Jh,z,r. Specifically, for large αz, any perturbation in the span of the eigenvectors corresponding to the eigenvalues with a magnitude close to zero is integrated by the network, and, once the input perturbation ceases, the memory trace of the input is retained for a duration much longer than the intrinsic time constant of the neurons; perturbations along other directions, however, relax with a spectrum of timescales dictated by the inverse of (the real part of) their eigenvalues. Thus, the manifold of slow directions forms an approximate continuous attractor on which input can effortlessly move the state vector around. These approximate continuous attractor dynamics are illustrated in Fig. 2. At time t = 0, an input Ih (with Ir = Iz = 0) is applied till t = 10 (between dashed vertical lines) along an eigenvector of the Jacobian with an eigenvalue close to zero. Inputs along this slow manifold with varying strengths (different shades of red) are integrated by the network as evidenced by the excess projection of the network activity on the left eigenvector uλ corresponding to the slow mode; on the other hand, inputs not aligned with the slow modes decay away quickly (dashed black line). Recall that the intrinsic time constant of the neurons here is set to one unit. The exponentially fast (in αz) pinching of the spectral curve (discussed above in Sec. IVA) suggests this slow-manifold behavior should also hold for moderately large αz (as in Fig. 2). Therefore, even though the state is technically unstable, the local structure of the Jacobian is responsible for giving rise to extremely long timescales and allows the network to operate as an approximate integrator within relatively long windows of time, as demonstrated in Fig. 2.
Of course, after sufficiently long times, the instability causes the state to evolve and the memory is lost. Exactly how long the memory lasts depends on the asymptotic stability of the network, which is revealed by the Lyapunov spectrum, discussed below in Sec. VA.
V. OUTPUT GATE CONTROLS DIMENSIONALITY AND LEADS TO A NOVEL CHAOTIC TRANSITION
We thus far use insights from local dynamics to study the functional consequences of the gates. To study the salient features of the output gate, it is useful to analyze the effect of inputs and the long-time behavior of the network through the lens of Lyapunov spectra. We see that the output gate controls the dimensionality of the dynamics in the phase space; dimensionality is a salient aspect of the dynamics for task function [42]. The output gate also gives rise to a novel discontinuous chaotic transition, near which inputs (even static ones) can abruptly push a stable system into strongly chaotic behavior—contrary to the typically stabilizing effect of inputs. Below, we begin with the Lyapunov analyses of the dynamics and then proceed to study the chaotic transition.
A. Long-time behavior of the network
We study the asymptotic behavior of the network and the nature of the time-varying state through the lens of its Lyapunov spectra. In this section, where we study the effects of αz, our results are numerical except in cases where αz = 0 [e.g., in Fig. 3(d)]. Lyapunov exponents specify how infinitesimal perturbations δx(t) grow or shrink along the trajectories of the dynamics—in particular, if the growth or decay is exponentially fast, then the rate is dictated by the maximal Lyapunov exponent defined as [54] λmax ≔ limT→∞ T−1 lim‖δx(0)‖→0 ln[‖δx(T)‖/‖δx(0)‖]. More generally, the set of all Lyapunov exponents—the Lyapunov spectrum—yields the rates at which perturbations along different directions shrink or diverge and, thus, provide a fuller characterization of asymptotic behavior. We first numerically study how the gates shape the full Lyapunov spectrum (details in Appendix D) and derive an analytical prediction for the maximum Lyapunov exponent using the DMFT (Sec. VA1) [55].
Figures 3(a) and 3(b) show how the update (z) and output (r) gates shape the Lyapunov spectrum. We see that, as the update gets more sensitive (larger αz), the Lyapunov spectrum flattens, pushing more exponents closer to zero, generating long timescales. As the output gate becomes more sensitive (larger αr), all Lyapunov exponents increase, thus increasing the rate of growth in unstable directions.
We can estimate the dimensionality of the activity in the chaotic state by calculating an upper bound DA on the dimension according to a conjecture by Kaplan and Yorke [54]. The Kaplan-Yorke upper bound for the attractor dimension DA is given by
(9) |
where λi are the rank-ordered Lyapunov exponents. We see in Fig. 3(c) that the sensitivity of the output gate (αr) can shape the dimensionality of the dynamics—a more sensitive output gate leads to higher dimensionality. As we see below, this effect of the output gate is different from how the gain gh shapes dimensionality and can lead to a novel chaotic transition. Even more directly, if the r gate for neurons i1…iK is set to zero, then the activity is constrained to evolve in an N − K-dimensional subspace; however, the r gate allows the possibility—i.e., the “inductive bias”—of doing this dynamically.
1. DMFT prediction for λmax
We would also like to study the chaotic nature of the time-varying phase by means of the maximal Lyapunov exponent and characterize when the transition to chaos occurs. We extend the DMFT for the gated RNN to calculate the maximum Lyapunov exponent, and, to do this, we make use of a technique suggested by Refs. [56,57] and clearly elucidated in Ref. [21]. The details are provided in Appendix E, and the end result of the calculation is the DMFT prediction for λmax as the solution to a generalized eigenvalue problem for κ involving the correlation functions of the state variables:
(10) |
(11) |
where we denote the two-time correlation function Cx(t, t′) ≡ 〈x(t)x(t′)〉 for different (functions of) state variables x(t) [see Eq. (C25) for more context]. The largest eigenvalue solution to this problem is the required maximal Lyapunov exponent [58]. Note that this is the analog of the Schrodinger equation for the maximal Lyapunov exponent in the vanilla RNN. When αz = 0 (or small), the h field is Gaussian, and we can use Price’s theorem for Gaussian integrals to replace the variational derivatives on the rhs of Eqs. (10) and (11) by simple correlation functions, for instance, ∂Cϕ(τ)/∂Ch(τ) = Cϕ′(τ). In this limit, we see good agreement between the numerically calculated maximal Lyapunov exponent [Fig. 3(c), dots] compared to the DMFT prediction [Fig. 3(c), solid line] obtained by solving the eigenvalue problem [Eqs. (10) and (11)]. For large values of αz, we see quantitative deviations between the DMFT prediction and the true λmax. Indeed, for large αz, the distribution of h is strongly non-Gaussian, and there is no reason to expect that variational formulas given by Price’s theorem are even approximately correct. For more on this point, see the discussion toward the end of Appendix C.
2. Condition for continuous transition to chaos
The value of αz affects the precise value of the maximal Lyapunov exponent λmax; however, numerics suggest that, across a continuous transition to chaos, the point at which λmax becomes positive is not dependent on αz (data not shown). We can see this more clearly by calculating the transition to chaos when the leading edge of the spectral curve (for a FP) crosses zero. This condition is given by Eq. (6), and we see that it has no dependence on αz or the update gate. We stress that this condition [Eq. (6)] for the transition to chaos—when the stable fixed point becomes unstable—is valid when the chaotic attractor emerges continuously from the fixed point [Fig. 3(c), αr = 0, 2]. However, in the gated RNN, there is another discontinuous transition to chaos [Fig. 3(c), αr = 20]: For sufficiently large αr, the transition to chaos is discontinuous and occurs at a value of gh where the zero FP is still stable (gh < 2 with no biases). To our knowledge, this is a novel type of transition which is not present in the vanilla RNN and not visible from an analysis that considers only the stability of fixed points. We characterize this phenomenon in detail below.
B. Output gate induces a novel chaotic transition
Here, we describe a novel phase, characterized by a proliferation of unstable fixed points and the coexistence of a stable fixed point with chaotic dynamics. It is the appearance of this state that gives rise to the discontinuous transition observed in Fig. 3(c). The appearance of this state is mediated by the output gate becoming more switchlike (i.e., increasing αr) in the quiescent region for gh. To our knowledge, no such comparable phenomenon exists in RNNs with additive interactions. The full details of the calculations for this transition are provided in Appendix G. Here, we simply state and describe the salient features. For ease of presentation, the rest of the section assumes that all biases are zero. The results in this section are strictly valid only for αz = 0. In Appendix G3, we argue that they should also hold for moderate αz.
This discontinuous transition is characterized by a few noteworthy features.
1. Spontaneous emergence of fixed points
When gh < 2.0, the zero fixed point is stable. Moreover, for , when αr crosses a threshold value , unstable fixed points spontaneously appear in the phase space. The only dynamical signature of these unstable FPs are short-lived transients which do not scale with system size (see Fig. 11). Thus, we have a condition for fixed-point transition:
(12) |
These unstable fixed points correspond to the emergence of nontrivial solutions to the time-independent MFT. Figure 4(a) shows the appearance of fixed-point MFT solutions for a fixed gh, and Fig. 4(b) shows the critical as a function of gh. As gh → 2−, we see that .
These spontaneous MFT fixed-point solutions are unstable according to the criterion Eq. (6) derived from RMT. Moreover, in Appendix J, using a Kac-Rice analysis, we show that in this region the full 3N-dimensional system does indeed have a number of unstable fixed points that grows exponentially fast with N. Thus, this transition line represents a topological trivialization transition as conceived by, e.g., Refs. [59,60]. Our analysis shows that instability is intimately connected to the proliferation of fixed points. Remarkably, however, a time-dependent solution to the DMFT does not emerge across this transition, and the microscopic dynamics are insensitive to the transition in topological complexity, bringing us to the next point.
2. A delayed dynamical transition that shows a decoupling between topological and dynamical complexity
On increasing αr beyond , there is a second transition when αr crosses a critical value . This happens when we satisfy the condition for dynamical transition:
(13) |
derived in Appendix G2. Figure 4(c) shows how varies with gh. As gh → 2−, we see that . Across this transition, a dynamical state spontaneously emerges, and the maximum Lyapunov exponent jumps from a negative value to a positive value [Fig. 4(d)]. This state exhibits chaotic dynamics that coexist with the stable zero fixed point. The presence of the stable FP means that the dynamical state is not strictly a chaotic attractor but rather a spontaneously appearing “chaotic set.” On increasing gh beyond 2.0, for large but fixed αr, the stable fixed point disappears, and the state smoothly transitions into a full chaotic attractor that is characterized above. This picture is summarized in the schematic in Fig. 4(e). This gap between the proliferation of unstable fixed points and the appearance of the chaotic dynamics differs from the result of Wainrib and Touboul [45] for purely additive RNNs, where the proliferation (topological complexity) is tightly linked to the chaotic dynamics (dynamical complexity). Thus, for gated RNNs, there appears to be another distinct mechanism for the transition to chaos, and the accompanying transition is a discontinuous one.
3. Long chaotic transients
For finite systems, across the transition the dynamics eventually flow into the zero FP after chaotic transients. Moreover, we expect this transient time to scale with the system size, and, in the infinite system size limit, the transient time should diverge in spite of the fact that the stable fixed point still exists. This is because the relative volume of the basin of attraction of the fixed point vanishes as N → ∞. In Appendix G [Figs. 11(c) and 11(d)], we do indeed see that the transient time for a fixed gh scales with system size [Fig. 11(c)] once αr is above the second transition (dashed line) and not otherwise [see Figs. 11(a) and 11(e), dashed lines].
4. An input-induced chaotic transition
The discontinuous chaotic transition has a functional consequence: Near the transition, static inputs can push a stable system to strong chaotic activity. This is in contrast to the typically stabilizing effects of inputs on the activity of random additive RNNs [21,43,44]. In Figs. 5(a) and 5(b), we see that, when static input with variance is applied to a stable system (a) near the discontinuous chaotic transition (in region 2 in Fig. 7), it induces chaotic activity (b); however, for the same input when applied to the system in the chaotic state [Fig. 5(c)], the dynamics are stabilized (d) as reported before.
This phenomenon for static inputs can be understood using the phase diagram with nonzero biases, discussed in Sec. VII. There, we see how the transition curves move when a random bias βh is included. Near the classic chaotic transition (αr = 0), the bias moves the curve toward larger gh, thus suppressing chaos. Near the discontinuous chaotic transition , the bias pulls the curve toward smaller values of αr, thus promoting chaos. Thus, inputs can have opposite effects of inducing or stabilizing chaos within the same model in different parameter regimes. This phenomenon could, in principle, be leveraged for shaping the interaction between inputs and internal dynamics.
VI. GATES PROVIDE A FLEXIBLE RESET MECHANISM
Here, we discuss how the gates provide another critical function—a mechanism to flexibly reset the memory trace depending on external input or the internal state. This function complements the memory function; a memory that cannot be erased when needed is not very useful. To build intuition, let us consider a linear network ḣ = −h + Jh, where the matrix has a few eigenvalues that are zero, while the rest have a negative real part. The slow modes are good for memory function; however, that fact also makes it hard to forget memory traces along the slow modes. This trade-off is pointed out in Ref. [61]. To be functionally useful, it is critical that the memory trace can be erased flexibly in a context-dependent manner. The r gate allows this function naturally. Consider the same net, but now augmented with an r gate: ḣ = −h + Jh ⊙ σr. If the gate is turned off (σr = 0) for a short duration, the state h is reset to zero. One can actually be more specific: We may choose a with σr = σ[Jr(ϕh)], such that the r gate turns off whenever ϕ(h) gets aligned with u, thus providing an internal-context-dependent reset.
Apart from resetting to zero, the z gate also allows the possibility of rapidly scrambling the state to a random value by means of the input-induced chaos. This phenomenon is illustrated in Fig. 6, where the network in the marginally stable state normally functions as a memory (retains traces for long times, as in Fig. 2), but positive inputs Iz (with Ih = Ir = 0) to the z gate above a threshold strength even for a short duration can induce chaos, thereby scrambling the state and erasing the previous memory state (Fig. 6, bottom panel). The mechanism for this scrambling can be understood by appealing to Eq. (8). A finite input Iz with nonzero mean is able to change 〈σ(z)〉 and, thus, push the critical line for marginal stability in one way or the other. For instance, if 〈Iz〉 > 0, 〈σ(z)〉 > 1/2, which (for αr = 0) moves the transition to marginal stability to a smaller value of gh. This implies that a marginally stable state can be made chaotic in the presence of Iz with finite mean. This mechanism for input-induced chaos actually appears to be different from that explored in the previous section, which occurs across the discontinuous chaotic transition. We discuss this more in Sec. VII.
In summary, gating imbues the RNN with the capacity to flexibly reset memory traces, providing an “inductive bias” for context-dependent reset. The specific method of reset depends on the task or function, and this can be selected, e.g., by gradient-based training. This inductive bias for resetting is found to be critical for performance in ML tasks [62].
VII. PHASE DIAGRAMS FOR THE GATED NETWORK
Here, we summarize the rich dynamical phases of the gated RNN and the critical lines separating them. The key parameters determining the critical lines and the phase diagram are the activation and output-gate gains and the associated biases: (gh, βh, αr, βr). The update gate does not play a role in determining continuous or critical chaotic transitions. On the other hand, it influences the discontinuous transition to chaos for sufficiently large values of αz (see Sec. G3 for discussion). Furthermore, the update gate has a strong effect on the dynamical aspects of the states near the critical lines. There are macroscopic regions of the parameter space adjacent to the critical lines where the states can be made marginally stable in the limit of αz → ∞. The shape of this marginal stability region is influenced by βz and Iz.
Figure 7(a) shows the dynamical phases for the network with no biases in the (gh, αr) plane. When gh is below 2.0 and , the zero fixed point is the only solution (region 1). As discussed in Sec. VB, on crossing the fixed-point bifurcation line [green line, Fig. 7(a)], there is a spontaneous proliferation of unstable fixed points in the phase space (region 2). This can occur only when . The proliferation of fixed points is not accompanied by any obvious dynamical signatures. However, if , we can increase αr further to cross a second discontinuous transition where a dynamical state spontaneously appears featuring the coexistence of chaotic activity and a stable fixed point (region 3). When gh is increased beyond the critical value of 2.0, the stable zero fixed point becomes unstable for all αr, and we get a chaotic attractor (region 4). All the critical lines are determined by gh and αr, and αz has no explicit role; however, for large αz there is a large region of the parameter space on the chaotic side of the chaotic transition that can be made marginally stable [thatched region 5 in Fig. 7(a)].
A. Role of biases and static inputs
Biases have the effect of generating nontrivial fixed points and controlling stability by moving the edge of the spectral curve. Another key feature of biases is the suppression of the discontinuous bifurcation transition observed without biases. This is explained in detail in Appendix H. A particularly illuminating illustration of the effects of a bias can be inferred from the critical line (red dashed) for finite bias shown in Fig. 7. This curve, computed using the FP stability criterion (6) combined with the MFT equations [(C28)–(C30)], represents the transition between stability and chaos for finite bias with zero mean and nonzero variance. Equivalently, we may think of this as the critical line for a network with static input (with Ir = Iz = 0). Along the gh axis, we can observe the well-documented phenomena whereby an input suppresses chaos. This corresponds to the region gh > 2 which lies to the left of the red dashed critical line, which is chaotic in the absence of input and flows to a stable fixed point in the presence of input. However, this behavior is reversed for gh < 2. Here, we see a significant swath of phase space which is stable in the absence of input but which becomes chaotic when input is present. Thus, the stability-to-chaos phase boundary in the presence of biases (or inputs) reveals that the output (r) gate can facilitate an input-induced transition to chaos.
VIII. DISCUSSION
Gating is a form of multiplicative interaction that is a central feature of the best-performing RNNs in machine learning, and it is also a prominent feature of biological neurons. Prior theoretical work on RNNs has considered only RNNs with additive interactions. Here, we present the first detailed study on the consequences of gating for RNNs and show that gating can produce dramatically richer behavior that have significant functional benefits.
The continuous-time gated RNN (gRNN) we study resembles a popular model used in machine learning applications, the gated recurrent unit (GRU) [see the note below Eq. (C27)]. Previous work [51] looks at the instantaneous Jacobian spectrum for the discrete-time GRU using RMT methods similar to those presented in Appendix A; however, this work does not go beyond time-independent MFT and presents a phase diagram showing only boundaries across which the MFT fixed points become unstable [63]. In the present manuscript, we illuminate the full dynamical phase diagram for our gated RNN, uncovering much richer structure. Both the GRU and our gRNN have a gating function which dynamically scales the time constant, which in both cases leads to a marginally stable phase in the limit of a binary gate. However, the dynamical phase diagram presented in Fig. 7 reveals a novel discontinuous transition to chaos. We conjecture that such a phase transition should also be present in the GRU. Also, Ref. [51] lacks any discussion of the influence of inputs or biases. The present paper includes discussion of the functional significance of the gates in the presence of inputs. We believe these results, combined with the refined dynamical phase diagram, can shed some light on the role of analogous gates in the GRU and other gated ML architectures. We review the significance of the gates in more detail below.
A. Significance of the update gate
The update gate modulates the rate of integration. In single-neuron models, such a modulation is shown to make the neuron’s responses robust to time-warped inputs [14]. Furthermore, normative approaches, requiring time reparametrization invariance in ML RNNs, naturally imply the existence of a mechanism that modulates the integration rate [64]. We show that, for a wide range of parameters, a more sensitive (or switchlike) update gate leads to marginal stability. Marginally stable models of biological function have long been of interest with regard to their benefits for information processing (cf. Ref. [33] and references therein). In the gated RNN, a functional consequence of the marginally stable state is the use of the network as a robust integrator—such integratorlike function is shown to be beneficial for a variety of computational functions such as motor control [34–36], decision making [37], and auditory processing [53]. However, previous models of these integrators often require handcrafted symmetries and fine-tuning [38]. We show that gating allows this function without fine-tuning. Signatures of integratorlike behavior are also empirically observed in successfully trained gated ML RNNs on complex tasks [41]. We provide a theoretical basis for how gating produces these. The update gate facilitates accumulation of slow modes and a pinching of the spectral curve which leads to a suppression of unstable directions and overall slowing of the dynamics over a range of parameters. This is a manifestly self-organized slowing down. Other mechanisms for slowing down dynamics have been proposed where the slow timescales of the network dynamics are inherited from other slow internal processes such as synaptic filtering [65,66]; however, such mechanisms differ from the slowing due to gating; they do not seem to display the pinching and clumping, and they also do not rely on self-organized behavior.
B. Significance of the output gate
The output gate dynamically modulates the outputs of individual neurons. Similar shunting mechanisms are widely observed in real neurons and are crucial for performance in ML tasks [62]. We show that the output gate offers fine control over the dimensionality of the dynamics in phase space, and this ability is implicated in task performance in ML RNNs [42]. This gate also gives rise to a novel discontinuous chaotic transition where inputs can abruptly push stable systems to strongly chaotic activity; this is in contrast to the typically stabilizing role of inputs in additive RNNs. In this transition, there is a decoupling between topological and dynamical complexity. The chaotic state across this transition is also characterized by the coexistence of a stable fixed point with chaotic dynamics—in finite size systems, this manifests as long transients that scale with the system size. We note that there are other systems displaying either a discontinuous chaotic transition or the existence of fixed points overlapping with chaotic (pseudo)attractors [19] or apparent chaotic attractors with finite alignment with particular directions [67]; however, we are not aware of a transition in large RNNs where static inputs can induce strong chaos or the topological and dynamical complexity are decoupled across the transition. In this regard, the chaotic transition mediated by the output gated seems to be fundamentally different. More generally, the output gate is likely to have a significant role in controlling the influence of external inputs on the intrinsic dynamics.
We also show how the gates complement the memory function of the update gate by providing an important, context- and input-dependent reset mechanism. The ability to erase a memory trace flexibly is critical for function [62]. Gates also provide a mechanism to avoid the accuracy-flexibility trade-off noted for purely additive networks—where the stability of a memory comes at the cost of the ease with which it is rewritten [61].
We summarize the rich behavior of the gated RNN via phase diagrams indicating the critical surfaces and regions of marginal stability. From a practical perspective, the phase diagram is useful in light of the observation that it is often easier to train RNNs initialized in the chaotic regime but close to the critical points. This is often referred to as the “edge of chaos” hypothesis [68–70]. Thus, the phase diagrams provide ML practitioners with a map for principled parameter initialization—one of the most critical choices deciding training success.
ACKNOWLEDGMENTS
K. K. is supported by a C. V. Starr fellowship and a CPBF fellowship (through NSF PHY-1734030). T. C. is supported by a grant from the Simons Foundation (891851, TC). D. J. S. was supported by the NSF through the CPBF (PHY-1734030) and by a Simons Foundation fellowship for the MMLS. This work was partially supported by the NIH under Grant No. R01EB026943. K. K. and D.J.S. thank the Simons Institute for the Theory of Computing at U. C. Berkeley, where part of the research was conducted. T. C. gratefully acknowledges the support of the Initiative for the Theoretical Sciences at the Graduate Center, CUNY, where most of this work was completed. We are most grateful to William Bialek, Jonathan Cohen, Andrea Crisanti, Rainer Engelken, Moritz Helias, Jonathan Kadmon, Jimmy Kim, Itamar Landau, Wave Ngampruetikorn, Katherine Quinn, Friedrich Schuessler, Julia Steinberg, and Merav Stern for fruitful discussions.
APPENDIX A: DETAILS OF RANDOM MATRIX THEORY FOR SPECTRUM OF THE JACOBIAN
In this section, we provide details of the calculation of the bounding curve for the Jacobian spectrum for both fixed points and time-varying states. Our approach to the problem utilizes the method of Hermitian reduction [25,26] to deal with non-Hermitian random matrices.The analysis here resembles that in Ref. [51], which also considers Jacobians that are highly structured random matrices arising from discrete-time gated RNNs.
The Jacobian 𝒟 is a block-structured matrix constructed from the random coupling matrices Jh,z,r and diagonal matrices of the state variables. In the limit of large N, we expect the spectrum to be self-averaging—i.e., the distribution of eigenvalues for a random instance of the network approaches the ensemble-averaged distribution. We can, thus, gain insight about typical dynamical behavior by studying the ensemble- (or disorder-) averaged spectrum of the Jacobian. Our starting point is the disorder-averaged spectral density μ(λ) defined as
(A1) |
where the λi are the eigenvalues of 𝒟 for a given realization of Jh,z,r and the expectation is taken over the distribution of real Ginibre random matrices from which Jh,z,r are drawn. Using an alternate representation for the Dirac delta function in the complex plane , we can write the average spectral density as
(A2) |
where is the 3N-dimensional identity matrix. 𝒟 is in general non-Hermitian, so the support of the spectrum is not limited to the real line, and the standard procedure of studying the Green’s function by analytic continuation is not applicable, since it is nonholomorphic on the support. Instead, we use the method of Hermitization [25,26] to analyze the resolvent for an expanded 6N × 6N Hermitian matrix H:
(A3) |
(A4) |
and the Green’s function for the original problem is obtained by considering the lower-left block of 𝒢:
(A5) |
To make this problem tractable, we invoke an ansatz called the local chaos hypothesis [57,71], which posits that, for large random networks in steady state, the state variables are statistically independent of the random coupling matrices Jz,h,r (also see Ref. [72]). This implies that the Jacobian [Eq. (4)] has an explicit linear dependence only on Jh,z,r, and the state variables are governed by their steady-state distribution from the disorder-averaged DMFT (Appendix C). These assumptions make the random matrix problem tractable, and we can evaluate the Green’s function by using the self-consistent Born approximation, which is exact as N → ∞. We detail this procedure below.
The Jacobian itself can be decomposed into structured (A, L, R) and random parts (𝒥):
(A6) |
At this point, we must make a crucial assumption: The structured matrices A, L, and R are independent of the random matrices appearing 𝒥. This implies that the dynamics is self-averaging and that the state variables reach a steady-state distribution determined by the DMFT and insensitive to the particular quenched disorder 𝒥. This self-averaging assumption leads to theoretical predictions which are in very good agreement with simulations of large networks, as presented in Fig. 1.
This independence assumption renders 𝒟 a linear function of the random matrix 𝒥, whose entries are Gaussian random variables. The next steps are to develop an asymptotic series in the random components of H, compute the resulting moments, and perform a resummation of the series. This is conveniently accomplished by the self-consistent Born approximation (SCBA). The output of the SCBA is a self-consistently determined self-energy functional Σ[𝒢] which succinctly encapsulates the resummation of moments. With this, the Dyson equation for 𝒢 is given by
(A7) |
where the matrices on the right are defined in terms of 3N × 3N blocks:
(A8) |
(A9) |
and Q is a superoperator which acts on its argument as follows:
(A10) |
Here, we express the self-energy using the 3N × 3N subblocks of the Green’s function 𝒢:
(A11) |
At this point, we have presented all of the necessary ingredients for computing the Green’s function and, thus, determining the spectral properties of the Jacobian. These are the Dyson equation (A7), along with the free Green’s function (A8) and the self-energy (A9). Most of what is left is complicated linear algebra. However, in the interest of completeness, we proceed to unpack these equations and give a detailed derivation of the main equation of interest, the bounding curve of the spectral density.
To proceed further, it is useful to define the following transformed Green’s functions, which can be written in terms of N × N subblocks:
(A12) |
(A13) |
Denote also the mean trace of these subblocks as
(A14) |
Then the self-energy matrix in Eq. (A9) is block diagonal, i.e., Σ[𝒢] = bdiag(Σ11, Σ22), with
(A15) |
(A16) |
With the self-energy in this form, we are able to solve the Dyson equation for the full Green’s function 𝒢 by direct matrix inversion:
(A17) |
which can be carried out easily by symbolic manipulation software. The rhs of Eq. (A17) is a function of , whereas the lhs is a function of the Green’s function before the transformations (A12) and (A13). Thus, to get a set of equations we can solve, we apply these same transformations to both sides of Eq. (A17) after solving the Dyson equation. The final step is to take the limit η → 0, recovering the problem we originally wished to solve.
The result of these manipulations is a set of six equations for the mean traces of the transformed Green’s function defined in Eq. (A14). In order to write these down, we introduce some additional notation. The self-consistent equations are of the form
(A18) |
where we denote 〈M〉 ≡ N−1TrM for shorthand and i runs from 1 to 6. Denote the state-variable-dependent diagonal matrices as
(A19) |
and, because they appear frequently in the resulting equations, define
(A20) |
(A21) |
(A22) |
The denominator in Eq. (A18) is then given by
(A23) |
and the numerators Γi are given by
(A24) |
(A25) |
(A26) |
(A27) |
(A28) |
The numerators and denominator are all diagonal matrices with real entries, which is why we use the simple notation of a ratio when referring to matrix inversion.
Solving these equations gives us the as implicit functions of λ. They are, in general, complicated and resist exact solution. However, the situation simplifies considerably when we are looking for the spectral curve. In this case, we are looking for all that satisfy the self-consistent equations with .
We must take this limit carefully, since the ratio of these functions can remain constant. For this reason, it is necessary to define
(A29) |
We may do the same for , , and , but it turns out that x2 and x3 are sufficient to compute the spectral curve. Next, divide by and send all , keeping the ratios fixed. Applying this to the equation for results in
(A30) |
Similarly, for and , we get
(A31) |
(A32) |
where the coefficients γi, which are functions of λ, are given by
The linear system of equations (A30)–(A32) is consistent iff
(A33) |
In other words, γi must satisfy Eq. (A33) when . This expression depends on λ and implicitly defines a curve in , which is the boundary of the support of the spectral density.
Plugging in the explicit expression for γi, we get the implicit equation for the spectral curve as all that satisfy
(A34) |
For large systems, we can replace the empirical traces of the state variable by their averages given by the DMFT variances. Then, the equation for the curve for a general steady state is given by
(A35) |
For fixed points, we have , which makes γ3 = γ4 = 0. The equation for the spectral curve simplifies to that which is quoted in the main text [Eq. (5)]:
(A36) |
1. Jacobian spectrum for the case αr = 0
In the case when αr = 0, it is possible to express the Green’s function [Eq. (A5)] in a simpler form. Recall that
(A37) |
Let . Then, the Green’s function is given by
(A38) |
(A39) |
(A40) |
where is defined implicitly to satisfy the equation
(A41) |
The function acts as a sort of order parameter for the spectral density, indicating the transition on the complex plane between zero and finite density μ. Outside the spectral support, λ ∈ Σc, this order parameter vanishes, ξ = 0, and the Green’s function is holomorphic:
(A42) |
which, of course, indicates that the density is zero since . Inside the support λ ∈ Σ, the order parameter ξ ≠ 0, and the Green’s function consequently picks up nonanalytic contributions, proportional to . Since the Green’s function is continuous on the complex plane, it must be continuous across the boundary of the spectral support. This must then occur precisely when the holomorphic solution meets the nonanalytic solution, at ξ = 0. This is the condition used to find the boundary curve above.
APPENDIX B: SPECTRAL CLUMPING AND PINCHING IN THE LIMIT αz → ∞
In this section, we provide details on the accumulation of eigenvalues near zero and the pinching of the leading spectral curve (for certain values of gh) as the update gate becomes switchlike (αz → ∞). To focus on the key aspects of these phenomena, we consider the case when the reset gate is off and there are no biases (αr = 0 and βr,h,z = 0). Moreover, we consider a piecewise linear approximation—sometimes called “hard” tanh—to the tanh function given by
(B1) |
This approximation does not qualitatively change the nature of the clumping.
In the limit αz → ∞, the update gate σz becomes binary with a distribution given by
(B2) |
where fz = 〈σz〉 is the fraction of update gates that are open (i.e., equal to one). Using this, along with the assumption that —which is valid in this regime—we can simplify the expression for the Green’s function [Eqs. (A38)–(A42)] to yield
(B3) |
where fh is the fraction of hard tanh activations that are not saturated. In the limit of small τz and βr = 0, we get the expression for the density given in the text:
(B4) |
Thus, we see an extensive number of eigenvalues at zero.
Now, let us study the regime where αz is large but not infinite. We would like to get the scaling behavior of the leading edge of the spectrum and the density of eigenvalues contained in a radius δ around the origin. We make an ansatz for the spectral edge close to zero , where c is a positive constant. With this ansatz, the equation for the spectral curve reads
(B5) |
In the limit of large αz and βr = 0, this implies
(B6) |
If this has a positive solution for c, then the scaling of the spectral edge as holds. Moreover, whenever there is a positive solution for c, we also expect pinching of the spectral curve, and in the limit αz → ∞ we have marginal stability.
Under the same approximation, we can approximate the eigenvalue density in a radius δ around zero as
(B7) |
where we choose the contour along for θ ∈ [0, 2π) and . In the limit of large αz (thus, δ ≪ 1), we get the scaling form described in the main text:
(B8) |
APPENDIX C: DETAILS OF THE DYNAMICAL MEAN-FIELD THEORY
The DMFT is a powerful analytical framework used to study the dynamics of disordered systems, and it traces its origins to the study of dynamical aspects of spin glasses [73,74] and has been later applied to the study of random neural networks [9,15,21,75]. In our case, the DMFT reduces the description of the full 3N-dimensional (deterministic) ordinary differential equations (ODEs) describing (h, z, r) to a set of three coupled stochastic differential equations for scalar variables (h, z, r).
Here, we provide a detailed, self-contained description of the dynamical mean-field theory for the gated RNN using the Martin–Siggia–Rose–De Dominicis–Janssen formalism. The starting point is a generating functional—akin to the generating function of a random variable—which takes an expectation over the paths generated by the dynamics. The generating functional is defined as
(C1) |
where xj(t) ≡ [hj(t), zj(t), rj(t)] is the trajectory and is the argument of the generating functional. We also include external fields , which are used to calculate the response functions. The measure in the expectation is a path integral over the dynamics. The generating functional is then used to calculate correlation and response functions using the appropriate (variational) derivatives. For instance, the two-point function for the h field is given by
(C2) |
Up until this point, things are quite general and do not rely on the specific form of the dynamics. However, for large random networks, we expect certain quantities such as the population averaged correlation function Ch ≡ N−1 Σi〈hi(t)hi(t′)〉 to be self-averaging and, thus, not vary much across realizations. Thus, we can study the disorder averaged (over 𝒥), the generating functional , and approximate with its value evaluated at the saddle point of the action. This approximation gives us the single-site DMFT picture of dynamics described in Eqs. (C19) and (C20).
To see how this all works, we start with the equations of motion (in vector form)
(C3) |
(C4) |
(C5) |
where ⊙ stands for elementwise multiplication.
To write down the MSRDJ generating functional, let us discretize the dynamics (in the Itô convention). Note that in this convention the Jacobian is unity.
where we introduce external fields in the dynamics , , and . The generating functional is given by
(C6) |
where , , and xj(t) ≡ [hj(t), zj(t), rj(t)]; also, the expectation is over the dynamics generated by the network. Writing this out explicitly, with δ functions enforcing the dynamics, we get the following integral for the generating functional:
(C7) |
Now, let us introduce the Fourier representation for the δ function; this introduces an auxiliary field variable, which as we see allows us to calculate the response function in the MSRDJ formalism. The generating functional can then be expressed as
(C8) |
where the functions fh,z,r summarize the gated RNN dynamics
Let us now take the continuum limit δt → 0 and formally define the measures 𝒟hi = limδt→0 ∏t dhi(t). We can then write the generating functional as a path integral:
(C9) |
where , x = (hi, zi, ri), , and the action S which gives weights to the paths is given by
(C10) |
The functional is properly normalized, so Z𝒥[0, b] = l. We can calculate correlation functions and response functions by taking appropriate variational derivatives of the generating functional Z, but first we address the role of the random couplings.
1. Disorder averaging
We are interested in the typical behavior of ensembles of the networks, so we work with the disorder-averaged generating functional ; Z𝒥 is properly normalized, so we are allowed to do this averaging on Z𝒥. Averaging over involves the following integral:
which evaluates to
and similarly for Jz and Jr we get terms
The disorder-averaged generating functional is then given by
(C11) |
where the disorder-averaged action is given by
(C12) |
With some foresight, we see the action is extensive in the system size, and we can try to reduce it to a single-site description. However, the issue now is that we have nonlocal terms (e.g., involving both i and j), and we can introduce the following auxiliary fields to decouple these nonlocal terms:
(C13) |
To make the C’s free fields that we integrate over, we enforce these relations using the Fourier representation of δ functions with additional auxiliary fields:
This allows us to make the following transformations to decouple the nonlocal terms in the action :
We see clearly that the and Cϕ auxiliary fields which represent the (population-averaged) ϕσr − ϕσr and ϕ − ϕ correlation functions decouple the sites by summarizing all the information present in the rest of the network in terms of two-point functions; two different sites interact only by means of the correlation functions. The disorder-averaged generating functional can now be written as
(C14) |
where C = (Ch, Cz, Cr) and Ĉ = (Ĉh, Ĉz, Ĉr). The sitewise decoupled action Sd contains only terms involving a single site (and the C fields). So, for a given value of Ĉ and C, the different sites are decoupled and driven by the sitewise action
(C15) |
where
2. Saddle-point approximation for N → ∞
So far, we do not make any use of the fact that we are considering large networks. However, noting that N appears in the exponent in the expression for the disorder-averaged generating functional, we can approximate it using a saddle-point approximation:
We approximate the action ℒ in Eq. (C14) by its saddle-point value plus a Hessian term: ℒ ≃ ℒ0 + ℒ2 and the Q and Q̂ fields represent Gaussian fluctuations about the saddle-point values C0 and Ĉ0, respectively. At the saddle-point the action is stationary with respect to variations; thus, the saddle-point values of C fields satisfy
(C16) |
In evaluating the saddle-point correlation function in the second line, we use the fact that equal-time response functions in the Itô convention are zero [29]. This is perhaps the first significant point of departure from previous studies of disordered neural networks and forces us to confront the multiplicative nature of the z gate. Here, 〈⋯〉0 denotes averages with respect to paths generated by the saddle-point action; thus, these equations are a self-consistency constraint. With the correlation fields fixed at their saddle-point values, if we neglect the contribution of the fluctuations (i.e., ignore ℒ2), then the generating functional is given by a product of identical sitewise generating functionals:
(C17) |
where the sitewise functionals are given by
(C18) |
where .
The sitewise decoupled action is now quadratic with the correlation functions taking on their saddle-point values. This corresponds to an action for each site containing three scalar variables driven by Gaussian processes. This can be seen explicitly by using a Hubbard-Stratonovich transform which makes the action linear at the cost of introducing three auxiliary Gaussian fields ηh, ηz, and ηr with correlation functions , , and , respectively. With this transformation, the action for each site corresponds to stochastic dynamics for three scalar variables given by
(C19) |
(C20) |
(C21) |
where the Gaussian noise processes ηh, ηz, and ηr have correlation functions that must be determined self-consistently:
The intuitive picture of the saddle-point approximation is as follows: The sites of the full network become decoupled, and they are each driven by a Gaussian processes whose correlation functions summarize the activity of the rest of the network “felt” by each site. It is possible to argue about the final result heuristically, but one does not have access to the systematic corrections that a field theory formulation affords.
We comment here on the unique difficulty that gating presents to an analysis of the DMFT. While r(t) and z(t) are both described by Gaussian processes in the DMFT, the multiplicative σz(z) interaction in Eq. (C19) spoils the Gaussianity of h(t). Note that r(t) is always Gaussian and uncorrelated to h(t). In order to try solving for the correlation functions, we need to make a factorization assumption, justified numerically in Fig. 10. The story simplifies at a fixed point, where h = ηh (since σz > 0), and is, thus, Gaussian and independent of r.
In order to solve the DMFT equations, we use a numerical method described in Ref. [76]. Specifically, we generate noise paths ηh,z,r starting with an initial guess for the correlation functions and then iteratively update the correlation functions using the mean-field equations till convergence. The classical method of solving the DMFT by mapping the DMFT equations to a second-order ODE describing the motion of a particle in a potential cannot be used in the presence of multiplicative gates. In Fig. 9, we see that the solution to the mean-field equations agrees well with the true population-averaged correlation function; Fig. 9 also shows the scale of fluctuations around the mean-field solutions (Fig. 9, thin black lines).
The correlation functions in the DMFT picture such as Ch(t, t′) = 〈h(t)h(t′)〉 are the order parameters and correspond to the population-averaged correlation functions in the full network. These turn out to useful in our analysis of the RNN dynamics in some analyses. Qualitative changes in the correlation functions correspond to transitions between dynamical regimes of the RNN.
In general, the non-Gaussian nature of h makes it impossible to get equations governing the correlation functions. However, when αz is not too large, Eqs. (C19) and (C20) can be extended to get equations of motions for the correlation functions Ch, Cz, and Cr, which proves useful later on. This requires a separation assumption between the h and σz correlators, which seems reasonable for moderate αz (see Fig. 10). “Squaring” Eqs. (C19) and (C20), we get
(C22) |
(C23) |
(C24) |
where we use the shorthand σz(t) ≡ σz[z(t)], ϕ(t) ≡ ϕ[h(t)], and denote the two-time correlation functions as
(C25) |
where x ∈ {h, z, r, σz, σr, ϕ} and the expectation here is over the random Gaussian fields in Eqs. (C19)–(C21). We assume that the network reaches steady state, so that the correlation functions are only a function of the time difference τ = t − t′. The role of the z gate as an adaptive time constant is evident in Eq. (C22).
For time-independent solutions, i.e., fixed points, Eqs. (C22)–(C24) simplify to read
(C26) |
(C27) |
where we use Δ instead of C to indicate fixed-point variances and Dx is the standard Gaussian measure. It is interesting to note that these mean-field equations can be mapped to those obtained in Ref. [51] for the discrete-time GRU.
We also make use of the MFT with static random inputs. For completeness, we include the resulting equations here. With , the MFT time-independent solution satisfies
(C28) |
(C29) |
(C30) |
APPENDIX D: DETAILS OF THE NUMERICS FOR THE LYAPUNOV SPECTRUM
The evolution of perturbations δx(t) along a trajectory follow the tangent-space dynamics governed by the Jacobian
(D1) |
So, after a time T, the initial perturbation δx(0) is given by
(D2) |
where 𝒯[⋯] is the time-ordering operator applied to the contents of the bracket. When the infinitesimal perturbations grow (shrink) exponentially, the rate of this exponential growth (decay) is dictated by the maximal Lyapunov exponent defined as [54]
(D3) |
For ergodic systems, this limit is independent of almost all initial conditions, as guaranteed by the Oseledets multiplicative ergodic theorem [54]. Positive values of λmax imply that the nearby trajectories diverge exponentially fast, and the system is chaotic. More generally, the set of all Lyapunov exponents—the Lyapunov spectrum—yields the rates at which perturbations along different directions shrink or diverge and, thus, provides a fuller characterization of asymptotic behavior. The first k-ordered Lyapunov exponents are given by the growth rates of k linearly independent perturbations. These can be obtained as the logarithms of the eigenvalues of the Oseledets matrix, defined as [54]
(D4) |
However, this expression cannot be directly used to calculate the Lyapunov spectra in practice, since M(t) rapidly becomes ill conditioned. We instead employ a method suggested by Ref. [77] (also cf. Ref. [78] for Lyapunov spectra of RNNs). We start with k orthogonal vectors Q0 = [q1, …, qk] and evolve them using the tangent-space dynamics [Eq. (D1)] for a short time interval t0. Therefore, the new set of vectors is given by
(D5) |
We now decompose Q̂ = Q1R1 using a QR decomposition, into an orthogonal matrix Q1 and a upper-triangular matrix R1 with positive diagonal elements, which give the rate of shrinkage or expansion of the volume element along the different directions. We iterate this procedure for a long time, t0 × Nl, and the first k-ordered Lyapunov exponents are given by
(D6) |
APPENDIX E: DETAILS OF THE DMFT PREDICTION FOR λmax
The starting point of the method to calculate the DMFT prediction for λmax is two replicas of the system x1(t) and x2(t) with the same coupling matrices Jh,z,r and the same parameters. If the two systems are started with initial conditions which are close, then the rate of convergence or divergence of the trajectories reveals the maximal Lyapunov exponent. To this end, let us define and study the growth rate of d(t, t). In the large N limit, we expect population averages like to be self-averaging (like in the DMFT for a single system) [79], and, thus, we can write
(E1) |
For trajectories that start nearby, the asymptotic growth rate of d(t) is the maximal Lyapunov exponent. In order to calculate this using the DMFT, we need a way to calculate C12—the correlation between replicas—for a typical instantiation of systems in the large N limit. As suggested by Ref. [21], this can be achieved by considering a joint generating functional for the replicated system:
(E2) |
We then proceed to take the disorder average of this generating functional—in much the same way as a single system—and this introduces correlations between the state vectors of the two replicas. A saddle-point approximation as in the single system case (cf. Appendix C) yields a system of coupled stochastic differential equations (SDEs) (one for each replica), similar to Eq. (C20), but now the noise processes in the two replicas are coupled, so that terms like need to be considered. As before, the SDEs imply the equations of motion for the correlation functions
(E3) |
(E4) |
(E5) |
where μ, v ∈ {1, 2} are the replica indices. Note that the single-replica solution clearly is a solution to this system, reflecting the fact that marginal statistics of each replica is the same as before. When the replicas are started with initial conditions that are ϵ-close, we expect the inter-replica correlation function to diverge from the single-replica steady-state solution, so we expand C12 to linear order as . From Eq. (E1), we see that , and, thus, the growth rate of yields the required Lyapunov exponent. To this end, we make an ansatz , where 2T = t + s, 2τ = t − s, and κ is the DMFT prediction of the maximum Lyapunov exponent that needs to be solved for. Substituting this back into Eq. (E3), we get a generalized eigenvalue problem for κ as stated in the text [Eqs. (10) and (11)].
APPENDIX F: CALCULATION OF MAXIMAL LYAPUNOV EXPONENT FROM RMT
The DMFT prediction for how gates shape λmax (via the correlation functions) is somewhat involved; thus, we provide an alternate expression for the maximal Lyapunov exponent λmax, derived using RMT which relates it to the relaxation time of the dynamics. The starting point to get λmax is the Oseledets multiplicative ergodic theorem, which guarantees that [80]
(F1) |
(F2) |
where and 𝒟 is the Jacobian. For the vanilla RNN, the Jacobian is given by
(F3) |
We expect the maximal Lyapunov exponent to be independent of the random network realization and, thus, equal to its value after disorder averaging. Furthermore, to make any progress, we use a short-time approximation for . Defining the diagonal matrix R(t) = ∫t [ϕ′(t′)]dt′, these assumptions give
(F4) |
(F5) |
where the second line in Eq. (F5) follows after disorder averaging over J and keeping only terms to leading order in N. Next, we may apply the DMFT to write
(F6) |
(F7) |
In steady state, the correlation function depends only on the difference of the two times, and, thus, we can write
(F8) |
where we define the relaxation time for the Cϕ′ correlation function
(F9) |
Substituting Eq. (F8) in Eq. (F4), we get
(F10) |
which for long times behaves like . By inserting this into Eq. (F1), we obtain a bound for the maximal Lyapunov exponent for the vanilla RNN:
(F11) |
(F12) |
This formula relates the asymptotic Lyapunov exponent to relaxation time of a local correlation function in steady state. It is interesting to note that the bound also follows by applying the variational theorem to the potential energy obtained from the Schrodinger equation that arises in computing the Lyapunov exponent using DMFT (e.g., see Refs. [15,32]). Specifically, if one uses the potential obtained in these works V(τ) = 1 − Cϕ′(τ), and assumes a uniform “ground state wave function,” the variational theorem implies that the true ground state energy E0 is upper bounded , which consequently implies the bound (F11).
Now we present the derivation for the mean-squared singular value of the susceptibility matrix for the gated RNN with αz = 0 and βz = −∞. In this limit, σz = 1, and the instantaneous Jacobian becomes the 2N × 2N matrix
(F13) |
(F14) |
(F15) |
where h = h(t) and r = r(t) are time dependent.
Let us define the quantity of interest
(F16) |
(F17) |
where we additionally define Ŝt = ∫t dt′St and the integration is performed elementwise. Expanding the exponentiated matrices and computing moments directly, one finds that the leading order in N moments must have an equal number of Ĵ and ĴT. Thus, we must evaluate
(F18) |
The ordering of the matrices is important in this expression. Since all of the Ĵ appear to the left of ĴT, the leading-order contributions to the moment come from Wick contractions that are “noncrossing”—in the language of diagrams, the moment is given by a “rainbow” diagram. Consequently, we may evaluate cn by induction. First, the induction step. Define the expected value of the matrix moment
(F19) |
(F20) |
(F21) |
We wish to determine an and bn. Next, define
(F22) |
(F23) |
(F24) |
Now we can directly determine the induction step at the level of matrix moments by Wick contraction of the rainbow diagram:
(F25) |
(F26) |
(F27) |
This implies the following recursion for the diagonal elements of ĉn:
(F28) |
The initial condition is given by observing that , which implies a0 = b0 = 1. The solution to this recursion relation can be written in terms of a transfer matrix
(F29) |
which implies the moment is given by
(F30) |
To evaluate this, we use the fact that the eigenvalues of the transfer matrix are
(F31) |
which are real valued. The eigenvectors are
(F32) |
Then, defining l = (1, 1), the moment can be written
(F33) |
(F34) |
The final expression for the mean-squared singular value is then
(F35) |
After resumming this infinite series, we wind up with an expression in terms of the modified Bessel function:
(F36) |
In the steady state, we approximate these expressions by assuming the correlation functions are time-translation invariant. Thus, we may write, for instance,
(F37) |
and similarly for gQ and gP. Then, the eigenvalues of the transfer matrix become
(F38) |
At late times, using the asymptotic behavior of the modified Bessel function, the moment becomes
(F39) |
which gives the Lyapunov exponent
(F40) |
where the relaxation times τA, τr, and τq are defined as, respectively,
(F41) |
(F42) |
(F43) |
APPENDIX G: DETAILS OF THE DISCONTINUOUS CHAOTIC TRANSITION
In this section, we provide the details for the calculations involved in the discontinuous chaotic transition.
1. Spontaneous emergence of fixed-points
For gh < 2.0 and small αr, the zero fixed point is the globally stable state for the dynamics and the only solution to the fixed-point equations [Eq. (C26)] for Δh. However, as we increase αr for a fixed gh, two additional nonzero solutions to Δh spontaneously appear at a critical value as shown in Fig. 4(a). Numerical solutions to the fixed-point equations reveal the form of the bifurcation curve and the associated value of . We see that increases rapidly with decreasing gh, dividing the parameter space into regions with either one or three solutions for Δh. The corresponding vanishes at two boundary values of gh—one at 2.0 and another, gc, below 1.5, where . This naturally leads to the question of whether the fixed-point bifurcation exists for all values of gh below 2.0.
To answer this, we perturbatively solve the fixed-point equations in two asymptotic regimes: (i) gh → 2− and (ii) . Details of the perturbative treatment are in Appendix I 2. For gh = 2 − ϵ, we see that the perturbative problem undergoes a bifurcation from one solution (Δh = 0) to three when αr crosses the bifurcation threshold , and this is the left limit of the bifurcation curve in Fig. 4(b). The larger nonzero solution for the variance at the bifurcation point scales as
(G1) |
where ξ0 and ξ0 are positive constants (see Appendix I 2).
At the other extreme, to determine the smallest value of gh for which a bifurcation is possible, we note from Fig. 4(b) that in this limit αr → ∞, and, thus, we can look for solutions to Δh in the limit: Δh ≪ 1 and αr → ∞ and . In this limit, there is a bifurcation in the perturbative solution when , and, close to the critical point, the fixed-point solution is given by (see Appendix I 2)
(G2) |
Thus, in the region , there exist nonzero solutions to the fixed-point equations once αr is above a critical value . These solutions correspond to unstable fixed points appearing in the phase space.
2. Delayed dynamical transition shows a decoupling between topological and dynamical complexity
The picture from the fixed-point transition above is that. when gh is in the interval (, 2), there is a proliferation of unstable fixed points in the phase space provided . However, it turns out that the spontaneous appearance of these unstable fixed points is not accompanied by any asymptotic dynamical signatures—as measured by the Lyapunov exponents (see Fig. 4) or by the transient times (see Fig. 11). It is only when αr is increased further beyond a second critical value that we see the appearance of chaotic and long-lived transients. This is significant in regard to a result by Wainrib and Touboul [45], where they show that the transition to chaotic dynamics (dynamical complexity) in random RNNs is tightly linked to the proliferation of critical points (topological complexity), and, in their case, the exponential rate of growth of critical points (a topological property) is the same as the maximal Lyapunov exponent (a dynamical property).
Let us characterize the second dynamical transition curve given by [Fig. 4(c), red curve]. For ease of discussion, we turn off the update gate (αz = 0) and introduce a functional Fψ for a 2D Gaussian average of a given function ψ(x):
(G3) |
(G4) |
The DMFT equations for the correlation functions then become
(G5) |
We further make an approximation that τr ≪ 1, which, in turn, implies Cr(τ) ≈ Cϕ(τ). This approximation turns out to hold even for moderately large τr. With these approximations, we can integrate the equations for Ch(τ) to arrive at an equation for the variance . We do this by multiplying by ∂τCh(τ) and integrating from τ to ∞, and we get
(G6) |
Using the boundary condition that Ċh(0) = 0, we get the equation for the variance:
(G7) |
Solving this equation gives the DMFT prediction for the variance for any gh and αr. Beyond the critical value of αr, two nonzero solutions for spontaneously emerge. In order to use Eq. (G7) to find a prediction for the DMFT bifurcation curve , we need to use the additional fact that at the bifurcation point the two solutions coincide, and there is only one nonzero solution. To proceed, we can view the lhs of Eq. (G7), as a function of αr, gh, and . Then, the equation for the bifurcation curve is obtained by solving the following two equations for and :
(G8) |
(G9) |
To get the condition for the dynamical bifurcation transition, we need to differentiate the lhs of Eq. (G7) with respect to and set it to 0. This involves terms like
(G10) |
We give a brief outline of calculating the first term. It is easier to work in the Fourier domain:
(G11) |
This immediately gives us
(G12) |
Using this fact, we can calculate the derivative of as a straightforward (but long) sum of Gaussian integrals. We then numerically solve Eqs. (G8) and (G9) to get the bifurcation curve shown in Fig. 4(c). Figure 4(d) shows the corresponding variance at the bifurcation point (red curves). We note two salient points: (i) The DMFT bifurcation curve is always above the fixed-point bifurcation curve [black, in Fig. 4(a)], and (ii) the lower critical value of gh which permits a dynamical transition [dashed green curve in Figs. 4(a) and 4(b)] is smaller than the corresponding fixed-point critical value of .
We now calculate the lower critical value of gh and provide an analytical description of the asymptotic behavior near the lower and higher critical values of gh. From the red curve in Fig. 4(c), we know that, as gh tends toward the lower critical value, and , we can approximate σr as a step function in this limit, and is approximated as
(G13) |
(G14) |
The DMFT equation then reads
Integrating this equation, we get
which has O(Ch(0)2) corrections. From the boundary condition Ċh(0) = 0, we know that as x → 1 then ẋ → 0. We thus find that these boundary conditions are consistent only to leading order in Ch(0) when gh is equal to its critical value:
(G15) |
which indicates that Ch(0) must vanish as .
In the other limit when gh → 2−, we see that remains finite and . We assume that, for gh = 2 − ϵ, has a power-series expansion
(G16) |
We also expand Fϕ and to O[Ch(0)2]:
(G17) |
and look for values of αr which permit a nonzero value for c0 in the leading-order solutions to the DMFT. We find that the critical value of αr from the perturbative solution is given by
(G18) |
The DMFT prediction for the dynamical bifurcation agrees well with the full network simulations. In Fig. 4(e), we see that the maximum Lyapunov exponent experiences a discontinuous transition from a negative value (network activity decays to fixed point) to a positive value (activity is chaotic) at the critical value of αr predicted by the DMFT (dashed vertical lines).
3. Influence of update gate on the discontinuous transition
Here, we comment briefly on the possible influence of the z gate on the discontinuous dynamical phase transition given by the curve . Assuming Eq. (C22) is valid (discussed in more detail toward the end of Appendix C), we may rewrite the DMFT equation for the two-point correlation functions as
(G19) |
where
(G20) |
Noting that a time-dependent solution corresponds to a nonzero solution for Ch(0) and satisfies the boundary condition Ċh(0) = 0 then requires
(G21) |
where we define a new “potential” function which is related to that defined above by
(G22) |
We leave the arguments (gh, αr, ) implicit, for ease of presentation. We proceed to bound the new potential by establishing bounds on . To be explicit, we have
(G23) |
which we express as the sum of a connected component (indicated by a subscript c) and a disconnected component. We can consider two limiting behaviors. When the correlation time tends to zero, the connected component vanishes and (at zero bias βz = 0)
(G24) |
Increasing the correlation time can serve only to increase the two-point function, since σ ≥ 0. In the extreme limit of very long correlation time, we have that
(G25) |
The inequality is saturated at αz = ∞, when σz becomes a step function of its argument. Therefore, the two-point correlation function of the update gate is bounded above and below:
(G26) |
and this bound is uniform in the sense that it holds for all values of the argument . Consequently, we are able to bound the potential
(G27) |
It follows immediately that the derivative is similarly bounded. Consequently, the zeros of and coincide with the zeros of ℱ and , respectively. As a result, the discontinuous transition, determined by Eqs. (G8) and (G9), remain unchanged for values of αz for which Eq. (C22) is valid. Thus, for moderately large αz [approximately 10, where Eq. (C22) is valid], the critical line for the discontinuous transition remains unchanged.
APPENDIX H: THE ROLE OF BIASES
We thus far describe the salient dynamical aspects for the gated RNN in the absence of biases. Here, we describe the role of the biases βh (bias of the activation ϕ) and βr (bias of the output gate σr). We first note that, when βh = 0, zero is always a fixed point of the dynamics, and the zero fixed point is stable provided
(H1) |
where ϕ(x) = tanh(ghx + βh). This gives the familiar gh < 2 condition when βr = 0 [81]. Thus, in this case, there is an interplay between gh and βr in determining the leading edge of the Jacobian around the zero fixed point and, thus, its stability. In the limit βr → −∞, the leading edge retreats to . When βh > 0, zero cannot be a fixed point of the dynamics. Therefore, βh facilitates the appearance of nonzero fixed points, and both βr and βh determine the stability of these nonzero fixed points.
To gain some insight into the role of βh in generating fixed points, we treat the mean-field FP equations [Eq. (C26)] perturbatively around the operating point gc where the zero fixed point becomes unstable [Eq. (H1)]. For small βh and ϵ = gh − gc, we can express the solution Δh as a power series in ϵ, and we see that to leading order the fixed-point variance behaves as (details in Appendix I 1)
(H2) |
(H3) |
where ϕ0 ≡ tanh and f2(αr, βr) and f2(αr, βr) are constant functions with respect to ϵ. Therefore, we see that the bias βh gives rise to nonzero fixed points near the critical point which scale linearly with the bias. In Fig. 12(e), we show this linear scaling of the solution for the case when βh = ϵ, and we see that the prediction (lines) matches the true solution (circles) over a reasonably wide range.
More generally, away from the critical gc, an increasing βh gives rise to fixed-point solutions with increasing variance, and this can arise continuously from zero, or it can arise by stabilizing an unstable, time-varying state depending on the value of βr. In Fig. 12(a), we see how the Δh behaves for increasing βh for different βr, and we can see the stabilizing effect of βh on unstable solutions by looking at its effect on the leading spectral edge [Fig. 12(b)]. In Fig. 12(c), we see that an increasing βr also gives rise to increasing Δh. However, in this case, it has a destabilizing effect by shifting the leading spectral edge to the right. In particular, when βh = 0, increasing βr destabilizes the zero fixed point and give rise to a time-varying solution. We note that, when βh = 0, varying βr cannot yield stable nonzero FPs. The combined effect of βh and βr can been seen in Fig. 12(f), where the nonzero solutions to the left of the orange line indicate unstable (time-varying) solutions. We choose the parameters to illustrate an interesting aspect of the biases: In some cases, increasing βh can have a nonmonotonic effect on the stability, wherein the solution becomes unstable with increasing βh and is then eventually stabilized for sufficiently large βh.
1. Effect of biases on the phase boundaries
In Figs. 13(a) and 13(b), we look at how the critical line for the chaotic transition, in the αr − gh plane, changes as we vary βh (a) or βr (b). Positive values of βr (“open” output gate) tend to make the transition line less dependent on αr [Fig. 13(b)], and negative values of βr have a stabilizing effect by requiring larger values of gh and αr to transition to chaos. As we see above, higher values of βh have a stabilizing effect, requiring higher gh and αr to make the (nonzero) stable fixed point unstable. In both cases, the critical lines for marginal stability [Figs. 13(a) and 13(b), dashed lines] are also influenced in a similar way. In Figs. 13(c) and 13(d), we see how the stability-to-chaos transition is affected by αr (c) and βr (d). Consistent with the discussion above, larger αr and βr have a destabilizing effect, requiring a larger βh to make the system stable.
APPENDIX I: DETAILS OF THE PERTURBATIVE SOLUTIONS TO THE MEAN-FIELD EQUATIONS
1. Perturbative solutions for the fixed-point variance Δh with biases
In this section, we derive the perturbative solutions for the fixed-point variance Δh with finite biases, near the critical point where the zero fixed point becomes unstable. Recall that fixed-point variances are obtained by solving
(I1) |
(I2) |
The expansion we seek is perturbative in Δh. So, expanding the gating and activating functions about their biases under the assumption , we have a series expansion to :
(I3) |
(I4) |
(I5) |
where we use the following identities involving the derivatives of tanh:
(I6) |
(I7) |
(I8) |
(I9) |
(I10) |
This gives us to
(I11) |
(I12) |
(I13) |
(I14) |
and, therefore,
(I15) |
To proceed further, we study the solutions to this equation for small deviations for a critical value of gh. Which critical value should we use? Recall that the zero fixed point becomes unstable when
(I16) |
Therefore, we expand around this operating point and our small parameter ϵ = gh − gc, where gc = σr(0)−1. We make an ansatz that we can express Δh as a power series in ϵ:
(I17) |
where η is the exponent for the prefactor scaling and needs to be determined self-consistently. To get the scaling relations for Δh, we need to expand the coefficients in the Taylor series for Δh in terms of ϵ. We note that c0 = tanh(βh)2, and, therefore, these approximations make sense only for small βh. How small should βh be relative to ϵ? We make the following ansatz:
(I18) |
and, thus, if δ > 1/2, then increases slower than ϵ.
We now express the coefficients for small βh:
(I19) |
(I20) |
(I21) |
After solving Eqs. (I15)–(I19) self-consistently in terms of the expansion parameter ϵ, we get the following perturbative solution for δ ≤ 1:
(I22) |
(I23) |
f2(αr, βr) and f2(αr, βr) are constant functions (with respect to ϵ). Therefore, we see a linear scaling with the bias βh.
2. Perturbative solutions for the fixed-point variance Δh in the bifurcation region with no biases
The perturbative treatment of the fixed-point solutions in this case closely follows that described above. For gh = 2 − ϵ, we can express Δh as a power series in ϵ (Δh = c0 + c1ϵ + c2ϵ2) and look for a condition that allows for a nonzero c0 corresponding to the bifurcation point. Since we expect, Δh to be small in this regime, we can expand Δr as
(I24) |
and, similarly, we can also approximate
(I25) |
Now, equating coefficient of powers of ϵ, we get that either c0 = 0 or
(I26) |
which is a valid solution when . This is the bifurcation curve limit near gh = 2−.
In the other limit, and . We can work in the regime where to see what values of gh admit a bifurcation in the perturbative solutions. The equation [to ] is given by
(I27) |
Thus, we get a positive solution for Δh, when , and, to the leading order, the solution scales as
(I28) |
3. Ch(τ) near critical point
Here, we study the asymptotic behavior of Ch(τ) near the critical point gh = 2.0 for small αz. For simplicity, we set the biases to be zero. In this limit, we can assume that Ch(τ) and Cϕ(τ) are small. Let us begin by approximating .
We get, up to ,
(I29) |
(I30) |
(I31) |
(I32) |
This can be obtained, for instance, by expanding σz[z(t)] and taking the Gaussian averages over the argument z(t) in the steady state. The relation between Cϕ(τ) and Cz(τ), in general, does not have a simple form; however, when gh ~ 2, we expect the relaxation time τR ≫ 1, and therefore, we can approximate Cz(τ) ≈ Cϕ(τ). We can then approximate Cϕ as
(I33) |
(I34) |
(I35) |
(I36) |
Note that this also gives us an approximation for Cϕ(0). Putting all this together, the equation governing Ch(τ),
(I37) |
becomes [up to ]
(I38) |
(I39) |
(I40) |
(I41) |
(I42) |
Integrating with respect to τ gives
(I43) |
The boundary conditions are
(I44) |
The second condition implies the constant is 0. And the first condition implies
(I45) |
From this, we can solve for Ch(0) (neglecting terms higher than quadratic) to get a solution that is perturbative in the deviation ϵ from the critical point (gh = 2 + ϵ). To the leading order, the variance grows as
(I46) |
and the αz enters the timescale-governing term a1 only at O(ϵ2). At first, it might seem counterintuitive that αz, which effectively controls the dynamical time constant in the equations of motion, should not influence the relaxation rate to leading order. However, this result is for the dynamical behavior close to the critical point, where the relaxation time is a scaling function of ϵ. Moving away from this critical point, the relaxation time becomes finite, and the z gate, and, thus, αz, should have a more visible effect.
APPENDIX J: TOPOLOGICAL COMPLEXITY VIA KAC-RICE FORMULA
The arguments here are similar to those presented in Ref. [82], which use a self-averaging assumption to express the topological complexity (defined below) in terms of a spectral integral. Let us begin.
The goal is to estimate the total number of fixed points for a dynamical system ẋ = G(x). The Kac-Rice analysis proceeds by constructing the integral over the state space x whose integrand has delta-functional support only on the fixed points:
(J1) |
where 𝒟 = ∂G/∂x is the instantaneous Jacobian. The expectation value here is over the random coupling matrices. The average number of fixed points is related to the so-called topological complexity 𝒞 via the definition
We seek a saddle-point approximation of this quantity below.
For the gated RNN, the state space x = (h, z, r), and the fixed points satisfy
(J2) |
(J3) |
(J4) |
where for notational shorthand we introduce and , anticipating the mean-field approximation to come. Notice that only the first equation for h provides a nontrivial constraint. Once h is found, the second and third equations can be used to determine z and r, respectively. Notice, furthermore, that, since σ(zi) > 0, the solutions hi to the first equation do not depend on zi. Indeed, the dependence on σ(z) can be factorized out of the Kac-Rice integral. This requires noting first that, for the fixed point Jacobian, Eq. (A6) implies that the Jacobian can be written (setting τr = τz = 1 for simplicity)
(J5) |
and that the determinant can be factorized:
(J6) |
(J7) |
The product of σ(zi) produced by the determinant is canceled by the product of delta functions, using the fact that σ(zi) > 0 and the transformation law
(J8) |
So we see that what evidently matters for the topological complexity is the fixed-point Jacobian:
(J9) |
whose eigenvalues we denote by λi for i = 1, …, N and with the spectral density
(J10) |
The preceding analysis is all basically to show that we could easily have set αz = 0 and gotten the same answer; i.e., the z gate does not influence the topological properties of the dynamics. For αz = ∞, the situation changes drastically, and the analysis likely needs to be significantly reworked. Indeed, in this limit, we most likely do not have discrete fixed points anymore, so the very notion of counting fixed points no longer makes sense.
Having introduced the spectral density, we can rewrite the Kac-Rice integral as
(J11) |
Note that, since the spectral density of 𝒟fp is independent of z, the integral over z is trivial to perform and leaves only h and r in the integrand.
So far, everything is exact. We begin now to make some approximations. The first crucial approximation is that the spectral density is self-averaging. The RMT analysis in the previous sections shows us furthermore that the spectral density depends only on macroscopic correlation functions of the state variables. Let us denote the spectral integral factor
(J12) |
by which we mean that it depends on the particular realization of the random coupling 𝒥 and the state vector x. The self-averaging assumption implies that
(J13) |
i.e., this factor does not depend on the particular realization of ℐ but just on the state vector. Equivalently, we are assuming that the spectral density depends only on the configurations h and r and not the particular realization Jh,r. This allows us to pull this factor outside of the expectation value:
(J14) |
Now we give some nonrigorous arguments for how one might evaluate the remaining expectation value. In order to carry out the average over Jh and Jr, we utilize the Fourier representation of the delta function to write
(J15) |
(J16) |
which upon disorder averaging yields
(J18) |
where we define
(J19) |
This is where we make our second crucial assumption: that the empirical averages appearing in Eq. (J19) converge to their average value
(J20) |
(J21) |
This means we are assuming the strong law of large numbers. With this essential step, the integral in Eq. (J18) evaluates to
(J22) |
(J23) |
where and Δr = Cϕ—which are just the time-independent (fixed-point) MFT equations (C26).
Returning to the expression for the complexity, this series of approximations gives us
(J24) |
Let us now describe our derivation more intuitively. We start with the formal expression for the Kac-Rice formula, which uses the delta functional integrand to find fixed points and counts them with the weighting factor related to the Jacobian. Our first assumption allows us to simplify the calculation involving the Jacobian, since we argue that this term is self-averaging. The second assumption allows us to deal with the remaining expectation value of the delta functions. The expectation value adds a number of delta functions (however many there may be for that Jh/r) for each configuration of the connectivity. For continuously distributed connectivity, this implies that the expectation value smears out the delta functions and results in a smooth distribution. What should this distribution be? Well, we know from the mean-field analysis that the state vectors are distributed as Gaussians at a fixed point. Furthermore, the mean-field theory becomes exact for large N. Therefore, we should expect that, in this limit, the delta functions are smeared out into the Gaussian distributions determined by the MFT. This is what our derivation shows.
The final step is to recall that the spectral density depends on the state vectors only via empirical averages. For instance, in the absence of an r gate, the spectral density depends on the empirical average Ĉϕ′. Again invoking the strong law of large numbers, we may argue that the self-averaging goes a step further and that
(J25) |
(J26) |
where
(J27) |
This is precisely the spectral density we study in a preceding Appendix and the one for which we obtain an explicit expression for the spectral curve. These approximations give us the topological complexity
(J28) |
Now we take a closer look at the spectral density. The eigenvalues of 𝒟fp form a circular droplet of finite radius ρ and centered on −1. Therefore, the eigenvalues have the form λ = −1 + reiθ, and the spectral density is a function only of r. The value of the radius is found from Eq. (5) by removing the z gate (i.e., setting αz = 0). After some algebraic steps, we find for the radius
(J29) |
(J30) |
Using these facts, we can write the topological complexity as
(J31) |
(J32) |
where is the indicator function which is one for r < ρ and vanishes for r > ρ. Thus, we see that the topological complexity is zero for ρ < 1. This is precisely the fixed-point stability condition derived in the main text [Eq. (6)]. Conversely, the topological complexity is nonzero for ρ > 1, which corresponds to unstable fixed points. Thus, we see, under our set of reasonable approximations, unstable MFT fixed points correspond to a finite topological complexity and, consequently, to a number of “microscopic” fixed points that grows exponentially with N.
The final missing ingredient, necessary to show that region 2 in the phase diagram has an exponentially growing number of fixed points, is to show that the MFT fixed points which appear after the bifurcation are indeed unstable. At the moment, we lack any analytical handle on this. However, we confirm numerically that, along the bifurcation curve, the fixed points are unstable and that increasing the variance Δh serves only to increase ρ. However, is it possible for the lower branch, on which Δh decreases with αr? Evidently not, since Δh scales with αr in such a way that ends up growing like , thus once again increasing ρ. Therefore, we conclude that the MFT fixed points appearing after the bifurcation are always unstable, with ρ > 1. This concludes our informal proof of the transition in topological complexity between regions 1 and 2 in the phase diagram in Fig. 7.
References
- [1].Graves A, Mohamed A-R, and Hinton G, in Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, New York, 2013), pp. 6645–6649. [Google Scholar]
- [2].Pascanu R, Gulcehre C, Cho K, and Bengio Y, How to Construct Deep Recurrent Neural Networks, arXiv:1312.6026. [Google Scholar]
- [3].Pathak J, Hunt B, Girvan M, Lu Z, and Ott E, Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach, Phys. Rev. Lett 120, 024102 (2018). [DOI] [PubMed] [Google Scholar]
- [4].Vlachas PR, Pathak J, Hunt BR, Sapsis TP, Girvan M, Ott E, and Koumoutsakos P, Backpropagation Algorithms and Reservoir Computing in Recurrent Neural Networks for the Forecasting of Complex Spatiotemporal Dynamics, Neural Netw. 126, 191 (2020). [DOI] [PubMed] [Google Scholar]
- [5].Guastoni L, Srinivasan PA, Azizpour H, Schlatter P, and Vinuesa R, On the Use of Recurrent Neural Networks for Predictions of Turbulent Flows, arXiv:2002.01222. [Google Scholar]
- [6].Jozefowicz R, Zaremba W, and Sutskever I, An Empirical Exploration of Recurrent Network Architectures, Proc. Mach. Learn. Res 37, 2342 (2015). [Google Scholar]
- [7].Vogels TP, Rajan K, and Abbott LF, Neural Network Dynamics, Annu. Rev. Neurosci 28, 357 (2005). [DOI] [PubMed] [Google Scholar]
- [8].Ahmadian Y and Miller KD, What Is the Dynamical Regime of Cerebral Cortex?, arXiv:1908.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Kadmon J and Sompolinsky H, Transition to Chaos in Random Neuronal Networks, Phys. Rev. X 5, 041030 (2015). [Google Scholar]
- [10].Sussillo D and Abbott LF, Generating Coherent Patterns of Activity from Chaotic Neural Networks, Neuron 63, 544 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Laje R and Buonomano DV, Robust Timing and Motor Patterns by Taming Chaos in Recurrent Neural Networks, Nat. Neurosci 16, 925 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Hochreiter S and Schmidhuber J, Long Short-Term Memory, Neural Comput. 9, 1735 (1997). [DOI] [PubMed] [Google Scholar]
- [13].Mitchell SJ and Silver RA, Shunting Inhibition Modulates Neuronal Gain during Synaptic Excitation, Neuron 38, 433 (2003). [DOI] [PubMed] [Google Scholar]
- [14].Gütig R and Sompolinsky H, Time-Warp–Invariant Neuronal Processing, PLoS Biol. 7, e1000141 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Sompolinsky H, Crisanti A, and Sommers H-J, Chaos in Random Neural Networks, Phys. Rev. Lett 61, 259 (1988). [DOI] [PubMed] [Google Scholar]
- [16].Martí D, Brunel N, and Ostojic S, Correlations between Synapses in Pairs of Neurons Slow Down Dynamics in Randomly Connected Neural Networks, Phys. Rev. E 97, 062314 (2018). [DOI] [PubMed] [Google Scholar]
- [17].Schuessler F, Dubreuil A, Mastrogiuseppe F, Ostojic S, and Barak O, Dynamics of Random Recurrent Networks with Correlated Low-Rank Structure, Phys. Rev. Research 2, 013111 (2020). [Google Scholar]
- [18].Mastrogiuseppe F and Ostojic S, Linking Connectivity, Dynamics, and Computations in Low-Rank Recurrent Neural Networks, Neuron 99, 609 (2018). [DOI] [PubMed] [Google Scholar]
- [19].Stern M, Sompolinsky H, and Abbott LF, Dynamics of Random Neural Networks with Bistable Units, Phys. Rev. E 90, 062710 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Aljadeff J, Stern M, and Sharpee T, Transition to Chaos in Random Networks with Cell-Type-Specific Connectivity, Phys. Rev. Lett 114, 088101 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Schuecker J, Goedeke S, and Helias M, Optimal Sequence Memory in Driven Random Networks, Phys. Rev. X 8, 041029 (2018). [Google Scholar]
- [22].Brette R, Exact Simulation of Integrate-and-Fire Models with Synaptic Conductances, Neural Comput. 18, 2004 (2006). [DOI] [PubMed] [Google Scholar]
- [23].Amari S-I, Characteristics of Random Nets of Analog Neuron-Like Elements, IEEE Trans. Syst. Man Cybernet SMC-2, 643 (1972). [Google Scholar]
- [24].Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, and Bengio Y, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, arXiv:1406.1078. [Google Scholar]
- [25].Chalker JT and Mehlig B, Eigenvector Statistics in Non-Hermitian Random Matrix Ensembles, Phys. Rev. Lett 81, 3367 (1998). [Google Scholar]
- [26].Feinberg J and Zee A, Non-Hermitian Random Matrix Theory: Method of Hermitian Reduction, Nucl. Phys B504, 579 (1997). [Google Scholar]
- [27].Martin PC, Siggia E, and Rose H, Statistical Dynamics of Classical Systems, Phys. Rev. A 8, 423 (1973). [Google Scholar]
- [28].De Dominicis C, Dynamics as a Substitute for Replicas in Systems with Quenched Random Impurities, Phys. Rev. B 18, 4913 (1978). [Google Scholar]
- [29].Hertz JA, Roudi Y, and Sollich P, Path Integral Methods for the Dynamics of Stochastic and Disordered Systems, J. Phys. A 50, 033001 (2017). [Google Scholar]
- [30].Janssen H-K, On a Lagrangean for Classical Field Dynamics and Renormalization Group Calculations of Dynamical Critical Properties, Z. Phys. B 23, 377 (1976). [Google Scholar]
- [31].Crisanti A and Sompolinsky H, Path Integral Approach to Random Neural Networks, Phys. Rev. E 98, 062120 (2018). [Google Scholar]
- [32].Helias M and Dahmen D, Statistical Field Theory for Neural Networks (Springer, New York, 2020). [Google Scholar]
- [33].Mora T and Bialek W, Are Biological Systems Poised at Criticality?, J. Stat. Phys 144, 268 (2011). [Google Scholar]
- [34].Seung HS, Continuous Attractors and Oculomotor Control, Neural Netw. 11, 1253 (1998). [DOI] [PubMed] [Google Scholar]
- [35].Seung HS, How the Brain Keeps the Eyes Still, Proc. Natl. Acad. Sci. U.S.A 93, 13339 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Seung HS, Lee DD, Reis BY, and Tank DW, Stability of the Memory of Eye Position in a Recurrent Network of Conductance-Based Model Neurons, Neuron 26, 259 (2000). [DOI] [PubMed] [Google Scholar]
- [37].Machens CK, Romo R, and Brody CD, Flexible Control of Mutual Inhibition: A Neural Model of Two-Interval Discrimination, Science 307, 1121 (2005). [DOI] [PubMed] [Google Scholar]
- [38].Chaudhuri R and Fiete I, Computational Principles of Memory, Nat. Neurosci 19, 394 (2016). [DOI] [PubMed] [Google Scholar]
- [39].Bialek W, Biophysics: Searching for Principles (Princeton University Press, Princeton, NJ, 2012). [Google Scholar]
- [40].Goldman MS, Memory without Feedback in a Neural Network, Neuron 61, 621 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Maheswaranathan N, Williams A, Golub MD, Ganguli S, and Sussillo D, Reverse Engineering Recurrent Networks for Sentiment Classification Reveals Line Attractor dynAmics., in Advances in Neural Information Processing Systems, edited by Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, and Garnett R (Curran Associates, Inc., New York, 2019), Vol 32, p. 15696. [PMC free article] [PubMed] [Google Scholar]
- [42].Farrell M, Recanatesi S, Moore T, Lajoie G, and Shea-Brown E, Recurrent Neural Networks Learn Robust Representations by Dynamically Balancing Compression and Expansion, bioRxiv 10.1101/564476. [DOI] [Google Scholar]
- [43].Molgedey L, Schuchhardt J, and Schuster HG, Suppressing Chaos in Neural Networks by Noise, Phys. Rev. Lett 69, 3717 (1992). [DOI] [PubMed] [Google Scholar]
- [44].Rajan K, Abbott LF, and Sompolinsky H, Stimulus-Dependent Suppression of Chaos in Recurrent Neural Networks, Phys. Rev. E 82, 011903 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Wainrib G and Touboul J, Topological and Dynamical Complexity of Random Neural Networks, Phys. Rev. Lett 110, 118101 (2013). [DOI] [PubMed] [Google Scholar]
- [46].Sutskever I, Martens J, Dahl G, and Hinton G, On the Importance of Initialization and Momentum in Deep Learning, Proc. Mach. Learn. Res 28, 1139 (2013). [Google Scholar]
- [47].Legenstein R and Maass W, Edge of Chaos and Prediction of Computational Performance for Neural Circuit Models, Neural Netw. 20, 323 (2007). [DOI] [PubMed] [Google Scholar]
- [48].Jaeger H and Haas H, Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication, Science 304, 78 (2004). [DOI] [PubMed] [Google Scholar]
- [49].Toyoizumi T and Abbott LF, Beyond the Edge of Chaos: Amplification and Temporal Integration by Recurrent Networks in the Chaotic Regime, Phys. Rev. E 84, 051908 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Since the Jacobian spectral density depends on correlation functions (see Appendix A), in the dynamical steady state the spectral density becomes time-translation invariant. In other words, the spectral density also reaches a steady-state distribution. As a result, a snapshot of the spectral density at any given time has the same form. Instability then implies that the eigenvectors must evolve over time in order to keep the dynamics bounded. The timescale involved in the evolution of the eigenvectors should correspond roughly with the correlation time implied by the DMFT. Within this window, the spectral analysis of the Jacobian in the steady state gives a meaningful description of the range of timescales involved. Furthermore, we see empirically that the local structure appears very informative of the true dynamics, in particular, with understanding the emergence of continuous attractors and marginal stability, as we discuss in Sec. IV.
- [51].Can T, Krishnamurthy K, and Schwab DJ, Gating Creates Slow Modes and Controls Phase-Space Complexity in GRUs and LSTMS, Proc. Mach. Learn. Res 107, 476 (2020). [Google Scholar]
- [52].The continuous-time gated RNN we study in this paper is most closely related to the GRU architecture studied in Ref. [51].
- [53].Eguíluz VM, Ospeck M, Choe Y, Hudspeth AJ, and Magnasco MO, Essential Nonlinearities in Hearing, Phys. Rev. Lett 84, 5232 (2000). [DOI] [PubMed] [Google Scholar]
- [54].Eckmann J-P and Ruelle D, in The Theory of Chaotic Attractors (Springer, New York, 1985), pp. 273–312. [Google Scholar]
- [55].For reference, we also supply a bound on the maximal Lyapunov exponent in Appendix F, showing that the relaxation time of the dynamics enters into an upper bound on λmax.
- [56].Derrida B and Pomeau Y, Random Networks of Automata: A Simple Annealed Approximation, Europhys. Lett 1, 45 (1986). [Google Scholar]
- [57].Cessac B, Increase in Complexity in Random Neural Networks, J. Phys. I (France) 5, 409 (1995). [Google Scholar]
- [58].One might worry that the h and σ(z) correlators are not separable, in general. However, this issue arises only for large αz. For moderate αz, the separability assumption is valid.
- [59].Fyodorov YV, Complexity of Random Energy Landscapes, Glass Transition, and Absolute Value of the Spectral Determinant of Random Matrices, Phys. Rev. Lett 92, 240601 (2004). [DOI] [PubMed] [Google Scholar]
- [60].Fyodorov YV and Le Doussal P, Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization, J. Stat. Phys 154, 466 (2014). [Google Scholar]
- [61].Pereira J and Wang X-J, A tradeoff between Accuracy and Flexibility in a Working Memory Circuit Endowed with Slow Feedback Mechanisms, Cereb. Cortex 25, 3586 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Greff K, Srivastava RK, Koutník J, Steunebrink BR, and Schmidhuber J, LSTM: A Search Space Odyssey, IEEE Trans. Neural Netw. Learn. Syst 28, 2222 (2017). [DOI] [PubMed] [Google Scholar]
- [63].In fact, the fixed-point phase diagrams for the current model and the GRU are in one-to-one correspondence. What this static phase diagram importantly lacks is region 3 in Fig. 7.
- [64].Tallec C and Ollivier Y, Can Recurrent Neural Networks Warp Time?, arXiv:1804.11188. [Google Scholar]
- [65].Muscinelli SP, Gerstner W, and Schwalger T, How Single Neuron Properties Shape Chaotic Dynamics and Signal Transmission in Random Neural Networks, PLoS Comput. Biol 15, e1007122 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [66].Beiran M and Ostojic S, Contrasting the Effects of Adaptation and Synaptic Filtering on the Timescales of Dynamics in Recurrent Networks, PLoS Comput. Biol 15, e1006893 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [67].Pereira U and Brunel N, Attractor Dynamics in Networks with Learning Rules Inferred from In Vivo Data, Neuron 99, 227 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [68].Bertschinger N and Natschläger T, Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks, Neural Comput. 16, 1413 (2004). [DOI] [PubMed] [Google Scholar]
- [69].Legenstein R and Maass W, in New Directions in Statistical Signal Processing: From Systems to Brain, edited by Haykin Simon, Principe Jose C., Sejnowski Terrence J., and McWhirter John (The MIT Press, 2006), p. 127. [Google Scholar]
- [70].Boedecker J, Obst O, Lizier JT, Mayer NM, and Asada M, Information Processing in Echo State Networks at the Edge of Chaos, Theory Biosci. 131, 205 (2012). [DOI] [PubMed] [Google Scholar]
- [71].Geman S and Hwang C-R, A Chaos Hypothesis for Some Large Systems of Random Equations, Z. Wahrscheinlichkeitstheorie Verwandte Gebiete 60, 291 (1982). [Google Scholar]
- [72]. Strictly speaking, the state variables evolve according to dynamics governed by (and, thus, dependent on) the J’s. However, the local chaos hypothesis states that large random networks approach a steady state where the state variables are independent of J’s and are distributed according to their steady-state distribution.
- [73].Sompolinsky H and Zippelius A, Relaxational Dynamics of the Edwards-Anderson Model and the Mean-Field Theory of Spin-Glasses, Phys. Rev. B 25, 6860 (1982). [Google Scholar]
- [74].Sompolinsky H and Zippelius A, Dynamic Theory of the Spin-Glass Phase, Phys. Rev. Lett 47, 359 (1981). [Google Scholar]
- [75].Chow CC and Buice MA, Path Integral Methods for Stochastic Differential Equations, J. Math. Neurosci 5, 8 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [76].Roy F, Biroli G, Bunin G, and Cammarota C, Numerical Implementation of Dynamical Mean Field Theory for Disordered Systems: Application to the Lotka–Volterra Model of Ecosystems, J. Phys. A 52, 484001 (2019). [Google Scholar]
- [77].Geist K, Parlitz U, and Lauterborn W, Comparison of Different Methods for Computing Lyapunov Exponents, Prog. Theor. Phys 83, 875 (1990). [Google Scholar]
- [78].Engelken R, Wolf F, and Abbott L, Lyapunov Spectra of Chaotic Recurrent Neural Networks, arXiv:2006.02427. [Google Scholar]
- [79]. The local chaos hypothesis employed by Cessac [57] amounts to the same assumption.
- [80]. Strictly speaking, Oseledets theorem guarantees that λmax = limt→∞ (1/2t) log[(‖χu‖2)/(‖u‖2)] for almost every u. In particular, we can take u to be the all-ones vector. The term inside the log then becomes , and the second term is subleading in N, since the susceptibilities are random functions. This justifies Eq. (F1).
- [81]. In previous work, g = 1 sets the critical value. The difference is simply due to the factor σr(0) = 1/2. The vanilla RNN result is recovered by sending βr → ∞.
- [82].Ipsen JR and Peterson ADH, Consequences of Dale’s Law on the Stability-Complexity Relationship of Random Neural Networks, Phys. Rev. E 101, 052412 (2020). [DOI] [PubMed] [Google Scholar]