Exact learning dynamics of deep linear networks with prior knowledge

Clémentine C J Dominé; Lukas Braun; James E Fitzgerald; Andrew M Saxe

doi:10.1088/1742-5468/ad01b8

. 2023 Nov 15;2023(11):114004. doi: 10.1088/1742-5468/ad01b8

Exact learning dynamics of deep linear networks with prior knowledge^{^*}

Clémentine C J Dominé ^1,⁶, Lukas Braun ^2,⁶, James E Fitzgerald ³, Andrew M Saxe ^1,^4,^5,^**

PMCID: PMC10955603 PMID: 38524253

Abstract

Learning in deep neural networks is known to depend critically on the knowledge embedded in the initial network weights. However, few theoretical results have precisely linked prior knowledge to learning dynamics. Here we derive exact solutions to the dynamics of learning with rich prior knowledge in deep linear networks by generalising Fukumizu’s matrix Riccati solution (Fukumizu 1998 Gen 1 1E–03). We obtain explicit expressions for the evolving network function, hidden representational similarity, and neural tangent kernel over training for a broad class of initialisations and tasks. The expressions reveal a class of task-independent initialisations that radically alter learning dynamics from slow non-linear dynamics to fast exponential trajectories while converging to a global optimum with identical representational similarity, dissociating learning trajectories from the structure of initial internal representations. We characterise how network weights dynamically align with task structure, rigorously justifying why previous solutions successfully described learning from small initial weights without incorporating their fine-scale structure. Finally, we discuss the implications of these findings for continual learning, reversal learning and learning of structured knowledge. Taken together, our results provide a mathematical toolkit for understanding the impact of prior knowledge on deep learning.

Keywords: deep learning, learning theory, machine learning

1. Introduction

A hallmark of human learning is our exquisite sensitivity to prior knowledge: what we already know affects how we subsequently learn (Carey 1985). For instance, having learned about the attributes of nine animals, we may learn about the tenth more quickly (McClelland et al 1995, Murphy 2004, McClelland 2013, Flesch et al 2018). In machine learning, the impact of prior knowledge on learning is evident in a range of paradigms including reversal learning (Erdeniz and Atalay 2010), transfer learning (Taylor and Stone 2009, Thrun and Pratt 2012, Lampinen and Ganguli 2018, Gerace et al 2022), continual learning (Kirkpatrick et al 2017, Zenke et al 2017, Parisi et al 2019), curriculum learning (Bengio et al 2009), and meta learning (Javed and White 2019). One form of prior knowledge in deep networks is the initial network state, which is known to strongly impact learning dynamics (Saxe et al 2014, Pennington et al 2017, Bahri et al 2020). Even random initial weights of different variance can yield qualitative shifts in network behaviour between the lazy and rich regimes (Chizat et al 2019), imparting distinct inductive biases on the learning process. More broadly, rich representations such as those obtained through pretraining provide empirically fertile inductive biases for subsequent fine-tuning (Raghu et al 2019). Yet while the importance of prior knowledge to learning is clear, our theoretical understanding remains limited, and fundamental questions remain about the implicit inductive biases of neural networks trained from structured initial weights. A better understanding of the impact of initialisation on gradient-based learning may lead to improved pretraining schemes and illuminate pathologies like catastrophic forgetting in continual learning (McCloskey and Cohen 1989).

Here, we address this gap by deriving exact solutions to the dynamics of learning in deep linear networks as a function of network initialisation, revealing an intricate and systematic dependence. We consider the setting depicted in figure 1(A), where a network is trained with standard gradient descent from a potentially complex initialisation. When trained on the same task, different initialisations can radically change the network’s learning trajectory (figures 1(B)–(D)). Our approach, based on a matrix Riccati formalism (Fukumizu 1998), provides explicit analytical expressions for the network output over time (figures 1(B)–(D) dotted). While simple, deep linear networks have a non-convex loss landscape and have been shown to recapitulate several features of nonlinear deep networks while retaining mathematical tractability.

Figure 1. — Learning with prior knowledge. (A) In our setting, a deep linear network with $N_{i}$ input, $N_{h}$ hidden and $N_{o}$ output neurons is trained from a particular initialisation using gradient descent. (B)–(D) Network output for an example task over training time when starting from (B) small random weights, (C) large random weights, and (D) the weights of a previously learned task. The dynamics depend in detail on the initialisation. Solid lines indicate simulations, dotted lines indicate the analytical solutions we derive in this work.

1.1. Contributions

•
We derive an explicit solution for the gradient flow of the network function, internal representational similarity, and finite-width neural tangent kernel (NTK) of over- and under-complete two-layer deep linear networks for a rich class of initial conditions (section 3).
•
We characterise a set of random initial network states that exhibit fast, exponential learning dynamics and yet converge to rich neural representations. Dissociating fast and slow learning dynamics from the rich and lazy learning regimes (section 4).
•
We analyse how weights dynamically align to task-relevant structure over the course of learning, going beyond prior work that has assumed initial alignment (section 5).
•
We provide exact solutions to continual learning dynamics, reversal learning dynamics and to the dynamics of learning and revising structured representations (section 6).

1.2. Related work

Our work builds on analyses of deep linear networks (Baldi and Hornik 1989, Fukumizu 1998, Saxe et al 2014, 2019, Lampinen and Ganguli 2018, Arora et al 2018a, Tarmoun et al 2021, Atanasov et al 2022), which have shown that this simple model nevertheless has intricate fixed point structure and nonlinear learning dynamics reminiscent of phenomena seen in nonlinear networks. A variety of works has analysed convergence (Arora et al 2018b, Du and Hu 2019), generalisation (Lampinen and Ganguli 2018, Poggio et al 2018, Huh 2020), and the implicit bias of gradient descent (Gunasekar et al 2018, Ji and Telgarsky 2018, Laurent and Brecht 2018, Arora et al 2019a). These works mostly considers the tabula rasa case of small initial random weights, for which exact solutions are known (Saxe et al 2014). By contrast our formalism describes dynamics from a much larger class of initial conditions and can describe alignment dynamics that do not occur in the tabula rasa setting. Most directly, our results build from the matrix Riccati formulation proposed by Fukumizu (1998). Connecting this formulation and matrix factorisation problems yields a better characterisation of the convergence rate (Tarmoun et al 2021). We extend and refine the matrix Riccati result to obtain the dynamics of over- and under-complete networks; to obtain numerically stable forms of the matrix equations; and to more explicitly reveal the impact of initialisation.

A line of theoretical research has considered online learning dynamics in teacher-student settings (Biehl and Schwarze 1995, Saad and Solla 1995, Goldt et al 2019), deriving ordinary differential equations for the average learning dynamics even in nonlinear networks. However, solving these equations requires numerical integration. By contrast, our approach provides explicit analytical solutions for the more restricted case of deep linear networks.

Other approaches for analysing deep network dynamics include the NTK (Jacot et al 2018, Lee et al 2019, Arora et al 2019b) and the mean field approach (Mei et al 2018, Rotskoff and Vanden-Eijnden 2018, Sirignano and Spiliopoulos 2020). While the former can describe nonlinear networks but not the learning dynamics of hidden representations, the later yields a description of representation learning dynamics in wide networks in terms of a partial differential equation. Our work is similar in seeking a subset of more tractable models that are amenable to analysis, but we focus on the impact of initialisation on representation learning dynamics and explicit solutions.

A large body of work has investigated the effect of different random initialisations on learning in deep networks. The role of initialisation in the vanishing gradient problem and proposals for better initialisation schemes have been illuminated by several works drawing on the central limit theorem (Glorot and Bengio 2010, Saxe et al 2014, He et al 2015, Pennington et al 2017, Xiao et al 2018), reviewed in Carleo et al (2019), Arora et al (2020), Bahri et al (2020). These approaches typically guarantee that gradients do not vanish at the start of learning, but do not analytically describe the resulting learning trajectories. Influential work has shown that network initialisation variance mediates a transition from rich representation learning to lazy NTK dynamics (Chizat et al 2019), which we analyse in our framework.

2. Preliminaries and setting

Consider a supervised learning task in which input vectors $x_{n} \in R^{N_{i}}$ from a set of P training pairs ${(x_{n}, y_{n})}_{n = 1 \dots P}$ have to be associated with their target output vectors $y_{n} \in R^{N_{o}}$ . We learn this task with a two-layer linear network model (figure 1(A)) that produces the output prediction

{\hat{y}}_{n} = W_{2} W_{1} x_{n},

with weight matrices $W_{1} \in R^{N_{h} \times N_{i}}$ and $W_{2} \in R^{N_{o} \times N_{h}}$ , where $N_{h}$ is the number of hidden units. The network’s weights are optimised using full batch gradient descent with learning rate η (or respectively time constant $τ = \frac{1}{η}$ ) on the mean squared error loss

L (\hat{y}, y) = \frac{1}{2} ⟨| | \hat{y} - y | |^{2}⟩,

where $⟨ \cdot ⟩$ denotes the average over the dataset. The input and input-output correlation matrices of the dataset are

Finally, the gradient optimisation starts from an initialisation $W_{2} (0), W_{1} (0)$ . Our goal is to understand the full time trajectory of the network’s output and internal representations as a function of this initialisation and the task statistics.

Our starting point is the seminal work of Fukumizu (Fukumizu 1998), which showed that the gradient flow dynamics could be written as a matrix Riccati equation with known solution. In particular, defining

the continuous time dynamics of the matrix $Q Q^{T}$ from initial state $Q (0)$ is

Q Q^{T} (t) = e^{F \frac{t}{τ}} Q (0) {[I + \frac{1}{2} Q {(0)}^{T} (e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1}) Q (0)]}^{- 1} Q {(0)}^{T} e^{F \frac{t}{τ}},

if the following four assumptions hold:

Assumption 2.1.

The dimensions of the input and target vectors are identical, that is $N_{i} = N_{o}$ .

Assumption 2.2.

The input data is whitened, that is ${\tilde{Σ}}^{x x} = I$ .

Assumption 2.3.

The network’s weight matrices are zero-balanced at the beginning of training, that is $W_{1} (0) W_{1} (0)^{T} = W_{2} (0)^{T} W_{2} (0)$ . If this condition holds at initialisation, it will persist throughout training (Saxe et al 2014, Arora et al 2018a).

Assumption 2.4.

The input-output correlation of the task and the initial state of the network function have full rank, that is $r a n k ({\tilde{Σ}}^{x y}) = r a n k (W_{2} (0) W_{1} (0)) = N_{i} = N_{o}$ . This implies that the network is not bottlenecked, i.e. $N_{h} \geq \min (N_{i}, N_{o})$ .

For completeness, we include a derivation of this solution in appendix A.

Rather than tracking the weights’ dynamics directly, this approach tracks several key statistics collected in the matrix

graphic file with name jstatad01b8f10_lr.jpg

which can be separated into four quadrants with intuitive meaning: the off-diagonal blocks contain the network function

graphic file with name jstatad01b8f11_lr.jpg

while the on-diagonal blocks contain the correlation structure of the weight matrices. These permit calculation of the temporal evolution of the network’s internal representations including the task-relevant representational similarity matrices (RSMs) (Kriegeskorte et al 2008), i.e. the kernel matrix $ϕ (x)^{T} ϕ (x^{'})$ , of the neural representations in the hidden layer

graphic file with name jstatad01b8f12_lr.jpg

where + denotes the pseudoinverse; and the network’s finite-width NTK (Jacot et al 2018, Lee et al 2019, Arora et al 2019b)

graphic file with name jstatad01b8f13_lr.jpg

where I is the identity matrix and ⊗ is the Kronecker product. For a derivation of these quantities see appendix B. Hence, the solution in equation (5) describes important aspects of network behaviour.

However, in this form, the solution has several limitations. First, it relies on general matrix exponentials and inverses, which are a barrier to explicit understanding. Second, when evaluated numerically, it is often unstable. And third, the equation is only valid for equal input and output dimensions. In the following section we address these limitations.

Implementation and simulation. Simulation details are in appendix H. Code to replicate all simulations and plots are available online ⁶ under a GPLv3 license and requires $<$ 6 h to execute on a single AMD Ryzen 5950x.

3. Exact learning dynamics with prior knowledge

In this section we derive an exact and numerically stable solution for $Q Q^{T}$ that better reveals the learning dynamics, convergence behaviour and generalisation properties of two-layer linear networks with prior knowledge. Further, we alter the equations to be applicable to equal and unequal input and output dimensions, overcoming assumption 2.1.

To place the solution in a more explicit form, we make use of the compact singular value decomposition. Let the compact singular value decomposition of the initial network function and the input-output correlation of the task be

S V D (W_{2} (0) W_{1} (0)) = U S V^{T} and S V D ({\tilde{Σ}}^{y x}) = \tilde{U} \tilde{S} {\tilde{V}}^{T} .

Here, U and $\tilde{U} \in R^{N_{o} \times N_{m}}$ denote the left singular vectors, S and $\tilde{S} \in R^{N_{m} \times N_{m}}$ the square matrix with ordered, non-zero eigenvalues on its diagonal and V and $\tilde{V} \in R^{N_{i} \times N_{m}}$ the corresponding right singular vectors. For unequal input-output dimensions ( $N_{i} \neq N_{o}$ ) the right and left singular vectors are therefore not generally square and orthonormal. Accordingly, for the case $N_{i} > N_{o}$ , we define ${\tilde{U}}_{⊥} \in R^{N_{o} \times (N_{o} - N_{i})}$ as a matrix containing orthogonal column vectors that complete the basis, i.e. make $[\tilde{U} {\tilde{U}}_{⊥}]$ orthonormal. Conversely, we define ${\tilde{V}}_{⊥} \in R^{N_{i} \times (N_{i} - N_{o})}$ for the case of $N_{i} > N_{o}$ .

Assumption 3.1.

Define $B = U^{T} \tilde{U} + V^{T} \tilde{V}$ and $C = U^{T} \tilde{U} - V^{T} \tilde{V}$ . B is non-singular.

Theorem 3.1.

Under the assumptions of whitened inputs, 2.2, zero-balanced weights 2.3, full rank 2.4, and B non-singular 3.1, the temporal dynamics of $Q Q^{T}$ are

with

For a proof of theorem 3.1 please refer to appendix C.

With this solution we can calculate the exact temporal dynamics of the loss, network function, RSMs and NTK (figures 2(A) and (B)). As the solution contains only negative exponentials, it is numerically stable and provides high precision across a wide range of learning rates and network architectures (figures 2(C) and (D)).

Figure 2. — Exact learning dynamics (A) the temporal dynamics of the numerical simulation (coloured lines) of the loss, network function, correlation of input and output weights and the NTK (columns 1–5 respectively) are exactly matched by the analytical solution (black dotted lines) for small initial weight values and (B) large initial weight values. (C) Each line shows the deviation of the analytical loss $\hat{L}$ from the numerical loss $L$ for one of n = 50 networks with random architecture and training data (details in appendix H) across a range of learning rates $η \in [0.05, 0.0005]$ . The deviation mutually decreases with the learning rate. (D) Numerical and analytical learning curves for five randomly sampled example networks (coloured x in (C)).

We note that a solution for the weights $W_{1} (t)$ and $W_{2} (t)$ , i.e. $Q (t)$ , can be derived up to a time varying orthogonal transformation as demonstrated in appendix C. Further, as time-dependent variables only occur in matrix exponentials of diagonal matrices of negative sign, the network approaches a steady state solution.

Theorem 3.2.

Under the assumptions of theorem 3.1, the network function converges to the global minimum $\tilde{U} \tilde{S} {\tilde{V}}^{T}$ and acquires a rich task-specific internal representation, that is $W_{1}^{T} W_{1} = \tilde{V} \tilde{S} {\tilde{V}}^{T}$ and $W_{2} W_{2}^{T} = \tilde{U} \tilde{S} {\tilde{U}}^{T}$ .

The proof of theorem 3.2 is in appendix C. We now turn to several implications of these results.

4. Rich and lazy learning regimes and generalisation

Recent results have shown that large deep networks can operate in qualitatively distinct regimes that depend on their weight initialisations (Chizat et al 2019, Flesch et al 2022), the so called rich and lazy regimes. In the rich regime, learning dynamics can be highly nonlinear and lead to task-specific solutions thought to lead to favourable generalisation properties (Chizat et al 2019, Saxe et al 2019, Flesch et al 2022). By contrast, the lazy regime exhibits simple exponential learning dynamics and exploits high-dimensional nonlinear projections of the data produced by the initial random weights, leading to task-agnostic representations that attain zero training error but possibly lower generalisation performance (Jacot et al 2018, Lee et al 2019, Arora et al 2019b). Traditionally, the rich and lazy learning regimes have been respectively linked to low and high variance initial weights (relative to the network layer size).

To illustrate these phenomena, we consider a semantic learning task in which a set of living things have to be linked to their position in a hierarchical structure (figure 3(A)) (Saxe et al 2014). The representational similarity of the input of the task ( $\tilde{V} \tilde{S} {\tilde{V}}^{T}$ ) reveals its inherent structure (figure 3(B)). For example, the representations of the two fishes are most similar to each other, less similar to birds and least similar to plants. Likewise, the representational similarity of the task’s target values ( $\tilde{U} \tilde{S} {\tilde{U}}^{T}$ ) reveals the primary groups among which items are organised. As a consequence, one can for example predict from an object being a fish that it is an animal and from an object being a plant that it is not a bird. Reflecting these structural relationships in internal representations can allow the rich regime to generalise in ways the lazy regime cannot. Crucially, $Q Q^{T} (t)$ contains the temporal dynamics of the weights’ representational similarity and therefore can be used to study if a network finds a rich or lazy solution.

When training a two layer network from random small initial weights, the weights’ input and output RSM (figure 3(C), upper left and lower right quadrant) are identical to the task’s structure at convergence. However, when training from large initial weights, the RSM reveals that the network has converged to a lazy solution (figure 3(D)). We emphasise that the network function in both cases is identical (figures 3(C) and (D), lower left quadrant). And while their final loss is identical too, their learning dynamics evolve slow and step-wise in the case of small initial weights and fast and exponentially in the case of large initial weights (figure 3(F)), as predicted by previous work (Chizat et al 2019).

However, from theorem 3.2 it directly follows that our setup is guaranteed to find a rich solution in which the weights’ RSM is identical to the task’s RSM, i.e. $W_{1}^{T} W_{1} = \tilde{V} \tilde{S} {\tilde{V}}^{T}$ and $W_{2} W_{2}^{T} = \tilde{U} \tilde{S} {\tilde{U}}^{T}$ . Therefore, as zero-balanced weights may be large, there exist initial states that converge to rich solutions while evolving as rapid exponential learning curves (figures 3(E) and (F)). Crucially, these initialisations are task-agnostic, in the sense that they are independent of the task structure (see Mishkin and Matas 2015). This finding applies to any learning task with well defined input-output correlation. For additional simulations see appendix D. Hence our equation can describe the change in dynamics from step-like to exponential with increasing weight scale, and separate this dynamical phenomenon from the structure of internal representations.

5. Decoupling dynamics

The learning dynamics of deep linear networks depend on the exact initial values of the synaptic weights. Previous solutions studied learning dynamics under the assumption that initial network weights are ‘decoupled’, such that the initial state of the network and the task share the same singular vectors, i.e. that $U = \tilde{U}$ and $V = \tilde{V}$ (Saxe et al 2014). Intuitively, this assumption means that there is no cross-coupling between different singular modes, such that each evolves independently. However, this assumption is violated in most real-world scenarios. As a consequence, most prior work has relied on the empirical observation that learning from tabula rasa small initial weights occurs in two phases: First, the network’s input-output map rapidly decouples; then subsequently, independent singular modes are learned in this decoupled regime. Because this decoupling process is fast when training begins from small initial weights, the learning dynamics are still approximately described by the temporal learning dynamics of the singular values assuming decoupling from the start. This dynamic has been called a silent alignment process (Atanasov et al 2022). Here we leverage our matrix Riccati approach to analytically study the dynamics of this decoupling process. We begin by deriving an alternate form of the exact solution that eases the analysis.

Theorem 5.1.

Let the weight matrices of a two layer linear network be initialised by $W_{1} = A (0) {\tilde{V}}^{T}$ and $W_{2} = \tilde{U} A (0)^{T}$ , where $A (0) \in R^{N_{h} \times N_{i}}$ is an arbitrary, invertible matrix. Then, under the assumptions of equal input-output dimensions 2.1, whitened inputs 2.2, zero-balanced weights 2.3 and full rank 2.4, the temporal dynamics of $Q Q^{T}$ are fully determined by

$A^{T} A (t) = {[e^{- \tilde{S} \frac{t}{τ}} {(A {(0)}^{T} A (0))}^{- 1} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1}]}^{- 1} .$

For a proof of theorem 5.1, please refer to appendix E. We remark that this form is less general than that in theorem 3.1, and in particular implies $U V = \tilde{U} \tilde{V}$ . Here the matrix $A^{T} A$ represents the dynamics directly in the SVD basis of the task. Off-diagonal elements represent counterproductive coupling between different singular modes (for instance, $[A^{T} A]_{21}$ is the strength of connection from input singular vector 1 to output singular vector 2, which must approach zero to perform the task perfectly), while on-diagonal elements represent the coupling within the same mode (for instance, $[A^{T} A]_{11}$ is the strength of connection from input singular vector 1 to output singular vector 1, which must approach the associated task singular value to perform the task perfectly). Hence the decoupling process can be studied by examining the dynamics by which $A^{T} A$ becomes approximately diagonal.

The outer inverse in equation (13) renders it difficult to study high dimensional networks analytically. Therefore, we focus on small networks with input and output dimension $N_{i} = 2$ and $N_{o} = 2$ , for which a lengthy but explicit analytical solution is given in appendix E. In this setting, the structure of the weight initialisation and task are encoded in the matrices

A {(0)}^{T} A (0) = [\begin{matrix} a_{1} (0) & b (0) \\ b (0) & a_{2} (0) \end{matrix}] and \tilde{S} = [\begin{matrix} s_{1} & 0 \\ 0 & s_{2} \end{matrix}],

where the parameters $a_{1} (0)$ and $a_{2} (0)$ represent the component of the initialisation that is aligned with the task, and b(0) represents cross-coupling, such that taking $b (0) = 0$ recovers previously known and more restricted solutions for the decoupled case (Saxe et al 2014). We use this setting to demonstrate two features of the learning dynamics.

Decoupling dynamics. First, we track decoupling by considering the dynamics of the off-diagonal element b(t) (figures 4(D)–(F) red lines). At convergence, the off-diagonal element shrinks to zero as shown in appendix E. However, strikingly, b(t) can exhibit non-monotonic trajectories with transient peaks or valleys partway through the learning process. In particular, in appendix E we derive the time of the peak magnitude as $t_{peak} = \frac{τ}{4 s} ln \frac{s (s - a_{1} - a_{2})}{a_{1} a_{2} - b (0)^{2}}$ (figure 4(F) green dotted line), which coincides approximately with the time at which the on-diagonal element is half learned. If initialised from small random weights, the off-diagonal remains near-zero throughout learning, reminiscent of the silent alignment effect (Atanasov et al 2022). For large initialisations, no peak is observed and the dynamics are exponential. At intermediate initialisations, the maximum of the off-diagonal is reached before the singular mode is fully learned (appendix E). Intuitively, a particular input singular vector can initially project appreciably onto the wrong output singular vector, corresponding to initial misalignment. This is only revealed when this link is amplified, at which point corrective dynamics remove the counter-productive coupling, as schematised in figure 4(B). We report further measurements of decoupling in appendix E.

Figure 4. — Decoupling dynamics. (A) Analytical (black dotted lines) and numerical (solid lines) of the temporal dynamics of the on- and off-diagonal elements of $A^{T} A$ in blue and red, respectively. (B) Schematic representation of the decoupling process. (C) Three target matrices with dense, unequal diagonal, and equal diagonal structure. (D)and (F) Decoupling dynamics for the top (D), middle (E), and bottom (F) tasks depicted in panel (C). Row F contains analytical predictions for the time of the peak of the off-diagonal (dashed green). The network is initialised as defined in appendix E with small, intermediate and large variance.

Effect of initialisation variance. Next, we revisit the impact of initialisation scale for the on-diagonal dynamics. As shown in figures 4(D)–(F), as the initialisation variance grows the learning dynamics change from sigmoidal to exponential, possibly displaying more complex behaviour at intermediate variance (appendix E). In this simple setting we can analyse this transition in detail. Taking $s_{1} = s_{2} = s$ as in figure 4(F) and $| a_{1} (0) |, | a_{2} (0) |, | b (0) | ≪ 1$ , we recover a sigmoidal trajectory,

a_{1} (t) = \frac{s a_{1} (0)}{e^{\frac{- 2 s t}{τ}} [s - a_{1} (0) - a_{2} (0)] + a_{1} (0) + a_{2} (0)},

while for $| a_{1} (0) |, | a_{2} (0) |, | b (0) | ≫ 0$ the dynamics of the on-diagonal element a ₁ is close to exponential (figures 4(D)–(F) left and right columns). We examine larger networks in appendix E.

6. Applications

The solutions derived in sections 3 and 5 provide tools to examine the impact of prior knowledge on dynamics in deep linear networks. So far we have traced general features of the behaviour of these solutions. In this section, we use this toolkit to develop accounts of several specific phenomena.

Continual Learning. Continual learning (see Parisi et al 2019 for a review) and the pathology of catastrophic forgetting have long been a challenge for neural network models (McCloskey and Cohen 1989, Ratcliff 1990, French 1999). A variety of theoretical work has investigated aspects of continual learning (Tripuraneni et al 2020, Asanuma et al 2021, Doan et al 2021, Lee et al 2021, Shachaf et al 2021). In this setting, starting from an initial set of weights, a network is trained on a sequence of tasks with respective input-output correlations $T_{1} = {\tilde{Σ}}_{1}^{y x}, T_{2} = {\tilde{Σ}}_{2}^{y x}, T_{3} = {\tilde{Σ}}_{3}^{y x}, \dots$ . As shown in figure 5(A), our dynamics immediately enable exact solutions for the full continual learning process, whereby the final state after training on one task becomes the initial network state for the next task. These solutions thus reveal the exact time course of forgetting for arbitrary sequences of tasks.

Figure 5. — Continual learning. (A) Top: network training from small zero-balanced weights on a sequence of tasks (coloured lines show simulation and black dotted lines analytical results). Bottom: evaluation loss for tasks of the sequence (dotted) while training on the current task (solid). As the network function is optimised on the current task, the loss of other tasks increases. (B) Comparison of the numerical and analytical amount of catastrophic forgetting on a first task after training on a second task for n = 50 linear (red), tanh (blue) and ReLU (green) networks. (C) Weight alignment before and after training on a sequence of two tasks for n = 50 networks in linear (red), tanh (blue) and ReLU (green) networks. Shaded area shows $\pm s t d$ . (D) Evaluation loss for each of 5 tasks during training a linear (red), tanh (blue) and ReLU (green) network. (E) Same data es in (D) but evaluated as relative change (i.e. amount of catastrophic forgetting). The top half of each square shows the pre-computed analytical amount of forgetting and the bottom half the numerical value.

Training on later tasks can overwrite previously learned knowledge, a phenomenon known as catastrophic forgetting (McCloskey and Cohen 1989, Ratcliff 1990, French 1999). From theorem 3.2 it follows that from any arbitrary zero-balanced initialisation 2.3, the network converges to the global optimum such that the initialisation is completely overwritten and forgetting is truly catastrophic. In particular, the loss of any other task $T_{i}$ after training to convergence on task $T_{j}$ is $L_{i} (T_{j}) = 1 / 2 | | {\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x} | |_{F}^{2} + c$ , where c is a constant that only depends on training data of task $T_{i}$ (appendix F). As a consequence, the amount of forgetting, i.e. the relative change of loss, is fully determined by the similarity structure of the tasks and thus can be fully determined for a sequence of tasks before the onset of training (figures 5(B) and (E), appendix F). For example, the amount of catastrophic forgetting in task $T_{a}$ , when training on task $T_{c}$ after having trained the network on task $T_{b}$ is $L_{a} (T_{c}) - L_{a} (T_{b})$ . As expected, our results depend on our linear setting and tanh or ReLU nonlinearities can show different behaviour, typically increasing the amount of forgetting (figures 5(B), (D) and (E)). Further, in nonlinear networks, weights become rapidly unbalanced and forgetting values that are calculated before the onset of training do not predict the actual outcome (figures 5(B)–(E)). In summary, our results link exact learning dynamics with catastrophic forgetting and thus provide an analytical tool to study the mechanisms and potential counter measures underlying catastrophic forgetting.

Reversal learning. During reversal learning, pre-existing knowledge has to be relearned, overcoming a previously learned relationship between inputs and outputs. For example, reversal learning occurs when items of a class are mislabelled and later corrected. We show analytically, that reversal learning in fact does not succeed in deep linear networks (appendix G). The pre-existing knowledge lies exactly on the separatrix of a saddle point causing the learning dynamics to converge to zero (figure 6(A)). In contrast, the learning still succeeds numerically, as any noise will perturb the dynamics off the saddle point, allowing learning to proceed (figure 6(A)). However, the dynamics still slow in the vicinity of the saddle point, providing a theoretical explanation for catastrophic slowing in deep linear networks (Lee et al 2022). We note that the analytical solution requires an adaptation of theorem 3.1, as B is generally not invertible in the case of reversal learning (appendix G). Further, as is revealed by the exact learning dynamics (appendix G), shallow networks do succeed without exhibiting catastrophic slowing during reversal learning (figure 6(B)).

Figure 6. — Reversal learning and revising structured knowledge. Scale of x-axis varies in top and bottom rows. (A) Analytical (black dotted) and numerical (solid) learning dynamics of a reversal learning task. The analytical solution gets stuck on a saddle point, whereas the numerical simulation escapes the saddle point and converges to the target. (B) In a shallow network, training on the same task as in A converges analytically (black dotted) and numerically (solid). (C) Semantic learning tasks. Revised living kingdom (top) and colour hierarchy (bottom). (D) SVD of the input-output correlation of the tasks and respective RSMs. (E) Analytical (black dotted) and simulation (solid) loss and (F) learning dynamics of first training on the living kingdom (figure 3(A)) and subsequently on the respective task in (C). The analytical solution fails for the revised animal kingdom as it gets stuck in a saddle point, while the simulation escapes the saddle (top, green circle). Initial training on the living kingdom task from large initial weights and subsequent training on the colour hierarchy have similar convergence times (bottom) (G) multidimensional scaling (MDS) of the network function for initial training on the living kingdom task from small (top) and large initial weights (bottom). Note how despite the seemingly chaotic learning dynamics when starting form large initial weights, both simulations learn the same representation. (H) MDS of subsequent training on the respective task in (C).

Revising structured knowledge. Knowledge is often organised within an underlying, shared structure, of which many can be learned and represented in deep linear networks (Saxe et al 2019). For example, spatial locations can be related to each other using the same cardinal directions, or varying semantic knowledge can be organised using the same hierarchical tree. Here, we investigate if deep linear networks benefit from shared underlying structure. To this end, a network is first trained on the three-level hierarchical tree of section 4 (eight items of the living kingdom, each with a set of eight associated features), and subsequently trained on a revised version of the hierarchy. The revised task varies the relation of inputs and outputs while keeping the same underlying tree structure. If the revision involves swapping two neighbouring nodes on any level of the hierarchy, e.g. the identity of the two fish on the lowest level of the hierarchy (figure 6(C), top), the task is identical to reversal learning, leading to catastrophically slowed dynamics (figures 6(E) and (F), top). When training the network on a new hierarchical tree with identical items but a new set of features, like a colour hierarchy (figure 6(C), bottom), there is no speed advantage in comparison to a random initialisation with similar initial variance (figures 6(E) and (F), bottom). Importantly, from theorem 3.2 it follows, that the learning process can be sped up significantly by initialising from large zero-balanced weights, while converging to a global minimum with identical generalisation properties as when training from small weights (figures 6(G) and (H). In summary, having incorporated structured knowledge before revision does not speed up or even slows down learning in comparison to learning from random zero-balanced weights. Notably, that is despite the tasks’ structure being almost identical (figures 3(B) and 6(D).

7. Discussion

We derive exact solutions to the dynamics of learning with rich prior knowledge in a tractable model class: deep linear networks. While our results broaden the class of two-layer linear network problems that can be described analytically, they remain limited and rely on a set of assumptions (2.1)–(2.4). In particular, weakening the requirement that the input covariance be white and the weights be zero-balanced would enable analysis of the impact of initialisation on internal representations. Nevertheless, these solutions reveal several insights into network behaviour. We show that there exists a large set of initial values, namely zero-balanced weights 2.3, which lead to task-specific representations; and that large initialisations lead to exponential rather than sigmoidal learning curves. We hope our results provide a mathematical toolkit that illuminates the complex impact of prior knowledge on deep learning dynamics.

Acknowledgment

L B was supported by the Woodward Scholarship awarded by Wadham College, Oxford and the Medical Research Council [MR/N013468/1]. C D and A S were supported by the Gatsby Charitable Foundation (GAT3755). Further, A S was supported by a Sir Henry Dale Fellowship from the Wellcome Trust and Royal Society (216386/Z/19/Z) and the Sainsbury Wellcome Centre Core Grant (219627/Z/19/Z). A S is a CIFAR Azrieli Global Scholar in the Learning in Machines & Brains program. J F was supported by the Howard Hughes Medical Institute.

Appendix A. Fukumizu approach

For completeness, we reproduce the derivation from Fukumizu (1998) of equation (5). We consider the learning setting describe in section 2. Under the assumptions of equal input-output dimensions 2.1, whitened inputs 2.2 and zero-balanced weights 2.3, the weights dynamics yield

\begin{aligned} τ \frac{d}{d t} W_{1} = W_{2}^{T} ({\tilde{Σ}}^{y x} - W_{2} W_{1} {\tilde{Σ}}^{x x}), \end{aligned}

\begin{aligned} τ \frac{d}{d t} W_{2} = ({\tilde{Σ}}^{y x} - W_{2} W_{1} {\tilde{Σ}}^{x x}) W_{1}^{T} . \end{aligned}

Under the assumption of whitened inputs 2.2, the dynamics simplify to

\begin{aligned} τ \frac{d}{d t} W_{1} = W_{2}^{T} ({\tilde{Σ}}^{y x} - W_{2} W_{1}), \end{aligned}

\begin{aligned} τ \frac{d}{d t} W_{2} = ({\tilde{Σ}}^{y x} - W_{2} W_{1}) W_{1}^{T} . \end{aligned}

We introduce the variables

Q = [\begin{matrix} W_{1}^{T} \\ W_{2} \end{matrix}] and Q Q^{T} = [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] .

We compute the time derivative

τ \frac{d}{d t} (Q Q^{T}) = τ [\begin{matrix} \frac{d W_{1}^{T}}{d t} W_{1} + W_{1}^{T} \frac{d W_{1}}{d t} & \frac{d W_{1}^{T}}{d t} W_{2}^{T} + W_{1}^{T} \frac{d W_{2}^{T}}{d t} \\ \frac{d W_{2}}{d t} W_{1} + W_{2} \frac{d W_{1}}{d t} & \frac{d W_{2}}{d t} W_{2}^{T} + W_{2} \frac{d W_{2}^{T}}{d t} \end{matrix}] .

Using equations (18) and (19) we compute the four quadrant separately giving

\begin{aligned} τ (\frac{d W_{1}^{T}}{d t} W_{1} + W_{1}^{T} \frac{d W_{1}}{d t}) \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x} - W_{2} W_{1})}^{T} W_{2} W_{1} + W_{1}^{T} W_{2}^{T} ({\tilde{Σ}}^{y x} - W_{2} W_{1}) \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{1} + W_{1}^{T} W_{2}^{T} {\tilde{Σ}}^{y x} - W_{1}^{T} W_{2}^{T} W_{2} W_{1} - {(W_{2} W_{1})}^{T} W_{2} W_{1} \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{1} + W_{1}^{T} W_{2}^{T} {\tilde{Σ}}^{y x} - W_{1}^{T} W_{2}^{T} W_{2} W_{1} - W_{1}^{T} W_{1} W_{1}^{T} W_{1}, \end{aligned}

\begin{aligned} τ (\frac{d W_{1}^{T}}{d t} W_{2}^{T} + W_{1}^{T} \frac{d W_{2}^{T}}{d t}) \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x} - W_{2} W_{1})}^{T} W_{2} W_{2}^{T} + W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x} - W_{2} W_{1})}^{T} \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{2}^{T} + W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x})}^{T} - W_{1}^{T} W_{1} {(W_{2} W_{1})}^{T} - {(W_{2} W_{1})}^{T} W_{2} W_{2}^{T}, \end{aligned}

\begin{aligned} = {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{2}^{T} + W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x})}^{T} - W_{1}^{T} W_{1} W_{1}^{T} W_{2}^{T} - W_{1}^{T} W_{2}^{T} W_{2} W_{2}^{T}, \end{aligned}

\begin{aligned} τ (\frac{d W_{2}}{d t} W_{1} + W_{2} \frac{d W_{1}}{d t}) \end{aligned}

\begin{aligned} = ({\tilde{Σ}}^{y x} - W_{2} W_{1}) W_{1}^{T} W_{1} + W_{2} W_{2}^{T} ({\tilde{Σ}}^{y x} - W_{2} W_{1}) \end{aligned}

\begin{aligned} = {\tilde{Σ}}^{y x} W_{1}^{T} W_{1} + W_{2} W_{2}^{T} {\tilde{Σ}}^{y x} - W_{2} W_{2}^{T} W_{2} W_{1} - W_{2} W_{1} W_{1}^{T} W_{1}, \end{aligned}

\begin{aligned} τ (\frac{d W_{2}}{d t} W_{2}^{T} + W_{2} \frac{d W_{2}^{T}}{d t}) \end{aligned}

\begin{aligned} = ({\tilde{Σ}}^{y x} - W_{2} W_{1}) W_{1}^{T} W_{2}^{T} + W_{2} W_{1} {({\tilde{Σ}}^{y x} - W_{2} W_{1})}^{T} \end{aligned}

\begin{aligned} = {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} + W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} - W_{2} W_{1} W_{1}^{T} W_{2}^{T} - W_{2} W_{1} {(W_{2} W_{1})}^{T} \end{aligned}

\begin{aligned} = {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} + W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} - W_{2} W_{1} W_{1}^{T} W_{2}^{T} - W_{2} W_{1} W_{1}^{T} W_{2}^{T} \end{aligned}

\begin{aligned} = {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} + W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} - W_{2} W_{1} W_{1}^{T} W_{2}^{T} - W_{2} W_{2}^{T} W_{2} W_{2}^{T}, \end{aligned}

where we have used the assumption of zero-balanced weights 2.3 to simplify equations (25) and (37).

Defining

F = [\begin{matrix} 0 & {({\tilde{Σ}}^{y x})}^{T} \\ {\tilde{Σ}}^{y x} & 0 \end{matrix}],

the gradient flow dynamics of $Q Q^{T}$ (t) can be written as a differential matrix Riccati equation

\begin{aligned} τ \frac{d}{d t} (Q Q^{T}) = F Q Q^{T} + Q Q^{T} F - {(Q Q^{T})}^{2} . \end{aligned}

We write $τ \frac{d}{d t} (Q Q^{T})$ for completeness

\begin{aligned} τ \frac{d}{d t} (Q Q^{T}) \\ = [\begin{matrix} 0 & {({\tilde{Σ}}^{y x})}^{T} \\ {\tilde{Σ}}^{y x} & 0 \end{matrix}] [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] \\ + {[\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}]}^{T} [\begin{matrix} 0 & {({\tilde{Σ}}^{y x})}^{T} \\ {\tilde{Σ}}^{y x} & 0 \end{matrix}] - {[\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}]}^{2} \end{aligned}

\begin{aligned} = [\begin{matrix} 0 & {({\tilde{Σ}}^{y x})}^{T} \\ {\tilde{Σ}}^{y x} & 0 \end{matrix}] [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] \\ + [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] [\begin{matrix} 0 & {({\tilde{Σ}}^{y x})}^{T} \\ {\tilde{Σ}}^{y x} & 0 \end{matrix}] \\ - [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{1} & {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{2}^{T} \\ {\tilde{Σ}}^{y x} W_{1}^{T} W_{1} & {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} \end{matrix}] \\ + [\begin{matrix} W_{1}^{T} W_{2}^{T} {\tilde{Σ}}^{y x} & W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x})}^{T} \\ W_{2} W_{2}^{T} {\tilde{Σ}}^{y x} & W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} \end{matrix}] \\ - [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] [\begin{matrix} W_{1}^{T} W_{1} & W_{1}^{T} W_{2}^{T} \\ W_{2} W_{1} & W_{2} W_{2}^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{1} & {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{2}^{T} \\ {\tilde{Σ}}^{y x} W_{1}^{T} W_{1} & {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} \end{matrix}] \\ + [\begin{matrix} W_{1}^{T} W_{2}^{T} {\tilde{Σ}}^{y x} & W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x})}^{T} \\ W_{2} W_{2}^{T} {\tilde{Σ}}^{y x} & W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} \end{matrix}] \\ - [\begin{matrix} W_{1}^{T} W_{1} W_{1}^{T} W_{1} + W_{1}^{T} W_{2}^{T} W_{2} W_{1} & W_{1}^{T} W_{1} W_{1}^{T} W_{2}^{T} + W_{1}^{T} W_{2}^{T} W_{2} W_{2}^{T} \\ W_{2} W_{1} W_{1}^{T} W_{1} + W_{2} W_{2}^{T} W_{2} W_{1} & W_{2} W_{1} W_{1}^{T} W_{2}^{T} + W_{2} W_{2}^{T} W_{2} W_{2}^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{1} + W_{1}^{T} W_{2}^{T} {\tilde{Σ}}^{y x} \\ - W_{1}^{T} W_{2}^{T} W_{2} W_{1} - W_{1}^{T} W_{1} W_{1}^{T} W_{1} & {({\tilde{Σ}}^{y x})}^{T} W_{2} W_{2}^{T} + W_{1}^{T} W_{1} {({\tilde{Σ}}^{y x})}^{T} \\ - W_{1}^{T} W_{1} W_{1}^{T} W_{2}^{T} - W_{1}^{T} W_{2}^{T} W_{2} W_{2}^{T} \\ {\tilde{Σ}}^{y x} W_{1}^{T} W_{1} + W_{2} W_{2}^{T} {\tilde{Σ}}^{y x} \\ - W_{2} W_{2}^{T} W_{2} W_{1} - W_{2} W_{1} W_{1}^{T} W_{1} & {\tilde{Σ}}^{y x} W_{1}^{T} W_{2}^{T} + W_{2} W_{1} {({\tilde{Σ}}^{y x})}^{T} \\ - W_{2} W_{1} W_{1}^{T} W_{2}^{T} - W_{2} W_{2}^{T} W_{2} W_{2}^{T} \end{matrix}] \end{aligned}

$□$

The four quadrant of (44) are equivalent to equations (25), (29), (32) and (37) respectively.

Assuming that $Q (0)$ is full rank, the continuous differential equation (39) has a unique solution for all $t ⩾ 0$

\begin{aligned} Q Q^{T} (t) = e^{F \frac{t}{τ}} Q (0) {[I + \frac{1}{2} Q {(0)}^{T} (e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1}) Q (0)]}^{- 1} Q {(0)}^{T} e^{F \frac{t}{τ}} . \end{aligned}

Appendix B. Network’s internal representations

B.1. Representational similarity analysis

The task-relevant representational similarity matrix (Kriegeskorte et al 2008) of the hidden layer, calculated from the inputs $H = W_{1} X$ is

\begin{aligned} {R S M}_{I} (t) & = H^{T} (t) H (t) \end{aligned}

\begin{aligned} = {(W_{1} (t) X)}^{T} W_{1} (t) X \end{aligned}

\begin{aligned} = X^{T} (W_{1}^{T} W_{1}) (t) X . \end{aligned}

Similarly, the representational similarity matrix of the hidden layer, calculated from the outputs $\tilde{H} = W_{2}^{+} Y$ , where + denotes the pseudoinverse, is

\begin{aligned} {R S M}_{O} (t) & = {\tilde{H}}^{T} (t) \tilde{H} (t) \end{aligned}

\begin{aligned} = {(W_{2}^{+} (t) Y)}^{T} W_{2}^{+} (t) Y \end{aligned}

\begin{aligned} = Y^{T} {(W_{2} W_{2}^{T} (t))}^{+} Y . \end{aligned}

B.2. Finite-width neural tangent kernel

In the following, we derive the finite-width neural tangent kernel (Jacot et al 2018) for a two-layer linear network. Starting with the network function at time t

F_{t} (X) = W_{2} W_{1} X,

the discrete time gradient descent dynamics of the next time step yields

\begin{aligned} F_{t + 1} (X) & = (W_{2} - η \frac{\partial L}{\partial W_{2}}) (W_{1} - η \frac{\partial L}{\partial W_{1}}) X \end{aligned}

\begin{aligned} = W_{2} W_{1} X - η (W_{2} \frac{\partial L}{\partial W_{1}} + \frac{\partial L}{\partial W_{2}} W_{1} - η \frac{\partial L}{\partial W_{2}} \frac{\partial L}{\partial W_{1}}) X . \end{aligned}

The network function’s gradient flow can then be derived as

\begin{aligned} \frac{F_{t + 1} (X) - F_{t} (X)}{η} & = - (W_{2} \frac{\partial L}{\partial W_{1}} + \frac{\partial L}{\partial W_{2}} W_{1} - η \frac{\partial L}{\partial W_{2}} \frac{\partial L}{\partial W_{1}}) X \end{aligned}

\begin{aligned} [η \to 0] \frac{d}{d t} F (X) & = - (W_{2} \frac{\partial L}{\partial W_{1}} + \frac{\partial L}{\partial W_{2}} W_{1}) X . \end{aligned}

Substituting the partial derivatives

\begin{aligned} \frac{\partial L}{\partial W_{1}} & = \frac{1}{2} \frac{\partial}{\partial W_{1}} | | W_{2} W_{1} X - Y | |_{F}^{2} \end{aligned}

\begin{aligned} = W_{2}^{T} (W_{2} W_{1} X - Y) X^{T} \end{aligned}

and

\begin{aligned} \frac{\partial L}{\partial W_{2}} & = \frac{1}{2} \frac{\partial}{\partial W_{2}} | | W_{2} W_{1} X - Y | |_{F}^{2} \end{aligned}

\begin{aligned} = (W_{2} W_{1} X - Y) X^{T} W_{1}^{T} \end{aligned}

then yields

\frac{d}{d t} F (X) = - W_{2} W_{2}^{T} (W_{2} W_{1} X - Y) X^{T} X - (W_{2} W_{1} X - Y) X^{T} W_{1}^{T} W_{1} X .

Finally, we introduce the identity matrix $I_{N_{o}}$ of size $N_{o}$ and apply row-wise vectoriasation ${v e c}_{r} (F (X)) := f (X)$ and the identity ${v e c}_{r} (A B C) = (A \otimes C^{T}) {v e c}_{r} (B)$ to derive the neural tangent kernel

\begin{aligned} \frac{d}{d t} F (X) & = - W_{2} W_{2}^{T} (W_{2} W_{1} X - Y) X^{T} X - I_{N_{o}} (W_{2} W_{1} X - Y) X^{T} W_{1}^{T} W_{1} X \end{aligned}

\begin{aligned} \Leftrightarrow \frac{d}{d t} f (X) & = - (\underset{N T K}{\underset{⏟}{W_{2} W_{2}^{T} \otimes X^{T} X + I \otimes X^{T} W_{1}^{T} W_{1} X}}) {v e c}_{r} (W_{2} W_{1} X - Y) \end{aligned}

\begin{aligned} = - ([W_{2} \otimes X^{T}, I \otimes X^{T} W_{1}^{T}] {[W_{2} \otimes X^{T}, I \otimes X^{T} W_{1}^{T}]}^{T}) {v e c}_{r} (\frac{\partial L}{\partial F}) \end{aligned}

\begin{aligned} = - ([\nabla_{W_{1}} f, \nabla_{W_{2}} f] {[\nabla_{W_{1}} f, \nabla_{W_{2}} f]}^{T}) \frac{\partial L}{\partial f} \end{aligned}

\begin{aligned} = - (\nabla_{θ} f \nabla_{θ} f^{T}) \frac{\partial L}{\partial f}, \end{aligned}

where $[A, B]$ denotes concatenation.

Appendix C. Exact learning dynamics with prior knowledge

C.1. Proof of theorem 3.1

In the following, we prove that equation (11) is in fact a solution to the matrix Riccati equation arising from gradient flow (equation (39)). We prove the theorem by directly substituting our solution for $Q Q^{T} (t)$ into the matrix Riccati equation.

C.1.1. Unequal input-output dimension.

We start with the following equation

\begin{aligned} Q Q^{T} (t) = & \underset{L}{\underset{⏟}{[O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] Q (0)}} \\ \times \underset{C^{- 1}}{\underset{⏟}{{[I + \frac{1}{2} Q {(0)}^{T} (O (e^{2 Λ \frac{t}{τ}} - I) Λ^{- 1} O^{T} + 4 \frac{t}{τ} M M^{T}) Q (0)]}^{- 1}}} \end{aligned}

\begin{aligned} \times \underset{R}{\underset{⏟}{Q {(0)}^{T} [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}]}} \\ = & L C^{- 1} R, \end{aligned}

which is identical to equation (11) in the main text, as we verify in section C.2 (by reversing the derivation from equation (152) to equation (128)). Substituting our solution into the matrix Riccati equation then yields

\begin{aligned} τ \frac{d}{d t} Q Q^{T} & = F Q Q^{T} + Q Q^{T} F - {(Q Q^{T})}^{2} \end{aligned}

\begin{aligned} \Rightarrow τ \frac{d}{d t} L C^{- 1} R & \overset{?}{=} F L C^{- 1} R + L C^{- 1} R F - L C^{- 1} R L C^{- 1} R . \end{aligned}

Next, we note that

O^{T} O = \frac{1}{\sqrt{2}} {[\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}]}^{T} \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}] = I,

\begin{aligned} O^{T} M & = \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}^{T} & {\tilde{U}}^{T} \\ {\tilde{V}}^{T} & - {\tilde{U}}^{T} \end{matrix}] \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}_{⊥} \\ {\tilde{U}}_{⊥} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{2} [\begin{matrix} {\tilde{V}}^{T} {\tilde{V}}_{⊥} + {\tilde{U}}^{T} {\tilde{U}}_{⊥} \\ {\tilde{V}}^{T} {\tilde{V}}_{⊥} - {\tilde{U}}^{T} {\tilde{U}}_{⊥} \end{matrix}] \end{aligned}

\begin{aligned} = 0 \end{aligned}

and

\begin{aligned} M^{T} O & = \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}_{⊥}^{T} & {\tilde{U}}_{⊥}^{T} \end{matrix}] \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{2} [\begin{matrix} {\tilde{V}}_{⊥}^{T} \tilde{V} + {\tilde{U}}_{⊥}^{T} \tilde{U} \\ {\tilde{V}}_{⊥}^{T} \tilde{V} - {\tilde{U}}_{⊥}^{T} \tilde{U} \end{matrix}] \end{aligned}

\begin{aligned} = 0 . \end{aligned}

Then, using the chain rule $\partial (A B) = (\partial A) B + A (\partial B)$ and the identities

\frac{d}{d t} (A^{- 1}) = A^{- 1} (\frac{d}{d t} A) A^{- 1} and \frac{d}{d t} (e^{t A}) = A e^{t A} = e^{t A} A

we get

\begin{aligned} τ \frac{d}{d t} Q Q^{T} & = τ \frac{d}{d t} (L C^{- 1} R) \end{aligned}

\begin{aligned} = τ (\frac{d}{d t} L) C^{- 1} R + τ L (\frac{d}{d t} C^{- 1} R) \end{aligned}

\begin{aligned} = τ (\frac{d}{d t} L) C^{- 1} R + τ L C^{- 1} (\frac{d}{d t} R) + τ L (\frac{d}{d t} C^{- 1}) R, \end{aligned}

with

\begin{aligned} τ (\frac{d}{d t} L) C^{- 1} R & = τ O \frac{1}{τ} Λ e^{Λ \frac{t}{τ}} O^{T} Q (0) C^{- 1} R \end{aligned}

\begin{aligned} = O Λ e^{Λ \frac{t}{τ}} O^{T} Q (0) C^{- 1} R \end{aligned}

\begin{aligned} = [O Λ O^{T} O e^{Λ \frac{t}{τ}} O^{T} Q (0) + 2 O Λ \underset{0}{\underset{⏟}{O^{T} M}} M^{T} Q (0)] C^{- 1} R \end{aligned}

\begin{aligned} = F L C^{- 1} R, \end{aligned}

\begin{aligned} τ L C^{- 1} (\frac{d}{d t} R) & = τ L C^{- 1} Q {(0)}^{T} O \frac{1}{τ} e^{Λ \frac{t}{τ}} Λ O^{T} \end{aligned}

\begin{aligned} = L C^{- 1} Q {(0)}^{T} O e^{Λ \frac{t}{τ}} Λ O^{T} \end{aligned}

\begin{aligned} = L C^{- 1} [Q {(0)}^{T} O e^{Λ \frac{t}{τ}} O^{T} O Λ O^{T} + 2 Q {(0)}^{T} M \underset{0}{\underset{⏟}{M^{T} O}} Λ O^{T}] \end{aligned}

\begin{aligned} = L C^{- 1} R F \end{aligned}

and

\begin{aligned} τ L (\frac{d}{d t} C^{- 1}) R & = - τ L C^{- 1} (\frac{d}{d t} C) C^{- 1} R \end{aligned}

\begin{aligned} = - L C^{- 1} [τ \frac{1}{2} Q {(0)}^{T} O 2 \frac{1}{τ} e^{2 Λ \frac{t}{τ}} Λ Λ^{- 1} O^{T} Q (0) \end{aligned}

\begin{aligned} + τ \frac{1}{2} Q {(0)}^{T} 4 \frac{1}{τ} M M^{T} Q (0)] C^{- 1} R \\ = - L C^{- 1} [Q {(0)}^{T} O e^{2 Λ \frac{t}{τ}} O^{T} Q (0) + 2 Q {(0)}^{T} M M^{T} Q (0)] C^{- 1} R \end{aligned}

\begin{aligned} = - L C^{- 1} [Q {(0)}^{T} O e^{Λ \frac{t}{τ}} O^{T} O e^{Λ \frac{t}{τ}} O^{T} Q (0) \\ + 2 Q {(0)}^{T} O e^{Λ \frac{t}{τ}} \underset{0}{\underset{⏟}{O^{T} M}} M^{T} Q (0) \end{aligned}

\begin{aligned} + 2 Q {(0)}^{T} M \underset{0}{\underset{⏟}{M^{T} O}} e^{Λ \frac{t}{τ}} O^{T} Q (0) \\ + 4 Q {(0)}^{T} M M^{T} M M^{T} Q (0)] C^{- 1} R \\ = - L C^{- 1} R L C^{- 1} R . \end{aligned}

Finally, substituting equations (82), (86) and (90) into the left hand side of equation (70) proves equality.

$□$

C.1.2. Equal input-output dimension.

In the case of equal input-output dimensions ${\tilde{U}}_{⊥} = {\tilde{V}}_{⊥} = 0$ equation (67) reduces to

\begin{aligned} Q Q^{T} (t) & = \underset{L}{\underset{⏟}{O e^{Λ \frac{t}{τ}} O^{T} Q (0)}} \\ \times \underset{C^{- 1}}{\underset{⏟}{{[I + \frac{1}{2} Q {(0)}^{T} O e^{2 Λ \frac{t}{τ}} Λ^{- 1} O^{T} Q (0) - \frac{1}{2} Q {(0)}^{T} O Λ^{- 1} O^{T} Q (0)]}^{- 1}}} \end{aligned}

\begin{aligned} \times \underset{R}{\underset{⏟}{Q {(0)}^{T} O e^{Λ \frac{t}{τ}} O^{T}}} = L C^{- 1} R . \end{aligned}

Therefore, analogously to the proof for unequal input-output dimensions, it follows that

\begin{aligned} τ \frac{d}{d t} Q Q^{T} & = τ \frac{d}{d t} L C^{- 1} R \end{aligned}

\begin{aligned} = τ (\frac{d}{d t} L) C^{- 1} R + τ L (\frac{d}{d t} C^{- 1} R) \end{aligned}

\begin{aligned} = τ (\frac{d}{d t} L) C^{- 1} R + τ L C^{- 1} (\frac{d}{d t} R) + τ L (\frac{d}{d t} C^{- 1}) R, \end{aligned}

with

\begin{aligned} τ (\frac{d}{d t} L) C^{- 1} R & = τ O Λ \frac{1}{τ} e^{Λ \frac{t}{τ}} O^{T} Q (0) C^{- 1} R \end{aligned}

\begin{aligned} = O Λ O^{T} O e^{Λ \frac{t}{τ}} O^{T} Q (0) C^{- 1} R \end{aligned}

\begin{aligned} = F L C^{- 1} R, \end{aligned}

\begin{aligned} τ L C^{- 1} (\frac{d}{d t} R) & = τ L C^{- 1} Q {(0)}^{T} O \frac{1}{τ} e^{Λ \frac{t}{τ}} Λ O^{T} \end{aligned}

\begin{aligned} = L C^{- 1} Q {(0)}^{T} O e^{Λ \frac{t}{τ}} O^{T} O Λ O^{T} \end{aligned}

\begin{aligned} = L C^{- 1} R F, \end{aligned}

and

\begin{aligned} τ L (\frac{d}{d t} C^{- 1} R) & = - τ L C^{- 1} (\frac{d}{d t} C) C^{- 1} R \end{aligned}

\begin{aligned} = - τ L C^{- 1} (\frac{1}{2} Q {(0)}^{T} O e^{2 Λ \frac{t}{τ}} \frac{2}{τ} Λ Λ^{- 1} O^{T} Q (0)) C^{- 1} R \end{aligned}

\begin{aligned} = - τ L C^{- 1} Q {(0)}^{T} O e^{Λ \frac{t}{τ}} O^{T} O e^{Λ \frac{t}{τ}} Q (0) C^{- 1} R \end{aligned}

\begin{aligned} = - L C^{- 1} R L C^{- 1} R . \end{aligned}

Finally, substituting equations (100), (103) and (106) into the left hand side of equation (70) proves equality.

$□$

C.2. Derivation of the exact learning dynamics

In the following, we outline how the solution to the matrix Ricatti equation can be acquired. Let the input and output dimension of a two-layer linear network (equation (1)) be denoted by $N_{i}$ and $N_{o}$ respectively. Further, let $N_{m} = min (N_{i}, N_{o})$ denote the smaller one of the two. The compact singular value decomposition of the initial network function and the input-output correlation of the task is then

S V D (W_{2} (0) W_{1} (0)) = U S V^{T} and S V D ({\tilde{Σ}}^{y x}) = \tilde{U} \tilde{S} {\tilde{V}}^{T} .

More specifically, in the case of $N_{i} < N_{o}$ , ${\tilde{U}}^{T} \tilde{U} = {\tilde{V}}^{T} \tilde{V} = \tilde{V} {\tilde{V}}^{T} = I \in R^{N_{i} \times N_{i}}$ but $\tilde{U} {\tilde{U}}^{T} \neq I \in R^{N_{o} \times N_{o}}$ . In this case, we use ${\tilde{U}}_{⊥} \in R^{N_{o} \times (N_{o} - N_{i})}$ to denote the matrix that contains orthogonal column vectors such that the concatenation $[\tilde{U} {\tilde{U}}_{⊥}]$ is orthonormal and ${\tilde{V}}_{⊥} \in R^{N_{i} \times (N_{o} - N_{i})}$ to denote a matrix of zeros.

Conversely, in the case of $N_{i} > N_{o}$ , $\tilde{U} {\tilde{U}}^{T} = {\tilde{U}}^{T} \tilde{U} = {\tilde{V}}^{T} \tilde{V} = I \in R^{N_{o} \times N_{o}}$ but ${\tilde{V}}^{T} \tilde{V} \neq I \in R^{N_{i} \times N_{i}}$ and we define ${\tilde{V}}_{⊥} \in R^{N_{i} \times (N_{i} - N_{o})}$ such that $[\tilde{V} {\tilde{V}}_{⊥}]$ is orthonormal and ${\tilde{U}}_{⊥} \in R^{N_{o} \times (N_{o} - N_{i})}$ to denote a matrix of zeros.

C.2.1. Inverse and matrix exponential of F.

The solution to the matrix Riccati equation as provided by Fukumizu (1998) requires calculation of the inverse F ⁻¹ and the matrix exponential $e^{F \frac{t}{τ}}$ . To this end, we diagonalise F by completing its basis by incorporating zero eigenvalues as illustrated below

\begin{aligned} F & = [\begin{matrix} 0 & \tilde{V} \tilde{S} {\tilde{U}}^{T} \\ \tilde{U} \tilde{S} {\tilde{V}}^{T} & 0 \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} & \sqrt{2} {\tilde{V}}_{⊥} \\ \tilde{U} & - \tilde{U} & \sqrt{2} {\tilde{U}}_{⊥} \end{matrix}] [\begin{matrix} \tilde{S} & 0 & 0 \\ 0 & - \tilde{S} & 0 \\ 0 & 0 & 0 \end{matrix}] \frac{1}{\sqrt{2}} {[\begin{matrix} \tilde{V} & \tilde{V} & \sqrt{2} {\tilde{V}}_{⊥} \\ \tilde{U} & - \tilde{U} & \sqrt{2} {\tilde{U}}_{⊥} \end{matrix}]}^{T} \end{aligned}

\begin{aligned} = P Γ P^{T} . \end{aligned}

Note that $P^{T} P = P P^{T} = I$ and therefore $P^{T} = P^{- 1}$ . We then use the diagonalisation of F to rewrite the matrix exponential

\begin{aligned} e^{F \frac{t}{τ}} & = P e^{Γ} P^{T} \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} & \sqrt{2} V_{⊥} \\ \tilde{U} & - \tilde{U} & \sqrt{2} U_{⊥} \end{matrix}] [\begin{matrix} e^{\tilde{S} \frac{t}{τ}} & 0 & 0 \\ 0 & e^{- \tilde{S} \frac{t}{τ}} & 0 \\ 0 & 0 & e^{0} \end{matrix}] \frac{1}{\sqrt{2}} {[\begin{matrix} \tilde{V} & \tilde{V} & \sqrt{2} V_{⊥} \\ \tilde{U} & - \tilde{U} & \sqrt{2} U_{⊥} \end{matrix}]}^{T} \end{aligned}

\begin{aligned} = \frac{1}{2} [\begin{matrix} \tilde{V} e^{\tilde{S} \frac{t}{τ}} {\tilde{V}}^{T} + \tilde{V} e^{- \tilde{S} \frac{t}{τ}} {\tilde{V}}^{T} + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} & \tilde{V} e^{\tilde{S} \frac{t}{τ}} {\tilde{U}}^{T} - \tilde{V} e^{- \tilde{S} \frac{t}{τ}} {\tilde{U}}^{T} + 2 {\tilde{V}}_{⊥} {\tilde{U}}_{⊥}^{T} \\ \tilde{U} e^{\tilde{S} \frac{t}{τ}} {\tilde{V}}^{T} - \tilde{U} e^{- \tilde{S} \frac{t}{τ}} {\tilde{V}}^{T} + 2 {\tilde{U}}_{⊥} {\tilde{V}}_{⊥}^{T} & \tilde{U} e^{\tilde{S} \frac{t}{τ}} {\tilde{U}}^{T} - \tilde{U} e^{- \tilde{S} \frac{t}{τ}} {\tilde{U}}^{T} + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}] [\begin{matrix} e^{\tilde{S} \frac{t}{τ}} & 0 \\ 0 & e^{- \tilde{S} \frac{t}{τ}} \end{matrix}] \frac{1}{\sqrt{2}} {[\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}]}^{T} + 2 \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}_{⊥} \\ {\tilde{U}}_{⊥} \end{matrix}] \frac{1}{\sqrt{2}} {[\begin{matrix} {\tilde{V}}_{⊥} \\ {\tilde{U}}_{⊥} \end{matrix}]}^{T} \end{aligned}

\begin{aligned} = O e^{Λ \frac{t}{τ}} O + 2 M M^{T} . \end{aligned}

As the inverse $F^{- 1} = P Γ^{- 1} P^{T}$ is not well defined for a Γ with zero eigenvalues. We study eigenvalues of value zero by analysing the limiting behaviour of

e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1}

for a single mode

\begin{aligned} lim_{ϵ \to 0} [e^{\frac{ϵ t}{τ}} \frac{1}{ϵ} e^{\frac{ϵ t}{τ}} - \frac{1}{ϵ}] = & lim_{ϵ \to 0} [\frac{e^{\frac{2 ϵ t}{τ}} - 1}{ϵ}] \end{aligned}

\begin{aligned} \overset{L’Hospital}{\to} & lim_{ϵ \to 0} [\frac{\frac{\partial}{\partial ϵ} (e^{\frac{2 ϵ t}{τ}} - 1)}{\frac{\partial}{\partial ϵ} ϵ}] \end{aligned}

\begin{aligned} = & lim_{ϵ \to 0} 2 \frac{t}{τ} e^{\frac{2 ϵ t}{τ}} \end{aligned}

\begin{aligned} = & 2 \frac{t}{τ} . \end{aligned}

which reveals the time dependent contribution of zero eigenvalues. Thus

\begin{aligned} e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1} & = O e^{Λ \frac{t}{τ}} O^{T} O Λ^{- 1} O^{T} O e^{Λ \frac{t}{τ}} O^{T} - O Λ^{- 1} O^{T} + 4 \frac{t}{τ} M M^{T} . \end{aligned}

We continue by substituting the above results into Fukumizu’s equation

\begin{aligned} Q Q^{T} (t) & = [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] Q (0) \end{aligned}

\begin{aligned} \times {[I + \frac{1}{2} Q {(0)}^{T} (O e^{Λ \frac{t}{τ}} O^{T} O Λ^{- 1} O^{T} O e^{Λ \frac{t}{τ}} O^{T} - O Λ^{- 1} O^{T} + 4 \frac{t}{τ} M M^{T}) Q (0)]}^{- 1} \\ \times Q {(0)}^{T} [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] \\ = [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] Q (0) \\ \times {[I + \frac{1}{2} Q {(0)}^{T} (O e^{Λ \frac{t}{τ}} Λ^{- 1} e^{Λ \frac{t}{τ}} O^{T} - O Λ^{- 1} O^{T} + 4 \frac{t}{τ} M M^{T}) Q (0)]}^{- 1} \end{aligned}

\begin{aligned} Q {(0)}^{T} [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] \\ = [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] Q (0) \\ \times {[I + \frac{1}{2} Q {(0)}^{T} (O (e^{2 Λ \frac{t}{τ}} Λ^{- 1} - Λ^{- 1}) O^{T} + 4 \frac{t}{τ} M M^{T}) Q (0)]}^{- 1} \end{aligned}

\begin{aligned} \times Q {(0)}^{T} [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] \\ = [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] Q (0) \\ \times {[I + \frac{1}{2} Q {(0)}^{T} (O (e^{2 Λ \frac{t}{τ}} - I) Λ^{- 1} O^{T} + 4 \frac{t}{τ} M M^{T}) Q (0)]}^{- 1} \end{aligned}

\begin{aligned} \times Q {(0)}^{T} [O e^{Λ \frac{t}{τ}} O^{T} + 2 M M^{T}] . \end{aligned}

Then, matrix multiplication on the left side of the equation yields

\begin{aligned} O e^{Λ \frac{t}{τ}} & = \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}] [\begin{matrix} e^{\tilde{S} \frac{t}{τ}} & 0 \\ 0 & e^{- \tilde{S} \frac{t}{τ}} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} \tilde{V} e^{\tilde{S} \frac{t}{τ}} & \tilde{V} e^{- \tilde{S} \frac{t}{τ}} \\ \tilde{U} e^{\tilde{S} \frac{t}{τ}} & - \tilde{U} e^{- \tilde{S} \frac{t}{τ}} \end{matrix}] \end{aligned}

and

\begin{aligned} O^{T} Q (0) & = \frac{1}{\sqrt{2}} {[\begin{matrix} \tilde{V} & \tilde{V} \\ \tilde{U} & - \tilde{U} \end{matrix}]}^{T} [\begin{matrix} V \sqrt{S} R^{T} \\ U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}^{T} V \sqrt{S} R^{T} + {\tilde{U}}^{T} U \sqrt{S} R^{T} \\ {\tilde{V}}^{T} V \sqrt{S} R^{T} - {\tilde{U}}^{T} U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{\sqrt{2}} [\begin{matrix} ({\tilde{V}}^{T} V + {\tilde{U}}^{T} U) \sqrt{S} R^{T} \\ ({\tilde{V}}^{T} V - {\tilde{U}}^{T} U) \sqrt{S} R^{T} \end{matrix}], \end{aligned}

such that

\begin{aligned} O e^{Λ \frac{t}{τ}} O^{T} Q (0) & = \frac{1}{2} [\begin{matrix} \tilde{V} e^{\tilde{S} \frac{t}{τ}} & \tilde{V} e^{- \tilde{S} \frac{t}{τ}} \\ \tilde{U} e^{\tilde{S} \frac{t}{τ}} & - \tilde{U} e^{- \tilde{S} \frac{t}{τ}} \end{matrix}] [\begin{matrix} {\tilde{V}}^{T} V \sqrt{S} R^{T} + {\tilde{U}}^{T} U \sqrt{S} R^{T} \\ {\tilde{V}}^{T} V \sqrt{S} R^{T} - {\tilde{U}}^{T} U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{1}{2} [\begin{matrix} \tilde{V} (e^{\tilde{S} \frac{t}{τ}} ({\tilde{V}}^{T} V + {\tilde{U}}^{T} U) + e^{- \tilde{S} \frac{t}{τ}} ({\tilde{V}}^{T} V - {\tilde{U}}^{T} U)) \sqrt{S} R^{T} \\ \tilde{U} (e^{\tilde{S} \frac{t}{τ}} ({\tilde{V}}^{T} V + {\tilde{U}}^{T} U) - e^{- \tilde{S} \frac{t}{τ}} ({\tilde{V}}^{T} V - {\tilde{U}}^{T} U)) \sqrt{S} R^{T} \end{matrix}] . \end{aligned}

We continue by calculating

\begin{aligned} 4 M M^{T} Q (0) & = 4 \frac{1}{\sqrt{2}} [\begin{matrix} {\tilde{V}}_{⊥} \\ {\tilde{U}}_{⊥} \end{matrix}] \frac{1}{\sqrt{2}} {[\begin{matrix} {\tilde{V}}_{⊥} \\ {\tilde{U}}_{⊥} \end{matrix}]}^{T} [\begin{matrix} V \sqrt{S} R^{T} \\ U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = 2 [\begin{matrix} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} & {\tilde{V}}_{⊥} {\tilde{U}}_{⊥}^{T} \\ {\tilde{U}}_{⊥} {\tilde{V}}_{⊥}^{T} & {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} \end{matrix}] [\begin{matrix} V \sqrt{S} R^{T} \\ U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = 2 [\begin{matrix} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} & 0 \\ 0 & {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} \end{matrix}] [\begin{matrix} V \sqrt{S} R^{T} \\ U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = 2 [\begin{matrix} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V \sqrt{S} R^{T} \\ {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

and

\begin{aligned} \frac{1}{2} Q {(0)}^{T} 4 \frac{t}{τ} M M^{T} Q (0) & = \frac{t}{τ} [\begin{matrix} R \sqrt{S} V^{T} R \sqrt{S} U^{T} \end{matrix}] [\begin{matrix} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V \sqrt{S} R^{T} \\ {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U \sqrt{S} R^{T} \end{matrix}] \end{aligned}

\begin{aligned} = \frac{t}{τ} [\begin{matrix} R \sqrt{S} (V^{T} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V + U^{T} {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U) \sqrt{S} R^{T} \end{matrix}] \end{aligned}

Next, we define $B = U^{T} \tilde{U} + V^{T} \tilde{V}$ and $C = U^{T} \tilde{U} - V^{T} \tilde{V}$ and rewrite the inverse as

\begin{aligned} {[I + \frac{1}{2} Q {(0)}^{T} O (e^{2 Λ \frac{t}{τ}} - I) Λ^{- 1} O^{T} Q (0) + 2 \frac{t}{τ} Q {(0)}^{T} M M^{T} Q (0)]}^{- 1} \end{aligned}

\begin{aligned} = [I + \frac{1}{4} R \sqrt{S} ([\begin{matrix} B & - C \end{matrix}] (e^{2 Λ \frac{t}{τ}} - I) Λ^{- 1} [\begin{matrix} B^{T} \\ - C^{T} \end{matrix}] \\ + 4 \frac{t}{τ} (V^{T} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V + U^{T} {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U)) \sqrt{S} R^{T}]^{- 1} . \end{aligned}

Working from the centre out, we have

\begin{aligned} [\begin{matrix} B & - C \end{matrix}] Λ^{- 1} [\begin{matrix} B^{T} \\ - C^{T} \end{matrix}] & = [\begin{matrix} B & - C \end{matrix}] [\begin{matrix} {\tilde{S}}^{- 1} & 0 \\ 0 & - {\tilde{S}}^{- 1} \end{matrix}] [\begin{matrix} B^{T} \\ - C^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} B & - C \end{matrix}] [\begin{matrix} {\tilde{S}}^{- 1} B^{T} \\ {\tilde{S}}^{- 1} C^{T} \end{matrix}] \end{aligned}

\begin{aligned} = B {\tilde{S}}^{- 1} B^{T} - C {\tilde{S}}^{- 1} C^{T} \end{aligned}

and

\begin{aligned} [\begin{matrix} B & - C \end{matrix}] e^{2 Λ \frac{t}{τ}} Λ^{- 1} [\begin{matrix} B^{T} \\ - C^{T} \end{matrix}] & = [\begin{matrix} B & - C \end{matrix}] [\begin{matrix} e^{2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} & 0 \\ 0 & - e^{- 2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} \end{matrix}] [\begin{matrix} B^{T} \\ - C^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} B & - C \end{matrix}] [\begin{matrix} e^{2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} B^{T} \\ e^{- 2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} C^{T} \end{matrix}] \end{aligned}

\begin{aligned} = B e^{2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} B^{T} - C e^{- 2 \tilde{S} \frac{t}{τ}} {\tilde{S}}^{- 1} C^{T} . \end{aligned}

Finally, using $A B^{- 1} = (B A^{- 1})^{- 1}$ (and $A^{- 1} B = (B^{- 1} A)^{- 1}$ ) to move terms into the inverse, we rewrite

\begin{aligned} Q Q^{T} (t) & = \frac{1}{2} [\begin{matrix} (\tilde{V} (e^{\tilde{S} \frac{t}{τ}} B^{T} - e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V) \sqrt{S} R^{T} \\ (\tilde{U} (e^{\tilde{S} \frac{t}{τ}} B^{T} + e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U) \sqrt{S} R^{T} \end{matrix}] \\ \times [I + R \sqrt{S} (\frac{1}{4} B (e^{2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} B^{T} - \frac{1}{4} C (e^{- 2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} C^{T} \\ + \frac{t}{τ} (V^{T} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V + U^{T} {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U)) \sqrt{S} R^{T}]^{- 1} \end{aligned}

\begin{aligned} \frac{1}{2} {[\begin{matrix} (\tilde{V} (e^{\tilde{S} \frac{t}{τ}} B^{T} - e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V) \sqrt{S} R^{T} \\ (\tilde{U} (e^{\tilde{S} \frac{t}{τ}} B^{T} + e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U) \sqrt{S} R^{T} \end{matrix}]}^{T} \\ = \frac{1}{2} [\begin{matrix} \tilde{V} (e^{\tilde{S} \frac{t}{τ}} B^{T} - e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V \\ \tilde{U} (e^{\tilde{S} \frac{t}{τ}} B^{T} + e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U \end{matrix}] \\ \times [S^{- 1} + \frac{1}{4} B (e^{2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} B^{T} - \frac{1}{4} C (e^{- 2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} C^{T} \\ {+ \frac{t}{τ} (V^{T} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V + U^{T} {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U)]}^{- 1} \end{aligned}

\begin{aligned} \frac{1}{2} {[\begin{matrix} \tilde{V} (e^{\tilde{S} \frac{t}{τ}} B^{T} - e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V \\ \tilde{U} (e^{\tilde{S} \frac{t}{τ}} B^{T} + e^{- \tilde{S} \frac{t}{τ}} C^{T}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U \end{matrix}]}^{T} \\ = [\begin{matrix} \tilde{V} (I - e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}} \\ \tilde{U} (I + e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}} \end{matrix}] \\ \times [4 e^{- \tilde{S} \frac{t}{τ}} B^{- 1} S^{- 1} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1} \\ - e^{- \tilde{S} \frac{t}{τ}} B^{- 1} C (e^{- 2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}} \\ + 4 \frac{t}{τ} e^{- \tilde{S} \frac{t}{τ}} B^{- 1} (V^{T} {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V + U^{T} {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U) {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}]^{- 1} \\ \times {[\begin{matrix} \tilde{V} (I - e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) + 2 {\tilde{V}}_{⊥} {\tilde{V}}_{⊥}^{T} V B^{- T} e^{- \tilde{S} \frac{t}{τ}} \\ \tilde{U} (I + e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) + 2 {\tilde{U}}_{⊥} {\tilde{U}}_{⊥}^{T} U B^{- T} e^{- \tilde{S} \frac{t}{τ}} \end{matrix}]}^{T} . \end{aligned}

C.3. Proof of theorem 3.2: Limiting behaviour

As training time increases, all terms including a matrix exponential with negative exponent in equation (11) vanish to zero, as $\tilde{S}$ is a diagonal matrix with entries larger zero

lim_{t \to \infty} e^{- \tilde{S} \frac{t}{τ}} = 0 .

Therefore, in the temporal limit, equation (11) reduces to

\begin{aligned} = [\begin{matrix} \tilde{V} \\ \tilde{U} \end{matrix}] {[\begin{matrix} {\tilde{S}}^{- 1} \end{matrix}]}^{- 1} [\begin{matrix} {\tilde{V}}^{T} & {\tilde{U}}^{T} \end{matrix}] \end{aligned}

\begin{aligned} = [\begin{matrix} \tilde{V} \tilde{S} {\tilde{V}}^{T} & \tilde{V} \tilde{S} {\tilde{U}}^{T} \\ \tilde{U} \tilde{S} {\tilde{V}}^{T} & \tilde{U} \tilde{S} {\tilde{U}}^{T} \end{matrix}] . \end{aligned}

$□$

C.4. Dynamics of $Q (t)$

The solution for the weights $W_{1} (t)$ and $W_{2} (t)$ can be derived up to a time varying orthogonal transformation as demonstrated by Yan et al (1994).

Under the assumptions of whitened inputs 2.2, zero-balanced weights 2.3, full rank 2.4, and equal input-output dimension, the temporal dynamics of $Q (t)$ is given as

\begin{aligned} Q (t) = e^{F \frac{t}{τ}} Q (0) {[I + \frac{1}{2} Q {(0)}^{T} (e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1}) Q (0)]}^{- \frac{1}{2}} D (t) . \end{aligned}

where $D (t)$ is an orthogonal matrix of size $N_{h} \times N_{h}$ . From this definition, computing $Q (t) Q (t)^{T}$ , we recover equation (45).

Equation (157) shows that the individual weight matrices are not directly described by parts of the $Q (t) Q (t)^{T}$ solution. Instead, they are fixed only up to a time-dependent orthogonal transformation. To verify this, we numerically compute $D (t)$ as $D (t) = q (t)^{+} Q_{sim} (t)$ where $Q_{sim} (t)$ denotes weights obtained from numerical simulations of gradient descent, + denotes the pseudoinverse ( $q^{+} (t) = (q^{T} (t) q (t))^{- 1} {q (t)}^{T}$ where $q (t)$ is rectangular) and

\begin{aligned} q (t) = e^{F \frac{t}{τ}} Q (0) {[I + \frac{1}{2} Q {(0)}^{T} (e^{F \frac{t}{τ}} F^{- 1} e^{F \frac{t}{τ}} - F^{- 1}) Q (0)]}^{- \frac{1}{2}} . \end{aligned}

We numerically show in figure 7(D) right panel that $D (t)$ generally changes over time. Letting $Q_{d} (t)$ denote the estimated $Q (t)$ using the numerically recovered $D (t)$ , figure 7(D) left and centre panels show that both the dynamics of $Q_{d} (t)$ and $Q_{d} (t) Q_{d} (t)^{T}$ match the temporal dynamics of the simulation. The small derivation between the simulation and the analytical solution for later time points, is due to the imprecision of the pseudoinverse.

In figure 7(C), we report the implementation of equation (158). As expected, the analytical solution does not match the numerical temporal dynamics. However, the solution for $q (t) q (t)^{T}$ recovers the correct dynamics.

Appendix D. Rich and lazy learning regimes and generalisation

Under the assumptions of theorem 3.1, the network function acquires a rich task-specific internal representation at convergence, that is $W_{1}^{T} W_{1} = \tilde{V} \tilde{S} {\tilde{V}}^{T}$ and $W_{2} W_{2}^{T} = \tilde{U} \tilde{S} {\tilde{U}}^{T}$ . Therefore, there exist initial states with large zero-balanced weights that lead to rich solutions.

We more quantitatively capture this phenomena in figure 8. We define the error on the internal representation as figure 3 $| | W_{1}^{T} W_{1} - \tilde{V} \tilde{S} {\tilde{V}}^{T} | |_{F}^{2}$ and $| | W_{2} W_{2}^{T} - \tilde{U} \tilde{S} {\tilde{U}}^{T} |_{F}^{2}$ for W ₁ and W ₂ respectively. Effectively, we measure the richness of the representation and in turn it is generalisation ability. In figure 8, the error remains zero for increasing gain for any network initialised with zero-balanced weights. In other words, the representation at convergences is rich. In contrast, for random initialisation the error increase consequently with increasing gain. As the network is moving away from the small random weight initialisation, the network converges to lazier representation.

Appendix E. Decoupling dynamics

E.1. Proof for theorem 5.1

Let the input and output dimension of a two-layer linear network (equation (1)) be equal, i.e. $N_{i} = N_{o}$ , then equation (11) simplifies to

\begin{aligned} Q Q^{T} (t) & = [\begin{matrix} \tilde{V} (I - e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) \\ \tilde{U} (I + e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) \end{matrix}] \\ \times [4 e^{- \tilde{S} \frac{t}{τ}} B^{- 1} S^{- 1} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1} \\ {- e^{- \tilde{S} \frac{t}{τ}} B^{- 1} C (e^{- 2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}]}^{- 1} \\ \times {[\begin{matrix} \tilde{V} (I - e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) \\ \tilde{U} (I + e^{- \tilde{S} \frac{t}{τ}} C^{T} {(B^{T})}^{- 1} e^{- \tilde{S} \frac{t}{τ}}) \end{matrix}]}^{T} . \end{aligned}

Further, let the singular value decomposition of the input-output correlation of the task be

S V D ({\tilde{Σ}}^{y x}) = \tilde{U} \tilde{S} {\tilde{V}}^{T}

and suppose that the initial state of the network can be written in the form

S V D (W_{2} (0) W_{1} (0)) = U S V^{T} = \tilde{U} A {(0)}^{T} A (0) {\tilde{V}}^{T} .

First, we note that the initial weights in this setting are not independent of the structure of the target task. In particular,

\begin{aligned} U \sqrt{S} & = \tilde{U} A {(0)}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow {\tilde{U}}^{T} U \sqrt{S} & = A {(0)}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow \sqrt{S} U^{T} \tilde{U} & = A (0) \end{aligned}

and

\begin{aligned} \sqrt{S} V^{T} & = A (0) {\tilde{V}}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow \sqrt{S} V^{T} \tilde{V} & = A (0) \end{aligned}

and therefore

\begin{aligned} \sqrt{S} U^{T} \tilde{U} & = \sqrt{S} V^{T} \tilde{V} \end{aligned}

\begin{aligned} \Leftrightarrow U V^{T} & = \tilde{U} {\tilde{V}}^{T} . \end{aligned}

This further simplifies the equation, as

\begin{aligned} U \sqrt{S} & = \tilde{U} A {(0)}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow U & = \tilde{U} A {(0)}^{T} {\sqrt{S}}^{- 1} \end{aligned}

and

\begin{aligned} \sqrt{S} V^{T} & = A (0) {\tilde{V}}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow V^{T} & = {\sqrt{S}}^{- 1} A (0) {\tilde{V}}^{T} \end{aligned}

\begin{aligned} \Leftrightarrow V & = \tilde{V} A {(0)}^{T} {\sqrt{S}}^{- 1}, \end{aligned}

then recollecting the definition of B and C we get

\begin{aligned} B^{T} & = {\tilde{U}}^{T} U + {\tilde{V}}^{T} V \end{aligned}

\begin{aligned} = {\tilde{U}}^{T} \tilde{U} A {(0)}^{T} {\sqrt{S}}^{- 1} + {\tilde{V}}^{T} \tilde{V} A {(0)}^{T} {\sqrt{S}}^{- 1} \end{aligned}

\begin{aligned} = ({\tilde{U}}^{T} \tilde{U} + {\tilde{V}}^{T} \tilde{V}) A {(0)}^{T} {\sqrt{S}}^{- 1} \end{aligned}

\begin{aligned} = 2 A {(0)}^{T} {\sqrt{S}}^{- 1} \end{aligned}

and

\begin{aligned} C^{T} & = {\tilde{U}}^{T} U - {\tilde{V}}^{T} V \end{aligned}

\begin{aligned} = ({\tilde{U}}^{T} \tilde{U} - {\tilde{V}}^{T} \tilde{V}) A {(0)}^{T} {\sqrt{S}}^{- 1} \end{aligned}

\begin{aligned} = 0 . \end{aligned}

Substituting the new values of B and C into equation (159) then yields

\begin{aligned} Q Q^{T} (t) & = [\begin{matrix} \tilde{V} \\ \tilde{U} \end{matrix}] {[4 e^{- \tilde{S} \frac{t}{τ}} \frac{1}{4} A {(0)}^{- 1} \sqrt{S} S^{- 1} \sqrt{S} A {(0)}^{- T} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1}]}^{- 1} {[\begin{matrix} \tilde{V} \\ \tilde{U} \end{matrix}]}^{T} \end{aligned}

\begin{aligned} = [\begin{matrix} \tilde{V} \\ \tilde{U} \end{matrix}] {[e^{- \tilde{S} \frac{t}{τ}} {(A {(0)}^{T} A (0))}^{- 1} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1}]}^{- 1} {[\begin{matrix} \tilde{V} \\ \tilde{U} \end{matrix}]}^{T} . \end{aligned}

Finally, we note that the dynamics can thus be written as

Q Q^{T} (t) = [\begin{matrix} \tilde{V} A^{T} A (t) {\tilde{V}}^{T} & \tilde{V} A^{T} A (t) {\tilde{U}}^{T} \\ \tilde{U} A^{T} A (t) {\tilde{V}}^{T} & \tilde{U} A^{T} A (t) {\tilde{U}}^{T} \end{matrix}]

where

A^{T} A (t) = {[e^{- \tilde{S} \frac{t}{τ}} {(A {(0)}^{T} A (0))}^{- 1} e^{- \tilde{S} \frac{t}{τ}} + (I - e^{- 2 \tilde{S} \frac{t}{τ}}) {\tilde{S}}^{- 1}]}^{- 1} .

$□$

E.2. Solution for $2 \times 2$ dynamics

We consider small networks with input and output dimension $N_{i} = 2$ and $N_{o} = 2$ . In this setting, the structure of the weight initialisation and task are encoded in the matrices

A {(0)}^{T} A (0) = [\begin{matrix} a_{1} (0) & b (0) \\ b (0) & a_{2} (0) \end{matrix}] and \tilde{S} = [\begin{matrix} s_{1} & 0 \\ 0 & s_{2} \end{matrix}],

where the parameters $a_{1} (0)$ and $a_{2} (0)$ represent coupling within a singular mode, and b(0) represents counterproductive cross-coupling between different singular modes.

From equation (13), we have

\begin{aligned} A^{T} A (t) = & [[\begin{matrix} e^{\frac{- s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- s_{2} t}{τ}} \end{matrix}] {[\begin{matrix} a_{1} (0) & b (0) \\ b (0) & a_{2} (0) \end{matrix}]}^{- 1} [\begin{matrix} e^{\frac{- s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- s_{2} t}{τ}} \end{matrix}] \\ {+ [[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] - [\begin{matrix} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}]] {[\begin{matrix} s_{1} & 0 \\ 0 & s_{2} \end{matrix}]}^{- 1}]}^{- 1} \end{aligned}

\begin{aligned} = & [\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} [\begin{matrix} e^{\frac{- s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- s_{2} t}{τ}} \end{matrix}] [\begin{matrix} a_{2} (0) & - b (0) \\ - b (0) & a_{1} (0) \end{matrix}] [\begin{matrix} e^{\frac{- s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- s_{2} t}{τ}} \end{matrix}] \\ {+ [[\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] - [\begin{matrix} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}]] [\begin{matrix} \frac{1}{s_{1}} & 0 \\ 0 & \frac{1}{s_{2}} \end{matrix}]]}^{- 1}, \end{aligned}

where we use

\begin{aligned} {[\begin{matrix} a & b \\ c & d \end{matrix}]}^{- 1} = \frac{1}{a d - b c} [\begin{matrix} d & - b \\ - c & a \end{matrix}] . \end{aligned}

We continue with

\begin{aligned} \begin{array}{l} A^{T} A (t) = [\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} [\begin{matrix} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}] [\begin{matrix} a_{2} (0) & - b (0) \\ - b (0) & a_{1} (0) \end{matrix}] [\begin{matrix} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}] \\ + {[\begin{matrix} \frac{1}{s_{1}} & 0 \\ 0 & \frac{1}{s_{2}} \end{matrix}] - [\begin{matrix} \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}]]}^{- 1} \end{array} \end{aligned}

\begin{aligned} \begin{array}{l} = [\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} [\begin{matrix} e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0) & - e^{\frac{- s_{1} t}{τ}} b (0) e^{\frac{- s_{2} t}{τ}} \\ - e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}} & e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0) \end{matrix}] \\ {+ [\begin{matrix} \frac{1}{s_{1}} & 0 \\ 0 & \frac{1}{s_{2}} \end{matrix}] - [\begin{matrix} \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}} & 0 \\ 0 & \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}]]}^{- 1} \end{array} \end{aligned}

\begin{aligned} = {[\begin{matrix} \frac{e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{1}} - \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}} & - \frac{e^{\frac{- s_{1} t}{τ}} b (0) e^{\frac{- s_{2} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} \\ - \frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} & \frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}} \end{matrix}]}^{- 1} . \end{aligned}

We use equation (189) and simplify the denominator

\begin{aligned} A^{T} A (t) \\ = \frac{1}{(\frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}}) (\frac{e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{1}} - \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}}) - {(- \frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}})}^{2}} \\ \times [\begin{matrix} \frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}} & \frac{e^{\frac{- s_{1} t}{τ}} b (0) e^{\frac{- s_{2} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} \\ \frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b (0)^{2}} & \frac{e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0)}{a_{1} (0) a_{2} (0) - b (0)^{2}} + \frac{1}{s_{1}} - \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}} \end{matrix}] . \end{aligned}

The diagonal element $a_{1} (t)$ is given as

\begin{aligned} a_{1} (t) \\ = \frac{\frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}}}{(\frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}}) (\frac{e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{1}} - \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}}) - {(- \frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}})}^{2}}, \end{aligned}

and interchanging subscripts 1 and 2 yields $a_{2} (t)$ . As a check on this result, by setting $b (0) = 0$ we recover the expression

a_{1} (t) = \frac{a_{1} (0)}{e^{\frac{- 2 s_{1} t}{τ}} + \frac{a_{1} (0)}{s_{1}} (1 - e^{\frac{- 2 s_{1} t}{τ}})},

from Saxe et al (2019).

We further simplify the denominator to

\begin{aligned} A^{T} A (t) \\ = \frac{1}{\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} (e^{\frac{- 2 (s_{1} + s_{2}) t}{τ}} (1 - \frac{a_{1} (0)}{s_{1}} - \frac{a_{2} (0)}{s_{2}}) + e^{\frac{- 2 s_{2} t}{τ}} \frac{a_{1} (0)}{s_{1}} + e^{\frac{- 2 s_{1} t}{τ}} \frac{a_{2} (0)}{s_{2}}) + \frac{1}{s_{2} s_{1}}} \\ \times [\begin{matrix} \frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}} & \frac{e^{\frac{- s_{1} t}{τ}} b (0) e^{\frac{- s_{2} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} \\ \frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} & \frac{e^{\frac{- 2 s_{1} t}{τ}} a_{2} (0)}{a_{1} (0) a_{2} (0) - b (0)^{2}} + \frac{1}{s_{1}} - \frac{1}{s_{1}} e^{\frac{- 2 s_{1} t}{τ}} \end{matrix}] \end{aligned}

E.3. Off-Diagonal decoupling dynamics

We track the decoupling by considering the dynamics of the off-diagonal element b(t).

b (t) = \frac{\frac{e^{\frac{- s_{2} t}{τ}} b (0) e^{\frac{- s_{1} t}{τ}}}{a_{1} (0) a_{2} (0) - b {(0)}^{2}}}{\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} (e^{\frac{- 2 (s_{1} + s_{2}) t}{τ}} (1 - \frac{a_{1} (0)}{s_{1}} - \frac{a_{2} (0)}{s_{2}}) + e^{\frac{- 2 s_{2} t}{τ}} \frac{a_{1} (0)}{s_{1}} + e^{\frac{- 2 s_{1} t}{τ}} \frac{a_{2} (0)}{s_{2}}) + \frac{1}{s_{2} s_{1}}} .

As t tends to infinity $lim_{t \to \infty} b (t) = 0$ the off-diagonal element shrinks to zero.

We can further simplify the off-diagonal to

b (t) = \frac{b (0)}{e^{\frac{- (s_{1} + s_{2}) t}{τ}} (1 - \frac{a_{1} (0)}{s_{1}} - \frac{a_{2} (0)}{s_{2}}) + e^{\frac{(s_{1} - s_{2}) t}{τ}} \frac{a_{1} (0)}{s_{1}} + e^{\frac{(s_{2} - s_{1}) t}{τ}} \frac{a_{2} (0)}{s_{2}} + \frac{a_{1} (0) a_{2} (0) - b {(0)}^{2}}{s_{2} s_{1}}} .

Equation (198) can exhibit non-monotonic trajectories with transient peaks as shown in figure 4. The qualitative observations for the $2 \times 2$ network hold for larger target matrices as shown in figure 9. For large initialisation, the dynamics are exponential. At intermediate and small initialisation, the maximum of the off-diagonal is reached before the singular mode is fully learned. In the small initialisation scheme, the peak is of negligible size. The respective target matrix for panel (A)–(D), (B)–(E) and (C)–(F) in figure 9 are

Figure 9. — (A)–(C) Network function dynamics (Diagonal elements: blue, Off-diagonal elements: red) learning with learning rate η = 0.01 on the target $5 \times 5$ diagonal matrices shown in equation (198). The network was initialised as defined in section E with Small ( $σ = 1 \times 10^{- 6}$ ), Intermediate (σ = 0.1) and Large (σ = 2) variance, and hidden layer size $N_{h} = 10$ . (A), Dense. (B), Diagonal. (C), Equal diagonal. (D)–(F). Corresponding numerical temporal dynamics of the projection of the network function on- and off-diagonal elements into the singular-basis of the initialisation. Equivalently, the temporal dynamics of the elements of $A A^{T}$ bottom left quadrant. (D), Dense. (E), Diagonal. (F), Equal diagonal.

\begin{aligned} dense [\begin{matrix} 5 & 6 & 3 & 0 & 1 \\ 4, & 1 & 0 & 1 & 2 \\ 3 & 0 & 2 & 4 & 0 \\ 3 & 4 & 0 & 3 & 2 \\ 2 & 0 & 1 & 3 & 4 \end{matrix}], diagonal [\begin{matrix} 5 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 2 & 0 & 0 \\ 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 4 \end{matrix}] and equal diagonal [\begin{matrix} 5 & 0 & 0 & 0 & 0 \\ 0 & 5 & 0 & 0 & 0 \\ 0 & 0 & 5 & 0 & 0 \\ 0 & 0 & 0 & 5 & 0 \\ 0 & 0 & 0 & 0 & 5 \end{matrix}] . \end{aligned}

We characterise these dynamics considering the case where $s_{1} = s_{2} = s$ for the two-by-two solution (i.e. equal diagonal target y) for which we can compute the time of the peak. In this particular case, we can further simplify the off-diagonal to

b (t) = \frac{b (0)}{e^{\frac{- 2 (s) t}{τ}} (1 - \frac{a_{1} (0) + a_{2} (0)}{s}) + \frac{a_{1} (0) + a_{2} (0)}{s} + \frac{a_{1} (0) a_{2} (0) - b {(0)}^{2}}{s^{2}}} .

We find the time of the maximum of the off-diagonal elements to be $t_{peak} = \frac{τ}{4 s} ln \frac{s (s - a_{1} (0) - a_{2} (0))}{a_{1} (0) a_{2} (0) - b (0)^{2}}$ .

The presence of a peak in the off-diagonal values, indicates the decoupling, but as shown in figures 4(D)–(F), the peak size is negligible in comparison to the size of the on-diagonal values for small initial weights. This difference is reminiscent of the silent alignment effect described by Atanasov et al (2022). We further note, that the time scale of decoupling is on the same order as the one reported for the silent alignment effect $t_{sa} = \frac{1}{s}$ .

E.4. On-diagonal dynamics and the effect of initialisation variance

In this section we revisit the impact of initialisation scale for the on-diagonal dynamics. We now start with

\begin{aligned} a_{1} (t) = \frac{\frac{e^{\frac{- 2 s_{2} t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s_{2}} - \frac{1}{s_{2}} e^{\frac{- 2 s_{2} t}{τ}}}{\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} (e^{\frac{- 2 (s_{1} + s_{2}) t}{τ}} (1 - \frac{a_{1} (0)}{s_{1}} - \frac{a_{2} (0)}{s_{2}}) + e^{\frac{- 2 s_{2} t}{τ}} \frac{a_{1} (0)}{s_{1}} + e^{\frac{- 2 s_{1} t}{τ}} \frac{a_{2} (0)}{s_{2}}) + \frac{1}{s_{2} s_{1}}} . \end{aligned}

The diagonal elements simplify in the cases where $s_{1} = s_{2} = s$ (i.e. target Y is diagonal),

\begin{aligned} a_{1} (t) = \frac{\frac{e^{\frac{- 2 s t}{τ}} a_{1} (0)}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} + \frac{1}{s} - \frac{1}{s} e^{\frac{- 2 s t}{τ}}}{\frac{1}{a_{1} (0) a_{2} (0) - b {(0)}^{2}} (e^{\frac{- 4 s t}{τ}} (1 - \frac{a_{1} (0)}{s} - \frac{a_{2} (0)}{s}) + e^{\frac{- 2 s t}{τ}} \frac{a_{1} (0)}{s} + e^{\frac{- 2 s t}{τ}} \frac{a_{2} (0)}{s}) + \frac{1}{s^{2}}} . \end{aligned}

We consider when $| a_{1} (0) |, | a_{2} (0) |, | b (0) | ≪ 1$ , and recover a sigmoidal trajectory,

a_{1} (t) = \frac{s a_{1} (0)}{e^{\frac{- 2 s t}{τ}} [s - a_{1} (0) - a_{2} (0)] + a_{1} (0) + a_{2} (0)} .

We can compute the time at which $a_{1} (t)$ rises to half its asymptotic value to be

t_{half} = \frac{τ}{2 s} log (\frac{s - a_{1} (0) - a_{2} (0)}{a_{1} (0) - a_{2} (0)}) .

For $| a_{1} (0) |, | a_{2} (0) |, | b (0) | ≫ 0$ the dynamics of the on-diagonal element a ₁ is close to exponential.

The observation for $2 \times 2$ network hold for larger target matrices as shown in figure 9. For large variance initialisations, the dynamics are exponential. At intermediate variance initialisations, we observe more complex behaviour. While at small variance initialisations, the on-diagonal element describes a sigmoidal trajectory.

Appendix F. Continual learning

We consider the case of training a two-layer deep linear network on a sequence of tasks $T_{a}$ , $T_{b}$ , $T_{c}$ , ... with corresponding correlation functions $T_{a} = {\tilde{Σ}}_{a}^{y x}$ , $T_{b} = {\tilde{Σ}}_{b}^{y x} . . .$ . Then, the full batch loss of the ith task at any point in training time is

L_{i} = \frac{1}{2 P} {||W_{2} W_{1} X_{i} - Y_{i}||}_{F}^{2} .

From theorem 3.2 it follows that after training the network to convergence on task $T_{j}$ , the network function is $W_{2} W_{1} = \tilde{U} \tilde{S} {\tilde{V}}^{T} = {\tilde{Σ}}_{j}^{y x}$ . Further, using the assumption of whitened inputs 2.2 and the identities $| | A | |_{F}^{2} = Tr (A A^{T})$ and $Tr (A) + Tr (B) = Tr (A + B)$ , the full batch loss of the i-th task is then

\begin{aligned} L_{i} (T_{j}) & = \frac{1}{2 P} {||{\tilde{Σ}}_{j}^{y x} X_{i} - Y_{i}||}_{F}^{2} \end{aligned}

\begin{aligned} = \frac{1}{2 P} T r (({\tilde{Σ}}_{j}^{y x} X_{i} - Y_{i}) {({\tilde{Σ}}_{j}^{y x} X_{i} - Y_{i})}^{T}) \end{aligned}

\begin{aligned} = \frac{1}{2 P} T r ({\tilde{Σ}}_{j}^{y x} X_{i} X_{i}^{T} {\tilde{Σ}}_{j}^{{y x}^{T}}) - \frac{1}{P} T r ({\tilde{Σ}}_{j}^{y x} X_{i} Y_{i}^{T}) + \frac{1}{2 P} T r (Y_{i} Y_{i}^{T}) \end{aligned}

\begin{aligned} = \frac{1}{2} T r ({\tilde{Σ}}_{j}^{y x} {\tilde{Σ}}_{j}^{{y x}^{T}}) - T r ({\tilde{Σ}}_{j}^{y x} {\tilde{Σ}}_{i}^{{y x}^{T}}) + \frac{1}{2} T r ({\tilde{Σ}}_{i}^{y y}) \end{aligned}

\begin{aligned} = \frac{1}{2} T r (({\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x}) {({\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x})}^{T} - {\tilde{Σ}}_{i}^{y x} {\tilde{Σ}}_{i}^{{y x}^{T}}) + \frac{1}{2} ({\tilde{Σ}}_{i}^{y y}) \end{aligned}

\begin{aligned} = \frac{1}{2} {||{\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x}||}_{F}^{2} \underset{c}{\underset{⏟}{- \frac{1}{2} T r ({\tilde{Σ}}_{i}^{y x} {\tilde{Σ}}_{i}^{{y x}^{T}}) + \frac{1}{2} ({\tilde{Σ}}_{i}^{y y})}} . \end{aligned}

Therefore, the amount of forgetting $F$ on task $T_{i}$ when training on task $T_{k}$ after having trained the network on task $T_{j}$ , i.e. the relative change of loss, is fully determined by the similarity structure of the tasks

\begin{aligned} F_{i} (T_{j}, T_{k}) & = L_{i} (T_{k}) - L_{i} (T_{j}) \end{aligned}

\begin{aligned} = \frac{1}{2} {||{\tilde{Σ}}_{k}^{y x} - {\tilde{Σ}}_{i}^{y x}||}_{F}^{2} + c - \frac{1}{2} {||{\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x}||}_{F}^{2} - c \end{aligned}

\begin{aligned} = \frac{1}{2} ({||{\tilde{Σ}}_{k}^{y x} - {\tilde{Σ}}_{i}^{y x}||}_{F}^{2} - {||{\tilde{Σ}}_{j}^{y x} - {\tilde{Σ}}_{i}^{y x}||}_{F}^{2}) . \end{aligned}

Appendix G. Revising structured knowledge

G.1. Reversal learning dynamics

In the following, we assume that the input dimension is equal to the output dimension. Further, we denote the i-th column of the left and right singular vectors as u _i, ${\tilde{u}}_{i}$ and v _i, ${\tilde{v}}_{i}$ respectively.

Reversal learning occurs when the task and the initial network function share the same left and right singular vectors, i.e. $U = \tilde{U}$ and $V = \tilde{V}$ , except for one or multiple columns of the left singular vectors, for which the direction is reversed:

- u_{i} = {\tilde{u}}_{i} .

We note that, if there is any reversal in the right singular vectors $- v_{i} = {\tilde{v}}_{i}$ , this can be written as a reversal in the left singular vectors, as the signs of the right and left singular vectors are interchangeable. In the reversal learning setting, both $B = U^{T} \tilde{U} + V^{T} \tilde{V}$ and $C = U^{T} \tilde{U} - V^{T} \tilde{V}$ are diagonal matrices. The diagonal entries of C are zero if the singular vectors are aligned and 2 if they are reversed. Similarly, diagonal entries of B are 2 if the singular vectors are aligned and zero if they are reversed. Therefore, in the case of reversal learning, B is a diagonal matrix with 0 values and thus is not invertible. As a consequence, the learning dynamics cannot be described by equation (11). However, as B and C are diagonal matrices, the learning dynamics simplify. Let b _i, c _i, s _i and ${\tilde{s}}_{i}$ denote the i-th diagonal entry of B, C, S and $\tilde{S}$ respectively, then the network dynamics can be rewritten as

\begin{aligned} W_{2} W_{1} (t) & = \frac{1}{2} \tilde{U} (e^{\tilde{S} \frac{t}{τ}} B^{T} + e^{- \tilde{S} \frac{t}{τ}} C^{T}) \\ \times {[S^{- 1} + \frac{1}{4} B (e^{2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} B^{T} - \frac{1}{4} C (e^{- 2 \tilde{S} \frac{t}{τ}} - I) {\tilde{S}}^{- 1} C^{T}]}^{- 1} \end{aligned}

\begin{aligned} \frac{1}{2} (e^{\tilde{S} \frac{t}{τ}} B - e^{- \tilde{S} \frac{t}{τ}} C) {\tilde{V}}^{T} \\ = \sum_{i = 1}^{N_{i}} \frac{b_{i}^{2} e^{2 {\tilde{s}}_{i} \frac{t}{τ}} - c_{i}^{2} e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}}}{4 s_{i}^{- 1} + b_{i}^{2} e^{2 {\tilde{s}}_{i} \frac{t}{τ}} {\tilde{s}}_{i}^{- 1} - b_{i}^{2} {\tilde{s}}_{i}^{- 1} - c_{i}^{2} e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} {\tilde{s}}_{i}^{- 1} + c_{i}^{2} {\tilde{s}}_{i}^{- 1}} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} \end{aligned}

\begin{aligned} = \sum_{i = 1}^{N_{i}} \frac{s_{i} b_{i}^{2} {\tilde{s}}_{i} - s_{i} c_{i}^{2} {\tilde{s}}_{i} e^{- 4 {\tilde{s}}_{i} \frac{t}{τ}}}{4 {\tilde{s}}_{i} e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} + s_{i} b_{i}^{2} (1 - e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}}) + s_{i} c_{i}^{2} (e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} - e^{- 4 {\tilde{s}}_{i} \frac{t}{τ}})} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} . \end{aligned}

It follows, that in the reversal learning case, i.e. $b = 0$ , for each reversed singular vector, the dynamics vanish to zero

lim_{t \to \infty} \frac{- s_{i} c_{i}^{2} {\tilde{s}}_{i} e^{- 4 {\tilde{s}}_{i} \frac{t}{τ}}}{4 {\tilde{s}}_{i} e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} + s_{i} c_{i}^{2} (e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} - e^{- 4 {\tilde{s}}_{i} \frac{t}{τ}})} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} = 0 .

Analytically, the learning dynamics are initialised and remain on the separatrix of a saddle point, until the corresponding singular value of the network function has vanished and remains zero, corresponding to convergence to the saddle point. When simulated numerically, the learning dynamics escape the saddle points due to imprecision of floating point arithmetic. However, numerical optimisation still suffers from catastrophic slowing (Lee et al 2022), as escaping the saddle point takes time (figure 6(A)). In contrast, in the case of aligned singular vectors ( $c = 0$ ), we recover the equation for the temporal dynamics as described in Saxe et al (2014). Training succeeds, as the singular value of the network function converges to its target value

\begin{aligned} lim_{t \to \infty} \sum_{i = 1}^{N_{i}} \frac{s_{i} b_{i}^{2} {\tilde{s}}_{i}}{4 {\tilde{s}}_{i} e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}} + s_{i} b_{i}^{2} (1 - e^{- 2 {\tilde{s}}_{i} \frac{t}{τ}})} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} & = \frac{s_{i} b_{i}^{2} {\tilde{s}}_{i}}{s_{i} b_{i}^{2}} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} \end{aligned}

\begin{aligned} = {\tilde{s}}_{i} {\tilde{u}}_{i} {\tilde{v}}_{i}^{T} . \end{aligned}

In summary, in the case of aligned singular vectors, the learning dynamics can be described by the convergence of singular values. However in the case of reversal learning, analytically, training does not succeed. In simulations, the learning dynamics escape the saddle point due to numerical imprecision, but the learning dynamics are catastrophically slowed in the vicinity of the saddle point.

G.2. Exact learning dynamics in shallow networks

To provide a point of comparison to our deep linear network results, here we derive a solution for the temporal dynamics of reversal learning in a shallow network.

The network’s weights are optimised using full batch gradient descent with learning rate η (or equivalently time constant $τ = 1 / η$ ) on the mean squared error loss given in equation (2), yielding the first task dynamics

\begin{aligned} τ \frac{d}{d t} W = {\tilde{Σ}}^{y x} - W {\tilde{Σ}}^{x x}, \end{aligned}

where ${\tilde{Σ}}^{x x}$ and ${\tilde{Σ}}^{y x}$ is the input and input-output correlation matrices of the dataset. We define

\begin{aligned} S V D (W (0)) = U S V^{T} and S V D ({\tilde{Σ}}^{y x}) = \tilde{U} \tilde{S} {\tilde{V}}^{T} . \end{aligned}

motivating the change of variable $W = U \overline{W} V^{T}$ . We project the weight into the basis of the initialisation

\begin{aligned} τ \frac{d}{d t} U \overline{W} V^{T} & = {\tilde{Σ}}^{y x} - U \overline{W} V^{T} {\tilde{Σ}}^{x x} \end{aligned}

\begin{aligned} τ \frac{d}{d t} U \overline{W} V^{T} & = U U^{T} {\tilde{Σ}}^{y x} V V^{T} - U \overline{W} V^{T} {\tilde{Σ}}^{x x} \end{aligned}

\begin{aligned} τ \frac{d}{d t} \overline{W} & = U^{T} {\tilde{Σ}}^{y x} V - \overline{W} {\tilde{Σ}}^{x x} . \end{aligned}

Under the assumption of whitened inputs 2.2, the dynamics yields

\begin{aligned} τ \frac{d}{d t} \overline{W} & = U^{T} {\tilde{Σ}}^{y x} V - \overline{W} . \end{aligned}

Defining ${\overline{W}}_{i i} = b_{i}$ the diagonal element of the matrix, encoding the strength of the mode i transmitted by the input-to-output weight. Similarly, we write ${(U^{T} {\tilde{Σ}}^{y x} V)}_{i i} = k_{i}$ . Assuming decoupled initial conditions, we obtain the scalar dynamics

\begin{aligned} τ \frac{d}{d t} b_{i} = k_{i} - b_{i} \end{aligned}

with solution

b_{i} = k_{i} (1 - e^{\frac{- t}{τ}}) + b_{i}^{0} e^{\frac{- t}{τ}} .

Reverting the change of variable, the weight trajectory yields

W = U B (t) V^{T} .

This solution is very similar to the one proposed by Saxe et al (2019). However, the key here is that k _i can have negative values. k _i is negative whenever a vector is in the opposite direction to the initialisation (as in the reversal learning setting). We show in figure 6 that the analytical solution derived above matches the numerical temporal dynamics. From equation (228), we note that the shallow network cannot display catastrophic slowing.

Appendix H. Simulations

In the following, we describe the details of the simulation studies. Generally, $N_{i}$ , $N_{h}$ and $N_{o}$ denote the dimension of the input, hidden layer and output (target) respectively. The number of training samples is N and the learning rate is denoted by $η = \frac{1}{τ}$ .

H.1. Zero-balanced weight initialisation

The initial network weights are zero-balanced 2.3 when they satisfy

W_{1} (0) W_{1} {(0)}^{T} = W_{2} {(0)}^{T} W_{2} (0) .

In practice, we use algorithm 1 to initialise the network weights, where α is a scaling factor which is used to control the variance of the weights, i.e. to vary between small and large weight initialisations.

Algorithm 1. Zero-balanced weight initialisation.
Require: $N_{i}, N_{h}, N_{o}, σ$
W
W
$U, S, V \leftarrow S V D (W_{2} W_{1})$
$S \leftarrow \sqrt{S}$
$R \sim N (μ = 0, σ = 1) \in R^{N_{h} \times N_{h}}$
$R,_,_\leftarrow S V D (R)$
if $N_{i} \neq N_{o}$ then
$N_{s} \leftarrow N_{i} i f N_{i} < N_{o} e l s e N_{o}$
$S_{1} \leftarrow [\begin{matrix} S \\ 0_{N_{h} - N_{s} \times N_{s}} \end{matrix}]$
$S_{2} \leftarrow [\begin{matrix} S & 0_{N_{s} \times N_{h} - N_{s}} \end{matrix}]$
$W_{1} \leftarrow R S_{1} V^{T}$
$W_{2} \leftarrow U S_{2} R^{T}$
else
$W_{1} \leftarrow R S V^{T}$
$W_{2} \leftarrow U S R^{T}$
end if
return $W_{1} W_{2}$

Open in a new tab

H.2. Tasks

In the following, we describe the different tasks that are used throughout the simulation studies.

H.2.1. Random regression task.

In a random regression task the inputs $X \in R^{N_{i}, N}$ are sampled from a random normal distribution $X \sim N (μ = 0, σ = 1)$ . The input data X is then whitened, such that $\frac{1}{N} X X^{T} = I$ . The target values $Y \in R^{N_{o}, N}$ are also sampled from a random normal distribution, however, with variance adjusted to the number of output nodes $Y \sim N (μ = 0, α = \frac{1}{\sqrt{N_{o}}})$ . Thus, network inputs and target values are uncorrelated Gaussian noise and therefore, a linear solution does not always exist.

H.2.2. Teacher-student task.

In order to guarantee that a linear solution exists, we use the teacher-student setup. First, inputs X are sampled as in the random regression task. Then, target values Y are generated by sampling a pair of random zero-balanced weights $W_{1} \in R^{N_{h} \times N_{i}}$ and $W_{2} \in R^{N_{o} \times N_{h}}$ and then calculating $Y = W_{2} W_{1} X$ . Like this, it is ensured that a linear solution exists. The variance of the output is varied by changing the variation within the zero-balanced weights σ.

H.2.3. Semantic hierarchy.

Input items in the semantic hierarchy task are encoded as one-hot vectors, i.e. $X = I$ . The corresponding target vectors y_i encoded the position in the hierarchical tree. Where a 1 encoded being a left child of a node, a −1 encoded being a right child of a node and a 0 encoded that the item is not a child of that node. For example, the blue fish is a blue fish, it is a left child of the root node, a left child of the animal node, not part of the plant branch, a right child of the fish node, and not part of the bird, algae or flower branch, leading to the label $[1, 1, 1, 0, - 1, 0, 0, 0]$ . The labels for all objects in the semantic tree as depicted in figure 3(A) is then

\begin{aligned} Y = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & - 1 & - 1 & - 1 & - 1 \\ 1 & 1 & - 1 & - 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 1 & - 1 & - 1 \\ 1 & - 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & - 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & - 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & - 1 \end{matrix}] . \end{aligned}

The singular value decomposition for the corresponding correlation matrix ${\tilde{Σ}}^{y x}$ are not unique. The first two, the third and the fourth and the last four singular values are identical. In order to match the numerical and analytical solution, this permutation invariance is removed by adding a small constant perturbation to each column $y_{i}, i \in 1, \dots, N$ of the labels

y_{i} = y_{i} * (1 + \frac{0.1}{i}),

leading to almost but not exactly identical singular values.

H.2.4. Colour hierarchy.

Following the same procedure as described for the semantic hierarchy, the labels for the colour hierarchy as depicted in figure 6(C) are then

\begin{aligned} Y = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ - 1 & 1 & 1 & - 1 & 1 & 1 & - 1 & - 1 \\ 0 & - 1 & 1 & 0 & - 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & - 1 & 0 & 0 & - 1 & 1 \\ 0 & 0 & 1 & 0 & 0 & - 1 & 0 & 0 \\ 0 & - 1 & 0 & 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & - 1 \\ 0 & 0 & 0 & 1 & 0 & 0 & - 1 & 0 \end{matrix}] . \end{aligned}

H.3. Figure 1

Figure 1 panels (B)–(D) show three simulations from varying initial weights on the same teacher-student task. The task was created with σ = 0.35. Farther, $N_{i} = 5$ , $N_{h} = 10$ , $N_{o} = 2$ and N = 10. The learning rate was η = 0.1 and the initial network weights were sampled with σ = 0.01, σ = 0.25 and σ = 0.25 in panels (B), (C) and (D) respectively.

H.4. Figure 2

Figure 2 panels (A) and (B) show a simulation on the same teacher-student task (σ = 0.25), once from small initial weights (σ = 0.01) and once from large initial weights (σ = 0.15). Dimensions were $N_{i} = 4$ , $N_{h} = 5$ , $N_{o} = 3$ and N = 10 and the learning rate was η = 0.05. Panel (C) was generated by running 50 simulations, each with a different initial random seed. For each of the simulations, dimensions were sampled randomly, such that $N_{i} \in [2, 50]$ , $N_{o} \in [2, 50]$ , $N_{h} = [min (N_{i}, N_{o}), 50]$ and $N \in [2 max (N_{i}, N_{h}, N_{o}), 3 max (N_{i}, N_{h}, N_{o})]$ . Then, a random regression task was generated. Subsequently, a linear network was initialised with $σ \sim U [\frac{0.01}{\sqrt{max (N_{i}, N_{o}, N_{h})}}, \frac{0.5}{\sqrt{max (N_{i}, N_{o}, N_{h})}}]$ . The network was then trained until convergence on the same task from the same initial weights for seven different learning rates $η \in {0.05, 0.0232, 0.0107, 0.005, 0.0023, 0.0011, 0.0005}$ .

H.5. Figure 3

Panels (C)–(F) in figure 3 were generated by training a linear network with $N_{i} = 8$ , $N_{h} = 14$ , $N_{o} = 8$ on the N = 8 items of the semantic hierarchy task. The learning rate was η = 0.05 and the initial weights in panels (C), (D) and (E) were sampled from a normal distribution with σ = 0.0001 and σ = 0.42 and zero-balanced weights with σ = 0.44 respectively.

H.6. Figure 4

Figure 4 panel (A) was generated by training a linear network with $N_{i} = 5$ , $N_{h} = 10$ , $N_{o} = 5$ on the target Y as shown in equation (198) (equal diagonal). The network was initialised with σ = 0.1. The learning rate was η = 0.01.

Figure 4 panels (D)–(F) was generated by training a linear network with $N_{i} = 2$ , $N_{h} = 10$ , $N_{o} = 2$ on the target Y as shown in figure 4(C) and input $X = b f i$ . The network was initialised with small σ = 0.000 01, intermediate σ = 0.3 and large σ = 2 synaptic weights. The learning rate was η = 0.0001.

H.7. Figure 5

Figure 5 panel (A) was generated by training a linear network with $N_{i} = 5$ , $N_{h} = 10$ , $N_{o} = 6$ subsequently on four different random regression tasks with N = 25. The learning rate was η = 0.05 and the initial weights were small (σ = 0.0001).

Panels (B) and (C) were generated by running 50 simulations on two subsequent random regression tasks, each with a different initial random seed. The simulation was repeated three times, the first time with a linear, the second time with a tanh and the last time with a ReLU activation function in the hidden layer. Dimension were randomly sampled such that $N_{i} \in [2, 30]$ , $N_{o} \in [2, 30]$ , $N_{h} = [min (N_{i}, N_{o}), 30]$ and N = 100. The standard deviation of the initial weight was chosen such that $σ = \frac{0.5}{\sqrt{0.5 (N_{i} + N_{h})}}$ . The learning rate was η = 0.075.

For panel (D) and (E) the same simulation was repeated for three times, the first time with a linear, the second time with a tanh and the last time with a ReLU activation function. Each time, five random regression tasks with dimensions $N_{i} = 15$ , $N_{h} = 18$ , $N_{o} = 21$ and N = 50 were generated. Then a network with initial weight scale α = 0.025 was sequentially trained with learning rate η = 0.1 on the five random regression tasks.

H.8. Figure 6

Figure 6 panel (A) was generated by training a linear network with $N_{i} = 4$ , $N_{h} = 6$ , $N_{o} = 4$ on a reversal learning task (see section G.1), which was derived from a random regression task. The learning rate was η = 0.05 and initial weights had a standard deviation of σ = 0.25. Panel (B) was generated by training a shallow linear network (see section G.2) on the same reversal learning task, with identical hyperparameters as in panel (A).

For the top and bottom rows of panels (E) and (F) a linear network with $N_{i} = 8$ , $N_{h} = 14$ , $N_{o} = 8$ was trained on the semantic hierarchy task, followed by training the network on the adapted semantic hierarchy as depicted in figure 6(C) top, which is a reversal learning task and the colour hierarchy respectively. The learning rate was η = 0.05 and σ was set to 0.001 and 0.35 respectively.

Footnotes

⁶

https://github.com/saxelab/deep-linear-networks-with-prior-knowledge.

References

Arora R, et al. Princeton University; 2020. Theory of deep learning (in preparation) [Google Scholar]
Arora S, Cohen N, Golowich N, Wei H. A convergence analysis of gradient descent for deep linear neural networks. 2018b (arXiv: 1810.02281)
Arora S, Cohen N, Hazan E. On the optimization of deep networks: implicit acceleration by overparameterization. Int. Conf. on Machine Learning; PMLR; 2018a. pp. pp 244–53. [Google Scholar]
Arora S, Cohen N, Wei H, Luo Y. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems; 2019a. [Google Scholar]
Arora S, Du S S, Wei H, Zhiyuan Li, Salakhutdinov R R, Wang R. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems; 2019b. p. vol 32. [Google Scholar]
Asanuma H, Takagi S, Nagano Y, Yoshida Y, Igarashi Y, Okada M. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. J. Phys. Soc. Japan. 2021;90:104001. doi: 10.7566/JPSJ.90.104001. [DOI] [Google Scholar]
Atanasov A, Bordelon B, Pehlevan C. Neural networks as kernel learners: the silent alignment effect. Int. Conf. on Learning Representations.2022. [Google Scholar]
Bahri Y, Kadmon J, Pennington J, Schoenholz S S, Sohl-Dickstein J, Ganguli S. Statistical mechanics of deep learning. Annu. Rev. Condens. Matter Phys. 2020;11:501–28. doi: 10.1146/annurev-conmatphys-031119-050745. [DOI] [Google Scholar]
Baldi P, Hornik K. Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 1989;2:53–58. doi: 10.1016/0893-6080(89)90014-2. [DOI] [Google Scholar]
Bengio Y, Louradour Jôme, Collobert R, Weston J. Curriculum learning. Proc. 26th Annual Int. Conf. on Machine Learning; 2009. pp. pp 41–48. [Google Scholar]
Biehl M, Schwarze H. Learning by on-line gradient descent. J. Phys. A: Math. Gen. 1995;28:643. doi: 10.1088/0305-4470/28/3/018. [DOI] [Google Scholar]
Carey S E. Conceptual Change In Childhood. MIT Press; 1985. [Google Scholar]
Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, Zdeborová L. Machine learning and the physical sciences. Rev. Mod. Phys. 2019;91:045002. doi: 10.1103/RevModPhys.91.045002. [DOI] [Google Scholar]
Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. Advances in Neural Information Processing Systems; 2019. [Google Scholar]
Doan T, Abbana Bennani M, Mazoure B, Rabusseau G, Alquier P. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. Int. Conf. on Artificial Intelligence and Statistics; PMLR; 2021. pp. pp 1072–80. [Google Scholar]
Erdeniz B, Bedin Atalay N. Simulating probability learning and probabilistic reversal learning using the attention-gated reinforcement learning (agrel) model. 2010 Int. Joint Conf. on Neural Networks (IJCNN); IEEE; 2010. pp. pp 1–6. [Google Scholar]
Flesch T, Balaguer J, Dekker R, Nili H, Summerfield C. Comparing continual task learning in minds and machines. Proc. Natl Acad. Sci. 2018;115:E10313–22. doi: 10.1073/pnas.1800755115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Flesch T, Juechems K, Dumbalska T, Saxe A, Summerfield C. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron. 2022;110:4212–19. doi: 10.1016/j.neuron.2022.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
French R M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999;3:128–35. doi: 10.1016/S1364-6613(99)01294-2. [DOI] [PubMed] [Google Scholar]
Fukumizu K. Effect of batch learning in multilayer neural networks; Int. Conf. Neural Information Processing (ICONIP); 1998. pp. pp 67–70. [Google Scholar]
Gerace F, Saglietti L, Sarao Mannelli S, Saxe A, Zdeborová L. Probing transfer learning with a model of synthetic correlated datasets. Mach. Learn.: Sci. Technol. 2022;3:015030. doi: 10.1088/2632-2153/ac4f3f. [DOI] [Google Scholar]
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Proc. Thirteenth 13th Int. Conf. on Artificial Intelligence and Statistics (JMLR Workshop and Conf. Proc.); 2010. pp. pp 249–56. [Google Scholar]
Goldt S, Advani M, Saxe A M, Krzakala F, Zdeborová L. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in Neural Information Processing Systems; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gunasekar S, Lee J D, Soudry D, Srebro N. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems; 2018. p. p 31. [Google Scholar]
Huh D. Curvature-corrected learning dynamics in deep neural networks. Int. Conf. on Machine Learning; PMLR; 2020. pp. pp 4552–60. [Google Scholar]
Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems; 2018. [Google Scholar]
Javed K, White M. Meta-learning representations for continual learning. Advances in Neural Information Processing Systems; 2019. pp. pp 1820–30. [Google Scholar]
Kaiming H, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proc. IEEE Int. Conf. on Computer Vision; 2015. pp. pp 1026–34. [Google Scholar]
Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. 2017;114:3521–6. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kriegeskorte N, Mur M, Bandettini P A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2008;2:4. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lampinen A K, Ganguli S. An analytic theory of generalization dynamics and transfer learning in deep linear networks. 2018 (arXiv: 1809.10374)
Laurent T, Brecht J. Deep linear networks with arbitrary loss: all local minima are global. Int. Conf. on Machine Learning; PMLR; 2018. pp. pp 2902–7. [Google Scholar]
Lee J, Xiao L, Schoenholz S, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in Neural Information Processing Systems; 2019. [Google Scholar]
Lee S, Sarao Mannelli S, Clopath C, Goldt S, Saxe A. Maslow’s hammer for catastrophic forgetting: node re-use vs node activation. 2022 (arXiv: 2205.09029)
Lee S, Sebastian G, Saxe A. Continual learning in the teacher-student setup: Impact of task similarity. Int. Conf. on Machine Learning; PMLR; 2021. pp. pp 6109–19. [Google Scholar]
McClelland J L. Incorporating rapid neocortical learning of new schema-consistent information into complementary learning systems theory. J. Exp. Psychol. Gen. 2013;142:1190. doi: 10.1037/a0033812. [DOI] [PubMed] [Google Scholar]
McClelland J L, McNaughton B L, O’Reilly R C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 1995;102:419. doi: 10.1037/0033-295X.102.3.419. [DOI] [PubMed] [Google Scholar]
McCloskey M, Cohen N J. Psychology of Learning and Motivation. vol 24. Elsevier; 1989. Catastrophic interference in connectionist networks: The sequential learning problem; pp. pp 109–65. [Google Scholar]
Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc. Natl Acad. Sci. 2018;115:E7665–71. doi: 10.1073/pnas.1806579115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mishkin D, Matas J. All you need is a good init. 2015 (arXiv: 1511.06422)
Murphy G. The big Book of Concepts. MIT Press; 2004. [Google Scholar]
Parisi G I, Kemker R, Part J L, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71. doi: 10.1016/j.neunet.2019.01.012. [DOI] [PubMed] [Google Scholar]
Pennington J, Schoenholz S, Ganguli S. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems; 2017. [Google Scholar]
Poggio T, Liao Q, Miranda B, Banburski A, Boix X, Hidary J. Theory iiib: generalization in deep networks. 2018 (arXiv: 1806.11379)
Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. Advances in Neural Information Processing Systems; 2019. p. p 32. [Google Scholar]
Ratcliff R. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol. Rev. 1990;97:285. doi: 10.1037/0033-295X.97.2.285. [DOI] [PubMed] [Google Scholar]
Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in Neural Information Processing Systems; 2018. [Google Scholar]
Saad D, Solla S A. Exact solution for on-line learning in multilayer neural networks. Phys. Rev. Lett. 1995;74:4337. doi: 10.1103/PhysRevLett.74.4337. [DOI] [PubMed] [Google Scholar]
Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2nd Int. Conf. on Learning Representations, ICLR 2014 (Conf. Track Proc.) (Banff, AB, Canada, 14–16 April 2014).2014. [Google Scholar]
Saxe A M, McClelland J L, Ganguli S. A mathematical theory of semantic development in deep neural networks. Proc. Natl Acad. Sci. 2019;116:11537–46. doi: 10.1073/pnas.1820226116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shachaf G, Brutzkus A, Globerson A. A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems; 2021. [Google Scholar]
Simon D, Wei H. Width provably matters in optimization for deep linear neural networks. Int. Conf. on Machine Learning; PMLR; 2019. pp. pp 1655–64. [Google Scholar]
Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. Stoch. Process. Appl. 2020;130:1820–52. doi: 10.1016/j.spa.2019.06.003. [DOI] [Google Scholar]
Tarmoun S, Franca G, Haeffele B D, Vidal R. Understanding the dynamics of gradient flow in overparameterized linear models. Int. Conf. on Machine Learning; PMLR; 2021. pp. pp 10153–61. [Google Scholar]
Taylor M E, Stone P. Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 2009;10:1633–85. [Google Scholar]
Thrun S, Pratt L. Learning to Learn. Springer Science & Business Media; 2012. [Google Scholar]
Tripuraneni N, Jordan M, Jin C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems; 2020. pp. pp 7852–62. [Google Scholar]
Xiao L, Bahri Y, Sohl-Dickstein J, Schoenholz S, Pennington J. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. Int. Conf. on Machine Learning; PMLR; 2018. pp. pp 5393–402. [Google Scholar]
Yan W-Y, Helmke U, Moore J B. Global analysis of oja’s flow for neural networks. IEEE Trans. Neural Netw. 1994;5:674–83. doi: 10.1109/72.317720. [DOI] [PubMed] [Google Scholar]
Zenke F, Poole B, Ganguli S. Continual learning through synaptic intelligence. Int. Conf. on Machine Learning; PMLR; 2017. pp. pp 3987–95. [PMC free article] [PubMed] [Google Scholar]
Ziwei J, Telgarsky M. Gradient descent aligns the layers of deep linear networks. 2018 (arXiv: 1810.02032)

[jstatad01b8bib1] Arora R, et al. Princeton University; 2020. Theory of deep learning (in preparation) [Google Scholar]

[jstatad01b8bib2] Arora S, Cohen N, Golowich N, Wei H. A convergence analysis of gradient descent for deep linear neural networks. 2018b (arXiv: 1810.02281)

[jstatad01b8bib3] Arora S, Cohen N, Hazan E. On the optimization of deep networks: implicit acceleration by overparameterization. Int. Conf. on Machine Learning; PMLR; 2018a. pp. pp 244–53. [Google Scholar]

[jstatad01b8bib4] Arora S, Cohen N, Wei H, Luo Y. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems; 2019a. [Google Scholar]

[jstatad01b8bib5] Arora S, Du S S, Wei H, Zhiyuan Li, Salakhutdinov R R, Wang R. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems; 2019b. p. vol 32. [Google Scholar]

[jstatad01b8bib6] Asanuma H, Takagi S, Nagano Y, Yoshida Y, Igarashi Y, Okada M. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. J. Phys. Soc. Japan. 2021;90:104001. doi: 10.7566/JPSJ.90.104001. [DOI] [Google Scholar]

[jstatad01b8bib7] Atanasov A, Bordelon B, Pehlevan C. Neural networks as kernel learners: the silent alignment effect. Int. Conf. on Learning Representations.2022. [Google Scholar]

[jstatad01b8bib8] Bahri Y, Kadmon J, Pennington J, Schoenholz S S, Sohl-Dickstein J, Ganguli S. Statistical mechanics of deep learning. Annu. Rev. Condens. Matter Phys. 2020;11:501–28. doi: 10.1146/annurev-conmatphys-031119-050745. [DOI] [Google Scholar]

[jstatad01b8bib9] Baldi P, Hornik K. Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 1989;2:53–58. doi: 10.1016/0893-6080(89)90014-2. [DOI] [Google Scholar]

[jstatad01b8bib10] Bengio Y, Louradour Jôme, Collobert R, Weston J. Curriculum learning. Proc. 26th Annual Int. Conf. on Machine Learning; 2009. pp. pp 41–48. [Google Scholar]

[jstatad01b8bib11] Biehl M, Schwarze H. Learning by on-line gradient descent. J. Phys. A: Math. Gen. 1995;28:643. doi: 10.1088/0305-4470/28/3/018. [DOI] [Google Scholar]

[jstatad01b8bib12] Carey S E. Conceptual Change In Childhood. MIT Press; 1985. [Google Scholar]

[jstatad01b8bib13] Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, Zdeborová L. Machine learning and the physical sciences. Rev. Mod. Phys. 2019;91:045002. doi: 10.1103/RevModPhys.91.045002. [DOI] [Google Scholar]

[jstatad01b8bib14] Chizat L, Oyallon E, Bach F. On lazy training in differentiable programming. Advances in Neural Information Processing Systems; 2019. [Google Scholar]

[jstatad01b8bib15] Doan T, Abbana Bennani M, Mazoure B, Rabusseau G, Alquier P. A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. Int. Conf. on Artificial Intelligence and Statistics; PMLR; 2021. pp. pp 1072–80. [Google Scholar]

[jstatad01b8bib16] Erdeniz B, Bedin Atalay N. Simulating probability learning and probabilistic reversal learning using the attention-gated reinforcement learning (agrel) model. 2010 Int. Joint Conf. on Neural Networks (IJCNN); IEEE; 2010. pp. pp 1–6. [Google Scholar]

[jstatad01b8bib17] Flesch T, Balaguer J, Dekker R, Nili H, Summerfield C. Comparing continual task learning in minds and machines. Proc. Natl Acad. Sci. 2018;115:E10313–22. doi: 10.1073/pnas.1800755115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib18] Flesch T, Juechems K, Dumbalska T, Saxe A, Summerfield C. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron. 2022;110:4212–19. doi: 10.1016/j.neuron.2022.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib19] French R M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999;3:128–35. doi: 10.1016/S1364-6613(99)01294-2. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib20] Fukumizu K. Effect of batch learning in multilayer neural networks; Int. Conf. Neural Information Processing (ICONIP); 1998. pp. pp 67–70. [Google Scholar]

[jstatad01b8bib21] Gerace F, Saglietti L, Sarao Mannelli S, Saxe A, Zdeborová L. Probing transfer learning with a model of synthetic correlated datasets. Mach. Learn.: Sci. Technol. 2022;3:015030. doi: 10.1088/2632-2153/ac4f3f. [DOI] [Google Scholar]

[jstatad01b8bib22] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. Proc. Thirteenth 13th Int. Conf. on Artificial Intelligence and Statistics (JMLR Workshop and Conf. Proc.); 2010. pp. pp 249–56. [Google Scholar]

[jstatad01b8bib23] Goldt S, Advani M, Saxe A M, Krzakala F, Zdeborová L. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in Neural Information Processing Systems; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib24] Gunasekar S, Lee J D, Soudry D, Srebro N. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems; 2018. p. p 31. [Google Scholar]

[jstatad01b8bib25] Huh D. Curvature-corrected learning dynamics in deep neural networks. Int. Conf. on Machine Learning; PMLR; 2020. pp. pp 4552–60. [Google Scholar]

[jstatad01b8bib26] Jacot A, Gabriel F, Hongler C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems; 2018. [Google Scholar]

[jstatad01b8bib27] Javed K, White M. Meta-learning representations for continual learning. Advances in Neural Information Processing Systems; 2019. pp. pp 1820–30. [Google Scholar]

[jstatad01b8bib28] Kaiming H, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proc. IEEE Int. Conf. on Computer Vision; 2015. pp. pp 1026–34. [Google Scholar]

[jstatad01b8bib29] Kirkpatrick J, et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl Acad. Sci. 2017;114:3521–6. doi: 10.1073/pnas.1611835114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib30] Kriegeskorte N, Mur M, Bandettini P A. Representational similarity analysis-connecting the branches of systems neuroscience. Front. Syst. Neurosci. 2008;2:4. doi: 10.3389/neuro.06.004.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib31] Lampinen A K, Ganguli S. An analytic theory of generalization dynamics and transfer learning in deep linear networks. 2018 (arXiv: 1809.10374)

[jstatad01b8bib32] Laurent T, Brecht J. Deep linear networks with arbitrary loss: all local minima are global. Int. Conf. on Machine Learning; PMLR; 2018. pp. pp 2902–7. [Google Scholar]

[jstatad01b8bib33] Lee J, Xiao L, Schoenholz S, Bahri Y, Novak R, Sohl-Dickstein J, Pennington J. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in Neural Information Processing Systems; 2019. [Google Scholar]

[jstatad01b8bib34] Lee S, Sarao Mannelli S, Clopath C, Goldt S, Saxe A. Maslow’s hammer for catastrophic forgetting: node re-use vs node activation. 2022 (arXiv: 2205.09029)

[jstatad01b8bib35] Lee S, Sebastian G, Saxe A. Continual learning in the teacher-student setup: Impact of task similarity. Int. Conf. on Machine Learning; PMLR; 2021. pp. pp 6109–19. [Google Scholar]

[jstatad01b8bib36] McClelland J L. Incorporating rapid neocortical learning of new schema-consistent information into complementary learning systems theory. J. Exp. Psychol. Gen. 2013;142:1190. doi: 10.1037/a0033812. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib37] McClelland J L, McNaughton B L, O’Reilly R C. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 1995;102:419. doi: 10.1037/0033-295X.102.3.419. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib38] McCloskey M, Cohen N J. Psychology of Learning and Motivation. vol 24. Elsevier; 1989. Catastrophic interference in connectionist networks: The sequential learning problem; pp. pp 109–65. [Google Scholar]

[jstatad01b8bib39] Mei S, Montanari A, Nguyen P-M. A mean field view of the landscape of two-layer neural networks. Proc. Natl Acad. Sci. 2018;115:E7665–71. doi: 10.1073/pnas.1806579115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib40] Mishkin D, Matas J. All you need is a good init. 2015 (arXiv: 1511.06422)

[jstatad01b8bib41] Murphy G. The big Book of Concepts. MIT Press; 2004. [Google Scholar]

[jstatad01b8bib42] Parisi G I, Kemker R, Part J L, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71. doi: 10.1016/j.neunet.2019.01.012. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib43] Pennington J, Schoenholz S, Ganguli S. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems; 2017. [Google Scholar]

[jstatad01b8bib44] Poggio T, Liao Q, Miranda B, Banburski A, Boix X, Hidary J. Theory iiib: generalization in deep networks. 2018 (arXiv: 1806.11379)

[jstatad01b8bib45] Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. Advances in Neural Information Processing Systems; 2019. p. p 32. [Google Scholar]

[jstatad01b8bib46] Ratcliff R. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol. Rev. 1990;97:285. doi: 10.1037/0033-295X.97.2.285. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib47] Rotskoff G, Vanden-Eijnden E. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in Neural Information Processing Systems; 2018. [Google Scholar]

[jstatad01b8bib48] Saad D, Solla S A. Exact solution for on-line learning in multilayer neural networks. Phys. Rev. Lett. 1995;74:4337. doi: 10.1103/PhysRevLett.74.4337. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib49] Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2nd Int. Conf. on Learning Representations, ICLR 2014 (Conf. Track Proc.) (Banff, AB, Canada, 14–16 April 2014).2014. [Google Scholar]

[jstatad01b8bib50] Saxe A M, McClelland J L, Ganguli S. A mathematical theory of semantic development in deep neural networks. Proc. Natl Acad. Sci. 2019;116:11537–46. doi: 10.1073/pnas.1820226116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib51] Shachaf G, Brutzkus A, Globerson A. A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems; 2021. [Google Scholar]

[jstatad01b8bib52] Simon D, Wei H. Width provably matters in optimization for deep linear neural networks. Int. Conf. on Machine Learning; PMLR; 2019. pp. pp 1655–64. [Google Scholar]

[jstatad01b8bib53] Sirignano J, Spiliopoulos K. Mean field analysis of neural networks: A central limit theorem. Stoch. Process. Appl. 2020;130:1820–52. doi: 10.1016/j.spa.2019.06.003. [DOI] [Google Scholar]

[jstatad01b8bib54] Tarmoun S, Franca G, Haeffele B D, Vidal R. Understanding the dynamics of gradient flow in overparameterized linear models. Int. Conf. on Machine Learning; PMLR; 2021. pp. pp 10153–61. [Google Scholar]

[jstatad01b8bib55] Taylor M E, Stone P. Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 2009;10:1633–85. [Google Scholar]

[jstatad01b8bib56] Thrun S, Pratt L. Learning to Learn. Springer Science & Business Media; 2012. [Google Scholar]

[jstatad01b8bib57] Tripuraneni N, Jordan M, Jin C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems; 2020. pp. pp 7852–62. [Google Scholar]

[jstatad01b8bib58] Xiao L, Bahri Y, Sohl-Dickstein J, Schoenholz S, Pennington J. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. Int. Conf. on Machine Learning; PMLR; 2018. pp. pp 5393–402. [Google Scholar]

[jstatad01b8bib59] Yan W-Y, Helmke U, Moore J B. Global analysis of oja’s flow for neural networks. IEEE Trans. Neural Netw. 1994;5:674–83. doi: 10.1109/72.317720. [DOI] [PubMed] [Google Scholar]

[jstatad01b8bib60] Zenke F, Poole B, Ganguli S. Continual learning through synaptic intelligence. Int. Conf. on Machine Learning; PMLR; 2017. pp. pp 3987–95. [PMC free article] [PubMed] [Google Scholar]

[jstatad01b8bib61] Ziwei J, Telgarsky M. Gradient descent aligns the layers of deep linear networks. 2018 (arXiv: 1810.02032)

PERMALINK

Exact learning dynamics of deep linear networks with prior knowledge *

Clémentine C J Dominé

Lukas Braun

James E Fitzgerald

Andrew M Saxe

Abstract

1. Introduction

Figure 1.

1.1. Contributions

1.2. Related work

2. Preliminaries and setting

Assumption 2.1.

Assumption 2.2.

Assumption 2.3.

Assumption 2.4.

3. Exact learning dynamics with prior knowledge

Assumption 3.1.

Theorem 3.1.

Figure 2.

Theorem 3.2.

4. Rich and lazy learning regimes and generalisation

Figure 3.

5. Decoupling dynamics

Theorem 5.1.

Figure 4.

6. Applications

Figure 5.

Figure 6.

7. Discussion

Acknowledgment

Appendix A. Fukumizu approach

Appendix B. Network’s internal representations

B.1. Representational similarity analysis

B.2. Finite-width neural tangent kernel

Appendix C. Exact learning dynamics with prior knowledge

C.1. Proof of theorem 3.1

C.1.1. Unequal input-output dimension.

C.1.2. Equal input-output dimension.

C.2. Derivation of the exact learning dynamics

C.2.1. Inverse and matrix exponential of F.

C.3. Proof of theorem 3.2: Limiting behaviour

C.4. Dynamics of Q(t)

Figure 7.

Appendix D. Rich and lazy learning regimes and generalisation

Figure 8.

Appendix E. Decoupling dynamics

E.1. Proof for theorem 5.1

E.2. Solution for 2×2 dynamics

E.3. Off-Diagonal decoupling dynamics

Figure 9.

E.4. On-diagonal dynamics and the effect of initialisation variance

Appendix F. Continual learning

Appendix G. Revising structured knowledge

G.1. Reversal learning dynamics

G.2. Exact learning dynamics in shallow networks

Appendix H. Simulations

H.1. Zero-balanced weight initialisation

H.2. Tasks

H.2.1. Random regression task.

H.2.2. Teacher-student task.

H.2.3. Semantic hierarchy.

H.2.4. Colour hierarchy.

H.3. Figure 1

H.4. Figure 2

H.5. Figure 3

H.6. Figure 4

H.7. Figure 5

H.8. Figure 6

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Exact learning dynamics of deep linear networks with prior knowledge^{^*}

C.4. Dynamics of $Q (t)$

E.2. Solution for $2 \times 2$ dynamics