Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2008 Aug 22;4(8):e1000143. doi: 10.1371/journal.pcbi.1000143

Falling towards Forgetfulness: Synaptic Decay Prevents Spontaneous Recovery of Memory

James V Stone 1,*, Peter E Jupp 2
Editor: Karl J Friston3
PMCID: PMC2516185  PMID: 18725945

Abstract

Long after a new language has been learned and forgotten, relearning a few words seems to trigger the recall of other words. This “free-lunch learning” (FLL) effect has been demonstrated both in humans and in neural network models. Specifically, previous work proved that linear networks that learn a set of associations, then partially forget them all, and finally relearn some of the associations, show improved performance on the remaining (i.e., nonrelearned) associations. Here, we prove that relearning forgotten associations decreases performance on nonrelearned associations; an effect we call negative free-lunch learning. The difference between free-lunch learning and the negative free-lunch learning presented here is due to the particular method used to induce forgetting. Specifically, if forgetting is induced by isotropic drifting of weight vectors (i.e., by adding isotropic noise), then free-lunch learning is observed. However, as proved here, if forgetting is induced by weight values that simply decay or fall towards zero, then negative free-lunch learning is observed. From a biological perspective, and assuming that nervous systems are analogous to the networks used here, this suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift, but not synaptic decay, yields free-lunch learning.

Author Summary

If you learn a skill, then partially forget it, does relearning part of that skill induce recovery of other parts of the skill? More generally, if you learn a set of associations, then partially forget them, does relearning a subset induce recovery of the remaining associations? In previous work, in which participants learned the layout of a scrambled computer keyboard, the answer to this question appeared to be “yes.” More recently, we modeled this “free-lunch learning” effect using artificial neural networks, in which the synaptic strength between each pair of model neurons is a connection weight. We proved that if forgetting is induced by allowing each weight value to drift randomly, then free-lunch learning is almost inevitable. However, if, after learning a set of associations, forgetting is induced by allowing each connection weight to decay or fall toward zero, then relearning a subset of associations decreases performance on the remaining associations. This suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift yields free-lunch learning, whereas decay does not.

Introduction

The idea that structural changes underpin the formation of new memories can be traced to the 19th century [1]. More recently, Hebb proposed that “When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased” [2]. It is now widely accepted that learning involves some form of Hebbian adaptation, and a growing body of evidence suggests that Hebbian adaptation is associated with the long-term potentiation (LTP) observed in neuronal systems [3]. LTP is an increase in synaptic efficacy which occurs in the presence of pre-synaptic and post-synaptic activity, and can be specific to a single synapse. One consequence of Hebbian adaptation is that information regarding a specific association is distributed amongst many synaptic connections, and therefore gives rise to a distributed representation of each association.

In [4], participants learned the layout of letters on a “scrambled” keyboard. After a period of forgetting, participants relearned a subset of letter positions. Crucially, this improved performance on the remaining (i.e., nonrelearned) letter positions. However, whereas relearning some associations shows evidence of FLL in some studies [4][6], this is not found in not all studies [7]. This discrepancy may be because the many studies performed to investigate this general phenomenon use a wide variety of different materials and procedures, with some measuring recall and others measuring recognition performance, for example. However, within the realms of psychology, one relevant effect is known as part-set cueing inhibition.

Part-set cueing inhibition [8] occurs when a subject is exposed to part of a set of previously learned items, which is found to reduce recall of nonrelearned items. However, [9] showed that a learned row of words was better recalled if the cues consisted of a subset of words placed in their learned positions than if cue words were placed in other positions. In this case, part-set cueing seems to improve performance, but only if each “part” appears in the spatial position in which it was originally learned. This position-specificity is consistent with the FLL effect reported using the “scrambled keyboard” procedure in [4] but has no obvious concomitant in network models (e.g., [4],[10],[11]).

If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations. Therefore, relearning some old and partially forgotten associations should affect the integrity of other associations learned at about the same time. As noted above, previous work has shown that relearning some forgotten associations does not disrupt other associations, but partially restores them. This FLL effect has also been demonstrated in neural network models ([10],[12]), where it can accelerate evolution of adaptive behaviors [13]. Crucially, in [12], the proof that relearning some associations partially restores other associations assumes that forgetting is caused by the addition of isotropic noise to connection weights, which could result from the cumulative effect of small random changes in connection weights. In contrast, here we prove that if forgetting is induced by shrinking weights towards zero, so that weights “fall” towards the origin, then relearning some associations disrupts other associations.

The protocol used to examine FLL here is the same as that used in [4] and [12] and is as follows (see Figure 1). First, learn a set of n 1+n 2 associations A = A 1A 2 consisting of two subsets A 1 and A 2 of n 1 and n 2 associations, respectively. After all learned associations A have been partially forgotten, measure performance error on subset A 1. Finally, relearn only subset A 2 and then remeasure performance on subset A 1. FLL occurs if relearning subset A 2 improves performance on A 1.

Figure 1. Free-lunch learning protocol.

Figure 1

Two subsets of associations A 1 and A 2 are learned. After partial forgetting (see text), performance error E pre on subset A 1 is measured. Subset A 2 is then relearned to pre-forgetting levels of performance, and performance error E post on subset A 1 is re-measured. If E post<E pre then FLL has occurred, and the amount of FLL is δ = E preE post. Redrawn from [12].

In order to preclude a common misunderstanding, we emphasize that, for a network with n connection weights, it is assumed that nn 1+n 2 ; that is, the number of connection weights on each output unit is not less than the number n 1+n 2 of learned associations. Using the class of linear network models described below, up to n associations can be learned perfectly (see [12]).

The proofs below refer to a network with one output unit. However, these proofs apply to networks with multiple output units, because the n connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

Definition of Performance Error

Each association consists of an input vector x and a corresponding target value d. For a network with weight vector w, the response to an input vector x is y = w·x. We define the performance error for input vectors x 1,…,x k and desired outputs d 1,…,dk to be

graphic file with name pcbi.1000143.e001.jpg (1)

where yi = w·x i is the output response to the input vector x i. By putting X = (x 1,…,x k)T, d = (d 1,…,dk)T and

graphic file with name pcbi.1000143.e002.jpg

we can write Equation 1 succinctly as

graphic file with name pcbi.1000143.e003.jpg (2)

The two subsets A 1 and A 2 consist of n 1 and n 2 associations, respectively. Let w 0 be the network weight vector after A 1 and A 2 are learned. When A 1 and A 2 are forgotten, the network weight vector changes to w 1, say, and the performance error on A 1 becomes E pre = E(X;w 1,d). Finally, relearning A 2 yields a new weight vector, w 2, say, and the performance error on A 1 is E post = E(X;w 2,d). Free-lunch learning has occurred if performance error on A 1 is less after relearning A 2 than it was before relearning A 2 (i.e., if E post<E pre).

Given weight vectors w 1 and w 2, a matrix X of input vectors, and a vector d of desired outputs, define

graphic file with name pcbi.1000143.e004.jpg (3)

which we shall also refer to simply as δ.

In previous work [12], we assumed that the “forgetting vector” v (defined as v = w 1w 0) has an isotropic distribution. Here we shall assume instead that the post-forgetting weight vector w 1 is given by

graphic file with name pcbi.1000143.e005.jpg (4)

for some (possibly random) scalar r, so that

graphic file with name pcbi.1000143.e006.jpg (5)

and therefore

graphic file with name pcbi.1000143.e007.jpg (6)

The interpretation of Equation 6 is that forgetting consists of making the optimal weight vector w 0 “fall” towards the origin by a falling factor 1−r.

Results

We provide theoretical results, and compare these with results obtained using computer simulations. In essence, our theoretical and simulation results indicate that falling weights induce negative FLL, which decreases with the square of the falling factor 1−r.

Theoretical Results

Our two main theorems are summarised here, and proofs are provided in the Methods section. These theorems apply to a network with n weights which learns n 1+n 2 associations A = A 1A 2, and then after partial forgetting, relearns the n 2 associations in A 2.

We prove that if n 1+n 2n (so that, in general, the associations A 1 and A 2 are consistent) and the joint distribution of (X 1,d 1) is isotropic (where X 1 and d 1 are the matrix of inputs and the vector of desired outputs for subset A 1 of associations) then the expected value of δ is negative (recall that δ is defined in Equation 3). We then prove that the probability P(δ<0) that δ is negative approaches unity as n 1 approaches ∞.

Theorem 1

For every non-zero value of r, the expected value of δ given r is negative. More precisely,

graphic file with name pcbi.1000143.e008.jpg (7)

with equality only in trivial cases, and where the constant of proportionality is guaranteed to be positive. Thus, the expected amount of FLL is negative (or zero).

From a physiological perspective, the case r<1 is obviously of interest because it represents synaptic weight decay. However, from a mathematical perspective, Theorem 1 applies to every value of r, and so it also holds for r>1. In other words, any movement of the weight vector w along the the line connecting w 0 to the origin yields an expectation of negative FLL, in accordance with Theorem 1.

Theorem 2

Under mild conditions on the distributions of the input/output pairs (X 1,d 1) and (X 2,d 2),

graphic file with name pcbi.1000143.e009.jpg (8)

where x and Inline graphic are any columns of Inline graphic and Inline graphic, respectively, and

graphic file with name pcbi.1000143.e013.jpg

Theorem 2 implies that, if (i) the number (n 1) of associations in A 1 is a fixed non-zero proportion ( n 1/n ) of the number n of connection weights, (ii) E[∥d 12]E[∥d 2−2] is bounded as n → ∞, and (iii) γ(n) → 0 as n → ∞ then P(δ>0) → 0 as n → ∞, i.e., the amount of FLL is negative, with a probability which tends to 1 as n → ∞.

For example, if we assume that (i) each input vector x = (x 1,…,xn) is chosen from an isotropic Gaussian distribution and (ii) the variance of xi is Inline graphic then γ(n) = 2/n, Inline graphic, and E[∥d 12]E[∥d 2−2] = n 1/(n 2−1). This ensures that P(δ>0) → 0 as n → ∞.

Simulation Results

Simulation was carried out on a network with n input units and one output unit. The set A of associations consisted of k input vectors (x 1,…,x k) and k corresponding desired scalar output values (d 1,…,dk). Each input vector comprised n elements x = (x 1,…,xn). The values of xi and di were chosen from a Gaussian distribution with unit variance (i.e., Inline graphic). A network's output yi is a weighted sum of input values Inline graphic, where xij is the jth component of the ith input vector x i, and each weight wj is the connection between the jth input unit and the output unit.

Given that the network error for a given set of k associations is Inline graphic, the derivative Inline graphic of E with respect to w yields the delta learning rule Inline graphic, where η is the learning rate, which is adjusted according to the number of weights.

However, in order to save time, we used an equivalent learning method. Learning of the k = n associations in A = A 1A 2 was performed by solving a set of n simultaneous equations using a standard method, after which the weight vector w 0 was obtained; this provided perfect performance on all n associations. Partial forgetting was induced by making weights “fall” towards the origin w 1 = r w 0, after which performance error was E pre. Relearning the n 2 = n/2 associations in A 2 was implemented with k = n 2 as above, after which performance error was E post.

In each simulation, each value in each input vector x i, and each target value di was chosen from the same isotropic gaussian distribution with unit variance. There were 100 input units, and one output unit. The subsets A 1 and A 2 each consisted of 50 associations. The value of δ = E preE post was obtained in each of 100 simulations, using a different random seed for each simulation. In Figure 2, the mean of 100 values of δ is shown for various values of the falling factor 1−r.

Figure 2. Free-lunch learning decreases as the network's weight vector falls toward the origin.

Figure 2

A network with 100 input units and one output unit learns two subsets A 1 and B 2, each of which consists of 50 associations. After learning A 1 and A 2, the network has a weight vector w = w0, but after partial forgetting, the weight vector is w = w1. If forgetting consists of subtracting a proportion 1−r of w0 such that w1 = w0−(1−r)w0 then the weight vector “falls” towards the origin; the factor 1−r is called the falling factor. After forgetting, performance error on A 1 is E pre, an error which changes to E post after relearning A 2, where this change is δ = E preE post. Given that there are A 1 associations in A 1, the expected free-lunch learning per association in A 1 is therefore E[δ/n 1|r]. Solid curve: the expected FLL, E[δ/n 1|r], where this expectation is taken over 100 computer simulations. Dashed curve: theoretical prediction of E[δ/n 1|r] (see Equation 7), using a constant of proportionality equal to unity, so that the predicted free-lunch learning is E predict[δ/n 1|r] = −(1−r)2. As predicted, free-lunch learning E[δ/n 1|r] becomes more negative as the falling factor 1−r increases.

The Geometry of Forgetting

We present a brief account of the geometry which underpins the results reported here, for a network with two input units and one output unit, as shown in Figure 3A. This network learns two associations A 1 = (X 1,d 1) and A 2 = (X 2,d 2).

Figure 3. Geometric example of how relearning A 2 increases the error on A 1.

Figure 3

(A) A network with two input units and one output unit, with connection weights ωa and ωb defines a weight vector w = (ωa,ωb). The network learns two associations A 1 and A 2. For example, A 1 is the mapping from input vector x1 = (x 11,x 12) to desired output value d 1, and learning A 1 consists of adjusting w until the network output y 1 = w·x1 equals d 1. (B) For a given association A 2 = (X 2,d 2), the corresponding constraint line in the space defined by (ωa,ωb) is L 2. Irrespective of the precise value of the target output value d 1 in association A 1, if d 1 is distributed isotropically then +d 1 is as probable as −d 1. When averaged over +d 1 and −d 1, the change δ in error on A 1 induced by relearning A 2 can be shown to be −(1−r)2 e 2, where w1 ± = rw0 ±. Since this is less than zero, the expected change E[δ|r]<0. (Figure 3A redrawn from [12]).

Figure 3B provides a geometric example of how relearning A 2 increases the error on A 1. After learning A 1 and A 2, w = w 0. The effects of forgetting and relearning can be seen by ignoring the ± superscripts and subscripts for now. After partial forgetting, w = w 1, and performance error E pre = p 2. Relearning A 2 yields w 2, the orthogonal projection of w 1 on to L 2, and performance error is E post = q 2. FLL occurs if δ = E preE post>0, or equivalently if p 2q 2>0 (see [12], Appendices A–C for proofs). Forgetting here consists of reducing w 0 by a factor r<1, so that w 1 = r w 0.

The plus and minus signs in Figure 3B refer to two versions Inline graphic and Inline graphic of association A 1, in which X 1 is the same and the target d 1 has the same magnitude, but opposite signs: Inline graphic and Inline graphic.

We now find the expected change in error induced by relearning a given association A 2. After learning Inline graphic followed by forgetting, the change in error on Inline graphic after relearning A 2 is Inline graphic. After learning Inline graphic followed by forgetting, the change in error on Inline graphic after relearning A 2 is Inline graphic. Using similar triangles in Figure 3B,

graphic file with name pcbi.1000143.e031.jpg (9)
graphic file with name pcbi.1000143.e032.jpg (10)

Therefore, the total change in error on Inline graphic and Inline graphic induced by relearning A 2 (on different occasions) is

graphic file with name pcbi.1000143.e035.jpg (11)
graphic file with name pcbi.1000143.e036.jpg (12)
graphic file with name pcbi.1000143.e037.jpg (13)

Irrespective of the precise value of the target output value d 1 in A 1, if the distribution of d 1 is isotropic then +d 1 is as probable as −d 1. If the total change in error for two instances (Inline graphic and Inline graphic) of A 1 is −2(1−r)2 e 2 then the expected change (conditional on e ) is E[δ|e] = −(1−r)2 e 2. Therefore, if forgetting is induced by falling weight values, then the expected change in error E[δ]<0.

Discussion

We have proved and demonstrated that, in one of the simplest forms of neural network model, relearning part of a previously learned set of associations reduces performance on the remaining non-relearned associations. This result is in stark contrast to our previous results, which proved that relearning induced partial recovery of non-relearned items [12]. The only difference between these two studies is the way in which forgetting was induced.

An obvious physiological concomitant of Hebbian learning is long-term potentiation (LTP), which seems to underpin learned behaviors [14]. LTP can last for hours, days or even months, and usually follows an exponential decay [3]. However, some forms of LTP do not seem to decay [15], and have been shown to be stable for up to one year [16]. Such stability is remarkable, but from a statistical point of view, would almost certainly be accompanied by random fluctuations which would have a cumulative effect over time; and indeed, fluctuations are apparent in the stable LTP reported in [16]. Crucially, it is not known if the forgetting of learned behaviors is caused by decaying efficacy at many synapses, or by the cumulative effect of random fluctuations in stable LTP-induced synaptic efficacies. Here, decaying efficacy is analogous to weight values that fall toward zero in network models, whereas the cumulative effect of random fluctuations is analogous to the addition of random noise, or drifting, of weight values in network models.

Given a choice between forgetting via synaptic weights that fall towards zero and weights that drift isotropically, has evolution chosen drifting or falling? If all other things were equal then forgetting via synaptic drift would seem to be the obvious choice. This is because drifting ensures that relearning a subset of associations improves performance on other associations, whereas falling decreases performance. However, other things are rarely equal. The expected magnitude of weights increases with drifting but decreases with falling. (Consider a hypersphere centered on the origin, with radius ∥w 0∥ . Simple geometry shows that more than half of all directions emanating from w 0 yield a new weight vector w 1 which lies outside the hypersphere, and therefore E[∥w 1∥]>E[∥w 0∥] (assuming, for example, that all vectors w 1w 0 have the same length).) This decrease in weight magnitudes effectively reduces neuronal firing rates, which reduces metabolic costs relative to costs incurred by synaptic drift. Synaptic drift therefore confers mnemonic benefits, but these benefits come at a metabolic price. Thus the increased fitness gained from the mnemonic benefits of synaptic drift must be offset against their metabolic costs. In essence, even free-lunch learning comes at a price.

Methods

We proceed by deriving expressions for E pre, E post, and for δ = Epre−E post. We prove that if n 1+n 2n then the expected value of δ is negative. We then prove that the probability P(δ<0) that δ is negative approaches unity as n 1 approaches ∞.

Performance Errors

Given a c×n matrix X and a c -dimensional vector d, let L X ,d be the affine subspace

graphic file with name pcbi.1000143.e040.jpg

of Inline graphic. If X and d are consistent (i.e., there is a w such that Xw = d) then

graphic file with name pcbi.1000143.e042.jpg

Given weight vectors w 1 and w 2, a matrix X of input vectors, and a vector d of desired outputs, define

graphic file with name pcbi.1000143.e043.jpg

where E pre = E(X;w 1,d) and E post = E(X;w 2,d). Let Inline graphic be any element of L X ,d. Then

graphic file with name pcbi.1000143.e045.jpg (14)

If X i has rank ni then transposing the QR decomposition of Inline graphic (or, equivalently, using Gram–Schmidt orthonormalisation of the rows of X i) gives

graphic file with name pcbi.1000143.e047.jpg

for unique ni×ni and ni×n matrices T i and Z i with T i lower triangular with positive diagonal elements, and Inline graphic. Simple calculation shows that, for any weight vector w, Inline graphic and Inline graphic are orthogonal. Since Inline graphic, it follows that the matrix Inline graphic represents the operator that projects orthogonally onto the image of Inline graphic. Because

graphic file with name pcbi.1000143.e054.jpg (15)

the image of Inline graphic is contained in that of Inline graphic. As both these images have dimension ni, they must be equal, and so Inline graphic represents the operator which projects orthogonally onto the image of Inline graphic.

Now suppose that X and d are consistent, where

graphic file with name pcbi.1000143.e059.jpg

Then, after the network has learned A 1 and A 2, the weight vector w 0 satisfies

graphic file with name pcbi.1000143.e060.jpg (16)

(If, as below, n 1+n 2n, X 2 and d 2 are consistent, and (X 1,d 1) has a continuous distribution then Equation 16 holds with probability 1.)

Falling

We now assume that forgetting is induced by weight values “falling” towards the origin at zero, i.e., forgetting consists of shrinking the weight vector w 0 by a (possibly random) factor r towards the “dead state” 0. Thus the post-forgetting weight vector w 1 is given by

graphic file with name pcbi.1000143.e061.jpg (17)

and so the “forgetting vector” v = w 1w 0 is

graphic file with name pcbi.1000143.e062.jpg (18)

The form of forgetting given by Equation 17 is very different from that investigated in [12], where v has an isotropic distribution and is independent of (X 1,d 1) and (X 2,d 2).

Let w 2 be the orthogonal projection of w 1 onto L 2. Then

graphic file with name pcbi.1000143.e063.jpg

Manipulation gives

graphic file with name pcbi.1000143.e064.jpg (19)

and so

graphic file with name pcbi.1000143.e065.jpg (20)

Then Equations 14, 16, and 18–20 yield

graphic file with name pcbi.1000143.e066.jpg (21)

The Case of Isotropic Random (X 1,d 1)

In this section we assume that the distribution of (X 1,d 1) is isotropic, i.e., that (UX 1 V,Ud 1) has the same distribution as (X 1,d 1) for all orthogonal n 1×n 1 matrices U and all orthogonal n×n matrices V. Then taking the conditional expectation of Equation 21 for given X 2, d 2, and r gives the following theorem.

Theorem 1

If

  1. n 1+n 2n,

  2. X 2 and d 2 are consistent,

  3. the distribution of (X 1,d 1) is continuous and isotropic,

  4. X 1, d 1, and (X 2,d 2,r) are independent.

then

graphic file with name pcbi.1000143.e067.jpg (22)

where x is any column of Inline graphic.

Corollary 1

If 1.-3. of Theorem 1 hold then

graphic file with name pcbi.1000143.e069.jpg (23)

with equality if and only if either r = 1 or d 2 = 0.

Corollary 1 says that (apart from trivial exceptions) the expected amount of FLL is negative.

To obtain Theorem 2, it is useful to have some moments of isotropic distributions. Let x be isotropically distributed on Inline graphic. Then Equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield

graphic file with name pcbi.1000143.e071.jpg (24)
graphic file with name pcbi.1000143.e072.jpg (25)

as in Equations A.14 and A.15 of [12].

The other tool used in proving Theorem 2 is the formula

graphic file with name pcbi.1000143.e073.jpg (26)

for any random variables X,Y,Z for which these quantities exist. Equation 26 is an application to the conditional distribution of Y|Z of the standard conditional variance formula that is given in Equation 2b.3.6 on page 97 of [17].

Taking the expectation and variance of Equation 21 as only d 1 varies and using Equation 24 gives

graphic file with name pcbi.1000143.e074.jpg (27)
graphic file with name pcbi.1000143.e075.jpg (28)

Taking the expectation of Equation 28 as only X 1 varies and using Equation 24 gives

graphic file with name pcbi.1000143.e076.jpg (29)

We now suppose that

graphic file with name pcbi.1000143.e077.jpg (30)

Then taking the variance of Equation 27 as only X 1 varies and using Equation 25 gives

graphic file with name pcbi.1000143.e078.jpg (31)

Adding Equations 29 and 30 and using Equation 26 yields

graphic file with name pcbi.1000143.e079.jpg (32)

To obtain an upper bound on the conditional probability of FLL (i.e., on P(δ≥0|X 2,d 2,r)), we use Chebyshev's inequality, which states that, for any random variable Y and any positive value of t

graphic file with name pcbi.1000143.e080.jpg

Applying Chebyshev's inequality to the conditional distribution of δ(w 1,w 2,X 1,d 1) given (X 2,d 2,r), taking t = E[δ(w 1,w 2;X 1,d 1)|X 2,d 2,r], and noting that (by Equation 23) t≤0, we obtain

graphic file with name pcbi.1000143.e081.jpg (33)

Substituting Equations 22 and 32 into Equation 33 gives

graphic file with name pcbi.1000143.e082.jpg (34)

where

graphic file with name pcbi.1000143.e083.jpg

For any positive-definite symmetric matrix A and vector x, diagonalization of A, together with the fact that x+1/x≥2 for positive x, yields

graphic file with name pcbi.1000143.e084.jpg (35)

Combining Equations 34 and 35 with the fact that Inline graphic gives

graphic file with name pcbi.1000143.e086.jpg (36)

Taking the expectation of Equation 36 over X 2 yields

graphic file with name pcbi.1000143.e087.jpg (37)

where x and Inline graphic are any columns of Inline graphic and Inline graphic, respectively.

Taking the expectation of Equation 37 over d 2 and r yields the following theorem.

Theorem 2

If (a) conditions 1.-4. of Theorem 1 hold, (b) the columns Inline graphic of Inline graphic are distributed independently, (c) X 2, d 2, and r are independent, (d) the distribution of (X 2,d 2) is isotropic, and (e) E[∥d 2−2] is finite then

graphic file with name pcbi.1000143.e093.jpg (38)

where x and Inline graphic are any columns of Inline graphic and Inline graphic, respectively, and

graphic file with name pcbi.1000143.e097.jpg

Corollary 2

If the conditions of Theorem 2 hold and

graphic file with name pcbi.1000143.e098.jpg

where x and Inline graphic are any columns of Inline graphic and Inline graphic, respectively, then

graphic file with name pcbi.1000143.e102.jpg

Thus

graphic file with name pcbi.1000143.e103.jpg

provided that n 1/n and n 2/n are bounded away from zero.

Acknowledgments

Thanks to David Sterratt for asking, “What would happen to free-lunch learning if weights decayed?” and to three anonymous reviewers for their detailed comments.

Footnotes

The authors have declared that no competing interests exist.

No funding was received for this work.

References

  • 1.Tanzi E. I fatti e le induzioni nellodierna isologia del sistema nervosa. Riv Sper Freniatr Med Leg. 1893;19:419–472. [Google Scholar]
  • 2.Hebb D. The Organization of Behavior: A Neuropsychological Theory. New York: Wiley; 1949. [Google Scholar]
  • 3.Abraham W. How long will long-term potentiation last? Philos Trans R Soc Lond B Biol Sci. 2003;358:735–744. doi: 10.1098/rstb.2002.1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Stone J, Hunkin N, Hornby A. Predicting spontaneous recovery of memory. Nature. 2001;414:167–168. doi: 10.1038/35102676. [DOI] [PubMed] [Google Scholar]
  • 5.Coltheart M, Byng S. A treatment for surface dyslexia. In: Seron X, editor. Cognitive Approaches in Neuropsychological Rehabilitation. London: Lawrence Erlbaum Associates; 1989. [Google Scholar]
  • 6.Weekes B, Coltheart M. Surface dyslexia and surface dysgraphia: treatment studies and their theoretical implications. Cogn Neuropsychol. 1996;13:277–315. [Google Scholar]
  • 7.Atkins P. What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychol Res. 2001;65:202–215. doi: 10.1007/s004269900015. [DOI] [PubMed] [Google Scholar]
  • 8.Roediger H., III Inhibition in recall from cueing with recall targets. J Verbal Learn Verbal Behav. 1973;12:644–657. [Google Scholar]
  • 9.Serra M, Nairne J. Part-set cuing of order information: implications for associative theories. Mem Cognit. 2000;28:847–855. doi: 10.3758/bf03198420. [DOI] [PubMed] [Google Scholar]
  • 10.Hinton G, Plaut D. Using fast weights to deblur old memories. Proceedings Ninth Annual Conference of the Cognitive Science Society. 1987. pp. 177–186.
  • 11.Atkins P, Murre J. Recovery of unrehearsed items in connectionist models. Connect Sci. 1998;10:99–119. [Google Scholar]
  • 12.Stone J, Jupp P. Free-lunch learning: modelling spontaneous recovery of memory. Neural Comput. 2007;19:194–217. doi: 10.1162/neco.2007.19.1.194. [DOI] [PubMed] [Google Scholar]
  • 13.Stone J. Distributed representations accelerate evolution of adaptive behaviours. PLoS Comput Biol. 2007;3:e147. doi: 10.1371/journal.pcbi.0030147. doi:10.1371/journal.pcbi.0030147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Whitlock J, Heynen A, Shuler M, Bear M. Learning induces long-term potentiation in the hippocampus. Science. 2006;313:1093–1097. doi: 10.1126/science.1128134. [DOI] [PubMed] [Google Scholar]
  • 15.Staubli U, Lynch G. Stable hippocampal long-term potentiation elicited by theta pattern stimulation. Brain Res. 1987;435:227–234. doi: 10.1016/0006-8993(87)91605-2. [DOI] [PubMed] [Google Scholar]
  • 16.Abraham WC, Logan B, Greenwood JM, Dragunow M. Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. J Neurosci. 2002;22:9626–9634. doi: 10.1523/JNEUROSCI.22-21-09626.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rao C. Linear Statistical Inference and its Applications. 2nd edition. New York: Wiley; 1973. [Google Scholar]

Articles from PLoS Computational Biology are provided here courtesy of PLOS

RESOURCES