Skip to main content
eLife logoLink to eLife
. 2017 Sep 7;6:e22225. doi: 10.7554/eLife.22225

Fundamental bound on the persistence and capacity of short-term memory stored as graded persistent activity

Onur Ozan Koyluoglu 1,, Yoni Pertzov 2, Sanjay Manohar 3, Masud Husain 3, Ila R Fiete 4
Editor: Lila Davachi5
PMCID: PMC5779315  PMID: 28879851

Abstract

It is widely believed that persistent neural activity underlies short-term memory. Yet, as we show, the degradation of information stored directly in such networks behaves differently from human short-term memory performance. We build a more general framework where memory is viewed as a problem of passing information through noisy channels whose degradation characteristics resemble those of persistent activity networks. If the brain first encoded the information appropriately before passing the information into such networks, the information can be stored substantially more faithfully. Within this framework, we derive a fundamental lower-bound on recall precision, which declines with storage duration and number of stored items. We show that human performance, though inconsistent with models involving direct (uncoded) storage in persistent activity networks, can be well-fit by the theoretical bound. This finding is consistent with the view that if the brain stores information in patterns of persistent activity, it might use codes that minimize the effects of noise, motivating the search for such codes in the brain.

Research organism: Human

Introduction

Short-term memory, which refers to the brain’s temporary buffer of readily usable information, is considered to be a critical component of general intelligence (Conway et al., 2003). Despite considerable interest in understanding the neural mechanisms that limit short-term memory, the issue remains relatively unsettled. Human working memory is a complex phenomenon, involving not just short-term memory but executive selection and processing, operating on multiple timescales and across multiple brain areas (Jonides et al., 2008). In this study, we restrict ourselves to obtaining limits on short-term memory performance purely due to noise in persistent activity networks, if analog information is stored directly into these networks, or if it is first well-encoded to make the stored states robust to ongoing noise.

Short-term memory experiments quantify the precision of memory recall. Typically in such experiments, subjects are briefly presented with sensory inputs, which are then removed. After a delay the subjects are asked to estimate from memory some feature of the input. Consistent with everyday experience, memory capacity is severely limited, restricted to just a handful of items (Miller, 1956), and recall performance is worse when there are more items to be remembered. Persistence can also be limited, though forgetting over time is a less severe constraint than capacity: several experiments show that recall performance declines with delay (Luck and Vogel, 1997; Jonides et al., 2008; Barrouillet et al., 2009; Barrouillet et al., 2011; Barrouillet et al., 2012; Pertzov et al., 2013; Wilken and Ma, 2004; Bays et al., 2011; Pertzov et al., 2017; Anderson et al., 2011), at least when many items are stored in memory.

Efforts in experimental and theoretical psychology to understand the nature of these memory constraints (Atkinson and Shiffrin, 1968) have led to quantification of human memory performance, and to phenomenological models that can fit limitations in capacity (Zhang and Luck, 2008; Bays and Husain, 2008; van den Berg et al., 2012) or in persistence (Wilken and Ma, 2004; Barrouillet et al., 2012). They have also led to controversy: about whether memory consists of discrete ‘slots’ for a limited maximum number of items (Miller, 1956; Cowan, 2001; Zhang and Luck, 2008) or is more continuously allocable across a larger, variable number of items (van den Berg et al., 2012; Bays and Husain, 2008); about whether forgetting in short-term memory can be attributed in part to some inherent temporal decay of an activity or memory variable over time (Barrouillet et al., 2012; Campoy, 2012; Ricker and Cowan, 2014; Zhang and Luck, 2009) or is, as more widely supported, primarily due to interference across stored items (Lewandowsky et al., 2009).

These controversies have been difficult to resolve in part because different experimental paradigms lend support to different models, while in some cases the resolution of memory performance data is not high enough to adjuciate between models. In addition, psychological models of memory performance make little contact with its neural underpinnings; thus, it is difficult to mediate between them on the basis of mechanism or electrophysiological studies.

On the mechanistic side, persistent neural activity has been widely hypothesized to form the substrate for short-term memory. The hypothesis is based on a corpus of electrophysiological work establishing a link between short-term memory and persistent neural activity (Funahashi, 2006; Smith and Jonides, 1998; Wimmer et al., 2014). Neural network models of analog persistent activity predict a degradation of information over time (Compte et al., 2000; Brody et al., 2003; Boucheny et al., 2005; Burak and Fiete, 2009; Fung et al., 2010; Mongillo et al., 2008; Burak and Fiete, 2012; Wei et al., 2012), because of noise in synaptic and neural activation. If individual analog features are assumed to be directly stored as variables in such persistent activity networks, the time course of degradation of persistent activity should directly predict the time course of degradation in short-term memory performance. However, these models do not typically consider the direct storage of multiple variables (but see (Wei et al., 2012) ), and in general their predictions have not been directly compared against human psychophysics experiments in which the memory load and delay period are varied.

In the present work, we make the following contributions: (1) Generate psychophysics predictions for information degradation as a function of delay period and number of stored items, if information is stored directly, without recoding, in persistent activity neural networks of a fixed total size; (2) Generate psychophysics predictions (though the use of joint source-channel coding theory) for a model that assumes information is restructured by encoding and decoding stages before and after storage in persistent activity neural networks; (3) Compare these models to new analog measurements (Pertzov et al., 2017) of human memory performance on an analog task as the demands on both maintenance duration and capacity are varied.

We show that the direct storage predictions are at odds with human memory performance. We propose that noisy storage systems, such as persistent activity networks, may be viewed as noisy channels through which information is passed, to be accessed at another time. We use the theory of channel coding and joint source-channel coding to derive the information-theoretic upper-bound on the achievable accuracy of short-term memory as a function of time and number of items to be remembered, assuming a core of graded persistent activity networks. According to the channel coding view, the brain might strategically restructure information before storing it, to use the available neurons in a way that minimizes the impact of noise upon the ability to retrieve that information later. We apply our framework, which requires the assumption of additional encoding and decoding stages in the memory process, to psychophysical data obtained using the technique of delayed estimation (Ma et al., 2014), which provides a sensitive measure of short-term memory recall using a continuous, analog response space, rather than discrete (Yes/No) binary recall responses.

We show that empirical results are in substantially better agreement with the functional form of the theoretical bound than with predictions from a model of direct storage of information in persistent activity networks.

Our treatment of the memory problem is distinct from other recent approaches rooted in information theory (Brady et al., 2009; Sims et al., 2012), which consider only source coding – they assume that internal representations have a limited number of states, then compute the minimal distortion achievable in representing an analog variable with these limited states, after redundancy reduction and other compression. All representations are noise-free. By contrast, our central focus is precisely on noise and its effects on memory degradation over time, because the stored states are assumed to diffuse or random-walk across the set of possible stored states. The emphasis on representation with noise involves channel coding as the central element of our analysis.

Our present work is also complementary to efforts to understand short-term memory as rooted in variables other than persistent activity, for instance the possibility that short-term synaptic plasticity, through facilitation (Mongillo et al., 2008; Barak and Tsodyks, 2014; Mi et al., 2017), might ‘silently’ (Stokes, 2015) store short-term memory, which is reactivated and accessed through intermittent neural activity (Lundqvist et al., 2016).

Results

Analog measurement of human short-term memory

We consider data from subjects performing a delayed estimation task (Figure 1—source data 1). We briefly summarize the paradigm and the main findings; a more detailed description can be found in Pertzov et al. (2017) Subjects view a display with several (K) differently colored and oriented bars that are subsequently removed for the storage (delay) period. Following the storage period, subjects were cued by one of the colored bars in the display, now randomly oriented, and asked to rotate it to its remembered orientation. Bar orientations in the display were drawn randomly from the uniform distribution over all angles (thus the range of orientations lies in the circular interval [0,π]) and the report of the subject was recorded as an analog value, to allow for more detailed and quantitative comparisons with theory (van den Berg et al., 2012). Importantly, both the number of items (K) and the storage duration (T) were varied.

When only a single item had to be remembered, the length of the storage interval had no statistically significant influence on the distribution of responses over the intervals considered (Figure 1B, with different delays marked by different shades and line styles; errors <10 degrees, effect of delay: F(3,36)=1.3,p=0.3; errors between 30-50 degrees: F(3,36)=0.2,p=0.9). By contrast, response accuracy degraded significantly with delay duration when there were 6 items in the stimulus (Figure 1C; true orientation subtracted from all responses to provide a common center at 0 degrees). The number of very precise responses decreased (errors <10 degrees, effect of delay: F(3,36)=6.15,p=0.002), with a corresponding increase in the number of trials with large errors (e.g. errors between 30-50 degrees, effect of delay: F(3,36)=5.4,p=0.004).

Figure 1. Human performance on an analog delayed orientation matching task with variable item number and storage duration.

(A) Setup of a delayed orientation estimation task to probe human short-term memory. A variable number of bars with different colors and uniformly randomly drawn orientations are presented for 500 msec. Following a variable delay, the subjects are asked to adjust the orientation of a cue bar, by using a dial, to match the remembered orientation of the bar of the same color from the presentation. (B) Distribution of responses for one item, plotted so the target orientation is centered at zero. Different shades and line styles represent different delays. Note that responses did not vary significantly with storage duration. (C) Distribution of responses for six items varies with storage duration. (D) Mean squared error of recall on the task of Figure 1A (averaged across subjects and trials, and normalized by (180)2, the square of the range of the stored variable), as item number and delay duration are systematically varied. Error bars denote SEM across participants.

Figure 1—source data 1. Experiment data used in the manuscript.
Subjects view a display with several (K) differently colored and oriented bars that are subsequently removed for the storage (delay) period. Following the storage period, subjects were cued by one of the colored bars in the display, now randomly oriented, and asked to rotate it to its remembered orientation. Bar orientations in the display were drawn randomly from the uniform distribution over all angles (thus the range of orientations lies in the circular interval [0,π]) and the report of the subject was recorded as an analog value. (See also [Pertzov et al., 2017]).
DOI: 10.7554/eLife.22225.004

Figure 1.

Figure 1—figure supplement 1. Similar variance statistics for bounded versus unbounded domains over range relevant for performance data.

Figure 1—figure supplement 1.

Circular nature of memory variable is unimportant in computing response statistics. (A) A normal distribution with standard deviation σ=0.2 over an unbounded domain (solid black curve), together with the corresponding wrapped normal distribution (dashed blue), wrapped around the circular interval [-0.5,0.5). Note the strong similarity of these two distributions: the wrapped normal barely deviates from the corresponding normal distribution for this value of standard deviation (corresponding to a variance of σ2=0.04). (B) Left: the computed standard deviation of the distributions in (A), as a function of the standard deviation σ of the unbounded normal distribution used to generate both the distributions (solid black: normal; dashed blue: wrapped normal). Right: The difference of the two curves from plot at top. Note that the computed standard deviation in the wrapped normal distribution only departs substantially (by more than 5%) from that of the normal distribution around σ=0.3 or σ20.09. All the responses in the experiments, both the across-subject averages reported in the main paper as well as the individual performance averages reported in the Appendix, exhibit a MSE (normalized by squared range) smaller than 0.09. Thus, there is little effect of the boundedness of the angular variable in the results, even though the range of the coded variable is only π radians.

Overall, the squared error in recalling an item’s orientation (Figure 1D), averaged over subjects, increased with delay duration (F(3,27)=49,p<0.001) and also with item number (F(3,27)=48,p<0.001). The data show a clear interaction between storage interval duration and set size (F(9,81)=17,p<0.001), apparent as steeper degradation slopes for larger set-sizes. In summary, for a small number of items (e.g. K=1,2), increasing the storage duration does not strongly affect performance, but for any fixed delay, increasing item number has a more profound effect.

Finally, at all tested delays and item numbers, the squared errors are much smaller than the squared range of the circular variable, and any sub-linearities in the curves cannot be attributed to the inevitable saturation of a growing variance on a circular domain (Figure 1—figure supplement 1).

Information degradation in persistent activity networks

In this and all following sections, we start from the hypothesis that persistent neural activity underlies short-term information storage in the brain. The hypothesis is founded on evidence of a relationship between the stored variable and specific patterns of elevated (or depressed) neural activity (Taube, 1998; Aksay et al., 2001) that persist into the memory storage period and terminate when the task concludes, and on findings that fluctuations in delay-period neural activity can be predictive of variations in memory performance (Funahashi, 2006; Smith and Jonides, 1998; Blair and Sharp, 1995; Miller et al., 1996; Romo et al., 1999; Supèr et al., 2001; Harrison and Tong, 2009; Wimmer et al., 2014).

Neural network models like the ring attractor generate an activity bump that is a steady state of the network and thus persists when the input is removed, Figure 2A. All rotations of the canonical activity bump form a one-dimensional continuum of steady states, Figure 2B. Relatively straightforward extensions of the ring network can generate 2D or higher-dimensional manifolds of persistent states. However, any noise in network activity, for instance in form of stochastic spiking (Softky and Koch, 1993; Shadlen and Newsome, 1994), leads to lateral random drift along the manifold in the form of a diffusive (Ornstein-Uhlenbeck) random walk (Compte et al., 2000; Brody et al., 2003; Boucheny et al., 2005; Wu et al., 2008; Burak and Fiete, 2009; Fung et al., 2010; Burak and Fiete, 2012), Figure 2C–D.

Figure 2. Analog persistent activity networks and information decay over time.

Figure 2.

(A) In a ring network, each neuron excites its immediate neighbors and inhibits all the rest (weight profiles not shown). A single bump of activity (green) is a steady state of such a network of such a network, as are all its translations around the ring. (B) A ‘state-space’ view of activity in the ring network: each axis represents the activity of one neuron in the network; if there are N neurons in the network, this state-space plot is N-dimensional. Any point inside the state space represents some possible instantaneous configuration of activity in the N neurons. The grey curve represents the set of steady states, which traces a 1-dimensional manifold because the stable states are just translations of a canonical activity bump along a single dimension. (C) Top: Grey: a schematic non-noisy activity bump; black vertical lines: schematic spikes emitted by neurons after the state is initialized according to the grey curve. Black curve: A best-fit activity profile for the emitted spikes is shifted relative to the original grey bump simply because of the stochastic spikes. Bottom: the state space view of (B), with the addition of the state corresponding to the non-noisy initial activity bump (grey filled circle), the noisy spiking state (black cross), and the projection of the noisy spiking state to the best-fit or closest non-noisy activity profile (black filled circle). (D) Over longer periods of time, activity fluctuations seen in (C) drive a diffusive drift (random walk) along the manifold of stable states, with a squared error that grows linearly with time.

A defining feature of such random walks is that the squared deviation of the stored state relative to its initial value will grow linearly with elapsed time over short times, Figure 2D, with a proportionality constant 2𝒟 (where 𝒟 is the diffusivity) that depends on quantities like the size of the network and the peak firing rate of neurons (Burak and Fiete, 2012).

Memory modeled as direct storage in persistent activity networks

Suppose that the variables in a short-term memory task were directly transferred to persistent activity neural networks with a manifold of fixed points that matched the topology of the represented variable. Thus, K circular variables would be stored, entry-by-entry, in K 1-dimensional (1D) ring networks (Ben-Yishai et al., 1995). (Alternatively, the K variables could be stored in a single network with a K-dimensional manifold of stable states, as described in the Appendix; the performance in neural costs and in fit to the data of this version of direct storage is worse than with storage in K 1D networks, thus we focus on banks of 1D networks.)

When N neural resources (e.g. composed of N sets of M neurons each, for a total of NM neurons) are split into K networks, each network is left with N/K resources (NM/K neurons in our example) for storage of a 1D variable. We know from (Burak and Fiete, 2012) that the diffusivity of the state in each of these 1D persistent activity networks will scale as the inverse of the number of neurons and of the peak firing rate per neuron. In other words, the diffusion coefficient is given by 𝒟¯(K,N)=𝒟K/N, where 𝒟 is a diffusivity parameter independent of K,N (but 𝒟1/M). So long as the squared error remains small compared to the squared range of the variable, it will grow linearly in time at a rate given by 2𝒟¯(K,N) (indeed, in the psychophysical data, the squared error remains small compared to the squared range of the angular variable; see Figure 1—figure supplement 1). Therefore the mean squared error (MSE) is given by:

DMSE(Φ,K,T)=Φ22𝒟KNT. (1)

The only free parameter in the expression for MSE as a function of time and item number is the ratio N/2𝒟. Because the inverse diffusivity parameter 1/𝒟 scales with the number of neurons (M in our example) when N,K are held fixed, the product N/(2𝒟) is proportional to the total number of neurons (N/(2𝒟)NM). This ratio therefore functions as a combined neural resource parameter.

Direct storage is a poor model of memory performance

To fit the theory of direct storage to psychophysics data, we find a single best-fit value (with weighted least-squares) of the free parameter N/2𝒟 across all item numbers and storage durations. For each item number curve, the fits are additionally anchored to the shortest storage period point (T=100 ms), which serves as a proxy for baseline performance at zero delay. Such baseline errors close to zero delay – which may be due to limitations in sensory perception, attentional constraints, constraints on the rate of information encoding (loading) into memory, or other factors – are not the subject of the present study, which seeks to describe how performance will deteriorate over time relative to the zero-delay baseline, as a function of storage duration and item number.

As can be seen in Figure 3A, the direct storage theory provides a poor match to human memory performance (p values that the data occur by sampling from the model, excluding the 100 ms time-point: 0.07,0.38,<10-4 for 1 item; 0.39,<10-4,0.2 for 2 items; 0.09,0.29,0.08 for 4 items, and <10-3,<10-4,<10-4 for 6). These p-values strongly suggest rejection of the model.

Figure 3. Comparison of direct and coded storage models using persistent activity networks with human memory performance.

(A) Lines: predictions from the direct storage model for human memory. The theory specifies all curves with a single free parameter, after shifting each curve to the measured value of performance at the shortest delay interval of 100 ms. Fits performed by weighted least squares (weights are inverse SEM). (B) Similar to (A), but parameters fit by ordinary least-squares to only the 6-item curve; note the discrepancy in the 1- and 2-item fits. (C–E) Information (ϕ) is directly transmitted (or stored) in a noisy channel, and at the end an estimate of ϕ^ of ϕ is recovered. (C) A scenario involving space-to-earth communication. (D) The scenario for direct storage in noisy memory banks (the nosy channels); the encoder and decoder are simply the identity transformation in the case of direct storage and hence do nothing. (E) The K pieces of information in the K-dimensional vector ϕ are each represented in one of K continuous attractor neural networks of size N/K neurons each. Each attractor representation accumulates squared error linearly over time and inversely with N/K. (F–H) Same as (C–E), but here information is first encoded (ϕ𝐗(ϕ)) with appropriate structure and redundancy to combat the channel noise. A good encoder-decoder pair can return an estimate ϕ^ that has lower error than the direct strategy, even with similar resource use, mitigating the effects of channel noise for high-fidelity information preservation. (H) The K-dimensional ϕ is encoded as the (N-dimensional) codeword 𝐱, each entry of which is stored in one of N persistent activity networks. Squared error in the channel grows linearly with time as before; however, the resources used to build K channels of quality (N/K)1/2𝒟 from before are redirected into building N channels of poorer quality 1/2𝒟 (assuming N>K). The decoder estimates ϕ from N-dimensional output 𝐲. (I) Same as (A), but the model lines are the lower-bound on mean-squared error obtained from an information-theoretic model of memory with good coding. (Model fit by weighted least-squares; the theory specifies all curves with two free parameters, after shifting each curve to the measured value of performance at the shortest delay interval of 100 ms).

Figure 3.

Figure 3—figure supplement 1. Cross-validated comparison of the direct and well-coded storage models after leaving out T=1s datapoints.

Figure 3—figure supplement 1.

The A) direct and B) well-coded storage models are fit to the data, excluding the datapoints at time (T=1s). This is a leave-one-out or jackknife cross-validation procedure. The well-coded model predicts the withheld datapoints with smaller error than the uncoded/direct coding model. Direct model: Sum of weighted least-squares error (WLS error): 103.3984; sum of squares error: 0.022888; squared error on held-out T=1000 ms point: 0.0043414. Well-coded model (with minimum error near N=10): WLS error: 11.3172; sum of squares error: 0.0016302; squared error on held-out T=1000 ms point: 0.0011631. BIC score: Delta BIC = BIC(direct model all items WLS) - BIC(coded model all items WLS): 11.4039, in favor of the well-coded model.
Figure 3—figure supplement 2. Cross-validated comparison of the direct and well-coded storage models after leaving out T=2s datapoints.

Figure 3—figure supplement 2.

The A) direct and B) well-coded storage models are fit to the data, excluding the datapoints at time (T=2s). This is a leave-one-out or jackknife cross-validation procedure. The well-coded model predicts the withheld datapoints with smaller error than the uncoded/direct coding model. Direct model: WLS error: 79.2137; sum of squares error: 0.015975; squared error on held-out T=2000 ms point: 0.010418. Well-coded model (with minimum error near N=5): WLS error: 2.9575; sum of squares error: 0.0007505; squared error on held-out T=2000 ms point: 0.00083856. BIC scores: Delta BIC = BIC(direct model all items WLS) - BIC(coded model all items WLS): 32.4666, in favor of well-coded model.
Figure 3—figure supplement 3. Comparison of models after removal of the shortest (100 ms) delay time-point under the argument that it represents a different memory process (iconic memory).

Figure 3—figure supplement 3.

The T=1000 ms point is now used as the baseline level to analyze the time degradation of stored memory, instead of the T=100 ms point, which is deleted altogether from the analysis. The argument for this analysis is that T=100ms might overlap with the process of iconic memory and should not be used in a comparison across the longer-latency short-term memory interval datapoints. The (A) direct model and (B) well-coded model, where (C) fit-quality plateaus to a nearly asymptotic constant with increasing N (but the asymptotic value is nearly achieved by N=10). Direct model: WLS error: 37.317; sum of squares error: 0.0080949. Well-coded model (no minimum in error in interior of range; asymptotic decay of error with N, with near-asymptotic value reached by N=10; here, we use N=100, but similar results including BIC scores for N=10: WLS error: 12.493; sum of squares error: 0.0019871. Delta BIC = BIC(direct model all items WLS) - BIC(coded model all items WLS): 24.8239, in favor of the well-coded model.
Figure 3—figure supplement 4. Redefining item numbers as K=[1 4 8 12] (instead of K=[1 2 4 6]) to take into account the memorization of item color in addition to orientation.

Figure 3—figure supplement 4.

1. Fits with direct storage model and (B) well-coded model. For the well-coded model, fit quality reaches a minimum around N = 10. Direct model: WLS error: 80.4649; sum of squares error: 0.016218. Well-coded model (with minimum error near N=10): WLS error: 12.4617; sum of squares error: 0.0016035. Delta BIC = BIC(direct model all items WLS) - BIC(coded model all items WLS): 68.0032 in favor of the coded model.

Does the direct storage model fail mostly because its dependence on time and item number are linear, while the data exhibits some nonlinear effects at the largest delays? On the contrary, direct storage fails to fit the data even at short delays when the performance curves are essentially linear (see the systematic underestimation of squared error by the model over 2 second delays in the 4- and 6-item curves). If anything, the slight sub-linearity in the 6-item curve at longer delays tends to bring it closer to the other curves and thus to the model, thus its effect is to slightly reduce the discrepancy between the data and fits from direct storage theory.

One view of the results, obtained by selecting model parameters to best match the 6-item curve, is that direct storage theory predicts an insufficiently strong improvement in performance with decreasing item number, Figure 3B (p-values for direct-storage model when fit to the 6-item responses: <10-3,10-3,<10-4 for 1 item; <10-2,<10-4,<10-4 for 2 items; 0.76,<10-2,2×10-3 for 4 items; 0.22,0.39,0.38 for 6, excluding the 100 ms delay time-point; the p-values for the 1- and 2-item curves strongly suggest rejection of the model).

Information-theoretic bound on memory performance with well-coded storage

Even if information storage in persistent activity networks is a central component of short-term memory, describing the storage step is not a sufficient account of memory. This fact is widely appreciated in memory psychophysics, where it has been observed that variations in attention, motivation, and other factors also affect memory performance (Atkinson and Shiffrin, 1968; Matsukura et al., 2007). Here we propose that, even discounting these complex factors, direct storage of a set of continuous variables into persistent activity networks with the same total dimension of stable states lacks generality as a model of memory because it does not consider how pre-encoding of information could affect its subsequent degradation, Figure 3C–E. This omission could help account for the mismatch between predictions from direct storage and human behavior, Figure 3A–B.

Storing information in noisy persistent activity networks means that after a delay there will be some information loss, as described above. Mathematically, information storage in a noisy medium is equivalent to passing the information through a noisy information channel. To allow for high-fidelity communication through a noisy channel, it is necessary to first appropriately encode the signal, Figure 3F. Encoding for error control involves the addition of appropriate forms of redundancy tailored to the channel noise. As shown by Shannon (Shannon, 1948), very different levels of accuracy can be achieved with different forms of encoding for the same amount of coding redundancy and channel noise. Thus, predictions for memory performance after good encoding may differ substantially from the predictions from direct storage even though the underlying storage networks (channels) are identical.

Thus, a more general theory of information storage for short-term memory in the brain would consider the effects of arbitrary encoder-decoder pairs that sandwich the noisy storage stage, Figure 3G. In such a three-stage model, information to be stored is first passed to an encoder, which performs all necessary encoding. Encoding strategies may include source coding or compression of the data as well as, critically, channel coding — the addition of redundancy tailored to the noise in the channel so that, subject to constraints on how much redundancy can be added, the downstream effects of channel noise are minimized (Shannon, 1948). The coded information is stored in persistent activity networks, Figure 3H. Finally, the information is accessed by a decoder or readout, Figure 3G. Here, we derive a bound on the best performance that can be achieved by any coding or decoding strategy, if the storage step involves graded persistent activity.

The encoder transforms the K-dimensional input variable into an N dimensional codeword, to be stored in a bank of storage networks with an N-dimensional manifold of persistent activity states (in the form of N networks with a 1-dimensional manifold each, or 1 network with an N-dimensional manifold, or something in between). To equalize resource use for the persistent activity networks in both direct storage and coded storage models of memory, the N stored states have a diffusivity 𝒟 each, in contrast to the diffusivity of 𝒟K/N each for K states (compare Figure 3D–E and and G–H). The storage step is equivalent to passage of information through additive Gaussian information channels, with variance proportional to the storage duration T and to the diffusivity. The decoder error-corrects the output of the storage stage and inverts the code to provide an estimate of the stored variable. (For more details, see Materials and methods and Appendix.)

We can use information theory to derive the minimum achievable recall error over all possible encoder-decoder structures, for the given statistics of the variable to be remembered and the noise in the storage information channels. In particular, we use joint source-channel coding theory to first consider at what rate information can be conveyed through a noisy channel for a given level of noise and coding redundancy, then obtain the minimal achievable distortion (recall error) for that information rate (see Materials and Appendix). We obtain the following lower-bound on the recall error:

DMSE(Φ,K,T)=Φ22πe(1+12𝒟T)N/K (2)

This result is the theoretical lower bound on MSE achievable by any system that passes information through a noisy channel with the specified statistics: a Gaussian additive channel noise of zero mean and variance 2𝒟T per channel use, a codeword of dimension N, and a variable to be transmitted (stored) of dimension K, with entries that lie in the range [0,Φ]. The bound becomes tight asymptotically (for large N), but for small N it remains a strict lower-bound. Although the potential for decoding errors is reduced at smaller N, the qualitative dependence of performance on item number and delay should remain the same (Appendix and (Polyanskiy et al., 2010) ). The bound is derived by dividing the total resources (defined here, as in the direct storage case, as the ratio N/2𝒟) evenly across all stored items (details in Appendix), similar to a ‘continuous resource’ conception of memory. The same theoretical treatment will admit different resource allocations, for instance, one could split the resources into a fixed number of pieces and allocate those to a (sub)set of the presented items, more similar to the ‘discrete slots’ model.

A heuristic derivation of the result above can be obtained by first noting that the capacity of a Gaussian channel with a given signal-to-noise ratio (SNR) is IGauss=12log(1+SNR). The summed capacity of N channels, spread across the K items of the stored variable, produces Iperitem=NKIGauss. The variance of a scalar within the unit interval represented by I bits of information is bounded below by e-2I. Inserting Iperitem into the variance expression and SNR=1/2𝒟T into IGauss, yields Equation 2 , up to scaling prefactors. The Appendix provides more rigorous arguments that the bound we derive is indeed the best that can theoretically be achieved.

Equation 2 exhibits some characteristic features, including, first, a joint dependence on the number of stored items and the storage duration. According to this expression, the time-course of memory decay depends on the number of items. This effect arises because items compete for the same limited memory resources and when an item is allocated fewer resources it is more susceptible to the effects of noise over time. Second, the scaling with item number is qualitatively different than the scaling with storage duration: Increasing the number of stored items degrades performance much more steeply than increasing the storage interval, because item number is in the exponent. For a single memorized feature or item, the decline in accuracy with storage interval duration is predicted to be weak. On the other hand, increasing the number of memorized items while keeping the storage duration fixed should lead to a rapid deterioration in memory accuracy.

We next consider whether the performance of an optimal encoder (given this lower bound) can be distinguished from the direct storage model based on human performance data. The two predictions differ in their dependence upon the number of independent storage channels or networks, N, which we do not know how to control in human behavior. Equally important, since Equation 2 provides a theoretical limit on performance, it is of interest to learn whether human behavior approximates the limit, and where it might deviate from it.

Comparison of theoretical bound with human performance

In comparing the psychophysical data to the theoretical bound on short-term memory performance, there are two unknown parameters, 1/2𝒟 (the inverse diffusivity in each persistent activity network) and N (the number of such networks), both of which scale linearly with the neural resource of neuron number. The product of these parameters corresponds to total neural resource exactly as in the direct storage case. We fit Equation 2 to human performance data, assuming as in the direct storage model that the total neural resource is fixed across all item numbers and delay durations, and setting the 100 ms delay values of the theoretical curves to their empirical values.

The resulting best fit between theory and human behavior is excellent (Figure 4E; p values that the data means may occur by sampling from the model, excluding the T=100 ms time-points: 0.99,0.07,0.75 for 1 item; 0.46,0.07,0.60 for 2 items; 0.54,0.24,0.43 for 4 items; 0.89,0.38,0.32 for 6; all values are larger than 0.05, most much more so. These p values indicate a significantly better fit to data than obtained with the direct storage model).

Figure 4. Multiplicity of reasonable parametric solutions for the well-coded storage model, with N=5--10 networks providing the best fits to human performance.

(A) The weighted least-squares error (colorbar indicating size of error on right) of the well-coded model fit to psychophysics data as a function of the two fit parameters, 𝒟 and N. The deep blue valley running near the diagonal of the parameter space constitutes a set of reasonable fits to the data. (B) Three fits to the data using parameters along the valley, sampled at N=5,10, and 100. These three parameter sets are indicated by white circles in (A). (C) Blue curve: the weighted least-squares error in the fit between data and theory along the bottom of the valley seen in (A). Gray curve: the total resource use for the corresponding points along the valley.

Figure 4.

Figure 4—figure supplement 1. Performance of individual subjects and fits to well-coded storage model.

Figure 4—figure supplement 1.

The responses of individual subjects are also well-fit by the functional form of the theoretically obtained bound on memory performance. Top two panels: the quality of fit (weighted squared error of fit) as a function of the two parameters of the theory (dark blue = best fit; dark red = worst fit), for each of the 10 subjects in the study. Middle two panels: The quality of fit along the valley defined by the dark blue area in the top two panels, plotted as a function of N, for each of the 10 subjects. Most subjects have an optimal value of 4N10. The two outliers (subjects 7 and 8, with N=2 and N=20, respectively, have rather flat fit quality along the entire valley, as a function of N, and thus other values of N produce very similar fit quality.) Bottom two panels: Individual subject performance (circles: mean-squared error averaged across trials; error bars are the across-trial SEM) as a function of storage interval duration, for different item numbers (1, 2, 4 and 6 items, black, blue, cyan and green, respectively). Solid curves: fits from the theoretical bound on performance (minimum weighted squared error).
Figure 4—figure supplement 2. Fits of individual subject performance to direct storage model with hypothesis comparison score between direct and well-coded storage models.

Figure 4—figure supplement 2.

The responses of individual subjects fit with weighted least-squares to the direct storage model. Weights are equal to the empirical standard error (SEM). The Bayesian Information Criterion-based hypothesis comparison score between the well-coded storage model and the direct storage model for individual subjects (ΔBIC) is indicated at the top of each subplot.

If we penalize the well-coded storage model for its extra parameter compared to direct storage (1/2𝒟 and N, versus the single parameter 𝒟/N for the direct storage model) through the Bayesian Information Criterion (BIC), a likelihood-based hypothesis comparison test (that more stringently penalizes model parameters than the AIC or Aikike Information Criterion), the evidence remains very strongly in favor of the well-coded memory storage model compared to direct storage (ΔBIC9910, where 10 is the cutoff for ‘very strong’ support) (Kass and Raftery, 1995). In fact, according to the BIC, the discrepancy in the quality of fit to the data between the models is so great that the increased parameter cost of the well-coded memory model barely perturbs the evidence in its favor. Some more statistical controls by jackknife cross-validation of the two models (Figure 3—figure supplement 1, Figure 3—figure supplement 2), exclusion of the T=100 ms point on the grounds that it might represent iconic memory recall rather than short-term memory (Figure 3—figure supplement 3), and redefinition of the number of items in memory to take into account the colors and orientations of the objects are given in the Appendix (Figure 3—figure supplement 4); the results are qualitatively unchanged, and also do not result in large quantitative deviations in the extracted parameters (discussed below).

The two-dimensional parameter space for fitting the theory to the data contains a one-dimensional manifold of reasonable solutions, Figure 4A (dark blue valley), most of which provide better fits to the data than the direct storage model. Some of these different fits to the data are shown in Figure 4B. At large values of N, the manifold is roughly a hyperbola in logN and log(1/2𝒟), suggesting that the logarithms of the two neural resource parameters can roughly trade off with each other; indeed, the total resource use in the one-dimensional solution valley is roughly constant at large N, Figure 4C (gray curves). However, at smaller N, the resource use drops with increasing N. The fits are not equally good along the valley of reasonable solutions, and the best fit lies near N=5 independent networks or channels (for jackknife cross-validation fits, see Figure 3—figure supplement 1, Figure 3—figure supplement 2, the best fits for the coded model can be closer to N=10; thus, the figure obtained for the number of memory networks should be taken as an order-of-magnitude estimate rather than an exact value). Resource use in the valley declines with increasing N to its asymptotic constant value (thus larger N would yield bigger representational efficiencies); however, by N=5, resource use is already close to its final asymptotic value, thus the gains of increasing the number of separate memory networks beyond N=510 diminish. The theory also provides good fits to individual subject performance for all ten subjects, using parameter values within a factor of 10 (and usually much less than a factor of 10) of each other (see Appendix).

Comparison of neural resource use in direct and well-coded storage models of memory

Finally, we compare the neural resources required for storage in the direct storage model (best-fit) compared to the well-coded storage model. We quantify the neural resources required for well-coded storage as the product of the number of networks N with the inverse diffusive coefficient 1/2𝒟. This is proportional to the number of neurons required to implement storage. To replicate human behavior, coded storage requires resources totaling N/2𝒟32 (in units of seconds) for N=5, and N/2𝒟22 (s) for N=10, corresponding to the parameter settings for the fits in Figures 4C and 5B (center), respectively. By contrast, uncoded storage requires a 40-fold increase in N or a 40-fold decrease in the diffusive growth rate in squared error, 2𝒟, per network (or a corresponding increase in the product, N/2𝒟), because N/2𝒟1215 (s) under direct storage, to produce the best-fit result of Figure 3A. Thus, well-coded storage requires substantially fewer resources in the persistent activity networks for similar performance (assuming best fits of each produce similar performance). Equivalently, a memory system with good encoding can achieve substantially better performance with the same total storage resources, than if information were directly stored in persistent activity networks.

This result on the disparity in resource use between uncoded and coded information storage is an illustration of the power of strong error-correcting codes. Confronted with the prospect of imperfect information channels, finitely many resources, and the need to store or transmit information faithfully, one may take two different paths.

The first option is to split the total resources into K storage bins, into which the K variables are stored; when there are more variables, there are more bins and each variable receives a smaller bin. The other is to store N quantities in N bins regardless of K, by splitting each of the K variables into N pieces and assigning a piece from each of the different variables to one bin; when there are more variables, each variable gets a smaller piece of the bin. In the former approach, which is similar to the direct storage scenario, increasing N would lead to improvements in the fidelity of each of the K channels, Figure 4D. In the latter approach, which is the strong coding strategy, increasing N would increase the number of channels while keeping their fidelity fixed, Figure 4B. The latter ultimately yields a more efficient use of the same total resources in terms of the final quality of performance, especially for larger values of N, at least without considering the cost of the encoding and decoding steps.

If we hold the total resource N/2𝒟NM fixed, the lowest achievable MSE (Equation 2 ) in the well-coded memory model is reached for maximally large N and thus maximally large 𝒟. However, human memory performance appears to be best-fit by N10. It is not clear, if our model does capture the basic architecture of the human memory system, why the memory system might operate in a regime of relatively small N. First, note that for increasing N, the total resource cost by N=10 is already down to within 10% of the minimum resource cost reached at much larger N. Second, note that the theory is derived under the ‘diffusive’ memory storage assumption: that within a storage network, information loss is diffusive. Thus, the assumption implicitly made while varying the parameter N in Figure 4C is that as the number of networks (N) is increased, the diffusivity 𝒟1/M per network will simply increase in proportion to keep NM fixed. However, the dynamics of persistent activity networks do not remain purely diffusive once the resource per network drops below a certain level: a new kind of non-diffusive error can start to become important (Schwab DJ & Fiete I (in preparation)). In this regime, the effective diffusivity in the network can grow much faster than the inverse network size. The non-diffusive errors produce large, non-local errors (which may be consistent with ‘pure guessing’ or ‘sudden death’ errors sometimes reported in memory psychophysics [Zhang and Luck, 2009]). It is possible that the memory networks operate in a regime where each channel (memory network) is allocated enough resources to mostly avoid non-diffusive errors, and this limits the number of networks.

Discussion

Key contributions

We have provided a fundamental lower-bound on the error of recall in short-term memory as a function of item number and storage duration, if information is stored in graded persistent activity networks (our noisy channels). This bound on performance with an underlying graded persistent activity mechanism provides a reference point for comparison with human performance regardless of whether the brain employs strong encoding and decoding processes in its memory systems. The comparison can yield insights into the strategies the brain does employ.

Next, we used empirical data from analog measurements of memory error as a function of both temporal delay and the number of stored items. Using results from the theory of diffusion on continuous attractor manifolds in neural networks, we derived an expression for memory performance if the memorized variables were stored directly in graded persistent activity networks. The resulting predictions did not match human performance. The mismatch invites further investigation into whether and how direct-storage models can be modified to account for real memory performance.

Finally, we found that the bound from theory provided an (unexpectedly) good match to human performance, Figure 4. We are not privy to the actual values of the parameters N,1/2𝒟 in the brain and it is possible the brain uses a value of, to take an arbitrary example, 5×N to achieve a performance reached with N in Equation 2 , which would be (quantitatively) ‘suboptimal’. Nevertheless, the possibility that the brain might perform qualitatively according to the functional form of the theoretical bound is highly nontrivial: As we have seen, the addition of appropriate encoding and decoding systems can reduce the degradation in accuracy from scaling polynomially (1/N) in the number of neurons, as in direct storage, to scaling exponentially (e-αN for some α>0). This is a startling possibility that requires more rigorous examination in future work.

Are neural representations consistent with exponentially strong codes?

Typical population codes for analog variables, as presently understood, exhibit linear gains in performance with N; such codes involve neurons with single-bump or ramp-like tuning curves that are offset or scaled copies of one another. For related reasons, persistent activity networks with such tuning curves also exhibit linear gains in memory performance with N (Burak and Fiete, 2012). These ‘classical population codes’ are ubiquitous in the sensory and motor peripheries as well as some cognitive areas. So far, the only example of an analog neural code known in principle to be capable of exponential scaling with N is the periodic, multi-scale code for location in grid cells of the mammalian entorhinal cortex (Hafting et al., 2005; Sreenivasan and Fiete, 2011; Mathis et al., 2012) : with this code, animals can represent an exponentially large set of distinct locations at a fixed local spatial resolution using linearly many neurons (Fiete et al., 2008; Sreenivasan and Fiete, 2011).

A literal analogy with grid cells would imply that all such codes should look periodic as a function of the represented variable, with a range of periods. A more general view is that the exponential capacity of the grid cell code results from two related features: First, no one group of grid cells with a common spatial tuning period carries full information about the coded variable (the spatial location of the animal) – location cannot be uniquely specified by the spatially periodic group response even in the absence of any noise. Second, the partial location information in different groups is independent because of the distinct spatial periods across groups (Sreenivasan and Fiete, 2011). In this more general view, strong codes need not be periodic, but there should be multiple populations that encode different, independent ‘parts’ of the same variable, which would be manifest as different sub-populations with diverse tuning profiles, and mixed selectivity to multiple variables.

It remains to be seen whether neural representations for short-term visual memory are consistent with strong codes. Intriguingly, neural responses for short-term memory are diverse and do not exhibit tuning that is as simple or uniform as typical for classical population codes (Miller et al., 1996; Fuster and Alexander, 1971; Romo et al., 1999; Wang, 2001; Funahashi, 2006; Fuster and Jervey, 1981; Rigotti et al., 2013). An interesting prediction of the well-coded model, amenable to experimental testing, is that the representation within a memory channel must be in an optimized format, and that this format is not necessarily the same format that information was initially presented in. The brain would have to perform a transformation from stimulus-space into a well-coded form, and one might expect to observe this transition of the representation at encoding. (See, e.g., recent works (Murray et al., 2017; Spaak et al., 2017), which show the existence of complex and heterogeneous dynamic transformations in primate prefrontal cortex during working memory tasks.) The less orthogonal the original stimulus space is to noise during storage and the more optimized the code for storage to resist degradation, the more different the mnemonic code will be from the sample-evoked signal. Studies that attempt to decode a stimulus from delay-period neural or BOLD activity on the basis of tuning curves obtained from the stimulus-evoked period are well-suited to test this question (Zarahn et al., 1999; Courtney et al., 1997; Pessoa et al., 2002; Jha and McCarthy, 2000; Miller et al., 1996; Baeg et al., 2003; Meyers et al., 2008; Stokes et al., 2013) : If it is possible to use early stimulus-evoked responses to accurately decode the stimulus over the delay-period (Zarahn et al., 1999; Courtney et al., 1997; Pessoa et al., 2002; Jha and McCarthy, 2000; Miller et al., 1996), it would suggest that information is not re-coded for noise resistance. On the other hand, a representation that is reshaped during the delay period relative to the stimulus-evoked response (Baeg et al., 2003; Meyers et al., 2008; Stokes et al., 2013) might support the possibility of re-coding for storage.

On the other hand, the encoding and decoding steps for strong codes add considerable complexity to the storage task, and it is unclear whether these steps can be performed efficiently so that the efficiencies of these codes are not nullified by their costs. In light of our current results, it will be interesting to further probe with neurophysiological tools whether storage for short-term visual memory is consistent with strong neural codes. With psychophysics, it will be important to compare human performance and the information-theoretic bound in greater detail. On the theoretical side, studying the decoding complexity of exponential neural codes is a topic of ongoing work (Fiete et al., 2014; Chaudhuri and Fiete, 2015), where we find that non-sparse codes made up of a product of many constraints on small subsets of the codewords might be amenable to strong error correction through simple neural dynamics.

Relationship to existing work and questions for the future

Compared to other information-theoretic considerations of memory (Brady et al., 2009; Sims et al., 2012), the distinguishing feature of our approach is our focus on neuron- or circuit-level noise and the fundamental limits such noise will impose on persistence.

Our theoretical framework permits the incorporation of many additional elements: Variable allocation of resources during stimulus presentation based on task complexity, perceived importance, attention, and information loading rate, may all be incorporated into the present framework. This can be achieved by modeling 1/2𝒟 and N as dependent functions (e.g. as done in [van den Berg et al., 2012; Sims et al., 2012; Elmore et al., 2011]) rather than independent parameters, and by exploiting the flexibility allowed by our model in uneven resource allocation across items in the display (Materials and methods).

The memory psychophysics literature contains evidence of more complex memory effects, including a type of response called ‘sudden death’ or pure guessing (Zhang and Luck, 2009; Anderson et al., 2011). These responses are characterized by not being localized around the true value of the cued variable, and contribute a uniform or pedestal component to the response distribution. Other studies show that these apparent pedestals may not be a separate phenomenon and can, at least in some cases, be modeled by a simple growth in the variance over a bounded (circular) variable of a unimodal response distribution that remains centered at the cue location (van den Berg et al., 2012; Bays, 2014; Ma et al., 2014). In our framework, good encoding ensures that for noise below a threshold, the decoder can recover an improved estimate of the stored variable; however, strong codes exhibit sharp threshold behavior as the noise in the channel is varied smoothly. Once the noise per channel grows beyond the threshold, so-called catastrophic or threshold errors will occur, and the errors will become non-local: this phenomenon will look like sudden death in the memory report. In this sense, an optimal coding and decoding framework operating on top of continuously diffusing states in memory networks is consistent with the existence of sudden death or pure guessing-like responses, even without a distinct underlying mechanistic process in the memory networks themselves. We note, however, that the fits to the data shown here were all in the below-threshold regime.

Another complex effect in memory psychophysics is misbinding, in which one or more of the multiple features (color, orientation, size, etc.) of an item are mistakenly associated with those from another item. This work should be viewed as a model of single-feature memory. Very recently, there have been attempts to model misbinding (Matthey et al., 2015). It may be possible to extend the present model in the direction of (Matthey et al., 2015) by imagining the memory networks to be multi-dimensional attractors encoding multiple features of an item.

It will be important to understand whether in the direct coding model, modifications with plausible biological interpretations can lead to significantly better agreement with the data. From a purely curve-fitting perspective, the model requires stronger-than-linear improvement in recall accuracy with declining item number, and one might thus convert the combined resource parameter N/𝒟 in Equation 1 into a function that varies inversely with K. This step would result in a better fit, but would correspond in the direct storage model to an increased allocation of total memory resources when the task involves fewer items, an implausible modification. Alternatively, if multiple items are stored within a single persistent activity network, collision effects can limit performance for larger item numbers (Wei et al., 2012), but a quantitative result on performance as a function of delay time and item number remain to be worked out. Further examination of the types of data we have considered here, with respect to predictions that would result from a memory model dependent on direct storage of variables into persistent activity network(s), should help further the goal of linking short-term memory performance with neural network models of persistent activity.

Finally, note that our results stem from considering a specific hypothesis about the neural substrates of short-term memory (that memory is stored in a continuum of persistent activity states) and from the assumption that forgetting in short-term memory is undesirable but neural resources required to maintain information have a cost. It will also be interesting to consider the possibility of information storage in discrete rather than graded persistent activity states, with appropriate discretization of analog information before storage. Such storage networks will yield different bounds on memory performance than derived here (Koulakov et al., 2002; Goldman et al., 2003; Fiete et al., 2014), which should include the existence of small analog errors arising from discretization at the encoding stage, with little degradation over time because of the resistance of discrete states to noise. Also of great interest is to obtain predictions about degradation of short-term memory in activity-silent mechanisms such as synaptic facilitation (Barak and Tsodyks, 2014; Mi et al., 2017; Stokes, 2015; Lundqvist et al., 2016). A distinct alternate perspective on the limited persistence of short-term memory is that forgetting is a design feature that continually clears the memory buffer for future use and that limited memory allows for optimal search and computation that favors generalization instead of overfitting (Cowan, 2001). In this view, neural noise and resource constraints are not bottlenecks and there may be little imperative to optimize neural codes for greater persistence and capacity. To this end, it will be interesting to consider predictions from a theory in which limited memory is a feature, against the predictions we have presented here from the perspective that the neural system must work to avoid forgetting.

Materials and methods

Human psychophysics experiments

Ten neurologically normal subjects (age range 19-35 yr) participated in the experiment after giving informed consent. All subjects reported normal or corrected-to-normal visual acuity. Stimuli were presented at a viewing distance of 60 cm on a 21” CRT monitor. Each trial began with the presentation of a central fixation cross (white, 0.8 diameter) for 500 milliseconds, followed by a memory array consisted of 1, 2, 4, or 6 oriented bars (2×0.3 of visual angle) presented on a grey background on an imaginary circle (radius 4.4) around fixation with equal inter-item distances (centre to centre). The colors of the bars in each trial were randomly selected out of eight easily-distinguishable colors. The stimulus display was followed by a blank delay of 0.1,1,2 or 3 seconds and at the end of each sequence, recall for one of the items was tested by displaying a ‘probe’ bar of the same color with a random orientation. Subjects were instructed to rotate the probe using a response dial (Logitech Intl. SA) to match the remembered orientation of the item of the same color in the sequence - henceforth termed the target. Each of the participants performed between 11 and 15 blocks of 80 trials. Each block consisted of 20 trials for each of the 4 possible item numbers, consisting of 5 trials for each delay duration.

Overview of theoretical framework and key steps

Channel coding and channel rate

Consider transmitting information about K scalar variables in the form of codewords of power 1 (i.e., k=1KP(k)=1, where P(k) is the average power allocated to encode item k, with the average taken over N different channel uses, so that the average power actually used is 1Ni=1N(Xi(k))2P(k). The number of channel uses, N, is equivalent in our memory framework to the number of parallel memory channels, each of which introduces a Gaussian white noise of variance 2𝒟T. The rate of growth of variance of the variable stored in persistent activity networks, 2𝒟, is derived in Burak and Fiete (2012); here, when we refer to this diffusivity, it is in dimensionless units where the variable is normalized by its range.

The information throughput (i.e., the information rate per channel use, also known as channel rate) for such channels is bounded by (see Appendix for details):

R𝒮(T)k𝒮R(k)12log(1+k𝒮P(k)2𝒟T) (3)

 where 𝒮 refers to any subset of the the K items, {1,,K}. Equation 3 defines an entire region of information rates that are achievable: the total encoding power or the total channel rate, or both, may be allocated to a single item, or distributed across multiple items. Thus, the expression of Equation 3 is compatible with interpretations of memory as either a continuous or a discrete resource (van den Berg et al., 2012; Zhang and Luck, 2008). (E.g., setting P(k)=0 for any k5, would correspond to a 4-slot conceptualization of short-term memory. Distributing P(k)=1/K for any variable number K of statistically similar items, would more closely describe a continuous resource model.) For both conceptualizations, this framework would allow us to consider, if the experiment setup warranted, different allocations of power P(k) and information rates across the encoded items.

For the delayed orientation matching task considered here, all presented items have equal complexity and a priori importance, so the relevant case is P(k)=1/K for all k=1,,K, together with equal-rate allocation, R(1)==R(K), resulting in the following bound on per-item or per-feature information throughput in the noisy channel (see Appendix for more detail):

R(k)(T)12Klog(1+12𝒟T). (4)

Next we consider how this bound on information rate in turn constrains the reconstruction error of the source variable (i.e., the K-variable vector to be memorized, ϕ).

Source coding and rate-distortion theory

At a source coder that compresses a source variable, rate-distortion theory relates the source rate to the distortion in reconstructing the source, at least for specific source distributions and specific error (distortion) metrics. For instance, if the source variables are each drawn uniformly from the interval [0,Φ], then the mean-squared error in reconstructing the source, DMSE, is related to the source rate R through the rate-distortion function (see Appendix):

12log(Φ22πeDMSE)R12log(Φ212DMSE). (5)

Joint source-channel coding

If the source rate is set to equal the maximal channel rate of Equation 4, then use the expression of Equation 5 from rate-distortion theory, we obtain the predicted bound on distortion in the source variable after source coding and channel transmission. This predicted distortion bound is given in Equation 2. In general problems of information transmission through an noisy channel, it is not necessarily jointly optimal to separately derive the optimal channel rate and the optimal distortion for a given source rate, and then to set the source rate to equal the maximal channel rate; the total distortion of the source passed through the channel need not be lower-bounded by the resulting expression. However, in our case of interest the two-step procedure described above, deriving first the channel capacity then inserting the capacity into the rate-distortion equation, yields a tight bound on distortion for the memory framework.

This concludes the basic derivation, in outline form, of the main theoretical result of the manuscript. The Supplementary Information supplies more steps and detail.

Fitting of theory to data

In all fits of theory to data (for direct and well-coded storage), we assume that recall error at the shortest storage interval of 100 ms reflects baseline errors unrelated to the temporal loss of recall accuracy from noisy storage that is the focus of the present work. Under the assumption that this early (‘initial’) error is independent of the additional errors accrued over the storage period, it is appropriate to treat the baseline (T=100 ms) MSE as an additive contribution to the rest of the MSE (the variance of the sum of independent random variables is the sum of their variances). For this reason, we are justified in treating the T=100 ms errors as given by the data and setting these points as the initial offsets of the theory curves, which go on to explain the temporal (item-dependent) degradation of information placed in noisy storage.

The curves are fit by minimizing the summed weighted squared error of the theoretical prediction in fitting the subject-averaged performance data over all item numbers and storage durations. The theoretical predictions are given by Equation 1 for direct storage and Equation 2 for well-coded storage. The weights in the weighted least-squares are the inverse SEMs for each (item, storage duration) pair. The parameters of the fit are N/2𝒟 (direct storage model) or N and 2𝒟 (well-coded model). The parameter value selected is common across all item numbers and storage durations. The p values given in the main paper quantify how likely the data means are to have been based on samples from a Gaussian distribution centered on the theoretical prediction.

Model comparison with the bayesian information criterion

The Bayesian Information Criterion (BIC) is a likelihood-based method for model comparison, with a penalty term that takes into account the number of parameters used in the candidate models. BIC is a Bayesian model comparison method, as discussed in Kass and Raftery (1995)

Given data x that are (assumed to be) drawn from a distribution in the exponential family and a model M(θ) with associated parameters θ (θ is a vector of k parameters), the BIC is given by:

BIC=2L^+kln(2πn) (6)

where n is the number of observations, and L^ is the likelihood of the model (with parameters θ selected by maximum likelihood). The smaller the BIC, the better the model. The more positive the difference

ΔBIC=BIC(M2)BIC(M1) (7)

between a pair of models M1(θ1) and M2(θ2) (with associated parameters θ1,θ2, respectively, possibly of different dimensions k1k2), the stronger the evidence for M1.

To obtain the BIC for the direct and coded models, the model distributions are taken to be Gaussians whose means (for each item and delay) are given by the theoretical results of Equations 1 and 2, respectively, and whose variance is given by the empirically measured data variance across trials and subjects, computed separately per item and delay. We used the parameters N=10,1/2𝒟=2.28 for the well-coded storage model, and (2𝒟/N)=3.24×10-7 for the direct storage model, to obtain ΔBIC=172.67. The empirical response variance is computed over each trial for each subjects, for a total of n=660 observations for each (T,K) or (delay interval, item number) pair. The number of parameters is k=1 for direct storage and k=2 for well-coded storage. Setting the parameter numbers to k=1+4 and k=2+4 to take into account the 4 values of response errors at the shortest delay at T=100 ms does not change the ΔBIC score because the score is dominated by the likelihood term, so that these changes in the parameter penalty term have negligible effect.

Appendix

Joint source-channel coding and memory: justification and main results

Noisy information channels as a component of short-term memory systems

Noisy information channels have traditionally been used to model communication systems: in satellite or cell-phone communications, the transmitted information is degraded during passage from one point to another (Shannon, 1959; Wang, 2001; Cover and Thomas, 1991). Such transmission and degradation over space is referred to as a channel use. However, noisy channels are apt descriptors of any system in which information is put in to be accessed at a different place or a different time, with loss occurring in-between (Shannon, 1959; Wang, 2001; Cover and Thomas, 1991). Thus, hard drives are channels, with the main channel noise being the probability of random bit flips (from high-energy cosmic rays). Similarly, neural short-term memory systems store information and are subject to unavoidable loss because of the stochasticity of neural spiking and synaptic activation. In this sense, noise-induced loss in persistent activity networks is like passing the stored information through a noisy channel.

Channel coding

In channel coding, a message is first encoded to add redundancy, then transmitted through the noisy channel, and finally decoded at the decoder. Here, we establish the terminology and basic results from Shannon’s noisy channel coding theory (Shannon, 1959; Cover and Thomas, 1991), which are used in the main paper.

First, consider a task that involves storing or communicating a simple message, q, where q is a uniformly distributed index taking one of Q values: q{1,,Q}. The message q is encoded according to a deterministic vector function (an encoding function), to generate the N-dimensional vector 𝐱(q)=(x1(q),x2(q),,xN(q)), Figure 1. This is the channel-coding step. The codeword 𝐱(q), is redundant, is sent through the noisy channel, which produces an output 𝐲 according to some conditional distribution p(𝐲|𝐱) (𝐲 is an N-dimensional vector; the channel is specified by the distribution p(𝐲|𝐱)). In a memoryless channel (no feedback from the decoder at the end of the channel back to the encoder at the mouth of the channel), the channel obeys

p(𝐲|𝐱)=n=1Np(yn|xn), (8)

where all distributions p(yn|xn) represent an identical distribution that defines the channel (Cover and Thomas, 1991). In this setup, transmission of the scalar source variable q involves N independent channel uses.

The decoder constructs a mapping 𝐲{1,,Q}, to make an estimate q^ of the received message from the channel outputs 𝐲. If q^q, the decoder has made an error. The error probability is the probability that q is decoded incorrectly, averaged over all q. This scenario, in which q, which is a single number (and represents one of the messages to be communicated) and the decoder receives a single number (observation) from each channel use, is referred to as point-to-point communication (Cover and Thomas, 1991).

If the decoder can correctly decode q, the channel communication rate (also known as the rate per channel use), which quantifies how many information bits (about q) are transmitted per entry of the coded message 𝐱, is given by R=log2(Q)/N. Shannon showed in his noisy channel coding theorem (Shannon, 1959; Cover and Thomas, 1991) that for any channel, in the limit N, it is possible in principle to communicate error-free through the channel at any rate up to the channel capacity C, defined by:

C=maxp(x)1NI(𝐱;𝐲). (9)

For specific channels, it is possible to explicitly compute the channel capacity in terms of interesting parameters of the channel model and encoder; below, we will state such results for our channels of interest, for subsequent use in our theoretical analysis.

Point-to-point Gaussian channel with a power constraint

For a scalar quantity transmitted over an additive Gaussian white noise channel of variance 2𝒟T, with an average power constraint P for representing the codewords (i.e., 1Ni=1N||xi||2P), the channel capacity , or maximum rate at which information can be transmitted without error, is given by (Cover and Thomas, 1991) :

C=12log(1+P2𝒟T). (10)
Gaussian multiple-access channel

Next, suppose the message is itself multi-dimensional (of dimension K), so that the message is 𝐪=(q1,,qK). (In a memory task, these K variables may correspond to different features of one item, or one feature each of multiple items, or some distribution of features and items. All features of all items are simply considered as elements of the message, appropriately ordered.)

The general framework for such a scenario is the multiple-access channel (MAC). In a MAC, separate encoders each encode one message element qk (k=1,,K), as an N-dimensional codeword 𝐱k(qk). The full message 𝐪 is thus represented by a set of K different N-dimensional codewords, 𝐗(𝐪)=(𝐱1(q1),,𝐱K(qK)). The power of each encoder is limited to P(k) with a constraint on the summed power (we assume k=1KP(k)1.) The encoded outputs are transmitted through a channel with a single receiver at the end.

As before, we consider the channel to be Gaussian. In this Gaussian MAC model, the channel output 𝐲 is a single N-dimensional vector, like the output in the point-to-point communication case (Cover and Thomas, 1991). The MAC channel is defined by the distribution p(𝐲|𝐗)=p(𝐲|𝐱1,,𝐱K). For a Gaussian MAC, p(𝐲|𝐗) is a Gaussian distribution with mean equal to k=1K𝐱k and variance equal to the noise variance. The decoder is tasked with reconstructing all K elements of 𝐪 from the N-dimensional 𝐲.

The probability of error is defined as the average probability of error across all K entries of the message. The fundamental limit on information transmission over the MAC is not a single number, but a region in a K-dimensional space: It is possible to allocate power and thus rates differentially to different entries of the message 𝐪, and information capacity varies based on allocation. Through Shannon’s channel coding theorem, the region of achievable information rates for the Gaussian MAC with noise variance 2𝒟T is given by:

R𝒮12log(1+k𝒮P(k)2𝒟T), (11)

where 𝒮 refers to any subset of {1,,K}, and we represent the summed rate for a given 𝒮 as R𝒮=k𝒮R(k). In memory tasks, we assume the total power constraint is constant, regardless of the number of items, and K corresponds to the number of items. Thus, power allocation per item will generally vary (decrease) with item number.

To summarize, we have a fundamental limit on information transmission rates in a Gaussian multiple-access channel as described above.

Capacity of a Gaussian MAC with equal per-item rate equals point-to-point channel capacity

The summed information rate through a Gaussian MAC channel is maximized when the per-item rate is equal across items. Moreover, at this equal-rate per-item point, the Gaussian MAC model corresponds directly to a point-to-point Gaussian (AWGN) channel coding model, where the channel input has an average power constraint P, which is set to P=kP(k), where P(k) is the power constraint on the channel input of the k-th encoder of the original Gaussian MAC model. In this equivalent AWGN model, a single encoder is responsible for transmitting all of the K message elements, by dividing the point-to-point channel capacity equally among the message elements. The maximum information rate in a point-to-point AWGN channel is (1/2)log(1+SNR), and therefore the information rate per item, if the rate is divided evenly over all K items, is R(k)=(1/2K)log(1+SNR). This capacity can be achieved by setting the inputs for the AWGN point-to-point channel to be the N-dimensional vector 𝐱, with 𝐱=k=1K𝐱k(qk), where 𝐱k(qk) are the set of K vectors of length N generated from the encoders of the Gaussian MAC. The ith component xi of 𝐱 is xi=k=1Kxik(qk), where xik(qk) is the ith element of the vector 𝐱k which encodes the message element qk, and therefore xi contains information about all components of the message (joint representation of message elements).

Comparing the expression for the Gaussian MAC information rate with the capacity result from the corresponding point-to-point Gaussian channel, R(k)=(1/2K)log(1+SNR), it is clear that the summed rate of the equal-rate per-item Gaussian MAC can achieve the same (optimal) information rate per item as the point-to-point AWGN channel.

Figure 4B of our main manuscript may be viewed as depicting the AWGN point-to-point channel, with a scalar input xi to each of the N memory networks (AWGN channels). It is interesting to note that both the AWGN channel and Gaussian MAC models suggest that the brain might encode distinct items independently but then store them jointly.

Point-to-point communication through a Gaussian channel with a peak amplitude constraint

Suppose the codewords are amplitude-limited, rather than collectively power-limited, so that each element ||xi||A for some amplitude A. If we are considering each entry of the codeword as being stored in a persistent activity network, then the maximal range of each codeword entry is constrained, rather than just the average power across entries. In this sense, amplitude-constrained channels may be more apt descriptors than power-constrained channels.

For comparison with the capacity of a Gaussian channel with a power constraint P, we set without loss of generality A=P. Then, for a scalar quantity transmitted with this amplitude constraint over an additive Gaussian white noise channel of variance 2𝒟T, the channel capacity is similar to that of the power-constrained Gaussian channel, but with the cost of a modest multiplicative pre-factor c that is smaller than, but close to size 1 (Softky and Koch, 1993; Raginsky, 2008):

C=c2log(1+P2𝒟T). (12)

If the SNR (=P2𝒟T) is such that SNR<1.05, then c[0.8,1] (Raginsky, 2008). Therefore, channel capacity of the amplitude-constrained Gaussian channel can be 80% or more of the channel capacity of the corresponding power-constrained Gaussian channel. In any case, the power-constrained Gaussian channel capacity expression is a good upper bound on the capacity of the amplitude-constrained version of that channel.

Joint source-channel coding

In memory experiments, it is not possible to directly measure information throughput in the internal storage networks. Rather, a related quantity that can be measured, and is thus the quantity of interest, is the accuracy of recall. In this section, we describe how the general bound on information throughput in the storage networks – derived in the previous section – can be used to strictly upper-bound the accuracy of recall in a specific class of memory tasks.

Consider a task that involves storing or communicating a variable ϕ. This variable is known as the information source. The information source may be analog or discrete, and uniform or not. To remove redundancies in the source distribution or to possibly even further compress the inputs (at the loss of information), the source may be passed through a source-coding step. (For instance, the real interval [-1,1] can be compressed through binary quantization into one bit by assigning the subinterval [-1,0] to the point 0, and [0,1] to 1, at the expense of precision.) The output of the source coder is known as the message , which was the assumed input to the noisy channel in the sections discussed above. The message is a uniformly distributed index q, taking one of Q values, q{1,,Q}. The source rate is the number of bits allocated per source symbol, or log2(Q).

For discrete, memoryless point-to-point Gaussian channels, Shannon’s separation theorem (Shannon, 1959; Cover and Thomas, 1991) holds, which means that to obtain minimal distortion of a source variable that must be communicated through a noisy channel, it is optimal to separately compute the channel information rate, then set the source rate to equal the channel rate. Rate-distortion theory from source coding will then specify the lower bound on distortion with this scheme. Because the separation theorem holds for the point-to-point AWGN channel considered above, and because the point-to-point AWGN rate equals the maximal summed MAC rate, we can apply the separation theorem to our memory framework and then use rate-distortion theory to compute the lower bound on distortion.

To minimize distortion according to the separation theorem, we therefore set the source rate log2(Q) to equal the maximum number of bits that may be transmitted error-free over the channel. With this choice, all messages are transmitted without error in the channel. Then, we apply rate-distortion theory to determine the minimum distortion achievable for the allocated source rate. For a given source rate allocation, the distortion depends on several factors: the statistics of the source (e.g. whether it is uniform, Gaussian, etc.), the source coding scheme, and on the distortion measure (e.g. mean absolute error (an L-1 norm), mean squared error (an L-2 norm), or another metric that quantifies the difference between the true source and its estimate). Closed-form expressions for minimum achievable distortion do not exist for arbitrary sources and distortion metrics, but crucially, there are some useful bounds on specific distortion measures including the mean squared error, which is our focus.

Mean squared error (MSE) distortion

For arbitrary source distributions, the relationship between source rate (R bits per source symbol) and minimum MSE distortion (DMSE(R)) at that rate, is given by:

h(ϕ)12log(2πeDMSE(R))R12log(σϕ2DMSE(R))

where h(ϕ) is the differential entropy of the source, σϕ2 is the variance of the source, and log is in base-2. The inequality on the right is saturated (becomes an equality) for a Gaussian source (Cover and Thomas, 1991). The inequality on the left is the Shannon Lower Bound (Sims et al., 2012) on MSE distortion for arbitrary memoryless sources, and it, too, is saturated for a Gaussian source (Cover and Thomas, 1991).

Specializing the above expression to a uniform source over the interval [0,Φ], we have h(ϕ)=log(Φ), and σϕ2=Φ2/12. Thus, we obtain

12log(Φ22πeDMSE)R12log(Φ212DMSE). (13)

Inverting the inequalities above to obtain bounds on the MSE distortion, we have

Φ22πe22RDMSE(R)Φ21222R. (14)

Note that the upper and lower bounds are identical in form – proportional to Φ22-2R – up to a constant prefactor that lies between [1/2πe,1/12]. Thus, the lower bound on distortion is given by

DMSE(R,Φ)=αMSEΦ22πe22R, (15)

where αMSE is an unknown constant of size about 1, somewhere in the range [1,2πe/12].

Now, we set the information rate R for the source (bits per source symbol) in the equation above, to match the the maximum rate for error-free transmission in the noisy storage information channel. The maximum number of bits that can be stored error-free is N times the channel capacity given in Equation 4 , because Equation 4 represents the information capacity for each channel use, and each of the N storage networks represents one channel use. Thus, we have R=NR(k)(T), where R(k)(T) is given in Equation 4 , and the minimum MSE distortion is:

DMSE(Φ,K,T)=αMSEΦ22πe(1+P2𝒟T)N/K. (16)

Because we are interested in the lower-bound on error, we set αMSE to the lower bound of its range, αMSE=1, so that we obtain the expression given in the main paper (Equation 2 ):

DMSE(Φ,K,T)=Φ22πe(1+P2𝒟T)N/K. (17)

Indeed, any other choice of αMSE within its range [1,2πe/12] does not qualitatively affect our subsequent results in the main paper.

To summarize, we derived the bound given in Equation 16 by separately combining two different bounds - the lower-bound on achievable distortion at a source for a given source rate and the upper-bound on information throughput in a noisy information channel. This combination of the two separate bounds, where each bound did not take into account the statistics of the other process (the source bound was computed independently of the channel and the channel independently of the source), is in general sub-optimal. It is tight (optimal) in this case only because the uniform source and Gaussian channel obey the conditions of Shannon’s separation theorem, also known as the joint source-channel coding theorem (Cover and Thomas, 1991; Wang, 2001; MacKay, 2002; Shannon, 1959; Viterbi and Omura, 1979).

Bound on recall accuracy for amplitude-constrained channels

As noted in Section 2 of the Appendix, the power-constrained channel capacity is an upper bound for the amplitude-constrained channel capacity (amplitude A=P). It follows that the lower-bound on distortion for power-constrained channels, Equation 16 , is a lower-bound on the amplitude-constrained channel. Further, because the channel capacity of an amplitude-constrained Gaussian channel is of the same form as the capacity of a power-constrained Gaussian channel, with a prefactor c that is close to 1, we easily see that the specific expression for MSE distortion is modified to be:

DMSE(Φ,K,T)=αMSEΦ22πe(1+P2𝒟T)cN/K. (18)

Because N is a free parameter of the theory, we may simply renormalize cN to equal N. Thus, the theoretical prediction obtained for a power-constrained channel is the same in functional form as that for an amplitude-constrained channel.

In comparing the theoretical prediction against the predictions of direct storage in persistent activity networks, however, we should take into account the factor c, noting that to produce an effective value of N requires N/c many networks, which is greater than N because c<1.

Non-asymptotic considerations

Many of the numerical fits in the paper involve values of N that are not large: N is of order 10. When transmitting information with smaller N, the error-free information rate is lower (Polyanskiy et al., 2010), or conversely, if transmitting at rates close to capacity with smaller numbers of channel uses (N) there can be decoding errors. In deriving our bound on distortion from joint-source channel coding theory, we inserted the asymptotic value of information rate (the capacity) into the rate-distortion function and assumed that information transmission at that rate would be error-free. If errors occur, the resulting distortion will be higher. It is important to note that, even far from the asymptotic limit in N, the derived lower-bound on distortion in Equation 16 remains a strict lower-bound; non-asymptotic effects can raise the overall error, not lower it.

Nevertheless, it is of interest to consider how distortion may be modified for values of N that are not asymptotically large. One would write the total non-asymptotic MSE distortion (DMSEasymp) as the sum of terms:

DMSEasymp=DMSE(1pe)+Depe. (19)

Here, DMSE is the error-free distortion bound derived above, pe is the probability of error in the non-asymptotic regime, and De is the distortion in case of error. If an error resulted in total loss of information about the transmitted (coded) variable, De would scale as Φ2, independent of N or other parameters in the problem. The only dependence on N would then enter through the probability of error, pe. The probability of error vanishes exponentially with N (Polyanskiy et al., 2010), and can be small even for relatively small values of N. The second term is in practice a small contributor to the MSE. Alternatively, one can ask how small N can be and at how far below the asymptotic capacity to enable information transmission at or below a given error rate. Analytical and numerical results in Polyanskiy et al. (2010) show that at SNR values lower than the estimated SNR in the memory system model (SNR=P/2𝒟T=1/2𝒟T2.2 dB at T=3 sec and SNR4 dB at T=2 sec; while Figure 6 in (Polyanskiy et al., 2010) has SNR=0 dB and pe=10-3), it is possible to remain within a factor of 1/3 of the asymptotic information capacity with N<10. Thus, the non-asymptotic expectation is that the information transmission rate should be scaled down from the asymptotically achievable information rate (the capacity) by some factor c (in this case, c3). Thus, through Equation 15 , we see that the bound on distortion will remain the same as in Equation 2 of the main manuscript, with the replacement of N/K in the exponent by N/cK. In other words, the previous values of the fit parameter N in the fits would actually correspond to cN. Thus, it actually takes c times more resources (where c scales slowly with 1/N) to achieve a given level of performance non-asymptotically as asymptotically.

To summarize, the bound on distortion given in Equation 16 is still a strict lower-bound on distortion in the regime where N is not asymptotically large; moreover, the functional form of the bound can remain largely the same in the non-asymptotic regime because the error probability is small for modest N. In addition, it is possible to achieve a given low error probability at a fixed SNR by simply decreasing the information rate, which increases distortion in a way that is effectively the same as increasing the value of the free parameter N.

Direct (uncoded) storage in persistent activity networks

Modeling short-term memory as direct storage of variables in persistent activity networks, produces results that are inconsistent with the data, as shown in the main paper. To obtain predictions for persistence and capacity through direct storage in persistent activity networks, first consider storing a single circular orientation variable, for a single bar in the delayed orientation matching task, as a bump in one ring network (Ben-Yishai et al., 1995; Amit, 1992; Zhang, 1996). The ring network would have neurons from all the N storage networks in our short-term memory system pooled together, thus the network is N times larger. The mean squared error of a variable stored in a continuous attractor neural network with stochastic neural spiking grows linearly with the storage interval T over short intervals (with ‘short’ defined as all intervals before the root-mean squared error has grown to be an appreciable fraction of the range of the variable, 2π). Let ϕ/Φ be the coded variable, with ϕ[0,Φ]. If the rate of growth of error in the individual storage networks of the main paper is 2𝒟 (recall that D¯=D/P, where D is coefficient of diffusion (Burak and Fiete, 2012); thus, the quantity D¯ describes the rate at which the stored variable drifts away from its initial value, normalized by the squared range of the variable, per unit power of the representation; alternatively, we may think of the total representional power as being normalized to 1 in all cases), then the rate of growth of squared error in the single ring network is 2𝒟/N (Burak and Fiete, 2012). The factor of N enters because if all other quantities are held fixed, the diffusion coefficient in continuous attractor memory networks is inversely proportional to network size. Thus, the squared error in the variable at short times T is given by (ϕ(T)-ϕ(0))2/Φ2=2𝒟T/N. In other words, we have

DMSE(Φ,K=1,T)=Φ22𝒟TN (20)

Next, consider storing K scalar variables, with each component ranging in [0,Φ], and represented in one of K different small networks, constructed from the single storage network above. Thus, its size is 1/K of the above. Relative to Equation 20 above, we therefore have

DMSE(Φ,K,T)=Φ22𝒟KTN (21)

In other words, for memory systems involving direct storage in persistent activity networks without special encoding, we expect the squared error to grow linearly with K and T. The prediction of uncoded storage in persistent activity networks can be compared directly with the prediction from encoded storage (Equation 2), because they involve the same parameters and the same resource use in the memory networks. While adding a proper encoding stage can reduce storage errors exponentially in N, uncoded storage results in decreases with N that are merely polynomial (more specifically, scaling as N-1).

Finally, one may consider directly storing the K-dimensional variable in a single persistent activity network that is a K-dimensional ring network (a K-torus). In this situation, the neurons have to be arranged so the number of neurons per linear dimension of the network scales as N1/K. Thus, the rate of growth of squared error along each dimension of the network scales as 2𝒟/(N1/K), and we have

DMSE(Φ,K,T)=Φ22𝒟TN1/K (22)

This scaling with T remains linear, while the improvement in squared error with N is weaker than the scaling in Equation 21 , which in turn is weaker than the scaling in Equation 2 , and consequently produces worse fits to the data than does Equation 21 . Therefore, we have chosen to contrast the better of two scenarios of direct (uncoded) storage, Equation 21 , against the predictions of the theory of short-term memory proposed in this work.

Comparison of direct storage against coded storage in power- or amplitude-constrained channels

In the main text, we compared not only how the predictions of coded versus direct storage compare with each other as a function of T and K, but also compared total resource use to achieve a given performance with the two different models of storage. In the latter comparison, we derive the total neural resource, N/2𝒟, required in the two schemes. We report that direct storage requires a 40-fold larger N/2𝒟 than coded storage, basing our results on the expression for coded storage in power-constrained channels. As noted in Section 3 of the Appendix, the effective N for an amplitude-constrained channel, which might be a more apt constraint for persistent activity networks with bounded ranges, is actually N/c, where c is a prefactor close to but smaller than 1, that represents the fractional loss in channel capacity incurred by enforcing an amplitude rather than power constraint. As described in (Raginsky, 2008) (see also related work in (Softky and Koch, 1993) ), the cost of replacing a power constraint by an amplitude constraint is modest, with c[0.8,1] for an appropriate regime of channel SNR (this is the regime of SNR for our fits to the data). Thus, even with an amplitude constraint for the coded memory scenario, direct storage would require a 30-fold larger N/2𝒟.

Performance of individual subjects and comparison with theory

Here, we supply the data from individual subjects, as well as fits of the theory of Equation 2 and the direct storage model 1 to their performance.

The individual subject responses and the fits of the well-coded storage model are shown in Figure 4—figure supplement 1. We first plot the quality-of-fit or energy surface of the fits of the well-coded model to the individual subject data (top two rows in Figure 4—figure supplement 1), as the two parameters of the model are varied. These individual-subject solution spaces look qualitatively similar to the across-subject aggregates reported in the main manuscript. All subjects exhibit a 1D manifold of ‘good’ parameter settings, along which the model provides a reasonable match to the data. The quality of fit along the 1D manifold (valley) is shown in the next two rows of Figure 4—figure supplement 1; based on the local minima of these curves, we infer the optimal settings of N and 1/2𝒟 for each subject. The differences between individuals emerges in that the best N values range between 2 and 20, and that for most subjects, the best values range between 4 and 11. Subjects with deviations in the optimal N from this narrower range have essentially flat valleys between N=2 and N=20 (Figure 4—figure supplement 1), and thus the choice of N is not strongly constrained.

The minimum fit errors are necessarily larger than the minimum fit errors for the across-subject averaged data, because of the higher variability of individual subject data (fewer trials per subject than total trials across subjects). Nevertheless, the normalized squared errors of the fits can be quite low, and the theory provides good fits to the psychophysics data for the individual subjects.

We also fit the individual subject data to the direct storage models, to be able to compare the predictions from the two models, Figure 4—figure supplement 2. We then compute the Bayesian Information Criterion score for both the direct storage model and the well-coded storage model, and report the ΔBIC score for hypothesis comparison, Figure 4—figure supplement 2. Positive (negative) ΔBIC scores indicate support for the well-coded (direct) storage model, and an absolute value of 10 or greater indicates very strong support. Note that the ΔBIC scores for the individual subjects are much smaller in magnitude than the aggregate scores for all pooled data in the main manuscript, because the data set for individual subjects is smaller and has less statistical strength. Nevertheless, there is very strong support (|ΔBIC|>10) for the well-coded model in 4 out of 10 subjects, close to strong support for direct storage in 2 out of 10 subjects (|ΔBIC|10), positive support for direct storage in 2 subjects, and essentially insignificant support (|ΔBIC|2) in 2 remaining subjects.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Onur Ozan Koyluoglu, Email: ozan.koyluoglu@berkeley.edu.

Lila Davachi, New York University, United States.

Funding Information

This paper was supported by the following grants:

  • National Science Foundation IIS-1464349 to Onur Ozan Koyluoglu.

  • Israel Science Foundation 1747/14 to Yoni Pertzov.

  • MRC Clinician Scientist Fellowship MR/P00878X to Sanjay Manohar.

  • National Institute for Health Research Oxford Biomedical Centre to Masud Husain.

  • Wellcome Trust to Masud Husain.

  • National Science Foundation IIS-1148973 to Ila R Fiete.

  • Simons Foundation to Ila R Fiete.

  • Howard Hughes Medical Institute Faculty Scholar Award to Ila R Fiete.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Software, Formal analysis, Writing—original draft, Writing—review and editing.

Data curation, Software, Writing—review and editing.

Data curation, Writing—review and editing.

Data curation, Writing—review and editing.

Conceptualization, Software, Writing—original draft, Writing—review and editing.

Ethics

Human subjects: The study reported here conform to the Declaration of Helsinki and all procedures were approved by the ethics committee of the National Hospital for Neurology and Neurosurgery (NHNN) prior to the study commencing. Research Ethics Committee number (ERC) 04/Q0406/60. Personal information about individuals was password protected and saved in compliance to the Data Protection Act 1998 (DPA).

Additional files

Transparent reporting form
DOI: 10.7554/eLife.22225.014

References

  1. Aksay E, Gamkrelidze G, Seung HS, Baker R, Tank DW. In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neuroscience. 2001;4:184–193. doi: 10.1038/84023. [DOI] [PubMed] [Google Scholar]
  2. Amit D. Modeling brain function: The world of attractor neural networks. Cambridge University Press; 1992. [Google Scholar]
  3. Anderson DE, Vogel EK, Awh E. Precision in visual working memory reaches a stable plateau when individual item limits are exceeded. Journal of Neuroscience. 2011;31:1128–1138. doi: 10.1523/JNEUROSCI.4125-10.2011. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  4. Atkinson R, Shiffrin R. Human memory: A proposed system and its control processes. The Psychology of Learning and Motivation. 1968;2:89–195. doi: 10.1016/S0079-7421(08)60422-3. [DOI] [Google Scholar]
  5. Baeg EH, Kim YB, Huh K, Mook-Jung I, Kim HT, Jung MW. Dynamics of population code for working memory in the prefrontal cortex. Neuron. 2003;40:177–188. doi: 10.1016/S0896-6273(03)00597-X. [DOI] [PubMed] [Google Scholar]
  6. Barak O, Tsodyks M. Working models of working memory. Current Opinion in Neurobiology. 2014;25:20–24. doi: 10.1016/j.conb.2013.10.008. [DOI] [PubMed] [Google Scholar]
  7. Barrouillet P, De Paepe A, Langerock N. Time causes forgetting from working memory. Psychonomic Bulletin & Review. 2012;19:87–92. doi: 10.3758/s13423-011-0192-8. [DOI] [PubMed] [Google Scholar]
  8. Barrouillet P, Gavens N, Vergauwe E, Gaillard V, Camos V. Working memory span development: a time-based resource-sharing model account. Developmental Psychology. 2009;45:477–490. doi: 10.1037/a0014615. [DOI] [PubMed] [Google Scholar]
  9. Barrouillet P, Portrat S, Vergauwe E, Diependaele K, Camos V. Further evidence for temporal decay in working memory: reply to Lewandowsky and Oberauer (2009) Journal of Experimental Psychology: Learning, Memory, and Cognition. 2011;37:1302–1317. doi: 10.1037/a0022933. [DOI] [PubMed] [Google Scholar]
  10. Bays PM, Gorgoraptis N, Wee N, Marshall L, Husain M. Temporal dynamics of encoding, storage, and reallocation of visual working memory. Journal of Vision. 2011;11:6–15. doi: 10.1167/11.10.6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bays PM, Husain M. Dynamic shifts of limited working memory resources in human vision. Science. 2008;321:851–854. doi: 10.1126/science.1158023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bays PM. Noise in neural populations accounts for errors in working memory. Journal of Neuroscience. 2014;34:3632–3645. doi: 10.1523/JNEUROSCI.3204-13.2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ben-Yishai R, Bar-Or RL, Sompolinsky H. Theory of orientation tuning in visual cortex. PNAS. 1995;92:3844–3848. doi: 10.1073/pnas.92.9.3844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Blair HT, Sharp PE. Anticipatory head direction signals in anterior thalamus: evidence for a thalamocortical circuit that integrates angular head motion to compute head direction. Journal of Neuroscience. 1995;15:6260–6270. doi: 10.1523/JNEUROSCI.15-09-06260.1995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Boucheny C, Brunel N, Arleo A. A continuous attractor network model without recurrent excitation: maintenance and integration in the head direction cell system. Journal of Computational Neuroscience. 2005;18:205–227. doi: 10.1007/s10827-005-6559-y. [DOI] [PubMed] [Google Scholar]
  16. Brady TF, Konkle T, Alvarez GA. Compression in visual working memory: using statistical regularities to form more efficient memory representations. Journal of Experimental Psychology: General. 2009;138:487–502. doi: 10.1037/a0016797. [DOI] [PubMed] [Google Scholar]
  17. Brody CD, Romo R, Kepecs A. Basic mechanisms for graded persistent activity: discrete attractors, continuous attractors, and dynamic representations. Current Opinion in Neurobiology. 2003;13:204–211. doi: 10.1016/S0959-4388(03)00050-3. [DOI] [PubMed] [Google Scholar]
  18. Burak Y, Fiete IR. Accurate path integration in continuous attractor network models of grid cells. PLoS Computational Biology. 2009;5:e1000291. doi: 10.1371/journal.pcbi.1000291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Burak Y, Fiete IR. Fundamental limits on persistent activity in networks of noisy neurons. PNAS. 2012;109:17645–17650. doi: 10.1073/pnas.1117386109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Campoy G. Evidence for decay in verbal short-term memory: a commentary on Berman, Jonides, and Lewis (2009) Journal of Experimental Psychology: Learning, Memory, and Cognition. 2012;38:1129–1136. doi: 10.1037/a0026934. [DOI] [PubMed] [Google Scholar]
  21. Chaudhuri R, Fiete IR. CoSyNe Meeting Abstract II-78. Salt Lake City, UT, USA: 2015. Using expander codes to construct Hopfield networks with exponential capacity. [Google Scholar]
  22. Compte A, Brunel N, Goldman-Rakic PS, Wang XJ. Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex. 2000;10:910–923. doi: 10.1093/cercor/10.9.910. [DOI] [PubMed] [Google Scholar]
  23. Conway AR, Kane MJ, Engle RW. Working memory capacity and its relation to general intelligence. Trends in Cognitive Sciences. 2003;7:547–552. doi: 10.1016/j.tics.2003.10.005. [DOI] [PubMed] [Google Scholar]
  24. Courtney SM, Ungerleider LG, Keil K, Haxby JV. Transient and sustained activity in a distributed neural system for human working memory. Nature. 1997;386:608–611. doi: 10.1038/386608a0. [DOI] [PubMed] [Google Scholar]
  25. Cover T, Thomas J. Elements of Information Theory. John Wiley and Sons, Inc; 1991. [DOI] [Google Scholar]
  26. Cowan N. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behavioral and Brain Sciences. 2001;24:87–114. doi: 10.1017/S0140525X01003922. [DOI] [PubMed] [Google Scholar]
  27. Elmore LC, Ma WJ, Magnotti JF, Leising KJ, Passaro AD, Katz JS, Wright AA. Visual short-term memory compared in rhesus monkeys and humans. Current Biology. 2011;21:975–979. doi: 10.1016/j.cub.2011.04.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Fiete IR, Burak Y, Brookings T. What grid cells convey about rat location. Journal of Neuroscience. 2008;28:6858–6871. doi: 10.1523/JNEUROSCI.5684-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Fiete IR, Schwab DS, Tran NM. A binary Hopfield network with information rate and applications to grid cell decoding. Proceedings of the 2nd Workshop on Biological Distributed Algorithms; Austin, TX, USA: 2014. [Google Scholar]
  30. Funahashi S. Prefrontal cortex and working memory processes. Neuroscience. 2006;139:251–261. doi: 10.1016/j.neuroscience.2005.07.003. [DOI] [PubMed] [Google Scholar]
  31. Fung CC, Wong KY, Wu S. A moving bump in a continuous manifold: a comprehensive study of the tracking dynamics of continuous attractor neural networks. Neural Computation. 2010;22:752–792. doi: 10.1162/neco.2009.07-08-824. [DOI] [PubMed] [Google Scholar]
  32. Fuster JM, Alexander GE. Neuron activity related to short-term memory. Science. 1971;173:652–654. doi: 10.1126/science.173.3997.652. [DOI] [PubMed] [Google Scholar]
  33. Fuster JM, Jervey JP. Inferotemporal neurons distinguish and retain behaviorally relevant features of visual stimuli. Science. 1981;212:952–955. doi: 10.1126/science.7233192. [DOI] [PubMed] [Google Scholar]
  34. Goldman MS, Levine JH, Major G, Tank DW, Seung HS. Robust persistent neural activity in a model integrator with multiple hysteretic dendrites per neuron. Cerebral Cortex. 2003;13:1185–1195. doi: 10.1093/cercor/bhg095. [DOI] [PubMed] [Google Scholar]
  35. Hafting T, Fyhn M, Molden S, Moser MB, Moser EI. Microstructure of a spatial map in the entorhinal cortex. Nature. 2005;436:801–806. doi: 10.1038/nature03721. [DOI] [PubMed] [Google Scholar]
  36. Harrison SA, Tong F. Decoding reveals the contents of visual working memory in early visual areas. Nature. 2009;458:632–635. doi: 10.1038/nature07832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Jha AP, McCarthy G. Prefrontal Activity during Delayed-response Tasks Requiring Response Selection and Preparation. Proceedings of Cognitive Neuroscience Society 2000 [Google Scholar]
  38. Jonides J, Lewis RL, Nee DE, Lustig CA, Berman MG, Moore KS. The mind and brain of short-term memory. Annual Review of Psychology. 2008;59:193–224. doi: 10.1146/annurev.psych.59.103006.093615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Kass RE, Raftery AE. Bayes Factors. Journal of the American Statistical Association. 1995;90:773–795. doi: 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
  40. Koulakov AA, Raghavachari S, Kepecs A, Lisman JE. Model for a robust neural integrator. Nature Neuroscience. 2002;5:775–782. doi: 10.1038/nn893. [DOI] [PubMed] [Google Scholar]
  41. Lewandowsky S, Oberauer K, Brown GD. No temporal decay in verbal short-term memory. Trends in Cognitive Sciences. 2009;13:120–126. doi: 10.1016/j.tics.2008.12.003. [DOI] [PubMed] [Google Scholar]
  42. Luck SJ, Vogel EK. The capacity of visual working memory for features and conjunctions. Nature. 1997;390:279–281. doi: 10.1038/36846. [DOI] [PubMed] [Google Scholar]
  43. Lundqvist M, Rose J, Herman P, Brincat SL, Buschman TJ, Miller EK. Gamma and Beta Bursts Underlie Working Memory. Neuron. 2016;90:152–164. doi: 10.1016/j.neuron.2016.02.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Ma WJ, Husain M, Bays PM. Changing concepts of working memory. Nature neuroscience. 2014;17:347–356. doi: 10.1038/nn.3655. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. MacKay DJC. Information Theory, Inference & Learning Algorithms. New York: Cambridge University Press; 2002. [Google Scholar]
  46. Mathis A, Herz AV, Stemmler MB. Resolution of nested neuronal representations can be exponential in the number of neurons. Physical Review Letters. 2012;109:018103. doi: 10.1103/PhysRevLett.109.018103. [DOI] [PubMed] [Google Scholar]
  47. Matsukura M, Luck SJ, Vecera SP. Attention effects during visual short-term memory maintenance: Protection or prioritization? Perception & Psychophysics. 2007;69:1422–1434. doi: 10.3758/BF03192957. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Matthey L, Bays PM, Dayan P. A probabilistic palimpsest model of visual short-term memory. PLOS Computational Biology. 2015;11:e1004003. doi: 10.1371/journal.pcbi.1004003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Meyers EM, Freedman DJ, Kreiman G, Miller EK, Poggio T. Dynamic population coding of category information in inferior temporal and prefrontal cortex. Journal of Neurophysiology. 2008;100:1407–1419. doi: 10.1152/jn.90248.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Mi Y, Katkov M, Tsodyks M. Synaptic Correlates of Working Memory Capacity. Neuron. 2017;93:323–330. doi: 10.1016/j.neuron.2016.12.004. [DOI] [PubMed] [Google Scholar]
  51. Miller EK, Erickson CA, Desimone R. Neural mechanisms of visual working memory in prefrontal cortex of the macaque. Journal of Neuroscience. 1996;16:5154–5167. doi: 10.1523/JNEUROSCI.16-16-05154.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Miller GA. The magical number seven plus or minus two: some limits on our capacity for processing information. Psychological Review. 1956;63:81–97. doi: 10.1037/h0043158. [DOI] [PubMed] [Google Scholar]
  53. Mongillo G, Barak O, Tsodyks M. Synaptic theory of working memory. Science. 2008;319:1543–1546. doi: 10.1126/science.1150769. [DOI] [PubMed] [Google Scholar]
  54. Murray JD, Bernacchia A, Roy NA, Constantinidis C, Romo R, Wang XJ. Stable population coding for working memory coexists with heterogeneous neural dynamics in prefrontal cortex. PNAS. 2017;114:394–399. doi: 10.1073/pnas.1619449114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Pertzov Y, Bays PM, Joseph S, Husain M. Rapid forgetting prevented by retrospective attention cues. Journal of Experimental Psychology: Human Perception and Performance. 2013;39:1224–1231. doi: 10.1037/a0030947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Pertzov Y, Manohar S, Husain M. Rapid forgetting results from competition over time between items in visual working memory. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2017;43:528–536. doi: 10.1037/xlm0000328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Pessoa L, Gutierrez E, Bandettini P, Ungerleider L. Neural correlates of visual working memory: fMRI amplitude predicts task performance. Neuron. 2002;35:975–987. doi: 10.1016/S0896-6273(02)00817-6. [DOI] [PubMed] [Google Scholar]
  58. Polyanskiy Y, Poor HV, Verdu S. Channel coding rate in the finite blocklength regime. IEEE Transactions on Information Theory. 2010;56:2307–2359. doi: 10.1109/TIT.2010.2043769. [DOI] [Google Scholar]
  59. Raginsky M. On the information capacity of gaussian channels under small peak power constraints. IEEE. 2008 doi: 10.1109/ALLERTON.2008.4797569. [DOI] [Google Scholar]
  60. Ricker TJ, Cowan N. Differences between presentation methods in working memory procedures: a matter of working memory consolidation. Journal of Experimental Psychology: Learning, Memory, and Cognition. 2014;40:417–428. doi: 10.1037/a0034301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Rigotti M, Barak O, Warden MR, Wang XJ, Daw ND, Miller EK, Fusi S. The importance of mixed selectivity in complex cognitive tasks. Nature. 2013;497:585–590. doi: 10.1038/nature12160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Romo R, Brody CD, Hernández A, Lemus L. Neuronal correlates of parametric working memory in the prefrontal cortex. Nature. 1999;399:470–473. doi: 10.1038/20939. [DOI] [PubMed] [Google Scholar]
  63. Shadlen MN, Newsome WT. Noise, neural codes and cortical organization. Current Opinion in Neurobiology. 1994;4:569–579. doi: 10.1016/0959-4388(94)90059-0. [DOI] [PubMed] [Google Scholar]
  64. Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27:379, 623–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  65. Shannon CE. Coding theorems for a discrete source with a fidelity criterion. Institute of Radio Engineers, International Convention Record, part 4. 1959;7:142–163. [Google Scholar]
  66. Sims CR, Jacobs RA, Knill DC. An ideal observer analysis of visual working memory. Psychological Review. 2012;119:807–830. doi: 10.1037/a0029856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Smith EE, Jonides J. Neuroimaging analyses of human working memory. PNAS. 1998;95:12061–12068. doi: 10.1073/pnas.95.20.12061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Smith JG. The information capacity of amplitude- and variance-constrained sclar gaussian channels. Information and Control. 1971;18:203–219. doi: 10.1016/S0019-9958(71)90346-9. [DOI] [Google Scholar]
  69. Softky WR, Koch C. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience. 1993;13:334–350. doi: 10.1523/JNEUROSCI.13-01-00334.1993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Spaak E, Watanabe K, Funahashi S, Stokes MG. Stable and Dynamic Coding for Working Memory in Primate Prefrontal Cortex. The Journal of Neuroscience. 2017;37:6503–6516. doi: 10.1523/JNEUROSCI.3364-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Sreenivasan S, Fiete I. Grid cells generate an analog error-correcting code for singularly precise neural computation. Nature Neuroscience. 2011;14:1330–1337. doi: 10.1038/nn.2901. [DOI] [PubMed] [Google Scholar]
  72. Stokes MG, Kusunoki M, Sigala N, Nili H, Gaffan D, Duncan J. Dynamic coding for cognitive control in prefrontal cortex. Neuron. 2013;78:364–375. doi: 10.1016/j.neuron.2013.01.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Stokes MG. 'Activity-silent' working memory in prefrontal cortex: a dynamic coding framework. Trends in Cognitive Sciences. 2015;19:394–405. doi: 10.1016/j.tics.2015.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Supèr H, Spekreijse H, Lamme VA. A neural correlate of working memory in the monkey primary visual cortex. Science. 2001;293:120–124. doi: 10.1126/science.1060496. [DOI] [PubMed] [Google Scholar]
  75. Tan H, Kung Yao, Yao K. Evaluation of rate-distortion functions for a class of independent identically distributed sources under an absolute-magnitude criterion. IEEE Transactions on Information Theory. 1975;21:59–64. doi: 10.1109/TIT.1975.1055335. [DOI] [Google Scholar]
  76. Taube JS. Head direction cells and the neurophysiological basis for a sense of direction. Progress in Neurobiology. 1998;55:225–256. doi: 10.1016/S0301-0082(98)00004-5. [DOI] [PubMed] [Google Scholar]
  77. van den Berg R, Shin H, Chou WC, George R, Ma WJ. Variability in encoding precision accounts for visual short-term memory limitations. PNAS. 2012;109:8780–8785. doi: 10.1073/pnas.1117465109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Vembu S, Verdu S, Steinberg Y. The source-channel separation theorem revisited. IEEE Transactions on Information Theory. 1995;41:44–54. doi: 10.1109/18.370119. [DOI] [Google Scholar]
  79. Viterbi AJ, Omura JK. Principles of digital communication and coding. McGraw-Hill; 1979. [Google Scholar]
  80. Wang XJ. Synaptic reverberation underlying mnemonic persistent activity. Trends in Neurosciences. 2001;24:455–463. doi: 10.1016/S0166-2236(00)01868-3. [DOI] [PubMed] [Google Scholar]
  81. Wei Z, Wang XJ, Wang DH. From distributed resources to limited slots in multiple-item working memory: a spiking network model with normalization. Journal of Neuroscience. 2012;32:11228–11240. doi: 10.1523/JNEUROSCI.0735-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Wilken P, Ma WJ. A detection theory account of change detection. Journal of Vision. 2004;4:11–1135. doi: 10.1167/4.12.11. [DOI] [PubMed] [Google Scholar]
  83. Wimmer K, Nykamp DQ, Constantinidis C, Compte A. Bump attractor dynamics in prefrontal cortex explains behavioral precision in spatial working memory. Nature Neuroscience. 2014;17:431–439. doi: 10.1038/nn.3645. [DOI] [PubMed] [Google Scholar]
  84. Wu S, Hamaguchi K, Amari S. Dynamics and computation of continuous attractors. Neural Computation. 2008;20:994–1025. doi: 10.1162/neco.2008.10-06-378. [DOI] [PubMed] [Google Scholar]
  85. Zarahn E, Aguirre GK, D'Esposito M. Temporal isolation of the neural correlates of spatial mnemonic processing with fMRI. Cognitive Brain Research. 1999;7:255–268. doi: 10.1016/S0926-6410(98)00029-9. [DOI] [PubMed] [Google Scholar]
  86. Zhang K. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. Journal of Neuroscience. 1996;16:2112–2126. doi: 10.1523/JNEUROSCI.16-06-02112.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Zhang W, Luck SJ. Discrete fixed-resolution representations in visual working memory. Nature. 2008;453:233–235. doi: 10.1038/nature06860. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Zhang W, Luck SJ. Sudden death and gradual decay in visual working memory. Psychological Science. 2009;20:423–428. doi: 10.1111/j.1467-9280.2009.02322.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision letter

Editor: Lila Davachi1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Fundamental bound on the persistence and capacity of short-term memory stored as graded persistent activity" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and David Van Essen as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Tim Buschman (Reviewer #1); John D Murray (Reviewer #2); Brad Postle (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The manuscript presents an information-theoretic computational model of STM that suggests an intriguing new way that information may be coded in working memory. The theoretical framework developed here constitutes an important advance in linking neural circuit mechanisms to testable psychophysical behavior (here, working memory precision as a function of duration and load). The quantitative fit of the model to human behavior is compelling and bolsters the relevance of the theoretical advances.

The reviewers were all in agreement about the potential impact of this work presented. But all also agreed that further discussion of the proposed models and its implications should be added to more thoroughly place this work in the broader context of the field. Some specific suggestions are made below. Furthermore, there several detailed questions regarding aspects of the model that should also be addressed. I have edited and appended the revisions that are essential to include in a revision below.

Please address the following in a revision:

1) One reviewer noted that 'it takes too long to get to the point in the manuscript at which the reader knows, well, what the main point of the paper will be. It's not in the title, not in the Abstract, and, indeed, not clearly articulated until subsection “Information-theoretic bound on memory performance with well-coded storage” of the manuscript.' The first part of the manuscript is taken up with a lengthy exposition of why and how direct storage models are unsatisfactory. For a general-interest journal, one would want the central idea to be clearly articulated in one of the first paragraphs in the paper (not to mention in the Abstract), then the demonstration that direct storage models are insufficient to be dispatched within a few short paragraphs. Perhaps some of this could be accomplished in part by moving some of the text and analyses to figure legends? As things stand, the figures with their minimalist legends are inscrutable. One idea would be to display the panel from Figure 4E side-by-side with 3A and B, to permit a side-by-side comparison of the different approaches. Indeed, Figures 3 and 4 could be merged, together with much of the text between them.

2) A second major absence from the Introduction, which will raise concerns by many familiar with the current literature, is near absence of any consideration of the growing number of suggestions that STM might be accomplished by mechanisms other than sustained activity. To name just a few, there's a recent TICS paper by Stokes that is explicitly devoted to this idea, there are several theoretical accounts by Tsodyks, Barack, and colleagues (nicely summarized in a recent Current Opinion review), and there's the nonlinear dynamical systems model of Lundqvist and colleagues, recently illustrated with data from Miller's group.

If some variant of these "activity-silent" accounts is correct, are the ideas presented in this manuscript irrelevant, or are there principles from the present theory that would apply? Additionally/alternatively, are there principles from the present theory that might apply to sustained activity supporting a behavior other than STM?

3) Some of the writing contains incomplete or misleading assertions. For example, the idea that there are constraints on the amount of time that information can be held in STM ignores the fact that a classically held hallmark of STM is precisely that it is not sensitive to the passage of time, per se. (Two examples are from Keppel and Underwood, and many demonstrations of prolonged retention of information in STM in anterograde amnesic patients.) Indeed, puzzlingly, one of the papers cited by the authors to substantiate their assertion is entitled "No temporal decay in verbal short-term memory."

4) The manuscript makes not contact with the growing literature of multivariate analyses of data from STM tasks, from nonhuman and human electrophysiology, and from human fMRI. Some of these studies show the ability to decode the contents of STM from delay-period activity with decoders trained on sample-evoked signal. Others suggest that the neural code may be dynamic, with minimal if any cross-temporal generalization (i.e., "off-diagonal" decoding). How does the proposed theory relate to this empirical literature? Without reference to these broader literatures, the present manuscript might be more suitable for a more specialized computational journal.

5) The authors argue that the currently accepted model of working memory predicts a linear increase in mean-squared error (MSE) over time and load (MSE ~ (load)*(time)). In contrast, they find a sub-linear increase in MSE with time (Figure 3A and 3B). This sub-linearity is well fit by the well-coded model. However, some of this non-linearity could be due to other, less-capacity-limited, forms of memory at very short time delays. For example, iconic memory, thought to have an extremely high capacity, is likely still available at 100 ms (some might argue for longer). This could lead to a reduction in the MSE at the lowest time delays. Ideally the authors would control for this using masking stimuli. Alternatively, the authors could control for this by excluding the very short delays from the analysis (possibly increasing the maximum memory delay if needed for fits).

6) As with many working memory paradigms, it is not entirely clear how to define the working memory load in the current task. It seems subjects must remember multiple pieces of information per memorandum (e.g. both color and orientation) in all cases except for the single item. This would suggest memory load is actually 1, 4, 8, and 12. Does this non-linearity account for the poor fit of the linear "direct coding" model? It seems like it might not, given the poor fit in Figure 3B but it would still be worth testing the two models with different values for memory load. Similarly, recent work has suggested some degree of independence of working memory load across the two visual hemifields. Again, this would suggest only the balanced displays can be directly compared (e.g. 2, 4, and 6 items). Does the well-coded model still provide a better fit If the analysis is restricted to these three conditions?

7) The authors appropriately use BIC to perform model comparison. However, these model comparison criterion often penalize parameters to different degrees. Did the authors also find the well-coded model generalized to a withheld dataset better than the direct coding model?

8) Recent work has debated whether errors during working memory are due, in part, to guessing or not (e.g. Luck, Awh, Vogel, Bays, etc). In fact, Steve Luck argues for no increase in variance with load (or time?), instead only an increase in guess rate. If fitting a circular Gaussian to the distribution do the authors find an increase in variance or an increase in baseline (or both)? Related to this, it isn't clear to me how the pure 'sudden-death' framework matches with the diffusivity arguments made here. It seems that perhaps the well-coded model could explain the existence of complete failures to remember if the signal diffuses too much, but the model would still argue for some diffusion of memory over time. This doesn't seem consistent with the current model. I know the authors attempt to address this in the Discussion section of the current manuscript but I would encourage the authors to clarify their position.

9) This study uses the co-authors' human psychophysical data from Pertzov et al., 2016 Journal of Experimental Psychology. That study decomposed errors into three sources: (1) noisy representation; (2) mis-binding or non-target responses; and (3) random guessing. They reported that all three of these components increased with higher load and with longer delays. How does these prior findings relate to the present study? Are these different sources subsumed by the present model? Or are these important features that the present model (in the diffusive regime) does not account for? Does the present model produce only the first type of errors? The Authors mention that in another regime of the model, non-diffusive errors can produce pure guessing errors. Can the model speak to the mechanisms of mis-binding errors? Please include discussion of this point.

10) Regarding the implications for neural representations: The Authors discuss that one prediction of the model would be signatures of exponentially strong codes in neural representations. As I understand it, one way this could be implemented is that each of the N memory networks has a different spatial period for its periodic coding, as in the case of grid cells. The other feature of the present model is that for multi-item working memory, a memory network contains signals for all of the K items. It would be helpful if the Authors can clarify what the implications on neural representations are for this feature of distributed multi-item coding. Does this imply that single neurons would show mixed selectivity for multiple items? Please include discussion of this point.

eLife. 2017 Sep 7;6:e22225. doi: 10.7554/eLife.22225.017

Author response


Please address the following in a revision:

1) One reviewer noted that 'it takes too long to get to the point in the manuscript at which the reader knows, well, what the main point of the paper will be. It's not in the title, not in the Abstract, and, indeed, not clearly articulated until subsection “Information-theoretic bound on memory performance with well-coded storage” of the manuscript.' The first part of the manuscript is taken up with a lengthy exposition of why and how direct storage models are unsatisfactory. For a general-interest journal, one would want the central idea to be clearly articulated in one of the first paragraphs in the paper (not to mention in the Abstract), then the demonstration that direct storage models are insufficient to be dispatched within a few short paragraphs. Perhaps some of this could be accomplished in part by moving some of the text and analyses to figure legends? As things stand, the figures with their minimalist legends are inscrutable. One idea would be to display the panel from Figure 4E side-by-side with 3A and B, to permit a side-by-side comparison of the different approaches. Indeed, Figures 3 and 4 could be merged, together with much of the text between them.

We have now edited the Abstract and Introduction to convey what the manuscript is about much earlier in the text. Please see the new introductory paragraph: "In the present work, we make the following contributions: 1) Generate psychophysics predictions for information degradation as a function of delay period and number of stored items, if information is stored directly, without recoding, in persistent activity neural networks of a given size over given time interval; 2) Generate psychophysics predictions (though the use of joint source-channel coding theory) for a model that assumes information is restructured by encoding and decoding stages before and after storage in persistent activity neural networks; 3) Compare these models to new analog measurements \cite{Pertzov16} of human memory performance on an analog task as the demands on both maintenance duration and capacity are varied."

Please note that the early results of the manuscript are to establish the theoretical predictions for direct storage in persistent activity networks. To our knowledge, these predictions about degradation as a function of time and item number with direct storage have not been made explicit before, and so are one part of our results (if they had been made before it would have been easy to shorten this section and replace it with a citation). It is equally important to state the framework, formalism (including resource use parameters, etc.), and results for the direct storage model in the main results for comparison with the framework and parameters of the well-coced model, so that it is clear that we are making a fair comparison.

The figure captions are fairly long, and in merging plots as well as clarifying the captions as suggested, they have become slightly longer. Thus, moving more of the text of the results to the figure captions is not ideal. We have edited and shortened the direct storage Results section, but have not eviscerated it as we feel it is an integral part of our main result. As suggested, we have also merged Figures 3 and 4, to make a direct comparison between the different models easier for the reader.

2) A second major absence from the Introduction, which will raise concerns by many familiar with the current literature, is near absence of any consideration of the growing number of suggestions that STM might be accomplished by mechanisms other than sustained activity. To name just a few, there's a recent TICS paper by Stokes that is explicitly devoted to this idea, there are several theoretical accounts by Tsodyks, Barack, and colleagues (nicely summarized in a recent Current Opinion review), and there's the nonlinear dynamical systems model of Lundqvist and colleagues, recently illustrated with data from Miller's group.

If some variant of these "activity-silent" accounts is correct, are the ideas presented in this manuscript irrelevant, or are there principles from the present theory that would apply? Additionally/alternatively, are there principles from the present theory that might apply to sustained activity supporting a behavior other than STM?

We thank the reviewers very much for this comment. Indeed, we did not explicitly discuss activity-silent accounts of STM in our Introduction or Discussion (other than providing a reference to Mongillo and Tsodyks 2008, a model of how synaptic facilitation can aid in the robustness of short term memory). Given recent experimental and modeling results in this direction -- they are starting to form a compelling alternate STM mechanism to persistent activity mechanisms -- it is important to mention these accounts.

We have added a brief stand-alone passage in the Introduction, stating that our current work is complementary to efforts to explain STM in terms of synaptic facilitation/activity-silent mechanisms. In this passage, we cite the work of Mi, Katkov and Tsodyks, 2016: Barak and Tsodyks, 2014; Stokes, 2015 and Lundqvist et al., 2016.

With respect to the question about whether our model would apply to activity-silent mechanisms: In citing the model of Mongillo and Tsodyks, 2008 in our earlier manuscript, we had considered the possibility of synaptic facilitation as a source of a longer cellular time-constant to serve as the basis of STM, but we viewed that model as another persistent activity model, with activity supporting facilitation and facilitation supporting elevated activity. The facilitation process lent a slower intrinsic time-constant to the persistent activity feedback loop, thus providing a more robust/less fine-tuned way to generate persistent activity. Such a model would be subject to the same diffusion/drift problems as persistent activity models, qualitatively speaking (but quantitatively with lower noise or slower diffusion time-constant), and thus subject to similar degradation as considered in our present work.

The newer models cited in the paragraph may exhibit different dynamics, and be subject to different types of noise, in which case the general principle of restructuring of information to improve memory would still be true but the functional form of error versus number of items and N could be somewhat different. However, if the synaptic facilitation states in these models were subject to a Gaussian drift (e.g. if the facilitation states are analog-valued and some biophysical noise-process drives a random walk through the set of possible states even in the absence of neural activity), then they too could be could be treated as a bank of information channels with Gaussian noise and potentially our theory would extend to these, but with different parameters.

Since there are not yet good models of noise in the synaptic facilitation variable, for instance, and the effects of such noise on collective network memory states, we cannot directly yet compute a theoretical bound on memory performance for these mechanisms. However, that is definitely a future interest; with more theoretical work on modeling sources of noise in the activity-silent mechanisms, it will be possible to apply a similar theoretical framework to obtain bounds on memory performance with and without good encoding.

3) Some of the writing contains incomplete or misleading assertions. For example, the idea that there are constraints on the amount of time that information can be held in STM ignores the fact that a classically held hallmark of STM is precisely that it is not sensitive to the passage of time, per se. (Two examples are from Keppel and Underwood, and many demonstrations of prolonged retention of information in STM in anterograde amnesic patients.) Indeed, puzzlingly, one of the papers cited by the authors to substantiate their assertion is entitled "No temporal decay in verbal short-term memory."

Indeed, as the reviewers note, early studies have emphasized the temporal robustness of STM, and compared to “iconic” memory, STM is much less susceptible to forgetting. Consistent with this, our experimental results clearly demonstrate that single items are remembered with very little degradation over time, and the effects of increasing item number are stronger than the effects of increasing delay on memory performance.

However, there is performance degradation over time, especially for more items. We do not ourselves model pure temporal decay as a mechanism for memory loss, so it was not our intention to convey this in the Introduction. The source of confusion was our phrasing and references. We are now more careful in making a distinction between performance degradation over time versus the possible mechanisms for such degradation (which could include noise or interference or, less likely according to the literature, pure temporal decay mechanisms), please see our edits.

4) The manuscript makes not contact with the growing literature of multivariate analyses of data from STM tasks, from nonhuman and human electrophysiology, and from human fMRI. Some of these studies show the ability to decode the contents of STM from delay-period activity with decoders trained on sample-evoked signal. Others suggest that the neural code may be dynamic, with minimal if any cross-temporal generalization (i.e., "off-diagonal" decoding). How does the proposed theory relate to this empirical literature? Without reference to these broader literatures, the present manuscript might be more suitable for a more specialized computational journal.

Our formalism indicates that the representation within a memory channel must be in an optimised format, and that this format is not necessarily the same format that information was initially presented in. According to the information-theoretic view, the brain must perform a transformation from stimulus-space into an optimally coded form, and one might expect to observe this transition of the representation at encoding. The less optimal the original stimulus space, the more different the mnemonic code will likely be from the sample-evoked signal.

This insight by the reviewer constitutes a potential key prediction of the model, that in domains that are already combinatorially structured, neural representations should remain similar throughout the delay period, whereas in domains amenable to compression at encoding, neural codes during the delay will appear dynamic or at least different from the stimulus-evoked signal. We now include a discussion of this point in the manuscript (Discussion section, paragraph beginning "It remains to be seen whether neural representations for short-term visual memory are consistent …."), also citing papers in the literature that show variously show either stable, conserved coding during delay or varying, different states during delay.

5) The authors argue that the currently accepted model of working memory predicts a linear increase in mean-squared error (MSE) over time and load (MSE ~ (load)*(time)). In contrast, they find a sub-linear increase in MSE with time (Figure 3A and 3B). This sub-linearity is well fit by the well-coded model. However, some of this non-linearity could be due to other, less-capacity-limited, forms of memory at very short time delays. For example, iconic memory, thought to have an extremely high capacity, is likely still available at 100 ms (some might argue for longer). This could lead to a reduction in the MSE at the lowest time delays. Ideally the authors would control for this using masking stimuli. Alternatively, the authors could control for this by excluding the very short delays from the analysis (possibly increasing the maximum memory delay if needed for fits).

Thank you for this comment. Please note that the real problem with the direct storage (linear) model is not so much that the function in time is linear, as that even the average slopes of the different-item number curves versus time are not fit by the slopes in the linear model: That is, if we fit the 1-item versus time data, then the predicted slope of the 6-item versus time slope prediction is far lower than the average slope of the actual data.

This can be seen in Figure 3A. If we attempt to fit all the curves simultaneously as well as possible, again the slopes of the fits in time are far from the mean slopes of the curves, leaving aside the question of sub-linearity.

If we understand, the reviewer is suggesting the following scenario: Consider some process that has linear degradation of information in time (e.g. like direct storage of information into persistent activity networks). Add to this model the assumption that the 100 ms time-point is due to iconic memory. After excluding this 100 ms point, the uncoded model might provide a much better fit than it has so far, and it might also be more competitive with the coded model.

We now perform this analysis, and find that the uncoded model still fails to simultaneously fit the 1- and 6- item versus time data, and remains a substantially poorer fit than the coded model fit to the same data. The result does not change these qualitative comparisons.

6) As with many working memory paradigms, it is not entirely clear how to define the working memory load in the current task. It seems subjects must remember multiple pieces of information per memorandum (e.g. both color and orientation) in all cases except for the single item. This would suggest memory load is actually 1, 4, 8, and 12. Does this non-linearity account for the poor fit of the linear "direct coding" model? It seems like it might not, given the poor fit in Figure 3B but it would still be worth testing the two models with different values for memory load. Similarly, recent work has suggested some degree of independence of working memory load across the two visual hemifields. Again, this would suggest only the balanced displays can be directly compared (e.g. 2, 4, and 6 items). Does the well-coded model still provide a better fit If the analysis is restricted to these three conditions?

This is an excellent suggestion. We have now redefined the item numbers from (1, 2, 4, 6) to (1,4, 8, 12) and re-done the fits. We find that our qualitative conclusions remain unchanged.

7) The authors appropriately use BIC to perform model comparison. However, these model comparison criterion often penalize parameters to different degrees. Did the authors also find the well-coded model generalized to a withheld dataset better than the direct coding model?

Thank you for another good question. To address this, we redid the analysis by excluding one time-point across all item-number curves, then asked how well the curves obtained from fitting the other time-points predicted the error for the held-out data-point. We repeated this for another time-point. This is like a leave-one-out or jackknife cross-validation procedure. We find that the well-coded model predicts the withheld datapoints with smaller error than the uncoded/direct coding model.

8) Recent work has debated whether errors during working memory are due, in part, to guessing or not (e.g. Luck, Awh, Vogel, Bays, etc). In fact, Steve Luck argues for no increase in variance with load (or time?), instead only an increase in guess rate. If fitting a circular Gaussian to the distribution do the authors find an increase in variance or an increase in baseline (or both)? Related to this, it isn't clear to me how the pure 'sudden-death' framework matches with the diffusivity arguments made here. It seems that perhaps the well-coded model could explain the existence of complete failures to remember if the signal diffuses too much, but the model would still argue for some diffusion of memory over time. This doesn't seem consistent with the current model. I know the authors attempt to address this in the Discussion section of the current manuscript but I would encourage the authors to clarify their position.

Thank you for the opportunity to clarify. The direct storage model, which involves only diffusion, does not include a nonlinear "sudden-death" process. Instead, the error of recall will simply grow, continuously and monotonically, over time; it's still possible in this model that a noisy, discrete-in-time experiment will result in the appearance of a sudden-death event where there really is only continuous degradation in the underlying system (e.g. beyond some threshold of memory degradation, noise in the report or observation will make the memory appear to be "gone"). On the other hand, if information is stored in a well-coded way, according to some good error-correcting code, we would expect inherently sharp threshold behavior: such codes display a characteristic level of noise below which they can effectively suppress most error, and above which they are guaranteed to fail, and then their errors are large. Thus, the model would predict a relatively small accumulation of error over some interval, followed by a super-linear increase in squared error. We now clarify this point in the manuscript.

9) This study uses the co-authors' human psychophysical data from Pertzov et al., 2016 Journal of Experimental Psychology. That study decomposed errors into three sources: (1) noisy representation; (2) mis-binding or non-target responses; and (3) random guessing. They reported that all three of these components increased with higher load and with longer delays. How does these prior findings relate to the present study? Are these different sources subsumed by the present model? Or are these important features that the present model (in the diffusive regime) does not account for? Does the present model produce only the first type of errors? The Authors mention that in another regime of the model, non-diffusive errors can produce pure guessing errors. Can the model speak to the mechanisms of mis-binding errors? Please include discussion of this point.

Re. Misbinding: The current model does not address this source of error. We now clarify this fact in the text. Our model in its present form is presented a single feature dimension, in this case orientation, and thus it does not consider the binding problem/binding errors. Note that our model could in principle be extended to take into account the joint storage of pairs or more of features per item, by representing those features as part of a higher-dimensional continuous attractor network, as in the joint population code model considered by Matthey, Bays and Dayan (2015); this is certainly of future interest of us (but of course out of scope of the current work). We have now added a note about this point in the Discussion.

Re. sudden death: we have now clarified in Discussion how sudden death can be consistent with our framework: "In our framework, good encoding ensures that for noise below a threshold, the decoder can recover an improved estimate of the stored variable; however, strong codes exhibit sharp threshold behavior as the noise in the channel is varied smoothly. […]We note, however, that the fits to the data shown here were all in the below-threshold regime."

10) Regarding the implications for neural representations: The Authors discuss that one prediction of the model would be signatures of exponentially strong codes in neural representations. As I understand it, one way this could be implemented is that each of the N memory networks has a different spatial period for its periodic coding, as in the case of grid cells. The other feature of the present model is that for multi-item working memory, a memory network contains signals for all of the K items. It would be helpful if the Authors can clarify what the implications on neural representations are for this feature of distributed multi-item coding. Does this imply that single neurons would show mixed selectivity for multiple items? Please include discussion of this point.

Good question. It is difficult to imagine any scenario involving non-mixed selectivity for items in a strong-coding scheme, thus indeed mixed selectivity would be a prediction of such a such a scheme. We already had a longer discussion of the question of tuning curves for strong codes under the heading "Are neural representations consistent with exponentially strong codes?" in the Discussion. We have now added a comment about mixed selectivity there.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. Experiment data used in the manuscript.

    Subjects view a display with several (K) differently colored and oriented bars that are subsequently removed for the storage (delay) period. Following the storage period, subjects were cued by one of the colored bars in the display, now randomly oriented, and asked to rotate it to its remembered orientation. Bar orientations in the display were drawn randomly from the uniform distribution over all angles (thus the range of orientations lies in the circular interval [0,π]) and the report of the subject was recorded as an analog value. (See also [Pertzov et al., 2017]).

    DOI: 10.7554/eLife.22225.004
    Transparent reporting form
    DOI: 10.7554/eLife.22225.014

    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES