The reliability and stability of visual working memory capacity

Z Xu; KCS Adam; X Fang; EK Vogel

doi:10.3758/s13428-017-0886-6

. Author manuscript; available in PMC: 2019 Apr 1.

Published in final edited form as: Behav Res Methods. 2018 Apr;50(2):576–588. doi: 10.3758/s13428-017-0886-6

The reliability and stability of visual working memory capacity

Z Xu ^1,^*, KCS Adam ^2,^*, X Fang ¹, EK Vogel ²

PMCID: PMC5632133 NIHMSID: NIHMS866866 PMID: 28389852

Abstract

Because of the central role of working memory capacity in cognition, many studies have used short measures of working memory capacity to examine its relationship to other domains. Here, we measured the reliability and stability of visual working memory capacity, measured using a single-probe change detection task. In Experiment 1, subjects (N = 137) completed a large number of trials of a change detection task (540 in total, 180 each of set–sizes 4, 6, and 8). With large numbers of trials and subjects, reliability estimates were high (α > .9). We then used an iterative downsampling procedure to create a look-up table for expected reliability in experiments with small sample sizes. In Experiment 2, subjects (N = 79) completed 31 sessions of single-probe change-detection. The first 30 sessions took place over 30 consecutive days, and the last session took place 30 days later. This unprecedented number of sessions allowed us to examine the effects of practice on stability and internal reliability. Even after much practice, individual differences were stable over time (average between-session r =.76).

Keywords: visual working memory, reliability, change detection

Working Memory Capacity (WMC) is a core cognitive ability that predicts performance across many domains. For example, WMC predicts attentional control, fluid intelligence and real-world outcomes such as perceiving hazards while driving (Engle, Tuholski, Laughlin, & Conway, 1999; Fukuda, Vogel, Mayr, & Awh, 2010; Wood, Hartley, Furley, & Wilson, 2016). As such, researchers are often interested in devising brief measures of WMC to investigate the relationship of WMC to other cognitive processes. However, truncated versions of WMC tasks could potentially be inadequate for reliably measuring an individual’s capacity. Inadequate measurement could obscure correlations between measures or even differences in performance between experimental conditions. Furthermore, while WMC is considered to be a stable trait of the observer, little work has directly examined the role of extensive practice on the measurement of WMC over time. This is of particular concern due to the popularity of research examining whether training affects WMC (Melby-Lervåg & Hulme, 2013; Shipstead, Redick, & Engle, 2012). Extensive practice on any given cognitive task has the potential to significantly alter the nature of the variance that is determining performance. For example, extensive practice has the potential to induce a restriction of range problem, in which the bulk of the observers reach similar performance levels - thus reducing any opportunity to observe correlations with other measures. Consequently, a systematic study of the reliability and stability of WMC measures is critical for improving the measurement and reproducibility of major phenomena in this field.

In the present study, we seek to establish the reliability and stability of one particular WMC measure: Change Detection. Change detection measures of visual working memory have gained popularity as a means of assessing individual differences in capacity. In a typical change detection task, participants briefly view an array of simple visual items (~100 to 500 ms), such as colored squares, and remember these items across a short delay (~1 to 2 seconds). At test, observers are presented with an item at one of the remembered locations, and they indicate whether the presented test item is the same as the remembered item (“no-change” trial) or is different (“change trial”). Performance can be quantified as raw accuracy or converted into a capacity estimate (“K”). In capacity estimates, performance for change trials and no-change trials is calculated separately as hits (proportion of correct change trials) and false alarms (proportion of incorrect no-change trials) and converted into a set-size dependent score (Cowan, 2001; Pashler, 1988; Rouder, Morey, Morey, & Cowan, 2011).

There are several beneficial features of change detection tasks that have led to their increased popularity. First, change detection memory tasks are simple and short enough to be used with developmental and clinical populations (e.g. Cowan, Fristoe, Elliott, Brunner, & Saults, 2006; Gold, Wilk, McMahon, Buchanan, & Luck, 2003; Lee et al., 2010). Second, the relatively short length of trials lends the task well to neural measures that require large numbers of trials. In particular, neural studies employing change detection tasks have provided strong corroborating evidence of capacity limits in WM (Todd & Marois, 2004; Vogel & Machizawa, 2004), and have yielded insights into potential mechanisms underlying individual differences in working memory capacity (for review, see: Luria, Balaban, Awh, & Vogel, 2016). Finally, change detection tasks and closely-related memory-guided saccade tasks can be used with animal models from pigeons (Gibson, Wasserman, & Luck, 2011) to non-human primates (Buschman, Siegel, Roy, & Miller, 2011), providing a rare opportunity to directly compare behavior and neural correlates of task performance across species (Elmore, Magnotti, Katz, & Wright, 2012; Reinhart et al., 2012).

A main aim of this study is to quantify the effect of measurement error and sample size on the reliability of change detection estimates. In previous studies, change detection estimates of capacity have yielded good reliability estimates (e.g. Pailian & Halberda, 2015; Unsworth et al., 2014). However, measurement error can vary dramatically with the number of trials in a task, thus impacting reliability; Pailian and Halberda (2015) found that reliability of change detection estimates greatly improved when the number of trials was increased. Researchers frequently employ vastly different numbers of trials and subjects in studies of individual differences, but the effect of trial number on change-detection reliability has never been fully characterized. In studies using large batteries of tasks, time and measurement error are forces working in opposition to one another. When researchers want to minimize the amount of time that a task takes, measures are often truncated to expedite administration. Such truncated measures increase measurement noise and potentially harm the reliability of the measure. At present, there is no clear understanding of either the minimum numbers of subjects and trials that are necessary to obtain reliable estimates of change detection capacity.

In addition to measurement error within-session, reliability of individual differences could be compromised with extensive practice. Previously, it was found that visual working memory capacity estimates were stable (r = .77) after 1.5 years between testing sessions (Johnson et al., 2013). However, the effect of extensive practice on change detection estimates of capacity has yet to be characterized. Extensive practice could harm the reliability and stability of measures in a couple of ways. First, it is possible that participants could improve so much that they reach performance ceiling, thus eliminating variability between individuals. Second, if individual differences are due to the utilization of optimal versus sub-optimal strategies, then participants might converge to a common mean after engaging in extensive practice and finding optimal task strategies. Both of these hypothetical possibilities would call into question the true stability of working memory capacity estimates, and likewise severely harm the statistical reliability of the measure. As such, in Experiment 2 we directly quantified the extent of extensive practice on the stability of working memory capacity estimates.

Overview of Experiments

We measured the reliability and stability of a single-probe change-detection measure of visual working memory capacity. In Experiment 1, we measured the reliability of capacity estimates obtained with a commonly used version of the color change-detection task for a relatively large number of subjects (n = 137) and a larger than typical number of trials (t = 540). In Experiment 2, we measured the stability of capacity estimates across an unprecedented number of testing sessions (31). Because of the large number of sessions, we could investigate the stability of change detection estimates after extended practice and over a period of 60 days.

Experiment 1

Materials and Methods

Participants

A total of 139 individuals (35 males; mean age = 19.97, SD = 1.07) with normal or corrected-to-normal vision participated in the experiment. Participants provided written informed consent, and the study was approved by the Ethics Committee at Southwest University. Participants received monetary compensation for their participation. Two participants were excluded because they had negative average capacity values, resulting in a final sample of 137 subjects.

Stimuli

Stimuli were presented on monitors with a refresh rate of 75Hz and a screen resolution of 1024 x 768. Participants sat approximately 60 cm from the screen, though a chinrest was not used so all visual angle estimates are approximate. In addition, there were some small variations in monitor size (five 16” CRT monitors, three 19” LCD monitors) in testing rooms, leading to small variations in the size of the colored squares from monitor to monitor. Details are provided about the approximate range in degrees of visual angle.

All stimuli were generated in MATLAB (The MathWorks, Natick, MA) using Psychophysics toolbox. Colored squares (51 pixels; range of 1.55° to 2.0° visual angle) served as memoranda. Squares could appear anywhere within an area of the monitor subtending approximately 10.3° to 13.35° degrees horizontally and 7.9° to 9.8° degrees vertically. Squares could appear in any of nine distinct colors, and colors were sampled without replacement within each trial (RGB values: Red=255 0 0; Green= 0 255 0; Blue= 0 0 255; Magenta=255 0 255; Yellow= 255 255 0; Cyan= 0 255 255; Orange = 255 128 0; White=255 255 255; Black= 0 0 0). Participants were instructed to fixate a small black dot (Approximate range: .36° to .47° visual angle) at the center of the display.

Procedures

Each trial began with a blank fixation period of 1,000 ms. Then, participants briefly viewed an array of 4, 6, or 8 colored squares (150 ms) which they remembered across a blank delay period (1,000 ms). At test, one colored square was presented at one of the remembered locations. There was an equal probability that the probed square was the same color (no-change trial) or a different color (change trial). Participants made an unspeeded response by pressing the “z” key if the color was the same and the “/” key if the color was different. Participants completed 180 trials of set-sizes 4, 6, and 8 (540 trials total). Trials were divided into 9 blocks, and participants were given a brief rest period (30 seconds) after each block. To calculate capacity, change detection accuracy was transformed into a K estimate using Cowan’s (2001) formula K = N × (H – FA), where N represents the set-size, H is the hit rate (proportion of correct change trials), and FA is the false alarm rate (proportion of incorrect no-change trials). Cowan’s formula is best for single-probe displays like the one employed here. For change detection tasks using whole-display probes, Pashler’s (1988) formula may be more appropriate (Rouder et al., 2011).

Results

Descriptive statistics for each set-size condition are shown in Table 1, and data for both Experiment 1 and 2 are available online on Open Science Framework at https://osf.io/g7txf/. There was a significant difference in performance across set-sizes, F(2,268) = 20.6, p < .001, η_p² = .133, and polynomial contrasts revealed a significant linear trend, F(1,134) = 36.48, p < .001, η_p² = .214, indicating that average performance declined slightly with increased memory load.

Table 1. Descriptive statistics for Experiment 1.

Descriptive statistics are shown separately for each set-size and for the average of the three set-sizes. Kurtosis and skewness values are both centered around 0. Neither kurtosis nor skewness was credibly non-normal in any condition (Cramer, 1997).

	Mean K	SD	Min	Max	Kurtosis	Skewness
Set-Size 4	2.32	.70	.58	3.87	−.49	−.34
Set-Size 4
Set-Size 6	2.10	.97	.07	4.80	−.18	.34
Set-Size 6
Set-Size 8	1.98	.97	−.18	4.53	−.52	−.14
Set-Size 8
Average	2.14	.82	.38	4.31	−.47	.07

Open in a new tab

Reliability of the Full Sample: Cronbach’s Alpha

We computed Cronbach’s alpha (unstandardized) using K scores from the three set-sizes as items (180 trials contributing to each item), and obtained a value of α = .91 (Cronbach, 1951). We also computed Cronbach’s alpha using K scores from the nine blocks of trials (60 trials contributing to each item) and obtained a nearly identical value of α = .92. Finally, we computed Cronbach’s alpha using raw accuracy for single trials (540 items), and obtained an identical value of α = .92. Thus, change detection estimates had high internal reliability for this large sample of subjects, and the precise method used to divide trials into “items” does not impact Cronbach’s alpha estimates of reliability for the full sample. Further, using raw accuracy versus bias-corrected K scores did not impact reliability.

Reliability of the Full Sample: Split-half

The split-half correlation of the K scores for even and odd trials was reliable, r = .88, p < .001, 95% CI [.78 .88]. Correcting for attenuation yielded a split-half correlation value of r = .94 (Brown, 1910; Spearman, 1910). Likewise, the capacity scores from individual set-sizes correlated with each other: r_ss4–ss6 = .84, p < .001, [95% CI .78 .88]; r_ss6–ss8 = .78, p < .001, [95% CI .72 .85]; r_ss4–ss8 = .76, p < .001, [95% CI .68 .83]. Split-half correlations for individual set-sizes yielded Spearman-Brown corrected correlation values of r = .91 for set-size 4, r = .86 for set-size 6, and r = .76 for set-size 8, respectively.

The drop in capacity from set-size 4 to set-size 8 has been used in the literature as a measure of filtering ability. However, the internal reliability of this difference score has typically been low (Pailian & Halberda, 2015; Unsworth et al., 2014). Likewise, we found here that the split-half reliability of the performance decline from set-size 4 to set-size 8 (“4–8 Drop”) was low, with a Spearman-Brown corrected correlation value of r = .24. While weak, this correlation is the same strength as reported in earlier work (Unsworth et al., 2014). The split-half reliability of the performance decline from set-size 4 to set-size 6 was slightly higher, r = .39, and the split-half reliability of the difference between set-size 6 and set-size 8 performance was very low, r = .08. The reliability of differences scores can be impacted both by (1) the internal reliability of each measure used to compute the difference and (2) the degree of correlation between the two measures (Rodebaugh et al., 2016). Although the internal reliability of each individual set-size was high, the positive correlation between set-sizes may have decreased the reliability of the set-size difference scores.

An Iterative Downsampling Approach

To investigate the effects of sample size and trial number on reliability estimates, we used an iterative downsampling procedure. Two reliability metrics were assessed: (1) Cronbach’s alpha, using single trial accuracy as items and (2) split-half correlations using all trials. For the downsampling procedure, we randomly sampled subjects and trials from the full dataset. Number of subjects (n) was varied from 5 to 135 in steps of 5. The number of trials (t) was varied from 5 to 540 in steps of 5. Number of subjects and number of trials were factorially combined (2916 cells total). For each cell in the design, we ran 100 sampling iterations. On each iteration, n subjects and t trials were randomly sampled from the full dataset and reliability metrics were calculated for the sample.

Figure 1 shows the results of the downsampling procedure for Cronbach’s alpha. Figure 2 shows the results of the downsampling procedure for split-half reliability estimates. In each plot, we show both the average reliability obtained across the 100 iterations (Fig. 1A and Fig. 2A) and the worst reliability obtained across the 100 iterations (Fig. 1B and Fig. 2B). Conceptually, we could think of each iteration of the downsampling procedure as akin to running one “experiment” with subjects randomly sampled from our “population” of 137. While it is good to know the average expected reliability across many experiments, the typical experimenter will only run an experiment once. Thus, considering the “worst case scenario” is instructive for planning the number of subjects and the number of trials to be collected. For a more complete picture of the breadth of reliabilities obtained, we can also consider the variability in reliability across iterations (SD) and the range of reliability values (Fig. 2C–2D). Finally, we repeated this iterative downsampling approach for each individual set-size. Average reliability as well as the variability of reliability for individual set-sizes are shown in Figure 3. Note, each set-size begins with 1/3 as many trials as Figures 1 and 2.

In each cell, Cronbach’s alpha was computed for t trials (x-axis) and n subjects (y-axis). (a) Average reliability across 100 iterations. (b) Minimum reliability obtained (worst random sample of subjects and trials).

(a) Average reliability across 100 iterations. (b) Minimum reliability obtained (worst random sample of subjects and trials). (c) Standard deviation of the reliability obtained across samples. (d) Range of reliability values obtained across samples.

Top 3 panels: Average reliability for each set-size. Bottom 3 panels: Standard Deviation of the reliability for each set-size across 100 downsampling iterations.

Next, we looked at some potential characteristics of samples with low reliability (e.g. iterations with particularly low versus high reliability). We ran 500 sampling iterations of 30 subjects and 120 trials, then we did a median split for high-versus low-reliability samples. There was no significant difference in the mean (p = .86), skewness (p = .60) or kurtosis (p = .70) of high versus low reliability samples. There was, however, a significant effect of sample range and variability. As would be expected, samples with higher reliability had a larger standard deviation, t(498) = 26.7, p <.001, 95% CI [.14 .17], and a wider range, t(498) = 15.2, p < .001, 95% CI [.52 .67]), than samples with low reliability.

A Note for Fixed Capacity + Attention Estimates of Capacity

So far, we have discussed only the most commonly used methods of estimating working memory capacity (K scores and percent correct). Other methods of estimating capacity have been used, and we would like to briefly mention one of them. Rouder and colleagues (2008) suggested adding an attentional lapse parameter to estimates of visual working memory capacity, a model referred to as Fixed Capacity + Attention. Adding an attentional lapse parameter accounts for trials where subjects are inattentive to the task at hand. Specifically, participants commonly make errors on trials that should be well within capacity limits (e.g. set-size 1), and adding a lapse parameter can help to explain these anomalous dips in performance. Unlike typical estimates of capacity in which a K value is computed directly for performance for each set-size and then averaged, this model uses a log-likelihood estimation technique that estimates a single capacity parameter by simultaneously considering performance across all set-sizes and/or change probability conditions. Critically, this model assumes that data is obtained for at least one sub-capacity set-size, and that any error made on this set-size reflects an attentional lapse. If the model is fit to data that lacks at least one sub-capacity set-size (e.g. 1 or 2 items), then the model will fit poorly and provide nonsensical parameter estimates.

Recently, Van Snellenberg and colleagues used the Fixed Capacity + Attention Model to calculate capacity for a change detection task, and they found that the reliability of the model’s capacity parameter was low (r = .35), and did not correlate with other working memory tasks (Van Snellenberg, Conway, Spicer, Read, & Smith, 2014). Critically, however, this study used only relatively high set-sizes (4 and 8), and lacked a sub-capacity set-size, so model fits were likely poor. Using code made available from Rouder et al., we fit a Fixed Capacity + Attention model to our data (Rouder, n.d.). We found that when this model is misapplied (i.e. used on data without at least 1 sub-capacity set-size) the internal reliability of the capacity parameter was low (r uncorrected = .35), and negatively correlated with raw change detection accuracy, r = −.25, p = .004. If we had only applied this model to our data, we would have mistakenly concluded that change detection measures offer poor reliability and do not correlate with other measures of working memory capacity.

Discussion

Here, we have shown that when sufficient numbers of trials and subjects are collected, the reliability of change detection capacity is remarkably high (r >.9). On the other hand, a systematic downsampling method revealed that insufficient trials or insufficient subject numbers could dramatically reduce the reliability obtained in a single experiment. If researchers hope to measure the correlation between visual working memory capacity and some other measure, Figures 1 and 2 can serve as an approximate guide to expected reliability. Because we only had a single sample of the largest n (137), we cannot make definitive claims about the reliability of future samples of this size. However, given the stabilization of correlation coefficients with large sample sizes and the extremely high correlation coefficient obtained, we can be relatively confident that the reliability estimate for our full sample (n = 137) would not change substantially in future samples of university students. Further, we can make claims about how the reliability of small, well-defined sub-samples of this “population” can systematically deviate from an empirical upper bound.

The average capacity obtained for this sample was slightly lower than some other values in the literature, typically cited as around 3–4 items. The slightly lower average for this sample could potentially cause some concern about the generalizability of these reliability values for future samples. For the current manuscript’s sample, average K-scores for set-sizes 4 and 8 were K = 2.3 and K = 2.0, respectively. The largest, most comparable sample to the present sample is a 495 subject sample in work by Fukuda, Woodman, and Vogel (2015). The average K-scores for set-size 4 and 8 were K = 2.7 and K = 2.4, respectively, and the task design was nearly identical (150 ms encoding time, 1000 ms retention interval, no color repetitions allowed, and set-sizes 4 and 8). The difference of 0.3 – 0.4 items between these two samples is relatively small, though likely significant. However, for the purposes of estimating reliability, the variance of the distribution is more important than the mean. The variability observed in the present sample (SD = 0.7 for set-size 4, SD = .97 for set-size 8) was very similar to that observed in the Fukuda et al. sample (SD = 0.6 for set-size 4 and SD = 1.2 for set-size 8), though unfortunately the Fukuda e al. study did not report reliability. Because of the nearly identical variability of scores across these two samples, we can infer that our reliability results would indeed generalize to other large samples for which change detection scores have been obtained.

We recommend applying an iterative downsampling approach to other measures where expediency of task administration is valued, but reliability is paramount. The stats-savvy reader may note that the Spearman-Brown prophecy formula also allows one to calculate how many observations must be added to improve expected reliability, according to the formula:

N = \frac{ρ *_{x x^{'}} (1 - ρ_{x x^{'}})}{ρ_{x x^{'}} (1 - ρ *_{x x^{'}})}

Where ρ*_xx_′ is the desired correlation strength, ρ_xx_′ is the observed correlation and N is the number of times that test length must be multiplied to achieve the desired correlation strength. Critically, however, this formula does not account for the accuracy of the observed correlation. Thus, if one starts from an unreliable correlation coefficient obtained with a small number of subjects and trials, one will obtain an unreliable estimate of the number of observations needed to improve correlation strength. In experiments such as this one, both number of trials and number of subjects will drastically change estimates of the number of subjects needed to observe correlations of a desired strength.

Let’s take an example from our iterative downsampling procedure. Imagine that we ran 100 experiments, each with 15 subjects and 150 total trials of change detection. Doing so, we would obtain 100 different estimates of the strength of the true split-half correlation. We could then apply the Spearman-Brown formula to each of these 100 estimates in order to calculate the number of trials needed to obtain a desired reliability of r = .8. So doing, we would find that, on average, we would need around 140 trials to obtain the desired reliability. However, because of the large variability in the observed correlation strength (r = .37 to .97), if we had only run the “best case” experiment (r = .97), we would estimate that we need only 18 trials to obtain our desired reliability of r = .8 with 15 subjects. On the other hand, if we had run the “worst case” experiment (r = .37), then we would estimate that we need 1,030 trials. There are downsides to both types of estimation errors. While a pessimistic estimate of the number of trials needed (>1000) would certainly ensure adequate reliability, this may come at the cost of time and participants’ frustration. Conversely, an overly optimistic estimate of the number of trials needed (<20) would lead to underpowered studies that waste time and funds.

Finally, we investigated an alternative parameterization of capacity based on a model that assumes a fixed capacity and an attention lapse parameter (Rouder et al., 2008). Critically, this model attempts to explain errors for set-sizes that are well within capacity limits (e.g. 1 item). If researchers inappropriately apply this model to change detection data with only large set-sizes, they would erroneously conclude that change detection tasks yield poor reliability and fail to correlate with other estimates of capacity (e.g. Van Snellenberg et al., 2014).

In Experiment 2, we shifted our focus to the stability of change detection estimates. That is, how consistent are estimates of capacity from day-to-day? We collected an unprecedented number of sessions of change detection performance (31) spanning 60 days. We examined the stability of capacity estimates, defined as the correlation between individuals’ capacity estimates from one day to the next. Since capacity is thought to be a stable trait of the individual, we predicted that individual differences in capacity should be reliable across many testing sessions.