Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Apr 1.
Published in final edited form as: Behav Res Methods. 2018 Apr;50(2):576–588. doi: 10.3758/s13428-017-0886-6

The reliability and stability of visual working memory capacity

Z Xu 1,*, KCS Adam 2,*, X Fang 1, EK Vogel 2
PMCID: PMC5632133  NIHMSID: NIHMS866866  PMID: 28389852

Abstract

Because of the central role of working memory capacity in cognition, many studies have used short measures of working memory capacity to examine its relationship to other domains. Here, we measured the reliability and stability of visual working memory capacity, measured using a single-probe change detection task. In Experiment 1, subjects (N = 137) completed a large number of trials of a change detection task (540 in total, 180 each of set–sizes 4, 6, and 8). With large numbers of trials and subjects, reliability estimates were high (α > .9). We then used an iterative downsampling procedure to create a look-up table for expected reliability in experiments with small sample sizes. In Experiment 2, subjects (N = 79) completed 31 sessions of single-probe change-detection. The first 30 sessions took place over 30 consecutive days, and the last session took place 30 days later. This unprecedented number of sessions allowed us to examine the effects of practice on stability and internal reliability. Even after much practice, individual differences were stable over time (average between-session r =.76).

Keywords: visual working memory, reliability, change detection


Working Memory Capacity (WMC) is a core cognitive ability that predicts performance across many domains. For example, WMC predicts attentional control, fluid intelligence and real-world outcomes such as perceiving hazards while driving (Engle, Tuholski, Laughlin, & Conway, 1999; Fukuda, Vogel, Mayr, & Awh, 2010; Wood, Hartley, Furley, & Wilson, 2016). As such, researchers are often interested in devising brief measures of WMC to investigate the relationship of WMC to other cognitive processes. However, truncated versions of WMC tasks could potentially be inadequate for reliably measuring an individual’s capacity. Inadequate measurement could obscure correlations between measures or even differences in performance between experimental conditions. Furthermore, while WMC is considered to be a stable trait of the observer, little work has directly examined the role of extensive practice on the measurement of WMC over time. This is of particular concern due to the popularity of research examining whether training affects WMC (Melby-Lervåg & Hulme, 2013; Shipstead, Redick, & Engle, 2012). Extensive practice on any given cognitive task has the potential to significantly alter the nature of the variance that is determining performance. For example, extensive practice has the potential to induce a restriction of range problem, in which the bulk of the observers reach similar performance levels - thus reducing any opportunity to observe correlations with other measures. Consequently, a systematic study of the reliability and stability of WMC measures is critical for improving the measurement and reproducibility of major phenomena in this field.

In the present study, we seek to establish the reliability and stability of one particular WMC measure: Change Detection. Change detection measures of visual working memory have gained popularity as a means of assessing individual differences in capacity. In a typical change detection task, participants briefly view an array of simple visual items (~100 to 500 ms), such as colored squares, and remember these items across a short delay (~1 to 2 seconds). At test, observers are presented with an item at one of the remembered locations, and they indicate whether the presented test item is the same as the remembered item (“no-change” trial) or is different (“change trial”). Performance can be quantified as raw accuracy or converted into a capacity estimate (“K”). In capacity estimates, performance for change trials and no-change trials is calculated separately as hits (proportion of correct change trials) and false alarms (proportion of incorrect no-change trials) and converted into a set-size dependent score (Cowan, 2001; Pashler, 1988; Rouder, Morey, Morey, & Cowan, 2011).

There are several beneficial features of change detection tasks that have led to their increased popularity. First, change detection memory tasks are simple and short enough to be used with developmental and clinical populations (e.g. Cowan, Fristoe, Elliott, Brunner, & Saults, 2006; Gold, Wilk, McMahon, Buchanan, & Luck, 2003; Lee et al., 2010). Second, the relatively short length of trials lends the task well to neural measures that require large numbers of trials. In particular, neural studies employing change detection tasks have provided strong corroborating evidence of capacity limits in WM (Todd & Marois, 2004; Vogel & Machizawa, 2004), and have yielded insights into potential mechanisms underlying individual differences in working memory capacity (for review, see: Luria, Balaban, Awh, & Vogel, 2016). Finally, change detection tasks and closely-related memory-guided saccade tasks can be used with animal models from pigeons (Gibson, Wasserman, & Luck, 2011) to non-human primates (Buschman, Siegel, Roy, & Miller, 2011), providing a rare opportunity to directly compare behavior and neural correlates of task performance across species (Elmore, Magnotti, Katz, & Wright, 2012; Reinhart et al., 2012).

A main aim of this study is to quantify the effect of measurement error and sample size on the reliability of change detection estimates. In previous studies, change detection estimates of capacity have yielded good reliability estimates (e.g. Pailian & Halberda, 2015; Unsworth et al., 2014). However, measurement error can vary dramatically with the number of trials in a task, thus impacting reliability; Pailian and Halberda (2015) found that reliability of change detection estimates greatly improved when the number of trials was increased. Researchers frequently employ vastly different numbers of trials and subjects in studies of individual differences, but the effect of trial number on change-detection reliability has never been fully characterized. In studies using large batteries of tasks, time and measurement error are forces working in opposition to one another. When researchers want to minimize the amount of time that a task takes, measures are often truncated to expedite administration. Such truncated measures increase measurement noise and potentially harm the reliability of the measure. At present, there is no clear understanding of either the minimum numbers of subjects and trials that are necessary to obtain reliable estimates of change detection capacity.

In addition to measurement error within-session, reliability of individual differences could be compromised with extensive practice. Previously, it was found that visual working memory capacity estimates were stable (r = .77) after 1.5 years between testing sessions (Johnson et al., 2013). However, the effect of extensive practice on change detection estimates of capacity has yet to be characterized. Extensive practice could harm the reliability and stability of measures in a couple of ways. First, it is possible that participants could improve so much that they reach performance ceiling, thus eliminating variability between individuals. Second, if individual differences are due to the utilization of optimal versus sub-optimal strategies, then participants might converge to a common mean after engaging in extensive practice and finding optimal task strategies. Both of these hypothetical possibilities would call into question the true stability of working memory capacity estimates, and likewise severely harm the statistical reliability of the measure. As such, in Experiment 2 we directly quantified the extent of extensive practice on the stability of working memory capacity estimates.

Overview of Experiments

We measured the reliability and stability of a single-probe change-detection measure of visual working memory capacity. In Experiment 1, we measured the reliability of capacity estimates obtained with a commonly used version of the color change-detection task for a relatively large number of subjects (n = 137) and a larger than typical number of trials (t = 540). In Experiment 2, we measured the stability of capacity estimates across an unprecedented number of testing sessions (31). Because of the large number of sessions, we could investigate the stability of change detection estimates after extended practice and over a period of 60 days.

Experiment 1

Materials and Methods

Participants

A total of 139 individuals (35 males; mean age = 19.97, SD = 1.07) with normal or corrected-to-normal vision participated in the experiment. Participants provided written informed consent, and the study was approved by the Ethics Committee at Southwest University. Participants received monetary compensation for their participation. Two participants were excluded because they had negative average capacity values, resulting in a final sample of 137 subjects.

Stimuli

Stimuli were presented on monitors with a refresh rate of 75Hz and a screen resolution of 1024 x 768. Participants sat approximately 60 cm from the screen, though a chinrest was not used so all visual angle estimates are approximate. In addition, there were some small variations in monitor size (five 16” CRT monitors, three 19” LCD monitors) in testing rooms, leading to small variations in the size of the colored squares from monitor to monitor. Details are provided about the approximate range in degrees of visual angle.

All stimuli were generated in MATLAB (The MathWorks, Natick, MA) using Psychophysics toolbox. Colored squares (51 pixels; range of 1.55° to 2.0° visual angle) served as memoranda. Squares could appear anywhere within an area of the monitor subtending approximately 10.3° to 13.35° degrees horizontally and 7.9° to 9.8° degrees vertically. Squares could appear in any of nine distinct colors, and colors were sampled without replacement within each trial (RGB values: Red=255 0 0; Green= 0 255 0; Blue= 0 0 255; Magenta=255 0 255; Yellow= 255 255 0; Cyan= 0 255 255; Orange = 255 128 0; White=255 255 255; Black= 0 0 0). Participants were instructed to fixate a small black dot (Approximate range: .36° to .47° visual angle) at the center of the display.

Procedures

Each trial began with a blank fixation period of 1,000 ms. Then, participants briefly viewed an array of 4, 6, or 8 colored squares (150 ms) which they remembered across a blank delay period (1,000 ms). At test, one colored square was presented at one of the remembered locations. There was an equal probability that the probed square was the same color (no-change trial) or a different color (change trial). Participants made an unspeeded response by pressing the “z” key if the color was the same and the “/” key if the color was different. Participants completed 180 trials of set-sizes 4, 6, and 8 (540 trials total). Trials were divided into 9 blocks, and participants were given a brief rest period (30 seconds) after each block. To calculate capacity, change detection accuracy was transformed into a K estimate using Cowan’s (2001) formula K = N × (H – FA), where N represents the set-size, H is the hit rate (proportion of correct change trials), and FA is the false alarm rate (proportion of incorrect no-change trials). Cowan’s formula is best for single-probe displays like the one employed here. For change detection tasks using whole-display probes, Pashler’s (1988) formula may be more appropriate (Rouder et al., 2011).

Results

Descriptive statistics for each set-size condition are shown in Table 1, and data for both Experiment 1 and 2 are available online on Open Science Framework at https://osf.io/g7txf/. There was a significant difference in performance across set-sizes, F(2,268) = 20.6, p < .001, ηp2 = .133, and polynomial contrasts revealed a significant linear trend, F(1,134) = 36.48, p < .001, ηp2 = .214, indicating that average performance declined slightly with increased memory load.

Table 1. Descriptive statistics for Experiment 1.

Descriptive statistics are shown separately for each set-size and for the average of the three set-sizes. Kurtosis and skewness values are both centered around 0. Neither kurtosis nor skewness was credibly non-normal in any condition (Cramer, 1997).

Mean K SD Min Max Kurtosis Skewness
Set-Size 4 2.32 .70 .58 3.87 −.49 −.34

Set-Size 6 2.10 .97 .07 4.80 −.18 .34

Set-Size 8 1.98 .97 −.18 4.53 −.52 −.14

Average 2.14 .82 .38 4.31 −.47 .07

Reliability of the Full Sample: Cronbach’s Alpha

We computed Cronbach’s alpha (unstandardized) using K scores from the three set-sizes as items (180 trials contributing to each item), and obtained a value of α = .91 (Cronbach, 1951). We also computed Cronbach’s alpha using K scores from the nine blocks of trials (60 trials contributing to each item) and obtained a nearly identical value of α = .92. Finally, we computed Cronbach’s alpha using raw accuracy for single trials (540 items), and obtained an identical value of α = .92. Thus, change detection estimates had high internal reliability for this large sample of subjects, and the precise method used to divide trials into “items” does not impact Cronbach’s alpha estimates of reliability for the full sample. Further, using raw accuracy versus bias-corrected K scores did not impact reliability.

Reliability of the Full Sample: Split-half

The split-half correlation of the K scores for even and odd trials was reliable, r = .88, p < .001, 95% CI [.78 .88]. Correcting for attenuation yielded a split-half correlation value of r = .94 (Brown, 1910; Spearman, 1910). Likewise, the capacity scores from individual set-sizes correlated with each other: rss4–ss6 = .84, p < .001, [95% CI .78 .88]; rss6–ss8 = .78, p < .001, [95% CI .72 .85]; rss4–ss8 = .76, p < .001, [95% CI .68 .83]. Split-half correlations for individual set-sizes yielded Spearman-Brown corrected correlation values of r = .91 for set-size 4, r = .86 for set-size 6, and r = .76 for set-size 8, respectively.

The drop in capacity from set-size 4 to set-size 8 has been used in the literature as a measure of filtering ability. However, the internal reliability of this difference score has typically been low (Pailian & Halberda, 2015; Unsworth et al., 2014). Likewise, we found here that the split-half reliability of the performance decline from set-size 4 to set-size 8 (“4–8 Drop”) was low, with a Spearman-Brown corrected correlation value of r = .24. While weak, this correlation is the same strength as reported in earlier work (Unsworth et al., 2014). The split-half reliability of the performance decline from set-size 4 to set-size 6 was slightly higher, r = .39, and the split-half reliability of the difference between set-size 6 and set-size 8 performance was very low, r = .08. The reliability of differences scores can be impacted both by (1) the internal reliability of each measure used to compute the difference and (2) the degree of correlation between the two measures (Rodebaugh et al., 2016). Although the internal reliability of each individual set-size was high, the positive correlation between set-sizes may have decreased the reliability of the set-size difference scores.

An Iterative Downsampling Approach

To investigate the effects of sample size and trial number on reliability estimates, we used an iterative downsampling procedure. Two reliability metrics were assessed: (1) Cronbach’s alpha, using single trial accuracy as items and (2) split-half correlations using all trials. For the downsampling procedure, we randomly sampled subjects and trials from the full dataset. Number of subjects (n) was varied from 5 to 135 in steps of 5. The number of trials (t) was varied from 5 to 540 in steps of 5. Number of subjects and number of trials were factorially combined (2916 cells total). For each cell in the design, we ran 100 sampling iterations. On each iteration, n subjects and t trials were randomly sampled from the full dataset and reliability metrics were calculated for the sample.

Figure 1 shows the results of the downsampling procedure for Cronbach’s alpha. Figure 2 shows the results of the downsampling procedure for split-half reliability estimates. In each plot, we show both the average reliability obtained across the 100 iterations (Fig. 1A and Fig. 2A) and the worst reliability obtained across the 100 iterations (Fig. 1B and Fig. 2B). Conceptually, we could think of each iteration of the downsampling procedure as akin to running one “experiment” with subjects randomly sampled from our “population” of 137. While it is good to know the average expected reliability across many experiments, the typical experimenter will only run an experiment once. Thus, considering the “worst case scenario” is instructive for planning the number of subjects and the number of trials to be collected. For a more complete picture of the breadth of reliabilities obtained, we can also consider the variability in reliability across iterations (SD) and the range of reliability values (Fig. 2C–2D). Finally, we repeated this iterative downsampling approach for each individual set-size. Average reliability as well as the variability of reliability for individual set-sizes are shown in Figure 3. Note, each set-size begins with 1/3 as many trials as Figures 1 and 2.

Figure 1. Cronbach’s alpha as a function of the number of trials and the number of subjects in Experiment 1.

Figure 1

In each cell, Cronbach’s alpha was computed for t trials (x-axis) and n subjects (y-axis). (a) Average reliability across 100 iterations. (b) Minimum reliability obtained (worst random sample of subjects and trials).

Figure 2. Spearman-Brown corrected split-half reliability estimates as a function of the number of trials and subjects in Experiment 1.

Figure 2

(a) Average reliability across 100 iterations. (b) Minimum reliability obtained (worst random sample of subjects and trials). (c) Standard deviation of the reliability obtained across samples. (d) Range of reliability values obtained across samples.

Figure 3. Spearman-Brown corrected split-half reliability estimates for each set-size in Experiment 1.

Figure 3

Top 3 panels: Average reliability for each set-size. Bottom 3 panels: Standard Deviation of the reliability for each set-size across 100 downsampling iterations.

Next, we looked at some potential characteristics of samples with low reliability (e.g. iterations with particularly low versus high reliability). We ran 500 sampling iterations of 30 subjects and 120 trials, then we did a median split for high-versus low-reliability samples. There was no significant difference in the mean (p = .86), skewness (p = .60) or kurtosis (p = .70) of high versus low reliability samples. There was, however, a significant effect of sample range and variability. As would be expected, samples with higher reliability had a larger standard deviation, t(498) = 26.7, p <.001, 95% CI [.14 .17], and a wider range, t(498) = 15.2, p < .001, 95% CI [.52 .67]), than samples with low reliability.

A Note for Fixed Capacity + Attention Estimates of Capacity

So far, we have discussed only the most commonly used methods of estimating working memory capacity (K scores and percent correct). Other methods of estimating capacity have been used, and we would like to briefly mention one of them. Rouder and colleagues (2008) suggested adding an attentional lapse parameter to estimates of visual working memory capacity, a model referred to as Fixed Capacity + Attention. Adding an attentional lapse parameter accounts for trials where subjects are inattentive to the task at hand. Specifically, participants commonly make errors on trials that should be well within capacity limits (e.g. set-size 1), and adding a lapse parameter can help to explain these anomalous dips in performance. Unlike typical estimates of capacity in which a K value is computed directly for performance for each set-size and then averaged, this model uses a log-likelihood estimation technique that estimates a single capacity parameter by simultaneously considering performance across all set-sizes and/or change probability conditions. Critically, this model assumes that data is obtained for at least one sub-capacity set-size, and that any error made on this set-size reflects an attentional lapse. If the model is fit to data that lacks at least one sub-capacity set-size (e.g. 1 or 2 items), then the model will fit poorly and provide nonsensical parameter estimates.

Recently, Van Snellenberg and colleagues used the Fixed Capacity + Attention Model to calculate capacity for a change detection task, and they found that the reliability of the model’s capacity parameter was low (r = .35), and did not correlate with other working memory tasks (Van Snellenberg, Conway, Spicer, Read, & Smith, 2014). Critically, however, this study used only relatively high set-sizes (4 and 8), and lacked a sub-capacity set-size, so model fits were likely poor. Using code made available from Rouder et al., we fit a Fixed Capacity + Attention model to our data (Rouder, n.d.). We found that when this model is misapplied (i.e. used on data without at least 1 sub-capacity set-size) the internal reliability of the capacity parameter was low (r uncorrected = .35), and negatively correlated with raw change detection accuracy, r = −.25, p = .004. If we had only applied this model to our data, we would have mistakenly concluded that change detection measures offer poor reliability and do not correlate with other measures of working memory capacity.

Discussion

Here, we have shown that when sufficient numbers of trials and subjects are collected, the reliability of change detection capacity is remarkably high (r >.9). On the other hand, a systematic downsampling method revealed that insufficient trials or insufficient subject numbers could dramatically reduce the reliability obtained in a single experiment. If researchers hope to measure the correlation between visual working memory capacity and some other measure, Figures 1 and 2 can serve as an approximate guide to expected reliability. Because we only had a single sample of the largest n (137), we cannot make definitive claims about the reliability of future samples of this size. However, given the stabilization of correlation coefficients with large sample sizes and the extremely high correlation coefficient obtained, we can be relatively confident that the reliability estimate for our full sample (n = 137) would not change substantially in future samples of university students. Further, we can make claims about how the reliability of small, well-defined sub-samples of this “population” can systematically deviate from an empirical upper bound.

The average capacity obtained for this sample was slightly lower than some other values in the literature, typically cited as around 3–4 items. The slightly lower average for this sample could potentially cause some concern about the generalizability of these reliability values for future samples. For the current manuscript’s sample, average K-scores for set-sizes 4 and 8 were K = 2.3 and K = 2.0, respectively. The largest, most comparable sample to the present sample is a 495 subject sample in work by Fukuda, Woodman, and Vogel (2015). The average K-scores for set-size 4 and 8 were K = 2.7 and K = 2.4, respectively, and the task design was nearly identical (150 ms encoding time, 1000 ms retention interval, no color repetitions allowed, and set-sizes 4 and 8). The difference of 0.3 – 0.4 items between these two samples is relatively small, though likely significant. However, for the purposes of estimating reliability, the variance of the distribution is more important than the mean. The variability observed in the present sample (SD = 0.7 for set-size 4, SD = .97 for set-size 8) was very similar to that observed in the Fukuda et al. sample (SD = 0.6 for set-size 4 and SD = 1.2 for set-size 8), though unfortunately the Fukuda e al. study did not report reliability. Because of the nearly identical variability of scores across these two samples, we can infer that our reliability results would indeed generalize to other large samples for which change detection scores have been obtained.

We recommend applying an iterative downsampling approach to other measures where expediency of task administration is valued, but reliability is paramount. The stats-savvy reader may note that the Spearman-Brown prophecy formula also allows one to calculate how many observations must be added to improve expected reliability, according to the formula:

N=ρxx(1-ρxx)ρxx(1-ρxx)

Where ρ*xx is the desired correlation strength, ρxx is the observed correlation and N is the number of times that test length must be multiplied to achieve the desired correlation strength. Critically, however, this formula does not account for the accuracy of the observed correlation. Thus, if one starts from an unreliable correlation coefficient obtained with a small number of subjects and trials, one will obtain an unreliable estimate of the number of observations needed to improve correlation strength. In experiments such as this one, both number of trials and number of subjects will drastically change estimates of the number of subjects needed to observe correlations of a desired strength.

Let’s take an example from our iterative downsampling procedure. Imagine that we ran 100 experiments, each with 15 subjects and 150 total trials of change detection. Doing so, we would obtain 100 different estimates of the strength of the true split-half correlation. We could then apply the Spearman-Brown formula to each of these 100 estimates in order to calculate the number of trials needed to obtain a desired reliability of r = .8. So doing, we would find that, on average, we would need around 140 trials to obtain the desired reliability. However, because of the large variability in the observed correlation strength (r = .37 to .97), if we had only run the “best case” experiment (r = .97), we would estimate that we need only 18 trials to obtain our desired reliability of r = .8 with 15 subjects. On the other hand, if we had run the “worst case” experiment (r = .37), then we would estimate that we need 1,030 trials. There are downsides to both types of estimation errors. While a pessimistic estimate of the number of trials needed (>1000) would certainly ensure adequate reliability, this may come at the cost of time and participants’ frustration. Conversely, an overly optimistic estimate of the number of trials needed (<20) would lead to underpowered studies that waste time and funds.

Finally, we investigated an alternative parameterization of capacity based on a model that assumes a fixed capacity and an attention lapse parameter (Rouder et al., 2008). Critically, this model attempts to explain errors for set-sizes that are well within capacity limits (e.g. 1 item). If researchers inappropriately apply this model to change detection data with only large set-sizes, they would erroneously conclude that change detection tasks yield poor reliability and fail to correlate with other estimates of capacity (e.g. Van Snellenberg et al., 2014).

In Experiment 2, we shifted our focus to the stability of change detection estimates. That is, how consistent are estimates of capacity from day-to-day? We collected an unprecedented number of sessions of change detection performance (31) spanning 60 days. We examined the stability of capacity estimates, defined as the correlation between individuals’ capacity estimates from one day to the next. Since capacity is thought to be a stable trait of the individual, we predicted that individual differences in capacity should be reliable across many testing sessions.

Experiment 2

Materials and Methods

Participants

79 individuals (male: 22; female: 57; mean age = 22.67 years, SD = 2.31) with normal or corrected-to-normal vision participated for monetary compensation. The study was approved by the Ethics Committee of Southwest University.

Stimuli

Some experimental sessions were completed in the lab and others were completed in participants’ homes. In the lab, stimuli were presented on monitors with a refresh rate of 75 Hz. At home, stimuli were presented on laptop screens with somewhat variable refresh rates and sizes. In both cases, participants sat approximately 60 cm from the screen, though a chinrest was not used so all visual angle estimates are approximate. In the lab, there were some small variations in monitor size (five 18.5” LCD monitors, one 19” LCD monitor) in testing rooms, leading to small variations in the size of the colored squares. Details are provided about the approximate range in degrees of visual angle in the lab.

All stimuli were generated in MATLAB (The MathWorks, Natick, MA) using Psychophysics toolbox. Colored squares (51 pixels; range of 1.28° to 1.46° visual angle) served as memoranda. Squares could appear anywhere within an area of the monitor subtending approximately 14.4° to 14.8° degrees horizontally and 8.1° to 8.4° degrees vertically. Squares could appear in any of nine distinct colors (RGB values: Red=255 0 0; Green= 0 255 0; Blue= 0 0 255; Magenta=255 0 255; Yellow= 255 255 0; Cyan= 0 255 255; Orange = 255 128 0; White=255 255 255; Black= 0 0 0). Colors were sampled without replacement for set-size 4 and set-size 6 trials. Each color could be repeated up to 1 time in set-size 8 trials (i.e. colors were sampled from a list of 18 colors, with each of the 9 unique colors appearing twice). Participants were instructed to fixate a small black dot (~.3° visual angle) at the center of the display.

Procedures

Trial procedures for the change detection task were identical to Experiment 1. Participants completed a total of 31 sessions of the change detection task. In each session, participants completed a total of 120 trials (split over 5 blocks). There were 40 trials each of set-sizes 4, 6, and 8. Participants were asked to finish the change detection task once a day for 30 consecutive days. They could do this task on their own computers or on the experimenters’ computers throughout the day. Participants were instructed that they should complete the task in a relatively quiet environment and not do anything else (e.g. talking to others) at the same time. Experimenters reminded the participants to finish the task and collected the data files every day.

Results

Descriptive Statistics

Descriptive statistics for average K values across the 31 sessions are shown in Table 2. Across all sessions, the average capacity was 2.83 (SD = .23). Change in mean capacity over time is shown in Figure 4A. A repeated measures ANOVA revealed a significant difference in capacity across sessions, F(18.76,1388.38)1 = 15.04, p < .001, ηp2 = .169. Subjects’ performance initially improved across sessions, then leveled off. The group-average increase in capacity over time is well-described by a two-term exponential model (SSE = .08, RMSE = .06, Adjusted R2 = .94), described by the equation: y = 2.776 × e.003x – .798 × e−26x. To test the impression that individuals’ improvement slowed over time, we fit several growth curve models to the data using Maximum Likelihood Estimation (‘fitmle.m’) with Subject entered as a random factor. We coded time as days from the first session (Session 1 = 0). Model A included only a random intercept; Model B included a random intercept and a random linear effect of time; Model C added in a quadratic effect of time, and Model D added a cubic effect of time. As shown in Table 3, the quadratic model provided the best fit to the data. Further testing revealed that both random slopes and intercepts were needed to best fit the data (Table 4, Models C1–C4). That is, participants started out with different baseline capacity values, and they improved at different rates. However, the covariance matrix for Model C revealed that there was no systematic relationship between initial capacity (intercept) and either the linear effect of time, r = .21, 95% CI [−.10 .49], or the quadratic effect of time, r = −.14, 95% CI [−.48 .24]. This suggests that there was no meaningful relationship between a participant’s initial capacity and their rate of improvement. To visualize this point, we did a quartile split of session 1 performance, and then plotted the change for each of each group (Figure 4).

Table 2. Descriptive statistics for Experiment 2.

Descriptive statistics are shown separately for each set-size and for the average of the three set-sizes. Kurtosis and skewness values are both centered around 0. Asterisks denote credible deviation from normality (Cramer, 1997).

N Mean SD Minimum Maximum Kurtosis Skewness
Day 1 79 2.15 0.85 0.40 4.03 −0.69 0.24

Day 2 79 2.36 0.86 0.07 3.97 −0.24 −0.32

Day 3 79 2.43 0.82 0.80 4.07 −0.62 −0.29

Day 4 78 2.51 0.85 0.40 4.10 −0.31 −0.31

Day 5 79 2.52 0.93 0.57 4.27 −0.55 −0.13

Day 6 79 2.74 0.92 0.53 4.60 −0.39 −0.20

Day 7 79 2.73 0.91 0.67 4.63 −0.88 −0.09

Day 8 79 2.66 0.87 1.03 4.70 −0.66 0.06

Day 9 79 2.81 0.92 0.50 5.07 −0.18 −0.19

Day 10 79 2.86 0.94 0.77 4.70 −0.84 0.01

Day 11 78 2.79 0.94 0.40 4.27 −0.51 −0.55*

Day 12 79 2.83 1.01 −0.10 4.80 −0.38 −0.37

Day 13 78 2.85 0.96 0.37 4.80 −0.57 −0.21

Day 14 79 3.01 0.95 0.93 5.03 −0.46 −0.11

Day 15 78 2.85 0.92 0.37 4.37 0.12 −0.73*

Day 16 79 2.91 0.92 0.23 4.90 −0.05 −0.35

Day 17 79 2.84 0.90 0.87 4.77 −0.51 −0.18

Day 18 79 2.93 1.02 0.53 4.73 −0.40 −0.23

Day 19 79 2.90 0.92 0.87 4.57 −0.69 −0.24

Day 20 79 2.94 0.92 0.47 4.93 −0.03 −0.32

Day 21 79 2.98 0.94 0.80 4.90 −0.08 −0.47

Day 22 79 2.99 0.98 0.83 4.90 −0.65 −0.23

Day 23 79 2.86 1.05 0.23 5.47 −0.17 −0.14

Day 24 78 3.00 0.98 0.97 4.77 −0.74 −0.26

Day 25 79 3.04 0.95 0.67 5.03 −0.41 −0.16

Day 26 79 3.01 0.93 0.43 5.07 −0.28 −0.34

Day 27 79 3.09 1.06 0.43 5.00 −0.51 −0.29

Day 28 79 3.04 0.97 0.33 4.83 −0.22 −0.48

Day 29 79 3.01 1.04 0.77 5.07 −0.38 −0.33

Day 30 79 3.02 1.05 0.33 5.00 −0.48 −0.29

Day 60 79 3.00 1.08 −0.13 5.40 0.29 −0.58*
Figure 4. Average capacity (K) across testing sessions.

Figure 4

Shaded bars represent standard error of the mean. Note, the axis is spliced between days 30 and 60, as no intervening data points were collected during this time Left: Average change in performance over time. Right: Average change in performance over time for each quartile of subjects (quartile split performed on data from session 1).

Table 3.

Comparison of Linear, Quadratic, and Cubic growth models, all with random intercepts and slopes where applicable.

Model A: Intercept Only Model B: Linear Model C: Quadratic Model D: Cubic
Intercept 2.83*** 2.60*** 2.41*** 2.29***

Linear Slope 0.014*** .037*** .07**

Quadratic Slope −.0005*** −.002*

Cubic Slope 2 x 10−5 n.s.

−2LL 4366.2 4084.8 3914.7 4231.6

BIC 4389.6 4131.6 3992.7 4348.6
***

p < .001

**

p < .01

*

p<.05

Table 4.

Comparison of fixed versus random slopes and intercept.

Model C1: Fixed Int. Fixed Slope Model C2: Fixed Int. Random Slope Model C3: Random Int. Fixed Slope Model C4: Random Int. Random Slope
−2LL 6672.3 4627.7 4009.1 3914.7

BIC 6703.5 4682.3 4048.1 3992.7

Within-session reliability

Within-session reliability was assessed using Cronbach’s alpha and split-half correlations. Cronbach’s alpha (using single-trial accuracy as items) yielded an average within-session reliability of α = .76 (SD = .04, Min. = .65, Max. = .83). Equivalently, spit-half correlations on K-scores calculated from even versus odd trials revealed average Spearman-Brown corrected reliability of r = .76 (SD = .05, Min. = .62, Max. = .84). As in Experiment 1, using raw error (Cronbach’s alpha) versus bias adjusted capacity measures (Cowan’s K) did not affect reliability estimates. Within-session reliability increased slightly over time (Figure 5). Cronbach’s alpha values were positively correlated with session number (1–31), r = .82, p < .001, 95% CI [.66, .91], as were split-half correlation values, r = .67, p < .001, 95% CI [.41, .83].

Figure 5. Change in within-session reliability across sessions in Experiment 2.

Figure 5

There was a significant positive relationship between session number (1:31) and internal reliability.

Between-session stability

We first assessed stability over time by computing correlation coefficients for all pairwise combinations of sessions (465 total combinations). Missing sessions were excluded from the correlations, meaning that some pairwise correlations included 78 subjects instead of 79 (see Table 2). All sessions correlated with each other, mean r = .71 (SD = .06, Min. = .48, Max. = .86, all p-values < .001). A heat map of all pairwise correlations is shown in Figure 6. The most temporally distant sessions still correlated with each other. The correlation between Day 1 and Day 30 (28 intervening sessions) was r = .53, p < .001, 95% CI [.35, .67]; the correlation between Day 30 and Day 60 (0 intervening sessions) was r = .81, p < .001, 95% CI [.72, .88]; the correlation between Day 1 and Day 60 was r = .59, p < .001, 95% CI [.41, .71]. Finally, we observed that between-session stability increased over time, likely due to increased internal reliability across sessions. To compute change in reliability over time, we calculated the correlation coefficient for temporally adjacent sessions (e.g. the correlation of session 1 and session 2, of session 2 and session 3, etc.). The average adjacent-session correlation was r = .76 (SD = .05, Min. = .64, Max. = .86), and the strength of adjacent-session correlations was positively correlated with session number, r = .68, p < .001, indicating an increase in stability over time.

Figure 6. Correlations between sessions.

Figure 6

Left: Correlations between all possible pairs of sessions. Color represents the correlation coefficient of the capacity estimates from each possible pairwise combination of the 31 sessions. All correlation values were significant, p < .001. Right: Illustration of the sessions that are most distant in time: Day 1 correlated with Day 30 (28 intervening sessions) and Day 30 correlated with Day 60 (no intervening sessions).

Differences by testing location

We tested for systematic differences in performance, reliability, and stability for sessions completed at home versus in the lab. In total, there were 41 subjects who completed all of their sessions in their own home (“home group”), 27 subjects who completed all of their sessions in the lab (“lab group”), and 11 subjects who completed some sessions at home and some in the lab (“mixed group”).

Across all 31 sessions, subjects in the home group had an average capacity of 2.67 (SD = 1.01), those in the lab group had an average capacity of 3.01 (SD = .83) and those in the mixed group had an average capacity of 2.98 (SD = 1.04). On average, scores for sessions in the home group were slightly lower than scores for sessions in the lab group, t(2101) = −7.98, p < .001, 95% CI [−.42, −.25]. Scores for sessions in the mixed group were higher than for sessions in the home group, t(1606) = 5.0, p < .001, 95% CI [.19, .43], but were not different from the lab group, t(1175) = .44, p = .67, 95% CI [−.09, .14]. Interestingly, however, a paired t-test for the mixed group (n = 11) revealed that the same subjects performed slightly better in the lab (M = 3.08) and slightly worse at home, M = 2.85, t(10) = 3.15, p = .01, 95% CI [.07, .39].

Cronbach’s alpha estimates of within-session reliability were slightly higher for sessions completed at home (Mean α = .76, SD = .05) compared to sessions completed in the lab (Mean α = .69, SD = .058), t(60)= 3.75, p < .001, 95% CI [.03 .10]. Likewise, Spearman-Brown Corrected correlation coefficients were higher for sessions completed at home (Mean r = .79, SD = .07) compared to in the lab (Mean r = .67, SD = .14), t(60) = 4.42, p <.001, 95% CI [.07, .18]. However, these differences in reliability may result from (1) unequal sample sizes between lab and home or (2) unequal average capacity between groups (3) unequal variability between groups. Once equating sample size between groups and matching samples for average capacity, differences in reliability were no longer stable: Across iterations of matched samples, differences in Cronbach’s α ranged from p < .01 to p > .5, and differences in split-half correlation significance ranged from p < .01 to p > .25.

Next, we examined differences in stability for sessions completed at home compared to in the lab. On average, test-retest correlations were higher for home sessions (Mean r = .72, SD = .08) compared to lab sessions (Mean r = .67, SD = .10), t(928) = 8.01, p < .001, 95% CI [.04 .06]. Again, however differences in test-retest correlations were not reliable after matching sample size and average capacity, differences in correlation significance ranged from p = .01 to p = .98.

Discussion

With extensive practice over multiple sessions, we observed improvement in overall change detection performance. This improvement was most pronounced over early sessions, after which mean performance stabilized for the remaining sessions. The internal reliability of the first session (Spearman Brown corrected r =.71, Cronbach’s α = .67) was within the range predicted by the look-up table created in Experiment 1 for 80 subjects and 120 trials (predicted range: r = .61 to .87 and α = .58 to .80, respectively). Both reliability and stability remained high over the span of 60 days. In fact, reliability and stability increased slightly across sessions. An important consideration for any cognitive measure is whether or not repeated exposure to the task will harm the reliability of the measure. For example, re-exposure to the same logic puzzles will drastically reduce the amount of time needed to solve the puzzles and inflate accuracy. Thus, for such tasks great care must be taken to generate novel test versions to be administered at different dates. Similarly, over-practice effects could lead to a sharp decrease in variability of performance (e.g. ceiling effects, floor effects), which would by definition lead to a decrease in reliability. Here, we demonstrated that while capacity estimates increase when subjects are frequently exposed to a change detection task, the reliability of the measure is not compromised by practice effects or ceiling effects.

We also examined whether reliability was harmed for participants who completed the change detection sessions in their own homes compared to the lab. While remote data collection sacrifices some degree of experimental control, the use of at-home tests is becoming more common with the ease of remote data collection through resources like Amazon’s Mechanical Turk (Mason & Suri, 2012). Reliability was not noticeably disrupted by noise arising from small differences in stimulus size between different testing environments. After controlling for number of subjects and capacity, there was no longer a consistent difference in reliability or stability for sessions completed at home compared to in the lab. However, capacity estimates obtained in subjects’ homes were significantly lower than those obtained in the lab. Larger sample sizes are needed to more fully investigate systematic differences in capacity and reliability between testing environments.

General Discussion

In Experiment 1, we developed a novel approach for estimating expected reliability in future experiments. We collected change detection data from a large number of subjects and trials, and then we used an iterative downsampling procedure to investigate the effect of sample size and trial number on reliability. Average reliability across iterations was fairly impervious to the number of subjects. Instead, average reliability estimates across iterations relied more heavily on the number of trials per subject. On the other hand, the variability of reliability estimates across iterations was highly sensitive to the number of subjects. For example, with only 10 subjects, the average reliability estimate for an experiment with 150 trials was high (α = .75) but the worst iteration (akin to the worst expected experiment out of 100) gave a poor reliability estimate (α = .42). On the other hand, the range between the best and worst reliability estimates decreased dramatically as the number of subjects increased. With 40 subjects, the minimum observed reliability for 150 trials was α =.65.

In Experiment 2, we examined the reliability and stability of change detection capacity estimates across an unprecedented number of testing sessions. Subjects completed 31 sessions of single-probe change-detection. The first 30 sessions took place over 30 consecutive days, and the last session took place 30 days later (Day 60). Average internal reliability for the first session was in the range predicted by the look-up table in Experiment 1. Despite improvements in performance across sessions, between-subject variability in K remained stable over time (average test-retest between all 31 sessions was r =.76; the correlation for the two most distant sessions, Day 1 and Day 60, was r = .59). Interestingly, both within-session reliability and between-session reliability increased across sessions. Rather than diminishing due to practice, reliability of WMC estimates increased across many sessions.

The present work has implications for planning studies with novel measures and for justifying the inclusion of existing measures into clinical batteries such as the Research Domain Criteria (RDoC) project (Cuthbert & Kozak, 2013; Rodebaugh et al., 2016). For basic research, an internal reliability of 0.7 is considered a sufficient “rule of thumb” for investigating correlational relationship between measures (Nunnally, 1978). While this level of reliability (or even lower) will allow researchers to detect correlations, it is not sufficient to confidently assess the scores of individuals. For that, reliability in excess of .9 or even .95 is desirable (Nunnally, 1978). Here, we demonstrate how the number of trials can alter the reliability of working memory capacity estimates; with relatively few trials (~150, around 10 minutes of task time), change detection estimates are sufficiently reliable for correlation studies (α ~ .8), but many more trials are needed (~500) to boost reliability to the level needed to assess individuals (α ~ .9). Another important consideration for a diagnostic measure is its reliability across multiple testing sessions. Some tasks lose their diagnostic value once individuals have been exposed to them once or twice. Here we demonstrate that change detection estimates of working memory capacity are stable, even when participants are well-practiced on the task (3,720 trials over 31 sessions).

One challenge in estimating the “true” reliability of a cognitive task is that reliability depends heavily on sample characteristics. As we have demonstrated, varying the sample size and number of trials can yield very different estimates of the reliability for a perfectly identical task. Other sample characteristics can likewise affect reliability; the most notable of these is sample homogeneity. The sample used here was a large sample of university students, with a fairly wide range in capacities (approximately 0.5 – 4 items). Samples using only a subset of this capacity range (e.g. clinical patient groups with very low capacity) will be less internally reliable because of the restricted range of the sub-population. Indeed, in Experiment 1 we found that sampling iterations with poor reliability tended to have lower variability and a smaller range of scores. Thus, carefully recording sample size, mean, standard deviation, and internal reliability in all experiments will be critical for assessing and improving the reliability of standardized tasks used for cognitive research. In the interest of replicability, open source code repositories (e.g. the Experiment Factory) have sought to make standardized versions of common cognitive tasks better-categorized, open, and easily available (Sochat et al., 2016). However, one potential weakness for task repositories is a lack of documentation about expected internal reliability. Standardization of tasks can be very useful, but it should not be over-applied. In particular, experiments with different goals should use different test lengths that best suit the goals of the experimental question. We feel that projects such as the Experiment Factory will certainly lead to more replicable science, and including estimates of reliability with task code could help to further this goal.

Finally, the results presented here have implications for researchers who are interested in differences between experimental conditions and not individual differences per se. Trial number and sample size will affect the degree of measurement error for each condition used within change detection experiments (e.g. set-sizes, distractor presence, etc.). To detect significant differences between conditions and avoid false positives, it would be desirable to estimate the number of trials needed to ensure adequate internal reliability for each condition of interest within the experiment. Insufficient trial numbers or sample sizes can lead to intolerably low internal reliability, and could spoil an otherwise well-planned experiment.

The results of Experiments 1 and 2 revealed that change detection capacity estimates of visual working memory capacity are both internally reliable and stable across many testing sessions. This finding is consistent with previous studies showing that other measures of working memory capacity are reliable and stable, including complex span measures (Beckmann, Holling, & Kuhn, 2007; Foster et al., 2015; Klein & Fiss, 1999; Waters & Caplan, 1996) and the visuospatial n-back (Hockey & Geffen, 2004). The main analyses from Experiment 1 suggest concrete guidelines for designing studies that require reliable estimates of change detection capacity. When both sample size and trial numbers were high, the reliability of change detection was quite high (α > .9). However, studies with insufficient sample sizes or number of trials frequently had low internal reliability. Consistent with the notion that working memory capacity is a stable trait of the individual, individual differences in capacity remained stable over many sessions in Experiment 2 despite practice-related performance increases.

Both the effects of trial number and sample size are important to consider, and researchers should be cautious about generalizing expected reliability across vastly different sample sizes. For example, in a recent paper by Foster and colleagues (2015), the authors found that cutting the number of complex span trials by two-thirds had only a modest effect on the strength of the correlation between working memory capacity and fluid intelligence. Critically, however, the authors used around 500 subjects, and such a large sample size will act as a buffer against increases in measurement error (i.e. fewer trials per subject). Readers wishing to conduct a new study with a smaller sample size (e.g. 50 subjects) would be ill-advised to dramatically cut trial numbers based on this finding alone; as demonstrated in Experiment 1, cutting trial numbers leads to greater volatility of reliability values for small sample sizes relative to large ones. Given present concerns about power and replicability in psychological research (Open Science Collaboration, 2015), we suggest that rigorous estimation of task reliability, considering both subject and trial numbers, will be useful for planning both new studies and replication efforts.

Acknowledgments

Research was supported by the Project of Humanities and Social Sciences, Ministry of Education, China (15YJA190008), the Fundamental Research Funds for the Central Universities (SWU1309117), NIH grant 2R01 MH087214-06A1 and Office of Naval Research grant N00014-12-1-0972. Datasets for all experiments are available online on Open Science Framework at https://osf.io/g7txf/.

Footnotes

1

Greenhouse-Geisser values reported when Mauchly’s Test of Sphericity is violated.

Contributions: Z.X. and E.V. designed the experiments; Z.X. and X.F. collected data. K.A. performed analyses and drafted the manuscript. K.A., Z.X., and E.V. revised the manuscript.

Conflicts of Interest: none

References

  1. Beckmann B, Holling H, Kuhn JT. Reliability of verbal–numerical working memory tasks. Personality and Individual Differences. 2007;43(4):703–714. https://doi.org/10.1016/j.paid.2007.01.011. [Google Scholar]
  2. Brown W. Some experimental results in the correlation of mental abilities. British Journal of Psychology, 1904–1920. 1910;3(3):296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x. [Google Scholar]
  3. Buschman TJ, Siegel M, Roy JE, Miller EK. Neural substrates of cognitive capacity limitations. Proceedings of the National Academy of Sciences. 2011;108(27):11252–11255. doi: 10.1073/pnas.1104666108. https://doi.org/10.1073/pnas.1104666108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cowan N. The magical number 4 in short-term memory: a reconsideration of mental storage capacity. The Behavioral and Brain Sciences. 2001;24(1):87–114–185. doi: 10.1017/s0140525x01003922. https://doi.org/10.1017/S0140525X01003922. [DOI] [PubMed] [Google Scholar]
  5. Cowan N, Fristoe NM, Elliott EM, Brunner RP, Saults JS. Scope of attention, control of attention, and intelligence in children and adults. Memory & Cognition. 2006;34(8):1754–1768. doi: 10.3758/bf03195936. https://doi.org/10.3758/BF03195936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cramer D. Basic statistics for social research: step-by-step calculations and computer techniques using Minitab. London; New York: Routledge; 1997. [Google Scholar]
  7. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16(3):297–334. https://doi.org/10.1007/BF02310555. [Google Scholar]
  8. Cuthbert BN, Kozak MJ. Constructing constructs for psychopathology: The NIMH research domain criteria. Journal of Abnormal Psychology. 2013;122(3):928–937. doi: 10.1037/a0034028. https://doi.org/10.1037/a0034028. [DOI] [PubMed] [Google Scholar]
  9. Elmore LC, Magnotti JF, Katz JS, Wright AA. Change detection by rhesus monkeys (Macaca mulatta) and pigeons (Columba livia) Journal of Comparative Psychology. 2012;126(3):203–212. doi: 10.1037/a0026356. https://doi.org/10.1037/a0026356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Engle RW, Tuholski SW, Laughlin JE, Conway AR. Working memory, short-term memory, and general fluid intelligence: a latent-variable approach. Journal of Experimental Psychology. General. 1999;128(3):309–331. doi: 10.1037//0096-3445.128.3.309. [DOI] [PubMed] [Google Scholar]
  11. Foster JL, Shipstead Z, Harrison TL, Hicks KL, Redick TS, Engle RW. Shortened complex span tasks can reliably measure working memory capacity. Memory & Cognition. 2015;43(2):226–236. doi: 10.3758/s13421-014-0461-7. https://doi.org/10.3758/s13421-014-0461-7. [DOI] [PubMed] [Google Scholar]
  12. Fukuda K, Vogel E, Mayr U, Awh E. Quantity, not quality: the relationship between fluid intelligence and working memory capacity. Psychonomic Bulletin & Review. 2010;17(5):673–679. doi: 10.3758/17.5.673. https://doi.org/10.3758/17.5.673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fukuda K, Woodman GF, Vogel EK. Individual Differences in Visual Working Memory Capacity: Contributions of Attentional Control to Storage. In: Jolicoeur P, Lefebvre C, Martinez-Trujillo J, editors. Mechanisms of Sensory Working Memory: Attention and Performance XXV. Elsevier; 2015. pp. 105–120. Retrieved from http://linkinghub.elsevier.com/retrieve/pii/B9780128013717000090. [Google Scholar]
  14. Gibson B, Wasserman E, Luck SJ. Qualitative similarities in the visual short-term memory of pigeons and people. Psychonomic Bulletin & Review. 2011;18(5):979–984. doi: 10.3758/s13423-011-0132-7. https://doi.org/10.3758/s13423-011-0132-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gold JM, Wilk CM, McMahon RP, Buchanan RW, Luck SJ. Working memory for visual features and conjunctions in schizophrenia. Journal of Abnormal Psychology. 2003;112(1):61–71. https://doi.org/10.1037/0021-843X.112.1.61. [PubMed] [Google Scholar]
  16. Hockey A, Geffen G. The concurrent validity and test? retest reliability of a visuospatial working memory task. Intelligence. 2004;32(6):591–605. https://doi.org/10.1016/j.intell.2004.07.009. [Google Scholar]
  17. Johnson MK, McMahon RP, Robinson BM, Harvey AN, Hahn B, Leonard CJ, … Gold JM. The relationship between working memory capacity and broad measures of cognitive ability in healthy adults and people with schizophrenia. Neuropsychology. 2013;27(2):220–229. doi: 10.1037/a0032060. https://doi.org/10.1037/a0032060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Klein K, Fiss WH. The reliability and stability of the Turner and Engle working memory task. Behavior Research Methods, Instruments, & Computers: A Journal of the Psychonomic Society, Inc. 1999;31(3):429–432. doi: 10.3758/bf03200722. [DOI] [PubMed] [Google Scholar]
  19. Lee EY, Cowan N, Vogel EK, Rolan T, Valle-Inclan F, Hackley SA. Visual working memory deficits in patients with Parkinson’s disease are due to both reduced storage capacity and impaired ability to filter out irrelevant information. Brain. 2010;133(9):2677–2689. doi: 10.1093/brain/awq197. https://doi.org/10.1093/brain/awq197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Luria R, Balaban H, Awh E, Vogel EK. The contralateral delay activity as a neural measure of visual working memory. Neuroscience & Biobehavioral Reviews. 2016;62:100–108. doi: 10.1016/j.neubiorev.2016.01.003. https://doi.org/10.1016/j.neubiorev.2016.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Mason W, Suri S. Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods. 2012;44(1):1–23. doi: 10.3758/s13428-011-0124-6. https://doi.org/10.3758/s13428-011-0124-6. [DOI] [PubMed] [Google Scholar]
  22. Melby-Lervåg M, Hulme C. Is working memory training effective? A meta-analytic review. Developmental Psychology. 2013;49(2):270–291. doi: 10.1037/a0028228. https://doi.org/10.1037/a0028228. [DOI] [PubMed] [Google Scholar]
  23. Nunnally JC. Psychometric theory. 2. New York: McGraw-Hill; 1978. [Google Scholar]
  24. Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716–aac4716. doi: 10.1126/science.aac4716. https://doi.org/10.1126/science.aac4716. [DOI] [PubMed] [Google Scholar]
  25. Pailian H, Halberda J. The reliability and internal consistency of one-shot and flicker change detection for measuring individual differences in visual working memory capacity. Memory & Cognition. 2015;43(3):397–420. doi: 10.3758/s13421-014-0492-0. https://doi.org/10.3758/s13421-014-0492-0. [DOI] [PubMed] [Google Scholar]
  26. Pashler H. Familiarity and visual change detection. Perception & Psychophysics. 1988;44(4):369–378. doi: 10.3758/bf03210419. https://doi.org/10.3758/BF03210419. [DOI] [PubMed] [Google Scholar]
  27. Reinhart RMG, Heitz RP, Purcell BA, Weigand PK, Schall JD, Woodman GF. Homologous Mechanisms of Visuospatial Working Memory Maintenance in Macaque and Human: Properties and Sources. Journal of Neuroscience. 2012;32(22):7711–7722. doi: 10.1523/JNEUROSCI.0215-12.2012. https://doi.org/10.1523/JNEUROSCI.0215-12.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Rodebaugh TL, Scullin RB, Langer JK, Dixon DJ, Huppert JD, Bernstein A, … Lenze EJ. Unreliability as a Threat to Understanding Psychopathology: The Cautionary Tale of Attentional Bias. Journal of Abnormal Psychology. 2016 doi: 10.1037/abn0000184. https://doi.org/10.1037/abn0000184. [DOI] [PMC free article] [PubMed]
  29. Rouder JN. Applications and Source Code. n.d Retrieved June 22, 2016, from http://pcl.missouri.edu/apps.
  30. Rouder JN, Morey RD, Cowan N, Zwilling CE, Morey CC, Pratte MS. An assessment of fixed-capacity models of visual working memory. Proceedings of the National Academy of Sciences of the United States of America. 2008;105(16):5975–5979. doi: 10.1073/pnas.0711295105. https://doi.org/10.1073/pnas.0711295105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rouder JN, Morey RD, Morey CC, Cowan N. How to measure working memory capacity in the change detection paradigm. Psychonomic Bulletin & Review. 2011;18(2):324–330. doi: 10.3758/s13423-011-0055-3. https://doi.org/10.3758/s13423-011-0055-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Shipstead Z, Redick TS, Engle RW. Is working memory training effective? Psychological Bulletin. 2012;138(4):628–654. doi: 10.1037/a0027473. https://doi.org/10.1037/a0027473. [DOI] [PubMed] [Google Scholar]
  33. Sochat VV, Eisenberg IW, Enkavi AZ, Li J, Bissett PG, Poldrack RA. The Experiment Factory: Standardizing Behavioral Experiments. Frontiers in Psychology. 2016;7 doi: 10.3389/fpsyg.2016.00610. https://doi.org/10.3389/fpsyg.2016.00610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Spearman C. Correlation calculated from faulty data. British Journal of Psychology, 1904–1920. 1910;3(3):271–295. https://doi.org/10.1111/j.2044-8295.1910.tb00206.x. [Google Scholar]
  35. Todd JJ, Marois R. Capacity limit of visual short-term memory in human posterior parietal cortex. Nature. 2004;428(6984):751–754. doi: 10.1038/nature02466. https://doi.org/10.1038/nature02466. [DOI] [PubMed] [Google Scholar]
  36. Unsworth N, Fukuda K, Awh E, Vogel EK. Working memory and fluid intelligence: Capacity, attention control, and secondary memory retrieval. Cognitive Psychology. 2014;71:1–26. doi: 10.1016/j.cogpsych.2014.01.003. https://doi.org/10.1016/j.cogpsych.2014.01.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Van Snellenberg JX, Conway ARA, Spicer J, Read C, Smith EE. Capacity estimates in working memory: Reliability and interrelationships among tasks. Cognitive, Affective, & Behavioral Neuroscience. 2014;14(1):106–116. doi: 10.3758/s13415-013-0235-x. https://doi.org/10.3758/s13415-013-0235-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Vogel EK, Machizawa MG. Neural activity predicts individual differences in visual working memory capacity. Nature. 2004;428(6984):748–751. doi: 10.1038/nature02447. https://doi.org/10.1038/nature02447. [DOI] [PubMed] [Google Scholar]
  39. Waters GS, Caplan D. The measurement of verbal working memory capacity and its relation to reading comprehension. The Quarterly Journal of Experimental Psychology. A, Human Experimental Psychology. 1996;49(1):51–75. doi: 10.1080/713755607. https://doi.org/10.1080/713755607. [DOI] [PubMed] [Google Scholar]
  40. Wood G, Hartley G, Furley PA, Wilson MR. Working Memory Capacity, Visual Attention and Hazard Perception in Driving. Journal of Applied Research in Memory and Cognition. 2016 https://doi.org/10.1016/j.jarmac.2016.04.009.

RESOURCES