Abstract
Numerous studies have established that there are benefits of corrective feedback for learning, but the mechanisms of this benefit are not well understood. An important question is whether corrective feedback improves memory via episodic processes or solely via semantic mediation. If episodic processes are involved, then memory for corrective feedback should include contextual details of the feedback episode. The present study tested this hypothesis across 3 experiments (total n = 223) in which participants completed an encoding task that involved cued guessing of category exemplars. Exemplars generated by participants were equally likely to be treated as correct or incorrect, and the “correct” exemplar was presented within a feedback display after each response. Separate versions of the task manipulated font color in either the feedback display or the initial cue/typed response display. Participants were instructed to remember either the correct exemplars or their own typed responses, and the corresponding font colors. Retrieval task (cued recall, free recall, recognition) was varied across experiments. Across all 3 experiments, a higher rate of memory accuracy was observed for context associated with corrective feedback relative to other conditions. The findings are consistent with the hypothesis that errorful learning involves episodic memory, not merely semantic mediation.
Keywords: Memory, Generation, Corrective Feedback, Context
A large and growing body of evidence indicates that wrong answers are good for learning – as long as corrective feedback is provided (see Metcalfe, 2017, for a thorough review). A substantial amount of research on corrective feedback in learning has focused on feedback’s relevance to retrieval-based learning, in which testing of recently learned information confers memory advantages compared to re-studying of the information (also referred to as testing effects; Roediger & Karpicke, 2006). Feedback during testing may enhance testing effects (e.g., Kang, McDermott, & Roediger, 2007; McDaniel & Fisher, 1991), and corrective feedback appears to be more beneficial for item memory than confirmatory feedback is (Hays, Kornell, & Bjork, 2010; Pashler, Cepeda, Wixted, & Rohrer, 2005). Corrective feedback is particularly important in variations of the testing effect where learners – usually unsuccessfully – attempt to provide responses for items that have not been previously studied, also known as pretesting (Kornell, Hays, & Bjork, 2009; Kornell & Vaughn, 2016). In this case, corrective feedback essentially acts as the study episode, and produces superior memory relative to simply reading the to-be-remembered information without attempting to generate a response. This advantage of pretesting can also be influenced by the timing of the feedback (Hays, Kornell, & Bjork, 2013; but see Kornell, 2014).
The cognitive mechanisms that support errorful learning are not fully understood. One proposal is that the benefit of error correction represents a specific case of semantic mediation in retrieval-based learning (Pyc & Rawson, 2010; Carpenter, 2011; Carpenter & Yeung, 2017). According to the semantic mediation (or semantic elaboration) hypothesis, retrieval attempts during testing produce a form of elaboration in which concepts are activated that are semantically related to the correct answer and can serve as mediators linking the cue to the correct answer on subsequent retrieval attempts. In errorful learning, a similar process of elaboration may take place, resulting in an incorrect response, which is then linked to the correct response via feedback. Hence, the initial incorrect response acts as one of the mediating concepts to produce the correct response on a later test. Some prior studies have obtained results consistent with this account (Butler, Fazio, & Marsh, 2011; Knight, Ball, Brewer, DeWitt, & Marsh, 2012; Vaughn & Rawson, 2012; but see Clark, 2016, which did not find evidence for feedback acting as a mediator). A number of researchers have argued that semantic mediation is insufficient to account for the benefits of retrieval-based learning, including benefits of error generation and corrective feedback. Alternative accounts have argued that episodic memory processes, rather than semantic network activation, play a central role in retrieval-based learning. For example, Karpicke, Lehman, and Aue’s (2014) episodic context account of retrieval-based learning proposes that the key mechanism contributing to benefits of retrieval practice is the updating of episodic context features in the memory trace for the retrieved item. This account ties in with broader theories of memory updating, such as the memory for change framework (e.g., Wahlheim & Jacoby, 2013), which posits that the recollection of prior episodic details enables the incorporation of prior and current episodes into a combined representation that resolves interference and enhances memory for current information. With regard to feedback specifically, Metcalfe and Huelser (2020) have recently argued that the benefits of learning from corrective feedback are based on episodic updating mechanisms. They reported results from two experiments that suggested: 1) error correction was not better when errors were semantically related (versus unrelated) to correct answers, arguing against semantic mediation; and 2) error correction was better when erroneous responses could be recalled, arguing for episodic updating.
Additional support for the importance of episodic processes, as opposed to semantic mediation, in errorful learning comes from evidence that corrective feedback is simply encoded better than other information presented during a learning trial (enhanced encoding of feedback hypothesis; Potts, Davies, & Shanks, 2019). A key finding in this regard is the “hypercorrection effect” (Butterfield & Metcalfe, 2001) whereby error correction is better for high-confidence errors than low-confidence errors. This has been used to suggest that corrective feedback – particularly when it is surprising to learners that they are being corrected – attracts learners’ attention in a manner that directly benefits subsequent retrieval of the feedback, without the need for semantic mediation.
If learning via corrective feedback involves episodic processes, then its benefits might be expected also to apply to memory for other aspects of the learning episode, such as contextual details. Contextual details are an essential component of episodic recollection (e.g., Tulving, 1972), and play a key role in memory-updating accounts of retrieval-based learning such as that of Karpicke et al. (2014). Contextual information is essential to source memory (Johnson, Hashtroudi, & Lindsay, 1993), which is an important aspect of learning in both everyday and educational settings. Therefore, it is critical to our understanding of errorful learning to know the extent to which error generation and corrective feedback influence memory for contextual details of the learning episode. If corrective feedback about target items also influences memory for contextual details, it would imply that episodic processes are involved in errorful learning. On the other hand, if errorful learning is solely based on semantic mediation, then feedback should have no effect on memory for contextual details of the study episode.
Some evidence exists to suggest that testing effects extend to memory for contextual information (Akan, Stanley, & Benjamin, 2018; Rowland, 2011), and that errorful learning enhances source memory, relative to errorless learning (Cyr & Anderson, 2012). However, to our knowledge only one prior study (Fazio & Marsh, 2009) has directly examined memory for contextual information associated with feedback. Fazio and Marsh found that memory for context associated with feedback was related to participants’ confidence in their responses. That is, contextual details (i.e., font color, voice gender) of corrective feedback were remembered better when participants had high confidence in their erroneous answers, and contextual details of confirmatory feedback were remembered better when participants had low confidence in their correct answers. Their findings were consistent with the idea that hypercorrection (Butterfield & Metcalfe, 2001) applies to both the item and context information contained in feedback. However, Fazio and Marsh’s study had some limitations: it did not compare memory for contextual details of feedback to memory for contextual details associated with other aspects of the learning trial, and did not test for general effects of corrective or confirmatory feedback on context memory, independent of confidence or surprisingness. Additionally, that study used general-knowledge items as stimuli, which made it impossible to randomly assign items to correct or incorrect response conditions.
It should be noted that, to the extent episodic processes are involved in learning from feedback, the effects on context memory might not necessarily be all positive. For example, studies of the generation effect have found that although self-generation reliably improves memory for generated items, it does not necessarily improve memory for context associated with those items. That is, contextual features of items (such as font colors of words) are sometimes remembered more accurately when items are passively studied than when they are generated, constituting a “negative generation effect” (Jurica & Shimamura, 1999). Careful investigations of these negative generation effects on context memory (e.g., Mulligan, 2004; Mulligan, 2011; Mulligan, Lozito, & Rosner, 2006) have established that they tend to occur when the context features to be remembered involve cognitive processes in a different domain than that of the generation task. For example, memory for a visual feature, such as font color, is more negatively affected when a semantic generation task is used; likewise, the negative effect on memory for context can be reversed when to-be-remembered context features are aligned to the generation task (for example, memory for auditory features is enhanced by use of a rhyming generation task; Overman, Richard, & Stephens, 2017). Similar trade-offs could apply to processing of feedback information, whereby processing of corrective feedback about a semantically-generated item might inhibit processing of perceptual details of that item. Alternatively, increased attentional focus on the feedback display might enhance encoding of both meaningful and incidental aspects of the display, similar to the attentional boost effect in memory for incidental information during a target detection task (Spataro, Mulligan, & Rossi-Arnaud, 2013; Swallow & Jiang, 2010).
In summary, to build a more complete theoretical framework of the mechanisms involved in errorful learning, it is important to empirically establish the degree to which context memory is influenced by conditions of item generation and corrective or confirmatory feedback. The present study addressed this knowledge gap by examining memory for contextual features (specifically, font color) of feedback in a learning task with cue-target associations that consisted of semantic categories and exemplars. Importantly, the learning task was arranged so that exemplars generated by participants were equally likely to be treated as correct or incorrect responses, and category cues were randomly assigned to incorrect and correct generation conditions. This allowed for direct comparison of the memory effects of corrective versus confirmatory feedback, while eliminating any inherent differences between correct and incorrect items, as well as any basis for expectancies by which the two types of feedback could differ in their surprisingness. Additionally, context memory was compared between versions of the task in which the target exemplars to be remembered were the correct items presented at feedback versus participants’ own typed responses. This provided a critical control condition, making it possible to test whether differences in context memory across generation and feedback conditions were specific to context associated with the feedback, compared to context associated with other parts of the encoding episode. This comparison also helped to rule out the possibility that incorrect generation trials differed from correct generation trials in some systematic way other than in the feedback portion of the trial. A further benefit of including experiment versions in which participants remembered their own responses was that it provided conditions that more directly overlapped with prior studies of generation effects in context memory (e.g. Mulligan, 2004).
Three experiments were conducted, in which the item retrieval task was varied (Experiment 1: cued recall; Experiment 2: free recall; Experiment 3: recognition). The use of different retrieval tasks provided for consideration of the relative contributions of encoding and retrieval processes to the effects of corrective and confirmatory feedback on context (and item) memory performance.
Experiment 1
As described above, Experiment 1 tested the effects of corrective and confirmatory feedback on item and context memory, using cue-target pairs that consisted of common category names and exemplars. Two versions of the experiment were conducted, which differed only in whether the target exemplars to be remembered (and whose font colors varied) were the “correct” items presented at feedback (Experiment 1a), or the participants’ own typed responses (Experiment 1b).
Method
Participants
A total of 69 undergraduates (mean age = 18.75 years) participated and were compensated with course credit. Thirty-one participants completed Experiment 1a, in which the to-be-remembered item and context information was presented during feedback, and 38 completed Experiment 1b, in which the to-be-remembered item and context information consisted of their initial responses to stimuli. The two versions of the experiment were conducted in different semesters, with Experiment 1a conducted first (this was also the case for Experiments 2 and 3). Thus the assignment of participants to the two versions was not completely random. Nonetheless, it was assumed that there were no meaningful differences in the population of available participants across the semesters in which the two versions were run. All participants provided informed consent and all procedures were approved by the Institutional Review Board of Elon University.
Sample size was based on prior studies that have investigated context memory across encoding conditions that included generating and reading items (e.g., Mulligan, Lozito, & Rozner, 2006; Overman, Richard, & Stephens, 2017). Those studies reported effect sizes for within-subjects comparisons of context memory measures across encoding conditions of d = .51, d = .53, and d = .67. A priori power analysis indicated that for an effect size of d = .53, a sample size of n = 30 would yield estimated power of .8. Thus, participants were recruited with the goal of having at least 30 participants with usable data in each version of the experiment.1 Extra participants were also recruited in Experiment 1b due to concerns that a subset of seven participants may have performed the cued recall task incorrectly (see below). This enabled some analyses to be carried out with those participants excluded, while still having data from more than 30 participants in each version of the experiment.
Materials
Stimuli were based on 46 taxonomic categories selected from the category norms of Van Overschelde, Rawson, and Dunlosky (2004). For each category, two exemplars were selected to create a pool of available exemplars for use in the experiment. Exemplars were chosen such that they were among the most commonly named members of each category but did not create potential conflicts or confusion with exemplars from other categories (for example, orange was not used since it could be either a fruit or a color of the rainbow). A complete list of category names and selected exemplars is provided in the Appendix.
Procedure
Both versions of the experiment consisted of an encoding task followed by a three-minute backwards-counting task and a cued recall task. The encoding task included three conditions: Read, Correct Generation, and Incorrect Generation. There were 14 trials in each condition, with 42 of the 46 categories randomly assigned to encoding conditions for each participant. Four of the categories were used in buffer trials at the beginning of the encoding task and were not tested in the cued recall task.
Encoding task.
An illustration of the encoding task is provided in Figure 1. On each trial, the participant was presented with a category name, and typed the name of a member of that category. In the Read condition, the category name was presented with one of the pre-selected exemplars below it. The participant re-typed the exemplar, with typed letters viewable in an echo box at the bottom of the screen, and pressed Enter. After the participant pressed Enter, the category-exemplar pair remained on the screen for 1000 ms, after which the font color of the category name and exemplar changed, and the display remained on the screen for an additional 5000 ms.
Figure 1. Design of the Encoding Task.

Note. Participants were cued with categories and instructed to type names of category members. Participants either re-typed an item that was presented along with the cue (Read condition), or typed a self-generated item (Correct and Incorrect Generation conditions). The feedback display then presented the “correct” response. On half of generation trials, the participant’s own response was treated as correct (Correct Generation). The other half of generation trials treated the participant’s response as incorrect and selected a different category member to display as feedback. In Experiment 1a, blue or yellow font color was used in the feedback display; in Experiment 1b, blue or yellow color was used in the initial typed item display.
In the generation conditions, only the category name was presented at the beginning of the trial, with a blank space below it (a continuous underscore). The participant was instructed to “guess which category member is supposed to go in the space. Then, type the word into the text box and press Enter.” As in the read condition, typed letters were viewable in an echo box at the bottom of the screen. After pressing Enter, the category name disappeared and the screen displayed either “Correct!” (in the Correct Generation condition) or “Incorrect” (in the Incorrect Generation condition) in white font for 1000 ms. The category name then reappeared with the correct exemplar below it for 5000 ms. The correct exemplar shown in the feedback display was either the participant’s own typed response (in the Correct Generation condition) or a non-matching exemplar from the pre-selected list (in the Incorrect Generation condition). Thus, exactly half of the participant’s responses were labeled as incorrect, regardless of what the participant typed.2 As noted by Yan, Yu, Garcia, & Bjork (2014), this type of design makes it impossible for participants to form any valid strategy for guessing correct responses. It also ensured equal numbers of trials with confirmatory versus corrective feedback, and removed any basis for expectancy or surprise based on prediction of the correct response. After the initial four buffer trials, the encoding list was arranged in seven blocks, each of which contained the six possible combinations of encoding condition and font color. Trial order was randomized within each block.
The key difference in the encoding phase between Experiments 1a and 1b was the timing of when colored font was used in the display. In Experiment 1a, the initial typing prompt display was presented in white font (including the border of the echo box), and the feedback display of the category name and correct exemplar was presented in either yellow or blue font. In Experiment 1b, the initial typing prompt display was presented in yellow or blue font (including the border of the echo box), and the feedback display was presented in white font. In both versions of the experiment, participants were instructed as to what they would be tested on later. That is, in Experiment 1a, they were informed that they should try to remember correct category members and their font colors. In Experiment 1b, participants were still given the task of attempting to name the “correct” category member and receiving feedback, but they were also informed that they should try to remember their own typed responses, along with the corresponding font colors.
It may be noted that in the Correct Generation and Read conditions, the same items were studied twice (at initial response and at feedback), but with different contextual information each time, since the blue or yellow font color occurred for only one of those study instances and white font was used for the other. For Incorrect Generation, on the other hand, the target item (i.e., feedback item in Experiment 1a and typed item in Experiment 1b) was only ever studied in blue or yellow font. This difference might have the potential to create an unintended advantage in context memory for the Incorrect Generation condition, since there was no competing contextual information from seeing the target item in white font. For this reason, the comparison between Experiments 1a and 1b is especially important, because any such advantage should apply equally to both versions. If the two versions differ in context memory performance in the Incorrect Generation condition, it cannot be attributed to any difference in competing contextual information associated with target items.
Cued recall task.
During each trial in the test phase, participants were presented with a category name, and attempted to type a target category member from the encoding task. All 42 non-buffer category names were presented in random order. In Experiment 1a, the target category member that participants were instructed to recall was the correct item, i.e., the item that had been presented in the feedback display. In Experiment 1b, the target category member that participants were instructed to recall was the one that they had typed, regardless of whether it had been marked as correct or incorrect. After the participant typed a cued recall response and pressed Enter, that response was immediately presented again, and the participant was prompted to indicate whether the item had been presented in blue or yellow font at encoding, using the 7 and 8 keys, respectively, on the keyboard.
Results
Analysis approach
For all experiments reported in this paper, two sets of analyses were conducted. The first set consisted of separate, linear-model based analyses of accuracy in item and context memory. For context memory, accuracy was measured in terms of identification-of-origin (or I-O) scores (e.g., Johnson et al., 1993). The second set of analyses constructed multinomial processing tree (MPT) models of participants’ response patterns based on assumed underlying cognitive processes that include item retrieval, context retrieval, and guessing (Batchelder & Riefer, 1999). In this type of model, item and context memory are represented by different parameters, and memory effects can be examined by comparing corresponding parameters across different conditions, and by comparing the ability of models to fit the data when parameters are allowed to vary versus not to vary across conditions.
For the linear-model based analyses, trial-by-trial response data were dichotomous in nature; i.e., participants either retrieved or did not retrieve a target item, and either correctly or incorrectly identified the font color of a retrieved item. Accuracy in item and context memory for a participant was thus reflected by the proportion of trials on which a target item was retrieved, and on which a target item’s font color was correctly identified, respectively. Because context accuracy was conditional upon memory for target items, its estimation relied on varying numbers of trials across participants. Similar data in the existing literature have often been analyzed via classical analysis of variance techniques; however, for the present study, a hierarchical Bayesian analysis of variance (BANOVA) approach was used, which avoids the pitfalls associated with categorical dependent variables and unbalanced designs in classical ANOVA (e.g., Gelman, 2005; Kruschke, 2015), while retaining ANOVA’s convenient framework for interpreting the effects of multiple independent variables and their interactions. The BANOVA package in R (Dong & Wedel, 2016) provides a straightforward implementation of this approach. Thus, for the three experiments reported here, both item and context memory accuracy were analyzed using the BANOVA.Bernoulli() function, which used a logit link function to model correct response proportions, with encoding condition as a within-subjects factor and retrieval instructions as a between-subjects factor. Modeling dichotomous responses in this way also provides a more valid estimate of response probabilities near one and zero, compared to more traditional methods that can be particularly distorted by floor and ceiling effects. Factors were effects-coded in the models with the Read condition and typed item retrieval as reference levels. Default prior distributions and model parameters were used. Except where otherwise noted, Markov Chain Monte Carlo simulations (carried out in JAGS; Plummer, 2003) consisted of 25,000 total draws, with 5,000 burn-in and thinning factor of 10, for a total of 2,000 samples. In all analyses, convergence diagnostics indicated that the MCMC chains converged. Sampled posterior probability distributions of model coefficients were used to generate predicted values and 95% credible intervals for correct response proportions in each condition, and Bayesian p-values for each factor and interaction (see Dong & Wedel, 2016, for full specification of BANOVA outputs). The Bayesian p-values reported below are meant to be analogous to classical p-values, although they reflect a somewhat different concept in that they are computed based on where the tail of a model parameter’s posterior distribution crosses zero (then doubled to create a “two-tailed” value). Although presented here as an inferential tool, overinterpretation of these should be avoided, as with classical p-values (Marsman & Wagenmakers, 2017; Wagenmakers, 2007). In addition to the BANOVA analyses reported for each experiment, further inferential support was provided by applying a Bayesian linear mixed-effects modeling approach to the combined context memory data from all three experiments (using the brms package in R; Bürkner, 2017), which enabled model comparison via Bayes Factor (details of that analysis are provided in the Results section for Experiment 3).
MPT model analyses reported here also employed a hierarchical Bayesian approach for the estimation of model parameters, using the TreeBUGS package in R (Heck, Arnold, & Arnold, 2018). Models were constructed by adapting prior MPT models of source memory (e.g., Bayen, Murnane, & Erdfelder, 1996) to the present experiments, and are described in further detail below. For each model, latent-trait MPT parameters were estimated by MCMC simulations in JAGS consisting of four chains with 20,000 adaptation steps followed by 220,000 total draws, with 20,000 burn-in and thinning factor of 20, for a total of 10,000 samples. Default prior distributions were used, which for group mean parameters were uniform across probability space. In all analyses, convergence diagnostics indicated that the MCMC chains converged. Posterior-predictive p-values pT1 and pT2 were used to measure adequacy of model fit to means and covariances in the data (e.g., Klauer, 2010), and relative model fits were compared using the widely-applicable or Watanabe-Akaike information criterion (WAIC; Watanabe, 2010) which quantifies predictive accuracy while correcting for the number of parameters in a model, and is preferred over other statistics within the Bayesian framework (e.g., AIC, DIC) for its use of full posterior distributions in its estimation (Gelman, Hwang, & Vehtari, 2013). Posterior distributions of parameter estimates were also used to identify 95% credible intervals for pairwise differences between parameters.
Trial exclusions
Participants’ responses in the encoding task were inspected to ensure basic compliance with the task instructions. In general, participants followed instructions, re-typing provided exemplars in the Read condition and producing exemplars of the cued categories in the generation conditions. Misspellings or typographical errors occurred on approximately 2.5% of encoding trials across both versions of the experiment; in these cases it was still possible to identify what the intended response had been, and score it accordingly. A total of nine encoding trials, across seven participants, had blank responses; for six of these trials (all in Experiment 1b) the blank encoding response caused there not to be anything to recall in the corresponding cued recall trial. Those nine cued recall trials were therefore considered invalid and were excluded from analysis. In the analysis of cued recall trials, items were considered as matching a prior response from encoding (i.e., either the participant’s own encoding response or the correct response that was presented in the feedback display) if the recall response could be interpreted as representing the same lexical item as the item from the encoding task. Thus, some variations between encoded items and recalled items were allowed, including spelling and typographical errors as well as minor word variations (e.g., singular and plural forms, abbreviations, etc.). Across both versions of the experiment, approximately 3.2% of valid recall trials involved this type of approximate match between the recall response and an encoded item.
Item accuracy
The left panel of Figure 2 displays proportion correct cued recall responses for individual participants (circles), as well as mean and 95% credible intervals of predicted proportion correct based on posterior probability distributions of the coefficients in the hierarchical BANOVA model. Gray symbols represent results from Experiment 1a, in which cued recall targets were the feedback items presented to participants during encoding, and white symbols represent results from Experiment 1b, in which targets were the participants’ own typed responses during encoding (which differed from the feedback items in the Incorrect Generation condition). Circle size represents the number of valid trials from which each participant’s proportion was computed (most participants had 14 valid trials per condition; note that, thanks to the Bayesian modeling approach, participants with fewer valid trials inherently had less influence on the estimation of model parameters). A summary of the BANOVA model itself is provided in Table 1, with estimates of the linear model coefficients and 95% credible intervals. Bayesian p-values generated from the model suggested that there was a significant main effect of encoding condition, p < .001, and a significant interaction of encoding condition and retrieval instructions, p = .002. The main effect of retrieval instructions was not significant, p = .379. Consistent with previous studies of generation effects in memory (e.g., Slamecka & Graf, 1978), predicted accuracy across both versions of the experiment was superior in the Correct Generation condition (M = .989, 95% CI [.980, .994]) than in the Read condition (M = .824, 95% CI [.751, .880]). Predicted overall performance in the Incorrect Generation condition fell between that of the other conditions (M = .922, 95% CI [.882, .950]).
Figure 2. Accuracy in Item and Context Memory for Target Items in Experiment 1.

Note. Item memory accuracy (left panel) is represented as the proportion of valid cued recall trials for which the target item was recalled. For Experiment 1a (gray symbols), target items for recall were the category exemplars provided in the feedback display during encoding; for Experiment 1b, target items were the participant’s own typed responses during encoding. Context memory accuracy (right panel) is represented as identification-of-origin score, which is the proportion of correctly recalled targets for which the correct font color was identified. For both panels, circles represent individual participant data, with circle size indicating the number of trials used to compute the proportion. Boxes represent means and 95% credible intervals of the hierarchical BANOVA model’s predictions for response proportions, which were generated from the posterior probability distributions of the estimated linear model coefficients.
Table 1.
Experiment 1 Item Accuracy: Estimated BANOVA Coefficients
| Effect | Full data set (n=69) |
Limited data set (n=62) |
||
|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | |
|
| ||||
| Intercept | 2.83 | [2.48, 3.20] * | 2.91 | [2.58, 3.26] * |
| Retrieve Feedback Item | 0.15 | [−0.19, 0.52] | −0.10 | [−0.44, 0.23] |
| Correct Generation | 1.65 | [1.28, 2.07] * | 1.54 | [1.15, 1.99] * |
| Correct Generation : Retrieve Feedback | −0.49 | [−0.88, −0.13] * | −0.36 | [−0.79, 0.02] * |
| Incorrect Generation | −0.36 | [−0.70, −0.02] * | −0.09 | [−0.40, 0.22] |
| Incorrect Generation : Retrieve Feedback | 0.52 | [0.19, 0.86] * | 0.16 | [−0.13, 0.47] |
Note. Retrieve Typed Item and Read were the reference levels for Retrieval Instructions and Encoding Condition, respectively. Model for limited data set excluded participants with exceptionally low item accuracy in the Incorrect Generation condition.
denotes that zero lies outside the 95% credible interval.
Within the Incorrect Generation condition, an apparent difference based on retrieval instructions appeared to be driving the interaction effect in the model. Inspection of incorrect trial data indicated that most of the incorrect cued recall responses in the Incorrect Generation condition (75.7% in Experiment 1a, 94.0% in Experiment 1b) were studied nontarget items – i.e., recall of the participant’s own response rather than the feedback item in Experiment 1a, or vice versa in Experiment 1b. Thus, the Incorrect Generation condition produced some susceptibility to interference from studied nontargets. Additionally, as the data in Figure 2 show, there was a cluster of seven participants in Experiment 1b whose accuracy in the Incorrect Generation condition was less than 25%. Among these participants, 88.8% of all recalled responses were studied nontargets, suggesting these participants may have misunderstood the instruction to recall their own typed responses and intentionally recalled feedback items instead. Therefore, an additional BANOVA model was run with all responses from these seven participants excluded, resulting in a sample size of n = 31 for both versions. Bayesian p-values generated from this model still suggested a main effect of encoding condition, p < .001, but no longer suggested a significant interaction, p = .065. The main effect of retrieval instructions remained non-significant, p = .552. Predicted proportion correct for the Incorrect Generation condition overall (M = .944, 95% CI [.917, .963]) was still slightly lower than for Correct Generation overall (M = .989, 95% CI [.979, .994]), and both generation conditions had substantially better accuracy overall than the Read condition (M = .811, 95% CI [.745, .862]).
Context accuracy
The right panel of Figure 2 displays context memory accuracy in terms of I-O scores, with circles representing individual participants’ response proportions, circle size representing the number of trials underlying each proportion, and boxes representing means and 95% credible intervals of predicted response proportions from a hierarchical BANOVA model with encoding condition and retrieval instructions as within-subjects and between-subjects factors, respectively. Estimates of BANOVA model coefficients are presented in Table 2. Model predictions suggest very similar levels of accuracy in the Read and Correct Generation conditions, under both retrieval instruction conditions. In contrast, accuracy in the Incorrect Generation condition in Experiment 1a (recall feedback item; M = .687, 95% CI [.618, .750]) was substantially higher than in other conditions. In particular, a clear difference was observed in comparison to the Incorrect Generation condition in Experiment 1b (recall typed item; M = .539, 95% CI [.468, .614]). Accuracy in the Incorrect Generation condition in Experiment 1a was also somewhat higher than in the Correct Generation condition in Experiment 1a (M = .602, 95% CI [.527, .670]), although the 95% credible intervals overlapped (see pairwise comparisons below). Additionally, the Bayesian p-values generated from the model suggested there was no significant main effect of encoding condition, p = .388, or of retrieval instructions, p = .123, but did suggest a significant interaction, p = .010. As noted above, a subset of seven participants may have misunderstood retrieval instructions in Experiment 1b, which could lead to a concern that these participants might also have negatively influenced context memory accuracy in the Incorrect Generation condition. Thus, the BANOVA analysis was re-run with all data from those seven participants excluded. Consistent with the model that used the full data set, there did not appear to be a main effect of encoding condition, p = .513, or of retrieval instructions, p = .238, but there was evidence of a significant interaction, p = .007. Predicted performance from these data still suggested a difference in context memory in the Incorrect Generation condition between Experiment 1a (M = .690, 95% CI [.616, .753]) and Experiment 1b (M = .540, 95% CI [.461, .616]) as well as a potential difference between Incorrect Generation in Experiment 1a and Correct Generation in Experiment 1a (M = .602, 95% CI [.526, .671]). Pairwise analyses were conducted to further examine the model’s estimates of the differences between these conditions. The posterior samples of the linear coefficients were used to construct a distribution of the predicted accuracy difference between the Incorrect Generation and Correct Generation conditions in Experiment 1a, and a distribution of the predicted accuracy difference between the Incorrect Generation conditions in Experiments 1a and 1b. The distributions are displayed in Figure 3. For Incorrect Generation versus Correct Generation in Experiment 1a, the proportion of posterior samples for which the predicted difference was less than or equal to zero was .0185. For Incorrect Generation in Experiment 1a versus Experiment 1b, the proportion of posterior samples for which the predicted difference was less than or equal to zero was .0055. Thus, both comparisons indicated a high estimated probability of a true advantage in context memory for Incorrect Generation in Experiment 1a relative to the other conditions.
Table 2.
Experiment 1 Context Accuracy: Estimated BANOVA Coefficients
| Effect | Full data set (n=69) |
Limited data set (n=62) |
||
|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | |
|
| ||||
| Intercept | 0.40 | [0.27, 0.54] * | 0.42 | [0.27, 0.58] * |
| Retrieve Feedback Item | 0.11 | [−0.03, 0.24] | 0.09 | [−0.07, 0.23] |
| Correct Generation | 0.01 | [−0.14, 0.16] | 0.00 | [−0.16, 0.15] |
| Correct Generation : Retrieve Feedback | −0.10 | [−0.26, 0.04] | −0.09 | [−0.25, 0.07] |
| Incorrect Generation | 0.07 | [−0.09, 0.23] | 0.06 | [−0.11, 0.23] |
| Incorrect Generation : Retrieve Feedback | 0.21 | [0.04, 0.37] * | 0.23 | [0.07, 0.40] * |
Note. Model for limited data set excluded participants with exceptionally low item accuracy in the Incorrect Generation condition.
denotes that zero lies outside the 95% credible interval.
Figure 3. Posterior Distributions for Pairwise Comparisons of Interest in Experiment 1.

Note. The horizontal axis in each panel represents the difference between two specific conditions in the BANOVA model’s predictions of proportion correct context identification. Top panel: distribution of predicted differences between the Incorrect Generation and Correct Generation conditions in Experiment 1a. Bottom panel: distribution of predicted differences between Incorrect Generation in Experiment 1a and Incorrect Generation in Experiment 1b. Both distributions indicate high probability that the differences were positive, i.e., context accuracy was greater for the Incorrect Generation condition in Experiment 1a than for either of the other conditions of interest. The limited dataset model was used for these analyses.
MPT models
As described above, data from the experiment were also modeled using a multinomial processing tree whose parameters were estimated in a hierarchical Bayesian framework using the R package TreeBUGS. The trees used for Experiment 1 are displayed in Figure 4 and include parameters for item retrieval (D), context retrieval (d), item guessing (b) and context guessing (a). Because category cues were provided at retrieval, and the cues themselves had been presented in colored font at encoding, it was assumed that context retrieval could occur in the absence of item retrieval. Thus, the model’s estimation of context memory took into account participants’ color responses for all valid item responses, including incorrect items (but excluding invalid trials as described above). Data were excluded from the participants who potentially misunderstood task instructions in Experiment 1b, as noted above. Separate models were fit to each version of the experiment (i.e., 1a and 1b), and models fit to each version included both a full model and a limited model with additional parameter restrictions. In the full models, separate processing trees were assumed for the three encoding conditions, with both item memory (D) and context memory (d) allowed to vary across conditions (guessing parameters a and b were assumed to be equal across conditions). In the limited models, the context memory parameter d was restricted to be equal across conditions.
Figure 4. Multinomial Processing Tree Model for Experiment 1.

Note. Processing trees represent possible combinations of cognitive operations assumed to contribute to each response type at test. Cues in the experiment referred to study episodes in which either blue or yellow font had been used (left side). Participants’ responses (right side) could consist of target items or nontarget items followed by identification of the associated font color as either blue or yellow. D = probability of retrieving target item; d = probability of retrieving font color; b = probability of guessing target item; a = probability of guessing “blue.” Separate trees were used for the three encoding conditions. For full models, parameters were restricted such that D1 = D2 and d1 = d2 within each condition, and all a’s were equal and all b’s were equal across conditions. For limited models all d’s were restricted to be equal.
Model parameters for the full models are presented in Table 3. It can be seen that the MPT model estimates closely aligned with the results of the linear-model based analyses above. In particular, item retrieval was found to be substantially greater in the Correct Generation and Incorrect Generation conditions than in the Read condition, in both versions of the experiment. For the context memory parameters, dIncorrect was found to be reliably greater Experiment 1a than in Experiment 1b (95% CI of the difference: [.021, .436]), consistent with the pairwise difference seen in I-O scores in the Incorrect Generation condition across experiment versions. Similarly, within Experiment 1a, dIncorrect was estimated to be greater than dCorrect although this difference was not as statistically reliable as in the analysis of I-O scores (95% CI of the difference: [−.048, .373]). For comparison between the full and limited models, fit statistics are presented in Table 4. For Experiment 1a, the full model was preferred both in terms of absolute model fit (i.e., the limited model failed to describe the covariance in the data, pT2 < .05) and relative fit (i.e., the mean difference in WAIC was greater than two standard errors). Thus, there was strong evidence for including context memory differences across conditions in the MPT model. For Experiment 1b, predictive accuracy was virtually identical for the full model compared to the limited model, consistent with the interpretation that context memory did not differ across conditions when participants were instructed to recall the context associated with their own typed responses. One further MPT model analysis was conducted to address a potential alternative explanation for the apparent advantage in memory for context associated with corrective feedback. Specifically, because the cued recall task closely resembled the generation trials in the initial encoding task, a participant could hypothetically provide some correct item responses at test by re-generating category members on the fly, without retrieving any item information from the study episode. This could be expected to occur less often in the Incorrect Generation condition, in which the correct exemplar for each category was guaranteed not to be the one most preferred by the participant. Consequently, I-O scores for the Incorrect Generation condition were more certain to have included only trials on which item retrieval actually occurred, whereas I-O scores in other conditions could have included some trials in which there was no retrieval of item information from the encoding phase. Although the aforementioned MPT models included all retrieval trials (not just correct ones), the item-guessing parameter (b) was assumed to be equal across conditions. To evaluate the hypothesis that different rates of item guessing accounted for the observed data, an MPT model was fit in which the b parameter was allowed to vary across conditions, while the context memory parameter (d) was not. The variable-guessing model provided an adequate fit to the data in terms of means (pT1 = .356) but not in terms of covariance (pT2 = .023). Predictive accuracy (WAIC = 1177.9) was also worse than for the model with context memory differences, although the difference in WAIC between models (M = −24.5, SE = 14.7) was not as large as the difference between the model with variable-context memory differences and the model that had both equal context memory and equal item guessing across conditions.
Table 3.
Estimated MPT Model Parameter Estimates (Full Models)
| Cued recall |
Free recall |
Recognition |
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feedback item (Exp. 1a) |
Typed item (Exp.1b) |
Feedback item (Exp. 2a) |
Typed item (Exp.2b) |
Feedback item (Exp. 3a) |
Typed item (Exp.3b) |
|||||||
| Parameter | M | 95% CI | M | 95% CI | M | 95% CI | M | 95% CI | M | 95% CI | M | 95% CI |
|
| ||||||||||||
| D Read | 0.18 | [0.00, 0.60] | 0.24 | [0.01, 0.67] | 0.17 | [0.12, 0.21] | 0.15 | [0.11, 0.19] | 0.67 | [0.49, 0.79] | 0.09 | [0.00, 0.30] |
| D Correct | 0.91 | [0.77, 0.98] | 0.96 | [0.90, 1.00] | 0.36 | [0.31, 0.42] | 0.41 | [0.36, 0.46] | 0.92 | [0.87, 0.96] | 0.92 | [0.86, 0.97] |
| D Incorrect | 0.76 | [0.45, 0.94] | 0.76 | [0.46, 0.93] | 0.34 | [0.27, 0.40] | 0.31 | [0.22, 0.40] | 0.95 | [0.87, 0.99] | 0.67 | [0.38, 0.85] |
| d Read | 0.11 | [0.01, 0.23] | 0.15 | [0.03, 0.26] | 0.17 | [0.01, 0.49] | 0.26 | [0.02, 0.53] | 0.13 | [0.01, 0.39] | 0.21 | [0.03, 0.51] |
| d Correct | 0.13 | [0.02, 0.26] | 0.17 | [0.04, 0.28] | 0.06b | [0.00, 0.17] | 0.15 | [0.01, 0.31] | 0.11 | [0.00, 0.37] | 0.10 | [0.00, 0.43] |
| d Incorrect | 0.30a | [0.11, 0.48] | 0.06a | [0.00, 0.17] | 0.31b,c | [0.10, 0.49] | 0.07c | [0.00, 0.21] | 0.12 | [0.00, 0.41] | 0.14 | [0.01, 0.46] |
| a | 0.50 | [0.46, 0.54] | 0.54 | [0.51, 0.58] | 0.55 | [0.49, 0.61] | 0.47 | [0.42, 0.53] | 0.51 | [0.33, 0.58] | 0.50 | [0.23, 0.58] |
| b | 0.70 | [0.41, 0.84] | 0.64 | [0.20, 0.81] | 0.41 | [0.21, 0.61] | 0.78 | [0.68, 0.86] | ||||
| D 2_Read | 0.75 | [0.56, 0.88] | 0.91 | [0.85, 0.96] | ||||||||
| d 3_Incorrect | 0.91 | [0.73, 0.99] | 0.06 | [0.00, 0.36] | ||||||||
Note. Values of 0.00 and 1.00 are the result of rounding and do not reflect exact estimates. Superscript letters a, b, c indicate pairwise comparisons of interest for context memory parameters (d), for which the 95% credible interval of the mean difference did not include zero.
Table 4.
Summary of MPT Model Fits
| Task | Retrieval instruction | Full (variable context) model |
Equal context model |
WAIC difference |
|||||
|---|---|---|---|---|---|---|---|---|---|
| p T1 | p T2 | WAIC | p T1 | p T2 | WAIC | M | SE | ||
|
| |||||||||
| Cued Recall | Feedback item | 0.87 | 0.09 | 1153.4 | 0.34 | 0.02 | 1186.6 | −33.2 | 14.1 |
| Typed item | 0.64 | 0.50 | 1136.6 | 0.62 | 0.34 | 1137.2 | −0.6 | 5.7 | |
| Free Recall | Feedback item | 0.65 | 0.37 | 1047.1 | 0.33 | 0.17 | 1054.7 | −7.6 | 5.9 |
| Typed item | 0.21 | 0.48 | 1187.8 | 0.22 | 0.23 | 1185.1 | 2.6 | 4.4 | |
| Recognition | Feedback item | 0.10 | 0.32 | 1066.4 | 0.08 | 0.24 | 1066.8 | −0.4 | 7.3 |
| Typed item | 0.13 | 0.47 | 1181.9 | 0.07 | 0.48 | 1180.4 | 1.5 | 3.7 | |
Note. Posterior-predictive p−values for T1 and T2 statistics reflect adequacy of model fits to the means and covariances of the data, respectively (Klauer, 2010), with values <0.05 indicative of inadequate fit. WAIC (Watanabe, 2010) reflects the model's predictive accuracy while correcting for the number of parameters in the model, with lower values preferred.
Discussion
The findings from Experiment 1 provide novel evidence that corrective feedback affects memory for contextual details. Specifically, incorrect responses with corrective feedback resulted in greater accuracy in memory for contextual details associated with the feedback display than for contextual details associated with the initial response. This difference was not observed for confirmatory feedback when participants generated correct responses or retyped items that were presented. As noted earlier, the experiment was structured so that cues were randomly assigned to Correct Generation, Incorrect Generation, and Read conditions, and participants could not reliably predict whether a response they generated would be correct. Thus, the observed effect of corrective feedback was independent of potentially covarying factors such as participants’ confidence, degree of surprise, or prior knowledge of the subject matter. Additionally, the finding that there was no advantage in context memory associated with typed items in Experiment 1b is important because it demonstrates that the effect of corrective feedback on context memory did not extend to non-feedback information in the encoding episode. It also rules out the possibility that the advantage in context memory for corrective feedback was caused by some other difference between encoding conditions, such as a lack of competing contextual information for targets. Taken together, the results are consistent with the hypothesis that there is enhanced encoding of corrective feedback (e.g., Potts et al., 2019), and that this enhanced encoding applies to both item and context information.
In addition to the novel findings regarding context memory, the results of Experiment 1 also provide valuable data regarding item memory and errorful learning. By comparing the effects of study alone, correct generation with confirmatory feedback, and incorrect generation with corrective feedback, the present results add to a growing literature that seeks to delineate the conditions under which error generation can be beneficial or harmful to learning (e.g, Bridger & Mecklinger, 2014; Zawadzka & Hanczakowski, 2019).
Experiment 2
Experiment 1 demonstrated a performance advantage in context memory for corrective feedback, consistent with the overarching hypothesis that episodic processes, and not just semantic mediation, are involved in errorful learning. As mentioned above, the observed effect could be caused by enhanced encoding of corrective feedback relative to other information during a learning trial.. Analyses of multinomial processing tree models also did not support an alternative explanation that the specific context memory advantage for corrective feedback might have been an artifact of differences in item guessing across conditions (although it is important to note that such an explanation still implies that episodic processes are involved in retrieval of corrective feedback). Nonetheless, it is useful to test context memory for corrective feedback in a task that reduces the possibility of item guessing.
Experiment 2 provided such a test by employing the same encoding task as Experiment 1, but with free recall as the retrieval task instead of cued recall. The use of free recall helped to ensure that at least some retrieval was involved in each correct response at test, as it would be unlikely for participants to spontaneously generate items from the encoding phase in a memory-free manner (re-generation of exemplars could still occur but would require, at a minimum, retrieval of the category cue from the learning trial).
Method
Participants
Participants were 77 undergraduate students (mean age = 19.26 years). Thirty-six participants completed Experiment 2a and 41 participants completed Experiment 2b. As in Experiment 1, participants were recruited with the goal of at least 30 participants in each version. Several additional participants were added based on the observation that some individuals did not recall any target items from at least one of the encoding conditions (six in Experiment 2a and 12 in Experiment 2b). Although this did not prevent the use of their data, it potentially limited the informativeness of their context memory responses in comparing conditions; thus, additional participants were added to help compensate for this limitation. Additionally, one participant in Experiment 2b did not produce any usable recall responses due to typing error (see Results). All participants provided informed consent and all procedures were approved by the Institutional Review Board of the university.
Stimuli
Stimuli were the same as in Experiment 1.
Procedure
The encoding tasks for Experiments 2a and 2b were identical to Experiments 1a and 1b, respectively. The retrieval task was changed to free recall. For Experiment 2a, participants were instructed to type every correct answer they could think of from the encoding task, pressing Enter after each word, and to type STOP when they could not think of any more correct answers. They were then presented with each of the recalled answers they had just typed, one at a time and in sequential order. For each of these recalled responses, the participant indicated whether it had been presented in blue or yellow font in the feedback display at encoding, using the 7 and 8 keys, respectively, on the keyboard. For Experiment 2b, the retrieval procedure was identical except that participants were instructed to type as many as possible of their own (previously-typed) answers from the encoding task, including category members they had been told were incorrect. They were then presented with each of these recalled answers and indicated whether it had been presented in blue or yellow font when they had typed it during encoding, using the 7 and 8 keys, respectively, on the keyboard.
Results
Analysis approach and inspection of trial data
The same approach to statistical analysis was used as in Experiment 1, with hierarchical BANOVA models applied to dichotomous trial-by-trial accuracy data, with encoding condition and retrieval instructions as within-subjects and between-subjects factors, respectively. Each participant’s data were summarized with respect to encoding trials, such that item accuracy for a given trial reflected whether the target item for that trial was subsequently recalled, and context accuracy for a trial reflected whether the correct font color was selected, conditional on the target having been recalled (i.e., context accuracy data was missing on trials for which the target item was not recalled).
As in Experiment 1, participants’ responses in the encoding task were inspected to ensure basic compliance with the task instructions. Across all participants in both versions of the experiment, a total of 33 encoding trials had blank responses. Most of these instances (25) were caused by two participants who neglected to re-type the displayed category member on the majority of Read trials. There were a total of six encoding trials, across five participants, for which a blank response in a generation condition caused the trial not to include a valid target item; these trials were excluded from further analyses. Misspellings or typographical errors occurred on approximately of 1.9% encoding trials across both versions of the experiment; these trials were not excluded from analysis and were used to count approximate recall matches when possible. Apart from these minor issues, participants complied with instructions and generated valid exemplars of the cued categories, as in Experiment 1.
Free recall responses were inspected and evaluated relative to target items from the encoding phase. One participant (Experiment 2b) produced unusable free recall responses by failing to press the Enter key between items. This caused the recalled items not to be separated for individual context memory trials in the context memory block that followed free recall, so none of that participant’s data was used, reducing the total number of participants in Experiment 2b to n = 40. Another two participants (one in Experiment 2a, one in 2b) made the same error for part of the free recall list, but did produce some usable recall responses that were included in the item and context memory analyses. For all participants, analysis of free recall responses ignored any recalled buffer items (first 4 of study list), any blank or uninterpretable responses (e.g., single letters), and any repetitions of items within an individual participant’s recall list. For any item typed more than once in a participant’s free recall list, only the first font color response for that item in the subsequent context memory block was used to evaluate context memory accuracy. As in Experiment 1, approximate matches to items from the study list were identified (3.2% of all valid recall responses were approximate matches). Of the valid recalled items produced by participants (1,069 total), 952 (89.1%) were target items, and 97 (9.1%) were intrusions of nontarget items (e.g., recall of one’s own typed response in Experiment 1a or of a corrective feedback item in Experiment 1b). Some intrusions of nontarget items (37 of the 97 instances across all participants, 3.5% of all recall responses) were produced when participants recalled both the target and nontarget exemplars for a particular category. For these trials, the target item was counted as having been correctly recalled, and the subsequent color response to the target item was used in analyses of context memory. There were 20 intrusions (1.9%) of new items that did not match to any item studied during the encoding phase.
Item accuracy
Item memory results for free recall in Experiment 2 are displayed in the left panel of Figure 5, using the same representations of participant data and BANOVA model predictions as in Figure 2. Estimates of model coefficients are presented in Table 5. Predicted response proportions from the model indicate that across both versions of the experiment, item memory performance was clearly superior in the Correct Generation condition (M = .377, 95% CI [.332, .426]) than in the Read condition (M = .141, 95% CI [.111, .174]). Accuracy in the Incorrect Generation (M = .328, 95% CI [.281, .375]) condition was statistically similar to accuracy in the Correct Generation condition. Bayesian p-values supported the conclusion of a highly significant effect of encoding condition, p < .001, and no significant effect of retrieval instructions, p = .910, or interaction, p = .194.
Figure 5. Accuracy in Item and Context Memory for Target Items in Experiment 2.

Note. Participant data and BANOVA model predictions are represented in the same way as in Figure 2. Free recall item accuracy (left panel) reflects the proportion of valid encoding trials for which the target item was produced during the free recall phase of the experiment, and context memory accuracy (right panel) reflects the proportion of those target items for which the correct font color was selected during the context memory phase of the experiment.
Table 5.
Experiment 2 Item Accuracy: Estimated BANOVA Coefficients
| Effect | Estimate | 95% CI |
|---|---|---|
|
| ||
| Intercept | −1.01 | [−1.15, −0.86] * |
| Retrieve Feedback Item | −0.01 | [−0.14, 0.13] |
| Correct Generation | 0.51 | [0.36, 0.67] * |
| Correct Generation : Retrieve Feedback | −0.10 | [−0.26, 0.05] |
| Incorrect Generation | 0.29 | [0.13, 0.46] * |
| Incorrect Generation : Retrieve Feedback | 0.02 | [−0.14, 0.19] |
Note. Data for Experiment 2 models excluded one participant with no usable recall responses, for total n=76.
denotes that zero lies outside the 95% credible interval.
Context accuracy
Context accuracy data (I-O scores) and predicted values are displayed in the right panel of Figure 5; coefficients and effects sizes from the BANOVA model are presented in Table 6. Note that the number of context memory trials for each participant (represented by circle size in the figure) was generally lower and more variable relative to Experiment 1, given that context memory responses were only available for as many items as were freely recalled. Consequently, the model predictions reflect greater uncertainty, with wider ranges for the 95% credible intervals in each condition. Nonetheless, the BANOVA analysis closely resembled the findings from Experiment 1 in that the model suggested there was a significant interaction between encoding condition and retrieval instructions, p = .010, but no significant main effect of encoding condition, p = .087, or of retrieval instructions, p = .687. As seen in Figure 5, context memory accuracy was relatively high in the Incorrect Generation condition in Experiment 2a (recall feedback item; M = .681, 95% CI [.581, .769]) compared to other conditions, particularly the Incorrect Generation condition in Experiment 2b (recall typed responses; M = .553, 95% CI [.452, .649]), and the Correct Generation condition in Experiment 1a (M = .495, 95% CI [.389, .601]). As in Experiment 1, these potential differences were further examined in pairwise analyses. The posterior distributions for the estimated accuracy differences are shown in Figure 6. Both distributions indicate a high probability that the accuracy estimates for these conditions reflect an advantage for Incorrect Generation in Experiment 2a (the proportion of samples with a difference less than or equal to zero was .0005 for the comparison with Correct Generation in Experiment 2a and .036 for the comparison with Incorrect Generation in Experiment 2b).
Table 6.
Experiment 2 Context Accuracy: Estimated BANOVA Coefficients
| Effect | Estimate | 95% CI |
|---|---|---|
|
| ||
| Intercept | 0.44 | [0.22, 0.64] * |
| Retrieve Feedback Item | −0.05 | [−0.25, 0.17] |
| Correct Generation | −0.22 | [−0.47, 0.03] * |
| Correct Generation : Retrieve Feedback | −0.19 | [−0.43, 0.05] |
| Incorrect Generation | 0.05 | [−0.19, 0.28] |
| Incorrect Generation : Retrieve Feedback | 0.32 | [0.08, 0.56] * |
Note. Data for Experiment 2 models excluded one participant with no usable recall responses, for total n=76.
denotes that zero lies outside the 95% credible interval.
Figure 6. Posterior Distributions for Pairwise Comparisons of Interest in Experiment 2.

Note. The BANOVA model’s predicted pairwise differences are presented in the same way as in Figure 3. As in Experiment 1, both distributions indicate a high probability that context accuracy was greater for the Incorrect Generation condition in Experiment 2a than for either of the other conditions of interest.
MPT models
As in Experiment 1, data from the experiment were also modeled using multinomial processing trees. The trees used for Experiment 2 are displayed in Figure 7; note that the trees are adapted to model context judgments only for successfully recalled targets (since context judgments were not made for unrecalled items) and does not include intrusions since they were a small proportion of recalled items, would have required separate trees/parameters, and many participants did not produce any intrusion responses. Thus the model includes parameters for item retrieval (D), context retrieval (d), and context guessing (a). Invalid or unusable data were excluded as described above, and separate models were fit to each version of the experiment (i.e., 2a and 2b), including both full and limited models that allowed the context memory parameter d to vary or not to vary across conditions, respectively.
Figure 7. Multinomial Processing Tree Model for Experiment 2.

Note. The models for Experiment 2 represented participants’ responses to each possible target as either recalled or unrecalled, followed by identification (for recalled targets) of the associated font color as either blue or yellow. Parameters represented the same processes as in Experiment 1, with the difference that there was assumed to be no item guessing. Parameter values were restricted in the same manner as in Experiment 1.
Model parameters for the full models are presented in Table Y. As in Experiment 1, the MPT model estimates closely aligned with the results of the linear-model based analyses above. Item retrieval was found to be greater in the Correct Generation and Incorrect Generation conditions than in the Read condition, in both versions of the experiment. For the context memory parameters, dIncorrect was found to be reliably greater Experiment 2a than in Experiment 2b (95% CI of the difference: [.002, .448]), and within Experiment 2a, dIncorrect was estimated to be greater than dCorrect (95% CI of the difference: [.025, .452]). Fit statistics are presented in Table 4. For Experiment 2a, the full model had slightly better predictive accuracy than the limited model, although the mean difference in WAIC was only about 1.3 times the standard error of the difference. For Experiment 2b, predictive accuracy was again considerably more similar between the full model and the limited model.
Discussion
Experiment 2, which used free recall as the retrieval task, provided additional evidence that memory for context associated with feedback items and generated responses differed across conditions of corrective and confirmatory feedback. This was evidenced by the observed interaction of encoding condition and retrieval instructions in the BANOVA analysis of I-O scores, as well as the differences in parameter estimates for context memory across conditions in the MPT models. Within the Incorrect Generation condition, mean I-O scores for feedback versus typed items were very similar to Experiment 1, although there was greater overlap in the credible intervals for the estimates due to the reduced number of successfully retrieved items for which each participant’s context memory could be tested. The reduced number of context memory trials may also account for the observation that the MPT model with varying context across conditions did not have as great an advantage in predictive accuracy relative to the model in which context memory was held constant.
Qualitatively, the primary difference in patterns of context memory performance between Experiments 1 and 2 was the poor performance in the Correct Generation condition in Experiment 2a (in which targets were feedback items), where performance was centered almost exactly at chance. This result could be interpreted as revealing weakness in the encoding of confirmatory feedback, such that it failed to update contextual information encoded with the initial response when that initial response had been strongly encoded (as evidenced by its ability to be freely recalled). In the cued recall task of Experiment 1a, this weakness might not have been as evident because cueing enabled retrieval of nearly all target items in the Correct Generation condition, including those for which initial encoding was not as strong and which may have been more likely to incorporate the contextual details of the weakly encoded confirmatory feedback. In any case, the relative advantage in memory for context associated with corrective feedback supports the interpretation that episodic processing is enhanced in some way by corrective feedback. This adds further support to the conclusion that the corresponding advantage in Experiment 1 was not an artifact of the cued recall task having possibly filtered memory-free responses to a greater extent in the Incorrect Generation condition compared to the other conditions.
As in Experiment 1, item memory results were broadly consistent with prior findings regarding the benefits of self-generation and learning from errors, relative to simply studying items. Particularly impressive was that participants recalled items learned from corrective feedback as well (on average) as items they had self-generated, although it should be noted that they were not always able to discriminate between the two, as evidenced by occasional intrusions of nontarget items.
Experiment 3
The purpose of Experiment 3 was to further examine effects of error generation and feedback on context memory when recognition was used as the retrieval task. As in Experiment 2, this provided the opportunity for further replication, and also addressed a potential weakness in the study design: specifically, did the different encoding instructions across the A and B versions of the experiment create a fundamentally different task for participants? In the B version, participants were instructed that they would be tested on their own typed responses, so perhaps this induced them to ignore the feedback altogether, creating conditions that are not truly parallel to the conditions in the A version. By using recognition as the retrieval task, Experiment 3 provided an opportunity to test participants’ memory for both their own typed responses and the feedback items, in both versions of the experiment. If participants ignored feedback in the B version of the encoding task, then they should exhibit little recognition of corrective feedback items. On the other hand, if corrective feedback does induce enhanced encoding, then participants might be expected to frequently recognize feedback items, even when instructed to remember their initial responses.
Method
Participants
Participants were 78 undergraduate students (mean age = 18.94 years). Thirty-five participants completed Experiment 2a and 43 participants completed Experiment 2b. As in Experiments 1 and 2, participants were recruited with the goal of at least 30 participants in each version, and as with the other experiments, several additional participants were added in order to compensate for potential lack of data due to poor performance by some individuals. Specifically, two participants in Experiment 3a and 10 participants in Experiment 3b had either a substantial number of excluded trials or exceptionally low recognition rates for targets (see Results for details). All participants provided informed consent and all procedures were approved by the Institutional Review Board of the university.
Stimuli
Stimuli were the same as in Experiments 1 and 2.
Procedure
The same encoding tasks were used as in Experiments 1 and 2. The retrieval task was changed to recognition. The recognition list consisted of 84 trials, which included the 14 correct exemplars from each encoding condition (42 total), the 14 incorrect responses that were provided by the participant at encoding (in the Incorrect Generation condition), and 28 previously unused exemplars from the categories assigned to the Read and Correct Generation conditions (i.e., 14 from each; in the Correct Generation condition, this was the first of the two available exemplars for each category, unless its initial letter matched that of the participant’s generated response).
On each trial in the recognition task, participants were presented with a category label-exemplar pair, and three response options: blue target, yellow target, and not a target (indicated with the 7, 8, and 9 keys on the keyboard, respectively). Participants were instructed that some of the words on the screen would be correct answers from the encoding task, some would be their own incorrect responses from the encoding task, and some would be new. For Experiment 3a, participants were instructed to respond not a target to incorrect category members or new items, and for correct category members, to select either blue target or yellow target to indicate whether it had been displayed in blue or yellow font in the feedback display at encoding. For Experiment 3b, participants were instructed to respond not a target to any category member they had not typed as an answer during the encoding task (i.e., new items or items that were provided as corrective feedback). For category members they had typed at encoding, participants were instructed to select either blue target or yellow target to indicate whether it had been displayed in blue or yellow font when they typed it at encoding.
Results
Analysis approach and inspection of trial data
The same approach to statistical analysis was used as in Experiments 1 and 2, with hierarchical BANOVA models applied to dichotomous trial-by-trial accuracy data, with encoding condition and retrieval instructions as within-subjects and between-subjects factors, respectively. As in Experiment 2, each participant’s data were summarized with respect to encoding trials, such that each encoding trial was coded for whether the target item for that trial was subsequently recognized (hit), whether the non-target item for that trial was subsequently recognized (false alarm), and whether the correct font color was selected given that the target item was recognized (identification of origin).
As in Experiments 1 and 2, encoding trial data were examined for any errors or failures to adhere to instructions. Overall compliance with the encoding task was high, as in Experiments 1 and 2. Across all participants in both versions of the experiment, a total of 22 encoding trials had blank or invalid responses (18 blank, four with response of “IDK” or “DONTKNOW”). Additionally, although the program was set to activate CapsLock, four participants (one in Experiment 3a and three in Experiment 3b) were found to have either used the Shift key or turned off CapsLock so that many of their typed responses included lowercase letters. Because the recognition memory task used typed responses in generation conditions as either targets or non-target lures (depending on condition), this created a potential problem because participant-generated responses could be identified simply due to the presence of lowercase letters. Thus, in addition to trial exclusions based on blank responses, item recognition responses were not counted for any target or non-target items that were invalidated due to lowercase letters (87 target and 28 non-target recognition items across those four participants). Other misspellings or typographical errors occurred on an additional 1.9% of encoding trials across all participants in both versions of the experiment; however, recognition responses based on those trials were not excluded from analyses, as the errors were minor and unsystematic.
Item accuracy
Item accuracy results for Experiment 3 are displayed in Figures 6 and 7. For ease of comparison with Experiments 1 and 2, Figure 8 displays proportions of target item recognition (i.e., hit rates) in the left panel, and context memory accuracy in the right panel. Context memory results are described in the next section. Figure 9 displays proportions of nontarget recognition (i.e., false alarm rates) in the left panel, and a combined measure of item accuracy in the right panel (described below). As in the other figures, circles represent individual participant response rates and boxes represent 95% credible intervals of the predicted response rate generated from the hierarchical BANOVA model. Estimated model coefficients are presented in Table 7.
Figure 8. Item and Context Memory Accuracy for Target Items in Experiment 3.

Note. The left panel displays proportion hits, i.e., the proportion of target items that were correctly recognized as targets. The right panel displays proportion correct font color identification for correctly recognized targets. As in Figures 2 and 4, circles represent response proportions for individual participants, with circle size indicating the number of valid trials used to compute the proportion. Boxes represent means and 95% credible intervals of predicted response proportions based on the posterior probability distributions of BANOVA model coefficients.
Figure 9. False Alarms and Overall Item Accuracy in Experiment 3.

Note. The left panel displays recognition of nontarget items in Experiment 3 (i.e., false alarms). Note that in the Incorrect Generation condition, nontarget items had been studied during the encoding phase (as the participant’s own typed responses in Experiment 3a, and as the feedback items in Experiment 3b), whereas nontarget items in the Correct Generation and Read conditions were unstudied exemplars of the categories that were presented during the encoding phase. The right panel represents overall item recognition accuracy in Experiment 3, by showing the proportion of encoding trials for which targets exclusively were recognized in the retrieval task (i.e., subtracting trials in which nontarget items were recognized). Results displayed in this figure do not include the subset of participants with below-chance hit rates for Incorrect Generation in Experiment 3 (see text).
Table 7.
Experiment 3 Item Accuracy: Estimated BANOVA Coefficients
| Effect | Hits: Full data set (n=78) |
Hits: Limited data set (n=70) |
False Alarms (n=70) |
Exclusive Hits (n=70) |
||||
|---|---|---|---|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | |
|
| ||||||||
| Intercept | 2.81 | [2.51, 3.15] * | 2.91 | [2.65, 3.22] * | −2.02 | [−2.38, −1.68] * | 1.08 | [0.83, 1.34] * |
| Retrieve Feedback Item | 0.07 | [−0.24, 0.36] | −0.14 | [−0.43, 0.14] | −0.77 | [−1.12, −0.41] * | 0.63 | [0.39, 0.89] * |
| Correct Generation | 1.16 | [0.85, 1.50] * | 1.12 | [0.77, 1.51] * | −1.23 | [−1.52, −0.94] * | 1.43 | [1.18, 1.71] * |
| Correct Generation : Retrieve Feedback |
−0.62 | [−0.96, −0.33] * | −0.63 | [−1.04, −0.28] * | 0.36 | [0.07, 0.66] * | −0.57 | [−0.84, −0.31] * |
| Incorrect Generation | −0.16 | [−0.50, 0.17] | 0.11 | [−0.18, 0.40] | 2.31 | [1.91, 2.76] * | −1.57 | [−1.88, −1.27] * |
| Incorrect Generation : Retrieve Feedback |
0.60 | [0.28, 0.94] * | 0.37 | [0.08, 0.67] * | −1.21 | [−1.63, −0.79] * | 1.13 | [0.84, 1.42] * |
Note. Models for limited data set excluded participants with exceptionally low hit rates in the Incorrect Generation condition.
denotes that zero lies outside the 95% credible interval.
The overall pattern of hit rates closely resembled the patterns of retrieval accuracy in Experiments 1 and 2, with the highest performance across both versions of the experiment occurring in the Correct Generation condition (M = .982, 95% CI [.970, .989]), the worst performance occurring in the Read condition, (M = .860, 95% CI [.804, .903]), and the Incorrect Generation falling between the two others (M = .934, [.899, .958]). Bayesian p-values suggested a significant main effect of encoding condition, p < .001, but no main effect of retrieval instructions, p = .635. The model also suggested a significant interaction, p < .001. Similar to Experiment 1, a subgroup of eight low-performing participants (one in Experiment 3a and seven in Experiment 3b) can be seen in the Incorrect Generation condition, with hit rates that were well below chance level for Incorrect Generation, suggesting they may have misunderstood retrieval instructions. To account for this possibility, the BANOVA analysis was re-run with all data from those eight participants excluded. The model based on the limited data set suggested the same effects as with the full data set: i.e., significant main effect of encoding condition, p < .001, no main effect of retrieval instructions, p = .321, and significant interaction, p = .001. The persistence of the interaction in the limited data set may be attributable to the remaining participants in Experiment 3b having a nearly perfect predicted hit rate in the Correct Generation condition (M = .992, 95% CI [.981, .997]), which was significantly higher than the predicted hit rate in any other condition, in either version (the closest being Correct Generation in Experiment 3a, M = .962, 95% CI [.934, .979]).
As seen in Figure 9, the pattern of false alarm rates differed markedly between Experiments 3a and 3b. Although both versions of the experiment had higher false alarm rates in the Incorrect Generation condition than in the other two encoding conditions, the false alarm rates in Experiment 3b were comparable to hit rates for many participants. That is, many participants endorsed the majority of both target and non-target stimuli from the Incorrect Generation condition. In contrast, false alarm rates were similar across retrieval instructions in the Read and Correct Generation conditions. The results shown in Figure 9 do not include the eight participants with below-chance hit rates. Bayesian p-values for both the full data set and the limited data set suggested significant main effects of encoding condition, p < .001, and retrieval instructions, p < .001, and a significant interaction, p < .001. Model coefficients are also reported in Table 7.
Experiment 3 was structured such that each encoding trial had corresponding target and nontarget trials during the retrieval phase. As a result, it was also possible to assess overall recognition accuracy within the same analysis framework as the other dependent measures reported here, by counting an encoding trial as successful only if it resulted in subsequent recognition of the target and exclusion of the nontarget. Bernoulli probabilities for “exclusive hits” were then estimated using the same hierarchical BANOVA model structure as for hits and false alarms. Participant data and model predictions for exclusive hit rates are displayed in the right panel of Figure 9 (this analysis did not include the eight participants with below-chance hit rates in Incorrect Generation). Model coefficients are also presented in Table 7. Bayesian p-values indicated significant main effects of encoding condition, p < .001, and retrieval instructions, p < .001, and a significant interaction, p < .001. As seen in Figure 9, exclusive hit rates were highest in the Correct Generation condition (overall M = .925, 95% CI [.895, .948]), followed by the Read condition (overall M = .771, 95% CI [.696, .837]). For the Incorrect Generation condition, when participants were instructed to recognize feedback items (Experiment 3a), performance was similar to the Read condition (M = .780, 95% CI [.671, .860]). However, when participants were instructed to recognize typed items (Experiment 3b), the exclusive hit rate was very low (M = .095, 95% CI [.053, .157]), reflecting the observation that most participants endorsed both the feedback items and typed items in this condition.
Context accuracy
The right panel of Figure 8 displays context accuracy results from Experiment 3, represented by the proportion of correctly recognized targets (i.e., hits) for which the correct font color was also selected (i.e., I-O scores, as in Experiments 1 and 2). Model coefficients are reported in Table 8. Consistent with Experiments 1 and 2, context memory accuracy in Experiment 3a (recall feedback item) was somewhat better for Incorrect Generation (M = .639, 95% CI [.573, .698]) than Correct Generation (M = .580, 95% CI [.514, .647]), and was somewhat better for Incorrect Generation in Experiment 3a than Experiment 3b (recall typed responses; M = .552, 95% CI [.490, .615]), although the differences were more modest than in Experiments 1 and 2 Consistent with these observations, Bayesian p-values from the model did not strongly suggest evidence for any significant effects: encoding condition, p = .149, retrieval instructions, p = .269, interaction p = .131. The BANOVA model was also run with the limited data set that excluded the eight participants with below-chance hit rates; results were essentially the same as for the full data set, with no significant main effect of encoding condition, p = .117, or of retrieval instructions, p = .375, and no interaction, p = .226.
Table 8.
Experiment 3 Context Accuracy: Estimated BANOVA Coefficients
| Effect | Full data set (n=78) |
Limited data set (n=70) |
||
|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | |
|
| ||||
| Intercept | 0.36 | [0.25, 0.48] * | 0.38 | [0.27, 0.51] * |
| Retrieve Feedback Item | 0.07 | [−0.05, 0.18] | 0.05 | [−0.07, 0.18] |
| Correct Generation | −0.10 | [−0.24, 0.04] | −0.11 | [−0.26, 0.04] |
| Correct Generation : Retrieve Feedback | 0.00 | [−0.15, 0.14] | 0.01 | [−0.14, 0.14] |
| Incorrect Generation | 0.03 | [−0.11, 0.17] | 0.04 | [−0.11, 0.19] |
| Incorrect Generation : Retrieve Feedback | 0.11 | [−0.03, 0.26] | 0.09 | [−0.05, 0.24] |
Note. Models for limited data set excluded participants with exceptionally low hit rates in the Incorrect Generation condition.
denotes that zero lies outside the 95% credible interval.
As noted in the item memory analyses above, participants in both versions of Experiment 3 had relatively high false alarm rates in the Incorrect Generation condition (especially in Experiment 3b). Because participants indicated font color as part of the recognition response, this meant that a substantial number of context memory judgments were collected for both feedback and typed items, with both types of retrieval instructions. Further, it was possible to classify font color responses to nontargets as correct or incorrect relative to the font color that had been used for the target item on the corresponding trial at encoding. For example, in Experiment 3b, a false alarm was generated when a participant indicated a font color for an item that had been presented as corrective feedback, yet that font color might nonetheless match the font color of the corresponding typed item. If the colors chosen on false alarm trials in Experiment 3b were more accurate than chance, it would suggest that memory for context associated with corrective feedback may include temporally-adjacent context. To test this, an additional BANOVA analysis was performed on recognition responses in the Incorrect Generation condition only, with context accuracy as the dependent variable, item type (feedback vs. typed) as a within-subjects factor, and retrieval instructions (recognize feedback item vs. recognize typed item) as a between subjects factor. All participants were included, given that the analysis examined context memory for both target and nontarget items and thus item accuracy was irrelevant. Participant data and predicted response proportions are displayed in the left panel of Figure 10, and Table 9 presents the model coefficients. Of particular interest are the two sets of responses in which the recognized items were nontargets, i.e., there was a mismatch between retrieval instructions and the recognition response. Note that in these cases, the specific category member being recognized (i.e., the nontarget) was only ever presented in white font, whereas the category cue and target item were presented in colored font during a different portion of the encoding trial. When typed items were recognized in the version that instructed participants to recognize feedback items (Experiment 3a), accuracy in memory for font color was roughly at chance (M = .470, 95% CI [.356, .586]). In contrast, when feedback items were recognized in the version that instructed participants to recognize typed items (Experiment 3b), accuracy in memory for font color was significantly better than chance (M = .595, 95% CI [.531, .657]), with a mean predicted value between those found on target recognition trials in Experiment 3a (M = .643, 95% CI [.570, .709]) and Experiment 3b (M = .549, 95% CI [.479, .613]). Bayesian p-values from the model suggested there was a significant main effect of item type, p = .006, no significant main effect of retrieval instructions, p = .732, and no interaction, p = .135. This suggests that contextual information associated with corrective feedback includes context of the response that occurred prior to the feedback itself, and that such prior contextual information is remembered at least as well when cued by the feedback information as when cued by the response which carried the context.
Figure 10. Context Memory in the Incorrect Generation Condition in Experiment 3.

Note. Using data from hits (i.e., target item recognition) and false alarms (i.e., nontarget item recognition) in the Incorrect Generation condition, context memory accuracy is displayed according to whether the recognized item was a feedback item or the participant’s own typed response, and whether retrieval instructions were to recognize feedback items or typed responses. For nontarget items (where there is a mismatch between retrieval instructions and actual recognition), context accuracy reflects memory for the font color that had occurred with the corresponding target item (Incorrect Generation nontargets appeared only in white font at encoding).
Table 9.
Context Accuracy by Item Type in the Incorrect Generation Condition: Estimated BANOVA Coefficients
| Effect | Estimate | 95% CI |
|---|---|---|
|
| ||
| Intercept | 0.26 | [0.10, 0.43] * |
| Feedback Instructed | −0.03 | [−0.20, 0.14] |
| Feedback Recognized | 0.22 | [0.05, 0.39] * |
| Feedback Recognized: Feedback Instructed | 0.13 | [−0.04, 0.31] |
Note. Data included all Experiment 3 participants, for total n=78. Feedback Instructed corresponds to Experiment 3a in which the feedback item was the target. Feedback Recognized corresponds to the feedback item being endorsed (regardless of whether it was a target).
denotes that zero lies outside the 95% credible interval.
MPT Models
Multinomial processing trees for Experiment 3 are displayed in Figure 11. The general structure of the models was based on the two-high threshold model of source memory (2HTSM; Bayen et al., 1996); however, due to the structure of Experiment 3, it was necessary to modify the 2HTSM in two important ways. First, as in Experiment 1, font color was associated with both cues and targets, so it was assumed that context retrieval was possible based on the cue alone, even when there was no item retrieval. Second, because nontarget items in the Incorrect Generation condition had actually been presented at encoding, it was not possible to assume that they were recognized as “new” in the same way as nontarget items in the other conditions. Thus, the processing tree for the Incorrect Generation assumed that nontarget recognition consisted of two steps: recognition of the item as old (modeled by parameter D2), followed by source discrimination of the item as a nontarget (modeled by an additional parameter d3). Because these modifications added significant complexity, the models were also simplified by collapsing the trees for each font color, such that there was a single tree for targets and a single tree for nontargets in each condition. Invalid trials were excluded as described above, as were the eight low-performing participants. An additional two participants (one each from Experiment 3a and 3b) were excluded because they lacked any valid responses in one or more processing trees.
Figure 11. Multinomial Processing Tree Model for Experiment 3.

Note. The models for Experiment 3 were simplified by collapsing across font colors and representing participants’ responses as endorsing a presented item with either the correct or incorrect color (relative to the corresponding encoding trial) or not endorsing the item. D1 = probability of recognizing target item as a target; d1 and d2 = probability of retrieving font color; b = probability of guessing that an item was a target; a = probability of correctly guessing font color. The trees for new items differed by condition. For the Read and Correct Generation conditions, D2 = probability of recognizing a nontarget item as being new; for the Incorrect Generation condition, D2 = probability of recognizing a nontarget item as being old, with the additional parameter d3 = probability of discriminating that the item was the nontarget from the relevant encoding trial. All a’s were restricted to be equal and all b’s were restricted to be equal across conditions. D1 and D2 were restricted to be equal within the Correct Generation and Incorrect Generation conditions, but were allowed to differ within the Read condition. Values of d1 and d2 were restricted to be equal within each condition for the full models and across all conditions for the limited models.
As in Experiments 1 and 2, separate models were fit to each version of the experiment (i.e., 3a and 3b), including both full and limited models. In fitting the full models it was found that an adequate fit required that the D parameter be allowed to vary between target and nontarget trees in the Read condition. This appeared to have been caused by correct rejection rates in the Read condition being higher than hit rates overall. Model parameters for the full models are presented in Table 3. As in Experiments 1 and 2, the MPT model estimates aligned with the results of the linear-model based analyses. Item retrieval parameters were generally lower in the Read condition than in the Correct Generation and Incorrect Generation conditions. For Experiment 3b the estimated value of the D1_Read parameter was implausibly low, which seems to have been compensated for by a much higher estimate of item guessing (b). This suggests the current model may not provide an ideal description of item retrieval in Experiment 3b, although this does not create a particular concern for the theoretical considerations of the current study. Interestingly, the d3_Incorrect parameter functioned as intended across the two versions of the experiment, representing a shift from high to low discrimination of targets versus studied nontargets that allowed the model to account for the very high rate of false alarms in the Incorrect Generation condition in Experiment 3b. Estimates of the context retrieval parameters were very similar across conditions in both versions of the experiment, consistent with the analyses of I-O scores. Likewise, the predictive accuracy of the full models did not differ meaningfully from that of the limited models with equal context retrieval parameters (Table 4).
Combined context accuracy for all experiments
Although Experiments 1, 2, and 3 used different retrieval tasks, context memory accuracy was measured in a comparable way across experiments, i.e., the proportion of correctly remembered target items for which the correct font color was identified. Each experiment has been reported separately above, but it is likely that some underlying basis for context memory performance is shared across the three retrieval tasks. Thus, it would be useful to evaluate whether an interaction between encoding condition and retrieval instructions was observed when context memory data were combined across experiments to provide an overall estimate of the interaction effect using all available data. Pooling of data in this way is enabled by the Bayesian analysis approach, in which the data from each of the experiments merely provides additional information for the estimation of posterior probabilities of linear model parameters. This is mathematically equivalent to using information from one experiment to adjust the prior probabilities used in the analysis of another experiment (e.g. Wagenmakers et al., 2018). A further aid to inference regarding the interaction would be to compare a model that includes the interaction to a model that does not include the interaction, using a Bayes factor (e.g., Rouder, Morey, Speckman, & Province, 2012). In order to accomplish this, an additional analysis was conducted using a Bayesian linear mixed model with logit link function, implemented in the brms package in R (Bürkner, 2017). For this analysis, two models were run with context memory accuracy for retrieved targets as the dependent variable. The first model included retrieval task type, retrieval instructions, encoding condition, and the retrieval instructions X encoding condition interaction as fixed effects, and participant and category cue (i.e., item) as random effects. As in the BANOVA models, typed item retrieval and Read condition were used as reference levels. Recognition was used as the reference level for task. The second model was identical except that retrieval instructions and encoding condition were included only as simple effects, with no interaction terms. Default priors were used, and each MCMC simulation was run with four chains of 10,000 steps each (1,000 warmup) in order to provide sufficient samples for stable computation of Bayes factors. Each of the models was run with the full data set from 223 participants, as well as with the limited data set that excluded the 15 participants previously identified as potentially misunderstanding retrieval task instructions in Experiments 1 and 2. Table 10 presents the estimated coefficients from the model that included the interaction, and Figure 12 presents model predictions and 95% credible intervals along with accuracy rates of individual participants. Consistent with the BANOVA models3 for the individual experiments, the interaction of Incorrect Generation condition and retrieval instructions produced a model coefficient well above zero, suggesting the interaction was significant. The estimated coefficient for the simple effect of Incorrect Generation had a 95% CI slightly below zero, suggesting a bifurcation (relative to the Read condition) whereby context accuracy in the Incorrect Generation condition was reduced when targets were typed responses and enhanced when targets were feedback items. Predicted accuracy rates by condition were consistent with this interpretation and with context memory performance observed in each experiment separately. Performance was highest for feedback items in the Incorrect Generation condition (M = .651, 95% CI [.611, .688]), particularly in comparison to the other generation conditions, all of which had substantial overlap in 95% CIs with each other and no overlap with feedback in Incorrect Generation. Specifically, typed responses in the Incorrect Generation condition had the worst estimated context memory performance, (M = .540, 95% CI [.500, .580]), followed by feedback items in the Correct Generation condition (M = .568, 95% CI [.527, .607]), and typed items in the Correct Generation condition (M = .571, 95% CI [.533, .608]).
Table 10.
Context Accuracy, Experiments 1–3 Combined: Estimated Bayesian Linear Model Coefficients
| Effect | Full data set (n=223) |
Limited data set (n=208) |
||
|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | |
|
| ||||
| Intercept | 0.39 | [0.22, 0.56] * | 0.44 | [0.26, 0.62] * |
| Cued Recall | 0.04 | [−0.12, 0.19] | 0.04 | [−0.12, 0.20] |
| Free Recall | 0.02 | [−0.17, 0.21] | 0.00 | [−0.18, 0.19] |
| Retrieve Feedback | −0.07 | [−0.28, 0.14] | −0.12 | [−0.34, 0.11] |
| Correct Generation | −0.10 | [−0.28, 0.07] | −0.14 | [−0.33, 0.04] |
| Correct Generation : Retrieve Feedback | 0.05 | [−0.20, 0.31] | 0.09 | [−0.18, 0.36] |
| Incorrect Generation | −0.23 | [−0.41, −0.05] * | −0.25 | [−0.44, −0.06] * |
| Incorrect Generation : Retrieve Feedback | 0.53 | [0.26, 0.79] * | 0.55 | [0.28, 0.82] * |
Note. Model: Correct Context ~ Task + Retrieval Instructions * Encoding Condition + (1|Subj) + (1|Cue). Recognition, Retrieve Typed Item, and Read were the reference levels for Task, Retrieval Instructions, and Encoding Condition, respectively. Model for limited data set excluded participants with exceptionally poor target item performance in the Incorrect Generation condition in Experiments 1 and 3.
denotes that zero lies outside the 95% credible interval.
Figure 12. Pooled Analysis of Identification of Origin Scores in Experiments 1–3.

Note. As in preceding figures, circles represent individual participant response proportions, with circle size indicating the number of trials underlying the proportion. Boxes represent means and 95% credible intervals for the predicted response proportions based on a Bayesian mixed-effects linear model that included item retrieval task type (cued recall, free recall, recognition), retrieval instructions, encoding condition, and retrieval instructions X encoding condition interaction as fixed effects, and participant and category cue as random effects (see text).
Finally, Bayes factors (BFinteraction) were computed using the brms package’s bridge sampling algorithm to compare the models with interaction terms for encoding condition X retrieval instructions to models with no interaction terms. To ensure stable estimates, each Bayes factor was computed 15 times. For the models based on the full data set, mean BFinteraction = 1858 (SD = 26.9), and for the models based on the limited data set, mean BFinteraction = 1428 (SD = 18.7). Thus, with BFinteraction > 103, there was strong evidence that the data were more consistent with an interaction model than a non-interaction model.
Discussion
The results of Experiment 3, which used recognition as the retrieval task, provided additional insight into both item and context memory associated with corrective feedback. Memory for context associated with targets followed a similar qualitative pattern to that observed in Experiment 1, but the difference in context memory for feedback versus typed items in the Incorrect Generation condition was not as large, and thus did not yield convincing evidence, on its own, of a non-zero effect for the interaction between encoding condition and retrieval instructions. Nonetheless, the pattern of data from Experiment 3 did not contradict the findings from Experiments 1 and 2, and when context memory performance for targets was combined across all three experiments, the data convincingly supported the conclusion that contextual information associated with corrective feedback is remembered better than contextual information associated with other aspects of errorful learning trials, including confirmatory feedback, correct responses, and incorrect responses. The recognition task used in Experiment 3 may have weakened the context memory advantage for corrective feedback by incorporating the font color judgment into the item recognition response, in contrast to Experiments 1 and 2, in which participants recalled items prior to making context judgments. That is, by inducing recollection to take place first, the recall tasks could have caused well-encoded contextual information to be retrieved and made available for the subsequent context judgment, thus enhancing the context memory advantage for those items that already had initially stronger context encoding.
The most notable results from Experiment 3 had to do with item and context memory related to corrective feedback when participants were instructed to remember their own incorrect responses (i.e., in Experiment 3b). In terms of item memory, participants in Experiment 3b had much greater difficulty rejecting corrective feedback items when instructed to recognize their own responses than participants in Experiment 3a had in rejecting their own responses when instructed to recognize feedback items. This pattern of results strongly refutes the notion that participants in the B version of each experiment were ignoring feedback information. With respect to context memory, a surprising finding was that when corrective feedback items were recognized (as false alarms) in Experiment 3b, font color judgments were as good or better than when typed items were recognized (as hits) – even though the font color being remembered had occurred earlier with the typed item, and not with the feedback item itself. This finding suggests that contextual information for the entire learning trial is associated with the corrective feedback item. This was not the case for participants’ own incorrect responses: in Experiment 1a, when participants recognized typed items (as false alarms), they were no better than chance at identifying the font color that had appeared with the subsequent corrective feedback. Thus, in terms of both item and context memory, Experiment 3 reinforced the interpretation that memory encoding is enhanced for corrective feedback relative to other aspects of a learning trial.
General Discussion
Context memory is important for learning. Not only can context support retrieval of the content of a memory episode (e.g., Lehman & Malmberg, 2013), context itself may contain useful information or act as a retrieval cue for other, related episodes (e.g., Rowland & DeLosh, 2014). Contextual information may also play a key role in retrieval-based learning (e.g., Karpicke et al., 2014; Lehman, Smith, & Karpicke, 2014), and context memory can provide a window into the cognitive processes involved in different memory encoding conditions (e.g., Mulligan et al., 2006). Yet, few studies have directly examined how feedback affects memory for context. The present study provided novel comparisons of memory for context associated with re-typed items, correctly generated responses, incorrectly generated responses, and corrective and confirmatory feedback, using an encoding task that controlled for prior knowledge and expectancies regarding correct and incorrect responses. Across three experiments, different retrieval tasks were used to test item and context memory within this paradigm, and the results consistently indicated a memory advantage for contextual details associated with corrective feedback.
The results support the hypothesis that the memory benefits of errorful learning are, in part, due to episodic memory processes. If learning from corrective feedback occurred only via semantic mediation between generated responses and correct answers, there would be no reason to predict greater accuracy for contextual details presented in the feedback episode. The current findings do not suggest semantic mediation is absent from errorful learning, but they do argue that semantic mediation is not the only mechanism that accounts for the benefits of corrective feedback.
How does corrective feedback enhance context memory? As mentioned in the Introduction, it may be the case that error commission – even when the error is not surprising – focuses attention on feedback in a manner that benefits encoding of all information presented with feedback, including incidental features. Thus, there may be enhanced encoding of corrective feedback relative to other aspects of learning trials (Potts et al., 2019), even when the feedback is not necessarily surprising (cf. Fazio & Marsh, 2009). In this manner, the effect on context memory may be related to the attentional boost effect (Spataro et al., 2013; Swallow & Jiang, 2010) whereby memory is enhanced for incidental pictorial or verbal information that accompanies targets in a monitoring task. In line with this idea, Van der Borght, Schouppe, and Notebaert (2016) reported an extension of the attentional boost effect in which memory was improved for task-irrelevant verbal information that was associated with error feedback relative to correct feedback in a flanker task. On the other hand, Mulligan, Smith, and Spataro (2015) found that the classic attentional boost manipulation did not improve memory for contextual details such as font color. Thus, further research is needed in order to determine whether enhanced encoding of corrective feedback is supported by some of the same mechanisms as the attentional boost effect.
In addition to their implications for errorful learning, effects of corrective feedback on context memory may also have important theoretical consequences for understanding testing effects more broadly. For example, the episodic context account of retrieval-based learning (Karpicke et al., 2014; Lehman et al., 2014) proposes that testing effects occur because retrieval of previously-stored information involves both the reinstatement of contextual information from the prior episode and incorporation of the current context into the memory representation. As a result, testing increases the number and variety of context features associated with the information, which in turn support retrieval of the information on subsequent tests. Kornell and Vaughn (2016) have argued that the episodic context account does not explain pretesting effects, because there is no prior episodic context that can be updated. However, pretesting effects could still fit in to the framework of the episodic context account if it is assumed that some aspect of the pretesting paradigm enhances the initial encoding of contextual features. The present results suggest that corrective feedback could be one means by which enhancement of context encoding occurs during pretesting.
The present findings also provide an extension of the literature on generation effects (i.e., effects of self-generating to-be-remembered information instead of merely having it presented). Generation is known to affect memory both for to-be-remembered content as well as for contextual details (Geghman & Multhaup, 2004; Jurica & Shimamura, 1999; Marsh, Edelman, & Bower, 2001; Mulligan, 2004). In many cases, generation reduces memory for context, in contrast to its positive effect on memory for content, although the presence and direction of the effect depends on the relationship between the type of contextual information being tested and the type of processing required by the generation task (Mulligan, 2004; 2011; Mulligan et al., 2006; Overman et al., 2017). This type of negative effect on context memory might reasonably have been expected in the current experiment, particularly in Experiments 1b, 2b, and 3b, in which target items were participants’ own generated responses, comparable to other studies of generation effects. No differences were observed in context memory between conditions in the B versions of the experiments, which may suggest that there was a similar degree of semantic versus perceptual processing across conditions in the present encoding task. It is also interesting that negative generation effects were not observed for context memory in the A versions of the experiments. For example, this might have been expected if participants had shifted processing of corrective feedback towards its semantic features, at the expense of incidental perceptual features. In the same way, a context memory advantage might have been expected for confirmatory feedback, as participants had no need to encode a new item and were free to focus font color, which they had been told to remember for later. In this way, the current study raises the possibility that the processing trade-offs that have previously been observed during response generation may not apply to the processing of feedback that follows a generated response.
Finally, the present study has valuable practical implications, in that superior memory for context associated with corrective feedback may partially explain why feedback reliably enhances the benefits of practice testing in educational settings (Dunlosky, Rawson, Marsh, Nathan, & Willingham, 2013). Further research is needed to determine whether the effect observed here generalizes to other types of contextual information and to-be-learned content, including information encountered both in the classroom and in everyday life.
Acknowledgments
Portions of this work were presented at the 57th Annual Meeting of the Psychonomic Society. Amy Overman is supported by NIH Grant R15AG052903. Joseph Stephens is supported by Air Force Research Laboratory and OSD under agreement number FA8750-15-2-0116. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NIH, Air Force Research Laboratory and OSD, or the U.S. Government. The authors thank Hannah Greenwood, Ashley Howard, and Laura Bernstein for assistance with data collection, and Nate Kornell and Neil Mulligan for helpful comments on earlier versions of this work. Complete data and analysis scripts for this study are openly available at osf.io/547mc/.
Appendix
Categories and exemplars used in the experiments.
| Category cue | Exemplar 1 | Exemplar 2 |
|---|---|---|
|
| ||
| A PART OF A BUILDING (buffer item) | DOOR | WINDOW |
| A MILITARY TITLE (buffer item) | SERGEANT | GENERAL |
| A COLOR OF THE RAINBOW (buffer item) | RED | GREEN |
| A TYPE OF SHIP OR BOAT (buffer item) | SAILBOAT | CRUISE |
| A PART OF SPEECH | NOUN | VERB |
| AN ELECTED OFFICIAL | PRESIDENT | SENATOR |
| A KIND OF BIRD | CARDINAL | EAGLE |
| A PART OF THE BODY | ARM | LEG |
| A CARPENTER’S TOOL | HAMMER | SAW |
| A CHEMICAL ELEMENT | HYDROGEN | OXYGEN |
| A CLERGY MEMBER | PRIEST | BISHOP |
| AN ARTICLE OF CLOTHING | SHIRT | PANTS |
| A NATURAL EARTH FORMATION | MOUNTAIN | VOLCANO |
| A TYPE OF DANCE | SALSA | TANGO |
| A UNIT OF DISTANCE | METER | INCH |
| A TYPE OF HUMAN DWELLING | HOUSE | APARTMENT |
| A TYPE OF FABRIC | COTTON | SILK |
| A KIND OF FISH | TROUT | GOLDFISH |
| A FLAVORING FOR FOOD | SALT | PEPPER |
| A KIND OF FLOWER | ROSE | DAISY |
| AN ARTICLE OF FOOTWEAR | SHOE | BOOT |
| A FOUR-LEGGED ANIMAL | DOG | CAT |
| A FRUIT | APPLE | ORANGE |
| A TYPE OF FUEL | GASOLINE | DIESEL |
| A PIECE OF FURNITURE | CHAIR | TABLE |
| A GARDENER’S TOOL | HOE | SHOVEL |
| AN HERB | OREGANO | BASIL |
| A KIND OF INSECT | FLY | ANT |
| A KITCHEN UTENSIL | FORK | SPOON |
| A LIQUID | WATER | SODA |
| A TYPE OF METAL | IRON | STEEL |
| A FORM OF MONEY | DOLLAR | PENNY |
| A MUSICAL GENRE | ROCK | JAZZ |
| A MUSICAL INSTRUMENT | FLUTE | GUITAR |
| AN OCCUPATION OR PROFESSION | DOCTOR | TEACHER |
| A PRECIOUS STONE | DIAMOND | RUBY |
| A TYPE OF READING MATERIAL | BOOK | MAGAZINE |
| A FAMILY RELATIVE | MOM | AUNT |
| AN AREA OF SCIENCE | BIOLOGY | CHEMISTRY |
| A TYPE OF SNAKE | RATTLESNAKE | COBRA |
| A SPORT | FOOTBALL | SOCCER |
| A UNIT OF TIME | SECOND | MINUTE |
| A TRANSPORTATION VEHICLE | CAR | BUS |
| A KIND OF TREE | OAK | PINE |
| A VEGETABLE | CARROT | BROCCOLI |
| A WEATHER PHENOMENON | TORNADO | HURRICANE |
Note: Category names and exemplars were drawn from the norms of Van Overschelde et al. (2004). Exemplars 1 and 2 were equally likely to be used as the to-be-remembered targets in the read condition. For the incorrect generation condition, Exemplar 1 was used, unless its initial letter matched that of the participant’s typed response, in which case Exemplar 2 was used.
Footnotes
It may be noted that t-test based comparisons were not ultimately used in the analyses reported here; nonetheless, planning for sample size was informed by such comparisons as reported in prior studies.
By the same token, responses labeled as correct during the encoding task could, in theory, include any nonsensical or otherwise invalid responses that participants happened to enter on Correct Generation trials. In practice, it was rare for participants to enter completely invalid responses, and those that did occur were excluded during data analysis, as described in Results.
For additional comparison purposes, a full factorial BANOVA model (with Task and Retrieval Instructions as between-subjects factors and Encoding Condition as a within-subjects factor) yielded very similar results. The Encoding Condition X Retrieval Instructions interaction was highly significant for both the full and limited data sets, p < .001. A possible main effect of Encoding Condition was suggested, with p = .047 for the full data set and p = .028 for the limited data set. No other effects or interactions were suggested.
Contributor Information
Amy A. Overman, Elon University
Joseph D. W. Stephens, North Carolina A&T State University
Mary F. Bernhardt, Elon University
References
- Akan M, Stanley SE, & Benjamin AS (2018). Testing enhances memory for context. Journal of Memory and Language, 103, 19–27. [Google Scholar]
- Batchelder WH, & Riefer DM (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6(1), 57–86. [DOI] [PubMed] [Google Scholar]
- Bayen UJ, Murnane K, & Erdfelder E (1996). Source discrimination, item detection, and multinomial models of source monitoring. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22(1), 197–215. [Google Scholar]
- Bridger EK, & Mecklinger A (2014). Errorful and errorless learning: The impact of cue–target constraint in learning from errors. Memory & Cognition, 42(6), 898–911. [DOI] [PubMed] [Google Scholar]
- Bürkner PC (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–27. [Google Scholar]
- Butler AC, Fazio LK, & Marsh EJ (2011). The hypercorrection effect persists over a week, but high-confidence errors return. Psychonomic Bulletin & Review, 18(6), 1238–1244. [DOI] [PubMed] [Google Scholar]
- Butterfield B, & Metcalfe J (2001). Errors committed with high confidence are hypercorrected. Journal of Experimental Psychology: Learning, Memory, and Cognition, 27(6), 1491–1494. [DOI] [PubMed] [Google Scholar]
- Carpenter SK (2011). Semantic information activated during retrieval contributes to later retention: Support for the mediator effectiveness hypothesis of the testing effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(6), 1547–1552. [DOI] [PubMed] [Google Scholar]
- Carpenter SK, & Yeung KL (2017). The role of mediator strength in learning from retrieval. Journal of Memory and Language, 92, 128–141. [Google Scholar]
- Clark CM (2016). When and why does learning profit from the introduction of errors? [Doctoral dissertation, UCLA]. [Google Scholar]
- Cyr AA, & Anderson ND (2012). Trial-and-error learning improves source memory among young and older adults. Psychology and Aging, 27, 429–439. [DOI] [PubMed] [Google Scholar]
- Dong C, & Wedel M (2017). BANOVA: An R package for hierarchical Bayesian ANOVA. Journal of Statistical Software, 81(9), 1–46. [Google Scholar]
- Dunlosky J, Rawson KA, Marsh EJ, Nathan MJ, & Willingham DT (2013). Improving students’ learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14, 4–58. [DOI] [PubMed] [Google Scholar]
- Fazio LK, & Marsh EJ (2009). Surprising feedback improves later memory. Psychonomic Bulletin & Review, 16, 88–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geghman KD, & Multhaup KS (2004). How generation affects source memory. Memory & Cognition, 32(5), 819–823. [DOI] [PubMed] [Google Scholar]
- Gelman A (2005). Analysis of variance – Why it is more important than ever. The Annals of Statistics, 33(1), 1–53. [Google Scholar]
- Gelman A, Hwang J, & Vehtari A (2014). Understanding predictive information criteria for Bayesian models. Statistics and Computing, 24(6), 997–1016. [Google Scholar]
- Hays MJ, Kornell N, & Bjork RA (2010). The costs and benefits of providing feedback during learning. Psychonomic Bulletin & Review, 17, 797–801. [DOI] [PubMed] [Google Scholar]
- Hays MJ, Kornell N, & Bjork RA (2013). When and why a failed test potentiates the effectiveness of subsequent study. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 290–296. [DOI] [PubMed] [Google Scholar]
- Heck DW, Arnold NR, & Arnold D (2018). TreeBUGS: An R package for hierarchical multinomial-processing-tree modeling. Behavior Research Methods, 50(1), 264–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson MK, Hashtroudi S, & Lindsay DS (1993). Source monitoring. Psychological Bulletin, 114(1), 3–28. [DOI] [PubMed] [Google Scholar]
- Jurica PJ, & Shimamura AP (1999). Monitoring item and source information: Evidence for a negative generation effect in source memory. Memory & Cognition, 27(4), 648–656. [DOI] [PubMed] [Google Scholar]
- Kang SHK, McDermott KB, & Roediger HL III. (2007). Test format and corrective feedback modify the effect of testing on long-term retention. European Journal of Cognitive Psychology, 19, 528–558. [Google Scholar]
- Karpicke JD, Lehman M, & Aue WR (2014). Retrieval-based learning: An episodic context account. Psychology of Learning and Motivation, 61, 237–284. [DOI] [PubMed] [Google Scholar]
- Klauer KC (2010). Hierarchical multinomial processing tree models: A latent-trait approach. Psychometrika, 75(1), 70–98. [Google Scholar]
- Knight JB, Ball BH, Brewer GA, DeWitt MR, & Marsh RL (2012). Testing unsuccessfully: A specification of the underlying mechanisms supporting its influence on retention. Journal of Memory and Language, 66(4), 731–746. [Google Scholar]
- Kornell N (2014). Attempting to answer a meaningful question enhances subsequent learning even when feedback is delayed. Journal of Experimental Psychology: Learning, Memory, & Cognition, 40, 106–114. [DOI] [PubMed] [Google Scholar]
- Kornell N, Hays MJ, & Bjork RA (2009). Unsuccessful retrieval attempts enhance subsequent learning. Journal of Experimental Psychology: Learning, Memory, & Cognition, 35, 989–998. [DOI] [PubMed] [Google Scholar]
- Kornell N, & Vaughn KE (2016). How retrieval attempts affect learning: A review and synthesis. Psychology of Learning and Motivation, 183–215. [Google Scholar]
- Kruschke JK (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd ed.). Academic Press. [Google Scholar]
- Lehman M, & Malmberg KJ (2013). A buffer model of encoding and temporal correlations in retrieval. Psychological Review, 120, 155–189. [DOI] [PubMed] [Google Scholar]
- Lehman M, Smith MA, & Karpicke JD (2014). Toward an episodic context account of retrieval-based learning: Dissociating retrieval practice and elaboration. Journal of Experimental Psychology: Learning, Memory, & Cognition, 40, 1787–1794. [DOI] [PubMed] [Google Scholar]
- Marsh EJ, Edelman G, & Bower GH (2001). Demonstrations of a generation effect in context memory. Memory & Cognition, 29(6), 798–805. [DOI] [PubMed] [Google Scholar]
- Marsman M, & Wagenmakers EJ (2017). Three insights from a Bayesian interpretation of the one-sided P value. Educational and Psychological Measurement, 77(3), 529–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDaniel MA, & Fisher RP (1991). Tests and test feedback as learning sources. Contemporary Educational Psychology, 16, 192–201. [Google Scholar]
- Metcalfe J (2017). Learning from errors. Annual Review of Psychology, 68(6), 1–25. [DOI] [PubMed] [Google Scholar]
- Metcalfe J, & Huelser BJ (2020). Learning from errors is attributable to episodic recollection rather than semantic mediation. Neuropsychologia, 138, 107296. [DOI] [PubMed] [Google Scholar]
- Mulligan NW (2004). Generation and memory for contextual detail. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(4), 838–855. [DOI] [PubMed] [Google Scholar]
- Mulligan NW (2011). Generation disrupts memory for intrinsic context but not extrinsic context. The Quarterly Journal of Experimental Psychology, 64(8), 1543–1562. [DOI] [PubMed] [Google Scholar]
- Mulligan NW, Lozito JP, & Rosner ZA (2006). Generation and context memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4), 836–846. [DOI] [PubMed] [Google Scholar]
- Mulligan NW, Smith SA, & Spataro P (2016). The attentional boost effect and context memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 42(4), 598–607. [DOI] [PubMed] [Google Scholar]
- Overman AA, Richard AG, & Stephens JDW (2017). A positive generation effect on memory for auditory context. Psychonomic Bulletin & Review, 24, 944–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pashler H, Cepeda NJ, Wixted JT, & Rohrer D (2005). When does feedback facilitate learning of words? Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 3–8. [DOI] [PubMed] [Google Scholar]
- Plummer M (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124, 1–10. [Google Scholar]
- Potts R, Davies G, & Shanks DR (2019). The benefit of generating errors during learning: What is the locus of the effect? Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(6), 1023–1041. [DOI] [PubMed] [Google Scholar]
- Pyc MA, & Rawson KA (2010). Why testing improves memory: Mediator effectiveness hypothesis. Science, 330(6002), 335. [DOI] [PubMed] [Google Scholar]
- Roediger HL III, & Karpicke JD (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249–255. [DOI] [PubMed] [Google Scholar]
- Rouder JN, Morey RD, Speckman PL, & Province JM (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5), 356–374. [Google Scholar]
- Rowland CA (2011). Testing Effects in Context Memory (Unpublished master’s thesis). Colorado State University, Fort Collins, Colorado. [Google Scholar]
- Rowland CA, & DeLosh EL (2014). Benefits of testing for nontested information: Retrieval-induced facilitation of episodically bound material. Psychonomic Bulletin & Review, 21, 1516–1523. [DOI] [PubMed] [Google Scholar]
- Slamecka NJ, & Graf P (1978). The generation effect: Delineation of a phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604. [Google Scholar]
- Spataro P, Mulligan NW, & Rossi-Arnaud C (2013). Divided attention can enhance memory encoding: the attentional boost effect in implicit memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 1223–1231. [DOI] [PubMed] [Google Scholar]
- Swallow KM, & Jiang YV (2010). The attentional boost effect: Transient increases in attention to one task enhance performance in a second task. Cognition, 115, 118–132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tulving E (1972). Episodic and semantic memory. In Tulving E & Donaldson W (Eds.), Organization of Memory (pp. 381–403). New York, NY: Academic Press. [Google Scholar]
- Van der Borght L, Schouppe N, & Notebaert W (2016). Improved memory for error feedback. Psychological Research, 80, 1049–1058. [DOI] [PubMed] [Google Scholar]
- Van Overschelde JP, Rawson KA, & Dunlosky J (2004). Category norms: An updated and expanded version of the norms. Journal of Memory and Language, 50, 289–335. [Google Scholar]
- Vaughn KE, & Rawson KA (2012). When is guessing incorrectly better than studying for enhancing memory? Psychonomic Bulletin & Review, 19(5), 899–905. [DOI] [PubMed] [Google Scholar]
- Wagenmakers EJ (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. [DOI] [PubMed] [Google Scholar]
- Wagenmakers EJ, Marsman M, Jamil T, Ly A, Verhagen J, Love J, ... & Morey RD (2018). Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychonomic Bulletin & Review, 25(1), 35–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wahlheim CN, & Jacoby LL (2013). Remembering change: The critical role of recursive remindings in proactive effects of memory. Memory & Cognition, 41(1), 1–15. [DOI] [PubMed] [Google Scholar]
- Watanabe S (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(12), 3571–3594. [Google Scholar]
- Yan VX, Yu Y, Garcia MA, & Bjork RA (2014). Why does guessing incorrectly enhance, rather than impair, retention? Memory & Cognition, 42, 1373–1383. [DOI] [PubMed] [Google Scholar]
- Zawadzka K, & Hanczakowski M (2019). Two routes to memory benefits of guessing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 45(10), 1748–1760. [DOI] [PubMed] [Google Scholar]
