Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jul 11.
Published in final edited form as: Mem Cognit. 2010 Jun;38(4):407–418. doi: 10.3758/MC.38.4.407

Memorial Consequences of Multiple-choice Testing on Immediate and Delayed Tests

Lisa K Fazio 1, Pooja K Agarwal 2, Elizabeth J Marsh 3, Henry L Roediger III 4
PMCID: PMC4094137  NIHMSID: NIHMS581290  PMID: 20516221

Abstract

Multiple-choice testing has both positive and negative consequences for performance on later tests. Prior testing increases the number of questions answered correctly on a later test, but also increases the likelihood that questions will be answered with lures from the previous multiple-choice test (Roediger & Marsh, 2005). Prior research has shown that positive effects of testing persist over a delay, but no one has examined the durability of negative effects of testing. To address this, subjects took multiple-choice and cued recall tests (on subsets of questions) both immediately and a week after studying. Although delay reduced both the positive and negative testing effects, both still occurred after one week, especially if the multiple-choice test had also been delayed. These results are consistent with the argument that recollection underlies both the positive and negative testing effects.


Multiple-choice exams are commonly used in classrooms, as they are easy to grade and their scoring is perceived as objective. While much has been written about the assessment function of such tests, less research has focused on the consequences of this form of testing for long-term knowledge. This gap in the literature is troubling, because the available results suggest that tests can change knowledge in addition to assessing it. The most well-known example is the testing effect, the finding that taking an initial test often increases performance on a later test (see Roediger & Karpicke, 2006a for a review).

While earlier work on testing tended to rely on simple word list stimuli, more recently the emphasis has shifted to studying the effects of testing in educationally relevant situations (Butler, Marsh, Goode, & Roediger, 2006; Marsh, Agarwal, & Roediger, 2009; Marsh, Roediger, Bjork, & Bjork, 2007; Roediger, Agarwal, Kang, & Marsh, in press; Roediger & Marsh, 2005). In the typical experiment, subjects read non-fiction passages on a variety of subjects and then take an initial multiple-choice test. A few minutes later, they take a final cued recall test that includes questions that had been tested on the prior multiple-choice test as well as new questions. Participants are more likely to answer final cued recall questions correctly if they had been tested on the prior multiple-choice test, thus showing the testing effect.

A second effect in this sort of experiment is more problematic: Multiple-choice testing can also have negative effects on students’ knowledge. The reason is that multiple-choice tests expose students to incorrect answers (lures) in addition to correct responses. Just as Brown (1988) and Jacoby and Hollingshead (1990) showed that exposure to incorrect spellings of words increased later misspellings, one could predict that reading lures on a multiple-choice test would increase errors on later tests. Supporting this logic, Toppino and his colleagues showed that students rated previously-read multiple-choice lures as truer than novel false facts (Toppino & Brochin, 1989; Toppino & Luipersbeck, 1993). Similarly, Roediger and Marsh (2005) found that multiple-choice testing increased the intrusion of multiple-choice lures as answers on a final general knowledge test, even though subjects were warned not to guess on that test. Consistent with an interference account, multiple-choice questions that paired the correct answer with a greater number of lures increased this negative effect of testing.

Prior work has established that multiple-choice tests can have both positive and negative consequences. But how persistent are these effects? Prior research has established that positive testing effects persist over at least a week’s delay. For example, Spitzer (1939) had 3,605 sixth-graders in Iowa read a passage on bamboo. The children were tested on the passage according to different testing schedules. In one group, children were tested on the passage immediately after reading it and again one week later. Another group was tested on the passage for the first time one week after reading. When both groups were tested one week after reading the passages, performance was much higher in the group that had been tested previously on the material than in the group being tested for the first time. In other words, the benefits of initial testing persisted over a delay of one week. Roediger and Karpicke (2006b) observed similar effects in college students. Their students read nonfiction passages; some of these were restudied during the initial session and others were tested. After two days or one week, recall of the passages was higher if they had been tested initially than if they had been restudied. To be clear, performance was always lower on delayed tests than on immediate tests, but there was less forgetting over time following testing than after equivalent time spent restudying.

The question we address in the present research is whether negative testing effects persist over a delay, similar to what occurs with positive testing effects. Butler and Roediger (2008) found that negative testing effects can be nullified if feedback is provided after the multiple-choice test. However, this step is often not taken in the classroom in order to protect items from the test bank. If negative testing effects do not persist for long after a multiple-choice test, this fact would remove concerns about negative effects of testing. On the other hand, if the negative effects do persist over time, then the implication for educators would be to include feedback with all tests.

The effects of delay are also of theoretical interest. Typically, manipulations of delay have different effects on memory errors depending on the mechanism underlying the error. Consider the standard explanation for the effects of delay in the false fame paradigm. In a prototypical experiment, subjects study famous and non-famous names, some of which were presented during an initial study session. Afterwards, subjects judge the fame of each of a series of names, including new famous names, new non-famous names, and studied non-famous names. On an immediate test, subjects are less likely to call repeatedly studied non-famous names “famous,” because they are able to recollect the source of the names’ familiarity: the earlier study phase. In contrast, if the fame judgments are delayed for a day, subjects are more likely to call repeatedly studied non-famous names “famous.” After a day, the names are still familiar but subjects are less able to recollect the source of that familiarity (Jacoby, Kelley, Brown, & Jasechko, 1989).

An increased reliance on familiarity over time (as recollection drops) is used to explain effects of delay in numerous paradigms1. For example, consider the finding that prior shallow processing of campus scenes increases subjects’ beliefs that they had visited locations that they had never actually been to (Brown & Marsh, 2008). Exposure increases a scene’s familiarity, but after a delay of one or three weeks subjects misattribute that familiarity to prior personal experience with the place. The type of familiarity proposed to underlie these results is similar to the representations that support long-term priming over months and years (e.g. Cave, 1997; Mitchell, 2006). Thus the rate of false memories is likely to be consistent over time (or even increase) if they result from a misattribution of this type of familiarity. Returning to the issue of multiple-choice tests, a previously selected multiple-choice lure may easily come to mind at test and this retrieval ease may be misinterpreted as confidence in the answer (Kelley & Lindsay, 1993) rather than its presence on the earlier test. Thus, delaying the final test may have no effect on the negative testing effect, or even increase it. To be clear, we are not suggesting that familiarity does not decrease over time. Rather, as subjects become more reliant on familiarity, they may produce lures that they would have rejected on an immediate test (because they remembered that the answer was presented on the multiple-choice test).

In contrast, some memory errors actually decrease over time. For example, consider what happens when people learn falsehoods from fictional stories. In this paradigm, subjects read short stories that contain statements about the world, some of which are false. Subjects intrude these story errors on later general knowledge tests even when they are warned against guessing. Suggestibility is robust on an immediate test, but reduced on a delayed test (Barber, Rajaram, & Marsh, 2008; Marsh, Meade, & Roediger, 2003). In this case, subjects learn specific falsehoods that need to be recollected, and thus delay reduces the effect. Returning to the issue of multiple-choice tests, it is possible that the negative testing effect depends upon recollection of the multiple-choice lures. If so, then delay should reduce the negative testing effect.

The prior literature allows for both possibilities. On the one hand, the effects of testing have been linked to enhanced recollection (Chan & McDermott, 2007; Karpicke, McCabe, & Roediger, 2006). Prior testing increases the number of “remember” responses on a later recognition test and process dissociation measures show that the effects of testing are primarily recollection-driven rather than familiarity-driven. From these studies, we would predict that both positive and negative testing effects would depend upon recollection, and thus should be similarly affected by delay. On the other hand, Brainerd and Reyna (1996) have shown that delay increases the likelihood that children will select a lure from a prior recognition test on a second test, suggesting a role for familiarity in this memory error. From this study, we would predict that familiarity underlies negative testing effects, and thus the rate of lure intrusions on a final test should remain constant or even increase over time.

In the experiment presented here, we asked a number of questions about how delay affects the memorial consequences of testing. All subjects visited the laboratory twice, with one week separating the two sessions. Of interest was subjects’ ability to answer questions about facts from 36 nonfiction passages on initial and delayed tests. The different delays were all manipulated within-subjects. All of the subjects took all of the tests, and across subjects the assignment of passages to testing schedules was counterbalanced as shown in Figure 1.

Figure 1.

Figure 1

Within-subjects design of the experiment. One-half of passages were read and one-half were not; both types of passages were rotated through the 3 testing Schedules (A, B, C) shown. Assignment of passages to reading condition (read vs. not read), passages to testing Schedule (A, B, C), and facts to multiple-choice format (not-tested versus 2, 4, or 6 alternatives) was counterbalanced across subjects.

During the first session, subjects read one half of the nonfiction passages; reading status was manipulated to ensure a wide range of performance. The goal was for some questions to be difficult because the passages had not been read (and thus potentially more likely to yield negative testing effects) and for some to be easier following passage reading (and thus more likely to be remembered correctly after a delay of one week). After the reading phase, all subjects took an initial multiple-choice test on two-thirds of the passages (see Figure 1). Each multiple-choice question paired the correct answer with one, three, or five lures; in other words, subjects answered two, four, and six-alternative forced-choice questions. On immediate tests, testing with additional multiple-choice lures increases the negative testing effect (Roediger & Marsh, 2005); of interest here was whether that effect persisted over a delay.

After completion of the initial multiple-choice test, all subjects completed an initial cued recall test. Critically, this test included questions on half the facts tested on the initial multiple-choice test (see Figure 1). One week later, subjects returned and took a second multiple-choice test (on the remaining 1/3 of passages that had not yet been tested on a multiple-choice test) and a final cued recall test on all items. Subjects were instructed to answer all cued recall questions, similar to how students attempt to answer all exam questions even if unsure. Because forced responding increases guessing, subjects also rated their confidence in each answer so that we could ascertain whether guessing was responsible for any negative testing effects that might be observed.

The design yielded three testing schedules, all of which have real-world parallels in educational situations. Schedule A (immediate multiple-choice and cued recall tests) mimics students’ self-quizzing immediately before an exam. Schedule B (immediate multiple-choice, delayed cued recall) is similar to when a teacher gives a quiz one week before a larger, more comprehensive test. Finally, in Schedule C (delayed multiple-choice and cued recall) students have read the material earlier and are then quizzing themselves just before the exam. It should be noted that Schedules A and C model different situations; although both involve multiple-choice testing immediately before a cued recall test, in Schedule A this testing occurs immediately after reading the passages whereas in Schedule C the testing is delayed a week after passage reading. In some ways Schedule C is the most likely scenario in the real world; students learn information but then delay self-testing and other study behaviors until immediately before the exam.

This design allowed us to answer three important questions about the persistence of positive and negative consequences of multiple-choice testing. First, what is the effect of delaying the cued recall test until a week after the initial multiple-choice test? To answer this, we compared performance on the initial cued recall test from Schedule A to the performance on the final cued recall test from Schedule B. The second question involved any effects of delaying the multiple-choice test by one week. To answer this question, we compared performance on the final cued recall test in Schedule B (following immediate multiple-choice testing) to that observed in Schedule C (following delayed multiple-choice testing). Performance should be higher on the immediate multiple-choice test, perhaps magnifying the benefits and minimizing the costs of testing. In contrast, more errors might be selected on a delayed multiple-choice test, possibly increasing the costs of testing. The final question involved whether the effects of testing persist from the first cued recall test to the final cued recall test. Do the costs and benefits of testing observed on an initial exam persist a week later? Again, the focus was on performance on the final cued recall test, but the key comparison was between the initial and final cued recall tests in Schedule A.

Method

Participants

72 Washington University undergraduates participated in the experiment, either for partial fulfillment of a course requirement or for monetary compensation.

Design

The experiment had a 2 (passage status: read or not read) × 4 (number of alternatives on the multiple-choice test: 0 [not-tested], 2, 4 or 6) × 3 (testing schedule: A, B, or C, as shown in Figure 1) design. All factors were manipulated within-subjects and counterbalanced across subjects.

Materials

We used the same non-fiction passages as did Roediger and Marsh (2005); these were selected from reading comprehension sections of TOEFEL, SAT, and GRE practice test books. The passages spanned a variety of topics, including famous people (e.g., Louis Armstrong), science (e.g., the sun), history (e.g., the founding of New York City), places (e.g., Mt. Rainier) and animals (e.g., sea otters). Roediger and Marsh (2005) created four questions for each passage, each of which was tested in all four formats necessary for the design (2, 4 and 6 alternative multiple-choice, plus cued recall). The multiple-choice questions were created by generating five plausible lures for each question, and the six options (the lures plus the correct answer) were randomly ordered. Two lures were randomly removed to create each 4-alternative question; two more were randomly removed from each to create the 2-alternative questions. Across subjects, the four questions corresponding to each of the passages were rotated through the four multiple-choice conditions (0 [not-tested], 2, 4, or 6 alternatives).

The 36 passages were divided into two sets to allow counterbalancing of reading status; each reading set was further subdivided into three groupings to allow counterbalancing of testing schedules. Thus there were six groups of six passages; texts on similar subjects (e.g., the ozone layer and the sun) were placed in different groups. Half the subjects read the passages in Set 1; the other half read the passages in Set 2. Therefore, each subject read only half the passages but was tested on all 36. Across subjects, both read and nonread passages were rotated through the three testing Schedules (A, B and C, as depicted in Figure 1). All items were included on the final cued recall test; we manipulated which passages were tested (and in what format) prior to that final test. For one set of passages, Schedule A, subjects took the multiple-choice test and a cued recall test in Session 1 (as well as the final cued recall test on all items in Session 2). For a second set of passages, Schedule B, subjects took the multiple-choice test in Session 1, but did not take a cued recall test until Session 2. For the third set of passages, Schedule C, the items were not tested in Session 1. Rather, the multiple-choice test was administered in Session 2, prior to the final cued recall test.

The first multiple-choice test contained 96 questions: 24 fillers and 72 critical questions (half corresponding to read passages). The fillers were questions from the Nelson and Narens (1980) norms and were used to provide separation between questions from the same passage (fillers were used for this purpose on the other multiple-choice and cued recall tests, too). There were 12 different versions of this test, so that across subjects all items appeared in all four multiple-choice formats (not-tested, 2, 4, or 6 alternatives) and all passages were sometimes tested in this immediate multiple-choice condition.

The first cued recall test contained 72 questions: 24 fillers and 48 critical items (half from read passages and half from nonread passages). Each question was followed by a space for writing the answer and a box for recording confidence. Confidence was rated on a 4-point scale ranging from not sure to very sure. There were three versions of this test, so that across subjects all passages were sometimes tested on this test.

The second multiple-choice test contained 48 questions: 12 fillers and 36 critical questions (18 from studied passages). As with the first multiple-choice test, 12 versions were needed for counterbalancing purposes.

All participants took the same final cued recall test. This test contained 144 critical questions and 72 fillers, for a total of 216 questions. Each question was followed by a space for writing the answer and a box for recording confidence. Confidence was rated on the same 4-point scale as was used on the first cued recall test. All tests were in paper-and-pencil format.

Procedure

The experiment consisted of two sessions, separated by one week. In the first session, subjects read 18 of the 36 passages. The amount of time devoted to each passage was determined in pre-testing; the amount of time allotted to each passage varied because the passages differed in length. On average, subjects were given up to 90 s. to read each passage. The goal was for all subjects to finish reading each passage once. Subjects were given a sheet on which they indicated when they had completed reading the passage; the experimenter monitored the subjects for completion and moved subjects to the next passage when all had finished reading.

Immediately after reading the passages, the first multiple-choice test was administered. The experimenter read the instructions aloud to the subjects, telling them they were going to take a multiple-choice test, with no mention of the prior reading phase. They were told: “You must answer each and every question. You will not know the answers to all of the questions. That’s okay. If you have to, just guess. Sometimes a question will have two possible answers, sometimes four, and sometimes six. For each question, read the question carefully, read all the possible answers, and then circle the best answer. Again, you should answer all of the questions even if you have to guess. We would like you to answer the questions in the order in which they appear. Do not go back and change your answers. Rather, read each question and its answers once, and simply select the best possible answer and move on to the next question.” Subjects were told they would receive up to 14 minutes for completion of the test, and that they would be given verbal warnings about how time was passing. Pre-testing determined that this amount of time would be more than enough for subjects to finish the test. Those who finished early were instructed to turn over their tests and wait quietly for the next set of instructions. All subjects then worked on a spatial filler task for five minutes.

After the filler task, subjects had up to 12 minutes to complete the first cued recall test. The experimenter read the following instructions aloud to subjects: “You will now take a second general knowledge test. This time, the questions are open-ended. So, you will read each question and write down your answer. Again, we would like you to answer all of the questions even though some of them are very difficult. Please write an answer for each and every one even if you have to guess. Again, answer the questions in the order in which they appear, and do not go back and change your answers. For each answer, please rate how sure you are that you are correct, using the following scale: 1 = very sure, 2 = sure, 3 = somewhat sure, and 4 = not sure. Please write the appropriate number in the box labeled confidence rating, next to the blank on which you’ll write your answer.” Subjects were informed the test had 72 questions and that they would be given 12 minutes to complete the test; pre-testing had established that this was more than enough time for subjects to complete the test. Subjects followed the instructions, answering an average of 98% of the cued recall questions.

One week later, subjects returned to the lab for Session 2. The session began with the second multiple-choice test, which was prefaced with the same instructions as the first test. Subjects were given up to 7 minutes for completing this test (again, this time was determined through pre-testing). Following the multiple-choice test, all subjects worked on a spatial filler task for five minutes. After the filler task, all subjects took the final cued recall test, and rated their confidence in each answer using the same 4-point scale as on the earlier cued recall test. Subjects were given up to 35 minutes to complete the final test, with the same instructions as used on the first cued recall test. No reference was made to the reading phase, or to the earlier tests. No subjects had difficulty in completing any of the tests in the time allotted. As with the first cued recall test, subjects followed the instructions, answering an average of 98% of the cued recall questions.

Results

All results were significant at the .05 level of confidence unless otherwise noted.

Performance on the Multiple-Choice Tests

The data from the multiple-choice tests are shown in Table 1. Subjects correctly answered more multiple-choice questions when they had read the passages containing the tested facts (M = .71) than when they had not read the relevant passages (M = .52), F(1, 71) = 242.38, MSE = .03, ηp2 = .77. In addition, as the number of multiple-choice alternatives increased, subjects were less likely to answer the multiple-choice question correctly, F(2, 142) = 186.65, MSE = .02, ηp2 = .72. This effect was larger when subjects had not read the passages containing the facts. When subjects had not read the passages, performance decreased from .68 when choosing between two alternatives, to .48 with four alternatives, to only .40 with six alternatives. In other words, when the passages had not been read, performance dropped 28% when the alternatives were increased from two to six, as compared to the smaller drop of 19% when subjects had read the passages. This interaction between reading status and number of alternatives was significant, F(2, 142) = 6.02, MSE = .03, ηp2 = .08. To be clear, the number of alternatives affected performance for both read, F(2, 142) = 65.01, MSE =.01, ηp2 = .48, and nonread passages, F(2, 142) = 103.90, MSE =.01, ηp2 = .59, but this effect was larger when subjects had not read the passages.

Table 1.

Proportion correct on the multiple-choice tests, as a function of timing of the multiple-choice (MC) test, number of MC alternatives, and the reading status of the passages.

Number of MC Alternatives

MC Test Timing Two Four Six M
Immediate
 Read .86 .78 .72 .79
 Not Read .68 .48 .41 .52
Delayed
 Read .76 .60 .52 .63
 Not Read .68 .49 .38 .52
M .75 .59 .51

As expected, student’s ability to correctly answer multiple-choice questions depended on the timing of the test. Subjects answered more questions correctly on the immediate multiple-choice test (M = .66) than on the delayed test (M = .57), F(1, 71) = 49.08, MSE = .03, ηp2 = .41. Delay only interacted with one variable, reading status, F(1, 71) = 36.91, MSE = .04, ηp2 = .34, which confirms the obvious point that when subjects had not read the passages, the delay between reading and testing did not affect performance (M = .52 for both tests). The advantage gained from reading the passages was reduced after a week’s delay (Ms = .79 and .63 on the immediate and delayed tests, respectively), the usual finding of forgetting over time.

Performance on the Cued Recall Tests

The design allowed us to answer a number of questions about how multiple-choice testing affects later cued recall performance. Rather than analyzing all conditions together, we made the comparisons necessary to answer questions of interest. We begin with an analysis of performance on the initial cued recall test in Schedule A (following immediate multiple-choice testing). This condition served as the control for most of the questions of interest, and also extends Roediger and Marsh (2005) from a cued recall test with a warning against guessing to a cued recall test with forced responding and confidence ratings.

Immediate Cued Recall: An extension of positive and negative consequences of testing

As in Roediger and Marsh (2005), the number of prior multiple-choice alternatives had two separate, opposite effects on an immediate cued recall test. These data are shown in the top panels of Tables 2 (proportion of cued recall questions answered correctly) and 3 (proportion of cued recall questions answered with multiple-choice lures).

Table 2.

Proportion of cued recall questions answered correctly, as a function of passage reading, number of alternatives on the prior multiple-choice test, and timing of tests.

Number of Prior MC alternatives

Zero (Not tested) Two Four Six M
Immediate Cued Recall, MC was in Session 1
 Read .48 .76 .71 .65 .65
 Not Read .21 .54 .39 .39 .38
M .35 .65 .55 .52
Delayed Cued Recall, MC was in Session 1
 Read .30 .45 .47 .49 .43
 Not Read .23 .33 .29 .28 .28
M .27 .39 .38 .38
Delayed Cued Recall, MC was in Session 2
 Read .31 .63 .52 .45 .48
 Not Read .23 .51 .39 .33 .37
M .27 .57 .46 .39
Delayed Cued Recall, CR and MC in Session 1
 Read .37 .59 .58 .57 .53
 Not Read .24 .38 .33 .35 .33
M .31 .49 .46 .46

First, testing benefited later memory: Subjects correctly answered a greater proportion of cued recall questions if they had been tested previously on the multiple-choice test (M = .57) than if not (M = .35), F(1, 71) = 161.24, MSE = .02, ηp2 = .69. As on the initial multiple-choice test, subjects answered more cued recall questions correctly if they had read the relevant passages, F(1, 71) = 190.13, MSE = .03, ηp2 = .73. Reading status did not interact with testing (F < 1); the benefits of testing were equally strong for questions corresponding to read and not read passages.

However, not all forms of prior testing were equal. That is, prior testing with two alternatives led to 65% correct on the cued recall test; this dropped to 55% following testing with four alternatives and 52% with six alternatives. This effect of number of prior multiple-choice alternatives was significant even when never-tested items were removed from the analysis, F(2, 142) = 16.41, MSE = .04, ηp2 = .19, and there was no interaction between passage reading and number of prior alternatives, F(2, 142) = 2.29, MSE = .04, p > .10.

A second negative consequence of testing was the intrusion of multiple-choice lures as answers on the immediate cued recall test; the relevant data are shown in the top panel of Table 3. That is, we scored whether each answer was one of the 5 possible multiple-choice lures for that item. Subjects were more likely to produce multiple-choice lures when they had not read the relevant passages (M = .30) than after reading the passages (M = .15), F(1, 71) = 135.10, MSE = .03, ηp2 = .66. Most importantly, the number of prior multiple-choice alternatives (0 [not tested], 2, 4, 6) affected the rate of lure intrusions on the cued recall test, F(3, 213) = 23.29, MSE = .03, ηp2 = .25. Multiple-choice lure intrusions increased linearly with number of prior alternatives for both read, F(1, 71) = 10.56, MSE = .03, ηp2 = .13, and nonread passages, F(1, 71) = 60.49, MSE = .04, ηp2 = .46, but the pattern was stronger for nonread passages. In other words, the interaction between number of prior alternatives and reading status was significant, F(3, 213) = 8.86, MSE = .03, ηp2 = .11. For nonread passages, multiple-choice lure intrusions increased from .19 without testing to .40 after testing with six alternatives, an increase of 21%, t(71) = 6.79, SEM = .03. In contrast, after reading the relevant passages, lure intrusions increased from .13 with zero alternatives to .21 with six prior alternatives: a smaller but still significant increase of 8%, t(71) = 2.83, SEM = .03.

Table 3.

Proportion of cued recall questions answered with multiple-choice lures, as a function of passage reading, number of alternatives on the prior multiple-choice test, and timing of tests.

Number of Prior MC alternatives

Zero (not tested) Two Four Six M
Immediate Cued Recall MC was in Session 1
 Read .13 .09 .14 .21 .15
 Not Read .19 .23 .40 .40 .30
M .16 .16 .27 .30
Delayed Cued Recall, MC was in Session 1
 Read .17 .14 .15 .14 .15
 Not Read .18 .21 .26 .30 .24
M .18 .18 .21 .22
Delayed Cued Recall, MC was in Session 2
 Read .19 .18 .30 .38 .26
 Not Read .19 .25 .36 .44 .31
M .19 .21 .33 .41
Delayed Cued Recall, CR and MC in Session 1
 Read .17 .12 .15 .19 .16
 Not Read .18 .21 .29 .32 .25
M .18 .16 .22 .26

As described earlier, subjects rated their confidence in their cued recall answers. These confidence ratings were used to assess the role of guessing in the negative testing effect. Critically, a similar pattern occurred when the lowest confidence (“not sure”) responses were removed from the analyses. Subjects produced more multiple-choice lure intrusions when the passages were nonread as compared to read, F(1, 71) = 19.33, MSE = .02, ηp2 = .21, and the number of prior multiple-choice alternatives affected production of lure answers, F(3, 213) = 11.65, MSE = .02, ηp2 = .14. As in the analysis with all answers, multiple-choice lure intrusions increased linearly with number of previously read alternatives [0 (not-tested), 2, 4, 6] for questions that referred to both read, F(1, 71) = 4.98, MSE = .02, ηp2 = .07, and nonread passages, F(1, 71) = 28.45, MSE = .02, ηp2 = .29. Again the increase in lure production was larger for nonread passages, leading to an interaction between the number of multiple-choice alternatives and prior reading, F(3, 213) = 6.75, MSE = .02, ηp2 = .09.

Finally, we examined the persistence of errors made on the multiple-choice test. That is, given that a lure was selected on the multiple-choice test, how likely was it that a lure was produced on the cued recall test? This analysis includes all lures produced on the final test (as opposed to requiring it to be the same lure as selected on the multiple-choice test), because prior work has shown that almost all lures produced on the final test match earlier selections (e.g., Roediger & Marsh, 2005; Marsh, Agarwal, & Roediger, 2009). Following the selection of a multiple-choice lure, 65% of the corresponding cued recall questions were answered with multiple-choice lures. In the later sections we will use this number as a baserate to examine the effects of delay on the persistence of errors.

In short, similar to Roediger and Marsh (2005), multiple-choice testing led to benefits on a cued recall test a few minutes later (a positive testing effect), and these benefits were reduced if the prior multiple-choice test had paired the correct answer with additional alternatives. Subjects were more likely to answer cued recall questions with multiple-choice lures following testing with additional multiple-choice alternatives, especially for read passages. In addition, the negative testing effect was not due to guessing on the cued recall test: the effect persisted even after guesses were removed from the analyses.

Did delaying the cued recall test change the impact of the initial multiple-choice test?

To isolate the effects of delaying the cued recall test, we compared performance on passage facts tested on the initial cued recall test in Schedule A to performance on the final cued recall test in Schedule B. In this comparison, the multiple-choice test always immediately followed the reading period, and the cued recall test occurred either immediately or one-week after the multiple-choice test. The immediate condition is the one reported in the last section (the top panel in Tables 2 and 3) and the delayed condition is reported in the 2nd panel of Tables 2 and 3.

We begin with an analysis of correct answers on the cued recall test, as shown in Table 2. Not surprisingly, delaying the cued recall test led to lower performance than was observed on the immediate cued recall test, F(1, 71) = 109.86, MSE = .02, ηp2 = .61. Delaying the test also reduced the effects of having read the passages, F(1, 71) = 31.71, MSE = .02, ηp2 = .31.

Of particular interest was whether delaying the cued recall test changed the effects of prior testing. Interestingly, delaying the final test led to a reduction in the positive testing effect, F(1, 71) = 23.68, MSE = .02, ηp2 = .25. As reported already, testing increased the proportion of final questions answered correctly on the immediate test to .57, relative to .35 in the non-tested condition. When the final test was delayed, prior testing only increased the proportion of questions answered correctly from .27 (non-tested) to .38 (tested). This eleven percent difference was significant, t(71) = 7.06, SEM = .02, but it was smaller than the increase from testing observed on the initial test (M = .22), t(71) = 12.70, SEM =.02. There was also a marginally significant three-way interaction between passage reading, prior testing, and delay, F(1, 71) = 3.67, MSE = .02, ηp2 = .05, p = .06. On the immediate test, the testing effect was similar for read and nonread passages (a benefit of 23% for previously tested items). However, delay reduced the testing effect more for nonread passages than for read passages. After a delay, the difference between tested and non-tested items was 17% for read passages but only 7% for nonread passages. Having read the passages helped protect the benefits of testing over the delay.

Next, we examined if delaying the final cued recall test had consequences for the negative testing effect. Two analyses are relevant to this question. First is whether the positive testing effect was smaller following testing with additional lures. The second analysis involves the proportion of cued recall questions answered with multiple-choice lures.

Looking only at performance on the delayed cued recall test (the 2nd panel in Table 2), the effect of number of prior multiple-choice alternatives on correct recall disappeared, F <1. The proportion of correct cued recall answers remained constant at .38 following testing with 2, 4 or 6 alternatives. There was a hint that the number of prior multiple-choice alternatives had different effects on correct answers for read passages (actually increasing performance following testing with more alternatives) than nonread passages (where performance decreased following testing with more alternatives), but the interaction failed to reach significance, F(2, 142) = 2.52, MSE = .03, p = .08, ηp2 = .03. Overall, performance on the delayed test differed from that observed on the immediate test (where cued recall performance decreased when the number of prior alternatives increased from two to six, for both read and nonread passages). The different patterns on the immediate and delayed tests led to an interaction between the number of prior alternatives and delay, F(2, 142) = 7.72, MSE = .04, ηp2 = .10. The three-way interaction between delay, reading status and number of prior multiple-choice alternatives was non-significant F(2, 142) = 1.45, MSE = .03, p = .24.

Secondly, did subjects still answer the cued recall questions with multiple-choice lures if the final test was delayed for one week? The answer is yes; an analysis of the delayed cued recall test revealed that multiple-choice lure intrusions increased linearly with number of previously read alternatives, F(1, 71) = 6.71, MSE = .03, ηp2 = .09. This increase, however, was smaller when the cued recall test was delayed by one week, as reflected by an interaction between delay and number of prior alternatives F(3, 213) = 5.22, MSE = .03, ηp2 = .07. Lure production increased from .16 with zero alternatives (not-tested items) to .30 with six alternatives on the immediate test, a difference of 14%, t(71) = 6.64, SEM = .02. On the delayed test, lure production increased from .18 with zero alternatives to .22 with six alternatives, a difference of only 4%, but this difference was still significant t(71) = 2.17, SEM = .02. Lure production remained stable over the delay for questions referring to read passages, but decreased over time for questions referring to read passages. This led to an interaction between reading status and delay, F(1, 71) = 14.08, MSE = .03, ηp2 = .17. The three-way interaction between delay, reading status and number of multiple-choice alternatives was not significant, F(3, 213) = 1.85, MSE = .03, p = .14.

Similar negative testing effects were observed after the lowest confidence responses were removed from the analyses. Paralleling the main analyses, there was an interaction between delay and number of prior multiple-choice alternatives after guesses were removed, F(3, 213) = 3.89, MSE = .02, ηp2 = .05. Multiple-choice lure intrusions increased from .08 with zero prior alternatives (not-tested) to .15 following six alternatives on the immediate test, an increase of 7%, t(71) = 4.15, SEM = .02. The increase in multiple-choice lure intrusions was smaller but still significant on the delayed test. Lure intrusions increased from .06 for not-tested items to .08 for questions previously tested with six alternatives, t(71) = 2.06, SEM = .01. Again, delay reduced lure intrusions for nonread passages, whereas the overall rate of lure intrusions did not change over time for read passages, resulting in an interaction between delay and reading status, F(1, 71) = 5.36, MSE = .02, ηp2 = .07.

Finally, we examined whether delay affected the persistence of errors made on the multiple-choice test. Of interest was whether a cued recall question would be answered with one of multiple-choice lures, given an error was made on the parallel multiple-choice question. Critically, delay reduced the likelihood that a multiple-choice error led to a lure intrusion on the final test. Sixty-five percent of the initial multiple-choice errors led to lure intrusions on the immediate cued recall test, whereas only 36% of the multiple-choice errors led to lure intrusions on cued recall test after one week, t(71) = 10.70, SEM = .03.

In summary, delaying the cued recall test reduced both the positive and negative effects of testing. Prior testing increased later production of correct answers on both the immediate and delayed tests, but the increase was smaller when the tests were separated by one week. Delay reduced both negative consequences of testing. First, after a delay, the number of prior multiple-choice alternatives no longer affected correct answers on the cued recall test. The positive testing effect was similar following testing with two, four, or six prior alternatives. Second, delaying the cued recall test also reduced the intrusion of multiple-choice lures, although this negative testing effect was not eliminated.

Did delaying the initial multiple-choice test change its impact on the final cued recall test?

To isolate the effects of the timing of the initial multiple-choice test, this analysis was limited to performance on the final cued recall test. We compared performance on the final test as a function of whether passages were assigned to the immediate multiple-choice testing condition (Schedule B in Figure 1) or the delayed multiple-choice testing condition (Schedule C in Figure 1). Thus, the delay between study and the final cued recall test was constant in the two groups; only the placement of the multiple-choice test varied.

A comparison of the 2nd and 3rd panels of Table 2 reveals that the positive testing effect was larger when the multiple-choice test occurred in the 2nd session, immediately before the cued recall test rather than a week earlier, F(1, 71) = 10.90, MSE = .02, ηp2 = .13. When both tests occurred in the 2nd session (panel 3), cued recall performance was much better for previously tested items (M = .47) as compared to previously untested items (M = .27), t(71) = 11.51, SEM = .02. When the multiple-choice test had occurred a week earlier (panel 2), subjects still correctly answered more cued recall questions from passages that had been tested previously (M = .38) than from non-tested passages (M = .27), t(71) = 7.06, SEM = .02. However, this testing effect was reduced relative to the testing effect observed when both the multiple-choice and the cued recall test were delayed.

The timing of the multiple-choice test also affected whether or not all forms of testing were equivalent. When both the multiple-choice and cued recall tests occurred during the 2nd session, performance decreased from .57 to .46 to .39 as the number of prior alternatives increased from 2 to 4 to 6, F(2, 142) = 30.17, MSE = .04, ηp2 = .30. As reported in the previous section, when the multiple-choice and cued recall tests occurred in different sessions, there was no effect of number of prior multiple-choice alternatives (2 vs. 4 vs. 6) on cued recall performance. These two different patterns led to an interaction between timing of the multiple-choice test and number of prior multiple- choice alternatives, F(2, 142) = 15.70, MSE = .04, ηp2 = .18. The three-way interaction between timing of the multiple-choice test, number of prior multiple-choice alternatives and reading status was not significant, F(2, 142) = 1.51, MSE = .03, p = .22

The timing of the multiple-choice test significantly affected the production of multiple-choice lure intrusions on the final cued recall test. These data appear in the 2nd and 3rd panels of Table 3. The effect of testing with additional multiple-choice alternatives was larger when the two tests occurred in the same session, as reflected in an interaction between delay and number of prior alternatives F(3, 213) = 13.77, MSE = .03, ηp2 = .16. When facts were tested twice in the 2nd session, multiple-choice lure answers increased from .19 for not-tested items to .41 for items tested with six alternatives, t(71) = 10.16, SEM = .02. In contrast, when the multiple-choice test had occurred a week earlier, multiple-choice lure answers on the final test showed a smaller (but still significant) increase to .22 after testing with six alternatives (as compared to a baseline of .18), t(71) = 2.17, SEM = .02. Delaying the multiple-choice test to the second session also reduced the benefits of having read the passages. When the multiple-choice test occurred just before the cued recall test, lure production was high and reading provided less protection against the negative testing effect, F(1, 71) = 4.85, MSE = .03, ηp2 = .06.

The timing of the multiple-choice test still affected the negative testing effect after guesses were removed from the analysis. Multiple-choice lure answers increased 10% with increasing alternatives when both tests occurred in the second session, as compared to 2% when the multiple-choice test occurred a week earlier, F(3, 213) = 4.90, MSE = .02, ηp2 = .07. After removing guesses, the interaction between delay and reading status was no longer significant, F < 1. Questions referring to both read and nonread passages produced equal lure production at both delays.

Finally, we examined the proportion of multiple-choice errors that persisted onto the final cued recall test. That is, given a multiple-lure selection, how likely were subjects to produce a multiple-choice lure on the corresponding final cued recall question? More errors persisted when the two tests were held in the same session (M = .48) then when the tests occurred a week apart (M = .36), t(71) = 4.73, SEM = .02.

In summary, delaying the multiple-choice test increased both its positive and negative effects on the final cued recall test. Prior testing increased correct answers in both conditions, but especially when the multiple-choice test was close in time to the final test. The delayed multiple-choice also led to greater intrusions of multiple-choice lures on the final test, as reflected in the higher persistence rate.

Did testing effects observed on the immediate cued recall test persist to the delayed cued recall test?

To examine whether testing effects observed on an immediate cued recall test persisted over a one-week delay, we compared performance on the initial cued recall test (following multiple-choice testing) to performance on those same items on the final cued recall test. Referring to Figure 1, we compared performance on the initial and final cued recall tests for Schedule A.

The positive testing effect observed on the initial cued recall test (as shown in the top panel of Table 2) was retained on the delayed cued recall test (as shown in the bottom panel of Table 2). On the final test, subjects correctly answered 47% of items that had been tested on both the multiple-choice and cued recall tests in session one. This is significantly above the baseline of 31% for items tested on the initial cued recall test, but not been tested on the initial multiple-choice test, t(71) = 8.86, SEM = .02. However, this testing effect is significantly smaller than the one observed on the immediate cued recall test, where performance increased from .35 to .57, leading to an interaction between testing and timing of the cued recall test, F(1, 71) = 18.90, MSE = .01, ηp2 = .21. There was also a three-way interaction between passage reading, prior testing, and delay, F(1, 71) = 10.83, MSE = .01, ηp2 = .13. Delay only affected the testing effect for nonread passages. For read passages, testing boosted performance by 23% on the immediate test and 21% on the delayed test. In contrast, for nonread passages, testing boosted performance by 23% on the immediate test, but this dropped to 11% on the final test.

On the final test, all forms of prior multiple-choice testing led to similar levels of correct responding. Although the number of prior multiple-choice alternatives had an effect on correct answers on the immediate cued recall test, this effect did not appear when the same questions were asked again on the 2nd cued recall test (leading to an interaction between delay and number of prior alternatives, F(2, 142) = 14.04, MSE = .01, ηp2 = .17). On the 1st test, as described earlier, correct answers declined when subjects had been tested with more multiple-choice alternatives, from .65 to .52. On the 2nd test, however, performance dropped from .49 to .46, and a linear trend analysis on these delayed data was not significant, F(1, 71) = 1.39, MSE = .04, p = .24.

However, as shown in Table 3, the pattern of lure intrusions seen on the 1st cued recall test also appeared on the final test, albeit to a lesser extent. An examination of the final test revealed that multiple-choice lure intrusions increased linearly with number of prior multiple-choice alternatives, F(1, 71) = 19.55, MSE = .03, ηp2 = .22. However, this pattern was not as strong as that observed on the first cued recall test, leading to an interaction between delay and number of prior alternatives, F(3, 213) = 6.54, MSE = .01, ηp2 =.08. On the initial cued recall test lure intrusions increased from .16 with zero alternatives (not-tested) to .30 after prior multiple-choice testing with six alternatives, t(71) = 6.64, SEM = .02. This difference was significant but smaller on the second cued recall test. Lure intrusions increased from .18 following zero alternatives to .26 following six alternatives, t(71) = 3.85, SEM = .02. Lure production dropped more over the delay for nonread passages as compared to read passages. This led to an interaction between delay and reading status, F(1, 71) = 31.14, MSE = .01, ηp2 = .31.

Removing guesses from the analyses did not change the conclusions about the persistence over one week of the negative testing effects observed on the initial cued recall test. Excluding guesses, intrusions increased from .08 to .15 on the 1st cued recall test and from .07 to .11 on the 2nd cued recall test, meaning that the number of prior alternatives and delay interacted F(3, 213) = 3.67, MSE = .01, ηp2 = .05. Again, questions referring to nonread passages showed larger decreases in lure production over the delay, but the interaction between delay and reading status was now only marginally significant, F(1, 71) = 3.69, MSE = .01, p = .06, ηp2 = .05.

Finally, we examined whether errors on the initial multiple-choice test were associated with errors on the cued recall tests. Errors on the multiple-choice test were more likely to lead to errors on the immediate cued recall test (M = .65) than on the delayed cued recall test (M = .48), t(71) = 9.05, SEM = .02. In other words, some of the errors that were repeated on the first cued recall test were forgotten by the final test.

In short, when the same questions were asked on immediate and delayed cued recall tests, similar effects of prior multiple-choice testing were observed on the two tests, although the effects were reduced on the delayed test.

Discussion

The first contribution of this experiment was to extend Roediger and Marsh’s (2005) finding of positive and negative testing effects to a test with forced responding. Whereas Roediger and Marsh (2005) instructed subjects not to guess on the final cued recall test and to only answer the questions to which they knew the answer, we instructed subjects to answer every question, even if they had to guess. This instruction is much more similar to what occurs in educational situations. Given that most instructors do not penalize students for guessing, there is a strong incentive for students to answer every question even if they have to guess. We thought the results might change with the new instructions, with the possibility that allowing guesses would increase the negative effects of testing.

On the whole our results were similar to those found by Roediger and Marsh (2005). On an immediate cued recall test there was a positive testing effect; subjects were more likely to answer cued recall questions correctly if they had occurred on the multiple-choice test. This positive effect of testing decreased following exposure to additional lures on the prior multiple-choice test. Having read additional lures also increased the likelihood that cued recall questions would be answered with multiple-choice lures. All of these results nicely parallel those of Roediger and Marsh (2005). One difference involves the overall rate of lure intrusions, which was much higher in the present experiment (M = .22) than in Roediger and Marsh (M = .09). However, because this increase was also observed in the baseline (not tested) condition, it does not change the conclusions. The only substantive difference between the two experiments involved the effects of having read the passages. In the current study, the negative testing effect was reduced following passage reading. This pattern is similar to Roediger and Marsh numerically, although the interaction between reading status and number of prior alternatives did not reach significance in their study. In general, passage reading protects against the negative effects of multiple-choice testing. When students are well prepared for the multiple-choice test, the negative effects of testing are reduced. Interestingly, the current experiment shows that if passage reading and the multiple-choice test are separated by one week (Schedule C), then reading no longer protects against lure intrusions.

The second, larger, contribution of this experiment was to examine the effects of delay on positive and negative testing effects. We asked three main questions. First, does taking a multiple-choice test still yield positive and negative testing effects if the final cued recall test is delayed one week? Second, does delaying the multiple-choice test (to a week after reading) change its impact on the final cued recall test? Third, do the testing effects observed on an initial cued recall test appear on a final cued recall test a week later? We discuss the answers to these questions below, before turning to a more general discussion of the experiment.

To guide this discussion, Figures 2 and 3 show a summary of the effects of delay on the positive and negative effects of prior testing. For the purposes of the figures, we collapsed across read and nonread passages. These simplified figures highlight the most important findings to be discussed below. To preview, all positive and negative testing effects were significant, but the size of the effects differed dramatically across conditions.

Figure 2.

Figure 2

The positive effects of prior multiple-choice testing on later cued recall performance, as a function of test timing.

Figure 3.

Figure 3

The negative effect of prior multiple-choice testing on later cued recall performance, as a function of test timing.

Delayed Effects of Immediate Multiple-choice Testing

To determine whether a multiple-choice test still affected later responding after a delay, we compared performance on the initial cued recall test in Schedule A to performance on the final cued recall test in Schedule B (see Figure 1). That is, we examined performance on the cued recall test as a function of whether it was taken immediately or one week after the multiple-choice test. To summarize across our different dependent measures (some of which excluded guesses), positive testing effects were reduced but still present when the cued recall test occurred one week after the initial session. On both immediate and delayed cued recall tests, subjects correctly answered more questions if they had been previously tested on the multiple-choice test. This positive testing effect was larger, however, when the cued recall test immediately followed the multiple-choice test. In addition, increasing numbers of multiple-choice alternatives decreased performance on the immediate cued recall test, but had no effect after one week. Likewise, the negative effects of testing were reduced over the delay, but still occurred. Taking a multiple-choice test increased production of multiple-choice lures on the final cued recall test, especially after reading more multiple-choice alternatives. Although these effects persisted over the delay, they were reduced.

Effects of Delaying the Multiple-choice Test

To determine the effects of delaying the multiple-choice test for one week, we compared performance on the final cued recall test in Schedules B and C (see Figure 1). That is, we compared performance on the final cued recall test as a function of whether subjects had taken an immediate multiple-choice test (after the reading phase, a week before the final cued recall test) or a delayed multiple-choice test (immediately before the final test). To summarize, both positive and negative testing effects were larger when the multiple-choice test was delayed and occurred immediately before the final test.

Persistence of Testing Effects

As shown in Figure 1, testing Schedule A provided an opportunity to see if testing effects observed on an initial cued recall test persisted to the final cued test a week later. Again, the short answer is “yes.” Although the effects were reduced over the delay, in general the positive and negative testing effects observed on 1st cued recall test were also observed on the 2nd cued recall test.

In summary, three general points emerged from the experiment. First, both the positive and negative effects of prior testing were strongest when the multiple-choice test and the cued recall test occurred in the same session. It was less important whether they both occurred in the first session or both in the second session. Rather, separation in time between the tests reduced the effects of testing. Second, the negative testing effect decreased over the delay, but was never eliminated. Third, it should be noted that the positive testing effect was very robust. Both immediately and after the delay, the positive testing effect was always greater than the negative testing effect. That is, the increase in correct answers following testing was always larger than the increase in multiple-choice lure answers. The net result of prior multiple-choice testing was always positive.

We first comment on the theoretical implications of our results, and then turn to practical recommendations. In particular, our findings are consistent with prior work that suggests that recollective processing underlies the benefits of testing (Chan & McDermott, 2007; Karpicke et al., 2006). Chan and McDermott had subjects study two lists of words: In the tested condition subjects were given a free recall test after each list, while in the not tested condition subjects instead solved math problems. At the end of the experiment, both groups completed a final recognition test on the words from both lists. Subjects in the tested condition were better able to remember on which list the words appeared and gave more “remember” responses on the recognition test than subjects in the not tested condition. These results suggest that testing increases later recollection processes rather than increasing familiarity.

Our work extends this recollection account beyond the positive effects of testing to the negative testing effect. We manipulated a variable thought to have a large impact on recollection: delay, and it had similar effects on positive and negative testing effects. The fact that the negative testing effect decreased over the delay suggests that recollecting the multiple-choice lures is a prerequisite for the negative testing effect.2 This is in contrast to other false memory paradigms such as false fame, where memory errors are caused by a reliance on familiarity in the absence of recollection (Jacoby, Woloshyn, & Kelley, 1989).

Based on our results, what advice can be offered to educators? Because teachers should want to retain the positive (but not the negative) effects of testing, the fact that the temporal spacing of the tests has an impact on positive testing effects leads to the recommendation that frequent quizzes should be given to enhance students’ knowledge. At a delay of one week the positive testing effect still outweighs the negative, but it is an open question as to whether the positive testing effect would still prevail at longer delays. Finally, given that the negative effects of testing persist over time, educators should be aware of the costs of multiple-choice tests and to try to reduce these hazards. One easy intervention is providing feedback after a test, which increases later correct responding and decreases later production of multiple-choice lures (Butler & Roediger, 2008). In short, we believe that frequent tests given with feedback will increase students’ knowledge, while avoiding the negative effects of testing.

Acknowledgments

We thank Holli Sink and Aaron Johnson for help with manuscript preparation. This research was supported by a collaborative activity award from the James S. McDonnell Foundation.

Footnotes

1

This argument for long-term persistence of familiarity does not contradict recent findings about familiarity dropping quickly in the very short-term (e.g., Yonelinas & Levy, 2002). It may very well be that familiarity drops off more quickly than recollection initially, but that familiarity is more stable than recollection over longer delays.

2

Because both familiarity and recollection likely drop over a delay, it is impossible to definitively say that subjects’ reliance on the prior multiple-choice lures is due to recollection. However, our interpretation (that familiarity is more stable over time than recollection is, meaning that delay primarily affects recollection) is consistent with how recollection and familiarity are conceptualized in other paradigms such as false fame.

The final publication is available at http://dx.doi.org/10.3758/MC.38.4.407

The opinions expressed are those of the authors and do not represent the views of the James S. McDonnell Foundation.

Contributor Information

Lisa K. Fazio, Duke University

Pooja K. Agarwal, Washington University in St. Louis

Elizabeth J. Marsh, Duke University

Henry L. Roediger, III, Washington University in St. Louis.

References

  1. Barber SJ, Rajaram S, Marsh EJ. Fact Learning: How information accuracy, delay and repeated testing change retention and retrieval experience. Memory. 2008;16:934–946. doi: 10.1080/09658210802360603. [DOI] [PubMed] [Google Scholar]
  2. Brainerd CJ, Reyna VF. Mere memory testing creates false memories in children. Developmental Psychology. 1996;32:467–478. [Google Scholar]
  3. Brown AS. Encountering misspellings and spelling performance: Why wrong isn’t right. Journal of Educational Psychology. 1988;80:488–494. [Google Scholar]
  4. Brown AS, Marsh EJ. Evoking false beliefs about autobiographical experience. Psychonomic Bulletin & Review. 2008;15:186–190. doi: 10.3758/pbr.15.1.186. [DOI] [PubMed] [Google Scholar]
  5. Butler AC, Marsh EJ, Goode MK, Roediger HL., III When additional multiple-choice lures aid versus hinder later memory. Applied Cognitive Psychology. 2006;20:941–956. [Google Scholar]
  6. Butler AC, Roediger HL., III Feedback enhances the positive effects and reduces the negative effects of multiple-choice testing. Memory & Cognition. 2008;36:604–616. doi: 10.3758/mc.36.3.604. [DOI] [PubMed] [Google Scholar]
  7. Cave CB. Very long-lasting priming in picture naming. Psychological Science. 1997;8(4):322–325. [Google Scholar]
  8. Chan JCK, McDermott KB. The testing effect in recognition memory: A dual process account. Journal of Experimental Psychology: Learning, Memory, & Cognition. 2007;33(2):431–437. doi: 10.1037/0278-7393.33.2.431. [DOI] [PubMed] [Google Scholar]
  9. Jacoby LL, Hollingshead A. Reading student essays may be hazardous to your spelling: Effects of reading incorrectly and correctly spelled words. Canadian Journal of Psychology. 1990;44:345–358. [Google Scholar]
  10. Jacoby LL, Kelley CM, Brown J, Jasechko J. Becoming famous overnight: Limits on the ability to avoid unconscious influences of the past. Journal of Personality and Social Psychology. 1989;56(3):326–338. [Google Scholar]
  11. Jacoby LL, Woloshyn V, Kelley CM. Becoming famous without being recognized: Unconcious influences of memory produced by dividing attention. Journal of Experimental Psychology: General. 1989;118:115–125. [Google Scholar]
  12. Karpicke JD, McCabe DP, Roediger HL., III Testing enhances recollection: Process dissociations and metamemory judgments. Paper presented at the Annual Meeting of the Psychonomic Society; Houston, TX. 2006. Nov, [Google Scholar]
  13. Kelley CM, Lindsay DS. Remembering mistaken for knowing: Ease of retrieval as a basis for confidence in answers to general knowledge questions. Journal of Memory & Language. 1993;32:1–24. [Google Scholar]
  14. Marsh EJ, Agarwal PK, Roediger HL., III Memorial consequences of answering SAT II questions. Journal of Experimental Psychology: Applied. 2009;15:1–11. doi: 10.1037/a0014721. [DOI] [PubMed] [Google Scholar]
  15. Marsh EJ, Meade ML, Roediger HL., III Learning facts from fiction. Journal of Memory & Language. 2003;49:519–536. [Google Scholar]
  16. Marsh EJ, Roediger HL, III, Bjork RA, Bjork EL. The memorial consequences of multiple-choice testing. Psychonomic Bulletin & Review. 2007;14:194–199. doi: 10.3758/bf03194051. [DOI] [PubMed] [Google Scholar]
  17. Mitchell DB. Nonconscious priming after 17 years: Invulnerable implicit memory? Psychological Science. 2006;17(11):925–929. doi: 10.1111/j.1467-9280.2006.01805.x. [DOI] [PubMed] [Google Scholar]
  18. Nelson TO, Narens L. Norms of 300 general-information questions: Accuracy of recall, latency of recall, and feeling-of-knowledge ratings. Journal of Verbal Learning & Verbal Behavior. 1980;19:338–368. [Google Scholar]
  19. Roediger HL, III, Agarwal PK, Kang SHK, Marsh EJ. Benefits of testing memory: Best practices and boundary conditions. In: Davies GM, Wright DB, editors. New Frontiers in Applied Memory. Brighton, UK: Psychology Press; (in press) [Google Scholar]
  20. Roediger HL, III, Karpicke JD. The power of testing memory: Basic research and implications for educational practice. Perspectives on Psychological Science. 2006a;1:181–210. doi: 10.1111/j.1745-6916.2006.00012.x. [DOI] [PubMed] [Google Scholar]
  21. Roediger HL, III, Karpicke JD. Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science. 2006b;17:249–255. doi: 10.1111/j.1467-9280.2006.01693.x. [DOI] [PubMed] [Google Scholar]
  22. Roediger HL, III, Marsh EJ. The positive and negative consequence of multiple-choice testing. Journal of Experimental Psychology: Learning, Memory, & Cognition. 2005;31:1155–1159. doi: 10.1037/0278-7393.31.5.1155. [DOI] [PubMed] [Google Scholar]
  23. Spitzer HF. Studies in retention. Journal of Educational Psychology. 1939;30:641–656. [Google Scholar]
  24. Toppino TC, Brochin HA. Learning from tests: The case of true-false examinations. Journal of Educational Research. 1989;83:119–124. [Google Scholar]
  25. Toppino TC, Luipersbeck SM. Generality of the negative suggestion effect in objective tests. Journal of Educational Psychology. 1993;86:357–362. [Google Scholar]

RESOURCES