Familiarity and Retrieval Processes in Delayed Judgments of Learning

Janet Metcalfe; Bridgid Finn

doi:10.1037/a0012580

. Author manuscript; available in PMC: 2009 Sep 1.

Published in final edited form as: J Exp Psychol Learn Mem Cogn. 2008 Sep;34(5):1084–1097. doi: 10.1037/a0012580

Familiarity and Retrieval Processes in Delayed Judgments of Learning

Janet Metcalfe ¹, Bridgid Finn ¹

PMCID: PMC2593741 NIHMSID: NIHMS59979 PMID: 18763893

Abstract

Two processes are postulated to underlie delayed judgments of learning (JOLs) -- cue familiarity and target retrievability. The two processes are distinguishable because the familiarity-based judgments are thought to be faster than the retrieval-based processes, because only retrieval-based JOLs should enhance the relative accuracy of the correlations between the JOLs and criterion test performance, and because only retrieval-based judgments should enhance memory. To test these predictions, in three experiments, we either speeded people’s JOLs or allowed them to be unspeeded. The relative accuracy of the JOLs in predicting performance on the criterion test was higher for the unspeeded JOLs than for the speeded JOLs, as predicted. The unspeeded JOL conditions showed enhanced memory as compared to the speeded JOL conditions, as predicted. And finally, the unspeeded JOLs were sensitive to manipulations that modified recallability of the target, while the speeded JOLs were selectively sensitive to experimental variations in the familiarity of the cues. Thus, all three of the predictions about the consequences of the two processes potentially underlying delayed JOLs were borne out. A model of the processes underlying delayed JOLs, based on these and earlier results is presented.

People’s judgments of learning (JOLs) have consequences for their subsequent study behavior (Finn, in press; Metcalfe & Finn, 2008). If JOLs are independently lowered, say, by framing the JOL question to participants to ask about whether they will remember the answer (resulting in high JOLs) or whether they will forget it (resulting in low JOLs), their study choice behavior is altered. They choose fewer items to study in the former case than in the latter, even though their learning, at time of making the judgment, is the same (Finn, in press). Other manipulations that have altered people’s JOLs in an illusory way also have been shown to have direct consequences for what they choose to study (Metcalfe & Finn, 2008). Given that people use these metacognitive judgments to control their subsequent behavior, it is important both that the judgments be accurate and that we understand the processes that underlie them. Delayed JOLs, in which the judgments are made using only the cue at some time after the study effort, appear to be among the most accurate ways of making a self assessment of one’s own learning, both in terms of relative accuracy (Nelson & Dunlosky, 1991) and calibration (Finn & Metcalfe, 2007, 2008; Koriat & Bjork, 2005). For this reason, we were especially interested in understanding the mechanisms underlying delayed JOLs.

Research on delayed JOLs focuses on the postulate that the mechanism for making these judgments is an attempted retrieval of the target (Nelson, Narens & Dunlosky, 2004). Here, we test the idea that although some delayed JOLs may, indeed, be based on a retrieval attempt as most researchers have proposed, there is a second basis for these judgments--cue familiarity. We will investigate whether these two mechanisms that may underlie delayed JOLs are separable, and also whether they may have different consequences for the accuracy of the JOLs and for people’s subsequent memory.

The reasons many researchers have thought that delayed JOLs may be based on retrieval is that the relative accuracy of people’s delayed judgments is substantially higher than when those judgments are made immediately after the study presentation (Begg, Duft, Lalonde, Melnick, & Sanvito, 1989; Benjamin & Bjork, 1996; Benjamin, Bjork & Schwartz, 1998; Kimball & Metcalfe, 2003; Koriat, 1997; Nelson & Dunlosky, 1991; 1992; Nelson, Dunlosky, Graf & Narens, 1994; Spellman & Bjork, 1992). There have been three main theories of why the delayed JOL accuracy advantage occurs, and each of the three implicates a retrieval attempt in the case of delayed JOLs. Indeed, only two studies (Benjamin, 2005; Son & Metcalfe, 2005) have suggested that something else may underlie some delayed JOLs.

The case for the postulate that people use an attempt to retrieve the target as the basis of their delayed JOLs comes primarily from studies and theories that have attempted to explain the difference in immediate and delayed JOL relative predictive accuracy with respect to the criterion test, that is, the ‘delayed JOL effect.’ The first proposal to explain this finding was the monitoring dual memories hypothesis given by Nelson and Dunlosky (1991) which states that immediate judgments are based on retrieval from both short-term memory (STM) and long-term memory (LTM). While making an immediate judgment the target item is still in STM and so judgments made immediately will not entail a retrieval attempt from LTM and hence will be poor at discriminating between what will be remembered and what will be forgotten when the test is delayed. By contrast, delayed JOLs rely only on retrieval from LTM, which is more diagnostic of what will happen at the final test.

The second explanation of the delayed JOL effect is the transfer appropriate processing view (Begg, et al., 1989; Dunlosky & Nelson, 1992; Glenberg, Sanocki, Epstein & Morris, 1987; Roediger, Weldon, & Challis, l989), which states that retrieval enacted at a delay is more similar to the retrieval that the person will use at test than are the processes that people use to make immediate JOLs. Therefore, the delayed retrieval will be more diagnostic of how people will do on the test. Although there are data mitigating against this theory (Dunlosky & Nelson, 1997; Dunlosky, Rawson, & Middleton, 2005; Weaver & Keleman, 2003), our only point here is that it postulates that the reason for the delayed JOL to test accuracy is a retrieval attempt. By both of these views, if there were no target retrieval attempt the correlations between JOLs and later test performance would be low rather than high. We will make a similar assumption--that a target retrieval attempt should result in a high JOL to test correlation, but if no retrieval attempt is made that correlation will be lower. We will use this as a method to tease apart the hypothesized two processes in delayed JOLs.

The third view is the Self-Fulfilling Prophecy explanation. By this view, the improvement in the relative accuracy of the delayed JOLs comes about because those judgments themselves -- which involve retrieval and retrieval, if successful, enhances memory--have an effect on the later memory test performance (Kimball & Metcalfe, 2003; Spellman & Bjork, 1992). This theory, like the others, states that people make their delayed JOLs by attempting to retrieve the target. If they are successful, they give those items a high JOL; if unsuccessful, they assign a low JOL. The critical difference between this theory and the two others is that these authors note (and demonstrate, in the case of Kimball & Metcalfe, 2003) that the act of successful retrieval at a delay enhances memory for those items that are brought to mind (see, Roediger & Karpicke, 2006). Those retrieved items are not only given high JOLs, but also get a memory boost. Thus, the JOL itself, insofar as it involves retrieval, should enhance memory. We will return to this point shortly, since we will not only look for higher relative accuracy if the learner is retrieving to make his or her JOLs, but we will also look for enhanced memory.

Despite the near consensus that delayed JOLs are based on an attempt at target retrieval, Son and Metcalfe (2005) have recently presented data that suggest that some delayed JOLs may not be based on target retrieval. Three experiments compared the reaction times of people when making JOLs without any instructions to when they were told to retrieve and then make the JOLs. According to a retrieval-only hypothesis, people should attempt to retrieve the target in both cases: telling them to do what they would do anyhow should not alter their behavior. If so, then the RT functions in these two cases should track one another. In both cases, the time needed to make the JOL should increase as the JOLs decrease and target retrieval becomes more difficult and time consuming.

However, Son and Metcalfe (2005) found that the reaction times for the lowest JOL items did not follow this pattern: some ‘don’t know’ judgments were made very quickly. The pattern of reaction time data followed the expectations of the retrieval hypothesis in the case where people were told to retrieve first and then make their JOLs: reaction times increased monotonically with the lowest JOLs showing the longest reaction times. But a different pattern was seen for the JOL alone condition. It showed a nonmonotonic reaction time function with the lowest JOLs being made very rapidly rather than very slowly. Indeed, a measure of the lowest JOLs in the JOL alone condition showed that they were made faster than the time needed to make a retrieval attempt. When making the lowest JOLs, people seemed to know that they did not know without having to take the time needed to attempt to retrieve the target.

To make these very fast, low JOLs, Son and Metcalfe (2005) suggested that people might be evaluating how familiar they were with the cue, assessing it as low, and making their judgment based on this evaluation. They suggested that both cue familiarity and target retrievability may play a role in making JOLs. Fast low JOLs arise because cue familiarity is assessed as low, and no attempt is made in these cases to retrieve the target. Thus, the judgment process can conclude rapidly. When cue familiarity is assessed as high and the target is retrieved very quickly, a high JOL is given--but it is a somewhat slower judgment.

If their explanation of the reaction time data is correct, there are three testable consequences. First, there should be a beneficial memory effect of retrieval, but only when the JOLs are based on target retrieval and not when they are based only on cue familiarity. A number of research reports have shown that testing and retrieval have beneficial effects on later memory (e.g., Butler & Roediger, 2007; Karpicke & Roediger, 2008; McDaniel & Fisher, 1991; McDaniel, Kowitz, & Dunay, 1989; McDaniel & Masson, 1985; Roediger & Karpicke, 2006; Pashler, Cepeda, Wixted, & Rohrer, 2005; Pashler, Zarow, & Triplett, 2003). Whitten and Bjork (1977) have found similar memory benefits for retrieval practice. This enhancement, presumably, occurs only on the items that are retrieved (and not on the ones that fail to be retrieved). Nevertheless, some items should get a memory boost from the JOL procedure itself, as long as that JOL process involves retrieval. The finding that successful retrieval enhances memory can be used as a dependent measure to see, retrospectively, whether one JOL condition was more likely to involve retrieval than another.

Second, all three dominant theories propose that the reason delayed JOLs accurately predict performance is because of the retrieval attempt. It follows that we would expect to see the very high JOL relative accuracy in the case where those JOLs are made, primarily, on the basis of target retrieval. JOL relative accuracy should be less good were the JOLs to be based, mainly, on cue familiarity without a retrieval attempt.

Third, we should be able to experimentally manipulate the two kinds of judgments rather than just relying on correlational evidence. If the cue-familiarity-based JOLs are made quickly, whereas the target-retrieval-based JOLs are made more slowly, we should expect to see that variables that selectively affect cue familiarity should impact more on the speeded JOLs, whereas variables that affect retrieval should impact primarily on the unspeeded JOLs. Benjamin (2005), in a study that manipulated cue and target familiarity, found promising preliminary evidence, in support of the second and third proposition. We shall explore this third prediction further, as well.

Experiment 1

In the first experiment, we manipulated target retrievability by using multiple pictorial cue exemplars of a particular category (bear1, bear2, bear3, bear4) and either paired each category cue with a single target word--resulting in high retrievability, or paired each category cue with multiple targets--resulting in low target retrievability. An example of the pictorial cues used in this experiment is given in Figure 1. Using the pictorial variants of the category allowed us to be explicit about which target was specified in the multiple target condition, while still keeping the cue familiarity the same in the two conditions. Our two primary conditions, were, therefore, A-B, A’-B, A”-B, A”’-B (which, for simplicity, we will hereafter call A-B A-B), and A-B, A’-C, A”-D, A”’-E, (which we will hereafter call, A-B A-C). A-B A-B is, of course, a positive transfer situation, and should result in good recall of the target, whereas A-B A-C is a negative transfer situation, and should result in poorer recall of the target.

We also varied whether the JOL that people made at a delay was speeded or unspeeded. In the speeded condition, participants had to respond in less than 3/4 of a second, or else they heard a voice (in the computer program we used) say: “Hurry” and a “Too slow! Data lost!” written message appeared onscreen. In the unspeeded condition they were told to take their time in making the judgments, and no voice ever intruded. In the judgment phase we also included pictorial cues that had never been presented. We call this the ‘New’ condition.

Our predictions were that in the speeded conditions the JOLs would be lowest in the New cues condition (because of a lack of cue familiarity). They would be higher, but about the same in the A-B A-C condition and in the A-B A-B condition (because of greater, but equal, cue familiarity, and little ‘contamination’ from target retrieval). In the unspeeded condition we expected low JOLs in the New condition as well (because of a lack of both cue familiarity and target retrievability). But here we predicted higher JOLs in the A-B A-C condition than in the new condition (because of higher target retrievability) and still higher JOLs in the A-B A-B condition (because the target would be easiest to retrieve in this condition).

We also predicted that the JOL gammas indexing the relative accuracy would be higher in the unspeeded than in the speeded JOL condition. The difference in gamma correlations was expected on the grounds that the JOLs would be based much more on attempted retrieval in the unspeeded JOL condition than in the speeded condition. And, finally, we predicted that recall would be better in the unspeeded JOL condition than in the speeded JOL condition. The purported retrieval attempt, in the unspeeded JOL condition, was expected to improve recall of those items that were retrieved. In the speeded JOL condition a target retrieval was predicted much less frequently and thus less recall enhancement was expected.

Method

Participants

The participants were 32 undergraduates at Columbia University and Barnard College. They participated for course credit or were paid at a rate of $12 an hour for participating. Participants were treated in accordance with the ethical principles of the APA and the Columbia University IRB approved all of the experiments in this article.

Design and materials

The experiment was a 2 (Speeded or Unspeeded JOL) X 2 (Encoding Condition, A-B A-B, or A-B A-C) X 12 (within-list repetitions of the basic design, over which the data were collapsed) within-participant design. Participants also made JOLs, in both the speeded and unspeeded condition, on 12 new cues.

The picture cues were four distinct exemplars of a particular category, which shared a common name, as shown in Figure 1. These cues, each being slightly different from one another, allowed us to uniquely query a particular target in the JOL and memory tests.

Procedure

Participants were shown, one at a time, and instructed to remember, 48 picture-word pairs. The 48 cues represented 6 distinct categories with 4 exemplars per category in each of the A-B A-B and the A-B A-C conditions, randomly mixed into a single list of items. Each picture-word pair was presented for 3 s of study on each presentation, and the entire list was shown twice. Participants were then asked for their JOLs for 12 cues from that list, and 6 cues that were new. The 12 cues from the list were selected such that 6 were cues from the 6 categories in the just-studied A-B A-B condition and 6 were from the A-B A-C condition. The cue used for the JOL was randomly selected from one of the 4 exemplar pictures that had been studied for each category. The JOL cue was the same as was then given in the test phase. The 6 New cues were randomly selected from other categories of pictures that each had four exemplars. After making their JOLs, participants were then tested for recall on the 18 items on which they had made JOLs.

There were two trials. The second trial was the same as the first (with different materials, of course) except that if the judgments had been speeded on the first trial they were unspeeded on the second, and if they had been unspeeded on the first they were speeded on the second. The speed of the first trial judgments was counterbalanced over participants.

The procedure in making the JOLs was as follows. Participants were told, “After you are presented with the pairs you will have an opportunity to give a JOL. A JOL is a Judgment of Learning which indicates your how confident you are that in about 10 minutes from now you will be able to recall the target when prompted with the picture.” They made their JOLs by pressing one of four keys that ranged in quarters from 0-100%. Keys were marked on the keyboard. In both conditions there was a practice trial in which the judgments were made at the speed at which they would be made during the experiment, and in which participants were told that for the upcoming trial they would be making either speeded or unspeeded judgments. This practice trial was especially important in the speeded conditions, because it gave participants the opportunity to practice with the JOL buttons as quickly as was necessary during the experiment, before we were collecting data. During the practice trial, as well as during the experiment, a prerecorded voice in the speeded conditions said ‘Hurry!’ and a ‘Too slow! Data lost!’ message appeared if the JOL response exceeded .75 s. This happened during the experiment on 15 % of the speeded trials. We included all of the items in the analyses below, though, even those that exceeded .75 s.

Results

Latencies

The mean time to make the Speeded JOLs was .61 s as compared to 1.48 s in the Unspeeded condition, t(31) = 7.37, p <.05. (We also conducted a separate analysis that excluded items that exceeded .75 s in the speeded condition. The pattern of results was the same as shown below.)

Recall

As predicted, recall was better in the Unspeeded JOL condition than in the Speeded condition. Unspeeded judgments showed a recall advantage (M = .69, SE = .04) over the Speeded condition (M = .63, SE = .04). This main effect was significant, F(1,31) = 5.53, MSe = .02, p < .05, η²_p = .15 (effect size is reported using partial eta squared, η²_p ).

As was expected Encoding Condition A-B A-B showed better recall performance (M = .83) than Condition A-B A-C (M = .48, F(1,31) = 71.34, MSe = .06, p < .05, η²_p = .70 . The interaction between condition and judgment speed was not significant (F <1). The recall means are shown in Figure 2.

Mean recall performance for conditions A-B A-B and A-B A-C under speeded and unspeeded JOL conditions, in Experiment 1. Error bars indicate standard errors of the mean.

JOLs

The JOLs for the new items were included in this analysis, in both the unspeeded and speeded JOL conditions. All of the relevant effects and interactions were still significant, however, when the data were reanalyzed with the new items eliminated. As predicted, when people made Speeded JOLs their judgments followed the familiarity of the cue, whereas when they made Unspeeded JOLs the judgments followed the retreivability of the target. The interaction between JOL Speed and Encoding Condition, F(2,62) = 16.82, MSe = .21, p<.05, η²_p = .35, is shown in Figure 3. Both the speeded and the unspeeded JOLs showed low mean judgments on the new items. In the speeded condition, although both the A-B A-B and the A-B A-C condition showed higher JOLs than those given to the new cues (t(31) = 7.83, p <.05, t(31) = 8.74, p <.05, respectively), there was no significant difference between them, t(31) = 1.52, p >.05. There was, however, a difference between the JOLs in the A-B A-B condition and the A-B, A-C condition in the unspeeded JOL condition, reflecting a similar difference in retrieval in these two conditions, t(31) = 5.56, p < .05.

There was also, of course, a main effect of Encoding Condition, F(2,62) = 119.14, MSe = .51, p < .05, η²_p = .79. There was a main effect of JOL speed, F(1,31) = 11.76, MSe = .22, p <.05, η²_p = .28. However, these main effects were qualified by the interaction of interest.

Gamma correlations relating JOLs to recall

Gamma correlations between JOLs and recall index relative metacognitive accuracy. We computed gamma correlations collapsed over all conditions (including the new items) within the unspeeded and speeded JOL conditions. As predicted, the gammas were higher for the unspeeded condition (M = .84, SE = .05) than for the speeded JOL condition (M = .61, SE = .08, t(30) =2.60, p <.05. We also eliminated the new items and recomputed the gammas only on items that had been presented for study. Once again, they were higher for the unspeeded JOL condition, (M = .58, SE = .11), than for the speeded JOL condition, (M = .28, SE = .11, t(24)= 2.14, p <.05 (The change in degrees of freedom occurred because some subjects had either all answers wrong or all right, and so a gamma could not be computed for them).

Additional analyses

Using data only from the Unspeeded JOL condition, we were able to investigate the reaction times (RTs) of participants making delayed JOLs when they were not constrained or subject to a time deadline. The data from this condition are comparable to the RT data of Son and Metcalfe (2005) when people were simply asked to make delayed JOLs without further constraints. In addition, because we had used a condition in which the cues were new, we were able to investigate whether under unspeeded conditions people would spontaneously give very fast low JOLs selectively in this condition, presumably, because of a lack of cue familiarity. The reaction time data for the three conditions, along with the proportion of responses in each condition at each of the four JOL levels, and the proportion correct at each of these four levels, are presented in Figure 4. As can be seen, most of the JOL responses in the New condition clustered into the lowest JOL category: people knew that they did not know. And they were very fast. In the A-B A-B condition, in contrast, most of the JOLs clustered into the highest JOL category. The proportion of responses in the highest JOL category was, appropriately, somewhat lower in the A-B A-C condition. They knew that they knew the answers more often in the A-B A-B condition than in the A-B A-C condition. The ‘know’, or highest JOL judgments, in both the A-B A-C and the A-B A-B conditions were made quickly but numerically less quickly than the ‘don’t know’ judgments in the New condition--consistent with the hypothesis. Medium valued JOLs in the A-B A-B and A-B A-C conditions were made more slowly, just as Son and Metcalfe (2005) had shown.

Reaction times at each of the 4 JOL levels are given for the New condition, the A-B A-C condition, and the A-B A-B condition, for left, center and right panels on the top. A JOL of 1 indicates that the participant thought they did not know the response while a JOL of 4 indicates that they thought they knew it. On the bottom are the proportion of responses given at each JOL level, shown by the bars, and proportion correct at each JOL level, shown by the diamonds, with the data from the New condition on the left, from the A-B A-C condition in the center, and from the A-B A-B condition on the right. All data are from Experiment 1.

We were unable to conduct an ANOVA combining both Levels of JOLs and Encoding Conditions (New, A-B A-B and A-B A-C) on RTs, because there were many cases in which there were no responses at all in the New condition for the highest JOLs, and in the A-B A-B condition for the lowest JOL category. Indeed, there was not a single participant in this experiment who had data in every cell of the full design. Thus, we had to collapse. Accordingly, we conducted two separate one-way ANOVAs, the first comparing RTs across the 3 Encoding Conditions (collapsing over JOL levels) , and the second comparing RTs over JOL levels (collapsing over Encoding Conditions). There was a significant effect of Encoding Condition with RT as the dependent variable, F(2, 62) = 9.84, MSe=.39, p < .05, η²⁼_p .24 . Although numerically the New condition (at 1.16 s) was faster than the A-B A-B condition (at 1.39 s), the post hoc test comparing these two conditions was not significant, t(31)= 1.41, p>.05 . The post hoc tests comparing both the New condition to the A-B A-C condition (at 1.84 s) and the A-B A-B condition to the A-B A-C condition were both significant , t(31)= 3.86, p<.05, and t(31)=3.69, p<.05, respectively.

There was a main effect for JOL level when RT was the dependent measure, F(3,57)=6.56, MSe=.61, p<.05, η²_p=.26. All differences among means except between JOL level 1 and JOL level 4 and between JOL level 2 and JOL level 3 were significant — indicating an inverted U-shaped curve as a function of JOL level, with the collapsed RT data. Accordingly, we tested for linear, quadratic and cubic trends. Only the quadratic coefficient was significant, t(19)=2.90, p<.05. These distributional and RT results extend and provide further support for the dual process hypothesis.

Discussion

The predictions of the dual process model of delayed JOLs held up very well in the first experiment. The relative accuracy of the gamma correlations was higher with unspeeded than speeded JOLs. This pattern was consistent with the idea that the slow process that people use in making delayed JOLs involves a target retrieval attempt, but the fast process involves something else. Memory was better when the JOLs were slow rather than fast, suggesting a benefit from retrieval practice that was greater in the unspeeded condition. The manipulation that affected target retrieval had an impact only on the unspeeded JOLs and did not show up on the speeded JOLs. These three results suggest that the two processes are different and dissociable. They also suggest that the slow process may be an attempt at target retrieval. The low JOLs in evidence in the condition in which the cues were new suggests that the fast process was probably cue familiarity, but this suggestion is equivocal because both the cue and the target were completely unfamiliar in this case. Not only was the cue unfamiliar, but the target was also unretrievable, because no target had been presented.

Experiment 2

Although the results of the first experiment were supportive of our hypothesis, we had only included a measure of cue familiarity during the judgment process and retrieval but not during encoding. Thus, in the second experiment, we used the same basic design as had been used in the first experiment except that we added another condition in which the cue and target were presented only once. Thus, our three encoding conditions were A-B A-B, A-B A-C, and A-B-- the latter being a condition in which the cue was presented only once, and hence cue familiarity was expected to be lower than in the other two conditions.