Abstract
Traditional accounts of reasoning have characterized human error response to be an unconscious process whereby cognitive misers blindly neglect the critical information that would lead to problem solution, thereby substituting an easier problem for the actual problem (e.g., Kahneman & Frederick, 2002). For the bat-and-ball problem, the unconscious substitution hypothesis is challenged on two fronts in the present study: (1) testing for conscious representation of the error-inducing semantic content of the problem (i.e., the “more than” phrase, “The bat costs $1.00 more than the ball.”); and (2) comparing experimentally response confidence differences between standard versions of the problem and isomorphic controls (without that phrase) to verify post-decision sensitivity to the errors, following De Neys, Rossi, and Houdé (2013). Crucially, even when interference questions were included between testing and memory response, incorrect reasoners largely had accurate recall and recognition of the problem’s error inducing phrase. Incorrect reasoners’ intra-individual error sensitivity was replicated and extended via the introduction of a social-metacognitive measurement, which was found to be correlated with intra-individual post-decision confidence and also yielded an error sensitivity effect. Finally, latency responses verify the relationship between time spent reasoning and post-decision confidence. Implications and future directions are discussed.
Keywords: reasoning, confidence, decision-making, error sensitivity, bat-and-ball problem, memory, attribute substitution, latency response
The Bat-and-Ball Problem: Evidence in support of a conscious error process
The bat-and-ball problem, one of three questions included in the Cognitive Reflection Test, has become a classic index of biased and correct responses in judgment and decision-making research (see Frederick, 2005). It is presented as follows:
A bat and ball together cost $1.10. The bat costs $1.00 more than the ball.
How much does the ball cost?
The most frequent answer is the biased “10 cents” response, wherein participants incorrectly subtract $1.00 from $1.10. The correct answer is the “5 cents” response, which can be reasoned through algebraically by solving two simultaneous equations (Hoover & Healy, 2017):
Bat + Ball = $1.10
Bat = $1.00 + Ball
Solve for Ball
Classic accounts of response bias for this problem rely on dual process models of reasoning, which suggest that biased participants are driven to respond in such a manner because they are making judgments on the basis of intuition (Type 1 processing) rather than higher-order rational consideration (Type 2 processing) (Evans, 2008; Evans & Stanovich, 2013; Kahneman 2011; Kahneman & Frederick, 2002; Stanovich & West, 2002). When these sorts of errors in judgment are made, intuitive reasoners are thought to be susceptible to attribute substitution, the process by which a more complicated problem is replaced with a simpler problem (e.g., Kahneman, 2011; Kahneman & Frederick, 2002). In the bat-and-ball problem, the attribute substitution consists of neglecting the “more than” phrase, essentially substituting the difficult problem with “more than” for an easier problem without “more than” (e.g., De Neys, Rossi, & Houdé, 2013; Hoover & Healy 2017).
Classically, human reasoners have been regarded by investigators as cognitive misers who are oblivious to the erroneous nature of their thinking that gave rise to attribute substitution (Kahneman & Frederick, 2002; Tversky & Kahneman, 1974). However, De Neys et al. (2013) determined that reasoners are on some level sensitive to this erroneous process. This determination was accomplished by comparing the response confidence of incorrect reasoners’ standard bat-and-ball problem variant response (with the “more than” phrase) to an isomorphic control variant response without the “more than” phrase. For example:
A pencil and eraser together cost $1.10. The pencil costs $1.00.
How much does the eraser cost?
The answer to this problem is easily seen to be 10 cents.
The response confidence prompt asked participants “to indicate how confident they were that their response was correct by writing down a number between 0% (totally not sure) and 100% (totally sure)” (De Neys et al., 2013, p. 270). Incorrect reasoners indicated lower response confidence scores to standard than to control versions, suggesting that they were at some level sensitive to their attribute substitution error process (i.e., they showed substitution sensitivity). Subsequent follow-up studies have provided additional insights to help explain the nature of post-decision error sensitivity. For instance, response confidence was found to be predicted by a verification process that reasoners used to determine accuracy, as well as the perceived difficulty of the problem (Szollosi, Bago, Szaszi, & Aczel; 2017). The key question De Neys et al. (2013) asked is whether or not fast, Type 1 errors are produced on an entirely unconscious basis, as previously assumed (e.g., Kahneman & Fredrick, 2002).
The body of work on conflict detection in decision-making provides the theoretical framework for explaining the findings outlined above (e.g., Aczel, Szollosi, & Bago, 2016; De Neys, 2012; Pennycook, Fugelsang, & Koehler, 2012). In this line of theorizing, reasoning errors arise from an inability to inhibit prepotent intuitive responses that, importantly, cannot be explained solely due to miserly cognition because incorrect reasoners demonstrate that they are unsure their reasoning was accurate. Indeed, there is a rich body of research in support of error sensitivity in particular and logical intuitions more generally (e.g., Bago & De Neys, 2017; De Neys, 2012, 2014; De Neys & Bonnefon, 2013; De Neys & Glumicic, 2008; De Neys et al., 2013; Gangemi, Bourgeois-Gironde, & Mancini, 2015; Mata, Schubert, & Ferreira, 2014; but see Singmann, Klauer, & Kellen, 2014, for some caveats regarding this literature). Furthermore, sensitivity to error processing has been demonstrated though converging evidence in latency response investigations (De Neys & Glumicic, 2008; Frey, Johnson, & De Neys, 2017; Johnson, Tubau, & De Neys, 2016) neuroimaging (De Neys, Vartanian, & Goel, 2008), and alternative measurements of confidence (De Neys, Cromheeke, & Osman, 2011). However, sensitivity effects were not obtained by indexing reasoner’s mouse-movements (Travers, Rolison, & Feeney, 2016) nor their eye-movements (Mata, Ferreira, Voss, & Kollei, 2017; see the subsequent debate concerning the studies by Mata et al. and Frey et al. discussed by Mata & Ferreira, 2018). On a more general level, unconscious processes have not been found to have much (if any) explanatory power (for a review, see Newell & Shanks, 2014).
In the present study, we extend this body of work testing the unconscious substitution hypothesis (e.g., Kahneman, 2011; Kahneman & Frederick, 2002). Specifically, we added to the present study a more direct measure of the respondents’ problem representation by asking them to recall and recognize the problem wording. This addition allows us to determine whether incorrect reasoners consciously processed the error-eliciting “more than” phrase. Evidence that incorrect participants did not err unconsciously would take the form of incorrect standard question respondents remembering the standard problem with the “more than” phrase intact. In other words, such a finding would indicate that intuitive reasoners did not simply overlook or negate the error-inducing “more than” phrase and did in fact cognitively process the semantic content. On a related note, recent work on change detection of conflict and no-conflict problems has revealed that reasoners who had better problem representation of question phrasing were more likely to solve the problem correctly (Mata et al., 2014). Therefore, we also expected to see better performance among those with better problem representations, in line with this work (see Supplemental Materials for these results).
The other major aspect of the present work is in regard to confidence as an index of error sensitivity. Confidence generally reflects accuracy, but subjective confidence ratings of participants are widely known to be subject to bias. One of the primary sources of miscalibration is overconfidence (e.g., Koriat, Lichtenstein, & Fischhoff, 1980; Lichtenstein, Fischhoff, & Phillips, 1982; Nelson & Narens, 1990; Shaw, 1996; Soll, 1996). Reasoners tend to judge quite favorably that their knowledge is accurate, when they are both correct and incorrect, across a wide range of contexts (e.g., general knowledge, eye-witness testimony, legal judgments), although less so for perceptual judgments (Keren, 1988), and reasoners are even willing to bet money on their judgment in real and hypothetical gambling tasks (e.g., Fischhoff, Slovic, & Lichtenstein, 1977). Another source of miscalibration is due to a mismatch between normative modeling of confidence via probability (i.e., 0% confident – 100% confident) and descriptive data regarding real-world judgments (e.g., González-Vallejo & Bonham, 2007; Keren, 1991, 1997; Klayman, Soll, González-Vallejo, & Barlas, 1999; Lichtenstein & Fischhoff, 1977). De Neys et al. (2013) did not address miscalibration issues in their report with respect to either (a) overconfidence or (b) normative modeling of confidence via probability. However, miscalibration might not be directly relevant to the theoretical question they asked, which involved an examination of a difference in response confidence between standard versions of the bat-and-ball problem and the simpler, isomorphic control versions. Instead, the difference in response confidence between standard and control might be more relevant to the notion of resolution (discrimination among different degrees of certainty), and resolution has been shown not to change as a function of task difficulty (e.g., Lichtenstein & Fischhoff, 1977). Furthermore, overconfidence is typically found to be higher with more difficult than easier problems (i.e., the hard-easy effect; Keren, 1997; Lichtenstein et al., 1982), which would lead to the prediction that reasoners might be more confident in their responses to the more difficult standard problem than to the easier isomorphic control problem, contrary to the findings of De Neys et al.
Theoretical issues aside, ceiling effects in confidence might obscure differences between conditions. Although De Neys et al. (2013) found a substantial effect of condition (standard vs. control) on confidence, the response confidence scores were very high overall, approaching the ceiling (see also Hathorn & Healy, 2015, 2016, and Hoover & Healy, 2017, who found that the modal response confidence score for all reasoners was 100%). In an attempt to reduce confidence score levels, in the present study we added another measure of confidence that asks participants to give their opinion about whether other respondents would answer the question correctly, following Frederick (2005). Indirect measurement of confidence and modification of social contexts are two methods that have been demonstrated to reduce overconfidence (e.g., Arkes, Christensen, Lai, & Blumer, 1987). Fredrick’s (2005) opinion judgment proxy clearly takes advantage of indirect measurement of confidence and also, arguably, social context given that it asks reasoners to gauge the judgment of other reasoners. Furthermore, these opinion judgments should yield estimates off the ceiling because they should be less susceptible to miscalibration caused by overconfidence in one’s own reasoning (e.g., Fishhoff et al., 1997; Lichtenstein & Fischhoff, 1980; Lichtenstein et al., 1982).
However, it is important to note the limitations of indirect measurements of confidence, particularly with respect to hypothetical error sensitivity processing. Although we expect there to be overlap between the intra-individual rules that intuitive reasoners use for their own judgment and their social metacognition regarding others’ reasoning, these scales might be indexing separate error signals. For example, Mata and Almeida (2014) found evidence for intra-individual error sensitivity but not social metacognitive error sensitivity among intuitive reasoners. They also noted, though, that low statistical power could have been responsible for their lack of an observed effect (p. 357). The present study’s large sample size should ensure statistical sensitivity. We predicted that the direct and indirect measurements of confidence would be correlated, that both would yield error sensitivity effects, and critically, that the indirect measurement of confidence would be less susceptible to ceiling effects.
Finally, latency responses to both the standard and control versions were collected. Correlations between response time and response confidence were expected, in line with previous work (e.g., Johnson et al., 2016; Kelley & Lindsay, 1993; Thompson, Prowse Turner, & Pennycook, 2011; Thompson et al., 2013).
Beyond including the opinion judgment question (which always followed the response confidence question), covertly recording response time (online via survey software), and adding the posttest memory questions, we employed the De Neys et al. (2013) procedure in the present investigation exactly as it had been reported.
Method
Participants
Three hundred and eighty two participants (MAge = 26.34, SD = 9.44, range = 18–65, men = 61%) were tested online via a Qualtrics survey using their own computers in three separate samples. The first sample was gathered via Amazon Mechanical Turk (n = 126), the second sample via the University of Colorado Boulder (UCB) student subject pool of undergraduate general psychology students (n = 128), and the third sample, again, via Amazon Mechanical Turk (n = 128). The Amazon Mechanical Turk (MTurk) participants were restricted to current college students (indicated via self-report) at least 18 years of age and were paid 50 cents for their participation. The UCB students received course credit.
According to a power calculation based on the effect of substitution sensitivity (i.e., the difference between standard and control problems in response confidence scores) reported by De Neys et al. (2013; ηp2 = .23 on n = 195 incorrect reasoners), only 30 incorrect participants, defined as those giving the incorrect answer on the standard problem, were needed to achieve .8 power for that effect. This power calculation was computed using G*Power with the SPSS option (Faul, Erdfelder, Lang, & Buchner, 2007). The number of incorrect reasoners in the present study was 227 total, and those participants were used in the results reported here for confidence, opinion, memory, and response time. The results of the remaining 155 correct reasoners for those measures are reported in the Supplemental Materials.
Materials and Procedure
The items and item amounts for the primary questions of interest followed the De Neys et al. (2013) procedure: Instead of the familiar bat/ball/$1.10, the counterbalanced variants of the problems were the unfamiliar Pencil/Eraser/$1.10 and Magazine/Banana/$2.90 (with $2.90 replacing $1.10 and $2.00 replacing $1.00).
Following De Neys et al. (2013), questions were counterbalanced such that (a) half of each sample was presented with a control question first followed by a standard question, whereas the presentation was reversed for the remaining half; and (b) half of the Pencil/Eraser/$1.10 and Magazine/Banana/$2.90 questions were the standard version of the problem, and the remaining half were the isomorphic control version (i.e., without the “more than” phrase). In all cases, each participant was asked a total of two bat-and-ball variant questions, one standard and one control; one of these two questions was the Pencil/Eraser/$1.10 question, and the other the Magazine/Banana/$2.90 question.
Participants’ response confidence was obtained after each question variant (on a separate survey page) with the following prompt: “How confident are you in your response? Please write down a number between 0% (totally not sure) to 100% (totally sure).” This response was followed immediately by an opinion judgment proxy, in which participants were asked, for example, “In your opinion, what percentage of people answering the banana and magazine problem were able to solve it correctly?” (Frederick, 2005).
After completing both experimental questions (including the confidence ratings and opinion judgments), participants were asked memory questions. The order of the variants for the memory questions was the same as that for the experimental questions. The first memory question was a recall response (e.g., “Do you remember the pencil and eraser problem? Please recall the problem and write down what you remember.”). After the recall responses for the two problems, participants gave recognition responses for both problems, wherein the response choices included the question either with or without the error-eliciting “more than” phrase. The key component coded on both recall and recognition responses was the presence of the “more than” phrase.
For the second and third samples, nine simple math problems were included between the experimental questions and the memory questions to serve as interference to limit recall and recognition based on working memory (see the Supplemental Materials). Neither math problems nor any other activity occurred between the De Neys et al. (2013) experimental and the memory questions for the first sample. Specifically, we started by testing 126 MTurk participants and then examined their data. We did not have a precise stopping rule for the sample size, but we decided from the outset to pause data collection after examining the data from an initial sample of MTurk participants. We noted poor memory performance by these initial participants. On the basis of our observations, we tested two additional samples of participants (one from MTurk and one from UCB), each approximately the same size as the initial sample (128 participants), and gave them the math problems to create interference.
Response time was measured directly through the Qualtrics software. Each response time included the time to read the question, the decision time, the time to type the answer, and the time to press the arrow button to go to the next survey page. Note that the format of the response was left open to participants (e.g., “.05,” “five cents,” “$0.05,” etc.; however, most responded in simple decimal format, i.e., “.05” or “.10”).
Results
All measures and conditions from our data collection are reported. See the Supplemental Materials for additional results not included in the main article.
Accuracy
Each participant contributed one accuracy score (0 or 1) for each condition (standard and control). Of the 382 total participants, 155 gave correct responses to the standard question and 227 gave incorrect responses. A total of seven participants responded incorrectly with an irregular incorrect response (i.e., answers other than the biased “10 cents” for the pencil and eraser problem or biased “90 cents” for the magazine and banana problem). Accuracy on the standard question was 40.6%, SD = 49.2% (MTurk 1st sample: M = 38.9%, SD = 48.9%; MTurk 2nd sample: M = 46.1%, SD = 50.0%; UCB: M = 36.7%, SD = 48.4%); this accuracy level was 20.6% higher than previous research findings indicated as average (e.g., Bourgeois-Gironde & Vanderhenst, 2009; De Neys et al., 2013; Frederick, 2005; Hoover & Healy, 2017). In contrast, the isomorphic control question, which did not contain the possible error-inducing “more than” phrase, elicited correct responses with 96.6% accuracy, SD = 18.2% (MTurk 1st sample: M = 98.4%, SD = 12.5%; MTurk 2nd sample: M = 91.4%, SD = 28.1%; UCB: M = 100%, SD = 0%). A 3 × 2 mixed factorial analysis of variance was conducted on participants’ proportion of correct responses, with the between-subjects variable of subject group (MTurk 1st sample, MTurk 2nd sample, UCB) and the within-subject (i.e., repeated measures) variable of condition (standard, control). The effect of condition was significant, F(1, 379) = 437.914, p < .001, ηp2 = .54, such that when solving the isomorphic control, participants were much more likely to answer correctly than when solving the standard question. There was, however, a significant Condition × Subject Group interaction, F(2, 379) = 4.193, p = .016, ηp2 = .02, such that UCB participants showed a larger difference between the standard and control problems than did MTurk participants.
The analysis of variance was conducted on proportions to enable comparison with the findings of De Neys et al. (2013), who used an analysis of variance on percentages to compare standard and control problems. However, because the dependent variable is dichotomous (0 or 1), a mixed effects logistic regression (with subject as the random variable) is more appropriate and yielded comparable results. Specifically, participants were significantly more likely to respond accurately to isomorphic control questions than to standard variants, b = −3.73, odds ratio (OR) = 41.57, χ2 = 153.54, p < .001, 95% confidence interval (CI) [0.01, 0.04]. That is, participants’ odds of answering the isomorphic control questions correctly was about 42 times more likely than answering the standard variants correctly.
The remaining analyses are conducted on the subset of reasoners who answered the standard question variants incorrectly (see the Supplemental Materials for responses from those who answered the standard question correctly). This analytical approach is consistent with the De Neys et al. (2013) report. Finally, note that those who answered the standard question incorrectly are termed “incorrect reasoners,” and those who answered the standard question correctly, “correct reasoners.” Even cognitive misers are expected to be able to answer the isomorphic control correctly, so accuracy on the isomorphic controls has no bearing on whether a given participant is classified as either a “correct reasoner” or an “incorrect reasoner.”
Confidence
The response confidence scores are reported here as proportions rather than percentages. Again, a 3 × 2 mixed factorial analysis of variance was conducted and limited to the scores from incorrect reasoners. Importantly, consistent with De Neys et al. (2013), incorrect responders were less confident in their erroneous standard responses than in their control responses (see Table 1), F(1, 224) = 3.995, p = .047, ηp2 = .02.
Table 1.
Responses for the Incorrect Reasoners on the Standard and Control Problems as a Function of Mean (and Standard Deviation) Confidence, Opinion, Recall, Recognition, and Response Time,
| Measure | Standard | Control |
|---|---|---|
| Confidence | .889 (.240) | .923 (.202) |
| Opinion | .804 (.201) | .837 (.204) |
| Recall | .697 (.462) | .254 (.358) |
| Recognition | .824 (.382) | .304 (.461) |
| Response time | 34.041 (45.724) | 21.191 (34.479) |
Note. The response time measures are in seconds. The confidence and opinion measures refer to proportions. The recall and recognition memory measures refer to the proportion of responses that include the “more than” phrase.
Note that the difference in confidence scores between the standard and control responses was at least as large for correct reasoners; ηp2 = .06; see Figure 1 and Supplemental Materials. Likewise, De Neys et al. (2013) found a significant difference for correct reasoners, although the difference was smaller in that case for correct than for incorrect reasoners.
Figure 1.

Response confidence scores (top panel) and opinion judgment scores (bottom panel) (in proportions) for standard and control variants of the bat-and-ball problem by reasoners incorrect and correct on the standard question. Note that the scale on the vertical axis differs in the two panels. The error bars represent standard errors of the mean.
The modal response was 100%, with 148 out of the 227 total incorrect reasoners completely confident in their erroneous response to the standard question. Together this finding and an observed mean response for the standard problem of 89% imply that many incorrect reasoners did not actually question their judgment.
Opinion
As for the response confidence scores, the opinion judgments are reported here as proportions rather than percentages. There was a positive relationship between standard question confidence and standard question opinion judgments for incorrect reasoners, r(225) = .480, p < .001, such that incorrect reasoners who were less confident in their response were also less likely to think other reasoners could answer the standard question correctly. This strong relationship lends support to the notion that opinion judgments and response confidence scores are reflecting similar cognitive processes.
Again, a 3 × 2 mixed factorial analysis of variance was conducted and limited to the scores of incorrect reasoners. For incorrect respondents’ opinions of others’ ability to answer correctly, there was a main effect of condition such that incorrect respondents reported significantly less confidence in others reasoners’ standard responses than in their control responses (see Table 1), F(1, 224) = 6.218, p = .013, ηp2 = .03. In addition, there were effects involving subject group (again see the Supplemental Materials). However, as with the response confidence index, the difference in opinion judgments between standard and control was at least as large for the correct reasoners (η2 = .30; see Figure 1 and Supplemental Materials). The very large difference for correct reasoners on opinion judgments suggests that ceiling effects masked the smaller difference for correct reasoners on response confidence both here and in the study by De Neys et al. (2013). Together, these finding suggests that the indirect measurement of confidence was, as predicted, more statistically sensitive and an appropriate alternative measurement of intuitive reasoners’ error sensitivity.
Memory
For the recall response, a mixed effects logistic regression (with subject as the random variable) was conducted due to the dichotomous dependent variable (with or without “more than”), considering only those participants who wrote down an answer that could be coded as with or without the “more than” phrase (e.g., “don’t know” responses were excluded). Incorrect reasoners usually recalled the standard problem, but not the control, as containing “more than” (see Table 1), with this effect of condition significant, b = 3.02, odds ratio (OR) = 20.44, χ2 = 21.69, p < .001, 95% confidence interval (CI) [5.74, 72.75].
For recognition, all incorrect reasoners’ responses were included in the analyses because their answers came in the form of a multiple-choice forced response. A mixed effects logistic regression (with subject as the random variable) was again conducted due to the dichotomous dependent variable (with or without “more than”). Once again, incorrect reasoners usually recognized the standard problem, but not the control, as containing “more than” (see Table 1), with this effect of condition significant, b = 3.00, odds ratio (OR) = 20.07, χ2 = 50.34, p < .001, 95% confidence interval (CI) [8.76, 45.94].
Together, these results suggest that intuitive respondents were generally cognizant of the error-eliciting “more than” phrase.
Response Time
Response times were log transformed for the analyses although the means and standard deviations reported here are untransformed values (in seconds). Again, a 3 × 2 mixed factorial analysis of variance was conducted. In this case, as mentioned earlier, the analysis was limited to the response times from incorrect reasoners, defined as those who were incorrect on the standard problem (ignoring the correctness of their response on the control problem). There was a main effect of condition such that incorrect reasoners responded faster to the control question than to the standard question (see Table 1), F(1, 224) = 38.833, p < .001, ηp2 = .15.
As predicted, response time and confidence on the standard question were negatively correlated for incorrect reasoners, both for response confidence, r(225) = −.271, p < .001, and for opinion judgments, r(225) = −.194, p = .003, such that the less time incorrect reasoners spent answering the standard problem, the more their confidence increased, consistent with previous findings (e.g., Thompson et al., 2013). Also consistent are observed correlations involving the difference in response time between standard and control and both the difference in response confidence between standard and control, r(225) = −.289, p < .001, and the difference in opinion judgments between standard and control, r(225) = −.154, p = .020, which were negative for incorrect reasoners, such that longer times for the standard relative to the control condition were associated with lower confidence for the standard relative to the control condition. There was also a positive relationship between the difference in opinion judgments and the difference in response confidence, r(225) = .293, p < .001. These difference scores were computed by subtracting the standard confidence from the control confidence (for both direct and indirect indices) to yield a single score that represented the confidence decrease yielded by the presence of the “more than” phrase. This approach allowed us to examine the simple relationship between the change in confidence and the change in response time between conditions.
Discussion
The present study directly tested the unconscious attribute substitution hypothesis (e.g., Kahneman, 2011; Kahneman & Frederick, 2002) using the De Neys et al. (2013) experimental paradigm. Towards this end, two different novel features were added to the comparison made by De Neys et al.: (a) an examination of recall and recognition responses for problem wording to document the extent to which attribute substitution by incorrect reasoners was consciously or unconsciously represented, and (b) an examination of an indirect measure of response confidence based on opinion judgments that should be off the ceiling because they avoid participants’ overconfidence in their own responses. We also included an examination of latency responses to get a better sense of the differences in response time between conditions, and insight regarding how time spent responding is related to post-decision confidence.
In terms of the recall memory response, as predicted, incorrect reasoners sometimes showed a faulty memory of the standard problem. Including the latter two samples with interference question between decision and post-decision memory assessment, we found that 30% of incorrect reasoners who gave classifiable responses neglected to include the “more than” phrase in their recall response (and 18% failed to recognize) for the standard question (see Table 1), which provides evidence that the mental representation of those incorrect reasoners involves a simpler subtraction problem. Hence, this finding in consistent with the argument that participants’ errors on the standard bat-and-ball problem could have arisen due to an unconscious attribute substitution error process. However, this percentage should be far higher for incorrect reasoners if the majority of them are engaging in unconscious attribute substitution. The fact that 70% of the incorrect reasoners with classifiable responses recalled the “more than” phrase (and 82% recognized the phrase) in the standard question suggests that the phrase was not simply overlooked; the critical information was consciously processed by most incorrect participants. Instead, incorrect reasoners displayed evidence that they processed and recalled the words of the question but did not fully appreciate their meaning. In other words, information detection alone is not sufficient; proper integration of the semantic content (particularly the critical “more than” phrase) is a prerequisite for problem solution. This finding is consistent with Mata et al.’s (2014) investigation into problem representation via a change detection paradigm. We found that effect sizes (odds ratios) for correct reasoners’ memory representations were several orders of magnitude larger than those for incorrect reasoners, which is consistent with previous findings noting that better problem representation is related to an increased ability to solve the problem correctly (see Supplemental Materials). The present study indicates that, although better representation is indeed associated with accuracy, information detection alone is not enough; ability to utilize the post-detection semantic content effectively is of more central concern. Indeed, this interpretation is consistent with previous findings revealing that reasoning errors can simply stem from a lack of knowledge or motivation to set up the logic of the problem (e.g., Agnoli & Krantz, 1989; Scherer, Yates, Baker, & Valentine, 2017).
In the present study, De Neys et al.’s (2013) confidence question yielded a difference between standard and control problems for incorrect reasoners like that found in the earlier study. Incorrect participants were likewise less confident that others would be able to respond accurately in the standard condition than in the control condition, and as predicted, these opinion judgments were less susceptible to ceiling effects. These results are consistent with the role of error sensitivity discussed by De Neys et al. Furthermore, because the direct and indirect measurements of confidence were strongly correlated, the latter of which being less susceptible to ceiling effects, we can recommend its usage as a complement to the more direct, intra-individual measurement of post-decision error sensitivity. However, this recommendation comes with two caveats: (a) Although our results suggest significant shared variability, it is possible that indirect measurements of confidence index a separate error signal, particularly social metacognitive error sensitivity (e.g., Mata & Almeida, 2014); and (b) a large sample is necessary to ensure sufficient statistical power. Mata and Almeida were unsuccessful in yielding a sensitivity effect among intuitive reasoners; however, the authors of that study speculated that statistical power could be responsible for their lack of an observed sensitivity effect.
Although the present findings do not provide evidence, affirmatively or negatively, that intuitive respondents err for reasons other than substituting a simpler problem for the actual, more challenging one, alternative explanations for error processing should be considered. Indeed, even according to Kahneman’s theorizing, correct respondents are thought to substitute initially and then go on to engage in further, Type 2 consideration to solve the problem (e.g., Kahneman, 2011; Kahneman & Frederick, 2002). Nevertheless, alternative explanations include the fact that varying degrees of numeracy among men and women (e.g., Sinayev & Peters, 2015) and Actively Open-minded Thinking among male respondents in particular (Baron, Scott, Fincher, & Metz, 2015; Campitelli & Gerrans, 2014) have been shown to predict accuracy, and these observations do not explicitly rely on Kahneman’s attribute substitution hypothesis. This literature illustrates well how multiple explanatory sources of cognitive variability, not just attribute substitution, might underlie errors on the bat-and-ball problem. In any event, an error in reasoning has occurred. Because we cannot determine for certain that a substitution error occurred, “error sensitivity,” rather than “substitution sensitivity,” seems to be a more appropriate term.
Our consideration of response time was also illuminating. Consistent with previous research (e.g., Johnson et al., 2016; Kelley & Lindsay, 1993; Thompson et al., 2013), response confidence was shown to decrease with increases in response time. However, the causal relationship between the time spent responding and confidence in the response remains an open question. Specifically, it is not clear whether the additional time spent is driving reductions in confidence or if instead participants who are less confident take more time to consider their responses. It is possible that sensitivity to a disfluency signal, a low Feeling of Rightness, could explain the observed relationship between processing speed and response confidence more parsimoniously than could sensitivity to the substitution process itself. A logical follow up to the present work would be to determine whether the additional arithmetic operation required in the standard version relative to the isomorphic control version of the problem introduced by De Neys et al. (2013) produced a disfluency sensitivity effect rather than a substitution sensitivity effect per se. The present latency results do not allow us to decide between these two views.
Supplementary Material
Acknowledgments
This research was supported in part by NASA grant NNX14AB75A to the University of Colorado Boulder. This project is dedicated to the late Lesley G. Hathorn, Ph.D., who spearheaded the idea to explore the memory representation of the bat-and-ball problem. We are also grateful to James Foster and James Kole for methodological guidance, to the other members of the Center for Research on Training for their useful feedback about this study, and to Ellen Peters, Bill Raymond, Shaw Ketels, and Seth Gans for insightful comments on preliminary versions of this manuscript. Finally, we would like to thank Wim De Neys who provided helpful comments and suggestions on the present research and on an earlier unpublished manuscript by Hathorn and Healy, which was written before Hathorn’s untimely death, and two anonymous reviewers who provided helpful comments on earlier versions of the present manuscript.
Footnotes
Final preparation of this article occurred while Alice Healy was a visiting scholar in the laboratory of Professor Michael Kahana at the University of Pennsylvania.
The full data set is available from the authors upon request.
References
- Aczel B, Szollosi A, & Bago B (2016). Lax monitoring versus logical intuition: The determinants of confidence in conjunction fallacy. Thinking & Reasoning, 22, 99–117. doi: 10.1080/13546783.2015.1062801 [DOI] [Google Scholar]
- Agnoli F, & Krantz DH (1989). Suppressing natural heuristics by formal instruction: The case of the conjunction fallacy. Cognitive Psychology, 21, 515–550. doi: 10.1016/0010-0285(89)90017-0 [DOI] [Google Scholar]
- Arkes HR, Christensen C, Lai C, & Blumer C (1987). Two methods of reducing overconfidence. Organizational Behavior and Human Decision Processes, 39, 133–144. doi: 10.1016/0749-5978(87)90049-5 [DOI] [Google Scholar]
- Bago B, & De Neys W (2017). Fast logic? Examining the time course assumption of dual process theory. Cognition, 158, 90–109. doi: 10.1016/j.cognition.2016.10.014 [DOI] [PubMed] [Google Scholar]
- Baron J, Scott S, Fincher K, & Metz SE (2015). Why does the Cognitive Reflection Test (sometimes) predict utilitarian moral judgment (and other things)? Journal of Applied Research in Memory and Cognition, 4, 265–284. doi: 10.1016/j.jarmac.2014.09.003 [DOI] [Google Scholar]
- Bourgeois-Gironde S, & Vanderhenst J-B (2009). How to open the door to System 2: Debiasing the Bat and Ball problem. In Watanabe S, Bloisdell AP, Huber L, & Young A (Eds.), Rational animals, irrational humans (pp. 235–252). Tokyo: Keio University Press. [Google Scholar]
- Campitelli G, & Gerrans P (2014). Does the cognitive reflection test measure cognitive reflection? A mathematical modeling approach. Memory & Cognition, 42, 434–447. doi: 10.3758/s13421-013-0367-9 [DOI] [PubMed] [Google Scholar]
- De Neys W (2012). Bias and conflict: A case for logical intuitions. Perspectives on Psychological Science, 7, 28–38. doi: 10.1177/1745691611429354 [DOI] [PubMed] [Google Scholar]
- De Neys W (2014). Conflict detection, dual processes, and logical intuitions: Some clarifications. Thinking & Reasoning, 20, 169–187. doi: 10.1080/13546783.2013.854725 [DOI] [Google Scholar]
- De Neys W, & Bonnefon J-F (2013). The ‘whys’ and ‘whens’ of individual differences in thinking biases. Trends in Cognitive Sciences, 17, 172–178. doi: 10.1016/j.tics.2013.02.001 [DOI] [PubMed] [Google Scholar]
- De Neys W, Cromheeke S, & Osman M (2011). Biased but in doubt: Conflict and decision confidence. PloS one, 6(1), e15954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Neys W, & Glumicic T (2008). Conflict monitoring in dual process theories of thinking. Cognition, 106, 1248–1299. doi: 10.1016/j.cognition.2007.06.002 [DOI] [PubMed] [Google Scholar]
- De Neys W, Rossi S, & Houdé O (2013). Bats, balls, and substitution sensitivity: Cognitive misers are no happy fools. Psychonomic Bulletin & Review, 20, 269–273. doi: 10.3758/s13423-013-0384-5 [DOI] [PubMed] [Google Scholar]
- De Neys W, Vartanian O, & Goel V (2008). Smarter than we think: When our brains detect that we are biased. Psychological Science, 19, 483–489. doi: 10.1111/j.1467-9280.2008.02113.x [DOI] [PubMed] [Google Scholar]
- Evans J B. T. (2008). Dual-processing accounts of reasoning, judgment, and social cognition. Annual Review of Psychology, 59, 255–278. doi: 10.1146/annurev.psych.59.103006.093629 [DOI] [PubMed] [Google Scholar]
- Evans J B. T., & Stanovich KE (2013). Dual-process theories of higher cognition: Advancing the debate. Perspectives on Psychological Science, 8, 223–241. doi: 10.1177/1745691612460685 [DOI] [PubMed] [Google Scholar]
- Faul F, Erdfelder E, Lang A-G, & Buchner A (2007). G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175–191. doi: 10.3758/BF03193146 [DOI] [PubMed] [Google Scholar]
- Fischhoff B, Slovic P, & Lichtenstein S (1977). Knowing with certainty: The appropriateness of extreme confidence. Journal of Experimental Psychology: Human Perception and Performance, 3, 552–564. doi: 10.1037/0096-1523.3.4.552 [DOI] [Google Scholar]
- Frederick S (2005). Cognitive reflection and decision making. The Journal of Economic Perspectives, 19, 25–42. doi: 10.1257/089533005775196732 [DOI] [Google Scholar]
- Frey D, Johnson ED, & De Neys W (2017). Individual differences in conflict detection during reasoning. Quarterly Journal of Experimental Psychology, 71, 1188–1208. doi: 10.1080/17470218.2017.1313283 [DOI] [PubMed] [Google Scholar]
- Gangemi A, Bourgeois-Gironde S, & Mancini F (2015). Feelings of error in reasoning—in search of a phenomenon. Thinking & Reasoning, 21, 383–396. doi: 10.1080/13546783.2014.980755 [DOI] [Google Scholar]
- González-Vallejo C, & Bonham A (2007). Aligning confidence with accuracy: Revisiting the role of feedback. Acta Psychologica, 125, 221–239. doi: 10.1016/j.actpsy.2006.07.010 [DOI] [PubMed] [Google Scholar]
- Hathorn LG, & Healy AF (2015, May). Decision-making and the bat-and-ball problem. Poster presented at the 27th APS Annual Convention, New York, NY. [Google Scholar]
- Hathorn LG, & Healy AF (2016, May). Attribute substitution in the bat-and-ball problem. Poster presented at the 28th APS Annual Convention, Chicago, IL. [Google Scholar]
- Hoover JD, & Healy AF (2017). Algebraic reasoning and bat-and-ball problem variants: Solving isomorphic algebra first facilitates problem solving later. Psychonomic Bulletin & Review, 24, 1922–1928. doi: 10.3758/s13423-017-1241-8 [DOI] [PubMed] [Google Scholar]
- Johnson ED, Tubau E, & De Neys W (2016). The Doubting System 1: Evidence for automatic substitution sensitivity. Acta Psychologica, 164, 56–64. doi: 10.1016/j.actpsy.2015.12.008 [DOI] [PubMed] [Google Scholar]
- Kahneman D (2011). Thinking, fast and slow New York, NY: Farrar, Straus and Giroux. [Google Scholar]
- Kahneman D, & Frederick S (2002). Representativeness revisited: Attribute substitution in intuitive judgement. In Gilovich T, Griffin D, Kahneman D (Eds.), Heuristics and biases: The psychology of intuitive judgment (pp. 49–81). Cambridge, UK: Cambridge University Press. [Google Scholar]
- Kelley CM, & Lindsay DS (1993). Remembering mistaken for knowing: Ease of retrieval as a basis for confidence in answers to general knowledge questions. Journal of Memory and Language, 32, 1–24. doi: 10.1006/jmla.1993.1001 [DOI] [Google Scholar]
- Keren G (1988). On the ability of monitoring non-veridical perceptions and uncertain knowledge: Some calibration studies. Acta Psychologica, 67, 95–119. doi: 10.1016/0001-6918(88)90007-8 [DOI] [PubMed] [Google Scholar]
- Keren G (1991). Calibration and probability judgments: Conceptual and methodological issues. Acta Psychologica, 77, 217–273. doi: 10.1016/0001-6918(91)90036-Y [DOI] [Google Scholar]
- Keren G (1997). On the calibration of probability judgments: Some critical comments and alternative perspectives. Journal of Behavioral Decision Making, 10, 269–278. doi: [DOI] [Google Scholar]
- Klayman J, Soll JB, González-Vallejo C, & Barlas S (1999). Overconfidence: It depends on how, what, and whom you ask. Organizational Behavior and Human Decision Processes, 79, 216–247. doi: 10.1006/obhd.1999.2847 [DOI] [PubMed] [Google Scholar]
- Koriat A, Lichtenstein S, & Fischhoff B (1980). Reasons for confidence. Journal of Experimental Psychology: Human Learning and Memory, 6, 107–118. doi: 10.1037/0278-7393.6.2.107 [DOI] [Google Scholar]
- Lichtenstein S, & Fischhoff B (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20, 159–183. doi: 10.1016/0030-5073(77)90001-0 [DOI] [Google Scholar]
- Lichtenstein S, & Fischhoff B (1980). Training for calibration. Organizational Behavior and Human Performance, 26, 149–171. doi: 10.1016/0030-5073(80)90052-5 [DOI] [Google Scholar]
- Lichtenstein S, Fischhoff B, & Phillips LD (1982). Calibration of probabilities: The state of the art to 1980. In Kahneman D, Slovic P, & Tversky A (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 306–334). Cambridge, UK: Cambridge University Press. [Google Scholar]
- Mata A, & Almeida T (2014). Using metacognitive cues to infer others’ thinking. Judgment and Decision Making, 9, 349–359. [Google Scholar]
- Mata A, & Ferreira MB (2018). Response: Commentary: Seeing the conflict: an attentional account of reasoning errors. Frontiers in Psychology, 9, 24. doi: 10.3389/fpsyg.2018.00024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mata A, Ferreira MB, Voss A, & Kollei T (2017). Seeing the conflict: An attentional account of reasoning errors. Psychonomic Bulletin & Review, 24, 1980–1986. doi: 10.3758/s13423-017-1234-7 [DOI] [PubMed] [Google Scholar]
- Mata A, Schubert A-L, & Ferreira MB (2014). The role of language comprehension in reasoning: How “good-enough” representations induce biases. Cognition, 133, 457–463. doi: 10.1016/j.cognition.2014.07.011 [DOI] [PubMed] [Google Scholar]
- Newell BR, & Shanks DR (2014). Unconscious influences on decision making: A critical review. Behavioral and Brain Sciences, 37, 1–19. doi: 10.1017/S0140525X12003214 [DOI] [PubMed] [Google Scholar]
- Nelson TO, & Narens L (1990). Metamemory: A theoretical framework and new findings. Psychology of learning and motivation (Vol. 26, pp. 125–173). Academic Press. doi: 10.1016/S0079-7421(08)60053-5. [DOI] [Google Scholar]
- Pennycook G, Fugelsang JA, & Koehler DJ (2012). Are we good at detecting conflict during reasoning? Cognition, 124, 101–106. doi: 10.1016/j.cognition.2012.04.004 [DOI] [PubMed] [Google Scholar]
- Scherer LD, Yates JF, Baker SG, & Valentine KD (2017). The influence of effortful thought and cognitive proficiencies on the conjunction fallacy: Implications for dual-process theories of reasoning and judgment. Personality and Social Psychology Bulletin, 43, 874–887. doi: 10.1177/0146167217700607 [DOI] [PubMed] [Google Scholar]
- Shaw JS III (1996). Increases in eyewitness confidence resulting from postevent questioning. Journal of Experimental Psychology: Applied, 2, 126–146. doi: 10.1037/1076-898X.2.2.126 [DOI] [Google Scholar]
- Sinayev A, & Peters E (2015). Cognitive reflection vs. calculation in decision making. Frontiers in Psychology, 6, 532. doi: 10.3389/fpsyg.2015.00532 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singmann H, Klauer KC, & Kellen D (2014). Intuitive logic revisited: New data and a Bayesian mixed model meta-analysis. PLOS ONE, 9, e94223. doi: 10.1371/journal.pone.0094223 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soll JB (1996). Determinants of overconfidence and miscalibration: The roles of random error and ecological structure. Organizational Behavior and Human Decision Processes, 65, 117–137. doi: 10.1006/obhd.1996.0011 [DOI] [Google Scholar]
- Stanovich KE, & West RF (2002). Individual differences in reasoning: Implications for the rationality debate? In Gilovich T, Griffin D, & Kahneman D (Eds.), Heuristics and biases: The psychology of intuitive judgment (pp. 421–440). Cambridge, UK: Cambridge University Press. doi: 10.1017/CBO9780511808098.026 [DOI] [Google Scholar]
- Szollosi A, Bago B, Szaszi B, & Aczel B (2017). Exploring the determinants of confidence in the bat-and-ball problem. Acta Psychologica, 180, 1–7. doi: 10.1016/j.actpsy.2017.08.003 [DOI] [PubMed] [Google Scholar]
- Thompson VA, Prowse Turner JA, & Pennycook G (2011). Intuition, reason, and metacognition. Cognitive Psychology, 63, 107–140. doi: 10.1016/j.cogpsych.2011.06.001 [DOI] [PubMed] [Google Scholar]
- Thompson VA, Prowse Turner JA, Pennycook G, Ball LJ, Brack H, Ophir Y, & Ackerman R (2013). The role of answer fluency and perceptual fluency as metacognitive cues for initiating analytic thinking. Cognition, 128, 237–251. doi: 10.1016/j.cognition.2012.09.012 [DOI] [PubMed] [Google Scholar]
- Travers E, Rolison JJ, & Feeney A (2016). The time course of conflict on the Cognitive Reflection Test. Cognition, 150, 109–118. doi: 10.1016/j.cognition.2016.01.015 [DOI] [PubMed] [Google Scholar]
- Tversky A & Kahneman D (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131. doi: 10.1126/science.185.4157.1124 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
