A Re-Examination of “Bias” in Human Randomness Perception

Paul A Warren; Umberto Gostoli; George D Farmer; Wael El-Deredy; Ulrike Hahn

doi:10.1037/xhp0000462

. 2017 Oct 23;44(5):663–680. doi: 10.1037/xhp0000462

A Re-Examination of “Bias” in Human Randomness Perception

Paul A Warren ^1,^*, Umberto Gostoli ¹, George D Farmer ¹, Wael El-Deredy ², Ulrike Hahn ³

Editor: Isabel Gauthier

PMCID: PMC5933241 PMID: 29058943

Abstract

Human randomness perception is commonly described as biased. This is because when generating random sequences humans tend to systematically under- and overrepresent certain subsequences relative to the number expected from an unbiased random process. In a purely theoretical analysis we have previously suggested that common misperceptions of randomness may actually reflect genuine aspects of the statistical environment, once cognitive constraints are taken into account which impact on how that environment is actually experienced (Hahn & Warren, Psychological Review, 2009). In the present study we undertake an empirical test of this account, comparing human-generated against unbiased process-generated binary sequences in two experiments. We suggest that comparing human and theoretically unbiased sequences using metrics reflecting the constraints imposed on human experience provides a more meaningful picture of lay people’s ability to perceive randomness. Finally, we propose a simple generative model of human random sequence generation inspired by the Hahn and Warren account. Taken together our results question the notion of bias in human randomness perception.

Keywords: cognitive bias, perception of randomness, gambler’s fallacy

Public Significance Statement

The dominant perspective in experimental psychology is that human judgment and decision making are flawed. This is particularly evident in research on human perception of randomness. Here we explore this idea, presenting several analyses of data from an experiment in which participants are asked to generate a sequence of outcomes from a binary random process (like a coin toss). Although behavior does depart from the output of genuinely random source, the extent of this departure depends on how performance is characterized and whether constraints on human memory and attention span are taken into account. We find that when such constraints are considered, and appropriate performance measures are used, humans actually match the random source rather well. We argue more generally it may be problematic to emphasize errors in human judgment and decision-making without taking account of appropriate constraints.

Randomness is the flip side of statistical structure. Consequently, researchers interested in human beings as “intuitive statisticians” have long been interested in people’s ability to identify patterns of data as random. A long tradition of research has reached rather negative conclusions about people’s intuitive understanding of randomness. Whereas early studies focused primarily on people’s ability to generate random sequences (see, e.g., Wagenaar, 1972), later work has also examined people’s ability to judge sequences as random (see, e.g., Bar-Hillel & Wagenaar, 1991; Kahneman & Tversky, 1972; and see Oskarsson, Van Boven, McClelland, & Hastie, 2009 for an extensive review).

Both studies of sequence generation and production have found evidence of similar biases, in particular a bias toward overalternation between the different possible outcomes, such as “heads” (H) or “tails” (T), in binary sequences. This alternation bias has frequently been interpreted as evidence for a belief in the “gambler’s fallacy” (GF), that is, the erroneous belief that an increasing run of one outcome (e.g., HHHHHH . . .) makes the other outcome ever more likely (but see, e.g., Edwards, 1961).¹ Such a belief, which can indeed be found among gamblers around the world (Clotfelter & Cook, 1993; Croson & Sundali, 2005; Terrell, 1998; Toneatto, Blitz-Mille, Calderwood, Dragonetti, & Tsanos, 1997), may reflect a mistaken conception of random processes as “self-correcting” in such a way as to maintain an equal balance between the possible outcomes (for other explanations see, e.g., the review of research on the GF by Hahn, 2011).

However, the concept of randomness is a difficult, and often counterintuitive, one not just for gamblers or experimental participants, but also for experimenters (on the concept of randomness see, e.g., Beltrami, 1999), and extensive critiques have shown much of the empirical research on lay understanding of randomness to be conceptually flawed (see in particular, Ayton, Hunt, & Wright, 1989; Nickerson, 2002; but also Lopes, 1982). Aforementioned evidence from real-world gamblers aside, it is less clear than might be expected how good or bad lay people’s ability to both discern and mimic the output of random sources actually is.

Research with novel tasks, that do not suffer from the conceptual flaws identified, have tended to confirm some element of bias in people’s performance (e.g., Olivola & Oppenheimer, 2008; Rapoport & Budescu, 1992) while finding also that participants’ performance is considerably better than deemed by past research (see, e.g., Lopes & Oden, 1987; Nickerson & Butler, 2009).

In particular, it has been argued that people’s performance may actually be quite good given their actual experience of random sequences, whether inside or outside the lab. Williams and Griffiths (2013) show how seemingly poor performance on randomness judgment tasks may stem from the genuine paucity of the available statistical evidence. Hahn and Warren (2009) similarly argue that common biases and misperceptions of randomness may actually reflect genuine aspects of the statistical environment, once it is taken into account how that environment is actually experienced. Specifically, Hahn and Warren demonstrate that if human experience of a stream of binary random events is assumed to be (a) finite and (b) constrained by the limitations of short-term memory (STM) and/or attention, then based upon highly counterintuitive mathematical results, not all binary substrings are equally likely to occur.

We next describe this theoretical account in more detail, before going on to present the results of two behavioral experiments that provide evidence that human perception of randomness conforms to the theoretical treatment outlined. Finally, we present a simple generative model of human random sequence generation that reflects key features of the Hahn and Warren account.

Hahn and Warren (2009) Account of Randomness Perception

The theoretical account of randomness perception in Hahn and Warren (2009, 2010) relies upon a simple model of how a human might experience an unfolding sequence of random events. It is proposed that humans have a limited capacity window of experience of length k that has access to the present event and preceding k-1 events. This window slides one event at a time through an unfolding finite sequence of length n > k. That humans could only ever experience a finite stream of events is incontrovertible. Further, given the well-characterized bounds on human STM capacity and/or attention span, this limited capacity, sliding window of experience account seems plausible.

Crucially, when subsequences of length k are counted among a longer finite sequence of length n using the sliding window analysis suggested above, certain subsequences are more likely to not occur, even when the generation process is unbiased. In particular perfect runs of one outcome have highest nonoccurrence probability (or conversely lowest occurrence rate), followed by perfect alternations of the two outcomes. This highly counterintuitive mathematical result is illustrated in Figure 1B; the unbroken line represents the occurrence rates for the 16 possible subsequences of length 4. For example, the occurrence rate for the perfect run subsequence 0000 is around 0.47 meaning that this subsequence does not appear at all on around 53% of all sequences of length 20 generated by an unbiased random process. In contrast the occurrence rate for subsequence 0001 is around 0.75 meaning that this subsequence does not appear on only around 25% of unbiased sequences of length 20. Hahn and Warren (2009) argue that if human experience of unfolding random events mimics the sliding window, then this could explain three key tendencies of human randomness perception that are taken as evidence of bias:

1
A tendency to think that sequences with some irregularity are more likely given an unbiased coin.
2
An expectation of equal numbers of heads and tails within a sequence.
3
A tendency to overalternate between outcomes when generating random sequences.

(A) Results of Analysis 1 for sliding window length 4. Average subsequence frequencies per 20-bit block are presented for both human-generated (dots) and the theoretically unbiased (TU) data (solid line, 95% confidence interval [CI] dashed lines). (B) The results of Analysis 2 for sliding window length 4. Proportions of 20-bit blocks containing at least one occurrence of the subsequence are presented for both human-generated (dots) and TU data (solid line, 95% CI dashed lines).

Based on theoretical data of the kind presented here (Figure 1B unbroken line), Hahn and Warren argue that (a) is reasonable, that is, the figure demonstrates that there is statistical support for the intuition that regular subsequences (e.g., 1111, 0101) occur less often than irregular subsequences (e.g., 0100, 1101). Hahn and Warren also argue that (b) is consistent with the sliding window account because it is difficult to distinguish between the vast majority of sequences using occurrence rate (Figure 1B, unbroken line) suggesting judgments should be based not on an explicit coding of each subsequence but something simpler such as the proportion of heads. Finally, Hahn and Warren argue (c) follows directly from the sliding window account because short sequences tend to have more alternations between outcomes than expected in an infinite series (Kareev, 1992).

Overview

In the present study we examine the characteristics of human randomness perception in light of the theoretical account of Hahn and Warren (2009) across two experiments. Although a range of tasks have been used previously to investigate randomness perception, sequence generation has been by far the most dominant and, accordingly, we use this task in both our experiments. In Experiment 1 we asked participants to first observe the output of a random source before generating a random binary sequence. In Experiment 2 we replicated Experiment 1 but also examined the effect of recent experience by comparing sequences generated both before and after exposure to the random source. To preempt our results, in both experiments we find that when compared on expected frequency of occurrence of binary subsequences, behavior departs markedly from that of an unbiased random generating process. This is a common finding in the literature and such results give rise to the notion of bias in randomness perception, since for an unbiased random process the expected frequencies should all be equal for any specified subsequence length. However, we also show that human sequences are remarkably similar to those of an unbiased random generation process when other methods of comparison are used that are relevant to the sliding window account (e.g., subsequence occurrence rate or direct comparison of subsequence frequency distributions for a given window length), and that this is particularly evident at subsequence lengths around 4. This is a plausible length of the typical human window of experience as defined above and in line with research suggesting that the effective span of STM is 4 when strategies such as rehearsal are ruled out (Cowan, 2001, 2010). Finally, we present a simple model of binary sequence generation in humans that incorporates the key features of the Hahn and Warren (2009) account. This model generates binary outcomes with one free parameter, reflecting the extent to which the probability of runs of the same outcome (e.g., 111...1) is down-weighted to favor sequences in which the run is ended (e.g., 111...0).

Experiment 1

Participants first observed blocks of binary outcome random sequences following an unbiased Bernoulli process (p = .5) and were then instructed to generate random outputs to match the properties of the observed process.

Method

Participants

Twelve undergraduate students from the University of Manchester participated on a voluntary basis and gave informed consent. Participants received course credit as payment. There were no exclusion criteria.

Materials

Participants were seated in front of a 19-inch LCD display. The experimental stimuli were presented using the Python programming language on a PC running Windows 7. Participants responded using a standard Windows keyboard.

Procedure

Participants were told they would first observe the output of a machine generating a random sequence of 1’s and 0’s, and that they should attend to it (Presentation Phase) before going on to generate a sequence (Generation Phase).

Presentation Phase: Each digit (a 1 or 0) appeared on the screen for 250 ms before being replaced by the next digit in the sequence. The display of each digit was accompanied by a corresponding tone. The display was full screen with a black background. The digits were displayed in white in 80 point Arial font in the center of the screen. To reinforce the signal provided by the random source 1’s were accompanied by a high (1200 Hz) tone, and 0’s by a low (800 Hz) tone.² After every 20 digits the sequence paused and participants were required to complete a distractor task. The distractor task consisted of counting the number of vowels in a list of 10 words. In total participants observed 600 digits over 30 blocks of length 20.

Generation Phase: Participants were asked to generate a new sequence representative of the one they had just observed in the Presentation Phase. They used the keyboard to press either 1 with their left hand, or 0 with their right hand. For each key press participants saw the appropriate digit on screen and heard the corresponding tone, exactly as in the presentation phase. As in the Presentation Phase, participants generated 600 digits in 30 blocks of 20 and the same distractor task was used in between each block.

Data analysis

We compared the statistical properties of sequences generated by a truly random Bernoulli process (p = .5) and those generated by our participants (N = 12). Based on evidence that the effective span of short term memory is 4 items, when strategies such as rehearsal are ruled out (Cowan, 2001, 2010), we describe our analysis, and present results for k = 4 only. However, we have repeated our preliminary analyses for other values k = 3 to k = 6 (see supplemental materials). For each participant, and each of the 30 blocks of data collected, we slid a window of length k = 4 through the 20-bit sequence of generated outcomes. We then undertook four analyses of these sequences by aggregating data across observers. From 12 participants generating 30 × 20-bit sequences we had 360 sequences over which to assess performance. We undertook the following four analyses to characterize performance in different ways.

Analysis 1: We calculated the average observed frequency for each of the 16 possible subsequences per 20-bit sequence. Note that for an unbiased random process the expected frequency of each subsequence should be 1.0625 per 20-bit sequence. When randomness perception is referred to as biased, it is typically based on the observation that participant generated subsequences do not occur with equal frequency (e.g., alternating sequences are overrepresented and runs are underrepresented).

Analysis 2: We calculated the occurrence rate— that is, the proportion of 20-bit sequences that contained at least one occurrence of each of the 16 possible subsequences. Note this metric is the complement of the nonoccurrence probability described by Hahn and Warren (2009). Even for an unbiased random process this metric will not be the same for all subsequences (see Hahn & Warren, 2009 and Figure 1B).

Analysis 3: We generated histograms illustrating the proportion of 20-bit sequences containing 0, 1, 2, and so forth . . . occurrences of three subsequences AAAA, ABAB, AAAB (averaged over A = 1, B = 0 cases and vice versa) that are particularly interesting under the Hahn and Warren (2009) account. Subsequence 0000 (and its complement 1111) has special status since its nonoccurrence rate for plausible values of n and k is markedly different from the other sequences. Similarly, subsequence 0101 (and its complement 1010) is interesting because its nonoccurrence rate is lower than the other sequences. Subsequence 0001 (and its complement 1110) is interesting when compared to a perfect run of the same length. This comparison is relevant to the gambler’s fallacy phenomenon. Note that Analysis 1 is equivalent to calculating the expected value of such distributions for each of the 16 subsequences.

Analysis 4: The histograms generated in Analysis 3 contain significant positive skew. Consequently we generated boxplots illustrating the median, Inter-Quartile Range (IQR) and extreme data for the distributions obtained in Analysis 3.

We also generated the same amount of data (360 × 20-bit sequences) as that obtained from human participants from a genuinely unbiased Bernoulli process (p = .5). We refer to these simulated sequences as the theoretically unbiased (TU) data-set and their properties are analyzed in an identical manner to the human data. By repeatedly generating (N = 1,000) TU data-sets we were able to place confidence bounds on the metrics described in Analysis 1 and 2 for a TU participant.

Results

In Figure 1A the dots represent the observed expected values of human-generated subsequence frequencies (Analysis 1) at window length 4. The unbroken black lines represent the equivalent metric for the TU participant. The dashed lines represent the 95% confidence interval (CI) on the TU data. Note that the TU expected frequencies are the same across subsequences since in an unbiased random process all subsequences at all window lengths should be equally represented (e.g., see Beltrami, 1999). Although the majority of the human data lies within the CI for the TU data, there are some clear departures and there appears to be systematic over- and underrepresentation of certain subsequences relative to the TU data. This analysis illustrates the standard description of human random sequence generation as biased. Relative to the TU data, the perfect runs are clearly underrepresented and 10 of the other 14 subsequences are overrepresented.

Figure 1B shows the outcome of Analysis 2 for window length 4. The dots represent the occurrence rate—that is, the proportion of the 360 blocks on which a subsequence occurred at least once—for human participants. Respectively, the solid black and dashed lines illustrate the equivalent occurrence rate and 95% CI for the TU data. Using this analysis the human and TU data share several common features, including a marked decrease in occurrence rate for perfect runs. In addition the human data appear to follow the fluctuations in the TU data with high correlation between the sequence occurrence rates (r = .971).

We also undertook a follow-up analysis to further investigate the high correlation observed in Figure 2B. In particular, one might want to ask how remarkable it is to find such a high correlation and what degree of correlation might arise by mathematical necessity for any process that even crudely matches the properties of a genuinely random source. In other words, how closely does a generating source need to match a random process to give rise to the degree of distributional match observed in our data.

The results of the follow up analysis to examine degrees of correlation between occurrence rates of the observed human generated and theoretically unbiased (TU) subsequences (k = 4). (A) Variation in correlation between occurrence rates of an unbiased process and those that are biased in base rate (β). (B) Variation in correlation between occurrence rates of an unbiased process and those that are biased in Markov switching rate (σ).

A simple thought experiment illustrates the issue. A truly random source has an expected long-term alternation rate of .5. This alternation rate could be matched perfectly by generating a sequence of perfectly alternating 0s and 1s (i.e., 0101010101 . . .). Though this sequence would match several of the statistical properties of sequences produced by random generating sources, it would fail to match the subsequence distribution statistics shown in Figure 1A and 1B.

In further analysis we examined the extent to which a random generating source would need to be perturbed away from unbiased to observe a marked drop in correlation in occurrence rates with those of a truly random process. We reasoned that if that correlation remains high over a large range of perturbations then the high correlation observed in our observers seems unremarkable. However, if the correlation is sensitive to small perturbations then it seems reasonable to suggest that the high correlation is because of genuine similarity between human observers and a random process and worthy of note. We perturbed the unbiased process in two ways:

1
By manipulating the base rate β, that is, the propensity of the source to generate 0’s and 1’s. Specifically, we changed the probability P(0) = β of generating a 0 on each step, and accordingly the probability P(1) = 1-β of generating a 1 on each step. Clearly for an unbiased random process β = 0.5. Increasing β above 0.5 leads to a bias toward producing 0’s whereas decreasing the parameter leads to a bias for 1’s.
2
By manipulating the switching rate σ of a Markov process, that is, the propensity of the source to transition from one possible state (0 or 1) at step i to the other state at step i + 1. Specifically, we defined a 2 × 2 Markov transition matrix M with diagonal entries reflecting the probability of sticking in the same state (0 or 1) set to 1-σ and off diagonal entries, reflecting the probability of switching (from 0 to 1 or vice versa) set to σ. For an unbiased random process σ = 0.5. Increasing σ above 0.5 leads to a bias toward switching whereas decreasing the parameter leads to a tendency to generate runs of the same outcome.

The 95% CIs for the correlation between the biased and unbiased generators as a function of the β and σ parameters are shown in Figure 2. Clearly the correlation coefficient obtained between the occurrence rates at window length four is rather sensitive to small perturbations away from a truly random process for both perturbation types. Therefore, we conclude that the degree of subsequence match observed in our data genuinely speaks to the degree of appreciation participants show for the characteristic outputs of random generating sources.

As noted in Hahn and Warren (2010), although the nonoccurrence probability, or its complement the occurrence rate, is a convenient statistic with which to illustrate differences between subsequences it is not the only statistic for which differences emerge for an unbiased random process. In Analyses 3 and 4 we illustrate significant differences between the distributions, medians and modes of three key subsequence types: AAAA (i.e., 1111 and 0000), AAAB (i.e., 1110 and 0001), and ABAB (i.e., 1010 and 0101) and show that based on these analyses human and TU data are in close agreement. In Figure 3 we present the outcome of Analysis 3 for the TU (Figure 3A) and human (Figure 3B) data. Note, that occurrence rates obtained in Analysis 2 for the three subsequences considered can also be seen in Figure 3 as the sum of all columns except that for frequency 0. Although there are some differences in the human versus TU distributions they are primarily both qualitatively and quantitatively similar. Furthermore, the clear skew in the distributions of these data suggests that it might be problematic to use the expected value (i.e., the average number of occurrences calculated in Analysis 1) as a summary statistic. To reinforce this point note that the observed expected values of the three distributions in Figure 3B are given by the corresponding data points in Figure 1A. As noted in Analysis 1, for the human data these expected values are different. On the other hand for the TU data the expected values of the three distributions in Figure 3A are identical at 1.0625. However, considering the distributions, we see that the differences between human and TU data are actually rather subtle. For example, for the AAAA sequences, even though the expected value is considerably lower for human participants (around 0.7) than for the TU data distribution (1.0625) this discrepancy appears to be largely driven by the fact that high frequency sequences (e.g., beyond frequency 5) are underrepresented in the human data. These extreme values would contribute significantly to increasing the expected value even though they are highly unlikely to be experienced. As a consequence, we suggest that placing emphasis on the difference in expected values between human data and that generated by a TU process is problematic when there are similarities in the data generated on other (potentially more appropriate) statistics.

The results of Analysis 3 for sliding window length 4. Histograms describe the proportion of blocks containing each occurrence frequency for three selected subsequences. (A) theoretically unbiased (TU) data truncated at occurrence frequency = 6. Note the expected values of these three distributions are identical at 1.0625 (consistent with Analysis 1). (B) Data for human observers. Note that the expected values of these three distributions are different from 1.0625 and equal to the appropriate average frequency data points in Figure 1A.

In Figure 4 we present another illustration of the data in Figure 3. These boxplots emphasize the similarity in the median frequency for the humans and TU data. In addition, box plots for the AAAB and ABAB subsequences are very similar between human and TU participants. Similar to Figure 3, for subsequence AAAA the increased tendency for the TU participant to generate high frequency sequences is also evident. As noted above, this tendency is responsible for the higher expected value for TU relative to human data. In addition we see that for an agent paying attention to the median statistic it would be true to say that subsequence AAAB is less likely to occur that AAAA. It is possible that this plays a role in the gambler’s fallacy.

The results of Analysis 4 for sliding window length 4. Boxplots illustrating medians Inter-Quartile Range (IQRs) and extreme values of the data illustrated in Figure 2 for three selected sequences. (A) theoretically unbiased (TU) data (truncated at frequency = 12). (B) Data for human observers.

Note that although we have focused exclusively on the analyses at window length k = 4 we have data for lengths from k = 3–6. We find that up to length 5 there is good correspondence between human and simulated data on Analyses 2, 3, and 4 but beyond this value the discrepancies are greatly increased³ (see supplemental materials for these analyses).

Discussion

In this experiment we have provided preliminary evidence in line with the Hahn and Warren (2009) account of randomness perception. We showed that sequences generated by human participants were remarkably similar to those from a truly random process when compared on a set of metrics that are more appropriate given the constraints on how humans might actually experience random events.

One potential issue with this study is that we used a relatively small sample of participants. Arguably this makes our result even more surprising—we did not need large amounts of data to find similarities between our account and human data. However, it would be useful to replicate our results in a larger sample.

Furthermore, it is possible that the data generated by our participants after seeing a random source says more about ability of participants to mimic rather than their concept of randomness. To a certain extent this contention can be ruled out by showing that participant generated sequences are not well matched to the actual sequence observed. However, it would of course be more compelling to measure participants’ sequence generation behavior both before and after the random source experience. We will then be able to assess the extent to which participants’ perception of randomness was altered by that experience. If participants’ performance is altered by passively viewing a “machine generating a random sequence,” without any need to engage with the sequence (e.g., through outcome prediction as in Edwards, 1961), it would suggest both that experience of randomness is key, and that, consequently, the much bemoaned “biases” in randomness perception and generation are ultimately transient phenomena. This will be particularly compelling if the specific experience observed is not as well matched to human performance since this would suggest that participants have learned something general about random sequences rather than how to mimic a specific sequence. To investigate these issues we conducted a second experiment.

Experiment 2

Experiment 2 was very similar to Experiment 1 with the following changes. We used ‘H’ and ‘T’ with the cover story of a fair coin, rather than ‘1’ and ‘0’, and whether or not participants heard a sound accompanying the visual stimuli was manipulated as a between subjects condition. The second difference was that participants were asked to generate a random sequence before being exposed to one. In the first experiment participants observed and then generated, in the second experiment participants generated, then observed, and then generated again. Experiment 2, therefore, allowed us to test for any learning that might occur from being exposed to a genuine random sequence. In all other respects Experiment 2 was identical to Experiment 1