Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2023 Aug 9;120(33):e2302491120. doi: 10.1073/pnas.2302491120

An illusion of predictability in scientific results: Even experts confuse inferential uncertainty and outcome variability

Sam Zhang a,1, Patrick R Heck b, Michelle N Meyer c, Christopher F Chabris c, Daniel G Goldstein d, Jake M Hofman d,1
PMCID: PMC10438372  PMID: 37556500

Significance

In many fields, there has been a long-standing emphasis on inference (precisely estimating an unknown quantity, such as a population average) over prediction (forecasting individual outcomes). Here, we show that this focus on inference over prediction can mislead readers into thinking that the results of scientific studies are more definitive than they actually are. Through a series of randomized experiments, we demonstrate that this confusion arises for one of the most basic ways of presenting statistical findings and affects even experts whose jobs involve producing and interpreting such results. In contrast, we show that communicating both inferential and predictive information side by side provides a simple and effective alternative, leading to calibrated interpretations of scientific results.

Keywords: statistics, uncertainty, science communication, visualization, experiments

Abstract

Traditionally, scientists have placed more emphasis on communicating inferential uncertainty (i.e., the precision of statistical estimates) compared to outcome variability (i.e., the predictability of individual outcomes). Here, we show that this can lead to sizable misperceptions about the implications of scientific results. Specifically, we present three preregistered, randomized experiments where participants saw the same scientific findings visualized as showing only inferential uncertainty, only outcome variability, or both and answered questions about the size and importance of findings they were shown. Our results, composed of responses from medical professionals, professional data scientists, and tenure-track faculty, show that the prevalent form of visualizing only inferential uncertainty can lead to significant overestimates of treatment effects, even among highly trained experts. In contrast, we find that depicting both inferential uncertainty and outcome variability leads to more accurate perceptions of results while appearing to leave other subjective impressions of the results unchanged, on average.


Much of science is concerned with making inferences about entire populations using only samples from them. For instance, a medical trial might compare the health of patients who were given an experimental treatment to those who received a placebo, or a social science study might contrast the economic mobility of individuals from different demographic groups. In each case, the goal is to draw conclusions about the broader populations of interest, but this is often complicated by two factors: first, access to relatively small samples from these populations, and second, highly variable outcomes within each group. For example, a medical study might involve only a few dozen patients, and some patients who received the experimental treatment might have responded strongly to it, while others did not.

Perhaps the most common solution to these problems is to focus on aggregate outcomes (e.g., averages within each group) instead of individual outcomes and to report some measure of inferential uncertainty about them (e.g., how precisely we have estimated the average for each group). Reporting inferential uncertainty (typically through standard errors (SEs), confidence intervals, Bayesian credible intervals, or similar) has long been a cornerstone of statistics and constitutes a major part of introductory courses on the topic (1). Quantifying inferential uncertainty is important for many reasons, from providing a plausible range of values for a quantity of interest to helping us avoid being misled by random variation in samples of data that may not accurately reflect trends in the underlying populations of interest.

At the same time, focusing on only aggregate outcomes and inferential uncertainty might lead us to overlook outcome variability (e.g., how much individual outcomes vary around averages for each group), often quantified by measures such as standard deviation (SD) or variance, and which is important for understanding effect sizes and the predictability of outcomes. Although there are systematic relationships between measures of inferential uncertainty and outcome variability, they capture two very different—but potentially easily confused—concepts. Here, we investigate the extent to which the pervasive focus on inferential uncertainty in scientific visualizations can produce illusory impressions about the size and importance of scientific findings, even among experts whose jobs involve creating and interpreting such results.

To highlight the difference between inferential uncertainty and outcome variability—and to see why focusing on the former might be misleading about the latter—consider the plot in the Upper-Left panel of Fig. 1, inspired by a highly cited study on whether violent video games cause aggressive behavior (2). In the study, participants were randomly assigned to play either a violent or a nonviolent video game, after which their behavior on an unrelated task was measured using a continuous aggressiveness score. The two black points in this plot show estimated average aggressiveness within each group, and the error bars encode inferential uncertainty about those estimates (one SE above and below the average). Compare this to the plot in the Upper-Right panel of Fig. 1, which depicts the same averages and error bars, but adds colored points to show individual outcomes as well.

Fig. 1.

Fig. 1.

Inferential uncertainty vs. outcome variability. (Left) Estimated means and an error bar representing one SE above and one SE below the mean, for two conditions in an experiment. The SE is a measure of the uncertainty in our inference of the mean. (Right) Individual outcomes shown in addition to the same SEs on the Left. With only 50 participants per condition (Top), we have less confident estimates for the mean than when we have 400 participants per condition (Bottom). However, more data do not systematically decrease the variability in the outcomes themselves.

In principle, there is no reason to prefer one of these plots to the other—in fact, given the sample sizes and a few distributional assumptions, one can calculate information about either inferential uncertainty or outcome variability from each. In practice, however, each of these representations has distinct visual features that emphasize different notions of uncertainty and lends itself to different interpretations. In particular, the format on the left is designed to facilitate “inference by eye” (35) so that readers can deduce a range of plausible values for the average in each group and apply visual heuristics for hypothesis testing. By applying a rough heuristic, the lack of overlap of the error bars is taken as evidence against the idea that there is no difference in average aggressiveness scores between conditions. That said, displaying the error bar as a single visual object (arms capped off on either end), with the mean at its center, focuses perception on that object as itself a representation of the relevant data, and since the object is bounded at the ends of the error bar, such a display might encourage the viewer to imagine that the underlying data must cluster more tightly around the mean than it actually does. (In fact, the majority of the individual data points fall above and below the range of the y-axis scale.) As a result, one might look at the plot on the left and conclude that violent videogames cause aggressive behavior, and indeed popular outlets that covered this work featured strongly worded headlines to this effect (see *, for example).

The figure on the right contains all of the information present in the plot on the left but simply adds points that show individual outcomes. This format was suggested by Gardner and Altman several decades ago (6) to place more emphasis on communicating sample size, outcome variability, and effect sizes. There have since been several efforts to popularize these types of plots (711), but they remain relatively uncommon and, to the best of our knowledge, have not been empirically tested. The dots draw some attention away from the object represented by the error bars, and the contrast (in color and intensity) between the dots and the bar makes it easy to focus on either the individual points or the mean and error bar, to shift attention between them, and to see that the bar does not represent the entirety of the data, merely one particular facet of it.

Specifically, adding individual outcomes highlights that while there is relatively low inferential uncertainty in this study (i.e., the average in each group is precisely estimated), there is still a great deal of outcome variability within each group (i.e., individual outcomes vary quite a bit around their respective averages). So much so, in fact, that one has to rescale the y-axis just to accommodate the range of outcomes, providing some perspective for the difference in means between conditions. “Inference by eye” is still possible in this alternative representation, but it also makes clear that while violent video games may change aggressive behavior on average, the relationship is far from deterministic: knowing only if someone played a violent video game or not says relatively little about how aggressively they might behave.

Moreover, as depicted in the bottom row of Fig. 1, this divergence between inferential uncertainty and outcome variability actually grows with sample size. For instance, if we were to conduct a larger study—as is more commonplace today compared to when the original study was done—and sample 800 participants instead of 100, we would get extremely precise estimates of averages in each condition (indicated by the small error bars in the Bottom-Left panel), but, as the Bottom-Right panel shows, collecting more data would not systematically decrease outcome variability.

As these examples demonstrate, differences in what visualizations emphasize might lead readers to different conclusions. So which one of these formats should we prefer when presenting statistical findings to readers, and how much does this choice matter? While there is a large body of literature in the fields of data visualization and human–computer interaction on different ways of depicting either inferential uncertainty or outcome variability (1214), to the best of our knowledge there is little empirical work that compares the two. In many fields, there is an emphasis on inference and hypothesis testing, and so plots displaying inferential uncertainty are the default and considered a “best practice” (1517). At the same time, there is an increasingly large body of research showing that people routinely make mistakes when making inferences based on such plots. For instance, when shown these plots, people often misestimate the range of plausible values for a parameter and draw incorrect conclusions related to hypothesis testing and the replicability of scientific findings (12, 13, 1820). As a result, it may be the case that plots designed to convey inferential uncertainty may in fact not be very effective for statistical inference.

Here, we raise a different but potentially more important concern. Beyond being unreliable for traditional statistical inference tasks, the pervasive preference for communicating inferential uncertainty found in published work can lead to an “illusion of predictability” (21), whereby people underestimate the variability of outcomes and overestimate the size and importance of scientific findings. In particular, if a reader mistakes inferential uncertainty for outcome variability when viewing the plots like those on the Left of Fig. 1, they might be left with the impression that most outcomes fall within the depicted error bars and conclude that violent video games have an alarmingly strong effect on aggressive behavior, with predictable outcomes in each condition. The plots on the right hopefully avoid this confusion, showing that such a strong conclusion may not be warranted. In this example, seeing the comparison side-by-side should help clarify the distinction between inferential uncertainty and outcome variability. In practice, however, it is common for figures to depict only one type of uncertainty, a choice which is often not even explicitly stated (22, 23). Moreover, there are many published examples where authors themselves mistake the two concepts, errantly labeling standard deviations as standard errors or vice versa (24). This can leave the reader guessing as to what is being communicated—a task that is not helped by the fact that the terms involved sound similar (e.g., “standard error” or “SE” vs. “standard deviation” or “SD”), or that they are often both depicted by the same visual marks in plots (e.g., error bars).

Recent work has shown evidence of this confusion among laypeople: in a series of large-scale, online experiments, participants overestimated the effectiveness of, and were willing to pay more for, the same hypothetical treatment when shown visualizations that depicted inferential uncertainty compared to outcome variability, even when controlling for other visual factors such as the scale of the y-axis (25, 26). However, these studies’ participants were laypeople (crowd workers), not experts or practitioners trained in statistics. In addition, the studies involved fictitious, low-stakes scenarios. There is good reason to imagine that these effects might disappear with appropriate training or in sufficiently consequential settings, in which case they would be of much less concern. Here, we investigate the extent to which visual displays of inferential uncertainty vs. outcome variability affect judgments by experts in more realistic scenarios. Specifically, we present a series of preregistered, randomized experiments where experts saw the same scientific findings depicting different types of uncertainty and answered a series of questions about the size and importance of findings they were shown. Our results, composed of responses from medical professionals, professional data scientists, and tenure-track academic faculty, show that the prevalent form of visualizing only inferential uncertainty can lead to significant overestimates of treatment effects, even among highly trained and knowledgeable experts. In contrast, we find that an alternative format that depicts both inferential uncertainty (by showing statistical estimates) and outcome variability (by also showing individual data points) leads to more accurate perceptions of results while appearing to leave other subjective impressions of the results unchanged, on average. We conclude with a discussion of how this relates to larger issues around practical vs. statistical significance, inference vs. prediction, and scientific communication.

Results

We conducted three preregistered experiments to investigate how the graphical communication of different types of uncertainty affects experts’ perceptions of the size and importance of scientific findings. All three experiments used similar experimental setups but with different types of experts. In each experiment, participants were shown the results of a study that compared a treatment group to a control group, where we randomly varied whether the figure in the study depicted inferential uncertainty (via standard errors), outcome variability (via standard deviations or individual data points), or both. After reviewing the study, participants were asked to estimate the effectiveness of the treatment shown in the study and make additional decisions based on the findings they saw. In all three experiments, we elicited perceived treatment effectiveness by asking for the probability that a randomly selected member of the treatment group had a higher (or lower) score than a randomly selected member of the control group, a number between 50% and 100% known as the common language effect size, probability of superiority, or AUC. We chose this measure because it was developed to aid in the communication of effect sizes and thus provides an easy and effective means of eliciting effect sizes from participants (27, 28). For each of our experiments, we report all sample sizes, conditions, data exclusions, and measures for the main analyses that were described in our preregistration documents, and we followed the preregistered plan exactly except for the participants we dropped in experiment 1 due to web browser incompatibility. The code to generate stimuli and run our experiment is available online along with the code and data to reproduce our analysis. Materials and Methods and SI Appendix, section S2 contain full details of the experimental design and analyses.

Experiment 1: Medical Providers.

For our first experiment, we recruited medical providers with prescribing privileges employed at a regional healthcare system. All participation was voluntary, and no direct payment was made; instead, we donated Thanksgiving meals to a local food bank for each completion of the study. We performed eleven 30-min structured interviews with physicians about the task to ensure that it was realistic, familiar, and easy to understand (SI Appendix, S3). Those who participated were randomly assigned to see the results of a hypothetical trial for either blood pressure or COVID-19 medications. All participants saw the same information about the corresponding medication type, but some were randomly assigned to see accompanying figures depicting inferential uncertainty first (means and standard errors, in the “Saw SEs first” condition), while others were shown figures depicting outcome variability first (means and standard deviations, in the “Saw SDs first” condition) (SI Appendix, Fig. S1). Participants were then asked to estimate the probability of superiority for the medication they were shown along with how much it would be worth to patients. They were also asked to recall what the error bars in the figures represented and to provide a histogram of outcomes for patients in the treatment and control groups using a tool called Distribution Builder (2931). After completing these tasks, participants were shown “another scenario” for the same type of medication and repeated the entire process. Although not revealed to them, the second scenario was identical to the first except for the accompanying figure—those in the “Saw SEs first” condition were subsequently shown the same results but with SDs in the accompanying figure, whereas those in the “Saw SDs first” condition were shown SEs. The study concluded with a background survey to gauge participants’ medical experience and statistical training.

Of the 221 participants who fully completed the study, we removed 58 participants who indicated that they had completed the study more than once, which occurred due to a web browser incompatibility in our experiment code, leaving 163 participants.§ Most participants were medical doctors with at least some experience with randomized controlled trials (SI Appendix, Fig. S9).

Comparing participants’ probability of superiority estimates for the first scenario they saw, we find that average estimates were substantially higher for those who saw SEs (depicting inferential uncertainty) first compared to those who saw SDs (depicting outcome variability) first (Fig. 2A). With both medications, participants who saw SEs overestimated the size of the effects they were shown, whereas those who saw SDs underestimated those same effects. For the blood pressure medication, with a true value of 72%, average estimates were 89% for SEs vs. 66% for SDs (t(65.1)=6.83,P<0.001). For the COVID-19 medication, with a true value of 76%, average estimates were 86% for SEs vs. 67% for SDs (t(77.9)=6.35,P<0.001).

Fig. 2.

Fig. 2.

Results for medical providers. (A) The estimated probability of superiority of each treatment is depicted for each provider, with black dot and error bars signifying the mean and one SE above and below the mean, as well as dashed lines representing the true underlying effect size. Colored points show individual responses. (B) The perceived value of treatment is shown with a logarithmic y-axis, and the black dot and error bars depict the mean and one SE above and below the mean. Colored points show individual responses. (C) Perceived distributions (bars) vs. actual distributions (line) of the effectiveness of treatment.

As shown in Fig. 2A, this roughly 20 percentage point difference in average estimates between conditions is largely driven by a sizeable and statistically significant difference in extreme responses, defined as estimates that exceeded 90%, from those who saw SEs compared to those who saw SDs (t(65.9)=6.27,P<0.001). For both scenarios, the majority of participants who saw SEs made extreme responses, whereas a small minority of those who saw SDs did so. A within-subjects analysis comparing how much each participant’s estimate of the probability of superiority changed between the two scenarios they saw reveals a similar pattern (SI Appendix, Fig. S3). Responses to the recall question indicate that these differences are likely due to participants mistaking SE error bars for showing outcome variability instead of inferential uncertainty (SI Appendix, Fig. S6), with only 36% of participants who saw SEs first correctly recalling the type and meaning of the error bars they were shown, which is worse than chance. This is consistent with the responses we collected via Distribution Builder, depicted in Fig. 2C, which show that those who saw SEs first generated narrower outcome distributions (and higher implied probabilities of superiority) overall compared to those who saw SDs first (t(160)=4.92, P<0.001).

Interestingly, however, these large differences in perceived effectiveness were not reflected in estimates of how much participants thought patients would value these treatments, perhaps because there are strong conventions for how much different medications should cost regardless of effectiveness. We did not find evidence that participants who first saw SEs were willing to pay a median price that was significantly different from participants who first saw SDs for either the blood pressure medication (Z=1.38,P=0.084) or the COVID-19 medication (Z=0.85,P=0.198). Further analyses, including participant feedback, is available in Supporting Information, SI Appendix, section S1A.

Overall, the results of this experiment demonstrate that even medical professionals can be misled by common visualizations that depict inferential uncertainty. That said, this experiment has several limitations. First, many of the participants in our studies indicated only moderate training in and comfort with statistics, and so perhaps we would expect different results from experts with more rigorous statistical backgrounds. Second, although the figures that displayed outcome variability directly through SDs curbed extremely high estimates of effect sizes, estimates were on average below the true effect size when outcome variability was displayed using SDs. Third, we tested only one true underlying effect size in this study. We designed our second experiment to address these concerns by targeting experts with more statistical training, exploring alternative formats that depict both inferential uncertainty and outcome variability, and testing a wide range of true underlying effect sizes.

Experiment 2: Data Scientists.

For our second experiment, we recruited professional data scientists at a large software company. All participation was voluntary, and no direct payment was made; instead, we donated one set of personal protective equipment to the United Nations COVID-19 relief effort on behalf of each participant who completed the study. Those who participated saw a one-page extended abstract based on the violent video game study described above. All participants saw the same abstract, but some were randomly assigned to see an accompanying figure depicting only inferential uncertainty (means and standard errors in the “SE only” condition, as in the Lower-Left of Fig. 1), whereas others saw both inferential uncertainty and outcome variability (means, standard errors, and individual outcomes in the “SE + points” condition, as in the Lower-Right of Fig. 1). We designed the latter to test whether this format, originally proposed by Gardner and Altman (6), would lead to more accurate perceptions of effect sizes than SEs or SDs alone. Then, we asked participants for their editorial judgments on the abstract, including the overall appeal of the work, the sufficiency of the sample size used, and whether they would accept the extended abstract if they were a journal editor, all on 5-point Likert scales. Following this, we asked participants to estimate the size of the effect presented in the abstract, measured by the probability that someone who played a violent video game displayed more aggressive behavior than someone who played a nonviolent video game (the probability of superiority), and to recall what the error bars in the figure represented. Finally, we had participants repeat probability of superiority estimates for five randomly generated figures of the same type that they saw in the abstract to explore how estimates change with the true underlying effect size.

A total of 175 participants finished part 1 of the experiment, 161 participants finished part 2, and 138 participants completed the postexperiment background survey. The majority of participants had upward of three years of experience working in data science and reasonable prior experience with statistics (SI Appendix, Fig. S10, Middle). As per our preregistration, we removed 2 participants who indicated that they had none of the prior experience in statistics or scientific literature that we screened for.

In line with our first experiment, we find that on average, participants who saw only inferential uncertainty (in the SE only condition) made substantially higher probability of superiority estimates compared to those who saw both inferential uncertainty and outcome variability (in the SE + points condition) (t(159.4)=6.34,P<0.001). As shown in Fig. 3A, responses in the SE + points condition were well calibrated to the true effect size of 59% (mean=61%,SD=13%), whereas we once again find overestimation with the conventional SE only format (mean=76%,SD=20%). This more than 15 percentage point difference in average estimates is apparent in extreme responses as well: Only 6% of participants in the SE + points condition provided probability of superiority estimates that exceeded 90%, while 35% of participants in the SE only condition did so (t(142.5)=5.11,P<0.001).

Fig. 3.

Fig. 3.

Results for data scientists and faculty. (A and B) Perceived probability of superiority of the experiment in the editorial judgment task between the conditions for data scientists and faculty, respectively. The black dot displays the mean, and the error bars are one SE above and below the mean. The dotted line is the true probability of superiority of the underlying scenario, and colored points show individual responses. (C and D) Distributions of the editorial judgments between the two conditions for data scientists and faculty, respectively. The dot and error bars above the plots show the mean and one SE above and below the mean. (E and F) For each of a series of hypothetical experiments with results generated from a random true probability of superiority, data scientists and faculty (respectively) estimated the true probability of superiority. The dotted line displays the correct answers. The colored line is a loess fit to the data, the shaded region is a 95% CI, and colored points show individual responses.

Despite these rather large differences in perceived effect size, we do not see a corresponding difference in average editorial opinion between conditions (Fig. 3C). Specifically, we did not find evidence that the SE + points format changed the average appeal of the work (t(168.6)=1.56,P=0.120), the average perceived sufficiency of sample size (t(152.9)=0.88, P=0.380), or the average overall recommendation (t(171.3)=0.52,P=0.604) compared to the SE only format. However, in a post hoc analysis, we did find a systematic correlation between how large a participant perceived the effect presented in the study to be and their overall editorial recommendation (SI Appendix, section S1B).

As with our first experiment, participants showed a reasonable degree of confusion about both the type of error bars they saw and how to interpret them. Only 55% of people in the SE only condition and 51% of people in the SE + Points condition correctly responded that the error bars represented uncertainty in the estimation of the average, rather than variability in outcomes (SI Appendix, section S1B).

To check whether any differences between conditions were specific to the study in the extended abstract that we showed participants, or to the true underlying effect size of 59% for that study, we also showed each participant five additional figures with different (randomly generated) true underlying effect sizes ranging from 50 to 75%. In line with our previous findings, those who saw only SEs systematically overestimated the size of the effects they were shown, whereas those in the SE + points condition were, on average, well calibrated (Fig. 3E). A mixed effects model fit to predict absolute error in responses based on experimental condition and the true underlying effect size (both as fixed effects) and participant identity (as a random effect; see Materials and Methods) confirms this: Participants in the SE + points condition made estimates that were on average 11 percentage points (95% CI: [8.23,13.7]) closer to the true probability of superiorities compared to participants in the SE only condition. As with the extended abstract, participants who saw SEs only responded with a bimodal pattern, where a large cluster of extreme responses over 90% raised the overall average. In the SE + points condition, only 3.7% of responses were extreme, while 37% of responses in the SE only condition were extreme (t(500.3)=12.5, P<0.001). Further detailed analyses and participant feedback are available in SI Appendix, section S1B.

Experiment 3: Faculty.

Our third and final experiment was identical to the previous experiment, but involved academic tenure-track faculty instead of professional data scientists. We recruited US tenure-track faculty from PhD-granting institutions in the fields of psychology, sociology, physics, biology, business, and computer science. Once again, all participation was voluntary, and no direct payment was made, we instead donated personal protective equipment to the United Nations COVID-19 relief effort on behalf of each participant who completed the study.

A total of 368 participants completed part 1 of the experiment, 339 participants completed part 2, and 289 participants completed the optional background survey. Participants reported being highly experienced with the scientific process, with the modal participant indicating that they had performed over 100 peer reviews (SI Appendix, Fig. S11). As per our preregistration, we removed 63 participants who indicated that they were not currently tenure-track faculty, had no prior coursework in statistics, no experience conducting statistical analyses, or had never peer-reviewed a paper.

In line with our previous findings, participants in the SE only condition made substantially higher probability of superiority estimates (mean=68%,SD=19%) compared to those in the SE + points condition (mean=60%,SD=15%) on average (t(363.9)=4.52,P<0.001), and responses in the SE + points condition were well calibrated to the true value of 59% (Fig. 3B). Similarly, while 11% of participants in the SE + points condition provided probability of superiority estimates of 90% or greater, 21% of participants in the SE only condition did so, a statistically significant difference (t(363)=2.6,P=0.010).

Despite differences in perceived effect size by condition, we do not find a corresponding difference in average editorial opinion (Fig. 3D). Specifically, we did not find evidence that the SE + points format changed the average appeal of the work (t(356.3)=0.324,P=0.746), the average perceived sufficiency of sample size (t(349.2)=1.58,P=0.115), or the average overall editorial recommendation (t(362.6)=1.54,P=0.124). Mirroring the post hoc analysis from the previous experiment, we did find a systematic correlation between how large a participant perceived the effect presented in the study to be and their overall editorial recommendation (SI Appendix, section S1C).

As with our earlier experiments, participants showed a reasonable degree of confusion about the specific meaning of the error bars that they saw (SI Appendix, section S1C). In contrast to our previous experiment, however, we saw less confusion about the meaning of error bars for those in the SE + points condition compared to those in the SE only condition: 58% of participants in the SE only condition and 71% of participants in the SE + points condition recalling the correct meaning of the error bars (t(363.7)=2.67,P=0.008).

The second part of the experiment, which explored a wide range of true underlying effect sizes, showed a similar pattern to the previous experiment: Participants in the SE + points condition made estimates that were on average 5.9 (95% CI: [4.1,8.6]) percentage points closer to the true probability of superiorities compared to participants in the SE only condition, using the same mixed effects model as in the previous experiment (Fig. 3F). Extreme estimates drive this average difference: Whereas only 5.4% of responses in the SE + points condition were above 90%, 22% of the responses were as extreme (t(1362.4)=9.68, P<0.001). Further detailed analyses and participant feedback are available in SI Appendix, section S1C.

Discussion

Taken together, the results of these three preregistered experiments highlight a serious concern for the current state of scientific communication. Specifically, the pervasive focus on inferential uncertainty in scientific data visualizations can mislead even experts about the size and importance of scientific findings, leaving them with the impression that effects are larger than they actually are. This “illusion of predictability” is likely due to readers confusing the concepts of inferential uncertainty and outcome variability and consequently mistaking precise statistical estimates for certain outcomes. Fortunately, we have identified a straightforward solution to this problem: when possible, visually display both outcome variability and inferential uncertainty by plotting individual data points alongside statistical estimates.

There are, of course, several limitations to our work and to the accompanying recommendation of plotting individual outcomes. First, with regard to editorial judgments, we tested only one extended abstract scenario. It could be the case that for another scenario, editorial opinions actually change along with the visual representation chosen to accompany the text. For example, perhaps there is a more polarizing setting for which people have weaker priors about the effect size and would be swayed more by visualizations of one type over the other. That said, if this were the case, we would argue that the representation that results in the most veridical perceived effect size should be chosen, as this would lead reviewers to make the most well-informed decision possible about the merits of the work. Likewise, these effects could be different in a “real stakes" settings (e.g., when actually reviewing for a high-stakes journal or making business decisions about the quality of a data analysis) compared to the hypothetical situation we presented our participants with. Another limitation of the settings we investigate concerns the ground truth effect sizes. While the values in our stimuli are similar in magnitude to those commonly found in medicine, neuroscience, psychology, and social sciences generally (26), we do not make claims of an illusion of predictability at considerably different effect sizes. However, investigating effect sizes that rarely occur in publications would have lower relevance for practice. In addition, there are cases where plotting individual outcomes is not as easy as it sounds. For instance, large datasets or extreme data skew can make it challenging to present all (or even a reasonable fraction of) the data in a way that allows one to see individual observations alongside statistical estimates. There are also complications when studying marginal effects while fixing or averaging over other factors, although techniques such as partial dependency plots could be adapted for these settings (33, 34). Finally, there is the opportunity to study other visual encodings of uncertainty, including gradient and violin plots (13), hypothetical outcome plots (35), and quantile dot plots (36). These limitations aside, we still endorse the idea that one should show outcome variability when possible, preferably by plotting individual outcomes alongside statistical estimates.

Our findings provide a clear and important opportunity to improve how statistical visualizations are presented to laypeople and experts alike. Such improvements should increase audience comprehension without sacrificing the details displayed in conventional plots.

Having identified this problem and a solution to it, we might ask why it is has gone unnoticed for so long. Our conjecture is that this specific issue, while centered around data visualization, reflects a broader issue around how science is done and how scientific results are communicated. Specifically, in many fields, there has been a long-standing emphasis on inference (e.g., obtaining unbiased estimates of individual effects) over prediction (e.g., forecasting future outcomes), perhaps in part because prediction can be quite difficult, especially when compared to inference. It is surely easier to estimate an average effect across a large population, as is done in standard statistical inference, than it is to predict individual outcomes given all measurable factors that might be relevant to a given problem. But when the results of a study are communicated, they can often come across as having implied the latter when in fact they have only established the former. As a result, there can be a great deal of confusion as to what we have actually learned about the world from a particular study, and as we have demonstrated, even experts mistake inferential visualizations as communicating information about prediction. Borrowing from Jacob Cohen’s critique of hypothesis testing (37), we believe a similar logic applies to the display of inferential uncertainty: “Among many other things, it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!”

To this end, we believe that the solution of communicating both inferential uncertainty and outcome variability is merited. Rather than emphasizing inference over prediction (or vice versa), we should aim for integrative approaches that consider both aspects of scientific inquiry (38) and present them clearly alongside each other so that readers can themselves make accurate and appropriate inferences from them.

Materials and Methods

The Institutional Review Board of Microsoft Corporation reviewed the protocol of these experiments and approved them for human subjects research under approval Ethics Review Portal #10159. Informed consent was obtained from all participants prior to starting any of the studies mentioned below. Full descriptions of the experimental protocol, including screenshots, are available in SI Appendix, section S2.

All t tests are Welch’s test for unequal variances unless otherwise noted (39), using the default settings in t.test in R. For median tests, we use the two-sample asymptotic Brown-Moody median_test function from the coin package in R. Bootstraps are performed with 10,000 resamples using the boot.ci function from the boot package in R and the reverse percentile interval method for constructing confidence intervals. To analyze the calibration task for data scientists and faculty, we fit the following preregistered linear mixed effects model using the lme4 package in R:

|error|(1|participant)+psup+points [1]

where |error| is the absolute value between the true and guessed probability of superiority, psup refers to the true probability of superiority, and points is a binary indicator variable that is 1 if the participant was in the SE + points condition, 0 otherwise. Probability of superiorities are expressed as a percentage between 50% and 100%.

To analyze the role of perceived probability of superiority on overall editorial opinion for the data scientists and the faculty, we used the following linear regression model:

overallpsup+condition+controls [2]

where overall is the overall editorial judgment of the extended abstract on a 5-point Likert scale, psup refers to the estimated probability of superiority, and controls refers to background variables reported in each experiment.

Supplementary Material

Appendix 01 (PDF)

Acknowledgments

We would like to thank Alex Blanton and Chris Shaffer, who helped with the experiment involving data scientists, as well as all of the data scientists who participated in that experiment. We would also like to thank the tenure-track faculty who participated in this research. We also wish to thank the medical providers who participated in our first experiment and those who provided feedback on the hypothetical blood pressure RCT results and experimental task we developed. Finally, we would like to thank Aaron Clauset and Nick LaBerge for their help with the studies we conducted. This work was supported in part by an NSF Graduate Research Fellowship Award DGE 2040434 (S.Z.). The views expressed here are those of the authors and do not represent the views of the Consumer Financial Protection Bureau or the United States.

Author contributions

S.Z., P.R.H., M.N.M., C.F.C., D.G.G., and J.M.H. designed research; S.Z., P.R.H., and J.M.H. performed research; S.Z. and J.M.H. analyzed data; and S.Z., P.R.H., M.N.M., C.F.C., D.G.G., and J.M.H. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

§Results are qualitatively similar if we do not remove these participants.

Ideally, reviewers would judge work based on the quality of the questions it asks and the methods used to answer that question, but in the absence of registered reports (32), the results themselves might impact editorial judgments.

Contributor Information

Sam Zhang, Email: sam.zhang@colorado.edu.

Jake M. Hofman, Email: jmh@microsoft.com.

Data, Materials, and Software Availability

Anonymized raw experiment data and all of the code necessary for reproducting the results in the paper, along with preregistered analysis plans are available online and have been deposited in https://github.com/jhofman/illusion-of-predictability (https://osf.io/9gxva/) (40).

Supporting Information

References

  • 1.Fidler F., Cumming G., “Teaching confidence intervals: Problems and potential solutions” in Proceedings of the 55th Session of the International Statistical Institute (Sydney, Australia, 2005), vol. 50. [Google Scholar]
  • 2.Anderson C. A., Dill K. E., Video games and aggressive thoughts, feelings, and behavior in the laboratory and in life. J. Person. Soc. Psychol. 78, 772 (2000). [DOI] [PubMed] [Google Scholar]
  • 3.Cumming G., Finch S., Inference by eye: Confidence intervals and how to read pictures of data. Am. Psychol. 60, 170 (2005). [DOI] [PubMed] [Google Scholar]
  • 4.Cumming G., Fidler F., Kalinowski P., Lai J., The statistical recommendations of the American Psychological Association publication manual: Effect sizes, confidence intervals, and meta-analysis. Aust. J. Psychol. 64, 138–146 (2012). [Google Scholar]
  • 5.Cumming G., Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis (Routledge, 2013). [Google Scholar]
  • 6.Gardner M. J., Altman D. G., Confidence intervals rather than P values: Estimation rather than hypothesis testing. BMJ 292, 746–750 (1986). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Krzywinski M., Altman N., Visualizing samples with box plots. Nat. Methods 11, 119–120 (2014). [DOI] [PubMed] [Google Scholar]
  • 8.Weissgerber T. L., Milic N. M., Winham S. J., Garovic V. D., Beyond bar and line graphs: Time for a new data presentation paradigm. PLoS Biol. 13, e1002128 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Show the dots in plots. Nat. Biomed. Eng. 1, 0079 (2017).
  • 10.Ho J., Tumkaya T., Aryal S., Choi H., Claridge-Chang A., Moving beyond P values: Data analysis with estimation graphics. Nat. Methods 16, 565–566 (2019). [DOI] [PubMed] [Google Scholar]
  • 11.Allen M., Poggiali D., Whitaker K., Marshall T. R., Kievit R. A., Raincloud plots: A multi-platform tool for robust data visualization. Wellcome Open Res. 4, 63 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Newman G. E., Scholl B. J., Bar graphs depicting averages are perceptually misinterpreted: The within-the-bar bias. Psychon. Bull. Rev. 19, 601–607 (2012). [DOI] [PubMed] [Google Scholar]
  • 13.Correll M., Gleicher M., Error bars considered harmful: Exploring alternate encodings for mean and error. IEEE Trans. Visual. Comput. Graphics 20, 2142–2151 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kale A., Kay M., Hullman J., Visual reasoning strategies for effect size judgments and decisions. IEEE Trans. Visual Comput. Graphics 27, 272–282 (2020). [DOI] [PubMed] [Google Scholar]
  • 15.Association A. P., Publication Manual of the American Psychological Association (American Psychological Association, ed. 7, 2019). [Google Scholar]
  • 16.JAMA: Instructions for Authors, (2022). https://web.archive.org/web/20220412040639/https://jamanetwork.com/journals/jama/pages/instructions-for-authors. Accessed 4 December 2022.
  • 17.New England Journal of Medicine: Statistical Reporting Guidelines, (2022). https://web.archive.org/web/20220405233315/https://www.nejm.org/author-center/new-manuscripts. Accessed 4 December 2022.
  • 18.Cumming G., Williams J., Fidler F., Replication and researchers’ understanding of confidence intervals and standard error bars. Understanding Stat. 3, 299–311 (2004). [Google Scholar]
  • 19.Belia S., Fidler F., Williams J., Cumming G., Researchers misunderstand confidence intervals and standard error bars. Psychol. Methods 10, 389 (2005). [DOI] [PubMed] [Google Scholar]
  • 20.Hoekstra R., Morey R. D., Rouder J. N., Wagenmakers E. J., Robust misinterpretation of confidence intervals. Psychon. Bull. Rev. 21, 1157–1164 (2014). [DOI] [PubMed] [Google Scholar]
  • 21.Soyer E., Hogarth R. M., The illusion of predictability: How regression statistics mislead experts. Int. J. Forecasting 28, 695–711 (2012). [Google Scholar]
  • 22.Cumming G., Fidler F., Vaux D. L., Error bars in experimental biology. J. Cell Biol. 177, 7–11 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Krzywinski M., Altman N., Points of significance: Error bars. Nat. Methods 10, 921–922 (2013). [DOI] [PubMed] [Google Scholar]
  • 24.Nagele P., Misuse of standard error of the mean (SEM) when reporting variability of a sample. A critical evaluation of four anaesthesia journals. Br. J. Anaesth. 90, 514–516 (2003). [DOI] [PubMed] [Google Scholar]
  • 25.Hofman J. M., Goldstein D. G., Hullman J., “How visualizing inferential uncertainty can mislead readers about treatment effects in scientific results” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, 2020), pp. 1–12. [Google Scholar]
  • 26.Kim Y., Hofman J. M., Goldstein D. G., “Putting scientific results in perspective: Improving the communication of standardized effect sizes” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, 2022). [Google Scholar]
  • 27.McGraw K. O., Wong S., A common language effect size statistic. Psychol. Bull. 111, 361 (1992). [Google Scholar]
  • 28.Ruscio J., A probability-based measure of effect size: Robustness to base rates and other factors. Psychol. Methods 13, 19 (2008). [DOI] [PubMed] [Google Scholar]
  • 29.Sharpe W. F., Goldstein D. G., Blythe P. W., The distribution builder: A tool for inferring investor preferences (2000). https://web.stanford.edu/~wfsharpe/art/qpaper/qpaper.html.
  • 30.Goldstein D. G., Johnson E. J., Sharpe W. F., Choosing outcomes versus choosing products: Consumer-focused retirement investment advice. J. Consumer Res. 35, 440–456 (2008). [Google Scholar]
  • 31.distBuilder: A Javascript library to facilitate the elicitation of subjective probability distributions (2016). 10.5281/zenodo.166736. Accessed 11 January 2020. [DOI]
  • 32.Chambers C. D., Registered reports: A new publishing initiative at Cortex. Cortex 49, 609–610 (2013). [DOI] [PubMed] [Google Scholar]
  • 33.Friedman J. H., Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001). [Google Scholar]
  • 34.Cleveland W. S., “Coplots, nonparametric regression, and conditionally parametric fits” in Multivariate Analysis and Its Applications (Institute of Mathematical Statistics, Hayward, CA, 1994), pp. 21–36. [Google Scholar]
  • 35.Hullman J., Resnick P., Adar E., Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering. PLoS ONE 10, e0142444 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kay M., Kola T., Hullman J. R., Munson S. A., “When (ish) is my bus? User-centered visualizations of uncertainty in everyday, mobile predictive systems” in Proceedings of the 2016 Chi Conference on Human Factors in Computing Systems (San Jose, CA, 2016), pp. 5092–5103. [Google Scholar]
  • 37.Cohen J., The earth is round (p <.05). Am. Psychol. 49, 997 (1994). [Google Scholar]
  • 38.Hofman J. M., et al. , Integrating explanation and prediction in computational social science. Nature 595, 181–188 (2021). [DOI] [PubMed] [Google Scholar]
  • 39.Delacre M., Lakens D., Leys C., Why psychologists should by default use Welch’s t-test instead of Student’s t-test. Int. Rev. Soc. Psychol. 30, 92–101 (2017). [Google Scholar]
  • 40.Zhang S., et al. , An illusion of predictability in scientific results. Github. https://github.com/jhofman/illusion-of-predictability/tree/main/raw-data. Deposited 29 April 2022.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

Data Availability Statement

Anonymized raw experiment data and all of the code necessary for reproducting the results in the paper, along with preregistered analysis plans are available online and have been deposited in https://github.com/jhofman/illusion-of-predictability (https://osf.io/9gxva/) (40).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES