Every working scientist knows that in the details are both devils and angels. Lots of small design decisions have to be made in collecting and analyzing data, and those decisions affect conclusions. But beginning scientists, from rookies in school science fairs to students in early years of a rigorous Ph.D. program, are often surprised how much small decisions matter. Despite this recognition that details matter, when science is communicated, many small decisions made privately by a science team are hidden from view. It is difficult to disclose every detail (and usually little disclosure is required). Such hidden decisions can be thought of as “dark methods,” like dark matter which cannot be directly seen because it does not reflect light, but which is evident from its other effects. The Herculean effort resulting in the new many-analyst study (1) which is the subject of my Commentary should force a painful reckoning about the extent of these dark method choices and their influence on conclusions. Design decisions of each team that were coded (107 of them) explained at most 10 to 20% of the outcome variance. Assuming that the coding itself is not too noisy, it seems that hidden decisions account for the lion’s share of what different teams conclude.
In ref. 1, they recruited 73 teams to test the hypothesis that “immigration reduces public support for government provision of social policies.” Whether this hypothesis is true is obviously an important question, especially now and very likely in the world’s future as well. The hypothesis is also sufficiently clear that social sciences should be able to generate some progress toward an answer.
The teams were given data about 31 countries (mostly rich and middle-income) from five waves of ISPSS data spanning 1985 to 2016, asking six questions about the role of government in policies about aging, work, and health. Yearly data on immigrant stock and flow came from the World Bank, UN, and OECD. These are the best available data covering many countries and years in a standardized way and are widely used.
What did they find? They found that average marginal effects of immigrants on policy support were significantly positive or negative in 17% and 25% of the tested models. The other model results had 95% confidence intervals including zero (58%). The range of subjective conclusions was similar.
Their next question was how well differences in estimates could be explained by various sources of variance. It might be the case null results finding no effect are mostly derived by less experienced teams, of different subjective prior beliefs influenced what teams found. But actually, differences in measured expertise and prior beliefs made little difference. They coded 107 separate design decisions taken by three or more teams. These are decisions such as choices of an estimator, measurement strategy, independent variables, subsets of data, etc. The contribution of these coded decisions explained only a little more than 10% of variance in results between teams.
The authors conclude that even when trying to carefully code these design decisions (in order specifically to shed light on typically dark methods), the coded variables do not explain much. Eighty Percent of the variance in team-reported results is due to some other variables that are not coded. Fig. 1A illustrates both variability in team outcomes and the weak relation between high-level design features and those outcomes.
Fig. 1.
Illustrations of dark methods and peer scientist mispredicting outcomes and variation. (A) Average marginal effect (AME) (y-axis) plotted from low to high (x-axis) for three subjective conclusion categories reported by teams. Subjective conclusions are overlapping, e.g., AMEs from .00 to −.015 are included in all three subjective categories. High-level design characteristics (blue intensity coding, Bottom) are not evidently correlated with either AME or conclusions. (B) Prices in an fMRI-linked scientific prediction market to predict the percentage of teams finding support for hypothesis 8. The true (“fundamental”) value is .057, but market prices overestimate that number. Market prices associated with team scientists who did analysis (green) overestimate less than nonteam market traders. (C) Box-whisker plots of 164 teams’ subjective beliefs about cross-team dispersion (y-axis, log scale). Actual dispersion is much higher for the outlier-ridden full sample (coral red) than a winsorized sample (tangerine yellow). For four of the hypotheses, the actual dispersion is close to the 97.5% top whisker of research team beliefs. (Sources: Ref. 1, SI Appendix, figure S9, 6, ExtData figure 5, and 7, figure 5).
The challenges posed by the surprising influence of dark methods come after almost two decades of other questions about how well current practices cumulate scientific regularity (2). Social scientists—as well as those in other fields, especially medicine—are now well aware of the feared and actual impacts of p-hacking, selective inference, and both scientist-driven and editorial publication bias. A small wave of direct replications in psychology, economics, and in general science journals, intended to reproduce previous experimental protocols as closely as possible, typically found that many or most results do not replicate strongly (3–5). (My rule of thumb is that the long-run effect size of a genuine discovery will be at 2/3 as large as the original effect). But most social sciences have also turned toward self-correction, albeit at the slow pace of turning a large oil tanker rather than a sports car. Preregistration, journal requirements for data archiving, and Registered Reports preaccepted before data are collected are big steps forward.
Thanks to the efforts of hundreds of scientists, we can now draw some general conclusions from both this new study (1) and two other recent “many-analyst” studies (6 and 7). In all three studies, a large number of analysis teams were given both common data of unusually high quality and sample size and clear hypotheses to test. As in ref. 1, any differences in results that emerge result only from differences in teams’ methods.
In refs. 6 and 7, 70 teams were given fMRI data from a large sample of N = 108 participants who chose whether to accept a series of gain–loss gamble or not. Teams tested nine specific hypotheses, derived from previous findings, about whether specific brain regions encoded decision variables (e.g., was ventromedial prefrontal cortex activity larger for larger potential gains?). For about half of the nine hypotheses, most teams came to the same conclusions, either that there was very little activation or (in one case) a lot of activation supporting the hypothesis. For the other four hypotheses, there was disagreement, with (thresholded) activation reported by 20 to 80% of teams. In looking for design decisions that explained results variance, they looked at five variables. The two most important differences were which software package was used and the smoothness of the neural spatial map. (Maps are routinely smoothed because localized measures of neural activity are noisy, but the extent of smoothing is a design choice which varied.) But each of these variables contributed only .04 to R2. This result brings a small number of basic “dark” design choices into the light but leaves a lot unexplained.
In ref. 7, 164 teams were given data from 720 million trades from 2002 to 2018 for the most actively traded derivative, the EuroStoxx 50 index future contract. They were asked to test six hypotheses about changes in trading activity over this span. Changes are scientifically and practically interesting because there was a global financial crash in this sample, leading to changes in regulation, and a rise in rapid algorithmic trading and other trends.
The changes the teams estimated vary a lot across the six hypotheses because they are annualized changes in numbers like order flow or market efficiency; they are not effect sizes (although t-statistics are heavily analyzed and easier for readers of this Commentary to understand). However, the variation across teams—which the authors cleverly call “nonstandard errors”—is about 1.65 times as large as the mean SE of estimates. This cross-team dispersion is lower but only by a little for the highest-quality research teams.
The fMRI and EuroStoxx studies also compared peer scientist numerical predictions about how big the cross-team dispersion will turn out to be, with the actual dispersion.
In the fMRI study, “prediction markets” were created in which both research team scientists and others who were not on teams could trade artificial assets with a monetary value equal to the percentage of teams that accepted a hypothesis. The market predictions overestimated the probability of hypothesis acceptance by 64%, although ranked prediction levels across the nine hypotheses between prediction prices and outcomes were highly correlated and team members were more accurate (team members r = .96, P < .001; nonteam members r = .55, P = .12), Fig. 1B. This cross-hypothesis accuracy is consistent with some degree of accuracy of science-peer predictions across treatment effects (e.g., ref. 8).
In the EuroStoxx study, dispersion of beliefs was 71% lower than the actual dispersion of estimates (Fig. 1C).
These results show that the research team scientists doing these analyses highly underestimated both likely results and cross-team dispersion. In other words, a part of the scientific study itself was to carefully document whether the participants were surprised by the results or not. They were surprised. Readers should be too (please resist hindsight bias).
These three many-analyst organizers have thought of every angle. Their evidence strongly suggests that obvious explanations for hidden variability are just not right. The teams were chosen and evaluated on simple measures of expertise: In ref. 1, 83% had experience teaching data analysis courses, and all of them were first required to reproduce findings of a widely cited study (9) to join the research teams. It does not appear that weaker or stronger research teams (judged by experience and publications or by outside peer review for EuroStoxx) have less outcome dispersion.
A trickier question is whether some slippery concept of vagueness of hypothesis tests creates the hidden multiverse, which would not arise for a sharper hypothesis test. It is true that testing the hypothesis “immigration reduces public support for government provision of social policies” seems to allow a lot of freedom to measure almost every scientific word in the hypothesis differently (immigrant, public support, social policies). But the data sets are the same, so there is very limited room for measurement differences. Furthermore, the fMRI hypotheses (6) are not vague at all: they were taken straight from previous papers and are very clear about what brain regions and statistical outcomes are hypothesized to be associated. Even in that case, for about half of the nine hypotheses, there was substantial cross-team variation in results.
While it is difficult to know, by definition, what the hidden dark method differences are, why are these experienced so surprised by the amount of dispersion between their own work and their peers’ work? The fact that predictions about both outcome variability and outcome levels are so wildly off, in the finance and fMRI many-analyst studies which measured predictions, suggests that individual scientists do not appreciate how different their peers’ analytical choices are and how much results will be affected. How can evidence about dark analytical variability exist, as shown by these studies, but has also stayed so hidden and is therefore so surprising?
A possible answer is that scientists are not immune to a perceived “false consensus” bias that arises when people generally overimagine their own ideas when judging what others think (10), as is well established in psychology. But such a mistake is particularly surprising because science is so open about general analytical differences. Between large conferences, small seminars, and peer review, there are many opportunities to debate alternative design choices and their likely impact.
There is a large difference across the two studies (immigration and EuroStoxx) in how research teams responded to feedback. While immigration teams in ref. 1 could change their models and resubmit revised results after seeing what others did, “no team voluntarily opted to do this” (except after coding mistakes). The authors (1) suggest that more “epistemic humility" is needed. However, in the EuroStoxx many-analyst study, there was a lot of revision and subsequent reduction in team variability across four steps of the study (notably, after peer reviewers commented on early stage results and again after the five papers judged by nonteam peers to be the best were publicized to all teams). Reduction in analytical variance fell by 53% for the main sample (which excluded outliers). This stark difference in revision rates from feedback about other teams’ results seems to reflect norms in different fields about some combination of humility and conformity.
What’s next? Will more and better data save us? It is not at all clear that better data will bring conclusions closer together. The EuroStoxx data are as good as an 18-y span could be for testing simple hypotheses about how financial markets have changed, and there is still dispersion that surprised the experts working with that single oceanic set of data. One can imagine excellent new sources of data about immigration, political reactions, and popular support for policies. But new data sources are more likely an even larger combinatorial explosion of different design decisions. There is little chance that more new data will lead to more new convergence rather than a new proliferation of different approaches.
Hopefully, one trend which is next is improvements in quantifying and promoting transparency*. Meanwhile, it is hard to see how the regular peer-review process can continue to credibly operate in the face of this new evidence about the hidden analytical multiverse. When selective peer-reviewed journals reject one paper on a topic and accept another, they are implicitly endorsing a combination of the methods and results of the accepted paper compared to the methods and results of the rejected paper. An acceptance says “We think this study used a superior method and got us closer to the truth.” But how can such endorsements be made with confidence when so much of the methods is hidden?
Put more vividly, imagine if all 73 teams’ immigration manuscripts from ref. 1 were submitted to journals over a period of time. In the light of these results, there would be no evidentiary basis to claim that one paper’s methods were better and more truth-producing than another’s methods. (Remember that measures of scientific competence or investigator prior belief did not matter much either, so referees who believe that they are irrelevant cannot fall back on those simple judgments to say yes or no). But editors have to make decisions and usually have to say no a lot more than they can say yes. If referees and editors are no better at sniffing out dark methods creating different outcomes than these fastidious researchers (1) were, how can and do referees decide? The pressure to decide among many equally outstanding papers creates plenty of room for editorial bias, referee-author rivalry, faddish conformity, network favoritism, and other influences to sneak in. Even worse, editorial choices can have large multiplier effects by guiding other researchers, especially those with the largest career concerns, in the directions pointed out by published articles.
Based on these results, professional organizations—particularly societies and their journal editors—should be in a crisis-management mode. An obvious step—which could start tomorrow—is to help organize, fund, and commit to publish more many-analyst multiverse studies. The power to move scientists in a better direction is held by journals, funding agencies, and (to some extent) rich universities. The great news from ref. 1 and the other two studies described here (as well as important precursors and ongoing efforts) is that a lot of talented scientists are willing to spend valuable time to figure out how to do science that is clearer and cumulates regularity better, in the face of the surprising many-analyst variance of results seen across not just the immigration study but two other recent studies as well.
In analysis of structural economic models, it is usually hard for readers to tell how results would differ if an assumption was violated (a type of dark method). A computable model of transparency was derived by ref. 11. A similar idea could prove useful in other social sciences.
Acknowledgments
Author contributions
C.F.C. wrote the paper.
Competing interest
The author declares no competing interest.
Footnotes
See companion article, “Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty,” 10.1073/pnas.2203150119.
*In analysis of structural economic models, it is usually hard for readers to tell how results would differ if an assumption was violated (a type of dark method). A computable model of transparency was derived by ref. 11. A similar idea could prove useful in other social sciences.
References
- 1.Breznau N., et al. , Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc. Natl. Acad. Sci. U.S.A. 119, e2203150119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nosek B. A., et al. , Replicability, robustness, and reproducibility in psychological science. Ann. Rev. Psychol. 73, 719–748 (2022). [DOI] [PubMed] [Google Scholar]
- 3.O. S. Collaboration, Estimating the reproducibility of psychological science. Science 349, aac4716 (2015). [DOI] [PubMed] [Google Scholar]
- 4.Camerer C. F., et al. , Evaluating replicability of laboratory experiments in economics. Science 351, 1433–1436 (2016). [DOI] [PubMed] [Google Scholar]
- 5.Camerer C. F., et al. , Evaluating the replicability of social science experiments in nature and science between 2010 and 2015. Nat. Hum. Behav. 2, 637–644 (2018). [DOI] [PubMed] [Google Scholar]
- 6.Botvinik-Nezer R., et al. , Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Menkveld A. J., et al. , Non-standard errors. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3961574 (2022).
- 8.DellaVigna S., Pope D., Predicting experimental results: Who knows what? J Polit. Econ. 126, 2410–2456 (2018). [Google Scholar]
- 9.Brady D., Finnigan R., Does immigration undermine public support for social policy? Am. Soc. Rev. 79, 17–42 (2014). [Google Scholar]
- 10.Marks G., Miller N., Ten years of research on the false-consensus effect: An empirical and theoretical review. Psychol. Bull. 102, 72–90 (1987). [Google Scholar]
- 11.Andrews I., Gentzkow M., Shapiro J. M., Transparency in structural research. J. Bus. Econ. Stat. 38, 711–722 (2020). [Google Scholar]

