Abstract
Conducting research with human subjects can be difficult because of limited sample sizes and small empirical effects. We demonstrate that this problem can yield patterns of results that are practically indistinguishable from flipping a coin to determine the direction of treatment effects. We use this idea of random conclusions to establish a baseline for interpreting effect-size estimates, in turn producing more stringent thresholds for hypothesis testing and for statistical-power calculations. An examination of recent meta-analyses in psychology, neuroscience, and medicine confirms that, even if all considered effects are real, results involving small effects are indeed indistinguishable from random conclusions.
Keywords: random conclusions, estimation, hypothesis testing, t tests, benchmarks
Introduction
Human-subjects research often involves noisy measures and limited sample sizes. Accordingly, small effects and low statistical power are typical in many areas of behavioral and medical science (Marek et al., 2022; Szucs & Ioannidis, 2017). Some argue that this situation is tenable because the ongoing identification of small effects amounts to a steady accumulation of knowledge (Götz et al., 2022). We argue to the contrary. Specifically, we show that the study of small effects frequently produces results that are indistinguishable from flipping a coin to determine the direction of an experimental treatment’s effect. We use this idea to develop a benchmark based on minimum acceptable estimation accuracy. This benchmark yields an intuitive interpretation of effect-size estimates—one based in accurate estimation. We show that calibrating existing tests to our benchmark yields far stricter thresholds for hypothesis testing and for statistical-power calculations. Our work is intended to spark a larger discussion within the scientific community on acceptable estimation accuracy, the interpretation of effects, and statistical standards.
Although there are many exceptions, behavioral scientists almost universally test null hypotheses, which are often formulated as two or more means being exactly equal to one another. Much ink has been spilled noting the shortcomings of this approach (e.g., Krantz, 1999; Nickerson, 2000; van de Schoot et al., 2011). Cohen (1994) famously criticized the null hypothesis through his “nil” hypothesis critique, describing it as a conceptual tool that is ill-suited for answering substantive research questions. He noted that for continuous dependent variables, it is simply impossible for two population means to be truly equal to one another. This means that the null hypothesis acts as a straw man to be knocked down at a given sample size. By his critique, all effects exist in a trivial sense; it just may be that some are so small that they do not warrant attention. A more meaningful line of investigation is determining whether effects are accurately estimated and characterized.
What constitutes acceptable estimation accuracy? This question is challenging to answer and fraught with subjectivity. A confidence interval (CI) deemed acceptably narrow by one scientist may be unacceptably wide to another. We seek to answer this question by the use of a reference—a foil—with undeniable negative qualities. To better understand the accuracy of standard methods, we will compare them against a foil estimation process that is, by construction, incapable of accurately estimating effects. Such a foil is useful for handling questions of subjectivity. If a community of scientists agree that this foil is unacceptably inaccurate, then any estimation process that cannot be distinguished from it is also unacceptably inaccurate.
Our foil must be tailored to the types of questions that behavioral scientists ask and to how they make decisions about data. Behavioral scientists often formulate directional hypotheses about treatment effects. Is the population mean of Group A larger than that of Group B? A strong foil would offer zero information about the correct direction of effects. A foil could randomize the direction of any observed effect; for example, which group mean is larger than another would be decided via a coin flip. Such a foil creates a worst-case scenario for evaluating any directional hypothesis. In addition, behavioral scientists typically use the outcome of a statistical test to conclude whether a treatment effect is detected. In keeping with our estimation focus, an ideal foil would remove effect detection from the comparison. One way to handle this is for the foil to correctly detect whether an effect exists at similar, or identical, rates as standard methods. A scientist using this foil would correctly reject a relevant null hypothesis just as often as someone using standard estimation methods. This would make the foil especially useful for evaluating published findings in the literature.
Scientists using such a foil would arrive at random conclusions regarding their data. All else being equal, they would detect effects as often as scientists using standard methods, but would be incapable of accurately estimating and characterizing them. The logic is straightforward: If one accepts that arriving at random conclusions is unscientific and inaccurate, then it becomes incumbent on the scientific community to use statistical procedures that would be distinguishable from such a foil.1 In the present work, we focus on the canonical case of using sample means to estimate population means for two independent groups. Our proposed foil consists of an estimation process that randomizes the direction of treatment effects while still correctly rejecting a null hypothesis as often as standard methods.
Our analyses reveal that distinguishing sample means from such a foil requires far larger sample sizes than typically employed in the behavioral sciences, especially when studying the kinds of small effects that are commonplace in the psychological literature. We also show that our foil comparison naturally relates to many existing tests and methods, including those based on traditional null hypotheses. We leverage these connections to provide new calibrations for existing techniques. For power analyses, we show that typical power thresholds of .80 are not sufficient to rule out unacceptable estimation accuracy. Linking our argument to hypothesis testing, we show that far stricter thresholds () are required if sufficient estimation accuracy is to be ensured. We also provide a simple methodology that allows researchers to convert a common measure of effect size, Cohen’s , into an easily understood measure of estimation accuracy on the basis of our foil. This methodology can be applied to CIs over Cohen’s , allowing researchers to determine whether their estimates are acceptably accurate. Finally, we examine a collection of meta-analyses from the behavioral sciences, finding that typical estimates in many fields of study are indistinguishable from our random conclusions foil.
Ultimately, all scientific decisions regarding data are made by human beings. A key aim of any statistical methodology is to provide characterizations of data that researchers can understand. What we provide in the current work is simply a perspective, one grounded in a common experimental design with linkages to many other familiar statistical quantities and methods. It is through this framing that we aim to push forward the conversation on estimation accuracy and replication efforts. To further understand our approach and provide precise definitions, consider the following scenario.
A Tale of Two Labs
Consider two hypothetical laboratories, Lab 1 and Lab 2, studying an effect—for instance, the efficacy of a drug. Both labs use a treatment condition (Group A) and a control condition (Group B) and compare the sample means from each group, and , on some outcome measure. These sample means underpin the statistical tests conducted by both labs and provide point estimates for the population means, and , that instantiate their scientific hypotheses regarding the drug’s effect. Assume that the drug has a true effect , where , with being the standard deviation of responses from the populations.2
Unfortunately, Lab 2 has a glitch in their data-analysis software—it randomly assigns, with equal likelihood, the labels of “treatment” and “control” to those means. That is, if Lab 2 conducted a study for which the actual sample means for the two conditions were and , the software would instead report and with probability equal to .5, and the truth cannot be recovered. We refer to this procedure as a random-conclusions estimator (RCE) because the direction of the effect—whether the drug helps or harms—is determined at random. Although mathematically related, the RCE is distinct from a classic Fisher randomization test in which labels are randomized at the individual response level to generate a null, no-effect reference distribution.
If Lab 2’s error came to light, retraction of any study that relied on this software would be demanded, and a drug approved on the basis of such results would (right-fully) be recalled. But Lab 2 provides an interesting comparison with Lab 1, especially when considering issues of replication and reliability. Lab 2 will correctly reject the null hypothesis, , exactly as often as Lab 1 using a two-tailed test. Barring preregistration restrictions, both labs will publish results at similar rates. In this way, Lab 2 will pollute the scientific literature with random conclusions and, in the case of drug trials, potentially claim evidence for dangerous treatments.
Lab 1 and Lab 2 are identical with the exception that Lab 2 is using an RCE, which, by any measure, is not science because the direction of effects (including published effects) is determined via a coin flip. Intuitively, we would like to believe that results from the two labs would be readily distinguishable. Unfortunately, in many areas of behavioral science, even if all effects exist, Lab 2’s results will often be strikingly similar to Lab 1’s, and the gain from removing their results from the literature may be marginal at best. This situation is illustrated in Figure 1, which presents scenarios for effect sizes that are conventionally considered large, medium, and small (yet interpretable; Cohen, 1988; Sawilowsky, 2009). For simplicity, these scenarios assume that outcomes in both conditions are normally distributed with unit variance. The left and right columns of Figure 1 illustrate the sampling distributions of mean estimates in each of the labs. Each dot represents a pair of means from a single study. How well these means estimate the population means and is quantified in terms of a common metric for assessing estimation accuracy: mean-squared error (MSE; see the Appendix).
Fig. 1.
Distribution of sample mean estimates and for Lab 1 and Lab 2. Each row corresponds to a different combination of effect size and sample size per group . The ratio of mean-squared error (MSE) values for the two labs, , is represented by . To facilitate visualization, we report all relevant values for each comparison from both Lab 1 and Lab 2 (, , MSE) in the Lab 1 panel.
In the top row of Figure 1, the effect size is large. Lab 2’s bimodal distribution of estimates clearly evidences the software error, and the resulting MSE is 19 times larger than Lab 1’s. We use to denote the ratio ; has a lower bound of 1, given that there is no scenario in which Lab 2’s estimates will be, on average, more accurate than Lab 1’s. The middle and bottom rows of Figure 1 illustrate how the estimates from the two labs converge as effect size becomes smaller, with Lab 2’s distribution of estimates eventually becoming unimodal. These changes are indexed by : In the bottom row, , and estimates from the two labs are visually nearly indistinguishable, an impression confirmed by a small Wasserstein metric (Rubner et al., 2000) and the large number of replicates needed (at least 54 per lab) to reliably distinguish the distributions of results from the two labs via a Kolmogorov-Smirnov test (see the Appendix).
Effect size and sample size combinations like those in the bottom row of Figure 1 raise an important question: If Lab 2’s results are subject to retraction, how should we interpret Lab 1’s results? Put differently, if one’s results look unscientific, perhaps they are unscientific. A computer glitch on the scale of Lab 2’s results is, one hopes, an unlikely occurrence, but the comparison is useful in illustrating what a worst-case estimator could look like and why it would be problematic if it were indistinguishable from current practice. Within the behavioral sciences, many of the hypotheses being tested, if not the vast majority, are directional in nature. The RCE completely randomizes the direction of effects, removing any information about direction from the data. Yet the RCE is special in that it still detects effects at the same rate as sample means via a nondirectional test, which is, once again, ubiquitous practice in the behavioral sciences. In this way, our RCE comparison provides an interesting new perspective on published literature in the field, which often hinges upon the successful reporting of a significant test. We are not seriously suggesting that such a computer glitch exists, but we do think it highly problematic if a large corpus of work within the behavioral sciences is indistinguishable from such an error.3
General Formulation
If the goal is to be distinguishable from a veritable Lab 2, as instantiated by the RCE, we can use as an index to set standards for hypothesis testing and sample-size planning. As shown in the Appendix, simplifies to
(1) |
where is the sample size per group. Equation 1 is straightforward to interpret: For given values of and , sample means are times as accurate (in terms of MSE) as the RCE. Although is distribution-free and interpretable outside of any testing framework, it functionally relates to a two-sample test and the resulting values. See the Appendix for connections between and other metrics, including out-of-sample . This relationship allows us to reexamine hypothesis-testing and statistical-power standards by calibrating to minimally acceptable estimation, as opposed to detection error rates against a null hypothesis. The mathematics are familiar, but the RCE comparison offers new interpretation to these techniques.
Determining a minimum acceptable for a given scientific discipline is perhaps best decided on a case-by-case basis, taking into consideration specific research goals (S. F. Anderson & Maxwell, 2016; Navarro, 2019). Here, we demonstrate the consequences of a threshold of 3 for the interpretation of results and sample-size planning. Although somewhat arbitrary and perhaps modest, this threshold is motivated by the logic illustrated in Figure 1. When , the sampling distribution of the RCE becomes unimodal for normal random variables (Figs. A3–A7, Appendix), and the number of study replicates required to reliably distinguish it from sample means becomes impractical (Table A1). If we take our illustration with the two labs seriously, poor values imply that members of Lab 1 and Lab 2 could spend their entire careers replicating scores of studies and be unable to reject the null hypothesis that they are using the same estimator (see the Appendix).
Table A1 in the Appendix characterizes in terms of the information about the direction of effect that is gained by using sample means versus the RCE. For example, for , the usage of sample means reduces the uncertainty about the correct direction of effects by only 29% compared with the total uncertainty given by the RCE (see also Fig. A1, Appendix). In this way, our RCE comparison links directly to the concept of Type S errors regarding the sign of the effect (Gelman & Carlin, 2014; Gelman & Tuerlinckx, 2000). See also recent work by Domingue et al. (2021), who applied the concept of weighted coins to develop a measure of predictive accuracy for binary outcomes.
Applications to CIs and Hypothesis Testing
Applying Equation 1 to the bounds of a 95% CI over provides researchers a simple, transparent method to gauge how accurately a range of plausible effects is being estimated. For example, consider a study with a sample size of 50 that yields an effect size point estimate of 0.5 and a 95% CI equal to [0.10, 0.89] (see, e.g., Cumming & Finch, 2001). This interval does not include 0, corresponds to a value of .014, and by current standards would provide researchers assurance that an effect has been detected. But even if this interval contains the population value , researchers cannot be confident that their estimation is better than the bottom row of Figure 1. Applying Equation 1 to this CI yields a interval of [1.25, 20.80], which includes conditions in which sample mean estimates are practically indistinguishable from random conclusions. Put another way, these researchers may claim that the population means are not equal, but, upon examining the bounds on , may also conclude that there remains tremendous uncertainty regarding the size and direction of the effect. Indeed, sample means estimation yields a 16.075% reduction in uncertainty (relative to the RCE) at the lower bound () and a 99.998% reduction in uncertainty at the upper bound (; Fig. A1). Although the effect-size estimate implies a difference between groups, the accuracy of this estimate could be anything from a blind guess to a statement of fact.
Figure 2 contextualizes within familiar statistical quantities:
Fig. 2.
Relationship between and different relevant quantities. The bands correspond to the 95% confidence intervals (CIs) of . The power values reported in (b) are also reported in Table 1. For further details, see the main text and the Appendix.
Panel (a) - ensuring that is greater than 3 often requires a large , especially when dealing with smaller effect sizes (e.g., ). Sample size requirements are more stringent if one also wants to achieve 95% confidence that the true is larger than 3. For example, the estimation accuracy of a small effect (with ) requires a sample of 2 × 255 = 510 to be confidently acceptable. See the Appendix for a discussion on how effect-size priors can be used to determine .
Panel (b) - the requirements for acceptability can also be framed in terms of statistical power. Regardless of , under the standard alpha (), statistical power needs to be above .92 for CIs over to exclude values less than 3. Minimally acceptable estimation of an effect requires its detection to be near certain: Common but arbitrary power standards, such as .80, do not yield estimates that rule out unacceptable estimation accuracy.
Panel (c) - it is well known that a larger results in smaller observed effects becoming statistically significant. However, the associated with said effects can still be unacceptable. For example, critical effects with a of .05 yield a of approximately 5, with CIs that include values very close to 1. In comparison, critical effects with a of .0005, which are approximately 78% larger than their .05 counterparts, yield confidently acceptable values. We note that using an of .0005 as a threshold for null hypothesis testing is a stricter standard than other recent proposals that focus on the detection of effects (Benjamin et al., 2018). Such a stringent criterion makes it more difficult for questionable researcher practices, such as -hacking (Simmons et al., 2011), to affect the outcome. Finally, these results may also serve to dampen researcher urges to characterize nonsignificant effects () as if they are acceptably accurate.
Panel (d) - some researchers consider an effect to be robust or reliable when the 95% CI of does not cross zero (Cumming, 2013). But when we transform a strictly positive or negative interval onto a range of plausible values, we see that they will include unacceptable values (for a threshold of 3) unless the width is less than (i.e., 58% of its maximum width of ). In short, estimation accuracy can be unacceptable even for robust or reliable effects.
Examining Prior Meta-Analyses
We examined several recent meta-analyses to get a snapshot of how common poor values are in various subfields (Gaeta & Brydges, 2020; Nuijten et al., 2020; Siegel et al., 2021; Szucs & Ioannidis, 2017). Table 1 shows a remarkable consistency across subfields, with the estimated median power to detect a small effect () ranging between 0.11 and 0.16. These power estimates translate to values ranging from 1.54 to 1.94, which strongly resemble the unacceptable scenario illustrated in the bottom row of Figure 1. Said simply, the majority of studies examining small effects in these fields may be producing results that are virtually indistinguishable from random conclusions. These meta-analytic values are also plotted in Figure 2 (b), where we show that even representative studies examining medium and large effects are not sufficiently powered to rule out unacceptable estimation accuracy.
Table 1.
Median Power to Detect Small (), Medium (), and Large () Effects as Reported in Meta-Analyses and Their Corresponding Values (in Brackets).
Small effect |
Medium effect |
Large effect |
|
---|---|---|---|
Meta-analyses | Median power [] | Median power [] | Median power [] |
Szucs & Ioannidis (2017) | |||
Cognitive neuroscience | 0.11 [1.54] | 0.40 [4.06] | 0.70 [7.56] |
Psychology | 0.16 [1.94] | 0.60 [6.13] | 0.81 [9.48] |
Medicine | 0.15 [1.86] | 0.59 [6.00] | 0.80 [9.32] |
Nuijten et al. (2020) | |||
Intelligence | 0.11 [1.54] | 0.47 [4.75] | 0.99 [19.88] |
Gaeta & Brydges (2020) | |||
Speech and language | 0.13 [1.70] | 0.49 [4.86] | 0.91 [12.52] |
Siegel et al. (2021) | |||
Industrial and organizational psychology | 0.47 [4.58] | 0.79 [8.86] | 0.99 [19.88] |
Note: The values are also illustrated in Figure 2b. We calculated power for Gaeta & Brydges (2020) and Siegel et al. (2021) on the basis of median sample sizes.
Extensions
Our Lab 1 and Lab 2 framing provides a concrete way for scientists to grapple with inherently difficult questions about acceptable estimation accuracy and replication within the behavioral sciences. This framing could be extended to other estimators, testing frameworks, and experimental designs. In the current application, we focused on sample means and the usage of the independent two-sample test. We did so because of the ubiquity of this experimental design and testing framework within the behavioral sciences. Our RCE formulation could be used to calibrate power and hypothesis-testing thresholds for statistical tests other than the standard test, such as Welch’s test, which allows for differences in group variance (Welch, 1947). Future work could explore how different configurations of group variances impact the RCE sample-mean comparison and what testing and power thresholds provide acceptable estimation accuracy.
The RCE is defined by the randomization of group labels on the estimates of interest, but these are not required to be population means. In keeping with our two-group design, an RCE could be defined as the randomization of group labels to estimates of population medians, which may be an interesting application for heavily skewed distributions. One could then examine alternative power and hypothesis-testing calibrations for tests such as the Wilcoxon-Mann-Whitney test. It should be noted, however, that the Wilcoxon-Mann-Whitney test is appropriate only for evaluating whether two population medians are different under relatively strict assumptions—that is, that both populations are identically distributed and differ only by a shift in location (Divine et al., 2018).
The RCE and two-labs perspective could be extended to other experimental designs. In defining a general RCE comparison, we want to preserve two distinct features of our current formulation. First, a generalized RCE should randomize the conclusions of scientific interest. Applications could include a one-way analysis of variance, in which group mean labels are randomized, thus randomizing which means are larger than others while preserving Type I and Type II error rates for the omnibus test. Generalizations could also include multiple regression: Certain aspects of the estimation process could be randomized, such as whether one standardized regression coefficient is larger than, or has the same sign, as another.4 Second, a generalized RCE should also yield statistically significant results at rates similar to the standard estimation method being evaluated. This gives a generalized Lab 2 comparison additional bite, because the generalized RCE is not just randomizing the direction of results; it is also leading to random decisions regarding data. This second point is not intended to avoid important questions relating to preregistration practices (Nosek et al., 2019; Szollosi et al., 2020) but rather to place a finer point on an RCE comparison.
Given a suitable RCE and a standard method of estimation (e.g., ordinary least squares), we define a generalized as the ratio of the respective mean-squared-error values. Although MSE has several nice properties, other accuracy metrics could also be substituted. Under this definition, retains its simple interpretation: An estimator is times as accurate as a generalized RCE. Future work could develop these comparisons and relate them to existing techniques, such as CIs, statistical power, and hypothesis testing.
Recommendations
Report intervals
When reporting CIs over Cohen’s values, we recommend also reporting the requisite interval using that study’s sample size. A CI communicates a range of plausible effect sizes, whereas the CI over communicates how well the effect is being estimated relative to an easily understood benchmark. If the CI includes values less than 3, it is worth reporting that the data do not rule out unacceptable levels of estimation accuracy. Although we have illustrated some consequences of using 3 as a threshold for , other values could be used depending upon the context.5 The key takeaway is that intervals translate effect-size estimates into a comprehensible measure of estimation accuracy. Reporting intervals also provides researchers a degree of nuance when reporting results, allowing them to claim (or not) the detection of an effect, up to the usual Type I error rate under a specified level, while also being transparent about estimation accuracy. To be clear, no additional inference is taking place: Transforming a CI over values into one over values is expressing the same information again from an estimation perspective. Making use of such a perspective can be done regardless of one’s statistical-inferential inclinations (e.g., Bayesian vs. frequentist). It is worth noting once again that is distribution-free, in that its interpretation as the ratio of MSE values between sample means and the RCE does not depend upon any particular distributional form (see the Appendix for details).
Power statistical tests for estimation
When conducting a priori power analyses, we recommend that the sample size be selected according to effective estimation of the effect, rather than simple detection. We demonstrated that power of .92, when using an of .05, results in CIs over that exclude values less than 3. This perspective offers a grounded rationale for power values, rather than the highly arbitrary, but quite common, value of .80. Selecting sample sizes in this way is similar in spirit to the work of Gelman and Carlin (2014) and connects to the work of Kelley and Maxwell (2003) and Kelley and Lai (2011), who argue for determining sample size on the basis of CI width. See also the work of S. F. Anderson et al. (2017), who present a power-analysis framework that incorporates publication bias.
Bayesian estimation
One takeaway from our arguments is that there simply is not much information contained in small samples and small effects. Bringing more information to the analysis can take many forms, with Bayesian methodology being an obvious approach. Informative priors can be used to improve estimation accuracy of mean estimates (Gelman et al., 1995), and such priors can be incorporated into the test itself (see, notably, Rouder et al., 2009; Gronau et al., 2019; and Ly & Wagenmakers, 2021). Bayesian formulations are well suited for integrating informative hypotheses with cognitive models (Lee & Vanpaemel, 2018; Vanpaemel & Lee, 2012), which can help avoid some of the estimation issues we raise here. This approach is especially important for researchers who face limited sample sizes by the very nature of their investigations. Of course, the accuracy of Bayesian approaches under limited sample sizes will be prior dependent (e.g., McNeish, 2016). The Appendix also provides two examples of how prior beliefs can be incorporated into the computation of .
Computational modeling and formal theory
Throughout, we have treated the accurate estimation of an effect as a primary goal. There is much to say about whether conceptualizing and testing theories in this way is optimal from a meta-science perspective. Indeed, Scheel (2022) argued that many psychological hypotheses are imprecisely specified, leading to questionable attempts at replication and measurement. Improved theory and quantitative modeling can lead to more compelling tests (e.g., model selection; for a recent review, see Myung & Pitt, 2018), avoiding simple effect-based characterizations (van Rooij & Baggio, 2021); see also Guest and Martin (2021) and Proulx and Morey (2021). Lee et al. (2019) and Devezer et al. (2019) provide thoughtful analysis and argumentation for how formalism can be used to improve scientific practices.
A more stringent threshold () for two-group between-subjects hypothesis testing
Using sets a more stringent threshold than recent high-profile recommendations for methods reform (Benjamin et al., 2018). It’s hardly our goal to further contribute to file-drawer problems by arguing that some studies should not be published if is less than 3. Indeed, we believe that all studies should be reported and that values (likewise, values) should not serve as gatekeepers to the literature. Yet for researchers who want to provide a characterization that goes beyond mere detection (e.g., “the two groups differ”) and ensure that their estimates are distinguishable from random conclusions, a more prohibitive level is arguably required. Rather than a tool for censorship, can be perceived as a useful way to adjust the strength of one’s claims to the expected accuracy of the estimation process.
The importance of experimental design
The fact that small effects are commonly observed does not mean that they are inevitable—one should always keep in mind the artificial and constructive nature of effects (e.g., Guala & Mittone, 2005; Woodward, 1989). In the behavioral sciences, effects are often small because of the use of minimal experimental manipulations that make the conditions being compared virtually identical, apart from a minor change (for a discussion, see Prentice & Miller, 1992). Researchers can rely on to gauge the ability of a given experimental design to elicit a target phenomenon with sufficient accuracy, which in some cases can lead to the development of alternative experimental approaches. We do emphasize that notions of effect size are just one of many factors that impact experimental outcomes; see Buzbas et al. (2023) for a formal treatment of experimental design and its relation to replication rates.
Discussion
In reaction, one might argue that estimation accuracy should not be much of a concern if we care only about correctly detecting effects. We find this argument untenable for four reasons: First, knowledge about effect sizes plays a crucial role when using basic research findings to develop effective real-world interventions (Schober et al., 2018). Second, developing a theoretic account of the phenomena being studied typically requires more than just nominal or ordinal information (Meehl, 1978). Third, this reaction is at odds with the widespread use of statistical models that are predicated on quantitative comparisons of effects (Kellen et al., 2021), or the popularity of inferential frameworks that call for a quantitative reasoning of effects (Vanpaemel, 2010). Fourth, even in the context of coarse-grained theoretical accounts and ordinal predictions, knowledge about effect sizes is still relevant in the sense that it can inform us on matters of theoretical scope (i.e., how many people conform to a given theory’s predictions; Davis-Stober & Regenwetter, 2019; Heck, 2021). That being said, we are not claiming that a focus on detection is by itself problematic, or that there are no legitimate contexts in which it takes center stage; we are asserting only that a mature scientific characterization calls for more than that, namely accurate estimates.
Alternatively, one could try to downplay the importance of estimation accuracy by arguing that talk of effects is by itself problematic, in the sense that effects are of secondary importance relative to the explanation of psychological capacities (van Rooij & Baggio, 2021). We take issue with pursuing such a line of reasoning here, as it mistakenly implies that giving psychological theorizing the attention that it is owed somehow eliminates effects from researchers’ discourses. As a counterexample, consider the recent discussion on benchmark effects in short-term and working memory, a research domain that stands out for its highly sophisticated theoretical accounts (Oberauer et al., 2018). By contrast, the empirical exigencies of theory testing and development give estimation accuracy center stage (Meehl, 1978).
One could also argue that there is nothing new to see here, given that is so closely related to already-established quantities. For instance, it is easy to see that is a quadratic function of the statistic (for details, see the Appendix). Rather than an all-new, all-different quantity to be reconciled with all the other ones in researchers’ toolboxes, what offers is a reframing of an old problem. It is an attractive feature, not a shortcoming,6 that is closely related to known quantities or tests, or that the pursuit of estimation accuracy ends up recovering similar methodological proposals with distinct motivations (e.g., Benjamin et al., 2018). It is also worth noting once again that although we assumed Gaussian distributions when deriving our value recommendations, the definition of the RCE and the subsequent interpretation of as a ratio of MSE values is distribution-free.
Regardless of one’s scientific view, random conclusions are indefensible. It follows that researchers’ empirical findings should, at a minimum, be distinguishable from a foil whose conclusions are determined by a coin flip. But as we have demonstrated, this is easier said than done: Many published research studies, despite honest efforts, have barely improved upon the estimation accuracy of the infamous Lab 2. As it turns out, one can easily fail to reliably outperform Lab 2, even if effects are real, studies are based in strong theory, and no questionable research practices are at play. The RCE approach and the index that can be derived from it provide a new perspective on methodological reform (Devezer et al., 2019; Munafò et al., 2017; Shrout & Rodgers, 2018). Everything begins with a simple statement: The estimation accuracy of our methods should be distinguishable from a random-conclusions foil. In the pursuit of this modest goal, we find that the default value threshold of .05 does not rule out unacceptable conditions (see the bottom row of Fig. 1), leading us to more stringent criteria that also address known concerns with measurement error, statistical power, and replicability (Gelman & Carlin, 2014; Loken & Gelman, 2017; Maxwell et al., 2015; but see also Bak-Coleman et al., 2022). Based on these results, we believe that and the RCE approach more generally constitute an important tool in improving psychological science.
Funding
C. P. Davis-Stober acknowledges support from the National Institutes of Health under Grant No. K25AA024182 (NIAAA; C. P. Davis-Stober, primary investigator). W. Bonifay’s contributions to this work were supported by the Institute of Education Sciences, U.S. Department of Education, through Grant No. R305D210032. D. Kellen’s work was supported by a National Science Foundation (NSF) CAREER Award, ID No. 2145308. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Institutes of Health, Institute of Education Sciences, the National Science Foundation, the U.S. Department of Education, or the authors’ home institutions.
Appendix
Formal characterization of
All code is available on the Open Science Framework (OSF) at https://osf.io/2hza8/?view_only=f679d2211a314f469118e2fa27111fea.
Let be -many independent, identically distributed samples from a random variable, , with mean and variance , where is finite. Likewise, let be -many independent, identically distributed samples from a random variable, , with mean and variance . We assume that and are independent of one another. We quantify accuracy via mean-squared error (MSE):
where “” is the expectation operator and and are, respectively, estimates for and .
Result 1.
The ratio of MSE values between the random conclusion estimator (numerator) and sample means (denominator) is equal to
Proof.
We first calculate the MSE of the random conclusions estimator (RCE):
Equation 1 is obtained by taking the ratio of MSERCE to the MSE of sample means,
□
The value is easily expressed in other metrics. It is equivalent to , under the usual metric, providing a direct relationship with the two-sample test. Relevant to questions involving replication, we can also write out-of-sample (Campbell & Thompson, 2008), denoted , as a simple function of . Consistent with typical formulations, we compare sample means against a competitor that uses the grand mean, , as the estimate for the population means in each group. As before, we assume equal in both groups. Direct calculation provides the following relationship:
where and are the MSE values for sample means and the grand mean, respectively.
Comparing sample means to the RCE via Kolmogorov-Smirnov tests
We carried out a power analysis to determine the sample size for achieving a power of .80 to reject the hypothesis that bivariate samples from the two distributions (sample means and RCE) are equal. We used the two-dimensional Kolmogorov-Smirnov test of Fasano and Franceschini (1987) with an of .05. These power analyses were carried out in MATLAB using Lau’s (2021) implementation of the test. The first row of Table A1 shows the required number of samples to achieve a statistical power of .80 as a function of . We also carried out a power analysis using a one-dimensional test that examines the distribution of differences between mean estimates—that is, we calculated similar power analyses using . For this test, we used the two-sample Cramér-von Mises goodness-of-fit test (T. W. Anderson, 1962), as implemented in MATLAB by Cardelino (2021). The second row of Table A1 displays the required number of samples to achieve a power of .80 for each estimator as a function of .
Rows 1 and 2 of Table A1 list the minimum number of studies (draws) per lab to reject the null hypothesis that estimates from the two labs follow the same generating distribution with a statistical power of .80. Row 3 presents the gain in information about the direction of an effect when estimated by sample means (relative to the RCE), where 0 represents no reduction in uncertainty and 1 represents total reduction in uncertainty.
Table A1.
Power analyses.
2D Kolmogorov-Smirnov test | 54 | 31 | 19 | 15 |
Cramér-von Mises test | 45 | 27 | 20 | 18 |
Information gain | .29 | .49 | .72 | .91 |
Comparing samples means to the RCE via entropy
We can evaluate the two estimators with respect to information gain regarding the direction of the effect. The RCE randomly assigned condition labels according to a fair coin toss (). The Shannon entropy of the RCE with respect to direction of the effect is given by
or total entropy about the direction. In other words, as approaches , RCE estimates converge to . Note that entropy is not contingent on or and thus the RCE yields total entropy about the direction, regardless of sample size or effect size. The RCE thereby exemplifies the principle of maximum entropy (Jaynes, 1957), which holds that the probability distribution with the largest entropy best represents the most uniform state of knowledge. To that end, contextualizes the sample-means estimator relative to an optimally deficient estimator, such that higher values of indicate greater accuracy beyond mere random conclusions regarding direction.
The relationship between the two estimators can be further quantified in information-theoretic terms: As increases from 1.0, the sample-means estimator will afford an increase in information relative to the RCE. This is illustrated in Figure A1, which depicts information gain (or reduction in entropy) as a function of . The -axis represents the bits of information that are gained when using sample means rather than the RCE, with values ranging from 0 (no information gain; i.e., sample means are as equally uninformative as the RCE) to 1 (complete information gain; i.e., sample means eliminate 100% of the uncertainty that comes with using the RCE). For example, the dotted line in the figure shows that the threshold is associated with a 72% increase in information. If researchers desire a 90% gain in information beyond the RCE, they must achieve a greater than 4.76 ; a 95% gain in information requires a greater than 5.99. To create this figure, we used the entropy package in R (Hausser & Strimmer, 2009) to calculate the Shannon entropy , in bits, of the sample means and RCE distributions generated by all combinations of and . We then found the reduction in entropy (i.e., the information gain) at corresponding values of . All code is available in the OSF repository linked above. Our use of information gain is equivalent to the (asymmetric) Kullback-Leibler divergence (Kullback & Leibler, 1951) , which we deemed theoretically appropriate because it allows us to gauge improvements in the accuracy of sample means estimation relative to that of the maximally entropic RCE. Analysis of the (symmetric) Jensen-Shannon divergence reveals a nearly identical trajectory across values of , but without the theoretical alignment or ease of interpretation.
Fig. A1.
Information gain afforded by sample means (relative to the RCE) regarding the direction of an effect as a function of .
Quantifying the difference between distributions via the Wasserstein metric
To quantify the differences between the left and right sides of Figure 1 we relied on the Wasserstein metric, which is also known as the Earth Mover’s Distance (EMD) because it determines the most efficient strategy for transporting a certain mass of earth from one position to another (Urbanek & Rubner, 2015). Specifically, the transportation of some mass from position , where is a unit of the reference mass with weight , to position , where is a unit of the target mass with weight , is given by the , where is the ground distance between and and is the optimal path from to .
In the current context, the EMD reflects the minimum amount of work (where one unit of work corresponds to transporting one unit of mass by one unit of distance) that is required to convert each random conclusions distribution to its corresponding sample-means distribution. We used the R package emdist (Urbanek & Rubner, 2015) to derive the EMD under each scenario in Figure 1 (see the OSF repository for code). When the effect size is small, the sample-means estimate in the bottom left panel (Fig. 1) shows that a negligible amount of work has been done to improve upon the random-conclusions estimate in the bottom right panel (Fig. 1), EMD = .106; on average, each unit of mass in the RCE panel would need to be moved just .106 units to match the mass in the sample-means panel. Relative to this small-effect-size condition, it would take six times more work to improve upon the RCE when is large (EMD = .635) and four times more work when is moderate (EMD = .424). In other words, more work is necessary whenever researchers want to ensure that their estimates are notably better than the mathematically least-informative estimate.
Confidence intervals around the true
For each effect size considered, we computed 95% confidence intervals (CIs). The approach used to compute these intervals (see Cumming & Finch, 2001) consisted of determining the noncentral distributions whose tails yield the observed statistic (which can be obtained from and ) with nominal probabilities (e.g., 0.025 and 0.975). Because the present analysis focuses on absolute effect sizes, we established a lower boundary () on these intervals (for which ).
We investigated the coverage rates of the 95% intervals obtained for true effect sizes, such as the ones illustrated in Figure 2. Specifically, we performed the following steps:
Computed the 95% CI for known values of and .
Generated samples from two normal distributions with variances 1 and means 0 and .
Computed an effect-size estimate from the samples taken in Step 2 and subsequently transformed this estimate into a estimate.
Checked whether was included in the CI computed in Step 1.
Repeated Steps 2 through 4 100,000 times.
Performed Steps 1 through 5 for different combinations of and values.
The results reported in Table A2 show that the 95% CIs around the true effect sizes, when transformed into intervals, included the estimates obtained from random samples roughly 95% of the time. These results corroborate our interpretation of these CIs around true values of as ranges of plausible estimates under a given effect size and sample size per group .
Table A2.
Coverage Rates of the 95% Confidence Intervals Around for a Given True Effect Size and Sample Size per Group .
0.93 | 0.94 | 0.94 | 0.94 | 0.94 | |
0.95 | 0.96 | 0.96 | 0.96 | 0.96 | |
0.96 | 0.97 | 0.97 | 0.97 | 0.97 | |
0.97 | 0.97 | 0.97 | 0.97 | 0.95 | |
0.97 | 0.97 | 0.96 | 0.95 | 0.95 | |
0.97 | 0.97 | 0.95 | 0.95 | 0.95 | |
0.97 | 0.95 | 0.95 | 0.95 | 0.95 | |
0.96 | 0.95 | 0.95 | 0.95 | 0.95 | |
0.95 | 0.95 | 0.95 | 0.95 | 0.95 | |
0.95 | 0.95 | 0.95 | 0.95 | 0.95 | |
0.95 | 0.95 | 0.95 | 0.95 | 0.95 |
Using effect-size priors to determine n
Further, can be used to determine the sample size that is expected to satisfy one’s accuracy standards. Although our previous examples focused on point-effect-size values (see Fig. 2a), it is easy to incorporate prior beliefs in terms of an absolute effect-size distribution with support over the positive reals. Let be the minimum accuracy threshold. For a given effect size , the minimum ensuring a threshold-satisfying value is given by
The expected can be obtained by calculating the following integral:
For example, if we assume a uniform prior over [0.1, 0.5] and a threshold of 3, then . Note that, alternatively, one could consider the minimum for a given that yields a lower bound of plausible estimates that satisfies the threshold. If we consider the 95% CI as our range of plausible values, then an integration over like the one above yields .
Finally, note that alternative prior distributions could be used instead (Gronau et al., 2019). For example, we could assume a truncated -prior on with a location of 0.30, a scale of 0.05, degrees of freedom () of 3, and support over . This prior, which is illustrated in Figure A2, places most of its mass on effect sizes ranging between 0.2 and 0.4. Computing the above integrals using this informative prior instead results in of approximately 56 and 323, respectively.
Histograms comparing the two distributions at different values of
As an illustration, Figures A3–A7 display bivariate histograms for simulated data under sample means (left-hand columns) and random conclusions (right-hand columns) for values of 1.5, 2,3,5, and 10. Each row corresponds to values of 0.10, 0.20, 0.30 and 0.40 for values of that give the appropriate value of . By examining Figures A3 through A7, we can see that the estimator’s variance clearly depends upon , but the relationship between sample means and the RCE remains stable for fixed values of at different combinations of and .
Fig. A2.
Example of a truncated -prior.
Fig. A3.
Bivariate histograms comparing the sampling distribution of sample means to the random conclusions estimator under a of 1.5. Each row of figures corresponds to a different combination of and to yield the same value of .
Fig. A4.
Bivariate histograms comparing the sampling distribution of sample means to the random conclusions estimator under a of 2. Each row of figures corresponds to a different combination of and to yield the same value of .
Fig. A5.
Bivariate histograms comparing the sampling distribution of sample means to the random conclusions estimator under a of 3. Each row of figures corresponds to a different combination of and to yield the same value of .
Fig. A6.
Bivariate histograms comparing the sampling distribution of sample means to the random conclusions estimator under a of 5. Each row of figures corresponds to a different combination of and to yield the same value of .
Fig. A7.
Bivariate histograms comparing the sampling distribution of sample means to the random conclusions estimator under a of 10. Each row of figures corresponds to a different combination of and to yield the same value of .
Footnotes
Transparency
Action Editor: Tim Pleskac
Editor: Interim Editorial Panel
Declaration of Conflicting Interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
This type of argument is foundational to myriad other existing procedures, such as determining whether fitted network models are distinguishable from networks with randomly determined connections (Steinley & Brusco, 2021). Looking further back, another example would be techniques such as Horn’s parallel analysis (Horn, 1965).
We use to denote the true effect size in the population and to denote sample estimates of . When we refer to the population value of Cohen’s , we are referring to .
To be clear, we are also not suggesting that comparisons with Lab 2 can serve as a way to identify errors or questionable research practices.
See Davis-Stober and Dana (2014) for a proto-RCE estimator along these lines.
Indeed, rejecting the null at the level is equivalent to stating that the 95% CI over does not include 1, a value that can be achieved only if there is precisely no effect.
For a similar scenario in which the same model-selection index is derived from very different theoretical foundations, see Grünwald and Navarro (2009) and Karabatsos and Walker (2006).
References
- Anderson SF, Kelley K, & Maxwell SE (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. [DOI] [PubMed] [Google Scholar]
- Anderson SF, & Maxwell SE (2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods, 21(1), 1–12. [DOI] [PubMed] [Google Scholar]
- Anderson TW (1962). On the distribution of the two-sample Cramér-von Mises criterion. The Annals of Mathematical Statistics, 33, 1148–1159. [Google Scholar]
- Bak-Coleman J, Mann RP, West J, & Bergstrom CT (2022). Replication does not measure scientific productivity. SocArXiv rkyf 7, Center for Open Science. 10.31235/osf.io/rkyf7 [DOI] [Google Scholar]
- Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, . . . Johnson VE (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6–10. 10.1038/s41562-017-0189-z [DOI] [PubMed] [Google Scholar]
- Buzbas EO, Devezer B, & Baumgaertner B. (2023). The logical structure of experiments lays the foundation for a theory of reproducibility. Royal Society Open Science, 10, Article 221042. 10.1098/rsos.221042 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Campbell JY, & Thompson SB (2008). Predicting excess stock returns out of sample: Can anything beat the historical average? The Review of Financial Studies, 21(4), 1509–1531. [Google Scholar]
- Cardelino J. (2021). Two sample Cramér-von Mises hypothesis test. https://www.mathworks.com/matlabcentral/fileexchange/13407-two-sample-cramer-von-mises-hypothesis-test
- Cohen J. (1988). Statistical power analysis for the behavioral sciences. Routledge. [Google Scholar]
- Cohen J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. [Google Scholar]
- Cumming G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. Routledge. [Google Scholar]
- Cumming G, & Finch S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61(4), 532–574. [Google Scholar]
- Davis-Stober CP, & Dana J. (2014). Comparing the accuracy of experimental estimates to guessing: A new perspective on replication and the “crisis of confidence” in psychology. Behavior Research Methods, 46, 1–14. [DOI] [PubMed] [Google Scholar]
- Davis-Stober CP, & Regenwetter M. (2019). The ‘paradox’ of converging evidence. Psychological Review, 126(6), 865–879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devezer B, Nardin LG, Baumgaertner B, & Buzbas EO (2019). Scientific discovery in a model-centric framework: Reproducibility, innovation, and epistemic diversity. PLOS ONE, 14(5), Article e0216125. 10.1371/journal.pone.0216125 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Divine GW, Norton HJ, Barón AE, & Juarez-Colunga E. (2018). The Wilcoxon–Mann–Whitney procedure fails as a test of medians. The American Statistician, 72(3), 278–286. [Google Scholar]
- Domingue B, Rahal C, Faul J, Freese J, Kanopka K, Rigos A, Stenhaug B, & Tripathi A. (2021). Intermodel vigorish (IMV): A novel approach for quantifying predictive accuracy with binary outcomes. 10.31235/osf.io/gu3ap [DOI] [Google Scholar]
- Fasano G, & Franceschini A. (1987). A multidimensional version of the Kolmogorov–Smirnov test. Monthly Notices of the Royal Astronomical Society, 225(1), 155–170. [Google Scholar]
- Gaeta L, & Brydges CR (2020). An examination of effect sizes and statistical power in speech, language, and hearing research. Journal of Speech, Language, and Hearing Research, 63(5), 1572–1580. [DOI] [PubMed] [Google Scholar]
- Gelman A, & Carlin J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. [DOI] [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, & Rubin DB (1995). Bayesian data analysis. Chapman & Hall/CRC. [Google Scholar]
- Gelman A, & Tuerlinckx F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15(3), 373–390. [Google Scholar]
- Götz FM, Gosling SD, & Rentfrow PJ (2022). Small effects: The indispensable foundation for a cumulative psychological science. Perspectives on Psychological Science, 17, 205–215. 10.1177/1745691620984483 [DOI] [PubMed] [Google Scholar]
- Gronau QF, Ly A, & Wagenmakers E-J (2019). Informed Bayesian t-tests. The American Statistician, 74(2), 137–143. [Google Scholar]
- Grünwald P, & Navarro DJ (2009). NML, Bayes and true distributions: A comment on Karabatsos and Walker (2006). Journal of Mathematical Psychology, 53(1), 43–51. [Google Scholar]
- Guala F, & Mittone L. (2005). Experiments in economics: External validity and the robustness of phenomena. Journal of Economic Methodology, 12(4), 495–515. [Google Scholar]
- Guest O, & Martin AE (2021). How computational modeling can force theory building in psychological science. Perspectives on Psychological Science, 16(4), 789–802. [DOI] [PubMed] [Google Scholar]
- Hausser J, & Strimmer K. (2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(7), 1469–1484. [Google Scholar]
- Heck DW (2021). Assessing the “paradox” of converging evidence by modeling the joint distribution of individual differences: Comment on Davis-Stober and Regenwetter (2019). Psychological Review, 128(6), 1187–1196. [DOI] [PubMed] [Google Scholar]
- Horn JL (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. [DOI] [PubMed] [Google Scholar]
- Jaynes ET (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620. [Google Scholar]
- Karabatsos G, & Walker SG (2006). On the normalized maximum likelihood and Bayesian decision theory. Journal of Mathematical Psychology, 50(6), 517–520. [Google Scholar]
- Kellen D, Davis-Stober CP, Dunn JC, & Kalish ML (2021). The problem of coordination and the pursuit of structural constraints in psychology. Perspectives on Psychological Science, 16(4), 767–778. [DOI] [PubMed] [Google Scholar]
- Kelley K, & Lai K. (2011). Accuracy in parameter estimation for the root mean square error of approximation: Sample size planning for narrow confidence intervals. Multivariate Behavioral Research, 46(1), 1–32. [DOI] [PubMed] [Google Scholar]
- Kelley K, & Maxwell SE (2003). Sample size for multiple regression: Obtaining regression coefficients that are accurate, not simply significant. Psychological Methods, 8(3), 305–321. [DOI] [PubMed] [Google Scholar]
- Krantz DH (1999). The null hypothesis testing controversy in psychology. Journal of the American Statistical Association, 94(448), 1372–1381. [Google Scholar]
- Kullback S, & Leibler RA (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. [Google Scholar]
- Lau B. (2021). 2-d Kolmorogov-Smirnov test, n-d energy test, Hotelling T2 test. https://github.com/brian-lau/multdist
- Lee MD, Criss AH, Devezer B, Donkin C, Etz A, Leite FP, Matzke D, Rouder JN, Trueblood JS, White CN, & Vandekerckhove J. (2019). Robust modeling in cognitive science. Computational Brain & Behavior, 2(3), 141–153. [Google Scholar]
- Lee MD, & Vanpaemel W. (2018). Determining informative priors for cognitive models. Psychonomic Bulletin & Review, 25(1), 114–127. [DOI] [PubMed] [Google Scholar]
- Loken E, & Gelman A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585. [DOI] [PubMed] [Google Scholar]
- Ly A, & Wagenmakers E-J (2021). Bayes factors for peri-null hypotheses. 10.48550/ARXIV.2102.07162 [DOI] [Google Scholar]
- Marek S, Tervo-Clemmens B, Calabro FJ, Montez DF, Kay BP, Hatoum AS, Donohue MR, Foran W, Miller RL, Hendrickson TJ, Malone SM, Kandala S, Feczko E, Miranda-Dominguez O, Graham AM, Earl EA, Perrone AJ, Cordova M, Doyle O, & Dosenbach NUF (2022). Reproducible brain-wide association studies require thousands of individuals. Nature, 603, 654–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maxwell SE, Lau MY, & Howard GS (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498. [DOI] [PubMed] [Google Scholar]
- McNeish D. (2016). On using Bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5), 750–773. [Google Scholar]
- Meehl PE (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806–834. [Google Scholar]
- Munafò MR, Nosek BA, Bishop DV, Button KS, Chambers CD, Du Sert NP, Simonsohn U, Wagenmakers E-J, Ware JJ, & Ioannidis JP (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myung JI, & Pitt MA (2018). Model comparison in psychology. In Wixted JT, Phelps EA, & Davachi L. (Eds.), Stevens’ handbook of experimental psychology and cognitive neuroscience (Vol. 5, pp. 85–118). John Wiley & Sons, Inc. [Google Scholar]
- Navarro DJ (2019). Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior, 2(1), 28–34. [Google Scholar]
- Nickerson RS (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. [DOI] [PubMed] [Google Scholar]
- Nosek BA, Beck ED, Campbell L, Flake JK, Hardwicke TE, Mellor DT, van’t Veer AE, & Vazire S. (2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23(10), 815–818. [DOI] [PubMed] [Google Scholar]
- Nuijten MB, van Assen MA, Augusteijn HE, Crompvoets EA, & Wicherts JM (2020). Effect sizes, power, and biases in intelligence research: A meta-meta-analysis. Journal of Intelligence, 8(4), Article 36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oberauer K, Lewandowsky S, Awh E, Brown GD, Conway A, Cowan N, Donkin C, Farrell S, Hitch GJ, Hurlstone MJ, Ma WJ, Morey CC, Nee DE, Schweppe J, Vergauwe E, & Ward G. (2018). Benchmarks for models of short-term and working memory. Psychological Bulletin, 144, 885–958. [DOI] [PubMed] [Google Scholar]
- Prentice DA, & Miller DT (1992). When small effects are impressive. Psychological Bulletin, 112(1), 160–164. [Google Scholar]
- Proulx T, & Morey RD (2021). Beyond statistical ritual: Theory in psychological science. Perspectives on Psychological Science, 16(4), 671–681. [DOI] [PubMed] [Google Scholar]
- Rouder JN, Speckman PL, Sun D, Morey RD, & Iverson G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225–237. [DOI] [PubMed] [Google Scholar]
- Rubner Y, Tomasi C, & Guibas LJ (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121. [Google Scholar]
- Sawilowsky SS (2009). New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8(2), Article 26. [Google Scholar]
- Scheel AM (2022). Why most psychological research findings are not even wrong. Infant and Child Development, 31(1), Article e2295. [Google Scholar]
- Schober P, Bossers SM, & Schwarte LA (2018). Statistical significance versus clinical importance of observed effect sizes: What do p values and confidence intervals really represent? Anesthesia and Analgesia, 126(3), 1068–1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shrout PE, & Rodgers JL (2018). Psychology, science, and knowledge construction: Broadening perspectives from the replication crisis. Annual Review of Psychology, 69, 487–510. [DOI] [PubMed] [Google Scholar]
- Siegel M, Eder JSN, Wicherts JM, & Pietschnig J. (2021). Times are changing, bias isn’t: A meta-meta-analysis on publication bias detection practices, prevalence rates, and predictors in industrial/organizational psychology. Journal of Applied Psychology, 107(11), 2013–2039. [DOI] [PubMed] [Google Scholar]
- Simmons JP, Nelson LD, & Simonsohn U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. [DOI] [PubMed] [Google Scholar]
- Steinley D, & Brusco MJ (2021). On fixed marginal distributions and psychometric network models. Multivariate Behavioral Research, 56(2), 329–335. [DOI] [PubMed] [Google Scholar]
- Szollosi A, Kellen D, Navarro DJ, Shiffrin R, van Rooij I, Van Zandt T, & Donkin C. (2020). Is preregistration worthwhile? Trends in Cognitive Sciences, 24(2), 94–95. [DOI] [PubMed] [Google Scholar]
- Szucs D, & Ioannidis JPA (2017). Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLOS Biology, 15(3), Article e2000797. 10.1371/journal.pbio.2000797 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Urbanek S, & Rubner Y. (2015). Emdist: Earth mover’s distance. R package version 0.3–2. [Google Scholar]
- van de Schoot R, Hoijtink H, & Jan-Willem R. (2011). Moving beyond traditional null hypothesis testing: Evaluating expectations directly. Frontiers in Psychology, 2, Article 24. 10.3389/fpsyg.2011.00024 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vanpaemel W. (2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54, 491–498. [Google Scholar]
- Vanpaemel W, & Lee MD (2012). Using priors to formalize theory: Optimal attention and the generalized context model. Psychonomic Bulletin & Review, 19(6), 1047–1056. [DOI] [PubMed] [Google Scholar]
- van Rooij I, & Baggio G. (2021). Theory before the test: How to build high-verisimilitude explanatory theories in psychological science. Perspectives on Psychological Science, 16(4), 682–697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Welch BL (1947). The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika, 34(1–2), 28–35. [DOI] [PubMed] [Google Scholar]
- Woodward J. (1989). Data and phenomena. Synthese, 79, 393–472. [Google Scholar]