In contrast to Wilson et al. (1), we believe that signal detection continues to be a key component of empirical science because of its inherent uncertainty. Null hypothesis significance testing (NHST) serves to detect signals, and we highlight mischaracterizations of NHST made by Wilson et al. (1). An explanation for observing smaller effects in (larger) replication than (smaller) original studies that does not require the complicated model in ref. 1 could consist of NHST-based truncation of effects with the law of large numbers (LLN).
NHST makes no assumption about the distribution of effect sizes or of a random draw from a dichotomous distribution. NHST is a procedure that compares empirical data with a sampling distribution implied by a null hypothesis (e.g., H0: μX − μY = 0). A small enough probability (P value) of observing the data under justifies the decision to continue research because the data suggest a signal () instead of noise ( with sampling variability), which involves signal detection. Even assuming no true null hypotheses, rejecting the null as an explanation for the data also allows one to reject any distribution assuming an effect in the opposite direction (2). Thus, sufficient signal can also indicate the correct effect direction. Uncertainty is always involved, however, because “tests of significance … are capable of rejecting or invalidating hypotheses in so far as these are contradicted by the data; but … they are never capable of establishing them as certainly true” (3).
Assuming an exponential distribution of δ (standardized mean difference in the population) with zero probability of H0: δ = 0 (point value in a continuous distribution) presumes, but does not test, that HA: δ > 0 is true. The simulation in the study by Wilson et al. (1) is unnecessary. The more parsimonious LLN, coupled with NHST screening based on studies with power < 1.0, dictates that effect sizes in original research are exaggerated. Under LLN, the estimate () approaches the effect (δ) as . Points in Fig. 1A represent means from a single sample (size ); dots are significant (P < 0.05) and circles are not (P > 0.05). More extreme are required to reach significance at smaller N (cf. original studies), whereas less extreme reach significance at larger N (cf. replication studies).
Fig. 1.
(A) Variability in sample means as a function of sample size N for a single study, so the variance depicted is not about a sampling distribution but of a sample. The population effect size is δ = 0.5 (dashed horizontal reference line), and the data d follow a normal distribution with unit variance. (B) Increasing N is associated with increasing values in the t test statistic and lower critical values (monotonically decreasing reference line). The red-colored dots (estimates and t test statistics) are significant at P < 0.05; open circles have P > 0.05. R code and data to accompany this figure are available over Open Science Framework (https://osf.io/97amd/).
Signal detection does not just pertain to deciding whether there is signal (e.g., of correct direction) in original research. Assuming HA: δ = 0.5 and large N (cf. replication studies), estimated effects still exhibit variability that can raise doubts about the direction of an effect (e.g., is close to 0 for N = 108 in Fig. 1A). The screening function of NHST does involve signal detection, although perhaps of direction (2) rather than (or in addition to) presence versus absence of an effect. Human scientific inquiry involves making probabilistic decisions (i.e., detecting a signal) that are relevant to both original and replication studies. Beyond statistical models, conclusive (i.e., less uncertain but not definitive) research claims require research practices centered on optimizing validity (4–6).
Footnotes
The authors declare no competing interest.
Data deposition: The data and materials for this paper have been made publicly available via the Open Science Framework (OSF) and can be accessed at https://osf.io/97amd/.
References
- 1.Wilson B. M., Harris C. R., Wixted J. T., Science is not a signal detection problem. Proc. Natl. Acad. Sci. U.S.A. 117, 5559–5567 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jones L. V., Tukey J. W., A sensible formulation of the significance test. Psychol. Methods 5, 411–414 (2000). [DOI] [PubMed] [Google Scholar]
- 3.Fisher R. A., Statistical tests. Nature 136, 474 (1935). [Google Scholar]
- 4.Finkel E. J., Eastwick P. W., Reis H. T., Replicability and other features of a high-quality science: Toward a balanced and empirical approach. J. Pers. Soc. Psychol. 113, 244–253 (2017). [DOI] [PubMed] [Google Scholar]
- 5.Fabrigar L. R., Wegener D. T., Vaughan-Johnston T. I., Wallace L. E., Petty R. E., “Designing and interpreting replication studies in psychological research” in Handbook of Research Methods in Consumer Psychology, Kardes F. R., Herr P. M., Schwarz N., Eds. (Routledge, New York, 2019), pp. 483–507. [Google Scholar]
- 6.Flake J. K., Pek J., Hehman E., Construct validation in social and personality research: Current practice and recommendations. Soc. Psychol. Personal. Sci. 8, 370–378. (2017). [Google Scholar]

