Skip to main content
. 2021 Dec 10;10:e71601. doi: 10.7554/eLife.71601

Table 2. Replication rates according to three criteria involving null hypothesis significance testing.

Papers Experiments Effects All outcomes
Total number 23 50 158 188
ORIGINAL POSITIVE RESULTS
Succeeded on all three criteria 2 11% 2 6% 13 13% 20 18%
[1]Failed only on significance and direction 2 11% 1 3% 4 4% 6 5%
[2]Failed only on original in replication confidence interval 1 5% 5 15% 14 14% 10 9%
[3]Failed only on replication in original confidence interval 0 0% 0 0% 0 0% 0 0%
Failed only on [1] and [2] 0 0% 3 9% 11 11% 14 13%
Failed only on [2] and [3] 5 26% 10 30% 15 15% 14 13%
Failed only on [1] and [3] 1 5% 0 0% 0 0% 0 0%
Failed on all three criteria [1], [2], and [3] 8 42% 12 36% 40 41% 48 43%
Total 19 33 97 112
ORIGINAL NULL RESULTS
Succeeded on all three criteria 6 55% 7 58% 8 53% 7 35%
[1]Failed only on significance and direction 2 18% 2 17% 3 20% 5 25%
[2]Failed only on original in replication confidence interval 1 9% 1 8% 1 7% 1 5%
[3]Failed only on replication in original confidence interval 0 0% 0 0% 0 0% 0 0%
Failed only on [1] and [2] 0 0% 0 0% 0 0% 0 0%
Failed only on [2] and [3] 2 18% 2 17% 2 13% 2 10%
Failed only on [1] and [3] 0 0% 0 0% 0 0% 0 0%
Failed on all three criteria [1], [2], and [3] 0 0% 0 0% 1 7% 5 25%
Total 11 12 15 20

Number of replications that succeeded or failed to replicate results in original experiments according to three criteria within the null hypothesis significance testing framework: statistical significance (p < 0.05) and same direction; original effect size inside 95% confidence interval of replication effect size using standardized mean difference (SMD) effect sizes; replication effect size inside 95% confidence interval of original effect size using SMD effect sizes. Data for original positive results and original null results are shown separately, as are data for all outcomes and aggregated by effect, experiment, and paper. Very similar results are obtained when alternative strategies are used to aggregate the data (see Tables S4–S6 in Supplementary file 1).