Table 2. Replication rates according to three criteria involving null hypothesis significance testing.
| Papers | Experiments | Effects | All outcomes | |||||
|---|---|---|---|---|---|---|---|---|
| Total number | 23 | 50 | 158 | 188 | ||||
| ORIGINAL POSITIVE RESULTS | ||||||||
| Succeeded on all three criteria | 2 | 11% | 2 | 6% | 13 | 13% | 20 | 18% |
| [1]Failed only on significance and direction | 2 | 11% | 1 | 3% | 4 | 4% | 6 | 5% |
| [2]Failed only on original in replication confidence interval | 1 | 5% | 5 | 15% | 14 | 14% | 10 | 9% |
| [3]Failed only on replication in original confidence interval | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% |
| Failed only on [1] and [2] | 0 | 0% | 3 | 9% | 11 | 11% | 14 | 13% |
| Failed only on [2] and [3] | 5 | 26% | 10 | 30% | 15 | 15% | 14 | 13% |
| Failed only on [1] and [3] | 1 | 5% | 0 | 0% | 0 | 0% | 0 | 0% |
| Failed on all three criteria [1], [2], and [3] | 8 | 42% | 12 | 36% | 40 | 41% | 48 | 43% |
| Total | 19 | 33 | 97 | 112 | ||||
| ORIGINAL NULL RESULTS | ||||||||
| Succeeded on all three criteria | 6 | 55% | 7 | 58% | 8 | 53% | 7 | 35% |
| [1]Failed only on significance and direction | 2 | 18% | 2 | 17% | 3 | 20% | 5 | 25% |
| [2]Failed only on original in replication confidence interval | 1 | 9% | 1 | 8% | 1 | 7% | 1 | 5% |
| [3]Failed only on replication in original confidence interval | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% |
| Failed only on [1] and [2] | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% |
| Failed only on [2] and [3] | 2 | 18% | 2 | 17% | 2 | 13% | 2 | 10% |
| Failed only on [1] and [3] | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% |
| Failed on all three criteria [1], [2], and [3] | 0 | 0% | 0 | 0% | 1 | 7% | 5 | 25% |
| Total | 11 | 12 | 15 | 20 | ||||
Number of replications that succeeded or failed to replicate results in original experiments according to three criteria within the null hypothesis significance testing framework: statistical significance (p < 0.05) and same direction; original effect size inside 95% confidence interval of replication effect size using standardized mean difference (SMD) effect sizes; replication effect size inside 95% confidence interval of original effect size using SMD effect sizes. Data for original positive results and original null results are shown separately, as are data for all outcomes and aggregated by effect, experiment, and paper. Very similar results are obtained when alternative strategies are used to aggregate the data (see Tables S4–S6 in Supplementary file 1).