Skip to main content
Royal Society Open Science logoLink to Royal Society Open Science
. 2024 Aug 28;11(8):240149. doi: 10.1098/rsos.240149

The assessment of replicability using the sum of p-values

Leonhard Held 1,, Samuel Pawel 1, Charlotte Micheloud 1
PMCID: PMC11349439  PMID: 39205991

Abstract

Statistical significance of both the original and the replication study is a commonly used criterion to assess replication attempts, also known as the two-trials rule in drug development. However, replication studies are sometimes conducted although the original study is non-significant, in which case Type-I error rate control across both studies is no longer guaranteed. We propose an alternative method to assess replicability using the sum of p -values from the two studies. The approach provides a combined p -value and can be calibrated to control the overall Type-I error rate at the same level as the two-trials rule but allows for replication success even if the original study is non-significant. The unweighted version requires a less restrictive level of significance at replication if the original study is already convincing which facilitates sample size reductions of up to 10%. Downweighting the original study accounts for possible bias and requires a more stringent significance level and larger sample sizes at replication. Data from four large-scale replication projects are used to illustrate and compare the proposed method with the two-trials rule, meta-analysis and Fisher’s combination method.

Keywords: Edgington’s method, p-values, replication studies, sample size planning, two-trials rule, Type-I error rate

1. Introduction

Replication studies are increasingly conducted in various fields to assess the replicability of original findings. While ‘replicability’ is intuitively understood as the ability to obtain consistent results when a study is repeated with new subjects [1], there is no agreed-upon quantitative definition of replicability or ‘replication success’. In practice, a variety of different approaches are used that capture different intuitions about what it means for a replication to be successful. For example, metrics such as the Q -test or prediction intervals quantify the statistical compatibility of the effect estimates from original and replication studies [2,3] while a meta-analysis of original and replication results provides an assessment of the evidence pooled across both studies [4].

Perhaps the most commonly used success criterion is the statistical significance of the replication study with an effect estimate in the same direction as in the original study [59]. Positive original findings are usually also based on significance, and the procedure is then analogous to the ‘two-trials rule’ in drug development [10, §12.2.8]. The intuition behind this approach is that each study individually should provide sufficient evidence for a claim, which is not guaranteed with other approaches such as meta-analysis or the Q -test. In practice, the standard two-sided significance level 0.05 is often used and so replication success with the two-trials rule is achieved if both one-sided p -values, po and pr , from the original and replication study, respectively, are smaller than α=0.05/2=0.025 . The one-sided formulation is convenient to ensure that the effect estimates from the two studies are in the same direction.

The success condition of the two-trials rule max{po,pr}α can be rewritten as

p2TR=max{po,pr}2α2, (1.1)

where the combined p -value p2TR of the two-trials rule turns out to be valid [11], i.e. uniformly distributed under the intersection null hypothesis that both the true effect θo in the original study and the true effect θr in the replication study are null. The two-trials rule therefore controls the overall Type-I error rate at level α2 . For α=0.025 , the probability to incorrectly declare replication success under the intersection null hypothesis is 0.0252=0.000625 .

Despite its simplicity, in the replication setting the two-trials rule has important limitations. First of all, the ‘double dichotomization’ at α=0.025 can lead to conclusions which seem counterintuitive. For example, the two-trials rule is not fulfilled if po=0.026 and pr=0.001 (or vice versa), but it is fulfilled when both po and pr are 0.024, although there is less evidence for an effect in the second case [12]. The first example also illustrates that replication success is impossible if the original study just missed statistical significance, no matter how convincing the replication study is. However, some replication projects do interpret non-significant original studies as positive findings and try to replicate them. In the Reproducibility Project: Psychology (RPP) [5], four ’positive’, but non-significant original effects have been included, all with (one-sided) p -values between 0.025 and 0.03. In the Experimental Economics Replication Project (EERP) [6], two ’positive’, but non-significant effects have been included, one with po=0.027 [13] and one with an even larger p -value, po=0.035 [14]. The strict application of the two-trials rule would inflate the overall Type-I error rate and reduce trust in the replication finding.

The past and ongoing debate on the use and misuse of p -values [1517] has led various researchers to advocate for a quantitative interpretation of p -values as measures of the strength of evidence [18] or divergence [19]. This implies that an original study with a relatively large p -value carries only suggestive evidence against the null hypothesis of no effect [20] and requires a more convincing replication result for confirmation than an original study with a smaller p -value. There is in fact empirical evidence that studies with small p -values tend to replicate better than studies with only suggestive evidence. For example, several large-scale replication projects found strong negative Spearman correlations between original p -values and the corresponding (two-sided) replication p -values being smaller than 0.05 [57]. Held [21] has shown that more stringent significance levels have improved performance in predicting replication success in an application to the data from the Open Science Collaboration [5] project. The two-trials rule, however, requires the same level of evidence at replication no matter what level of evidence the claim of the original discovery had, as long as it was significant.

A promising alternative is to summarize the total evidence from the two p -values with a combined p -value, but different from the one based on the two-trials rule (1.1). A large number of valid p -value combination methods are available [22,23] and the two-trials rule can serve as a benchmark as it provides the success threshold α2 to ensure appropriate control of the overall Type-I error rate [2426]. Perhaps most prominent is Fisher’s combination method [27], where replication success at level α2 is achieved if the product of the p -values poprcF=exp{0.5χ42(1α2)} , where χν2() denotes the quantile function of the χ2 -distribution with ν degrees of freedom. The corresponding combined p -value is

pF=1Pr(X2log{popr}), (1.2)

where X follows a χ2 distribution with 4 d.f., so replication success would be declared if pFα2 . However, the goal of replication studies is to confirm an original (positive) finding, but replication success with Fisher’s method can be achieved even if one of the p -values is very large. In fact, replication success at level α2=0.0252 is guaranteed if po<cF0.00006 , no matter how large pr is, so Fisher’s method may not even require a replication study to confirm a claim of a new discovery, which is clearly undesirable.

An alternative approach used often in replication projects [57,9] is to calculate a combined p -value pMA with a fixed-effect meta-analysis, also known as the ‘pooled-trials rule’ in drug development [10, §12.2.8]. Let Φ() denote the standard normal cumulative distribution function and Φ1() the corresponding quantile function. The meta-analytic p -value pMA=1Φ(zMA) is identical to the combined p -value from weighted Stouffer’s method [23,28], where

zMA=σrzo+σozrσo2+σr2 (1.3)

is the weighted average of the original and replication z -values zo=Φ1(1po) and zr=Φ1(1pr) with weights proportional to the standard errors σr and σo of θ^r and θ^o , respectively. Again, the criterion pMAα2 has been used to assess replication success [24,25,29,30], although a threshold larger than 0.000625 (e.g. two-sided 0.005 or even 0.05) is often used in applications. But it is immediate from (1.3) that meta-analysis has similar problems Fisher’s method because zMA can become large (and hence pMA small) even if one of the two underlying z -values zo and zr is small or even negative (and hence the corresponding p -value large). Recently, Muradchanian et al. [4] have conducted a study to investigate meta-analysis as a replication success metric. They also conclude that meta-analysis is an inappropriate tool if one wants to evaluate whether the replication result is in line with the original result.

In what follows we propose a particular p -value combination method based on the sum of p -values which requires both studies to be convincing, similar to the two-trials rule: Edgington’s method [31] recently rediscovered by Held [32] in the context of drug regulation. For overall Type-I error control at level α2=0.0252 , success is achieved if the sum of p -values is smaller than 2×0.0250.035 , so possible if either po or pr is not significant, as long as they are both smaller than 0.035 . As such, the method can be used in the scenarios encountered in previous replication projects where an original study was borderline non-significant. We also derive a weighted version of Edgington’s method that treats original and replication differently. Giving twice the weight to the replication study results in the necessary (but not sufficient) success conditions po2α and prα . Both versions of Edgington’s method reach opposite conclusions than the two-trials rule for the two examples introduced earlier: replication success at level 0.0252 is declared if po=0.026 and pr=0.001 , but not if po and pr are 0.024. The approach is summarized in box 1.

Box 1. Assessing replicability using the sum of p-values.

Overall Type-I error control at level α2=0.0252 .

Input: one-sided p -values po and pr from original and replication study.

unweighted weighted
weights wo=1 , wr=1 wo=1 , wr=2
replication success criterion po+pr2α0.035 po+2pr2α=0.05
significance level for replication study 2αpo0.035po αpo/2=0.025po/2
combined p -value pE=(po+pr)2/2  (for po+pr1 ) pEw=(po+2pr)2/4  (for po+2pr1 )

Other choices can be made for α and the weights wo and wr .

The rest of the article is organized as follows. Edgington’s method is described in §2.1 and extended to include weights in §2.2. In §3, we discuss and compare the conditional Type-I error rate and the project power of the two-trials rule, meta-analysis and Fisher’s and Edgington’s methods. The conditional Type-I error rate is the probability, given the original study result, to incorrectly declare replication success when the true effect is null at replication, while the project power is the probability to correctly declare replication success over both studies in combination. The different methods are applied to the data from four large-scale replication projects in §4 and sample size planning of the replication study is described in §5. Extensions to more than one replication study are described in §6. Finally, some discussion is provided in §7.

2. Additive combination of p-values

2.1. Edgington’s method

Edgington [31] developed a method to combine p -values by adding them. Here, we investigate the use of Edgington’s method in the replication setting, where results from one original and one replication study are available. Under the intersection null hypothesis, the sum of the two p -values:

E=po+pr (2.1)

follows the Irwin–Hall distribution with parameter n=2 [33,34], i.e. E IH(2). The cumulative distribution function of the Irwin–Hall distribution can then be used to calculate a valid combined p -value:

pE=Pr(IH(2)E)={E2/2 if 0<E1,1+2EE2/2 if 1<E2. (2.2)

The success condition pEα2 can be expressed in terms of the p -value sum Eb , where the critical value is b=2α . At the standard α=0.025 , the critical value is hence b0.035 . The threshold b can be considered as the available budget for the two p -values, po and pr , and this implies that replication success is possible for a non-significant original (or replication) p -value as long as po (or pr ) <b=0.035 . The sum of the p -values for the two examples presented in §1 is E=0.026+0.001=0.027 and E=0.024+0.024=0.048 , respectively, so replication success is declared in the first example but not the second, in contrast to the two-trials rule but in accordance with intuition. The corresponding combination p -values (2.2) are pE=0.0004 and pE=0.001 , respectively, the first smaller and the second larger than the threshold α2=0.000625 .

2.2. Weighted version

An interesting extension of Edgington’s method is to include weights, for example, proportional to the precision of the studies, as in weighted Stouffer’s method. However, in the replication setting one might want to downweight the original and upweight the replication study, for example, with weights 1/3 and 2/3 , respectively. This would address concerns that original studies may be subject to questionable research practice and hence prone to bias. Consider the weighted sum of p -values:

Ew=wopo+wrpr, (2.3)

with positive weights wowr , then we show in appendix A that the corresponding combination p -value is

pEw={Ew22wowr if 0<Ewwo,1wr(Ewwo2) if wo<Ewwr,1+1wowr(Ew(wo+wr)(wo+wr)22Ew22) if wr<Ewwo+wr. (2.4)

Note that (2.4) reduces to (2.2) for wo=wr=1 , and is invariant under multiplication of the weights (wo,wr) with a positive constant. The p -value (2.4) therefore only depends on the weight ratio w~=wr/wo of the replication to the original weight. Setting the first line of (2.4) equal to α2 gives the available budget bw=2wowrα for Ewwo .

In the following we will use the weights wo=1 and wr=2 , although other choices can be made, of course. Then

pEw={Ew2/4 if 0<Ew1,Ew/21/4 if 1<Ew2,54+32EwEw2/4 if 2<Ew3. (2.5)

For Ew1 with small enough po<pr , we have pEw=(po+2pr)2/4pr2=p2TR from (1.1), so weighted Edgington will behave similar to the two-trials rule, whereas the p -value from the unweighted version will then be roughly half as large as p2TR : pE=(po+pr)2/2pr2/2=p2TR/2 .

The success condition pEw=Ew2/4α2 can be rewritten as Ew=po+2pr2α , so doubling the replication weight increases the budget from 2α to 2α and for α=0.025 it will be possible to successfully replicate original studies with po0.05 . However, the p -value of the replication study now counts twice in Ew , so replication success is impossible if pr>α , just as with the two-trials rule. For example, the original study by Kuziemko et al. [14] mentioned in §1 had a quite large p -value: po=0.035 . Conducting a replication would be pointless if analysis is based on the two-trials rule or even unweighted Edgington. However, replication success would be still possible with weighted Edgington, but the replication study has to be quite convincing to achieve success ( pr<0.0250.035/2=0.0075 ).

3. Operating characteristics

As discussed before, all methods considered control the overall Type-I error rate at level α2 . We will now look at two other operating characteristics: conditional Type-I error rate and project power.

3.1. Conditional Type-I error rate

The success condition on the replication p -value pr with the two-trials rule is always prα , regardless of the original study result (as long as poα holds). In contrast, with Edgington’s and Fisher’s methods as well as the meta-analysis criterion the required value of pr depends on the original p -value po . This condition is

prbpo (3.1)

for Edgington’s method,

pr(bwwopo)/wr (3.2)

for weighted Edgington’s method,

prmin{cF/po,1} (3.3)

for Fisher’s method, and

pr1Φ{Φ1(1α2)c+1zoc} (3.4)

for the meta-analysis criterion, which depends on the variance ratio c=σo2/σr2 . For a very convincing original study (where po0 , equivalently zo ), the right-hand side in (3.1) tends to b=2α , in (3.2) to bw/wr=2/w~α , while the right-hand side in (3.3) and (3.4) converges to 1 . Replication success can thus be achieved with Fisher’s method and the meta-analysis criterion even if the replication p -value is very large, while this cannot happen with the two-trials rule and Edgington’s method.

Of particular interest in the replication setting is the conditional Type-I error rate, the probability, given the original study result, that the replication study flags success although there is no true effect at replication ( θr=0 ). For α=0.025 , the conditional Type-I error rate is α=0.025 , bpo<b=0.035 and (bwwopo)/wr<bw/wr=0.025 with the two-trials rule and the unweighted and weighted Edgington’s method, respectively, but can become very large with Fisher’s method and the meta-analysis criterion, if po is small. For example, suppose the original study had a p -value of po=0.001 . The conditional Type-I error rate of Edgington’s method at the standard α=0.025 level then is 3.4% (unweighted) and 2.45% (weighted). With Fisher’s method and meta-analysis (for c=1 ), the conditional Type-I error rate is 5.8% and 7.0%, respectively. Now suppose the original study had a p -value of po=0.0001 . The conditional Type-I error rate then is 3.53% (unweighted) and 2.495% (weighted) with Edgington’s method, so only slightly larger. However, with Fisher’s method and meta-analysis, the conditional Type-I error rate increases drastically to 58.1% and 19.9%, respectively.

3.2. Project power

The project power is the probability to correctly declare replication success over both studies in combination when both underlying effects are non-null. Most original studies are designed to have 80% power to detect the assumed true effect size at significance level α , but the power can be considerably lower in reality [35,36]. In figure 1, we consider the project power with an original power of 80 % (figure 1a,b ) and 40% (figure 1c,d ), under the assumption that the true effect sizes are the same ( θr=θo ; figure 1a,c ) or that the true replication effect size is half the true original effect size ( θr=θo/2 ; figure 1b,d ). The latter case reflects the shrinkage of effect estimates often encountered in replication projects.

Figure 1.

Project power of the two-trials rule.

Project power of the two-trials rule, Edgington’s (both unweighted and weighted) and Fisher’s methods, and the meta-analysis criterion as a function of the relative sample size c=nr/no , assuming that the true effect sizes θo and θr are the same (a,c), or that the true replication effect size is half the true original effect size θr=θo/2 (b,d). The calculations assume that the original study has a power of 80% (a,b) or 40% (c,d) to detect the assumed true effect size at significance level α .

In contrast to the conditional Type-I error rate, the project power of all methods depends on the original power to detect θo and the variance ratio c , which can often be interpreted as the relative sample size c=nr/no (replication to original); see appendix B for the derivations. The project power of the two-trials rule cannot become larger than the power in the original study to detect the true effect size (see figure 1). The project power of Edgington’s method is either essentially identical (for small c and the weighted version) or otherwise larger than the project power of the two-trials rule with limit 84% (unweighted) and 87.6% (weighted), respectively, for c and an original power of 80 %. The corresponding values are 46% (unweighted) and 52.5% (weighted) for an original power of 40 %. The project power of Fisher’s method and the meta-analysis criterion is larger than the project power of both the two-trials rule and Edgington’s method and reaches values close to 100% in the case θr=θo even for a low original power of 40%. However, the price to pay is a considerable increase in conditional Type-I error rate, as discussed in §3.1.

4. Application

The RPP [5], the EERP [6], the Social Sciences Replication Project (SSRP) [7] and the Experimental Philosophy Replicability Project (EPRP) [8] are large-scale replication projects which aimed to replicate important discoveries published in journals from their respective fields. Here, we consider 138 original studies considered to have a ‘positive’ effect, i.e. either significant or with a ‘trend to significance’ (i.e. non-significant with a p -value just slightly larger than the significance threshold). Several methods were used to assess replication success, such as the two-trials rule, meta-analysis of the original and replication effect estimates or compatibility of the replication effect estimates with a prediction interval based on the original results.

We grouped the 138 positive original studies into significant ( po threshold) and non-significant ( po> threshold) at varying thresholds. The proportion of significant replication studies at the one-sided 0.025 level ( prα=0.025 ) is then calculated for each of the two groups and displayed in figure 2. The proportion of significant replication studies is higher for more convincing original studies (i.e. significant original studies at a smaller significance threshold). This shows that more convincing original studies tend to replicate more than less convincing original studies, and it therefore makes sense to be less stringent with the former as does Edgington’s method, but not the two-trials rule.

Figure 2.

Proportion of significant replication studies.

Proportion of significant replication studies ( pr0.025 ) for original studies with p -value below and above the threshold, as a function of the threshold. The points at the bottom represent the original p -values po . There are 17 study pairs with original p -value po<0.000001 that are not shown.

We then calculated the replication rates in the four projects with the two-trials rule ( p2TRα2 ) and Edgington’s method ( pEα2 or pEwα2 ) for varying levels α2 (see figure 3). The replication rates are similar with a tendency of larger success rates with Edgington’s method. This is in line with the larger project power of Edgington’s method discussed in §3.2. For example, at level α2=0.0252 , the replication rate of the two-trials rule and Edgington’s method are 30.4% versus 31.9% in the RPP, 55.6% versus 61.1% in the EERP, 61.9% for both in the SSRP and 76.7% for both in the EPRP. The conclusions only differ for two study pairs: the original study by Schmidt & Besner [37] and its replication in the RPP, and the original study by Ambrus & Greiner [13] and its replication in the EERP. As the original p -values po=0.028 and po=0.027 are slightly larger than α=0.025 in both cases, the two-trials rule is not fulfilled. However, as pr<0.0001 and pr=0.006 , respectively, the sum po+pr does not exceed b=0.035 and so replication success with Edgington’s method is achieved for both study pairs. Likewise, success is also achieved for the weighted method as po+2pr2α=0.05 in both cases.

Figure 3.

Success rate of the RPP, EERP, SSRP and EPRP.

Success rate of the RPP, EERP, SSRP and EPRP as a function of the overall Type-I error rate α2 with the two-trials rule and Edgington’s method.

We also computed the combined p -values p2TR , pE , pF and pMA with each of the four methods and plotted them against the replication p -value pr for non-significant replication studies (see figure 4). By construction, the combined p -value from weighted Edgington’s method is always larger than the success threshold α2=0.000625 , and this is also the case for the unweighted version, where the smallest combined p -value pE=0.000635 is just slightly above the threshold. In contrast, with Fisher’s method and the meta-analysis criterion, replication success is often declared although the replication p -value is quite large. There are even three studies with pr>0.5 (so with an effect estimate in the wrong direction) which achieve success at level α2=0.000625 with Fisher’s method, and one study with the meta-analysis criterion. This illustrates that Fisher’s method and the meta-analysis criterion are not suited as a replacement for the two-trials rule.

Figure 4.

Combined p-values p2TR

Combined p -values p2TR , pE , pEw , pF and pMA have been calculated from the original and replication p -values po and pr , respectively, for all replication studies considered in the four replication projects. They are plotted against the replication p -value pr for non-significant replication studies ( pr>0.025 ). Combined p -values in the grey area flag replication success at overall Type-I error rate 0.0252=0.000625 . The dashed black line is the lower bound pr2 for p2TR . The dashed orange line is the lower bound pr2/2 for pE1/2 . Fisher’s ( pF ) and meta-analytic ( pMA ) p -values smaller than 10−6 are marked with a triangle.

5. Sample size calculation

The sample size of the replication study is usually calculated based on conditional power, i.e. such that the power 1β to reach a significant replication effect estimate reaches a certain value, usually 80% or 90%, assuming the original effect estimate is the true one. If, additionally, significance of the original study is required, this corresponds to the sample size calculation based on the two-trials rule. In practice, a standard sample size calculation method is used where the minimal clinically important difference is replaced with the original effect estimate θ^o . For example, for a balanced two-sample z -test, the sample size per group in the replication study is calculated as

nr=2τ2(z1α+z1β)2θ^o2, (5.1)

where τ denotes the common standard deviation of the measurements and z1u=Φ1(1u) denotes the 1u quantile of the standard normal distribution. We note that in some replication projects [7] the original effect estimate θ^o in (5.1) is reduced by 25% or even 50% to take into account its possible inflation [38].

It is also possible to express equation (5.1) on the relative scale. In that case, the required relative sample size c=σo2/σr2=nr/no is calculated as

c=(z1α+z1β)2zo2. (5.2)

Sample size calculation based on (5.1) or (5.2) is appropriate if significance of the replication study at level α is the desired criterion for replication success. If instead Edgington’s method will be used, the sample size calculation needs to be appropriately adapted to ensure that the design of the replication study matches the analysis [39]. To do so, the significance level α needs to be replaced with bpo in equations (5.1) and (5.2), so now depends on the p -value from the original study. A smaller sample size than with the two-trials rule is therefore required if bpo>α , i.e.

po<bα=2αα=α(21)0.01. (5.3)

The weighted version always requires a larger sample size than the two-trials rule, because the required significance level is αpo/2<α .

Figure 5 shows the sample size ratio

Figure 5.

Replication sample size ratio of Edgington’s method compared with the two-trials rule to reach 80% (top) and 90% (bottom) power.

Replication sample size ratio of Edgington’s method compared with the two-trials rule to reach 80% (top) and 90% (bottom) power. For conditional power, the sample size ratio is monotonically increasing, while it reaches a minimum at po=0.00009 for predictive power in the unweighted version. The corresponding sample size reduction is one minus the sample size ratio. The sample size ratio of the weighted version is always monotonically increasing and converges to 1 for po0 .

(z1b+po+z1β)2(z1α+z1β)2resp.(z1α+po/2+z1β)2(z1α+z1β)2

of Edgington’s method (unweighted and weighted) versus the two-trials rule for α=0.025 , a power of 80% and 90% and po[0.00001,0.025] . At 80% power, the sample size calculated with unweighted Edgington’s method can be up to 10.6% smaller than with the two-trials rule. At 90% power, the sample size reduction looks very similar with a maximum of 9.2%. However, if po>0.01 , the required sample size with Edgington’s method is larger than with the two-trials rule. The weighted version always requires a larger sample size, but smaller than the unweighted version if po is close to α=0.025 .

A drawback of conditional power is that it does not take the uncertainty of the original result into account and hence can lead to underpowered replication studies. One way to take into account the uncertainty of the original result is to use ‘predictive power’ instead [40]. The relative sample size based on predictive power is generally larger than based on conditional power. The sample size reduction of Edgington’s method compared with the two-trials rule can be even more pronounced and reaches a value of 11.2% (10.3%) at po=0.00009 ( po=0.0002 ) for 80% (90%) predictive rather than conditional power (see figure 5).

6. Extensions to more than one replication study

It has been argued that a single replication study will often not be sufficient and that more than one replication study is needed to provide an unambiguous evaluation of replicability [41]. Edgington’s method can also be used if more than one replication study is conducted. A simple approach would be to combine the replication p -values into an overall replication p -value and then use Edgington’s method for one original and one replication p -value, as introduced in this article. However, Edgington’s method can also be applied directly to the individual p -values, as we now illustrate in the case of three studies (one original and two replications). Now the sum E3=po+pr1+pr2 of the three p -values needs to be smaller than the new budget b3=0.16 to ensure overall Type-I error control at level 0.0252 [32]. An interesting aspect of this approach is that it can be used to save resources if the replication studies are conducted sequentially. Indeed, there will be no point in conducting the second replication study if the sum of p -values E2=po+pr1 from the original and the first replication study is already larger than b3 . Otherwise, a second replication study at significance level b3E2 can be planned and we would flag replication success if E3=E2+pr2b3 holds. Such a sequential conduct of replication studies has been suggested by Hedges & Schauer [41, p. 567] because ‘a single initial replication may be one effort in a sequence of replications, and as researchers conduct additional subsequent replications, eventually a preponderance of evidence will support more sensitive analyses’.

A refined version of this approach has been proposed by Held [32, §4], allowing to stop for success already after the first replication study. The approach is based on so-called alpha-spending [42], distributing the overall Type-I error rate α2 to the analysis after the first and the second replication study. Alpha-spending is a method originally proposed for interim analyses in clinical trials. Specifically, Fisher’s method has been proposed for the evaluation of experiments with an adaptive interim analysis based on the p -values of two subsamples of the study [43]. Closer to our approach is the method by Chang [44] who derives stopping rules based on the sum of the p -values for each subsample of the trial.

The application of the alpha-spending approach to the replication setting is illustrated in figure 6, which shows the budget b2 for E2 and b3 for E3 depending on the proportion of α2 that is spent on the analysis after the first replication. For example, if we spend half of α2 on the first replication, we can stop for replication success if E2=po+pr1<b2=0.025 holds. If this is not the case but at least E2<b3=0.13 holds, we would plan and conduct a second replication study at significance level b3E2 . If eventually E3=E2+pr2b3=0.13 holds, we have achieved success after the second replication study. The combined procedure thus not only allows to stop for replication success or failure after the first replication study but offers a third possibility to conduct a second replication study if b2=0.025<E2<b3=0.13 . The approach could also be extended to more than two replication studies and weights could also be introduced.

Figure 6.

Budget b2 for E2 = po + pr1

Budget b2 for E2=po+pr1 and b3 for E3=po+pr1+pr2 , depending on the proportion of α2=0.0252 spent on the first replication study. The points denote the available budget if half of α2 is spent after the first replication.

7. Discussion

We propose to use the sum of the p -values, also known as Edgington’s method, instead of the two-trials rule in the assessment of replication success. An unweighted and a weighted version are considered. In cases where it can be safely assumed that the original study follows the same standards as the replication study [45], we recommend to use the unweighted version. In cases where the original study may be subject to questionable research practices or publication bias, we recommend to give more weight to the replication study. The exact choice of the weight depends on how much we distrust the original study result. Both the unweighted and the weighted methods exactly control the overall Type-I error rate at level α2 and have an acceptable bound on the conditional Type-I error rate, namely, b=0.035 and α=0.025 , respectively. These numbers are for the conventional (but arbitrary) α=0.025 and a weight ratio of 2, and in principle, other values could be used.

The success bound for the replication p -value pr with Edgington’s method is not fixed at α=0.025 but varies between 0 and b=0.035 (unweighted) or between 0 and α=0.025 (weighted), depending on the original p -value po . Replication success is possible for original studies that missed traditional statistical significance, as long as po0.035 (unweighted) or po2α=0.05 (weighted). While these bounds are less stringent than with the two-trials rule, they are also different from the more elaborate sceptical p -value [46]. The sceptical p -value has been developed specifically for replication studies and depends on the two p -values, po and pr , but also on the variance ratio, so treats original and replication studies not as exchangeable. The controlled sceptical p -value [47] ensures exact overall Type-I error rate control, just as all methods discussed in this article. It also allows for replication success if the original study is non-significant and can be used for sample size calculations. However, the method is more complicated and therefore more difficult to communicate. Edgington’s method can be seen as a simple compromise between the two-trials rule and the controlled sceptical p -value, valuing the combined evidence from both studies while ensuring that both studies support the alternative hypothesis. Of course, researchers may still want to quantify other aspects of replicability, such as statistical consistency of original and replication effect estimates, for which other methods, such as the Q -test, could be used [41]. In future work, we plan to conduct a simulation study to compare the operating characteristics of Edgington’s method with the sceptical p -value and alternative methods in the presence of publication bias and other questionable research practices [30,48].

One advantage of Edgington’s method is that it can be easily applied to non-normal or non-standard settings, for example, based on the t -test, a comparison of proportions or the log-rank test for survival data. For example, for a sample size calculation based on the t -test, the R function power.t.test() can be used to calculate the required replication sample size nr . The argument delta needs to be set to the original effect estimate θ^o (perhaps incorporating some additional shrinkage) and the argument sig.level to 2αpo (Edgington) or αpo/2 (weighted Edgington) rather than α (two-trials rule).

Appendix A. Weighted sum of p-values

The weighted sum of p -values (2.3) with weights wowr can be written as

Ew=qo+qr,

where qoU(0,wo) and qrU(0,wr) are independent uniform under the intersection null hypothesis. The density function of Ew can be computed as the convolution of the densities of qo and qr and follows a trapezoidal distribution [51]:

f(ew)={ewwowr if 0<ewwo,1wr if wo<ewwr,wo+wrewwowr if wr<ewwo+wr.

The cumulative distribution function of Ew is therefore

F(ew)={ew22wowr if 0<ewwo,1wr(ewwo2) if wo<ewwr,1+1wowr(ew(wo+wr)(wo+wr)22ew22) if wr<ewwo+wr.

A valid combined p -value is obtained by plugging Ew into the cumulative distribution function: pEw=F(Ew) .

Appendix B. Project power

First assume that both effects are the same, so θr=θo . As a result, zoN(μ,1) and zrN(μc,1) . The project power of the two-trials rule can be found in Held et al. [52, §3.3]. The project power of all other methods is calculated as

Pr(Replication successzo)ϕ(zoμ)dzo.

Specifically, Edgington’s method has project power

Φ1(1b)Pr(prbpo)ϕ(zoμ)dzoΦ1(1b)Pr(zrΦ1(1b+po))ϕ(zoμ)dzoΦ1(1b)Pr(zrΦ1(2bΦ(zo)))ϕ(zoμ)dzo=Φ1(1b)[1Φ{Φ1(2bΦ(zo))μc}]ϕ(zoμ)dzo. (B 1)

The project power of weighted Edgington’s method with weight ratio w~=wr/wo turns out to be

Φ1(1bw/wo)[1Φ{Φ1(1+1/w~2/w~αΦ(zo)/w~)μc}]ϕ(zoμ)dzo. (B 2)

The project power of Fisher’s method is

0f(zo)ϕ(zoμ)dzo with f(zo)={1Φ{Φ1(1cF1Φ(zo))μc},if 1Φ(zo)cF1,otherwise, (B 3)

and the project power of the meta-analysis criterion is

0[1Φ{Φ1(1α2)c+1zoc}μc]ϕ(zoμ)dzo. (B 4)

The project power of Edington’s method converges to

1Φ(Φ1(1b)Φ1(1α)Φ1(1β))

for c , while the corresponding limit is

1Φ(Φ1(1bw/wo)Φ1(1α)Φ1(1β))

for weighted Edgington’s method and 1 for the other two methods.

If θr=θo/2 , the term μc in (B 1), (B 2), (B 3) and (B 4) needs to be divided by 2 .

Contributor Information

Leonhard Held, Email: leonhard.held@uzh.ch.

Samuel Pawel, Email: samuel.pawel@uzh.ch.

Charlotte Micheloud, Email: charlotte.micheloud@uzh.ch.

Ethics

This work did not require ethical approval from a human subject or animal welfare committee.

Data accessibility

The R package Replication Success available on CRAN at: https://CRAN.R-project.org/package=ReplicationSuccess has been used for the sample size calculations. The data of the RPP, EERP, SSRP and EPRP are available through the command data("RProjects"). All p-values have been recalculated based on Fisher’s z-transformation as described in [49], electronic supplementary material]; see also help("RProjects"). The code to reproduce the calculations in this paper is available at [50]. Functions to compute Edgington’s combined p-value (pEdgington) and associated power (powerEdgington) and sample size calculations (sampleSizeEdgington) are available in the development version of the ReplicationSuccess package which can be installed with remotes::install_github(repo = "crsuzh/ReplicationSuccess", ref = "edgington") (requires the remotes package available on CRAN).

Declaration of AI use

We have not used AI-assisted technologies in creating this article.

Authors’ contributions

L.H.: conceptualization, formal analysis, methodology, project administration, software, visualization, writing—original draft, writing—review and editing; S.P.: software, visualization, writing—review and editing; C.M.: formal analysis, methodology, software, visualization, writing—original draft, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

No funding has been received for this article.

References

  • 1. National Academies of Sciences, Engineering, and Medicine . 2019. Reproducibility and replicability in science. Washington, DC: National Academies Press. [PubMed] [Google Scholar]
  • 2. Patil P, Peng RD, Leek JT. 2016. What should researchers expect when they replicate studies? A statistical view of replicability in psychological science. Perspect. Psychol. Sci. 11 , 539–544. ( 10.1177/1745691616646366) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Hedges LV, Schauer JM. 2019. Statistical analyses for studying replication: meta-analytic perspectives. Psychol. Methods 24 , 557–570. ( 10.1037/met0000189) [DOI] [PubMed] [Google Scholar]
  • 4. Muradchanian J, Hoekstra R, Kiers H, van Ravenzwaaij D. 2023. Evaluating meta-analysis as a replication success measure. Technical report. MetaArXiv. ( 10.31222/osf.io/ax825) [DOI]
  • 5. Open Science Collaboration . 2015. Estimating the reproducibility of psychological science. Science 349 , aac4716. ( 10.1126/science.aac4716) [DOI] [PubMed] [Google Scholar]
  • 6. Camerer CF, et al. 2016. Evaluating replicability of laboratory experiments in economics. Science 351 , 1433–1436. ( 10.1126/science.aaf0918) [DOI] [PubMed] [Google Scholar]
  • 7. Camerer CF, et al. 2018. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2 , 637–644. ( 10.1038/s41562-018-0399-z) [DOI] [PubMed] [Google Scholar]
  • 8. Cova F, et al. 2021. Estimating the reproducibility of experimental philosophy. Rev. Philos. Psychol. 12 , 9–44. ( 10.1007/s13164-018-0400-9) [DOI] [Google Scholar]
  • 9. Errington TM, Mathur M, Soderberg CK, Denis A, Perfito N, Iorns E, Nosek BA. 2021. Investigating the replicability of preclinical cancer biology. eLife 10 , e71601. ( 10.7554/eLife.71601) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Senn S. 2021. Statistical issues in drug development. Chichester, UK: Wiley. ( 10.1002/9781119238614) [DOI] [Google Scholar]
  • 11. Greenland S. 2019. Valid P-values behave exactly as they should: some misleading criticisms of P-values and their resolution with S-values. Am. Stat. 73 , 106–114. ( 10.1080/00031305.2018.1529625) [DOI] [Google Scholar]
  • 12. Benjamin DJ, Berger JO. 2019. Three recommendations for improving the use of p-values. Am. Stat. 73 , 186–191. ( 10.1080/00031305.2018.1543135) [DOI] [Google Scholar]
  • 13. Ambrus A, Greiner B. 2012. Imperfect public monitoring with costly punishment: an experimental study. Am. Econ. Rev. 102 , 3317–3332. ( 10.1257/aer.102.7.3317) [DOI] [Google Scholar]
  • 14. Kuziemko I, Buell RW, Reich T, Norton MI. 2014. ‘Last-place aversion’: evidence and redistributive implications. Q. J. Econ. 129 , 105–149. ( 10.1093/qje/qjt035) [DOI] [Google Scholar]
  • 15. Wasserstein RL, Lazar NA. 2016. The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70 , 129–133. ( 10.1080/00031305.2016.1154108) [DOI] [Google Scholar]
  • 16. Colquhoun D. 2017. The reproducibility of research and the misinterpretation of p-values. R. Soc. Open Sci. 4 , 171085. ( 10.1098/rsos.171085) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Wasserstein RL, Schirm AL, Lazar NA. 2019. Moving to a world beyond ‘p < 0.05’. Am. Stat. 73 , 1–19. ( 10.1080/00031305.2019.1583913) [DOI] [Google Scholar]
  • 18. Held L, Ott M. 2018. On p-values and Bayes factors. Annu. Rev. Stat. Appl. 5 , 393–419. ( 10.1146/annurev-statistics-031017-100307) [DOI] [Google Scholar]
  • 19. Greenland S. 2023. Divergence versus decision P-values: a distinction worth making in theory and keeping in practice: or, how divergence P-values measure evidence even when decision P-values do not. Scand. J. Stat. 50 , 54–88. ( 10.1111/sjos.12625) [DOI] [Google Scholar]
  • 20. Benjamin DJ, et al. 2018. Redefine statistical significance. Nat. Hum. Behav. 2 , 6–10. ( 10.1038/s41562-017-0189-z) [DOI] [PubMed] [Google Scholar]
  • 21. Held L. 2019. The assessment of intrinsic credibility and a new argument for p < 0.005. R. Soc. Open Sci. 6 . ( 10.1098/rsos.181534) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Hedges LV, Olkin I. 1985. Statistical methods for meta-analysis. Amsterdam, The Netherlands: Elsevier. [Google Scholar]
  • 23. Cousins RD. 2007. Annotated bibliography of some papers on combining significances or p-values. See https://arxiv.org/abs/0705.2209.
  • 24. Fisher LD. 1999. One large, well-designed, multicenter study as an alternative to the usual FDA paradigm. Drug Inf. J. 33 , 265–271. ( 10.1177/009286159903300130) [DOI] [Google Scholar]
  • 25. Shun Z, Chi E, Durrleman S, Fisher L. 2005. Statistical consideration of the strategy for demonstrating clinical evidence of effectiveness—one larger vs two smaller pivotal studies. Stat. Med. 24 , 1619–1637; ( 10.1002/sim.2015) [DOI] [PubMed] [Google Scholar]
  • 26. Rosenkranz GK. 2023. A generalization of the two trials paradigm. Ther. Innov. Regul. Sci. 57 , 316–320. ( 10.1007/s43441-022-00471-4) [DOI] [PubMed] [Google Scholar]
  • 27. Fisher RA. 1935. Statistical methods for research workers, 4th edn. Edinburgh, UK: Oliver & Boyd. [Google Scholar]
  • 28. Stouffer SA, Suchman EA, Devinney LC, Star SA, Williams RMJ. 1949. The American soldier: adjustment during army life (Studies in Social Psychology in World War II). Princeton, NJ: Princeton University Press. [Google Scholar]
  • 29. Maca J, Gallo P, Branson M, Maurer W. 2002. Reconsidering some aspects of the two-trials paradigm. J. Biopharm. Stat. 12 , 107–119. ( 10.1081/bip-120006450) [DOI] [PubMed] [Google Scholar]
  • 30. Freuli F, Held L, Heyard R. 2023. Replication success under questionable research practices—a simulation study. Stat. Sci. 38 . ( 10.1214/23-STS904) [DOI] [Google Scholar]
  • 31. Edgington ES. 1972. An additive method for combining probability values from independent experiments. J. Psychol. 80 , 351–363. ( 10.1080/00223980.1972.9924813) [DOI] [Google Scholar]
  • 32. Held L. 2024. Beyond the two‐trials rule. Stat. Med. ( 10.1002/sim.10055) [DOI] [PubMed] [Google Scholar]
  • 33. Irwin JO. 1927. On the frequency distribution of the means of samples from a population having any law of frequency with finite moments, with special reference to Pearson’s type II. Biometrika 19 , 225–239. ( 10.1093/biomet/19.3-4.225) [DOI] [Google Scholar]
  • 34. Hall P. 1927. The distribution of means for samples of size N drawn from a population in which the variate takes values between 0 and 1, all such values being equally probable. Biometrika 19 , 240–245. ( 10.2307/2331961) [DOI] [Google Scholar]
  • 35. Turner RM, Bird SM, Higgins JPT. 2013. The impact of study size on meta-analyses: examination of underpowered studies in Cochrane reviews. PLoS One 8 , e59202. ( 10.1371/journal.pone.0059202) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Dumas-Mallet E, Button KS, Boraud T, Gonon F, Munafò MR. 2017. Low statistical power in biomedical science: a review of three human research domains. R. Soc. Open Sci. 4 , 160254. ( 10.1098/rsos.160254) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Schmidt JR, Besner D. 2008. The Stroop effect: why proportion congruent has nothing to do with congruency and everything to do with contingency. J. Exp. Psychol. Learn. Mem. Cogn. 34 , 514–523. ( 10.1037/0278-7393.34.3.514) [DOI] [PubMed] [Google Scholar]
  • 38. Ioannidis JPA. 2008. Why most discovered true associations are inflated. Epidemiology 19 , 640–648. ( 10.1097/EDE.0b013e31818131e7) [DOI] [PubMed] [Google Scholar]
  • 39. Anderson SF, Kelley K. 2022. Sample size planning for replication studies: the devil is in the design. Psychol. Methods ( 10.1037/met0000520) [DOI] [PubMed] [Google Scholar]
  • 40. Micheloud C, Held L. 2022. Power calculations for replication studies. Stat. Sci. 37 , 369–379. ( 10.1214/21-STS828) [DOI] [Google Scholar]
  • 41. Hedges LV, Schauer JM. 2019. More than one replication study is needed for unambiguous tests of replication. J. Educ. Behav. Stat. 44 , 543–570. ( 10.3102/1076998619852953) [DOI] [Google Scholar]
  • 42. DeMets DL, Lan KK. 1994. Interim analysis: the alpha spending function approach. Stat. Med. 13 , 1341–1352. ( 10.1002/sim.4780131308) [DOI] [PubMed] [Google Scholar]
  • 43. Bauer P, Köhne K. 1994. Evaluation of experiments with adaptive interim analyses. Biometrics 50 , 1029–1041. ( 10.2307/2533441) [DOI] [PubMed] [Google Scholar]
  • 44. Chang M. 2007. Adaptive design methods based on sum of the p-values. Stat. Med. 26 , 2772–2784. ( 10.1002/sim.2755) [DOI] [PubMed] [Google Scholar]
  • 45. Protzko J, et al. 2023. High replicability of newly discovered social-behavioural findings is achievable. Nat. Hum. Behav. 8 , 311–319. ( 10.1038/s41562-023-01749-9) [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 46. Held L. 2020. A new standard for the analysis and design of replication studies (with discussion). J. R. Stat. Soc. Ser. A 183 , 431–448. ( 10.1111/rssa.12493) [DOI] [Google Scholar]
  • 47. Micheloud C, Balabdaoui F, Held L. 2023. Assessing replicability with the sceptical p-value: Type-I error control and sample size planning. Stat. Neerl. 77 , 573–591. ( 10.1111/stan.12312) [DOI] [Google Scholar]
  • 48. Muradchanian J, Hoekstra R, Kiers H, van Ravenzwaaij D. 2021. How best to quantify replication success? A simulation study on the comparison of replication success metrics. R. Soc. Open Sci. 8 , 201697. ( 10.1098/rsos.201697) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Pawel S, Held L. 2020. Probabilistic forecasting of replication studies. PLoS One 15 , e0231416. ( 10.1371/journal.pone.0231416) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Held L. Code for ‘The assessment of replicability using the sum of p-values’. See https://osf.io/uds2a/ (accessed 29 May 2024).
  • 51. van Dorp JR, Kotz S. 2003. Generalized trapezoidal distributions. Metrika 58 , 85–97. ( 10.1007/s001840200230) [DOI] [Google Scholar]
  • 52. Held L, Micheloud C, Pawel S. 2022. The assessment of replication success based on relative effect size. Ann. Appl. Stat. 16 , 706–720. ( 10.1214/21-AOAS1502) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The R package Replication Success available on CRAN at: https://CRAN.R-project.org/package=ReplicationSuccess has been used for the sample size calculations. The data of the RPP, EERP, SSRP and EPRP are available through the command data("RProjects"). All p-values have been recalculated based on Fisher’s z-transformation as described in [49], electronic supplementary material]; see also help("RProjects"). The code to reproduce the calculations in this paper is available at [50]. Functions to compute Edgington’s combined p-value (pEdgington) and associated power (powerEdgington) and sample size calculations (sampleSizeEdgington) are available in the development version of the ReplicationSuccess package which can be installed with remotes::install_github(repo = "crsuzh/ReplicationSuccess", ref = "edgington") (requires the remotes package available on CRAN).


Articles from Royal Society Open Science are provided here courtesy of The Royal Society

RESOURCES