Skip to main content
Springer logoLink to Springer
. 2023 Sep 21;33(1):127–154. doi: 10.1007/s11749-023-00888-5

Power priors for replication studies

Samuel Pawel 1,, Frederik Aust 2, Leonhard Held 1, Eric-Jan Wagenmakers 2
PMCID: PMC10991061  PMID: 38585622

Abstract

The ongoing replication crisis in science has increased interest in the methodology of replication studies. We propose a novel Bayesian analysis approach using power priors: The likelihood of the original study’s data is raised to the power of α, and then used as the prior distribution in the analysis of the replication data. Posterior distribution and Bayes factor hypothesis tests related to the power parameter α quantify the degree of compatibility between the original and replication study. Inferences for other parameters, such as effect sizes, dynamically borrow information from the original study. The degree of borrowing depends on the conflict between the two studies. The practical value of the approach is illustrated on data from three replication studies, and the connection to hierarchical modeling approaches explored. We generalize the known connection between normal power priors and normal hierarchical models for fixed parameters and show that normal power prior inferences with a beta prior on the power parameter α align with normal hierarchical model inferences using a generalized beta prior on the relative heterogeneity variance I2. The connection illustrates that power prior modeling is unnatural from the perspective of hierarchical modeling since it corresponds to specifying priors on a relative rather than an absolute heterogeneity scale.

Keywords: Bayes factor, Bayesian hypothesis testing, Bayesian parameter estimation, Hierarchical models, Historical data

Introduction

Power priors form a class of informative prior distributions that allow data analysts to incorporate historical data into a Bayesian analysis (Ibrahim et al. 2015). The most basic version of the power prior is obtained by updating an initial prior distribution with the likelihood of the historical data raised to the power of α, where α is usually restricted to the range from zero (i.e., complete discounting) to one (i.e., complete pooling). As such, the power parameter α specifies the degree to which historical data are discounted, thereby providing a quantitative compromise between the extreme positions of completely ignoring and fully trusting the historical data.

One domain where historical data are per definition available is the analysis of replication studies. One pertinent question in this domain is the extent to which a replication study has successfully replicated the result of an original study (National Academies of Sciences, Engineering, and Medicine 2019). Many methods have been proposed to address this question (Bayarri and Mayoral 2002a; Verhagen and Wagenmakers 2014; Johnson et al. 2016; Etz and Vandekerckhove 2016; van Aert and van Assen 2017; Ly et al. 2018; Hedges and Schauer 2019; Mathur and VanderWeele 2020; Held 2020; Pawel and Held 2020, 2022; Held et al. 2022, among others). Here we propose a new and conceptually straightforward approach, namely to construct a power prior for the data from the original study, and to use that prior to draw inferences from the data of the replication study. The power prior approach can accommodate two common notions of replication success: First, the notion that the replication study should provide evidence for a genuine effect. This can be quantified by estimating and testing an effect size θ, typically by assessing whether there is evidence that θ is different from zero. Second, the notion that the data from the original and replication studies should be compatible. This can be quantified by estimating and testing of the power parameter α. Values close to α=1 indicate compatibility as there is a complete pooling of both data sets, and values close to α=0 indicate incompatibility as the original data are completely discounted.

Below we first show how power priors can be constructed from data of an original study under a meta-analytic framework (Sect. 2). We then shown how the power prior can be used for parameter estimation (Sect. 2.1) and Bayes factor hypothesis testing (Sect. 2.2). Throughout, the methodology is illustrated by application to data from three replication studies which were part of a large-scale replication project (Protzko et al. 2020). In Sect. 3, we explore the connection to the alternative hierarchical modeling approach for incorporating the original data (Bayarri and Mayoral 2002a, b; Pawel and Held 2020), which has been previously used for evidence synthesis and compatibility assessment in replication settings. In doing so, we identify explicit conditions under which posterior distributions and tests can be reverse-engineered from one framework to the other. Essentially, power prior inferences using the commonly assigned beta prior on the power parameter α align with normal hierarchical model inferences if either a generalized F prior is assigned to the between-study heterogeneity variance τ2 which scales with the variance of the original data, or if a generalized beta prior is assigned to the relative heterogeneity I2. This perspective also explains the observed difficulty of making conclusive inferences about the power parameter α, as it is difficult to make inferences about a variance from two observations alone, and also because the commonly assigned beta prior on α is entangled with the variance from the data.

Power prior modeling of replication studies

Let θ denote an unknown effect size and θ^i an estimate thereof obtained from study i{o,r} where the subscript indicates “original” and “replication”, respectively. Assume that the likelihood of the effect estimates can be approximated by a normal distribution

θ^i|θN(θ,σi2)

with σi the (assumed to be known) standard error of the effect estimate θ^i. The effect size may be adjusted for confounding variables, and depending on the outcome variable, a transformation may be required for the normal approximation to be accurate (e. g.,  a log-transformation for an odds ratio effect size). This is the same framework that is typically used in meta-analysis, and it is applicable to many types of data and effect sizes (Spiegelhalter et al. 2004, chapter 2.4). There are, of course, situations where the approximation is inadequate and modified distributional assumptions are required (e. g.,  for data from studies with small sample sizes and/or extreme effect sizes).

The goal is now to construct a power prior for θ based on the data from the original study. Updating of an (improper) flat initial prior f(θ)1 by the likelihood of the original data raised to a (fixed) power parameter α leads to the normalized power prior

θ|θ^o,αNθ^o,σo2/α 1

as first proposed by Duan et al. (2005), see also Neuenschwander et al. (2009). There are different ways to specify α. The simplest approach fixes α to an a priori reasonable value, possibly informed by background knowledge about the similarity of the two studies. Another option is to use the empirical Bayes estimate (Gravestock and Held 2017), that is, the value of α that maximizes the likelihood of the replication data marginalized over the power prior. Finally, it is also possible to specify a prior distribution for α, the most common choice being a beta distribution α|x,yBe(x,y) for a normalized power prior conditional on α as in (1). This approach leads to a joint prior for the effect size θ and power parameter α with density

f(θ,α|θ^o,x,y)=N(θ|θ^o,σo2/α)Be(α|x,y) 2

where N(·|m,v) is the normal density function with mean m and variance v, and Be(·|x,y) is the beta density with parameters x and y. The uniform distribution (x=1, y=1) is often recommended as the default choice (Ibrahim et al. 2015). We note that α does not have to be restricted to the unit interval but could also be treated as a relative precision parameter (Held and Sauter 2017). We will, however, not consider such an approach since power parameters α>1 lead to priors with more information than what was actually supplied by the original study.

Parameter estimation

Updating the prior (2) with the likelihood of the replication data leads to the posterior distribution

f(α,θ|θ^r,θ^o,x,y)=N(θ^r|θ,σr2)N(θ|θ^o,σo2/α)Be(α|x,y)f(θ^r|θ^o,x,y). 3

The normalizing constant

f(θ^r|θ^o,x,y)=01N(θ^r|θ^o,σr2+σo2/α)Be(α|x,y)dα 4

is generally not available in closed form but requires numerical integration with respect to α. If inference concerns only one parameter, a marginal posterior distribution for either α or θ can be obtained by integrating out the corresponding nuisance parameter from (3). In the case of the power parameter α, this leads to

f(α|θ^r,θ^o,x,y)=N(θ^r|θ^o,σr2+σo2/α)Be(α|x,y)f(θ^r|θ^o,x,y) 5

whereas for the effect size θ, this gives

f(θ|θ^r,θ^o,x,y)=N(θ^r|θ,σr2)B(x+1/2,y)f(θ^r|θ^o,x,y)2πσo2B(x,y)×M{x+1/2,x+y+1/2,-(θ^o-θ)22σo2}

with B(z,w)=01tz-1(1-t)w-1dt={Γ(z)Γ(w)}/Γ(z+w) the beta function and M(a,b,z)={01exp(zt)ta-1(1-t)b-a-1dt}/B(b-a,a) the confluent hypergeometric function (Abramowitz and Stegun 1965, chapters 6 and 13).

Example “Labels”

We now illustrate the methodology on data from the large-scale replication project by Protzko et al. (2020). The project featured an experiment called “Labels” for which the original study reported the following conclusion: “When a researcher uses a label to describe people who hold a certain opinion, he or she is interpreted as disagreeing with those attributes when a negative label is used and agreeing with those attributes when a positive label is used” (Protzko et al. 2020, p. 17). This conclusion was based on a standardized mean difference effect estimate θ^o=0.21 and standard error σo=0.05 obtained from 1577 participants. Subsequently, four replication studies were conducted, three of them by a different laboratory than the original one, and all employing large sample sizes. Since the same original study was replicated by three independent laboratories, this is an instance of a “multisite” replication design (Mathur and VanderWeele 2020). While in principle it would be possible to analyze all of these studies jointly, we will show separate analyses for each pair of original and replication study as it reflects the typical situation of only one replication study being conducted per original study. Section 4 discusses possible extensions of the power prior approach for joint analyses in multisite designs.

Figure 1 shows joint and marginal posterior distributions for effect size θ and power parameter α based on the results of the three external replication studies and a power prior for the effect size θ constructed from the original effect estimate θ^o=0.21 (with standard error σo=0.05) and an initial flat prior f(θ)1. The power parameter α is assigned a uniform Be(x=1,y=1) prior distribution. The first replication found an effect estimate which was smaller than the original one (θ^r1=0.09 with σr1=0.05), whereas the other two replications found effect estimates that were either identical (θ^r2=0.21 with σr2=0.04) or larger (θ^r3=0.44 with σr3=0.06) than that reported in the original study. This is reflected in the marginal posterior distributions of the power parameter α, shown in the bottom right panel of Fig. 1. That is, the marginal distribution of the first replication (yellow) is slightly peaked around α=0.2 suggesting some incompatibility with the original study. In contrast, the second replication shows a marginal distribution (green) which is monotonically increasing so that the value α=1 receives the highest support, thereby indicating compatibility of the two studies. Finally, the marginal distribution of the third replication (blue) is sharply peaked around α=0.05 with 95% credible interval from 0 to 0.62 indicating strong conflict between this replication and the original study. The sharply peaked posterior is in stark contrast to the relatively diffuse posteriors of the first and second replications which hardly changed from the uniform prior. This is consistent with the asymptotic behavior of normalized power priors identified in Pawel et al. (2023a); In case of data incompatibility, normalized power priors with beta prior assigned to α permit arbitrarily peaked posteriors for small values of α. In contrast, for perfectly agreeing original and replication studies (θ^o=θ^r) there is a limiting posterior for α that gives only slightly more probability to values near one. The limiting posterior is in this case a Be(3/2,1) distribution, whose density is indicated by the dotted line. One can see, that the (green) posterior from the second replication is relatively close to the limiting posterior, despite its finite sample size. Similarly, the corresponding (green) 95% credible interval from 0.12 to 1 suggests that a wide range of very low to very high α values remain credible despite the excellent agreement of original and replication study.

Fig. 1.

Fig. 1

Joint (top) and marginal (bottom) posterior distributions of effect size θ and power parameter α based on data from the “Labels” experiment (Protzko et al. 2020). The dashed lines depict the posterior density for the effect size θ when the replication data are analyzed in isolation without incorporation of the original data. The horizontal error bars represent the corresponding 95% highest posterior density credible intervals. The dotted line represents the limiting posterior density of the power parameter α for perfectly agreeing original and replication studies

The bottom left panel of Fig. 1 shows the marginal posterior distribution of the effect size θ. Shown is also the posterior distribution of θ when the replication data are analyzed in isolation (dashed line), to see the information gain from incorporating the original data via a power prior. The degree of compatibility with the replication study influences how much information is borrowed from the original study. For instance, the (green) marginal posterior density based on the most compatible replication (θ^r2=0.21) is the most concentrated among the three replications, despite the standard error being the largest (σr2=0.06). Consequently, the 95% credible interval of θ is substantially narrower compared to the credible interval from the analysis of the replication data in isolation (dashed green). In contrast, the (blue) marginal posterior of the most conflicting estimate (θ^r3=0.44) borrows less information and consequently yields the least peaked posterior, despite the standard error being the smallest (σr3=0.04). In this case, the conflict with the original study even inflates the variance of posterior compared to the isolated replication posterior given by dashed blue line. This is, for example, apparent through its 95% credible interval (0.31 to 0.5) being even wider than the credible interval (0.35 to 0.52) based on the analysis of the replication data in isolation.

Hypothesis testing

In addition to estimating θ and α, we may also be interested in testing hypotheses about these parameters. Let H0 and H1 denote two competing hypotheses, each of them with an associated prior f(θ,α|Hi) and a resulting marginal likelihood obtained from integrating the likelihood of the replication data with respect to the prior

f(θ^r|Hi)=N(θ^r|θ,σr2)f(θ,α|Hi)dθdα 6

for i{0,1}. A principled Bayesian hypothesis testing approach is to compute the Bayes factor

BF01(θ^r)=Pr(H0|θ^r)Pr(H1|θ^r)/Pr(H0)Pr(H1)=f(θ^r|H0)f(θ^r|H1)

since it corresponds to the updating factor of the prior odds to the posterior odds of the hypotheses based on the data θ^r (first equality), or because it represents the relative accuracy with which the hypotheses predict the data θ^r (second equality) (Jeffreys 1939; Good 1958; Kass and Raftery 1995). A Bayes factor BF01(θ^r)>1 provides evidence for H0, whereas a Bayes factor BF01(θ^r)<1 provides evidence for H1. The more the Bayes factor deviates from one, the larger the evidence. In the following we will examine the Bayes factors related to various hypotheses about θ and α.

Hypotheses about the effect size θ

Researchers may be interested in testing the null hypothesis that there is no effect (H0:θ=0) against the alternative that there is an effect (H1:θ0). We note that while the point null hypothesis H0 is often unrealistic, it is usually a good approximation to more realistic interval null hypotheses that assign a distribution tightly concentrated around zero (Berger and Delampady 1987; Ly and Wagenmakers 2022). Under H0 there are no free parameters, but under the alternative H1 the specification of a prior distribution for θ and α is required. A natural choice is to use the normalized power prior based on the original data along with a beta prior for the power parameter as in (2). The associated Bayes factor is then given by

BF01{θ^r|H1:αBe(x,y)}=f(θ^r|H0:θ=0)f{θ^r|H1:θ|αN(θ^o,σo2/α),αBe(x,y)}=N(θ^r|0,σr2)01N(θ^r|θ^o,σr2+σo2/α)Be(α|x,y)dα. 7

An intuitively reasonable choice for the prior of α under H1 is a uniform αBe(x=1,y=1) distribution. However, it is worth noting that assigning a point mass α=1 leads to

BF01(θ^r|H1:α=1)=f(θ^r|H0:θ=0)f{θ^r|H1:θ|αN(θ^o,σo2/α),α=1}=N(θ^r|0,σr2)N(θ^r|θ^o,σo2+σr2), 8

which is the replication Bayes factor under normality (Verhagen and Wagenmakers 2014; Ly et al. 2018; Pawel and Held 2022), that is, the Bayes factor contrasting a point null hypothesis to the posterior distribution of the effect size based on the original data (and in this case a uniform initial prior). A fixed α=1 can also be seen as the limiting case of a beta prior with y>0 and x. The power prior version of the replication Bayes factor is thus a generalization of the standard replication Bayes factor, one that allows the original data to be discounted to some degree.

Hypotheses about the power parameter α

To quantify the compatibility between the original and replication study, researchers may also be interested in testing hypotheses regarding the power parameter α. For example, we may want to test the hypothesis that the data sets are “compatible” and should be completely pooled (Hc:α=1) against the hypothesis that they are incompatible or “different” and the original data should be discounted to some extent (Hd:α<1).

One approach is to assign a point prior Hd:α=0 which represents the extreme position that the original data should be completely discounted. This leads to the issue that for a flat initial prior f(θ)1, the power prior with α=0 is not proper and so the resulting Bayes factor is only defined up to an arbitrary constant. Instead of the flat prior, we may thus assign an uninformative but proper initial prior to θ, for instance, a unit-information prior θN(0,κ2) with κ2 the variance from one (effective) observation (Kass and Wasserman 1995) as it encodes minimal prior information about the direction or magnitude of the effect size (Best et al. 2021). Updating the unit-information prior by the likelihood of the original data raised to the power of α leads then to a θ|αN{μα=(αθ^o)/(α+σo2/κ2),σα2=1/(1/κ2+α/σo2)} distribution, so the Bayes factor is

BFdc(θ^r|Hd:α=0)=f{θ^r|Hd:θ|αN(μα,σα2),α=0}f{θ^r|Hc:θ|αN(μα,σα2),α=1}=N(θ^r|0,σr2+κ2)N(θ^r|sθ^o,σr2+sσo2) 9

with s=1/(1+σo2/κ2).

An alternative approach that avoids the specification of a proper initial prior for θ is to assign a prior to α under Hd. A suitable class of priors is given by Hd:αBe(1,y) with y>1. The Be(1,y) prior has its highest density at α=0 and is monotonically decreasing thus representing the more nuanced position that the original data should only be partially discounted. The parameter y determines the extent of partial discounting and the simple hypothesis Hd:α=0 can be seen as a limiting case when y. The resulting Bayes factor is given by

BFdc{θ^r|Hd:αBe(1,y)}=f{θ^r|Hd:θ|αN(θ^o,σo2/α),αBe(1,y)}f{θ^r|Hc:θ|αN(θ^o,σo2/α),α=1}=01N(θ^r|θ^o,σr2+σo2/α)Be(α|1,y)dαN(θ^r|θ^o,σr2+σo2). 10

Example “Labels” (continued)

Table 1 displays the results of the proposed hypothesis tests applied to the three replications of the “Labels” experiment. The Bayes factors contrasting H0:θ=0 to H1:θ0 with normalized power prior with uniform prior for the power parameter α under the alternative (column BF01{θ^r|H1:αBe(1,1)}) indicate neither evidence for absence nor presence of an effect in the first replication, but decisive evidence for the presence of an effect in the second and third replication. In all three cases, the Bayes factors are close to the standard replication Bayes factors with α=1 under the alternative (column BF01(θ^r|H1:α=1)).

Table 1.

Hypothesis tests for the replication studies of the “Labels” experiment with original standardized mean difference effect estimate θ^o=0.21 and standard error σo=0.05

θ^r σr Tests about the effect size θ Tests about the power parameter α
BF01{θ^r|H1:αBe(1,1)} BF01(θ^r|H1:α=1) BFdc(θ^r|Hd:α=0) BFdc{θ^r|Hd:αBe(1,2)}
1 0.09 0.05 1/1.1 1.1 1/5.6 1.2
2 0.21 0.06 1/367 1/478 1/19 1/1.5
3 0.44 0.04 < 1/1000 < 1/1000 16 25

The columns indicate replication effect estimates θ^r, their standard errors σr, Bayes factors contrasting the absence of an effect H0:θ=0 to the presence of an effect H1:θ0 with either a uniform prior αBe(x=1,y=1) or point prior α=1 under H1, and Bayes factors contrasting study incompatibility Hd:α<1 to study compatibility Hc:α=1 with either complete discounting prior α=0 or partial discounting prior αBe(1,y=2) under Hd

In order to compute the Bayes factor for testing Hd:α=0 versus Hc:α=1 we need to specify a unit variance for the unit-information prior. A crude approximation for the variance of a standardized mean difference effect estimate is given by Var(θ^i)=4/ni with ni the total sample size of the study, and assuming equal sample size in both groups (Hedges and Schauer 2021, p. 5). We may thus set the variance of the unit-information prior to κ2=2 since a total sample size of ni=2 (at least one observation from each group) is required to estimate a standardized mean difference. Based on this choice, the Bayes factors BFdc(θ^r|Hd:α=0) in Table 1 indicate that the data provide substantial and strong evidence for the compatibility hypothesis Hc in the first and second replication study, respectively, whereas the data indicate strong evidence for complete incompatibility Hd in the third replication study. The Bayes factor BFdc{θ^r|Hd:αBe(1,y=2)} in the right-most column with the partial discounting prior assigned under hypothesis Hd indicates absence of evidence for either hypothesis in the first and second replication, but strong evidence for incompatibility Hd in the third replication. The apparent differences to the Bayes factor with the complete discounting prior (column BFdc(θ^r|Hd:α=0)) illustrate that in case of no conflict (study 2) or not too much conflict (study 1) the test with the partial discounting prior is less sensitive in diagnosing (in)compatibility, but in case of substantial conflict (study 3) it is more sensitive.

The previous analysis is based on a beta prior with y=2 corresponding to a linearly decreasing density in α, Fig. 2 shows the Bayes factor for other values of y. We see that in the realistic range of y=1 (uniform prior) to y=100 (almost all mass at α=0) the results for the first and third replication hardly change, while for the second replication the Bayes factor shifts from anecdotal evidence to stronger evidence for compatibility.

Fig. 2.

Fig. 2

Sensitivity of the Bayes factor BFdc{θ^r|Hd:αBe(1,y)} with respect to the parameter y of the partial discounting prior under Hd

To conclude, our analysis suggests that only the second replication was fully successful in the sense that it provides evidence for the presence of an effect while also being compatible with the original study. For the other two replications the conclusions are more nuanced: In the first replication, there is neither evidence for the absence nor the presence of an effect, but substantial evidence for compatibility when a complete discounting prior is used, and no evidence for (in)compatibility when a partial discounting prior is used. Finally, in the third replication there is decisive evidence for an effect, but also strong evidence of incompatibility with the original study.

Bayes factor asymptotics

Some of the Bayes factors in the previous example provided only modest evidence for the test-relevant hypotheses despite the large sample sizes in original and replication study. It is therefore of interest to understand the asymptotic behavior of the proposed Bayes factors. For instance, we may wish to understand what happens when the standard error of the replication study σr becomes arbitrarily small (through an increase in sample size). Assume that θ^r is a consistent estimator of its true underlying effect size θr, so that as the standard error σr goes to zero, the estimate will converge in probability to the true effect size θr. The true replication effect size θr may be different from the true original effect size θo, for example, because the participant populations from both studies systematically differ.

The limiting Bayes factors for testing the effect size θ from (7) and (8) are then given by

limσr0BF01{θ^r|H1:αBe(x,y)}=δ(θr)2πB(x,y)B(x+1/2,y)×M{x+1/2,x+y+1/2,-(θr-θ^o)22σo2}-1

and

limσr0BF01(θ^r|H1:α=1)=δ(θr)N(θr|θ^o,σo2),

with δ(·) the Dirac delta function. Both Bayes factors are hence consistent (Bayarri et al. 2012) in the sense that they indicate overwhelming evidence for the correct hypothesis (i. e.,  the Bayes factors go to infinity/zero if the true effect size θr is zero/non-zero). In contrast, the Bayes factors for testing the power parameter α from (9) and (10) converge to positive constants

limσr0BFdc(θr|Hd:α=0)=1-sexp-12θr2κ2-(θr-sθ^o)2sσo2 11

and

limσr0BFdc{θr|Hd:αBe(1,y)}=B(3/2,y)B(1,y)My,y+3/2,(θr-θ^o)22σo2. 12

The amount of evidence one can find for either hypothesis thus depends on the original effect estimate θ^o, the standard error σo, and the true effect size θr. For instance, in the “Labels” experiment we have an original effect estimate θ^o=0.21, a standard error σo=0.05, and a unit variance κ2=2. The bound (11) is minimized for a true effect size equal to the original effect estimate θr=θ^o=0.21, so the most extreme level we can obtain is limσr0BFdc(θr|Hd:α=0)=1/28. Similarly, the bound (12) is minimized for θr=θ^o=0.21 since then the confluent hypergeometric function term becomes one, leading to limσr0BFdc{θr|Hd:αBe(1,y=2)}=B(3/2,y)/B(1,y)=1/1.9. Even in a perfectly precise replication study we cannot find more evidence, and hence the posterior probability of Hc:α=1 cannot converge to one.

While the Bayes factors (9) and (10) are inconsistent if the replication data become arbitrarily informative, the situation is different when also the original data become arbitrarily informative (reflected by also the standard error σo going to zero and the original effect estimate θ^o converging to its true effect size θo). The Bayes factor with Hd:α=0 from (9) is then consistent as the limit (11) goes correctly to infinity/zero if the true effect size of the replication study θr is different/equivalent from the true effect size of the original study θo. In contrast, the Bayes factor with Hd:αBe(1,y) from (10) is still inconsistent since it only shows the correct asymptotic behavior when the true effect sizes are unequal (i. e.,  the Bayes factor goes to infinity) but not when the effect sizes are equivalent, in which case it is still bounded by B(3/2,y)/B(1,y).

Bayes factor design of replication studies

Now assume that the replication study has not yet been conducted and we wish to plan for a suitable sample size. The design of replication studies should be aligned with the planned analysis (Anderson and Maxwell 2017) and if multiple analyses are performed, a sample size may be calculated that guarantees a sufficiently conclusive analysis in each case (Pawel et al. 2023b). In the power prior framework, samples size calculations may be based on either hypothesis testing or estimation of the effect size θ or the power parameter α. Estimation based approaches have been developed by Shen et al. (2023). Here, we focus on samples size calculations based on Bayes factor hypothesis testing as the methodology is still lacking.

In the case of testing the effect size θ, Pawel and Held (2022) studied Bayesian design of replication studies based on the Bayes factor (8) with α=1 under H1, i.e., the replication Bayes factor under normality. They obtained closed-form expressions for the probability of replication success under H0 and H1 based on which standard Bayesian design can be performed (Weiss 1997; Gelfand and Wang 2002; De Santis 2004; Schönbrodt and Wagenmakers 2017). For the Bayes factor (7) with αBe(x,y) under H1, closed-form expressions are not available anymore and simulation or numerical integration have to be used for sample size calculations.

For tests related to the power parameter α, there are also closed-form expressions for the probability of replication success based on the Bayes factor (9) with α=0 under Hd. We will now show how these can be derived and used for determining the replication sample size. With some algebra, one can show that BFdc(θ^r|Hd:α=0)γ is equivalent to

θ^r-θ^o(σr2+κ2)κ22X 13

with

X=(σr2+κ2)(σr2+sσo2)κ2-sσo2logγ2-logσr2+sσo2σr2+κ2-s2θ^o2sσo2-κ2

and s=1/(1+σo2/κ2). Denote by mi and vi the mean and variance of θ^r under hypothesis i{d,c}. The left hand side of (13) then follows a scaled non-central chi-squared distribution under both hypotheses. Hence the probability of replication success is given by

Pr(BFdcγ|Hi)=Prχ1,λi2X/vi 14

with non-centrality parameter

λi=mi-θ^o(σr2+κ2)κ22/vi.

To determine the replication sample size, we can now use (14) to compute the probability of replication success at a desired level γ over a grid of replication standard errors σr, and under either hypothesis Hd and Hc. The appropriate standard error σr is then chosen so that the probability for finding correct evidence is sufficiently high under the respective hypothesis, and sufficiently low under the wrong hypothesis. Subsequently, the standard error σr needs to be translated into a sample size, e. g.,  for standardized mean differences via the aforementioned approximation nr4/σr2.

Example “Labels” (continued)

Figure 3 illustrates Bayesian design based on the Bayes factor BFdc(θ^r|Hd:α=0) testing the power parameter α from (9). The three replication studies from the experiment “Labels” are now regarded as original studies, and each column of the figure shows the corresponding design of future replications. In each plot, the probability for finding strong evidence for Hc:α=1 (top) or Hd:α=0 (bottom) is shown as a function of the relative sample size. In both cases, the probability is computed assuming that either Hc (blue) or Hd (yellow) is true.

Fig. 3.

Fig. 3

Probability of replication success as a function of relative variance for the three replications of experiment “Labels” regarded as original study. The arrows point to the relative variance associated with an 80% probability under the respective hypotheses

The curves look more or less similar for all three studies. We see from the lower panels that the probability for finding strong evidence for Hd is not much affected by the sample size of the replication study; it stays at almost zero under Hc, while under Hd it increases from about 75% to about 90%. In contrast, the top panels show that the probability for finding strong evidence for Hc rapidly increases under Hc and seems to level off at an asymptote. Under Hd the probability stays below 5% across the whole range.

The arrows in the plots also display the required relative sample size to obtain strong evidence with probability of 80% under the correct hypothesis. We see that original studies with smaller standard errors require smaller relative sample sizes in the replication to achieve the same probability of replication success. Under Hc the required relative sample sizes are larger than under Hd. However, while the probability of misleading evidence under Hc seems to be well controlled under the determined sample size, under Hd it stays roughly 5% for all three studies, and even for very large replication sample sizes. Choosing the sample size based on finding strong evidence for Hc assuming Hc is true thus also guarantees appropriate error probabilities for finding strong evidence for Hd in all three studies. At the same time, it seems that the probability for finding misleading evidence for Hc cannot be reduced below around 5% which might be undesirably high for certain applications.

Connection to hierarchical modeling of replication studies

Hierarchical modeling is another approach that allows for the incorporation of historical data in Bayesian analyses; moreover, hierarchical models have previously been used in the replication setting (Bayarri and Mayoral 2002a, b; Pawel and Held 2020). We will now investigate how the hierarchical modeling approach is related to the power prior approach in the analysis of replication studies, both in parameter estimation and hypothesis testing.

Connection to parameter estimation in hierarchical models

Assume a hierarchical model

θ^i|θiN(θi,σi2) 15a
θi|θN(θ,τ2) 15b
f(θ)k 15c

where for study i{o,r} the effect estimate θ^i is normally distributed around a study specific effect size θi which itself is normally distributed around an overall effect size θ. The heterogeneity variance τ2 determines the similarity of the study specific effect sizes θi. The overall effect size θ is assigned an (improper) flat prior f(θ)k, for some k>0, which is a common approach in hierarchical modeling of effect estimates (Röver et al. 2021).

We show in Appendix A that under the hierarchical model (15) the marginal posterior distribution of the replication specific effect size θr is given by

θr|θ^o,θ^r,τ2Nθ^r/σr2+θ^o/(2τ2+σo2)1/σr2+1/(2τ2+σo2),11/σr2+1/(2τ2+σo2), 16

that is, a normal distribution whose mean is a weighted average of the replication effect estimate θ^r and the original effect estimate θ^o. The amount of shrinkage of the replication towards the original effect estimate depends on how large the replication standard error σr is relative to the heterogeneity variance τ2 and the original standard error σo. There exists a correspondence between the posterior for the replication effect size θr from the hierarchical model (16) and the posterior for the effect size θ under the power prior approach. Specifically, note that under the power prior and for a fixed power parameter α, the posterior of the effect size θ is given by

θ|θ^o,θ^r,αNθ^r/σr2+(θ^oα)/σo21/σr2+α/σo2,11/σr2+α/σo2. 17

The hierarchical posterior  (16) and the power prior posterior (17) thus match if and only if

α=σo22τ2+σo2, 18

respectively

τ2=1α-1σo22, 19

which was first shown by Chen and Ibrahim (2006). For instance, a power prior model with α=1 corresponds to a hierarchical model with τ2=0, and a hierarchical model with τ2 corresponds to a power prior model with α0. In between these two extremes, however, α has to be interpreted as a relative measure of heterogeneity since the transformation to τ2 involves a scaling by the variance σo2 of the original effect estimate. For this reason, there is a direct correspondence between α and the popular relative heterogeneity measure I2=τ2/(τ2+σo2) (Higgins and Thompson 2002) computed from τ2 and the variance of the original estimate σo2, that is,

α=1-I21+I2,

with inverse of the same functional form. Figure 4 shows α and the corresponding τ2 and I2 values which lead to matching posteriors.

Fig. 4.

Fig. 4

The heterogeneity τ2 and relative heterogeneity I2=τ2/(τ2+σo2) of a hierarchical model versus the power parameter α from a power prior model which lead to matching posteriors for the effect sizes θ and θr. The variance of the original effect estimate σo2=0.052 from the “Labels” experiment is used for the transformation to the heterogeneity scale τ2

It has remained unclear whether or not a similar correspondence exists in cases where α and τ2 are random and assigned prior distributions. Here we confirm that there is indeed such a correspondence. Specifically, the marginal posterior of the replication effect size θr from the hierarchical model matches with the marginal posterior of the effect size θ from the power prior model if the prior density functions fτ2(·) and fα(·) of τ2 and α satisfy

fτ2(τ2)=fασo22τ2+σo22σo2(2τ2+σo2)2 20

for every τ20, see Appendix B for details. Importantly, the correspondence condition (20) involves a scaling by the variance from the original effect estimate σo2, meaning that also in this case α acts similar to a relative heterogeneity parameter. This can also be seen from the correspondence condition between α and I2=τ2/(σo2+τ2), which can be derived in exactly the same way as the correspondence between α and τ2. That is, the marginal posteriors of θ and θr match if the prior density functions fI2(·) and fα(·) of I2 and α satisfy

fI2(I2)=fα1-I21+I22(1+I2)2 21

for every 0I21.

Interestingly, conditions (21) and (20) imply that a beta prior on the power parameter αBe(x,y) corresponds to a generalized F prior on the heterogeneity τ2GF(y,x,2/σo2) and a generalized beta prior on the relative heterogeneity I2GBe(y,x,2), see Appendix C for details on both distributions. This connection provides a convenient analytical link between hierarchical modeling and the power prior framework, as beta priors for α are almost universally used in applications of power priors. The result also illustrates that the power prior framework seems unnatural from the perspective of hierarchical modeling since it corresponds to specifying priors on the I2 scale rather than on the τ2 scale. The same prior on I2 will imply different degrees of informativeness on the τ2 scale for original effect estimates θ^o with different variances σo2 since I2 is entangled with the variance of the original effect estimate.

Figure 5 provides three examples of matching priors using the variance of the original effect estimate from the “Labels” experiment for the transformation to the heterogeneity scale τ2. The top row of Fig. 5 shows that the uniform prior on α corresponds to a f(τ2)σo2/(2τ2+σo2)2 prior which is similar to the “uniform shrinkage” prior f(τ2)σo2/(τ2+σo2)2 (Daniels 1999). This prior has the highest density at τ2=0 but still gives some mass to larger values of τ2. Similarly, on the scale of I2 the prior slightly favors smaller values. The middle row of Fig. 5 shows that the αBe(2,1) prior—indicating more compatibility between original and replication than the uniform prior—gives even more mass to small values of τ2 and I2, and also has the highest density at τ2=0 and I2=0. In contrast, the bottom row of Fig. 5 shows that the αBe(1,2) prior—indicating less compatibility between original and replication than the uniform prior—gives less mass to small τ2 and I2, and has zero density at τ2=0 and I2=0.

Fig. 5.

Fig. 5

Priors on the heterogeneity τ2GF(y,x,2/σo2) (left), the relative heterogeneity I2=τ2/(σo2+τ2)GBe(y,x,2) (middle) and the power parameter αBe(x,y) (right) that lead to matching marginal posteriors for effect sizes θ and θr. The variance of the original effect estimate σo2=0.052 from the “Labels” experiment is used for the transformation to the heterogeneity scale τ2

Connection to hypothesis testing in hierarchical models

Two types of hypothesis tests can be distinguished in the hierarchical model; tests for the overall effect size θ and tests for the heterogeneity variance τ2. In all cases, computations of marginal likelihoods of the form

f(θ^r|Hi)=N(θ^r|θ,σr2+τ2)f(θ,τ2|Hi)dθdτ2 22

with i{j,k} are required for obtaining Bayes factors BFjk(θ^r)=f(θ^r|Hj)/f(θ^r|Hk) which quantify the evidence that the replication data θ^r provide for a hypothesis Hk over a competing hypothesis Hj. Under each hypothesis a joint prior for τ2 and θ needs to be assigned.

As with parameter estimation, it is of interest to investigate whether there is a correspondence with hypothesis tests from the power prior framework from Sect. 2.2. For two tests to match, one needs to assign priors to τ2 and θ, respectively, to α and θ so that the marginal likelihood (22) equals the marginal likelihood from the power prior model (6) under both test-relevant hypotheses.

Concerning the generalized replication Bayes factor from (7) testing H0:θ=0 versus H1:θ0, one can show that it matches with the Bayes factor contrasting H0:θ=0 versus H1:θ0 with

H0:θ=0versusH1:θ|τ2N(θ^o,σo2+τ2)τ2=0τ2GF(y,x,σo2/2)

for the replication data in in the hierarchical framework. The Bayes factor thus compares the likelihood of the replication data under the hypothesis H0 postulating that the global effect size θ is zero and that there is no effect size heterogeneity, relative to the likelihood of the data under the hypothesis H1 postulating that θ follows the posterior based on the original data and an initial flat prior for θ along with a generalized F prior on the heterogeneity τ2. Setting the heterogeneity to τ2=0 under H1 instead produces the replication Bayes factor under normality from (8).

The Bayes factor (9) that tests complete discouting Hd:α=0 versus complete compatibility Hc:α=1 can be obtained in the hierarchical framework by contrasting

Hd:θN(0,κ2)versusHc:θN(sθ^o,sσo2)τ2=0τ2=0

with s=1/(1+σo2/κ2). Hence, the Bayes factor compares the likelihood of the replication data under the initial unit-information prior relative to the likelihood of the replication data under the unit-information prior updated by the original data, assuming no heterogeneity under either hypothesis (so that the hierarchical model collapses to a fixed effects model). Although this particular test relates to the power parameter α in the power prior model, it is surprisingly unrelated to testing the heterogeneity variance τ2 in the hierarchical model.

The Bayes factor (10) testing Hd:α<1 versus Hc:α=1 using the partial discounting prior Hd:αBe(1,y) corresponds to testing Hd:τ2>0 versus Hc:τ2=0 with priors

Hd:θ|τ2N(θ^o,σo2+τ2)versusHc:θ|τ2N(θ^o,σo2+τ2)τ2GF(y,1,σo2/2)τ2=0

The test for compatibility via the power parameter α is thus equivalent to a test for compatibility via the heterogeneity τ2 (to which a generalized F prior is assigned) after updating of a flat prior for θ with the data from the original study.

Bayes factor asymptotics in the hierarchical model

Like the original test of Hc:α=1 versus Hd:αBe(1,y), the corresponding test of τ2 is inconsistent in the sense that when the standard errors from both studies go to zero (σo0 and σr0) and their true effect sizes are equivalent (θo=θr), the Bayes factor BFdc does not go to zero (to indicate overwhelming evidence for Hc:τ2=0) but converges to a positive constant. It is, however, possible to construct a consistent test for Hc:τ2=0 when we assign a different prior to τ2 under Hd:τ2>0. For instance, when we assign an inverse gamma prior Hd:τ2IG(q,r) with shape q and scale r, the Bayes factor is given by

BFdc{θ^r|Hd:τ2IG(q,r)}=N(θ^r|θ^o,σr2+σo2+2τ2)IG(τ2|q,r)dτ2N(θ^r|θ^o,σr2+σo2)

with IG(·|q,r) the density function of the inverse gamma distribution. The limiting Bayes factor is therefore

limσo,σr0BFdc{θ^r|Hd:τ2IG(q,r)}=Γ(q+1/2){r+(θr-θo)2/4}-(q+1/2)δ(θr-θo)4π,

so it correctly goes to zero/infinity when the effect sizes θr and θo are equivalent/different. To understand why the test with Hd:τ2IG(q,r) is consistent, but the original test with Hd:αBe(1,y) is not, one can transform the consistent test on τ2 to the corresponding test on α. The inverse gamma prior for τ2 implies a prior for α with density

f(α|q,r)=rqΓ(q)αq-1(1-α)q+12σo2qexp-2rασo2(1-α). 23

The Bayes factor contrasting Hc:α=1 versus Hd:α<1 with prior (23) assigned to α under Hd will thus produce a consistent test. The prior is shown in Fig. 6 for different parameters q and r and original standard errors σo. We see that the prior depends on the standard error of the original effect estimate σo, the smaller σo the more the prior is shifted towards zero. For example, the standard error σo=0.05 from the “Labels” experiment leads to priors that are almost indistinguishable from a point mass at α=0. The prior thus “unscales” α from the original standard error σo, thereby leading to a consistent test for study compatibility and resolving the inconsistency property of the beta prior.

Fig. 6.

Fig. 6

Prior for the power parameter α implied by an inverse gamma prior Hd:τ2IG(q,r) in a hierarchical model with consistent test for Hc:τ2=0 versus Hd:τ2>0

Discussion

We showed how the power prior framework can be used for design and analysis of replication studies. The approach supplies analysts with a suite of methods for assessing effect sizes and study compatibility. Both aspects can be tackled from an estimation or a hypothesis testing perspective, and the choice between the two is primarily philosophical. We believe that both perspectives provide valueable inferences that complement each other. Visualizations of joint and marginal posterior distributions are highly informative in terms of the available uncertainty. However, the power parameter α is an abstract quantity disconnected from actual scientific phenomena. Testing hypotheses of complete discounting versus complete pooling may therefore be more intuitive for researchers. Both approaches also suffer from similar problems: If the original and replication data are in perfect agreement, the posterior distribution of α hardly changes from the prior. For example, for the commonly used uniform prior αBe(x=1,y=1), we can at best obtain a α|θ^rBe(x+1/2=3/2,y=1) posterior (Pawel et al. 2023a). This means that for a “compatibility threshold” of, say, 0.8, we can never have a posterior probability higher than Pr(α>0.8|θ^r)=0.28, and for a threshold of 0.9 it is even lower Pr(α>0.9|θ^r)=0.15. The fact that the Bayes factor for testing Hd:αBe(1,y) against Hc:α=1 is inconsistent, i.e., bounded from below by a positive constant B(3/2,y)/B(1,y), simply presents the problem from a different perspective.

We also showed how the power prior approach is connected to hierarchical modeling, and gave conditions under which posterior distributions and hypothesis tests correspond between normal power prior models and normal hierarchical models. This connection provides an intuition for why even with highly precise and compatible original and replication study one can hardly draw conclusive inferences about the power parameter α; the power parameter α has a direct correspondence to the relative heterogeneity variance I2, and an indirect correspondence to the heterogeneity variance τ2 in a hierarchical model. Making inferences about a heterogeneity variance from two studies alone seems like a virtually impossible task since the “unit of information” is the number of studies and not the number of samples within a study. Moreover, Bayes factor hypothesis tests related to α have the undesirable asymptotic property of inconsistency if a beta prior is assigned to α. This is because the prior scales with the variance of the original data, just as a beta prior for I2 would in a hierarchical model. The identified link may also have computational advantages, e.g., it may be possible to estimate power prior models using the hierarchical model estimation procedures, or vice versa, but more research is needed on the connection in more complex situations that depart from normality assumptions.

Which of the two approaches should data analysts use in practice? We believe that the choice should be primarily guided by whether the hierarchical or the power prior model is scientifically more suitable for the studies at hand. If data analysts deem it scientifically plausible that the studies’ underlying effect sizes are connected via an overarching distribution then the hierarchical model may be more suitable, particularly because the approach naturally generalizes to more than two studies. On the other hand, if data analysts simply want to downweight the original studies’ contribution depending on the observed conflict, the power prior approach might be more suitable. The identified limitations for inferences related to the power parameter α should, however, be kept in mind when beta priors are assigned to the power parameter α.

There are also situations where the hierarchical and power prior frameworks can be combined, for example, when multiple replications of a single original study are conducted (multisite replications). In that case, one may model the replication effect estimates in a hierarchical fashion but link their overall effect size to the original study via a power prior. Multisite replications are thus the opposite of the usual situation in clinical trials where several historical “original” studies but only one current “replication” study is available (Gravestock and Held 2019).

Another commonly used Bayesian approach for incorporating historical data are robust mixture priors, i. e.,  priors which are mixtures of the posterior based on the historical data and an uninformative prior distribution (Schmidli et al. 2014). We conjecture that inferences based on robust mixture priors can be reverse-engineered within the framework of power priors through Bayesian model averaging over two hypotheses about the power parameter; however, more research is needed to explore the relationship between the two approaches.

The proposed methods are based on the standard meta-analytic assumption of approximate normality of effect estimates with known variances. This makes our methodology applicable to a wide range of effect sizes that may arise from different data models. However, in some situations this assumption may be inadequate, for example, when studies have small sample sizes. In this case, the methods could be modified to use the exact likelihood of the data (e. g.,  binomial or t), as in Bayarri and Mayoral (2002a), who used a t likelihood. However, the methodology would need to be adapted for each effect size type. Therefore, future work may examine specific data models in more detail to obtain more precise inferences. In this case, however, using the exact likelihood typically requires numerical methods to evaluate integrals that can be evaluated analytically under normality.

We primarily focused on the evaluation of (objective) Bayesian properties of the proposed methods. Further work is needed to evaluate their frequentist properties, for example, with a carefully planned simulation study (Morris et al. 2019). As in other recent studies (Muradchanian et al. 2021; Freuli et al. 2022), it would be interesting to simulate the realistic scenario of questionable research practices and publication bias affecting the original study to see how the adaptive downweighting of power priors can account for the inflated original results.

Acknowledgements

We thank Protzko et al. (2020) for publicly sharing their data. We thank Małgorzata Roos for helpful comments on a draft of the manuscript. We thank the associate editor and the two anonymous reviewers for many excellent comments and suggestions. This work was supported in part by an NWO Vici grant (016.Vici.170.083) to EJW, an Advanced ERC Grant (743086 UNIFY) to EJW, and a Swiss National Science Foundation mobility Grant (189295) to LH and SP.

Appendix A: Posterior distribution under the hierarchical model

Under the hierarchical model from (15), the joint posterior conditional on a heterogeneity τ2 is given by

f(θr,θo,θ|θ^o,θ^r,τ2)=i{o,r}N(θ^i|θi,σi2)N(θi|θ,τ2)kf(θ^o,θ^r|τ2) 24

with normalizing constant

f(θ^o,θ^r|τ2)=i{o,r}N(θ^i|θi,σi2)N(θi|θ,τ2)kdθodθrdθ=i{o,r}N(θ^i|θ,σi2+τ2)kdθ=kN(θ^r|θ^o,σo2+σr2+2τ2). 25

To obtain the marginal posterior distribution of the replication effect size θr we need to integrate out θo and θ from (24). This leads to

f(θr|θ^o,θ^r,τ2)=i{o,r}N(θ^i|θi,σi2)N(θi|θ,τ2)kdθodθf(θ^o,θ^r|τ2)=N(θ^r|θr,σr2)N(θr|θ,τ2)N(θ^o|θ,σo2+τ2)dθN(θ^r|θ^o,σo2+σr2+2τ2)=N(θ^r|θr,σr2)N(θr|θ^o,σo2+2τ2)N(θ^r|θ^o,σo2+σr2+2τ2)

which can be further simplified to identify the posterior given in (16).

When the heterogeneity τ2 is also assigned a prior distribution, the posterior distribution can be factorized in the posterior conditional on τ2 from (24) and the marginal posterior of τ2

f(τ2,θr,θo,θ|θ^o,θ^r)=f(θr,θo,θ|θ^o,θ^r,τ2)f(τ2|θ^o,θ^r).

Integrating out θr,θo, and θ from the joint posterior and using the previous results (25), the marginal posterior of τ2 can be derived to be

f(τ2|θ^o,θ^r)=i{o,r}N(θ^i|θi,σi2)N(θi|θ,τ2)kf(τ2)dθodθrdθf(θ^o,θ^r)=f(θ^r,θ^o|τ2)f(τ2)f(θ^r,θ^o|τ2)f(τ2)dτ2=N(θ^r|θ^o,σo2+σr2+2τ2)f(τ2)N(θ^r|θ^o,σo2+σr2+2τ2)f(τ2)dτ2.

Appendix B: Conditions for matching posteriors

For the marginal posteriors of θr and θ to match it must hold for every θ = θr that

f(θr|θ^o,θ^r)=f(θ|θ^o,θ^r)0f(θr|θ^o,θ^r,τ2)f(τ2|θ^o,θ^r)dτ2=01f(θ|θ^o,θ^r,α)f(α|θ^o,θ^r)dα. 26

By applying a change of variables (18) or (19) to the left or right hand side of (26), the marginal posteriors conditional on τ2 and α match. It is now left to investigate whether there are priors for τ2 and α so that also the marginal posteriors of τ2 and α match. The marginal posterior distribution of α is proportional to

f(α|θ^o,θ^r)fα(α)N(θ^r|θ^o,σr2+σo2/α).

After a change of variables τ2=(1/α-1)(σo2/2) the marginal posterior becomes

f(τ2|θ^o,θ^r)fασo22τ2+σo22σo2(2τ2+σo2)2N(θ^r|θ^o,σr2+σo2+2τ2),

Since, as shown in Appendix A, the marginal posterior of τ2 under the hierarchical model is proportional to

f(τ2|θ^o,θ^r)fτ2(τ2)N(θ^r|θ^o,σr2+σo2+2τ2),

the marginal posteriors of the effect sizes θ and θr match if

fτ2(τ2)=fασo22τ2+σo22σo2(2τ2+σo2)2

holds for every τ20.

Appendix C: The generalized beta and F distributions

A random variable XGBe(a,b,λ) with density function

f(x|a,b,λ)=λaxa-1(1-x)b-1B(a,b){1-(1-λ)x}a+b1[0,1](x) 27

follows a generalized beta distribution (in the parametrization of Libby and Novick 1982) with 1S(x) denoting the indicator function that x is in the set S. A random variable XGF(a,b,λ) with density function

f(x|a,b,λ)=λaxa-1B(a,b)(1+λx)a+b1[0,)(x) 28

follows a generalized F distribution (in the parametrization of Pham-Gia and Duong 1989).

Software and data

The CC-By Attribution 4.0 International licensed data were downloaded from https://osf.io/42ef9/. All analyses were conducted in the R programming language version 4.3.0 (R Core Team , 2020). The code and data to reproduce this manuscript is available at https://github.com/SamCH93/ppReplication. A snapshot of the GitHub repository at the time of writing this article is archived at https://doi.org/10.5281/zenodo.6940237. We also provide an R package for estimation and testing under the power prior framework https://CRAN.R-project.org/package=ppRep. The package can be installed by running install.packages("ppRep") from an R console.

Funding

Open access funding provided by University of Zurich.

Declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Abramowitz M, Stegun IA, editors. Handbook of mathematical functions with formulas, graphs and mathematical tables. New York: Dover Publications Inc; 1965. [Google Scholar]
  2. Anderson SF, Maxwell SE. Addressing the replication crisis: using original studies to design replication studies with appropriate statistical power. Multivar Behav Res. 2017;52(3):305–324. doi: 10.1080/00273171.2017.1289361. [DOI] [PubMed] [Google Scholar]
  3. Bayarri M, Mayoral A. Bayesian analysis and design for comparison of effect-sizes. J Stat Plan Inference. 2002;103(1–2):225–243. doi: 10.1016/s0378-3758(01)00223-3. [DOI] [Google Scholar]
  4. Bayarri MJ, Berger JO, Forte A, García-Donato G. Criteria for Bayesian model choice with application to variable selection. Ann Stat. 2012;40(3):1550–1577. doi: 10.1214/12-aos1013. [DOI] [Google Scholar]
  5. Bayarri MJ, Mayoral AM. Bayesian design of successful replications. Am Stat. 2002;56:207–214. doi: 10.1198/000313002155. [DOI] [Google Scholar]
  6. Berger JO, Delampady M. Testing precise hypotheses. Stat Sci. 1987 doi: 10.1214/ss/1177013238. [DOI] [Google Scholar]
  7. Best N, Price RG, Pouliquen IJ, Keene ON. Assessing efficacy in important subgroups in confirmatory trials: an example using Bayesian dynamic borrowing. Pharm Stat. 2021;20(3):551–562. doi: 10.1002/pst.2093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen M-H, Ibrahim JG. The relationship between the power prior and hierarchical models. Bayesian Anal. 2006 doi: 10.1214/06-ba118. [DOI] [Google Scholar]
  9. Daniels MJ. A prior for the variance in hierarchical models. Can J Stat. 1999;27(3):567–578. doi: 10.2307/3316112. [DOI] [Google Scholar]
  10. De Santis F. Statistical evidence and sample size determination for Bayesian hypothesis testing. J Stat Plan Inference. 2004;124(1):121–144. doi: 10.1016/s0378-3758(03)00198-8. [DOI] [Google Scholar]
  11. Duan Y, Ye K, Smith EP. Evaluating water quality using power priors to incorporate historical information. Environmetrics. 2005;17(1):95–106. doi: 10.1002/env.752. [DOI] [Google Scholar]
  12. Etz A, Vandekerckhove J. A Bayesian perspective on the reproducibility project: psychology. PLoS ONE. 2016;11(2):e0149794. doi: 10.1371/journal.pone.0149794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Freuli F, Held L, Heyard R (2022) Replication success under questionable research practices—a simulation study. Statistical Science (to appear). 10.31222/osf.io/s4b65
  14. Gelfand AE, Wang F. A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Stat Sci. 2002;17(2):193–208. doi: 10.1214/ss/1030550861. [DOI] [Google Scholar]
  15. Good IJ (1958) Significance tests in parallel and in series. J Am Stat Assoc 53(284):799–813. 10.1080/01621459.1958.10501480
  16. Gravestock I, Held L. Adaptive power priors with empirical Bayes for clinical trials. Pharm Stat. 2017;16(5):349–360. doi: 10.1002/pst.1814. [DOI] [PubMed] [Google Scholar]
  17. Gravestock I, Held L. Power priors based on multiple historical studies for binary outcomes. Biom J. 2019;61(5):1201–1218. doi: 10.1002/bimj.201700246. [DOI] [PubMed] [Google Scholar]
  18. Hedges LV, Schauer JM. More than one replication study is needed for unambiguous tests of replication. J Educ Behav Stat. 2019;44(5):543–570. doi: 10.3102/1076998619852953. [DOI] [Google Scholar]
  19. Hedges LV, Schauer JM. The design of replication studies. J R Stat Soc A Stat Soc. 2021;184(3):868–886. doi: 10.1111/rssa.12688. [DOI] [Google Scholar]
  20. Held L. A new standard for the analysis and design of replication studies (with discussion) J R Stat Soc A Stat Soc. 2020;183(2):431–448. doi: 10.1111/rssa.12493. [DOI] [Google Scholar]
  21. Held L, Micheloud C, Pawel S. The assessment of replication success based on relative effect size. Ann Appl Stat. 2022 doi: 10.1214/21-AOAS1502. [DOI] [Google Scholar]
  22. Held L, Sauter R. Adaptive prior weighting in generalized regression. Biometrics. 2017;73(1):242–251. doi: 10.1111/biom.12541. [DOI] [PubMed] [Google Scholar]
  23. Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med. 2002;21(11):1539–1558. doi: 10.1002/sim.1186. [DOI] [PubMed] [Google Scholar]
  24. Ibrahim JG, Chen M-H, Gwon Y, Chen F. The power prior: theory and applications. Stat Med. 2015;34(28):3724–3749. doi: 10.1002/sim.6728. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Jeffreys H. Theory of probability. 1. Oxford: Clarendon Press; 1939. [Google Scholar]
  26. Johnson VE, Payne RD, Wang T, Asher A, Mandal S. On the reproducibility of psychological science. J Am Stat Assoc. 2016;112(517):1–10. doi: 10.1080/01621459.2016.1240079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90(430):773–795. doi: 10.1080/01621459.1995.10476572. [DOI] [Google Scholar]
  28. Kass RE, Wasserman L. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J Am Stat Assoc. 1995;90(431):928–934. doi: 10.1080/01621459.1995.10476592. [DOI] [Google Scholar]
  29. Libby DL, Novick MR. Multivariate generalized beta distributions with applications to utility assessment. J Educ Stat. 1982;7(4):271–294. doi: 10.3102/10769986007004271. [DOI] [Google Scholar]
  30. Ly A, Etz A, Marsman M, Wagenmakers E-J. Replication Bayes factors from evidence updating. Behav Res Methods. 2018;51(6):2498–2508. doi: 10.3758/s13428-018-1092-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ly A, Wagenmakers E-J. Bayes factors for peri-null hypotheses. TEST. 2022;31(4):1121–1142. doi: 10.1007/s11749-022-00819-w. [DOI] [Google Scholar]
  32. Mathur MB, VanderWeele TJ. New statistical metrics for multisite replication projects. J R Stat Soc A Stat Soc. 2020;183(3):1145–1166. doi: 10.1111/rssa.12572. [DOI] [Google Scholar]
  33. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–2102. doi: 10.1002/sim.8086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Muradchanian J, Hoekstra R, Kiers H, van Ravenzwaaij D. How best to quantify replication success? A simulation study on the comparison of replication success metrics. R Soc Open Sci. 2021;8(5):201697. doi: 10.1098/rsos.201697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. National Academies of Sciences, Engineering, and Medicine (2019) Reproducibility and Replicability in Science. National Academies Press, London. 10.17226/25303 [PubMed]
  36. Neuenschwander B, Branson M, Spiegelhalter DJ. A note on the power prior. Stat Med. 2009;28(28):3562–3566. doi: 10.1002/sim.3722. [DOI] [PubMed] [Google Scholar]
  37. Pawel S, Aust F, Held L, Wagenmakers E-J. Normalized power priors always discount historical data. Stat. 2023;12(1):e591. doi: 10.1002/sta4.591. [DOI] [Google Scholar]
  38. Pawel S, Consonni G, Held L (2023b) Bayesian approaches to designing replication studies. Psychol Methods (To appear). 10.1037/met0000604 [DOI] [PubMed]
  39. Pawel S, Held L. Probabilistic forecasting of replication studies. PLoS ONE. 2020;15(4):e0231416. doi: 10.1371/journal.pone.0231416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Pawel S, Held L. The sceptical Bayes factor for the assessment of replication success. J R Stat Soc: Ser B (Stat Methodol) 2022 doi: 10.1111/rssb.12491. [DOI] [Google Scholar]
  41. Pham-Gia T, Duong Q. The generalized beta- and F-distributions in statistical modelling. Math Comput Model. 1989;12(12):1613–1625. doi: 10.1016/0895-7177(89)90337-3. [DOI] [Google Scholar]
  42. Protzko J, Krosnick J, Nelson LD, Nosek BA, Axt J, Berent M, Buttrick N, DeBell M, Ebersole CR, Lundmark S, MacInnis B, O’Donnell M, Perfecto H, Pustejovsky JE, Roeder SS, Walleczek J, Schooler J (2020) High replicability of newly-discovered social-behavioral findings is achievable (Preprint). 10.31234/osf.io/n2a9x
  43. R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
  44. Röver C, Bender R, Dias S, Schmid CH, Schmidli H, Sturtz S, Weber S, Friede T. On weakly informative prior distributions for the heterogeneity parameter in Bayesian random-effects meta-analysis. Res Synthes Methods. 2021;12(4):448–474. doi: 10.1002/jrsm.1475. [DOI] [PubMed] [Google Scholar]
  45. Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023–1032. doi: 10.1111/biom.12242. [DOI] [PubMed] [Google Scholar]
  46. Schönbrodt FD, Wagenmakers E-J. Bayes factor design analysis: planning for compelling evidence. Psychonomic Bull Rev. 2017;25(1):128–142. doi: 10.3758/s13423-017-1230-y. [DOI] [PubMed] [Google Scholar]
  47. Shen Y, Psioda MA, Ibrahim JG. BayesPPD: an R package for Bayesian sample size determination using the power and normalized power prior for generalized linear models. R J. 2023;14:335–351. doi: 10.32614/RJ-2023-016. [DOI] [Google Scholar]
  48. Spiegelhalter DJ, Abrams R, Myles JP. Bayesian approaches to clinical trials and health-care evaluation. New York: Wiley; 2004. [Google Scholar]
  49. van Aert RCM, van Assen MALM. Bayesian evaluation of effect size after replicating an original study. PLoS ONE. 2017;12(4):e0175302. doi: 10.1371/journal.pone.0175302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Verhagen J, Wagenmakers E-J. Bayesian tests to quantify the result of a replication attempt. J Exp Psychol Gen. 2014;143:1457–1475. doi: 10.1037/a0036731. [DOI] [PubMed] [Google Scholar]
  51. Weiss R (1997) Bayesian sample size calculations for hypothesis testing. J R Stat Soc: Ser D (The Stat) 46(2):185–191. 10.1111/1467-9884.00075

Articles from Test (Madrid, Spain) are provided here courtesy of Springer

RESOURCES