Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 Aug 7;14(1):99–116. doi: 10.1002/jrsm.1594

Robust Bayesian meta‐analysis: Model‐averaging across complementary publication bias adjustment methods

František Bartoš 1,2,, Maximilian Maier 1,3, Eric‐Jan Wagenmakers 1, Hristos Doucouliagos 4,5, T D Stanley 4,5
PMCID: PMC10087723  PMID: 35869696

Abstract

Publication bias is a ubiquitous threat to the validity of meta‐analysis and the accumulation of scientific evidence. In order to estimate and counteract the impact of publication bias, multiple methods have been developed; however, recent simulation studies have shown the methods' performance to depend on the true data generating process, and no method consistently outperforms the others across a wide range of conditions. Unfortunately, when different methods lead to contradicting conclusions, researchers can choose those methods that lead to a desired outcome. To avoid the condition‐dependent, all‐or‐none choice between competing methods and conflicting results, we extend robust Bayesian meta‐analysis and model‐average across two prominent approaches of adjusting for publication bias: (1) selection models of p‐values and (2) models adjusting for small‐study effects. The resulting model ensemble weights the estimates and the evidence for the absence/presence of the effect from the competing approaches with the support they receive from the data. Applications, simulations, and comparisons to preregistered, multi‐lab replications demonstrate the benefits of Bayesian model‐averaging of complementary publication bias adjustment methods.

Keywords: Bayesian model‐averaging, meta‐analysis, PET‐PEESE, publication bias, selection models

1. INTRODUCTION

Meta‐analysis is essential to cumulative science. 1 However, a common concern to meta‐analysis is the overestimation of effect size due to publication bias, the preferential publishing of statistically significant studies. 2 , 3 , 4 , 5 , 6 In addition, this effect size exaggeration can be further increased by questionable research practices, that is, researchers' tendency to manipulate their data in a way that increases the effect size and the evidence for an effect. 7 , 8 Indeed, descriptive surveys find that both problems are remarkably common. For example, John and colleagues 9 estimate that about 78% of researchers failed to disclose all dependent measures and around 36% stopped data collection after achieving a significant result* (but see Fiedler & Schwarz 10 who argued that the survey by John and colleagues 9 overestimated QRPs prevalence).

The results of the publication bias and questionable research practices are often viewed as a missing data problem; where some studies are missing from the published research record because they did not reach a statistical significance criterion while other estimates are observed after being “massaged” by researchers. Unfortunately, a perfect solution to the problem of missing data is impossible since we cannot know the unreported results nor the precise mechanism of omission. Multiple methods have been offered to adjust for likely publication bias from observable patterns contained in the reported research record. 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 All of these methods have been shown to thrive under different assumptions and simulation designs. 13 , 25 , 26 , 27

Because different methods can lead to different conclusions, some meta‐analysts suggest that we should not search for the “best” bias‐adjusted effect size estimate. Instead, they suggest that multitude bias‐adjusted effect size estimates should be offered as a sensitivity analysis for the original unadjusted value. 28 , 29 , 30 The new method proposed here is another useful tool to accommodate publication bias, and researchers are free to supplement it with other methods.

Researchers interested in obtaining better bias‐adjusted effect size estimates or selecting the most suitable set of methods for sensitivity analysis increasingly emphasize the importance of selecting an appropriate estimator conditional on the situation at hand. For instance, Hong & Reed 27 argue:

What is missing is something akin to a flow‐chart that would map observable characteristics to experimental results which the meta‐analyst could then use to select the best estimator for their situation. (p. 22)

And Carter and colleagues 25 write:

Therefore, we recommend that meta‐analysts in psychology focus on sensitivity analyses—that is, report on a variety of methods, consider the conditions under which these methods fail (as indicated by simulation studies such as ours), and then report how conclusions might change depending on which conditions are most plausible. (p. 115)

In practice, researchers seldom have knowledge about the data‐generating process nor do they have sufficient information to choose with confidence among the wide variety of proposed methods that aim to adjust for publication bias. Furthermore, this wide range of proposed methods often leads to contradictory conclusions. 25 The combination of uncertainty about the data‐generating process and the presence of conflicting conclusions can create a “breeding ground” for confirmation bias 31 : researchers may unintentionally select those methods that support the desired outcome. This freedom to choose can greatly inflate the rate of false positives, which can be a serious problem for conventional meta‐analysis methods.

An alternative approach is to integrate the different approaches, explicitly, and let the data determine the contribution of each model based on its relative predictive accuracy for the observed data. To implement this approach we extend the robust Bayesian meta‐analysis (RoBMA) framework outlined in Maier and colleagues. 13 The original RoBMA framework included selection models (operating on p‐values) that have been shown to work well even under high heterogeneity 25 , 32 (see also Guan and colleagues for earlier work 33 ). The extended RoBMA framework also includes PET‐PEESE, a method that adjusts for small‐study effects by modeling the relationship between the effect sizes and standard errors. 16 PET‐PEESE generally has low bias and performs well in applications. 25 , 34 By including both p‐value selection models as well as PET‐PEESE, the extended version of RoBMA can apply both models simultaneously and optimally, relative to the observed research record.

Below we first provide a brief introduction to the RoBMA framework. We use an example on precognition 35 to illustrate both the general model‐averaging methodology, RoBMA‐PSMA (PSMA: publication selection model‐averaging), and the way RoBMA‐PSMA combines multiple weight functions including PET‐PEESE. Second, we evaluate RoBMA‐PSMA on comparisons with findings from preregistered multi‐lab replications, 34 and across more than a thousand simulation environments employed by four different simulation studies. 27

2. ROBUST BAYESIAN META‐ANALYSIS: GENERAL BACKGROUND

Because the true data generating process is unknown (effect present vs. effect absent; fixed‐effect vs. random‐effects; no publication bias vs. publication bias; and how publication bias expresses itself), many different models can be specified. RoBMA‐PSMA accepts this multitude of models and uses Bayesian model‐averaging to combine the estimates from individual models based on how well each model predicts the data. 36 , 37 , 38 Consequently, the posterior plausibility for each individual model determines its contribution to the model‐averaged posterior distributions. 13 , 39 , 40

In this section, we provide a brief overview of Bayesian model‐averaging, the work horse of RoBMA (for an in‐depth treatment see 38 , 40 , 41 ). First, the researcher needs to specify (1) the models H under consideration, that is, the probability of data under the different parameter values that H allows, pdataθH (i.e., the likelihood), and (2) prior distributions for the model parameters θ, that is, the relative plausibility of the parameter values before observing the data, pθH. In the case of meta‐analyses, the data are usually represented by the observed effect sizes (y k ) and their standard errors (se k ) from k=1,,K individual studies. For example, a fixed‐effect meta‐analytic model H0 assuming absence of the mean effect (i.e., μ = 0) and no across‐study heterogeneity (i.e., τ = 0), can be defined as:

H0:μ=0,τ=0pdataθ0H0:yk~Normal0sek, (1)

where θ 0 denotes vector of parameters (μ and τ) belonging to the model H0.

In contrast, a fixed‐effect meta‐analytic model H1 assuming the presence of the mean effect (i.e., μ ≠ 0) needs to also specify a prior distribution for μ, f(.):

H1:μ~f.,τ=0pdataθ1H1:yk~Normalμsek. (2)

Once the models have been specified, Bayes' rule dictates how the observed data update the prior distributions to posterior distributions, for each model separately:

pθ0H0data=pθ0H0pdataθ0H0pdataH0,pθ1H1data=pθ1H1pdataθ1H1pdataH1, (3)

where the denominators denote the marginal likelihood, that is, the average probability of the data under a particular model. Specifically, marginal likelihoods are obtained by integrating the likelihood over the prior distribution for the model parameters:

p(data|H0)=pdataθ0H0pθ0H0dθ0,pdataH1=pdataθ1H1pθ1H1dθ1. (4)

Together with the likelihood, the prior parameter distribution determines the model's predictions. The marginal likelihood therefore quantifies a model's predictive performance in light of the observed data. Consequently, the marginal likelihood plays a pivotal role in model comparison and hypothesis testing. 42 The ratio of two marginal likelihoods is known as the Bayes factor (BF), 43 , 44 , 45 , 46 and it indicates the extent to which one model outpredicts another; in other words, it grades the relative support that the models receive from the data. For example, the Bayes factor that assesses the relative predictive performance of the fixed‐effect meta‐analytic model H0:μ=0 to that of the fixed‐effect model H1:μ0 is

BF10=pdataH1pdataH0. (5)

The resulting BF10 represents the outcome of a Bayesian hypothesis test for the presence versus absence of an effect for the fixed‐effect meta‐analytic models. Unlike the p‐value in Neyman‐Pearson hypothesis testing, the BF value can be interpreted as a continuous measure of evidence. A BF10 value larger than 1 indicates support for the alternative hypothesis (in the nominator) and a value lower than 1 indicates support for the null hypothesis (in the denominator). As a general rule of thumb, Bayes factors between 1 and 3 (between 1 and 1/3) are regarded as anecdotal evidence, Bayes factors between 3 and 10 (between 1/3 and 1/10) are regarded as moderate evidence, and Bayes factors larger than 10 (smaller than 1/10) are regarded as strong evidence in favor of (against) a hypothesis (e.g., appendix I of Jeffreys 47 and Lee & Wagenmakers 48 p. 105). While this rule of thumb can aid interpretation, Bayes factors are inherently continuous measures of the strength of evidence and any attempt at discretization inevitably involves a loss of information.

Next, we incorporate the prior model probabilities that later allow us to weight the posterior model estimates by posterior probability of the considered models. It is common practice to divide the prior model probability equally across the different model types, that is, pH0=pH1=1/2. 36 , 40 , 49 , 50 To obtain the posterior model probabilities, we apply Bayes' rule one more time, now on the level of models instead of parameters:

p(H0|data)=pH0pdataH0pdata,p(H1|data)=pH1pdataH1pdata. (6)

The common denominator,

pdata=pdataH0pH0+pdataH1pH1, (7)

ensures that the posterior model probabilities sum to one.

The relative predictive performance of the rival models determines the update from prior to posterior model probabilities; in other words, models that predict the data well receive a boost in posterior probability, and models that predict the data poorly suffer a decline. 45 , 51 Thus, the Bayes factor quantifies the degree to which the data change the prior model odds to posterior model odds:

pdataH1pdataH0Bayes factor=pH1datapH0dataPosterior odds/pH1pH0Prior odds. (8)

We can combine the posterior parameter distributions from the two fixed‐effect meta‐analytic models by weighting the distributions according to the posterior model probabilities (e.g., Wrinch & Jeffreys, 46 p. 387 and Jeffreys, 52 p. 222). The resulting model‐averaged posterior distribution can be defined as a mixture distribution,

pθdata=pθ0H0datapH0data+pθ1H1datapH1data. (9)

In RoBMA, the overall model ensemble is constructed from eight model types that represent the combination of the presence/absence of the effect, heterogeneity, and publication bias (modeled with two types of selection models in the original version of RoBMA 13 ). With more than two models in play, Equations (5) and (9) can be expanded to accommodate the additional models. Specifically, the inclusion Bayes factor can be defined as a comparison between sets of models. For example, BF10 quantifies the evidence for presence versus absence of the effect by the change from prior to posterior odds for the set of models that include the effect versus the set of models that exclude the effect:

BF10Inclusion Bayes factorfor effect=iIpHidatajJpHjdataPosterior inclusion oddsfor models assuming effect/iIpHijJpHjPrior inclusion oddsfor models assuming effect, (10)

where iI refers to models that include the effect and jJ refers to models that exclude the effect. 38 , 40 , In the same way, we can also assess the relative predictive performance of any model compared to the rest of the ensemble.

Finally, the model‐averaged posterior distribution of θ is defined as a mixture distribution of the posterior distributions of θ from each model Hn weighted by the posterior model probabilities,

pθdata=n=1NpθnHndatapHndata. (11)

To complete the model‐averaged ensemble with multiple models corresponding to each component (e.g., two weight functions as a way of adjusting for publication bias in the original RoBMA), we maintain our prior indifference towards each of the hypotheses (e.g., presence/absence of the effect) by setting the prior model probabilities of all models that compose one of these two components to sum to 1/2. Often, the data contain enough information to assign posterior model probabilities to a class of similar models, largely washing out the effect of prior model probabilities on the model‐averaged posterior distribution. If the data do not contain enough information, the model‐averaged posterior distribution will be more affected by the choice of prior model probabilities. If researchers have diverging views on plausibility of different models, they can modify these prior model probabilities (e.g., by decreasing the prior model probabilities of fixed‐effect models, but see 50 ).

In contrast to classical meta‐analytic statistics, the advantages of the Bayesian approach outlined above are that RoBMA can: (1) provide evidence for the absence of an effect (and therefore distinguish between “absence of evidence” and “evidence of absence”) 53 , 54 ; (2) update meta‐analytic knowledge sequentially, thus addressing recent concern about accumulation bias 55 ; (3) incorporate expert knowledge; (4) retain and incorporate all uncertainty about parameters and models, without the need to make all‐or‐none choices; (5) emphasize the model outcomes that are most supported by the data, allowing it to flexibly adapt to scenarios with high heterogeneity and small sample sizes.

3. PUBLICATION BIAS ADJUSTMENT METHOD 1: SELECTION MODELS

One class of publication bias correction methods are selection models. 11 , 12 , 32 , 56 , 57 , In general, selection models estimate the relative probability that studies with p‐values within pre‐specified intervals were published as well as the corrected meta‐analytic effect size. In other words, they are directly accounting for the missing data, based on the modeled relation between statistical significance and probability of publication. Selection models differ mostly in the specified weight function (such as 3PSM and 4PSM 11 and AK1 and AK2 20 ), or are fit only to the statistically significant results (e.g., p‐curve 18 and p‐uniform 19 ).

Selection models based on p‐values are attractive for several reasons. First, the models provide a plausible account of the data generating process—statistically non‐significant studies are less likely to be published than statistically significant studies. 2 , 4 , 6 Second, in recent simulation studies the unrestricted versions of selection models performed relatively well. 25 , 27

Selection models can be specified flexibly according to the assumed publication process. For example, we can distinguish between two‐sided selection (i.e., significant studies are published regardless of the direction of the effect) and one‐sided selection (only significant studies in the “correct” direction are preferentially reported). In the previous implementation of RoBMA, the selection models assumed two‐sided selection, either at a p‐value cutoff of 0.05 or also at a marginally significant cutoff of 0.10. 13 In this paper, we extend RoBMA by adding 4 weight functions that encompass more ways in which the selection process might operate. The added weight functions assume one‐sided selection for positive effect sizes with cutoffs on significant, marginally significant, and/or p‐values corresponding to the expected effect size direction. Overall the six included weight functions are:

  1. Two‐sided (already included in RoBMA)
    1. p‐value cutoffs = 0.05;
    2. p‐value cutoffs = 0.05 & 0.10.
  2. One‐sided (new in RoBMA‐PSMA)
    1. p‐value cutoffs = 0.05;
    2. p‐value cutoffs = 0.025 & 0.05;
    3. p‐value cutoffs = 0.05 & 0.50;
    4. p‐value cutoffs = 0.025 & 0.05 & 0.50.

3.1. Example—Feeling the future

We illustrate this extended version of RoBMA on studies from the infamous 2011 article “Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect.” 35 Across a series of nine experiments, Bem 35 attempted to show that participants are capable of predicting the future through the anomalous process of precognition. In response to a methodological critique, Bem and colleagues 60 later conducted a meta‐analysis on the nine reported experiments in order to demonstrate that the experiments jointly contained strong support for existence of the effect. Publication of such an implausible result in the flagship journal of social psychology ignited an intense debate about replicability, publication bias, and questionable research practices in psychology. 7

We analyze the data as described by Bem and colleagues 60 in Table 1 with the updated version of RoBMA R package. 61 For illustration, we specify the publication bias adjustment part with the six weight functions outlined above. We use the default prior distributions for the effect size and heterogeneity (standard normal and inverse‐gamma, respectively, as in the original version of RoBMA, see Appendix B (Data S1) for details). Internally, the package transforms the priors, the supplied Cohen's d, and their standard errors to the Fisher's z scale. § The estimates are transformed back to Cohen's d scale for ease of interpretation. R code and data for reproducibility are available on OSF https://osf.io/fgqpc/.

Our results do not provide notable evidence either for or against the presence of the anomalous effect of precognition: the model‐averaged Bayes factor equals BF10 = 1.91 and the posterior model‐averaged mean estimate of μ = 0.097, 95% CI [0.000, 0.232]. ** Figure 1 shows posterior model‐averaged estimated weights with a re‐scaled x‐axis for easier readability. Because the meta‐analysis is based on only nine estimates, the uncertainty in the estimated weights is relatively high.

FIGURE 1.

FIGURE 1

The model‐averaged weight function with 95% CI for Bem. 35 Results are model‐averaged across the whole model ensemble, including models assuming no publication bias (ω = 1)

These results are an improvement from the original RoBMA implementation (with only 2 two‐sided weight functions) that showed strong support for the effect: BF10 = 97.89, μ = 0.149, 95% CI [0.053, 0.240]. The substantial difference in conclusions between the original RoBMA and the RoBMA with four additional weight functions is due to the inclusion of one‐sided selection models that seems to provide a better description for the Bem studies. More importantly, with the additional four weight functions RoBMA provides only moderate evidence for the effect even when adopting the N(0, 0.3042) prior distribution for effect size recommended by Bem and colleagues 60 : BF10 = 5.59, μ = 0.122, 95% CI [0.000, 0.234].

3.2. Limitations of selection models

While selection models have been shown to perform well in comparison to other methods in simulation studies, 25 , 32 they often insufficiently adjust for publication bias when applied to actual meta‐analytic data. 34 , 64 This discrepancy arises because the simulation studies assume that the selection model is an accurate reflection of the true data‐generating process; that is, the synthetic data obey a selection process that stipulates publication probability is (a) based solely on p‐values rather than on effect sizes; (b) based solely on a discretized p‐value interval, within which the probability of publication is constant. The simulation studies largely ignore the possibility of model misspecification and therefore provide an upper bound on model performance. 25 A key strength of the Bayesian model‐averaging approach is that it can incorporate any number of models, increasing robustness and decreasing the potentially distorting effects of model misspecification. Therefore, we extend RoBMA with another method that adjusts for publication bias in an entirely different way—PET‐PEESE. 16

4. PUBLICATION BIAS ADJUSTMENT METHOD 2: PET‐PEESE

A prominent class of alternative approaches to the selection models outlined above are methods that adjust for publication bias by adjusting for small‐study effects by estimating the relationship between effect sizes and their standard errors. 14 The most well‐known approaches include trim and fill 17 and PET‐PEESE. 16 Here, we focus only on PET‐PEESE since its regression‐based framework, which fits the model to all observed studies, allows us to compare the model fit directly to the selection model‐based approaches.

PET‐PEESE method is an attractive addition to the RoBMA methodology since it often performs better than selection models in meta‐analytic applications 34 (for applications in the field of ego depletion and antidepressant effectiveness see Carter and colleagues 65 and Moreno and colleagues, 66 respectively). PET‐PEESE is a conditional (two‐step) estimator composed of two models, PET model (i.e., Precision Effect Test) that is correctly specified when the effect is absent and PEESE model (i.e., Precision Effect Estimate with Standard Error) that provides a better approximation when the effect is present. 16 The individual PET and PEESE models are the linear and the quadratic meta‐regression approximations, respectively, to the incidentally truncated selection model 16 The choice between the PET and the PEESE model proceeds as follows: the test for the effect size coefficient based on PET (with α = 0.10 for model selection only) is used to decide whether the PET (p > α) or the PEESE (p < α) effect size estimator is employed. 67

In order to add PET and PEESE models as a way of adjusting for publication bias with RoBMA, we modify them in the following way. Instead of following PET‐PEESE conditional selection of either PET or PEESE as proposed by Stanley & Doucouliagos, 16 we include both PET and PEESE models, separately, in the RoBMA ensemble (alongside the weight functions model) and model‐average over the entire ensemble. Furthermore, instead of using an unrestricted weighted least squares estimator, 16 we specify both fixed‐effects and random‐effects versions of these models, for consistency with the remaining RoBMA models. Consequently, the PET and PEESE models implemented in RoBMA correspond to meta‐regressions of effect size on either the standard errors or the variances with conventional fixed‐effects and random‐effect flavors (see Equation (2) in Appendix A (Data S1)).

In sum, we created a new RoBMA ensemble adjusting for publication bias using PET and PEESE models. Instead of using the model estimates conditionally, we model‐average across the fixed‐ and random‐effects PET and PEESE models assuming either absence or presence of the effect and the corresponding fixed‐ and random‐effects models without publication bias adjustment.

4.1. Example—Feeling the future

We revisit the Bem 35 example. For illustration, we now specify only the PET and PEESE models as the publication bias adjustment part of the RoBMA ensemble (we include the six weight functions specified above in the next subsection). Again, we use the RoBMA package with the same default priors for effect size and heterogeneity; we assign Cauchy(0, 1) and Cauchy (0, 5) priors restricted to the positive range to the regression coefficients on standard errors and variances, respectively (see Appendix B (Data S1) for details).

RoBMA version model‐averaging across the PET and PEESE models provides moderate evidence for the absence of an effect, BF10 = 0.226 (the reciprocal quantifying the evidence for the null hypothesis, BF01 = 4.42), with the posterior model‐averaged mean estimate μ = 0.013, 95% CI [−0.078, 0.197]. Figure 2 shows the estimated relationship between standard errors and effect sizes, where the effect size at standard error 0 corresponds to the posterior model‐averaged bias‐corrected estimate.

FIGURE 2.

FIGURE 2

The relationship between the standard errors and model‐averaged effect size estimate with 95% CI for Bem. 35 Results are model‐averaged across the entire model ensemble. Models assuming no publication bias have both PET and PEESE coefficients set to 0. Black diamonds correspond to the individual study estimates and standard errors

In this application, these results seem to provide an even better adjustment than the RoBMA version of selection models discussed previously. Furthermore, the RoBMA ensemble with PET‐PEESE models resolves a seeming inconsistency in the original conditional PET‐PEESE estimator. The frequentist PET model resulted in a significant negative effect size, μ = −0.182, t(7) = −3.65, p = 0.008, indicating that the effect size estimate from PEESE should be used, μ = 0.024, t(7) = 0.86, p = 0.418, which however is not notably different from zero. In addition, the RoBMA ensemble with PET‐PEESE does not provide evidence for precognition even under the more informed N(0, 0.3042) prior distribution for effect size recommended in Bem and colleagues 60 : BF10 = 0.670, μ = 0.028, 95% CI [−0.160, 0.215]. However, under this more informed prior, the data no longer provide moderate evidence against precognition.

4.2. Limitations of PET‐PEESE

While PET‐PEESE shows less bias and overestimation compared to other bias correction methods 34 its key limitation is that the estimates can have very high variability. In simulation studies, PET‐PEESE can have high RMSE (root mean square error). 13 , 25 , 27 Therefore, when PET‐PEESE based models are applied to an area of research for which they are ill‐suited, the resulting estimates may be inaccurate and unreliable. Stanley 67 shows how the performance of PET‐PEESE can be especially problematic at very high levels of heterogeneity (τ ≥ 0.5), with low number of studies (i.e., k ≤ 10), and under uniformly low power.

5. COMBINING SELECTION MODELS AND PET‐PEESE

In order to obtain the best of both PET‐PEESE and selection models, we combine them into an overarching model: RoBMA‐PSMA. Specifically, RoBMA‐PSMA includes the 6 weight functions outlined in the section “Publication Bias Adjustment Methods 1: Selection Models” (assuming either presence or absence of the effect and heterogeneity, this yields 24 models) as well as the two PET and PEESE regression models outlined in the section “Publication Bias Adjustment Methods 2: PET‐PEESE” section (assuming either presence or absence of the effect and heterogeneity, this yields 8 models). We set the prior probability for the publication bias‐adjusted models to 0.5 13 and divide this 0.5 probability equally across selection models and PET‐PEESE models (p. 47). 68 Finally, adding models assuming absence of the publication bias (assuming either presence or absence of the effect and heterogeneity, this yields 4 models) results in a total of 24 + 8 + 4 = 36 models that together comprise RoBMA‐PSMA. The entire model ensemble is summarized in Table 1.

TABLE 1.

RoBMA‐PSMA model ensemble together with prior parameter distributions (columns 1–3), prior model probabilities, (column 4), and posterior model probabilities (column 5) based on an application to the data from Bem 35

i Effect size Heterogeneity Publication bias Prior prob. Posterior prob.
1 μ = 0 τ = 0 None 0.125 0.000
2 μ = 0 τ = 0 ω Two‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.000
3 μ = 0 τ = 0 ω Two‐sided(0.1,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.000
4 μ = 0 τ = 0 ω One‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.012
5 μ = 0 τ = 0 ω One‐sided(0.05,0.025) ∼ CumDirichlet(1, 1, 1) 0.010 0.034
6 μ = 0 τ = 0 ω One‐sided(0.5,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.001
7 μ = 0 τ = 0 ω One‐sided(0.5,0.05,0.025) ∼ CumDirichlet(1, 1, 1, 1) 0.010 0.004
8 μ = 0 τ = 0
PET~Cauchy0,10
0.031 0.281
9 μ = 0 τ = 0
PEESE~Cauchy0,50
0.031 0.254
10 μ = 0 τ ∼ InvGamma(1, 0.15) None 0.125 0.000
11 μ = 0 τ ∼ InvGamma(1, 0.15) ω Two‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.000
12 μ = 0 τ ∼ InvGamma(1, 0.15) ω Two‐sided(0.1,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.000
13 μ = 0 τ ∼ InvGamma(1, 0.15) ω One‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.014
14 μ = 0 τ ∼ InvGamma(1, 0.15) ω One‐sided(0.05,0.025) ∼ CumDirichlet(1, 1, 1) 0.010 0.020
15 μ = 0 τ ∼ InvGamma(1, 0.15) ω One‐sided(0.5,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.006
16 μ = 0 τ ∼ InvGamma(1, 0.15) ω One‐sided(0.5,0.05,0.025) ∼ CumDirichlet(1, 1, 1, 1) 0.010 0.010
17 μ = 0 τ ∼ InvGamma(1, 0.15)
PET~Cauchy0,10
0.031 0.021
18 μ = 0 τ ∼ InvGamma(1, 0.15)
PEESE~Cauchy0,50
0.031 0.017
19 μ ∼ Normal(0, 1) τ = 0 None 0.125 0.051
20 μ ∼ Normal(0, 1) τ = 0 ω Two‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.007
21 μ ∼ Normal(0, 1) τ = 0 ω Two‐sided(0.1,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.031
22 μ ∼ Normal(0, 1) τ = 0 ω One‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.030
23 μ ∼ Normal(0, 1) τ = 0 ω One‐sided(0.05,0.025) ∼ CumDirichlet(1, 1, 1) 0.010 0.035
24 μ ∼ Normal(0, 1) τ = 0 ω One‐sided(0.5,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.018
25 μ ∼ Normal(0, 1) τ = 0 ω One‐sided(0.5,0.05,0.025) ∼ CumDirichlet(1, 1, 1, 1) 0.010 0.022
26 μ ∼ Normal(0, 1) τ = 0
PET~Cauchy0,10
0.031 0.047
27 μ ∼ Normal(0, 1) τ = 0
PEESE~Cauchy0,50
0.031 0.046
28 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) None 0.125 0.007
29 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω Two‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.001
30 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω Two‐sided(0.1,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.003
31 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω One‐sided(0.05) ∼ CumDirichlet(1, 1) 0.010 0.005
32 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω One‐sided(0.05,0.025) ∼ CumDirichlet(1, 1, 1) 0.010 0.005
33 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω One‐sided(0.5,0.05) ∼ CumDirichlet(1, 1, 1) 0.010 0.003
34 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15) ω One‐sided(0.5,0.05,0.025) ∼ CumDirichlet(1, 1, 1, 1) 0.010 0.004
35 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15)
PET~Cauchy0,10
0.031 0.004
36 μ ∼ Normal(0, 1) τ ∼ InvGamma(1, 0.15)
PEESE~Cauchy0,50
0.031 0.004

Note: μ corresponds to the effect size parameter, τ to the heterogeneity parameter, ω to the weight parameters with an appropriate selection process (either one or two‐sided with given cutoffs), PET to the regression coefficient on the standard errors, and PEESE to the regression coefficient on variances. All prior distributions are specified on the Cohen's d scale.

As mentioned above, RoBMA‐PSMA draws inference about the data by considering all models simultaneously. Specific inferences can be obtained by interrogating the model ensemble and focusing on different model classes. Concretely, the evidence for presence versus absence of the effect is quantified by the inclusion Bayes factor BF10 (Equation (10)) obtained by comparing the predictive performance of models assuming the effect is present (i=19,,36 in Table 1) to that of models assuming the effect is absent (j=1,,18 in Table 1). In the Bem example, substituting the prior and posterior model probabilities from Table 1 yields BF10 = 0.479. This Bayes factor indicates that the posterior inclusion odds for the models assuming the effect is present are slightly lower than the prior inclusion odds. In other words, models assuming that the effect is absent predicted the data about 1/0.479 ≈ 2.09 times better than models assuming the effect is present. This result aligns with the common scientific understanding of nature, which the presence of precognition would effectively overturn.

The remaining Bayes factors are calculated similarly. The Bayes factor for the presence versus absence of heterogeneity, BF rf , compares the predictive accuracy of models assuming heterogeneity (i=10,,18and28,36 in Table 1) with models assuming homogeneity (j=1,,9and19,27 in Table 1). With BF rf  = 0.144, the data disfavor the models assuming heterogeneity; that is, the data are BF fr  = 1/0.144 ≈ 6.94 times more likely to occur under homogeneity than under heterogeneity. Analogously, the Bayes factor for the presence versus absence of publication bias, BF pb , compares predictive performance of models assuming publication bias is present (i=2,,9,11,18,2027,and29,36 in Table 1) to that of models assuming publication bias is absent (j ∈ 1, 10, 19, and 28 in Table 1). Here, the results show strong support in favor of the models assuming publication bias is present, BF pb  = 16.31.

Model‐averaging can also be used to compare the different types of publication bias adjustment methods. Specifically, the predictive performance of the selection models (i = 2, …, 7, 11, …, 16, 20, …, 25, and 29, …, 34) may be contrasted to that of the PET‐PEESE models (j = 8, 9, 17, 18, 26, 27, 35, and 36), yielding BF = 0.397 (cf. Equation (10)); this result indicates that the posterior probability increases more for the PET‐PEESE models (0.25 → 0.675) than it does for the selection models (0.25 → 0.268), especially selection model assuming one‐sided selection that were better supported by the data (0.166 → 0.225) than the two‐sided selection models (0.083 → 0.043). However, this Bayes factor only modestly favors the PET‐PEESE models, and consequently the results from the selection models also contribute substantially towards the final posterior model‐averaged estimate.

Predictive performance of individual models may be contrasted to that of the rest of the ensemble (cf. Equation (10)). For Bem, 35 the data most strongly supported the PET and PEESE models assuming no effect and no heterogeneity, BF = 12.13 and BF = 10.57, respectively—the corresponding model probabilities increased from 0.031 to 0.280 and 0.255.

The posterior model‐averaged effect size estimate μ is obtained by combining the 36 estimates across all models in the ensemble, weighted according to their posterior model probabilities. Some of the models assume the effect is absent, and concentrate all prior probability mass on μ = 0; therefore, the model‐averaged posterior distribution is a combination of a “spike” at 0 and a mixture of continuous probability densities that correspond to the alternative models. When the alternative models are strongly supported by the data, the impact of the spike is minimal and the model‐averaged posterior distribution reduces to a mixture of continuous densities. In the Bem 35 example, RoBMA‐PSMA gives a posterior model‐averaged mean estimate μ = 0.038, 95% CI [−0.034, 0.214] (cf. Equation (9)). The posterior model‐averaged estimates for the remaining parameters, for example, the heterogeneity estimate τ or the publication weights ω, are obtained similarly.

The overall results would, again, remain similar even when using the Bem and colleagues' 60 more informed prior distribution for effect size, N(0, 0.3042): BF10 = 1.41, μ = 0.067, 95% CI [−0.111, 0.226]. These results are in line with failed replication studies, 69 , 70 , 71 , 72 evidence of QRPs, 73 , 74 , 75 and common sense 76 ; see also. 60 , 77 , 78 , 79 , 80

6. EVALUATING ROBMA THROUGH REGISTERED REPLICATION REPORTS

Kvarven and colleagues 34 compared the effect size estimates from 15 meta‐analyses of psychological experiments to the corresponding effect size estimates from Registered Replication Reports (RRR) of the same experiment. †† RRRs are accepted for publication independently of the results and should be unaffected by publication bias. The original meta‐analyses reveal considerable heterogeneity; thus, any single RRR is unlike to directly correspond to the true mean meta‐analytic effect size. As a result, the comparison of meta‐analysis results to RRRs will inflate RMSE and can be considered a highly conservative way of evaluating bias detection methods. However, when averaged over 15 RRRs, we would expect little systematic net heterogeneity and a notable reduction in aggregate bias. In this way, average bias adjusted estimates should randomly cluster around the average of the RRR estimates. In other words, we would expect little overall bias, relative to RRRs. Hence, the comparison to RRRs can be used to gauge the performance of publication bias adjustment methods, while keeping in mind that the studies are heterogeneous and limited in number. Kvarven and colleagues 34 found that conventional meta‐analysis methods resulted in substantial overestimation of effect size. In addition, Kvarven and colleagues 34 examined three popular bias detection methods: trim and fill (TF), 17 PET‐PEESE, 16 and 3PSM. 11 , 56 The best performing method was PET‐PEESE; however, its estimates still have notable RMSE.

Here we use the data analyzed by Kvarven and colleagues 34 as one way of comparing the performance of RoBMA‐PSMA in relation to a series of alternative publication bias correction methods. These methods include those examined by Kvarven and colleagues 34 —PET‐PEESE, 3PSM, and TF—as well as a set of seven other methods 27 : 4PSM, 11 AK1 and AK2, 20 p‐curve, 18 p‐uniform 19 WAAP‐WLS, 22 and endogenous kink (EK). 21 For completeness, we also show results for the original implementation of RoBMA‐old. 13 The RoBMA‐old, 3PSM, 4PSM, AK1, AK2, p‐curve, and p‐uniform can be viewed as selection models operating on p‐values that mostly differ in thresholds of the weight function and estimation algorithm. The PET‐PEESE, TF, and EK can be viewed as methods correcting for publication bias based on relationship between effect sizes and standard errors. Finally, RoBMA‐PSMA is a method that combines both types of publication bias corrections.

Following Kvarven and colleagues, 34 we report all meta‐analytic estimates on the Cohen's d scale, with one exception for a meta‐analysis that used Cohen's q scale. As in the Bem example, RoBMA internally transforms effect sizes from the Cohen's d scale to the Fisher z scale. ‡‡ Each method is evaluated on the following five metrics (cf. 34 ): (1) false positive rate (FPR), that is, the proportion of cases where the RRR fails to reject the null hypothesis (i.e., p > 0.05) whereas the meta‐analytic method concludes that the data offer support for the presence of the effect (i.e., p < 0.05 or BF10 > 10); (2) false negative rate (FNR), that is, the proportion of cases where the RRR rejects the null hypothesis (i.e., p < 0.05) whereas the meta‐analytic method fails to reject the null/finds evidence for the absence of the effect (i.e., p > 0.05 or BF10 < 1/10) §§ ; (3) overestimation factor (OF), that is, the meta‐analytic mean effect size divided by the RRR mean effect size; (4) bias, that is, the mean difference between the meta‐analytic and RRR effect size estimates; and (5) root mean square error (RMSE), that is, the square root of the mean of squared differences between the meta‐analytic and RRR effect size estimates. Note that when evaluating the methods' qualitative decisions (i.e., FPR and FNR), the RoBMA methods do not necessarily lead to a strong claim about the presence or absence of the effect; in the Bayesian framework, there is no need to make an all‐or‐none decision based on weak evidence, and here we have defined an in‐between category of evidence that does not allow a confident conclusion (i.e., Undecided, 1/10 < BF10 < 10; for a discussion on the importance of this in‐between category see Robinson 53 ). Furthermore, selecting a different significance level or Bayes factor thresholds would lead to different false positive and false negative rates.

The main results are summarized in Table 2. Evaluated across all metrics simultaneously, RoBMA‐PSMA generally outperforms the other methods. RoBMA‐PSMA has the lowest bias, the second‐lowest RMSE, and the second lowest overestimation factor. The only methods that perform better in one of the categories (i.e., AK2 with the lowest overestimation factor; PET‐PEESE and EK with the second and third lowest bias, respectively), showed considerably larger RMSE, and AK2 converged in only 5 out of 15 cases. Furthermore, RoBMA‐PSMA resulted in conclusions that are qualitatively similar to those from the RRR studies. Specifically, for cases where the RRR was statistically significant, RoBMA‐PSMA never showed evidence for the absence of the effect (i.e., FNR = 0/8 = 25%) but often did not find compelling evidence for the presence of the effect either (i.e., Undecided = 6/8 = 75%). Furthermore, for cases where the RRR was not statistically significant, RoBMA‐PSMA showed evidence for the presence of the effect only once (i.e., FPR = 1/7 ≈ 14.3%) and did not find compelling evidence for the absence of the effect in the remaining meta‐analyses (i.e., Undecided = 6/7 ≈ 85.7%). After adjusting for publication selection bias with RoBMA, the original meta‐analyses often did not contain sufficient evidence for firm conclusions about the presence versus absence of the effect. *** This highlights the oft‐hidden reality that the data at hand do not necessarily warrant strong conclusions about the phenomena under study; consequently, a final judgment needs to be postponed until more data accumulates.

TABLE 2.

Performance of 13 publication bias correction methods for the Kvarven and colleagues 34 test set comprised of 15 meta‐analyses and 15 corresponding “Gold Standard” registered replication reports (RRR)

Method FPR/Undecided FNR/Undecided OF Bias RMSE
RoBMA‐PSMA 0.143/0.857 0.000/0.750 1.160 0.026 0.164
AK2 0.000/— 0.250/— 1.043 −0.070 0.268
PET‐PEESE 0.143/— 0.500/— 1.307 0.050 0.256
EK 0.143/— 0.500/— 1.399 0.065 0.283
RoBMA‐old 0.714/0.286 0.000/0.000 2.049 0.171 0.218
4PSM 0.714/— 0.500/— 1.778 0.127 0.268
3PSM 0.714/— 0.125/— 2.193 0.195 0.245
TF 0.833/— 0.000/— 2.315 0.206 0.259
AK1 0.857/— 0.000/— 2.352 0.221 0.264
p‐uniform 0.500/— 0.429/— 2.375 0.225 0.288
p‐curve 2.367 0.223 0.289
WAAP‐WLS 0.857/— 0.125/— 2.463 0.239 0.295
Random Effects (DL) 1.000/— 0.000/— 2.586 0.259 0.310

Note: The results in italic are conditional on convergence: trim and fill did not converge in one case and AK2 did not converge in 10 cases. The rows are ordered based on combined log scores performance of the abs(log(OF)), abs(Bias), and RMSE (not shown).

Abbreviations: FNR/Undecided, false negative rate/undecided evidence under an effect; FPR/Undecided, false positive rate/undecided evidence under no effect; OF, overestimation factor; RMSE, root mean square error.

Figure 3 shows the effect size estimates from the RRRs for each of the 15 cases, together with the estimates from a random effects meta‐analysis and the posterior model‐averaged estimates from RoBMA and RoBMA‐PSMA (figures comparing all methods for each RRR are available in the “Kvarven et al/estimates figures” folder in the online supplementary materials at https://osf.io/fgqpc/files/). Because RoBMA‐PSMA corrects for publication bias, its estimates are shrunken towards zero. In addition, the estimates from RoBMA‐PSMA also come with wider credible intervals (reflecting the additional uncertainty about the publication bias process) and are generally closer to the RRR results. The most anomalous case concerns the Graham and colleagues 85 study, where all four methods yield similar intervals, but the RRR shows an effect size that is twice as small. This result may be due to cultural differences and the choice of the social or economic dimension that all contributed to heterogeneity in the original meta‐analysis. 86

FIGURE 3.

FIGURE 3

Effect size estimates with 95% CIs from a random‐effects meta‐analysis, three RoBMA models, and the RRR for the 15 experiments included in Kvarven et al. 34 Estimates are reported on the Cohen's d scale [Colour figure can be viewed at wileyonlinelibrary.com]

Appendix C (Data S1) demonstrates robustness of our findings by estimating RoBMA under different parameter prior distributions. Appendix D (Data S1) presents a non‐parametric bootstrap analysis of the RRR comparison, showing high uncertainty in the FPR and FNR, but qualitatively robust conclusions about the overestimation factor, bias, and RMSE. Appendix E (Data S1) demonstrates that our findings are not a result of a systematic underestimation of effect sizes by estimating RoBMA on 28 sets of Registered Replication Reports from Many Labs 2. 87

7. EVALUATING ROBMA THROUGH SIMULATION STUDIES

We evaluate the performance of the newly developed RoBMA methods using simulation studies. 27 As in Hong & Reed, 27 we tested the methods in four simulation environments, namely those developed by Stanley and colleagues 22 (SD&I), Alinaghi & Reed 88 (A&R), Bom & Rachinger 21 (B&R), and Carter and colleagues 25 (CSG&H). These environments differ in terms of assumptions concerning effect sizes, heterogeneity, sample sizes, and publication bias; moreover, CSG&H include questionable research practices (QRP). 9 Briefly, the SD&I environment relates to the settings usually found in medicine where a difference between two groups is assessed with either continuous or dichotomous outcomes. The A&R environment is similar to the settings encountered in economics and business and consists of relationships between two continuous variables with multiple estimates originating from a single study. The B&R environment considers situations where regression coefficients are routinely affected by an omitted‐variables bias. The CSG&H environment is most typical for psychology with effect sizes quantifying differences in a continuous measure between groups. For each condition from each of the four simulation environments, Hong & Reed 27 generated 3000 synthetic data sets that were then analyzed by all of the competing methods.

Here we used the code, data, and results publicly shared by Hong & Reed. 27 Because our Bayesian methods require computationally intensive Markov chain Monte Carlo estimation, we used only 300 synthetic data sets per condition (10% of the original replications). ††† Nevertheless, our simulations still required ~25 CPU years to complete. A detailed description of the simulation environments (consisting of a total of 1620 conditions) and the remaining methods can be found in Hong & Reed 27 and the corresponding original simulation publications. We compared the performance of RoBMA‐PSMA to all methods used in the previous section. See Supplementary Materials at https://osf.io/bd9xp/ for comparison of methods after removing 5% of the most extreme estimates from each method, as done by Hong & Reed, 27 with the main difference being an improved performance of AK1 and AK2.

Tables 3 and 4 summarize the aggregated results for mean square error (MSE) and bias, respectively, separately for each simulation environment. Although no single estimator dominates across all simulation environments and criteria, RoBMA‐PSMA is at or near the top in most cases. The exception is that RoBMA‐PSMA produces below‐average performance in the CSG&H environment. Tables S7 and S8 in Appendix F (Data S1) show that RoBMA‐PSMA overcorrects the effect size estimates and performs relatively poorly only in conditions with p‐hacking strong enough to introduce significant skew in the distributions of effect sizes. ‡‡‡

TABLE 3.

Ordered performance of the methods according to MSE for each simulation environment. Rank 1 has the lowest MSE. See text for details

Rank SD&I MSE A&R MSE B&R MSE CSG&H MSE
1 RoBMA‐PSMA 0.009 RoBMA‐PSMA 0.222 RoBMA‐PSMA 0.098 RoBMA‐old 0.012
2 AK2 a 0.013 TF 0.273 p‐uniform 0.185 WAAP‐WLS 0.018
3 RoBMA‐old 0.017 AK2 a 0.277 WAAP‐WLS 0.193 TF 0.022
4 TF 0.025 RoBMA‐old 0.327 RoBMA‐old 0.221 3PSM 0.023
5 WAAP‐WLS 0.025 4PSM 0.365 TF 0.321 PET‐PEESE 0.027
6 PET‐PEESE 0.028 AK1 a 0.389 EK 0.375 p‐uniform 0.028
7 EK 0.031 Random Effects (DL) 0.511 PET‐PEESE 0.378 4PSM 0.031
8 Random Effects (DL) 0.034 3PSM 0.511 4PSM 0.492 EK 0.033
9 p‐uniform 0.050 WAAP‐WLS 0.546 3PSM 0.493 RoBMA‐PSMA 0.036
10 3PSM 0.238 PET‐PEESE 0.605 Random Effects (DL) 0.526 Random Effects (DL) 0.046
11 p‐curve 1.228 EK 0.760 p‐curve 0.850 p‐curve 0.075
12 4PSM 3.375 p‐curve 3.514 AK1 a 2.806 AK1 a 0.280
13 AK1 a 6.231 p‐uniform 3.621 AK2 a 5.816 AK2 a 2.849

Note: Methods in italic converged in fewer than 90% repetitions in a given simulation environment.

a

The performance difference in terms of MSE for AK1 and AK2 between our implementation and that of Hong & Reed 27 is due to the fact that we did not omit the 5% most extreme estimates.

TABLE 4.

Ordered performance of the methods according to bias for each simulation environment. Rank 1 has the lowest bias. See text for details

Rank SD&I Bias A&R Bias B&R Bias CSG&H Bias
1 AK2 0.029 RoBMA‐PSMA 0.159 EK 0.095 PET‐PEESE 0.059
2 RoBMA‐PSMA 0.034 AK2 0.207 AK2 0.105 WAAP‐WLS 0.062
3 3PSM 0.040 EK 0.221 4PSM 0.108 RoBMA‐old 0.064
4 PET‐PEESE 0.049 PET‐PEESE 0.259 RoBMA‐PSMA 0.121 AK1 0.067
5 EK 0.053 WAAP‐WLS 0.266 PET‐PEESE 0.129 EK 0.072
6 RoBMA‐old 0.062 TF 0.288 3PSM 0.156 3PSM 0.081
7 AK1 0.082 4PSM 0.302 WAAP‐WLS 0.189 TF 0.091
8 WAAP‐WLS 0.083 RoBMA‐old 0.354 RoBMA‐old 0.228 4PSM 0.096
9 TF 0.088 AK1 0.397 TF 0.240 p‐uniform 0.106
10 4PSM 0.088 3PSM 0.475 AK1 0.277 RoBMA‐PSMA 0.110
11 Random effects (DL) 0.108 Random effects (DL) 0.556 Random effects (DL) 0.363 AK2 0.117
12 p‐uniform 0.147 p‐curve 1.530 p‐uniform 0.374 p‐curve 0.118
13 p‐curve 0.422 p‐uniform 1.555 p‐curve 0.522 Random effects (DL) 0.150

Note: Methods in italic converged in fewer than 90% repetitions in a given simulation environment.

Following Hong & Reed, 27 Table 5 averages performance across all four simulation environments. While the results confirm that RoBMA‐PSMA performs the best with regard to type I error rates and coverage, it is important to note that both the coverage and error rate were far above the nominal levels. The results also appear favorable to AK2, as it has the lowest bias in SD&I environment and the second lowest biases in the A&R and B&R environments. However, AK2 failed to converge in over 10% of these simulated meta‐analyses. Even when AK2 converges, its MSE in the B&R and CSG&H environments is relatively large.

TABLE 5.

Aggregated results over all simulation conditions from Hong & Reed 27

Rank Rank (Bias) Bias Rank (MSE) MSE Rank (Coverage0.95)
Coverage0.95
Rank (ERR) ERR
1 EK 0.079 RoBMA‐PSMA 0.054 AK2 a 0.167 RoBMA‐PSMA 0.093
2 PET‐PEESE 0.083 RoBMA‐old 0.085 RoBMA‐PSMA 0.172 AK2 a 0.129
3 AK2 a 0.099 WAAP‐WLS 0.085 3PSM 0.213 EK 0.257
4 RoBMA‐PSMA 0.099 TF 0.121 4PSM 0.265 3PSM 0.259
5 4PSM 0.103 PET‐PEESE 0.149 PET‐PEESE 0.306 PET‐PEESE 0.286
6 3PSM 0.105 EK 0.155 EK 0.307 4PSM 0.290
7 WAAP‐WLS 0.110 p‐uniform 0.161 RoBMA‐old 0.317 RoBMA‐old 0.485
8 RoBMA‐old 0.121 Random Effects (DL) 0.203 WAAP‐WLS 0.319 WAAP‐WLS 0.525
9 TF 0.141 3PSM 0.223 AK1 a 0.341 AK1 a 0.573
10 AK1 a 0.143 p‐curve 0.623 TF 0.407 p‐uniform 0.585
11 Random Effects (DL) 0.217 4PSM 0.851 Random Effects (DL) 0.510 TF 0.597
12 p‐uniform 0.230 AK1 a 2.258 p‐uniform 0.576 Random Effects (DL) 0.649
13 p‐curve 0.336 AK2 3.316 p‐curve p‐curve

Note: Ranking and values of aggregated bias, mean square error (MSE), absolute difference from 0.95 CI coverage (Coverage0.95), and type I error rate (ERR) averaged across all simulation environments in ( 27 ; the type I error rate for RoBMA methods is based on BF > 10). Methods in italic converged in fewer than 90% repetitions in a given simulation environment.

a

The performance difference in terms of MSE for AK1 and AK2 between our implementation and that of Hong & Reed 27 is due to the fact that we did not omit the 5% most extreme estimates.

It should be noted that the averaging operation is valid only for coverage and type I error rates, as these are fully comparable across the different simulation environments. In contrast, bias and MSE cannot be directly averaged or aggregated, as these are based on very different effect‐size metrics 27 ; for instance, the best method in A&R environment has five times the bias as the best method in SD&I environment. In order to make the metrics commensurate, we employ a relative order‐preserving logarithmic transformation to obtain an average ranking across these four different simulation environments 89 (1 corresponds to the best relative performance, 0 to the worst relative performance). §§§ Table 6 displays the average relative ranks of bias and MSE for these alternative methods across all simulation environments. RoBMA‐PSMA is ranked highest according to MSE and type I error rates, and is the second best according to both bias and confidence interval coverage. Again, the closest competition appears to come from AK2, but AK2 often does not converge and may yield high MSEs—see Table 3. Table 5 also shows that RoBMA‐PSMA does well when bias and MSE are simply averaged across these simulations, but those comparisons need to be interpreted with caution.

TABLE 6.

Ordered performance of the methods across simulation environments according to log scoring rule

Rank Bias Log score (Bias) MSE Log score (MSE)
1 AK2 0.801 RoBMA‐PSMA 0.831
2 RoBMA‐PSMA 0.801 RoBMA‐old 0.682
3 EK 0.778 TF 0.519
4 PET‐PEESE 0.746 WAAP‐WLS 0.504
5 WAAP‐WLS 0.616 AK2 0.392
6 3PSM 0.615 PET‐PEESE 0.369
7 4PSM 0.602 EK 0.327
8 RoBMA‐old 0.579 p‐uniform 0.324
9 AK1 0.515 3PSM 0.316
10 TF 0.500 4PSM 0.315
11 Random effects (DL) 0.329 Random effects (DL) 0.310
12 p‐uniform 0.304 AK1 0.183
13 p‐curve 0.242 p‐curve 0.114

Note: Methods in italic converged in less than 90% repetitions.

8. CONCLUDING COMMENTS

We have extended the robust Bayesian meta‐analytic framework with one‐sided weight functions and PET‐PEESE regression models. This extension allows researchers to draw inferences using a multitude of otherwise competing approaches (i.e., selection models based on p‐values and models estimating the relationship between effect sizes and standard errors). Consequently, researchers interested in obtaining the best possible adjusted meta‐analytic effect size estimate do not need to speculate about the type of publication bias in order to select the best method for their setting. Instead, RoBMA weights its inference in proportion to how well each method accounts for the data.

The extended version of RoBMA resolves the tension between the selection models and PET‐PEESE. Furthermore, we demonstrated that RoBMA‐PSMA outperforms previous methods when applied to actual meta‐analyses for which a gold standard is available. 34 Finally, the new RoBMA methods performed well in simulation studies. However, it is important to note that RoBMA‐PSMA did not perform well in simulation settings of Carter and colleagues 25 with prominent p‐hacking where it overcorrected the effect sizes.

The RoBMA framework can be further extended in multiple ways: to account for multilevel structures, to estimate within study clusters, to deal with multivariate outcomes, and to include of explanatory variables. Many of those extensions will, however, increase computational complexity, making them practically unfeasible for selection models. Therefore, further research is need in developing efficient algorithms or approximations that will allow the further extensions, currently unachievable under the RoBMA‐PSMA framework.

Out of the remaining methods, p‐curve, p‐uniform, and random effects meta‐analysis were dominated by the other estimators, and AK2 failed to converge in many cases. Overall, Bayesian model‐averaging greatly improved both PET‐PEESE and selection models: RoBMA‐PSMA reduces PET‐PEESE's MSE and bias as well as the selection models' MSE. Importantly, RoBMA‐PSMA takes uncertainty about the type of publication bias into account and combines the best of the two worlds. Even though RoBMA outperforms other methods in many cases in both the simulation study and the comparison of meta‐analyses and registered replication reports, it should be considered merely a new tool in the toolbox of publication selection bias detection.

In cases where the data generating process is known and depending on the metric that researchers want to optimize (e.g., bias vs. RMSE) an appropriate method can be selected via the results from our simulation study or the meta‐showdown explorer https://tellmi.psy.lmu.de/felix/metaExplorer/. If there is considerable uncertainty about the data generating process, we believe that RoBMA is a sensible default. Nevertheless, researchers may wish to check the conclusions of RoBMA against methods that are not part of the RoBMA ensemble, such as WAAP‐WLS. As there is no principled way of averaging these methods with RoBMA, **** researchers should view these comparisons as sensitivity analyses. If alternative methods come to the same conclusions as RoBMA, this suggests that the results are robust; If alternative methods come to a qualitatively different conclusion, this suggests that the results are fragile; in this case, we recommend a more in‐depth consideration of the data‐model relationship, and a transparent report that the conclusions vary based on the selected meta‐analytic technique.

We believe that the extended version of RoBMA with the outlined default prior distributions presents a reasonable setup for anyone interested in performing a meta‐analysis. However, the RoBMA framework is flexible and allows researchers to specify different prior distributions for any of the model parameters or include/exclude additional models (see “Appendix B: Specifying Different Priors” in, 90 or many of the R package vignettes). Consequently, researchers with substantial prior knowledge can test more specific hypotheses than those specified with the default model ensemble 39 , 50 , 91 or incorporate prior knowledge about the research environment. For instance, when prior research has established that the effect of interest shows considerable between‐study heterogeneity, researchers may decide to trim the default RoBMA ensemble by assigning prior probability zero to the fixed effects models, and consequently drawing conclusions from only the random effects models.

We have implemented RoBMA‐PSMA in a new version of the RoBMA R package. 61 Also, for researchers with little programming expertise we will implement the methodology in the open‐source statistical software package JASP. 92 , 93 We hope that these publicly‐shared statistical packages will encourage researchers across different disciplines to adopt these new methods for accommodating potential publication bias and draw conclusions that are rich, robust, and reliable.

AUTHOR CONTRIBUTIONS

All authors jointly generated the idea for the study. František Bartoš programmed the analysis, conducted the simulation study, and analyzed the data. František Bartoš and Maximilian Maier wrote the first draft of the manuscript and all authors critically edited it. All authors approved the final submitted version of the manuscript.

CONFLICT OF INTEREST

František Bartoš declares that he owns a negligible amount of shares in semiconductor manufacturing companies that might benefit from a wider application of computationally intensive methods such as RoBMA‐PSMA. The authors declare that there were no other conflicts of interest with respect to the authorship or the publication of this article.

Supporting information

Appendix S1. Supporting Information.

ACKNOWLEDGMENTS

This project was supported in part by a Vici grant (#016.Vici.170.083) to E.J.W. Computational resources were supplied by the project “e‐Infrastruktura CZ” (e‐INFRA LM2018140) provided within the program Projects of Large Research, Development and Innovations Infrastructures.

Bartoš F, Maier M, Wagenmakers E‐J, Doucouliagos H, Stanley TD. Robust Bayesian meta‐analysis: Model‐averaging across complementary publication bias adjustment methods. Res Syn Meth. 2023;14(1):99‐116. doi: 10.1002/jrsm.1594

Funding information NWO, Grant/Award Number: 016.Vici.170.083

Endnotes

*

As measured by the geometric mean over researchers' self‐admission rate, prevalence estimate (the estimate for the percent of other researchers who had engaged in a behavior), and an admission estimate. For the admission estimates the number of people reporting to have engaged in a given QRP was divided by researchers estimate for the proportion of other researchers that would admit that they engaged in this QRP.

Both simple and inclusion Bayes factors are commonly denoted as BF since they are still Bayes factors and the same rules and interpretations apply to them.

See 23 , 58 , 59 for selection models based on effect sizes and standard errors.

§

We use the Fisher's z scale for model fitting because it makes standard errors and effect sizes independent under the model without publication bias. This is an important prerequisite to test for the presence/absence of publication bias with the PET‐PEESE models introduced later. The effect sizes and standard errors are transformed using the popular formulas for effect size transformations. See the appendix in Haaf and colleagues 62 for proof that Fisher's z is a variance stabilizing transformation for Cohen's d. Prior distributions are linearly re‐scaled from Cohen's d to Fisher's z, in the same manner as in metaBMA R package. 63

**

The reported lower bound of the credible interval of 0.000 is not a coincidence and will be encountered more often than in conventional frequentist methods. The 0.000 lower bound is a consequence of averaging posterior estimates across all models, including models that specify μ = 0 (see Equation (9)). When a notable proportion of the posterior model probabilities is accumulated by models assuming the absence of the effect, the model‐averaged posterior distribution for effect size will include a sizeable point mass at μ = 0. Consequently, the lower credible internal bound is then “shrunk” to 0.

††

This RRR category includes “replications published according to the ‘registered replication report’ format in the journals ‘Perspectives on Psychological Science’ and ‘Advances in Methods and Practices in Psychological Science’; and (2) The ‘Many Labs’ projects in psychology” (p. 424). 34

‡‡

We also tried to estimate the remaining methods on the Fisher z scale; however, doing so reduced the performance of some of the other methods.

§§

This corresponds to the definition of the FPR and FNR indices from Kvarven and colleagues. 34 However, it is important to note that a statistically non‐significant result is not generally a valid reason to “accept” the null hypothesis. 81 , 82

***

Note that this is not a general pattern and RoBMA often results in compelling evidence, either in favor of the absence or in favor of the presence of an effect. 83 , 84

†††

For the methods used in Hong & Reed, 27 we recalculated the result based on a sample matching the 300 replications per conditions used for the RoBMA methods, using the per‐replication estimates shared by the authors.

‡‡‡

The QRPs simulated by Carter and colleagues 25 results in a strongly positively skewed distribution of effect sizes. While RoBMA‐PSMA contains selection models with weight functions that well adjusts for publication bias simulated by Carter and colleagues, 25 the additional skew generated by these QRPs results in misspecification of the best fitting models and consequent overcorrection of the meta‐analytic effect size. The simpler original RoBMA does not contain the appropriate one‐sided weight function. The skewed distribution of effect sizes does not introduce a strong systematic bias and it, as other methods, ironically result in better performance in the QRP environment of Carter and colleagues 25 due to more layers of specific model misspecification. The remaining simulation environments produce bias in a more diverse manner and do not lead to such strongly skewed distributions of effect sizes. As a result, RoBMA‐PSMA does not generally suffer from systematic bias.

§§§

This transformation has been used to compare and rank top scientists across different criteria of scientific impact. 89

****

Many of these methods either remove data (WAAP‐WLS) or impute data (trim‐and‐fill), which makes a comparison via marginal likelihood impossible.

DATA AVAILABILITY STATEMENT

The data and R scripts for performing the analyses and simulation study are openly available on OSF at https://osf.io/fgqpc/.

REFERENCES

  • 1. Borenstein M, Hedges L, Higgins J, Rothstein H. Publication Bias. Wiley; 2009:277‐292. [Google Scholar]
  • 2. Masicampo E, Lalande DR. A peculiar prevalence of p‐values just below .05. Q J Exp Psychol. 2012;65(11):2271‐2279. [DOI] [PubMed] [Google Scholar]
  • 3. Scheel AM, Schijen MRMJ, Lakens D. An excess of positive results: comparing the standard psychology literature with registered reports. Adv Methods Pract Psychol Sci. 2021;4(2):1‐12. doi: 10.1177/25152459211007467 [DOI] [Google Scholar]
  • 4. Wicherts JM. The weak spots in contemporary science (and how to fix them). Animals. 2017;7(12):90‐119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Rothstein HR, Sutton AJ, Borenstein M. Publication Bias in Meta‐Analysis. John Wiley & Sons; 2005. [Google Scholar]
  • 6. Rosenthal R, Gaito J. Further evidence for the cliff effect in interpretation of levels of significance. Psychol Rep. 1964;15(2):570. [Google Scholar]
  • 7. Simmons JP, Nelson LD, Simonsohn U. False‐positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol Sci. 2011;22(11):1359‐1366. [DOI] [PubMed] [Google Scholar]
  • 8. Stefan A, Schönbrodt FD. Big little lies: a compendium and simulation of p‐hacking strategies. PsyArXiv Preprints. 2022. doi: 10.31234/osf.io/xy2dk [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. John LK, Loewenstein G, Prelec D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol Sci. 2012;23(5):524‐532. [DOI] [PubMed] [Google Scholar]
  • 10. Fiedler K, Schwarz N. Questionable research practices revisited. Soc Psychol Personal Sci. 2016;7(1):45‐52. [Google Scholar]
  • 11. Vevea JL, Hedges LV. A general linear model for estimating effect size in the presence of publication bias. Psychometrika. 1995;60(3):419‐435. [Google Scholar]
  • 12. Iyengar S, Greenhouse JB. Selection models and the file drawer problem. Stat Sci. 1988;3(1):109‐117. [Google Scholar]
  • 13. Maier M, Bartoš F, Wagenmakers EJ. Robust Bayesian meta‐analysis: addressing publication bias with model‐averaging. Psychol Methods. 2022. [DOI] [PubMed] [Google Scholar]
  • 14. Egger M, Smith GD, Schneider M, Minder C. Bias in meta‐analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629‐634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Stanley TD, Doucouliagos H. Neither fixed nor random: weighted least squares meta‐regression. Res Synth Methods. 2017;8(1):19‐42. [DOI] [PubMed] [Google Scholar]
  • 16. Stanley TD, Doucouliagos H. Meta‐regression approximations to reduce publication selection bias. Res Synth Methods. 2014;5(1):60‐78. [DOI] [PubMed] [Google Scholar]
  • 17. Duval S, Tweedie R. Trim and fill: a simple funnel‐plot‐based method of testing and adjusting for publication bias in meta‐analysis. Biometrics. 2000;56(2):455‐463. [DOI] [PubMed] [Google Scholar]
  • 18. Simonsohn U, Nelson LD, Simmons JP. P‐curve: a key to the file‐drawer. J Exp Psychol Gen. 2014;143(2):534‐547. [DOI] [PubMed] [Google Scholar]
  • 19. Van Assen MA, Aert VR, Wicherts JM. Meta‐analysis using effect size distributions of only statistically significant studies. Psychol Methods. 2015;20(3):293‐309. [DOI] [PubMed] [Google Scholar]
  • 20. Andrews I, Kasy M. Identification of and correction for publication bias. Am Econ Rev. 2019;109(8):2766‐2794. [Google Scholar]
  • 21. Bom PR, Rachinger H. A kinked meta‐regression model for publication bias correction. Res Synth Methods. 2019;10(4):497‐514. [DOI] [PubMed] [Google Scholar]
  • 22. Stanley TD, Doucouliagos H, Ioannidis JP. Finding the power to reduce publication bias. Stat Med. 2017;36(10):1580‐1598. [DOI] [PubMed] [Google Scholar]
  • 23. Copas J. What works?: selectivity models and meta‐analysis. J R Stat Soc A Stat Soc. 1999;162(1):95‐109. [Google Scholar]
  • 24. Citkowicz M, Vevea JL. A parsimonious weight function for modeling publication bias. Psychol Methods. 2017;22(1):28‐41. [DOI] [PubMed] [Google Scholar]
  • 25. Carter EC, Schönbrodt FD, Gervais WM, Hilgard J. Correcting for bias in psychology: a comparison of meta‐analytic methods. Adv Methods Pract Psychol Sci. 2019;2(2):115‐144. [Google Scholar]
  • 26. Renkewitz F, Keiner M. How to detect publication bias in psychological research. Z Psychol. 2019;227(4):261‐279. [Google Scholar]
  • 27. Hong S, Reed WR. Using Monte Carlo experiments to select meta‐analytic estimators. Res Synth Methods. 2020;12:192‐215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Vevea JL, Woods CM. Publication bias in research synthesis: sensitivity analysis using a priori weight functions. Psychol Methods. 2005;10(4):428‐443. [DOI] [PubMed] [Google Scholar]
  • 29. Rothstein HR, Sutton AJ, Borenstein M. Publication bias in meta‐analysis. Publication Bias in Meta‐Analysis: Prevention, Assessment and Adjustments; John Wiley & Sons; 2005:1‐7. [Google Scholar]
  • 30. Mathur MB, Van der Weele TJ. Sensitivity analysis for publication bias in meta‐analyses. J R Stat Soc Ser C Appl Stat. 2020;69(5):1091‐1119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Oswald ME, Grosjean S. Confirmation Bias. Psychology Press; 2004:79‐96. [Google Scholar]
  • 32. McShane BB, Böckenholt U, Hansen KT. Adjusting for publication bias in meta‐analysis: an evaluation of selection methods and some cautionary notes. Perspect Psychol Sci. 2016;11(5):730‐749. [DOI] [PubMed] [Google Scholar]
  • 33. Guan M, Vandekerckhove J. A Bayesian approach to mitigation of publication bias. Psychon Bull Rev. 2016;23(1):74‐86. [DOI] [PubMed] [Google Scholar]
  • 34. Kvarven A, Strømland E, Johannesson M. Comparing meta‐analyses and preregistered multiple‐laboratory replication projects. Nat Hum Behav. 2020;4(4):423‐434. [DOI] [PubMed] [Google Scholar]
  • 35. Bem DJ. Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. J Pers Soc Psychol. 2011;100(3):407‐425. [DOI] [PubMed] [Google Scholar]
  • 36. Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14(4):382‐401. [Google Scholar]
  • 37. Leamer EE. Specification Searches: Ad Hoc Inference with Nonexperimental Data. Vol 53. Wiley; 1978. [Google Scholar]
  • 38. Hinne M, Gronau QF, van den Bergh D, Wagenmakers EJ. A conceptual introduction to Bayesian model averaging. Adv Methods Pract Psychol Sci. 2020;3(2):200‐215. [Google Scholar]
  • 39. Gronau QF, Erp VS, Heck DW, Cesario J, Jonas KJ, Wagenmakers EJ. A Bayesian model‐averaged meta‐analysis of the power pose effect with informed and default priors: the case of felt power. Compr Results Soc Psychol. 2017;2(1):123‐138. [Google Scholar]
  • 40. Gronau QF, Heck DW, Berkhout SW, Haaf JM, Wagenmakers EJ. A primer on Bayesian model‐averaged meta‐analysis. Adv Methods Pract Psychol Sci. 2021;4(3):1‐19. [Google Scholar]
  • 41. Fragoso TM, Bertoli W, Louzada F. Bayesian model averaging: a systematic review and conceptual classification. Int Stat Rev. 2018;86(1):1‐28. [Google Scholar]
  • 42. Jefferys WH, Berger JO. Ockham's razor and Bayesian analysis. Am Sci. 1992;80:64‐72. [Google Scholar]
  • 43. Etz A, Wagenmakers EJ. J. B. S. Haldane's contribution to the Bayes factor hypothesis test. Stat Sci. 2017;32:313‐329. [Google Scholar]
  • 44. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90(430):773‐795. [Google Scholar]
  • 45. Rouder JN, Morey RD. Teaching Bayes' theorem: strength of evidence as predictive accuracy. Am Stat. 2019;73(2):186‐190. [Google Scholar]
  • 46. Wrinch D, Jeffreys H. On certain fundamental principles of scientific inquiry. Philos Mag. 1921;42:369‐390. [Google Scholar]
  • 47. Jeffreys H. Theory of Probability. 1st ed. Oxford University Press; 1939. [Google Scholar]
  • 48. Lee MD, Wagenmakers EJ. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press; 2013. [Google Scholar]
  • 49. Clyde MA, Ghosh J, Littman ML. Bayesian adaptive sampling for variable selection and model averaging. J Comput Graph Stat. 2011;20(1):80‐101. [Google Scholar]
  • 50. Bartoš F, Gronau QF, Timmers B, Otte WM, Ly A, Wagenmakers EJ. Bayesian model‐averaged meta‐analysis in medicine. Stat Med. 2021;40:6743‐6761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Wagenmakers EJ, Morey RD, Lee MD. Bayesian benefits for the pragmatic researcher. Curr Dir Psychol Sci. 2016;25:169‐176. [Google Scholar]
  • 52. Jeffreys H. Some tests of significance, treated by the theory of probability. Proc Camb Philos Soc. 1935;31:203‐222. [Google Scholar]
  • 53. Robinson GK. What properties might statistical inferences reasonably be expected to have? – crisis and resolution in statistical inference. Am Stat. 2019;73:243‐252. [Google Scholar]
  • 54. Keysers C, Gazzola V, Wagenmakers EJ. Using Bayes factor hypothesis testing in neuroscience to establish evidence of absence. Nat Neurosci. 2020;23:788‐799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Schure TJ, Grünwald P. Accumulation bias in meta‐analysis: the need to consider time in error control. F1000Res. 2019;8:962. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Hedges LV. Modeling publication selection effects in meta‐analysis. Stat Sci. 1992;7(2):246‐255. [Google Scholar]
  • 57. Maier M, Van der Weele TJ, Mathur MB. Using selection models to assess sensitivity to publication bias: a tutorial and call for more routine use. Campbell Syst Rev. 2022;18(3):e1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Copas J, Li H. Inference for non‐random samples. J R Stat Soc Series B Stat Methodology. 1997;59(1):55‐95. [Google Scholar]
  • 59. Copas J, Shi JQ. A sensitivity analysis for publication bias in systematic reviews. Stat Methods Med Res. 2001;10(4):251‐265. [DOI] [PubMed] [Google Scholar]
  • 60. Bem DJ, Utts J, Johnson WO. Must psychologists change the way they analyze their data? J Pers Soc Psychol. 2011;101(4):716‐719. [DOI] [PubMed] [Google Scholar]
  • 61. Bartoš F, Maier M. RoBMA: An R Package for Robust Bayesian Meta‐Analyses. R package version 2.0.0; 2021. https://CRAN.R-project.org/package=RoBMA
  • 62. Haaf JM, Rouder JN. Does every study? Implementing ordinal constraint in meta‐analysis. Psychol Methods. 2021. [DOI] [PubMed] [Google Scholar]
  • 63. Heck, WD , Gronau, FQ , Wagenmakers, EJ . metaBMA: Bayesian Model Averaging for Random and Fixed Effects Meta‐Analysis; 2019. https://CRAN.R-project.org/package=metaBMA
  • 64. Mathur MB, Van der Weele TJ. Finding common ground in meta‐analysis “wars” on violent video games. Perspect Psychol Sci. 2019;14(4):705‐708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Carter EC, McCullough ME. Publication bias and the limited strength model of self‐control: has the evidence for ego depletion been overestimated? Front Psychol. 2014;5:823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Moreno SG, Sutton AJ, Turner EH, et al. Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. BMJ. 2009;339:b2981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Stanley TD. Limitations of PET‐PEESE and other meta‐analysis methods. Soc Psychol Personal Sci. 2017;8(5):581‐591. [Google Scholar]
  • 68. Jeffreys H. Theory of Probability. 3rd ed. Oxford University Press; 1961. [Google Scholar]
  • 69. Ritchie SJ, Wiseman R, French CC. Failing the future: three unsuccessful attempts to replicate Bem's ‘Retroactive Facilitation of Recall’ effect. PLoS One. 2012;7(3):e33423. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Galak J, LeBoeuf RA, Nelson LD, Simmons JP. Correcting the past: failures to replicate psi. J Pers Soc Psychol. 2012;103(6):933‐948. [DOI] [PubMed] [Google Scholar]
  • 71. Schlitz M, Bem DJ, Marcusson‐Clavertz D, et al. Two replication studies of a time‐reversed (psi) priming task and the role of expectancy in reaction times. J Sci Explor. 2021;35(1):65‐90. [Google Scholar]
  • 72. Wagenmakers EJ, Wetzels R, Borsboom D, Maas vHL, Kievit RA. An agenda for purely confirmatory research. Perspect Psychol Sci. 2012;7(6):632‐638. [DOI] [PubMed] [Google Scholar]
  • 73. Francis G. Too good to be true: publication bias in two prominent studies from experimental psychology. Psychon Bull Rev. 2012;19(2):151‐156. [DOI] [PubMed] [Google Scholar]
  • 74. Schimmack U. The ironic effect of significant results on the credibility of multiple‐study articles. Psychol Methods. 2012;17(4):551‐566. [DOI] [PubMed] [Google Scholar]
  • 75. Alcock J. Back from the future: parapsychology and the Bem affair. Skept Inq. 2011;35(2):31‐39. [Google Scholar]
  • 76. Hoogeveen S, Sarafoglou A, Wagenmakers EJ. Laypeople can predict which social‐science studies will be replicated successfully. Adv Methods Pract Psychol Sci. 2020;3(3):267‐285. [Google Scholar]
  • 77. Wagenmakers EJ, Wetzels R, Borsboom D, Van Der Maas HL. Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). J Pers Soc Psychol. 2011;100(3):426‐432. [DOI] [PubMed] [Google Scholar]
  • 78. Schimmack U. Why psychologists should not change the way they analyze their data: The devil is in the default prior. https://replicationindex.com/2015/05/09/why-psychologists-should-not-change-the-way-they-analyze-their-data-the-devil-is-in-the-default-piror/
  • 79. Schimmack U. My email correspondence with Daryl J. Bem about the data for his 2011 article “Feeling the future”. https://replicationindex.com/2018/01/20/my-email-correspondence-with-daryl-j-bem-about-the-data-for-his-2011-article-feeling-the-future/
  • 80. Rouder JN, Morey RD. A Bayes factor meta‐analysis of Bem's ESP claim. Psychon Bull Rev. 2011;18(4):682‐689. [DOI] [PubMed] [Google Scholar]
  • 81. Aczel B, Palfi B, Szollosi A, et al. Quantifying support for the null hypothesis in psychology: an empirical investigation. Adv Methods Pract Psychol Sci. 2018;1(3):357‐366. [Google Scholar]
  • 82. Goodman S. A dirty dozen: twelve p‐value misconceptions. Semin Hematol. 2008;45:135‐140. [DOI] [PubMed] [Google Scholar]
  • 83. Maier M, Bartoš F, Oh M, Wagenmakers EJ, Shanks D, Harris A. Publication bias in research on construal level theory. PsyArXiv Preprints. 2022. doi: 10.31234/osf.io/r8nyu [DOI] [Google Scholar]
  • 84. Maier M, Bartoš F, Stanley TD, Shanks DR, Harris JLA, Wagenmakers EJ. No evidence for nudging after adjusting for publication bias. Proc Natl Acad Sci. 2022;119:e2200300119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Graham J, Haidt J, Nosek BA. Liberals and conservatives rely on different sets of moral foundations. J Pers Soc Psychol. 2009;96(5):1029‐1046. [DOI] [PubMed] [Google Scholar]
  • 86. Kivikangas JM, Fernández‐Castilla B, Järvelä S, Ravaja N, Lönnqvist JE. Moral foundations and political orientation: systematic review and meta‐analysis. Psychol Bull. 2021;147(1):55‐94. [DOI] [PubMed] [Google Scholar]
  • 87. Klein RA, Vianello M, Hasselman F, et al. Many labs 2: investigating variation in replicability across samples and settings. Adv Methods Pract Psychol Sci. 2018;1(4):443‐490. [Google Scholar]
  • 88. Alinaghi N, Reed WR. Meta‐analysis and publication bias: how well does the FAT‐PET‐PEESE procedure work? Res Synth Methods. 2018;9(2):285‐311. [DOI] [PubMed] [Google Scholar]
  • 89. Ioannidis JP, Baas J, Klavans R, Boyack KW. A standardized citation metrics author database annotated for scientific field. PLoS Biol. 2019;17(8):e3000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Bartoš F, Maier M, Quintana D, Wagenmakers EJ. Adjusting for publication bias in JASP and R‐selection models, PET‐PEESE, and robust Bayesian meta‐analysis. Adv Methods Pract Psychol Sci. in press. [Google Scholar]
  • 91. Gronau QF, Ly A, Wagenmakers EJ. Informed Bayesian t‐tests. Am Stat. 2020;74(2):137‐143. [Google Scholar]
  • 92. JASP Team . JASP (Version 0.15) [Computer software]; 2021. https://jasp-stats.org/
  • 93. Ly A, van den Bergh D, Bartoš F, Wagenmakers EJ. Bayesian inference with JASP. ISBA Bull. 2021;28:7‐15. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix S1. Supporting Information.

Data Availability Statement

The data and R scripts for performing the analyses and simulation study are openly available on OSF at https://osf.io/fgqpc/.


Articles from Research Synthesis Methods are provided here courtesy of Wiley

RESOURCES