Skip to main content
PLOS One logoLink to PLOS One
. 2022 Feb 3;17(2):e0262809. doi: 10.1371/journal.pone.0262809

Heterogeneity estimates in a biased world

Johannes Hönekopp 1,*, Audrey Helen Linden 1
Editor: Tim Mathes2
PMCID: PMC8812955  PMID: 35113897

Abstract

Meta-analyses typically quantify heterogeneity of results, thus providing information about the consistency of the investigated effect across studies. Numerous heterogeneity estimators have been devised. Past evaluations of their performance typically presumed lack of bias in the set of studies being meta-analysed, which is often unrealistic. The present study used computer simulations to evaluate five heterogeneity estimators under a range of research conditions broadly representative of meta-analyses in psychology, with the aim to assess the impact of biases in sets of primary studies on estimates of both mean effect size and heterogeneity in meta-analyses of continuous outcome measures. To this end, six orthogonal design factors were manipulated: Strength of publication bias; 1-tailed vs. 2-tailed publication bias; prevalence of p-hacking; true heterogeneity of the effect studied; true average size of the studied effect; and number of studies per meta-analysis. Our results showed that biases in sets of primary studies caused much greater problems for the estimation of effect size than for the estimation of heterogeneity. For the latter, estimation bias remained small or moderate under most circumstances. Effect size estimations remained virtually unaffected by the choice of heterogeneity estimator. For heterogeneity estimates, however, relevant differences emerged. For unbiased primary studies, the REML estimator and (to a lesser extent) the Paule-Mandel performed well in terms of bias and variance. In biased sets of primary studies however, the Paule-Mandel estimator performed poorly, whereas the DerSimonian-Laird estimator and (to a slightly lesser extent) the REML estimator performed well. The complexity of results notwithstanding, we suggest that the REML estimator remains a good choice for meta-analyses of continuous outcome measures across varied circumstances.

Introduction

Meta-analyses pool the results from pertinent primary studies to estimate the magnitude and heterogeneity of the phenomenon under investigation. Typically, the results from individual studies vary more strongly than expected from sampling variance alone, which points to heterogeneity [1, 2]. As an example consider the sex difference in students’ math performance, for which an international survey found substantial variation; e.g. boys did considerably better than girls in Italy, but the reverse pattern was observed in Saudi Arabia [3]. Being larger than expected from sampling error, this variability between countries is an example of heterogeneity. Meta-analyses can quantify heterogeneity and thereby provide important information about the stability of the studied effect across contexts, i.e. different populations, times, research methods, etc. [4]. They can also try to uncover where heterogeneity comes from. For example, trans-national variability in the sex difference in students’ math performance is (partly) explained by national differences in women’s career opportunities [5].

A number of heterogeneity estimators have been proposed [6, 7]. Computer simulations that evaluate their performance typically presume that the set of primary studies underpinning the meta-analysis provides unbiased estimates of the underlying population effect size. In many contexts, this might be unrealistic [8, 9]. Whereas the effect of bias in sets of primary studies on meta-analytic effect size estimates has received considerable attention, its effect on heterogeneity estimates is less well understood [1012]. Here, we report computer simulations that compare the performance of different heterogeneity estimators when applied to unbiased and biased sets of primary studies. We also compare how bias in the set of primary studies affects estimates of mean effect size and heterogeneity.

Our paper is organized as follows. In the introduction, we first address why heterogeneity matters before we deal with biases in sets of primary studies and what is known about their effects on meta-analysis. This motivates a more detailed account of our aims. In the methods section, we deal with the random effects model and the heterogeneity estimators that underpin our simulation before we address it in detail.

Why heterogeneity matters

In meta-analysis, the heterogeneity estimate typically affects the weighting of the effect sizes in the primary studies and thereby the estimate of the overall effect size (see Methods for greater detail). Moreover, heterogeneity is of considerable interest in itself because of its practical and epistemic implications. On a practical level, large (unaccounted) heterogeneity means that the effectiveness of an intervention varies strongly and unpredictably across contexts, which is obviously undesirable. Large heterogeneity also reduces the statistical power of studies and should therefore be factored into sample size planning [13, 14]. Finally, heterogeneity also reflects on the state of knowledge in a particular research area. Explained heterogeneity represents progress in knowledge. Often however understanding of heterogeneity remains poor, and in this case large heterogeneity points to a fundamental lack in the understanding of the subject matter [15]. For these reasons, the degree of heterogeneity is of interest in itself, and consequently its correct estimation is important.

Bias in the set of primary studies

In the absence of pre-registration, effect sizes in published primary studies tend to inflate the underlying population effect sizes [1619]. Publication bias as well as flexibility in data collection and analysis are driving forces behind this, and we address them in turn. Publication bias arises when studies with statistically non-significant results have a reduced chance of being published [20]. This leads to inflated effect sizes in published primary studies. In unbiased samples, over- and underestimation of the population effect size cancel each other out. But the overestimating samples (e.g., those that find a particularly large difference between the means of experimental and control group) tend to result in lower p-values than the underestimating samples. Consequently, under publication bias more overestimating than underestimating samples pass through the publication bottleneck.

Publication bias provides an incentive for researchers to produce statistically significant findings. Given that larger sample effects tend to produce lower p-values, researchers might collect and analyse data in ways that lead to systematic overestimation of the population effect in their sample and thereby push their p-value under the threshold for statistical significance [21]. Such practices have become known as p-hacking [22]. Their unifying characteristic is that multiple analyses are run by the researcher but only the one that results in the smallest p-value is reported. Following earlier work [10], we focus on four practices of p-hacking that appear to be widely used in psychology [23]. i) Optional dependent variables means that multiple related outcome variables are analysed in a study. ii) Optional stopping means that researchers regularly peek at their results and stop data collection when they reach a statistically significant finding (or run out of steam). iii) Optional moderators means that the data are sliced in various ways (e.g., all participants; females only; males only). iv) Optional outlier removal means that analyses are performed both on all data and on data cleared from outliers.

Effects on meta-analyses

Publication bias and p-hacking lead to inflated effect size estimates in the published literature and in meta-analyses that rely on it [1012], and this problem is not readily solved by the inclusion of unpublished results [24]. Less is known about their effects on heterogeneity estimates. We are aware of only three studies into the effect of publication bias, and studies on p-hacking seem to be missing entirely. Using mathematical reasoning, two studies [25, 26] demonstrated that publication bias might lead to under- or overestimation of heterogeneity. However, this modelling assumed that the censoring of studies is contingent on their effect sizes instead of their statistical significance, which might be unrealistic [27]. A third study using both mathematical reasoning and computer simulations [28] considered the effect of publication bias (which was contingent on statistical significance), while also manipulating the level of true heterogeneity, the magnitude of true effects, and how much studies differed in their sample sizes. A complex picture emerged, but underestimation of heterogeneity was more prevalent than overestimation. The latter was mostly restricted to small effect sizes and tended to increase with the strength of publication bias.

Here, we expand on previous work in four ways. Our first aim is to investigate multiple heterogeneity estimators and compare their performance in a biased world. Our second aim is to investigate publication bias from a new angle. Previous analyses based publication bias on 1-tailed testing, whereby only positive results (i.e., those that point in the desired direction) can escape censoring [25, 28]. In applied research (e.g., medical trials), the valence of effect direction is often unequivocal (e.g., when the treatment reduces or increases mortality). In this case, the allure of positive findings is clear and 1-tailed publication bias appears indisputable. But in some areas of basic research, 2-tailed publication bias might be plausible because findings that go against the grain of received opinion can have particular appeal [29]. Our third aim is to consider the effects not only of publication bias but also of p-hacking. Our fourth and last aim is to investigate if biases in sets of primary studies affect estimates of effect size and heterogeneity to a similar degree or if one is prone to stronger distortions than the other.

Methods

Random effects model

In meta-analysis, random effects models, which take into account heterogeneity in the effect sizes underlying pertinent primary studies, are often most appropriate [4, 30, 31]. The random effects model describes θi, the true effect size in the ith study, as

θi=θ+δi (1)

whereby θ is the average true effect size and δi reflects its heterogeneity. The empirically observed effect size in the ith study serves as θ^i, which is the estimate for θi, and is modelled as

θ^i=θi+εi (2)

whereby εi is the within study error. δi and εi are typically presumed to be normally distributed with means of zero and variance τ2 and σi2, respectively. The average true effect size can then be estimated as

θ^=i=1kwiθ^i/i=1kwi (3)

with k being the number of studies in the meta-analysis and wi their weights. Ideally, weights wi = 1/(σi2+τ2) would be used. However, σi2 and τ2 are both unknown and need to be estimated from data.

Heterogeneity estimators

Numerous methods have been proposed to derive the estimated heterogeneity variance (τ^2). We considered five heterogeneity estimators in our simulation, which have either been frequently used or were positively evaluated in relevant reviews [7, 32]: DerSimonian-Laird (DL) [33], Hunter-Schmidt (HS) [34], maximum likelihood (ML) [35], Paule-Mandel (PM) [36], and restricted maximum likelihood (REML) [37].

DL and PM are methods-of-moments estimators and have the general form of

τ^2=i=1kwi(θ^iθ^)2i=1kwiσ^i2+i=1kwi2σ^i2i=1kwii=1kwii=1kwi2i=1kwi (4)

whereby θ^=i=1kwiθi^/i=1kwi. DL uses fixed-effects weights, wi=1/σ^i2. In contrast, PM uses random-effects weights wi=1/(σ^i2+τ^2), which are determined through an iterative process, which always converges. Using fixed-effects weights wi=1/σ^i2, HS estimates the heterogeneity variance as

τ^2=i=1kwi(θi^θ^)2ki=1kwi (5)

ML and REML both employ random-effects weights wi=1/(σ^i2+τ^2). ML takes the form

τ^2=i=1kwi2((θi^θ^)2+σ^i2)i=1kwi2 (6)

whereas REML uses

τ^2=i=1kwi2((θi^θ^)2+σ^i2)i=1kwi2+1i=1kwi (7)

ML and REML both use iterative cycles to jointly estimate θ^i2 and τ^2. Occasionally, these fail to converge on a solution. All estimators set any negative values for τ^2 to zero.

Simulation

Simulations were carried out in R (version 4.0.3). Metafor [38] version 2.4–0 was used to run meta-analyses on the simulated studies. The annotated R code is available in the supplement.

Simulation methods

In the simulations, multiple independent studies were run and submitted for publication (potentially biased by p-hacking), published (or not), and (if published) summarized in a meta-analysis. Between-subjects experiments with two groups were simulated. The outcome variable was continuous, and the standardized mean difference (SMD) served as effect size index. Meta-analyses on continuous outcomes are frequent in psychology [1]. We aggregated observed sample sizes from a representative set of 150 psychological meta-analyses [15] into a single distribution. Sample sizes Ni for simulated studies were randomly sampled from this distribution and equally split between groups 1 and 2. Median Ni (for both groups combined) was 100, with an interquartile range of 176. If average sample size differed considerably across the 150 meta-analyses in our set, our approach might result in unrealistic combinations of very large and very small samples in simulated meta-analyses, which in turn might distort our results [28]. However, an ANOVA (bias corrected accelerated bootstrap with 1,000 samples) revealed little variation of average sample size across these 150 meta-analyses (ηp2 = 0.020, F(149, 7077) = 0.97, p = .595).

Describing heterogeneity

We use τ to describe heterogeneity. In the present context, it has a number of advantages. In contrast to τ2, τ has an intuitive interpretation in that it reflects the standard deviation of the true effect size. Moreover, τ is in the same SMD unit as the simulation’s effect size estimates. This facilitates our fourth aim, to compare the effects of biased sets of primary studies on estimates of effect size and estimates of heterogeneity. Imagine that a given level of publication bias and p-hacking led to a bias of 0.1 in the overall effect size estimate θ^ and a bias of 0.1 in the heterogeneity estimate τ^. In this case it would be sensible to conclude that effect size estimates and heterogeneity estimates were affected to the same extent (although the same degree of bias might be seen as more consequential for effect size estimates than for heterogeneity estimates).

To describe bias in heterogeneity estimates we found verbal labels helpful, although a degree of arbitrariness is inevitable. In a recent survey of heterogeneity in psychology meta-analyses, average T (the empirical estimate of heterogeneity in SMD units) was 0.33 [15]. In light of this, labels of small/medium/large for (unsigned) bias in T of 0.05/0.10/0.20 struck us as sensible and we will use them in this way throughout.

Factors manipulated

To address our first aim, we compared the performance of the five heterogeneity estimators described above. Six factors were manipulated in our simulations (see Table 1). Addressing our second aim, the first factor concerned the type and prevalence of p-hacking applied to the experiments. We considered the four types of p-hacking [10] described above. i) Optional dependent variables: Researchers in the simulated experiment used two dependent variables, which were correlated ρ = .8 at the population level. ii) Optional stopping: After reaching their starting Ni, researchers regularly peeked at their results. They kept adding 10% of the starting Ni until they either obtained a statistically significant finding or hit maximum Ni (arbitrarily set at five times the starting Ni or 200, whichever is lower). iii) Optional moderator: The sex of all participants in the experiment was decided at random (p = .5). Researchers analyzed results for females only, males only, and the whole sample. iv) Optional outlier removal: Researchers run separate analyses on all data, and on data with outliers (unsigned z ≥ 2) removed.

Table 1. Simulation parameters.

Experimental Factors Abbreviation Levels
P-hacking p-hack None, medium, high
Type of publication bias TAIL 1-tailed, 2-tailed
Strength of publication biasa PB 0%, 40%, 80%
True heterogeneity τ 0, 0.11, 0.22, 0.33, 0.44
True average effect size θ 0, 0.2, 0.5, 0.8
Number of studies per meta-analysis k 9, 18, 36, 72

aIndicated as the proportion of statistically non-significant studies that remain unpublished. For 1-tailed publication bias, all negative findings are censored, independent of the strength of publication bias.

We then used these four p-hacking strategies to simulate three research environments [10]: no, medium, and high p-hacking. In the no p-hacking environment, no p-hacking was used. Consequently, each experiment leads to only one result, which would be published (unless censored by publication bias). In the medium p-hacking environment, 30% of researchers did not engage in p-hacking, 50% of researchers used both optional dependent variables and optional stopping, and 20% of researchers used all four p-hacking strategies. For the high p-hacking environment, these percentages were 10%, 40%, and 50%, respectively. Multiple p-hacking strategies were fully crossed. Thus, a researcher who engaged in all four would first study starting Ni participants and perform analyses on both dependent variables, with and without outliers, on all participants and on females and males separately. If none of these 12 analyses returned a statistically significant result (p < .05), 10% more participants were studied, and the same analyses carried out again. This cycle ended when either statistical significance or maximum Ni was reached. If multiple analyses resulted in statistically significant results at this point, only the one with the smallest p-value was submitted for publication.

Addressing our third aim, the second factor implemented type of publication bias as either 1- or 2-tailed. Under 1-tailed publication bias, statistically significant results (2-tailed testing) in the expected direction were always published; all other results were censored to a degree that was defined by the strength of publication bias. If p-hacking required selection between multiple analyses, this was contingent on a modified p-value, which equaled p for results in the expected direction. For results in the opposite direction, the modified p-value was computed as 1 + (1-p). Obviously, the modified p-value cannot be interpreted as a probability, but it appropriately penalizes results in the wrong direction with, ceteris paribus, stronger effects carrying greater penalties. With 2-tailed publication bias, statistically significant results were published regardless of sign, and all other results were censored to a degree that was defined by the strength of publication bias. Strength of publication bias was the third factor, implemented with levels 0%/40%/80% of non-significant results being censored. Degree of true heterogeneity was implemented with levels τ = 0.00 to 0.44 in steps of 0.11. The three highest levels represent average heterogeneity ±1SD observed in a survey of psychological meta-analyses [15]. We decided against inclusion of higher levels in our simulation because the usefulness of the meta-analytic model becomes questionable in the face of very high heterogeneity [30]. To what extent multiple close replications (in which an original study is replicated as faithfully as possible across many labs and each lab’s results are treated as a separate study) show heterogeneity has become an important issue in psychology [39]. We therefore included the additional level of τ = 0.11, which comes close to average observed heterogeneity in multiple close replications [15]. To facilitate comparisons with meta-analyses that express heterogeneity in the I2 metric (i.e., the proportion of between-study variance estimated to be due to heterogeneity), we computed mean I2 levels across simulations without publication bias and questionable research practices. The simulation’s five heterogeneity levels translated into I2 values of 6.7%, 31.7%, 62.4%, 78.4%, and 86.2% (with respective medians of 0.0%, 30.4%, 65.8%, 81.3%, and 88.3%). In Monte Carlo simulations of bias-free meta-analyses, the strength of the true effect can typically be disregarded as inconsequential [6, 37, 40]. In a biased world, however, θ proves important [10, 28]. We therefore implemented θ and used levels 0.0/0.2/0.5/0.8 because the latter three are often considered as benchmarks for small/medium/large effects in psychology [41]. Finally, the number of studies feeding into each meta-analysis was implemented with k = 9/18/36/72; an average of k = 37 in psychology meta-analyses motivated these choices [15].

A summary of simulation factors is provided in Table 1. All six manipulated factors were fully crossed, resulting in 1,440 unique factor combinations. Following [10], 1,000 meta-analyses were run for each (due to the intense computational demands in the conditions that involved questionable research practices, a higher number did not prove feasible). Occasional trials in which ML or REML failed to converge on a solution were replaced until 1,000 meta-analyses were completed. For each cell of the design, the simulation computed the standard deviation across the 1,000 heterogeneity estimates. Following [42], we divided this by 1000 to estimate the Monte Carlo error, i.e., the standard error for the heterogeneity estimate in each cell. The mean was 0.0023, its maximum 0.0068, which strikes us as sufficiently low. The annotated R code, which provides further technical details, is available at https://osf.io/qga8v/.

Results

Throughout our results, we will refer to estimates of overall effect size as d and to estimates of heterogeneity as T. Data files are available here: https://osf.io/qga8v/.

P-hacking increased mean Ni in simulated meta-analyses from 123 (no p-hacking) to 141 (medium p-hacking) and 129 (high p-hacking).

Estimation of effect size and heterogeneity in the absence of bias

To evaluate estimation performance in the absence of bias, analyses in this section are restricted to simulation conditions without p-hacking and without publication bias. Across analyses and in line with previous findings, level of effect size proved of little consequence [6]. Consequently, figures do not differentiate results by effect size.

Estimates of θ proved virtually unbiased (see S1 Fig), which is in line with previous simulations [6, 40, 43, 44]. Coverage probability (i.e., the percentage of confidence intervals that included θ) was too low for lower (but not for absent) heterogeneity, particularly in conjunction with small k (see S2 Fig). As can also be seen, the HS and ML estimator suffered worst from these problems. Consequently, type-1 errors were inflated under the same circumstances (see Fig 1), again particularly for the HS and ML estimator. This contrasts with previous findings conducted with simulations with similar parameters to ours [40], with the notable exception of markedly lower variability in Ni in their study. For a variety of heterogeneity estimators, including for HS and ML, they found excellent coverage for confidence intervals based on t-distributions.

Fig 1. Proportion of type-1 errors for overall effect size estimate d in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies per meta-analysis (k), with α-level = 0.05.

Fig 1

Unlike estimates of θ, heterogeneity estimates proved somewhat biased. True levels of τ = 0 were overestimated (which is not surprising, given that heterogeneity estimates are ≥0); for all other heterogeneity levels, τ estimates proved too low, especially when k was low (see Fig 2). Again, these problems were particularly strong for the HS and ML estimator. Bias for the DL, PM, and REML estimator rarely exceeded 0.02 and therefore appears negligible, particularly in light of average heterogeneity (τ = 0.33) in a survey of meta-analyses on continuous outcomes [15].

Fig 2.

Fig 2

Bias in heterogeneity estimates (Tbias) in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies in the meta-analysis (k).

Low bias is only one desirable property for heterogeneity estimators. In addition, they should be little affected by sampling fluctuation, i.e., under the same circumstances the variance in their estimates should be low. The root mean square error for heterogeneity estimates (Trmse) combines both bias and variance. By this measure, the ML and REML estimators performed consistently well (see Fig 3). The DL estimator, although showing little bias (see Fig 2), lost ground through relatively large variance, particularly for larger k; conversely, the ML estimator, although showing considerable bias (see Fig 2), looked somewhat better on Trmse because of its low variance (see S3 Fig). Our findings on bias and RMSE are broadly in line with those of previous simulations [32]. (A notable exception is a previous study [37] that found ML and HS to be comparable on both criteria, whereas HS performed clearly worse in our simulation, particularly for larger heterogeneity, as shown in Figs 2 and 3. This discrepancy might be partly down to the fact that the previous study implemented somewhat weaker heterogeneity (τ ≤ 0.31) than our simulation (τ ≤ 0.44) and did not truncate negative heterogeneity estimates to zero.) Finally, coverage of confidence intervals around T proved excellent across all conditions (see S4 Fig).

Fig 3. Root mean square error for heterogeneity estimates (TRMSE) in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies in the meta-analysis (k).

Fig 3

To summarize, in the absence of biases we found two problems in estimations: First, considerable type-1 error inflation occurred in tests of the overall effect size when true heterogeneity was low. This occurred even though these tests implemented the Knapp-Hartung adjustment [45]. Second, the HS estimator (and, to a lesser extent, the ML estimator) led to considerable underestimation of heterogeneity at the highest level of true heterogeneity in connection with low k. Overall, the REML estimator performed particularly well due to a combination of low bias and low variance. PM demonstrated the same strengths, unless heterogeneity was absent, which will be an unrealistic assumption in most contexts. Our simulations thus support pervious positive evaluations of the REML and PM estimators in the absence of bias [6, 32].

Estimation of heterogeneity in the presence of bias

In this section we look at heterogeneity estimates across all levels of our simulation and start with effects on Tbias (i.e., T−τ). Given the complexity of our simulation, understanding which factors or factor combinations matter poses a challenge. To address this problem we ran, for each heterogeneity estimator, a six-factorial between-subjects ANOVA on Tbias and used effect sum of squares to understand which factors and interactions proved most influential. In this and subsequent ANOVAs, main effects, 2-way-interactions, and 3-way interactions together accounted for upwards of 98% of variance for each estimator. Here and in subsequent analyses, we can therefore exclude an important role for 4-way and higher interactions, and consequently we do not comment on them. Table 2 identifies the most important effects. However, before we consider them in detail it is of interest to identify which heterogeneity estimators were least and most affected by our manipulations. (We will refer to this characteristic as an estimator’s “inertia” vs. “volatility”.) Ideally, any heterogeneity estimator should be rather inert. If its bias is low, inertia instils confidence that low bias will also prevail under the specific (but largely unknown) circumstances for the meta-analysis at hand. (Note that a meta-analyst only knows Ni and k for sure. The prevalence of p-hacking, the true heterogeneity between studies, etc. remain unknown.) If the estimator’s bias is large, inertia implies that it could be confidently corrected. The ANOVA’s corrected total sums of squares directly reflect estimators’ volatility. As can be seen from Table 2, the DL estimator proved most inert, whereas PM was (by a considerable margin) most volatile.

Table 2. The relative importance of design factors for Tbias.

Selected sum of squares from six-factorial ANOVA for five heterogeneity estimators.

DL HS ML PM REML M
P-hack 0.25 0.22 0.36 1.03 0.40 0.45
TAIL 0.48 0.37 1.03 1.65 1.11 0.93
θ 0.40 0.36 1.06 2.13 1.13 1.01
τ 1.88 3.39 1.13 1.89 1.04 1.87
k 0.01 0.25 0.27 0.01 0.02 0.11
PB 0.01 0.00 0.06 0.07 0.06 0.04
Sum main effects 3.03 4.59 3.91 6.78 3.76 4.41
θ ⨯ PB 0.20 0.14 0.42 0.51 0.46 0.35
P-hack ⨯ θ 0.02 0.02 0.10 0.22 0.10 0.09
TAIL ⨯ θ 0.27 0.21 0.72 0.97 0.77 0.59
P-hack ⨯ TAIL 0.10 0.08 0.30 0.53 0.33 0.27
P-hack ⨯ τ 0.15 0.11 0.10 0.24 0.12 0.14
Sum 2-way interactions 1.02 0.82 2.10 2.88 2.22 1.82
P-hack ⨯ TAIL ⨯ θ 0.07 0.05 0.23 0.32 0.25 0.18
Sum 3-way interactions 0.24 0.16 0.54 0.63 0.55 0.39
Error 0.03 0.02 0.07 0.05 0.07 0.05
Corrected total 4.29 5.62 6.60 10.34 6.61 6.69

The table shows all main effects and all 2-way and 3-way interactions for which sum of squares ≥ 0.20 for at least one estimator. The rightmost column shows the mean across the five estimators.

Regarding main effects on Tbias, strength of publication bias and k proved largely inconsequential (see Table 2). In decreasing order of importance, tau, effect size, type of publication bias, and p-hacking prevalence proved relevant. Their effects are summarized in the panels of Fig 4. As can be seen, all five heterogeneity estimators were affected in similar ways. Overall, Tbias was driven upwards by lower levels of true heterogeneity, the absence of a true effect, 2-tailed publication bias, and higher levels of p-hacking. In general, the PM estimator produced the highest heterogeneity estimates and HS the lowest, with DL, ML, and REML in between. This meant that the largest positive levels of Tbias were observed for PM and HS. The PM estimator showed small to moderate positive Tbias for absent or low heterogeneity, for an absent or small true effect, for 2-tailed publication bias and for moderate and high p-hacking. The HS estimator showed small negative Tbias for higher levels of heterogeneity, for 1-tailed publication bias, and in the absence of p-hacking. DL, ML, and REML showed the largest (small, positive) Tbias in the absence true heterogeneity and in the absence of a true effect.

Fig 4. Bias in heterogeneity estimates (Tbias) for five heterogeneity estimators as a function of true heterogeneity (τ), true average effect size (θ), type of publication bias, and p-hacking environment, respectively.

Fig 4

Regarding 2-way interactions, effect size ⨯ type of publication bias and effect size ⨯ strength of publication bias were most relevant (see Table 2) and are shown in Figs 5 and 6. In the absence of a true effect, 2-tailed publication bias increased heterogeneity estimates particularly strongly and positive Tbias emerged (small for HS and DL; moderate for ML and REML; and moderate-to-large for PM; see Fig 5). This arises because only under 2-tailed publication bias do p-hacking and publication bias have the potential to push published effect sizes either above or below zero, thus maximising their variance. Similarly, the absence of a true effect also boosted the positive Tbias created by strong publication bias (see Fig 6). At 80% publication bias and in the absence of a true effect, positive Tbias was moderate-to-large for the PM estimator, moderate for ML and REML, small for DL, and least pronounced for HS. This reflects that strong publication bias maximises the variance in published effect sizes at θ = 0 because (exaggerated) published effect sizes are equally likely to be above or below zero. Other effects on Tbias proved moderate in size and are immaterial to our discussion, but for illustration the largest 3-way interaction is shown in S5 Fig.

Fig 5. Bias in heterogeneity estimates (Tbias) for five heterogeneity estimators: Two-way interaction of true average effect size (θ) with type of publication bias.

Fig 5

Fig 6. Bias in heterogeneity estimates (Tbias) for five heterogeneity estimators: Two-way interaction of true average effect size (θ) with strength of publication bias.

Fig 6

Because our simulation considered for the first time p-hacking in addition to publication bias, we compared their effects in greater detail. In the ANOVA (Table 2), the 2-way interaction between p-hacking and strength of publication bias proved zero for all five heterogeneity estimators. In other words, the effects of p-hacking and publication bias were strictly additive, which is shown in S6 Fig. Nonetheless, their interactions with effect size proved somewhat different in nature (see S7 Fig). For an effect size of zero, both higher levels of p-hacking and stronger publication bias strongly increased Tbias. For larger effect sizes, a similar (although slightly weaker) effect emerged for p-hacking (see upper panel), but a reversal of this effect was observed publication bias; i.e., for larger effect sizes, an increase in publication bias now led to a (small) decrease in Tbias (see lower panel).

Unlike previous work [28], we implemented p-hacking in addition to publication bias and described heterogeneity via τ instead of I2. These differences notwithstanding, our simulation confirmed their finding that (under 1-tailed publication bias), overestimation of heterogeneity occurs under fewer simulation conditions than underestimation, and the latter is particularly strong for small θ and strong publication bias (see S8 Fig, which is restricted to 1-tailed publication bias).

Next, we look at Trmse to also capture estimators’ variance in addition to their bias. Again, we used sum of squares from six-factorial between-subjects ANOVA on Trmse for guidance (Table 3). And as previously, we used ANOVA’s corrected total sum of squares to judge estimators’ inertia. Paralleling inertia for Tbias, the DL estimator proved again most inert, whereas PM was again considerably more volatile than any other estimator.

Table 3. The relative importance of design factors for Trmse.

Selected sum of squares from six-factorial ANOVA for five heterogeneity estimators.

DL HS ML PM REML M
P-hack 0.11 0.05 0.27 0.84 0.35 0.33
TAIL 0.02 0.00 0.11 0.61 0.21 0.19
θ 0.07 0.02 0.29 0.84 0.42 0.33
τ 0.71 1.25 0.43 0.28 0.33 0.60
k 0.63 0.53 0.91 1.13 1.00 0.84
PB 0.03 0.03 0.10 0.14 0.12 0.08
Sum main effects 1.58 1.88 2.11 3.85 2.42 2.37
P-hack ⨯ TAIL 0.03 0.02 0.15 0.26 0.16 0.13
P-hack ⨯ θ 0.04 0.02 0.24 0.55 0.32 0.23
TAIL ⨯ θ 0.12 0.15 0.13 0.21 0.12 0.15
τ ⨯ θ 0.03 0.02 0.15 0.26 0.16 0.13
Sum 2-way interactions 0.46 0.52 1.03 1.82 1.07 0.98
P-hack ⨯ TAIL ⨯ θ 0.02 0.01 0.10 0.20 0.12 0.09
Sum 3-way interactions 0.16 0.18 0.39 0.45 0.38 0.31
Error 0.03 0.03 0.06 0.05 0.06 0.05
Corrected total 2.22 2.61 3.60 6.17 3.93 3.71

The table shows all main effects and all 2-way and 3-way interactions for which sum of squares ≥ 0.20 for at least one estimator.

Regarding main effects on Trmse, both strength and type of publication bias proved largely inconsequential. In decreasing order of importance, k, τ, θ, and p-hacking prevalence proved particularly relevant. Again, the main effects affected all five heterogeneity estimators in similar ways, but PM fared generally poorly in comparison to the others. Not surprisingly, Trmse was decreased by increasing k, but also by low (but not absent) heterogeneity, by larger effect sizes, and by less prevalence of p-hacking (see Fig 7). Across levels of k, the performance of DL, HS, ML, and REML proved very similar to each other’s. Otherwise, DL and HS as well as ML and REML tended to show similar Trmse. DL/HS proved somewhat better for τ = 0.11 and in the absence of a true effect; ML/REML proved somewhat better for τ = 0.44 and for medium and large effects.

Fig 7. Root mean square error for heterogeneity estimates (TRMSE) for five heterogeneity estimators as a function of number of studies in the meta-analysis (k), true heterogeneity (τ) true average effect size (θ), and p-hacking environment, respectively.

Fig 7

The strongest 2-way interaction (of effect size with p-hacking) did not add much to the comparison of estimators over and above the main effects just discussed (see S9 Fig).

Estimation of effect size in the presence of bias

In this section we return to effect size estimates, but this time across all simulation conditions. We focus on dbias (i.e., dθ, whereby d is the unbiased estimate of Cohen’s d, see [46]), which was hardly affected by the type of heterogeneity estimator used. For reporting economy, we report results only for (arbitrarily chosen) DL.

As previously, we use sum of squares from six-factorial between-subjects ANOVA on dbias to understand which simulation factors and interactions were most influential (see Table 4). Effects on dbias proved somewhat more complex than effects on Tbias: Main effects explained only 63% of the variance in dbias (66% for Tbias, averaged across estimators) and 3-way interactions explained 10% (6% for Tbias, averaged across estimators). Fig 8 provides an overview over important effects. Obviously, p-hacking and publication bias both increased dbias; under the levels selected in our simulation, the former proved more powerful. The combination of both could induce large bias. E.g., for a true effect size of zero, θ might be estimated to be over 0.3, a substantial effect. As one would expect, dbias was also stronger under 1-tailed than under 2-tailed publication bias, especially when a true effect was absent or small. Perhaps less intuitively, dbias also increased with τ (see S10 Fig).

Table 4. The relative importance of design factors for dbias (selected sum of squares from six-factorial between-subjects ANOVA).

Data are shown for DL but were very similar across all heterogeneity estimators.

P-hack 2.18
TAIL 0.89
θ 0.28
τ 0.89
k 0.00
PB 1.03
Sum main effects 5.27
θ ⨯ PB 0.21
TAIL ⨯ θ 1.04
TAIL ⨯ PB 0.22
τ ⨯ PB 0.22
TAIL ⨯ τ 0.26
Sum 2-way interactions 2.22
TAIL ⨯ θ ⨯ PB 0.30
Sum 3-way interactions 0.82
Error 0.08
Corrected total 8.38

The table shows all main effects and all 2-way and 3-way interactions for which sum of squares ≥ 0.20.

Fig 8. Bias in effect size estimates (dbias) as a function of p-hacking environment, strength of publication bias, true average effect size (θ), and type of publication bias.

Fig 8

Data shown are for the DL estimator but are very similar for other estimators.

For overall effect size estimates (d), meta-analyses typically report a p-value, which is tacitly assumed to provide an appropriate safeguard against type-1 errors. Fig 9 shows type-1 error rates for d under 1-tailed publication bias in our simulation. (Under 2-tailed publication bias, type-1 error rates proved very close to the nominal 5%.) As can be seen, type-1 error rates might reach catastrophic levels. Random effects p-values for d will therefore fail to offer protection against type-1 errors unless publication bias and p-hacking can be ruled out.

Fig 9. Type-1 error rates for d under 1-tailed publication bias as a function of strength of publication bias, level of p-hacking, and number of studies in the meta-analysis (k).

Fig 9

Data shown are for conditions with θ = 0 and are based on the DL estimator but are very similar for other estimators.

Comparison of biases in estimates of effect size and heterogeneity

As we expressed effect size and heterogeneity in the same SMD unit, it is possible to compare the effects of biased research on estimates of effect size and heterogeneity. For this purpose, Fig 10 contrasts unsigned estimation error for d and for T (i.e., absolute dbias and absolute Tbias) via boxplots. The upper panel is based on all simulation conditions. The middle panel excludes simulations with τ≤0.11 because such low levels of heterogeneity are rarely observed [15]. It also excludes simulations with θ = 0. These might often translate into relatively small effect size estimates, which in turn might render them (and their level of heterogeneity) of little interest to researchers, at least in some areas of psychology. Finally, the lower panel in Fig 10 also excludes 2-tailed publication bias, because this might be unrealistic in many domains. As can be seen, errors in effect size estimation were consistently much larger than errors in heterogeneity estimation. From this perspective, publication bias and p-hacking cause much more problems for the estimation of effect size than for the estimation of heterogeneity, especially when the latter relies on the DL, PM, or REML estimator.

Fig 10. Comparison of errors in estimation of effect size and heterogeneity.

Fig 10

Estimation errors for d are for the DL estimator, but virtually identical for the other estimators.

Discussion

One aim of our simulation was to compare the performance of heterogeneity estimators when publication bias and p-hacking distort the effect size estimates in primary studies. Confirming previous findings, we found that REML and PM did well when the sets of primary studies were unbiased [6, 32]. However, a different picture emerged once publication bias and p-hacking came into play: PM often performed poorly in terms of both bias and RMSE, whereas DL proved least biased while also showing low RMSE. Under conditions that might be particularly realistic and/or relevant in many research contexts (presence of a real effect and considerable heterogeneity, any publication bias is 1-tailed), REML showed similarly low levels of bias and lower RMSE.

We also compared the effects of 1-tailed versus 2-tailed publication bias. In line with a previous simulation [28], we found that underestimation of heterogeneity dominated under 1-tailed publication bias. However, overestimation prevailed under 2-tailed publication bias and the PM estimator proved particularly susceptible. Two-tailed publication bias might be expected only in a limited number of fields in which findings that go against the grain have appeal [29]. Nonetheless we believe this differentiation to be important.

Finally, we thought to compare the effects of biased sets of primary studies on estimates of effect size and heterogeneity. In the bias-free world that underlies most investigations on this subject matter, estimation of heterogeneity proves much more challenging than effect size estimates [32]. However, the presumed absence of biases in sets of primary studies seems unrealistic in many fields [8, 9, 19, 23]. In our simulation, biased sets of primary studies caused much more severe problems for estimates of effect size than for estimates of heterogeneity. Therefore, future investigations into meta-analytic parameter estimations should prioritise how to deal with biases in effect size estimates [e.g., 12, 22, 47, 48–50] over the relative merits of different heterogeneity estimators.

In our simulations, levels of heterogeneity and effect size as well as Ni and k were based on empirical observations in psychology, which is a strength of our approach. We did not consider some heterogeneity estimators that previously showed promise e.g., the two-step PM estimator [51], and we did not systematically manipulate Ni, which is a factor of interest in itself [6, 37, 52]. These are limitations of our approach. However, the six factors manipulated here in conjunction with the five estimators we considered, already posed considerable challenges, both in terms of the simulations’ run time as well as the ensuing analyses, and an even more complex simulation design would not have been feasible. Although our modelling of p-hacking was based on empirical observations [23], their implementation cannot avoid arbitrary choices. (For example, in our simulation only the result with the lowest p-value was submitted for publication. Other choices‒e.g., the analysis that is based on the largest number of participants whilst obtaining p < .05 or on all analyses that obtain p < .05‒would have been perfectly plausible.) Future studies will need to show how well our conclusions hold under modified assumptions. In this context it is encouraging that our simulation replicated key previous findings [28] even though our implementation of bias differed considerably from theirs. Finally, our simulations are restricted to continuous outcome measures, and it remains unclear if similar results are valid for binary outcome measures.

Meta-analyses in psychology often find large heterogeneity [1, 2]. More importantly, its causes typically remain unclear and this combination reflects poorly on the scientific understanding of the subject matter [15]. Our simulation results show that high observed heterogeneity cannot be conveniently dismissed as resulting from bias. For example, based on the DL estimator, average T was found to be 0.33 in a large sample of meta-analyses in psychology [15]; our simulation found that DL rarely produces Tbias even as high as 0.1. This underscores that, across many domains in psychology, large unaccounted heterogeneity is a serious issue that deserves more attention.

Conclusion

For meta-analyses on continuous outcome measures we demonstrated here that the performance of heterogeneity estimators can differ considerably when effect sizes in the primary studies are distorted by publication bias and p-hacking, which is to be expected in many research domains [5355]. Under various levels of distortions in the effect sizes of primary studies, heterogeneity estimates based on DL fared well in terms of bias and RMSE. However, as our own and previous work shows, REML outperforms DL in an unbiased research environment [6, 32]. Given that REML estimated heterogeneity almost as well as DL in a biased world (especially in simulation conditions that appear particularly plausible and/or important for actual research), REML remains in our view an excellent choice under the conditions simulated here, which should be broadly representative for meta-analyses of continuous outcomes in psychology.

For these conditions, our simulations suggest that the detrimental effects of biases in sets of primary studies are much larger for estimates of effect size than for estimates of heterogeneity. Therefore, our work underscores that the prevention and, in the case of past studies, detection and correction of biases in sets of primary studies is a pressing issue [10, 12, 22, 4750, 56, 57].

Supporting information

S1 Fig. Mean bias in estimates of the true average effect size (dbias) in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true average effect size (θ), true heterogeneity (τ), and number of studies per meta-analysis (k).

(TIF)

S2 Fig. Coverage of 95% CIs around d in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies per meta-analysis (k).

(TIF)

S3 Fig. Standard deviation for heterogeneity estimates under constant simulation conditions in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies per meta-analysis (k).

(TIF)

S4 Fig. Coverage of 95% CIs around T in the absence of publication bias and p-hacking for the DL estimator as a function of true average effect size (θ), true heterogeneity (τ), and number of studies per meta-analysis (k).

Virtually identical results for other estimators not shown.

(TIF)

S5 Fig. Illustration of the strongest 3-way interaction on Tbias (see Table 2).

(TIF)

S6 Fig. Absence of interaction between effects of p-hacking and strength of publication bias on Tbias.

(TIF)

S7 Fig. P-hacking and strength of publication bias differ in their interaction with the true average effect size (θ) on Tbias.

(TIF)

S8 Fig. Under 1-tailed publication bias (shown here), underestimation of heterogeneity is more prevalent than overestimation.

(TIF)

S9 Fig. Illustration of the strongest 2-way interaction on Trmse (see Table 3).

(TIF)

S10 Fig. Overestimation of effect size increases as heterogeneity increases.

(TIF)

Acknowledgments

We would like to thank Thomas Pollet for helpful comments on an earlier draft.

Data Availability

All materials and data can be found at https://osf.io/qga8v/.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.van Erp S, Verhagen J, Grasman RP, Wagenmakers E-J. Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. Journal of Open Psychology Data. 2017;5(1). [Google Scholar]
  • 2.Stanley T, Carter EC, Doucouliagos H. What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin. 2018;144(12):1325–46. doi: 10.1037/bul0000169 [DOI] [PubMed] [Google Scholar]
  • 3.TIMSS & PIRLS International Study Center. TIMSS 2015 International Database 2019 [Available from: https://timssandpirls.bc.edu/timss2015/international-database/.
  • 4.Schmidt FL, Oh IS, Hayes TL. Fixed‐versus random‐effects models in meta‐analysis: Model properties and an empirical comparison of differences in results. British Journal of Mathematical and Statistical Psychology. 2009;62(1):97–128. doi: 10.1348/000711007X255327 [DOI] [PubMed] [Google Scholar]
  • 5.Else-Quest NM, Hyde JS, Linn MC. Cross-national patterns of gender differences in mathematics: A meta-analysis. Psychological Bulletin. 2010;136(1):103–27. doi: 10.1037/a0018053 [DOI] [PubMed] [Google Scholar]
  • 6.Langan D, Higgins JP, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, et al. A comparison of heterogeneity variance estimators in simulated random‐effects meta‐analyses. Research synthesis methods. 2019;10(1):83–98. doi: 10.1002/jrsm.1316 [DOI] [PubMed] [Google Scholar]
  • 7.Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, et al. Methods to estimate the between‐study variance and its uncertainty in meta‐analysis. Research Synthesis Methods. 2016;7(1):55–79. doi: 10.1002/jrsm.1164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ioannidis JP. Why most published research findings are false. PLoS medicine. 2005;2(8). doi: 10.1371/journal.pmed.0020124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Fanelli D, Costas R, Ioannidis JP. Meta-assessment of bias in science. Proceedings of the National Academy of Sciences. 2017;114(14):3714–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Carter EC, Schönbrodt FD, Gervais WM, Hilgard J. Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science. 2019;2(2):115–44. [Google Scholar]
  • 11.Fanelli D, Ioannidis JP. US studies may overestimate effect sizes in softer research. Proceedings of the National Academy of Sciences. 2013;110(37):15031–6. doi: 10.1073/pnas.1302997110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stanley T, Doucouliagos H, Ioannidis JP. Finding the power to reduce publication bias. Statistics in medicine. 2017;36(10):1580–98. doi: 10.1002/sim.7228 [DOI] [PubMed] [Google Scholar]
  • 13.McShane BB, Böckenholt U. You cannot step into the same river twice: When power analyses are optimistic. Perspectives on Psychological Science. 2014;9(6):612–25. doi: 10.1177/1745691614548513 [DOI] [PubMed] [Google Scholar]
  • 14.Kenny DA, Judd CM. The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychological methods. 2019;24(5):578. doi: 10.1037/met0000209 [DOI] [PubMed] [Google Scholar]
  • 15.Linden AH, Hönekopp J. Heterogeneity of research results: a new perspective from which to assess and promote progress in psychological science. Perspectives on Psychological Science. 2021;16(2):358–76. doi: 10.1177/1745691620964193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):943–51. doi: 10.1126/science.aac4716 [DOI] [PubMed] [Google Scholar]
  • 17.Scheel AM, Schijen MR, Lakens D. An excess of positive results: Comparing the standard Psychology literature with Registered Reports. Advances in Methods and Practices in Psychological Science. 2021;4(2):25152459211007467. [Google Scholar]
  • 18.Schäfer T, Schwarz MA. The meaningfulness of effect sizes in psychological research: Differences between sub-disciplines and the impact of potential biases. Frontiers in Psychology. 2019;10:813. doi: 10.3389/fpsyg.2019.00813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kvarven A, Strømland E, Johannesson M. Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour. 2020;4(4):423–34. doi: 10.1038/s41562-019-0787-z [DOI] [PubMed] [Google Scholar]
  • 20.Sterling TD. Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association. 1959;54(285):30–4. [Google Scholar]
  • 21.Simmons JP, Nelson LD, Simonsohn U. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science. 2011;22(11):1359–66. doi: 10.1177/0956797611417632 [DOI] [PubMed] [Google Scholar]
  • 22.Simonsohn U, Nelson LD, Simmons JP. P-curve: a key to the file-drawer. Journal of Experimental Psychology: General. 2014;143(2):534–47. [DOI] [PubMed] [Google Scholar]
  • 23.John LK, Loewenstein G, Prelec D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science. 2012;23(5):524–32. doi: 10.1177/0956797611430953 [DOI] [PubMed] [Google Scholar]
  • 24.Ferguson CJ, Brannick MT. Publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses. Psychological Methods. 2012;17(1):120–8. doi: 10.1037/a0024445 [DOI] [PubMed] [Google Scholar]
  • 25.Jackson D. The implications of publication bias for meta‐analysis’ other parameter. Statistics in Medicine. 2006;25(17):2911–21. doi: 10.1002/sim.2293 [DOI] [PubMed] [Google Scholar]
  • 26.Jackson D. Assessing the implications of publication bias for two popular estimates of between‐study variance in meta‐analysis. Biometrics. 2007;63(1):187–93. doi: 10.1111/j.1541-0420.2006.00663.x [DOI] [PubMed] [Google Scholar]
  • 27.Kühberger A, Fritz A, Scherndl T. Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one. 2014;9(9):e105825. doi: 10.1371/journal.pone.0105825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Augusteijn HE, van Aert R, van Assen MA. The effect of publication bias on the Q test and assessment of heterogeneity. Psychological Methods. 2019;24(1):116–34. doi: 10.1037/met0000197 [DOI] [PubMed] [Google Scholar]
  • 29.Krueger JI, Funder DC. Towards a balanced social psychology: Causes, consequences, and cures for the problem-seeking approach to social behavior and cognition. Behavioral and Brain Sciences. 2004;27(3):313–27. [DOI] [PubMed] [Google Scholar]
  • 30.Serghiou S, Goodman SN. Random-effects meta-analysis: summarizing evidence with caveats. Jama. 2019;321(3):301–2. doi: 10.1001/jama.2018.19684 [DOI] [PubMed] [Google Scholar]
  • 31.Rice K, Higgins JP, Lumley T. A re‐evaluation of fixed effect (s) meta‐analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2018;181(1):205–27. [Google Scholar]
  • 32.Langan D, Higgins JP, Simmonds M. Comparative performance of heterogeneity variance estimators in meta‐analysis: a review of simulation studies. Research synthesis methods. 2017;8(2):181–98. doi: 10.1002/jrsm.1198 [DOI] [PubMed] [Google Scholar]
  • 33.DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7(3):177–88. doi: 10.1016/0197-2456(86)90046-2 [DOI] [PubMed] [Google Scholar]
  • 34.Hunter JE, Schmidt FL. Methods of Meta-Analysis: Correcting Error and Bias in Research Findings: Sage; 2004. [Google Scholar]
  • 35.Hardy RJ, Thompson SG. A likelihood approach to meta‐analysis with random effects. Statistics in Medicine. 1996;15(6):619–29. doi: [DOI] [PubMed] [Google Scholar]
  • 36.Paule RC, Mandel J. Consensus values and weighting factors. Journal of Research of the National Bureau of Standards. 1982;87(5):377–85. doi: 10.6028/jres.087.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Viechtbauer W. Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics. 2005;30(3):261–93. [Google Scholar]
  • 38.Viechtbauer W. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software. 2010;36(3):1–48. [Google Scholar]
  • 39.Klein RA, Ratliff KA, Vianello M, Adams Jr RB, Bahník Š, Bernstein MJ, et al. Investigating variation in replicability. Social Psychology. 2014;45(3):142–52. [Google Scholar]
  • 40.Sánchez-Meca J, Marín-Martínez F. Confidence intervals for the overall effect size in random-effects meta-analysis. Psychological methods. 2008;13(1):31. doi: 10.1037/1082-989X.13.1.31 [DOI] [PubMed] [Google Scholar]
  • 41.Cohen J. Statistical power analysis for the behavioral sciences. Hilsdale 1988. [Google Scholar]
  • 42.Koehler E, Brown E, Haneuse SJ-P. On the assessment of Monte Carlo error in simulation-based statistical analyses. The American Statistician. 2009;63(2):155–62. doi: 10.1198/tast.2009.0030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Sidik K, Jonkman JN. Simple heterogeneity variance estimation for meta‐analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics). 2005;54(2):367–84. [Google Scholar]
  • 44.Rukhin AL. Estimating heterogeneity variance in meta‐analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2013;75(3):451–69. [Google Scholar]
  • 45.Knapp G, Hartung J. Improved tests for a random effects meta‐regression with a single covariate. Statistics in medicine. 2003;22(17):2693–710. doi: 10.1002/sim.1482 [DOI] [PubMed] [Google Scholar]
  • 46.Lakens D. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Frontiers in Psychology. 2013;4. doi: 10.3389/fpsyg.2013.00004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Henmi M, Hattori S, Friede T. A confidence interval robust to publication bias for random‐effects meta‐analysis of few studies. Research Synthesis Methods. 2021. doi: 10.1002/jrsm.1482 [DOI] [PubMed] [Google Scholar]
  • 48.Stanley T, Doucouliagos H, Ioannidis JP, Carter EC. Detecting publication selection bias through excess statistical significance. Research Synthesis Methods. 2021;12:776–95. doi: 10.1002/jrsm.1512 [DOI] [PubMed] [Google Scholar]
  • 49.Egger M, Smith GD, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. doi: 10.1136/bmj.315.7109.629 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Duval S, Tweedie R. Trim and fill: A simple funnel‐plot–based method of testing and adjusting for publication bias in meta‐analysis. Biometrics. 2000;56(2):455–63. doi: 10.1111/j.0006-341x.2000.00455.x [DOI] [PubMed] [Google Scholar]
  • 51.DerSimonian R, Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemporary clinical trials. 2007;28(2):105–14. doi: 10.1016/j.cct.2006.04.004 [DOI] [PubMed] [Google Scholar]
  • 52.Panityakul T, Bumrungsup C, Knapp G. On estimating residual heterogeneity in random-effects meta-regression: A comparative study. Journal of Statistical Theory and Applications. 2013;12(3):253–65. [Google Scholar]
  • 53.Ravn T, Sørensen MP. Exploring the Gray Area: Similarities and Differences in Questionable Research Practices (QRPs) Across Main Areas of Research. Science and engineering ethics. 2021;27(4):1–33. doi: 10.1007/s11948-021-00310-z [DOI] [PubMed] [Google Scholar]
  • 54.Banks GC, Rogelberg SG, Woznyj HM, Landis RS, Rupp DE. Evidence on questionable research practices: The good, the bad, and the ugly. Springer; 2016. [Google Scholar]
  • 55.Fraser H, Parker T, Nakagawa S, Barnett A, Fidler F. Questionable research practices in ecology and evolution. PloS one. 2018;13(7):e0200303. doi: 10.1371/journal.pone.0200303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Munafò MR, Nosek BA, Bishop DV, Button KS, Chambers CD, du Sert NP, et al. A manifesto for reproducible science. Nature Human Behaviour. 2017;1. doi: 10.1038/s41562-016-0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Renkewitz F, Keiner M. How to detect publication bias in psychological research. Zeitschrift für Psychologie. 2019. [Google Scholar]

Decision Letter 0

Tim Mathes

24 Sep 2021

PONE-D-21-24961Heterogeneity estimates in a biased worldPLOS ONE

Dear Dr. Hönekopp,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 08 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Tim Mathes

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have referenced (ie. Bewick et al. [5]) which has currently not yet been accepted for publication. Please remove this from your References and amend this to state in the body of your manuscript: (ie “Bewick et al. [Unpublished]”) as detailed online in our guide for authors

http://journals.plos.org/plosone/s/submission-guidelines#loc-reference-style 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Dear Johannes and Audrey,

First of all, I’d like to complement you with your nice and relevant study. It was a joy to see that you’ve attempted to answer many remaining and highly relevant questions about heterogeneity, qrps and publication bias.

I will first provide you with my major comments. Minor comments and typos that caught my eye will be discussed later.

Kind regards,

Hilde Augusteijn

Major issues:

Method:

Page 9/page 24: It is unclear to me whether sample sizes were sampled from the total distribution of sample sizes from the 150 meta-analyses, or whether sample sizes were sampled per meta-analyses (e.g. randomly select 1 meta-analyses and generate data for all Ns in that meta-analyses). I suspect you did the former. I wonder however, how representative these sample sizes are within one meta-analysis. That is, in reality very small and very large studies may not be included in a meta-analyses together, since they do not investigate the same topics, or in ways that are methodically very different. Combining these small and large sample sizes might have a large impact on your heterogeneity estimates, as the variation of sample sizes matters for heterogeneity estimates (See Augusteijn et al, with a 1:1 ratio and 1:10 ratio). Please discuss the impact of your sample size choices in the discussion section.

Page 11: 1- and 2-tailed publication bias: Your interpretation of 1-tailed publication bias is different from how it is commonly interpreted in selection models. Your model of bias still has three different parameters: negative significant results: probability of publication is 0, non-significant results: probability depends on publication bias, positive significant results: probability of publication is 1. Often, non-significant and negative significant results are both considered to be effected by publication bias. Your choice will certainly have an impact on your results, especially when the true effect size is 0. Please change your publication bias model for 1-tailed bias, or provide a discussion of possible impact in your discussion, preferably with at least some additional simulations as a sensitivity analysis.

Page 11: Level of publication bias. This is my most important point of critique. I believe that your levels of publication bias a too limited. There are sufficient indications that publication bias in psychology might be higher than 80%. For example over 90% of studies report support for their focal hypothesis in the study by Fanelli (2010). Furthermore, the effects of publication bias on heterogeneity estimates are non-linear and heterogeneity estimates are impacted most drastically when publication bias is 100% (biggest underestimation), or close to 100 % (large overestimations when true effect is small). Please also include higher levels of publication bias. E.g. 90% or 95%, and 100%. Even though 100% bias might (luckily) not be realistic, neither is 0% bias. Knowing how the different estimators behave in this scenario is still highly relevant.

Minor issues:

Page 5: Not all QRPs are related to running multiple analyses and reporting only the smallest p-value. For example HARKing, rounding off p-values, or fraud. Furthermore, why did you choose these four? And do you expect they have exactly the same effect on the meta-analytical results, or not? What do we already know from previous studies on QRPs on for example effect size estimates?

Page 6: Do we know how often 2-tailed publication bias is plausible, compared to one tailed bias? Is there some data from empirical studies?

Page 9, line 189: please provide a reference for the claim that meta-analyses on continues outcomes are frequent.

Page 9, line 201: Is this median N of 100 the total N or per group?

Page 12, line 271: An I2 value of 6.6% when true tau=0, in conditions without qrps or bias. This deviates much more from 0 than I would expect?

Page 13, start of results: Please provide the reader with a sense of what the meta-analytical datasets looked like in the end: what was the effect of all qrps (splitting datasets, adding participants), on the actual sample sizes of primary studies? Is this still close to Ni=100?

Typos:

Page 5, line 111: QRPs instead of QRPS

Page 10, line 220: sections are not labeled as 2.2.

Page 13, line 288: URL to osf page no longer works. Please update the URL.

Page 15, line 347: ‘low k low’.

Page 24, line 500: ‘were’ instead of ‘where’.

Reviewer #2: “Heterogeneity estimates in a biased world” is a Monte Carlo study of the effects of publication bias (PB) and QRP (questionable research practices). It seems to be rigorously conducted, and its simulations are based on realistic research conditions as seen in psychology. Its major findings is: “Our results showed that biases in primary studies caused much greater problems for the estimation of effect size than for the estimation of heterogeneity.” This is an important lesson that meta-analysis community needs to hear. I suspect that this was already widely known, but I believe that this is the first paper that demonstrates this is a clear, rigorous and replicable way.

I wish to congratulate the authors for the way they conduct their Monte Carlo simulations. The design of the simulations can make an enormous difference to their results. Unless, the important research parameters (sample sizes, the amount of heterogeneity, the degree of publication selection, etc.) accurately reflects what is seen in the actual relevant research literature, the findings will be largely irrelevant. However, the authors based their simulations on what they found in what seems to be a fairly representative sample of 150 meta-analyses in psychology. I recommend that PLOS published this paper with a few revisions.

Suggestions for revisions.

1. Emphasize main findings: Please emphasize and expand the main finding that it is PB and QRP that causes random effect (RE) to be so very bias, and that this bias is very large under the typical conditions that the authors simulate. This substantial bias has also been confirmed in a systematic review of large pre-registered multi-lab replications (Kvarven et al., 2020), and it is so large that RE is entirely unreliable if applied to psychology naively without many qualifications and auxiliary statistical checks. These biases are also of a notable scientific size. The authors need to state a bit more strongly how these different methods of estimating tau (the heterogeneity SD) are largely irrelevant, especially relative to the size and consequences of RE’s bias. These consequences need to be explicitly stated and emphasized.

2. Biased studies: The way the authors characterize PB and QRP is rather misleading and may give the broader audience the wrong impression about the nature and extent of the problems involved. Classical PB is itself often interpreted as merely omitting some studies that are not statistically significant (SS). While this is indeed one avenue for the bias that we often find in published research results, there are many others. Reporting bias is recognized as a different avenue by medical researchers, as QRP is recognized by psychologists. But all of these vectors of the biases are the result of some process of selection of the results to be SS. This selection can be undertaken by the researchers on their own for their own reasons, or in anticipation of what reviewers and editors might demand. Or, this selection can be forced by the reviewers and editors. These details of selection process are largely irrelevant because they have the same outcome and can be simulated in the same way. Thus, a little more discussion of what this bias is and more references to the classical and better regraded methods to correct for these biases (collectively called PB here, for short) is needed. The authors repeatedly characterize this problem as “biases in primary studies.” It is not, or at least, this is not necessarily a bias in the primary studies. PB can be very serious, just as we see it in practice, if the individual primary results are not biased, but merely were selectively reported to be SS from entirely randomly produced distribution of estimates (with random QRPs, random outcome measures, random samples, etc). You might say that PB is an emergent property (selection for SS) of the entire research literature in a given area but is not associated with individually biased studies. Studies and researchers may also be biased, which will only amplify PB, but focusing on the unnecessary bias of individual studies can cause many to dismiss this severe problem. Many researchers do not believe that a notable portion of their colleagues are dishonest or deliberately distorting science. This is why, PB is so pernicious and easily dismissed. It can emerge from the system, as a whole, without individually researchers knowingly distorting science. Please do not characterize PB as ‘biased studies’ but rather as studies selected to be SS.

3. Type I errors: Please report the type I errors of all of these methods using the current simulation design. I suspect that the authors will find that RE has very high rates of type I errors for all of these methods, at least as long as there are more than a few estimates. If so, this will confirm the systematic review of large pre-registered multi-lab replications (Kvarven et al., 2020). Rates of false positives are very important as an indicator of scientific credibility. I suspect that RE (regardless of the method use to estimate tau) has such high rates of false positives, using the authors’ current simulation design, to disqualify RE from any serious scientific use. In any case, type I errors are important to show and to discuss. Not reporting type I errors could be considered to be a type of a selection bias in the way these simulation results are displayed and published. Methods PB, if you will.

4. Alternatives to random effects: This entire study assumes that RE is the only adequate method to conduct basic meta-analysis in psychology and that this issue then comes down to the best way to calculate RE. This is not the case, and worse, the authors show that all the ways to calculate RE produce notably large bias (greatly exaggerating the size of the effect under examination). I suspect that this simulation design will show that RE has high rates of false positives. It has long been known that RE has unacceptable biases and that these biases are easily reduced (Henmi and Copas, 2010; Stanley and Doucouliagos, 2015). Henmi and Copas (2010) showed that FE (fixed effect) will notably reduces PB and that the RE’s estimate of tau can accommodate the heterogeneity that FE ignores. However, Henmi and Copas (2010) uses the DL estimate of tau in their calculation of the CI. So, the estimate of tau might still be important in their approach. Henmi and others (2021) has recently generalized this method and show how it can work for very small meat-analyses. Alternatively, an entirely different approach, the unrestricted weighted least squares (UWLS), uses the bias reduction of FE but automatically accommodates heterogeneity using the mathematical invariance of WLS’s variance-covariance matrix to any multiplicative constant. UWLS accommodates heterogeneity without referring to or using RE or any of its estimates of tau (Stanley and Doucouliagos, 2015; 2017). That is, the central issue of this study of the effect of PB on estimates of tau could be entirely avoided and, at the same time, reduce the large biases reported in this paper. Simulations, like these, have shown that UWLS notably reduces RE’s bias with little if any compensating statistical loss (Stanley and Doucouliagos, 2015; 2017). These alternative methods to RE have been widely applied across the disciplines and used as a basis for a new statistical method to detect PB (Stanley et al., 2021). It would be nice if these other methods were simulated and reported using this same design. At a minimum discussed, they need to be discussed as viable alternative to this concern about how tau is calculated and as an alternative to RE’s large biases and high rates of false positives. The central scientific question, is how to reduce or eliminate bias and false positive meta-analyses because they are often the best scientific evidence we have.

References:

Henmi M, Copas JB. Confidence intervals for random effects meta-analysis and robustness to publication bias. Statistics in Medicine, 2010; 29:2969–2983.

Henmi M, Hattori S, Friede T. A confidence interval robust to publication bias for random-effects meta-analysis of few studies. Res Syn Meth. 2021;12:674–679. https://doi.org/10.1002/jrsm.1482

Stanley, T.D. and Doucouliagos, C. Neither fixed nor random: Weighted least squares meta-analysis,” Statistics in Medicine, 2015: 342116-27.

Stanley, T.D. and Doucouliagos, C. Neither fixed nor random: Weighted least squares meta-regression analysis. Res Synth Methods. 2017;8:19-42.

Stanley TD, Doucouliagos H, Ioannidis JPA, Carter EC. Detecting publication selection bias through excess statistical significance. Research Synthesis Methods. 2021; 1-20. https://doi.org/10.1002/jrsm.1512

Reviewer #3: Review comments to the author can be found in the attached .docx document. They are organised in the sequence of the paper and include some general points and more specific questions to be responded to.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Hilde Augusteijn

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Plos Review 1.doc

PLoS One. 2022 Feb 3;17(2):e0262809. doi: 10.1371/journal.pone.0262809.r002

Author response to Decision Letter 0


1 Nov 2021

Authors: Thanks for your thorough reading of our paper. We appreciate your detailed feedback and found your comments helpful.

I enjoyed reading this simulation study examining the effect of bias on heterogeneity estimates in meta-analysis. I found the methodology and inferences drawn from the results to be broadly sound. I have some minor suggestions which I have documented below.

It is very pleasant to see implementation of open science requirements. Thank-you for making all your code easily available. However I was unable to run the R files to replicate due to the absence of the excel file “obsN.xlsx”

Authors: Sorry about that. The file is now uploaded.

Line 59 “is an example for heterogeneity.” - possibly is an example of heterogeneity?

Authors: Corrected.

Line 95 “Bias in primary studies” Perhaps this is an accepted term in the psychology literature, but I would argue that publication bias arises from aggregation of primary studies but does not imply bias in any individual study result. That would make calling it a bias in a primary study potentially confusing or misleading. This term recurs throughout the paper and I wonder if some alternative terminology would improve the clarity of the paper. Perhaps more clearly you might call them sources of bias in meta-analysis.

Authors: Our use of the expression QRPs in our original submission caused some confusion. This section covers the two motivationally related but conceptually distinct processes of (i) publication bias and (ii) flexibility in data collection and analysis (now “p-hacking”, previously QRPs). P-hacking might cause systematic bias in individual studies. Motivated by your comment, we changed the heading to “Bias in published primary studies”.

Line 97 I completely agree with your point here that both publication bias and QRPs are sources of bias but this binary categorisation is non-standard in my experience and I think potentially confusing. Questionable research practices is a deliberately broad term often used as a catch all for research misconduct short of outright fraud/falsification of data. In my experience it normally isn’t associated with a particular mechanism or pattern of bias as you write about multiple analyses (though I would be happy to be corrected with an appropriate reference). In addition publication bias can conceivably be caused by questionable research practices (e.g. not seeking publication for null results). I think it’s fine to keep using the term throughout the paper but perhaps if you altered the section introducing these biases to reduce the emphasis on these as the only two sources of bias, the complete separation of the two, and clarify that the QRPs you want to focus on are only a subset of a larger group of practices.

Authors: Our use of the expression QRPs was unfortunate, and we replaced it with “p-hacking” throughout, which should clarify the matter.

Line 113 This reference shows that the listed QRPs are commonly self reported, however it would be great to know if there was any empirical evidence for their effects introducing meta-analysis results if you are aware of any. For an example of this from the medical field see:

Savović J, Jones HE, Altman DG, Harris RJ, Jüni P, Pildal J, et al. Influence of Reported Study Design Characteristics on Intervention Effect Estimates From Randomized, Controlled Trials. Annals of Internal Medicine. 2012 Sep 18;157(6):429.

Authors: The frequently self-reported p-hacking strategies we focused on are not easily identifiable from study descriptions. We are not aware of any studies in the spirit of Savović et al. that empirically investigate the bias caused by our (or similar) p-hacking strategies.

Line 189 Consider citing metafor

Authors: Done

Line 208 “Imagine that a given level of publication bias and QRPs led to a bias of 0.1 in the overall effect size estimate 𝜃̂ and a bias of 0.1 in the heterogeneity estimate 𝜏̂. In this case it would be sensible to conclude that effect size estimates and heterogeneity estimates were affected to the same extent.” I think this needs either further justification, or more ambiguity. Whilst it might be narrowly true that they have the same numerical level of bias, what the implications of this are is far from clear. Bias in estimate of effect are likely to affect heterogeneity, and the estimate of effect tends to be the focus of systematic review, so will typically have a larger effect on interpretation though of course context is important.

Authors: We now qualify our statement, and the sentence ends with “(although the same degree of bias might be seen as more consequential for effect size estimates than for heterogeneity estimates).”, see Line 209.

Line 229 “Optional outlier removal: Researchers run separate analyses on all data, and on data with

outliers (unsigned z ≥ 2) removed.” This doesn’t seem a likely mechanism – surely only outliers which push away from significance would be excluded (though I appreciate this is different assuming a 2 tail publication bias).

Authors: Depending on the simulation parameters, simulated studies are analysed with and without outliers. Reporting and publication are made contingent on the resulting p-values. Thus, outliers that lower the p-value will be (perhaps implausibly) removed for the analysis without outliers; however, their removal leads to a larger p-value, which is why this result will not be reported and therefore will not enter the meta-analysis/the simulation results.

Line 235 or 252 “Under 1-tailed publication bias, results that went against the expected

direction were never published” This choice makes the 1 tailed bias much more aggressive than the 2 tailed, and perhaps a bit unrealistic in the modern publishing environment? There is good evidence for publication bias on p-value, but I am aware of less evidence for publication bias based on direction of effect (though obviously they are related). It would be useful to see more justification for the formulations of publication bias chosen here, particularly any empirical evidence for the levels chosen.

Authors: We modified our implementation of 1-tailed publication bias (PB) in light of your comments. We re-ran all simulations with 1-tailed PB. Under 1-tailed PB, statistically significant results (2-tailed testing) in the expected direction were always published; all other results were censored to a degree that was defined by the strength of PB. This is in line with Augusteijn et al., 2019. If p-hacking required selection between multiple analyses, this was contingent on a modified p-value, which equaled p for results in the expected direction. For results in the opposite direction, the modified p-value was computed as 1 + (1-p). Obviously, being >1 the modified p-value cannot be interpreted as a probability, but it appropriately penalizes results in the wrong direction with, ceteris paribus, stronger effects carrying greater penalties. (See Lines 254-261). All analyses and figures in this revision are based on this new version of 1-tailed PB. Note that results and conclusions did not change in substantive ways.

Line 269 “ we computed mean I 2 levels across” I would be more interested in median I2 values if these are easily computable from your results

Authors: We added the medians in brackets, see Line 278.

Line 279 “Following (10), 1,000 meta-analyses were run for each” I appreciate the computational considerations here, but did you re-run an evaluation of the monte-carlo error for this simulation, or did you use 1,000 repetitions because that was sufficient in the previous paper? It is possible to estimate the Monte Carlo Error without running a much larger simulation as per:

Koehler E, Brown E, Haneuse SJ-PA. On the Assessment of Monte Carlo Error in Simulation-Based Statistical Analyses. Am Stat. 2009 May 1;63(2):155–62.

Authors: For each cell in the design, the simulation computed and recorded the standard deviation across the 1,000 heterogeneity estimates. This allowed us to estimate the Monte Carlo Error (MCE) for each cell. Mean MCE was 0.0023 with a maximum of 0.0068, which strikes us as satisfactory (see Lines 291-295, where we also refer to the Koehler et al. reference). Unfortunately, the standard deviation for effect size estimates was not recorded. Therefore, we could not estimate its MCE.

In addition did you consider using parallel computing to improve the speed of computation? I haven’t been able to exactly replicate your analysis but I have recently used R packages such as foreach and doParallel to run simulations on multiple cores without much programming difficulty to substantially improve running times.

Authors: Thanks for the tip. We will look into this in future simulations.

Line 293 “ level of effect size proved of little consequence” It might be worth clarifying somewhere that this is only true for the mean difference, for other measures (such as odds ratio) it can make a difference.

Authors: We address this among the limitations of our study (Line 560-562).

Line 336 In S4 fig is the coverage for all estimates of heterogeneity combined or for a specific estimate (e.g. DL)

Authors: Somewhat counterintuitively, metafor confidence intervals for heterogeneity are independent of the heterogeneity estimator. Therefore, CIs are the same for all five heterogeneity estimators.

For all figures please consider using a scalable format (e.g. pdf/svg/eps) rather than png so that there is no loss of resolution when zooming in to differentiate between lines. In addition in some figures (e.g. figure 3 tau 0.44 segment) it is very difficult to establish which lines overlap. Perhaps you could consider colour, or alternative types of lines?

Authors: You will find that all figures are much bigger and clearer now, although we saved them in the TIFF format as recommended by the journal.

I would like to see a little more justification of the reliance on the ANOVA here and in other places in the paper. It seems appropriate to infer that stability in estimates in the presence of bias is desirable, but absolute levels of bias and RMSE may be more valuable in certain situations.

Authors: We agree that a sizable or even large level of bias or RMSE is relevant even if it remains constant across conditions. Our figures address these levels of bias/RMSE, and our (necessarily somewhat arbitrary) verbal labels (Line 215) should also help to focus on this issue.

However, it is also important to consider how bias/RMSE varies as a function of the factors manipulated in the simulation. Given the complexity of our design, ANOVA is just a convenient tool to draw attention to powerful factors and interactions. Similarly, ANOVA results demonstrated that higher-order interactions (i.e., more than 2-way) were typically of little importance, which protects authors and readers against being sidetracked.

Line 374 Table 2. Does “M” indicate the mean? Please clarify this

Authors: It does. Now clarified in the Table’s note, Line 388.

Figure 4: Effect size, type of PB, and QRP environment are all drawn as categorical variables (though effect size could be redrawn on a continuous scale) and so it may not be appropriate to draw lines between the point values.

Authors: Your comment is, of course, correct. However, in psychology lines are used even for categorical independent variables to facilitate the perception of interaction effects. Here, we keep with this tradition.

I think this section (and others) would also benefit from a deeper investigation of the difference in patterns between PB and QRPs. Implementing QRPs is a novelty of the paper, and I think a more in depth comparison of the effects of PB and QRPs would be interesting.

Authors: We now provide more detail (Lines 430-439).

Line 383 “Overall, T bias was driven upwards ...” This is true and the trends are obvious from the figure, but since bias is optimal close to zero it might be worth rewording this to make it clear when bias becomes worse (i.e. further from zero) rather than simply higher – which could be better if the starting point was negative bias.

Authors: The remainder of the paragraph describes the resulting biases in greater detail (Lines 396-404).

Line 386 “This meant that the highest levels of T bias ...” Highest on the absolute scale, but I believe highest is technically incorrect here. This is related to the point above.

Authors: We rephrased to “the largest positive levels of Tbias” (Line 398).

Line 397 “Regarding 2-way interactions, effect size ⨯ type of publication bias ...” You could consider mentioning potential mechanisms of why bias is high where they are obvious to you.

Authors: For Fig 5, we added the following explanation (Line 414): “This arises because only under 2-tailed publication bias do p-hacking and publication bias have the potential to push published effect sizes either above or below zero, thus maximising their variance.” For Fig 6, we added (Line 420), “This reflects that strong publication bias maximises the variance in published effect sizes at θ = 0 because (exaggerated) published effect sizes are equally likely to be above or below zero.”

Line 416 “is less prevalent than underestimation” I’m not sure you can comment on prevalence since you are not doing empirical work – it only occurs more commonly in these simulation conditions. This is also another situation where perhaps some mechanistic explanation might orientate the reader.

Authors: We rephrased to, “overestimation of heterogeneity occurs under fewer simulation conditions is less prevalent than underestimation” (Line 442).

Line 450 This section feels a bit minimalist. Effect size is often the main focus of a paper, and is given relatively little attention here. One thing to consider is giving an example of the magnitude of bias in the text when comparing publication bias and QRPs. I think this is worth noting since

Authors: We now provide some perspective on the magnitude of bias in effect size estimates, “The combination of [p-hacking and publication bias] could induce large bias. E.g., for a true effect size of zero, θ might be estimated to be over 0.3, a substantial effect” (Line 487). Part of your comment got lost. We hope this addresses the issue.

Line 485 “and also those with θ=0 (because such effects tend to be of little interest to researchers” I’m not sure this is true, especially since researchers don’t know the true population effect in advance.

Authors: We now express this point more carefully, see Lines 507-513.

Line 492 “From this perspective, publication bias and QRPs cause much more problems for the estimation of effect size than for the estimation of heterogeneity” It might be worth exploring graphically representing the distribution of bias for t and d (e.g. histogram/boxplot/violin plot) rather than giving proportions above arbitrary cut-offs.

Authors: We ditched the proportions and now provide boxplots instead (Fig9, Line 506).

Line 522 “we did not consider some heterogeneity estimators that previously showed promise” It would be helpful to give an example here

Authors: We added a reference (Line 548).

Attachment

Submitted filename: Plos Review 1.doc

Decision Letter 1

Tim Mathes

11 Nov 2021

PONE-D-21-24961R1Heterogeneity estimates in a biased worldPLOS ONE

Dear Dr. Hönekopp,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The comments of reviewer 1 and 2 have not been addressed. Please address all comments that were raised by the reviewers.

Please submit your revised manuscript by Dec 26 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Tim Mathes

Academic Editor

PLOS ONE

Journal Requirements:

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Feb 3;17(2):e0262809. doi: 10.1371/journal.pone.0262809.r004

Author response to Decision Letter 1


16 Dec 2021

Reviewer #3

Our response: Thanks for your thorough reading of our paper. We appreciate your detailed feedback and found your comments helpful. Please note that all line numbers in our reply refer to the clean copy, i.e., the copy without tracked changes.

I enjoyed reading this simulation study examining the effect of bias on heterogeneity estimates in meta-analysis. I found the methodology and inferences drawn from the results to be broadly sound. I have some minor suggestions which I have documented below.

It is very pleasant to see implementation of open science requirements. Thank-you for making all your code easily available. However I was unable to run the R files to replicate due to the absence of the excel file “obsN.xlsx”

Our response: Sorry about that. The file is now uploaded.

Line 59 “is an example for heterogeneity.” - possibly is an example of heterogeneity?

Our response: Corrected.

Line 95 “Bias in primary studies” Perhaps this is an accepted term in the psychology literature, but I would argue that publication bias arises from aggregation of primary studies but does not imply bias in any individual study result. That would make calling it a bias in a primary study potentially confusing or misleading. This term recurs throughout the paper and I wonder if some alternative terminology would improve the clarity of the paper. Perhaps more clearly you might call them sources of bias in meta-analysis.

Our response: Wherever appropriate, we changed “biased primary studies” or similar to “biased sets of primary studies” or similar.

Line 97 I completely agree with your point here that both publication bias and QRPs are sources of bias but this binary categorisation is non-standard in my experience and I think potentially confusing. Questionable research practices is a deliberately broad term often used as a catch all for research misconduct short of outright fraud/falsification of data. In my experience it normally isn’t associated with a particular mechanism or pattern of bias as you write about multiple analyses (though I would be happy to be corrected with an appropriate reference). In addition publication bias can conceivably be caused by questionable research practices (e.g. not seeking publication for null results). I think it’s fine to keep using the term throughout the paper but perhaps if you altered the section introducing these biases to reduce the emphasis on these as the only two sources of bias, the complete separation of the two, and clarify that the QRPs you want to focus on are only a subset of a larger group of practices.

Our response: Our use of the expression QRPs was unfortunate, and we replaced it with “p-hacking” throughout, which should clarify the matter.

Line 113 This reference shows that the listed QRPs are commonly self reported, however it would be great to know if there was any empirical evidence for their effects introducing meta-analysis results if you are aware of any. For an example of this from the medical field see:

Savović J, Jones HE, Altman DG, Harris RJ, Jüni P, Pildal J, et al. Influence of Reported Study Design Characteristics on Intervention Effect Estimates From Randomized, Controlled Trials. Annals of Internal Medicine. 2012 Sep 18;157(6):429.

Our response: The frequently self-reported p-hacking strategies we focused on are not easily identifiable from study descriptions. We are not aware of any studies in the spirit of Savović et al. that empirically investigate the bias caused by our (or similar) p-hacking strategies.

Line 189 Consider citing metafor

Our response: Done

Line 208 “Imagine that a given level of publication bias and QRPs led to a bias of 0.1 in the overall effect size estimate 𝜃̂ and a bias of 0.1 in the heterogeneity estimate 𝜏̂. In this case it would be sensible to conclude that effect size estimates and heterogeneity estimates were affected to the same extent.” I think this needs either further justification, or more ambiguity. Whilst it might be narrowly true that they have the same numerical level of bias, what the implications of this are is far from clear. Bias in estimate of effect are likely to affect heterogeneity, and the estimate of effect tends to be the focus of systematic review, so will typically have a larger effect on interpretation though of course context is important.

Our response: We now qualify our statement, and the sentence ends with “(although the same degree of bias might be seen as more consequential for effect size estimates than for heterogeneity estimates).”, see Line 217.

Line 229 “Optional outlier removal: Researchers run separate analyses on all data, and on data with

outliers (unsigned z ≥ 2) removed.” This doesn’t seem a likely mechanism – surely only outliers which push away from significance would be excluded (though I appreciate this is different assuming a 2 tail publication bias).

Our response: Depending on the simulation parameters, simulated studies are analysed with and without outliers. Reporting and publication are made contingent on the resulting p-values. Thus, outliers that lower the p-value will be (perhaps implausibly) removed for the analysis without outliers; however, their removal leads to a larger p-value, which is why this result will not be reported and therefore will not enter the meta-analysis/the simulation results.

Line 235 or 252 “Under 1-tailed publication bias, results that went against the expected

direction were never published” This choice makes the 1 tailed bias much more aggressive than the 2 tailed, and perhaps a bit unrealistic in the modern publishing environment? There is good evidence for publication bias on p-value, but I am aware of less evidence for publication bias based on direction of effect (though obviously they are related). It would be useful to see more justification for the formulations of publication bias chosen here, particularly any empirical evidence for the levels chosen.

Our response: We modified our implementation of 1-tailed publication bias (PB) in light of your comments. We re-ran all simulations with 1-tailed PB. Under 1-tailed PB, statistically significant results (2-tailed testing) in the expected direction were always published; all other results were censored to a degree that was defined by the strength of PB. This is in line with Augusteijn et al., 2019. If p-hacking required selection between multiple analyses, this was contingent on a modified p-value, which equaled p for results in the expected direction. For results in the opposite direction, the modified p-value was computed as 1 + (1-p). Obviously, being >1 the modified p-value cannot be interpreted as a probability, but it appropriately penalizes results in the wrong direction with, ceteris paribus, stronger effects carrying greater penalties. (See Lines 260-268). All analyses and figures in this revision are based on this new version of 1-tailed PB. Note that results and conclusions did not change in substantive ways.

Line 269 “ we computed mean I 2 levels across” I would be more interested in median I2 values if these are easily computable from your results

Our response: We added the medians in brackets, see Line 285.

Line 279 “Following (10), 1,000 meta-analyses were run for each” I appreciate the computational considerations here, but did you re-run an evaluation of the monte-carlo error for this simulation, or did you use 1,000 repetitions because that was sufficient in the previous paper? It is possible to estimate the Monte Carlo Error without running a much larger simulation as per:

Koehler E, Brown E, Haneuse SJ-PA. On the Assessment of Monte Carlo Error in Simulation-Based Statistical Analyses. Am Stat. 2009 May 1;63(2):155–62.

Our response: For each cell in the design, the simulation computed and recorded the standard deviation across the 1,000 heterogeneity estimates. This allowed us to estimate the Monte Carlo Error (MCE) for each cell. Mean MCE was 0.0023 with a maximum of 0.0068, which strikes us as satisfactory (see Lines 297-301, where we also refer to the Koehler et al. reference). Unfortunately, the standard deviation for effect size estimates was not recorded. Therefore, we could not estimate its MCE.

In addition did you consider using parallel computing to improve the speed of computation? I haven’t been able to exactly replicate your analysis but I have recently used R packages such as foreach and doParallel to run simulations on multiple cores without much programming difficulty to substantially improve running times.

Our response: Thanks for the tip. We will look into this in future simulations.

Line 293 “ level of effect size proved of little consequence” It might be worth clarifying somewhere that this is only true for the mean difference, for other measures (such as odds ratio) it can make a difference.

Our response: We address this among the limitations of our study (Line 584-586).

Line 336 In S4 fig is the coverage for all estimates of heterogeneity combined or for a specific estimate (e.g. DL)

Our response: Somewhat counterintuitively, metafor confidence intervals for heterogeneity are independent of the heterogeneity estimator. Therefore, CIs are the same for all five heterogeneity estimators.

For all figures please consider using a scalable format (e.g. pdf/svg/eps) rather than png so that there is no loss of resolution when zooming in to differentiate between lines. In addition in some figures (e.g. figure 3 tau 0.44 segment) it is very difficult to establish which lines overlap. Perhaps you could consider colour, or alternative types of lines?

Our response: You will find that all figures are much bigger and clearer now, although we saved them in the TIFF format as recommended by the journal.

I would like to see a little more justification of the reliance on the ANOVA here and in other places in the paper. It seems appropriate to infer that stability in estimates in the presence of bias is desirable, but absolute levels of bias and RMSE may be more valuable in certain situations.

Our response: We agree that a sizable or even large level of bias or RMSE is relevant even if it remains constant across conditions. Our figures address these levels of bias/RMSE, and our (necessarily somewhat arbitrary) verbal labels (Line 222) should also help to focus on this issue.

However, it is also important to consider how bias/RMSE varies as a function of the factors manipulated in the simulation. Given the complexity of our design, ANOVA is just a convenient tool to draw attention to powerful factors and interactions. Similarly, ANOVA results demonstrated that higher-order interactions (i.e., more than 2-way) were typically of little importance, which protects authors and readers against being sidetracked.

Line 374 Table 2. Does “M” indicate the mean? Please clarify this

Our response: It does. Now clarified in the Table’s note, Line 397.

Figure 4: Effect size, type of PB, and QRP environment are all drawn as categorical variables (though effect size could be redrawn on a continuous scale) and so it may not be appropriate to draw lines between the point values.

Our response: Your comment is, of course, correct. However, in psychology lines are used even for categorical independent variables to facilitate the perception of interaction effects. Here, we keep with this tradition.

I think this section (and others) would also benefit from a deeper investigation of the difference in patterns between PB and QRPs. Implementing QRPs is a novelty of the paper, and I think a more in depth comparison of the effects of PB and QRPs would be interesting.

Our response: We now provide more detail (Lines 440-449).

Line 383 “Overall, T bias was driven upwards ...” This is true and the trends are obvious from the figure, but since bias is optimal close to zero it might be worth rewording this to make it clear when bias becomes worse (i.e. further from zero) rather than simply higher – which could be better if the starting point was negative bias.

Our response: The remainder of the paragraph describes the resulting biases in greater detail (Lines 405-413).

Line 386 “This meant that the highest levels of T bias ...” Highest on the absolute scale, but I believe highest is technically incorrect here. This is related to the point above.

Our response: We rephrased to “the largest positive levels of Tbias” (Line 407).

Line 397 “Regarding 2-way interactions, effect size ⨯ type of publication bias ...” You could consider mentioning potential mechanisms of why bias is high where they are obvious to you.

Our response: For Fig 5, we added the following explanation (Line 423): “This arises because only under 2-tailed publication bias do p-hacking and publication bias have the potential to push published effect sizes either above or below zero, thus maximising their variance.” For Fig 6, we added (Line 429), “This reflects that strong publication bias maximises the variance in published effect sizes at θ = 0 because (exaggerated) published effect sizes are equally likely to be above or below zero.”

Line 416 “is less prevalent than underestimation” I’m not sure you can comment on prevalence since you are not doing empirical work – it only occurs more commonly in these simulation conditions. This is also another situation where perhaps some mechanistic explanation might orientate the reader.

Our response: We rephrased to, “overestimation of heterogeneity occurs under fewer simulation conditions is less prevalent than underestimation” (Line 452).

Line 450 This section feels a bit minimalist. Effect size is often the main focus of a paper, and is given relatively little attention here. One thing to consider is giving an example of the magnitude of bias in the text when comparing publication bias and QRPs. I think this is worth noting since

Our response: We now provide some perspective on the magnitude of bias in effect size estimates, “The combination of [p-hacking and publication bias] could induce large bias. E.g., for a true effect size of zero, θ might be estimated to be over 0.3, a substantial effect” (Line 497). Part of your comment got lost. We hope this addresses the issue.

Line 485 “and also those with θ=0 (because such effects tend to be of little interest to researchers” I’m not sure this is true, especially since researchers don’t know the true population effect in advance.

Our response: We now express this point more carefully, see Lines 530-534.

Line 492 “From this perspective, publication bias and QRPs cause much more problems for the estimation of effect size than for the estimation of heterogeneity” It might be worth exploring graphically representing the distribution of bias for t and d (e.g. histogram/boxplot/violin plot) rather than giving proportions above arbitrary cut-offs.

Our response: We ditched the proportions and now provide boxplots instead (Fig10, Line 533).

Line 522 “we did not consider some heterogeneity estimators that previously showed promise” It would be helpful to give an example here

Our response: We added a reference (Line 572).

Reviewer #1: Dear Johannes and Audrey,

First of all, I’d like to complement you with your nice and relevant study. It was a joy to see that you’ve attempted to answer many remaining and highly relevant questions about heterogeneity, qrps and publication bias.

I will first provide you with my major comments. Minor comments and typos that caught my eye will be discussed later.

Kind regards,

Hilde Augusteijn

Our response: Dear Hilde, thanks for your comments and suggestions. We appreciate your thoughts. Sorry that we failed to notice them when we prepared our first revision. Please note that all line numbers in our reply refer to the clean copy, i.e., the copy without tracked changes.

Major issues:

Method:

Page 9/page 24: It is unclear to me whether sample sizes were sampled from the total distribution of sample sizes from the 150 meta-analyses, or whether sample sizes were sampled per meta-analyses (e.g. randomly select 1 meta-analyses and generate data for all Ns in that meta-analyses). I suspect you did the former. I wonder however, how representative these sample sizes are within one meta-analysis. That is, in reality very small and very large studies may not be included in a meta-analyses together, since they do not investigate the same topics, or in ways that are methodically very different. Combining these small and large sample sizes might have a large impact on your heterogeneity estimates, as the variation of sample sizes matters for heterogeneity estimates (See Augusteijn et al, with a 1:1 ratio and 1:10 ratio). Please discuss the impact of your sample size choices in the discussion section.

Our response: To clarify the matter, we rephrased as follows: “We aggregated observed sample sizes from a representative set of 150 psychological meta-analyses (15) into a single distribution. Sample sizes Ni for simulated studies were randomly sampled from this distribution and equally split between groups 1 and 2.” (LL198). We addressed your concern about unrealistic combinations of very small and very large studies in the same simulated meta-analyses as follows: “If average sample size differed considerably across the 150 meta-analyses in our set, our approach might result in unrealistic combinations of very large and very small samples in simulated meta-analyses, which in turn might distort our results (Augusteijn, van Aert, & van Assen, 2019). However, an ANOVA (bias corrected accelerated bootstrap with 1,000 samples) revealed little variation of average sample size across these 150 meta-analyses (ηp2 = 0.020, F(149, 7077) = 0.97, p = .595).” See Lines 198-206.

Page 11: 1- and 2-tailed publication bias: Your interpretation of 1-tailed publication bias is different from how it is commonly interpreted in selection models. Your model of bias still has three different parameters: negative significant results: probability of publication is 0, non-significant results: probability depends on publication bias, positive significant results: probability of publication is 1. Often, non-significant and negative significant results are both considered to be effected by publication bias. Your choice will certainly have an impact on your results, especially when the true effect size is 0. Please change your publication bias model for 1-tailed bias, or provide a discussion of possible impact in your discussion, preferably with at least some additional simulations as a sensitivity analysis.

Our response: We modified our implementation of 1-tailed publication bias (PB) in light of reviewers’ comments. We re-ran all simulations with 1-tailed PB. Under 1-tailed PB, statistically significant results (2-tailed testing) in the expected direction were always published; all other results were censored to a degree that was defined by the strength of PB. This is in line with Augusteijn et al., 2019. If p-hacking required selection between multiple analyses, this was contingent on a modified p-value, which equaled p for results in the expected direction. For results in the opposite direction, the modified p-value was computed as 1 + (1-p). Obviously, being >1 the modified p-value cannot be interpreted as a probability, but it appropriately penalizes results in the wrong direction with, ceteris paribus, stronger effects carrying greater penalties. (See Lines 261-268). All analyses and figures in this revision are based on this new version of 1-tailed PB. Note that results and conclusions did not change in substantive ways.

Page 11: Level of publication bias. This is my most important point of critique. I believe that your levels of publication bias a too limited. There are sufficient indications that publication bias in psychology might be higher than 80%. For example over 90% of studies report support for their focal hypothesis in the study by Fanelli (2010). Furthermore, the effects of publication bias on heterogeneity estimates are non-linear and heterogeneity estimates are impacted most drastically when publication bias is 100% (biggest underestimation), or close to 100 % (large overestimations when true effect is small). Please also include higher levels of publication bias. E.g. 90% or 95%, and 100%. Even though 100% bias might (luckily) not be realistic, neither is 0% bias. Knowing how the different estimators behave in this scenario is still highly relevant.

Our response: You raise an interesting point. We are less pessimistic about the prevalence of publication bias (PB) than you. The reason is that PB should predominantly affect studies’ focal hypothesis. Naturally, meta-analyses (MAs) also include results that were not the focal hypothesis of the paper in which they were published. These “non-headline” results should be less affected by PB, if at all. Two empirical observations support our viewpoint. 1) Frequently, a substantial proportion of primary effects summarised in a MA are not statistically significant as MAs' forest plots reveal. We are not aware of a systematic investigation of this issue but point to two arbitrary examples, Macnamara, Hambrick, and Oswald (2014, see Fig. 2) and (Sisk, Burgoyne, Sun, Butler, & Macnamara, 2018, see Fig. 2). 2) Levine, Asada, and Carpenter (2009) looked at the correlation between effect size and sample size across 51 meta-analyses. They found a much weaker correlation (mean r = -.16) than Kühberger, Fritz, and Scherndl (2014) who investigated the same relationship in findings that constituted the focal hypothesis of the respective papers and found rS = -.45.

Naturally, choices for other PB levels than ours would be perfectly defensible, and the study of additional PB levels would add to our simulation. However, we believe that the levels we chose are illuminating and sensible. This includes 0% PB, which we would expect in pre-registered trials and perhaps for some questions that are based on data irrelevant to papers’ focal hypotheses. Regarding the inclusion of additional levels of PB, we would like to point out that our simulations are already enormously time consuming in their present form.

Minor issues:

Page 5: Not all QRPs are related to running multiple analyses and reporting only the smallest p-value. For example HARKing, rounding off p-values, or fraud. Furthermore, why did you choose these four? And do you expect they have exactly the same effect on the meta-analytical results, or not? What do we already know from previous studies on QRPs on for example effect size estimates?

Our response: We replaced the expression QRPs with the more apt “p-hacking” throughout. We selected our four particular types of p-hacking based on their inferred high prevalence (John, Loewenstein, & Prelec, 2012), see Line 112. To differentiate the impact of different forms of p-hacking is beyond the scope of our paper, and we are not aware of previous investigations of this question.

Page 6: Do we know how often 2-tailed publication bias is plausible, compared to one tailed bias? Is there some data from empirical studies?

Our response: We are not aware of data that would shed light on this question.

Page 9, line 189: please provide a reference for the claim that meta-analyses on continues outcomes are frequent.

Our response: We now provide a reference, van Erp, Verhagen, Grasman, and Wagenmakers (2017), see Line 198. The paper, which surveyed 705 MAs in Psychological Bulletin, does not directly mention types of outcome variables, but their open data show that >95% of MAs used Pearson’s r or a standardised mean difference as an effect size.

Page 9, line 201: Is this median N of 100 the total N or per group?

Our response: This is total N, which we now make clearer (see Line 201).

Page 12, line 271: An I2 value of 6.6% when true tau=0, in conditions without qrps or bias. This deviates much more from 0 than I would expect?

Our response: We now report the median (0.0) in addition to the mean (Line 285). Note that in the absence of true heterogeneity, tau can only be overestimated but not underestimated, which biases the mean.

Page 13, start of results: Please provide the reader with a sense of what the meta-analytical datasets looked like in the end: what was the effect of all qrps (splitting datasets, adding participants), on the actual sample sizes of primary studies? Is this still close to Ni=100?

Our response: This information is now added in L307. P-hacking increased mean sample size in primary studies only moderately.

Typos:

Page 5, line 111: QRPs instead of QRPS

Page 10, line 220: sections are not labeled as 2.2.

Page 13, line 288: URL to osf page no longer works. Please update the URL.

Page 15, line 347: ‘low k low’.

Page 24, line 500: ‘were’ instead of ‘where’.

Our response: Thanks, corrected.

Reviewer #2: “Heterogeneity estimates in a biased world” is a Monte Carlo study of the effects of publication bias (PB) and QRP (questionable research practices). It seems to be rigorously conducted, and its simulations are based on realistic research conditions as seen in psychology. Its major findings is: “Our results showed that biases in primary studies caused much greater problems for the estimation of effect size than for the estimation of heterogeneity.” This is an important lesson that meta-analysis community needs to hear. I suspect that this was already widely known, but I believe that this is the first paper that demonstrates this is a clear, rigorous and replicable way.

I wish to congratulate the authors for the way they conduct their Monte Carlo simulations. The design of the simulations can make an enormous difference to their results. Unless, the important research parameters (sample sizes, the amount of heterogeneity, the degree of publication selection, etc.) accurately reflects what is seen in the actual relevant research literature, the findings will be largely irrelevant. However, the authors based their simulations on what they found in what seems to be a fairly representative sample of 150 meta-analyses in psychology. I recommend that PLOS published this paper with a few revisions.

Our response: Thanks for your comments, from which we have learned a lot. Sorry that we failed to notice your review when we prepared our first revision. Please note that all line numbers in our reply refer to the clean copy, i.e., the copy without tracked changes.

Suggestions for revisions.

1. Emphasize main findings: Please emphasize and expand the main finding that it is PB and QRP that causes random effect (RE) to be so very bias, and that this bias is very large under the typical conditions that the authors simulate. This substantial bias has also been confirmed in a systematic review of large pre-registered multi-lab replications (Kvarven et al., 2020), and it is so large that RE is entirely unreliable if applied to psychology naively without many qualifications and auxiliary statistical checks. These biases are also of a notable scientific size. The authors need to state a bit more strongly how these different methods of estimating tau (the heterogeneity SD) are largely irrelevant, especially relative to the size and consequences of RE’s bias. These consequences need to be explicitly stated and emphasized.

Our response: In our discussion, we now place more emphasis on this point and reference some of the papers you indicated under (4): “In our simulation, biased primary studies caused much more severe problems for estimates of effect size than for estimates of heterogeneity. Therefore, future investigations into meta-analytic parameter estimations should prioritise how to deal with biases in effect size estimates (e.g., Duval & Tweedie, 2000; Egger, Smith, Schneider, & Minder, 1997; Henmi, Hattori, & Friede, 2021; Simonsohn, Nelson, & Simmons, 2014; Stanley, Doucouliagos, & Ioannidis, 2017; Stanley, Doucouliagos, Ioannidis, & Carter, 2021) over the relative merits of different heterogeneity estimators.” Lines 564-568

2. Biased studies: The way the authors characterize PB and QRP is rather misleading and may give the broader audience the wrong impression about the nature and extent of the problems involved. Classical PB is itself often interpreted as merely omitting some studies that are not statistically significant (SS). While this is indeed one avenue for the bias that we often find in published research results, there are many others. Reporting bias is recognized as a different avenue by medical researchers, as QRP is recognized by psychologists. But all of these vectors of the biases are the result of some process of selection of the results to be SS. This selection can be undertaken by the researchers on their own for their own reasons, or in anticipation of what reviewers and editors might demand. Or, this selection can be forced by the reviewers and editors. These details of selection process are largely irrelevant because they have the same outcome and can be simulated in the same way. Thus, a little more discussion of what this bias is and more references to the classical and better regraded methods to correct for these biases (collectively called PB here, for short) is needed. The authors repeatedly characterize this problem as “biases in primary studies.” It is not, or at least, this is not necessarily a bias in the primary studies. PB can be very serious, just as we see it in practice, if the individual primary results are not biased, but merely were selectively reported to be SS from entirely randomly produced distribution of estimates (with random QRPs, random outcome measures, random samples, etc). You might say that PB is an emergent property (selection for SS) of the entire research literature in a given area but is not associated with individually biased studies. Studies and researchers may also be biased, which will only amplify PB, but focusing on the unnecessary bias of individual studies can cause many to dismiss this severe problem. Many researchers do not believe that a notable portion of their colleagues are dishonest or deliberately distorting science. This is why, PB is so pernicious and easily dismissed. It can emerge from the system, as a whole, without individually researchers knowingly distorting science. Please do not characterize PB as ‘biased studies’ but rather as studies selected to be SS.

Our response: Wherever appropriate, we changed“biased primary studies” or similar to “biased sets of primary studies” or similar.

3. Type I errors: Please report the type I errors of all of these methods using the current simulation design. I suspect that the authors will find that RE has very high rates of type I errors for all of these methods, at least as long as there are more than a few estimates. If so, this will confirm the systematic review of large pre-registered multi-lab replications (Kvarven et al., 2020). Rates of false positives are very important as an indicator of scientific credibility. I suspect that RE (regardless of the method use to estimate tau) has such high rates of false positives, using the authors’ current simulation design, to disqualify RE from any serious scientific use. In any case, type I errors are important to show and to discuss. Not reporting type I errors could be considered to be a type of a selection bias in the way these simulation results are displayed and published. Methods PB, if you will.

Our response: We now address type-1 errors for mean effect size estimates in the new Fig9. We write “For overall effect size estimates (d), meta-analyses typically report a p-value, which is tacitly assumed to provide an appropriate safeguard against type-1 errors. Fig9 shows type-1 error rates for d under 1-tailed publication bias in our simulation. (Under 2-tailed publication bias, type-1 error rates proved very close to the nominal 5%.) As can be seen, type-1 error rates might reach catastrophic levels. Random effects p-values for d will therefore fail to offer protection against type-1 errors unless publication bias and p-hacking can be ruled out.” LL512-517.

4. Alternatives to random effects: This entire study assumes that RE is the only adequate method to conduct basic meta-analysis in psychology and that this issue then comes down to the best way to calculate RE. This is not the case, and worse, the authors show that all the ways to calculate RE produce notably large bias (greatly exaggerating the size of the effect under examination). I suspect that this simulation design will show that RE has high rates of false positives. It has long been known that RE has unacceptable biases and that these biases are easily reduced (Henmi and Copas, 2010; Stanley and Doucouliagos, 2015). Henmi and Copas (2010) showed that FE (fixed effect) will notably reduces PB and that the RE’s estimate of tau can accommodate the heterogeneity that FE ignores. However, Henmi and Copas (2010) uses the DL estimate of tau in their calculation of the CI. So, the estimate of tau might still be important in their approach. Henmi and others (2021) has recently generalized this method and show how it can work for very small meat-analyses. Alternatively, an entirely different approach, the unrestricted weighted least squares (UWLS), uses the bias reduction of FE but automatically accommodates heterogeneity using the mathematical invariance of WLS’s variance-covariance matrix to any multiplicative constant. UWLS accommodates heterogeneity without referring to or using RE or any of its estimates of tau (Stanley and Doucouliagos, 2015; 2017). That is, the central issue of this study of the effect of PB on estimates of tau could be entirely avoided and, at the same time, reduce the large biases reported in this paper. Simulations, like these, have shown that UWLS notably reduces RE’s bias with little if any compensating statistical loss (Stanley and Doucouliagos, 2015; 2017). These alternative methods to RE have been widely applied across the disciplines and used as a basis for a new statistical method to detect PB (Stanley et al., 2021). It would be nice if these other methods were simulated and reported using this same design. At a minimum discussed, they need to be discussed as viable alternative to this concern about how tau is calculated and as an alternative to RE’s large biases and high rates of false positives. The central scientific question, is how to reduce or eliminate bias and false positive meta-analyses because they are often the best scientific evidence we have.

Our response: We now reference some of these papers. (See our reply to your first comment for details.) In the context of UWSL, you write “the central issue of this study of the effect of PB on estimates of tau could be entirely avoided”. This is true if the aim of the meta-analysis is restricted to estimating the average effect size with an appropriate CI. However, as we point out in the section Why heterogeneity matters, the extent of effect size heterogeneity can be of substantial interest in itself. For example, it might be applied to understand which psychological effects change most and least across cultures. As heterogeneity is at the core of our paper, we certainly cannot avoid it. We agree that extensions of our simulations to models beyond RE are of interest. However, any research project must be limited, and we decided to focus on the (still very popular) RE model here.

References:

Henmi M, Copas JB. Confidence intervals for random effects meta-analysis and robustness to publication bias. Statistics in Medicine, 2010; 29:2969–2983.

Henmi M, Hattori S, Friede T. A confidence interval robust to publication bias for random-effects meta-analysis of few studies. Res Syn Meth. 2021;12:674–679. https://doi.org/10.1002/jrsm.1482

Stanley, T.D. and Doucouliagos, C. Neither fixed nor random: Weighted least squares meta-analysis,” Statistics in Medicine, 2015: 342116-27.

Stanley, T.D. and Doucouliagos, C. Neither fixed nor random: Weighted least squares meta-regression analysis. Res Synth Methods. 2017;8:19-42.

Stanley TD, Doucouliagos H, Ioannidis JPA, Carter EC. Detecting publication selection bias through excess statistical significance. Research Synthesis Methods. 2021; 1-20. https://doi.org/10.1002/jrsm.1512

Reviewer #3: Review comments to the author can be found in the attached .docx document. They are organised in the sequence of the paper and include some general points and more specific questions to be responded to.

________________________________________

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Hilde Augusteijn

Reviewer #2: No

Reviewer #3: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Our response:

Augusteijn, H. E., van Aert, R., & van Assen, M. A. (2019). The effect of publication bias on the Q test and assessment of heterogeneity. Psychological Methods, 24(1), 116-134.

Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel‐plot–based method of testing and adjusting for publication bias in meta‐analysis. Biometrics, 56(2), 455-463.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629-634.

Henmi, M., Hattori, S., & Friede, T. (2021). A confidence interval robust to publication bias for random‐effects meta‐analysis of few studies. Research synthesis methods.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23(5), 524-532.

Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: a diagnosis based on the correlation between effect size and sample size. PloS one, 9(9), e105825.

Levine, T. R., Asada, K. J., & Carpenter, C. (2009). Sample sizes and effect sizes are negatively correlated in meta-analyses: Evidence and implications of a publication bias against nonsignificant findings. Communication Monographs, 76(3), 286-302.

Macnamara, B. N., Hambrick, D. Z., & Oswald, F. L. (2014). Deliberate practice and performance in music, games, sports, education, and professions: A meta-analysis. Psychological science, 25(8), 1608-1618.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: a key to the file-drawer. Journal of experimental psychology: General, 143(2), 534-547.

Sisk, V. F., Burgoyne, A. P., Sun, J., Butler, J. L., & Macnamara, B. N. (2018). To what extent and under which circumstances are growth mind-sets important to academic achievement? Two meta-analyses. Psychological science, 29(4), 549-571.

Stanley, T., Doucouliagos, H., & Ioannidis, J. P. (2017). Finding the power to reduce publication bias. Statistics in Medicine, 36(10), 1580-1598.

Stanley, T., Doucouliagos, H., Ioannidis, J. P., & Carter, E. C. (2021). Detecting publication selection bias through excess statistical significance. Research synthesis methods, 12, 776–795.

van Erp, S., Verhagen, J., Grasman, R. P., & Wagenmakers, E.-J. (2017). Estimates of between-study heterogeneity for 705 meta-analyses reported in Psychological Bulletin from 1990–2013. Journal of Open Psychology Data, 5(1).

Attachment

Submitted filename: Plos Review 1.doc

Decision Letter 2

Tim Mathes

6 Jan 2022

Heterogeneity estimates in a biased world

PONE-D-21-24961R2

Dear Dr. Hönekopp,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Tim Mathes

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Acceptance letter

Tim Mathes

18 Jan 2022

PONE-D-21-24961R2

Heterogeneity estimates in a biased world

Dear Dr. Hönekopp:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Tim Mathes

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Mean bias in estimates of the true average effect size (dbias) in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true average effect size (θ), true heterogeneity (τ), and number of studies per meta-analysis (k).

    (TIF)

    S2 Fig. Coverage of 95% CIs around d in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies per meta-analysis (k).

    (TIF)

    S3 Fig. Standard deviation for heterogeneity estimates under constant simulation conditions in the absence of publication bias and p-hacking for five heterogeneity estimators as a function of true heterogeneity (τ) and number of studies per meta-analysis (k).

    (TIF)

    S4 Fig. Coverage of 95% CIs around T in the absence of publication bias and p-hacking for the DL estimator as a function of true average effect size (θ), true heterogeneity (τ), and number of studies per meta-analysis (k).

    Virtually identical results for other estimators not shown.

    (TIF)

    S5 Fig. Illustration of the strongest 3-way interaction on Tbias (see Table 2).

    (TIF)

    S6 Fig. Absence of interaction between effects of p-hacking and strength of publication bias on Tbias.

    (TIF)

    S7 Fig. P-hacking and strength of publication bias differ in their interaction with the true average effect size (θ) on Tbias.

    (TIF)

    S8 Fig. Under 1-tailed publication bias (shown here), underestimation of heterogeneity is more prevalent than overestimation.

    (TIF)

    S9 Fig. Illustration of the strongest 2-way interaction on Trmse (see Table 3).

    (TIF)

    S10 Fig. Overestimation of effect size increases as heterogeneity increases.

    (TIF)

    Attachment

    Submitted filename: Plos Review 1.doc

    Attachment

    Submitted filename: Plos Review 1.doc

    Attachment

    Submitted filename: Plos Review 1.doc

    Data Availability Statement

    All materials and data can be found at https://osf.io/qga8v/.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES