P-hacking in meta-analyses: A formalization and new meta-analytic methods

Maya B Mathur

doi:10.1002/jrsm.1701

. Author manuscript; available in PMC: 2025 May 1.

Published in final edited form as: Res Synth Methods. 2024 Jan 25;15(3):483–499. doi: 10.1002/jrsm.1701

P-hacking in meta-analyses: A formalization and new meta-analytic methods

Maya B Mathur ¹

PMCID: PMC11042997 NIHMSID: NIHMS1955639 PMID: 38273211

Abstract

As traditionally conceived, publication bias arises from selection operating on a collection of individually unbiased estimates. A canonical form of such selection across studies (SAS) is the preferential publication of affirmative studies (i.e., those with significant, positive estimates) versus nonaffirmative studies (i.e., those with nonsignificant or negative estimates). However, meta-analyses can also be compromised by selection within studies (SWS), in which investigators “ $p$ -hack” results within their study to obtain an affirmative estimate. Published estimates can then be biased even conditional on affirmative status, which comprises the performance of existing methods that only consider SAS. We propose two new analysis methods that accommodate joint SAS and SWS; both analyze only the published nonaffirmative estimates. First, we propose estimating the underlying meta-analytic mean by fitting “right-truncated meta-analysis” (RTMA) to the published nonaffirmative estimates. This method essentially imputes the entire underlying distribution of population effects. Second, we propose conducting a standard meta-analysis of only the nonaffirmative studies (MAN); this estimate is conservative (negatively biased) under weakened assumptions. We provide an R package (phacking) and website (metabias.io). Our proposed methods supplement existing methods by assessing the robustness of meta-analyses to joint SAS and SWS.

Keywords: Selective reporting, file drawer, data dredging, Bayesian analysis, truncation

1. Introduction.

Publication bias, such as the preferential publication of significant results, is a well-known potential source of bias in meta-analyses.¹ Numerous statistical methods, mostly falling into two broad categories, have been developed to help assess or correct for this bias.^1,2 First, classical methods arising from the funnel plot assess whether small studies have systematically larger point estimates than large studies.^3,4 Second, selection models specify a parametric form for the underlying distribution of population effects (i.e., prior to selection) and for the dependence of a study’s publication probability on its $p$ -value. In two-step selection models, for example, the weight function may be specified such that affirmative results (defined as those with positive point estimates and a two-tailed $p < 0.05$ ) are more likely to be published than nonaffirmative results (defined as those with negative point estimates or $p \geq 0.05$ ).^5–8 By weighting each study’s contribution to the likelihood by its inverse-probability of publication per the weight function, the meta-analytic mean and the parameters of the weight function can be jointly estimated by maximum likelihood.^5–8 Related sensitivity analyses have been developed.⁹ Other methods are essentially hybrids^10,11 or Bayesian model averages¹² between funnel plot methods, such as PET-PEESE,¹⁰ and selection models.

These methods typically formalize publication bias as a selection process that operates on a collection of point estimates that, individually, are unbiased for their corresponding population effects. In this sense, traditionally conceived publication bias is a form of selection that operates across studies, which can include decisions by a study’s investigators to entirely withhold the study from submission to journals, as well as decisions by journal editors and reviewers.^13–16 We refer to the combined effects of these types of selection as “selection across studies” (SAS), and we refer to studies that are selected by SAS (i.e., available to the meta-analyst for potential inclusion in the meta-analysis) as “published” studies.

However, results can also be manipulated or selectively reported within studies, which is sometimes called “ $p$ -hacking” or “data dredging”.¹⁷ For example, investigators may fit multiple models to the same dataset in an attempt to obtain an affirmative estimate;^18–21 surveys indicate that such behaviors are commonplace, even based on self-admissions.²² Stefan & Schönbrodt²³ recently described and simulated 12 realistic mechanisms by which investigators may $p$ -hack. For example, investigators could selectively decide which variables to analyze (e.g., selecting the dependent variable, independent variable, or controlled covariates); how to transform or aggregate analyzed variables; and which participants to analyze (e.g., excluding outliers, iteratively increasing the sample size until significance is achieved, or conducting subgroup analyses).²³ Given this diversity of $p$ -hacking mechanisms, they called for future research to develop mathematical models of $p$ -hacking strategies.²³ A key goal of the present paper is to provide one possible formalization of $p$ -hacking, enabling new conceptual insights about when and how $p$ -hacking produces bias and how $p$ -hacking relates conceptually to classical publication bias. This formalization additionally enables us to develop new analysis methods.

Generalizing the concept of $p$ -hacking, we use “selection within studies” (SWS) to refer to situations in which the investigators of a given study obtain multiple point estimates and select only one of these estimates to submit for publication. These multiple estimates might differ not only because of statistical error, but also because they might have different population effects. Such within-study heterogeneity could arise, for example, if the estimates are obtained from different subsets of the data (e.g., the entire dataset versus only older participants) or from statistical models with different estimands (e.g., regression models that control for different covariates).²⁴ Heuristically, we will define a study as “hacked” (i.e., biased due to SWS) when its investigators select only one estimate from among multiple estimates, and the distribution of this “favored” estimate differs from that of a hypothetical “ideal” estimate (e.g., the estimate corresponding to the originally planned analysis). We refer to the collection of single favored estimates from each study, prior to the introduction of any SAS, as “underlying studies”.

Because SWS can distort studies’ point estimates themselves, many mechanisms of SWS do not conform to the selection processes assumed by traditional models of SAS. For example, selective publication of affirmative studies may result in an overrepresentation of such results among published studies, but does not affect the distribution of published point estimates conditional on affirmative status.^6,7 In contrast, SWS that favors affirmative estimates can distort even these conditional distributions.¹⁷ As a result, methods for SAS can perform poorly when there is also SWS.²⁵ Of the few existing methods to model SWS, the “ $p$ -curve” and its variants^17,26 seem to be the most widely used. These methods analyze only significant $p$ -values and assess whether these $p$ -values are left-skewed, which is taken as evidence of SWS.^17,26 These methods are special cases of selection models in which it is implicitly assumed that there is no heterogeneity across studies and that a study’s multiple estimates have autocorrelated $p$ -values.^17,27 In practice, heterogeneity is common in meta-analyses,²⁸ and $p$ -values may not be autocorrelated if, for example, investigators fit models to disjoint subsets of the data. In the Supplement (Section 1), we review the strengths and limitations of other existing methods for SWS.^20,29

In this paper, we develop new analysis methods that advance methodologically upon these existing approaches for SAS or SWS. We provide an R package, phacking (available on CRAN) and website, metabias.io. We first develop a theoretical framework that formalizes the above concepts of SWS and SAS (Section 2). To do so, we consider processes of SWS and SAS in which affirmative estimates are more likely to be favored by investigators and/or more likely to be published than are nonaffirmative estimates. SAS operates on studies’ favored estimates (i.e., after SWS has occurred) and affects which studies are ultimately available for inclusion in the meta-analysis. This framework is an extension of existing models of SAS and conforms well to empirical findings on how applied researchers and statisticians interpret and report $p$ -values.^28,30–32

We introduce right-truncated meta-analysis (RTMA), a method that is correctly specified if the favored estimates of hacked studies are always affirmative (e.g., because investigators obtain as many estimates as needed to obtain the first affirmative estimate) or if hacked studies with nonaffirmative favored estimates, if there are any, are never published (Section 3.1). These conditions are similar to those assumed by existing methods for SWS,^17,29 but unlike those methods, RTMA also allows for within-study heterogeneity and for estimates that are independent or autocorrelated within studies, and it provides inference. Heuristically, under many forms of SWS, the distribution of published affirmative estimates may be badly distorted, but the distribution of published nonaffirmative estimates accurately represents that of ideal nonaffirmative estimates. RTMA essentially imputes the entire underlying distribution of population effects among all ideal estimates in order to consistently estimate the metea-analytic mean. To circumvent the formidable statistical challenges posed by estimating the parameters of a truncated distribution,^33–35 we develop Bayesian methods for fitting RTMA under the Jeffreys prior^36,37 (Section 3.1). As a second, more conservative, sensitivity analysis, we propose conducting a standard meta-analysis of only the published nonaffirmative studies (Section 3.2). Under a broader class of mechanisms of SAS and SWS, favored nonaffirmative estimates even from hacked studies are typically smaller than the underlying mean, and we thus show that the meta-analysis of nonaffirmative estimates will be conservative (i.e., biased toward the null).

We present reporting recommendations and fit diagnostics for our proposed methods (Section 3.3), apply the methods to the controversial literature on money priming (Section 5), and validate the methods’ performance in simulations (Section 6). We additionally situate existing two-step selection models within our theoretical framework of SWS and SAS by establishing sufficient conditions under which the models are correctly specified when there is SWS as well as SAS (Section 4). These settings are highly restrictive, essentially ruling out all typical forms of SWS.²² This finding underscores the value of the present theoretical framework for clarifying how SWS and SAS operate, and of the proposed methods that directly accommodate both forms of selection.

2. Setting and notation.

Table 1 summarizes the notation developed in this section and subsequently. In a standard parametric meta-analysis of $k$ studies that does not accommodate SWS or SAS,³⁸ it is typically assumed that studies’ point estimates, ${\hat{θ}}_{i} = μ_{i} + ϵ_{i}$ , follow a simple random-effects model in which $μ_{i} ~ N (μ, τ^{2})$ , where $ϵ_{i} ~ N (0, σ_{i}^{2})$ and $μ_{i} ∐ ϵ_{i}$ . The conditional variances of studies’ point estimates, $σ_{i}^{2} = Var (ϵ_{i})$ , are generally treated as fixed and known. The standard estimand of interest for the meta-analysis is $μ$ , the overall mean population effect. Additionally, $τ$ is the heterogeneity (i.e., standard deviation of the population effects). We now extend this model to accommodate SWS and SAS.

Table 1:

Summary of notation for the $i^{th}$ underlying study. Similar notation without asterisks refers to published studies or estimates.

Notation	Interpretation
${\hat{θ}}_{i n}^{*}$	$n^{t h}$ estimate
${\hat{θ}}_{i 1}^{}, μ_{i 1}^{}$	Ideal estimate and its expectation
${\hat{θ}}_{i F}^{*}$	Favored estimate
$σ_{i}^{* 2}$	Variance of each estimate
$F_{i n}^{*}$	Indicator for whether $n^{t h}$ estimate is favored
$A_{i n}^{*}$	Indicator for whether $n^{t h}$ estimate is affirmative
“Successful” study	The study’s favored estimate is affirmative $(A_{i F}^{*} = 1)$
$H_{i}^{*}$	Indicator for whether study is hacked
$D_{i}^{*}$	Indicator for whether study is published

Open in a new tab

2.1. Selection within studies.

When studies obtain multiple estimates whose population effects may differ not only across studies but also within studies, one could define various estimands of interest for the meta-analysis. We consider the estimand of interest to be the mean population effect across underlying studies of a certain “ideal” estimate for each study. These estimates are “ideal” in that: (1) they are assumed to be exchangeable across hacked and unhacked studies (as described below); and (2) they follow a generalization of the simple random-effects meta-analysis model, per Assumption 1 below. Using asterisks to denote estimates from underlying studies (i.e., prior to any SAS), the investigators of each hacked study obtain multiple, potentially correlated estimates, ${{\hat{θ}}_{i 1}^{*}, {\hat{θ}}_{i 2}^{*}, \dots}$ , but they select a single “favored” estimate. (In the Discussion, we consider the implications of this assumption.) If there is no SAS, the investigators publish only the favored estimate; if there is SAS, the favored estimate is subjected to SAS. As such, we can define the index of the favored estimate as $F_{i}^{*} \in {1, 2, \dots}$ and define an indicator for whether any given estimate index, $n$ , is the favored one as $F_{i n}^{*} = 1 {F_{i}^{*} = n}$ . The favored estimate is then ${\hat{θ}}_{i F}^{*}$ . For example, if the $i^{t h}$ study is hacked, the investigators might obtain a series of estimates ${{\hat{θ}}_{i 1}^{*} = 0.84, {\hat{θ}}_{i 2}^{*} = 1.03, {\hat{θ}}_{i 3}^{*} = 1.23, {\hat{θ}}_{i 4}^{*} = 1.70, {\hat{θ}}_{i 5}^{*} = 2.10}$ and might favor the final estimate, such that ${\hat{θ}}_{i F}^{*} ≔ {\hat{θ}}_{i 5}^{*} = 2.10$ .

Without loss of generality, we index each study’s ideal estimate as the first estimate ( ${\hat{θ}}_{i 1}^{*}$ , with expectation $μ_{i 1}^{*}$ ) to suggest an intuitive situation in which each study’s chronologically first estimate corresponds to the investigators’ originally planned analysis, and in which the meta-analyst’s estimand of interest is the mean population effect of underlying studies’ originally planned analyses. However, more generally, the ideal estimate need not be the chronologically first estimate that investigators obtain, nor is it required that the study investigators actually observe the ideal estimate. For example, in a clinical trial, investigators might exclude treated participants with poor outcomes before conducting any analyses.

We consider meta-analyses in which some studies are unhacked (denoted $H_{i}^{*} = 0$ ) and others are hacked (denoted $H_{i}^{*} = 1$ ), although we will not require knowledge of which studies are hacked. As we formalize below, an unhacked study is one whose favored estimate has the same distribution as the study’s ideal estimate, both marginally and conditional on the affirmative status of these estimates. This includes the simple case in which investigators obtain only one estimate, namely the ideal estimate, as well as other settings in which investigators obtain multiple estimates but do not try to manipulate the affirmative status of the ideal estimate (Supplement, Corollary 1). A hacked study is one whose favored estimate is distributed differently from the study’s ideal estimate. A hacked study is “successful” if the favored estimate is affirmative and otherwise is “unsuccessful”.

More formally, let $σ_{i}^{* 2} = Var ({\hat{θ}}_{i n}^{*})$ be the variance of each underlying estimate, $n$ , produced by the $i^{th}$ study. To simplify notation, we treat $σ_{i}^{* 2}$ as constant for all estimates of a given study, but it is straightforward to accommodate the possibility that estimates within a study have different variances (Supplement, Section 3.3). As usual in meta-analysis, we treat studies’ standard errors, $σ^{*} = {σ_{1}^{*}, \dots σ_{k}^{*}}$ , as fixed and conditioned throughout. (Because we will assume a standard form of independence between studies [Assumption 1], conditioning on the full vector $σ^{*}$ will not require that the meta-analyst know the standard errors of unobserved estimates. This conditioning is used only to facilitate expressing the joint density of studies when the population effects arise from a location-scale family; see for example Section 4.) Letting $c$ denote the critical value defining statistical significance (e.g., $c \approx 1.96$ ), a given estimate $n$ is defined as being affirmative, denoted $A_{i n}^{*} = 1$ , if ${\hat{θ}}_{i n}^{*} / σ_{i}^{*} > c$ . We treat the critical value $c$ as known $a$ priori; it will not be estimated empirically.

In the Supplement (Definition 1), we formalize a definition of “unhacked” versus “hacked” studies. This definition focuses on potential discrepancies between the distribution of the favored estimate relative to that of the ideal estimate because such discrepancies are the key property that occurs in selection within studies but not selection across studies. As noted above, this definition states that an unhacked study is one whose favored estimate has the same distribution as the study’s ideal estimate, both marginally (Supplement, Definition 1, condition D1a) and conditional on the affirmative status of these estimates (D1b and D1c). By implication, in unhacked studies, the probability that the favored estimate is affirmative is equal to the probability that the ideal estimate is affirmative (D1d). Again, this definition includes the simplest intuitive case of an “unhacked” study, namely a study in which only the ideal estimate is obtained (and thus favored). However, as described above, this definition is also more general. In contrast, a hacked study is one whose favored estimate has a different marginal distribution than that of the study’s ideal estimate.

We additionally assume that the ideal estimates are exchangeable across underlying hacked and unhacked studies in the following sense.

Assumption 1 (Exchangeability of underlying ideal estimates).

f_{ϵ_{i 1}^{*} ∣ H_{i}^{*} = 0} (t) = f_{ϵ_{i 1}^{*} ∣ H_{i}^{*} = 1} (t) such  that E [ϵ_{i 1}^{*}] = 0 and  Var (ϵ_{i 1}^{*}) = σ_{i}^{* 2} f_{μ_{i 1}^{*} ∣ H_{i}^{*} = 0} (t) = f_{μ_{i 1}^{*} ∣ H_{i}^{*} = 1} (t) such  that E [μ_{i 1}^{*}] = μ, Var (μ_{i 1}^{*}) = τ^{2}, and μ_{i 1}^{*} ∐ ϵ_{i 1}^{*} {\hat{θ}}_{i 1}^{*} = μ_{i 1}^{*} + ϵ_{i 1}^{*} for H_{i}^{*} \in {0, 1}

To fix ideas, Figure 1 illustrates simple cases of how investigators might obtain the favored estimate ${\hat{θ}}_{i F}^{*}$ in a single unhacked study versus in a hacked study. In each case, by Assumption 1, the ideal estimate ${\hat{θ}}_{i 1}^{*}$ is drawn from $f_{{\hat{θ}}_{i 1}^{*} ∣ H_{i}^{*} = h} (t)$ , here depicted as a normal distribution with mean $μ = 0.8$ (dashed gray line), heterogeneity $τ = 0$ , and standard error $σ_{i}^{*} = 1$ . Thus, in this example, a given estimate ${\hat{θ}}_{i n}^{*}$ is affirmative if it exceeds $c \approx 1.96$ (dashed orange line), i.e., $A_{i n}^{*} = 1 {{\hat{θ}}_{i n}^{*} > 1.96}$ . In this example, investigators of both the unhacked study and the hacked study obtain an ideal estimate, ${\hat{θ}}_{i 1}^{*} = 0.84$ , that is nonaffirmative. In this particular hacked study, the investigators do not obtain any additional estimates, so ${\hat{θ}}_{i 1}^{*}$ is the favored estimate; that is, ${\hat{θ}}_{i F}^{*} ≔ {\hat{θ}}_{i 1}^{*}$ . Thus, trivially, the distribution of the favored estimate is identical to that of the ideal estimate; that is, $f_{{\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = 0} (t) = f_{{\hat{θ}}_{i 1}^{*} ∣ H_{i}^{*} = 0} (t)$ (Supplement, Definition 1, condition D1a). Also, specifically considering nonaffirmative favored versus ideal estimates, these distributions are also the same (D1b), depicted in the figure as gray regions of the densities. An analogous statement holds for affirmative favored versus ideal estimates (D1c), depicted as orange regions.

On the other hand, in this particular hacked study, the investigators attempt to obtain an affirmative estimate by obtaining another 4 autocorrelated estimates, for example by adding covariates one at a time to the analysis model. They thus have 5 estimates: ${{\hat{θ}}_{i 1}^{*} = 0.84, {\hat{θ}}_{i 2}^{*} = 1.03, {\hat{θ}}_{i 3}^{*} = 1.23, {\hat{θ}}_{i 4}^{*} = 1.70, {\hat{θ}}_{i 5}^{*} = 2.10}$ . In this example, ${\hat{θ}}_{i 5}^{*}$ is affirmative, and the investigators favor this estimate, such that ${\hat{θ}}_{i F}^{*} ≔ {\hat{θ}}_{i 5}^{*} = 2.10$ . Because the series of estimates was autocorrelated, the distribution of the favored estimate is distorted relative to that of the ideal estimate; that is, $f_{{\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = 1} (t) \neq f_{{\hat{θ}}_{i 1}^{*} ∣ H_{i}^{*} = 1} (t)$ , which is visually apparent by comparing the densities (D1a). The non-equivalence also holds for nonaffirmative estimates (D1b), which in this simple example, are never favored in the hacked study. Likewise, for affirmative estimates, the distribution of the favored estimate is again distorted relative to the distribution of ideal affirmative estimates due, for example, to the estimates’ autocorrelation (D1c). The latter is visually apparent by comparing the densities’ orange regions. It is important to note that Figure 1 necessarily depicts only a very simple case that does not capture the full range of situations accommodated by Definition 1. For example, as noted above, Definition 1 accommodates certain situations in which investigators of unhacked studies do obtain multiple estimates. The figure is meant only to provide intuition, while Definition 1 describes the framework in mathematical generality.

Two simple forms of SWS are as follows. First, SWS could favor the first affirmative estimate. That is, investigators could obtain at most a fixed maximum number of estimates, $N_{max}$ (e.g., 10). If an affirmative estimate is obtained before surpassing $N_{max}$ estimates, investigators favor the first affirmative estimate they obtain. If no affirmative estimates are obtained, investigators favor the first (ideal) estimate, which is nonaffirmative. Alternatively, SWS could favor the “best” affirmative estimate. That is, investigators could make a fixed number of estimates, regardless of when they obtain the first affirmative estimate. If any affirmative estimates are obtained, the investigators favor the estimate with the smallest $p$ -value.²⁰

2.2. Selection across studies.

We define SAS as selection that operates on studies’ favored estimates and affects whether a study is published (i.e., available to the meta-analyst for potential inclusion in the meta-analysis), with $D_{i}^{*} = 1$ indicating that the $i^{th}$ study is published. SAS may include decisions by studies’ investigators to withhold a study from publication, as well as decisions by journal editors and reviewers. We assume that if a study’s favored estimate is nonaffirmative, any SAS does not select further based on the favored estimate itself, as follows.

Assumption 2 (No SAS preference for larger nonaffirmative estimates). For each $h \in {0, 1}$ , we have $D_{i}^{*} ∐ {\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = h$ , $A_{i F}^{*} = 0$ . By implication, $(D_{i}^{*} = 1 ∣ {\hat{θ}}_{i F}^{*}, H_{i}^{*} = h, A_{i F}^{*} = 0) = P (D_{i}^{*} = 1 ∣ H_{i}^{*} = h, A_{i F}^{*} = 0)$ .

This assumption may be plausible if SAS does not select for “marginally significant” nonaffirmative results or those with larger point estimates. This appeared to be plausible in a systematic analysis of 63 meta-analyses across scientific disciplines, in which Z-scores overwhelmingly concentrated above 1.96 only.²⁸ In Section 3.3, we discuss diagnostics for possible violations of this and other assumptions.

We will use Assumption 2 for both proposed methods (RTMA and MAN), but a stronger version of this assumption will be required for existing selection models. In developing certain assumptions specific to RTMA, we will also refer to “stringent SAS”, which occurs when nonaffirmative estimates specifically for hacked studies are never published (e.g., because investigators of unsuccessful hacked studies do not submit them for publication). That is:

Definition 2. “Stringent SAS” is said to occur if $P (D_{i}^{*} = 1 ∣ H_{i}^{*} = 1, A_{i F}^{*} = 0) = 0$ .

Assumption 2 and stringent SAS are distinct concepts. In particular, Assumption 2 refers to both hacked studies and unhacked studies, whereas stringent SAS refers only to hacked studies. As such, when stringent SAS holds, it implies that the conditional independence in Assumption 2 holds for $H_{i}^{*} = 1$ , but not necessarily for $H_{i}^{*} = 0$ .

3. Proposed new meta-analysis methods.

Given the formalization of SWS and SAS established above, we now describe the two new meta-analysis methods.

3.1. Right-truncated meta-analysis.

We now introduce RTMA as a new meta-analytic method to accommodate both SAS and SWS. We first motivate the method heuristically before providing theoretical results. Under many forms of SWS, the distribution of published affirmative estimates may be badly distorted relative to the distribution of ideal affirmative estimates, and to correct this distortion would require making numerous assumptions about how, precisely, investigators obtain multiple estimates, how they choose a favored estimate, and the structure of autocorrelation and heterogeneity among estimates within a study. However, under many forms of SWS (characterized by Assumption 1, Assumption 2, and the assumptions below), the distribution of published nonaffirmative estimates accurately represents that of ideal nonaffirmative estimates. RTMA uses only the published nonaffirmative estimates to essentially impute the entire underlying distribution of population effects among all ideal estimates in order to consistently estimate $μ$ and $τ$ (Figure 2). That is, RTMA involves modeling the published nonaffirmative estimates as right-truncated normal.

Figure 2: — Simple schematic of the RTMA estimation approach. Here, the distribution of ideal estimates $(f_{{\hat{θ}}_{i_{1}} ∣ H_{i}^{*} = h} (t))$ is a normal distribution with mean $μ_{i} = 0.8$ , heterogeneity $τ = 0$ , and standard error $σ_{i}^{*} = 1$ .

To formalize the justification for RTMA, consider two possible scenarios regarding, respectively, SWS and SAS:

Definition 3. “Stringent overall selection” occurs if either of the following conditions holds:

P (A_{i F}^{*} = 0 ∣ H_{i}^{*} = 1) = 0 (Stringent SWS)

P (D_{i}^{*} = 1 ∣ H_{i}^{*} = 1, A_{i F}^{*} = 0) = 0 (Stringent SAS)

The first of the two possible conditions, namely stringent SWS, states that the favored estimate in a hacked study is always affirmative. This will occur if, for example, investigators of each hacked study obtain as many estimates as required to obtain the first affirmative estimate. (As noted in the Introduction, this scenario is similar to the assumptions of existing methods for SWS.^17,29) The second of the two possible conditions, namely stringent SAS, states that the favored estimate in a hacked study may be nonaffirmative, but that hacked studies with nonaffirmative estimates are never published. This will occur if, for example, investigators of hacked studies never submit a nonaffirmative estimate for publication. Clearly, if stringent overall selection holds, any published nonaffirmative estimates must be from unhacked rather than hacked studies (Supplement, Lemma 1). It is then straightforward to show that the distribution of published nonaffirmative estimates is the same as that of all underlying, nonaffirmative ideal estimates:

Theorem 1 (Distribution of published estimates under stringent overall selection). Suppose that underlying ideal estimates are exchangeable (Assumption 1), SAS does not favor larger nonaffirmative estimates over smaller nonaffirmative estimates (Assumption 2), and stringent overall selection holds (Definition 3). Then $f_{{\hat{θ}}_{i F} ∣ A_{i F} = 0} (t) = f_{{\hat{θ}}_{i 1}^{*} ∣ A_{i 1}^{*} = 0} (t)$ .

Proof. The distribution of published, nonaffirmative estimates is:

f_{{\hat{θ}}_{i F} ∣ A_{i F} = 0} (t) = f_{{\hat{θ}}_{i F}^{*} ∣ A_{i F}^{*} = 0, D_{i}^{*} = 1} (t) = \sum_{h} f_{{\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = h, A_{i F}^{*} = 0, D_{i}^{*} = 1} (t) \cdot P (H_{i}^{*} = h ∣ A_{i F}^{*} = 0, D_{i}^{*} = 1)

By stringent overall selection and Lemma 2 (Supplement), we have $P (H_{i}^{*} = 1 ∣ A_{i F}^{*} = 0, D_{i}^{*} = 1) = 0$ . Thus:

f_{{\hat{θ}}_{i F} ∣ A_{i F} = 0} (t) = f_{{\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = 0, A_{i F}^{*} = 0, D_{i}^{*} = 1} (t) = \frac{P (D_{i}^{*} = 1 ∣ {\hat{θ}}_{i F}^{*} = t, H_{i}^{*} = 0, A_{i F}^{*} = 0) f_{{\hat{θ}}_{i F}^{*} ∣ H_{i}^{*} = 0, A_{i F}^{*} = 0} (t)}{P (D_{i}^{*} = 1 ∣ H_{i}^{*} = 0, A_{i F}^{*} = 0)}

In the first term of the numerator, applying Assumption 2 allows the conditioning on ${\hat{θ}}_{i F}^{*} = t$ to be dropped, such that the term is equal to the denominator. In the second term of the numerator, applying D1b (Definition 1; Supplement) followed by Assumption 1 to drop the conditioning on $H_{i}^{*} = 0$ yields the desired result. □

Following the conventions of parametric random-effects meta-analysis, suppose the distributions in Assumption 1 are normal, such that $μ_{i 1}^{*} ~ N (μ, τ^{2})$ and $ϵ_{i 1}^{*} ~ N (0, σ_{i}^{* 2})$ . The marginal heterogeneity, $τ^{2}$ , may include within-study heterogeneity across estimates as well as across-study heterogeneity (e.g., under the data-generating process described in the Supplement, Section 6). Then, by Theorem 1, the published nonaffirmative estimates follow a right-truncated normal distribution. That is, for $i$ such that $D_{i}^{*} = 1$ :

f_{{\hat{θ}}_{i F} ∣ A_{i F} = 0} (t) = S_{i}^{- 1} {(2 π)}^{- 1 / 2} exp {- \frac{{(t - μ)}^{2}}{2 S_{i}^{2}}} \cdot \frac{1}{Φ ({\tilde{c}}_{i})}

(3.1)

where $S_{i} = \sqrt{τ^{2} + σ_{i}^{2}}$ is the marginal standard deviation, ${\tilde{c}}_{i} = (c σ_{i} - μ) / S_{i}$ is the marginally standardized truncation threshold for the $i^{t h}$ study, and $Φ$ is the cumulative distribution function of the standard normal distribution. In the next section, we discuss methods to consistently estimate $μ$ and $τ$ in Eq. (3.1), which we term “right-truncated meta-analysis”.

3.1.1. Estimation methods for right-truncated meta-analysis.

Estimating the parameters of a truncated distribution has been a formidable problem even when observations are identically distributed (i.e., outside the context of meta-analysis). When estimating the parameters of a truncated normal distribution with a known truncation point, maximum likelihood estimates (MLE) can be remarkably biased and inefficient because the likelihood can become nearly flat and is infinite with positive probability.^33–35 In this simple setting, Bayesian methods show potential to substantially improve upon point estimation.^34,35 In particular, desirable theoretical properties arise from estimating the posterior mode under the Jeffreys prior.³⁶ In the Supplement (Section 3.1.1), we derive the form of the Jeffreys prior and log-posterior and describe a simple computational approach for estimation.

As noted in the Introduction, the assumptions used in RTMA weaken certain assumptions used in existing methods for SAS or SWS.^7,17,26,29 That is, RTMA allows for within-study heterogeneity and for estimates that are independent or autocorrelated within studies, and also accommodates joint SAS and SWS. However, the weakened assumptions used for RTMA clearly will not hold for every meta-analysis; we discuss relevant fit diagnostics in Section 3.3. Additionally, in the next section, we propose a method that provides conservative estimates of $μ$ under assumptions weaker than those of RTMA.

3.2. Meta-analysis of nonaffirmative estimates.

We recently proposed conducting a standard meta-analysis of only published, nonaffirmative estimates (MAN) as a conservative sensitivity analysis for SAS, without considering additional SWS.⁹ In that context, we had shown that for a given, assumed selection ratio $η$ , sensitivity analysis can be conducted by upweighting published, nonaffirmative studies proportional to $η$ . The MAN estimate arose as a limiting case of this analysis for worst-case SAS (i.e., $η \to \infty$ ). As we discuss in this section, MAN also remains conservative under many (though not all) forms of SWS. Other sensitivity analyses for “worst-case” forms of SAS have been proposed, but these assume that SAS operates only based on a study’s sample size and not on its $p$ -value or affirmative status.³⁹ Additionally, these methods do not directly accommodate SWS.

Although MAN is a very simple and intuitive analysis, it has advantages that complement those of RTMA. First, MAN applies under a broader range of mechanisms of SWS and SAS than does RTMA. For example, stringent overall selection (as assumed by RTMA) would not hold if some hacked studies favor and publish a nonaffirmative estimate, and as such the performance of RTMA could be compromised. However, subject to the weakened conditions described below, MAN would still be conservative. Second, while RTMA assumes that the underlying ideal estimates are independent and are normally distributed, MAN can be fit using standard meta-analytic methods that accommodate clustered or non-normal point estimates (Section 3.2.1). Third, MAN is extremely easy for practitioners to implement using existing meta-analysis software, and is already being reported fairly often as conservative analysis for SAS.^? Fourth, as we discuss further in the simulation study (Section 6), even though the MAN estimate is constructed to be biased downward, it can sometimes provide more precise confidence intervals than RTMA.

In the Supplement (Section 3.2.1), we establish two sufficient conditions under which, if there is SWS in addition to SAS, MAN remains conservative in the sense that its estimand, $E [{\hat{θ}}_{i F} ∣ A_{i F} = 0]$ , is a lower bound on $μ$ . Heuristically, the first sufficient condition, MAN-1, states that the probability that a hacked study is unsuccessful is nonincreasing in the expectation of the study’s ideal estimate. This will often hold because, intuitively, the larger a study’s ideal estimate, the easier it will usually be for investigators to “turn” the estimate into an affirmative one, if the ideal estimate is not already affirmative. The second sufficient condition, MAN-2, states that if a hacked study is unsuccessful, the expectation of its favored estimate is no larger than the expectation of its ideal estimate. This, too, will often be the case if investigators do not tend to favor a larger nonaffirmative estimate over a smaller nonaffirmative estimate. Critically, this condition can hold even when investigators favor larger affirmative estimates over smaller affirmative estimates, as in the second mechanism of SWS described in Section 2.1. However, MAN-2a could be violated if, for example, investigators prefer “‘marginally significant” nonaffirmative estimates (e.g., those with $0.05 < p < 0.10$ ) over nonaffirmative estimates with $p > 0.10$ . The two sufficient conditions and alternative sufficient conditions are formalized in the Supplement.

Under these assumptions, we can obtain the main result regarding MAN, which simply states that MAN is conservative in that the expectation of the published, favored nonaffirmative estimates is less than the underlying mean $μ$ :

Theorem 2 (Conservatism of MAN). If MAN-1 and MAN-2 hold, then $E [{\hat{θ}}_{i F} ∣ A_{i F} = 0] \leq μ$ . The proof of this result appears in the Supplement.

3.2.1. Estimation methods for meta-analysis of nonaffirmative results.

We now consider methods to consistently estimate the MAN estimand, $E [{\hat{θ}}_{i F} ∣ A_{i F} = 0]$ . Under the relaxed conditions regarding SWS given above, the distribution of published, nonaffirmative estimates may no longer follow a known form. Although this precludes using parametric meta-analysis to estimate $μ$ , one could instead use robust variance estimation methods that are similar to generalized estimating equations, thus obviating assumptions on the distribution of population effects.^40–43 Asymptotic and finite-sample theory establishes that this approach provides consistent coefficient estimates and valid inference under arbitrary distributions.^40,43 By Theorem 2, an estimate $\hat{μ}$ obtained by robust variance estimation will be consistent for a value that is no larger than $μ$ ; it is in this sense that MAN is “conservative”. (Conducting a meta-analysis that conditions on $A_{i F} = 0$ induces a degree of positive correlation between ${\hat{θ}}_{i F}$ and $A_{i F}$ , but the bias that this induces in $\hat{E} [{\hat{θ}}_{i F} ∣ A_{i F} = 0]$ relative to $E [{\hat{θ}}_{i F} ∣ A_{i F} = 0]$ is modest.) Although one could also estimate the heterogeneity, $τ$ , which would typically also be biased downward, this is not the main purpose of MAN in the context of SAS and SWS.

3.3. Fit diagnostics.

We suggest examining two diagnostic plots to help identify violations of the assumptions underlying our proposed methods. Both plots are implemented in the R package phacking (functions z_density and rtma_qqplot) and the website metabias.io.First, the density of the observed Z-scores, ${\hat{θ}}_{i F} / σ_{i}$ , should be examined (e.g., Figure 4(a)). When SWS favors affirmative estimates over nonaffirmative estimates, as our methods and others assume,^7,29 the Z-scores may disproportionately concentrate just above a critical value of 1.96, as assessed by $p$ -curve and related methods.^17,26 (However, as noted in the Introduction, the presence of SWS does not guarantee a concentration of Z-scores just above 1.96, in particular if estimates are not highly correlated within studies or if $τ$ is relatively large. As such, it is prudent to proceed with the proposed analysis methods even if no such concentration is apparent.) In contrast, if the Z-scores also concentrate just below −1.96, this may indicate that SWS is “two-tailed” in that significant estimates are favored regardless of their direction, which could violate the assumption in RTMA of stringent overall selection. Alternatively, if the Z-scores concentrate just below 1.96, this may indicate that investigators favor smaller $p$ -values even among nonaffirmative estimates or that some investigators favor nonaffirmative over affirmative estimates. This could violate stringent overall selection or MAN-2. As noted above, a systematic analysis of 63 meta-analyses across scientific disciplines suggested that Z-scores overwhelmingly concentrate above 1.96 only,²⁸ suggesting that our assumptions may often be plausible in practice.

Figure 4: — Diagnostic plots for the Lodder et al.⁴⁸ meta-analysis. A version of Figure 4(a) that stratifies by preregistration status appears in the Supplement.

Second, to assess the fit of RTMA and possible violations of its distributional assumptions, we suggest using the RTMA estimates $\hat{μ}$ and $\hat{τ}$ to calculate the fitted cumulative distribution function of the observed ${\hat{θ}}_{i F}$ . One could then examine a quantile-quantile plot showing the fitted versus the empirical distribution functions (e.g., Figure 4(b)). If the estimates ${\hat{θ}}_{i F}$ do not adhere fairly closely to a 45-degree line, RTMA may not fit adequately. When fitting RTMA under the Jeffreys prior as we suggest, one should also examine the usual convergence metrics for Bayesian analyses, such as the potential scale reduction $(\hat{R})$ and the effective sample size.^44,45 If the number of published nonaffirmative estimates is small, these metrics may indicate that the target acceptance rate and/or maximum tree depth should be increased to obtain stable inference.^44–46

4. Existing two-step selection models.

Given that traditional conceptions of publication bias explicitly consider only SAS, but not SWS, it is informative to consider how the resulting statistical methods perform when there is both SWS and SAS. Focusing on existing two-step selection models^6,7 as one important and conceptually illustrative case, we use our proposed theoretical framework to determine that although there exist conditions under which the models do accommodate SWS, these conditions are highly restrictive and likely unrealistic for SWS in practice. We use “SM” to refer to a two-step selection model, namely one in which SAS is assumed to operate^6,7 such that for some selection ratio, $η \geq 0$ , we have $P (D_{i}^{*} = 1 ∣ A_{i F}^{*} = 1) = η \cdot P (D_{i}^{*} = 1 ∣ A_{i F}^{*} = 0)$ . In SM, the published estimates are thus assumed to have the following likelihood (Supplement, Corollary 4):⁷

g_{{\hat{θ}}_{i F}} (t) = \frac{η^{- 1} 1 {A_{i F} = 0} + 1 {A_{i F} = 1}}{η^{- 1} Φ (\frac{c σ_{i} - μ}{S_{i}}) + [1 - Φ (\frac{c σ_{i} - μ}{S_{i}})]} \cdot f^{*} (t)

(4.1)

where, again, $S_{i} = \sqrt{τ^{2} + σ_{i}^{2}}$ . The underlying likelihood for a study whose standard error is equal to the observed $σ_{i}$ is termed $f^{*} (t) ≔ f_{{\hat{θ}}_{i F}^{*}} (t)$ and is usually taken to be normal, such that $f^{*} (t) = S_{i}^{- 1} ϕ ((t - μ) / S_{i})$ , where $ϕ$ is the density of the standard normal distribution. Typically, these models are fit by maximimum likelihood estimation to jointly estimate $μ, τ$ , and $η$ . ^6,7

In the Supplement (Section 4.2), we establish four new conditions regarding the mechanisms of SWS and SAS that are sufficient for SM to be correctly specified. That is, if all of these conditions are fulfilled, then if one fits SM as usual to all published estimates, ${{\hat{θ}}_{1 F}, \dots, {\hat{θ}}_{k F}}$ , the selection model will be correctly specified and, as desired, would consistently estimate the mean of studies’ ideal estimates, $μ = E [μ_{i 1}^{*}]$ . In this case, SM would estimate an overall selection ratio, $η$ , that comprises both within- and across-study selection probabilities (Supplement, Theorem 3). The within-study selection probabilities depend both on investigators’ relative preference for affirmative over nonaffirmative estimates and on the statistical power of studies’ multiple estimates.

Heuristically, the conditions for SM to be correctly specified are as follows. Among underlying studies, the distribution of favored affirmative estimates must be the same as the distribution of ideal affirmative estimates (and likewise for nonaffirmative estimates), but favored estimates are more (or less) likely to be affirmative than are ideal estimates. Additionally, conditional on a estimate’s affirmative status, SAS must not select on the index of the favored estimate nor the favored estimate itself. This latter condition is a stronger version of Assumption 2, in which the exchangeability was required to hold conditional on $A_{i F}^{*} = 0$ but not necessarily conditional on $A_{i F}^{*} = 1$ . Again, these conditions on SWS are highly restrictive, essentially ruling out all typical forms of SWS.²² Thus, these results are not meant to suggest that SM will perform adequately under realistic SWS, but rather the opposite. That these conditions are so restrictive underscores the value of the present theoretical framework for clarifying how SWS and SAS operate, and the value of the proposed methods that directly accommodate both forms of selection. This theoretical exercise of obtaining sufficient conditions for SM to be correctly specified provides intuitive and mechanistic insight into why these methods can perform poorly when there is SWS, as illustrated in our simulation study (Section 6). Our theoretical framework could of course also be used to investigate the implications of SWS for other traditional methods that explicitly accommodate only SAS, such as alternative selection models in which the probability of publication is a continuous function of a study’s $p$ -value.⁴⁷

5. Applied example.

We now apply the proposed analysis methods to a previously published meta-analysis on money priming effects. All code and data required to reproduce the applied example is publicly available and documented (https://osf.io/gk5ez/). Money primes are stimuli, such as images of paper currency, that evoke the concept of money.According to the psychological theory of money priming, exposure to such primes may affect a diverse range of attitudes and behaviors, such as belief in a just world, propensity to help others, and propensity to cheat.⁴⁸ A large published literature on money priming has appeared to support the theory, but the extent to which such findings are artifacts of both SAS and SWS has been contested.⁴⁹

Lodder et al.⁴⁸ meta-analyzed 287 psychology experiments that manipulated whether participants were exposed to a money prime versus to a control prime that was not related to money. Studies could measure any psychological or behavioral outcome. A minority of these studies (51) were preregistered, meaning that some form of protocol was created and time-stamped prior to data collection, which may limit SWS.^50,51 Of the preregistered studies, 5 were affirmative and 46 were nonaffirmative. The remaining 236 studies were not preregistered, of which 108 were affirmative and 128 were nonaffirmative. We re-analyzed Lodder et al.’s⁴⁸ dataset to assess the robustness of this evidence to potential SAS and SWS. (In the original paper,⁴⁸ the meta-analytic dataset contained minor errors that were later corrected, although results of corrected analyses were not published. We re-analyzed the corrected dataset.) Notably, studies’ Z-scores appeared to disproportionately concentrate just above a critical value of 1.96, suggesting that SWS may be a substantial influence in addition to SAS (Figure 4(a)).

We used five methods to meta-analyze the point estimates on the Hedges’ $g$ scale:⁵² (1) Using all 287 studies, we obtained an uncorrected estimate from a standard, parametric random-effects meta-analysis fit by restricted maximum likelihood and with standard errors estimated with the Knapp-Hartung adjustment.^53–55 (2) Again using all 287 studies, we fit SM as a standard method for SAS.^7,56 Using only the 174 nonaffirmative studies, we obtained (3) RTMA and (4) MAN estimates as described in Sections 3.1 and 3.2. (5) Additionally, we conducted a standard meta-analysis of only the 51 preregistered studies, because these studies might be less subject to SWS than non-preregistered studies. This method serves as a comparator for RTMA and MAN; the latter two methods have the advantage of applying even when one does not know a priori which studies are unhacked, or when one is not willing to assume that preregistered studies are always unhacked.⁵⁷ Each of the five methods estimates estimates $μ$ and $τ$ , and SM additionally estimates the selection ratio $η$ . (However, as noted in Section 3.2, MAN is intended primarily for estimation of $μ$ because this is the parameter for which its conservatism is conceptually useful.) Methods (1), (4), and (5) use standard meta-analysis models fit to different subsets of the studies.

Figure 3 shows the estimates from each method. The uncorrected estimate was $g = 0.26$ $(95 % CI : [0.21, 0.30]; \hat{τ} = 0.35 [0.30, 0.38])$ . The SM estimate was smaller, but still suggested a non-negligible mean effect $(g = 0.18 [0.11, 0.26]; \hat{τ} = 0.32 [0.27, 0.36])$ . This model estimated a selection ratio of $\hat{η} = 1.64 [1.14, 2.88]$ . As described in Section 4, if SM were correctly specified, this estimated selection ratio would comprise both within- and across-study selection probabilities. However, this model was likely misspecified given the observed concentration of $Z$ -scores just above 1.96. Indeed, the estimate among only preregistered studies was close to the null $(g = 0.02 [- 0.03, 0.06]; \hat{τ} = 0.04 [0, 0.09])$ . Unlike the SM estimate, the RTMA estimate $(g = 0.03 [- 0.01, 0.09]; \hat{τ} = 0.10 [0.04, 0.18])$ agreed closely with the estimate among only the preregistered studies, although the heterogeneity estimate was somewhat larger. In this particular meta-analysis, widths of credible and confidence intervals indicated that the RTMA had comparable efficiency to the uncorrected estimate; this likely reflects reduced heterogeneity when the analysis is restricted to nonaffirmative studies, as well as misspecification of the uncorrected model. A diagnostic quantile-quantile plot (Figure 4(b)) suggested that the RTMA model achieved a reasonable fit. For a more conservative estimate under weakened assumptions, the MAN estimate was null $(g = 0 [- 0.03, 0.02])$ . In this meta-analysis, the MAN estimate was more precise than the uncorrected estimate, a point we consider in the Discussion.

These results that illustrate the value of applying RTMA and MAN in addition to conventional SAS methods. In this meta-analysis, the assumptions of SM appear clearly violated due to apparent SWS. Whereas the conventional SM estimate suggests that SAS is unlikely to entirely explain away effects of money priming, the nearly null RTMA estimate instead suggested that money priming effects may be largely artifacts of SWS. Under weakened assumptions, the conservative MAN estimate provides a lower bound that is also nearly null. Furthermore, it is interesting that the RTMA and MAN estimates agree closely with the estimate among only preregistered studies, as one would expect if the preregistered studies are indeed unhacked and if RTMA is correctly specified. However, it is possible that preregistered studies are still subject to some SWS, and such studies may differ systematically from non-preregistered studies on potential effect modifiers, such as the type of money prime.^57,58 These considerations suggest the importance of applying the proposed methods even when a large number of preregistered studies are available.

6. Simulation study.

We conducted a simulation study to evaluate the performance of point estimates and inference obtained using RTMA and MAN. Here we briefly summarize; full methodological details and results are available in the Supplement (Section 6). All code and data required to reproduce the simulation study is publicly available and documented (https://osf.io/gk5ez/). We compare the performance of RTMA and MAN to that of an uncorrected meta-analysis, SM,^7,56 PET-PEESE,¹⁰ and robust Bayesian model-averaging (RoBMA).¹² As a gold-standard benchmark, we also conducted a meta-analysis of only unhacked studies. We assess methods’ performance in terms of the bias of $\hat{μ}$ and in terms of the coverage and width of 95% confidence or credible intervals. Because Bayesian estimation under the Jeffreys prior can be alternatively viewed in a frequentist framework as a bias correction for the MLE,³⁷ we assess performance of credible intervals in terms of their frequentist properties, and we abbreviate both “credible interval” and “confidence interval” as “CI”. Based on peer reviewers’ recommendations, we used an existing simulation framework and R package, phackR^23,59 to facilitate comparison with previous and future simulation studies. This simulation framework has limitations, discussed in the Supplement.

In the Supplement, we present and discuss the results in detail (Tables S1 – S18), and results for each individual simulation scenario are publicly available as a dataset (https://osf.io/gk5ez/). Aggregating across scenarios, point estimates from RTMA were approximately unbiased, whereas the estimates from the uncorrected meta-analysis and the comparison correction methods (SM, RoBMA, and especially PET-PEESE) were substantially biased (Supplement, Tables S1–S3). RTMA exhibited a better bias-variance tradeoff (indicated by lower mean absolute error and root mean-square error) and closer to nominal CI coverage, than all three comparison correction methods. MAN showed minimal negative bias. With other distributions of underlying population effects, we expect that MAN would become more conservative (as seen in the applied example). RTMA did typically have wider CIs than the uncorrected meta-analysis and SM. This loss of precision is a limitation, though again, RTMA did exhibit a better bias-variance tradeoff than all comparison correction methods. Additionally, RTMA did also have narrower CIs than RoBMA and PET-PEESE. Of the comparison correction methods, SM typically outperformed RoBMA and PET-PEESE. In the Supplement (Section 6.4), we describe strengths and limitations of the simulation study.

7. Discussion.

This paper has provided a formalization of joint SWS and SAS, enabling new conceptual insights into the distinctions and commonalities between SWS and SAS, and into how SWS can lead to bias in methods that were designed to accommodate only SAS. We have proposed new meta-analytic methods for bias due to joint SWS and SAS in meta-analyses. These methods advance methodologically upon existing approaches by accommodating both SWS and SAS, accommodating additional mechanisms for SWS (including those in which there is within-study heterogeneity and those in which estimates are independent or autocorrelated), and providing correct inference. The first method we propose, RTMA, provides consistent estimates if hacked studies are always successful or if unsuccessful hacked studies, if there are any, are never published. This method, along with diagnostic plots, is implemented in the R package phacking and the website metabias.io. The second method we propose, MAN, is designed to be highly conservative under weakened assumptions on the mechanism of SWS. This method is straightforward to estimate nonparametrically using existing software.⁴¹

Both RTMA and MAN analyze only the observed nonaffirmative estimates because, as we have shown, these estimates retain a tractable distribution under certain plausible forms of joint SWS and SAS. RTMA essentially imputes the entire underlying distribution of population effects, whereas MAN simply estimates the mean underlying population effect of nonaffirmative favored estimates, providing a conservative estimate for $μ$ . As such, the RTMA estimate will usually exceed the MAN estimate, but the MAN confidence interval is often more precise (Section 6). Although the MAN estimate is highly conservative by design, it is often still informative in practice: for 66% of large meta-analyses sampled from journals representing multiple scientific disciplines, the MAN estimate remained greater than the null, and for 25% of meta-analyses, its CI also excluded the null.²⁸ As noted in Section 3.2, the MAN estimate also represents an estimate under worst-case SAS.⁹ Of course, whether a conservative estimate is scientifically informative will depend on the scientific question and on whether one is primarily interested in effects in a particular direction. Given the differing assumptions of RTMA and MAN, we would suggest reporting both estimates by default.

Our theoretical framework and methods have limitations. Although they assume certain mechanisms of SAS and SWS that align well with empirical evidence on how researchers interpret $p$ -values,^28,30–32 if actual selection processes depart considerably from these assumptions, our proposed methods may be compromised. Violations of the assumptions can be assessed to some extent using the suggested diagnostic plots (Section 3.3). We now consider five such violations. (1) Under one plausible violation of stringent overall selection, in which investigators of hacked studies favor and publish nonaffirmative estimates if they cannot obtain an affirmative estimate, additional simulations indicate that RTMA typically becomes conservative, but this will not be the case under all forms of misspecification. (2) In some cases, the meta-analyst may extract estimates that are only approximations of those reported in the article (i.e., the estimates on which SAS or SWS operates). This could occur, for example, if the article’s reported estimate is adjusted for covariates, but the meta-analyst extracts an unadjusted estimate using crude summary statistics. If the former estimate is affirmative but the latter is nonaffirmative, then the estimates analyzed in RMTA or MAN would still include some degree of SAS or SWS. This issue also affects existing methods for SAS and SWS. (3) Investigators might favor larger nonaffirmative estimates over smaller ones. For example, in some scientific fields, “marginally significant” nonaffirmative estimates (e.g., positive estimates with $0.05 < p < 0.10$ ) may be subject to some degree of hacking. In such contexts, one could more conservatively fit RTMA and MAN to only estimates such that ${\hat{θ}}_{i F} / σ_{i} < 1.64$ , rather than those with ${\hat{θ}}_{i F} / σ_{i} < 1.96$ . (4) Some investigators might favor nonaffirmative over affirmative results (Section 3.3). (5) If certain study-level covariates are associated with studies’ favored estimates, then analyzing only nonaffirmative estimates changes the distribution of covariates to which the RTMA or MAN applies. When studies have differing magnitudes of internal bias, for example due to confounding, this property can become a strength, as restricting analysis to nonaffirmative estimates also indirectly restricts to studies with less internal bias.⁶⁰

Because RTMA estimates the underlying parameters of a truncated distribution, its CI may be imprecise when the number of published nonaffirmative estimates is small, which may often be the case in some scientific disciplines. However, the CI remains correctly calibrated in these settings, and RTMA nevertheless showed a better bias-variance tradeoff than comparison correction methods (SM, RoBMA, and PET-PEESE) in the present simulation study, whose limitations are discussed in the Supplement. Some meta-analyses may not contain any nonaffirmative estimates at all, in which case RTMA and MAN cannot be applied.

Our proposed analysis methods characterize evidence strength using the standard meta-analytic point estimate and its confidence interval, but these metrics alone do not fully characterize evidence strength in a potentially heterogeneous distribution of effects.⁶¹ It can be informative to conduct analyses that consider the percentage of population effects that are stronger than a threshold chosen to represent a meaningfully strong effect size.^61–63 When one fits RTMA under the assumption of normal underlying population effects, one could easily obtain parametric estimates of the percentage of meaningfully strong effects using the estimated $\hat{μ}$ and $\hat{τ}$ from RTMA.^61,62,64

An intriguing frontier for future research would be to formally model SWS in which studies may favor multiple estimates, and to develop resulting estimation methods. In contrast, we have assumed that each study favors a single estimate, which is potentially then subject to SAS. Even if each study’s investigators in fact favor and report multiple estimates, our assumed model may hold approximately if each study has a single “headline” estimate (e.g., the estimate that is showcased in the Abstract) which is subjected to SAS, and if this headline estimate is the only one that, if published, be included in the meta-analysis. Indeed, some empirical evidence suggests that studies’ headline results may be subject to considerably more selective reporting (whether due to SAS or SWS) than are results reported less prominently.²⁸ On the other hand, a meta-analysis with sufficiently sensitive search terms should capture even these less prominent results, potentially compromising the performance of RTMA. Because MAN can be fit using methods that accommodate clustering among the published estimates, in principle this method could be used even when studies favor multiple estimates (Section 3.2.1). However, there remain intriguing conceptual challenges to formalizing such forms of SWS: Do investigators simply report all results they have obtained, but favor only a single affirmative estimate as the headline result? Or are non-headline results subject to some lesser degree of SWS? Can the distinction between headline and non-headline results be leveraged in estimation to reduce bias due to SWS? There has been substantial recent progress toward considering SAS in the context of multiple estimates per study;^9,65,66 can these conceptual and technical advances be parlayed into methods that additionally accommodate SWS? We encourage future research along these lines.

In summary, we have proposed a new conceptual framework, along with new meta-analytic methods, for joint SWS and SAS in meta-analyses. The methods are straightforward to apply, for example using the R package phacking or the website metabias.io. These methods are intended to complement rather than supplant existing best-practice reporting, including uncorrected meta-analytic estimates as well as estimates corrected for SAS using existing methods. Indeed, we have recently recommended that SM estimates be reported more routinely in meta-analyses.⁶⁷ Applying our proposed methods as well could additionally help assess the robustness of meta-analyses to joint SWS and SAS.

Supplementary Material

Supinfo1

NIHMS1955639-supplement-Supinfo1.csv^{(1.1KB, csv)}

Supinfo3

NIHMS1955639-supplement-Supinfo3.pdf^{(6.1KB, pdf)}

Supinfo2

NIHMS1955639-supplement-Supinfo2.pdf^{(6.1KB, pdf)}

Supinfo4

NIHMS1955639-supplement-Supinfo4.pdf^{(6KB, pdf)}

Supinfo5

NIHMS1955639-supplement-Supinfo5.pdf^{(6.2KB, pdf)}

Supinfo6

NIHMS1955639-supplement-Supinfo6.pdf^{(6.2KB, pdf)}

Supinfo8

NIHMS1955639-supplement-Supinfo8.pdf^{(6KB, pdf)}

Supinfo7

NIHMS1955639-supplement-Supinfo7.pdf^{(6.2KB, pdf)}

Supinfo11

NIHMS1955639-supplement-Supinfo11.pdf^{(6.1KB, pdf)}

Supinfo9

NIHMS1955639-supplement-Supinfo9.pdf^{(6.2KB, pdf)}

Supinfo10

NIHMS1955639-supplement-Supinfo10.pdf^{(6.2KB, pdf)}

Supinfo12

NIHMS1955639-supplement-Supinfo12.pdf^{(5.9KB, pdf)}

Supinfo13

NIHMS1955639-supplement-Supinfo13.pdf^{(6.2KB, pdf)}

Supinfo15

NIHMS1955639-supplement-Supinfo15.tex^{(21.7KB, tex)}

Supinfo14

NIHMS1955639-supplement-Supinfo14.pdf^{(595.2KB, pdf)}

Highlights.

What is already known:

Meta-analyses can be compromised by selection across studies (traditional publication bias) and selection within studies ( $p$ -hacking).
Numerous existing methods consider selection across studies.

What is new:

Existing methods for selection across studies (SAS) can be severely biased when there is also selection within studies (SWS).
We provide a conceptual and theoretical framework to describe SWS and SAS.
We propose two new meta-analytic methods that accommodate joint SAS and SWS; both analyze only the published nonaffirmative estimates.
The first method is a right-truncated meta-analysis.
The second is a simple meta-analysis of nonaffirmative results.

Potential impact for RSM readers outside the authors’ field:

These methods are intended to complement rather than supplant existing best-practice reporting, including uncorrected meta-analytic estimates as well as estimates corrected for SAS using existing methods.

Acknowledgments.

Mika Braginsky led development of the R package and website. Sander Greenland, Mika Braginsky, and Michael C. Frank provided helpful comments on a draft.

Funding.

This research was supported by a Stanford University award (McCormick & Gabilan Faculty Award) and by National Institutes of Health grants R01 LM013866, UL1TR003142, P30CA124435, and P30DK116074. The funders had no role in the design, conduct, or reporting of this research.

Footnotes

Reproducibility. All code and data required to reproduce the simulation study and applied example are publicly available and documented (https://osf.io/gk5ez/). The codebase for the R package is publicly available (https://github.com/mathurlabstanford/phacking).

References

[1].Marks-Anglin Arielle and Chen Yong. A historical review of publication bias. Research Synthesis Methods, 11(6):725–742, September 2020. [DOI] [PubMed] [Google Scholar]
[2].Zhi-Chao Jin, Xiao-Hua Zhou, and Jia He. Statistical methods for dealing with publication bias in meta-analysis. Statistics in Medicine, 34(2):343–360, 2015. [DOI] [PubMed] [Google Scholar]
[3].Duval Sue and Tweedie Richard. Trim and fill: a simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2):455–463, 2000. [DOI] [PubMed] [Google Scholar]
[4].Egger Matthias, Smith George Davey, Schneider Martin, and Minder Christoph. Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109):629–634, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Dear Keith BG and Begg Colin B. An approach for assessing publication bias prior to performing a meta-analysis. Statistical Science, pages 237–245, 1992. [Google Scholar]
[6].Hedges Larry V. Modeling publication selection effects in meta-analysis. Statistical Science, pages 246–255, 1992. [Google Scholar]
[7].Vevea Jack L and Hedges Larry V. A general linear model for estimating effect size in the presence of publication bias. Psychometrika, 60(3):419–435, 1995. [Google Scholar]
[8].Andrews Isaiah and Kasy Maximilian. Identification of and correction for publication bias. American Economic Review, 109(8):2766–94, 2019. [Google Scholar]
[9].Mathur Maya B. and VanderWeele Tyler J.. Sensitivity analysis for publication bias in meta-analyses. Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(5):1091–1119, August 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Stanley Tom D and Doucouliagos Hristos. Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5(1):60–78, 2014. [DOI] [PubMed] [Google Scholar]
[11].Bom Pedro RD and Rachinger Heiko. A kinked meta-regression model for publication bias correction. Research Synthesis Methods, 10(4):497–514, 2019. [DOI] [PubMed] [Google Scholar]
[12].Bartoš František, Maier Maximilian, Wagenmakers Eric-Jan, Doucouliagos Hristos, and Stanley TD. Robust bayesian meta-analysis: Model-averaging across complementary publication bias adjustment methods. Research Synthesis Methods, 14(1):99–116, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Dwan Kerry, Altman Douglas G, Arnaiz Juan A, Bloom Jill, Chan An-Wen, Cronin Eugenia, Decullier Evelyne, Easterbrook Philippa J, Elm Erik Von, Gamble Carrol, et al. Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PloS One, 3(8):e3081, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Franco Annie, Malhotra Neil, and Simonovits Gabor. Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203):1502–1505, 2014. [DOI] [PubMed] [Google Scholar]
[15].Greenwald Anthony G. Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82(1):1, 1975. [Google Scholar]
[16].Hahn S, Williamson PR, and Hutton JL. Investigation of within-study selective reporting in clinical research: follow-up of applications submitted to a local research ethics committee. Journal of Evaluation in Clinical Practice, 8(3):353–359, August 2002. [DOI] [PubMed] [Google Scholar]
[17].Simonsohn Uri, Nelson Leif D, and Simmons Joseph P. P-curve: a key to the file-drawer. Journal of Experimental Psychology: General, 143(2):534, 2014. [DOI] [PubMed] [Google Scholar]
[18].Simmons Joseph P, Nelson Leif D, and Simonsohn Uri. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359–1366, 2011. [DOI] [PubMed] [Google Scholar]
[19].Brodeur Abel, Lé Mathias, Sangnier Marc, and Zylberberg Yanos. Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1):1–32, 2016. [Google Scholar]
[20].Jager Leah R and Leek Jeffrey T. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1):1–12, 2014. [DOI] [PubMed] [Google Scholar]
[21].Brodeur Abel, Cook Nikolai, and Heyes Anthony. Methods matter: P-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11):3634–60, 2020. [Google Scholar]
[22].John Leslie K, Loewenstein George, and Prelec Drazen. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5):524–532, 2012. [DOI] [PubMed] [Google Scholar]
[23].Stefan Angelika M and Schönbrodt Felix D. Big little lies: A compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2):220346, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Phillips Carl V. Publication bias in situ. BMC Medical Research Methodology, 4(1), August 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Carter Evan C, Schönbrodt Felix D, Gervais Will M, and Hilgard Joseph. Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2):115–144, 2019. [Google Scholar]
[26].Van Assen Marcel ALM, van Aert Robbie, and Wicherts Jelte M. Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 20(3):293, 2015. [DOI] [PubMed] [Google Scholar]
[27].McShane Blakeley B, Böckenholt Ulf, and Hansen Karsten T. Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11(5):730–749, 2016. [DOI] [PubMed] [Google Scholar]
[28].Mathur Maya B and VanderWeele Tyler J. Estimating publication bias in meta-analyses of peer-reviewed studies: A meta-meta-analysis across disciplines and journal tiers. Research Synthesis Methods, 12(2):176–191, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Moss Jonas and De Bin Riccardo. Modelling publication bias and p-hacking. Biometrics, 2021. [DOI] [PubMed] [Google Scholar]
[30].McShane Blakeley B and Gal David. Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519):885–895, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Head Megan L., Holman Luke, Lanfear Rob, Kahn Andrew T., and Jennions Michael D.. The extent and consequences of p-hacking in science. PLOS Biology, 13(3):e1002106, March 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Masicampo EJ and Lalande Daniel R. A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology, 65(11):2271–2279, 2012. [DOI] [PubMed] [Google Scholar]
[33].Cope Eric W. Penalized likelihood estimators for truncated data. Journal of Statistical Planning and Inference, 141(1):345–358, 2011. [Google Scholar]
[34].Zhou Xiaoping, Giacometti Rosella, Fabozzi Frank J, and Tucker Ann H. Bayesian estimation of truncated data with applications to operational risk measurement. Quantitative Finance, 14(5):863–888, 2014. [Google Scholar]
[35].Mittal Mukul M and Dahiya Ram C. Estimating the parameters of a doubly truncated normal distribution: Estimating the parameters. Communications in Statistics - Simulation and Computation, 16(1):141–159, 1987. [Google Scholar]
[36].Jeffreys Harold. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946. [DOI] [PubMed] [Google Scholar]
[37].Firth David. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993. [Google Scholar]
[38].DerSimonian Rebecca and Laird Nan. Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3):177–188, September 1986. [DOI] [PubMed] [Google Scholar]
[39].Copas John and Jackson Dan. A bound for publication bias based on the fraction of unpublished studies. Biometrics, 60(1):146–153, 2004. [DOI] [PubMed] [Google Scholar]
[40].Hedges Larry V, Tipton Elizabeth, and Johnson Matthew C. Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1):39–65, 2010. [DOI] [PubMed] [Google Scholar]
[41].Fisher Zachary and Tipton Elizabeth. Robumeta: An R-package for robust variance estimation in meta-analysis. arXiv preprint arXiv:1503.02220, 2015. [Google Scholar]
[42].Pustejovsky James E and Tipton Elizabeth. Meta-analysis with robust variance estimation: Expanding the range of working models. Prevention Science, 23(3):425–438, 2022. [DOI] [PubMed] [Google Scholar]
[43].Tipton Elizabeth. Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods, 20(3):375, 2015. [DOI] [PubMed] [Google Scholar]
[44].Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, and Rubin DB. Bayesian data analysis. Boca Raton, 2014. [Google Scholar]
[45].Vehtari Aki, Gelman Andrew, Simpson Daniel, Carpenter Bob, and Bürkner Paul-Christian. Rank-normalization, folding, and localization: An improved R for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), 6 2021. [Google Scholar]
[46].Carpenter Bob, Gelman Andrew, Hoffman Matthew D, Lee Daniel, Goodrich Ben, Betancourt Michael, Brubaker Marcus, Guo Jiqiang, Li Peter, and Riddell Allen. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Iyengar Satish and Greenhouse Joel B. Selection models and the file drawer problem. Statistical Science, pages 109–117, 1988. [Google Scholar]
[48].Lodder Paul, Ong How Hwee, Grasman Raoul PPP, and Wicherts Jelte M. A comprehensive meta-analysis of money priming. Journal of Experimental Psychology: General, 148(4):688, 2019. [DOI] [PubMed] [Google Scholar]
[49].Vadillo Miguel A., Hardwicke Tom E., and Shanks David R.. Selection bias, vote counting, and money-priming effects: A comment on rohrer, pashler, and harris (2015) and vohs (2015). Journal of Experimental Psychology: General, 145(5):655–663, May 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Allen Christopher and Mehler David MA. Open science challenges, benefits and tips in early career and beyond. PLoS Biology, 17(5):e3000246, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Kaplan Robert M and Irvin Veronica L. Likelihood of null effects of large NHLBI clinical trials has increased over time. PloS One, 10(8):e0132382, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Hedges Larry V.. Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2):107, 1981. [Google Scholar]
[53].Viechtbauer Wolfgang. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 2010. [Google Scholar]
[54].Viechtbauer Wolfgang. Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3):261–293, September 2005. [Google Scholar]
[55].Knapp Guido and Hartung Joachim. Improved tests for a random effects meta-regression with a single covariate. Statistics in Medicine, 22(17):2693–2710, 2003. [DOI] [PubMed] [Google Scholar]
[56].Coburn Kathleen M, Vevea Jack L, and Maintainer Coburn Kathleen M. weightr: Estimating weight-function models for publication bias. https://CRAN.R-project.org/package=weightr, 2019. R package version 2.0.2.
[57].Claesen Aline, Gomes Sara, Tuerlinckx Francis, and Vanpaemel Wolf. Comparing dream to reality: an assessment of adherence of the first generation of preregistered studies. Royal Society Open Science, 8(10):211037, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[58].Lodder Paul, Ong How Hwee, Grasman Raoul P. P. P., and Wicherts Jelte M.. A comprehensive meta-analysis of money priming. Journal of Experimental Psychology: General, 148(4):688–712, April 2019. [DOI] [PubMed] [Google Scholar]
[59].Stefan Angelika M.. phackR: Simulate p-Hacking, 2023. R package version 0.0.0.9000. [Google Scholar]
[60].Mathur Maya B. Sensitivity analysis for the interactive effects of internal bias and publication bias in meta-analyses. Research Synthesis Methods, in press. Preprint retrieved from https://osf.io/u7vcb/. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Mathur Maya B and VanderWeele Tyler J. New metrics for meta-analyses of heterogeneous effects. Statistics in Medicine, 38(8):1336–1342, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Mathur Maya B and VanderWeele Tyler J. Robust metrics and sensitivity analyses for meta-analyses of heterogeneous effects. Epidemiology, 31(3):356–358, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[63].Mathur Maya B and VanderWeele Tyler J. Sensitivity analysis for unmeasured confounding in meta-analyses. J Am Stat Assoc, 115(529):163–172, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[64].Mathur Maya B., Wang Rui, and VanderWeele Tyler J.. MetaUtility: Utility Functions for Conducting and Interpreting Meta-Analyses, 2019. R package version 2.1.0. [Google Scholar]
[65].Rodgers Melissa A and Pustejovsky James E. Evaluating meta-analytic methods to detect selective reporting in the presence of dependent effect sizes. Psychological Methods, 26(2):141, 2021. [DOI] [PubMed] [Google Scholar]
[66].Alinaghi Nazila and Reed W Robert. Meta-analysis and publication bias: How well does the FAT-PET-PEESE procedure work? Research Synthesis Methods, 9(2):285–311, 2018. [DOI] [PubMed] [Google Scholar]
[67].Maier Maximilian, VanderWeele Tyler, and Mathur Maya. Using selection models to assess sensitivity to publication bias: A tutorial and call for more routine use. Campbell Systematic Reviews, 2022. Preprint retrieved from https://osf.io/preprints/metaarxiv/tp45u. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supinfo1

NIHMS1955639-supplement-Supinfo1.csv^{(1.1KB, csv)}

Supinfo3

NIHMS1955639-supplement-Supinfo3.pdf^{(6.1KB, pdf)}

Supinfo2

NIHMS1955639-supplement-Supinfo2.pdf^{(6.1KB, pdf)}

Supinfo4

NIHMS1955639-supplement-Supinfo4.pdf^{(6KB, pdf)}

Supinfo5

NIHMS1955639-supplement-Supinfo5.pdf^{(6.2KB, pdf)}

Supinfo6

NIHMS1955639-supplement-Supinfo6.pdf^{(6.2KB, pdf)}

Supinfo8

NIHMS1955639-supplement-Supinfo8.pdf^{(6KB, pdf)}

Supinfo7

NIHMS1955639-supplement-Supinfo7.pdf^{(6.2KB, pdf)}

Supinfo11

NIHMS1955639-supplement-Supinfo11.pdf^{(6.1KB, pdf)}

Supinfo9

NIHMS1955639-supplement-Supinfo9.pdf^{(6.2KB, pdf)}

Supinfo10

NIHMS1955639-supplement-Supinfo10.pdf^{(6.2KB, pdf)}

Supinfo12

NIHMS1955639-supplement-Supinfo12.pdf^{(5.9KB, pdf)}

Supinfo13

NIHMS1955639-supplement-Supinfo13.pdf^{(6.2KB, pdf)}

Supinfo15

NIHMS1955639-supplement-Supinfo15.tex^{(21.7KB, tex)}

Supinfo14

NIHMS1955639-supplement-Supinfo14.pdf^{(595.2KB, pdf)}

[R1] [1].Marks-Anglin Arielle and Chen Yong. A historical review of publication bias. Research Synthesis Methods, 11(6):725–742, September 2020. [DOI] [PubMed] [Google Scholar]

[R2] [2].Zhi-Chao Jin, Xiao-Hua Zhou, and Jia He. Statistical methods for dealing with publication bias in meta-analysis. Statistics in Medicine, 34(2):343–360, 2015. [DOI] [PubMed] [Google Scholar]

[R3] [3].Duval Sue and Tweedie Richard. Trim and fill: a simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2):455–463, 2000. [DOI] [PubMed] [Google Scholar]

[R4] [4].Egger Matthias, Smith George Davey, Schneider Martin, and Minder Christoph. Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109):629–634, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Dear Keith BG and Begg Colin B. An approach for assessing publication bias prior to performing a meta-analysis. Statistical Science, pages 237–245, 1992. [Google Scholar]

[R6] [6].Hedges Larry V. Modeling publication selection effects in meta-analysis. Statistical Science, pages 246–255, 1992. [Google Scholar]

[R7] [7].Vevea Jack L and Hedges Larry V. A general linear model for estimating effect size in the presence of publication bias. Psychometrika, 60(3):419–435, 1995. [Google Scholar]

[R8] [8].Andrews Isaiah and Kasy Maximilian. Identification of and correction for publication bias. American Economic Review, 109(8):2766–94, 2019. [Google Scholar]

[R9] [9].Mathur Maya B. and VanderWeele Tyler J.. Sensitivity analysis for publication bias in meta-analyses. Journal of the Royal Statistical Society: Series C (Applied Statistics), 69(5):1091–1119, August 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Stanley Tom D and Doucouliagos Hristos. Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5(1):60–78, 2014. [DOI] [PubMed] [Google Scholar]

[R11] [11].Bom Pedro RD and Rachinger Heiko. A kinked meta-regression model for publication bias correction. Research Synthesis Methods, 10(4):497–514, 2019. [DOI] [PubMed] [Google Scholar]

[R12] [12].Bartoš František, Maier Maximilian, Wagenmakers Eric-Jan, Doucouliagos Hristos, and Stanley TD. Robust bayesian meta-analysis: Model-averaging across complementary publication bias adjustment methods. Research Synthesis Methods, 14(1):99–116, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Dwan Kerry, Altman Douglas G, Arnaiz Juan A, Bloom Jill, Chan An-Wen, Cronin Eugenia, Decullier Evelyne, Easterbrook Philippa J, Elm Erik Von, Gamble Carrol, et al. Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PloS One, 3(8):e3081, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Franco Annie, Malhotra Neil, and Simonovits Gabor. Publication bias in the social sciences: Unlocking the file drawer. Science, 345(6203):1502–1505, 2014. [DOI] [PubMed] [Google Scholar]

[R15] [15].Greenwald Anthony G. Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82(1):1, 1975. [Google Scholar]

[R16] [16].Hahn S, Williamson PR, and Hutton JL. Investigation of within-study selective reporting in clinical research: follow-up of applications submitted to a local research ethics committee. Journal of Evaluation in Clinical Practice, 8(3):353–359, August 2002. [DOI] [PubMed] [Google Scholar]

[R17] [17].Simonsohn Uri, Nelson Leif D, and Simmons Joseph P. P-curve: a key to the file-drawer. Journal of Experimental Psychology: General, 143(2):534, 2014. [DOI] [PubMed] [Google Scholar]

[R18] [18].Simmons Joseph P, Nelson Leif D, and Simonsohn Uri. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11):1359–1366, 2011. [DOI] [PubMed] [Google Scholar]

[R19] [19].Brodeur Abel, Lé Mathias, Sangnier Marc, and Zylberberg Yanos. Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8(1):1–32, 2016. [Google Scholar]

[R20] [20].Jager Leah R and Leek Jeffrey T. An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1):1–12, 2014. [DOI] [PubMed] [Google Scholar]

[R21] [21].Brodeur Abel, Cook Nikolai, and Heyes Anthony. Methods matter: P-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11):3634–60, 2020. [Google Scholar]

[R22] [22].John Leslie K, Loewenstein George, and Prelec Drazen. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5):524–532, 2012. [DOI] [PubMed] [Google Scholar]

[R23] [23].Stefan Angelika M and Schönbrodt Felix D. Big little lies: A compendium and simulation of p-hacking strategies. Royal Society Open Science, 10(2):220346, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Phillips Carl V. Publication bias in situ. BMC Medical Research Methodology, 4(1), August 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Carter Evan C, Schönbrodt Felix D, Gervais Will M, and Hilgard Joseph. Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2):115–144, 2019. [Google Scholar]

[R26] [26].Van Assen Marcel ALM, van Aert Robbie, and Wicherts Jelte M. Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 20(3):293, 2015. [DOI] [PubMed] [Google Scholar]

[R27] [27].McShane Blakeley B, Böckenholt Ulf, and Hansen Karsten T. Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11(5):730–749, 2016. [DOI] [PubMed] [Google Scholar]

[R28] [28].Mathur Maya B and VanderWeele Tyler J. Estimating publication bias in meta-analyses of peer-reviewed studies: A meta-meta-analysis across disciplines and journal tiers. Research Synthesis Methods, 12(2):176–191, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Moss Jonas and De Bin Riccardo. Modelling publication bias and p-hacking. Biometrics, 2021. [DOI] [PubMed] [Google Scholar]

[R30] [30].McShane Blakeley B and Gal David. Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519):885–895, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Head Megan L., Holman Luke, Lanfear Rob, Kahn Andrew T., and Jennions Michael D.. The extent and consequences of p-hacking in science. PLOS Biology, 13(3):e1002106, March 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Masicampo EJ and Lalande Daniel R. A peculiar prevalence of p values just below .05. The Quarterly Journal of Experimental Psychology, 65(11):2271–2279, 2012. [DOI] [PubMed] [Google Scholar]

[R33] [33].Cope Eric W. Penalized likelihood estimators for truncated data. Journal of Statistical Planning and Inference, 141(1):345–358, 2011. [Google Scholar]

[R34] [34].Zhou Xiaoping, Giacometti Rosella, Fabozzi Frank J, and Tucker Ann H. Bayesian estimation of truncated data with applications to operational risk measurement. Quantitative Finance, 14(5):863–888, 2014. [Google Scholar]

[R35] [35].Mittal Mukul M and Dahiya Ram C. Estimating the parameters of a doubly truncated normal distribution: Estimating the parameters. Communications in Statistics - Simulation and Computation, 16(1):141–159, 1987. [Google Scholar]

[R36] [36].Jeffreys Harold. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946. [DOI] [PubMed] [Google Scholar]

[R37] [37].Firth David. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993. [Google Scholar]

[R38] [38].DerSimonian Rebecca and Laird Nan. Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3):177–188, September 1986. [DOI] [PubMed] [Google Scholar]

[R39] [39].Copas John and Jackson Dan. A bound for publication bias based on the fraction of unpublished studies. Biometrics, 60(1):146–153, 2004. [DOI] [PubMed] [Google Scholar]

[R40] [40].Hedges Larry V, Tipton Elizabeth, and Johnson Matthew C. Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1):39–65, 2010. [DOI] [PubMed] [Google Scholar]

[R41] [41].Fisher Zachary and Tipton Elizabeth. Robumeta: An R-package for robust variance estimation in meta-analysis. arXiv preprint arXiv:1503.02220, 2015. [Google Scholar]

[R42] [42].Pustejovsky James E and Tipton Elizabeth. Meta-analysis with robust variance estimation: Expanding the range of working models. Prevention Science, 23(3):425–438, 2022. [DOI] [PubMed] [Google Scholar]

[R43] [43].Tipton Elizabeth. Small sample adjustments for robust variance estimation with meta-regression. Psychological Methods, 20(3):375, 2015. [DOI] [PubMed] [Google Scholar]

[R44] [44].Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, and Rubin DB. Bayesian data analysis. Boca Raton, 2014. [Google Scholar]

[R45] [45].Vehtari Aki, Gelman Andrew, Simpson Daniel, Carpenter Bob, and Bürkner Paul-Christian. Rank-normalization, folding, and localization: An improved R for assessing convergence of MCMC (with discussion). Bayesian Analysis, 16(2), 6 2021. [Google Scholar]

[R46] [46].Carpenter Bob, Gelman Andrew, Hoffman Matthew D, Lee Daniel, Goodrich Ben, Betancourt Michael, Brubaker Marcus, Guo Jiqiang, Li Peter, and Riddell Allen. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Iyengar Satish and Greenhouse Joel B. Selection models and the file drawer problem. Statistical Science, pages 109–117, 1988. [Google Scholar]

[R48] [48].Lodder Paul, Ong How Hwee, Grasman Raoul PPP, and Wicherts Jelte M. A comprehensive meta-analysis of money priming. Journal of Experimental Psychology: General, 148(4):688, 2019. [DOI] [PubMed] [Google Scholar]

[R49] [49].Vadillo Miguel A., Hardwicke Tom E., and Shanks David R.. Selection bias, vote counting, and money-priming effects: A comment on rohrer, pashler, and harris (2015) and vohs (2015). Journal of Experimental Psychology: General, 145(5):655–663, May 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Allen Christopher and Mehler David MA. Open science challenges, benefits and tips in early career and beyond. PLoS Biology, 17(5):e3000246, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Kaplan Robert M and Irvin Veronica L. Likelihood of null effects of large NHLBI clinical trials has increased over time. PloS One, 10(8):e0132382, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Hedges Larry V.. Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2):107, 1981. [Google Scholar]

[R53] [53].Viechtbauer Wolfgang. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 2010. [Google Scholar]

[R54] [54].Viechtbauer Wolfgang. Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3):261–293, September 2005. [Google Scholar]

[R55] [55].Knapp Guido and Hartung Joachim. Improved tests for a random effects meta-regression with a single covariate. Statistics in Medicine, 22(17):2693–2710, 2003. [DOI] [PubMed] [Google Scholar]

[R56] [56].Coburn Kathleen M, Vevea Jack L, and Maintainer Coburn Kathleen M. weightr: Estimating weight-function models for publication bias. https://CRAN.R-project.org/package=weightr, 2019. R package version 2.0.2.

[R57] [57].Claesen Aline, Gomes Sara, Tuerlinckx Francis, and Vanpaemel Wolf. Comparing dream to reality: an assessment of adherence of the first generation of preregistered studies. Royal Society Open Science, 8(10):211037, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] [58].Lodder Paul, Ong How Hwee, Grasman Raoul P. P. P., and Wicherts Jelte M.. A comprehensive meta-analysis of money priming. Journal of Experimental Psychology: General, 148(4):688–712, April 2019. [DOI] [PubMed] [Google Scholar]

[R59] [59].Stefan Angelika M.. phackR: Simulate p-Hacking, 2023. R package version 0.0.0.9000. [Google Scholar]

[R60] [60].Mathur Maya B. Sensitivity analysis for the interactive effects of internal bias and publication bias in meta-analyses. Research Synthesis Methods, in press. Preprint retrieved from https://osf.io/u7vcb/. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Mathur Maya B and VanderWeele Tyler J. New metrics for meta-analyses of heterogeneous effects. Statistics in Medicine, 38(8):1336–1342, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Mathur Maya B and VanderWeele Tyler J. Robust metrics and sensitivity analyses for meta-analyses of heterogeneous effects. Epidemiology, 31(3):356–358, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] [63].Mathur Maya B and VanderWeele Tyler J. Sensitivity analysis for unmeasured confounding in meta-analyses. J Am Stat Assoc, 115(529):163–172, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] [64].Mathur Maya B., Wang Rui, and VanderWeele Tyler J.. MetaUtility: Utility Functions for Conducting and Interpreting Meta-Analyses, 2019. R package version 2.1.0. [Google Scholar]

[R65] [65].Rodgers Melissa A and Pustejovsky James E. Evaluating meta-analytic methods to detect selective reporting in the presence of dependent effect sizes. Psychological Methods, 26(2):141, 2021. [DOI] [PubMed] [Google Scholar]

[R66] [66].Alinaghi Nazila and Reed W Robert. Meta-analysis and publication bias: How well does the FAT-PET-PEESE procedure work? Research Synthesis Methods, 9(2):285–311, 2018. [DOI] [PubMed] [Google Scholar]

[R67] [67].Maier Maximilian, VanderWeele Tyler, and Mathur Maya. Using selection models to assess sensitivity to publication bias: A tutorial and call for more routine use. Campbell Systematic Reviews, 2022. Preprint retrieved from https://osf.io/preprints/metaarxiv/tp45u. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

P-hacking in meta-analyses: A formalization and new meta-analytic methods

Maya B Mathur

Abstract

1. Introduction.

2. Setting and notation.

Table 1:

2.1. Selection within studies.

Figure 1:

2.2. Selection across studies.

3. Proposed new meta-analysis methods.

3.1. Right-truncated meta-analysis.

Figure 2:

3.1.1. Estimation methods for right-truncated meta-analysis.

3.2. Meta-analysis of nonaffirmative estimates.

3.2.1. Estimation methods for meta-analysis of nonaffirmative results.

3.3. Fit diagnostics.

Figure 4:

4. Existing two-step selection models.

5. Applied example.

Figure 3:

6. Simulation study.

7. Discussion.

Supplementary Material

Highlights.

What is already known:

What is new:

Potential impact for RSM readers outside the authors’ field:

Acknowledgments.

Funding.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

P-hacking in meta-analyses: A formalization and new meta-analytic methods

Maya B Mathur

Abstract

1. Introduction.

2. Setting and notation.

Table 1:

2.1. Selection within studies.

Figure 1:

2.2. Selection across studies.

3. Proposed new meta-analysis methods.

3.1. Right-truncated meta-analysis.

Figure 2:

3.1.1. Estimation methods for right-truncated meta-analysis.

3.2. Meta-analysis of nonaffirmative estimates.

3.2.1. Estimation methods for meta-analysis of nonaffirmative results.

3.3. Fit diagnostics.

Figure 4:

4. Existing two-step selection models.

5. Applied example.

Figure 3:

6. Simulation study.

7. Discussion.

Supplementary Material

Highlights.

What is already known:

What is new:

Potential impact for RSM readers outside the authors’ field:

Acknowledgments.

Funding.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases