Abstract
Meta-analysis, the statistical procedure for combining results from multiple independent studies, has been widely used in medical research to evaluate intervention efficacy and drug safety. In many practical situations, treatment effects vary notably among the collected studies, and the variation, often modeled by the between-study variance parameter τ2, can greatly affect the inference of the overall effect size. In the past, comparative studies have been conducted for both point and interval estimation of τ2. However, most are incomplete, only including a limited subset of existing methods, and some are outdated. Further, none of the studies covers descriptive measures for assessing the level of heterogeneity, nor are they focused on rare binary events that require special attention. We summarize by far the most comprehensive set including 11 descriptive measures, 23 estimators, and 16 confidence intervals. In addition to providing synthesized information, we further categorize these methods according to their key features. We then evaluate their performance based on simulation studies that examine various realistic scenarios for rare binary events, with an illustration using a data example of a gestational diabetes meta-analysis. We conclude that there is no uniformly “best” method. However, methods with consistently better performance do exist in the context of rare binary events, and we provide practical guidelines based on numerical evidences.
Keywords: bias, confidence interval, coverage probability, DerSimonian and Laird, fixed effect, odds ratio, mean squared error, Q statistic, random effects
1. Introduction
Meta-analysis, the statistical procedure for synthesizing information from multiple studies, has been widely used in many research areas including social, psychological and especially medical sciences. Meta-analysis is a powerful tool in drug safety evaluation, where the number of cases (adverse events) can be very limited in a single study. The U.S. Food and Drug Administration (FDA) released a draft guidance for industry titled “Meta-Analyses of Randomized Controlled Clinical Trials to Evaluate the Safety of Human Drugs or Biological Products” in November 2018, which demonstrates the importance of meta-analysis in the development of new drugs. Such meta-analysis often involves binary outcomes of rare events, which are the focus of this study.
The primary goal of a meta-analysis is usually to estimate and infer the overall effect size, where the variability in the effect estimates from component studies should be properly accounted for. Besides the within-study sampling errors, the variability may come from diverse characteristics of individual studies such as disparities in trial protocols, subjects’ conditions, and population features, etc. When the study-wise differences exist, we call these studies (statistically) heterogeneous and the heterogeneity is typically measured by a between-study variance parameter τ2. Also, descriptive measures have been widely used by clinicians to provide a more intuitive interpretation about the heterogeneity for ease of understanding.
For point estimation of τ2, the DerSimonian and Laird (DL) estimator [10], most widely used in the field, has been frequently challenged for its default use in many software packages, largely due to its sizable negative bias when the heterogeneity level is high [39, 47, 2, 35, 34]. Many modifications over the DL estimator have been suggested based on the method of moments. Other approaches such as likelihood-based and other nonparametric methods can also be applied. For interval estimation of τ2, different types of confidence intervals (CIs) have been constructed to gauge the estimation uncertainty. However, nearly all these methods were constructed without a special consideration of dichotomous data and their performance remains unclear in the context of rare binary events, in which some may produce large bias or even fail to work.
Comparative studies and review papers exist for both point and interval estimation of τ2 but not for descriptive measures. For example, Veroniki et al. [46], Langan et al. [28], Petropoulou and Mavridis [37] reviewed and compared most of the existing estimators of τ2, among which only Petropoulou and Mavridis [37] conducted simulation studies to evaluate their performance. Previous comparisons about CIs (e.g., [48, 25, 45]) were largely limited to several similar types of CIs. As detailed in Tables 2 and 5, none of these papers covers descriptive measures for quantifying the level of heterogeneity, nor do they focus on rare binary events. And most of them are far from being complete, some even outdated, which motivates us to conduct this study to provide useful guidance to clinicians and bio-statisticians.
Table 2:
Overview of 23 estimators for the between-study variance τ2
| Estimators | Abbreviation | Reference | Iterative? | Sign | Effect Measure |
|---|---|---|---|---|---|
| Method of Moments | |||||
| Hedges and Olkin | HO | [17] | No | > = 0 | |
| Two-step Hedges and Olkin | HO2 | DerSimonian and Kacker [9] | No | > = 0 | |
| DerSimonian and Laird | DL | DerSimonian and Laird [10] | No | > = 0 | |
| Positive DerSimonian and Laird | DLp | Kontopantelis et al. [27] | No | > 0 | |
| Two-step DerSimonian and Laird | DL2 | DerSimonian and Kacker [9] | No | > = 0 | |
| Multistep Dersimonian and Laird | DLM | van Aert and Jackson [44] | No | > = 0 | |
| Paule and Mandel | PM | Paule and Mandel [36] | Yes | > = 0 | |
| Improved Paule and Mandel | IPM | Bhaumik et al. [2] | Yes | > = 0 | OR |
| Hartung and Makambi | HM | Hartung and Makambi [16] | No | > 0 | |
| Hunter and Schmidt | HS | Hunter and Schmidt [20] | No | > = 0 | |
| Lin, Chu and Hodges | LCH | Lin et al. [31] | No | > = 0 | |
| Likelihood-based | |||||
| Maximum Likelihood | ML | Hardy and Thompson [14] | Yes | > = 0 | |
| Restricted maximum likelihood | REML | Viechtbauer [47] | Yes | > = 0 | |
| Approximate restricted maximum likelihood | AREML | Morris [33] | Yes | > = 0 | |
| Model error variance (Least squares) | |||||
| Sidik and Jonkman | SJ | Sidik and Jonkman [39] | No | > 0 | |
| Sidik and Jonkman (HO prior) | SJHO | Sidik and Jonkman [40] | No | > 0 | |
| Bayesian | |||||
| Rukhin Bayes | RB0 | Rukhin [38] | Yes | > = 0 | |
| Positive Rukhin Bayes | RBp | Rukhin [38] | Yes | > 0 | |
| Empirical Bayes (Equivalent to PM) | EB | Morris [33] | Yes | > = 0 | |
| Fully Bayes | FB | Smith et al. [41] | Yes | > 0 | |
| Bayes Modal | BM | Chung et al. [6, 5] | Yes | > 0 | |
| Other nonparametric | |||||
| Malzahn, Böhning, and Holling | MBH | Malzahn et al. [32] | No | > = 0 | SMD |
| Non-parametric bootstrap DerSimonian and Laird | DLb | Kontopantelis et al. [27] | No | > = 0 |
Table 5:
Existing comparative studies on constructing CIs for τ2 in random-effects meta-analysis
| Review paper | CI methods reviewed/compared | Effect measure | Recommendations |
|---|---|---|---|
| Knapp et al. [25] | QP, MQP, BT, PLML, WML | MD/OR | QP and MQP |
| Viechtbauer [48] | QP, BT, PL, W, SJ, BS | OR | QP |
| Veroniki et al. [46] | PL, W, BT, BJ, J, QP, SJ, BS, BC | Generic | — |
| van Aert et al. [45] | QP, BJ, J | OR | None recommended when pki < 0.1 in combination with either K ≥ 80 or (K ≥ 40 and nki < 30) |
The paper is organized as follows. In Section 2, we introduce notation and frequently used terms in meta-analysis. Section 3 reviews existing descriptive measures quantifying the level of heterogeneity. In Section 4, we list estimators for τ2 and briefly summarize two recently developed ones that are not included in any of the existing review papers. In Section 5, different types of confidence intervals for τ2 are described and categorized. In Section 6, we compare the performance, in terms of bias and mean squared error (MSE) for point estimators and empirical coverage probability and width for CIs, in a large collection of scenarios that are designed to mimic practical situations. In Section 7, we re-analyze the data from a meta-analysis [1] of 20 trials of type 2 diabetes mellitus after gestational diabetes with focus on the heterogeneity among the component studies. The final section provides recommendations in terms of choosing appropriate estimators and CIs in meta-analysis of rare binary events as well as a brief discussion.
2. Notation & frequently used terms
Suppose a meta-analysis includes K independent studies and the kth study contains nk subjects (k = 1, …, K). In study k, let θk be the true but unknown treatment effect and yk be the observed treatment effect such that E[yk|θk] = θk and , the within-study variance. Typically , an estimate of , is reported along with yk in published studies and it is often treated as a known quantity in practice (i.e., indistinguishable from ). When the study-specific effects θk’s are treated as random variables rather than constants, we assume E[θk] = θ and Var[θk] = τ2, where θ, a parameter of main interest in the meta-analysis, represents the overall treatment effect across different studies, and τ2 measures the between-study heterogeneity. There exist two main parametric models, namely Re (random effect) and Fe (fixed effect), to combine results from component studies. The Re model assumes that yk = θk + ϵk, where θk ~ N(θ, τ2) and . When τ2 = 0, it is reduced to the Fe model yk = θ + ϵk, where a common treatment effect θ is assumed for all component studies (i.e., θk ≡ θ). These models can be used with any effect measure, as long as the assumed normality is (approximately) valid.
For binary responses, we denote the number of events by xk0 (xk1) and the number of subjects by nk0 (nk1) in the control (treatment) group. The probability of having an event in the control (treatment) group is denoted by pk0 (pk1). Effect measures for binary outcomes include risk difference (RD, pk1 − pk0), risk ratio (RR, pk1/pk0) and odds ratio (OR, [pk1/(1 − pk1)]/[pk0/(1 − pk0)]). For rare binary events, RR≈OR. A logarithm transformation of the odds ratio (LOR) is often used in meta-analysis for a much faster convergence to asymptotic normality, and the within-study variance is then estimated by . Gart [13] added a continuity correction factor of 0.5 to all the cells so that
and is estimated by
which will be used in our numerical evaluation of rare binary events.
Next, we introduce the (generalized) Q statistic [9] and related terms, which will frequently appear in the paper. For any parameter of interest, we use the corresponding letter/symbol with a hat to denote its estimate. For example, we use to denote the estimate of the overall treatment effect θ. The Q statistic is defined as the weighted sum of squared deviations between the estimated overall treatment effect and observed treatment effect in each individual study, namely
| (1) |
where wk is a positive weight assigned to study k, and , the weighted average of the estimated study-specific effects. A commonly used weighting scheme is to set , i.e., the inverse of the estimated variance of yk. Under this inverse-variance weighing scheme, the variance of can be given by if we treat wk’s as known constants (i.e., indistinguishable from [Var(yk)]−1). Further, this scheme yields for the Fe model, and for the Re model, where can be any estimator discussed in Section 4. Under the Fe (Re) model with the inverse-variance weights, we denote the corresponding Q statistic by QFe (QRe) and the corresponding by with variance vFe (vRe). In fact, the Cochran’s Q statistic is QFe, also known as the DerSimonian and Laird’s Q test statistic [10].
Throughout this paper, we use to denote a chi-squared distribution with df degrees of freedom, and use to denote its 100α-th percentile.
3. Descriptive measures quantifying between-study heterogeneity
As mentioned in the introduction, (statistical) heterogeneity exists when true effects being evaluated differ among studies in a meta-analysis. Assessing the extent of heterogeneity is essential for model selection between Fe and Re models and decision making. An obvious choice is by estimating the variance parameter τ2, as is typically done in a random-effects meta-analysis. As pointed out by Higgins and Thompson [18], this measure does not facilitate comparison of heterogeneity across meta-analyses of different types of outcomes (e.g., the survival time can be either continuous or discrete). Also, its scale is specific to a chosen effect metric and the interpretation can be difficult. For example, odds ratio is a commonly used effect measure for binary data. Still, the variance of log-odds ratio is not easy to understand for many non-statisticians. Alternatively, one may test the existence of the between-study heterogeneity (e.g., through Cochran’s Q-test [7]), and use the corresponding test statistic or p-value to indicate the extent of heterogeneity. However, such measures depend on the scale of effect sizes or the number of component studies K. To overcome these limitations, effort has been devoted to development of various descriptive measures that can provide more intuitive information about the heterogeneity.
Table 1 summarizes 11 descriptive heterogeneity measures in the literature. Note that all these measures are general-purpose and none is specifically designed for binary outcomes. Takkouche et al. [42] proposed two measures, RI and CVB, to quantify the level of heterogeneity in five published meta-analyses. The statistic RI was developed to estimate τ2/(τ2+σ2), the proportion of total variation in the effect estimates that is due to between-study heterogeneity. This quantity is also known as the intra-class correlation in the context of cluster sampling. Here, the within-study variances are assumed to be constant, i.e., , which is estimated by , making . The other statistic CVB estimates the between-study coefficient of variation τ/|θ| by . Obviously, CVB is affected by the overall treatment effect θ and is undefined when θ = 0.
Table 1:
Descriptive measures quantifying the between-study heterogeneity
| Name | f (θ, τ2, σ2, K) | Formula | Ref. | Interpretation | Assume |
|---|---|---|---|---|---|
| RI | [42] | Proportion of total variation in the estimates of treatment effect due to between-study heterogeneity | Yes | ||
| CVb | [42] | Between-study coefficient of variation | No | ||
| H2 | [18] | Relative excess in QFe over its degrees of freedom | Yes, but can be used for different . | ||
| R2 | [18] | Inflation in the confidence interval for a single summary estimate under Re model compared with Fe model | Yes, but can be used for different . | ||
| [18] | Same as RI | Yes | |||
| [24] | Same as RI | Yes | |||
| Rb | [8] | Proportion of the between-study heterogeneity τ2 relative to vRe, the variance of . | No | ||
| , | [31] | Same as H2 | Yes | ||
| [31] | Same as RI | Yes | |||
| , , is weighted median estimate | [31] | Same as H2 | Yes | ||
| [31] | Same as RI | Yes |
Under the assumption of a common within-study variance σ2, Higgins and Thompson [18] formulated a general heterogeneity measure as a function of the overall treatment effect θ, the between-study variance τ2, the within-study variance σ2, and the number of component studies, namely, f(θ, τ2, σ2, K). They proposed three criteria that such a measure should satisfy in general in order to facilitate its comparability and interpretability, including (i) dependence on the extent of heterogeneity, (ii) scale invariance, i.e. f(θ, τ2, σ2, K) = f(a + bθ, b2τ2, b2σ2, K) for any a and b, and (iii) size invariance, i.e. f(θ, τ2, σ2, K1) = f(θ, τ2, σ2, K2) for any positive integers K1 and K2. Criterion (i) implies that the function f should increase monotonically with τ2. Criterion (ii) implies that f should be a function of the ratio and that θ should not be involved. Criterion (iii) implies that f does not depend on K. It can be shown that any monotonically increasing function of ρ satisfies the three criteria. Based on this, three statistics, H2, R2 and I2 were proposed. The first, H2, estimates the quantity ρ + 1 by equating the observed value of QFe to its expectation so that can be interpreted as relative excess in QFe over its expected value, the degrees of freedom K −1. The second, R2, attempts to estimate ρ+1 as well; but here, ρ+1 is approximated by vRe/vFe so that , which can be interpreted as the inflation in the confidence interval for under the Re model compared with under the Fe model. Both H2 and R2 should be at least 1, where 1 means perfect homogeneity; and the larger the value, the more heterogeneous the studies. In practice, the authors suggested to use H and R because clinicians may be more familiar with standard deviations than variances. The third statistic, I2, estimates a different function of ρ, i.e. , which represents the proportion of total variance that is due to between-study variation. Higgins and Thompson [18] suggested to compute I2 by , which leads to a convenient relationship . Jackson et al. [24] suggested to compute I2 by , which leads to another convenient relationship . Both and are usually expressed as percentages between 0% and 100%, where a value of 0% corresponds to no observed heterogeneity, while larger values indicate increasing levels of heterogeneity. They estimate the same quantity as RI does, but with different within-study variance estimates. Among these measures (i.e. H2, R2, or ), is most popular and in the literature, I2 typically represents as is much less known. Higgins and Green [19] empirically provided a rough guide to the interpretation of I2 using overlapping intervals: a value in [0,0.4] suggests that heterogeneity may not be that important; [0.3, 0.6] may represent moderate heterogeneity; [0.5,0.9] may represent substantial heterogeneity; and [0.75,1] implies considerable heterogeneity.
The assumption of a constant within-study variance is probably untrue in many real life data. Thus, Crippa et al. [8] lifted this assumption and proposed a new measure Rb, defined as , to assess the contribution of the between-study variance τ2 to vRe (i.e., the variance of the pooled random-effects estimate ). It can be viewed as an average of the study-specific proportions of the study-specific variances due to between-study heterogeneity. They showed that the quantity τ2/vRe underlying Rb is a strictly increasing function of τ2 and is scale-invariant. However, this quantity depends on K and so is not size-invariant. They further showed that RI ≥ max(Rb, ). When and σ2 is estimated by s2, Rb, RI, and all yield the same quantity . The authors conducted a simulation study to examine the performance of RI, and Rb. Both RI and tend to be positively biased and this overestimation increases as K increases. Confidence intervals based on RI and give lower coverage probabilities compared to those based on Rb and the difference becomes more obvious when the within-study variances vary more and when the heterogeneity level increases.
To reduce the impact of outlying studies, Lin et al. [31] proposed new robust measures , , and , which are analogous to and have the same interpretations as H2 and I2, respectively. These methods were developed upon the absolute deviation measures Qr and Qm rather than the usual squared deviation measure Q, as defined in Table 1 and will be described in more detail in Section 4.
All the measures except for CVB depend on the precision of the study-specific effects. As the sample sizes of the component studies increase, would decrease to zero so that RI, RB and all I2’s would increase to 1 and all H2’s and R2 would become arbitrarily large, even when there is little between-study heterogeneity. The measure CVB avoids this drawback but has its own limitation: it would approach +∞ as θ goes to 0. Finally, we mention that some of the measures involve the estimated value . In principle, can be any estimator of τ2, but most software uses the DL estimator as the default choice.
4. Estimators
We summarize 23 estimators for τ2 in Table 2, among which most can be applied to all kinds of effect measures except for the improved Paule and Mandel estimator (IPM, [2]) and Malzahn, Böhning, and Holling (MBH, [32]). IPM is specifically designed to work with OR for binary outcomes, and MBH can be only used for standardized mean difference (SMD). All estimators can be divided into five groups: method of moments, likelihood-based, model error variance (least squares), Bayes, and other nonparametric estimators. Some have closed form expressions while the others require numerical solutions. Some produce only positive estimates while the others require truncation to zero when a negative value occurs. Some properties of the estimators are summarized in Table 2.
Table 3 shows previous studies that reviewed and compared (large) subsets of these estimators. Recommendations were made either based on their own simulations or conclusions from the literature. Among them, Veroniki et al. [46], Langan et al. [28] and Petropoulou and Mavridis [37] are the most comprehensive. Veroniki et al. [46] reviewed 17 estimators as listed in Table 3, including all the method of moments estimators except for the IPM, multistep DL and LCH estimators, all three likelihood-based estimators, the SJ estimator, all the Bayesian estimators, and DLb. Langan et al. [28] and Petropoulou and Mavridis [37] added IPM, MBH, and SJHO into the comparison. Note that IPM was briefly summarized but not compared with other estimators in Veroniki et al. [46]. Also, EB mentioned in [37] has been shown to be equivalent to PM. Langan et al. [28] also added RB estimators with different priors, RBu and RBa.
Table 3:
Existing comparative studies for various estimators of the between-study variance τ2.
| Review paper | Estimators compared | Effect measure | Recommendations |
|---|---|---|---|
| Viechtbauer [47] | HO, DL, HS, ML, REML | SMD and MD | REML |
| Sidik and Jonkman [40] | HO, DL, SJ, SJHO, ML, REML, EB | OR | SJHO when τ2 is expected to be small or moderate; SJ when τ2 is expected to be large. |
| Kontopantelis et al. [27] | HO, HO2, DL, DL2, DLb, DLp, SJ, SJHO, ML, RB, RBp | Generic | DLb |
| Veroniki et al. [46] | HO, HO2, DL, DL2, DLp, DLb, PM, HM, HS, ML, REML, AREML, SJ, RB, RBp, FB, BM | Generic | PM |
| Langan et al. [28] | Estimators in Veroniki et al. [46] except for FB plus IPM, SJHO, RBu, RBa, MBH | RR, OR, SMD, MD and Generic | PM |
| Petropoulou and Mavridis [37] | Estimators in Langan et al. [28] except for RBu, RBa | OR and MD | DLb and DLp |
| Langan et al. [29] | DL, HO, PM, PMHO, PMDL, HM, SJ, SJHO, REML | OR and Generic | REML, PM and PMDL for continuous outcomes and non-rare binary events |
Two newly proposed estimators, the LCH estimators [31] and the multistep DL estimator DLM [44], are included in our pool. We mark them in bold in Table 2 and provide a brief description for each in below. The IPM estimator [2] is described as well because it is the only method specifically designed for rare binary events. More details about other estimators can be found in [46] and references therein.
Lin, Chu and Hodges (LCH)
Lin et al. [31] proposed two alternative estimators, and , designed to be less affected by outliers than conventional estimators based on the Q statistics in (1). For the purpose of robustness, they are based on Qr and Qm, defined as the weighted sums of absolute differences between the study-specific treatment effects and the overall treatment effect, namely
Here, , the fixed-effect estimate of θ as defined in Section 2, and is the weighted median estimator that is the solution to the equation , where I(·) is the indicator function. The estimators and , based on Qr and Qm, respectively, can be derived similarly as by equating observed Qr and Qm to their corresponding expected values.
Multistep DL
We first introduce the generalized method of moments (GMM) estimator of τ2 based on the Q statistic in (1). DerSimonian and Kacker [9] showed that if the weights wk’s are treated as known constants, the expected value of Q is
| (2) |
By equating Q to its expected value, replacing by in (2), solving for τ2 and truncating any negative solution to zero:
| (3) |
The DL estimator [10] is a special case of , with and Q = QFe.
As discussed in Section 2, the inverse-variance weighing scheme yields when calculating the (generalized) Q statistic (1) under the Re model. Recall that the original DL estimator can be obtained by specifying in (3), which is equivalent to setting in the Re weights. The two-step DL method [9] first obtains and then sets in the Re weights to obtain from (3).
van Aert and Jackson [44] proposed the multistep DL estimator as a natural extension of the two-step DL estimator. The M-step DL estimator can be obtained recursively by computing , , …, using (3). It has been shown that the limit of the multistep DL estimator, , when it exists, is equivalent to the PM estimator. As further suggested by the authors, divergence problems seldom happen in practice and the convergence is usually achieved quickly.
Improved Paule and Mandel (IPM)
For meta-analysis of rare binary events, Bhaumik et al. [2] adopted a standard binomial-normal random-effects model (labeled BNBA), which can be specified by
They proposed a simple average estimator, , for the overall treatment effect θ and then developed the IPM estimator for τ2 based on and the iterative PM method. The treatment effect θk (measured by log-odds ratio) in study k is estimated with a correction factor a added to each cell count, namely, yka = log[(xk1 + a)/(nk1 − xk1 + a)] − log[(xk0 + a)/(nk0 − xk0 + a)]. The simple average estimator for θ is then given by . The authors further proved that a should be in order for to be the least biased for large samples. They noticed that the PM estimator for τ2 depends on and proposed to improve PM by borrowing strength from all component studies when estimating each within-study variance,
Denote the corresponding weights by and can be obtained by solving Q − (K − 1) = 0 iteratively with weights wk(*) in the calculation of Q.
5. Confidence intervals
Table 4 reports 16 existing methods for constructing CIs for τ2 in terms of key features including whether the algorithm for computing a CI is iterative, whether truncation for non-negativity is needed, which distribution is used for construction, and whether the CI is exact under the Re model. All the methods are general-purpose and so can be applied to meta-analysis of binary events except for the generalized variable approach [43], which is specifically designed for the mean difference (MD) metric based on normally distributed outcomes. Some of the CIs are obtained via a test-inverting process based on different statistics for testing .
Table 4:
CI methods for τ2 in random-effects meta-analysis.
| Method | Abbreviation | Iterative? (Y/N) | Truncation to 0? (Y/N) | Distribution Used | Exact Method for Re ?(Y/N) | Reference |
|---|---|---|---|---|---|---|
| CIs based on (modified) Q statistics | ||||||
| Q-Profile | QP | Y | Y | Y | [15, 25] | |
| Modified Q-Profile | MQP | Y | Y | N | [15, 25] | |
| Biggerstaff and Tweedie | BT | Y | Y | Ga(r, λ) | N | [3] |
| Biggerstaff and Jackson | BJ | Y | Y | A positive linear combination of | Y | [4] |
| Jackson | J | Y | Y | A positive linear combination of | Y | [21] |
| Approximate Jackson | AJ | N | Y | Normal | N | [23] |
| Unequal-tail Q-profile | UTQ | Y | Y | Y | [22] | |
| Profile likelihood CIs | ||||||
| PL based on ML estimation | PLML | Y | Y | N | [14] | |
| PL based on REML estimation | PLREML | Y | Y | N | [48] | |
| Wald CIs | ||||||
| Wald based on ML estimation | WML | N | Y | N(0,1) | N | [3, 49] |
| Wald based on REML estimation | WREML | N | Y | N(0, l) | N | [49] |
| Others | ||||||
| Sidik and Jonkman | SJ | N | N | N | [39] | |
| Sidik and Jonkman with HO priori | SJHO | N | N | N | [40] | |
| Bayesian credible intervals | — | Y | N | — | N | [46] |
| Bootstrap | BSP/BSNP | Y | Y | — | N | [11, 27] |
| Generalized variable approach | GV | Y | Y | — | N | [43] |
In Table 5, we list existing review papers on constructing confidence intervals for τ2. Clearly, none of these reviews is comprehensive.
5.1. Confidence intervals based on (modified) Q statistics
Q-profile and modified Q-profile CIs
Knapp et al. [25] and Viechtbauer [48] considered the Q-profile CIs based on the generalized Q statistic in (1) with weights , denoted by Q(τ2), which depends on τ2 and treats as known constants. It can be shown that Q(τ2) follows a distribution under the Re model for any τ2. It follows that . Based on the test-inversion principle, a 100(1 − α)% confidence interval for τ2 can be obtained as the interval (, ) satisfying and . Since τ2 is non-negative, is truncated to 0 if (meaning that is negative); and the CI is set to [0, 0] (or {0}, the set containing only zero) if (meaning that is also negative). This type of CIs is referred to as the Q-profile (QP) CIs as we are profiling Q(τ2) with different τ2 values when solving the above equations for and iteratively.
Knapp et al. [25] considered the fact that are only estimates and so have error variability, and constructed CIs using the test statistic that replaces the weights in Q(τ2) with regularized variants to achieve a closer approximation to , where the regularization factor rk is derived through a moment matching approach based on approximating the distribution of by a scaled χ2 distribution [15]. The lower bound is obtained by profiling while the upper bound is still obtained by profiling Q(τ2), satisfying and . We refer to this type of CIs as the modified Q-profile (MQP) CIs.
Like the Q-profile CIs, the MQP CIs need left truncation to zero if the lower bound turns out to be negative, and they are set to {0} if the upper bound is also negative. The same rule applies to all other types of CIs based on (modified) Q statistics in Section 5.1, as discussed below.
BT and BJ CIs based on Cochran’s Q statistic
Biggerstaff and Tweedie [3] proposed to approximate the distribution of the Cochran’s Q statistic QFe by a gamma distribution with a shape parameter r(τ2) ≡ E2(QFe)/Var(QFe) and a scale parameter λ(τ2) ≡ Var(QFe)/E(QFe). The mean and variance of QFe under the Re model are given by E(QFe) = (K − 1) + (S1 − S2/S1)τ2 and , where . CIs for τ2 can be obtained similarly based on this gamma approximation instead of using the above profiling approach, which we refer to as the BT intervals.
Biggerstaff and Jackson [4] derived the exact CDF of QFe under the Re model, denoted by FQ(q; τ2), as a positive linear combination of random variables, whose cumulative distribution function can be obtained using Farebrother’s algorithm [12] via the CompQuad-Form package in R. They then obtained (, ) by solving the two equations numerically, and , where c = S1 − S2/S1 and is the untruncated version of the DL estimator of τ2. This type of CIs is referred to as the BJ intervals.
Jackson and approximate Jackson CIs
Following the numerical approach in [4], Jackson [21] proposed CIs by test inversion based on the generalized Q in (1), which is also distributed as a positive linear combination of random variables under the Re model. Jackson et al. [23] further proposed to apply the arcsinh transformation to the untruncated version of for variance stabilization and then constructed CIs for τ2 based on a normal approximation. These types of CIs are referred to as the Jackson (J) and approximate Jackson (AJ) CIs, respectively. Based on simulation, Jackson further commented that weighting component studies by the reciprocal of their within-study standard errors (i.e. sk), rather than by their variances (i.e. ) as the convention dictates, appears to provide a sensible and viable option when there is little a priori knowledge about the extent of heterogeneity.
Unequal-tail Q profile CIs
Jackson and Bowden [22] advocated to use unequal tail probabilities to obtain shorter intervals whenever such methods are justifiable. For example, when constructing a 100(1 − α)% unequal-tail Q-profile (UTQ) confidence interval, the lower and upper bounds, and , are obtained by solving and , respectively, where α2 > α1 and α1 + α2 = α. They further suggested to use a pre-specified α-split with α1 = 0.01 and α2 = 0.04 for a 95% CI, which was shown to be able to retain the nominal coverage and reduce the width under the Re model. Obviously, the idea of unequal tails can be applied to all kinds of confidence intervals. In our numerical evaluation, we examine the performance of the Q-profile CIs with α1 = 0.01 and α2 = 0.04 as a representative.
5.2. Profile likelihood confidence intervals
Under the Re model, Hardy and Thompson [14] proposed the profile likelihood CIs based on maximum likelihood (ML) estimation, referred to as PLML. The profile log-likelihood for τ2 takes into account the fact that θ is also unknown and must be estimated, given by , where the log-likelihood function of (θ, τ2) is given by
and given the value of τ2, the ML estimator of θ can be obtained by
Then a 100(1 − α)% CI for τ2 is given by the set of τ2 values satisfying .
Viechtbauer [48] proposed to construct profile likelihood CIs based on restricted maximum likelihood (REML) estimation, referred to as PLREML. The 100(1 − α)% CI for τ2 is given by the set of τ2 values satisfying , where the restricted log-likelihood function of τ2 is given by
and is the REML estimate of τ2 (by maximizing lR). Viechtbauer [48] found that the REML-based CIs were slightly more accurate than the ML-based CIs in terms of coverage probability, especially for small K.
Because ML and REML estimates of τ2 require non-negativity, the lower bounds of profile likelihood (PL) intervals are always non-negative and the upper bounds are strictly positive after applying the same truncation for Q-profile CIs.
5.3. Wald confidence intervals
The Wald test statistics for testing under the Re model have the form , where can be or , and the standard error is estimated by
with and . We label the Wald statistics based on ML and REML estimation by WML and WREML, respectively. The corresponding 100(1 − α)% Wald (W) CI for τ2 can be easily obtained by or [3, 48], where zα is the 100α-th percentile of the standard normal distribution. Negative lower bounds of the Wald CIs should be truncated to 0 since both ML and REML estimates of τ2 are constrained to be non-negative.
5.4. Other confidence intervals
Sidik and Jonkman (SJ) CIs
Sidik and Jonkman [39] proposed confidence intervals based on the SJ estimator of τ2, which is derived from the weighted residual sum of squares in the framework of a linear regression model. Let the crude estimate be an a priori value for τ2. Then the SJ estimator is given by , where , and . It follows that has an asymptotic distribution of . Thus an approximate 100(1 − α)% confidence interval can be calculated by
Since is always positive, the SJ confidence intervals have positive lower and upper bounds. Sidik and Jonkman [40] later proposed an improved estimator by using as the a priori value. Then improved confidence intervals can be constructed correspondingly.
Bayesian credible intervals
Bayesian credible (BC) intervals can be obtained when a Bayesian approach is employed and posterior samples are drawn from the (joint) posterior distribution of all parameters involved using an MCMC algorithm. The lower and upper points of a 100(1 − α)% CI can be the 100(α/2)th and 100(1 − α/2)th percentiles of the posterior sample of τ2’s, or determined by the region that gives the highest posterior density. Such intervals may be heavily affected by the prior selection when the number of studies K is small.
Bootstrap CIs
Bootstrap techniques can be used to obtain confidence intervals for nearly all τ2 estimators. For nonparametric bootstrap (denoted by BSNP), we sample K studies with replacement from the observed set of studies B times to get B bootstrap samples. For parametric bootstrap (denoted by BSP), we first obtain the parameter estimates and then generate B samples from the assumed distributions with these estimates. For each (parametric or nonparametric) sample, we calculate the corresponding estimate . Then the 100(α/2)th and 100(1 − α/2)th percentiles of the B estimates of τ2 are respectively the lower and upper bounds of a 100(1 − α)% bootstrap confidence interval. In our numerical experiment, we only perform the nonparametric bootstrap procedure for the DL estimator for illustration.
The generalized variable (GV) approach
For meta-analysis of normally distributed outcomes, Tian [43] proposed inference procedures based on the generalized pivotal quantity for τ2. A pivotal quantity is a function of observations and parameters such that the distribution of the function does not depend on the parameters including nuisance parameters. Let be the population variance of the control (treatment) group in study k; let be the corresponding sample variance. For normally distributed outcomes, it is well known that for k = 1, …, K and i = 0, 1. Denote Q in (1) with weight by Q(τ2), which follows and is a monotonic decreasing function of τ2. Thus, given a real number η ≥ 0, there exists a unique such that . Based on this, Tian [43] defined the generalized pivotal quantity for τ2 as if η ≤ Q(0) and otherwise. Given the observed treatment effects yk’s and sample variances , the distribution of does not depend on any nuisance parameters. A series of values can be obtained by first simulating and and setting in Q(τ2) for k = 1, …, K and i = 0, 1, and then solving for . A 100(1 − α)% confidence interval is given by (, ), where the lower and upper bounds are the 100(α/2)th and 100(1 − α/2)th percentiles of the generated .
6. Simulation focusing on rare binary events
For meta-analysis of rare binary events, Li and Wang [30] conducted a comprehensive simulation study to compare the performance of various estimators of the overall treatment effect θ measured by log-odds ratio, where a flexible binomial-normal model was used to accommodate treatment groups with unequal variability. This model, labeled BNLW, specifies the event probabilities by
where μk ~ N(μ, σ2), θk ~ N(θ, τ2), μk ⊥ θk, and ω is a constant in [0, 1]. The random-effects model BNBA in [2] is a special case of BNLW with ω = 0. Further, when , it is reduced to the model in [41], which assumes the equality of the variances of logit(pk0) and logit(pk1).
In this section, we adopt the same model and simulation setup from [30], to examine the performance of various methods. Results are summarized in Sections 6.1 and 6.2 for estimating the between-study variance τ2 of log-odds ratios θk’s. Here, bias and MSE are reported for point estimation, and the actual coverage probability and width of confidence intervals are reported for interval estimation. To be specific, we set the number of studies K to 10, 20 and 50 to reflect different sizes of meta-analysis. We generate the number of events xki from Binomial(nki, pki) for k = 1, …, K and i = 0, 1. The number of subjects in the control group, nk0’s are generated from Uniform[2000, 3000] to examine large-sample performance and from Uniform[20, 1000] to examine small-sample performance, and then rounded to the nearest integers. To allow varying allocation ratios across studies, the within-study sample sizes are set to follow the relationship nk1 = Rknk0, where , R ∈ {1, 2, 4} and . For small sample sizes, as noted in [30], the range [20, 1000] is chosen so that the empirical means of in all the settings are below one while it still allows for cases where most component studies have small sample sizes but a few can have sample sizes close to 1000. To generate pki’s, we fix σ2 at 0.5, and set τ2 ∈ {0, 0.25, 0.5, 0.75, 1} for evaluating different estimators and τ2 ∈ {0, 0.1, 0.2, ⋯, 0.9, 1} for evaluating different types of CIs. We further set θ ∈ {−1, 0, 1} to reflect different directions of the overall treatment effect, set μ ∈ {−2.5, −5} to represent low and very low incidence rates of the binary event (i.e., 0.076 and 0.0067 in the probability scale), and set ω ∈ {0, 0.5, 1} to represent smaller/equal/larger variability in the control group, compared to the treatment group. For each setting, 1000 datasets are simulated to compute empirical values of the performance measures by taking the average.
6.1. Comparison of different heterogeneity estimators
We compare all the methods listed in Table 2 except for FB and MBH. Since the full Bayesian method can be greatly affected by the prior choice and other factors (such as convergence), we exclude FB from our simulation. The MBH method is designed specifically for standard mean difference thus not suitable for binary events. In addition, the empirical Bayes method EB is equivalent to PM and the multistep DL method has the property that DL∞ converges to PM. Therefore, we include PM in the comparison and leave EB and DLM out. We use heat maps to visualize the bias and MSE results where the rows of each map represent different methods and columns represents different τ2 values in [0, 1].
Large-sample results
Figure 1 presents the bias and MSE results of different estimators for μ = −2.5 and μ = −5 based on large-sample settings with R = 1, K = 50, θ = 0, and w = 0. As shown in Figure 1(a), as the event of interest becomes rarer, all methods seem to produce more bias when estimating τ2. Almost all methods underestimate the between-study heterogeneity when τ2 > 0. The RBp estimator, however, consistently overestimates τ2 when the event is very rare (μ = −5). As τ2 increases, most estimators produce more bias except for BM and RBp; the bias from BM first increases then decreases, and the bias from RBp decreases for very rare events (μ = −5). When the events are not that rare (μ = −2.5), most estimators have similarly low bias except for the one-step DL estimators (DL, DLp, DLb), HM, HS, and BM. However, IPM stands out with the lowest bias when the incidence rate becomes very low, especially when τ2 ≥ 0.5. The HS, HM, BM and one-step DL family methods remain the worst and should be avoided in terms of bias. All three likelihood-based methods, ML, REML and AREML, produce similar results with a moderate level of bias. In terms of MSE, most methods have similar performance except for HM and BM, which are the most inefficient according to Figure 1(b). Those with relatively large magnitude of bias tend to have relatively large MSE.
Figure 1:

Large-sample performance of different τ2 estimators based on settings with R = 1, K = 50, θ = 0, and w = 0.
We next discuss the potential impacts of R, K, θ, and w on the estimation performance for the large-sample case. Figures S1 and S2 in the Supplementary Material (SM) show the bias and MSE results for different R and K values, respectively, based on settings with μ = −2.5, θ = 0 and w = 0. We can see that when τ2 < 0.5, regardless of R and K, all the methods perform somewhat similarly and have both bias and MSE close to zero except for BM which has much larger bias. As K increases, MSE decreases significantly for every estimator when τ2 ≥ 0.5 but bias for a few estimators seems not to get closer to zero (e.g., DL for τ2 = 1, BM for τ2 = 0.5, and 0.75). However, the heat maps show very similar color patterns both vertically and horizontally, indicating that the impact of R and K on the relative performance of these methods is merely marginal. Figures S3 and S4 in the SM show the bias and MSE results for different θ and w values, respectively, based on settings with R = 1, K = 50 and μ = −5. When θ = −1, bias decreases as w increases while this trend reverses when θ = 1. This effect of w is minimal when there is no treatment effect (θ = 0). Similar trends are observed but less obvious for MSE. Also, we find that IPM maintains the best performance in terms of both bias and MSE while DL, DLp, DLb, HS, HM, and BM are among the worst in nearly all the settings considered.
Small-sample results
Figure 2 presents the bias and MSE results of different estimators for μ = −2.5 and μ = −5 based on small-sample settings with R = 1, K = 50, θ = 0, and w = 0. From Figure 2(a), we can see that when τ2 > 0, the underestimation observed in the large-sample results for all the estimators but RBp is much more severe for small samples, where the magnitude of bias increases substantially for very rare events (μ = −5). Note that RBp consistently overestimates τ2 for both μ = −2.5 and μ = −5, and unlike most other estimators, the bias decreases as τ2 increases. When events are not that rare (μ = −2.5), IPM is still the least biased. However, for very rare events (μ = −5), SJ becomes the least biased estimator for τ2 ≥ 0.5. The problem of SJ is that it significantly overestimates τ2 when there is no or little heterogeneity, due to its positive nature. From Figure 2(b) we can see that MSE does not change much when μ = −2.5 but dramatically increases when μ = −5 compared to results from large samples. For very rare events (μ = −5), SJ is the most efficient method except for τ2 = 0 and IPM seems to be the second best in terms of MSE. Note that when τ2 = 1, RBp has smaller MSE than IPM for very rare events, but it does not perform as well as IPM for smaller τ2 values.
Figure 2:

Small-sample performance of different τ2 estimators based on settings with R = 1, K = 50, θ = 0, and w = 0.
The impacts of R, K, θ, and w on the estimation bias and MSE for the small-sample case are shown in Figure S5–S8 of the SM. Since several methods (e.g., the likelihood-based methods) failed in some small-sample settings for very rare events (μ = −5), we show results for μ = −2.5 in these figures. Although the effect of K on MSE becomes more significant for small samples (i.e., MSE decreases more as K increases), it is still the case that both R and K have little impact on the relative performance of different methods. Also, similar trends for both bias and MSE occur when w and θ change as in the large-sample case. For these μ = −2.5 settings, IPM seems to be the best estimator due to its consistent top-level performance across various settings. This also agrees with the results in the left panels of Figure 2. On the other hand, DL, DLp, DLb, HM, HS, and BM should be used with caution due to their generally large bias.
6.2. Comparison of different types of CIs
Among those summarized in Table 4, we compared 14 different types of 95% CIs for the heterogeneity parameter τ2 in Figures 3 and 4, excluding Bayesian credible intervals and the GV method as before. As mentioned in Section 5, BSNP represents the nonparametric bootstrap procedure combined with the DL estimator and UTQ represents the unequal-tail Q-profile CI with α1 = 0.01 and α2 = 0.04. Again, from our (unreported) simulation results, we find that the influences of R, θ, and w on the empirical coverage probability are marginal.
Figure 3:

Actual coverage probabilities of different types of 95% CIs for different K values based on large-sample settings with R = 1, μ = −5, θ = 0, and w = 0.
Figure 4:

Actual coverage probabilities of different types of 95% CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0.
Figure 3 shows actual coverage probabilities of different types of CIs for different K values based on large-sample settings with R = 1, μ = −5, θ = 0, and w = 0. When there is no between-study heterogeneity (τ2 = 0), all the methods provide 100% coverage except for SJ and SJHO, which produce strictly positive intervals and so have zero coverage. When τ2 is small, as K increases, the methods based on (modified) Q statistics gain some improvement in coverage except for AJ, which achieves relatively high coverage for all K and τ2 values. As τ2 gets larger, most methods do not improve their coverage by increasing K.
Figure 4 presents actual coverage probabilities of different types of CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0. When μ = −2.5, most methods have actual coverage close to the nominal level 0.95. Among all, the nonparametric bootstrap CI has the lowest coverage, followed by the two Wald CIs when τ2 > 0. The influence of sample sizes is not obvious except for J, SJ and SJHO that improve their coverage for large sample sizes when τ2 is small. For very rare events (μ = −5), the impact of sample sizes is much more severe and some of the CIs (e.g., SJHO, J, UTQ) do not even achieve 50% coverage in most small-sample settings. In the large-sample settings, PLML, PLREML, and AJ maintain the nominal 95% coverage quite well at all positive levels of τ2. As the sample sizes become small, all methods fail to do so for very rare events when τ2 ≥ 0.3. Still, PLML and PLREML, and AJ are among those with the highest coverage. We also find that when τ2 ≥ 0.4, SJ joins the top-performing group with the following order SJ ≈ PLREML > PLML > AJ. This matches with the estimation results reported in Section 6.1 that for very rare events coupled with small samples, the SJ estimator is the least biased and has the smallest MSE when τ2 ≥ 0.5. In such situations, the Q statistic-based CIs have generally low coverage and thus should be avoided; meanwhile the Wald and nonparametric bootstrap CIs have moderate coverage instead of being the worst in the other three cases.
Figure 5 shows width curves of different types of CIs under the same settings of Figure 4, where for all CIs, the width shows an increasing pattern as τ2 increases. The influence of sample sizes on the CI width is only obvious when μ = −5, where all the CIs become narrower when sample sizes decrease. Though anti-intuitive, a closer examination reveals that when events are very rare and sample sizes are small, many simulation iterations produce confidence intervals of a point {0}, which makes the average width become smaller. In the first three situations (either μ = −2.5 or large samples), BT and BJ produce the widest intervals, and PL and AJ intervals, which offer higher coverage than most other methods, have moderate widths among all. Unsurprisingly, the nonparametric bootstrap procedure produces the narrowest CIs. In the last situation (very rare events coupled with small samples), PL and AJ intervals are among the widest. Here, CIs with shorter widths are not necessarily desirable as they may reflect more {0} intervals due to sparsity. SJ produces intervals with moderate widths though it also provides higher coverage when τ2 is large. Overall, we recommend PL and AJ intervals in meta-analysis of rare binary events for their high coverage. For very rare events with small samples, we recommend SJ intervals if we know there exists at least moderate-level heterogeneity. Besides, AJ and SJ intervals are much easier to obtain than PL intervals.
Figure 5:

Width curves of different types of 95% CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0.
7. Example: Type 2 diabetes mellitus after gestational diabetes
Women with gestational diabetes are believed to have a higher chance to develop type 2 diabetes. Bellamy et al. [1] performed a comprehensive systematic review and meta-analysis to assess the strength of this association. They selected 20 cohort studies that included 675,455 women with/without gestational diabetes and 10,859 type 2 diabetic events from 205 reports between Jan 1, 1960, and Jan 31, 2009, from Embase and Medline (see Table S.1 of the SM). We reanalyzed the data focusing on inference about the heterogeneity parameter τ2. We note that the overall event rate is ~1.61% and many studies have very small sample sizes with zero event counts. So this data example fits in the scenario of very rare events coupled with small sample sizes. Recall that in this scenario, SJ gives the least bias and most efficient estimator when there exists a moderate or large level of heterogeneity and IPM is the second best which tends to underestimate τ2.
Point estimates for the heterogeneity parameter τ2 and the corresponding inverse-variance weighted estimates for the overall treatment effect θ (measured by log-odds ratio) are summarized in Table 6. Here, most methods give an estimate between 0.4 and 0.7 for τ2, where the estimate from IPM is 0.563 and that from SJ is 0.679. This seems to suggest a moderate to high level of heterogeneity, especially after accounting for the underestimation from IPM. The RBp method, which has been shown to severely overestimate τ2 for very rare events, not surprisingly gives the largest estimate of 1.162. On the other hand, the HS estimate is much smaller than the others. The resulting estimated odds ratios do not vary as much except for the one from RBp. Table 7 shows the confidence intervals from all the compared methods. BT gives a very large upper bound, which seems to be odd. All CIs except for those from BT, BJ, and Wald methods exclude zero, among which SJ yields the shortest interval with the largest lower bound and the upper bound in line with that from PL and AJ methods. Recall that SJ tends to produce the best interval with higher coverage and relatively shorter width when there exists at least moderate-level heterogeneity, as reported in Section 6.2. In this example, we lean toward reporting the SJ interval, among the top performing methods PL, AJ and SJ. Based on the estimation and inference results above, we believe that these studies are heterogeneous‥
Table 6:
Data example of gestational diabetes meta-analysis: estimates for τ2 and θ from different methods
| Estimator | HO | HO2 | DL | DL2 | DLp | DLb | PM | IPM | HM | HS |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.220 | 0.418 | 0.466 | 0.411 | 0.466 | 0.265 | 0.413 | 0.563 | 0.419 | 0.046 | |
| 2.093 | 2.136 | 2.146 | 2.135 | 2.146 | 2.104 | 2.135 | 2.162 | 2.137 | 2.092 | |
| OR | 8.112 | 8.469 | 8.547 | 8.457 | 8.547 | 8.197 | 8.461 | 8.691 | 8.470 | 8.099 |
| Estimator | LCHmean | LCHmedian | ML | REML | AREML | SJ | SJHO | RB0 | RBp | BM |
| 0.519 | 0.298 | 0.396 | 0.449 | 0.433 | 0.679 | 0.290 | 0.198 | 1.162 | 0.195 | |
| 2.155 | 2.111 | 2.132 | 2.142 | 2.139 | 2.180 | 2.110 | 2.088 | 2.235 | 2.088 | |
| OR | 8.626 | 8.260 | 8.432 | 8.520 | 8.493 | 8.846 | 8.245 | 8.072 | 9.345 | 8.067 |
Table 7:
Data example of gestational diabetes meta-analysis: confidence intervals for τ2 from different methods.
| Method CI |
QP (0.109, 1.603) |
MQP (0.106, 1.603) |
UTQ (0.083, 1.403) |
BT [0, 8.610) |
BJ [0, 2.660) |
J (0.048, 1.540) |
AJ (0.004, 1.396) |
| Method CI |
SJ (0.393,1.449) |
SJHO
(0.168, 0.620) |
BSNP (0.012, 0.670) |
PLML
(0.113, 1.285) |
PLREML
(0.129, 1.458) |
WML
[0, 0.841) |
WREML
[0, 0.966) |
8. Discussion and recommendations
Based on our comprehensive simulation studies for large-sample meta-analysis of rare binary events, we recommend the IPM method for estimating the heterogeneity parameter τ2 if reducing estimation bias is of high priority, especially when the events are extremely rare. Most of the methods do not differ much in terms of MSE. We suggest to avoid using HM, HS and BM since they have relatively large bias and MSE compared with other estimators. The most widely used DL estimator and its one-step variants DLp and DLb do not perform satisfactorily and hence should be avoided. For small-sample meta-analysis of rare events, IPM is still recommended and SJ also performs much better than the other estimators in terms of both bias and MSE when τ2 ≥ 0.5 and the events are extremely rare. In terms of interval estimation, we recommend the profile likelihood methods (PLML and PLREML) and the approximate Jackson method AJ in general situations. Among the three, PLREML usually produces higher coverage but with wider intervals. The SJ method is a good candidate when events are extremely rare, sample sizes are small, and τ2 ≥ 0.4. We did not examine the performance of Bayesian methods because of the computation burden, convergence detection issue, and potential sensitivity to prior choices. However, Bayesian hierarchical modeling can be a good alternative especially when meaningful prior information is available.
We notice that most estimators for τ2 are negatively biased in our simulation, an interesting phenomenon observed in other simulation studies with binary outcomes [26, 39, 40, 2] as well. In simulation studies with continuous outcomes [27], most of the estimators show positive bias when τ2 is small (< 0.1) and the magnitude of bias of RBp is much larger than the other estimators; for larger τ2 values, the HS and ML estimators are negatively biased and the magnitude increases as τ2 increases [47]. Viechtbauer [47] provides some analytical results for the bias of estimators HO, DL, HS, ML, and REML. Most of these results were derived based on the homogeneous within-study variance assumption (). Under this assumption, the bias due to truncation is always positive for DL, HO and REML with all levels of heterogeneity and is negative for HS and ML when τ2 ≥ 0.5. However, we believe that in the rare events context, it is the sparsity (caused by zero counts) and lack of resolution in estimating the within-study variances that cause the large magnitude of underestimation for many methods. This underestimation is much reduced by the IPM estimator where the within-study variance estimates are improved by pooling information from all the studies.
Finally, we should mention that, when synthesizing information from multiple studies to obtain more reliable conclusions, one should not simply rely on one point estimate or one p-value (especially those from the default methods in software packages) without considering the rich selection of statistical tools offered in the literature. Each of the above reviewed models or methods has its own limitations. In practice, all kinds of evidence should be combined and evaluated together with the specific characteristics of component studies included in the meta-analysis.
Supplementary Material
References
- [1].Bellamy L, Casas J-P, Hingorani AD, and Williams D (2009). Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. The Lancet, 373(9677):1773–1779. [DOI] [PubMed] [Google Scholar]
- [2].Bhaumik DK, Amatya A, Normand S-LT, Greenhouse J, Kaizar E, Neelon B, and Gibbons RD (2012). Meta-analysis of rare binary adverse event data. Journal of the American Statistical Association, 107(498):555–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Biggerstaff B and Tweedie R (1997). Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Statistics in medicine, 16(7):753–768. [DOI] [PubMed] [Google Scholar]
- [4].Biggerstaff BJ and Jackson D (2008). The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Statistics in medicine, 27(29):6093–6110. [DOI] [PubMed] [Google Scholar]
- [5].Chung Y, Rabe-Hesketh S, and Choi I-H (2013a). Avoiding zero between-study variance estimates in random-effects meta-analysis. Statistics in medicine, 32(23):4071–4089. [DOI] [PubMed] [Google Scholar]
- [6].Chung Y, Rabe-Hesketh S, Dorie V, Gelman A, and Liu J (2013b). A non-degenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78(4):685–709. [DOI] [PubMed] [Google Scholar]
- [7].Cochran WG (1954). The combination of estimates from different experiments. Biometrics, 10(1):101–129. [Google Scholar]
- [8].Crippa A, Khudyakov P, Wang M, Orsini N, and Spiegelman D (2016). A new measure of between-studies heterogeneity in meta-analysis. Statistics in medicine, 35(21):3661–3675. [DOI] [PubMed] [Google Scholar]
- [9].DerSimonian R and Kacker R (2007). Random-effects model for meta-analysis of clinical trials: an update. Contemporary clinical trials, 28(2):105–114. [DOI] [PubMed] [Google Scholar]
- [10].DerSimonian R and Laird N (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177–188. [DOI] [PubMed] [Google Scholar]
- [11].Efron B and Tibshirani R (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75. [Google Scholar]
- [12].Farebrother R (1984). Algorithm as 204: the distribution of a positive linear combination of χ2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 33(3):332–339. [Google Scholar]
- [13].Gart JJ (1966). Alternative analyses of contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), pages 164–179. [Google Scholar]
- [14].Hardy RJ and Thompson SG (1996). A likelihood approach to meta-analysis with random effects. Statistics in medicine, 15(6):619–629. [DOI] [PubMed] [Google Scholar]
- [15].Hartung J and Knapp G (2005). On confidence intervals for the among-group variance in the one-way random effects model with unequal error variances. Journal of statistical planning and inference, 127(1–2):157–177. [Google Scholar]
- [16].Hartung J and Makambi K (2002). Positive estimation of the between-study variance in meta-analysis: theory and methods. South African Statistical Journal, 36(1):55–76. [Google Scholar]
- [17].Hedges LV and Olkin I (2014). Statistical methods for meta-analysis. Academic press. [Google Scholar]
- [18].Higgins J and Thompson SG (2002). Quantifying heterogeneity in a meta-analysis. Statistics in medicine, 21(11):1539–1558. [DOI] [PubMed] [Google Scholar]
- [19].Higgins JP and Green S (2011). Cochrane handbook for systematic reviews of interventions, volume 4 John Wiley & Sons. [Google Scholar]
- [20].Hunter JE and Schmidt FL (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage. [Google Scholar]
- [21].Jackson D (2013). Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Research Synthesis Methods, 4(3):220–229. [DOI] [PubMed] [Google Scholar]
- [22].Jackson D and Bowden J (2016). Confidence intervals for the between-study variance in random-effects meta-analysis using generalised heterogeneity statistics: should we use unequal tails? BMC medical research methodology, 16(1):118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Jackson D, Bowden J, and Baker R (2015). Approximate confidence intervals for moment-based estimators of the between-study variance in random effects meta-analysis. Research synthesis methods, 6(4):372–382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Jackson D, White IR, and Riley RD (2012). Quantifying the impact of between-study heterogeneity in multivariate meta-analyses. Statistics in medicine, 31(29):3805–3820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Knapp G, Biggerstaff BJ, and Hartung J (2006). Assessing the amount of heterogeneity in random-effects meta-analysis. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 48(2):271–285. [DOI] [PubMed] [Google Scholar]
- [26].Knapp G and Hartung J (2003). Improved tests for a random effects meta-regression with a single covariate. Statistics in medicine, 22(17):2693–2710. [DOI] [PubMed] [Google Scholar]
- [27].Kontopantelis E, Springate DA, and Reeves D (2013). A re-analysis of the Cochrane library data: the dangers of unobserved heterogeneity in meta-analyses. PloS one, 8(7):e69930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Langan D, Higgins J, and Simmonds M (2017). Comparative performance of heterogeneity variance estimators in meta-analysis: a review of simulation studies. Research synthesis methods, 8(2):181–198. [DOI] [PubMed] [Google Scholar]
- [29].Langan D, Higgins JP, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, Viechtbauer W, and Simmonds M (2019). A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Research synthesis methods, 10(1):83–98. [DOI] [PubMed] [Google Scholar]
- [30].Li L and Wang X (2019). Meta-analysis of rare binary events in treatment groups with unequal variability. Statistical Methods in Medical Research, 28(1):263–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Lin L, Chu H, and Hodges JS (2017). Alternative measures of between-study heterogeneity in meta-analysis: Reducing the impact of outlying studies. Biometrics, 73(1):156–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Malzahn U, Böhning D, and Holling H (2000). Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika, 87(3):619–632. [Google Scholar]
- [33].Morris CN (1983). Parametric empirical bayes inference: theory and applications. Journal of the American Statistical Association, 78(381):47–55. [Google Scholar]
- [34].Novianti PW, Roes KC, and van der Tweel I (2014). Estimation of between-trial variance in sequential meta-analyses: a simulation study. Contemporary clinical trials, 37(1):129–138. [DOI] [PubMed] [Google Scholar]
- [35].Panityakul T, Bumrungsup C, and Knapp G (2013). On estimating residual heterogeneity in random-effects meta-regression: a comparative study. J Stat Theory Appl, 12(3):253. [Google Scholar]
- [36].Paule RC and Mandel J (1982). Consensus values and weighting factors. Journal of Research of the National Bureau of Standards, 87(5):377–385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Petropoulou M and Mavridis D (2017). A comparison of 20 heterogeneity variance estimators in statistical synthesis of results from studies: a simulation study. Statistics in medicine, 36(27):4266–4280. [DOI] [PubMed] [Google Scholar]
- [38].Rukhin AL (2013). Estimating heterogeneity variance in meta-analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):451–469. [Google Scholar]
- [39].Sidik K and Jonkman JN (2005). Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2):367–384. [Google Scholar]
- [40].Sidik K and Jonkman JN (2007). A comparison of heterogeneity variance estimators in combining results of studies. Statistics in medicine, 26(9):1964–1981. [DOI] [PubMed] [Google Scholar]
- [41].Smith TC, Spiegelhalter DJ, and Thomas A (1995). Bayesian approaches to random-effects meta-analysis: A comparative study. Statistics in medicine, 14(24):2685–2699. [DOI] [PubMed] [Google Scholar]
- [42].Takkouche B, Cadarso-Suarez C, and Spiegelman D (1999). Evaluation of old and new tests of heterogeneity in epidemiologic meta-analysis. American journal of epidemiology, 150(2):206–215. [DOI] [PubMed] [Google Scholar]
- [43].Tian L (2008). Inferences about the between-study variance in meta-analysis with normally distributed outcomes. Biometrical Journal, 50(2):248–256. [DOI] [PubMed] [Google Scholar]
- [44].van Aert RC and Jackson D (2018). Multistep estimators of the between-study variance: The relationship with the Paule-Mandel estimator. Statistics in medicine, 37(17):2616–2629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].van Aert RC, van Assen MA, and Viechtbauer W (2019). Statistical properties of methods based on the Q-statistic for constructing a confidence interval for the between-study variance in meta-analysis. Research synthesis methods, 10(2):225–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [46].Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins J, Langan D, and Salanti G (2016). Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research synthesis methods, 7(1):55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [47].Viechtbauer W (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3):261–293. [Google Scholar]
- [48].Viechtbauer W (2007a). Confidence intervals for the amount of heterogeneity in meta-analysis. Statistics in medicine, 26(1):37–52. [DOI] [PubMed] [Google Scholar]
- [49].Viechtbauer W (2007b). Hypothesis tests for population heterogeneity in meta-analysis. British Journal of Mathematical and Statistical Psychology, 60(1):29–60. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
