Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 23.
Published in final edited form as: Stat Interface. 2020;13(4):449–464. doi: 10.4310/sii.2020.v13.n4.a3

Statistical Methods for Quantifying Between-study Heterogeneity in Meta-analysis with Focus on Rare Binary Events

Chiyu Zhang 1, Min Chen 2, Xinlei Wang 1,*
PMCID: PMC7901832  NIHMSID: NIHMS1671199  PMID: 33628357

Abstract

Meta-analysis, the statistical procedure for combining results from multiple independent studies, has been widely used in medical research to evaluate intervention efficacy and drug safety. In many practical situations, treatment effects vary notably among the collected studies, and the variation, often modeled by the between-study variance parameter τ2, can greatly affect the inference of the overall effect size. In the past, comparative studies have been conducted for both point and interval estimation of τ2. However, most are incomplete, only including a limited subset of existing methods, and some are outdated. Further, none of the studies covers descriptive measures for assessing the level of heterogeneity, nor are they focused on rare binary events that require special attention. We summarize by far the most comprehensive set including 11 descriptive measures, 23 estimators, and 16 confidence intervals. In addition to providing synthesized information, we further categorize these methods according to their key features. We then evaluate their performance based on simulation studies that examine various realistic scenarios for rare binary events, with an illustration using a data example of a gestational diabetes meta-analysis. We conclude that there is no uniformly “best” method. However, methods with consistently better performance do exist in the context of rare binary events, and we provide practical guidelines based on numerical evidences.

Keywords: bias, confidence interval, coverage probability, DerSimonian and Laird, fixed effect, odds ratio, mean squared error, Q statistic, random effects

1. Introduction

Meta-analysis, the statistical procedure for synthesizing information from multiple studies, has been widely used in many research areas including social, psychological and especially medical sciences. Meta-analysis is a powerful tool in drug safety evaluation, where the number of cases (adverse events) can be very limited in a single study. The U.S. Food and Drug Administration (FDA) released a draft guidance for industry titled “Meta-Analyses of Randomized Controlled Clinical Trials to Evaluate the Safety of Human Drugs or Biological Products” in November 2018, which demonstrates the importance of meta-analysis in the development of new drugs. Such meta-analysis often involves binary outcomes of rare events, which are the focus of this study.

The primary goal of a meta-analysis is usually to estimate and infer the overall effect size, where the variability in the effect estimates from component studies should be properly accounted for. Besides the within-study sampling errors, the variability may come from diverse characteristics of individual studies such as disparities in trial protocols, subjects’ conditions, and population features, etc. When the study-wise differences exist, we call these studies (statistically) heterogeneous and the heterogeneity is typically measured by a between-study variance parameter τ2. Also, descriptive measures have been widely used by clinicians to provide a more intuitive interpretation about the heterogeneity for ease of understanding.

For point estimation of τ2, the DerSimonian and Laird (DL) estimator [10], most widely used in the field, has been frequently challenged for its default use in many software packages, largely due to its sizable negative bias when the heterogeneity level is high [39, 47, 2, 35, 34]. Many modifications over the DL estimator have been suggested based on the method of moments. Other approaches such as likelihood-based and other nonparametric methods can also be applied. For interval estimation of τ2, different types of confidence intervals (CIs) have been constructed to gauge the estimation uncertainty. However, nearly all these methods were constructed without a special consideration of dichotomous data and their performance remains unclear in the context of rare binary events, in which some may produce large bias or even fail to work.

Comparative studies and review papers exist for both point and interval estimation of τ2 but not for descriptive measures. For example, Veroniki et al. [46], Langan et al. [28], Petropoulou and Mavridis [37] reviewed and compared most of the existing estimators of τ2, among which only Petropoulou and Mavridis [37] conducted simulation studies to evaluate their performance. Previous comparisons about CIs (e.g., [48, 25, 45]) were largely limited to several similar types of CIs. As detailed in Tables 2 and 5, none of these papers covers descriptive measures for quantifying the level of heterogeneity, nor do they focus on rare binary events. And most of them are far from being complete, some even outdated, which motivates us to conduct this study to provide useful guidance to clinicians and bio-statisticians.

Table 2:

Overview of 23 estimators for the between-study variance τ2

Estimators Abbreviation Reference Iterative? Sign Effect Measure
Method of Moments
Hedges and Olkin HO [17] No > = 0
Two-step Hedges and Olkin HO2 DerSimonian and Kacker [9] No > = 0
DerSimonian and Laird DL DerSimonian and Laird [10] No > = 0
Positive DerSimonian and Laird DLp Kontopantelis et al. [27] No > 0
Two-step DerSimonian and Laird DL2 DerSimonian and Kacker [9] No > = 0
Multistep Dersimonian and Laird DLM van Aert and Jackson [44] No > = 0
Paule and Mandel PM Paule and Mandel [36] Yes > = 0
Improved Paule and Mandel IPM Bhaumik et al. [2] Yes > = 0 OR
Hartung and Makambi HM Hartung and Makambi [16] No > 0
Hunter and Schmidt HS Hunter and Schmidt [20] No > = 0
Lin, Chu and Hodges LCH Lin et al. [31] No > = 0
Likelihood-based
Maximum Likelihood ML Hardy and Thompson [14] Yes > = 0
Restricted maximum likelihood REML Viechtbauer [47] Yes > = 0
Approximate restricted maximum likelihood AREML Morris [33] Yes > = 0
Model error variance (Least squares)
Sidik and Jonkman SJ Sidik and Jonkman [39] No > 0
Sidik and Jonkman (HO prior) SJHO Sidik and Jonkman [40] No > 0
Bayesian
Rukhin Bayes RB0 Rukhin [38] Yes > = 0
Positive Rukhin Bayes RBp Rukhin [38] Yes > 0
Empirical Bayes (Equivalent to PM) EB Morris [33] Yes > = 0
Fully Bayes FB Smith et al. [41] Yes > 0
Bayes Modal BM Chung et al. [6, 5] Yes > 0
Other nonparametric
Malzahn, Böhning, and Holling MBH Malzahn et al. [32] No > = 0 SMD
Non-parametric bootstrap DerSimonian and Laird DLb Kontopantelis et al. [27] No > = 0

Table 5:

Existing comparative studies on constructing CIs for τ2 in random-effects meta-analysis

Review paper CI methods reviewed/compared Effect measure Recommendations
Knapp et al. [25] QP, MQP, BT, PLML, WML MD/OR QP and MQP
Viechtbauer [48] QP, BT, PL, W, SJ, BS OR QP
Veroniki et al. [46] PL, W, BT, BJ, J, QP, SJ, BS, BC Generic
van Aert et al. [45] QP, BJ, J OR None recommended when pki < 0.1 in combination with either K ≥ 80 or (K ≥ 40 and nki < 30)

The paper is organized as follows. In Section 2, we introduce notation and frequently used terms in meta-analysis. Section 3 reviews existing descriptive measures quantifying the level of heterogeneity. In Section 4, we list estimators for τ2 and briefly summarize two recently developed ones that are not included in any of the existing review papers. In Section 5, different types of confidence intervals for τ2 are described and categorized. In Section 6, we compare the performance, in terms of bias and mean squared error (MSE) for point estimators and empirical coverage probability and width for CIs, in a large collection of scenarios that are designed to mimic practical situations. In Section 7, we re-analyze the data from a meta-analysis [1] of 20 trials of type 2 diabetes mellitus after gestational diabetes with focus on the heterogeneity among the component studies. The final section provides recommendations in terms of choosing appropriate estimators and CIs in meta-analysis of rare binary events as well as a brief discussion.

2. Notation & frequently used terms

Suppose a meta-analysis includes K independent studies and the kth study contains nk subjects (k = 1, …, K). In study k, let θk be the true but unknown treatment effect and yk be the observed treatment effect such that E[yk|θk] = θk and Var[ykθk]=σk2, the within-study variance. Typically sk2, an estimate of σk2, is reported along with yk in published studies and it is often treated as a known quantity in practice (i.e., indistinguishable from σk2). When the study-specific effects θk’s are treated as random variables rather than constants, we assume E[θk] = θ and Var[θk] = τ2, where θ, a parameter of main interest in the meta-analysis, represents the overall treatment effect across different studies, and τ2 measures the between-study heterogeneity. There exist two main parametric models, namely Re (random effect) and Fe (fixed effect), to combine results from component studies. The Re model assumes that yk = θk + ϵk, where θk ~ N(θ, τ2) and ϵk~N(0,σk2). When τ2 = 0, it is reduced to the Fe model yk = θ + ϵk, where a common treatment effect θ is assumed for all component studies (i.e., θkθ). These models can be used with any effect measure, as long as the assumed normality is (approximately) valid.

For binary responses, we denote the number of events by xk0 (xk1) and the number of subjects by nk0 (nk1) in the control (treatment) group. The probability of having an event in the control (treatment) group is denoted by pk0 (pk1). Effect measures for binary outcomes include risk difference (RD, pk1pk0), risk ratio (RR, pk1/pk0) and odds ratio (OR, [pk1/(1 − pk1)]/[pk0/(1 − pk0)]). For rare binary events, RR≈OR. A logarithm transformation of the odds ratio (LOR) is often used in meta-analysis for a much faster convergence to asymptotic normality, and the within-study variance σk2 is then estimated by sk2=1xk0+1nk0xk0+1xk1+1nk1xk1. Gart [13] added a continuity correction factor of 0.5 to all the cells so that

yk=logxk1+0.5nk1xk1+0.5logxk0+0.5nk0xk0+0.5,

and σk2 is estimated by

sk2=1xk0+0.5+1nk0xk0+0.5+1xk1+0.5+1nk1xk1+0.5,

which will be used in our numerical evaluation of rare binary events.

Next, we introduce the (generalized) Q statistic [9] and related terms, which will frequently appear in the paper. For any parameter of interest, we use the corresponding letter/symbol with a hat to denote its estimate. For example, we use θ^ to denote the estimate of the overall treatment effect θ. The Q statistic is defined as the weighted sum of squared deviations between the estimated overall treatment effect and observed treatment effect in each individual study, namely

Q=k=1Kwk(ykθ^)2, (1)

where wk is a positive weight assigned to study k, and θ^=k=1Kwkyk/k=1Kwk, the weighted average of the estimated study-specific effects. A commonly used weighting scheme is to set wk=[Var^(yk)]1, i.e., the inverse of the estimated variance of yk. Under this inverse-variance weighing scheme, the variance of θ^ can be given by 1/k=1Kwk if we treat wk’s as known constants (i.e., indistinguishable from [Var(yk)]−1). Further, this scheme yields wk=1/sk2 for the Fe model, and wk=1/(sk2+τ^2) for the Re model, where τ^2 can be any estimator discussed in Section 4. Under the Fe (Re) model with the inverse-variance weights, we denote the corresponding Q statistic by QFe (QRe) and the corresponding θ^ by θ^Fe(θ^Re) with variance vFe (vRe). In fact, the Cochran’s Q statistic is QFe, also known as the DerSimonian and Laird’s Q test statistic [10].

Throughout this paper, we use χdf2 to denote a chi-squared distribution with df degrees of freedom, and use χdf,α2 to denote its 100α-th percentile.

3. Descriptive measures quantifying between-study heterogeneity

As mentioned in the introduction, (statistical) heterogeneity exists when true effects being evaluated differ among studies in a meta-analysis. Assessing the extent of heterogeneity is essential for model selection between Fe and Re models and decision making. An obvious choice is by estimating the variance parameter τ2, as is typically done in a random-effects meta-analysis. As pointed out by Higgins and Thompson [18], this measure does not facilitate comparison of heterogeneity across meta-analyses of different types of outcomes (e.g., the survival time can be either continuous or discrete). Also, its scale is specific to a chosen effect metric and the interpretation can be difficult. For example, odds ratio is a commonly used effect measure for binary data. Still, the variance of log-odds ratio is not easy to understand for many non-statisticians. Alternatively, one may test the existence of the between-study heterogeneity (e.g., through Cochran’s Q-test [7]), and use the corresponding test statistic or p-value to indicate the extent of heterogeneity. However, such measures depend on the scale of effect sizes or the number of component studies K. To overcome these limitations, effort has been devoted to development of various descriptive measures that can provide more intuitive information about the heterogeneity.

Table 1 summarizes 11 descriptive heterogeneity measures in the literature. Note that all these measures are general-purpose and none is specifically designed for binary outcomes. Takkouche et al. [42] proposed two measures, RI and CVB, to quantify the level of heterogeneity in five published meta-analyses. The statistic RI was developed to estimate τ2/(τ2+σ2), the proportion of total variation in the effect estimates that is due to between-study heterogeneity. This quantity is also known as the intra-class correlation in the context of cluster sampling. Here, the within-study variances σk2s are assumed to be constant, i.e., σk2σ2, which is estimated by 1/k=1K1/sk2, making RI=τ^2τ^2+K/k=1K1/sk2. The other statistic CVB estimates the between-study coefficient of variation τ/|θ| by τ^2/|θ^|. Obviously, CVB is affected by the overall treatment effect θ and is undefined when θ = 0.

Table 1:

Descriptive measures quantifying the between-study heterogeneity

Name f (θ, τ2, σ2, K) Formula Ref. Interpretation Assume σk2σ2?
RI τ2τ2+σ2 τ^2τ^2+K/k1K1/sk2 [42] Proportion of total variation in the estimates of treatment effect due to between-study heterogeneity Yes
CVb τ|θ| τ^2|θ^| [42] Between-study coefficient of variation No
H2 τ2+σ2σ2 QFeK1 [18] Relative excess in QFe over its degrees of freedom Yes, but can be used for different σk2.
R2 τ2+σ2σ2vRevFe k=1K1sk2k=1K1sk2+τ^2 [18] Inflation in the confidence interval for a single summary estimate under Re model compared with Fe model Yes, but can be used for different σk2.
IHT2 τ2τ2+σ2 1K1QFe [18] Same as RI Yes
IR2 τ2τ2+σ2 1k=1K1sk2+τ^2k=1K1sk2 [24] Same as RI Yes
Rb τ2vRe1Kk=1Kτ2σk2+τ2 1Kk=1Kτ^2sk2+τ^2 [8] Proportion of the between-study heterogeneity τ2 relative to vRe, the variance of θ^Re. No
Hr2 τ2+σ2σ2 πQr22K(K1), Qr=k=1K1sk|ykθ^Fe| [31] Same as H2 Yes
Ir2 τ2τ2+σ2 12K(K1)πQr2 [31] Same as RI Yes
Hm2 τ2+σ2σ2 πQm22K2, Qm=k=1K1sk|ykθ^m|, θ^m is weighted median estimate [31] Same as H2 Yes
Im2 τ2τ2+σ2 Qm22K2/πQm2 [31] Same as RI Yes

Under the assumption of a common within-study variance σ2, Higgins and Thompson [18] formulated a general heterogeneity measure as a function of the overall treatment effect θ, the between-study variance τ2, the within-study variance σ2, and the number of component studies, namely, f(θ, τ2, σ2, K). They proposed three criteria that such a measure should satisfy in general in order to facilitate its comparability and interpretability, including (i) dependence on the extent of heterogeneity, (ii) scale invariance, i.e. f(θ, τ2, σ2, K) = f(a + , b2τ2, b2σ2, K) for any a and b, and (iii) size invariance, i.e. f(θ, τ2, σ2, K1) = f(θ, τ2, σ2, K2) for any positive integers K1 and K2. Criterion (i) implies that the function f should increase monotonically with τ2. Criterion (ii) implies that f should be a function of the ratio ρτ2σ2 and that θ should not be involved. Criterion (iii) implies that f does not depend on K. It can be shown that any monotonically increasing function of ρ satisfies the three criteria. Based on this, three statistics, H2, R2 and I2 were proposed. The first, H2, estimates the quantity ρ + 1 by equating the observed value of QFe to its expectation so that H2=QFeK1 can be interpreted as relative excess in QFe over its expected value, the degrees of freedom K −1. The second, R2, attempts to estimate ρ+1 as well; but here, ρ+1 is approximated by vRe/vFe so that R2=v^Re/v^Fe=k=1K1sk2/k=1K1sk2+τ^2, which can be interpreted as the inflation in the confidence interval for θ^Re under the Re model compared with θ^Fe under the Fe model. Both H2 and R2 should be at least 1, where 1 means perfect homogeneity; and the larger the value, the more heterogeneous the studies. In practice, the authors suggested to use H and R because clinicians may be more familiar with standard deviations than variances. The third statistic, I2, estimates a different function of ρ, i.e. ρ1+ρ=τ2τ2+σ2, which represents the proportion of total variance that is due to between-study variation. Higgins and Thompson [18] suggested to compute I2 by IHT2=1K1QFe, which leads to a convenient relationship IHT2=11H2. Jackson et al. [24] suggested to compute I2 by IR2=1v^Fev^Re=1k=1K1sk2+τ^2/k=1K1sk2, which leads to another convenient relationship IR2=11R2. Both IHT2 and IR2 are usually expressed as percentages between 0% and 100%, where a value of 0% corresponds to no observed heterogeneity, while larger values indicate increasing levels of heterogeneity. They estimate the same quantity as RI does, but with different within-study variance estimates. Among these measures (i.e. H2, R2, IHT2 or IR2), IHT2 is most popular and in the literature, I2 typically represents IHT2 as IR2 is much less known. Higgins and Green [19] empirically provided a rough guide to the interpretation of I2 using overlapping intervals: a value in [0,0.4] suggests that heterogeneity may not be that important; [0.3, 0.6] may represent moderate heterogeneity; [0.5,0.9] may represent substantial heterogeneity; and [0.75,1] implies considerable heterogeneity.

The assumption of a constant within-study variance is probably untrue in many real life data. Thus, Crippa et al. [8] lifted this assumption and proposed a new measure Rb, defined as Rb=1Kk=1Kτ^2sk2+τ^2, to assess the contribution of the between-study variance τ2 to vRe (i.e., the variance of the pooled random-effects estimate θ^Re). It can be viewed as an average of the study-specific proportions of the study-specific variances due to between-study heterogeneity. They showed that the quantity τ2/vRe underlying Rb is a strictly increasing function of τ2 and is scale-invariant. However, this quantity depends on K and so is not size-invariant. They further showed that RI ≥ max(Rb, IHT2). When σk2σ2 and σ2 is estimated by s2, Rb, RI, IHT2 and IR2 all yield the same quantity τ^2s2+τ^2. The authors conducted a simulation study to examine the performance of RI, IHT2 and Rb. Both RI and IHT2 tend to be positively biased and this overestimation increases as K increases. Confidence intervals based on RI and IHT2 give lower coverage probabilities compared to those based on Rb and the difference becomes more obvious when the within-study variances vary more and when the heterogeneity level increases.

To reduce the impact of outlying studies, Lin et al. [31] proposed new robust measures Hr2, Hm2, Ir2 and Im2, which are analogous to and have the same interpretations as H2 and I2, respectively. These methods were developed upon the absolute deviation measures Qr and Qm rather than the usual squared deviation measure Q, as defined in Table 1 and will be described in more detail in Section 4.

All the measures except for CVB depend on the precision of the study-specific effects. As the sample sizes of the component studies increase, σk2s would decrease to zero so that RI, RB and all I2’s would increase to 1 and all H2’s and R2 would become arbitrarily large, even when there is little between-study heterogeneity. The measure CVB avoids this drawback but has its own limitation: it would approach +∞ as θ goes to 0. Finally, we mention that some of the measures involve the estimated value τ^2. In principle, τ^2 can be any estimator of τ2, but most software uses the DL estimator τ^DL2 as the default choice.

4. Estimators

We summarize 23 estimators for τ2 in Table 2, among which most can be applied to all kinds of effect measures except for the improved Paule and Mandel estimator (IPM, [2]) and Malzahn, Böhning, and Holling (MBH, [32]). IPM is specifically designed to work with OR for binary outcomes, and MBH can be only used for standardized mean difference (SMD). All estimators can be divided into five groups: method of moments, likelihood-based, model error variance (least squares), Bayes, and other nonparametric estimators. Some have closed form expressions while the others require numerical solutions. Some produce only positive estimates while the others require truncation to zero when a negative value occurs. Some properties of the estimators are summarized in Table 2.

Table 3 shows previous studies that reviewed and compared (large) subsets of these estimators. Recommendations were made either based on their own simulations or conclusions from the literature. Among them, Veroniki et al. [46], Langan et al. [28] and Petropoulou and Mavridis [37] are the most comprehensive. Veroniki et al. [46] reviewed 17 estimators as listed in Table 3, including all the method of moments estimators except for the IPM, multistep DL and LCH estimators, all three likelihood-based estimators, the SJ estimator, all the Bayesian estimators, and DLb. Langan et al. [28] and Petropoulou and Mavridis [37] added IPM, MBH, and SJHO into the comparison. Note that IPM was briefly summarized but not compared with other estimators in Veroniki et al. [46]. Also, EB mentioned in [37] has been shown to be equivalent to PM. Langan et al. [28] also added RB estimators with different priors, RBu and RBa.

Table 3:

Existing comparative studies for various estimators of the between-study variance τ2.

Review paper Estimators compared Effect measure Recommendations
Viechtbauer [47] HO, DL, HS, ML, REML SMD and MD REML
Sidik and Jonkman [40] HO, DL, SJ, SJHO, ML, REML, EB OR SJHO when τ2 is expected to be small or moderate; SJ when τ2 is expected to be large.
Kontopantelis et al. [27] HO, HO2, DL, DL2, DLb, DLp, SJ, SJHO, ML, RB, RBp Generic DLb
Veroniki et al. [46] HO, HO2, DL, DL2, DLp, DLb, PM, HM, HS, ML, REML, AREML, SJ, RB, RBp, FB, BM Generic PM
Langan et al. [28] Estimators in Veroniki et al. [46] except for FB plus IPM, SJHO, RBu, RBa, MBH RR, OR, SMD, MD and Generic PM
Petropoulou and Mavridis [37] Estimators in Langan et al. [28] except for RBu, RBa OR and MD DLb and DLp
Langan et al. [29] DL, HO, PM, PMHO, PMDL, HM, SJ, SJHO, REML OR and Generic REML, PM and PMDL for continuous outcomes and non-rare binary events

Two newly proposed estimators, the LCH estimators [31] and the multistep DL estimator DLM [44], are included in our pool. We mark them in bold in Table 2 and provide a brief description for each in below. The IPM estimator [2] is described as well because it is the only method specifically designed for rare binary events. More details about other estimators can be found in [46] and references therein.

Lin, Chu and Hodges (LCH)

Lin et al. [31] proposed two alternative estimators, τ^r2 and τ^m2, designed to be less affected by outliers than conventional estimators based on the Q statistics in (1). For the purpose of robustness, they are based on Qr and Qm, defined as the weighted sums of absolute differences between the study-specific treatment effects and the overall treatment effect, namely

Qr=k=1K1sk|ykθ^Fe|,    Qm=k=1K1sk|ykθ^m|.

Here, θ^Fe=k=1Kyksk2/k=1K1sk2, the fixed-effect estimate of θ as defined in Section 2, and θ^m is the weighted median estimator that is the solution to the equation k=1Kwk[I(θyi)0.5]=0, where I(·) is the indicator function. The estimators τ^r2 and τ^m2, based on Qr and Qm, respectively, can be derived similarly as τ^DL2 by equating observed Qr and Qm to their corresponding expected values.

Multistep DL

We first introduce the generalized method of moments (GMM) estimator of τ2 based on the Q statistic in (1). DerSimonian and Kacker [9] showed that if the weights wk’s are treated as known constants, the expected value of Q is

E(Q)=τ2(k=1Kwkk=1Kwk2k=1Kwk)+(k=1Kwkσk2k=1Kwk2σk2k=1Kwk). (2)

By equating Q to its expected value, replacing σk2 by sk2 in (2), solving for τ2 and truncating any negative solution to zero:

τ^GMM2=max{Q(k=1Kwksk2k=1Kwk2sk2k=1Kwk)k=1Kwkk=1Kwk2k=1Kwk,0}. (3)

The DL estimator τ^DL2 [10] is a special case of τ^GMM2, with wk=1/sk2 and Q = QFe.

As discussed in Section 2, the inverse-variance weighing scheme yields wk=1/(sk2+τ^2) when calculating the (generalized) Q statistic (1) under the Re model. Recall that the original DL estimator τ^DL2 can be obtained by specifying wk=1/sk2 in (3), which is equivalent to setting τ^2=0 in the Re weights. The two-step DL method [9] first obtains τ^DL2 and then sets τ^2=τ^DL2 in the Re weights to obtain τ^DL22 from (3).

van Aert and Jackson [44] proposed the multistep DL estimator as a natural extension of the two-step DL estimator. The M-step DL estimator τ^DLM2 can be obtained recursively by computing τ^DL2, τ^DL22, …, τ^DLM2 using (3). It has been shown that the limit of the multistep DL estimator, τ^DL2, when it exists, is equivalent to the PM estimator. As further suggested by the authors, divergence problems seldom happen in practice and the convergence is usually achieved quickly.

Improved Paule and Mandel (IPM)

For meta-analysis of rare binary events, Bhaumik et al. [2] adopted a standard binomial-normal random-effects model (labeled BNBA), which can be specified by

xki~Binomial(nki,pki) for i=0,1;
logit(pk0)=μk,logit(pk1)=μk+θk;
μk~N(μ,σ2),θk~N(θ,τ2),μkθk for k=1,,K.

They proposed a simple average estimator, θ^sa, for the overall treatment effect θ and then developed the IPM estimator for τ2 based on θ^sa and the iterative PM method. The treatment effect θk (measured by log-odds ratio) in study k is estimated with a correction factor a added to each cell count, namely, yka = log[(xk1 + a)/(nk1xk1 + a)] − log[(xk0 + a)/(nk0xk0 + a)]. The simple average estimator for θ is then given by θ^sa=k=1Kyka/K. The authors further proved that a should be 12 in order for θ^sa to be the least biased for large samples. They noticed that the PM estimator for τ2 depends on sk2 and proposed to improve PM by borrowing strength from all component studies when estimating each within-study variance,

sk2(*)=1nk1+1[exp(μ^θ^s12+τ22)+2+exp(μ^+θ^s12+τ22)]+1nk0+1[exp(μ^)+2+exp(μ^)].

Denote the corresponding weights by wk(*)1/[sk2(*)+τ2] and τ^IPM2 can be obtained by solving Q − (K − 1) = 0 iteratively with weights wk(*) in the calculation of Q.

5. Confidence intervals

Table 4 reports 16 existing methods for constructing CIs for τ2 in terms of key features including whether the algorithm for computing a CI is iterative, whether truncation for non-negativity is needed, which distribution is used for construction, and whether the CI is exact under the Re model. All the methods are general-purpose and so can be applied to meta-analysis of binary events except for the generalized variable approach [43], which is specifically designed for the mean difference (MD) metric based on normally distributed outcomes. Some of the CIs are obtained via a test-inverting process based on different statistics for testing H0:τ2=0.

Table 4:

CI methods for τ2 in random-effects meta-analysis.

Method Abbreviation Iterative? (Y/N) Truncation to 0? (Y/N) Distribution Used Exact Method for Re ?(Y/N) Reference
CIs based on (modified) Q statistics
Q-Profile QP Y Y χK12 Y [15, 25]
Modified Q-Profile MQP Y Y χK12 N [15, 25]
Biggerstaff and Tweedie BT Y Y Ga(r, λ) N [3]
Biggerstaff and Jackson BJ Y Y A positive linear combination of χ12 Y [4]
Jackson J Y Y A positive linear combination of χ12 Y [21]
Approximate Jackson AJ N Y Normal N [23]
Unequal-tail Q-profile UTQ Y Y χK12 Y [22]
Profile likelihood CIs
PL based on ML estimation PLML Y Y χ12 N [14]
PL based on REML estimation PLREML Y Y χ12 N [48]
Wald CIs
Wald based on ML estimation WML N Y N(0,1) N [3, 49]
Wald based on REML estimation WREML N Y N(0, l) N [49]
Others
Sidik and Jonkman SJ N N χK12 N [39]
Sidik and Jonkman with HO priori SJHO N N χK12 N [40]
Bayesian credible intervals Y N N [46]
Bootstrap BSP/BSNP Y Y N [11, 27]
Generalized variable approach GV Y Y N [43]

In Table 5, we list existing review papers on constructing confidence intervals for τ2. Clearly, none of these reviews is comprehensive.

5.1. Confidence intervals based on (modified) Q statistics

Q-profile and modified Q-profile CIs

Knapp et al. [25] and Viechtbauer [48] considered the Q-profile CIs based on the generalized Q statistic in (1) with weights wk=1/(τ2+sk2), denoted by Q(τ2), which depends on τ2 and treats sk2s as known constants. It can be shown that Q(τ2) follows a χK12 distribution under the Re model for any τ2. It follows that P(χK1,α/22<Q(τ2)<χK1,1α/22)=1α. Based on the test-inversion principle, a 100(1 − α)% confidence interval for τ2 can be obtained as the interval (τ˜l2, τ˜u2) satisfying Q(τ˜l2)=χK1,1α/22 and Q(τ˜u2)=χK1,α/22. Since τ2 is non-negative, τ˜l2 is truncated to 0 if Q(0)<χK1,1α/22 (meaning that τ˜l2 is negative); and the CI is set to [0, 0] (or {0}, the set containing only zero) if Q(0)<χK1,α/22 (meaning that τ˜u2 is also negative). This type of CIs is referred to as the Q-profile (QP) CIs as we are profiling Q(τ2) with different τ2 values when solving the above equations for τ˜l2 and τ˜u2 iteratively.

Knapp et al. [25] considered the fact that sk2s are only estimates and so have error variability, and constructed CIs using the test statistic Q˜r that replaces the weights in Q(τ2) with regularized variants wrk=rk/(τ2+sk2) to achieve a closer approximation to χK12, where the regularization factor rk is derived through a moment matching approach based on approximating the distribution of τ2+sk2 by a scaled χ2 distribution [15]. The lower bound τ˜l2 is obtained by profiling Q˜r(τ2) while the upper bound τ˜u2 is still obtained by profiling Q(τ2), satisfying Q˜r(τ˜l2)=χK1,1α/22 and Q(τ˜u2)=χK1,α/22. We refer to this type of CIs as the modified Q-profile (MQP) CIs.

Like the Q-profile CIs, the MQP CIs need left truncation to zero if the lower bound τ˜l2 turns out to be negative, and they are set to {0} if the upper bound τ˜u2 is also negative. The same rule applies to all other types of CIs based on (modified) Q statistics in Section 5.1, as discussed below.

BT and BJ CIs based on Cochran’s Q statistic

Biggerstaff and Tweedie [3] proposed to approximate the distribution of the Cochran’s Q statistic QFe by a gamma distribution with a shape parameter r(τ2) ≡ E2(QFe)/Var(QFe) and a scale parameter λ(τ2) ≡ Var(QFe)/E(QFe). The mean and variance of QFe under the Re model are given by E(QFe) = (K − 1) + (S1S2/S1)τ2 and Var(QFe)=2(K1)+4(S1S2/S1)τ2+2(S2+S22/S122S3/S1)τ4, where Srk=1K[1/sk2]r. CIs for τ2 can be obtained similarly based on this gamma approximation instead of χK12 using the above profiling approach, which we refer to as the BT intervals.

Biggerstaff and Jackson [4] derived the exact CDF of QFe under the Re model, denoted by FQ(q; τ2), as a positive linear combination of χ12 random variables, whose cumulative distribution function can be obtained using Farebrother’s algorithm [12] via the CompQuad-Form package in R. They then obtained (τ˜l2, τ˜u2) by solving the two equations numerically, FQ(cτ^uDL2+K1;τ˜l2)=1α/2 and FQ(cτ^uDL2+K1;τ˜u2)=α/2, where c = S1S2/S1 and τ^uDL2=[QFe(K1)]/c is the untruncated version of the DL estimator of τ2. This type of CIs is referred to as the BJ intervals.

Jackson and approximate Jackson CIs

Following the numerical approach in [4], Jackson [21] proposed CIs by test inversion based on the generalized Q in (1), which is also distributed as a positive linear combination of χ12 random variables under the Re model. Jackson et al. [23] further proposed to apply the arcsinh transformation to the untruncated version of τ^GMM2 for variance stabilization and then constructed CIs for τ2 based on a normal approximation. These types of CIs are referred to as the Jackson (J) and approximate Jackson (AJ) CIs, respectively. Based on simulation, Jackson further commented that weighting component studies by the reciprocal of their within-study standard errors (i.e. sk), rather than by their variances (i.e. 1/sk2) as the convention dictates, appears to provide a sensible and viable option when there is little a priori knowledge about the extent of heterogeneity.

Unequal-tail Q profile CIs

Jackson and Bowden [22] advocated to use unequal tail probabilities to obtain shorter intervals whenever such methods are justifiable. For example, when constructing a 100(1 − α)% unequal-tail Q-profile (UTQ) confidence interval, the lower and upper bounds, τ˜l2 and τ˜u2, are obtained by solving Q(τ˜l2)=χK1,1α12 and Q(τ˜u2)=χK1,α22, respectively, where α2 > α1 and α1 + α2 = α. They further suggested to use a pre-specified α-split with α1 = 0.01 and α2 = 0.04 for a 95% CI, which was shown to be able to retain the nominal coverage and reduce the width under the Re model. Obviously, the idea of unequal tails can be applied to all kinds of confidence intervals. In our numerical evaluation, we examine the performance of the Q-profile CIs with α1 = 0.01 and α2 = 0.04 as a representative.

5.2. Profile likelihood confidence intervals

Under the Re model, Hardy and Thompson [14] proposed the profile likelihood CIs based on maximum likelihood (ML) estimation, referred to as PLML. The profile log-likelihood for τ2 takes into account the fact that θ is also unknown and must be estimated, given by l(θ^ML(τ2),τ2), where the log-likelihood function of (θ, τ2) is given by

l(θ,τ2)=K2ln 2π12k=1Kln(τ2+σk2)12k=1K(ykθ)2τ2+σk2,

and given the value of τ2, the ML estimator of θ can be obtained by

θ^ML(τ2)=k=1K1τ2+σk2yk/k=1K1τ2+σk2.

Then a 100(1 − α)% CI for τ2 is given by the set of τ2 values satisfying l(θ^ML(τ2),τ2)>l(θ^ML(τ^ML2),τ^ML2)χ1,1α2/2.

Viechtbauer [48] proposed to construct profile likelihood CIs based on restricted maximum likelihood (REML) estimation, referred to as PLREML. The 100(1 − α)% CI for τ2 is given by the set of τ2 values satisfying lR(τ2)>lR(τ^REML2)χ1,1α2/2, where the restricted log-likelihood function of τ2 is given by

lR(τ2)=K2ln 2π12k=1Kln(τ2+σk2)12k=1Kln1τ2+σk212k=1K(ykθ^ML(τ2))2τ2+σk2,

and τ^REML2 is the REML estimate of τ2 (by maximizing lR). Viechtbauer [48] found that the REML-based CIs were slightly more accurate than the ML-based CIs in terms of coverage probability, especially for small K.

Because ML and REML estimates of τ2 require non-negativity, the lower bounds of profile likelihood (PL) intervals are always non-negative and the upper bounds are strictly positive after applying the same truncation for Q-profile CIs.

5.3. Wald confidence intervals

The Wald test statistics for testing H0:τ2=0 under the Re model have the form W=τ^2/SE(τ^2), where τ^2 can be τ^ML2 or τ^REML2, and the standard error is estimated by

SE^(τ^ML2)=2[k=1KwML.k2]1,
SE^(τ^REML2)=2[k=1KwREML.k22k=1KwREML.k3k=1KwREML.k+(k=1KwREML.k2k=1KwREML.k)2]1

with wML.k=1/(τ^ML2+sk2) and wREML.k=1/(τ^REML2+sk2). We label the Wald statistics based on ML and REML estimation by WML and WREML, respectively. The corresponding 100(1 − α)% Wald (W) CI for τ2 can be easily obtained by τ^ML2±z1α/2SE^(τ^ML2) or τ^REML2±z1α/2SE^(τ^REML2) [3, 48], where zα is the 100α-th percentile of the standard normal distribution. Negative lower bounds of the Wald CIs should be truncated to 0 since both ML and REML estimates of τ2 are constrained to be non-negative.

5.4. Other confidence intervals

Sidik and Jonkman (SJ) CIs

Sidik and Jonkman [39] proposed confidence intervals based on the SJ estimator of τ2, which is derived from the weighted residual sum of squares in the framework of a linear regression model. Let the crude estimate τ^0=k=1K(yky¯)2/K be an a priori value for τ2. Then the SJ estimator is given by τ^SJ2=τ^02K1k=1Kw^k(ykθ^0)2, where w^k=1/(sk2+τ^02), and θ^0=k=1Kw^kyk/k=1Kw^k. It follows that (K1)τ^SJ2/τ2 has an asymptotic distribution of χK12. Thus an approximate 100(1 − α)% confidence interval can be calculated by

(K1)τ^SJ2χK1,1α/22τ2(K1)τ^SJ2χK1,α/22.

Since τ^SJ2 is always positive, the SJ confidence intervals have positive lower and upper bounds. Sidik and Jonkman [40] later proposed an improved estimator τ^SJHO2 by using τ^HO2 as the a priori value. Then improved confidence intervals can be constructed correspondingly.

Bayesian credible intervals

Bayesian credible (BC) intervals can be obtained when a Bayesian approach is employed and posterior samples are drawn from the (joint) posterior distribution of all parameters involved using an MCMC algorithm. The lower and upper points of a 100(1 − α)% CI can be the 100(α/2)th and 100(1 − α/2)th percentiles of the posterior sample of τ2’s, or determined by the region that gives the highest posterior density. Such intervals may be heavily affected by the prior selection when the number of studies K is small.

Bootstrap CIs

Bootstrap techniques can be used to obtain confidence intervals for nearly all τ2 estimators. For nonparametric bootstrap (denoted by BSNP), we sample K studies with replacement from the observed set of studies B times to get B bootstrap samples. For parametric bootstrap (denoted by BSP), we first obtain the parameter estimates and then generate B samples from the assumed distributions with these estimates. For each (parametric or nonparametric) sample, we calculate the corresponding estimate τ^2. Then the 100(α/2)th and 100(1 − α/2)th percentiles of the B estimates of τ2 are respectively the lower and upper bounds of a 100(1 − α)% bootstrap confidence interval. In our numerical experiment, we only perform the nonparametric bootstrap procedure for the DL estimator for illustration.

The generalized variable (GV) approach

For meta-analysis of normally distributed outcomes, Tian [43] proposed inference procedures based on the generalized pivotal quantity for τ2. A pivotal quantity is a function of observations and parameters such that the distribution of the function does not depend on the parameters including nuisance parameters. Let σk02(σk12) be the population variance of the control (treatment) group in study k; let sk02(sk12) be the corresponding sample variance. For normally distributed outcomes, it is well known that Vki(nki1)ski2/σki2~χnki12 for k = 1, …, K and i = 0, 1. Denote Q in (1) with weight wk=1/(σk02/nk0+σk12/nk1+τ2) by Q(τ2), which follows χK12 and is a monotonic decreasing function of τ2. Thus, given a real number η ≥ 0, there exists a unique τη20 such that Q(τη2)=η. Based on this, Tian [43] defined the generalized pivotal quantity Rτ2 for τ2 as Rτ2=τη2 if ηQ(0) and Rτ2=0 otherwise. Given the observed treatment effects yk’s and sample variances ski2s, the distribution of Rτ2 does not depend on any nuisance parameters. A series of Rτ2 values can be obtained by first simulating Vki~χnki12 and η~χK12 and setting σki2=(nki1)ski2/Vki in Q(τ2) for k = 1, …, K and i = 0, 1, and then solving for τη2. A 100(1 − α)% confidence interval is given by (Rτ2,α/2, Rτ2,1α/2), where the lower and upper bounds are the 100(α/2)th and 100(1 − α/2)th percentiles of the generated Rτ2s.

6. Simulation focusing on rare binary events

For meta-analysis of rare binary events, Li and Wang [30] conducted a comprehensive simulation study to compare the performance of various estimators of the overall treatment effect θ measured by log-odds ratio, where a flexible binomial-normal model was used to accommodate treatment groups with unequal variability. This model, labeled BNLW, specifies the event probabilities by

logit(pk0)=μkωθk,    logit(pk1)=μk+(1ω)θk,

where μk ~ N(μ, σ2), θk ~ N(θ, τ2), μkθk, and ω is a constant in [0, 1]. The random-effects model BNBA in [2] is a special case of BNLW with ω = 0. Further, when ω=12, it is reduced to the model in [41], which assumes the equality of the variances of logit(pk0) and logit(pk1).

In this section, we adopt the same model and simulation setup from [30], to examine the performance of various methods. Results are summarized in Sections 6.1 and 6.2 for estimating the between-study variance τ2 of log-odds ratios θk’s. Here, bias and MSE are reported for point estimation, and the actual coverage probability and width of confidence intervals are reported for interval estimation. To be specific, we set the number of studies K to 10, 20 and 50 to reflect different sizes of meta-analysis. We generate the number of events xki from Binomial(nki, pki) for k = 1, …, K and i = 0, 1. The number of subjects in the control group, nk0’s are generated from Uniform[2000, 3000] to examine large-sample performance and from Uniform[20, 1000] to examine small-sample performance, and then rounded to the nearest integers. To allow varying allocation ratios across studies, the within-study sample sizes are set to follow the relationship nk1 = Rknk0, where log2Rk~N(log2R,σR2), R ∈ {1, 2, 4} and σR2=0.5. For small sample sizes, as noted in [30], the range [20, 1000] is chosen so that the empirical means of min(nk0pk0,nk1pk1)k=1K in all the settings are below one while it still allows for cases where most component studies have small sample sizes but a few can have sample sizes close to 1000. To generate pki’s, we fix σ2 at 0.5, and set τ2 ∈ {0, 0.25, 0.5, 0.75, 1} for evaluating different estimators and τ2 ∈ {0, 0.1, 0.2, ⋯, 0.9, 1} for evaluating different types of CIs. We further set θ ∈ {−1, 0, 1} to reflect different directions of the overall treatment effect, set μ ∈ {−2.5, −5} to represent low and very low incidence rates of the binary event (i.e., 0.076 and 0.0067 in the probability scale), and set ω ∈ {0, 0.5, 1} to represent smaller/equal/larger variability in the control group, compared to the treatment group. For each setting, 1000 datasets are simulated to compute empirical values of the performance measures by taking the average.

6.1. Comparison of different heterogeneity estimators

We compare all the methods listed in Table 2 except for FB and MBH. Since the full Bayesian method can be greatly affected by the prior choice and other factors (such as convergence), we exclude FB from our simulation. The MBH method is designed specifically for standard mean difference thus not suitable for binary events. In addition, the empirical Bayes method EB is equivalent to PM and the multistep DL method has the property that DL converges to PM. Therefore, we include PM in the comparison and leave EB and DLM out. We use heat maps to visualize the bias and MSE results where the rows of each map represent different methods and columns represents different τ2 values in [0, 1].

Large-sample results

Figure 1 presents the bias and MSE results of different estimators for μ = −2.5 and μ = −5 based on large-sample settings with R = 1, K = 50, θ = 0, and w = 0. As shown in Figure 1(a), as the event of interest becomes rarer, all methods seem to produce more bias when estimating τ2. Almost all methods underestimate the between-study heterogeneity when τ2 > 0. The RBp estimator, however, consistently overestimates τ2 when the event is very rare (μ = −5). As τ2 increases, most estimators produce more bias except for BM and RBp; the bias from BM first increases then decreases, and the bias from RBp decreases for very rare events (μ = −5). When the events are not that rare (μ = −2.5), most estimators have similarly low bias except for the one-step DL estimators (DL, DLp, DLb), HM, HS, and BM. However, IPM stands out with the lowest bias when the incidence rate becomes very low, especially when τ2 ≥ 0.5. The HS, HM, BM and one-step DL family methods remain the worst and should be avoided in terms of bias. All three likelihood-based methods, ML, REML and AREML, produce similar results with a moderate level of bias. In terms of MSE, most methods have similar performance except for HM and BM, which are the most inefficient according to Figure 1(b). Those with relatively large magnitude of bias tend to have relatively large MSE.

Figure 1:

Figure 1:

Large-sample performance of different τ2 estimators based on settings with R = 1, K = 50, θ = 0, and w = 0.

We next discuss the potential impacts of R, K, θ, and w on the estimation performance for the large-sample case. Figures S1 and S2 in the Supplementary Material (SM) show the bias and MSE results for different R and K values, respectively, based on settings with μ = −2.5, θ = 0 and w = 0. We can see that when τ2 < 0.5, regardless of R and K, all the methods perform somewhat similarly and have both bias and MSE close to zero except for BM which has much larger bias. As K increases, MSE decreases significantly for every estimator when τ2 ≥ 0.5 but bias for a few estimators seems not to get closer to zero (e.g., DL for τ2 = 1, BM for τ2 = 0.5, and 0.75). However, the heat maps show very similar color patterns both vertically and horizontally, indicating that the impact of R and K on the relative performance of these methods is merely marginal. Figures S3 and S4 in the SM show the bias and MSE results for different θ and w values, respectively, based on settings with R = 1, K = 50 and μ = −5. When θ = −1, bias decreases as w increases while this trend reverses when θ = 1. This effect of w is minimal when there is no treatment effect (θ = 0). Similar trends are observed but less obvious for MSE. Also, we find that IPM maintains the best performance in terms of both bias and MSE while DL, DLp, DLb, HS, HM, and BM are among the worst in nearly all the settings considered.

Small-sample results

Figure 2 presents the bias and MSE results of different estimators for μ = −2.5 and μ = −5 based on small-sample settings with R = 1, K = 50, θ = 0, and w = 0. From Figure 2(a), we can see that when τ2 > 0, the underestimation observed in the large-sample results for all the estimators but RBp is much more severe for small samples, where the magnitude of bias increases substantially for very rare events (μ = −5). Note that RBp consistently overestimates τ2 for both μ = −2.5 and μ = −5, and unlike most other estimators, the bias decreases as τ2 increases. When events are not that rare (μ = −2.5), IPM is still the least biased. However, for very rare events (μ = −5), SJ becomes the least biased estimator for τ2 ≥ 0.5. The problem of SJ is that it significantly overestimates τ2 when there is no or little heterogeneity, due to its positive nature. From Figure 2(b) we can see that MSE does not change much when μ = −2.5 but dramatically increases when μ = −5 compared to results from large samples. For very rare events (μ = −5), SJ is the most efficient method except for τ2 = 0 and IPM seems to be the second best in terms of MSE. Note that when τ2 = 1, RBp has smaller MSE than IPM for very rare events, but it does not perform as well as IPM for smaller τ2 values.

Figure 2:

Figure 2:

Small-sample performance of different τ2 estimators based on settings with R = 1, K = 50, θ = 0, and w = 0.

The impacts of R, K, θ, and w on the estimation bias and MSE for the small-sample case are shown in Figure S5S8 of the SM. Since several methods (e.g., the likelihood-based methods) failed in some small-sample settings for very rare events (μ = −5), we show results for μ = −2.5 in these figures. Although the effect of K on MSE becomes more significant for small samples (i.e., MSE decreases more as K increases), it is still the case that both R and K have little impact on the relative performance of different methods. Also, similar trends for both bias and MSE occur when w and θ change as in the large-sample case. For these μ = −2.5 settings, IPM seems to be the best estimator due to its consistent top-level performance across various settings. This also agrees with the results in the left panels of Figure 2. On the other hand, DL, DLp, DLb, HM, HS, and BM should be used with caution due to their generally large bias.

6.2. Comparison of different types of CIs

Among those summarized in Table 4, we compared 14 different types of 95% CIs for the heterogeneity parameter τ2 in Figures 3 and 4, excluding Bayesian credible intervals and the GV method as before. As mentioned in Section 5, BSNP represents the nonparametric bootstrap procedure combined with the DL estimator and UTQ represents the unequal-tail Q-profile CI with α1 = 0.01 and α2 = 0.04. Again, from our (unreported) simulation results, we find that the influences of R, θ, and w on the empirical coverage probability are marginal.

Figure 3:

Figure 3:

Actual coverage probabilities of different types of 95% CIs for different K values based on large-sample settings with R = 1, μ = −5, θ = 0, and w = 0.

Figure 4:

Figure 4:

Actual coverage probabilities of different types of 95% CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0.

Figure 3 shows actual coverage probabilities of different types of CIs for different K values based on large-sample settings with R = 1, μ = −5, θ = 0, and w = 0. When there is no between-study heterogeneity (τ2 = 0), all the methods provide 100% coverage except for SJ and SJHO, which produce strictly positive intervals and so have zero coverage. When τ2 is small, as K increases, the methods based on (modified) Q statistics gain some improvement in coverage except for AJ, which achieves relatively high coverage for all K and τ2 values. As τ2 gets larger, most methods do not improve their coverage by increasing K.

Figure 4 presents actual coverage probabilities of different types of CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0. When μ = −2.5, most methods have actual coverage close to the nominal level 0.95. Among all, the nonparametric bootstrap CI has the lowest coverage, followed by the two Wald CIs when τ2 > 0. The influence of sample sizes is not obvious except for J, SJ and SJHO that improve their coverage for large sample sizes when τ2 is small. For very rare events (μ = −5), the impact of sample sizes is much more severe and some of the CIs (e.g., SJHO, J, UTQ) do not even achieve 50% coverage in most small-sample settings. In the large-sample settings, PLML, PLREML, and AJ maintain the nominal 95% coverage quite well at all positive levels of τ2. As the sample sizes become small, all methods fail to do so for very rare events when τ2 ≥ 0.3. Still, PLML and PLREML, and AJ are among those with the highest coverage. We also find that when τ2 ≥ 0.4, SJ joins the top-performing group with the following order SJ ≈ PLREML > PLML > AJ. This matches with the estimation results reported in Section 6.1 that for very rare events coupled with small samples, the SJ estimator is the least biased and has the smallest MSE when τ2 ≥ 0.5. In such situations, the Q statistic-based CIs have generally low coverage and thus should be avoided; meanwhile the Wald and nonparametric bootstrap CIs have moderate coverage instead of being the worst in the other three cases.

Figure 5 shows width curves of different types of CIs under the same settings of Figure 4, where for all CIs, the width shows an increasing pattern as τ2 increases. The influence of sample sizes on the CI width is only obvious when μ = −5, where all the CIs become narrower when sample sizes decrease. Though anti-intuitive, a closer examination reveals that when events are very rare and sample sizes are small, many simulation iterations produce confidence intervals of a point {0}, which makes the average width become smaller. In the first three situations (either μ = −2.5 or large samples), BT and BJ produce the widest intervals, and PL and AJ intervals, which offer higher coverage than most other methods, have moderate widths among all. Unsurprisingly, the nonparametric bootstrap procedure produces the narrowest CIs. In the last situation (very rare events coupled with small samples), PL and AJ intervals are among the widest. Here, CIs with shorter widths are not necessarily desirable as they may reflect more {0} intervals due to sparsity. SJ produces intervals with moderate widths though it also provides higher coverage when τ2 is large. Overall, we recommend PL and AJ intervals in meta-analysis of rare binary events for their high coverage. For very rare events with small samples, we recommend SJ intervals if we know there exists at least moderate-level heterogeneity. Besides, AJ and SJ intervals are much easier to obtain than PL intervals.

Figure 5:

Figure 5:

Width curves of different types of 95% CIs for both large- and small-sample cases and different μ values based on settings with R = 1, K = 20, θ = 0, and w = 0.

7. Example: Type 2 diabetes mellitus after gestational diabetes

Women with gestational diabetes are believed to have a higher chance to develop type 2 diabetes. Bellamy et al. [1] performed a comprehensive systematic review and meta-analysis to assess the strength of this association. They selected 20 cohort studies that included 675,455 women with/without gestational diabetes and 10,859 type 2 diabetic events from 205 reports between Jan 1, 1960, and Jan 31, 2009, from Embase and Medline (see Table S.1 of the SM). We reanalyzed the data focusing on inference about the heterogeneity parameter τ2. We note that the overall event rate is ~1.61% and many studies have very small sample sizes with zero event counts. So this data example fits in the scenario of very rare events coupled with small sample sizes. Recall that in this scenario, SJ gives the least bias and most efficient estimator when there exists a moderate or large level of heterogeneity and IPM is the second best which tends to underestimate τ2.

Point estimates for the heterogeneity parameter τ2 and the corresponding inverse-variance weighted estimates for the overall treatment effect θ (measured by log-odds ratio) are summarized in Table 6. Here, most methods give an estimate between 0.4 and 0.7 for τ2, where the estimate from IPM is 0.563 and that from SJ is 0.679. This seems to suggest a moderate to high level of heterogeneity, especially after accounting for the underestimation from IPM. The RBp method, which has been shown to severely overestimate τ2 for very rare events, not surprisingly gives the largest estimate of 1.162. On the other hand, the HS estimate is much smaller than the others. The resulting estimated odds ratios do not vary as much except for the one from RBp. Table 7 shows the confidence intervals from all the compared methods. BT gives a very large upper bound, which seems to be odd. All CIs except for those from BT, BJ, and Wald methods exclude zero, among which SJ yields the shortest interval with the largest lower bound and the upper bound in line with that from PL and AJ methods. Recall that SJ tends to produce the best interval with higher coverage and relatively shorter width when there exists at least moderate-level heterogeneity, as reported in Section 6.2. In this example, we lean toward reporting the SJ interval, among the top performing methods PL, AJ and SJ. Based on the estimation and inference results above, we believe that these studies are heterogeneous‥

Table 6:

Data example of gestational diabetes meta-analysis: estimates for τ2 and θ from different methods

Estimator HO HO2 DL DL2 DLp DLb PM IPM HM HS
τ^2 0.220 0.418 0.466 0.411 0.466 0.265 0.413 0.563 0.419 0.046
θ^ 2.093 2.136 2.146 2.135 2.146 2.104 2.135 2.162 2.137 2.092
OR 8.112 8.469 8.547 8.457 8.547 8.197 8.461 8.691 8.470 8.099
Estimator LCHmean LCHmedian ML REML AREML SJ SJHO RB0 RBp BM
τ^2 0.519 0.298 0.396 0.449 0.433 0.679 0.290 0.198 1.162 0.195
θ^ 2.155 2.111 2.132 2.142 2.139 2.180 2.110 2.088 2.235 2.088
OR 8.626 8.260 8.432 8.520 8.493 8.846 8.245 8.072 9.345 8.067

Table 7:

Data example of gestational diabetes meta-analysis: confidence intervals for τ2 from different methods.

Method
CI
QP
(0.109, 1.603)
MQP
(0.106, 1.603)
UTQ
(0.083, 1.403)
BT
[0, 8.610)
BJ
[0, 2.660)
J
(0.048, 1.540)
AJ
(0.004, 1.396)
Method
CI
SJ
(0.393,1.449)
SJHO
(0.168, 0.620)
BSNP
(0.012, 0.670)
PLML
(0.113, 1.285)
PLREML
(0.129, 1.458)
WML
[0, 0.841)
WREML
[0, 0.966)

8. Discussion and recommendations

Based on our comprehensive simulation studies for large-sample meta-analysis of rare binary events, we recommend the IPM method for estimating the heterogeneity parameter τ2 if reducing estimation bias is of high priority, especially when the events are extremely rare. Most of the methods do not differ much in terms of MSE. We suggest to avoid using HM, HS and BM since they have relatively large bias and MSE compared with other estimators. The most widely used DL estimator and its one-step variants DLp and DLb do not perform satisfactorily and hence should be avoided. For small-sample meta-analysis of rare events, IPM is still recommended and SJ also performs much better than the other estimators in terms of both bias and MSE when τ2 ≥ 0.5 and the events are extremely rare. In terms of interval estimation, we recommend the profile likelihood methods (PLML and PLREML) and the approximate Jackson method AJ in general situations. Among the three, PLREML usually produces higher coverage but with wider intervals. The SJ method is a good candidate when events are extremely rare, sample sizes are small, and τ2 ≥ 0.4. We did not examine the performance of Bayesian methods because of the computation burden, convergence detection issue, and potential sensitivity to prior choices. However, Bayesian hierarchical modeling can be a good alternative especially when meaningful prior information is available.

We notice that most estimators for τ2 are negatively biased in our simulation, an interesting phenomenon observed in other simulation studies with binary outcomes [26, 39, 40, 2] as well. In simulation studies with continuous outcomes [27], most of the estimators show positive bias when τ2 is small (< 0.1) and the magnitude of bias of RBp is much larger than the other estimators; for larger τ2 values, the HS and ML estimators are negatively biased and the magnitude increases as τ2 increases [47]. Viechtbauer [47] provides some analytical results for the bias of estimators HO, DL, HS, ML, and REML. Most of these results were derived based on the homogeneous within-study variance assumption (σk2=σ2). Under this assumption, the bias due to truncation is always positive for DL, HO and REML with all levels of heterogeneity and is negative for HS and ML when τ2 ≥ 0.5. However, we believe that in the rare events context, it is the sparsity (caused by zero counts) and lack of resolution in estimating the within-study variances that cause the large magnitude of underestimation for many methods. This underestimation is much reduced by the IPM estimator where the within-study variance estimates are improved by pooling information from all the studies.

Finally, we should mention that, when synthesizing information from multiple studies to obtain more reliable conclusions, one should not simply rely on one point estimate or one p-value (especially those from the default methods in software packages) without considering the rich selection of statistical tools offered in the literature. Each of the above reviewed models or methods has its own limitations. In practice, all kinds of evidence should be combined and evaluated together with the specific characteristics of component studies included in the meta-analysis.

Supplementary Material

Supplemental Material

References

  • [1].Bellamy L, Casas J-P, Hingorani AD, and Williams D (2009). Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. The Lancet, 373(9677):1773–1779. [DOI] [PubMed] [Google Scholar]
  • [2].Bhaumik DK, Amatya A, Normand S-LT, Greenhouse J, Kaizar E, Neelon B, and Gibbons RD (2012). Meta-analysis of rare binary adverse event data. Journal of the American Statistical Association, 107(498):555–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Biggerstaff B and Tweedie R (1997). Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Statistics in medicine, 16(7):753–768. [DOI] [PubMed] [Google Scholar]
  • [4].Biggerstaff BJ and Jackson D (2008). The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Statistics in medicine, 27(29):6093–6110. [DOI] [PubMed] [Google Scholar]
  • [5].Chung Y, Rabe-Hesketh S, and Choi I-H (2013a). Avoiding zero between-study variance estimates in random-effects meta-analysis. Statistics in medicine, 32(23):4071–4089. [DOI] [PubMed] [Google Scholar]
  • [6].Chung Y, Rabe-Hesketh S, Dorie V, Gelman A, and Liu J (2013b). A non-degenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika, 78(4):685–709. [DOI] [PubMed] [Google Scholar]
  • [7].Cochran WG (1954). The combination of estimates from different experiments. Biometrics, 10(1):101–129. [Google Scholar]
  • [8].Crippa A, Khudyakov P, Wang M, Orsini N, and Spiegelman D (2016). A new measure of between-studies heterogeneity in meta-analysis. Statistics in medicine, 35(21):3661–3675. [DOI] [PubMed] [Google Scholar]
  • [9].DerSimonian R and Kacker R (2007). Random-effects model for meta-analysis of clinical trials: an update. Contemporary clinical trials, 28(2):105–114. [DOI] [PubMed] [Google Scholar]
  • [10].DerSimonian R and Laird N (1986). Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177–188. [DOI] [PubMed] [Google Scholar]
  • [11].Efron B and Tibshirani R (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75. [Google Scholar]
  • [12].Farebrother R (1984). Algorithm as 204: the distribution of a positive linear combination of χ2 random variables. Journal of the Royal Statistical Society. Series C (Applied Statistics), 33(3):332–339. [Google Scholar]
  • [13].Gart JJ (1966). Alternative analyses of contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), pages 164–179. [Google Scholar]
  • [14].Hardy RJ and Thompson SG (1996). A likelihood approach to meta-analysis with random effects. Statistics in medicine, 15(6):619–629. [DOI] [PubMed] [Google Scholar]
  • [15].Hartung J and Knapp G (2005). On confidence intervals for the among-group variance in the one-way random effects model with unequal error variances. Journal of statistical planning and inference, 127(1–2):157–177. [Google Scholar]
  • [16].Hartung J and Makambi K (2002). Positive estimation of the between-study variance in meta-analysis: theory and methods. South African Statistical Journal, 36(1):55–76. [Google Scholar]
  • [17].Hedges LV and Olkin I (2014). Statistical methods for meta-analysis. Academic press. [Google Scholar]
  • [18].Higgins J and Thompson SG (2002). Quantifying heterogeneity in a meta-analysis. Statistics in medicine, 21(11):1539–1558. [DOI] [PubMed] [Google Scholar]
  • [19].Higgins JP and Green S (2011). Cochrane handbook for systematic reviews of interventions, volume 4 John Wiley & Sons. [Google Scholar]
  • [20].Hunter JE and Schmidt FL (2004). Methods of meta-analysis: Correcting error and bias in research findings. Sage. [Google Scholar]
  • [21].Jackson D (2013). Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Research Synthesis Methods, 4(3):220–229. [DOI] [PubMed] [Google Scholar]
  • [22].Jackson D and Bowden J (2016). Confidence intervals for the between-study variance in random-effects meta-analysis using generalised heterogeneity statistics: should we use unequal tails? BMC medical research methodology, 16(1):118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Jackson D, Bowden J, and Baker R (2015). Approximate confidence intervals for moment-based estimators of the between-study variance in random effects meta-analysis. Research synthesis methods, 6(4):372–382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Jackson D, White IR, and Riley RD (2012). Quantifying the impact of between-study heterogeneity in multivariate meta-analyses. Statistics in medicine, 31(29):3805–3820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Knapp G, Biggerstaff BJ, and Hartung J (2006). Assessing the amount of heterogeneity in random-effects meta-analysis. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 48(2):271–285. [DOI] [PubMed] [Google Scholar]
  • [26].Knapp G and Hartung J (2003). Improved tests for a random effects meta-regression with a single covariate. Statistics in medicine, 22(17):2693–2710. [DOI] [PubMed] [Google Scholar]
  • [27].Kontopantelis E, Springate DA, and Reeves D (2013). A re-analysis of the Cochrane library data: the dangers of unobserved heterogeneity in meta-analyses. PloS one, 8(7):e69930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Langan D, Higgins J, and Simmonds M (2017). Comparative performance of heterogeneity variance estimators in meta-analysis: a review of simulation studies. Research synthesis methods, 8(2):181–198. [DOI] [PubMed] [Google Scholar]
  • [29].Langan D, Higgins JP, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, Viechtbauer W, and Simmonds M (2019). A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Research synthesis methods, 10(1):83–98. [DOI] [PubMed] [Google Scholar]
  • [30].Li L and Wang X (2019). Meta-analysis of rare binary events in treatment groups with unequal variability. Statistical Methods in Medical Research, 28(1):263–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Lin L, Chu H, and Hodges JS (2017). Alternative measures of between-study heterogeneity in meta-analysis: Reducing the impact of outlying studies. Biometrics, 73(1):156–166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Malzahn U, Böhning D, and Holling H (2000). Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika, 87(3):619–632. [Google Scholar]
  • [33].Morris CN (1983). Parametric empirical bayes inference: theory and applications. Journal of the American Statistical Association, 78(381):47–55. [Google Scholar]
  • [34].Novianti PW, Roes KC, and van der Tweel I (2014). Estimation of between-trial variance in sequential meta-analyses: a simulation study. Contemporary clinical trials, 37(1):129–138. [DOI] [PubMed] [Google Scholar]
  • [35].Panityakul T, Bumrungsup C, and Knapp G (2013). On estimating residual heterogeneity in random-effects meta-regression: a comparative study. J Stat Theory Appl, 12(3):253. [Google Scholar]
  • [36].Paule RC and Mandel J (1982). Consensus values and weighting factors. Journal of Research of the National Bureau of Standards, 87(5):377–385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Petropoulou M and Mavridis D (2017). A comparison of 20 heterogeneity variance estimators in statistical synthesis of results from studies: a simulation study. Statistics in medicine, 36(27):4266–4280. [DOI] [PubMed] [Google Scholar]
  • [38].Rukhin AL (2013). Estimating heterogeneity variance in meta-analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):451–469. [Google Scholar]
  • [39].Sidik K and Jonkman JN (2005). Simple heterogeneity variance estimation for meta-analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2):367–384. [Google Scholar]
  • [40].Sidik K and Jonkman JN (2007). A comparison of heterogeneity variance estimators in combining results of studies. Statistics in medicine, 26(9):1964–1981. [DOI] [PubMed] [Google Scholar]
  • [41].Smith TC, Spiegelhalter DJ, and Thomas A (1995). Bayesian approaches to random-effects meta-analysis: A comparative study. Statistics in medicine, 14(24):2685–2699. [DOI] [PubMed] [Google Scholar]
  • [42].Takkouche B, Cadarso-Suarez C, and Spiegelman D (1999). Evaluation of old and new tests of heterogeneity in epidemiologic meta-analysis. American journal of epidemiology, 150(2):206–215. [DOI] [PubMed] [Google Scholar]
  • [43].Tian L (2008). Inferences about the between-study variance in meta-analysis with normally distributed outcomes. Biometrical Journal, 50(2):248–256. [DOI] [PubMed] [Google Scholar]
  • [44].van Aert RC and Jackson D (2018). Multistep estimators of the between-study variance: The relationship with the Paule-Mandel estimator. Statistics in medicine, 37(17):2616–2629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].van Aert RC, van Assen MA, and Viechtbauer W (2019). Statistical properties of methods based on the Q-statistic for constructing a confidence interval for the between-study variance in meta-analysis. Research synthesis methods, 10(2):225–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins J, Langan D, and Salanti G (2016). Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research synthesis methods, 7(1):55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Viechtbauer W (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3):261–293. [Google Scholar]
  • [48].Viechtbauer W (2007a). Confidence intervals for the amount of heterogeneity in meta-analysis. Statistics in medicine, 26(1):37–52. [DOI] [PubMed] [Google Scholar]
  • [49].Viechtbauer W (2007b). Hypothesis tests for population heterogeneity in meta-analysis. British Journal of Mathematical and Statistical Psychology, 60(1):29–60. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES