Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 May 1.
Published in final edited form as: Stat Methods Med Res. 2024 Mar 19;33(5):745–764. doi: 10.1177/09622802241231496

Comparisons of various estimates of the I2 statistic for quantifying between-study heterogeneity in meta-analysis

Yipeng Wang 1, Natalie DelRocco 2, Lifeng Lin 3
PMCID: PMC11759644  NIHMSID: NIHMS2046507  PMID: 38502022

Abstract

Assessing heterogeneity between studies is a critical step in determining whether studies can be combined and whether the synthesized results are reliable. The I2 statistic has been a popular measure for quantifying heterogeneity, but its usage has been challenged from various perspectives in recent years. In particular, it should not be considered an absolute measure of heterogeneity, and it could be subject to large uncertainties. As such, when using I2 to interpret the extent of heterogeneity, it is essential to account for its interval estimate. Various point and interval estimators exist for I2. This article summarizes these estimators. In addition, we performed a simulation study under different scenarios to investigate preferable point and interval estimates of I2. We found that the Sidik–Jonkman method gave precise point estimates for I2 when the between-study variance was large, while in other cases, the DerSimonian–Laird method was suggested to estimate I2. When the effect measure was the mean difference or the standardized mean difference, the Q-profile method, the Biggerstaff–Jackson method, or the Jackson method was suggested to calculate the interval estimate for I2 due to reasonable interval length and more reliable coverage probabilities than various alternatives. For the same reason, the Kulinskaya–Dollinger method was recommended to calculate the interval estimate for I2 when the effect measure was the log odds ratio.

Keywords: Confidence interval, coverage probability, heterogeneity, I2 statistic, meta-analysis

1. Introduction

Meta-analysis is a statistical tool to synthesize evidence from different studies and is widely used in medical research. Assessing heterogeneity between the collected studies is a critical step to examine whether the studies may be properly combined and the synthesized results are reliable.1,2 In this article, heterogeneity refers to the variation in underlying treatment effects across studies.3

A classical method to detect heterogeneity is the chi-squared Q test; the distribution of the Q statistic is approximately χk-12 (k is the number of studies) under the null hypothesis that all studies in a meta-analysis are homogeneous. However, the Q test alone does not suffice to describe the amount of heterogeneity because only P-values are produced to indicate a binary decision of either the presence or absence of heterogeneity. The I2 statistic has been a popular alternative to quantify heterogeneity because of its attractive interpretation as the proportion of total variation caused by heterogeneity rather than within-study sampling error.4,5 Specifically, the I2 statistic can be conceptualized as τ2/τ2+σ2×100%, where τ2 is the between-study variance caused by heterogeneity and σ2 is a summary of within-study variances. It ranges from 0% to 100%. The Cochrane Handbook provides a rough yet widely used rule to interpret this measure: I240% may indicate unimportant heterogeneity, 30%I260% may represent moderate heterogeneity, 50%I290% may represent substantial heterogeneity, and 75%I2100% implies considerable heterogeneity.3 These ranges overlap with each other because they are vague, and the true heterogeneity should be evaluated with caution, using both statistical and clinical knowledge.6

Over the past few years, the usage of I2 has been challenged from many perspectives, and it should not be used as an absolute measure.79 Several studies have demonstrated shortcomings of I2. The I2 statistic may be particularly unreliable in meta-analyses with a small number of studies (e.g. < 10).10,11 The sample sizes within individual studies can inflate or deflate I2 under different circumstances.11,12 Moreover, I2 inherits the following misunderstanding about the distribution of the Q statistic under the null hypothesis: the χk-12 approximation holds for large within-study sample sizes, but it is not accurate for small and moderate sample sizes.13,14

In response to the above shortcomings, the associated 95% confidence interval (CI) should be reported to accompany the I2 statistic.10,15,16 Also, under the null hypothesis, distributions of the Q statistic for different effect measures (e.g. standardized mean difference [SMD]) have been proposed to adjust for the inaccuracy of the standard chi-square approximation1720; they provide a solution to calculating the CI of I2. CIs may be more desirable than point estimates of I2 because they give an appreciation of the spectrum of possible extents of heterogeneity (e.g. mild to moderate). A spectrum of the I2 statistic can be more robust to nuisance factors compared with the point estimate, enabling appropriate interpretation of the overall estimate of the intervention effect.16 Methods to calculate 95% CIs of I2 have been discussed in former research.4,21 Nevertheless, more intensive studies need to be implemented to compare different methods’ performance (e.g. the coverage probability) in practical situations.

This article uses a simulation study under various scenarios to obtain informative conclusions of preferable point and interval estimates of I2. The rest of the article is organized as follows. We first review the setups of a meta-analysis, including various types of effect measures, in Section 2. Section 3 reviews various point and interval estimators of I2. Section 4 presents the simulation study comparing the multiple estimators. We conclude this article with a brief discussion in Section 5.

2. Setups of meta-analysis

2.1. Common-effect and random-effects models

Consider a meta-analysis that collects k independent studies. Let μi be the underlying true effect size in study i(i=1,,k). Each study reports an estimate of the effect size and its sample variance, denoted by yi and si2. These data are commonly modeled as yiNμi,si2. Although si2 is subject to sampling error, it is usually treated as a fixed, known value. This assumption is generally valid if each study’s sample size is reasonably large.

If study-specific true effect sizes are assumed to follow a normal distribution, that is, μiiidNμ,τ2, then this is the random-effect (RE) model that accounts for heterogeneity. Here, μ is the overall mean effect size, and τ2 is the between-study variance. If τ2=0, then μi=μ for all studies. This implies that the collected studies are homogeneous, and it leads to the common-effect (CE) model. The RE model encompasses within-study (si2) and between-study (τ2) variation, in contrast to the CE model that includes within-study variation only. We denote wi,CE=1/si2 as the weight assigned to each study under the CE model. The Q statistic is defined as Q=wi,CEyi-μˆCE2, where μˆCE=wi,CEyi/wi,CE is the pooled CE estimate of the overall effect size μ. It follows a χk-12 distribution under the null hypothesis. Under the RE model, using the between-study variance estimate τˆ2, the overall mean effect size is estimated as

μˆRE(τˆ2)=wi,RE(τˆ2)yiwi,RE(τˆ2) (1)

where wi,REτˆ2=1/si2+τˆ2.

2.2. Meta-analysis with a continuous outcome

Suppose each study in a meta-analysis compares a treatment group with a control group. Denote ni0 and ni1 as the sample sizes in the control and treatment groups in study i. The continuous outcome measures of participants in each group are assumed to follow normal distributions. The subject-level data in each arm have means μi0 and μi1 and variances γi02 and γi12. The sample means are denoted as yi0 and yi1, and the sample variances are denoted as si02 and si12 for i=1,,k.

If the outcome measures have a meaningful scale and all studies in the meta-analysis are reported on the same scale, the mean difference (MD) between the two groups, μi=μi1-μi0, is often used as the effect size. An estimate of the MD can be obtained from each study, denoted as yi=yi1-yi0. The variances of samples in two arms are frequently assumed to be equal, i.e., γi02=γi12=γi2. The γi2 is estimated as the pooled sample variance siP2=ni0-1si02+ni1-1si12/ni0+ni1-2. Therefore, the estimated within-study variance of yi is si2=1ni0+1ni1siP2.

Another commonly used effect measure for continuous outcomes is the SMD, because this unit-free measure permits different scales in the collected studies and is deemed more comparable across studies.22 The SMD effect measure is μi=μi1-μi0/γi. Known as Cohen’s d, it is frequently estimated as follows: yi=yi1-yi0/siP. The exact within-study variance of Cohen’s d can be derived as a complicated form of gamma functions,23 but researchers often use different simpler forms to approximate it.2426 For example, si2=1ni0+1ni1+yi22ni0+ni1-2. As si2 depends on yi, they are correlated. The correlation may increase as the sample sizes decrease, because the coefficient of yi2 in the formula increases. Cohen’s d is shown to be biased in small sample sizes.24 Therefore, we do not consider it further. Instead, we study the bias-corrected estimator Hedge’s g, which is usually adopted when sample sizes are small. Suggested by Hedges and Olkin,24(p86) it is computed as yi=1-34ni0+ni1-9yi1-yi0siP with an estimated variance si2=1ni0+1ni1+yi22ni0+ni1.

Except for this formula to estimate the within-study variance, Lin and Aloe27 summarized many other formulas. Using different formulas can result in different estimates of the overall SMD, but this topic is beyond the scope of this paper. Like Cohen’s d, the observed data yi and si2 are also correlated when using Hedge’s g as the effect measure, which may affect the estimation results of meta-analyses.28

2.3. Meta-analysis with a binary outcome

Suppose a 2×2 table is available from each collected study in a meta-analysis with a binary outcome (i.e. individual-level outcomes are reported from k studies). Denote ni00 and ni01 as the number of participants without and with an event in the control group, respectively; ni10 and ni11 are the data cells in the treatment group. The sample sizes in the control and treatment groups are ni0=ni00+ni01 and ni1=ni10+ni11. Also, denote pi0 and pi1 as the population event rates in the two groups.

The odds ratio (OR) is frequently used as the effect measure for a binary outcome; its true value in the study i is ORi=pi11-pi0/pi01-pi1. Using individual-level data, the OR is estimated by OR^i=ni00ni11/ni01ni10. The ORs are usually combined on a logarithmic scale in meta-analyses, because the distribution of the estimated log OR, yi=logOR^i, is better approximated by a normal distribution. The within-study variance of yi is estimated as si2=1ni00+1ni01+1ni10+1ni11.

Moreover, the risk ratio (RR) and risk difference (RD) are also popular effect metrics, but they are not discussed in this article. Although RRs are more interpretable measures of association for clinicians,29,30 the debate continues over the merits of the OR versus RR and their interpretations.31,32 Doi et al.33 argued that RRs should no longer be used in meta-analyses, because the RR depends on prevalence more so than on the strength of exposure-outcome association that it is supposed to reflect. Specifically, the RR is a ratio of two conditional probabilities that vary with outcome prevalence, whereas the OR is a true effect magnitude measure representing the multiplicative increase in odds of outcome from an unexposed state to an exposed state. The RD can be easily computed from the OR with the fixed baseline risk. When generating simulated meta-analyses for RDs and RRs under the RE model, it is unrealistic to naturally limit pi0 and pi1 within the range [0, 1] if the true overall effect size is given. This is because the normality assumption μiiidNμ,τ2 can generate extreme values of a non-zero τ. For example, a true RD of study i is simulated from N(0.2, 0.2) as μi=0.8, then pi1=pi0+μi will be beyond 1 if pi0 is fixed to larger than 0.2. To overcome this issue, an alternative method is truncating such improper probabilities so they are between 0 and 1, but this constraint can produce bias which cannot be distinguished from the bias caused by sampling error.28,34 Thus, the undesired effect of bounding the probabilities can be problematic, and inevitable when conducting simulation studies for RRs and RDs. Although some meta-analysts try to explore other models to simulate data, there still does not exist a general method that fixes the biased problem and is well accepted in the literature. Bakbergenuly et al.34 evaluated the performance of a number of data-generating models, such as the binomial generalized linear mixed model with logit link function and the beta-binomial model, when effects are RRs. It appears no gold standard was concluded, and they encouraged future research to explore this topic. Therefore, we focus on analyzing the results of ORs when studies return binary outcomes.

When sample sizes are small, some data cells may be 0, even if the event is not rare. In general, if a 2×2 table contains zero cells, a fixed value of 0.5 is added to each data cell to reduce bias and avoid computational errors.3537 Although this continuity correction may not be optimal in some cases and alternative corrections can be used,3841 we use the adding 0.5 correction if it is not specially mentioned in the following sections.

3. Estimates of the I2 statistic

3.1. Point estimates

Because point estimates of the between-study variance τ2 are used to calculate I2 intervals, we first introduce these point estimators. As I2 depends on τ2, these estimators further lead to point estimates of I2.

3.1.1. Method-of-moments approach

The estimator of τ2 can be derived from the method-of-moments approach, which is based on the generalized Q statistic,42 Qa=aiyi-μˆa2, where ai represents the weight assigned to the study i and μˆa=aiyi/ai. By equating Qa to its expected value, the general formula for the heterogeneity variance can be derived as

τˆ2=max{0,Qa(aisi2ai2si2ai)aiai2ai}. (2)

The DerSimonian–Laird (DL) estimator uses the CE model weights ai=wi,CE, leading to43:

τˆDL2=max{0,wi,CE(yiμˆCE)2(k1)wi,CEwi,CE2wi,CE}

Note that the DL estimators can produce negative variance estimates and are truncated to zero in such cases.

3.1.2. Sidik–Jonkman (SJ) method

Sidik and Jonkman44 proposed a two-step estimator producing positive τ2 estimates

τˆSJ2=1k111+(si2/τˆ02)(yiμˆSJ)2

where τˆ02=yi-y2/k is the initial heterogeneity variance estimate and μˆSJ is calculated from equation (1) with weights wi,REsi2/τˆ02=1/1+si2/τˆ02.

3.1.3. Restricted maximum likelihood (REML) method

Based on the marginal distribution of the RE model, yiNμ,si2+τ2, the maximum likelihood (ML) estimate τˆML2 is obtained by maximizing the log-likelihood function:

l(μ,τ2)=k2log(2π)12log(si2+τ2)12(yiμ)2si2+τ2

To derive the REML estimator, the above log-likelihood function is transformed to exclude the parameter μ.45 By doing so, REML avoids assuming μ is known and is therefore thought to be an improvement on the ML estimator.46 The modified log-likelihood function is

lR(τ2)=k2log(2π)12log(si2+τ2)12[yiμˆRE(τˆML2)]2si2+τ212log(1si2+τ2)

By maximizing this modified log-likelihood function to τ2, the formula of the between-study variance estimate is

τˆREML2=max{0,ai2[(yiμˆRE(τˆML2))2si2]ai2+1ai}

where ai=1/si2+τˆREML2. The REML estimate is calculated by using an iteration scheme. Fisher scoring algorithm is used for the iteration of the REML estimates in this article, as implemented in the R package “metafor.”47

3.2. Interval estimates

3.2.1. Interval estimates for I2basedontheQ statistic

The I2 statistic is originated from the Q statistic by assuming within-study variances are equal (i.e. si2=σ2) and by equating the observed Q with its expectation, so we have

I2=τ2τ2+σ2=Q(k1)Q (3)

which is a function of Q.48 A widely used truncation (i.e. I2 is set to 0 if Qk-1) is applied because conceptually the I2 statistic should be non-negative. Using equation (3), the I2 interval estimate can be calculated by evaluating quantiles from the cumulative distribution function (CDF) of the Q statistic (i.e. FQ). Biggerstaff and Jackson21 (BJ) developed three approaches to approximate the distribution of Q under the RE model.

The two-moment gamma approximation of FQ, with shape parameter α and scale parameter β, is obtained by matching the first two moments of the gamma and Q distributions. Explicit expressions for the mean and variance of Q are

E(Q)=k1+(S1S2S1)τ2
Var(Q)=2(k1)+4(S1S2S1)τ2+2(S2+S22S122S3S1)τ4

where Sr=wi,CEr. The proof for the two formulas is included by Biggerstaff and Tweedie.49 Using any non-negative estimates for τ2 in the above two formulas, the first two moments of Q can be estimated, denoted as Eˆ(Q) and Var^(Q). Therefore, solving equations E(Q)=αβ and Var(Q)=αβ2 by plugging in estimated values gives αˆ=[Eˆ(Q)]2/Var^(Q) and βˆ=Var^(Q)/Eˆ(Q). The FQ is then approximated by computing the gamma CDF with αˆ and βˆ.

The Pearson type III distribution provides an extension of the previous two-moment gamma approximation by adding the third central moment (TCM) of Q, which is derived similarly to the variance of Q:

TCM(Q)=E[(QE(Q))3]=8(k1)+24(S1S2S1)τ2+24(S22S3S1+S22S12)τ4+8(S33S4S1+3S2S3S12S23S13)τ6

Matching all three moments with parameters of the Pearson type III distribution, emphasizing the dependence on τ2, gives

r(τ2)=4Var(Q)3TCM(Q)2,θ(τ2)=2Var(Q)TCM(Q),andγ(τ2)=E(Q)2Var(Q)2TCM(Q)

Therefore, the approximation of FQ can easily be calculated from the Pearson type III CDF with location parameter γτ2, shape parameter rτ2 and rate parameter θτ2. This approximation can be obtained by plugging in τˆ2 to the three parameters. Note that although the three-moment Pearson type III approximation is intended as an improvement on the two-moment gamma approximation as it matches a further moment, it has support γτˆ2,, hence it is not appropriate to approximate FQ when values of Q are extremely small, especially if those values are less than γτˆ2.

A further approximation expected to be more accurate in the tails of the distribution is the saddlepoint approximation, given in the present case by Kuonen50 using the Barndorff–Nielsen formulation. This requires the cumulant generating function of Q, denoted by K(s), and its first two derivates, given by

K(s)=12i=1k1log(12λis),K(1)(s)=i=1k1λi12λis,andK(2)(s)=2i=1k1(λi12λis)2

where s<1/2λ1 and λ1λ2λk-10 with λk=0 are the ordered eigenvalues of S=Σ1/2AΣ1/2. Here, and Σ is the diagonal matrix with entries si2+τ2. Let A=W-1/iwi,CEwwt, where W is the diagonal matrix containing the wi,CE=1/si2,w is the vector containing the wi,CE, and the superscript t denotes matrix transpose. Plugging in the τ2 estimate, the saddlepoint approximating CDFFS(x)P(Qx) can be calculated in two steps. First, we solve the equation K(1)(sˆ)=x for sˆ, the solution referred to as the saddlepoint. Next, we compute a=sign(sˆ)2[sˆx-K(sˆ)] and b=sˆK(2)(sˆ). The saddlepoint approximation is then given by FS(x)=ϕa+1alogba, where ϕ is the standard normal CDF. Our objective is to obtain the I2 interval through evaluating the quantiles of the Q statistic, so FS(x) is used to estimate x for given probabilities (e.g. 0.025 and 0.975).

The test-based method51 provides another way to compute the CI of Q, and hence for the I2 statistic.16 Appendix A2 by Higgins and Thompson4 discussed in detail for conducting the test-based method to calculate the 95% CI of the heterogeneity measure

H=Qk1

where H is defined to be 1 whenever Qk-1. Because I2=1-1/H2 is a monotone-increasing function with H2, we briefly present results that are used to estimate the 95% CI of H here, and the corresponding 95% CI for I2 can be readily calculated via the relationship between I2 and H2. The logarithm of Q is used in this method to remove some of the skew inherent in the distribution of Q. A test-based standard error of log(H) is

SE[log(H)]={{12(k2)[113(k2)2]},Qklog(Q)log(k1)2(2Q2k3),Q>k

Then a 95% CI for H follows as exp{log(H)±1.96SE[log(H)]}. Therefore, a test-based interval estimate for I2 is constructed using the lower and upper bounds of H.

The non-parametric bootstrap CI of I2 can be obtained by sampling k studies with replacement from the observed pairs yi,si2, and I2 is estimated for each bootstrap sample using equation (3). Repeating the process B (e.g. 1000) times, a 95%CI is given by the 2.5th and 97.5th percentiles of the BI2 values.

In sum, five methods to estimate I2 intervals are summarized in this subsection. It should be noted that the three methods using the approximated FQ need first to estimate τ2, whereas the other two methods do not depend on τˆ2, as shown in Table 1.

Table 1.

Summary of the five methods to calculate confidence intervals for I2 based on the Q statistic.

Methods to estimate I2 intervals Use the estimated τ2
Biggerstaff and Jacksons approximated cumulative distribution functions of Q Two-moment gamma (TMG)
Pearson type III distribution (PIII)
Saddlepoint approximation (S)
Test-based approach (T)
Non-parametric bootstrap (NPBS)

(a) Methods requiring the estimated between-study variance, √; and (b) methods not requiring the estimated between-study variance, —.

3.2.2. Interval estimates for I2 based on the between-study variance

Consider the DL estimate of the between-study variance τ2, the Q statistic can be expressed in the form of τˆ2 via equation (2) when ai=wi,CE. Note that Q=k-1 when τˆ2=0, this setting matches with the widely used truncation that Q is truncated as k-1 if Qk-1. Replacing the Q statistic with the form of τˆ2 in equation (3), I2 can be expressed as a function of the estimated between-study variance

I2=τˆ2τˆ2+(k1)si2(si2)2si4 (4)

In this expression, the summary of the within-study variance (i.e. the moment-based sampling error) is treated as follows:

σˆ2=(k1)si2(si2)2si4

Nevertheless, considering I2 as a function of τˆ2 depends on the accuracy of the summary estimate σˆ2 because the calculation or interpretation of the I2 statistic can be seriously distorted if σˆ2 provides a misleading estimate.52 Nevertheless, we use the moment-based sampling error throughout this article because it is consistent with the definition of I2. Improving the summary estimate of the within-study variance will be explored in our future research. For a given meta-analysis, interval estimates of I2 can be calculated from interval estimates of the between-study variance via the monotone-increasing function I2τˆ2 in equation (4). Therefore, calculating the CI for I2 is one step further than estimating the CI for τ2. Researchers have conducted comprehensive overviews to compare estimation methods for τ2 and its uncertainty.53,54 Although equation (4) is derived to use the DL estimate of τ2, other methods can be used to calculate τˆ2 because different estimators aim to estimate the same true between-study variance. Three-point estimators of τ2 in Section 3.1 and the following six interval estimators of τ2 are considered to obtain the CI for τ2, and thus the CI for I2, in this article.

Specifically, we summarize interval estimation methods for the between-study variance below:

  • Profile likelihood CI of the REML estimator (PL-REML). The PL method55 is based on the log-likelihood function and is an iterative process that provides CIs for the between-study variance, considering the fact that μ needs to be estimated as well. The 95% PL CI for τ2 consists of the values that are not rejected by the likelihood ratio test with τ2 under the null hypothesis. For the REML estimator, the τ2 values in the CI are obtained by solving
    lR(τ2)>lR(τˆREML2)12χ1,0.952
    where χ1,0.952=3.841 is the 95th quantile of the χ12 distribution. The method produces wide CIs with very high coverage probabilities when τ2=0, and the coverage probabilities reduce to the nominal level as τ2 increasing.56
  • Q-profile CI(QP). The QP method is based on the generalized Q statistic (Qa in Section 3.1) when ai=1/si2+τ2, which follows χk-12. Viechtbauer56 shows that the Q-profile CI is obtained by iteratively solving Qaτ˜L2=χk-1,0.9752 and Qaτ˜U2=χk-1,0.0252, where τ˜L2 and τ˜U2 are the lower and upper confidence limits, respectively. The corresponding CIs have been shown to achieve nominal coverage probabilities even in small samples.56 However, the estimated within-study variance si2 is not the true within-study variance σi2 for each study. Therefore, in practice, the generalized Q statistic no longer follows the assumed chi-squared distribution. This method is implemented in the R package “metafor” as the default approach to compute the CI for τ2.

  • Biggerstaff and Jackson CI (BJ). Using the CDF of Q,FQx;τ2, Biggerstaff and Jackson21 proposed a method to calculate a 95% CI for the between-study variance by obtaining the solutions of the equations:
    (1FQ(x;τ2)=0.025,FQ(x;τ2)=0.025)
    When FQx;τ2=0<0.025, the interval is set as [0, 0]. If 1-FQx;τ2=0>0.025, the lower bound of CI is set equal to 0. The CDF FQx,τ2 may be calculated using the algorithm by Farebrother57 for the positive linear combination of chi-squared random variables.
  • Jackson CI(J). An extension of the BJ CI is suggested by Jackson58 using Qa. The generalized statistic Qa has been shown to be as a linear combination of χ2 random variables so that methods like BJ can be used. The CDF of Qa, FQax;τ2, is a continuous and strictly decreasing function of τ2. The 95%CI of τ2 is obtained as:
    (1FQa(x;τ2)=0.025,FQa(x;τ2)=0.025)
    When FQax;τ2=0<0.025, the interval is set as [0, 0]. If 1-FQax;τ2=0>0.025, the lower bound of CI is set equal to 0. For moderate τ2, Jackson recommends using the J interval with weights ai=1/si, which are used in this article. The BJ and J CIs for τ2 are calculated using the R code provided by Jackson.58
  • Sidik and Jonkman CI (SJ). Sidik and Jonkman44 propose a method based on the SJ estimator with the 2.5th and 97.5th quantiles of the χk-12 distribution:
    ((k1)τˆSJ2χk1,0.9752,(k1)τˆSJ2χk1,0.0252)
    As τˆSJ2 takes non-negative values, the interval should also be non-negative. Simulation studies indicate that the SJ intervals have very poor coverage probability when τ2 is small, but as k and τ2 increase the coverage probability becomes close to the nominal value.44,56
  • Bootstrap CI. For any consistent and non-negative estimator of τ2, parametric bootstrap CIs can be obtained by generating k values from the distribution yiNμˆREτˆ2,τˆ2+si2, where τˆ2 is the between-study variance estimate and μˆREτˆ2 given by equation (1). Next, estimate the between-study variance based on the bootstrap sample. After repeating this process B (e.g. 1000) times, the CI is constructed by taking the 2.5th and 97.5th percentiles of the distribution of τˆ2 values. Non-parametric bootstrap CIs are obtained via a similar process, where k studies are sampled with replacement from the observed pairs (yi,si2). For each bootstrap sample, τ2 can be estimated using the same specified method (e.g. REML). Repeating the process B times, a 95%CI is given by the 2.5th and 97.5th percentiles of the Bτˆ2 values. The normal distribution assumption of observed effects is not required in the non-parametric bootstrap method, but its coverage performance has been doubted because of the substantial deviation from the nominal level in simulation studies.56

So far, for generic effect measures, multiple approaches to calculating 95%CIs of I2 are presented in two directions: based on the Q statistic or based on the between-study variance. Nevertheless, these methods have been criticized by researchers because they can be unreliable in real-world meta-analysis.8,59 For the Q-profile method, the null distribution of the generalized Q statistic follows χk-12 may not be an accurate approximation, especially when study-specific sample sizes are small or moderate. Three methods based on three approximations of FQ proposed by Biggerstaff and Jackson21 also require sufficiently large studies (i.e. large sample sizes of studies) and the assumption that effect sizes are normally distributed. For the test-based method, Hoaglin8 points out: (1) the CI involving the test-based standard error is valid only under the null hypothesis; (2) the standard normal approximation, Z=2Q-2k-3, used in the method requires “large” degrees of freedom (e.g. over 100); and (3) subtracting log(k-1) is not exactly the same as subtracting the mean of log(Q). Therefore, the test-based CI for I2 can be unreliable to reflect heterogeneity. Other approaches to improve the estimation of CIs for τ2, and thus for I2, have been discussed recently.21,54,60 For example, Knapp et al.60 suggested a modified Q profile method using a different weighting scheme for the generalized Q statistic to determine the lower bound of the interval for τ2, and the upper bound is still the same as that of the original Q-profile method. However, the improvement of the modified Q-profile method is subtle, and the weighting scheme for the lower bound is lacking when effect measures are SMDs.

To handle the problem that χk-12 can be an inaccurate null distribution, Kulinskaya et al.18,20 proposed a series of methods, which provide appropriate CIs for τ2, by combining the Q-profile method with corrected null approximations of the Q statistic. The distribution of Q under the null hypothesis of homogeneity depends on statistics used to estimate the effects and the weights. Two methods to estimate CIs for τ2, thus for I2, are introduced for two effect measures, the SMD and the OR, as follows:

  • Kulinskaya–Dollinger–Bjøkestøl CI (KDB). When using Hedge’s g as the estimator of SMD, Kulinskaya et al.18 derived O(1/n) corrections to moments of Q and suggested using the chi-squared distribution with degrees of freedom equal to the estimate of the corrected first moment, denoted by χE(Q)2, to approximate the distribution of Q. The detailed expression of E(Q) is provided along with the R code by Kulinskaya et al.,18 and they are not presented here because the concrete form is complicated. The upper and lower confidence limits for τ2 can be calculated iteratively from the lower and upper quantiles of χE(Q)2:
    Q(τL2)=χE(Q),0.9752,Q(τU2)=χE(Q),0.0252
    Then, the corresponding CI for I2 is obtained via equation (4).
  • Kulinskaya–Dollinger CI (KD). When effect measures are log ORs, Kulinskaya and Dollinger20 obtain corrected approximations for the mean and variance of the Q statistic under the null hypothesis. They then match those corrected moments to construct a gamma distribution that closely fits the null distribution of Q, and their simulations confirm that the gamma approximation outperforms the chi-squared approximation.20 The improved approximation blends theoretical derivation with simulation results. Let EKD(Q) denote the corrected expectation of Q when τ2=0. This corrected first moment can be written as
    EKD(Q)=k10.687[k1Eth(Q)]
    where Eth(Q) is a theoretical moment obtained from their general expansion of the mean of Q for arbitrary binary effect measures. The detailed expression of Eth(Q) is presented in Appendix B.3 of Kulinskaya and Dollinger.20 For large sample sizes, Eth(Q) converges to k-1. The corrected variance of Q, denoted by VarKD(Q), is a quadratic function of the corrected mean and it is calculated by
    VarKD(Q)=4.74(k1)12.17EKD(Q)+9.42k1[EKD(Q)]2
    Then, the shape parameter α of the gamma distribution approximating FQ is estimated by αˆ=EKD(Q)2/VarKD(Q), and the scale parameter β is estimated by βˆ=VarKD(Q)/EKD(Q). Therefore, the KD interval estimate of τ2 is obtained by iteratively solving:
    Q(τL2)=FQ,0.975,Q(τU2)=FQ,0.025
    The corresponding CI for I2 is calculated via equation (4). Different from all other methods, the KD interval estimate of τ2 is based on the 2×2 table where 0.5 is added to each cell regardless of the existence of zero cells; this change is adjusted in the programming.

Among the methods presented in this subsection, four (PL-REML, SJ, PBS-τ2, and NPBS-τ2) need to use the estimated τ2, whereas others can directly calculate interval estimates of τ2 without using the point estimate.

4. Simulation study

4.1. Simulation settings

We conducted simulation studies to investigate the performance of different interval estimators of the I2 statistic. Following the framework by Morris et al.61 to design simulations:

  • Aims. The primary goal is to compare the performance of different methods’ 95% CIs for I2. The secondary aim is to compare three-point estimators of I2.

  • Data-generating mechanisms. The number of studies in a simulated meta-analysis was set to k=5,20, and 50. Denote n=n1,,nk, where ni represents the sample size of the study i(i=1,,k). When k=5, a vector n represents sample sizes of an artificial meta-analysis was fixed as (10, 20, 30, 40, 50), then we gradually increased it to (50, 75, 100, 125, 150), and to (150, 250, 350, 450, 550). Three different settings indicated the considered sample sizes were small, medium, and large. When k=20, the sample size vector was specified as four replicates of n when k=5. For example, considering 10ni50 and i=1,,20, the sample size vector was set by combining four vectors (10, 20, 30, 40, 50). Similarly, 10 replicates of n when k=5 were used to construct the sample size vector when k=50. The control/treatment allocation ratio was set to 1:1 in all studies, which is commonly used in real-world applications. Specifically, ni0=ni1=ni/2, where ni0 participants were assigned to the control group and ni1 participants were assigned to the treatment group.

When effect measures were MDs, each participan’s outcome measure was sampled from Nμi0,γi2 in the control group or Nμi0+μi,γi2 in the treatment group. Without loss of generality, the baseline effect μi0 of the study i was generated from N(0, 1). The study-specific standard deviation γi was sampled from U(1,5), and it was generated anew for each simulated meta-analysis. The MD μi was sampled from Nμ,τ2. Table 2 shows the specified values for the overall MD μ and the between-study standard deviation τ.

Table 2.

Vectors of the between-study standard deviation (τ) and specified values of the true overall effect size (μ).

Overall mean difference Overall standardized mean difference Overall log odds ratio



Range of the sample size ni μ = 0 or 1 μ = 0 μ = 0.8 μ = 0 μ = 1
10 ≤ ni ≤ 50 τ = (0.50, 1.10, 2.20) τ = (0.18, 0.37, 0.73) τ = (0.19, 0.38, 0.76) τ = (0.37, 0.73, 1.46) τ = (0.39, 0.78, 1.56)
50 ≤ ni ≤ 150 τ = (0.30, 0.60, 1.20) τ = (0.10, 0.20, 0.40) τ = (0.10, 0.21, 0.42) τ = (0.20, 0.40, 0.80) τ = (0.21, 0.43, 0.85)
150 ≤ ni ≤ 550 τ = (0.20, 0.30, 0.60) τ = (0.05, 0.11, 0.21) τ = (0.06, 0.11, 0.22) τ = (0.11, 0.21, 0.43) τ = (0.11, 0.23, 0.46)

Given the range of study-specific sample sizes and the true overall effect size, a between-study standard deviation is chosen from one of three values in the corresponding vector, and it is used to generate meta-analyses.

When effect measures were SMDs, each participant’s outcome measure was generated from Nμi0,γi2 in the control group or Nμi1,γi2 in the treatment group. The baseline effect μi0 of i th study was generated from N(0, 1), and the study-specific standard deviation γi was generated anew for each meta-analysis by sampling from U(1,5). The SMD μi=μi1-μi0/γi was sampled from the normal distribution Nμ,τ2, so μi1=μiγi+μi0. The overall SMD μ and the between-study standard deviation τ were set as in Table 2.

When effect measures were log ORs, the event numbers ni01 and ni11 in the control and treatment groups were sampled from Binomialni0,pi0 and Binomialni1,pi1, respectively. The event rate in the control group pi0 was sampled from U(0.3,0.7) representing a common event,62 and it was generated anew for each meta-analysis. The event rate in the treatment group pi1 was calculated using pi0 and the study-specific log OR μi; specifically, pi1=1+e-μi1-pi0/pi0-1. The study-specific log OR μi was sampled from Nμ,τ2. The settings of the overall log OR μ and the between-study standard deviation τ were presented in Table 2.

For each simulation setting above, 10,000 meta-analyses were generated. For a simulated meta-analysis, the study-specific effect size and the within-study variance were estimated as yi and si2 in Section 2. The RE model was applied to each simulated meta-analysis, and the between-study variance was estimated by three methods (DL, SJ, and REML) introduced in Section 3.1. We skipped simulated meta-analyses whose REML estimates of τ2 could not be obtained (e.g. the solution did not converge) until enough simulated meta-analyses were generated.

  • Estimands of interest. We estimated I2 and the corresponding 95% CI for each simulated meta-analysis. The true value of I2 was calculated by equation (4) with the true between-study variance τ2.

  • Methods to be evaluated. For MDs, we compared 12 methods to calculate 95% CIs of I2, five methods (TMG, PIII, S, T, and NPBS-Q) introduced in Section 3.2.1 and seven methods (PL-REML, QP, BJ, J, SJ, NPBS-τ2, and PBS-τ2) introduced in Section 3.2.2. These 12 methods were also compared when effect measures were SMDs or log ORs, but the KDB CI or the KD CI was added to the comparison. Among the methods needing to use the estimated between-study variance, the SJ method used the SJ estimate of τ2, and other methods used the REML estimate of τ2. Moreover, estimated I2 using three different estimators (DL, SJ, and REML) of τ2 were also compared.

  • Performance measures. Coverage probabilities of 95% CIs, lengths of interval estimates, standard deviations of lengths, biases, and root mean squared errors were examined.

We provide all R code for the simulations at the Open Science Framework (https://osf.io/qu26v/).

4.2. Simulation results

4.2.1. Properties of I2 estimates

For the point estimates of I2 using the DL, SJ, and REML estimators of τ2, the SJ method stood out when the between-study variance was large. Table 3 shows estimates of bias for the three estimation methods when the estimand was MD. Generally, the SJ methods had the highest bias compared to DL and REML when τ was small or moderate, but the lowest bias when τ was large. Additionally, DL and REML estimates of I2 had extremely similar performance and were often biased downward, particularly when τ was large.

Table 3.

Biases of estimated I2 using the estimated between-study variance of three methods: DerSimonian–Laird (DL), Sidik–Jonkman (SJ), and restricted maximum likelihood (REML) methods. Results are from the simulation study using mean differences as effect measures.

The true overall mean difference μ = 0 or 1
ni = 10–50 ni = 50–150 ni = 150–550



Method τ = 0.5 τ = 1.1 τ = 2.2 τ = 0.3 τ = 0.6 τ = 1.2 τ = 0.2 τ = 0.3 τ = 0.6
k=5
   DL 0.020 −0.125 −0.123 −0.021 −0.140 −0.128 −0.069 −0.137 −0.140
   SJ 0.204 −0.003 −0.075 0.155 −0.022 −0.082 0.099 −0.007 −0.084
   REML 0.015 −0.127 −0.122 −0.025 −0.144 −0.127 −0.074 −0.142 −0.140
k=20
   DL 0.031 −0.024 −0.016 −0.023 −0.048 −0.023 −0.046 −0.054 −0.028
   SJ 0.306 0.080 0.002 0.241 0.055 −0.007 0.180 0.070 −0.007
   REML 0.032 −0.019 −0.009 −0.025 −0.047 −0.018 −0.048 −0.054 −0.022
k=50
   DL 0.044 0.008 −0.001 −0.008 −0.014 −0.007 −0.024 −0.021 −0.010
   SJ 0.330 0.095 0.011 0.262 0.071 0.003 0.197 0.085 0.004
   REML 0.045 0.014 0.004 −0.007 −0.011 −0.004 −0.024 −0.020 −0.007

To illustrate this point, consider the case where k=20 and ni was between 50 and 150. When τ=0.3, estimates of the bias were -0.023,0.241, and −0.025 for DL, SJ, and REML methods, respectively. The SJ method’s magnitude of estimated bias was more than 10 times that of DL or REML. As τ increased to 0.6, the magnitudes of bias were approximately equal (−0.048 for DL, 0.055 for SJ, and – 0.047 for REML). However, when τ was 1.2, SJ had the lowest estimated bias at −0.007 compared to −0.023 and −0.018 for DL and REML. This held true across nearly all parameter combinations studied, as well as for SMD and log OR. The estimated magnitude of the bias for the log OR was higher than that of MD or SMD (Tables S1 to S4 in the Supplemental Material). This was particularly striking when μ=1 (Table S4 in the Supplemental Material). RMSE followed a similar, but far less extreme, pattern for all parameter combinations and estimands studied (Tables S5 to S9 in the Supplemental Material).

4.2.2. CIs for MDs and SMDs

Table 4 shows the simulation-based coverage probabilities in studies of the MD for each of the CI methods introduced for the I2 statistic. CIs based on the BJ estimate of FQ generally behaved similarly. When the number of studies was small (k=5), interval coverage for the TMG, PIII, and S methods decreased with increasing τ2, regardless of the size of the individual studies in the meta-analysis. For example, when an individual study sample size ni was between 10 and 15, we observed over-conservative coverage when τ=0.5 (100% TMG, 99.2% PIII, and 99.9% S), very close to the nominal coverage when τ=1.1 (95.5% TMG, 95.1% PIII, and 95.5%S). As k increased to 20 or 50, often a non-linear relationship between CI coverage and the between-study variance was present. This trend depended on ni, highlighting the importance of considering k,ni, and τ2 together when conducting a meta-analysis.

Table 4.

Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 12 methods in simulated studies where effect measures are mean differences.

The true overall mean difference μ=0 or 1
ni = 10–50 ni = 50–150 ni = 150–550



Method τ = 0.5 τ = 1.1 τ = 2.2 τ = 0.3 τ = 0.6 τ = 1.2 τ = 0.2 τ = 0.3 τ = 0.6
k=5
   TMG 100.0 95.5 83.4 100.0 95.0 82.8 100.0 96.8 82.3
   PIII 99.2 95.1 83.4 99.6 94.7 82.8 99.6 96.6 82.4
   S 99.9 95.5 83.1 100.0 95.0 82.3 100.0 96.9 81.8
   T 93.8 90.9 70.7 95.5 92.6 72.1 95.4 93.4 75.2
   NPBS-Q 68.0 65.8 63.2 66.7 64.1 61.9 64.9 63.8 61.5
   PL-REML 96.4 97.0 93.4 97.6 97.8 94.1 98.0 98.0 94.2
   QP 93.3 93.7 93.9 94.7 94.9 94.8 94.9 94.8 94.8
   BJ 93.2 93.7 93.7 94.6 94.8 94.9 95.2 94.7 94.7
   J 93.7 93.9 93.9 94.8 94.8 94.9 94.7 94.9 94.7
   SJ 46.6 76.0 86.7 55.7 79.0 88.2 64.3 76.9 87.7
   PBS-τ2 99.9 97.2 84.5 100.0 97.1 83.7 100.0 98.1 83.6
   NPBS-τ2 73.2 69.5 66.8 71.4 68.0 65.7 69.5 67.8 65.8
k=20
   TMG 99.3 92.1 95.1 99.8 90.6 93.4 97.2 90.5 93.4
   PIII 97.7 91.1 95.0 99.1 90.4 93.8 96.7 90.1 93.8
   S 98.4 91.4 95.2 99.4 90.2 93.8 96.9 90.0 93.7
   T 90.2 76.4 59.2 94.1 78.8 62.0 92.6 80.9 63.0
   NPBS-Q 88.5 85.6 83.1 85.7 83.5 81.0 85.0 83.4 80.3
   PL-REML 93.6 92.3 92.9 96.4 93.8 93.9 95.6 94.3 94.5
   QP 91.4 92.5 92.6 94.1 94.2 94.1 95.1 95.3 94.8
   BJ 91.5 93.3 94.0 94.0 94.5 94.2 94.9 94.6 94.5
   J 93.1 93.1 93.0 94.5 94.2 94.1 95.1 95.0 94.7
   SJ 8.1 58.8 83.8 18.0 68.2 87.6 32.6 63.8 87.7
   PBS-τ2 98.6 91.9 92.8 99.1 91.0 91.2 95.0 90.9 91.7
   NPBS-τ2 88.9 88.2 87.8 86.1 86.1 85.7 85.6 85.5 85.1
k=50
   TMG 96.8 95.1 97.8 96.7 94.5 97.1 93.7 93.5 96.4
   PIII 95.1 93.7 97.3 96.4 94.3 97.3 93.4 93.6 96.7
   S 95.6 94.3 96.4 96.4 94.3 96.6 93.4 93.6 95.8
   T 86.1 73.6 55.3 91.7 78.0 58.7 87.4 80.4 60.7
   NPBS-Q 92.7 92.2 90.7 91.3 90.0 88.6 90.4 89.3 87.7
   PL-REML 89.7 91.1 91.4 94.0 94.2 94.5 94.8 94.9 94.9
   QP 89.1 90.7 91.2 94.4 94.4 94.5 95.0 95.1 94.8
   BJ 89.3 92.5 93.5 94.3 94.5 94.6 95.1 94.9 95.0
   J 91.6 91.7 91.9 94.7 94.4 94.5 95.2 94.9 94.8
   SJ 0.3 35.4 79.9 1.8 50.4 87.4 8.1 43.9 86.8
   PBS-τ2 92.7 92.9 93.3 92.9 93.6 94.0 92.5 93.3 93.5
   NPBS-τ2 92.1 91.8 91.8 90.4 91.1 91.4 90.2 90.7 90.9

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance.

The performance of the T approach mimicked that of TMG, PIII, and S, but with lower CI coverage across different simulation settings, which was clearly illustrated with a large number of studies but a small within-study sample size. Table 4 shows, when k=50 and ni was between 10 and 50, the coverage probabilities were 86.1%,73.6%, and 55.3% for τ=0.5,τ=1.1, and τ=2.2, respectively. This was below the average coverage of other intervals based on the Q statistic. For the T approach, when k was fixed, the coverage probability decreased as τ increased. This finding indicates that the commonly used test-based method was inappropriate for calculating a 95%CI of I2 if the meta-analysis is highly heterogeneous. The similarities in coverage trajectory over parameters of meta-analyses among the T, TMG, PIII, and S methods likely reflect the fact that all four methods use the approximated distribution of the Q statistic, while the decrease in overall coverage probability for the T method may be because it does not use the estimate of τ2. Interestingly, although the coverage probability was worse than the other three methods, the average interval length of the T method was comparatively shorter (Table 5).

Table 5.

Average lengths (in percentage, %) of estimated I2 95% confidence intervals of 12 methods are shown with standard deviations in parentheses. Results are from the simulation study using mean differences as effect measures.

The true overall mean difference μ = 0 or 1
ni = 10–50 ni = 50–150 ni = 150–550



Method τ = 0.5 τ = 1.1 τ = 2.2 τ = 0.3 τ = 0.6 τ = 1.2 τ = 0.2 τ = 0.3 τ = 0.6
k=5
   TMG 72.7 (10.0) 79.9 (11.7) 88.0 (11.5) 72.6 (9.8) 79.5 (11.5) 87.7 (11.5) 74.0 (10.4) 78.4 (11.4) 87.3 (11.2)
   PIII 72.3 (10.0) 77.1 (12.3) 69.7 (22.4) 72.4 (9.7) 77.3 (11.8) 70.5 (22.0) 73.6 (10.2) 76.7 (11.5) 72.5 (20.3)
   S 72.7 (10.0) 79.7 (11.7) 83.4 (15.5) 72.6 (9.8) 79.4 (11.5) 83.3 (15.5) 74.0 (10.4) 78.3 (11.4) 84.0 (14.3)
   T 77.0 (8.6) 69.0 (18.9) 44.3 (28.2) 77.3 (7.8) 69.9 (17.9) 44.6 (27.9) 76.3 (9.9) 71.4 (16.4) 48.5 (27.6)
   NPBS-Q 41.2 (30.0) 58.9 (29.2) 77.9 (21.7) 41.0 (29.6) 58.3 (28.7) 77.6 (21.2) 44.7 (29.9) 55.6 (29.3) 76.1 (22.4)
   PL-REML 83.3 (9.3) 79.5 (15.9) 58.1 (27.3) 83.4 (8.8) 80.2 (14.9) 58.6 (27.0) 83.3 (9.6) 81.3 (13.8) 62.3 (26.0)
   QP 83.7 (18.7) 80.5 (19.9) 57.6 (29.1) 84.0 (17.6) 81.2 (18.8) 57.8 (28.7) 84.2 (17.4) 82.2 (18.5) 61.7 (28.0)
   BJ 83.2 (18.1) 80.1 (19.8) 58.0 (28.4) 83.5 (17.4) 80.9 (18.8) 58.3 (28.1) 83.7 (17.1) 81.8 (18.4) 62.0 (27.4)
   J 83.7 (18.5) 82.7 (19.7) 60.1 (30.5) 83.9 (17.6) 82.8 (18.6) 59.3 (29.7) 84.5 (17.3) 83.7 (18.2) 63.5 (29.0)
   SJ 53.6 (13.7) 50.7 (13.7) 37.0 (16.6) 54.3 (13.3) 51.6 (13.1) 37.5 (16.4) 53.9 (13.1) 52.1 (13.2) 39.6 (16.2)
   PBS-τ2 74.7 (9.3) 81.1 (10.7) 85.0 (14.5) 74.5 (9.1) 80.7 (10.4) 84.8 (14.4) 75.8 (9.6) 79.7 (10.4) 85.4 (12.9)
   NPBS-τ2 46.5 (30.7) 62.4 (28.2) 80.6 (20.6) 45.8 (30.1) 61.5 (27.9) 80.4 (20.1) 49.4 (30.1) 59.1 (28.7) 78.7 (21.2)
k=20
   TMG 58.9 (12.0) 65.1 (11.7) 35.6 (13.9) 57.8 (11.5) 65.1 (11.0) 35.5 (13.7) 61.8 (11.5) 66.3 (10.3) 40.0 (14.5)
   PIII 58.0 (11.2) 57.1 (13.0) 26.6 (11.3) 57.3 (10.9) 58.8 (12.1) 27.6 (11.5) 60.5 (10.6) 60.8 (11.1) 31.2 (12.5)
   S 58.3 (11.6) 60.1 (12.7) 29.7 (12.2) 57.5 (11.2) 61.1 (11.9) 30.4 (12.3) 60.9 (11.1) 62.9 (11.1) 34.3 (13.2)
   T 52.9 (8.0) 38.2 (15.2) 12.2 (8.3) 53.1 (7.7) 40.5 (14.9) 13.0 (8.6) 52.5 (9.1) 43.5 (14.1) 15.2 (9.8)
   NPBS-Q 50.1 (18.2) 59.9 (14.7) 33.6 (15.9) 48.6 (17.7) 59.7 (14.4) 33.8 (15.6) 53.8 (16.0) 60.2 (13.7) 37.9 (16.7)
   PL-REML 60.8 (11.2) 50.6 (14.5) 21.0 (10.4) 60.3 (10.9) 52.5 (13.9) 21.9 (10.7) 61.2 (10.1) 55.3 (13.0) 25.1 (11.8)
   QP 65.7 (14.0) 53.1 (15.9) 21.0 (10.5) 65.2 (13.8) 54.9 (15.2) 21.7 (10.6) 65.9 (12.3) 58.2 (14.2) 25.0 (11.8)
   BJ 62.9 (12.4) 52.3 (14.3) 23.8 (10.5) 62.5 (12.5) 53.8 (14.1) 24.3 (10.6) 63.6 (10.9) 56.7 (13.2) 27.3 (11.5)
   J 66.8 (14.2) 59.7 (18.7) 21.8 (12.1) 65.7 (13.9) 59.5 (17.2) 22.1 (11.5) 67.8 (12.6) 63.2 (15.9) 25.6 (13.1)
   SJ 28.7 (3.3) 24.6 (4.8) 13.9 (4.7) 29.2 (2.8) 25.7 (4.4) 14.6 (4.8) 28.7 (3.1) 26.4 (4.2) 16.0 (4.9)
   PBS-τ2 58.4 (12.5) 60.8 (13.6) 27.1 (13.3) 57.5 (12.2) 62.0 (12.5) 28.1 (13.4) 61.3 (11.8) 63.8 (11.5) 32.3 (14.7)
   NPBS-τ2 53.5 (20.3) 62.2 (16.5) 30.2 (18.8) 51.3 (19.5) 62.1 (15.7) 31.1 (18.8) 56.5 (17.4) 62.7 (14.9) 35.7 (20.2)
k=50
   TMG 48.8 (7.9) 39.0 (7.9) 16.4 (4.3) 48.5 (7.7) 40.0 (7.6) 16.5 (4.2) 50.2 (6.3) 43.1 (7.4) 18.8 (4.7)
   PIII 48.0 (7.6) 35.9 (7.8) 14.3 (3.7) 48.0 (7.5) 37.5 (7.7) 14.8 (3.7) 49.1 (6.3) 40.7 (7.6) 16.9 (4.2)
   S 48.2 (7.7) 36.7 (7.8) 14.8 (4.1) 48.1 (7.6) 38.1 (7.7) 15.2 (4.0) 49.3 (6.4) 41.2 (7.6) 17.3 (4.5)
   T 40.9 (6.6) 22.0 (7.8) 5.9 (2.6) 41.5 (6.3) 23.7 (7.9) 6.3 (2.6) 39.4 (7.3) 26.8 (8.4) 7.5 (3.1)
   NPBS-Q 46.3 (10.9) 38.3 (10.6) 16.0 (5.1) 45.2 (10.4) 39.2 (10.6) 16.1 (5.0) 47.4 (9.2) 42.0 (10.6) 18.3 (5.7)
   PL-REML 47.2 (7.5) 31.8 (8.2) 11.1 (3.4) 46.9 (7.0) 33.4 (8.0) 11.6 (3.4) 46.2 (6.5) 36.4 (8.1) 13.4 (3.9)
   QP 52.7 (9.4) 33.7 (9.0) 11.2 (3.5) 52.4 (8.9) 35.4 (8.6) 11.7 (3.4) 51.3 (7.8) 39.0 (8.9) 13.6 (4.0)
   BJ 49.0 (7.3) 33.5 (7.5) 13.9 (4.2) 49.0 (7.1) 34.6 (7.4) 14.1 (4.0) 48.2 (6.5) 37.6 (7.6) 16.0 (4.5)
   J 56.2 (11.1) 38.4 (13.5) 11.4 (3.5) 55.1 (10.1) 38.5 (12.0) 11.8 (3.4) 56.1 (9.4) 43.6 (12.7) 13.6 (4.0)
   SJ 18.3 (1.5) 15.1 (2.2) 7.9 (1.8) 18.8 (1.1) 16.0 (2.0) 8.4 (1.8) 18.3 (1.3) 16.5 (1.9) 9.3 (1.9)
   PBS-τ2 47.2 (8.6) 35.3 (8.6) 12.1 (3.8) 47.3 (8.4) 37.2 (8.4) 12.7 (3.8) 48.8 (6.8) 40.5 (8.2) 14.7 (4.3)
   NPBS-τ2 48.0 (12.7) 37.0 (12.2) 12.4 (4.8) 45.9 (11.6) 38.2 (12.0) 12.8 (4.7) 47.8 (10.3) 41.4 (12.1) 14.9 (5.5)

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: Test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance.

The NPBS-Q interval had unacceptable coverage across most of the simulation scenarios. This method hardly achieved nominal coverage probability, with the highest coverage estimated at 92.7%. In general, the coverage of the NPBS-Q interval decreased modestly with larger τ and ni. However, the primary drive of the low coverage was the number of studies available for resampling in the bootstrap operation, where k=50 provided the best coverage. In the case of a small number of studies, the NPBS-Q interval was shorter than other FQ-based intervals.

Compared with I2 intervals based on FQ, most intervals based on τ2 were found to have coverage consistently closer to the nominal 95%. Specifically, the QP, BJ, and J intervals stood out as well-performing in areas where the FQ-based intervals under-performed. PL-REML also performed well, but it was slightly farther from the nominal coverage probability compared to QP, BJ, and J; this was evident when the within-study sample size was between 150 and 550. For example, when k=20 and τ=0.3, the coverage probabilities were 95.6%,95.1%,94.9%, and 95.1% for PL-REML, QP, BJ, and J, respectively. Additionally, when τ=0.6, the coverage probabilities of these methods are closer to the nominal coverage, with coverage probabilities of 94.5%,94.8%,94.5%, and 94.7%. Interestingly, when k=50 and ni was between 10 and 50, the coverage of the QP, BJ, J, and PL-REML intervals dropped, and the TMG, PIII, and S methods were preferred in this case.

Two τ2-based intervals would not be recommended for practice based on poor coverage probabilities given by the simulation. For different simulation scenarios, the lowest CI coverage was found for the SJ method, which was consistent with prior simulation studies for the interval of τ2.54 Table 5 shows that the interval length of the SJ method tended to be the shortest of all studied methods and decreased with increasing τ. Additionally, similar to the NPBS-Q, the NPBS-τ2 failed to reach the nominal coverage in any scenario.

The CIs for the SMD when μ=0 or μ=0.8 showed the same trend as described above (Tables S10 to S13 in the Supplemental Material) with two important caveats. First, when k=50 and ni was between 10 and 15, the QP, BJ, and J methods worked well, and they were close to the nominal coverage. This was in direct contrast to the CIs based on FQ, whose coverage probabilities were consistently under 90% when τ was moderate or large (Table S7 in the Supplemental Material). Second, the KDB interval also worked well under all studied scenarios, and it had comparable performance to the QP, BJ, and J intervals. These points made any of the QP, BJ, J, or KDB interval estimation methods a reliable choice when the estimand of interest was SMD, regardless of the other meta-analysis parameters.

4.2.3. CIs for logOR

Similar trends of CI coverage were observed for log OR as MD, but with a greater magnitude of departure from the nominal coverage level. For both μ=0 and μ=1, the TMG, PIII, and S methods had decreasing coverage with increasing τ. In Table 6, when k=5 and ni was between 10 and 50, a drop from 100% coverage to 75% was seen from all three intervals as τ increased. As k increased to 20 and 50, this decrease in coverage presented for both moderate and large values of τ and the severity of departure from nominal coverage increased. The T interval also showed a drop in coverage as τ increased, the magnitude of which was exacerbated as k increased. NPBS-Q, SJ, and NPBS-τ2 showed unacceptable coverage in all scenarios studied. PBS-τ2 was generally over-conservative when τ was mild, but it performed poorly as τ increased.

Table 6.

Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 13 methods when the true overall log odds ratio is 0. Results are from the simulation study using log odds ratios as effect measures.

The true overall log odds ratio μ = 0
ni = 10–50 ni = 50–150 ni = 150–550



Method τ = 0.37 τ = 0.73 τ = 1.46 τ = 0.20 τ = 0.40 τ = 0.80 τ = 0.11 τ = 0.21 τ = 0.43
k=5
   TMG 100.0 100.0 74.8 100.0 100.0 82.3 100.0 100.0 83.3
   PIII 100.0 100.0 74.9 100.0 100.0 82.3 100.0 100.0 83.3
   S 100.0 100.0 74.8 100.0 100.0 82.2 100.0 100.0 83.1
   T 98.2 98.6 92.5 96.9 96.3 88.9 96.4 95.5 85.9
   NPBS-Q 64.6 61.7 49.7 64.8 63.7 61.8 65.1 65.1 63.5
   PL-REML 99.3 99.8 96.2 98.7 98.9 95.4 98.3 98.5 94.8
   QP 96.7 97.4 97.2 95.6 95.7 95.9 95.0 95.3 95.2
   BJ 96.8 97.3 97.1 95.6 96.1 96.7 95.1 95.5 95.7
   J 97.0 97.3 97.1 95.4 95.6 96.3 94.9 95.1 95.4
   SJ 51.0 80.0 91.3 54.0 78.5 88.1 54.2 78.6 87.8
   PBS-τ2 100.0 100.0 76.5 100.0 100.0 82.6 100.0 100.0 83.8
   NPBS-τ2 68.4 66.1 58.5 66.0 64.8 65.0 67.1 66.1 65.3
   KD 94.1 95.0 95.5 95.0 95.3 95.6 94.9 95.2 95.1
k=20
   TMG 100.0 83.2 67.5 99.9 89.9 87.8 99.9 91.5 91.1
   PIII 100.0 83.2 68.0 99.8 89.9 88.0 99.9 91.4 91.2
   S 99.9 83.2 67.9 99.7 89.6 88.0 99.6 91.1 91.2
   T 99.1 94.3 54.7 97.2 90.0 80.0 96.8 92.2 78.5
   NPBS-Q 82.5 75.3 35.4 85.5 84.3 76.2 85.6 85.1 82.8
   PL-REML 99.6 93.5 86.8 98.8 95.0 94.7 98.4 94.7 94.4
   QP 97.0 97.0 95.3 95.3 95.6 96.0 95.1 95.1 94.9
   BJ 97.0 96.2 89.0 95.5 96.0 95.9 95.2 95.2 95.4
   J 97.4 97.2 94.0 95.0 95.5 96.3 94.8 95.0 95.1
   SJ 10.8 68.6 91.3 14.7 67.7 87.9 16.1 66.9 87.6
   PBS-τ2 100.0 82.0 67.9 99.8 89.4 87.8 99.8 91.1 91.1
   NPBS-τ2 79.2 75.5 59.5 84.7 84.6 82.8 84.9 85.2 84.7
   KD 94.6 95.2 93.7 94.8 95.2 95.4 95.0 95.0 94.8
k=50
   TMG 100.0 76.9 44.2 99.6 91.2 89.1 99.4 93.3 93.5
   PIII 100.0 77.0 44.9 99.6 91.2 89.2 99.4 93.2 93.5
   S 100.0 76.8 44.9 99.6 91.2 89.0 99.4 93.2 93.4
   T 99.5 83.2 14.9 97.5 90.1 74.0 97.1 89.5 79.5
   NPBS-Q 85.5 72.5 8.4 90.5 88.4 75.6 90.8 90.3 88.2
   PL-REML 98.0 87.6 64.6 97.8 94.5 93.9 97.2 95.1 95.0
   QP 96.7 95.8 87.5 95.3 95.5 96.0 95.5 95.0 95.1
   BJ 96.4 93.2 54.7 95.4 95.6 93.2 95.4 95.4 95.8
   J 97.5 96.3 80.2 95.0 95.3 95.9 95.4 94.9 95.4
   SJ 0.4 48.6 88.2 1.1 49.8 87.5 1.3 48.1 87.8
   PBS-τ2 100.0 75.7 43.5 99.6 90.8 88.6 99.4 93.1 93.0
   NPBS-τ2 80.3 71.9 36.2 88.8 88.0 86.2 90.1 90.2 90.2
   KD 93.3 93.7 87.9 94.9 95.0 95.2 95.3 94.9 94.9

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance. KD: Kulinskaya-Dollinger.

When μ=0, the QP, BJ, J, and KD intervals maintained the closest to the nominal coverage probability under most parameter combinations. The exception was when k=50,ni was between 10 and 50, and τ=1.46. A large drop in coverage was observed for all four intervals, particularly the BJ CI, with a coverage of 54% in this case (Table 6), corresponding to a large decrease in CI length (Table S14 in the Supplemental Material). This was magnified when μ=1, where the dip in coverage and the decrease in CI length were also observed for the case that ni was between 50 and 150 (Table 7 and Table S15 in the Supplemental Material). The KD interval only showed a major drop in coverage for μ=1 when k=50, ni was between 10 and 50, and τ was moderate to large. In most parameter combinations, the KD interval outperformed other methods with respect to the coverage probability or the average interval length. Therefore, in the case of the log OR, we suggested using the KD interval of I2.

Table 7.

Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 13 methods when the true overall log odds ratio is 1. Results are from the simulation study using log odds ratios as effect measures.

The true overall log odds ratio μ = 1
ni = 10–50 ni = 50–150 ni = 150–550



Method τ =0.39 τ =0.78 τ =1.56 τ =0.21 τ =0.43 τ =0.85 τ =0.11 τ =0.23 τ =0.46
k=5
   TMG 100.0 100.0 73.4 100.0 100.0 80.4 100.0 100.0 82.5
   PIII 100.0 100.0 73.5 100.0 100.0 80.5 100.0 100.0 82.5
   S 100.0 100.0 73.3 100.0 100.0 80.3 100.0 100.0 82.4
   T 98.8 98.9 91.4 97.2 96.5 87.4 96.1 95.0 83.1
   NPBS-Q 61.4 57.4 45.0 64.8 63.4 59.1 64.6 64.3 62.2
   PL-REML 99.5 99.9 96.1 98.8 99.1 95.5 98.2 98.5 94.9
   QP 97.3 97.4 96.9 95.8 95.9 96.3 94.7 95.0 95.4
   BJ 97.2 97.3 96.8 95.8 96.1 96.6 94.8 95.2 95.6
   J 97.4 97.4 96.9 95.7 95.9 96.5 94.6 95.0 95.4
   SJ 54.8 83.1 90.2 52.8 79.0 88.5 52.0 78.1 87.6
   PBS-τ2 100.0 100.0 75.7 100.0 100.0 81.0 100.0 100.0 83.1
   NPBS-τ2 65.5 62.1 53.0 66.6 65.3 64.2 66.8 66.0 65.2
   KD 94.8 95.5 95.3 94.9 95.3 95.8 94.5 94.8 95.3
k=20
   TMG 100.0 80.3 63.0 99.9 89.1 86.4 99.8 91.1 91.0
   PIII 100.0 80.3 63.6 99.8 89.0 86.5 99.7 91.1 91.0
   S 99.8 80.3 63.4 99.7 89.0 86.4 99.6 91.1 91.0
   T 99.5 90.8 47.6 97.9 89.2 78.0 96.8 87.6 78.8
   NPBS-Q 78.1 68.4 30.4 84.8 82.5 71.8 85.8 84.5 82.1
   PL-REML 99.7 91.4 82.9 99.1 95.0 94.2 98.0 94.5 94.9
   QP 96.0 95.7 91.6 95.9 96.2 96.7 95.3 95.1 95.2
   BJ 95.9 94.6 85.2 95.9 96.3 95.5 95.2 95.1 95.7
   J 96.6 95.9 89.9 95.7 96.0 96.8 95.2 95.0 95.5
   SJ 14.0 77.9 86.1 14.1 68.2 89.6 12.3 67.7 88.6
   PBS-τ2 100.0 79.2 63.4 99.9 88.7 86.2 99.8 90.6 90.7
   NPBS-τ2 75.1 70.2 51.6 83.3 82.6 80.1 84.9 84.8 84.8
   KD 94.8 94.5 90.3 94.9 95.5 95.9 94.9 94.9 95.0
k=50
   TMG 100.0 71.1 34.6 99.8 89.3 85.0 99.4 93.2 93.0
   PIII 100.0 71.2 35.3 99.8 89.3 85.2 99.4 93.2 93.0
   S 100.0 71.1 35.2 99.8 88.9 84.9 99.4 92.6 92.4
   T 99.9 74.0 9.2 97.8 89.3 67.7 97.2 89.6 79.0
   NPBS-Q 78.0 61.6 5.6 89.6 86.7 69.7 90.7 90.4 87.1
   PL-REML 97.0 82.3 53.3 98.0 93.7 91.6 97.7 95.1 94.9
   QP 94.1 91.4 73.9 95.7 95.8 95.7 95.3 95.6 95.5
   BJ 93.6 87.8 44.0 95.7 95.3 91.1 95.4 95.6 95.5
   J 95.5 92.2 65.5 95.4 95.9 95.2 95.1 95.5 95.6
   SJ 0.7 66.6 73.9 0.7 49.8 89.1 0.6 50.0 87.7
   PBS-τ2 100.0 69.9 33.8 99.7 89.0 84.5 99.4 93.0 92.4
   NPBS-τ2 75.1 70.2 51.6 83.3 82.6 80.1 84.9 84.8 84.8
   KD 93.3 90.8 75.9 94.8 94.8 94.3 95.2 95.4 95.2

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance. KD: Kulinskaya-Dollinger.

5. Discussion

In this article, for a meta-analysis, we have compared different methods to calculate the point estimate and the 95% CI of the I2 statistic. For point estimates of I2, the SJ method is suggested to be used when τ2 is large. Otherwise, the DL method gives a less biased estimate for I2 based on the simulation studies. The interval estimates of I2 are grouped into two categories by their derivation. One group is the methods based on the approximation of the CDF for the Q statistic; another group is the methods viewing I2 as the function of τ2 in equation (4), and they calculate the interval of I2 based on the interval of τ2. Based on the simulation studies, we would suggest the following guidelines:

  • When the effect measure is the MD or SMD, use the QP, BJ, or J method to calculate the 95%CI for I2;

  • When the effect measure is the log OR, use the KD method to calculate the 95%CI for I2.

In the case of the logOR, the KD method is recommended because it generally outperforms the other methods with respect to the coverage probability of the 95% CI for I2. Except for the KDB method for the SMD and the KD method for the log OR, all other methods can be used to calculate the CI of the I2 statistic for any type of effect measure.

Although the I2 statistic is widely used to measure the heterogeneity of meta-analyses, it suffers from large uncertainties and should not be used as an absolute measure of heterogeneity. However, the CI of I2 provides an appreciation of the spectrum of possible extents of heterogeneity, which can be more robust to nuisance factors. In practice, meta-analysts should report the 95% CI for I2 using the recommended methods, which have reasonable interval lengths and provide much more reliable coverage probabilities than the currently used methods (e.g. the test-based method).

Based on the simulation framework in this article, other simulation settings can be considered in future research. For example, when the effect measure is the MD, the chi-squared distribution can be used to generate the study-specific standard deviation. Further studies can provide additional clarity on the guidelines for I2, with our work serving as the baseline.

Supplementary Material

Supplemental material

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institute of Mental Health, U.S. National Library of Medicine (grant numbers R03 MH128727 and R01 LM012982).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

  • 1.Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. Br Med J 2021; 372: n160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang Y, Lin L, Thompson CG, et al. A penalization approach to random-effects meta-analysis. Stat Med 2022; 41: 500–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons, 2019. [Google Scholar]
  • 4.Higgins JPT and Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539–1558. [DOI] [PubMed] [Google Scholar]
  • 5.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, et al. Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychol Methods 2006; 11: 193–206. [DOI] [PubMed] [Google Scholar]
  • 6.Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol 2011; 64: 1294–1302. [DOI] [PubMed] [Google Scholar]
  • 7.Borenstein M, Higgins JPT, Hedges LV, et al. Basics of meta-analysis: I2 is not an absolute measure of heterogeneity. Res Synth Methods 2017; 8: 5–18. [DOI] [PubMed] [Google Scholar]
  • 8.Hoaglin DC. Misunderstandings about Q and ‘Cochran’s Q test’ in meta-analysis. Stat Med 2016; 35: 485–495. [DOI] [PubMed] [Google Scholar]
  • 9.Hoaglin DC. Practical challenges of I2 as a measure of heterogeneity. Res Synth Methods 2017; 8: 54. [DOI] [PubMed] [Google Scholar]
  • 10.von Hippel PT. The heterogeneity statistic I2 can be biased in small meta-analyses. BMC Med Res Methodol 2015; 15: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mittlböck M and Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006; 25: 4321–4333. [DOI] [PubMed] [Google Scholar]
  • 12.Rücker G, Schwarzer G, Carpenter JR, et al. Undue reliance on I2 in assessing heterogeneity may mislead. BMC Med Res Methodol 2008; 8: 79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Viechtbauer W Hypothesis tests for population heterogeneity in meta-analysis. Br J Math Stat Psychol 2007; 60: 29–60. [DOI] [PubMed] [Google Scholar]
  • 14.Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10: 101–129. [Google Scholar]
  • 15.Ioannidis JPA, Patsopoulos NA and Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. Br Med J 2007; 335: 914–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Thorlund K, Imberger G, Johnston BC, et al. Evolution of heterogeneity I2 estimates and their 95% confidence intervals in large meta-analyses. PLoS One 2012; 7: e39471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kulinskaya E, Dollinger MB, Knight E, et al. A Welch-type test for homogeneity of contrasts under heteroscedasticity with application to meta-analysis. Stat Med 2004; 23: 3655–3670. [DOI] [PubMed] [Google Scholar]
  • 18.Kulinskaya E, Dollinger MB and Bjørkestøl K. Testing for homogeneity in meta-analysis I. The one-parameter case: standardized mean difference. Biometrics 2011; 67: 203–212. [DOI] [PubMed] [Google Scholar]
  • 19.Kulinskaya E, Dollinger MB and Bjørkestøl K. On the moments of Cochran’s Q statistic under the null hypothesis, with application to the meta-analysis of risk difference. Res Synth Methods 2011; 2: 254–270. [DOI] [PubMed] [Google Scholar]
  • 20.Kulinskaya E and Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran’s Q-statistic. BMC Med Res Methodol 2015; 15: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Biggerstaff BJ and Jackson D. The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Stat Med 2008; 27: 6093–6110. [DOI] [PubMed] [Google Scholar]
  • 22.Normand S-LT. Meta-analysis: formulating, evaluating, combining, and reporting. Stat Med 1999; 18: 321–359. [DOI] [PubMed] [Google Scholar]
  • 23.Malzahn U, Böhning D and Holling H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika 2000; 87: 619–632. [Google Scholar]
  • 24.Hedges LV and Olkin I. Statistical methods for meta-analysis. Orlando, FL: Academic Press, 1985. [Google Scholar]
  • 25.Cooper H, Hedges LV and Valentine JC. The handbook of research synthesis and meta-analysis. 2nd ed. New York, NY: Russell Sage Foundation, 2009. [Google Scholar]
  • 26.Egger M, Smith D, Altman G, et al. Systematic reviews in health care: meta-analysis in context. 2nd ed. London, UK: BMJ Publishing Group, 2001. [Google Scholar]
  • 27.Lin L and Aloe AM. Evaluation of various estimators for standardized mean difference in meta-analysis. Stat Med 2021; 40: 403–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lin L Bias caused by sampling error in meta-analysis with small sample sizes. PLoS One 2018; 13: e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Walter SD. Choice of effect measure for epidemiological data. J Clin Epidemiol 2000; 53: 931–939. [DOI] [PubMed] [Google Scholar]
  • 30.Tajeu GS, Sen B, Allison DB, et al. Misuse of odds ratios in obesity literature: an empirical analysis of published studies. Obesity 2012; 20: 1726–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Furuya-Kanamori L and Doi SAR. The outcome with higher baseline risk should be selected for relative risk in clinical studies: a proposal for change to practice. J Clin Epidemiol 2014; 67: 364–367. [DOI] [PubMed] [Google Scholar]
  • 32.Feng C, Wang B and Wang H. The relations among three popular indices of risks. Stat Med 2019; 38: 4772–4787. [DOI] [PubMed] [Google Scholar]
  • 33.Doi SA, Furuya-Kanamori L, Xu C, et al. Controversy and Debate: questionable utility of the relative risk in clinical research: paper 1: a call for change to practice. J Clin Epidemiol 2022; 142: 271–279. [DOI] [PubMed] [Google Scholar]
  • 34.Bakbergenuly I, Hoaglin DC and Kulinskaya E. Pitfalls of using the risk ratio in meta-analysis. Res Synth Methods 2019; 10: 398–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956; 20: 309–311. [DOI] [PubMed] [Google Scholar]
  • 36.Gart JJ, Pettigrew HM and Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika 1985; 72: 179–190. [Google Scholar]
  • 37.Pettigrew HM, Gart JJ and Thomas DG. The bias and higher cumulants of the logarithm of a binomial variate. Biometrika 1986; 73: 425–435. [Google Scholar]
  • 38.Sweeting MJ, Sutton AJ and Paul LC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Stat Med 2004; 23: 1351–1375. [DOI] [PubMed] [Google Scholar]
  • 39.Bradburn MJ, Deeks JJ, Berlin JA, et al. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 2007; 26: 53–77. [DOI] [PubMed] [Google Scholar]
  • 40.Cai T, Parast L and Ryan L. Meta-analysis for rare events. Stat Med 2010; 29: 2078–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rücker G, Schwarzer G, Carpenter J, et al. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Stat Med 2009; 28: 721–738. [DOI] [PubMed] [Google Scholar]
  • 42.DerSimonian R and Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemp Clin Trials 2007; 28: 105–114. [DOI] [PubMed] [Google Scholar]
  • 43.DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7: 177–188. [DOI] [PubMed] [Google Scholar]
  • 44.Sidik K and Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. J R Stat Soc Ser C Appl Stat 2005; 54: 367–384. [Google Scholar]
  • 45.Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 1977; 72: 320–338. [Google Scholar]
  • 46.Bias Viechtbauer W. and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 2005; 30: 261–293. [Google Scholar]
  • 47.Viechtbauer W Conducting meta-analyses in R with the metafor package. J Stat Softw 2010; 36: 3. [Google Scholar]
  • 48.Lin L Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract 2020; 26: 376–384. [DOI] [PubMed] [Google Scholar]
  • 49.Biggerstaff BJ and Tweedie RL. Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Stat Med 1997; 16: 753–768. [DOI] [PubMed] [Google Scholar]
  • 50.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]
  • 51.Miettinen O Estimability and estimation in case-referent studies. Am J Epidemiol 1976; 103: 226–235. [DOI] [PubMed] [Google Scholar]
  • 52.Wetterslev J, Thorlund K, Brok J, et al. Estimating required information size by quantifying diversity in random-effects model meta-analyses. BMC Med Res Methodol 2009; 9: 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Langan D, Higgins JPT, Jackson D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res Synth Methods 2019; 10: 83–98. [DOI] [PubMed] [Google Scholar]
  • 54.Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods 2016; 7: 55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hardy RJ and Thompson SG. A likelihood approach to meta-analysis with random effects. Stat Med 1996; 15: 619–629. [DOI] [PubMed] [Google Scholar]
  • 56.Viechtbauer W Confidence intervals for the amount of heterogeneity in meta-analysis. Stat Med 2007; 26: 37–52. [DOI] [PubMed] [Google Scholar]
  • 57.Farebrother RW. The distribution of a positive linear combination of χ2 random variables. J R Stat Soc Ser C Appl Stat 1984; 33: 332–339. [Google Scholar]
  • 58.Jackson D Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Res Synth Methods 2013; 4: 220–229. [DOI] [PubMed] [Google Scholar]
  • 59.Jackson D and White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J 2018; 60: 1040–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Knapp G, Biggerstaff BJ and Hartung J. Assessing the amount of heterogeneity in random-effects meta-analysis. Biom J 2006; 48: 271–285. [DOI] [PubMed] [Google Scholar]
  • 61.Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Peters JL, Sutton AJ, Jones DR, et al. Comparison of two methods to detect publication bias in meta-analysis. JAMA 2006; 295: 676–680. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

RESOURCES