Comparisons of various estimates of the I2 statistic for quantifying between-study heterogeneity in meta-analysis

Yipeng Wang; Natalie DelRocco; Lifeng Lin

doi:10.1177/09622802241231496

. Author manuscript; available in PMC: 2025 May 1.

Published in final edited form as: Stat Methods Med Res. 2024 Mar 19;33(5):745–764. doi: 10.1177/09622802241231496

Comparisons of various estimates of the I² statistic for quantifying between-study heterogeneity in meta-analysis

Yipeng Wang ¹, Natalie DelRocco ², Lifeng Lin ³

PMCID: PMC11759644 NIHMSID: NIHMS2046507 PMID: 38502022

Abstract

Assessing heterogeneity between studies is a critical step in determining whether studies can be combined and whether the synthesized results are reliable. The $I^{2}$ statistic has been a popular measure for quantifying heterogeneity, but its usage has been challenged from various perspectives in recent years. In particular, it should not be considered an absolute measure of heterogeneity, and it could be subject to large uncertainties. As such, when using $I^{2}$ to interpret the extent of heterogeneity, it is essential to account for its interval estimate. Various point and interval estimators exist for $I^{2}$ . This article summarizes these estimators. In addition, we performed a simulation study under different scenarios to investigate preferable point and interval estimates of $I^{2}$ . We found that the Sidik–Jonkman method gave precise point estimates for $I^{2}$ when the between-study variance was large, while in other cases, the DerSimonian–Laird method was suggested to estimate $I^{2}$ . When the effect measure was the mean difference or the standardized mean difference, the $Q$ -profile method, the Biggerstaff–Jackson method, or the Jackson method was suggested to calculate the interval estimate for $I^{2}$ due to reasonable interval length and more reliable coverage probabilities than various alternatives. For the same reason, the Kulinskaya–Dollinger method was recommended to calculate the interval estimate for $I^{2}$ when the effect measure was the log odds ratio.

Keywords: Confidence interval, coverage probability, heterogeneity, I² statistic, meta-analysis

1. Introduction

Meta-analysis is a statistical tool to synthesize evidence from different studies and is widely used in medical research. Assessing heterogeneity between the collected studies is a critical step to examine whether the studies may be properly combined and the synthesized results are reliable.^1,2 In this article, heterogeneity refers to the variation in underlying treatment effects across studies.³

A classical method to detect heterogeneity is the chi-squared $Q$ test; the distribution of the $Q$ statistic is approximately $χ_{k - 1}^{2}$ (k is the number of studies) under the null hypothesis that all studies in a meta-analysis are homogeneous. However, the $Q$ test alone does not suffice to describe the amount of heterogeneity because only $P$ -values are produced to indicate a binary decision of either the presence or absence of heterogeneity. The $I^{2}$ statistic has been a popular alternative to quantify heterogeneity because of its attractive interpretation as the proportion of total variation caused by heterogeneity rather than within-study sampling error.^4,5 Specifically, the $I^{2}$ statistic can be conceptualized as $τ^{2} / (τ^{2} + σ^{2}) \times 100 %$ , where $τ^{2}$ is the between-study variance caused by heterogeneity and $σ^{2}$ is a summary of within-study variances. It ranges from $0 %$ to $100 %$ . The Cochrane Handbook provides a rough yet widely used rule to interpret this measure: $I^{2} \leq 40 %$ may indicate unimportant heterogeneity, $30 % \leq I^{2} \leq 60 %$ may represent moderate heterogeneity, $50 % \leq I^{2} \leq 90 %$ may represent substantial heterogeneity, and $75 % \leq I^{2} \leq 100 %$ implies considerable heterogeneity.³ These ranges overlap with each other because they are vague, and the true heterogeneity should be evaluated with caution, using both statistical and clinical knowledge.⁶

Over the past few years, the usage of $I^{2}$ has been challenged from many perspectives, and it should not be used as an absolute measure.^7–9 Several studies have demonstrated shortcomings of $I^{2}$ . The $I^{2}$ statistic may be particularly unreliable in meta-analyses with a small number of studies (e.g. < 10).^10,11 The sample sizes within individual studies can inflate or deflate $I^{2}$ under different circumstances.^11,12 Moreover, $I^{2}$ inherits the following misunderstanding about the distribution of the $Q$ statistic under the null hypothesis: the $χ_{k - 1}^{2}$ approximation holds for large within-study sample sizes, but it is not accurate for small and moderate sample sizes.^13,14

In response to the above shortcomings, the associated $95 %$ confidence interval (CI) should be reported to accompany the $I^{2}$ statistic.^10,15,16 Also, under the null hypothesis, distributions of the $Q$ statistic for different effect measures (e.g. standardized mean difference [SMD]) have been proposed to adjust for the inaccuracy of the standard chi-square approximation^17–20; they provide a solution to calculating the CI of $I^{2}$ . CIs may be more desirable than point estimates of $I^{2}$ because they give an appreciation of the spectrum of possible extents of heterogeneity (e.g. mild to moderate). A spectrum of the $I^{2}$ statistic can be more robust to nuisance factors compared with the point estimate, enabling appropriate interpretation of the overall estimate of the intervention effect.¹⁶ Methods to calculate $95 %$ CIs of $I^{2}$ have been discussed in former research.^4,21 Nevertheless, more intensive studies need to be implemented to compare different methods’ performance (e.g. the coverage probability) in practical situations.

This article uses a simulation study under various scenarios to obtain informative conclusions of preferable point and interval estimates of $I^{2}$ . The rest of the article is organized as follows. We first review the setups of a meta-analysis, including various types of effect measures, in Section 2. Section 3 reviews various point and interval estimators of $I^{2}$ . Section 4 presents the simulation study comparing the multiple estimators. We conclude this article with a brief discussion in Section 5.

2. Setups of meta-analysis

2.1. Common-effect and random-effects models

Consider a meta-analysis that collects $k$ independent studies. Let $μ_{i}$ be the underlying true effect size in study $i (i = 1, \dots, k)$ . Each study reports an estimate of the effect size and its sample variance, denoted by $y_{i}$ and $s_{i}^{2}$ . These data are commonly modeled as $y_{i} \sim N (μ_{i}, s_{i}^{2})$ . Although $s_{i}^{2}$ is subject to sampling error, it is usually treated as a fixed, known value. This assumption is generally valid if each study’s sample size is reasonably large.

If study-specific true effect sizes are assumed to follow a normal distribution, that is, $μ_{i} \overset{iid}{\sim} N (μ, τ^{2})$ , then this is the random-effect (RE) model that accounts for heterogeneity. Here, $μ$ is the overall mean effect size, and $τ^{2}$ is the between-study variance. If $τ^{2} = 0$ , then $μ_{i} = μ$ for all studies. This implies that the collected studies are homogeneous, and it leads to the common-effect (CE) model. The RE model encompasses within-study ( $s_{i}^{2}$ ) and between-study ( $τ^{2}$ ) variation, in contrast to the CE model that includes within-study variation only. We denote $w_{i, C E} = 1 / s_{i}^{2}$ as the weight assigned to each study under the CE model. The $Q$ statistic is defined as $Q = \sum w_{i, C E} {(y_{i} - {\hat{μ}}_{C E})}^{2}$ , where ${\hat{μ}}_{C E} = \sum w_{i, C E} y_{i} / \sum w_{i, C E}$ is the pooled CE estimate of the overall effect size $μ$ . It follows a $χ_{k - 1}^{2}$ distribution under the null hypothesis. Under the RE model, using the between-study variance estimate ${\hat{τ}}^{2}$ , the overall mean effect size is estimated as

{\hat{μ}}_{RE} ({\hat{τ}}^{2}) = \frac{\sum w_{i, RE} ({\hat{τ}}^{2}) y_{i}}{\sum w_{i, RE} ({\hat{τ}}^{2})}

(1)

where $w_{i, R E} ({\hat{τ}}^{2}) = 1 / (s_{i}^{2} + {\hat{τ}}^{2})$ .

2.2. Meta-analysis with a continuous outcome

Suppose each study in a meta-analysis compares a treatment group with a control group. Denote $n_{i 0}$ and $n_{i 1}$ as the sample sizes in the control and treatment groups in study $i$ . The continuous outcome measures of participants in each group are assumed to follow normal distributions. The subject-level data in each arm have means $μ_{i 0}$ and $μ_{i 1}$ and variances $γ_{i 0}^{2}$ and $γ_{i 1}^{2}$ . The sample means are denoted as ${\overline{y}}_{i 0}$ and ${\overline{y}}_{i 1}$ , and the sample variances are denoted as $s_{i 0}^{2}$ and $s_{i 1}^{2}$ for $i = 1, \dots, k$ .

If the outcome measures have a meaningful scale and all studies in the meta-analysis are reported on the same scale, the mean difference (MD) between the two groups, $μ_{i} = μ_{i 1} - μ_{i 0}$ , is often used as the effect size. An estimate of the MD can be obtained from each study, denoted as $y_{i} = {\overline{y}}_{i 1} - {\overline{y}}_{i 0}$ . The variances of samples in two arms are frequently assumed to be equal, i.e., $γ_{i 0}^{2} = γ_{i 1}^{2} = γ_{i}^{2}$ . The $γ_{i}^{2}$ is estimated as the pooled sample variance $s_{i P}^{2} = [(n_{i 0} - 1) s_{i 0}^{2} + (n_{i 1} - 1) s_{i 1}^{2}] / (n_{i 0} + n_{i 1} - 2)$ . Therefore, the estimated within-study variance of $y_{i}$ is $s_{i}^{2} = (\frac{1}{n_{i 0}} + \frac{1}{n_{i 1}}) s_{i P}^{2}$ .

Another commonly used effect measure for continuous outcomes is the SMD, because this unit-free measure permits different scales in the collected studies and is deemed more comparable across studies.²² The SMD effect measure is $μ_{i} = (μ_{i 1} - μ_{i 0}) / γ_{i}$ . Known as Cohen’s $d$ , it is frequently estimated as follows: $y_{i} = ({\overline{y}}_{i 1} - {\overline{y}}_{i 0}) / s_{i P}$ . The exact within-study variance of Cohen’s $d$ can be derived as a complicated form of gamma functions,²³ but researchers often use different simpler forms to approximate it.^24–26 For example, $s_{i}^{2} = \frac{1}{n_{i 0}} + \frac{1}{n_{i 1}} + \frac{y_{i}^{2}}{2 (n_{i 0} + n_{i 1} - 2)}$ . As $s_{i}^{2}$ depends on $y_{i}$ , they are correlated. The correlation may increase as the sample sizes decrease, because the coefficient of $y_{i}^{2}$ in the formula increases. Cohen’s $d$ is shown to be biased in small sample sizes.²⁴ Therefore, we do not consider it further. Instead, we study the bias-corrected estimator Hedge’s $g$ , which is usually adopted when sample sizes are small. Suggested by Hedges and Olkin,^24(p86) it is computed as $y_{i} = [1 - \frac{3}{4 (n_{i 0} + n_{i 1}) - 9}] \frac{{\overline{y}}_{i 1} - {\overline{y}}_{i 0}}{s_{i P}}$ with an estimated variance $s_{i}^{2} = \frac{1}{n_{i 0}} + \frac{1}{n_{i 1}} + \frac{y_{i}^{2}}{2 (n_{i 0} + n_{i 1})}$ .

Except for this formula to estimate the within-study variance, Lin and Aloe²⁷ summarized many other formulas. Using different formulas can result in different estimates of the overall SMD, but this topic is beyond the scope of this paper. Like Cohen’s $d$ , the observed data $y_{i}$ and $s_{i}^{2}$ are also correlated when using Hedge’s $g$ as the effect measure, which may affect the estimation results of meta-analyses.²⁸

2.3. Meta-analysis with a binary outcome

Suppose a $2 \times 2$ table is available from each collected study in a meta-analysis with a binary outcome (i.e. individual-level outcomes are reported from $k$ studies). Denote $n_{i 00}$ and $n_{i 01}$ as the number of participants without and with an event in the control group, respectively; $n_{i 10}$ and $n_{i 11}$ are the data cells in the treatment group. The sample sizes in the control and treatment groups are $n_{i 0} = n_{i 00} + n_{i 01}$ and $n_{i 1} = n_{i 10} + n_{i 11}$ . Also, denote $p_{i 0}$ and $p_{i 1}$ as the population event rates in the two groups.

The odds ratio (OR) is frequently used as the effect measure for a binary outcome; its true value in the study $i$ is ${O R}_{i} = [p_{i 1} (1 - p_{i 0})] / [p_{i 0} (1 - p_{i 1})]$ . Using individual-level data, the OR is estimated by ${\hat{O R}}_{i} = (n_{i 00} n_{i 11}) / (n_{i 01} n_{i 10})$ . The ORs are usually combined on a logarithmic scale in meta-analyses, because the distribution of the estimated log OR, $y_{i} = \log {\hat{OR}}_{i}$ , is better approximated by a normal distribution. The within-study variance of $y_{i}$ is estimated as $s_{i}^{2} = \frac{1}{n_{i 00}} + \frac{1}{n_{i 01}} + \frac{1}{n_{i 10}} + \frac{1}{n_{i 11}}$ .

Moreover, the risk ratio (RR) and risk difference (RD) are also popular effect metrics, but they are not discussed in this article. Although RRs are more interpretable measures of association for clinicians,^29,30 the debate continues over the merits of the OR versus RR and their interpretations.^31,32 Doi et al.³³ argued that RRs should no longer be used in meta-analyses, because the RR depends on prevalence more so than on the strength of exposure-outcome association that it is supposed to reflect. Specifically, the RR is a ratio of two conditional probabilities that vary with outcome prevalence, whereas the OR is a true effect magnitude measure representing the multiplicative increase in odds of outcome from an unexposed state to an exposed state. The RD can be easily computed from the OR with the fixed baseline risk. When generating simulated meta-analyses for RDs and RRs under the RE model, it is unrealistic to naturally limit $p_{i 0}$ and $p_{i 1}$ within the range [0, 1] if the true overall effect size is given. This is because the normality assumption $μ_{i} \overset{iid}{\sim} N (μ, τ^{2})$ can generate extreme values of a non-zero $τ$ . For example, a true RD of study $i$ is simulated from $N (0.2, 0.2)$ as $μ_{i} = 0.8$ , then $p_{i 1} = p_{i 0} + μ_{i}$ will be beyond 1 if $p_{i 0}$ is fixed to larger than 0.2. To overcome this issue, an alternative method is truncating such improper probabilities so they are between 0 and 1, but this constraint can produce bias which cannot be distinguished from the bias caused by sampling error.^28,34 Thus, the undesired effect of bounding the probabilities can be problematic, and inevitable when conducting simulation studies for RRs and RDs. Although some meta-analysts try to explore other models to simulate data, there still does not exist a general method that fixes the biased problem and is well accepted in the literature. Bakbergenuly et al.³⁴ evaluated the performance of a number of data-generating models, such as the binomial generalized linear mixed model with logit link function and the beta-binomial model, when effects are RRs. It appears no gold standard was concluded, and they encouraged future research to explore this topic. Therefore, we focus on analyzing the results of ORs when studies return binary outcomes.

When sample sizes are small, some data cells may be 0, even if the event is not rare. In general, if a $2 \times 2$ table contains zero cells, a fixed value of 0.5 is added to each data cell to reduce bias and avoid computational errors.^35–37 Although this continuity correction may not be optimal in some cases and alternative corrections can be used,^38–41 we use the adding 0.5 correction if it is not specially mentioned in the following sections.

3. Estimates of the $I^{2}$ statistic

3.1. Point estimates

Because point estimates of the between-study variance $τ^{2}$ are used to calculate $I^{2}$ intervals, we first introduce these point estimators. As $I^{2}$ depends on $τ^{2}$ , these estimators further lead to point estimates of $I^{2}$ .

3.1.1. Method-of-moments approach

The estimator of $τ^{2}$ can be derived from the method-of-moments approach, which is based on the generalized $Q$ statistic,⁴² $Q_{a} = \sum a_{i} {(y_{i} - {\hat{μ}}_{a})}^{2}$ , where $a_{i}$ represents the weight assigned to the study $i$ and ${\hat{μ}}_{a} = \sum a_{i} y_{i} / \sum a_{i}$ . By equating $Q_{a}$ to its expected value, the general formula for the heterogeneity variance can be derived as

{\hat{τ}}^{2} = max {0, \frac{Q_{a} - (\sum a_{i} s_{i}^{2} - \frac{\sum a_{i}^{2} s_{i}^{2}}{\sum a_{i}})}{\sum a_{i} - \frac{\sum a_{i}^{2}}{\sum a_{i}}}} .

(2)

The DerSimonian–Laird (DL) estimator uses the CE model weights $a_{i} = w_{i, C E}$ , leading to⁴³:

{\hat{τ}}_{DL}^{2} = max {0, \frac{\sum w_{i, CE} {(y_{i} - {\hat{μ}}_{CE})}^{2} - (k - 1)}{\sum w_{i, CE} - \frac{\sum w_{i, CE}^{2}}{\sum w_{i, CE}}}}

Note that the DL estimators can produce negative variance estimates and are truncated to zero in such cases.

3.1.2. Sidik–Jonkman (SJ) method

Sidik and Jonkman⁴⁴ proposed a two-step estimator producing positive $τ^{2}$ estimates

{\hat{τ}}_{SJ}^{2} = \frac{1}{k - 1} \sum \frac{1}{1 + (s_{i}^{2} / {\hat{τ}}_{0}^{2})} {(y_{i} - {\hat{μ}}_{SJ})}^{2}

where ${\hat{τ}}_{0}^{2} = \sum {(y_{i} - \overline{y})}^{2} / k$ is the initial heterogeneity variance estimate and ${\hat{μ}}_{S J}$ is calculated from equation (1) with weights $w_{i, R E} (s_{i}^{2} / {\hat{τ}}_{0}^{2}) = 1 / [1 + (s_{i}^{2} / {\hat{τ}}_{0}^{2})]$ .

3.1.3. Restricted maximum likelihood (REML) method

Based on the marginal distribution of the RE model, $y_{i} \sim N (μ, s_{i}^{2} + τ^{2})$ , the maximum likelihood (ML) estimate ${\hat{τ}}_{M L}^{2}$ is obtained by maximizing the log-likelihood function:

l (μ, τ^{2}) = - \frac{k}{2} log (2 π) - \frac{1}{2} \sum log (s_{i}^{2} + τ^{2}) - \frac{1}{2} \sum \frac{{(y_{i} - μ)}^{2}}{s_{i}^{2} + τ^{2}}

To derive the REML estimator, the above log-likelihood function is transformed to exclude the parameter $μ .$ ⁴⁵ By doing so, REML avoids assuming $μ$ is known and is therefore thought to be an improvement on the ML estimator.⁴⁶ The modified log-likelihood function is

l_{R} (τ^{2}) = - \frac{k}{2} log (2 π) - \frac{1}{2} \sum log (s_{i}^{2} + τ^{2}) - \frac{1}{2} \sum \frac{{[y_{i} - {\hat{μ}}_{RE} ({\hat{τ}}_{ML}^{2})]}^{2}}{s_{i}^{2} + τ^{2}} - \frac{1}{2} log (\sum \frac{1}{s_{i}^{2} + τ^{2}})

By maximizing this modified $\log$ -likelihood function to $τ^{2}$ , the formula of the between-study variance estimate is

{\hat{τ}}_{REML}^{2} = max {0, \frac{\sum a_{i}^{2} [{(y_{i} - {\hat{μ}}_{RE} ({\hat{τ}}_{ML}^{2}))}^{2} - s_{i}^{2}]}{\sum a_{i}^{2}} + \frac{1}{\sum a_{i}}}

where $a_{i} = 1 / (s_{i}^{2} + {\hat{τ}}_{REML}^{2})$ . The REML estimate is calculated by using an iteration scheme. Fisher scoring algorithm is used for the iteration of the REML estimates in this article, as implemented in the R package “metafor.”⁴⁷

3.2. Interval estimates

3.2.1. Interval estimates for $I^{2} b a s e d o n t h e Q$ statistic

The $I^{2}$ statistic is originated from the $Q$ statistic by assuming within-study variances are equal (i.e. $s_{i}^{2} = σ^{2}$ ) and by equating the observed $Q$ with its expectation, so we have

I^{2} = \frac{τ^{2}}{τ^{2} + σ^{2}} = \frac{Q - (k - 1)}{Q}

(3)

which is a function of $Q$ .⁴⁸ A widely used truncation (i.e. $I^{2}$ is set to 0 if $Q \leq k - 1$ ) is applied because conceptually the $I^{2}$ statistic should be non-negative. Using equation (3), the $I^{2}$ interval estimate can be calculated by evaluating quantiles from the cumulative distribution function (CDF) of the $Q$ statistic (i.e. $F_{Q}$ ). Biggerstaff and Jackson²¹ (BJ) developed three approaches to approximate the distribution of $Q$ under the RE model.

The two-moment gamma approximation of $F_{Q}$ , with shape parameter $α$ and scale parameter $β$ , is obtained by matching the first two moments of the gamma and $Q$ distributions. Explicit expressions for the mean and variance of $Q$ are

E (Q) = k - 1 + (S_{1} - \frac{S_{2}}{S_{1}}) τ^{2}

Var (Q) = 2 (k - 1) + 4 (S_{1} - \frac{S_{2}}{S_{1}}) τ^{2} + 2 (S_{2} + \frac{S_{2}^{2}}{S_{1}^{2}} - 2 \frac{S_{3}}{S_{1}}) τ^{4}

where $S_{r} = \sum w_{i, C E}^{r}$ . The proof for the two formulas is included by Biggerstaff and Tweedie.⁴⁹ Using any non-negative estimates for $τ^{2}$ in the above two formulas, the first two moments of $Q$ can be estimated, denoted as $\hat{E} (Q)$ and $\hat{V a r} (Q)$ . Therefore, solving equations $E (Q) = α β$ and $V a r (Q) = α β^{2}$ by plugging in estimated values gives $\hat{α} = [\hat{E} (Q)]^{2} / \hat{V a r} (Q)$ and $\hat{β} = \hat{V a r} (Q) / \hat{E} (Q)$ . The $F_{Q}$ is then approximated by computing the gamma CDF with $\hat{α}$ and $\hat{β}$ .

The Pearson type III distribution provides an extension of the previous two-moment gamma approximation by adding the third central moment (TCM) of $Q$ , which is derived similarly to the variance of $Q$ :

TCM (Q) = E [{(Q - E (Q))}^{3}] = 8 (k - 1) + 24 (S_{1} - \frac{S_{2}}{S_{1}}) τ^{2} + 24 (S_{2} - 2 \frac{S_{3}}{S_{1}} + \frac{S_{2}^{2}}{S_{1}^{2}}) τ^{4} + 8 (S_{3} - 3 \frac{S_{4}}{S_{1}} + 3 \frac{S_{2} S_{3}}{S_{1}^{2}} - \frac{S_{2}^{3}}{S_{1}^{3}}) τ^{6}

Matching all three moments with parameters of the Pearson type III distribution, emphasizing the dependence on $τ^{2}$ , gives

r (τ^{2}) = \frac{4 Var {(Q)}^{3}}{TCM {(Q)}^{2}}, θ (τ^{2}) = \frac{2 Var (Q)}{TCM (Q)}, and γ (τ^{2}) = E (Q) - \frac{2 Var {(Q)}^{2}}{TCM (Q)}

Therefore, the approximation of $F_{Q}$ can easily be calculated from the Pearson type III CDF with location parameter $γ (τ^{2})$ , shape parameter $r (τ^{2})$ and rate parameter $θ (τ^{2})$ . This approximation can be obtained by plugging in ${\hat{τ}}^{2}$ to the three parameters. Note that although the three-moment Pearson type III approximation is intended as an improvement on the two-moment gamma approximation as it matches a further moment, it has support $[γ ({\hat{τ}}^{2}), \infty)$ , hence it is not appropriate to approximate $F_{Q}$ when values of $Q$ are extremely small, especially if those values are less than $γ ({\hat{τ}}^{2})$ .

A further approximation expected to be more accurate in the tails of the distribution is the saddlepoint approximation, given in the present case by Kuonen⁵⁰ using the Barndorff–Nielsen formulation. This requires the cumulant generating function of $Q$ , denoted by $K (s)$ , and its first two derivates, given by

K (s) = - \frac{1}{2} \sum_{i = 1}^{k - 1} log (1 - 2 λ_{i} s), K^{(1)} (s) = \sum_{i = 1}^{k - 1} \frac{λ_{i}}{1 - 2 λ_{i} s}, and K^{(2)} (s) = 2 \sum_{i = 1}^{k - 1} {(\frac{λ_{i}}{1 - 2 λ_{i} s})}^{2}

where $s < 1 / 2 λ_{1}$ and $λ_{1} \geq λ_{2} \geq \dots \geq λ_{k - 1} \geq 0$ with $λ_{k} = 0$ are the ordered eigenvalues of $S = Σ^{1 / 2} A Σ^{1 / 2}$ . Here, and $Σ$ is the diagonal matrix with entries $s_{i}^{2} + τ^{2}$ . Let $A = W - (1 / \sum_{i} w_{i, C E}) w w^{t}$ , where $W$ is the diagonal matrix containing the $w_{i, C E} = 1 / s_{i}^{2}, w$ is the vector containing the $w_{i, C E}$ , and the superscript t denotes matrix transpose. Plugging in the $τ^{2}$ estimate, the saddlepoint approximating $C D F F_{S} (x) \approx P (Q \leq x)$ can be calculated in two steps. First, we solve the equation $K^{(1)} (\hat{s}) = x$ for $\hat{s}$ , the solution referred to as the saddlepoint. Next, we compute $a = s i g n (\hat{s}) \sqrt{2 [\hat{s} x - K (\hat{s})]}$ and $b = \hat{s} \sqrt{K^{(2)} (\hat{s})}$ . The saddlepoint approximation is then given by $F_{S} (x) = ϕ (a + \frac{1}{a} \log (\frac{b}{a}))$ , where $ϕ$ is the standard normal CDF. Our objective is to obtain the $I^{2}$ interval through evaluating the quantiles of the $Q$ statistic, so $F_{S} (x)$ is used to estimate $x$ for given probabilities (e.g. 0.025 and 0.975).

The test-based method⁵¹ provides another way to compute the CI of $Q$ , and hence for the $I^{2}$ statistic.¹⁶ Appendix A2 by Higgins and Thompson⁴ discussed in detail for conducting the test-based method to calculate the $95 %$ CI of the heterogeneity measure

H = \sqrt{\frac{Q}{k - 1}}

where $H$ is defined to be 1 whenever $Q \leq k - 1$ . Because $I^{2} = 1 - 1 / H^{2}$ is a monotone-increasing function with $H^{2}$ , we briefly present results that are used to estimate the $95 %$ CI of $H$ here, and the corresponding $95 %$ CI for $I^{2}$ can be readily calculated via the relationship between $I^{2}$ and $H^{2}$ . The logarithm of $Q$ is used in this method to remove some of the skew inherent in the distribution of $Q$ . A test-based standard error of $\log (H)$ is

SE [log (H)] = {\begin{matrix} \sqrt{{\frac{1}{2 (k - 2)} [1 - \frac{1}{3 {(k - 2)}^{2}}]}}, Q \leq k \\ \frac{log (Q) - log (k - 1)}{2 (\sqrt{2 Q} - \sqrt{2 k - 3})}, Q > k \end{matrix}

Then a $95 %$ CI for $H$ follows as $e x p {\log (H) \pm 1.96 S E [\log (H)]}$ . Therefore, a test-based interval estimate for $I^{2}$ is constructed using the lower and upper bounds of $H$ .

The non-parametric bootstrap CI of $I^{2}$ can be obtained by sampling $k$ studies with replacement from the observed pairs $(y_{i}, s_{i}^{2})$ , and $I^{2}$ is estimated for each bootstrap sample using equation (3). Repeating the process $B$ (e.g. 1000) times, a $95 % C I$ is given by the 2.5th and 97.5th percentiles of the $B I^{2}$ values.

In sum, five methods to estimate $I^{2}$ intervals are summarized in this subsection. It should be noted that the three methods using the approximated $F_{Q}$ need first to estimate $τ^{2}$ , whereas the other two methods do not depend on ${\hat{τ}}^{2}$ , as shown in Table 1.

Table 1.

Summary of the five methods to calculate confidence intervals for I² based on the Q statistic.

Methods to estimate I² intervals		Use the estimated τ²
Biggerstaff and Jacksons approximated cumulative distribution functions of Q	Two-moment gamma (TMG)	√
	Pearson type III distribution (PIII)	√
	Saddlepoint approximation (S)	√
Test-based approach (T)		—
Non-parametric bootstrap (NPBS)		—

Open in a new tab

(a) Methods requiring the estimated between-study variance, √; and (b) methods not requiring the estimated between-study variance, —.

3.2.2. Interval estimates for $I^{2}$ based on the between-study variance

Consider the DL estimate of the between-study variance $τ^{2}$ , the $Q$ statistic can be expressed in the form of ${\hat{τ}}^{2}$ via equation (2) when $a_{i} = w_{i, C E}$ . Note that $Q = k - 1$ when ${\hat{τ}}^{2} = 0$ , this setting matches with the widely used truncation that $Q$ is truncated as $k - 1$ if $Q \leq k - 1$ . Replacing the $Q$ statistic with the form of ${\hat{τ}}^{2}$ in equation (3), $I^{2}$ can be expressed as a function of the estimated between-study variance

I^{2} = \frac{{\hat{τ}}^{2}}{{\hat{τ}}^{2} + \frac{(k - 1) \sum s_{i}^{- 2}}{{(\sum s_{i}^{- 2})}^{2} - \sum s_{i}^{- 4}}}

(4)

In this expression, the summary of the within-study variance (i.e. the moment-based sampling error) is treated as follows:

{\hat{σ}}^{2} = \frac{(k - 1) \sum s_{i}^{- 2}}{{(\sum s_{i}^{- 2})}^{2} - \sum s_{i}^{- 4}}

Nevertheless, considering $I^{2}$ as a function of ${\hat{τ}}^{2}$ depends on the accuracy of the summary estimate ${\hat{σ}}^{2}$ because the calculation or interpretation of the $I^{2}$ statistic can be seriously distorted if ${\hat{σ}}^{2}$ provides a misleading estimate.⁵² Nevertheless, we use the moment-based sampling error throughout this article because it is consistent with the definition of $I^{2}$ . Improving the summary estimate of the within-study variance will be explored in our future research. For a given meta-analysis, interval estimates of $I^{2}$ can be calculated from interval estimates of the between-study variance via the monotone-increasing function $I^{2} ({\hat{τ}}^{2})$ in equation (4). Therefore, calculating the CI for $I^{2}$ is one step further than estimating the CI for $τ^{2}$ . Researchers have conducted comprehensive overviews to compare estimation methods for $τ^{2}$ and its uncertainty.^53,54 Although equation (4) is derived to use the DL estimate of $τ^{2}$ , other methods can be used to calculate ${\hat{τ}}^{2}$ because different estimators aim to estimate the same true between-study variance. Three-point estimators of $τ^{2}$ in Section 3.1 and the following six interval estimators of $τ^{2}$ are considered to obtain the CI for $τ^{2}$ , and thus the CI for $I^{2}$ , in this article.

Specifically, we summarize interval estimation methods for the between-study variance below:

Profile likelihood CI of the REML estimator (PL-REML). The PL method⁵⁵ is based on the log-likelihood function and is an iterative process that provides CIs for the between-study variance, considering the fact that $μ$ needs to be estimated as well. The $95 %$ PL CI for $τ^{2}$ consists of the values that are not rejected by the likelihood ratio test with $τ^{2}$ under the null hypothesis. For the REML estimator, the $τ^{2}$ values in the CI are obtained by solving
$l_{R} (τ^{2}) > l_{R} ({\hat{τ}}_{REML}^{2}) - \frac{1}{2} χ_{1, 0.95}^{2}$
where $χ_{1,0.95}^{2} = 3.841$ is the 95th quantile of the $χ_{1}^{2}$ distribution. The method produces wide CIs with very high coverage probabilities when $τ^{2} = 0$ , and the coverage probabilities reduce to the nominal level as $τ^{2}$ increasing.⁵⁶
Q-profile $C I (Q P)$ . The QP method is based on the generalized $Q$ statistic ( $Q_{a}$ in Section 3.1) when $a_{i} = 1 / (s_{i}^{2} + τ^{2})$ , which follows $χ_{k - 1}^{2}$ . Viechtbauer⁵⁶ shows that the $Q$ -profile CI is obtained by iteratively solving $Q_{a} ({\tilde{τ}}_{L}^{2}) = χ_{k - 1,0.975}^{2}$ and $Q_{a} ({\tilde{τ}}_{U}^{2}) = χ_{k - 1,0.025}^{2}$ , where ${\tilde{τ}}_{L}^{2}$ and ${\tilde{τ}}_{U}^{2}$ are the lower and upper confidence limits, respectively. The corresponding CIs have been shown to achieve nominal coverage probabilities even in small samples.⁵⁶ However, the estimated within-study variance $s_{i}^{2}$ is not the true within-study variance $σ_{i}^{2}$ for each study. Therefore, in practice, the generalized $Q$ statistic no longer follows the assumed chi-squared distribution. This method is implemented in the R package “metafor” as the default approach to compute the CI for $τ^{2}$ .
Biggerstaff and Jackson CI (BJ). Using the CDF of $Q, F_{Q} (x; τ^{2})$ , Biggerstaff and Jackson²¹ proposed a method to calculate a $95 %$ CI for the between-study variance by obtaining the solutions of the equations:
$(1 - F_{Q} (x; τ^{2}) = 0.025, F_{Q} (x; τ^{2}) = 0.025)$
When $F_{Q} (x; τ^{2} = 0) < 0.025$ , the interval is set as $[0, 0]$ . If $1 - F_{Q} (x; τ^{2} = 0) > 0.025$ , the lower bound of CI is set equal to 0. The CDF $F_{Q} (x, τ^{2})$ may be calculated using the algorithm by Farebrother⁵⁷ for the positive linear combination of chi-squared random variables.
Jackson $C I (J)$ . An extension of the BJ CI is suggested by Jackson⁵⁸ using $Q_{a}$ . The generalized statistic $Q_{a}$ has been shown to be as a linear combination of $χ^{2}$ random variables so that methods like BJ can be used. The CDF of $Q_{a}$ , $F_{Q_{a}} (x; τ^{2})$ , is a continuous and strictly decreasing function of $τ^{2}$ . The $95 % C I$ of $τ^{2}$ is obtained as:
$(1 - F_{Q_{a}} (x; τ^{2}) = 0.025, F_{Q_{a}} (x; τ^{2}) = 0.025)$
When $F_{Q_{a}} (x; τ^{2} = 0) < 0.025$ , the interval is set as $[0, 0]$ . If $1 - F_{Q_{a}} (x; τ^{2} = 0) > 0.025$ , the lower bound of CI is set equal to 0. For moderate $τ^{2}$ , Jackson recommends using the $J$ interval with weights $a_{i} = 1 / s_{i}$ , which are used in this article. The BJ and $J$ CIs for $τ^{2}$ are calculated using the R code provided by Jackson.⁵⁸
Sidik and Jonkman CI (SJ). Sidik and Jonkman⁴⁴ propose a method based on the SJ estimator with the 2.5th and 97.5th quantiles of the $χ_{k - 1}^{2}$ distribution:
$(\frac{(k - 1) {\hat{τ}}_{SJ}^{2}}{χ_{k - 1, 0.975}^{2}}, \frac{(k - 1) {\hat{τ}}_{SJ}^{2}}{χ_{k - 1, 0.025}^{2}})$
As ${\hat{τ}}_{S J}^{2}$ takes non-negative values, the interval should also be non-negative. Simulation studies indicate that the SJ intervals have very poor coverage probability when $τ^{2}$ is small, but as $k$ and $τ^{2}$ increase the coverage probability becomes close to the nominal value.^44,56
Bootstrap CI. For any consistent and non-negative estimator of $τ^{2}$ , parametric bootstrap CIs can be obtained by generating $k$ values from the distribution $y_{i} \sim N ({\hat{μ}}_{R E} ({\hat{τ}}^{2}), {\hat{τ}}^{2} + s_{i}^{2})$ , where ${\hat{τ}}^{2}$ is the between-study variance estimate and ${\hat{μ}}_{R E} ({\hat{τ}}^{2})$ given by equation (1). Next, estimate the between-study variance based on the bootstrap sample. After repeating this process $B$ (e.g. 1000) times, the CI is constructed by taking the 2.5th and 97.5th percentiles of the distribution of ${\hat{τ}}^{2}$ values. Non-parametric bootstrap CIs are obtained via a similar process, where $k$ studies are sampled with replacement from the observed pairs ( $y_{i}, s_{i}^{2}$ ). For each bootstrap sample, $τ^{2}$ can be estimated using the same specified method (e.g. REML). Repeating the process $B$ times, a $95 % C I$ is given by the 2.5th and 97.5th percentiles of the $B {\hat{τ}}^{2}$ values. The normal distribution assumption of observed effects is not required in the non-parametric bootstrap method, but its coverage performance has been doubted because of the substantial deviation from the nominal level in simulation studies.⁵⁶

So far, for generic effect measures, multiple approaches to calculating $95 % C I s$ of $I^{2}$ are presented in two directions: based on the $Q$ statistic or based on the between-study variance. Nevertheless, these methods have been criticized by researchers because they can be unreliable in real-world meta-analysis.^8,59 For the $Q$ -profile method, the null distribution of the generalized $Q$ statistic follows $χ_{k - 1}^{2}$ may not be an accurate approximation, especially when study-specific sample sizes are small or moderate. Three methods based on three approximations of $F_{Q}$ proposed by Biggerstaff and Jackson²¹ also require sufficiently large studies (i.e. large sample sizes of studies) and the assumption that effect sizes are normally distributed. For the test-based method, Hoaglin⁸ points out: (1) the CI involving the test-based standard error is valid only under the null hypothesis; (2) the standard normal approximation, $Z = \sqrt{2 Q} - \sqrt{2 k - 3}$ , used in the method requires “large” degrees of freedom (e.g. over 100); and (3) subtracting $\log (k - 1$ ) is not exactly the same as subtracting the mean of $\log (Q)$ . Therefore, the test-based CI for $I^{2}$ can be unreliable to reflect heterogeneity. Other approaches to improve the estimation of CIs for $τ^{2}$ , and thus for $I^{2}$ , have been discussed recently.^21,54,60 For example, Knapp et al.⁶⁰ suggested a modified $Q$ profile method using a different weighting scheme for the generalized $Q$ statistic to determine the lower bound of the interval for $τ^{2}$ , and the upper bound is still the same as that of the original $Q$ -profile method. However, the improvement of the modified $Q$ -profile method is subtle, and the weighting scheme for the lower bound is lacking when effect measures are SMDs.

To handle the problem that $χ_{k - 1}^{2}$ can be an inaccurate null distribution, Kulinskaya et al.^18,20 proposed a series of methods, which provide appropriate CIs for $τ^{2}$ , by combining the $Q$ -profile method with corrected null approximations of the $Q$ statistic. The distribution of $Q$ under the null hypothesis of homogeneity depends on statistics used to estimate the effects and the weights. Two methods to estimate CIs for $τ^{2}$ , thus for $I^{2}$ , are introduced for two effect measures, the SMD and the OR, as follows:

Kulinskaya–Dollinger–Bjøkestøl CI (KDB). When using Hedge’s $g$ as the estimator of SMD, Kulinskaya et al.¹⁸ derived $O (1 / n)$ corrections to moments of $Q$ and suggested using the chi-squared distribution with degrees of freedom equal to the estimate of the corrected first moment, denoted by $χ_{E (Q)}^{2}$ , to approximate the distribution of $Q$ . The detailed expression of $E (Q)$ is provided along with the R code by Kulinskaya et al.,¹⁸ and they are not presented here because the concrete form is complicated. The upper and lower confidence limits for $τ^{2}$ can be calculated iteratively from the lower and upper quantiles of $χ_{E (Q)}^{2}$ :
$Q (τ_{L}^{2}) = χ_{E (Q), 0.975}^{2}, Q (τ_{U}^{2}) = χ_{E (Q), 0.025}^{2}$
Then, the corresponding CI for $I^{2}$ is obtained via equation (4).
Kulinskaya–Dollinger CI (KD). When effect measures are log ORs, Kulinskaya and Dollinger²⁰ obtain corrected approximations for the mean and variance of the $Q$ statistic under the null hypothesis. They then match those corrected moments to construct a gamma distribution that closely fits the null distribution of $Q$ , and their simulations confirm that the gamma approximation outperforms the chi-squared approximation.²⁰ The improved approximation blends theoretical derivation with simulation results. Let $E_{K D} (Q)$ denote the corrected expectation of $Q$ when $τ^{2} = 0$ . This corrected first moment can be written as
$E_{KD} (Q) = k - 1 - 0.687 [k - 1 - E_{th} (Q)]$
where $E_{t h} (Q)$ is a theoretical moment obtained from their general expansion of the mean of $Q$ for arbitrary binary effect measures. The detailed expression of $E_{t h} (Q)$ is presented in Appendix B.3 of Kulinskaya and Dollinger.²⁰ For large sample sizes, $E_{t h} (Q)$ converges to $k - 1$ . The corrected variance of $Q$ , denoted by ${V a r}_{K D} (Q)$ , is a quadratic function of the corrected mean and it is calculated by
${Var}_{KD} (Q) = 4.74 (k - 1) - 12.17 E_{KD} (Q) + \frac{9.42}{k - 1} {[E_{KD} (Q)]}^{2}$
Then, the shape parameter $α$ of the gamma distribution approximating $F_{Q}$ is estimated by $\hat{α} = {[E_{K D} (Q)]}^{2} / {V a r}_{K D} (Q)$ , and the scale parameter $β$ is estimated by $\hat{β} = {V a r}_{K D} (Q) / E_{K D} (Q)$ . Therefore, the KD interval estimate of $τ^{2}$ is obtained by iteratively solving:
$Q (τ_{L}^{2}) = F_{Q, 0.975}, Q (τ_{U}^{2}) = F_{Q, 0.025}$
The corresponding CI for $I^{2}$ is calculated via equation (4). Different from all other methods, the KD interval estimate of $τ^{2}$ is based on the $2 \times 2$ table where 0.5 is added to each cell regardless of the existence of zero cells; this change is adjusted in the programming.

Among the methods presented in this subsection, four (PL-REML, SJ, PBS- $τ^{2}$ , and NPBS- $τ^{2}$ ) need to use the estimated $τ^{2}$ , whereas others can directly calculate interval estimates of $τ^{2}$ without using the point estimate.

4. Simulation study

4.1. Simulation settings

We conducted simulation studies to investigate the performance of different interval estimators of the $I^{2}$ statistic. Following the framework by Morris et al.⁶¹ to design simulations:

Aims. The primary goal is to compare the performance of different methods’ $95 %$ CIs for $I^{2}$ . The secondary aim is to compare three-point estimators of $I^{2}$ .
Data-generating mechanisms. The number of studies in a simulated meta-analysis was set to $k = 5,20$ , and 50. Denote $n = (n_{1}, \dots, n_{k})$ , where $n_{i}$ represents the sample size of the study $i (i = 1, \dots, k)$ . When $k = 5$ , a vector $n$ represents sample sizes of an artificial meta-analysis was fixed as (10, 20, 30, 40, 50), then we gradually increased it to (50, 75, 100, 125, 150), and to (150, 250, 350, 450, 550). Three different settings indicated the considered sample sizes were small, medium, and large. When $k = 20$ , the sample size vector was specified as four replicates of $n$ when $k = 5$ . For example, considering $10 \leq n_{i} \leq 50$ and $i = 1, \dots, 20$ , the sample size vector was set by combining four vectors (10, 20, 30, 40, 50). Similarly, 10 replicates of $n$ when $k = 5$ were used to construct the sample size vector when $k = 50$ . The control/treatment allocation ratio was set to $1 : 1$ in all studies, which is commonly used in real-world applications. Specifically, $n_{i 0} = n_{i 1} = n_{i} / 2$ , where $n_{i 0}$ participants were assigned to the control group and $n_{i 1}$ participants were assigned to the treatment group.

When effect measures were MDs, each participan’s outcome measure was sampled from $N (μ_{i 0}, γ_{i}^{2})$ in the control group or $N (μ_{i 0} + μ_{i}, γ_{i}^{2})$ in the treatment group. Without loss of generality, the baseline effect $μ_{i 0}$ of the study $i$ was generated from $N (0, 1)$ . The study-specific standard deviation $γ_{i}$ was sampled from $U (1,5)$ , and it was generated anew for each simulated meta-analysis. The MD $μ_{i}$ was sampled from $N (μ, τ^{2})$ . Table 2 shows the specified values for the overall MD $μ$ and the between-study standard deviation $τ$ .

Table 2.

Vectors of the between-study standard deviation (τ) and specified values of the true overall effect size (μ).

	Overall mean difference	Overall standardized mean difference		Overall log odds ratio

Range of the sample size n_i	μ = 0 or 1	μ = 0	μ = 0.8	μ = 0	μ = 1
10 ≤ n_i ≤ 50	τ = (0.50, 1.10, 2.20)	τ = (0.18, 0.37, 0.73)	τ = (0.19, 0.38, 0.76)	τ = (0.37, 0.73, 1.46)	τ = (0.39, 0.78, 1.56)
50 ≤ n_i ≤ 150	τ = (0.30, 0.60, 1.20)	τ = (0.10, 0.20, 0.40)	τ = (0.10, 0.21, 0.42)	τ = (0.20, 0.40, 0.80)	τ = (0.21, 0.43, 0.85)
150 ≤ n_i ≤ 550	τ = (0.20, 0.30, 0.60)	τ = (0.05, 0.11, 0.21)	τ = (0.06, 0.11, 0.22)	τ = (0.11, 0.21, 0.43)	τ = (0.11, 0.23, 0.46)

Open in a new tab

Given the range of study-specific sample sizes and the true overall effect size, a between-study standard deviation is chosen from one of three values in the corresponding vector, and it is used to generate meta-analyses.

When effect measures were SMDs, each participant’s outcome measure was generated from $N (μ_{i 0}, γ_{i}^{2})$ in the control group or $N (μ_{i 1}, γ_{i}^{2})$ in the treatment group. The baseline effect $μ_{i 0}$ of $i$ th study was generated from $N (0, 1)$ , and the study-specific standard deviation $γ_{i}$ was generated anew for each meta-analysis by sampling from $U (1,5)$ . The SMD $μ_{i} = (μ_{i 1} - μ_{i 0}) / γ_{i}$ was sampled from the normal distribution $N (μ, τ^{2})$ , so $μ_{i 1} = μ_{i} γ_{i} + μ_{i 0}$ . The overall SMD $μ$ and the between-study standard deviation $τ$ were set as in Table 2.

When effect measures were $\log$ ORs, the event numbers $n_{i 01}$ and $n_{i 11}$ in the control and treatment groups were sampled from $B i n o m i a l (n_{i 0}, p_{i 0})$ and $B i n o m i a l (n_{i 1}, p_{i 1})$ , respectively. The event rate in the control group $p_{i 0}$ was sampled from $U (0.3,0.7)$ representing a common event,⁶² and it was generated anew for each meta-analysis. The event rate in the treatment group $p_{i 1}$ was calculated using $p_{i 0}$ and the study-specific log OR $μ_{i}$ ; specifically, $p_{i 1} = {[1 + e^{- μ_{i}} (1 - p_{i 0}) / p_{i 0}]}^{- 1}$ . The study-specific $\log$ OR $μ_{i}$ was sampled from $N (μ, τ^{2})$ . The settings of the overall $\log$ OR $μ$ and the between-study standard deviation $τ$ were presented in Table 2.

For each simulation setting above, 10,000 meta-analyses were generated. For a simulated meta-analysis, the study-specific effect size and the within-study variance were estimated as $y_{i}$ and $s_{i}^{2}$ in Section 2. The RE model was applied to each simulated meta-analysis, and the between-study variance was estimated by three methods (DL, SJ, and REML) introduced in Section 3.1. We skipped simulated meta-analyses whose REML estimates of $τ^{2}$ could not be obtained (e.g. the solution did not converge) until enough simulated meta-analyses were generated.

Estimands of interest. We estimated $I^{2}$ and the corresponding $95 %$ CI for each simulated meta-analysis. The true value of $I^{2}$ was calculated by equation (4) with the true between-study variance $τ^{2}$ .
Methods to be evaluated. For MDs, we compared 12 methods to calculate $95 %$ CIs of $I^{2}$ , five methods (TMG, PIII, S, T, and NPBS-Q) introduced in Section 3.2.1 and seven methods (PL-REML, QP, BJ, J, SJ, NPBS- $τ^{2}$ , and PBS- $τ^{2}$ ) introduced in Section 3.2.2. These 12 methods were also compared when effect measures were SMDs or log ORs, but the KDB CI or the KD CI was added to the comparison. Among the methods needing to use the estimated between-study variance, the SJ method used the SJ estimate of $τ^{2}$ , and other methods used the REML estimate of $τ^{2}$ . Moreover, estimated $I^{2}$ using three different estimators (DL, SJ, and REML) of $τ^{2}$ were also compared.
Performance measures. Coverage probabilities of $95 %$ CIs, lengths of interval estimates, standard deviations of lengths, biases, and root mean squared errors were examined.

We provide all R code for the simulations at the Open Science Framework (https://osf.io/qu26v/).

4.2. Simulation results

4.2.1. Properties of $I^{2}$ estimates

For the point estimates of $I^{2}$ using the DL, SJ, and REML estimators of $τ^{2}$ , the SJ method stood out when the between-study variance was large. Table 3 shows estimates of bias for the three estimation methods when the estimand was MD. Generally, the SJ methods had the highest bias compared to DL and REML when $τ$ was small or moderate, but the lowest bias when $τ$ was large. Additionally, DL and REML estimates of $I^{2}$ had extremely similar performance and were often biased downward, particularly when $τ$ was large.

Table 3.

Biases of estimated I² using the estimated between-study variance of three methods: DerSimonian–Laird (DL), Sidik–Jonkman (SJ), and restricted maximum likelihood (REML) methods. Results are from the simulation study using mean differences as effect measures.

	The true overall mean difference μ = 0 or 1
	n_i = 10–50			n_i = 50–150			n_i = 150–550

Method	τ = 0.5	τ = 1.1	τ = 2.2	τ = 0.3	τ = 0.6	τ = 1.2	τ = 0.2	τ = 0.3	τ = 0.6
k=5
DL	0.020	−0.125	−0.123	−0.021	−0.140	−0.128	−0.069	−0.137	−0.140
SJ	0.204	−0.003	−0.075	0.155	−0.022	−0.082	0.099	−0.007	−0.084
REML	0.015	−0.127	−0.122	−0.025	−0.144	−0.127	−0.074	−0.142	−0.140
k=20
DL	0.031	−0.024	−0.016	−0.023	−0.048	−0.023	−0.046	−0.054	−0.028
SJ	0.306	0.080	0.002	0.241	0.055	−0.007	0.180	0.070	−0.007
REML	0.032	−0.019	−0.009	−0.025	−0.047	−0.018	−0.048	−0.054	−0.022
k=50
DL	0.044	0.008	−0.001	−0.008	−0.014	−0.007	−0.024	−0.021	−0.010
SJ	0.330	0.095	0.011	0.262	0.071	0.003	0.197	0.085	0.004
REML	0.045	0.014	0.004	−0.007	−0.011	−0.004	−0.024	−0.020	−0.007

Open in a new tab

To illustrate this point, consider the case where $k = 20$ and $n_{i}$ was between 50 and 150. When $τ = 0.3$ , estimates of the bias were $- 0.023,0.241$ , and −0.025 for DL, SJ, and REML methods, respectively. The SJ method’s magnitude of estimated bias was more than 10 times that of DL or REML. As $τ$ increased to 0.6, the magnitudes of bias were approximately equal (−0.048 for DL, 0.055 for SJ, and – 0.047 for REML). However, when $τ$ was 1.2, SJ had the lowest estimated bias at −0.007 compared to −0.023 and −0.018 for DL and REML. This held true across nearly all parameter combinations studied, as well as for SMD and $\log$ OR. The estimated magnitude of the bias for the $\log$ OR was higher than that of MD or SMD (Tables S1 to S4 in the Supplemental Material). This was particularly striking when $μ = 1$ (Table S4 in the Supplemental Material). RMSE followed a similar, but far less extreme, pattern for all parameter combinations and estimands studied (Tables S5 to S9 in the Supplemental Material).

4.2.2. CIs for MDs and SMDs

Table 4 shows the simulation-based coverage probabilities in studies of the MD for each of the CI methods introduced for the $I^{2}$ statistic. CIs based on the BJ estimate of $F_{Q}$ generally behaved similarly. When the number of studies was small $(k = 5)$ , interval coverage for the TMG, PIII, and S methods decreased with increasing $τ^{2}$ , regardless of the size of the individual studies in the meta-analysis. For example, when an individual study sample size $n_{i}$ was between 10 and 15, we observed over-conservative coverage when $τ = 0.5$ ( $100 %$ TMG, $99.2 %$ PIII, and $99.9 %$ S), very close to the nominal coverage when $τ = 1.1$ ( $95.5 %$ TMG, $95.1 %$ PIII, and $95.5 % S)$ . As $k$ increased to 20 or 50, often a non-linear relationship between CI coverage and the between-study variance was present. This trend depended on $n_{i}$ , highlighting the importance of considering $k, n_{i}$ , and $τ^{2}$ together when conducting a meta-analysis.

Table 4.

Coverage probabilities (in percentage, %) of estimated I² 95% confidence intervals of 12 methods in simulated studies where effect measures are mean differences.

	The true overall mean difference μ=0 or 1
	n_i = 10–50			n_i = 50–150			n_i = 150–550

Method	τ = 0.5	τ = 1.1	τ = 2.2	τ = 0.3	τ = 0.6	τ = 1.2	τ = 0.2	τ = 0.3	τ = 0.6
k=5
TMG	100.0	95.5	83.4	100.0	95.0	82.8	100.0	96.8	82.3
PIII	99.2	95.1	83.4	99.6	94.7	82.8	99.6	96.6	82.4
S	99.9	95.5	83.1	100.0	95.0	82.3	100.0	96.9	81.8
T	93.8	90.9	70.7	95.5	92.6	72.1	95.4	93.4	75.2
NPBS-Q	68.0	65.8	63.2	66.7	64.1	61.9	64.9	63.8	61.5
PL-REML	96.4	97.0	93.4	97.6	97.8	94.1	98.0	98.0	94.2
QP	93.3	93.7	93.9	94.7	94.9	94.8	94.9	94.8	94.8
BJ	93.2	93.7	93.7	94.6	94.8	94.9	95.2	94.7	94.7
J	93.7	93.9	93.9	94.8	94.8	94.9	94.7	94.9	94.7
SJ	46.6	76.0	86.7	55.7	79.0	88.2	64.3	76.9	87.7
PBS-τ²	99.9	97.2	84.5	100.0	97.1	83.7	100.0	98.1	83.6
NPBS-τ²	73.2	69.5	66.8	71.4	68.0	65.7	69.5	67.8	65.8
k=20
TMG	99.3	92.1	95.1	99.8	90.6	93.4	97.2	90.5	93.4
PIII	97.7	91.1	95.0	99.1	90.4	93.8	96.7	90.1	93.8
S	98.4	91.4	95.2	99.4	90.2	93.8	96.9	90.0	93.7
T	90.2	76.4	59.2	94.1	78.8	62.0	92.6	80.9	63.0
NPBS-Q	88.5	85.6	83.1	85.7	83.5	81.0	85.0	83.4	80.3
PL-REML	93.6	92.3	92.9	96.4	93.8	93.9	95.6	94.3	94.5
QP	91.4	92.5	92.6	94.1	94.2	94.1	95.1	95.3	94.8
BJ	91.5	93.3	94.0	94.0	94.5	94.2	94.9	94.6	94.5
J	93.1	93.1	93.0	94.5	94.2	94.1	95.1	95.0	94.7
SJ	8.1	58.8	83.8	18.0	68.2	87.6	32.6	63.8	87.7
PBS-τ²	98.6	91.9	92.8	99.1	91.0	91.2	95.0	90.9	91.7
NPBS-τ²	88.9	88.2	87.8	86.1	86.1	85.7	85.6	85.5	85.1
k=50
TMG	96.8	95.1	97.8	96.7	94.5	97.1	93.7	93.5	96.4
PIII	95.1	93.7	97.3	96.4	94.3	97.3	93.4	93.6	96.7
S	95.6	94.3	96.4	96.4	94.3	96.6	93.4	93.6	95.8
T	86.1	73.6	55.3	91.7	78.0	58.7	87.4	80.4	60.7
NPBS-Q	92.7	92.2	90.7	91.3	90.0	88.6	90.4	89.3	87.7
PL-REML	89.7	91.1	91.4	94.0	94.2	94.5	94.8	94.9	94.9
QP	89.1	90.7	91.2	94.4	94.4	94.5	95.0	95.1	94.8
BJ	89.3	92.5	93.5	94.3	94.5	94.6	95.1	94.9	95.0
J	91.6	91.7	91.9	94.7	94.4	94.5	95.2	94.9	94.8
SJ	0.3	35.4	79.9	1.8	50.4	87.4	8.1	43.9	86.8
PBS-τ²	92.7	92.9	93.3	92.9	93.6	94.0	92.5	93.3	93.5
NPBS-τ²	92.1	91.8	91.8	90.4	91.1	91.4	90.2	90.7	90.9

Open in a new tab

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ²: the between-study variance.

The performance of the T approach mimicked that of TMG, PIII, and S, but with lower CI coverage across different simulation settings, which was clearly illustrated with a large number of studies but a small within-study sample size. Table 4 shows, when $k = 50$ and $n_{i}$ was between 10 and 50, the coverage probabilities were $86.1 %, 73.6 %$ , and $55.3 %$ for $τ = 0.5, τ = 1.1$ , and $τ = 2.2$ , respectively. This was below the average coverage of other intervals based on the $Q$ statistic. For the T approach, when $k$ was fixed, the coverage probability decreased as $τ$ increased. This finding indicates that the commonly used test-based method was inappropriate for calculating a $95 % C I$ of $I^{2}$ if the meta-analysis is highly heterogeneous. The similarities in coverage trajectory over parameters of meta-analyses among the T, TMG, PIII, and S methods likely reflect the fact that all four methods use the approximated distribution of the $Q$ statistic, while the decrease in overall coverage probability for the T method may be because it does not use the estimate of $τ^{2}$ . Interestingly, although the coverage probability was worse than the other three methods, the average interval length of the T method was comparatively shorter (Table 5).

Table 5.

Average lengths (in percentage, %) of estimated I² 95% confidence intervals of 12 methods are shown with standard deviations in parentheses. Results are from the simulation study using mean differences as effect measures.

	The true overall mean difference μ = 0 or 1
	n_i = 10–50			n_i = 50–150			n_i = 150–550

Method	τ = 0.5	τ = 1.1	τ = 2.2	τ = 0.3	τ = 0.6	τ = 1.2	τ = 0.2	τ = 0.3	τ = 0.6
k=5
TMG	72.7 (10.0)	79.9 (11.7)	88.0 (11.5)	72.6 (9.8)	79.5 (11.5)	87.7 (11.5)	74.0 (10.4)	78.4 (11.4)	87.3 (11.2)
PIII	72.3 (10.0)	77.1 (12.3)	69.7 (22.4)	72.4 (9.7)	77.3 (11.8)	70.5 (22.0)	73.6 (10.2)	76.7 (11.5)	72.5 (20.3)
S	72.7 (10.0)	79.7 (11.7)	83.4 (15.5)	72.6 (9.8)	79.4 (11.5)	83.3 (15.5)	74.0 (10.4)	78.3 (11.4)	84.0 (14.3)
T	77.0 (8.6)	69.0 (18.9)	44.3 (28.2)	77.3 (7.8)	69.9 (17.9)	44.6 (27.9)	76.3 (9.9)	71.4 (16.4)	48.5 (27.6)
NPBS-Q	41.2 (30.0)	58.9 (29.2)	77.9 (21.7)	41.0 (29.6)	58.3 (28.7)	77.6 (21.2)	44.7 (29.9)	55.6 (29.3)	76.1 (22.4)
PL-REML	83.3 (9.3)	79.5 (15.9)	58.1 (27.3)	83.4 (8.8)	80.2 (14.9)	58.6 (27.0)	83.3 (9.6)	81.3 (13.8)	62.3 (26.0)
QP	83.7 (18.7)	80.5 (19.9)	57.6 (29.1)	84.0 (17.6)	81.2 (18.8)	57.8 (28.7)	84.2 (17.4)	82.2 (18.5)	61.7 (28.0)
BJ	83.2 (18.1)	80.1 (19.8)	58.0 (28.4)	83.5 (17.4)	80.9 (18.8)	58.3 (28.1)	83.7 (17.1)	81.8 (18.4)	62.0 (27.4)
J	83.7 (18.5)	82.7 (19.7)	60.1 (30.5)	83.9 (17.6)	82.8 (18.6)	59.3 (29.7)	84.5 (17.3)	83.7 (18.2)	63.5 (29.0)
SJ	53.6 (13.7)	50.7 (13.7)	37.0 (16.6)	54.3 (13.3)	51.6 (13.1)	37.5 (16.4)	53.9 (13.1)	52.1 (13.2)	39.6 (16.2)
PBS-τ²	74.7 (9.3)	81.1 (10.7)	85.0 (14.5)	74.5 (9.1)	80.7 (10.4)	84.8 (14.4)	75.8 (9.6)	79.7 (10.4)	85.4 (12.9)
NPBS-τ²	46.5 (30.7)	62.4 (28.2)	80.6 (20.6)	45.8 (30.1)	61.5 (27.9)	80.4 (20.1)	49.4 (30.1)	59.1 (28.7)	78.7 (21.2)
k=20
TMG	58.9 (12.0)	65.1 (11.7)	35.6 (13.9)	57.8 (11.5)	65.1 (11.0)	35.5 (13.7)	61.8 (11.5)	66.3 (10.3)	40.0 (14.5)
PIII	58.0 (11.2)	57.1 (13.0)	26.6 (11.3)	57.3 (10.9)	58.8 (12.1)	27.6 (11.5)	60.5 (10.6)	60.8 (11.1)	31.2 (12.5)
S	58.3 (11.6)	60.1 (12.7)	29.7 (12.2)	57.5 (11.2)	61.1 (11.9)	30.4 (12.3)	60.9 (11.1)	62.9 (11.1)	34.3 (13.2)
T	52.9 (8.0)	38.2 (15.2)	12.2 (8.3)	53.1 (7.7)	40.5 (14.9)	13.0 (8.6)	52.5 (9.1)	43.5 (14.1)	15.2 (9.8)
NPBS-Q	50.1 (18.2)	59.9 (14.7)	33.6 (15.9)	48.6 (17.7)	59.7 (14.4)	33.8 (15.6)	53.8 (16.0)	60.2 (13.7)	37.9 (16.7)
PL-REML	60.8 (11.2)	50.6 (14.5)	21.0 (10.4)	60.3 (10.9)	52.5 (13.9)	21.9 (10.7)	61.2 (10.1)	55.3 (13.0)	25.1 (11.8)
QP	65.7 (14.0)	53.1 (15.9)	21.0 (10.5)	65.2 (13.8)	54.9 (15.2)	21.7 (10.6)	65.9 (12.3)	58.2 (14.2)	25.0 (11.8)
BJ	62.9 (12.4)	52.3 (14.3)	23.8 (10.5)	62.5 (12.5)	53.8 (14.1)	24.3 (10.6)	63.6 (10.9)	56.7 (13.2)	27.3 (11.5)
J	66.8 (14.2)	59.7 (18.7)	21.8 (12.1)	65.7 (13.9)	59.5 (17.2)	22.1 (11.5)	67.8 (12.6)	63.2 (15.9)	25.6 (13.1)
SJ	28.7 (3.3)	24.6 (4.8)	13.9 (4.7)	29.2 (2.8)	25.7 (4.4)	14.6 (4.8)	28.7 (3.1)	26.4 (4.2)	16.0 (4.9)
PBS-τ²	58.4 (12.5)	60.8 (13.6)	27.1 (13.3)	57.5 (12.2)	62.0 (12.5)	28.1 (13.4)	61.3 (11.8)	63.8 (11.5)	32.3 (14.7)
NPBS-τ²	53.5 (20.3)	62.2 (16.5)	30.2 (18.8)	51.3 (19.5)	62.1 (15.7)	31.1 (18.8)	56.5 (17.4)	62.7 (14.9)	35.7 (20.2)
k=50
TMG	48.8 (7.9)	39.0 (7.9)	16.4 (4.3)	48.5 (7.7)	40.0 (7.6)	16.5 (4.2)	50.2 (6.3)	43.1 (7.4)	18.8 (4.7)
PIII	48.0 (7.6)	35.9 (7.8)	14.3 (3.7)	48.0 (7.5)	37.5 (7.7)	14.8 (3.7)	49.1 (6.3)	40.7 (7.6)	16.9 (4.2)
S	48.2 (7.7)	36.7 (7.8)	14.8 (4.1)	48.1 (7.6)	38.1 (7.7)	15.2 (4.0)	49.3 (6.4)	41.2 (7.6)	17.3 (4.5)
T	40.9 (6.6)	22.0 (7.8)	5.9 (2.6)	41.5 (6.3)	23.7 (7.9)	6.3 (2.6)	39.4 (7.3)	26.8 (8.4)	7.5 (3.1)
NPBS-Q	46.3 (10.9)	38.3 (10.6)	16.0 (5.1)	45.2 (10.4)	39.2 (10.6)	16.1 (5.0)	47.4 (9.2)	42.0 (10.6)	18.3 (5.7)
PL-REML	47.2 (7.5)	31.8 (8.2)	11.1 (3.4)	46.9 (7.0)	33.4 (8.0)	11.6 (3.4)	46.2 (6.5)	36.4 (8.1)	13.4 (3.9)
QP	52.7 (9.4)	33.7 (9.0)	11.2 (3.5)	52.4 (8.9)	35.4 (8.6)	11.7 (3.4)	51.3 (7.8)	39.0 (8.9)	13.6 (4.0)
BJ	49.0 (7.3)	33.5 (7.5)	13.9 (4.2)	49.0 (7.1)	34.6 (7.4)	14.1 (4.0)	48.2 (6.5)	37.6 (7.6)	16.0 (4.5)
J	56.2 (11.1)	38.4 (13.5)	11.4 (3.5)	55.1 (10.1)	38.5 (12.0)	11.8 (3.4)	56.1 (9.4)	43.6 (12.7)	13.6 (4.0)
SJ	18.3 (1.5)	15.1 (2.2)	7.9 (1.8)	18.8 (1.1)	16.0 (2.0)	8.4 (1.8)	18.3 (1.3)	16.5 (1.9)	9.3 (1.9)
PBS-τ²	47.2 (8.6)	35.3 (8.6)	12.1 (3.8)	47.3 (8.4)	37.2 (8.4)	12.7 (3.8)	48.8 (6.8)	40.5 (8.2)	14.7 (4.3)
NPBS-τ²	48.0 (12.7)	37.0 (12.2)	12.4 (4.8)	45.9 (11.6)	38.2 (12.0)	12.8 (4.7)	47.8 (10.3)	41.4 (12.1)	14.9 (5.5)

Open in a new tab

TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: Test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ²: the between-study variance.

The NPBS-Q interval had unacceptable coverage across most of the simulation scenarios. This method hardly achieved nominal coverage probability, with the highest coverage estimated at $92.7 %$ . In general, the coverage of the NPBS-Q interval decreased modestly with larger $τ$ and $n_{i}$ . However, the primary drive of the low coverage was the number of studies available for resampling in the bootstrap operation, where $k = 50$ provided the best coverage. In the case of a small number of studies, the NPBS-Q interval was shorter than other $F_{Q}$ -based intervals.

Compared with $I^{2}$ intervals based on $F_{Q}$ , most intervals based on $τ^{2}$ were found to have coverage consistently closer to the nominal $95 %$ . Specifically, the QP, BJ, and J intervals stood out as well-performing in areas where the $F_{Q}$ -based intervals under-performed. PL-REML also performed well, but it was slightly farther from the nominal coverage probability compared to QP, BJ, and J; this was evident when the within-study sample size was between 150 and 550. For example, when $k = 20$ and $τ = 0.3$ , the coverage probabilities were $95.6 %, 95.1 %, 94.9 %$ , and $95.1 %$ for PL-REML, QP, BJ, and J, respectively. Additionally, when $τ = 0.6$ , the coverage probabilities of these methods are closer to the nominal coverage, with coverage probabilities of $94.5 %, 94.8 %, 94.5 %$ , and $94.7 %$ . Interestingly, when $k = 50$ and $n_{i}$ was between 10 and 50, the coverage of the QP, BJ, J, and PL-REML intervals dropped, and the TMG, PIII, and S methods were preferred in this case.

Two $τ^{2}$ -based intervals would not be recommended for practice based on poor coverage probabilities given by the simulation. For different simulation scenarios, the lowest CI coverage was found for the SJ method, which was consistent with prior simulation studies for the interval of $τ^{2} .$ ⁵⁴ Table 5 shows that the interval length of the SJ method tended to be the shortest of all studied methods and decreased with increasing $τ$ . Additionally, similar to the NPBS-Q, the NPBS- $τ^{2}$ failed to reach the nominal coverage in any scenario.

The CIs for the SMD when $μ = 0$ or $μ = 0.8$ showed the same trend as described above (Tables S10 to S13 in the Supplemental Material) with two important caveats. First, when $k = 50$ and $n_{i}$ was between 10 and 15, the QP, BJ, and J methods worked well, and they were close to the nominal coverage. This was in direct contrast to the CIs based on $F_{Q}$ , whose coverage probabilities were consistently under $90 %$ when $τ$ was moderate or large (Table S7 in the Supplemental Material). Second, the KDB interval also worked well under all studied scenarios, and it had comparable performance to the QP, BJ, and J intervals. These points made any of the QP, BJ, J, or KDB interval estimation methods a reliable choice when the estimand of interest was SMD, regardless of the other meta-analysis parameters.

4.2.3. CIs for $l o g O R$

Similar trends of CI coverage were observed for $\log$ OR as MD, but with a greater magnitude of departure from the nominal coverage level. For both $μ = 0$ and $μ = 1$ , the TMG, PIII, and S methods had decreasing coverage with increasing $τ$ . In Table 6, when $k = 5$ and $n_{i}$ was between 10 and 50, a drop from $100 %$ coverage to $75 %$ was seen from all three intervals as $τ$ increased. As $k$ increased to 20 and 50, this decrease in coverage presented for both moderate and large values of $τ$ and the severity of departure from nominal coverage increased. The $T$ interval also showed a drop in coverage as $τ$ increased, the magnitude of which was exacerbated as $k$ increased. NPBS-Q, SJ, and NPBS- $τ^{2}$ showed unacceptable coverage in all scenarios studied. PBS- $τ^{2}$ was generally over-conservative when $τ$ was mild, but it performed poorly as $τ$ increased.

Table 6.

Coverage probabilities (in percentage, %) of estimated I² 95% confidence intervals of 13 methods when the true overall log odds ratio is 0. Results are from the simulation study using log odds ratios as effect measures.

	The true overall log odds ratio μ = 0
	n_i = 10–50			n_i = 50–150			n_i = 150–550

Method	τ = 0.37	τ = 0.73	τ = 1.46	τ = 0.20	τ = 0.40	τ = 0.80	τ = 0.11	τ = 0.21	τ = 0.43
k=5
TMG	100.0	100.0	74.8	100.0	100.0	82.3	100.0	100.0	83.3
PIII	100.0	100.0	74.9	100.0	100.0	82.3	100.0	100.0	83.3
S	100.0	100.0	74.8	100.0	100.0	82.2	100.0	100.0	83.1
T	98.2	98.6	92.5	96.9	96.3	88.9	96.4	95.5	85.9
NPBS-Q	64.6	61.7	49.7	64.8	63.7	61.8	65.1	65.1	63.5
PL-REML	99.3	99.8	96.2	98.7	98.9	95.4	98.3	98.5	94.8
QP	96.7	97.4	97.2	95.6	95.7	95.9	95.0	95.3	95.2
BJ	96.8	97.3	97.1	95.6	96.1	96.7	95.1	95.5	95.7
J	97.0	97.3	97.1	95.4	95.6	96.3	94.9	95.1	95.4
SJ	51.0	80.0	91.3	54.0	78.5	88.1	54.2	78.6	87.8
PBS-τ²	100.0	100.0	76.5	100.0	100.0	82.6	100.0	100.0	83.8
NPBS-τ²	68.4	66.1	58.5	66.0	64.8	65.0	67.1	66.1	65.3
KD	94.1	95.0	95.5	95.0	95.3	95.6	94.9	95.2	95.1
k=20
TMG	100.0	83.2	67.5	99.9	89.9	87.8	99.9	91.5	91.1
PIII	100.0	83.2	68.0	99.8	89.9	88.0	99.9	91.4	91.2
S	99.9	83.2	67.9	99.7	89.6	88.0	99.6	91.1	91.2
T	99.1	94.3	54.7	97.2	90.0	80.0	96.8	92.2	78.5
NPBS-Q	82.5	75.3	35.4	85.5	84.3	76.2	85.6	85.1	82.8
PL-REML	99.6	93.5	86.8	98.8	95.0	94.7	98.4	94.7	94.4
QP	97.0	97.0	95.3	95.3	95.6	96.0	95.1	95.1	94.9
BJ	97.0	96.2	89.0	95.5	96.0	95.9	95.2	95.2	95.4
J	97.4	97.2	94.0	95.0	95.5	96.3	94.8	95.0	95.1
SJ	10.8	68.6	91.3	14.7	67.7	87.9	16.1	66.9	87.6
PBS-τ²	100.0	82.0	67.9	99.8	89.4	87.8	99.8	91.1	91.1
NPBS-τ²	79.2	75.5	59.5	84.7	84.6	82.8	84.9	85.2	84.7
KD	94.6	95.2	93.7	94.8	95.2	95.4	95.0	95.0	94.8
k=50
TMG	100.0	76.9	44.2	99.6	91.2	89.1	99.4	93.3	93.5
PIII	100.0	77.0	44.9	99.6	91.2	89.2	99.4	93.2	93.5
S	100.0	76.8	44.9	99.6	91.2	89.0	99.4	93.2	93.4
T	99.5	83.2	14.9	97.5	90.1	74.0	97.1	89.5	79.5
NPBS-Q	85.5	72.5	8.4	90.5	88.4	75.6	90.8	90.3	88.2
PL-REML	98.0	87.6	64.6	97.8	94.5	93.9	97.2	95.1	95.0
QP	96.7	95.8	87.5	95.3	95.5	96.0	95.5	95.0	95.1
BJ	96.4	93.2	54.7	95.4	95.6	93.2	95.4	95.4	95.8
J	97.5	96.3	80.2	95.0	95.3	95.9	95.4	94.9	95.4
SJ	0.4	48.6	88.2	1.1	49.8	87.5	1.3	48.1	87.8
PBS-τ²	100.0	75.7	43.5	99.6	90.8	88.6	99.4	93.1	93.0
NPBS-τ²	80.3	71.9	36.2	88.8	88.0	86.2	90.1	90.2	90.2
KD	93.3	93.7	87.9	94.9	95.0	95.2	95.3	94.9	94.9

Open in a new tab

When $μ = 0$ , the QP, BJ, J, and KD intervals maintained the closest to the nominal coverage probability under most parameter combinations. The exception was when $k = 50, n_{i}$ was between 10 and 50, and $τ = 1.46$ . A large drop in coverage was observed for all four intervals, particularly the BJ CI, with a coverage of $54 %$ in this case (Table 6), corresponding to a large decrease in CI length (Table S14 in the Supplemental Material). This was magnified when $μ = 1$ , where the dip in coverage and the decrease in CI length were also observed for the case that $n_{i}$ was between 50 and 150 (Table 7 and Table S15 in the Supplemental Material). The KD interval only showed a major drop in coverage for $μ = 1$ when $k = 50$ , $n_{i}$ was between 10 and 50, and $τ$ was moderate to large. In most parameter combinations, the KD interval outperformed other methods with respect to the coverage probability or the average interval length. Therefore, in the case of the log OR, we suggested using the KD interval of $I^{2}$ .

Table 7.

Coverage probabilities (in percentage, %) of estimated I² 95% confidence intervals of 13 methods when the true overall log odds ratio is 1. Results are from the simulation study using log odds ratios as effect measures.

	The true overall log odds ratio μ = 1
	n_i = 10–50			n_i = 50–150			n_i = 150–550

Method	τ =0.39	τ =0.78	τ =1.56	τ =0.21	τ =0.43	τ =0.85	τ =0.11	τ =0.23	τ =0.46
k=5
TMG	100.0	100.0	73.4	100.0	100.0	80.4	100.0	100.0	82.5
PIII	100.0	100.0	73.5	100.0	100.0	80.5	100.0	100.0	82.5
S	100.0	100.0	73.3	100.0	100.0	80.3	100.0	100.0	82.4
T	98.8	98.9	91.4	97.2	96.5	87.4	96.1	95.0	83.1
NPBS-Q	61.4	57.4	45.0	64.8	63.4	59.1	64.6	64.3	62.2
PL-REML	99.5	99.9	96.1	98.8	99.1	95.5	98.2	98.5	94.9
QP	97.3	97.4	96.9	95.8	95.9	96.3	94.7	95.0	95.4
BJ	97.2	97.3	96.8	95.8	96.1	96.6	94.8	95.2	95.6
J	97.4	97.4	96.9	95.7	95.9	96.5	94.6	95.0	95.4
SJ	54.8	83.1	90.2	52.8	79.0	88.5	52.0	78.1	87.6
PBS-τ²	100.0	100.0	75.7	100.0	100.0	81.0	100.0	100.0	83.1
NPBS-τ²	65.5	62.1	53.0	66.6	65.3	64.2	66.8	66.0	65.2
KD	94.8	95.5	95.3	94.9	95.3	95.8	94.5	94.8	95.3
k=20
TMG	100.0	80.3	63.0	99.9	89.1	86.4	99.8	91.1	91.0
PIII	100.0	80.3	63.6	99.8	89.0	86.5	99.7	91.1	91.0
S	99.8	80.3	63.4	99.7	89.0	86.4	99.6	91.1	91.0
T	99.5	90.8	47.6	97.9	89.2	78.0	96.8	87.6	78.8
NPBS-Q	78.1	68.4	30.4	84.8	82.5	71.8	85.8	84.5	82.1
PL-REML	99.7	91.4	82.9	99.1	95.0	94.2	98.0	94.5	94.9
QP	96.0	95.7	91.6	95.9	96.2	96.7	95.3	95.1	95.2
BJ	95.9	94.6	85.2	95.9	96.3	95.5	95.2	95.1	95.7
J	96.6	95.9	89.9	95.7	96.0	96.8	95.2	95.0	95.5
SJ	14.0	77.9	86.1	14.1	68.2	89.6	12.3	67.7	88.6
PBS-τ²	100.0	79.2	63.4	99.9	88.7	86.2	99.8	90.6	90.7
NPBS-τ²	75.1	70.2	51.6	83.3	82.6	80.1	84.9	84.8	84.8
KD	94.8	94.5	90.3	94.9	95.5	95.9	94.9	94.9	95.0
k=50
TMG	100.0	71.1	34.6	99.8	89.3	85.0	99.4	93.2	93.0
PIII	100.0	71.2	35.3	99.8	89.3	85.2	99.4	93.2	93.0
S	100.0	71.1	35.2	99.8	88.9	84.9	99.4	92.6	92.4
T	99.9	74.0	9.2	97.8	89.3	67.7	97.2	89.6	79.0
NPBS-Q	78.0	61.6	5.6	89.6	86.7	69.7	90.7	90.4	87.1
PL-REML	97.0	82.3	53.3	98.0	93.7	91.6	97.7	95.1	94.9
QP	94.1	91.4	73.9	95.7	95.8	95.7	95.3	95.6	95.5
BJ	93.6	87.8	44.0	95.7	95.3	91.1	95.4	95.6	95.5
J	95.5	92.2	65.5	95.4	95.9	95.2	95.1	95.5	95.6
SJ	0.7	66.6	73.9	0.7	49.8	89.1	0.6	50.0	87.7
PBS-τ²	100.0	69.9	33.8	99.7	89.0	84.5	99.4	93.0	92.4
NPBS-τ²	75.1	70.2	51.6	83.3	82.6	80.1	84.9	84.8	84.8
KD	93.3	90.8	75.9	94.8	94.8	94.3	95.2	95.4	95.2

Open in a new tab

5. Discussion

In this article, for a meta-analysis, we have compared different methods to calculate the point estimate and the $95 %$ CI of the $I^{2}$ statistic. For point estimates of $I^{2}$ , the SJ method is suggested to be used when $τ^{2}$ is large. Otherwise, the DL method gives a less biased estimate for $I^{2}$ based on the simulation studies. The interval estimates of $I^{2}$ are grouped into two categories by their derivation. One group is the methods based on the approximation of the CDF for the $Q$ statistic; another group is the methods viewing $I^{2}$ as the function of $τ^{2}$ in equation (4), and they calculate the interval of $I^{2}$ based on the interval of $τ^{2}$ . Based on the simulation studies, we would suggest the following guidelines:

When the effect measure is the MD or SMD, use the QP, BJ, or J method to calculate the $95 % C I$ for $I^{2}$ ;
When the effect measure is the $\log$ OR, use the KD method to calculate the $95 % C I$ for $I^{2}$ .

In the case of the $\log O R$ , the KD method is recommended because it generally outperforms the other methods with respect to the coverage probability of the $95 %$ CI for $I^{2}$ . Except for the KDB method for the SMD and the KD method for the log OR, all other methods can be used to calculate the CI of the $I^{2}$ statistic for any type of effect measure.

Although the $I^{2}$ statistic is widely used to measure the heterogeneity of meta-analyses, it suffers from large uncertainties and should not be used as an absolute measure of heterogeneity. However, the CI of $I^{2}$ provides an appreciation of the spectrum of possible extents of heterogeneity, which can be more robust to nuisance factors. In practice, meta-analysts should report the $95 %$ CI for $I^{2}$ using the recommended methods, which have reasonable interval lengths and provide much more reliable coverage probabilities than the currently used methods (e.g. the test-based method).

Based on the simulation framework in this article, other simulation settings can be considered in future research. For example, when the effect measure is the MD, the chi-squared distribution can be used to generate the study-specific standard deviation. Further studies can provide additional clarity on the guidelines for $I^{2}$ , with our work serving as the baseline.

Supplementary Material

Supplemental material

NIHMS2046507-supplement-Supplemental_material.pdf^{(225.1KB, pdf)}

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institute of Mental Health, U.S. National Library of Medicine (grant numbers R03 MH128727 and R01 LM012982).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

1.Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. Br Med J 2021; 372: n160. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wang Y, Lin L, Thompson CG, et al. A penalization approach to random-effects meta-analysis. Stat Med 2022; 41: 500–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons, 2019. [Google Scholar]
4.Higgins JPT and Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539–1558. [DOI] [PubMed] [Google Scholar]
5.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, et al. Assessing heterogeneity in meta-analysis: Q statistic or $I^{2}$ index? Psychol Methods 2006; 11: 193–206. [DOI] [PubMed] [Google Scholar]
6.Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol 2011; 64: 1294–1302. [DOI] [PubMed] [Google Scholar]
7.Borenstein M, Higgins JPT, Hedges LV, et al. Basics of meta-analysis: $I^{2}$ is not an absolute measure of heterogeneity. Res Synth Methods 2017; 8: 5–18. [DOI] [PubMed] [Google Scholar]
8.Hoaglin DC. Misunderstandings about $Q$ and ‘Cochran’s $Q$ test’ in meta-analysis. Stat Med 2016; 35: 485–495. [DOI] [PubMed] [Google Scholar]
9.Hoaglin DC. Practical challenges of $I^{2}$ as a measure of heterogeneity. Res Synth Methods 2017; 8: 54. [DOI] [PubMed] [Google Scholar]
10.von Hippel PT. The heterogeneity statistic $I^{2}$ can be biased in small meta-analyses. BMC Med Res Methodol 2015; 15: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Mittlböck M and Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006; 25: 4321–4333. [DOI] [PubMed] [Google Scholar]
12.Rücker G, Schwarzer G, Carpenter JR, et al. Undue reliance on $I^{2}$ in assessing heterogeneity may mislead. BMC Med Res Methodol 2008; 8: 79. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Viechtbauer W Hypothesis tests for population heterogeneity in meta-analysis. Br J Math Stat Psychol 2007; 60: 29–60. [DOI] [PubMed] [Google Scholar]
14.Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10: 101–129. [Google Scholar]
15.Ioannidis JPA, Patsopoulos NA and Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. Br Med J 2007; 335: 914–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Thorlund K, Imberger G, Johnston BC, et al. Evolution of heterogeneity $(I^{2})$ estimates and their $95 %$ confidence intervals in large meta-analyses. PLoS One 2012; 7: e39471. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kulinskaya E, Dollinger MB, Knight E, et al. A Welch-type test for homogeneity of contrasts under heteroscedasticity with application to meta-analysis. Stat Med 2004; 23: 3655–3670. [DOI] [PubMed] [Google Scholar]
18.Kulinskaya E, Dollinger MB and Bjørkestøl K. Testing for homogeneity in meta-analysis I. The one-parameter case: standardized mean difference. Biometrics 2011; 67: 203–212. [DOI] [PubMed] [Google Scholar]
19.Kulinskaya E, Dollinger MB and Bjørkestøl K. On the moments of Cochran’s $Q$ statistic under the null hypothesis, with application to the meta-analysis of risk difference. Res Synth Methods 2011; 2: 254–270. [DOI] [PubMed] [Google Scholar]
20.Kulinskaya E and Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran’s $Q$ -statistic. BMC Med Res Methodol 2015; 15: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Biggerstaff BJ and Jackson D. The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Stat Med 2008; 27: 6093–6110. [DOI] [PubMed] [Google Scholar]
22.Normand S-LT. Meta-analysis: formulating, evaluating, combining, and reporting. Stat Med 1999; 18: 321–359. [DOI] [PubMed] [Google Scholar]
23.Malzahn U, Böhning D and Holling H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika 2000; 87: 619–632. [Google Scholar]
24.Hedges LV and Olkin I. Statistical methods for meta-analysis. Orlando, FL: Academic Press, 1985. [Google Scholar]
25.Cooper H, Hedges LV and Valentine JC. The handbook of research synthesis and meta-analysis. 2nd ed. New York, NY: Russell Sage Foundation, 2009. [Google Scholar]
26.Egger M, Smith D, Altman G, et al. Systematic reviews in health care: meta-analysis in context. 2nd ed. London, UK: BMJ Publishing Group, 2001. [Google Scholar]
27.Lin L and Aloe AM. Evaluation of various estimators for standardized mean difference in meta-analysis. Stat Med 2021; 40: 403–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lin L Bias caused by sampling error in meta-analysis with small sample sizes. PLoS One 2018; 13: e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Walter SD. Choice of effect measure for epidemiological data. J Clin Epidemiol 2000; 53: 931–939. [DOI] [PubMed] [Google Scholar]
30.Tajeu GS, Sen B, Allison DB, et al. Misuse of odds ratios in obesity literature: an empirical analysis of published studies. Obesity 2012; 20: 1726–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Furuya-Kanamori L and Doi SAR. The outcome with higher baseline risk should be selected for relative risk in clinical studies: a proposal for change to practice. J Clin Epidemiol 2014; 67: 364–367. [DOI] [PubMed] [Google Scholar]
32.Feng C, Wang B and Wang H. The relations among three popular indices of risks. Stat Med 2019; 38: 4772–4787. [DOI] [PubMed] [Google Scholar]
33.Doi SA, Furuya-Kanamori L, Xu C, et al. Controversy and Debate: questionable utility of the relative risk in clinical research: paper 1: a call for change to practice. J Clin Epidemiol 2022; 142: 271–279. [DOI] [PubMed] [Google Scholar]
34.Bakbergenuly I, Hoaglin DC and Kulinskaya E. Pitfalls of using the risk ratio in meta-analysis. Res Synth Methods 2019; 10: 398–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956; 20: 309–311. [DOI] [PubMed] [Google Scholar]
36.Gart JJ, Pettigrew HM and Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika 1985; 72: 179–190. [Google Scholar]
37.Pettigrew HM, Gart JJ and Thomas DG. The bias and higher cumulants of the logarithm of a binomial variate. Biometrika 1986; 73: 425–435. [Google Scholar]
38.Sweeting MJ, Sutton AJ and Paul LC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Stat Med 2004; 23: 1351–1375. [DOI] [PubMed] [Google Scholar]
39.Bradburn MJ, Deeks JJ, Berlin JA, et al. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 2007; 26: 53–77. [DOI] [PubMed] [Google Scholar]
40.Cai T, Parast L and Ryan L. Meta-analysis for rare events. Stat Med 2010; 29: 2078–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Rücker G, Schwarzer G, Carpenter J, et al. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Stat Med 2009; 28: 721–738. [DOI] [PubMed] [Google Scholar]
42.DerSimonian R and Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemp Clin Trials 2007; 28: 105–114. [DOI] [PubMed] [Google Scholar]
43.DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7: 177–188. [DOI] [PubMed] [Google Scholar]
44.Sidik K and Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. J R Stat Soc Ser C Appl Stat 2005; 54: 367–384. [Google Scholar]
45.Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 1977; 72: 320–338. [Google Scholar]
46.Bias Viechtbauer W. and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 2005; 30: 261–293. [Google Scholar]
47.Viechtbauer W Conducting meta-analyses in R with the metafor package. J Stat Softw 2010; 36: 3. [Google Scholar]
48.Lin L Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract 2020; 26: 376–384. [DOI] [PubMed] [Google Scholar]
49.Biggerstaff BJ and Tweedie RL. Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Stat Med 1997; 16: 753–768. [DOI] [PubMed] [Google Scholar]
50.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]
51.Miettinen O Estimability and estimation in case-referent studies. Am J Epidemiol 1976; 103: 226–235. [DOI] [PubMed] [Google Scholar]
52.Wetterslev J, Thorlund K, Brok J, et al. Estimating required information size by quantifying diversity in random-effects model meta-analyses. BMC Med Res Methodol 2009; 9: 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Langan D, Higgins JPT, Jackson D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res Synth Methods 2019; 10: 83–98. [DOI] [PubMed] [Google Scholar]
54.Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods 2016; 7: 55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Hardy RJ and Thompson SG. A likelihood approach to meta-analysis with random effects. Stat Med 1996; 15: 619–629. [DOI] [PubMed] [Google Scholar]
56.Viechtbauer W Confidence intervals for the amount of heterogeneity in meta-analysis. Stat Med 2007; 26: 37–52. [DOI] [PubMed] [Google Scholar]
57.Farebrother RW. The distribution of a positive linear combination of $χ^{2}$ random variables. J R Stat Soc Ser C Appl Stat 1984; 33: 332–339. [Google Scholar]
58.Jackson D Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Res Synth Methods 2013; 4: 220–229. [DOI] [PubMed] [Google Scholar]
59.Jackson D and White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J 2018; 60: 1040–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Knapp G, Biggerstaff BJ and Hartung J. Assessing the amount of heterogeneity in random-effects meta-analysis. Biom J 2006; 48: 271–285. [DOI] [PubMed] [Google Scholar]
61.Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Peters JL, Sutton AJ, Jones DR, et al. Comparison of two methods to detect publication bias in meta-analysis. JAMA 2006; 295: 676–680. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental material

NIHMS2046507-supplement-Supplemental_material.pdf^{(225.1KB, pdf)}

[R1] 1.Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. Br Med J 2021; 372: n160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Wang Y, Lin L, Thompson CG, et al. A penalization approach to random-effects meta-analysis. Stat Med 2022; 41: 500–516. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons, 2019. [Google Scholar]

[R4] 4.Higgins JPT and Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539–1558. [DOI] [PubMed] [Google Scholar]

[R5] 5.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, et al. Assessing heterogeneity in meta-analysis: Q statistic or $I^{2}$ index? Psychol Methods 2006; 11: 193–206. [DOI] [PubMed] [Google Scholar]

[R6] 6.Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol 2011; 64: 1294–1302. [DOI] [PubMed] [Google Scholar]

[R7] 7.Borenstein M, Higgins JPT, Hedges LV, et al. Basics of meta-analysis: $I^{2}$ is not an absolute measure of heterogeneity. Res Synth Methods 2017; 8: 5–18. [DOI] [PubMed] [Google Scholar]

[R8] 8.Hoaglin DC. Misunderstandings about $Q$ and ‘Cochran’s $Q$ test’ in meta-analysis. Stat Med 2016; 35: 485–495. [DOI] [PubMed] [Google Scholar]

[R9] 9.Hoaglin DC. Practical challenges of $I^{2}$ as a measure of heterogeneity. Res Synth Methods 2017; 8: 54. [DOI] [PubMed] [Google Scholar]

[R10] 10.von Hippel PT. The heterogeneity statistic $I^{2}$ can be biased in small meta-analyses. BMC Med Res Methodol 2015; 15: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Mittlböck M and Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006; 25: 4321–4333. [DOI] [PubMed] [Google Scholar]

[R12] 12.Rücker G, Schwarzer G, Carpenter JR, et al. Undue reliance on $I^{2}$ in assessing heterogeneity may mislead. BMC Med Res Methodol 2008; 8: 79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Viechtbauer W Hypothesis tests for population heterogeneity in meta-analysis. Br J Math Stat Psychol 2007; 60: 29–60. [DOI] [PubMed] [Google Scholar]

[R14] 14.Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10: 101–129. [Google Scholar]

[R15] 15.Ioannidis JPA, Patsopoulos NA and Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. Br Med J 2007; 335: 914–916. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Thorlund K, Imberger G, Johnston BC, et al. Evolution of heterogeneity $(I^{2})$ estimates and their $95 %$ confidence intervals in large meta-analyses. PLoS One 2012; 7: e39471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Kulinskaya E, Dollinger MB, Knight E, et al. A Welch-type test for homogeneity of contrasts under heteroscedasticity with application to meta-analysis. Stat Med 2004; 23: 3655–3670. [DOI] [PubMed] [Google Scholar]

[R18] 18.Kulinskaya E, Dollinger MB and Bjørkestøl K. Testing for homogeneity in meta-analysis I. The one-parameter case: standardized mean difference. Biometrics 2011; 67: 203–212. [DOI] [PubMed] [Google Scholar]

[R19] 19.Kulinskaya E, Dollinger MB and Bjørkestøl K. On the moments of Cochran’s $Q$ statistic under the null hypothesis, with application to the meta-analysis of risk difference. Res Synth Methods 2011; 2: 254–270. [DOI] [PubMed] [Google Scholar]

[R20] 20.Kulinskaya E and Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran’s $Q$ -statistic. BMC Med Res Methodol 2015; 15: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Biggerstaff BJ and Jackson D. The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Stat Med 2008; 27: 6093–6110. [DOI] [PubMed] [Google Scholar]

[R22] 22.Normand S-LT. Meta-analysis: formulating, evaluating, combining, and reporting. Stat Med 1999; 18: 321–359. [DOI] [PubMed] [Google Scholar]

[R23] 23.Malzahn U, Böhning D and Holling H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika 2000; 87: 619–632. [Google Scholar]

[R24] 24.Hedges LV and Olkin I. Statistical methods for meta-analysis. Orlando, FL: Academic Press, 1985. [Google Scholar]

[R25] 25.Cooper H, Hedges LV and Valentine JC. The handbook of research synthesis and meta-analysis. 2nd ed. New York, NY: Russell Sage Foundation, 2009. [Google Scholar]

[R26] 26.Egger M, Smith D, Altman G, et al. Systematic reviews in health care: meta-analysis in context. 2nd ed. London, UK: BMJ Publishing Group, 2001. [Google Scholar]

[R27] 27.Lin L and Aloe AM. Evaluation of various estimators for standardized mean difference in meta-analysis. Stat Med 2021; 40: 403–426. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Lin L Bias caused by sampling error in meta-analysis with small sample sizes. PLoS One 2018; 13: e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Walter SD. Choice of effect measure for epidemiological data. J Clin Epidemiol 2000; 53: 931–939. [DOI] [PubMed] [Google Scholar]

[R30] 30.Tajeu GS, Sen B, Allison DB, et al. Misuse of odds ratios in obesity literature: an empirical analysis of published studies. Obesity 2012; 20: 1726–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Furuya-Kanamori L and Doi SAR. The outcome with higher baseline risk should be selected for relative risk in clinical studies: a proposal for change to practice. J Clin Epidemiol 2014; 67: 364–367. [DOI] [PubMed] [Google Scholar]

[R32] 32.Feng C, Wang B and Wang H. The relations among three popular indices of risks. Stat Med 2019; 38: 4772–4787. [DOI] [PubMed] [Google Scholar]

[R33] 33.Doi SA, Furuya-Kanamori L, Xu C, et al. Controversy and Debate: questionable utility of the relative risk in clinical research: paper 1: a call for change to practice. J Clin Epidemiol 2022; 142: 271–279. [DOI] [PubMed] [Google Scholar]

[R34] 34.Bakbergenuly I, Hoaglin DC and Kulinskaya E. Pitfalls of using the risk ratio in meta-analysis. Res Synth Methods 2019; 10: 398–419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956; 20: 309–311. [DOI] [PubMed] [Google Scholar]

[R36] 36.Gart JJ, Pettigrew HM and Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika 1985; 72: 179–190. [Google Scholar]

[R37] 37.Pettigrew HM, Gart JJ and Thomas DG. The bias and higher cumulants of the logarithm of a binomial variate. Biometrika 1986; 73: 425–435. [Google Scholar]

[R38] 38.Sweeting MJ, Sutton AJ and Paul LC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Stat Med 2004; 23: 1351–1375. [DOI] [PubMed] [Google Scholar]

[R39] 39.Bradburn MJ, Deeks JJ, Berlin JA, et al. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 2007; 26: 53–77. [DOI] [PubMed] [Google Scholar]

[R40] 40.Cai T, Parast L and Ryan L. Meta-analysis for rare events. Stat Med 2010; 29: 2078–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Rücker G, Schwarzer G, Carpenter J, et al. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Stat Med 2009; 28: 721–738. [DOI] [PubMed] [Google Scholar]

[R42] 42.DerSimonian R and Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemp Clin Trials 2007; 28: 105–114. [DOI] [PubMed] [Google Scholar]

[R43] 43.DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7: 177–188. [DOI] [PubMed] [Google Scholar]

[R44] 44.Sidik K and Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. J R Stat Soc Ser C Appl Stat 2005; 54: 367–384. [Google Scholar]

[R45] 45.Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 1977; 72: 320–338. [Google Scholar]

[R46] 46.Bias Viechtbauer W. and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 2005; 30: 261–293. [Google Scholar]

[R47] 47.Viechtbauer W Conducting meta-analyses in R with the metafor package. J Stat Softw 2010; 36: 3. [Google Scholar]

[R48] 48.Lin L Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract 2020; 26: 376–384. [DOI] [PubMed] [Google Scholar]

[R49] 49.Biggerstaff BJ and Tweedie RL. Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Stat Med 1997; 16: 753–768. [DOI] [PubMed] [Google Scholar]

[R50] 50.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]

[R51] 51.Miettinen O Estimability and estimation in case-referent studies. Am J Epidemiol 1976; 103: 226–235. [DOI] [PubMed] [Google Scholar]

[R52] 52.Wetterslev J, Thorlund K, Brok J, et al. Estimating required information size by quantifying diversity in random-effects model meta-analyses. BMC Med Res Methodol 2009; 9: 86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Langan D, Higgins JPT, Jackson D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res Synth Methods 2019; 10: 83–98. [DOI] [PubMed] [Google Scholar]

[R54] 54.Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods 2016; 7: 55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Hardy RJ and Thompson SG. A likelihood approach to meta-analysis with random effects. Stat Med 1996; 15: 619–629. [DOI] [PubMed] [Google Scholar]

[R56] 56.Viechtbauer W Confidence intervals for the amount of heterogeneity in meta-analysis. Stat Med 2007; 26: 37–52. [DOI] [PubMed] [Google Scholar]

[R57] 57.Farebrother RW. The distribution of a positive linear combination of $χ^{2}$ random variables. J R Stat Soc Ser C Appl Stat 1984; 33: 332–339. [Google Scholar]

[R58] 58.Jackson D Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Res Synth Methods 2013; 4: 220–229. [DOI] [PubMed] [Google Scholar]

[R59] 59.Jackson D and White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J 2018; 60: 1040–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Knapp G, Biggerstaff BJ and Hartung J. Assessing the amount of heterogeneity in random-effects meta-analysis. Biom J 2006; 48: 271–285. [DOI] [PubMed] [Google Scholar]

[R61] 61.Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Peters JL, Sutton AJ, Jones DR, et al. Comparison of two methods to detect publication bias in meta-analysis. JAMA 2006; 295: 676–680. [DOI] [PubMed] [Google Scholar]

PERMALINK

Comparisons of various estimates of the I2 statistic for quantifying between-study heterogeneity in meta-analysis

Yipeng Wang

Natalie DelRocco

Lifeng Lin

Abstract

1. Introduction

2. Setups of meta-analysis

2.1. Common-effect and random-effects models

2.2. Meta-analysis with a continuous outcome

2.3. Meta-analysis with a binary outcome

3. Estimates of the I2 statistic

3.1. Point estimates

3.1.1. Method-of-moments approach

3.1.2. Sidik–Jonkman (SJ) method

3.1.3. Restricted maximum likelihood (REML) method

3.2. Interval estimates

3.2.1. Interval estimates for I2basedontheQ statistic

Table 1.

3.2.2. Interval estimates for I2 based on the between-study variance

4. Simulation study

4.1. Simulation settings

Table 2.

4.2. Simulation results

4.2.1. Properties of I2 estimates

Table 3.

4.2.2. CIs for MDs and SMDs

Table 4.

Table 5.

4.2.3. CIs for log⁡OR

Table 6.

Table 7.

5. Discussion

Supplementary Material

Funding

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Comparisons of various estimates of the I² statistic for quantifying between-study heterogeneity in meta-analysis

3. Estimates of the $I^{2}$ statistic

3.2.1. Interval estimates for $I^{2} b a s e d o n t h e Q$ statistic

3.2.2. Interval estimates for $I^{2}$ based on the between-study variance

4.2.1. Properties of $I^{2}$ estimates

4.2.3. CIs for $l o g O R$