Abstract
Assessing heterogeneity between studies is a critical step in determining whether studies can be combined and whether the synthesized results are reliable. The statistic has been a popular measure for quantifying heterogeneity, but its usage has been challenged from various perspectives in recent years. In particular, it should not be considered an absolute measure of heterogeneity, and it could be subject to large uncertainties. As such, when using to interpret the extent of heterogeneity, it is essential to account for its interval estimate. Various point and interval estimators exist for . This article summarizes these estimators. In addition, we performed a simulation study under different scenarios to investigate preferable point and interval estimates of . We found that the Sidik–Jonkman method gave precise point estimates for when the between-study variance was large, while in other cases, the DerSimonian–Laird method was suggested to estimate . When the effect measure was the mean difference or the standardized mean difference, the -profile method, the Biggerstaff–Jackson method, or the Jackson method was suggested to calculate the interval estimate for due to reasonable interval length and more reliable coverage probabilities than various alternatives. For the same reason, the Kulinskaya–Dollinger method was recommended to calculate the interval estimate for when the effect measure was the log odds ratio.
Keywords: Confidence interval, coverage probability, heterogeneity, I2 statistic, meta-analysis
1. Introduction
Meta-analysis is a statistical tool to synthesize evidence from different studies and is widely used in medical research. Assessing heterogeneity between the collected studies is a critical step to examine whether the studies may be properly combined and the synthesized results are reliable.1,2 In this article, heterogeneity refers to the variation in underlying treatment effects across studies.3
A classical method to detect heterogeneity is the chi-squared test; the distribution of the statistic is approximately (k is the number of studies) under the null hypothesis that all studies in a meta-analysis are homogeneous. However, the test alone does not suffice to describe the amount of heterogeneity because only -values are produced to indicate a binary decision of either the presence or absence of heterogeneity. The statistic has been a popular alternative to quantify heterogeneity because of its attractive interpretation as the proportion of total variation caused by heterogeneity rather than within-study sampling error.4,5 Specifically, the statistic can be conceptualized as , where is the between-study variance caused by heterogeneity and is a summary of within-study variances. It ranges from to . The Cochrane Handbook provides a rough yet widely used rule to interpret this measure: may indicate unimportant heterogeneity, may represent moderate heterogeneity, may represent substantial heterogeneity, and implies considerable heterogeneity.3 These ranges overlap with each other because they are vague, and the true heterogeneity should be evaluated with caution, using both statistical and clinical knowledge.6
Over the past few years, the usage of has been challenged from many perspectives, and it should not be used as an absolute measure.7–9 Several studies have demonstrated shortcomings of . The statistic may be particularly unreliable in meta-analyses with a small number of studies (e.g. < 10).10,11 The sample sizes within individual studies can inflate or deflate under different circumstances.11,12 Moreover, inherits the following misunderstanding about the distribution of the statistic under the null hypothesis: the approximation holds for large within-study sample sizes, but it is not accurate for small and moderate sample sizes.13,14
In response to the above shortcomings, the associated confidence interval (CI) should be reported to accompany the statistic.10,15,16 Also, under the null hypothesis, distributions of the statistic for different effect measures (e.g. standardized mean difference [SMD]) have been proposed to adjust for the inaccuracy of the standard chi-square approximation17–20; they provide a solution to calculating the CI of . CIs may be more desirable than point estimates of because they give an appreciation of the spectrum of possible extents of heterogeneity (e.g. mild to moderate). A spectrum of the statistic can be more robust to nuisance factors compared with the point estimate, enabling appropriate interpretation of the overall estimate of the intervention effect.16 Methods to calculate CIs of have been discussed in former research.4,21 Nevertheless, more intensive studies need to be implemented to compare different methods’ performance (e.g. the coverage probability) in practical situations.
This article uses a simulation study under various scenarios to obtain informative conclusions of preferable point and interval estimates of . The rest of the article is organized as follows. We first review the setups of a meta-analysis, including various types of effect measures, in Section 2. Section 3 reviews various point and interval estimators of . Section 4 presents the simulation study comparing the multiple estimators. We conclude this article with a brief discussion in Section 5.
2. Setups of meta-analysis
2.1. Common-effect and random-effects models
Consider a meta-analysis that collects independent studies. Let be the underlying true effect size in study . Each study reports an estimate of the effect size and its sample variance, denoted by and . These data are commonly modeled as . Although is subject to sampling error, it is usually treated as a fixed, known value. This assumption is generally valid if each study’s sample size is reasonably large.
If study-specific true effect sizes are assumed to follow a normal distribution, that is, , then this is the random-effect (RE) model that accounts for heterogeneity. Here, is the overall mean effect size, and is the between-study variance. If , then for all studies. This implies that the collected studies are homogeneous, and it leads to the common-effect (CE) model. The RE model encompasses within-study () and between-study () variation, in contrast to the CE model that includes within-study variation only. We denote as the weight assigned to each study under the CE model. The statistic is defined as , where is the pooled CE estimate of the overall effect size . It follows a distribution under the null hypothesis. Under the RE model, using the between-study variance estimate , the overall mean effect size is estimated as
| (1) |
where .
2.2. Meta-analysis with a continuous outcome
Suppose each study in a meta-analysis compares a treatment group with a control group. Denote and as the sample sizes in the control and treatment groups in study . The continuous outcome measures of participants in each group are assumed to follow normal distributions. The subject-level data in each arm have means and and variances and . The sample means are denoted as and , and the sample variances are denoted as and for .
If the outcome measures have a meaningful scale and all studies in the meta-analysis are reported on the same scale, the mean difference (MD) between the two groups, , is often used as the effect size. An estimate of the MD can be obtained from each study, denoted as . The variances of samples in two arms are frequently assumed to be equal, i.e., . The is estimated as the pooled sample variance . Therefore, the estimated within-study variance of is .
Another commonly used effect measure for continuous outcomes is the SMD, because this unit-free measure permits different scales in the collected studies and is deemed more comparable across studies.22 The SMD effect measure is . Known as Cohen’s , it is frequently estimated as follows: . The exact within-study variance of Cohen’s can be derived as a complicated form of gamma functions,23 but researchers often use different simpler forms to approximate it.24–26 For example, . As depends on , they are correlated. The correlation may increase as the sample sizes decrease, because the coefficient of in the formula increases. Cohen’s is shown to be biased in small sample sizes.24 Therefore, we do not consider it further. Instead, we study the bias-corrected estimator Hedge’s , which is usually adopted when sample sizes are small. Suggested by Hedges and Olkin,24(p86) it is computed as with an estimated variance .
Except for this formula to estimate the within-study variance, Lin and Aloe27 summarized many other formulas. Using different formulas can result in different estimates of the overall SMD, but this topic is beyond the scope of this paper. Like Cohen’s , the observed data and are also correlated when using Hedge’s as the effect measure, which may affect the estimation results of meta-analyses.28
2.3. Meta-analysis with a binary outcome
Suppose a table is available from each collected study in a meta-analysis with a binary outcome (i.e. individual-level outcomes are reported from studies). Denote and as the number of participants without and with an event in the control group, respectively; and are the data cells in the treatment group. The sample sizes in the control and treatment groups are and . Also, denote and as the population event rates in the two groups.
The odds ratio (OR) is frequently used as the effect measure for a binary outcome; its true value in the study is . Using individual-level data, the OR is estimated by . The ORs are usually combined on a logarithmic scale in meta-analyses, because the distribution of the estimated log OR, , is better approximated by a normal distribution. The within-study variance of is estimated as .
Moreover, the risk ratio (RR) and risk difference (RD) are also popular effect metrics, but they are not discussed in this article. Although RRs are more interpretable measures of association for clinicians,29,30 the debate continues over the merits of the OR versus RR and their interpretations.31,32 Doi et al.33 argued that RRs should no longer be used in meta-analyses, because the RR depends on prevalence more so than on the strength of exposure-outcome association that it is supposed to reflect. Specifically, the RR is a ratio of two conditional probabilities that vary with outcome prevalence, whereas the OR is a true effect magnitude measure representing the multiplicative increase in odds of outcome from an unexposed state to an exposed state. The RD can be easily computed from the OR with the fixed baseline risk. When generating simulated meta-analyses for RDs and RRs under the RE model, it is unrealistic to naturally limit and within the range [0, 1] if the true overall effect size is given. This is because the normality assumption can generate extreme values of a non-zero . For example, a true RD of study is simulated from as , then will be beyond 1 if is fixed to larger than 0.2. To overcome this issue, an alternative method is truncating such improper probabilities so they are between 0 and 1, but this constraint can produce bias which cannot be distinguished from the bias caused by sampling error.28,34 Thus, the undesired effect of bounding the probabilities can be problematic, and inevitable when conducting simulation studies for RRs and RDs. Although some meta-analysts try to explore other models to simulate data, there still does not exist a general method that fixes the biased problem and is well accepted in the literature. Bakbergenuly et al.34 evaluated the performance of a number of data-generating models, such as the binomial generalized linear mixed model with logit link function and the beta-binomial model, when effects are RRs. It appears no gold standard was concluded, and they encouraged future research to explore this topic. Therefore, we focus on analyzing the results of ORs when studies return binary outcomes.
When sample sizes are small, some data cells may be 0, even if the event is not rare. In general, if a table contains zero cells, a fixed value of 0.5 is added to each data cell to reduce bias and avoid computational errors.35–37 Although this continuity correction may not be optimal in some cases and alternative corrections can be used,38–41 we use the adding 0.5 correction if it is not specially mentioned in the following sections.
3. Estimates of the statistic
3.1. Point estimates
Because point estimates of the between-study variance are used to calculate intervals, we first introduce these point estimators. As depends on , these estimators further lead to point estimates of .
3.1.1. Method-of-moments approach
The estimator of can be derived from the method-of-moments approach, which is based on the generalized statistic,42 , where represents the weight assigned to the study and . By equating to its expected value, the general formula for the heterogeneity variance can be derived as
| (2) |
The DerSimonian–Laird (DL) estimator uses the CE model weights , leading to43:
Note that the DL estimators can produce negative variance estimates and are truncated to zero in such cases.
3.1.2. Sidik–Jonkman (SJ) method
Sidik and Jonkman44 proposed a two-step estimator producing positive estimates
where is the initial heterogeneity variance estimate and is calculated from equation (1) with weights .
3.1.3. Restricted maximum likelihood (REML) method
Based on the marginal distribution of the RE model, , the maximum likelihood (ML) estimate is obtained by maximizing the log-likelihood function:
To derive the REML estimator, the above log-likelihood function is transformed to exclude the parameter 45 By doing so, REML avoids assuming is known and is therefore thought to be an improvement on the ML estimator.46 The modified log-likelihood function is
By maximizing this modified -likelihood function to , the formula of the between-study variance estimate is
where . The REML estimate is calculated by using an iteration scheme. Fisher scoring algorithm is used for the iteration of the REML estimates in this article, as implemented in the R package “metafor.”47
3.2. Interval estimates
3.2.1. Interval estimates for statistic
The statistic is originated from the statistic by assuming within-study variances are equal (i.e. ) and by equating the observed with its expectation, so we have
| (3) |
which is a function of .48 A widely used truncation (i.e. is set to 0 if ) is applied because conceptually the statistic should be non-negative. Using equation (3), the interval estimate can be calculated by evaluating quantiles from the cumulative distribution function (CDF) of the statistic (i.e. ). Biggerstaff and Jackson21 (BJ) developed three approaches to approximate the distribution of under the RE model.
The two-moment gamma approximation of , with shape parameter and scale parameter , is obtained by matching the first two moments of the gamma and distributions. Explicit expressions for the mean and variance of are
where . The proof for the two formulas is included by Biggerstaff and Tweedie.49 Using any non-negative estimates for in the above two formulas, the first two moments of can be estimated, denoted as and . Therefore, solving equations and by plugging in estimated values gives and . The is then approximated by computing the gamma CDF with and .
The Pearson type III distribution provides an extension of the previous two-moment gamma approximation by adding the third central moment (TCM) of , which is derived similarly to the variance of :
Matching all three moments with parameters of the Pearson type III distribution, emphasizing the dependence on , gives
Therefore, the approximation of can easily be calculated from the Pearson type III CDF with location parameter , shape parameter and rate parameter . This approximation can be obtained by plugging in to the three parameters. Note that although the three-moment Pearson type III approximation is intended as an improvement on the two-moment gamma approximation as it matches a further moment, it has support , hence it is not appropriate to approximate when values of are extremely small, especially if those values are less than .
A further approximation expected to be more accurate in the tails of the distribution is the saddlepoint approximation, given in the present case by Kuonen50 using the Barndorff–Nielsen formulation. This requires the cumulant generating function of , denoted by , and its first two derivates, given by
where and with are the ordered eigenvalues of . Here, and is the diagonal matrix with entries . Let , where is the diagonal matrix containing the is the vector containing the , and the superscript t denotes matrix transpose. Plugging in the estimate, the saddlepoint approximating can be calculated in two steps. First, we solve the equation for , the solution referred to as the saddlepoint. Next, we compute and . The saddlepoint approximation is then given by , where is the standard normal CDF. Our objective is to obtain the interval through evaluating the quantiles of the statistic, so is used to estimate for given probabilities (e.g. 0.025 and 0.975).
The test-based method51 provides another way to compute the CI of , and hence for the statistic.16 Appendix A2 by Higgins and Thompson4 discussed in detail for conducting the test-based method to calculate the CI of the heterogeneity measure
where is defined to be 1 whenever . Because is a monotone-increasing function with , we briefly present results that are used to estimate the CI of here, and the corresponding CI for can be readily calculated via the relationship between and . The logarithm of is used in this method to remove some of the skew inherent in the distribution of . A test-based standard error of is
Then a CI for follows as . Therefore, a test-based interval estimate for is constructed using the lower and upper bounds of .
The non-parametric bootstrap CI of can be obtained by sampling studies with replacement from the observed pairs , and is estimated for each bootstrap sample using equation (3). Repeating the process (e.g. 1000) times, a is given by the 2.5th and 97.5th percentiles of the values.
In sum, five methods to estimate intervals are summarized in this subsection. It should be noted that the three methods using the approximated need first to estimate , whereas the other two methods do not depend on , as shown in Table 1.
Table 1.
Summary of the five methods to calculate confidence intervals for I2 based on the Q statistic.
| Methods to estimate I2 intervals | Use the estimated τ2 | |
|---|---|---|
| Biggerstaff and Jacksons approximated cumulative distribution functions of Q | Two-moment gamma (TMG) | √ |
| Pearson type III distribution (PIII) | √ | |
| Saddlepoint approximation (S) | √ | |
| Test-based approach (T) | — | |
| Non-parametric bootstrap (NPBS) | — |
(a) Methods requiring the estimated between-study variance, √; and (b) methods not requiring the estimated between-study variance, —.
3.2.2. Interval estimates for based on the between-study variance
Consider the DL estimate of the between-study variance , the statistic can be expressed in the form of via equation (2) when . Note that when , this setting matches with the widely used truncation that is truncated as if . Replacing the statistic with the form of in equation (3), can be expressed as a function of the estimated between-study variance
| (4) |
In this expression, the summary of the within-study variance (i.e. the moment-based sampling error) is treated as follows:
Nevertheless, considering as a function of depends on the accuracy of the summary estimate because the calculation or interpretation of the statistic can be seriously distorted if provides a misleading estimate.52 Nevertheless, we use the moment-based sampling error throughout this article because it is consistent with the definition of . Improving the summary estimate of the within-study variance will be explored in our future research. For a given meta-analysis, interval estimates of can be calculated from interval estimates of the between-study variance via the monotone-increasing function in equation (4). Therefore, calculating the CI for is one step further than estimating the CI for . Researchers have conducted comprehensive overviews to compare estimation methods for and its uncertainty.53,54 Although equation (4) is derived to use the DL estimate of , other methods can be used to calculate because different estimators aim to estimate the same true between-study variance. Three-point estimators of in Section 3.1 and the following six interval estimators of are considered to obtain the CI for , and thus the CI for , in this article.
Specifically, we summarize interval estimation methods for the between-study variance below:
- Profile likelihood CI of the REML estimator (PL-REML). The PL method55 is based on the log-likelihood function and is an iterative process that provides CIs for the between-study variance, considering the fact that needs to be estimated as well. The PL CI for consists of the values that are not rejected by the likelihood ratio test with under the null hypothesis. For the REML estimator, the values in the CI are obtained by solving
where is the 95th quantile of the distribution. The method produces wide CIs with very high coverage probabilities when , and the coverage probabilities reduce to the nominal level as increasing.56 Q-profile . The QP method is based on the generalized statistic ( in Section 3.1) when , which follows . Viechtbauer56 shows that the -profile CI is obtained by iteratively solving and , where and are the lower and upper confidence limits, respectively. The corresponding CIs have been shown to achieve nominal coverage probabilities even in small samples.56 However, the estimated within-study variance is not the true within-study variance for each study. Therefore, in practice, the generalized statistic no longer follows the assumed chi-squared distribution. This method is implemented in the R package “metafor” as the default approach to compute the CI for .
- Biggerstaff and Jackson CI (BJ). Using the CDF of , Biggerstaff and Jackson21 proposed a method to calculate a CI for the between-study variance by obtaining the solutions of the equations:
When , the interval is set as . If , the lower bound of CI is set equal to 0. The CDF may be calculated using the algorithm by Farebrother57 for the positive linear combination of chi-squared random variables. - Jackson . An extension of the BJ CI is suggested by Jackson58 using . The generalized statistic has been shown to be as a linear combination of random variables so that methods like BJ can be used. The CDF of , , is a continuous and strictly decreasing function of . The of is obtained as:
When , the interval is set as . If , the lower bound of CI is set equal to 0. For moderate , Jackson recommends using the interval with weights , which are used in this article. The BJ and CIs for are calculated using the R code provided by Jackson.58 - Sidik and Jonkman CI (SJ). Sidik and Jonkman44 propose a method based on the SJ estimator with the 2.5th and 97.5th quantiles of the distribution:
As takes non-negative values, the interval should also be non-negative. Simulation studies indicate that the SJ intervals have very poor coverage probability when is small, but as and increase the coverage probability becomes close to the nominal value.44,56 Bootstrap CI. For any consistent and non-negative estimator of , parametric bootstrap CIs can be obtained by generating values from the distribution , where is the between-study variance estimate and given by equation (1). Next, estimate the between-study variance based on the bootstrap sample. After repeating this process (e.g. 1000) times, the CI is constructed by taking the 2.5th and 97.5th percentiles of the distribution of values. Non-parametric bootstrap CIs are obtained via a similar process, where studies are sampled with replacement from the observed pairs (). For each bootstrap sample, can be estimated using the same specified method (e.g. REML). Repeating the process times, a is given by the 2.5th and 97.5th percentiles of the values. The normal distribution assumption of observed effects is not required in the non-parametric bootstrap method, but its coverage performance has been doubted because of the substantial deviation from the nominal level in simulation studies.56
So far, for generic effect measures, multiple approaches to calculating of are presented in two directions: based on the statistic or based on the between-study variance. Nevertheless, these methods have been criticized by researchers because they can be unreliable in real-world meta-analysis.8,59 For the -profile method, the null distribution of the generalized statistic follows may not be an accurate approximation, especially when study-specific sample sizes are small or moderate. Three methods based on three approximations of proposed by Biggerstaff and Jackson21 also require sufficiently large studies (i.e. large sample sizes of studies) and the assumption that effect sizes are normally distributed. For the test-based method, Hoaglin8 points out: (1) the CI involving the test-based standard error is valid only under the null hypothesis; (2) the standard normal approximation, , used in the method requires “large” degrees of freedom (e.g. over 100); and (3) subtracting ) is not exactly the same as subtracting the mean of . Therefore, the test-based CI for can be unreliable to reflect heterogeneity. Other approaches to improve the estimation of CIs for , and thus for , have been discussed recently.21,54,60 For example, Knapp et al.60 suggested a modified profile method using a different weighting scheme for the generalized statistic to determine the lower bound of the interval for , and the upper bound is still the same as that of the original -profile method. However, the improvement of the modified -profile method is subtle, and the weighting scheme for the lower bound is lacking when effect measures are SMDs.
To handle the problem that can be an inaccurate null distribution, Kulinskaya et al.18,20 proposed a series of methods, which provide appropriate CIs for , by combining the -profile method with corrected null approximations of the statistic. The distribution of under the null hypothesis of homogeneity depends on statistics used to estimate the effects and the weights. Two methods to estimate CIs for , thus for , are introduced for two effect measures, the SMD and the OR, as follows:
- Kulinskaya–Dollinger–Bjøkestøl CI (KDB). When using Hedge’s as the estimator of SMD, Kulinskaya et al.18 derived corrections to moments of and suggested using the chi-squared distribution with degrees of freedom equal to the estimate of the corrected first moment, denoted by , to approximate the distribution of . The detailed expression of is provided along with the R code by Kulinskaya et al.,18 and they are not presented here because the concrete form is complicated. The upper and lower confidence limits for can be calculated iteratively from the lower and upper quantiles of :
Then, the corresponding CI for is obtained via equation (4). - Kulinskaya–Dollinger CI (KD). When effect measures are log ORs, Kulinskaya and Dollinger20 obtain corrected approximations for the mean and variance of the statistic under the null hypothesis. They then match those corrected moments to construct a gamma distribution that closely fits the null distribution of , and their simulations confirm that the gamma approximation outperforms the chi-squared approximation.20 The improved approximation blends theoretical derivation with simulation results. Let denote the corrected expectation of when . This corrected first moment can be written as
where is a theoretical moment obtained from their general expansion of the mean of for arbitrary binary effect measures. The detailed expression of is presented in Appendix B.3 of Kulinskaya and Dollinger.20 For large sample sizes, converges to . The corrected variance of , denoted by , is a quadratic function of the corrected mean and it is calculated by
Then, the shape parameter of the gamma distribution approximating is estimated by , and the scale parameter is estimated by . Therefore, the KD interval estimate of is obtained by iteratively solving:
The corresponding CI for is calculated via equation (4). Different from all other methods, the KD interval estimate of is based on the table where 0.5 is added to each cell regardless of the existence of zero cells; this change is adjusted in the programming.
Among the methods presented in this subsection, four (PL-REML, SJ, PBS-, and NPBS-) need to use the estimated , whereas others can directly calculate interval estimates of without using the point estimate.
4. Simulation study
4.1. Simulation settings
We conducted simulation studies to investigate the performance of different interval estimators of the statistic. Following the framework by Morris et al.61 to design simulations:
Aims. The primary goal is to compare the performance of different methods’ CIs for . The secondary aim is to compare three-point estimators of .
Data-generating mechanisms. The number of studies in a simulated meta-analysis was set to , and 50. Denote , where represents the sample size of the study . When , a vector represents sample sizes of an artificial meta-analysis was fixed as (10, 20, 30, 40, 50), then we gradually increased it to (50, 75, 100, 125, 150), and to (150, 250, 350, 450, 550). Three different settings indicated the considered sample sizes were small, medium, and large. When , the sample size vector was specified as four replicates of when . For example, considering and , the sample size vector was set by combining four vectors (10, 20, 30, 40, 50). Similarly, 10 replicates of when were used to construct the sample size vector when . The control/treatment allocation ratio was set to in all studies, which is commonly used in real-world applications. Specifically, , where participants were assigned to the control group and participants were assigned to the treatment group.
When effect measures were MDs, each participan’s outcome measure was sampled from in the control group or in the treatment group. Without loss of generality, the baseline effect of the study was generated from . The study-specific standard deviation was sampled from , and it was generated anew for each simulated meta-analysis. The MD was sampled from . Table 2 shows the specified values for the overall MD and the between-study standard deviation .
Table 2.
Vectors of the between-study standard deviation (τ) and specified values of the true overall effect size (μ).
| Overall mean difference | Overall standardized mean difference | Overall log odds ratio | |||
|---|---|---|---|---|---|
|
|
|
|
|||
| Range of the sample size ni | μ = 0 or 1 | μ = 0 | μ = 0.8 | μ = 0 | μ = 1 |
| 10 ≤ ni ≤ 50 | τ = (0.50, 1.10, 2.20) | τ = (0.18, 0.37, 0.73) | τ = (0.19, 0.38, 0.76) | τ = (0.37, 0.73, 1.46) | τ = (0.39, 0.78, 1.56) |
| 50 ≤ ni ≤ 150 | τ = (0.30, 0.60, 1.20) | τ = (0.10, 0.20, 0.40) | τ = (0.10, 0.21, 0.42) | τ = (0.20, 0.40, 0.80) | τ = (0.21, 0.43, 0.85) |
| 150 ≤ ni ≤ 550 | τ = (0.20, 0.30, 0.60) | τ = (0.05, 0.11, 0.21) | τ = (0.06, 0.11, 0.22) | τ = (0.11, 0.21, 0.43) | τ = (0.11, 0.23, 0.46) |
Given the range of study-specific sample sizes and the true overall effect size, a between-study standard deviation is chosen from one of three values in the corresponding vector, and it is used to generate meta-analyses.
When effect measures were SMDs, each participant’s outcome measure was generated from in the control group or in the treatment group. The baseline effect of th study was generated from , and the study-specific standard deviation was generated anew for each meta-analysis by sampling from . The SMD was sampled from the normal distribution , so . The overall SMD and the between-study standard deviation were set as in Table 2.
When effect measures were ORs, the event numbers and in the control and treatment groups were sampled from and , respectively. The event rate in the control group was sampled from representing a common event,62 and it was generated anew for each meta-analysis. The event rate in the treatment group was calculated using and the study-specific log OR ; specifically, . The study-specific OR was sampled from . The settings of the overall OR and the between-study standard deviation were presented in Table 2.
For each simulation setting above, 10,000 meta-analyses were generated. For a simulated meta-analysis, the study-specific effect size and the within-study variance were estimated as and in Section 2. The RE model was applied to each simulated meta-analysis, and the between-study variance was estimated by three methods (DL, SJ, and REML) introduced in Section 3.1. We skipped simulated meta-analyses whose REML estimates of could not be obtained (e.g. the solution did not converge) until enough simulated meta-analyses were generated.
Estimands of interest. We estimated and the corresponding CI for each simulated meta-analysis. The true value of was calculated by equation (4) with the true between-study variance .
Methods to be evaluated. For MDs, we compared 12 methods to calculate CIs of , five methods (TMG, PIII, S, T, and NPBS-Q) introduced in Section 3.2.1 and seven methods (PL-REML, QP, BJ, J, SJ, NPBS-, and PBS-) introduced in Section 3.2.2. These 12 methods were also compared when effect measures were SMDs or log ORs, but the KDB CI or the KD CI was added to the comparison. Among the methods needing to use the estimated between-study variance, the SJ method used the SJ estimate of , and other methods used the REML estimate of . Moreover, estimated using three different estimators (DL, SJ, and REML) of were also compared.
Performance measures. Coverage probabilities of CIs, lengths of interval estimates, standard deviations of lengths, biases, and root mean squared errors were examined.
We provide all R code for the simulations at the Open Science Framework (https://osf.io/qu26v/).
4.2. Simulation results
4.2.1. Properties of estimates
For the point estimates of using the DL, SJ, and REML estimators of , the SJ method stood out when the between-study variance was large. Table 3 shows estimates of bias for the three estimation methods when the estimand was MD. Generally, the SJ methods had the highest bias compared to DL and REML when was small or moderate, but the lowest bias when was large. Additionally, DL and REML estimates of had extremely similar performance and were often biased downward, particularly when was large.
Table 3.
Biases of estimated I2 using the estimated between-study variance of three methods: DerSimonian–Laird (DL), Sidik–Jonkman (SJ), and restricted maximum likelihood (REML) methods. Results are from the simulation study using mean differences as effect measures.
| The true overall mean difference μ = 0 or 1 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| ni = 10–50 | ni = 50–150 | ni = 150–550 | |||||||
|
|
|
|
|||||||
| Method | τ = 0.5 | τ = 1.1 | τ = 2.2 | τ = 0.3 | τ = 0.6 | τ = 1.2 | τ = 0.2 | τ = 0.3 | τ = 0.6 |
| k=5 | |||||||||
| DL | 0.020 | −0.125 | −0.123 | −0.021 | −0.140 | −0.128 | −0.069 | −0.137 | −0.140 |
| SJ | 0.204 | −0.003 | −0.075 | 0.155 | −0.022 | −0.082 | 0.099 | −0.007 | −0.084 |
| REML | 0.015 | −0.127 | −0.122 | −0.025 | −0.144 | −0.127 | −0.074 | −0.142 | −0.140 |
| k=20 | |||||||||
| DL | 0.031 | −0.024 | −0.016 | −0.023 | −0.048 | −0.023 | −0.046 | −0.054 | −0.028 |
| SJ | 0.306 | 0.080 | 0.002 | 0.241 | 0.055 | −0.007 | 0.180 | 0.070 | −0.007 |
| REML | 0.032 | −0.019 | −0.009 | −0.025 | −0.047 | −0.018 | −0.048 | −0.054 | −0.022 |
| k=50 | |||||||||
| DL | 0.044 | 0.008 | −0.001 | −0.008 | −0.014 | −0.007 | −0.024 | −0.021 | −0.010 |
| SJ | 0.330 | 0.095 | 0.011 | 0.262 | 0.071 | 0.003 | 0.197 | 0.085 | 0.004 |
| REML | 0.045 | 0.014 | 0.004 | −0.007 | −0.011 | −0.004 | −0.024 | −0.020 | −0.007 |
To illustrate this point, consider the case where and was between 50 and 150. When , estimates of the bias were , and −0.025 for DL, SJ, and REML methods, respectively. The SJ method’s magnitude of estimated bias was more than 10 times that of DL or REML. As increased to 0.6, the magnitudes of bias were approximately equal (−0.048 for DL, 0.055 for SJ, and – 0.047 for REML). However, when was 1.2, SJ had the lowest estimated bias at −0.007 compared to −0.023 and −0.018 for DL and REML. This held true across nearly all parameter combinations studied, as well as for SMD and OR. The estimated magnitude of the bias for the OR was higher than that of MD or SMD (Tables S1 to S4 in the Supplemental Material). This was particularly striking when (Table S4 in the Supplemental Material). RMSE followed a similar, but far less extreme, pattern for all parameter combinations and estimands studied (Tables S5 to S9 in the Supplemental Material).
4.2.2. CIs for MDs and SMDs
Table 4 shows the simulation-based coverage probabilities in studies of the MD for each of the CI methods introduced for the statistic. CIs based on the BJ estimate of generally behaved similarly. When the number of studies was small , interval coverage for the TMG, PIII, and S methods decreased with increasing , regardless of the size of the individual studies in the meta-analysis. For example, when an individual study sample size was between 10 and 15, we observed over-conservative coverage when ( TMG, PIII, and S), very close to the nominal coverage when ( TMG, PIII, and . As increased to 20 or 50, often a non-linear relationship between CI coverage and the between-study variance was present. This trend depended on , highlighting the importance of considering , and together when conducting a meta-analysis.
Table 4.
Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 12 methods in simulated studies where effect measures are mean differences.
| The true overall mean difference μ=0 or 1 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| ni = 10–50 | ni = 50–150 | ni = 150–550 | |||||||
|
|
|
|
|||||||
| Method | τ = 0.5 | τ = 1.1 | τ = 2.2 | τ = 0.3 | τ = 0.6 | τ = 1.2 | τ = 0.2 | τ = 0.3 | τ = 0.6 |
| k=5 | |||||||||
| TMG | 100.0 | 95.5 | 83.4 | 100.0 | 95.0 | 82.8 | 100.0 | 96.8 | 82.3 |
| PIII | 99.2 | 95.1 | 83.4 | 99.6 | 94.7 | 82.8 | 99.6 | 96.6 | 82.4 |
| S | 99.9 | 95.5 | 83.1 | 100.0 | 95.0 | 82.3 | 100.0 | 96.9 | 81.8 |
| T | 93.8 | 90.9 | 70.7 | 95.5 | 92.6 | 72.1 | 95.4 | 93.4 | 75.2 |
| NPBS-Q | 68.0 | 65.8 | 63.2 | 66.7 | 64.1 | 61.9 | 64.9 | 63.8 | 61.5 |
| PL-REML | 96.4 | 97.0 | 93.4 | 97.6 | 97.8 | 94.1 | 98.0 | 98.0 | 94.2 |
| QP | 93.3 | 93.7 | 93.9 | 94.7 | 94.9 | 94.8 | 94.9 | 94.8 | 94.8 |
| BJ | 93.2 | 93.7 | 93.7 | 94.6 | 94.8 | 94.9 | 95.2 | 94.7 | 94.7 |
| J | 93.7 | 93.9 | 93.9 | 94.8 | 94.8 | 94.9 | 94.7 | 94.9 | 94.7 |
| SJ | 46.6 | 76.0 | 86.7 | 55.7 | 79.0 | 88.2 | 64.3 | 76.9 | 87.7 |
| PBS-τ2 | 99.9 | 97.2 | 84.5 | 100.0 | 97.1 | 83.7 | 100.0 | 98.1 | 83.6 |
| NPBS-τ2 | 73.2 | 69.5 | 66.8 | 71.4 | 68.0 | 65.7 | 69.5 | 67.8 | 65.8 |
| k=20 | |||||||||
| TMG | 99.3 | 92.1 | 95.1 | 99.8 | 90.6 | 93.4 | 97.2 | 90.5 | 93.4 |
| PIII | 97.7 | 91.1 | 95.0 | 99.1 | 90.4 | 93.8 | 96.7 | 90.1 | 93.8 |
| S | 98.4 | 91.4 | 95.2 | 99.4 | 90.2 | 93.8 | 96.9 | 90.0 | 93.7 |
| T | 90.2 | 76.4 | 59.2 | 94.1 | 78.8 | 62.0 | 92.6 | 80.9 | 63.0 |
| NPBS-Q | 88.5 | 85.6 | 83.1 | 85.7 | 83.5 | 81.0 | 85.0 | 83.4 | 80.3 |
| PL-REML | 93.6 | 92.3 | 92.9 | 96.4 | 93.8 | 93.9 | 95.6 | 94.3 | 94.5 |
| QP | 91.4 | 92.5 | 92.6 | 94.1 | 94.2 | 94.1 | 95.1 | 95.3 | 94.8 |
| BJ | 91.5 | 93.3 | 94.0 | 94.0 | 94.5 | 94.2 | 94.9 | 94.6 | 94.5 |
| J | 93.1 | 93.1 | 93.0 | 94.5 | 94.2 | 94.1 | 95.1 | 95.0 | 94.7 |
| SJ | 8.1 | 58.8 | 83.8 | 18.0 | 68.2 | 87.6 | 32.6 | 63.8 | 87.7 |
| PBS-τ2 | 98.6 | 91.9 | 92.8 | 99.1 | 91.0 | 91.2 | 95.0 | 90.9 | 91.7 |
| NPBS-τ2 | 88.9 | 88.2 | 87.8 | 86.1 | 86.1 | 85.7 | 85.6 | 85.5 | 85.1 |
| k=50 | |||||||||
| TMG | 96.8 | 95.1 | 97.8 | 96.7 | 94.5 | 97.1 | 93.7 | 93.5 | 96.4 |
| PIII | 95.1 | 93.7 | 97.3 | 96.4 | 94.3 | 97.3 | 93.4 | 93.6 | 96.7 |
| S | 95.6 | 94.3 | 96.4 | 96.4 | 94.3 | 96.6 | 93.4 | 93.6 | 95.8 |
| T | 86.1 | 73.6 | 55.3 | 91.7 | 78.0 | 58.7 | 87.4 | 80.4 | 60.7 |
| NPBS-Q | 92.7 | 92.2 | 90.7 | 91.3 | 90.0 | 88.6 | 90.4 | 89.3 | 87.7 |
| PL-REML | 89.7 | 91.1 | 91.4 | 94.0 | 94.2 | 94.5 | 94.8 | 94.9 | 94.9 |
| QP | 89.1 | 90.7 | 91.2 | 94.4 | 94.4 | 94.5 | 95.0 | 95.1 | 94.8 |
| BJ | 89.3 | 92.5 | 93.5 | 94.3 | 94.5 | 94.6 | 95.1 | 94.9 | 95.0 |
| J | 91.6 | 91.7 | 91.9 | 94.7 | 94.4 | 94.5 | 95.2 | 94.9 | 94.8 |
| SJ | 0.3 | 35.4 | 79.9 | 1.8 | 50.4 | 87.4 | 8.1 | 43.9 | 86.8 |
| PBS-τ2 | 92.7 | 92.9 | 93.3 | 92.9 | 93.6 | 94.0 | 92.5 | 93.3 | 93.5 |
| NPBS-τ2 | 92.1 | 91.8 | 91.8 | 90.4 | 91.1 | 91.4 | 90.2 | 90.7 | 90.9 |
TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance.
The performance of the T approach mimicked that of TMG, PIII, and S, but with lower CI coverage across different simulation settings, which was clearly illustrated with a large number of studies but a small within-study sample size. Table 4 shows, when and was between 10 and 50, the coverage probabilities were , and for , and , respectively. This was below the average coverage of other intervals based on the statistic. For the T approach, when was fixed, the coverage probability decreased as increased. This finding indicates that the commonly used test-based method was inappropriate for calculating a of if the meta-analysis is highly heterogeneous. The similarities in coverage trajectory over parameters of meta-analyses among the T, TMG, PIII, and S methods likely reflect the fact that all four methods use the approximated distribution of the statistic, while the decrease in overall coverage probability for the T method may be because it does not use the estimate of . Interestingly, although the coverage probability was worse than the other three methods, the average interval length of the T method was comparatively shorter (Table 5).
Table 5.
Average lengths (in percentage, %) of estimated I2 95% confidence intervals of 12 methods are shown with standard deviations in parentheses. Results are from the simulation study using mean differences as effect measures.
| The true overall mean difference μ = 0 or 1 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| ni = 10–50 | ni = 50–150 | ni = 150–550 | |||||||
|
|
|
|
|||||||
| Method | τ = 0.5 | τ = 1.1 | τ = 2.2 | τ = 0.3 | τ = 0.6 | τ = 1.2 | τ = 0.2 | τ = 0.3 | τ = 0.6 |
| k=5 | |||||||||
| TMG | 72.7 (10.0) | 79.9 (11.7) | 88.0 (11.5) | 72.6 (9.8) | 79.5 (11.5) | 87.7 (11.5) | 74.0 (10.4) | 78.4 (11.4) | 87.3 (11.2) |
| PIII | 72.3 (10.0) | 77.1 (12.3) | 69.7 (22.4) | 72.4 (9.7) | 77.3 (11.8) | 70.5 (22.0) | 73.6 (10.2) | 76.7 (11.5) | 72.5 (20.3) |
| S | 72.7 (10.0) | 79.7 (11.7) | 83.4 (15.5) | 72.6 (9.8) | 79.4 (11.5) | 83.3 (15.5) | 74.0 (10.4) | 78.3 (11.4) | 84.0 (14.3) |
| T | 77.0 (8.6) | 69.0 (18.9) | 44.3 (28.2) | 77.3 (7.8) | 69.9 (17.9) | 44.6 (27.9) | 76.3 (9.9) | 71.4 (16.4) | 48.5 (27.6) |
| NPBS-Q | 41.2 (30.0) | 58.9 (29.2) | 77.9 (21.7) | 41.0 (29.6) | 58.3 (28.7) | 77.6 (21.2) | 44.7 (29.9) | 55.6 (29.3) | 76.1 (22.4) |
| PL-REML | 83.3 (9.3) | 79.5 (15.9) | 58.1 (27.3) | 83.4 (8.8) | 80.2 (14.9) | 58.6 (27.0) | 83.3 (9.6) | 81.3 (13.8) | 62.3 (26.0) |
| QP | 83.7 (18.7) | 80.5 (19.9) | 57.6 (29.1) | 84.0 (17.6) | 81.2 (18.8) | 57.8 (28.7) | 84.2 (17.4) | 82.2 (18.5) | 61.7 (28.0) |
| BJ | 83.2 (18.1) | 80.1 (19.8) | 58.0 (28.4) | 83.5 (17.4) | 80.9 (18.8) | 58.3 (28.1) | 83.7 (17.1) | 81.8 (18.4) | 62.0 (27.4) |
| J | 83.7 (18.5) | 82.7 (19.7) | 60.1 (30.5) | 83.9 (17.6) | 82.8 (18.6) | 59.3 (29.7) | 84.5 (17.3) | 83.7 (18.2) | 63.5 (29.0) |
| SJ | 53.6 (13.7) | 50.7 (13.7) | 37.0 (16.6) | 54.3 (13.3) | 51.6 (13.1) | 37.5 (16.4) | 53.9 (13.1) | 52.1 (13.2) | 39.6 (16.2) |
| PBS-τ2 | 74.7 (9.3) | 81.1 (10.7) | 85.0 (14.5) | 74.5 (9.1) | 80.7 (10.4) | 84.8 (14.4) | 75.8 (9.6) | 79.7 (10.4) | 85.4 (12.9) |
| NPBS-τ2 | 46.5 (30.7) | 62.4 (28.2) | 80.6 (20.6) | 45.8 (30.1) | 61.5 (27.9) | 80.4 (20.1) | 49.4 (30.1) | 59.1 (28.7) | 78.7 (21.2) |
| k=20 | |||||||||
| TMG | 58.9 (12.0) | 65.1 (11.7) | 35.6 (13.9) | 57.8 (11.5) | 65.1 (11.0) | 35.5 (13.7) | 61.8 (11.5) | 66.3 (10.3) | 40.0 (14.5) |
| PIII | 58.0 (11.2) | 57.1 (13.0) | 26.6 (11.3) | 57.3 (10.9) | 58.8 (12.1) | 27.6 (11.5) | 60.5 (10.6) | 60.8 (11.1) | 31.2 (12.5) |
| S | 58.3 (11.6) | 60.1 (12.7) | 29.7 (12.2) | 57.5 (11.2) | 61.1 (11.9) | 30.4 (12.3) | 60.9 (11.1) | 62.9 (11.1) | 34.3 (13.2) |
| T | 52.9 (8.0) | 38.2 (15.2) | 12.2 (8.3) | 53.1 (7.7) | 40.5 (14.9) | 13.0 (8.6) | 52.5 (9.1) | 43.5 (14.1) | 15.2 (9.8) |
| NPBS-Q | 50.1 (18.2) | 59.9 (14.7) | 33.6 (15.9) | 48.6 (17.7) | 59.7 (14.4) | 33.8 (15.6) | 53.8 (16.0) | 60.2 (13.7) | 37.9 (16.7) |
| PL-REML | 60.8 (11.2) | 50.6 (14.5) | 21.0 (10.4) | 60.3 (10.9) | 52.5 (13.9) | 21.9 (10.7) | 61.2 (10.1) | 55.3 (13.0) | 25.1 (11.8) |
| QP | 65.7 (14.0) | 53.1 (15.9) | 21.0 (10.5) | 65.2 (13.8) | 54.9 (15.2) | 21.7 (10.6) | 65.9 (12.3) | 58.2 (14.2) | 25.0 (11.8) |
| BJ | 62.9 (12.4) | 52.3 (14.3) | 23.8 (10.5) | 62.5 (12.5) | 53.8 (14.1) | 24.3 (10.6) | 63.6 (10.9) | 56.7 (13.2) | 27.3 (11.5) |
| J | 66.8 (14.2) | 59.7 (18.7) | 21.8 (12.1) | 65.7 (13.9) | 59.5 (17.2) | 22.1 (11.5) | 67.8 (12.6) | 63.2 (15.9) | 25.6 (13.1) |
| SJ | 28.7 (3.3) | 24.6 (4.8) | 13.9 (4.7) | 29.2 (2.8) | 25.7 (4.4) | 14.6 (4.8) | 28.7 (3.1) | 26.4 (4.2) | 16.0 (4.9) |
| PBS-τ2 | 58.4 (12.5) | 60.8 (13.6) | 27.1 (13.3) | 57.5 (12.2) | 62.0 (12.5) | 28.1 (13.4) | 61.3 (11.8) | 63.8 (11.5) | 32.3 (14.7) |
| NPBS-τ2 | 53.5 (20.3) | 62.2 (16.5) | 30.2 (18.8) | 51.3 (19.5) | 62.1 (15.7) | 31.1 (18.8) | 56.5 (17.4) | 62.7 (14.9) | 35.7 (20.2) |
| k=50 | |||||||||
| TMG | 48.8 (7.9) | 39.0 (7.9) | 16.4 (4.3) | 48.5 (7.7) | 40.0 (7.6) | 16.5 (4.2) | 50.2 (6.3) | 43.1 (7.4) | 18.8 (4.7) |
| PIII | 48.0 (7.6) | 35.9 (7.8) | 14.3 (3.7) | 48.0 (7.5) | 37.5 (7.7) | 14.8 (3.7) | 49.1 (6.3) | 40.7 (7.6) | 16.9 (4.2) |
| S | 48.2 (7.7) | 36.7 (7.8) | 14.8 (4.1) | 48.1 (7.6) | 38.1 (7.7) | 15.2 (4.0) | 49.3 (6.4) | 41.2 (7.6) | 17.3 (4.5) |
| T | 40.9 (6.6) | 22.0 (7.8) | 5.9 (2.6) | 41.5 (6.3) | 23.7 (7.9) | 6.3 (2.6) | 39.4 (7.3) | 26.8 (8.4) | 7.5 (3.1) |
| NPBS-Q | 46.3 (10.9) | 38.3 (10.6) | 16.0 (5.1) | 45.2 (10.4) | 39.2 (10.6) | 16.1 (5.0) | 47.4 (9.2) | 42.0 (10.6) | 18.3 (5.7) |
| PL-REML | 47.2 (7.5) | 31.8 (8.2) | 11.1 (3.4) | 46.9 (7.0) | 33.4 (8.0) | 11.6 (3.4) | 46.2 (6.5) | 36.4 (8.1) | 13.4 (3.9) |
| QP | 52.7 (9.4) | 33.7 (9.0) | 11.2 (3.5) | 52.4 (8.9) | 35.4 (8.6) | 11.7 (3.4) | 51.3 (7.8) | 39.0 (8.9) | 13.6 (4.0) |
| BJ | 49.0 (7.3) | 33.5 (7.5) | 13.9 (4.2) | 49.0 (7.1) | 34.6 (7.4) | 14.1 (4.0) | 48.2 (6.5) | 37.6 (7.6) | 16.0 (4.5) |
| J | 56.2 (11.1) | 38.4 (13.5) | 11.4 (3.5) | 55.1 (10.1) | 38.5 (12.0) | 11.8 (3.4) | 56.1 (9.4) | 43.6 (12.7) | 13.6 (4.0) |
| SJ | 18.3 (1.5) | 15.1 (2.2) | 7.9 (1.8) | 18.8 (1.1) | 16.0 (2.0) | 8.4 (1.8) | 18.3 (1.3) | 16.5 (1.9) | 9.3 (1.9) |
| PBS-τ2 | 47.2 (8.6) | 35.3 (8.6) | 12.1 (3.8) | 47.3 (8.4) | 37.2 (8.4) | 12.7 (3.8) | 48.8 (6.8) | 40.5 (8.2) | 14.7 (4.3) |
| NPBS-τ2 | 48.0 (12.7) | 37.0 (12.2) | 12.4 (4.8) | 45.9 (11.6) | 38.2 (12.0) | 12.8 (4.7) | 47.8 (10.3) | 41.4 (12.1) | 14.9 (5.5) |
TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: Test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance.
The NPBS-Q interval had unacceptable coverage across most of the simulation scenarios. This method hardly achieved nominal coverage probability, with the highest coverage estimated at . In general, the coverage of the NPBS-Q interval decreased modestly with larger and . However, the primary drive of the low coverage was the number of studies available for resampling in the bootstrap operation, where provided the best coverage. In the case of a small number of studies, the NPBS-Q interval was shorter than other -based intervals.
Compared with intervals based on , most intervals based on were found to have coverage consistently closer to the nominal . Specifically, the QP, BJ, and J intervals stood out as well-performing in areas where the -based intervals under-performed. PL-REML also performed well, but it was slightly farther from the nominal coverage probability compared to QP, BJ, and J; this was evident when the within-study sample size was between 150 and 550. For example, when and , the coverage probabilities were , and for PL-REML, QP, BJ, and J, respectively. Additionally, when , the coverage probabilities of these methods are closer to the nominal coverage, with coverage probabilities of , and . Interestingly, when and was between 10 and 50, the coverage of the QP, BJ, J, and PL-REML intervals dropped, and the TMG, PIII, and S methods were preferred in this case.
Two -based intervals would not be recommended for practice based on poor coverage probabilities given by the simulation. For different simulation scenarios, the lowest CI coverage was found for the SJ method, which was consistent with prior simulation studies for the interval of 54 Table 5 shows that the interval length of the SJ method tended to be the shortest of all studied methods and decreased with increasing . Additionally, similar to the NPBS-Q, the NPBS- failed to reach the nominal coverage in any scenario.
The CIs for the SMD when or showed the same trend as described above (Tables S10 to S13 in the Supplemental Material) with two important caveats. First, when and was between 10 and 15, the QP, BJ, and J methods worked well, and they were close to the nominal coverage. This was in direct contrast to the CIs based on , whose coverage probabilities were consistently under when was moderate or large (Table S7 in the Supplemental Material). Second, the KDB interval also worked well under all studied scenarios, and it had comparable performance to the QP, BJ, and J intervals. These points made any of the QP, BJ, J, or KDB interval estimation methods a reliable choice when the estimand of interest was SMD, regardless of the other meta-analysis parameters.
4.2.3. CIs for
Similar trends of CI coverage were observed for OR as MD, but with a greater magnitude of departure from the nominal coverage level. For both and , the TMG, PIII, and S methods had decreasing coverage with increasing . In Table 6, when and was between 10 and 50, a drop from coverage to was seen from all three intervals as increased. As increased to 20 and 50, this decrease in coverage presented for both moderate and large values of and the severity of departure from nominal coverage increased. The interval also showed a drop in coverage as increased, the magnitude of which was exacerbated as increased. NPBS-Q, SJ, and NPBS- showed unacceptable coverage in all scenarios studied. PBS- was generally over-conservative when was mild, but it performed poorly as increased.
Table 6.
Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 13 methods when the true overall log odds ratio is 0. Results are from the simulation study using log odds ratios as effect measures.
| The true overall log odds ratio μ = 0 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| ni = 10–50 | ni = 50–150 | ni = 150–550 | |||||||
|
|
|
|
|||||||
| Method | τ = 0.37 | τ = 0.73 | τ = 1.46 | τ = 0.20 | τ = 0.40 | τ = 0.80 | τ = 0.11 | τ = 0.21 | τ = 0.43 |
| k=5 | |||||||||
| TMG | 100.0 | 100.0 | 74.8 | 100.0 | 100.0 | 82.3 | 100.0 | 100.0 | 83.3 |
| PIII | 100.0 | 100.0 | 74.9 | 100.0 | 100.0 | 82.3 | 100.0 | 100.0 | 83.3 |
| S | 100.0 | 100.0 | 74.8 | 100.0 | 100.0 | 82.2 | 100.0 | 100.0 | 83.1 |
| T | 98.2 | 98.6 | 92.5 | 96.9 | 96.3 | 88.9 | 96.4 | 95.5 | 85.9 |
| NPBS-Q | 64.6 | 61.7 | 49.7 | 64.8 | 63.7 | 61.8 | 65.1 | 65.1 | 63.5 |
| PL-REML | 99.3 | 99.8 | 96.2 | 98.7 | 98.9 | 95.4 | 98.3 | 98.5 | 94.8 |
| QP | 96.7 | 97.4 | 97.2 | 95.6 | 95.7 | 95.9 | 95.0 | 95.3 | 95.2 |
| BJ | 96.8 | 97.3 | 97.1 | 95.6 | 96.1 | 96.7 | 95.1 | 95.5 | 95.7 |
| J | 97.0 | 97.3 | 97.1 | 95.4 | 95.6 | 96.3 | 94.9 | 95.1 | 95.4 |
| SJ | 51.0 | 80.0 | 91.3 | 54.0 | 78.5 | 88.1 | 54.2 | 78.6 | 87.8 |
| PBS-τ2 | 100.0 | 100.0 | 76.5 | 100.0 | 100.0 | 82.6 | 100.0 | 100.0 | 83.8 |
| NPBS-τ2 | 68.4 | 66.1 | 58.5 | 66.0 | 64.8 | 65.0 | 67.1 | 66.1 | 65.3 |
| KD | 94.1 | 95.0 | 95.5 | 95.0 | 95.3 | 95.6 | 94.9 | 95.2 | 95.1 |
| k=20 | |||||||||
| TMG | 100.0 | 83.2 | 67.5 | 99.9 | 89.9 | 87.8 | 99.9 | 91.5 | 91.1 |
| PIII | 100.0 | 83.2 | 68.0 | 99.8 | 89.9 | 88.0 | 99.9 | 91.4 | 91.2 |
| S | 99.9 | 83.2 | 67.9 | 99.7 | 89.6 | 88.0 | 99.6 | 91.1 | 91.2 |
| T | 99.1 | 94.3 | 54.7 | 97.2 | 90.0 | 80.0 | 96.8 | 92.2 | 78.5 |
| NPBS-Q | 82.5 | 75.3 | 35.4 | 85.5 | 84.3 | 76.2 | 85.6 | 85.1 | 82.8 |
| PL-REML | 99.6 | 93.5 | 86.8 | 98.8 | 95.0 | 94.7 | 98.4 | 94.7 | 94.4 |
| QP | 97.0 | 97.0 | 95.3 | 95.3 | 95.6 | 96.0 | 95.1 | 95.1 | 94.9 |
| BJ | 97.0 | 96.2 | 89.0 | 95.5 | 96.0 | 95.9 | 95.2 | 95.2 | 95.4 |
| J | 97.4 | 97.2 | 94.0 | 95.0 | 95.5 | 96.3 | 94.8 | 95.0 | 95.1 |
| SJ | 10.8 | 68.6 | 91.3 | 14.7 | 67.7 | 87.9 | 16.1 | 66.9 | 87.6 |
| PBS-τ2 | 100.0 | 82.0 | 67.9 | 99.8 | 89.4 | 87.8 | 99.8 | 91.1 | 91.1 |
| NPBS-τ2 | 79.2 | 75.5 | 59.5 | 84.7 | 84.6 | 82.8 | 84.9 | 85.2 | 84.7 |
| KD | 94.6 | 95.2 | 93.7 | 94.8 | 95.2 | 95.4 | 95.0 | 95.0 | 94.8 |
| k=50 | |||||||||
| TMG | 100.0 | 76.9 | 44.2 | 99.6 | 91.2 | 89.1 | 99.4 | 93.3 | 93.5 |
| PIII | 100.0 | 77.0 | 44.9 | 99.6 | 91.2 | 89.2 | 99.4 | 93.2 | 93.5 |
| S | 100.0 | 76.8 | 44.9 | 99.6 | 91.2 | 89.0 | 99.4 | 93.2 | 93.4 |
| T | 99.5 | 83.2 | 14.9 | 97.5 | 90.1 | 74.0 | 97.1 | 89.5 | 79.5 |
| NPBS-Q | 85.5 | 72.5 | 8.4 | 90.5 | 88.4 | 75.6 | 90.8 | 90.3 | 88.2 |
| PL-REML | 98.0 | 87.6 | 64.6 | 97.8 | 94.5 | 93.9 | 97.2 | 95.1 | 95.0 |
| QP | 96.7 | 95.8 | 87.5 | 95.3 | 95.5 | 96.0 | 95.5 | 95.0 | 95.1 |
| BJ | 96.4 | 93.2 | 54.7 | 95.4 | 95.6 | 93.2 | 95.4 | 95.4 | 95.8 |
| J | 97.5 | 96.3 | 80.2 | 95.0 | 95.3 | 95.9 | 95.4 | 94.9 | 95.4 |
| SJ | 0.4 | 48.6 | 88.2 | 1.1 | 49.8 | 87.5 | 1.3 | 48.1 | 87.8 |
| PBS-τ2 | 100.0 | 75.7 | 43.5 | 99.6 | 90.8 | 88.6 | 99.4 | 93.1 | 93.0 |
| NPBS-τ2 | 80.3 | 71.9 | 36.2 | 88.8 | 88.0 | 86.2 | 90.1 | 90.2 | 90.2 |
| KD | 93.3 | 93.7 | 87.9 | 94.9 | 95.0 | 95.2 | 95.3 | 94.9 | 94.9 |
TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance. KD: Kulinskaya-Dollinger.
When , the QP, BJ, J, and KD intervals maintained the closest to the nominal coverage probability under most parameter combinations. The exception was when was between 10 and 50, and . A large drop in coverage was observed for all four intervals, particularly the BJ CI, with a coverage of in this case (Table 6), corresponding to a large decrease in CI length (Table S14 in the Supplemental Material). This was magnified when , where the dip in coverage and the decrease in CI length were also observed for the case that was between 50 and 150 (Table 7 and Table S15 in the Supplemental Material). The KD interval only showed a major drop in coverage for when , was between 10 and 50, and was moderate to large. In most parameter combinations, the KD interval outperformed other methods with respect to the coverage probability or the average interval length. Therefore, in the case of the log OR, we suggested using the KD interval of .
Table 7.
Coverage probabilities (in percentage, %) of estimated I2 95% confidence intervals of 13 methods when the true overall log odds ratio is 1. Results are from the simulation study using log odds ratios as effect measures.
| The true overall log odds ratio μ = 1 |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| ni = 10–50 | ni = 50–150 | ni = 150–550 | |||||||
|
|
|
|
|||||||
| Method | τ =0.39 | τ =0.78 | τ =1.56 | τ =0.21 | τ =0.43 | τ =0.85 | τ =0.11 | τ =0.23 | τ =0.46 |
| k=5 | |||||||||
| TMG | 100.0 | 100.0 | 73.4 | 100.0 | 100.0 | 80.4 | 100.0 | 100.0 | 82.5 |
| PIII | 100.0 | 100.0 | 73.5 | 100.0 | 100.0 | 80.5 | 100.0 | 100.0 | 82.5 |
| S | 100.0 | 100.0 | 73.3 | 100.0 | 100.0 | 80.3 | 100.0 | 100.0 | 82.4 |
| T | 98.8 | 98.9 | 91.4 | 97.2 | 96.5 | 87.4 | 96.1 | 95.0 | 83.1 |
| NPBS-Q | 61.4 | 57.4 | 45.0 | 64.8 | 63.4 | 59.1 | 64.6 | 64.3 | 62.2 |
| PL-REML | 99.5 | 99.9 | 96.1 | 98.8 | 99.1 | 95.5 | 98.2 | 98.5 | 94.9 |
| QP | 97.3 | 97.4 | 96.9 | 95.8 | 95.9 | 96.3 | 94.7 | 95.0 | 95.4 |
| BJ | 97.2 | 97.3 | 96.8 | 95.8 | 96.1 | 96.6 | 94.8 | 95.2 | 95.6 |
| J | 97.4 | 97.4 | 96.9 | 95.7 | 95.9 | 96.5 | 94.6 | 95.0 | 95.4 |
| SJ | 54.8 | 83.1 | 90.2 | 52.8 | 79.0 | 88.5 | 52.0 | 78.1 | 87.6 |
| PBS-τ2 | 100.0 | 100.0 | 75.7 | 100.0 | 100.0 | 81.0 | 100.0 | 100.0 | 83.1 |
| NPBS-τ2 | 65.5 | 62.1 | 53.0 | 66.6 | 65.3 | 64.2 | 66.8 | 66.0 | 65.2 |
| KD | 94.8 | 95.5 | 95.3 | 94.9 | 95.3 | 95.8 | 94.5 | 94.8 | 95.3 |
| k=20 | |||||||||
| TMG | 100.0 | 80.3 | 63.0 | 99.9 | 89.1 | 86.4 | 99.8 | 91.1 | 91.0 |
| PIII | 100.0 | 80.3 | 63.6 | 99.8 | 89.0 | 86.5 | 99.7 | 91.1 | 91.0 |
| S | 99.8 | 80.3 | 63.4 | 99.7 | 89.0 | 86.4 | 99.6 | 91.1 | 91.0 |
| T | 99.5 | 90.8 | 47.6 | 97.9 | 89.2 | 78.0 | 96.8 | 87.6 | 78.8 |
| NPBS-Q | 78.1 | 68.4 | 30.4 | 84.8 | 82.5 | 71.8 | 85.8 | 84.5 | 82.1 |
| PL-REML | 99.7 | 91.4 | 82.9 | 99.1 | 95.0 | 94.2 | 98.0 | 94.5 | 94.9 |
| QP | 96.0 | 95.7 | 91.6 | 95.9 | 96.2 | 96.7 | 95.3 | 95.1 | 95.2 |
| BJ | 95.9 | 94.6 | 85.2 | 95.9 | 96.3 | 95.5 | 95.2 | 95.1 | 95.7 |
| J | 96.6 | 95.9 | 89.9 | 95.7 | 96.0 | 96.8 | 95.2 | 95.0 | 95.5 |
| SJ | 14.0 | 77.9 | 86.1 | 14.1 | 68.2 | 89.6 | 12.3 | 67.7 | 88.6 |
| PBS-τ2 | 100.0 | 79.2 | 63.4 | 99.9 | 88.7 | 86.2 | 99.8 | 90.6 | 90.7 |
| NPBS-τ2 | 75.1 | 70.2 | 51.6 | 83.3 | 82.6 | 80.1 | 84.9 | 84.8 | 84.8 |
| KD | 94.8 | 94.5 | 90.3 | 94.9 | 95.5 | 95.9 | 94.9 | 94.9 | 95.0 |
| k=50 | |||||||||
| TMG | 100.0 | 71.1 | 34.6 | 99.8 | 89.3 | 85.0 | 99.4 | 93.2 | 93.0 |
| PIII | 100.0 | 71.2 | 35.3 | 99.8 | 89.3 | 85.2 | 99.4 | 93.2 | 93.0 |
| S | 100.0 | 71.1 | 35.2 | 99.8 | 88.9 | 84.9 | 99.4 | 92.6 | 92.4 |
| T | 99.9 | 74.0 | 9.2 | 97.8 | 89.3 | 67.7 | 97.2 | 89.6 | 79.0 |
| NPBS-Q | 78.0 | 61.6 | 5.6 | 89.6 | 86.7 | 69.7 | 90.7 | 90.4 | 87.1 |
| PL-REML | 97.0 | 82.3 | 53.3 | 98.0 | 93.7 | 91.6 | 97.7 | 95.1 | 94.9 |
| QP | 94.1 | 91.4 | 73.9 | 95.7 | 95.8 | 95.7 | 95.3 | 95.6 | 95.5 |
| BJ | 93.6 | 87.8 | 44.0 | 95.7 | 95.3 | 91.1 | 95.4 | 95.6 | 95.5 |
| J | 95.5 | 92.2 | 65.5 | 95.4 | 95.9 | 95.2 | 95.1 | 95.5 | 95.6 |
| SJ | 0.7 | 66.6 | 73.9 | 0.7 | 49.8 | 89.1 | 0.6 | 50.0 | 87.7 |
| PBS-τ2 | 100.0 | 69.9 | 33.8 | 99.7 | 89.0 | 84.5 | 99.4 | 93.0 | 92.4 |
| NPBS-τ2 | 75.1 | 70.2 | 51.6 | 83.3 | 82.6 | 80.1 | 84.9 | 84.8 | 84.8 |
| KD | 93.3 | 90.8 | 75.9 | 94.8 | 94.8 | 94.3 | 95.2 | 95.4 | 95.2 |
TMG: two-moment gamma; PIII: Pearson type III; S: Saddlepoint; T: test-based; NPBS: non-parametric bootstrap; Q: the Q statistic; PL: profile-likelihood; REML: restricted maximum likelihood; QP: Q-profile; BJ: Biggerstaff and Jackson; J: Jackson; SJ: Sidik and Jonkman; PBS: parametric bootstrap; τ2: the between-study variance. KD: Kulinskaya-Dollinger.
5. Discussion
In this article, for a meta-analysis, we have compared different methods to calculate the point estimate and the CI of the statistic. For point estimates of , the SJ method is suggested to be used when is large. Otherwise, the DL method gives a less biased estimate for based on the simulation studies. The interval estimates of are grouped into two categories by their derivation. One group is the methods based on the approximation of the CDF for the statistic; another group is the methods viewing as the function of in equation (4), and they calculate the interval of based on the interval of . Based on the simulation studies, we would suggest the following guidelines:
When the effect measure is the MD or SMD, use the QP, BJ, or J method to calculate the for ;
When the effect measure is the OR, use the KD method to calculate the for .
In the case of the , the KD method is recommended because it generally outperforms the other methods with respect to the coverage probability of the CI for . Except for the KDB method for the SMD and the KD method for the log OR, all other methods can be used to calculate the CI of the statistic for any type of effect measure.
Although the statistic is widely used to measure the heterogeneity of meta-analyses, it suffers from large uncertainties and should not be used as an absolute measure of heterogeneity. However, the CI of provides an appreciation of the spectrum of possible extents of heterogeneity, which can be more robust to nuisance factors. In practice, meta-analysts should report the CI for using the recommended methods, which have reasonable interval lengths and provide much more reliable coverage probabilities than the currently used methods (e.g. the test-based method).
Based on the simulation framework in this article, other simulation settings can be considered in future research. For example, when the effect measure is the MD, the chi-squared distribution can be used to generate the study-specific standard deviation. Further studies can provide additional clarity on the guidelines for , with our work serving as the baseline.
Supplementary Material
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institute of Mental Health, U.S. National Library of Medicine (grant numbers R03 MH128727 and R01 LM012982).
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
- 1.Page MJ, Moher D, Bossuyt PM, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. Br Med J 2021; 372: n160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wang Y, Lin L, Thompson CG, et al. A penalization approach to random-effects meta-analysis. Stat Med 2022; 41: 500–516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. Chichester, UK: John Wiley & Sons, 2019. [Google Scholar]
- 4.Higgins JPT and Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002; 21: 1539–1558. [DOI] [PubMed] [Google Scholar]
- 5.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, et al. Assessing heterogeneity in meta-analysis: Q statistic or index? Psychol Methods 2006; 11: 193–206. [DOI] [PubMed] [Google Scholar]
- 6.Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol 2011; 64: 1294–1302. [DOI] [PubMed] [Google Scholar]
- 7.Borenstein M, Higgins JPT, Hedges LV, et al. Basics of meta-analysis: is not an absolute measure of heterogeneity. Res Synth Methods 2017; 8: 5–18. [DOI] [PubMed] [Google Scholar]
- 8.Hoaglin DC. Misunderstandings about and ‘Cochran’s test’ in meta-analysis. Stat Med 2016; 35: 485–495. [DOI] [PubMed] [Google Scholar]
- 9.Hoaglin DC. Practical challenges of as a measure of heterogeneity. Res Synth Methods 2017; 8: 54. [DOI] [PubMed] [Google Scholar]
- 10.von Hippel PT. The heterogeneity statistic can be biased in small meta-analyses. BMC Med Res Methodol 2015; 15: 35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mittlböck M and Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006; 25: 4321–4333. [DOI] [PubMed] [Google Scholar]
- 12.Rücker G, Schwarzer G, Carpenter JR, et al. Undue reliance on in assessing heterogeneity may mislead. BMC Med Res Methodol 2008; 8: 79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Viechtbauer W Hypothesis tests for population heterogeneity in meta-analysis. Br J Math Stat Psychol 2007; 60: 29–60. [DOI] [PubMed] [Google Scholar]
- 14.Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10: 101–129. [Google Scholar]
- 15.Ioannidis JPA, Patsopoulos NA and Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. Br Med J 2007; 335: 914–916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Thorlund K, Imberger G, Johnston BC, et al. Evolution of heterogeneity estimates and their confidence intervals in large meta-analyses. PLoS One 2012; 7: e39471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kulinskaya E, Dollinger MB, Knight E, et al. A Welch-type test for homogeneity of contrasts under heteroscedasticity with application to meta-analysis. Stat Med 2004; 23: 3655–3670. [DOI] [PubMed] [Google Scholar]
- 18.Kulinskaya E, Dollinger MB and Bjørkestøl K. Testing for homogeneity in meta-analysis I. The one-parameter case: standardized mean difference. Biometrics 2011; 67: 203–212. [DOI] [PubMed] [Google Scholar]
- 19.Kulinskaya E, Dollinger MB and Bjørkestøl K. On the moments of Cochran’s statistic under the null hypothesis, with application to the meta-analysis of risk difference. Res Synth Methods 2011; 2: 254–270. [DOI] [PubMed] [Google Scholar]
- 20.Kulinskaya E and Dollinger MB. An accurate test for homogeneity of odds ratios based on Cochran’s -statistic. BMC Med Res Methodol 2015; 15: 49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Biggerstaff BJ and Jackson D. The exact distribution of Cochran’s heterogeneity statistic in one-way random effects meta-analysis. Stat Med 2008; 27: 6093–6110. [DOI] [PubMed] [Google Scholar]
- 22.Normand S-LT. Meta-analysis: formulating, evaluating, combining, and reporting. Stat Med 1999; 18: 321–359. [DOI] [PubMed] [Google Scholar]
- 23.Malzahn U, Böhning D and Holling H. Nonparametric estimation of heterogeneity variance for the standardised difference used in meta-analysis. Biometrika 2000; 87: 619–632. [Google Scholar]
- 24.Hedges LV and Olkin I. Statistical methods for meta-analysis. Orlando, FL: Academic Press, 1985. [Google Scholar]
- 25.Cooper H, Hedges LV and Valentine JC. The handbook of research synthesis and meta-analysis. 2nd ed. New York, NY: Russell Sage Foundation, 2009. [Google Scholar]
- 26.Egger M, Smith D, Altman G, et al. Systematic reviews in health care: meta-analysis in context. 2nd ed. London, UK: BMJ Publishing Group, 2001. [Google Scholar]
- 27.Lin L and Aloe AM. Evaluation of various estimators for standardized mean difference in meta-analysis. Stat Med 2021; 40: 403–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lin L Bias caused by sampling error in meta-analysis with small sample sizes. PLoS One 2018; 13: e0204056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Walter SD. Choice of effect measure for epidemiological data. J Clin Epidemiol 2000; 53: 931–939. [DOI] [PubMed] [Google Scholar]
- 30.Tajeu GS, Sen B, Allison DB, et al. Misuse of odds ratios in obesity literature: an empirical analysis of published studies. Obesity 2012; 20: 1726–1731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Furuya-Kanamori L and Doi SAR. The outcome with higher baseline risk should be selected for relative risk in clinical studies: a proposal for change to practice. J Clin Epidemiol 2014; 67: 364–367. [DOI] [PubMed] [Google Scholar]
- 32.Feng C, Wang B and Wang H. The relations among three popular indices of risks. Stat Med 2019; 38: 4772–4787. [DOI] [PubMed] [Google Scholar]
- 33.Doi SA, Furuya-Kanamori L, Xu C, et al. Controversy and Debate: questionable utility of the relative risk in clinical research: paper 1: a call for change to practice. J Clin Epidemiol 2022; 142: 271–279. [DOI] [PubMed] [Google Scholar]
- 34.Bakbergenuly I, Hoaglin DC and Kulinskaya E. Pitfalls of using the risk ratio in meta-analysis. Res Synth Methods 2019; 10: 398–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies. Ann Hum Genet 1956; 20: 309–311. [DOI] [PubMed] [Google Scholar]
- 36.Gart JJ, Pettigrew HM and Thomas DG. The effect of bias, variance estimation, skewness and kurtosis of the empirical logit on weighted least squares analyses. Biometrika 1985; 72: 179–190. [Google Scholar]
- 37.Pettigrew HM, Gart JJ and Thomas DG. The bias and higher cumulants of the logarithm of a binomial variate. Biometrika 1986; 73: 425–435. [Google Scholar]
- 38.Sweeting MJ, Sutton AJ and Paul LC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Stat Med 2004; 23: 1351–1375. [DOI] [PubMed] [Google Scholar]
- 39.Bradburn MJ, Deeks JJ, Berlin JA, et al. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 2007; 26: 53–77. [DOI] [PubMed] [Google Scholar]
- 40.Cai T, Parast L and Ryan L. Meta-analysis for rare events. Stat Med 2010; 29: 2078–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Rücker G, Schwarzer G, Carpenter J, et al. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Stat Med 2009; 28: 721–738. [DOI] [PubMed] [Google Scholar]
- 42.DerSimonian R and Kacker R. Random-effects model for meta-analysis of clinical trials: an update. Contemp Clin Trials 2007; 28: 105–114. [DOI] [PubMed] [Google Scholar]
- 43.DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7: 177–188. [DOI] [PubMed] [Google Scholar]
- 44.Sidik K and Jonkman JN. Simple heterogeneity variance estimation for meta-analysis. J R Stat Soc Ser C Appl Stat 2005; 54: 367–384. [Google Scholar]
- 45.Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J Am Stat Assoc 1977; 72: 320–338. [Google Scholar]
- 46.Bias Viechtbauer W. and efficiency of meta-analytic variance estimators in the random-effects model. J Educ Behav Stat 2005; 30: 261–293. [Google Scholar]
- 47.Viechtbauer W Conducting meta-analyses in R with the metafor package. J Stat Softw 2010; 36: 3. [Google Scholar]
- 48.Lin L Comparison of four heterogeneity measures for meta-analysis. J Eval Clin Pract 2020; 26: 376–384. [DOI] [PubMed] [Google Scholar]
- 49.Biggerstaff BJ and Tweedie RL. Incorporating variability in estimates of heterogeneity in the random effects model in meta-analysis. Stat Med 1997; 16: 753–768. [DOI] [PubMed] [Google Scholar]
- 50.Kuonen D Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 1999; 86: 929–935. [Google Scholar]
- 51.Miettinen O Estimability and estimation in case-referent studies. Am J Epidemiol 1976; 103: 226–235. [DOI] [PubMed] [Google Scholar]
- 52.Wetterslev J, Thorlund K, Brok J, et al. Estimating required information size by quantifying diversity in random-effects model meta-analyses. BMC Med Res Methodol 2009; 9: 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Langan D, Higgins JPT, Jackson D, et al. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Res Synth Methods 2019; 10: 83–98. [DOI] [PubMed] [Google Scholar]
- 54.Veroniki AA, Jackson D, Viechtbauer W, et al. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res Synth Methods 2016; 7: 55–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Hardy RJ and Thompson SG. A likelihood approach to meta-analysis with random effects. Stat Med 1996; 15: 619–629. [DOI] [PubMed] [Google Scholar]
- 56.Viechtbauer W Confidence intervals for the amount of heterogeneity in meta-analysis. Stat Med 2007; 26: 37–52. [DOI] [PubMed] [Google Scholar]
- 57.Farebrother RW. The distribution of a positive linear combination of random variables. J R Stat Soc Ser C Appl Stat 1984; 33: 332–339. [Google Scholar]
- 58.Jackson D Confidence intervals for the between-study variance in random effects meta-analysis using generalised Cochran heterogeneity statistics. Res Synth Methods 2013; 4: 220–229. [DOI] [PubMed] [Google Scholar]
- 59.Jackson D and White IR. When should meta-analysis avoid making hidden normality assumptions? Biom J 2018; 60: 1040–1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Knapp G, Biggerstaff BJ and Hartung J. Assessing the amount of heterogeneity in random-effects meta-analysis. Biom J 2006; 48: 271–285. [DOI] [PubMed] [Google Scholar]
- 61.Morris TP, White IR and Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Peters JL, Sutton AJ, Jones DR, et al. Comparison of two methods to detect publication bias in meta-analysis. JAMA 2006; 295: 676–680. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
