Alternative Measures of Between-Study Heterogeneity in Meta-Analysis: Reducing the Impact of Outlying Studies

Lifeng Lin; Haitao Chu; James S Hodges

doi:10.1111/biom.12543

. Author manuscript; available in PMC: 2017 Mar 24.

Published in final edited form as: Biometrics. 2016 May 11;73(1):156–166. doi: 10.1111/biom.12543

Alternative Measures of Between-Study Heterogeneity in Meta-Analysis: Reducing the Impact of Outlying Studies

Lifeng Lin ^1,^*, Haitao Chu ¹, James S Hodges ¹

PMCID: PMC5106349 NIHMSID: NIHMS780049 PMID: 27167143

Summary

Meta-analysis has become a widely used tool to combine results from independent studies. The collected studies are homogeneous if they share a common underlying true effect size; otherwise, they are heterogeneous. A fixed-effect model is customarily used when the studies are deemed homogeneous, while a random-effects model is used for heterogeneous studies. Assessing heterogeneity in meta-analysis is critical for model selection and decision making. Ideally, if heterogeneity is present, it should permeate the entire collection of studies, instead of being limited to a small number of outlying studies. Outliers can have great impact on conventional measures of heterogeneity and the conclusions of a meta-analysis. However, no widely accepted guidelines exist for handling outliers. This article proposes several new heterogeneity measures. In the presence of outliers, the proposed measures are less affected than the conventional ones. The performance of the proposed and conventional heterogeneity measures are compared theoretically, by studying their asymptotic properties, and empirically, using simulations and case studies.

Keywords: Absolute deviation, Heterogeneity, I² statistic, Meta-analysis, Outliers

1. Introduction

Meta-analysis is a statistical method for combining a collection of effect estimates from multiple separate studies (Higgins and Green, 2008), and it has been applied in a wide range of scientific areas (Hunter and Schmidt, 1996; Prospective Studies Collaboration, 2002). The collected studies are called homogeneous if they share a common underlying true effect size; otherwise, they are called heterogeneous. A fixed-effect model is customarily used for studies deemed to be homogeneous, while a random-effects model is used for heterogeneous studies (Borenstein et al., 2010; Riley et al., 2011). Assessing heterogeneity is thus a critical issue in meta-analysis because different models may lead to different estimates of overall effect size and different standard errors. Also, the perception of heterogeneity or homogeneity helps clinicians make important decisions, such as whether the collected studies are similar enough to integrate their results and whether a treatment is applicable to all patients (Ioannidis et al., 2007).

The classical statistic for testing between-study heterogeneity is Cochran’s χ² test (Cochran, 1954), also known as the Q test (Whitehead and Whitehead, 1991). However, this test suffers from poor power when the number of collected studies is small, and it may detect clinically unimportant heterogeneity when many studies are pooled (Hardy and Thompson, 1998; Jackson, 2006). More importantly, since the Q statistic and estimators of between-study variance depend on either the number of collected studies or the scale of effect sizes, they cannot be used to compare degrees of heterogeneity between different meta-analyses. Accordingly, Higgins and Thompson (2002) proposed several measures to better describe heterogeneity. Among these, I² measures the proportion of total variation between studies that is due to heterogeneity rather than within-study sampling error, and it has been popular in the meta-analysis literature. Higgins and Green (2008) empirically provided a rough guide to interpretation of I²: 0 ≤ I² ≤ 0.4 indicates that heterogeneity might not be important; 0.3 ≤ I² ≤ 0.6 may represent moderate heterogeneity; 0.5 ≤ I² ≤ 0.9 may represent substantial heterogeneity; and 0.75 ≤ I² ≤ 1 implies considerable heterogeneity. These ranges overlap because the importance of heterogeneity depends on several factors and strict thresholds can be misleading (Higgins and Green, 2008).

Ideally, if heterogeneity is present in a meta-analysis, it should permeate the entire collection of studies instead of being limited to a small number of outlying studies. With this in mind, we may classify meta-analyses into four groups: (i) all the collected studies are homogeneous; (ii) a few studies are outlying and the rest are homogeneous; (iii) heterogeneity permeates the entire collection of studies; and (iv) a few studies are outlying and heterogeneity permeates the remaining studies. Outlying studies can have great impact on conventional heterogeneity measures and on the conclusions of a meta-analysis. Several methods have been recently developed for outliers and influence diagnostics (Viechtbauer and Cheung, 2010; Gumedze and Jackson, 2011). However, no widely accepted guidelines exist for handling outliers in the statistical literature, including the area of meta-analysis. Hedges and Olkin (1985) specified two extreme positions about dealing with outlying studies: (i) data are “sacred”, and no study should ever be set aside for any reason; or (ii) data should be tested for outlying studies, and those failing to conform to the hypothesized model should be removed. Neither seems appropriate. Alternatively, if a small number of studies is influential, some researchers usually present sensitivity analyses with and without those studies. However, if the results of sensitivity analysis differ dramatically, clinicians may reach no consensus about which result to use to make decisions. Because of these problems caused by outliers, ideal heterogeneity measures are expected to be robust: they should be minimally affected by outliers and accurately describe heterogeneity.

This article introduces several new heterogeneity measures, which are designed to be less affected by outliers than conventional measures. The basic idea comes from least absolute deviations (LAD) regression, which is known to have significant robustness advantages over classical least squares (LS) regression (Portnoy and Koenker, 1997). Specifically, LS regression aims at minimizing the sum of squared errors $\sum {(y_{i} - x_{i}^{T} β)}^{2}$ , where x_i represents predictors, y_i is the response, and β contains the regression coefficients. LAD regression minimizes the sum of absolute errors $\sum | y_{i} - x_{i}^{T} β |$ . The impact of outliers is diminished by using absolute values in LAD regression, compared to using squared values in LS regression. In meta-analysis, the conventional Q statistic has the form $Q = \sum w_{i} {(y_{i} - \bar{μ})}^{2}$ , where the y_i’s are the observed effect sizes, the w_i’s are study-specific weights, and $\bar{μ}$ is the weighted average effect size. Analogously, we consider a new measure $Q_{r} = \sum \sqrt{w_{i}} | y_{i} - \bar{μ} |$ , which is expected to be more robust against outliers than the conventional Q. An estimate of the between-study variance can be obtained based on Q_r. Also, since Q_r depends on the number of collected studies, we further derive two statistics to quantify heterogeneity, which are counterparts of I² and another statistic H also proposed by Higgins and Thompson (2002).

This article is organized as follows. Section 2 gives a brief review of conventional measures and discusses the dilemma of handling outliers in meta-analysis. Section 3 proposes several new heterogeneity measures designed to be robust to outliers. Section 4 uses theoretical properties to compare the proposed and conventional measures. Section 5 presents simulations to compare the various approaches empirically, and Section 6 applies the approaches to two actual meta-analyses. Section 7 provides a brief discussion.

2. The conventional methods

2.1 Measures of between-study heterogeneity

Suppose that a meta-analysis contains n independent studies. Let μ_i be the underlying true effect size, such as log odds ratio, in study i (i = 1, …, n). Typically, published studies report estimates of the effect sizes and their within-study variances, which we will call y_i and $s_{i}^{2}$ . It is customary to assume that the y_i’s are approximately normally distributed with mean μ_i and variance $σ_{i}^{2}$ , respectively. Since the unknown $σ_{i}^{2}$ can be estimated by $s_{i}^{2}$ , these data are commonly modeled as $y_{i} ~ N (μ_{i}, s_{i}^{2})$ with $s_{i}^{2}$ treated as known. Also, we assume that the true μ_i’s are independently distributed as $μ_{i} ~ N (μ, τ^{2})$ , where μ is the true overall mean effect size across studies and τ² is the between-study variance. The collected n studies are defined to be homogeneous if their underlying true effect sizes are equal, that is, μ_i = μ for all i = 1, …, n, or equivalently τ² = 0. On the other hand, the studies are heterogeneous if their underlying true effect sizes vary, that is, τ² > 0.

To test the homogeneity of the y_i’s (i.e., H₀: τ² = 0 vs. H_A: τ² > 0), the well-known Q statistic (Whitehead and Whitehead, 1991) is defined as

Q = \sum_{i = 1}^{n} w_{i} {(y_{i} - \bar{μ})}^{2},

which follows a $χ_{n - 1}^{2}$ distribution under the null hypothesis. Here, $w_{i} = 1 / s_{i}^{2}$ is the reciprocal of the within-study variance of y_i, and $\bar{μ} = \sum_{i = 1}^{n} w_{i} y_{i} / \sum_{i = 1}^{n} w_{i}$ is the pooled fixed-effect estimate of μ. Based on the Q statistic, DerSimonian and Laird (1986) introduced a method of moments estimate of the between-study variance,

{\hat{τ}}_{DL}^{2} = max {0, \frac{Q - (n - 1)}{\sum_{i = 1}^{n} w_{i} - \sum_{i = 1}^{n} w_{i}^{2} / \sum_{i = 1}^{n} w_{i}}} .

Note that the Q statistic depends on the number of collected studies n and the estimate of between-study variance depends on the scale of effect sizes. Hence, neither Q nor ${\hat{τ}}_{DL}^{2}$ can be used to compare degrees of heterogeneity between different meta-analyses. To allow such comparisons, Higgins and Thompson (2002) proposed the measures H and I²:

H = \sqrt{Q / (n - 1)}, I^{2} = [Q - (n - 1)] / Q .

The H statistic is interpreted as the ratio of the standard deviation of the estimated overall effect size from a random-effects meta-analysis compared to the standard deviation from a fixed-effect meta-analysis; I² describes the proportion of total variance between studies that is attributed to heterogeneity rather than sampling error. In practice, meta-analysts truncate H at 1 when H < 1 and truncate I² at 0 when I² < 0; therefore, H ⩾ 1 and I² lies between 0 and 1. Since I² is interpreted as a proportion, it is usually expressed as a percent. Both measures have been widely adopted in practice.

2.2 Outlier detection

As in many other statistical applications, outliers frequently appear in meta-analysis. Outliers may arise from at least three sources:

The quality of collected studies and systematic review. The published results (y_i, $s_{i}^{2}$ ) in a clinical study could be outlying due to errors in the process of recording, analyzing, or reporting data. Also, the populations in certain clinical studies may not meet the systematic review’s inclusion criteria; hence, such studies may be outlying compared to most other collected studies.
A heavy-tailed distribution of study-specific underlying effect sizes. Conventionally, at the between-study level, the study-specific underlying effect sizes μ_i are assumed to have a normal distribution. However, the true distribution of the μ_i’s may greatly depart from the normality assumption and have heavy tails, such as the t-distribution with small degrees of freedom.
Small sample sizes in certain studies. The true within-study variances $σ_{i}^{2}$ could be poorly estimated by the sample variances $s_{i}^{2}$ if the sample sizes are small. In some situations, effect sizes in small studies may be more informative than large studies due to “small study effects” (Nüesch et al., 2010); if their true within-study variances $σ_{i}^{2}$ are seriously underestimated, then small studies could be outlying.

Hedges and Olkin (1985) and Viechtbauer and Cheung (2010) introduced outlier detection methods for fixed-effect and random-effects meta-analyses, respectively. Both methods use a “leave-one-study-out” technique so that a potential outlier could have little influence on the residuals of interest. Specifically, the residual of study i is calculated as $e_{i} = y_{i} - {\bar{μ}}_{(- i)}$ . Here, ${\bar{μ}}_{(- i)}$ is the estimated overall effect size using the data without study i; that is, ${\bar{μ}}_{(- i)} = \frac{\sum_{j \neq i} y_{j} / s_{j}^{2}}{\sum_{j \neq i} 1 / s_{j}^{2}}$ under the fixed-effect setting, and ${\bar{μ}}_{(- i)} = \frac{\sum_{j \neq i} y_{j} / (s_{j}^{2} + {\hat{τ}}_{(- i)}^{2})}{\sum_{j \neq i} 1 / (s_{j}^{2} + {\hat{τ}}_{(- i)}^{2})}$ under the random-effects setting, where ${\hat{τ}}_{(- i)}^{2}$ can be the DerSimonian and Laird estimate using the data without study i. The variance of e_i is estimated as $v_{i} = s_{i}^{2} + {(\sum_{j \neq i} 1 / s_{j}^{2})}^{- 1}$ and $v_{i} = s_{i}^{2} + {\hat{τ}}_{(- i)}^{2} + {[\sum_{j \neq i} 1 / (s_{j}^{2} + {\hat{τ}}_{(- i)}^{2})]}^{- 1}$ under the fixed-effect and random-effects settings, respectively. The standardized residuals $ε_{i} = e_{i} / \sqrt{v_{i}}$ are expected to follow the standard normal distribution and studies with ε_i’s greater than 3 in absolute magnitude are customarily considered outliers.

Outliers may be masked if the above approaches are used in an inappropriate setting. For example, Figures 3(b) and 3(d) in Section 6 show standardized residuals of two actual meta-analyses; different outlier detection methods identify different outliers. Hence, one must assess the heterogeneity of collected studies to correctly apply the foregoing approaches to detect outliers. However, outliers may cause heterogeneity to be overestimated and thus affect procedures to detect them. Additionally, even if outliers are identified, there is no consensus in the statistical literature on what to do about them unless these studies are evidently erroneous (Barnett and Lewis, 1994). To avoid the dilemmas of detecting and handling outliers, we propose robust measures to assess heterogeneity.

Forest plots and standardized residual plots of two actual meta-analyses. The upper panels show the meta-analysis conducted by Ismail et al. (2012); the lower panels show that conducted by Haentjens et al. (2010). In (a) and (c), the columns “Lower” and “Upper” are the lower and upper bounds of 95% CIs of the effect sizes within each study. In (b) and (d), the filled dots represent standardized residuals obtained under the fixed-effect setting; the unfilled dots represent those obtained under the random-effects setting.

3. The proposed alternative heterogeneity measures

3.1 Heterogeneity measures based on absolute deviations and weighted average

In linear regression, it is well-known that least absolute deviations regression is more robust to outliers than classical least squares regression (Portnoy and Koenker, 1997). The former method minimizes $\sum | y_{i} - x_{i}^{T} β |$ and the latter minimizes $\sum {(y_{i} - x_{i}^{T} β)}^{2}$ , where x_i and y_i are predictors and response respectively and β contains the regression coefficients. In the context of meta-analysis, the conventional Q statistic is analogous to least squares regression, because Q is a weighted sum of squared deviations. To reduce the impact of outlying studies, we propose a new measure Q_r which is analogous to least absolute deviations regression. This measure is the weighted sum of absolute deviations, and is defined as

Q_{r} = \sum_{i = 1}^{n} \sqrt{w_{i}} | y_{i} - \bar{μ} | .

For random-effects meta-analysis, $E [Q_{r}] = \sum_{i = 1}^{n} \sqrt{2 v_{i} / π}$ , where $v_{i} = 1 - w_{i} / \sum_{j = 1}^{n} w_{j} + τ^{2} [w_{i} - 2 w_{i}^{2} / \sum_{j = 1}^{n} w_{j} + w_{i} \sum_{j = 1}^{n} w_{j}^{2} / {(\sum_{j = 1}^{n} w_{j})}^{2}]$ .

DerSimonian and Laird (1986) derived an estimate of the between-study variance τ² based on the Q statistic by the method of moments, i.e., equating the observed Q with its expectation. We can similarly obtain a new estimate of τ², denoted as ${\hat{τ}}_{r}^{2}$ , from the proposed Q_r statistic. Specifically, ${\hat{τ}}_{r}^{2}$ is the solution to the following equation in τ²:

Q_{r} \sqrt{\frac{π}{2}} = \sum_{i = 1}^{n} {1 - \frac{w_{i}}{\sum_{j = 1}^{n} w_{j}} + τ^{2} [w_{i} - \frac{2 w_{i}^{2}}{\sum_{j = 1}^{n} w_{j}} + \frac{w_{i} \sum_{j = 1}^{n} w_{j}^{2}}{{(\sum_{j = 1}^{n} w_{j})}^{2}}]}^{1 / 2} .

(1)

If this equation has no nonnegative solution, set ${\hat{τ}}_{r}^{2} = 0$ . Note that the right-hand side of Equation (1) is monotone increasing in τ², so the solution is unique.

The Q_r statistic, like Q, is dependent on the number of studies; ${\hat{τ}}_{r}^{2}$ , like $τ_{DL}^{2}$ , is dependent on the scale of effect sizes. Following the approach of Higgins and Thompson (2002), we tentatively assume that all studies share a common within-study variance σ² and explore heterogeneity measures that are independent of both the number of studies and the scale of effect sizes, so that they can be used to compare degrees of heterogeneity between meta-analyses. Suppose the target heterogeneity measure can be written as f(μ,τ²,σ²,n), which is a function of the true overall mean effect size μ, the between-study variance τ², the within-study variance σ², and the number of studies n. Higgins and Thompson (2002) suggested that this measure should satisfy the following three criteria:

(Dependence on the magnitude of heterogeneity) f(μ, τ′², σ²,n) > f(μ, τ², σ², n) for any τ′² > τ². This criterion is self-evident.
(Scale invariance) f(a + bμ,b²τ²,b²σ²,n) = f(μ,τ²,σ²,n) for any constants a and b. This criterion “standardizes” comparisons between meta-analyses using different scales of measurement and different types of outcome data.
(Size invariance) f(μ,τ²,σ²,n′) = f(μ,τ²,σ²,n) for any positive integers n and n′. This criterion indicates that the number of studies collected in meta-analysis does not systematically affect the magnitude of the heterogeneity measure.

Monotone increasing functions of ρ = τ²/σ² can be easily shown to satisfy these three criteria. Plugging w_i = 1/σ² into Equation (1), we have $ρ + 1 = π Q_{r}^{2} / [2 n (n - 1)]$ . This implies that

H_{r}^{2} = \frac{π Q_{r}^{2}}{2 n (n - 1)}

is a candidate measure. Further, considering ρ/(ρ + 1) = τ²/(τ² + σ²), commonly called the intraclass correlation, Equation (1) yields another candidate:

I_{r}^{2} = \frac{Q_{r}^{2} - 2 n (n - 1) / π}{Q_{r}^{2}} .

In practice, H_r would be truncated at 1 when H_r < 1 and $I_{r}^{2}$ would be truncated at 0 when $I_{r}^{2} < 0$ . These two measures, $H_{r}^{2}$ and $I_{r}^{2}$ , are analogous to and have the same interpretations as H² and I², respectively. Higgins and Thompson (2002) also introduced a so-called R² statistic; since it has interpretation and performance similar to H², we do not present a version of R² based on the new Q_r statistic.

Since standard deviations are used more frequently in clinical practice, Higgins and Thompson (2002) suggested reporting H, instead of H², for meta-analyses. For the proposed measures, we also recommend reporting H_r rather than $H_{r}^{2}$ . However, we suggest presenting I² and $I_{r}^{2}$ instead of their square roots because their interpretation of “proportion of variance explained” is widely familiar to clinicians. H_r = 1 or $I_{r}^{2} = 0$ implies perfect homogeneity. Also, since the expressions for H_r and $I_{r}^{2}$ only involve Q_r and n but not within-study variances, these two measures can be easily generalized to a situation where the within-study variances $s_{i}^{2}$ vary across studies.

3.2 Heterogeneity measures based on absolute deviations and weighted median

The proposed Q_r statistic uses the weighted average $\bar{μ}$ to estimate overall effect size under the null hypothesis; it may be sensitive to potential outliers. To derive an even more robust heterogeneity measure, we may replace the weighted average with the weighted median ${\hat{μ}}_{m}$ , which is defined as the solution to the following equation in θ:

\sum_{i = 1}^{n} w_{i} [I (θ ⩾ y_{i}) - 0.5] = 0,

(2)

where $I (\cdot)$ is the indicator function. This weighted median leads to a new test statistic, $Q_{m} = \sum_{i = 1}^{n} \sqrt{w_{i}} | y_{i} - {\hat{μ}}_{m} |$ . Note that the solution to Equation (2) may be not unique; to avoid this problem, we will approximate the indicator function by a monotone increasing smooth function (Horowitz, 1998). Section 3.3 introduces the details.

The expectation of Q_m may not be explicitly calculated because the distribution of weighted median of finite samples is unclear. By the theory of M-estimation (Huber and Ronchetti, 2009), the weighted median is a $\sqrt{n}$ -consistent estimator of the true overall effect size μ. Suppose that the weights w_i have finite first-order moment, then it can be shown that

| Q_{m} / n - \frac{1}{n} \sum_{i = 1}^{n} \sqrt{w_{i}} | y_{i} - μ | | \leq | {\hat{μ}}_{m} - μ | \cdot \frac{1}{n} \sum_{i = 1}^{n} \sqrt{w_{i}} = O_{p} (n^{- 1 / 2}) .

Therefore, when the number of collected studies n is large, $E [Q_{m} / n] \approx \frac{1}{n} E [\sum_{i = 1}^{n} \sqrt{w_{i}} | y_{i} - μ |] = \frac{1}{n} \sqrt{2 / π} \sum_{i = 1}^{n} \sqrt{(s_{i}^{2} + τ^{2}) / s_{i}^{2}}$ . By equating the Q_m statistic to its approximated expectation, a new estimator of between-study variance ${\hat{τ}}_{m}^{2}$ can be derived as the solution to $Q_{m} \sqrt{π / 2} = \sum_{i = 1}^{n} \sqrt{(s_{i}^{2} + τ^{2}) / s_{i}^{2}}$ in τ². If all the within-study variances are further assumed to be equal to a common value σ² as in Section 3.1, $E [Q_{m} / n] \approx \sqrt{2 / π} \sqrt{(σ^{2} + τ^{2}) / σ^{2}}$ . Based on Q_m, the counterparts of $H_{r}^{2}$ and $I_{r}^{2}$ —which assess (σ² + τ²)/σ² and τ²/(σ² + τ²) respectively—are defined as

H_{m}^{2} = \frac{π Q_{m}^{2}}{2 n^{2}}, I_{m}^{2} = \frac{Q_{m}^{2} - 2 n^{2} / π}{Q_{m}^{2}} .

Note that many meta-analyses only collect a small number of studies; however, the derivation of ${\hat{τ}}_{m}^{2}$ , $H_{m}^{2}$ , and $I_{m}^{2}$ assumes a large n. The finite-sample performance of these heterogeneity measures will be studied using simulations.

3.3 Calculation of p-values and confidence intervals

Due to the difficulty caused by summing the absolute values of correlated random variables in the expression of Q_r and the intractable distribution of weighted median in Q_m, it is not feasible to explicitly derive the probability and cumulative density functions for the proposed statistics. Instead, resampling method can be used to calculate p-values and 95% confidence intervals (CIs). Since the weighted median in Q_m is discontinuous and may be not unique due to the indicator function in Equation (2), we apply the approach in Horowitz (1998) to approximate the indicator function $I (t > 0)$ by a smooth function J(t) in the following simulations and case studies. For example, J(t) can be the scaled expit function $J_{ε} (t) = 1 / [1 + \exp (- t / ε)]$ , where ε is a pre-specified small constant. We use ε = 10⁻⁴; various choices of ε are shown to produce stable results in Web Appendix A.

Parametric resampling can be used to calculate a p-value for Q_r; similar procedures can also be used for Q and Q_m. First, estimate the overall effect size $\bar{μ}$ under H₀: τ² = 0 (i.e., the fixed-effect setting) and calculate the Q_r statistic based on the original data. Second, draw n samples under H₀, $y_{i}^{*} ~ N (\bar{μ}, s_{i}^{2})$ , and repeat this for B (say 10,000) iterations. Here, the weighted average $\bar{μ}$ is used to estimate μ because it is unbiased and may have smaller variance than the weighted median under the null hypothesis. Third, based on the B sets of bootstrap samples, calculate the Q_r statistic as $Q_{r}^{(b)}$ for b = 1, …, B. Finally, the p-value is calculated as $P [\sum_{b = 1}^{B} I (Q_{r}^{(b)} > Q_{r}) + 1] / (B + 1)$ . Here, 1 is added to both numerator and denominator to avoid calculating P = 0. To derive 95% CIs for the various heterogeneity measures, the nonparametric bootstrap can be used by taking samples of size n with replacement from the original data ${(y_{i}, s_{i}^{2})}_{i = 1}^{n}$ and calculating 2.5% and 97.5% quantiles for each of the measures over the bootstrap samples.

4. The relationship between I², $I_{r}^{2}$ , and $I_{m}^{2}$

4.1 When the number of studies is fixed

Since $I_{r}^{2}$ and $I_{m}^{2}$ are designed to be robust compared to the conventional I², they are expected to be smaller than I² in the presence of outliers. Applying the Cauchy-Schwarz Inequality, $Q_{r}^{2} \leq n Q$ , and the equality holds if and only if each $w_{i} {(y_{i} - \bar{μ})}^{2}$ equals a common value for all studies, in which case outliers are unlikely to appear. The foregoing inequality further implies $H_{r} \leq H \sqrt{π / 2}$ and $I_{r}^{2} \leq I^{2} + (1 - 2 / π) (1 - I^{2})$ . Therefore, the proposed H_r and $I_{r}^{2}$ are not always smaller than H and I², respectively; $I_{r}^{2}$ may be greater than I² by up to (1−2/π)(1−I²). Web Appendix B provides artificial meta-analyses to illustrate how the proposed measures may have better interpretations even when no outliers are present; $I_{r}^{2}$ and $I_{m}^{2}$ are larger than I² in those examples. As $I_{m}^{2}$ is based on the intractable weighted median, determining its relationship with I² and $I_{r}^{2}$ is not feasible in finite samples except by simulations. Alternatively, the asymptotic values of the three measures can be derived as n → ∞; Section 4.2 considers this case.

4.2 When the number of studies becomes large

This section focuses on the asymptotic properties of the three heterogeneity measures as the number of collected studies n → ∞. Denote $\overset{P}{\to}$ as convergence in probability, and let Φ(·) be the cumulative distribution function of the standard normal distribution. We have the following two propositions if no outliers are present.

Proposition 1

Under the fixed-effect setting, the observed effect sizes are $y_{i} ~ N (μ, s_{i}^{2})$ . Assume that the weights $w_{i} = 1 / s_{i}^{2}$ are independent and identically distributed with finite positive mean, and independent of the y_i’s. Then I², and $I_{m}^{2}$ converge to 0 in probability as n → ∞.

Proposition 2

Assume that all studies share a common within-study variance σ². Under the random-effects setting, the observed effect sizes are y_i ~ N(μ_i, σ²) and μ_i ~ N(μ, τ²); hence, the true proportion of total variation between studies due to heterogeneity is $I_{0}^{2} = τ^{2} / (σ^{2} + τ^{2})$ . Then I², I², and $I_{m}^{2}$ converge to the true $I_{0}^{2}$ in probability as n → ∞.

Propositions 1 and 2 show that, for either homogeneous or heterogeneous studies, all three heterogeneity measures converge to the true value and correctly indicate homogeneity or heterogeneity. Proposition 1 does not require that the n studies have a common within-study variance; Proposition 2 makes this assumption to facilitate definition of the true $I_{0}^{2}$ . The following proposition compares the three measures when the collection of studies is contaminated by a certain proportion of outlying studies.

Proposition 3

Assume that all studies share a common within-study variance σ². The observed effect sizes are y_i ~ N(μ_i, σ²). The meta-analysis is supposed to focus on a certain population of interest, and in this population, the study-specific underlying effect sizes are μ_i ~ N(μ, τ²); therefore, the true proportion of total variation between studies in this population that is due to heterogeneity is $I_{0}^{2} = τ^{2} / (σ^{2} + τ^{2})$ . However, 100η percent of the n studies are mistakenly included, having been conducted on inappropriate populations; their study-specific underlying effect sizes are μ_i ~ N(μ + C, τ²), where C is a constant, representing the discrepancy of outliers. Then, as n → ∞,

\begin{matrix} I^{2} \overset{P}{\to} 1 - {[{(1 - I_{0}^{2})}^{- 1} + r_{1} r_{2}]}^{- 1}; \\ I_{r}^{2} \overset{P}{\to} h (r_{1}, r_{2}; η, I_{0}^{2}); \\ I_{m}^{2} \overset{P}{\to} h (s_{1}, s_{2}; η, I_{0}^{2}) . \end{matrix}

Here, h(·, ·; η, $I_{0}^{2}$ ) is a function depending on η and $I_{0}^{2}$ defined as

h (t_{1}, t_{2}; η, I_{0}^{2}) = 1 - {η [{(1 - I_{0}^{2})}^{- 1 / 2} \exp (- \frac{1}{2} t_{1}^{2} (1 - I_{0}^{2})) + \sqrt{\frac{π}{2}} t_{1} (1 - 2 Φ (- t_{1} {(1 - I_{0}^{2})}^{1 / 2}))] + (1 - η) [{(1 - I_{0}^{2})}^{- 1 / 2} \exp (- \frac{1}{2} t_{2}^{2} (1 - I_{0}^{2})) - \sqrt{\frac{π}{2}} t_{2} (1 - 2 Φ (t_{2} {(1 - I_{0}^{2})}^{1 / 2}))]}^{- 2};

also, r₁ = (1 − η)C/σ, r₂ = ηC/σ, s₂ = C/σ − s₁, and s₁ is the solution to

η Φ (- s_{1} {(1 - I_{0}^{2})}^{1 / 2}) + (1 - η) Φ ((C / σ - s_{1}) {(1 - I_{0}^{2})}^{1 / 2}) = 0.5 .

Web Appendix C gives proofs of the three propositions. Proposition 3 suggests that all the three heterogeneity measures are affected by outlying studies, though to different degrees. Specifically, their asymptotic values are determined by three factors: the true proportion of total variation between studies that is due to heterogeneity $I_{0}^{2}$ , the proportion of outliers η, and the ratio of the discrepancy of the outliers C compared to the within-study standard deviation σ, that is, R = C/σ. Outliers are usually present in small quantities, so the proportion of outliers η is usually not large. Also, an observation is customarily considered an outlier if the distance to the overall mean is greater than three times the standard deviation σ; therefore, the ratio R is usually greater than 3.

Figure 1 compares the asymptotic values of the three heterogeneity measures derived in Proposition 3. The upper panels show the setting of true homogeneity ( $I_{0}^{2} = 0$ ) and the lower panels show the setting of true heterogeneity ( $I_{0}^{2} = 0.5$ ). Under each setting, the proportion of outliers is 1%, 5%, or 10%. Clearly, all the panels present a common trend: the three heterogeneity measures increase as R increases. When η is 1%, $I_{r}^{2}$ and $I_{m}^{2}$ are much less affected by outliers than I², indicating the robustness of the proposed measures. Also, $I_{m}^{2}$ is a bit smaller than $I_{r}^{2}$ . As η increases, the difference between I² and $I_{r}^{2}$ becomes smaller, while the difference between $I_{r}^{2}$ and $I_{m}^{2}$ becomes larger though it is never substantial. This implies that $I_{m}^{2}$ is the most robust measure when a meta-analysis is contaminated by a large proportion of outliers.

The asymptotic values of I², $I_{r}^{2}$ , and $I_{m}^{2}$ as n → ∞. The horizontal axis represents the ratio (R) of discrepancy of outliers (C) compared to within-study standard deviation (σ), that is, R = C/σ. The true proportion of total variation between studies that is due to heterogeneity $I_{0}^{2}$ is 0 (homogeneity, top row) or 0.5 (heterogeneity, bottom row). The proportion of outlying studies η varies from 1% (left panels) to 10% (right panels).

5. Simulations

Simulations were conducted to investigate the finite-sample performance of the various approaches to assessing heterogeneity. Without loss of generality, the true overall mean effect size was fixed as μ = 0. The number of studies in these simulated meta-analyses was set to n = 10 or 30, and the between-study variance was τ² = 0 (homogeneity) or 1 (heterogeneity). Under the homogeneity setting, the within-study standard errors s_i were sampled from U(0.5, 1); under the heterogeneity setting, we sampled s_i’s from U(s_min, s_max), where (s_min, s_max) = (0.5, 1), (1, 2), or (2, 5) to represent different proportions of total variation between studies that is due to heterogeneity. The observed effect sizes were drawn from $y_{i} ~ N (μ_{i}, s_{i}^{2})$ , where μ_i’s are study-specific underlying effect sizes. Regarding the μ_i, we considered the following two different scenarios to produce outliers.

(Contamination) The μ_i’s are normally distributed, μ_i ~ N(μ, τ²); however, m out of the n studies were contaminated by a certain outlying discrepancy, as in Proposition 3. We set m = 0, 1, 2, and 3, and five outlier patterns were considered: the m studies were created as outliers by artificially adding C, (C, C), (C, − C), (C, C, C), or (C, C, −C) to the original effect sizes for m = 1, 2, 2, 3, and 3 respectively. The discrepancy of outliers was set to $C = 3 \sqrt{s_{\max}^{2} + τ^{2}}$ .
(Heavy tail) The μ_i’s are drawn from a heavy-tailed distribution. We considered a location-scale transformed t distribution with degrees of freedom df = 3, 5, and 10; that is, $μ_{i} = μ + z_{i} \sqrt{(df - 2) / df}$ , where z_i ~ t_df. Note that the between-study variance τ² = Var[μ_i] = 1 in this scenario, so the generated studies are heterogeneous. Also, as degrees of freedom increases, the distribution of μ_i’s converges to the normal distribution and outliers are less likely.

Table 1 presents some results for n = 30, including statistical sizes (type I error rates) and powers of the statistics Q, Q_r, and Q_m for testing H₀: τ² = 0 vs. H_A: τ² > 0, and the root mean squared errors (RMSEs) and coverage probabilities of 95% CIs of ${\hat{τ}}_{DL}^{2}$ , ${\hat{τ}}_{r}^{2}$ , and ${\hat{τ}}_{m}^{2}$ . Web Appendix D contains complete simulation results. When the studies are homogeneous, each of the three test statistics controls type I error rate well if no outliers are present. Also, the RMSEs of the three estimators of τ² are close and their coverage probabilities are fairly high. However, when outliers appear, the type I error rate of Q inflates dramatically compared to Q_r and Q_m. The RMSE of ${\hat{τ}}_{DL}^{2}$ becomes larger than those of ${\hat{τ}}_{r}^{2}$ and ${\hat{τ}}_{m}^{2}$ ; also, the coverage probability of ${\hat{τ}}_{DL}^{2}$ is lower, especially when m = 3. As the number of outliers increases, the weighted-median-based ${\hat{τ}}_{m}^{2}$ has smaller RMSE and its 95% CI has higher coverage probability than the weighted-mean-based ${\hat{τ}}_{r}^{2}$ .

Table 1.

Some simulation results for meta-analyses containing 30 studies.

Outlier pattern

Size/power^†

RMSE

CP (%)

Q^‡

Q_r

Q_m

{\hat{τ}}_{DL}^{2}

{\hat{τ}}_{r}^{2}

{\hat{τ}}_{m}^{2}

{\hat{τ}}_{DL}^{2}

{\hat{τ}}_{r}^{2}

{\hat{τ}}_{m}^{2}

Scenario I (contamination) with τ² = 0 (homogeneity) and s_i ~ U(0.5, 1):

No outliers

0.05 (0.06)

0.05

0.10

0.12

0.10

0.55 (0.55)

0.27

0.25

0.37

0.24

0.20

(C, C)

0.89 (0.89)

0.66

0.60

0.63

0.42

0.35

(C, −C)

0.92 (0.92)

0.61

0.68

0.40

0.36

(C, C, C)

0.98 (0.98)

0.90

0.87

0.88

0.64

0.53

(C, C, −C)

0.99 (0.98)

0.89

0.88

0.99

0.61

0.55

Scenario I (contamination) with τ² = 1 (heterogeneity) and s_i ~ U(0.5, 1):

No outliers

0.98 (0.99)

0.98

0.40

0.43

0.41

1.00 (1.00)

1.00

0.84

0.63

0.55

(C, C)

1.00 (1.00)

1.00

1.37

1.00

0.85

(C, −C)

1.00 (1.00)

1.00

1.45

0.97

0.85

(C, C, C)

1.00 (1.00)

1.00

1.86

1.44

1.22

(C, C, −C)

1.00 (1.00)

1.00

2.05

1.40

1.25

Scenario I (contamination) with τ² = 1 (heterogeneity) and s_i ~ U(1, 2):

No outliers

0.48 (0.49)

0.42

0.43

0.74

0.81

0.75

0.89 (0.89)

0.78

0.77

1.97

1.36

1.17

(C, C)

0.99 (0.99)

0.94

3.33

2.29

1.93

(C, −C)

0.99 (0.99)

0.94

3.50

2.17

1.93

(C, C, C)

1.00 (1.00)

0.99

4.60

3.41

2.85

(C, C, −C)

1.00 (1.00)

0.99

5.03

3.24

2.90

Scenario II (heavy tail) with τ² = 1 (heterogeneity) and s_i ~ U(0.5, 1):

df = 3

0.92 (0.92)

0.89

0.88

1.45

0.59

0.56

df = 5

0.98 (0.98)

0.95

0.55

0.45

df = 10

0.98 (0.98)

0.97

0.43

0.42

Scenario II (heavy tail) with τ² = 1 (heterogeneity) and s_i ~ U(1, 2):

df = 3

0.41 (0.40)

0.35

1.53

0.88

0.82

df = 5

0.46 (0.46)

0.40

0.82

0.77

df = 10

0.48 (0.49)

0.42

0.76

0.82

0.77

Open in a new tab

RMSE: root mean squared error; CP: coverage probability of 95% confidence interval.

^†

Size (type I error rate) for homogeneous studies (τ² = 0) and power for heterogeneous studies (τ² > 0) at the significance level α = 0.05.

^‡

The sizes/powers outside the parentheses are produced by the resampling method; those inside the parentheses are obtained using Q’s theoretical distribution under the null hypothesis.

For heterogeneous studies, the conventional Q statistic is more powerful than Q_r or Q_m, but the differences are not large; this is expected because Q sacrifices type I error in the presence of outliers. In spite of this disadvantage of Q_r and Q_m, the proposed estimators of τ² still perform better than the conventional ${\hat{τ}}_{DL}^{2}$ in both Scenarios I and II.

Figure 2 compares the impact of a single outlier in Scenario I with m = 1 on the heterogeneity measures I², $I_{r}^{2}$ , and $I_{m}^{2}$ . As expected, these heterogeneity measures generally increase due to the outlying study, so their changes are mostly greater than 0. However, for both homogeneous and heterogeneous studies, the changes of $I_{r}^{2}$ and $I_{m}^{2}$ are generally smaller than the changes of I², indicating that the proposed measures are indeed less affected by outliers than the conventional I².

Scatter plots of the changes of $I_{0}^{2}$ and $I_{m}^{2}$ due to an outlier against the changes of I². For the upper panels, τ² = 0 (homogeneous studies) and *s_i* ~ U(0.5, 1); for the lower panels, τ² = 1 (heterogeneous studies) and *s_i* ~ U(1, 2). The left panels compare $I_{r}^{2}$ with I²; the right panels compare $I_{m}^{2}$ with I².

6. Two case studies

6.1 Homogeneous studies with outliers

Ismail et al. (2012) reported a meta-analysis consisting of 29 studies to evaluate the effect of aerobic exercise (AEx) on visceral adipose tissue (VAT) content/volume in overweight and obese adults, compared to control treatment. Figure 3(a) shows the forest plot with the observed effect sizes and their within-study 95% CIs; studies 1, 3, 19, and 29 seem to be outlying. If these four studies are removed, the remaining studies are much more homogeneous. Figure 3(b) presents the standardized residuals using both the fixed-effect and random-effects approaches described in Section 2.2. Studies 1, 19, and 29 have standardized residuals (under the fixed-effect setting) greater than 3 in absolute magnitude; hence, they may be considered outliers. We conducted sensitivity analysis by removing the following studies: (i) 1; (ii) 19; (iii) 29; (iv) 1 and 19; (v) 1 and 29; (vi) 19 and 29; and (vii) 1, 19, and 29.

Table 2 presents the results for the original meta-analysis and for alternate meta-analyses removing certain outlying studies. For the original meta-analysis, $I_{r}^{2} = 0.44$ and $I_{m}^{2} = 0.45$ , compared to I² = 0.59. Also, ${\hat{τ}}_{r}$ and ${\hat{τ}}_{m}$ are smaller than ${\hat{τ}}_{DL}$ . To test H₀: τ² = 0 vs. H_A: τ² > 0, the p-value of the Q statistic is smaller than 0.001, and those of the Q_r and Q_m statistics are 0.013 and 0.006, respectively. When study 29 is removed, the Q statistic is still significant (p-value = 0.008), while the p-values of the Q_r and Q_m statistics are larger than the commonly used significance level α = 0.05. After removing all three outlying studies, the p-values of the three test statistics are much larger than 0.05; also, $I_{r}^{2} = I_{m}^{2} = 0$ and I² = 0.11. Hence, the heterogeneity presented in the original meta-analysis is mainly caused by the few outliers. Note that $I_{r}^{2}$ and $I_{m}^{2}$ are still noticeably smaller than I² after removing the three identified outliers. This may be because some studies other than studies 1, 19, and 29 are potentially outlying. Figure 3(b) shows that the absolute values of the standardized residuals of studies 3 and 28 are fairly close to 3. Although some outliers may not be clearly detected, $I_{r}^{2}$ and $I_{m}^{2}$ automatically reduce their impact without removing them.

Table 2.

Summary results for two actual meta-analyses.

Removed studies

p-value of testing H₀: τ² = 0

Estimated τ (95% CI)

Heterogeneity measure (95% CI)

Q^†

Q_r

Q_m

{\hat{τ}}_{DL}

{\hat{τ}}_{r}

{\hat{τ}}_{m}

I²

I_{r}^{2}

I_{m}^{2}

Meta-analysis in Ismail et al. (2012):

None (Original)

< 0.001 (< 0.001)

0.013

0.006

0.39 (0, 0.62)

0.29 (0, 0.58)

0.30 (0, 0.56)

0.59 (0, 0.76)

0.44 (0, 0.73)

0.45 (0, 0.72)

< 0.001 (< 0.001)

0.047

0.030

0.35 (0, 0.58)

0.24 (0, 0.52)

0.24 (0, 0.51)

0.55 (0, 0.75)

0.36 (0, 0.69)

< 0.001 (< 0.001)

0.048

0.031

0.34 (0, 0.58)

0.24 (0, 0.52)

0.24 (0, 0.51)

0.54 (0, 0.75)

0.36 (0, 0.69)

0.36 (0, 0.68)

0.008 (0.007)

0.100

0.070

0.28 (0, 0.46)

0.21 (0, 0.44)

0.21 (0, 0.43)

0.44 (0, 0.66)

0.29 (0, 0.63)

0.30 (0, 0.62)

1 and 19

0.003 (0.004)

0.154

0.121

0.29 (0, 0.54)

0.18 (0, 0.45)

0.18 (0, 0.44)

0.47 (0, 0.73)

0.25 (0, 0.64)

0.24 (0, 0.63)

1 and 29

0.052 (0.052)

0.272

0.223

0.22 (0, 0.40)

0.14 (0, 0.37)

0.13 (0, 0.36)

0.33 (0, 0.60)

0.16 (0, 0.56)

0.15 (0, 0.55)

19 and 29

0.057 (0.057)

0.278

0.232

0.21 (0, 0.40)

0.13 (0, 0.38)

0.13 (0, 0.37)

0.32 (0, 0.60)

0.15 (0, 0.56)

0.14 (0, 0.55)

1, 19 and 29

0.302 (0.298)

0.547

0.504

0.11 (0, 0.30)

0 (0, 0.29)

0 (0, 0.27)

0.11 (0, 0.47)

0 (0, 0.46)

0 (0, 0.42)

Meta-analysis in Haentjens et al. (2010):

None (Original)

< 0.001 (< 0.001)

< 0.001

0.16 (0.02, 0.34)

0.15 (0, 0.37)

0.08 (0, 0.36)

0.74 (0.15, 0.86)

0.66 (0, 0.85)

0.63 (0, 0.85)

< 0.001 (< 0.001)

0.006

0.16 (0, 0.37)

0.13 (0, 0.42)

0.06 (0, 0.37)

0.68 (0, 0.84)

0.56 (0, 0.83)

0.52 (0, 0.81)

0.001 (0.001)

0.013

0.015

0.11 (0, 0.23)

0.11 (0, 0.27)

0.05 (0, 0.27)

0.60 (0, 0.76)

0.52 (0, 0.77)

0.47 (0, 0.76)

9 and 17

0.062 (0.059)

0.156

0.144

0.09 (0, 0.24)

0.07 (0, 0.27)

0.02 (0, 0.25)

0.39 (0, 0.65)

0.28 (0, 0.67)

0.23 (0, 0.65)

Open in a new tab

^†

The p-values outside the parentheses are produced by the resampling method; the p-values inside the parentheses are calculated using Q’s theoretical distribution under the null hypothesis.

6.2 Heterogeneous studies with outliers

Haentjens et al. (2010) investigated the magnitude and duration of excess mortality after hip fracture among older men by performing a meta-analysis consisting of 17 studies. Figure 3(c) shows the forest plot with the observed effect sizes (log hazard ratios) and their 95% within-study CIs. The forest plot indicates that the collected studies tend to be heterogeneous. Despite this, we used both the fixed-effect and random-effects diagnostic procedure in Section 2.2 to detect potential outliers. Figure 3(d) shows the study-specific standardized residuals, indicating that study 17 is apparently outlying. Although study 9’s standardized residual is smaller than 2 in absolute magnitude when using the random-effects approach, its standardized residual under the fixed-effect setting is fairly large. To take all potential outliers into account, we conducted sensitivity analysis by removing the following studies: (i) 9; (ii) 17; and (iii) 9 and 17.

The results are in Table 2. For the original meta-analysis, the p-values of all the three test statistics are smaller than 0.001, rejecting the null hypothesis of homogeneity. Also, I² = 0.74, $I_{r}^{2} = 0.66$ and $I_{m}^{2} = 0.63$ , indicating substantial heterogeneity. If study 9 is removed, the results seem to change little, implying that this study is not influential. If study 17 is removed, the p-values of the test statistics change noticeably; also, each of I², $I_{r}^{2}$ , and $I_{m}^{2}$ is reduced by more than 0.10. The three heterogeneity measures are still fairly high (larger than or close to 0.5); therefore, meta-analysts may keep paying attention to the heterogeneity of the remaining studies.

7. Discussion

This paper proposed several alternative measures of heterogeneity in meta-analysis. Large-sample properties and finite-sample studies showed that the new measures are robust to outliers compared with conventional measures. Since outliers frequently appear in meta-analysis and may not simply be removed without sound evidence, the proposed robust measures can provide useful information describing heterogeneity. The robustness of the new approaches mainly arises from using the absolute deviations in the Q_r and Q_m statistics; Q_r summarizes the deviations using the weighted average, and Q_m summarizes the deviations using the weighted median. Note that the number of studies is assumed to be large in deriving ${\hat{τ}}_{m}^{2}$ , H_m, and $I_{m}^{2}$ . However, many meta-analyses may only collect a few studies (Davey et al., 2011); these three measures need to be used with caution for small meta-analyses.

When study-level covariates are collected in meta-analysis, meta-regression is widely applied to investigate whether study characteristics explain heterogeneity (Higgins and Thompson, 2004). To improve robustness to outliers, instead of performing least squares regression, researchers may consider least absolute deviations regression (Portnoy and Koenker, 1997), which is related to the heterogeneity measures proposed in this article.

Heterogeneity measures are customarily used to select a fixed-effect or random-effects model, but both models have limitations in certain situations. Some researchers believe that heterogeneity is to be expected in any meta-analysis because the collected studies were performed by different teams in different places using different methods (Higgins, 2008). Also, the fixed-effect model produces confidence intervals with poor coverage probability when the collected studies have different true effect sizes (Hedges and Vevea, 1998), so some researchers recommend routinely using the random-effects model to yield conservative results (Chalmers, 1991). However, the random-effects model is not always better than the fixed-effect model, especially in the presence of publication bias (Poole and Greenland, 1999; Henmi and Copas, 2010; Stanley and Doucouliagos, 2015). Besides robustly assessing heterogeneity, alternative approaches to robustly estimating overall effects size in the presence of outliers remain to be studied.

The R code for the proposed methods are organized in the package altmeta and available at http://cran.r-project.org/package=altmeta.

Supplementary Material

Supplementary Materials

NIHMS780049-supplement-Supplementary_Materials.pdf^{(216.5KB, pdf)}

Acknowledgments

This research was supported in part by NIAID R21 AI103012 (HC, LL), NIDCR R03 DE024750 (HC), NLM R21 LM012197 (HC), and NIDDK U01 DK106786 (HC).

Footnotes

Supplementary Materials

Web Appendix A referenced in Section 3.3, Web Appendix B referenced in Section 4.1, Web Appendix C referenced in Section 4.2, and Web Appendix D referenced in Section 5 are available with this paper at the Biometrics website on Wiley Online Library.

References

Barnett V, Lewis T. Outliers in Statistical Data. 3rd John Wiley & Sons; New York, NY: 1994. [Google Scholar]
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods. 2010;1:97–111. doi: 10.1002/jrsm.12. [DOI] [PubMed] [Google Scholar]
Chalmers TC. Problems induced by meta-analyses. Statistics in Medicine. 1991;10:971–980. doi: 10.1002/sim.4780100618. [DOI] [PubMed] [Google Scholar]
Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10:101–129. [Google Scholar]
Davey J, Turner RM, Clarke MJ, Higgins JPT. Characteristics of meta-analyses and their component studies in the cochrane database of systematic reviews: a cross-sectional, descriptive analysis. BMC Medical Research Methodology. 2011;11:160. doi: 10.1186/1471-2288-11-160. [DOI] [PMC free article] [PubMed] [Google Scholar]
DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]
Gumedze FN, Jackson D. A random effects variance shift model for detecting and accommodating outliers in meta-analysis. BMC Medical Research Methodology. 2011;11:19. doi: 10.1186/1471-2288-11-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haentjens P, Magaziner J, Colón-Emeric CS, Vanderschueren D, Milisen K, Velkeniers B, Boonen S. Meta-analysis: excess mortality after hip fracture among older women and men. Annals of Internal Medicine. 2010;152:380–390. doi: 10.1059/0003-4819-152-6-201003160-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Statistics in Medicine. 1998;17:841–856. doi: 10.1002/(sici)1097-0258(19980430)17:8<841::aid-sim781>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]
Hedges LV, Olkin I. Statistical Method for Meta-Analysis. Academic Press; Orlando, FL: 1985. [Google Scholar]
Hedges LV, Vevea JL. Fixed- and random-effects models in meta-analysis. Psychological Methods. 1998;3:486–504. [Google Scholar]
Henmi M, Copas JB. Confidence intervals for random effects meta-analysis and robustness to publication bias. Statistics in Medicine. 2010;29:2969–2983. doi: 10.1002/sim.4029. [DOI] [PubMed] [Google Scholar]
Higgins JPT. Commentary: Heterogeneity in meta-analysis should be expected and appropriately quantified. International Journal of Epidemiology. 2008;37:1158–1160. doi: 10.1093/ije/dyn204. [DOI] [PubMed] [Google Scholar]
Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons; Chichester, UK: 2008. [Google Scholar]
Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine. 2002;21:1539–1558. doi: 10.1002/sim.1186. [DOI] [PubMed] [Google Scholar]
Higgins JPT, Thompson SG. Controlling the risk of spurious findings from meta-regression. Statistics in Medicine. 2004;23:1663–1682. doi: 10.1002/sim.1752. [DOI] [PubMed] [Google Scholar]
Horowitz JL. Bootstrap methods for median regression models. Econometrica. 1998;66:1327–1351. [Google Scholar]
Huber PJ, Ronchetti EM. Robust Statistics. 2nd John Wiley & Sons; Hoboken, NJ: 2009. [Google Scholar]
Hunter JE, Schmidt FL. Cumulative research knowledge and social policy formulation: the critical role of meta-analysis. Psychology, Public Policy, and Law. 1996;2:324–347. [Google Scholar]
Ioannidis JPA, Patsopoulos NA, Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. BMJ. 2007;335:914. doi: 10.1136/bmj.39343.408449.80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ismail I, Keating SE, Baker MK, Johnson NA. A systematic review and meta-analysis of the effect of aerobic vs. resistance exercise training on visceral fat. Obesity Reviews. 2012;13:68–91. doi: 10.1111/j.1467-789X.2011.00931.x. [DOI] [PubMed] [Google Scholar]
Jackson D. The power of the standard test for the presence of heterogeneity in meta-analysis. Statistics in Medicine. 2006;25:2688–2699. doi: 10.1002/sim.2481. [DOI] [PubMed] [Google Scholar]
Nüesch E, Trelle S, Reichenbach S, Rutjes AWS, Tschannen B, Altman DG, Egger M, Jüni P. Small study effects in meta-analyses of osteoarthritis trials: meta-epidemiological study. BMJ. 2010;341:c3515. doi: 10.1136/bmj.c3515. [DOI] [PMC free article] [PubMed] [Google Scholar]
Poole C, Greenland S. Random-effects meta-analyses are not always conservative. American Journal of Epidemiology. 1999;150:469–475. doi: 10.1093/oxfordjournals.aje.a010035. [DOI] [PubMed] [Google Scholar]
Portnoy S, Koenker R. The gaussian hare and the laplacian tortoise: computability of squared-error versus absolute-error estimators (with discussion) Statistical Science. 1997;12:279–300. [Google Scholar]
Prospective Studies Collaboration. Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. The Lancet. 2002;360:1903–1913. doi: 10.1016/s0140-6736(02)11911-8. [DOI] [PubMed] [Google Scholar]
Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549. doi: 10.1136/bmj.d549. [DOI] [PubMed] [Google Scholar]
Stanley TD, Doucouliagos H. Neither fixed nor random: weighted least squares meta-analysis. Statistics in Medicine. 2015;34:2116–2127. doi: 10.1002/sim.6481. [DOI] [PubMed] [Google Scholar]
Viechtbauer W, Cheung MWL. Outlier and influence diagnostics for meta-analysis. Research Synthesis Methods. 2010;1:112–125. doi: 10.1002/jrsm.11. [DOI] [PubMed] [Google Scholar]
Whitehead A, Whitehead J. A general parametric approach to the meta-analysis of randomized clinical trials. Statistics in Medicine. 1991;10:1665–1677. doi: 10.1002/sim.4780101105. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS780049-supplement-Supplementary_Materials.pdf^{(216.5KB, pdf)}

[R1] Barnett V, Lewis T. Outliers in Statistical Data. 3rd John Wiley & Sons; New York, NY: 1994. [Google Scholar]

[R2] Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods. 2010;1:97–111. doi: 10.1002/jrsm.12. [DOI] [PubMed] [Google Scholar]

[R3] Chalmers TC. Problems induced by meta-analyses. Statistics in Medicine. 1991;10:971–980. doi: 10.1002/sim.4780100618. [DOI] [PubMed] [Google Scholar]

[R4] Cochran WG. The combination of estimates from different experiments. Biometrics. 1954;10:101–129. [Google Scholar]

[R5] Davey J, Turner RM, Clarke MJ, Higgins JPT. Characteristics of meta-analyses and their component studies in the cochrane database of systematic reviews: a cross-sectional, descriptive analysis. BMC Medical Research Methodology. 2011;11:160. doi: 10.1186/1471-2288-11-160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]

[R7] Gumedze FN, Jackson D. A random effects variance shift model for detecting and accommodating outliers in meta-analysis. BMC Medical Research Methodology. 2011;11:19. doi: 10.1186/1471-2288-11-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Haentjens P, Magaziner J, Colón-Emeric CS, Vanderschueren D, Milisen K, Velkeniers B, Boonen S. Meta-analysis: excess mortality after hip fracture among older women and men. Annals of Internal Medicine. 2010;152:380–390. doi: 10.1059/0003-4819-152-6-201003160-00008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Hardy RJ, Thompson SG. Detecting and describing heterogeneity in meta-analysis. Statistics in Medicine. 1998;17:841–856. doi: 10.1002/(sici)1097-0258(19980430)17:8<841::aid-sim781>3.0.co;2-d. [DOI] [PubMed] [Google Scholar]

[R10] Hedges LV, Olkin I. Statistical Method for Meta-Analysis. Academic Press; Orlando, FL: 1985. [Google Scholar]

[R11] Hedges LV, Vevea JL. Fixed- and random-effects models in meta-analysis. Psychological Methods. 1998;3:486–504. [Google Scholar]

[R12] Henmi M, Copas JB. Confidence intervals for random effects meta-analysis and robustness to publication bias. Statistics in Medicine. 2010;29:2969–2983. doi: 10.1002/sim.4029. [DOI] [PubMed] [Google Scholar]

[R13] Higgins JPT. Commentary: Heterogeneity in meta-analysis should be expected and appropriately quantified. International Journal of Epidemiology. 2008;37:1158–1160. doi: 10.1093/ije/dyn204. [DOI] [PubMed] [Google Scholar]

[R14] Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons; Chichester, UK: 2008. [Google Scholar]

[R15] Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine. 2002;21:1539–1558. doi: 10.1002/sim.1186. [DOI] [PubMed] [Google Scholar]

[R16] Higgins JPT, Thompson SG. Controlling the risk of spurious findings from meta-regression. Statistics in Medicine. 2004;23:1663–1682. doi: 10.1002/sim.1752. [DOI] [PubMed] [Google Scholar]

[R17] Horowitz JL. Bootstrap methods for median regression models. Econometrica. 1998;66:1327–1351. [Google Scholar]

[R18] Huber PJ, Ronchetti EM. Robust Statistics. 2nd John Wiley & Sons; Hoboken, NJ: 2009. [Google Scholar]

[R19] Hunter JE, Schmidt FL. Cumulative research knowledge and social policy formulation: the critical role of meta-analysis. Psychology, Public Policy, and Law. 1996;2:324–347. [Google Scholar]

[R20] Ioannidis JPA, Patsopoulos NA, Evangelou E. Uncertainty in heterogeneity estimates in meta-analyses. BMJ. 2007;335:914. doi: 10.1136/bmj.39343.408449.80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Ismail I, Keating SE, Baker MK, Johnson NA. A systematic review and meta-analysis of the effect of aerobic vs. resistance exercise training on visceral fat. Obesity Reviews. 2012;13:68–91. doi: 10.1111/j.1467-789X.2011.00931.x. [DOI] [PubMed] [Google Scholar]

[R22] Jackson D. The power of the standard test for the presence of heterogeneity in meta-analysis. Statistics in Medicine. 2006;25:2688–2699. doi: 10.1002/sim.2481. [DOI] [PubMed] [Google Scholar]

[R23] Nüesch E, Trelle S, Reichenbach S, Rutjes AWS, Tschannen B, Altman DG, Egger M, Jüni P. Small study effects in meta-analyses of osteoarthritis trials: meta-epidemiological study. BMJ. 2010;341:c3515. doi: 10.1136/bmj.c3515. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Poole C, Greenland S. Random-effects meta-analyses are not always conservative. American Journal of Epidemiology. 1999;150:469–475. doi: 10.1093/oxfordjournals.aje.a010035. [DOI] [PubMed] [Google Scholar]

[R25] Portnoy S, Koenker R. The gaussian hare and the laplacian tortoise: computability of squared-error versus absolute-error estimators (with discussion) Statistical Science. 1997;12:279–300. [Google Scholar]

[R26] Prospective Studies Collaboration. Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. The Lancet. 2002;360:1903–1913. doi: 10.1016/s0140-6736(02)11911-8. [DOI] [PubMed] [Google Scholar]

[R27] Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ. 2011;342:d549. doi: 10.1136/bmj.d549. [DOI] [PubMed] [Google Scholar]

[R28] Stanley TD, Doucouliagos H. Neither fixed nor random: weighted least squares meta-analysis. Statistics in Medicine. 2015;34:2116–2127. doi: 10.1002/sim.6481. [DOI] [PubMed] [Google Scholar]

[R29] Viechtbauer W, Cheung MWL. Outlier and influence diagnostics for meta-analysis. Research Synthesis Methods. 2010;1:112–125. doi: 10.1002/jrsm.11. [DOI] [PubMed] [Google Scholar]

[R30] Whitehead A, Whitehead J. A general parametric approach to the meta-analysis of randomized clinical trials. Statistics in Medicine. 1991;10:1665–1677. doi: 10.1002/sim.4780101105. [DOI] [PubMed] [Google Scholar]

PERMALINK

Alternative Measures of Between-Study Heterogeneity in Meta-Analysis: Reducing the Impact of Outlying Studies

Lifeng Lin

Haitao Chu

James S Hodges

Summary

1. Introduction

2. The conventional methods

2.1 Measures of between-study heterogeneity

2.2 Outlier detection

Figure 3.

3. The proposed alternative heterogeneity measures

3.1 Heterogeneity measures based on absolute deviations and weighted average

3.2 Heterogeneity measures based on absolute deviations and weighted median

3.3 Calculation of p-values and confidence intervals

4. The relationship between I2, Ir2, and Im2

4.1 When the number of studies is fixed

4.2 When the number of studies becomes large

Proposition 1

Proposition 2

Proposition 3

Figure 1.

5. Simulations

Table 1.

Figure 2.

6. Two case studies

6.1 Homogeneous studies with outliers

Table 2.

6.2 Heterogeneous studies with outliers

7. Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4. The relationship between I², $I_{r}^{2}$ , and $I_{m}^{2}$