Skip to main content
The BMJ logoLink to The BMJ
. 2007 Nov 3;335(7626):914–916. doi: 10.1136/bmj.39343.408449.80

Uncertainty in heterogeneity estimates in meta-analyses

John P A Ioannidis 1,, Nikolaos A Patsopoulos 1, Evangelos Evangelou 1
PMCID: PMC2048840  PMID: 17974687

Abstract

John Ioannidis, Nikolaos Patsopoulos, and Evangelos Evangelou argue that, although meta-analyses often measure heterogeneity between studies, these estimates can have large uncertainty, which must be taken into account when interpreting evidence


Summary points

  • The extent of between study heterogeneity should be measured when interpreting results of meta-analyses

  • Meta-analyses rarely document uncertainty in estimates of heterogeneity

  • Our evaluation of a large number of meta-analyses shows a wide range of uncertainty about the extent of heterogeneity in most

  • Confidence intervals of I2 should be calculated and considered when interpreting meta-analyses

An important aim of systematic reviews and meta-analyses is to assess the extent to which different studies give similar or dissimilar results.1 Clinical, methodological, and biological heterogeneity are often topic specific, but statistical heterogeneity can be examined with the same methods in all meta-analyses. Therefore, the perception of statistical heterogeneity or homogeneity often influences meta-analysts and clinicians in important decisions. These decisions include whether the data are similar enough to combine different studies; whether a treatment is applicable to all or should be “individualised” because of variable benefits or harms in different types of patients; and whether a risk factor affects all people exposed or only select populations. How uncertain is the extent of statistical heterogeneity in meta-analyses? Moreover, is this uncertainty properly factored in when interpreting the results?

Evaluating heterogeneity between studies

Many statistical tests are available for evaluating heterogeneity between studies.2 3 Until recently, the most popular was Cochran's Q, a statistic based on the χ2 test.4 Cochran's Q usually has only low power to detect heterogeneity, however. It also depends on the number of studies and cannot be compared across different meta-analyses.2 3 Higgins and colleagues, in two highly cited papers,5 6 proposed the routine use of the I2 statistic. I2 is calculated as [(Q−df)/Q]×100%, where df is degrees of freedom (number of studies minus 1). Values of I2 range from 0% to 100%, and it tells us what proportion of the total variation across studies is beyond chance. This statistic can be used to compare the amount of inconsistency across different meta-analyses even with different numbers of studies.7 I2 is routinely implemented in all Cochrane reviews (standard option in RevMan) and is increasingly used in meta-analyses published in medical journals.

Higgins and colleagues suggested that we could “tentatively assign adjectives of low, moderate, and high to I2 values of 25%, 50%, and 75%.”6 Like any metric, however, I2 has some uncertainty, and Higgins and Thompson provided methods to calculate this uncertainty.5 Recently, other investigators compared the performance of I2 and Q in Monte-Carlo simulations across diverse simulated meta-analytic conditions. They found that I2 also has low statistical power with small numbers of studies and its confidence intervals can be large.8

Interpreting heterogeneity in selected meta-analyses

Inferences about the extent of heterogeneity must be especially cautious when the 95% confidence intervals around I2 are wide, ranging from low to high heterogeneity. Such uncertainty is usually ignored in systematic reviews, however. This can result in misconceptions. For example, a systematic review of corticosteroids for Kawasaki disease found a point estimate I2=59%.9 The authors decided to exclude the two studies that were most different, saying that their removal eliminated all of the across study heterogeneity (Q=5.59, P=0.588, I2=0.00). In fact, the 95% confidence interval for this I2=0% estimate still extends from 0% to 56%. With two small randomised trials and six non-randomised comparisons remaining, the meta-analysis concluded that corticosteroids consistently halve the risk of coronary aneurysms. However, the two largest randomised trials on this topic were published after the meta-analysis. Heterogeneity resurfaced: the largest trial found no effect on coronary dimensions,10 while the other trial showed an 80% reduction in the risk of coronary artery abnormalities.11

Eight systematic reviews published in the BMJ between 1 July 2005 and 1 January 2006 performed meta-analyses of randomised trials and seven of them performed some statistical analysis of heterogeneity between studies (table on bmj.com).12 13 14 15 16 17 18 Each review stated that they had tried to interpret heterogeneity, and seven meta-analyses provided enough information for us to calculate the 95% confidence interval of I2. The lower 95% confidence interval was always as low as 0% (rounded to integer percentage), with one exception. The upper 95% confidence interval always exceeded the 50% threshold, and in four cases it also exceeded the 75% threshold. A conclusive statement was feasible in only one case, where I2 was 69%, the 95% confidence interval was 40% to 80%, the Q statistic had P<0.001, and the authors justifiably concluded that “there was significant heterogeneity among these trials.”13 This meta-analysis had 15 studies, so the power of both Q and I2 was good. In all other meta-analyses (two to 12 studies each), strong statements in interpreting heterogeneity would be difficult to make. Only one review presented 95% confidence intervals for an I2 estimate.12 The authors concluded that “we could not observe significant heterogeneity.” Indeed the Q statistic had P=0.19. However, with only five studies, the power to detect heterogeneity was negligible. The I2 statistic was 35% and the 95% confidence interval ranged from 0% (no heterogeneity) to 76% (high heterogeneity).

Uncertainty in I2: large scale survey of meta-analyses

This limitation is not confined to the selected examples presented here—it is probably the rule rather than the exception. We used two large datasets of meta-analyses to evaluate empirically the extent of uncertainty in I2 estimates. Firstly, we looked at meta-analyses of the Cochrane Database of Systematic Reviews (Issue 4, 2005) that had four or more synthesised studies and binary outcomes. Because each Cochrane review may include several meta-analyses, we looked only at the one with the highest number of studies; in the case of ties, we used the one with the largest sample size. We did not look at meta-analyses of two or three studies. Such studies form a sizeable proportion of the Cochrane Library,19 but their 95% confidence intervals of I2 almost always span a wide range of heterogeneity, unless the studies are large and they give very different results. In total, we calculated the I2 statistic and its 95% confidence intervals for 1011 meta-analyses. The second dataset was a previously described database of 50 meta-analyses of gene-disease associations that had found a nominally statistically significant effect (P<0.05) for the proposed genetic risk factors.20

Figure 1 shows the upper and lower 95% confidence intervals of I2 for the two sets of meta-analyses. The pattern is similar. Of the meta-analyses where I2 is ≤25% (low heterogeneity), 83% of the Cochrane meta-analyses and 73% of the genetic risk factor meta-analyses have upper 95% confidence intervals that cross into the range of large heterogeneity (I2 ≥50%). Of the meta-analyses where I2 is ≥50% (large heterogeneity), 67% of the Cochrane meta-analyses and 52% of the genetic risk factor meta-analyses have lower 95% confidence intervals that cross into the range of low heterogeneity (I2 ≤25%).

graphic file with name ioaj480715.f1.jpg

Fig 1 Confidence intervals for estimated I2 in 1011 Cochrane meta-analyses and 50 meta-analyses of genetic risk factors. The median number of studies was 7 (interquartile range 5-11) and 20 (13-26), respectively, and the median total sample size was 1112 (512-2691) and 4660 (2823-8761), respectively. The median I2 was 21% (0-50%) and 38% (5-60%), respectively

Meta-analyses where I2 is estimated at 0% are affected by an especially important misconception. Many reviews interpret this as absence of heterogeneity, but the upper 95% confidence interval may be substantial (as in the Kawasaki example discussed above9). Figure 2 shows the uncertainty for the upper 95% confidence interval of I2 for the two sets of meta-analyses, limited to those with I2=0% (n=373 for Cochrane reviews, n=12 genetic studies). The upper 95% confidence interval exceeds 33% in all these meta-analyses. For 81% of the meta-analyses with I2=0%, the 95% confidence intervals are 50% or higher. Because of the way that research is currently reported, considerable heterogeneity between studies cannot be excluded with confidence in most meta-analyses. Some heterogeneity between studies is probably present in most meta-analyses. Claims for homogeneity may sometimes be stronger than the evidence allows. Trusting a non-significant P value for the Q statistic and an I2 estimate of 0% may sometimes lead to spurious certainty about the comparability and similarity of study results.

graphic file with name ioaj480715.f2.jpg

Fig 2 Proportion of meta-analyses with estimated I2=0% whose upper 95% confidence interval of I2 is lower than a given value

Technical aspects

The confidence interval of I2 can be calculated by several methods.5 Two methods, a test based approach and a non-central χ2 based approach have been implemented in Stata (heterogi module). The performance of these two methods is comparable, although the test based approach often gives lower values for lower and upper confidence intervals, so that the non-central χ2 based approach may be preferable.

Concluding comments

All statistical tests for heterogeneity are weak, including I2. The clinical implications of this are considerable and must be examined on a case by case basis. Putting too much trust in homogeneity of effects may give a false sense of reassurance that one size fits all. Lack of evidence of heterogeneity is not evidence of homogeneity. Conversely, putting too much trust in the presence of heterogeneity of effects may lead to spurious subgroup and exploratory analyses. Given that I2 is not precise, 95% confidence intervals should always be given.

Supplementary Material

[extra: Web extra]

Contributors and sources: JPAI has a long standing interest in meta-analyses and heterogeneity and had the original idea for this article. NAP and EE collected the data. NAP performed statistical analyses with help from JPAI and EE. JPAI wrote the manuscript and NAP and EE commented on it. JPAI is guarantor.

Competing interests: None declared.

Provenance and peer review: Not commissioned; externally peer reviewed.

References

  • 1.Lau J, Ioannidis JPA, Schmid CH. Summing up evidence: one answer is not always enough. Lancet 1998;351:123-7. [DOI] [PubMed] [Google Scholar]
  • 2.Sutton A, Abrams K, Jones D, Sheldon T, Song F. Methods for meta-analysis in medical research Chichester: Wiley, 2000
  • 3.Petitti DB. Approaches to heterogeneity in meta-analysis. Stat Med 2001;20:3625-33. [DOI] [PubMed] [Google Scholar]
  • 4.Cochran WG. The combination of estimates from different experiments. Biometrics 1954;10:101-29. [Google Scholar]
  • 5.Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Stat Med 2002;21:1539-58. [DOI] [PubMed] [Google Scholar]
  • 6.Higgins JPT, Thompson SG, Deeks J, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;327:557-60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mittlbock M, Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 2006;25:4321-33. [DOI] [PubMed] [Google Scholar]
  • 8.Huedo-Medina TB, Sánchez-Meca F, Marín-Martínez F, Botella J. Assessing heterogeneity in meta-analysis: Q statistic or IPsychol Methods index? [DOI] [PubMed] [Google Scholar]
  • 9.Wooditch AC, Aronoff SC. Effect of initial corticosteroid therapy on coronary artery aneurysm formation in Kawasaki disease: a meta-analysis of 862 children. Pediatrics 2005;116:989-95. [DOI] [PubMed] [Google Scholar]
  • 10.Newburger JW, Sleeper LA, McCrindle BW, Minich LL, Gersony W, Vetter VL, et al; Pediatric Heart Network Investigators. Randomized trial of pulsed corticosteroid therapy for primary treatment of Kawasaki disease. N Engl J Med 2007;356:663-75. [DOI] [PubMed] [Google Scholar]
  • 11.Inoue Y, Okada Y, Shinohara M, Kobayashi T, Kobayashi T, Tomomasa T, et al. A multicenter prospective randomized trial of corticosteroids in primary therapy for Kawasaki disease: clinical course and coronary artery outcome. J Pediatr 2006;149:336-41. [DOI] [PubMed] [Google Scholar]
  • 12.Maier PC, Funk J, Schwarzer G, Antes G, Falck-Ytter YT. Treatment of ocular hypertension and open angle glaucoma: meta-analysis of randomised controlled trials. BMJ 2005;331:134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Dennis CL. Psychosocial and psychological interventions for prevention of postnatal depression: systematic review. BMJ 2005;331:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Devereaux PJ, Beattie WS, Choi PT, Badner NH, Guyatt GH, Villar JC, et al. How strong is the evidence for the use of perioperative beta blockers in non-cardiac surgery? Systematic review and meta-analysis of randomised controlled trials. BMJ 2005;331:313-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Taylor SJ, Candy B, Bryar RM, Ramsay J, Vrijhoef HJ, Esmond G, et al. Effectiveness of innovations in nurse led chronic disease management for patients with chronic obstructive pulmonary disease: systematic review of evidence. BMJ 2005;331:485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Webster AC, Woodroffe RC, Taylor RS, Chapman JR, Craig JC. Acrolimus versus ciclosporin as primary immunosuppression for kidney transplant recipients: meta-analysis and meta-regression of randomised trial data. BMJ 2005;331:810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.McDonald MA, Simpson SH, Ezekowitz JA, Gyenes G, Tsuyuki RT. Angiotensin receptor blockers and risk of myocardial infarction: systematic review. BMJ 2005;331:873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Glass J, Lanctot KL, Herrmann N, Sproule BA, Busto UE. Sedative hypnotics in older people with insomnia: meta-analysis of risks and benefits. BMJ 2005;331:1169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ioannidis JP, Trikalinos TA, Zintzaras E. Extreme between-study homogeneity in meta-analyses could offer useful insights. J Clin Epidemiol 2006;59:1023-32. [DOI] [PubMed] [Google Scholar]
  • 20.Ioannidis JP, Trikalinos TA, Khoury MJ. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am J Epidemiol 2006;164:609-14. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[extra: Web extra]
bmj_335_7626_914__1.pdf (27.7KB, pdf)

Articles from BMJ : British Medical Journal are provided here courtesy of BMJ Publishing Group

RESOURCES