Abstract
Systematic reviews synthesize data across multiple studies to answer a research question, and an important component of the review process is to evaluate the heterogeneity of primary studies considered for inclusion. Little is known, however, about the ways that systematic reviewers evaluate heterogeneity, especially in clinical specialties like oncology. We examined a sample of systematic reviews from this body of literature to determine how meta-analysts assessed and reported heterogeneity. A PubMed search of 6 oncology journals was conducted to locate systematic reviews and meta-analyses. Two coders then independently evaluated the manuscripts for 10 different elements based on an abstraction manual. The initial PubMed search yielded 337 systematic reviews from 6 journals. Screening for exclusion criteria (nonsystematic reviews, genetic studies, individual patient data, etc.) found 155 articles that did not meet the definition of a systematic review. This left a final sample of 182 systematic reviews across 4 journals. Of these reviews, 50% (91/182) used varying combinations of heterogeneity tests, and of those, 16% (15/91) of review authors noted excessive heterogeneity and opted to not perform a meta-analysis. Of the studies that measured heterogeneity, 51% (46/91) used a random-effects model, 7% (8/91) used a fixed-effects model, and 43% (39/91) used both. We conclude that use of quantitative and qualitative heterogeneity measurement tools are underused in the 4 oncology journals evaluated. Such assessments should be routinely applied in meta-analyses.
Systematic reviews bring together all related empirical evidence based on predetermined eligibility criteria to answer a research question (1). This methodology is designed to minimize bias using an explicit, reproducible approach involving a systematic and comprehensive literature search, an assessment of validity of primary studies, and a systematic presentation and synthesis of findings. Oftentimes, systematic reviews also contain one or more meta-analyses that make use of statistical procedures to summarize the results of primary studies. It is evident that when multiple studies are combined for data synthesis, there will be differences, such as location of testing, drug doses, dosing schedules, follow-up, or ethnicity of participants, to name a few. If statistically significant heterogeneity is present, then researchers must decide if the primary studies are too diverse to synthesize or if follow-up analyses should be used to explore the effects of these differences on study outcomes.
Exploring heterogeneity between primary studies in systematic reviews can be done with multiple statistical tests such as I2, Cochran's Q (chi-squared), and Tau2. All of these tests have their own strengths and weaknesses, and so it is best to use multiple tests to fully inform clinicians on the dependability of a systematic reviews analysis (1–9). While much advice has been offered on evaluating heterogeneity, little is known about the ways that systematic reviewers actually address heterogeneity. Questions remain regarding the practices of systematic reviewers outside of Cochrane review groups, such as researchers in clinical specialties like oncology.
METHODS
Using the h5-Index from Google Scholar Metrics, we selected the 6 oncology journals with the highest index scores from the oncology subcategory. We searched PubMed using the following search string: ((((((“Journal of clinical oncology: official journal of the American Society of Clinical Oncology” [Journal] OR “Nature reviews. Cancer” [Journal]) OR “Cancer research” [Journal]) OR “The Lancet. Oncology” [Journal]) OR “Clinical cancer research: an official journal of the American Association for Cancer Research” [Journal]) OR “Cancer cell” [Journal]) AND (“2007/01/01” [PDAT]: “2015/12/31” [PDAT]) AND “humans” [MeSH Terms]) AND (meta-analysis [Title/ Abstract] OR systematic review [Title/Abstract]). This search strategy was adapted from a previously established method that is sensitive to identifying systematic reviews and meta-analyses (10). Searches were conducted on May 18 and May 26, 2015.
We used Covidence (covidence.org) to initially screen articles based on title and abstract. To qualify as a systematic review, studies had to summarize evidence across multiple studies and provide information on the search strategy, such as search terms, databases, or inclusion/exclusion criteria. Meta-analyses were classified as quantitative syntheses of results across multiple studies (11). Two screeners independently reviewed the titles and abstracts of each citation and made a decision regarding its suitability for inclusion based on the definitions previously described. To standardize the coding process, an abstraction manual was developed and pilot tested. After completing this process, a training session was conducted to familiarize coders with abstracting the data elements. A subset of studies was jointly coded. After the training exercise, each coder was provided with 3 new articles to code independently. Inter-rater agreement of these data was calculated using Cohen's kappa. Since inter-rater agreement was high (k = 0.86; agreement = 91%), each coder was assigned an equal subset of articles for data abstraction. We coded the following elements: a) statistical test used to evaluate heterogeneity; b) a priori threshold for statistical significance; c) type of model (random, fixed, mixed, or both); d) whether authors selected a random-effects model based on significance of the heterogeneity test; e) whether authors used a random-effects model without explanation; f) what type of plot was used to evaluate heterogeneity, if any; g) whether the plot was published as a figure in the manuscript; h) whether a follow-up analysis was conducted and, if so, the type of analysis (subgroup, meta-regression, and/or sensitivity analysis); i) whether heterogeneity was mentioned in writing only; and j) whether authors concluded there was too much heterogeneity to perform a meta-analysis. After the initial coding process, validation checks were conducted such that each coded element was verified by the other coder. Next, the screeners held a meeting to discuss the differences in decisions for inclusion/exclusion and reconcile any discrepancies by reaching consensus. Following the screening process, full-text versions of included articles were obtained via EndNote. Analysis of the final data was conducted using STATA 13.1. Data from this study are publicly available on Figshare (http://dx.doi.org/10.6084/m9.figshare.1496574).
RESULTS
The PubMed search resulted in 337 articles from 6 journals. After screening titles and abstracts via Covidence, 79 articles were excluded that did not meet the definition of a systematic review and/or meta-analysis. Full-text article screening for exclusion criteria resulted in the removal of an additional 74 articles. Additionally, 2 studies could not be retrieved. Two of the 6 journals were not heavily represented in the original sample of 337 articles, and with the exclusion of 155 articles 2 journals were excluded from the final sample. In total, 182 manuscripts representing 4 journals were analyzed for heterogeneity (Figure).
Figure.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of study selection.
Half (91/182) of all meta-analyses used at least 1 heterogeneity test. The most widely reported statistic was I2 (41.2%; 75/182) followed by X2 (24.2%; 44/182). In combination, X2 and I2 (13.2%; 24/182) were reported with greatest use followed by Q and I2 (12.6%; 23/182). Other combinations were utilized by the manuscripts but were not used to a great extent (I2 and Tau2 [0.55%; 1/182]; Q and X2 [0.55%; 1/182]; Q, X2, I2 [3.3%; 6/182]; and X2, I2, Tau2 [0.55%; 1/182]). As shown in the Table, authors selected a random-effects model most frequently (25%) followed by both fixed- and random-effects models (21%). Fixed-effects models were reported in 4% of studies, and a mixed-effects model was used in only 1 study. The remaining 48% did not report the type of model used for analysis. Twenty-four percent (43/182) used the random-effects model without considering the results of a heterogeneity test to confirm the need for such an analysis, and 15% (27/182) changed from the fixed- to random-effects model based on the results of a heterogeneity test.
Table.
Descriptive heterogeneity practices
| Category | Variable |
Cancer Research (n = 1) |
Clinical Cancer Research (n = 17) |
Journal of Clinical Oncology (n = 106) |
The Lancet Oncology (n = 58) |
Total (n = 182) |
|---|---|---|---|---|---|---|
| Types of heterogeneity plots reported | Forest plot | 1 | 5 | 46 | 24 | 76 (42%) |
| Forest plot, L'Abbé plot | 0 | 2 | 1 | 0 | 3 (2%) | |
| Forest, L'Abbé, Galbraith plot | 0 | 0 | 0 | 0 | 0 (0%) | |
| None | 0 | 10 | 59 | 34 | 103 (57%) | |
| Types of analyses reported | Subgroup analysis | 0 | 6 | 26 | 7 | 39 (21%) |
| Meta-regression | 0 | 2 | 11 | 4 | 17 (9%) | |
| Sensitivity analysis | 0 | 3 | 19 | 11 | 33 (18%) | |
| Choice of model reported | Both | 0 | 4 | 24 | 10 | 39 (21%) |
| Fixed | 0 | 2 | 4 | 2 | 8 (4%) | |
| Random | 1 | 5 | 23 | 17 | 46 (25%) | |
| Mixed | 0 | 0 | 1 | 0 | 1 (<1%) | |
| None | 0 | 6 | 54 | 28 | 88 (48%) |
The level of statistical significance for heterogeneity tests was reported in 45 systematic reviews. Among those reporting predefined thresholds for statistical significance, the most frequently reported P value was < 0.05 (64.4%; 29/45) followed by P < 0.10 (31.1%; 14/45). (P < 0.01 and P < 0.001 were both reported in 1 study.) Forty-three percent (78/182) of systematic reviews contained heterogeneity plots published as figures in the article (Table). A forest plot was the most common heterogeneity plot (42%). Only 2% used an L'Abbé to graphically represent heterogeneity.
Of the 3 tests designed to investigate heterogeneity (subgroup, meta-regression, and sensitivity analyses), subgroup analysis was used the most (21%), sensitivity analysis was second (18%), and meta-regression was used the least (9%) (Table 1). It was found that 20% (36/182) of the available manuscripts wrote about heterogeneity, but never actually evaluated it. Fifty-eight percent (105/182) of manuscripts did not find significant heterogeneity, 3% (5/182) found enough evidence of heterogeneity to disregard “some” of the meta-analysis, 4% (8/182) found significant heterogeneity, and 35% (64/182) never attempted to assess heterogeneity.
DISCUSSION
Systematic reviews operate based on methods designed to assist researchers in minimizing bias, performing literature searches, and evaluating data in a manner that hopefully limits biased results. One important aspect of this systematic process is the analysis of heterogeneity at its origin and the subsequent effect on meta-analysis. Although some studies assessed heterogeneity, it was not a common practice in the studies included in this systematic review. Only half of the available studies applied one or more of the common heterogeneity tests evaluated in this study, with 20% of available studies mentioning heterogeneity without further assessment. With interstudy variance always present on some level, heterogeneity evaluation in systematic reviews becomes necessary in many cases. The paltry use of meta-regression and/or subgroup analysis (9% and 21%, respectively) limits studies assessing the effect of heterogeneity on meta-analysis. The random-effects model was underused, with 25% using this model and 21% using both random- and fixed-effects models. The random-effects model assumes that parameters underlying studies follow a distribution, while fixed-effects models assume a single parameter value common to all studies (12). The random-effects model is a more likely event and may be used in cases of heterogeneous study outcomes. It is recognized that all systematic reviews will have some level of heterogeneity, and this study recommends the use of random-effects modeling for meta-analysis of heterogeneous intervention effects. The Institute of Medicine's Standards for Systematic Reviews state that “although the committee does not believe that any single statistical technique should be a methodological standard, it is essential that the SR [systematic review] team clearly explain and justify the reasons why it chose the technique actually used” (13). From this review, only 15% of available studies used the random-effects model with justification. This finding highlights the need for greater explanation during the decision-making process.
There have been recent developments for exploring and interpreting heterogeneity both before and after the review process. Evidence mapping is a process developed for systematic reviewers to explore sources of heterogeneity among primary studies prior to pooling (14). This qualitative approach may be a useful mechanism to identify sources of heterogeneity and could be a mechanism to inform subgroup analyses. A second means for interpreting heterogeneity is to calculate and report prediction intervals. Prediction intervals may be used to present an expected range of true effects and can assist in the clinical interpretation of heterogeneity by estimating an expected true treatment effect in future settings (15). We recommend that these areas be explored in future research.
This study has many positive attributes, including its adequate sample size and the careful application of coding procedures. Additionally, our findings shed light on current practices of heterogeneity assessment, which is greatly lacking in oncology reviews. Given that little research on heterogeneity practices has been conducted to date, especially in clinical specialties such as oncology, comparison of our results with other studies is difficult. To our knowledge, this is the first study of its kind in oncology. This study also has some limitations. We examined systematic reviews published in high-impact-factor oncology journals, and our results may not represent oncology systematic reviews as a whole. It is possible that higher-impact-factor journals have more rigorous reporting and methodological standards, and systematic reviews published in these journals may reflect these standards. Our search was also date limited, and our results should not be generalized outside of our search dates.
References
- 1.Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. Chichester: Cochrane Collaboration, 2008; eds. [Google Scholar]
- 2.Thorlund K, Imberger G, Johnston BC, Walsh M, Awad T, Thabane L, Gluud C, Devereaux PJ, Wetterslev J. Evolution of heterogeneity (I2) estimates and their 95% confidence intervals in large meta-analyses. PLoS One. 2012;7(7):e39471. doi: 10.1371/journal.pone.0039471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Guyatt G, Wyer P, Ioannidis JP. In Users' Guide to the Medical Literature: A Manual for Evidence-Based Clinical Practice. New York: McGraw-Hill, 2008; When to believe a subgroup analysis. [Google Scholar]
- 4.Jackson D. The implications of publication bias for meta-analysis' other parameter. Stat Med. 2006;25(17):2911–2921. doi: 10.1002/sim.2293. [DOI] [PubMed] [Google Scholar]
- 5.Jackson D. Assessing the implications of publication bias for 2 popular estimates of between-study variance in meta-analysis. Biometrics. 2007;63(1):187–193. doi: 10.1111/j.1541-0420.2006.00663.x. [DOI] [PubMed] [Google Scholar]
- 6.Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, Botella J. Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychol Methods. 2006;11(2):193–206. doi: 10.1037/1082-989X.11.2.193. [DOI] [PubMed] [Google Scholar]
- 7.Mittlböck M, Heinzl H. A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med. 2006;25(24):4321–4333. doi: 10.1002/sim.2692. [DOI] [PubMed] [Google Scholar]
- 8.Gavaghan DJ, Moore RA, McQuay HJ. An evaluation of homogeneity tests in meta-analyses in pain using simulations of individual patient data. Pain. 2000;85(3):415–424. doi: 10.1016/S0304-3959(99)00302-4. [DOI] [PubMed] [Google Scholar]
- 9.Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–560. doi: 10.1136/bmj.327.7414.557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Montori VM, Wilczynski NL, Morgan D, Haynes RB, Hedges Team. Optimal search strategies for retrieving systematic reviews from Medline: analytical survey. BMJ. 2005;330(7482):68. doi: 10.1136/bmj.38336.804167.47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Onishi A, Furukawa TA. Publication bias is underreported in systematic reviews published in high-impact-factor journals: metaepidemiologic study. J Clin Epidemiol. 2014;67(12):1320–1326. doi: 10.1016/j.jclinepi.2014.07.002. [DOI] [PubMed] [Google Scholar]
- 12.Higgins JP, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A Stat Soc. 2009;172(1):137–159. doi: 10.1111/j.1467-985X.2008.00552.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Morton S, Levit L, Berg A, Eden J. Washington, DC: National Academies Press, 2011; Finding What Works in Health Care: Standards for Systematic Reviews. [PubMed] [Google Scholar]
- 14.Althuis MD, Weed DL, Frankenfield CL. Evidence-based mapping of design heterogeneity prior to meta-analysis: a systematic review and evidence synthesis. Sys Rev. 2014;(3):80. doi: 10.1186/2046-4053-3-80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.IntHout J, Ioannidis JPA, Rovers MM, Goeman JJ. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open. 2016;6(7):e010247. doi: 10.1136/bmjopen-2015-010247. [DOI] [PMC free article] [PubMed] [Google Scholar]
