Abstract
Rationale, Aims, and Objectives
It is generally believed that evidence from low quality of evidence generate inaccurate estimates about treatment effects more often than evidence from high (certainty) quality evidence (CoE). As a result, we would expect that (a) estimates of effects of health interventions initially based on high CoE change less frequently than the effects estimated by lower CoE (b) the estimates of magnitude of effect size differ between high and low CoE. Empirical assessment of these foundational principles of evidence‐based medicine has been lacking.
Methods
We reviewed the Cochrane Database of Systematic Reviews from January 2016 through May 2021 for pairs of original and updated reviews for change in CoE assessments based on the Grading of Recommendations Assessment, Development and Evaluation (GRADE) method. We assessed the difference in effect sizes between the original versus updated reviews as a function of change in CoE, which we report as a ratio of odds ratio (ROR). We compared ROR generated in the studies in which CoE changed from very low/low (VL/L) to moderate/high (M/H) versus M/H to VL/L. Heterogeneity and inconsistency were assessed using the tau and I 2 statistic. We also assessed the change in precision of effect estimates (by calculating the ratio of standard errors) (seR), and the absolute deviation in estimates of treatment effects (aROR).
Results
Four hundred and nineteen pairs of reviews were included of which 414 (207 × 2) informed the CoE appraisal and 384 (192 × 2) the assessment of effect size. We found that CoE originally appraised as VL/L had 2.1 [95% confidence interval (CI): 1.19–4.12; p = 0.0091] times higher odds to be changed in the future studies than M/H CoE. However, the effect size was not different (p = 1) when CoE changed from VL/L → M/H [ROR = 1.02 (95% CI: 0.74–1.39)] compared with M/H → VL/L (ROR = 1.02 [95% CI: 0.44–2.37]). Similar overlap in aROR between the VL/L → M/H versus M/H → VL/L subgroups was observed [median (IQR): 1.12 (1.07–1.57) vs. 1.21 (1.12–2.43)]. We observed large inconsistency across ROR estimates (I 2 = 99%). There was larger imprecision in treatment effects when CoE changed from VL/L → M/H (seR = 1.46) than when it changed from M/H → VL/L (seR = 0.72).
Conclusions
We found that low‐quality evidence changes more often than high CoE. However, the effect size did not systematically differ between the studies with low versus high CoE. The finding that the effect size did not differ between low and high CoE indicate urgent need to refine current EBM critical appraisal methods.
Keywords: critical appraisal‐bias, evidence‐based medicine, meta‐epidemiology, observational studies, random error, randomized trials, systematic review
1. INTRODUCTION
A foundational epistemological principle underpinning evidence‐based medicine (EBM) is based on the assumption that the estimates of the effects of health interventions are closer to the ‘truth’ if they are based on higher than on lower quality (certainty) of evidence (CoE). 1 If the estimated treatment effects are close to the ‘true’ effects, this would also imply that they would less likely to change as evidence accumulates after new studies are completed. Conversely, because its relation to the ‘truth’ is less certain, this also implies that the estimated effects when evidence is of low quality would more likely change in future research. Research to date indicates that guideline panels are willing to issue stronger recommendations when they deem evidence to be of high quality, thus indirectly affirming this central EBM assumption. 2 , 3 , 4 , 5
However, whether this indirect assessment of quality of evidence based on guidelines panels' decision‐making is accurate is not known. It is possible that current methods of critical appraisal of CoE do not discriminate well between ‘true’ accurate from inaccurate estimates of treatment effects. That is, the effects of health interventions based on low quality of evidence may turn out to reflect ‘true effects’ by testing in subsequent studies. On the other hand, what was originally deemed as high‐quality evidence may be undermined by future studies more often than initially expected. Thus, it is not known if low‐quality evidence is more often revised than high‐quality evidence. Empirical evidence supporting this foundational principle of EBM is lacking.
The main purpose of this report is to assess if (a) low certainty evidence is more often revised than high certainty evidence in subsequent studies and if (b) the magnitude of effect size differs between high and low CoE.
2. METHODS
We assessed the change in CoE between the original and updated Cochrane systematic reviews, which reported rating of CoE as per the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for critical appraisal of medical evidence. 6 We used GRADE as this has been widely recognized as the most advanced system for operationalization of fundamental principles of EBM and critical evaluation of medical evidence. 1 , 7 , 8 GRADE was developed in the first decade of 21st Century after critical appraisal of 106 systems for rating of the quality of medical research evidence showed that none of them was capable of distinguishing low from high‐quality evidence. 1 , 9 , 10
We focused on the assessment of systematic reviews, rather on individual trials, because the second important EBM principle is that assessment of the true effects of health interventions is best accomplished by evaluating total evidence on the topic rather than based on a study selected to favour a particular claim. 1 GRADE is also considered a suitable method to asses certainty of evidence at the level of systematic review/meta‐analysis. 8 Thus, the unit of our analysis was a systematic review/meta‐analysis (SR/MA).
Cochrane Reviews are regularly updated providing a unique opportunity to assess when and whether the assessment of CoE changes between the original and updated reviews as a result of new evidence generated between two reviews. Since 2013 Cochrane Reviews have mandated the use of GRADE Summary of Findings (SoF) 11 to summarize CoE and magnitude effects of interventions that the reviews assessed. We evaluated all Cochrane reviews published in the last 5 years in the Cochrane Database of Systematic Reviews [https://www.cochranelibrary.com/cdsr/about-cdsr].
We used SoFs from the original and updated reviews to extract data for the primary outcome related to CoE and to assess the magnitude and direction of effect. (In case of multiple primary outcomes, the data were extracted from the first one listed in SoF table that contained data in both original and updated review). Eligible SR/MAs were divided into five groups; data were extracted from each group by pairs of independent reviewers. Kappa interrater agreement was calculated for each pair regarding CoE. As explained, we recorded CoE according to GRADE criteria (very low, low, moderate and high). 1 , 12
We also extracted summary meta‐analytic estimates for the primary outcome from each pair of reviews, that is, point estimates, dispersion (e.g., 95% confidence interval), metric used (e.g., relative risk, odds ratio, hazard ratio, standardized mean differences, etc.), number of trials per meta‐analysis, number of participants, type of comparator (active vs. placebo/no treatment), type of treatment (pharmaceutical vs. non‐pharmaceutical), whether the authorship of the original and updated reviews changed (to capture potential differences in judgment of CoE by the review team), and type of studies (randomized controlled trials vs. observational studies) that were meta‐analyzed.
We converted all effect estimates into odds ratio (OR). We also converted all effect sizes in the same direction, with OR < 1 indicating reduction of undesirable outcomes (i.e., more beneficial treatment). Because GRADE separates recommendations as strong versus weak based on the CoE, 13 typically endorsing strong versus weak (conditional) recommendations based on moderate/high versus low/very low, respectively, 4 , 14 our key analysis focused on the differences in effect sizes between these subgroups. We conducted McNemar's test for paired (before vs. after) data to reject the null hypothesis of equal probability that CoE remained the same, that is, in very low/low CoE versus moderate/high CoE groups. To test for linear trend in change of CoE over all categories—from very low to high—we employed a symmetry test with marginal homogeneity tests (which reduces to McNemar's test for two non‐independent categories of observations).
To asses for differences in the magnitude of effect size between original and updated evidence as a function of change in the assessment of CoE we calculated the ratio of odds ratio (ROR) across meta‐analytic estimates. 15 ROR compares intervention effects in meta‐analysis of trials with very low/low versus those with moderate/high CoE (or vice versa). 15 Thus, if the comparison referred to OR with very low/low versus those with moderate/high CoE pertains to ROR < 1, this would mean that treatment effects were more beneficial in meta‐analysis of trials with very low/low CoE, while ROR > 1 would indicate the opposite. 15 , 16 A test of interactions was performed to assess the hypothesis of no difference between the subgroups (i.e, treatments effects in very low/low vs. moderate/high CoE). 17 Because of assumed correlations in comparison of treatment effects, we calculated standard errors for ROR by correlating the effect sizes observed in the original versus updated reviews. 17 We obtained the values for correlation coefficients from the data. We performed sensitivity analyses by: (a) assuming one correlation coefficient between effects sizes in the original versus updated reviews and (b) calculating correlation coefficients for each subgroup according to direction of treatment effects (i.e., we calculated separate correlation coefficients for the subgroup showing positive, negative and no change in direction of effects between the original versus updated review—three correlation coefficients in total). We also repeated all analyses assuming no correlations between the effect sizes. Since we observed no differences in the results regardless of the postulated assumptions, we report the default analysis based on calculation with three different correlation coefficients.
Our hypothesis was that ROR between the subgroups would differ; in addition, we would expect that the effect size would be larger if CoE change from moderate/high to very low/low than other way around.
The analyses were based on using random effect Sidik‐Jonkman model. We assessed heterogeneity, that is, dispersion of effect size across the meta‐analytic estimates by calculating τ (tau) statistic. 16 We used I 2 statistic to assess inconsistency; I 2 represents the estimated proportion of the observed variance in true effect sizes across individual meta‐analyses rather than sampling error 16 ; it depends both on heterogeneity and total variation in the estimates between the analyses. 16 , 18 We complemented assessment of heterogeneity with calculation of the absolute deviation of treatment effects (aROR) as a function of change in CoE. 19 By definition, aROR is positive and reflects the x‐fold deviation of treatment effect from OR = 1 on the OR scale. Thus, if ROR = 0.8 or ROR = 1.25, the absolute deviation is equal to aROR = 1.25. aROR across all SR/MAs was expressed as (unweighted) median and interquartile range (IQR). 19 We also evaluated how the precision of the estimates changed by calculating the ratio of standard errors for each subgroup summarized as (unweighted) median and IQR. 19 Values >1 indicate larger standard errors (less precision) associated with given category (e.g., very low/low vs. moderate/high) of CoE. 19
A number of subgroup analyses—all defined a priori and published in the protocol to provide further methodological details 20 —were performed. These include assessment of differences between patient‐oriented (e.g., mortality, quality of life, etc.) versus disease‐oriented outcomes (e.g., disease response, laboratory outcomes, etc.), effect of a change in authorship between the original and updated reviews, effect of comparator intervention (active treatment vs. placebo/no treatment control) and type of treatment category (pharmaceutical vs. non‐pharmaceutical). Finally, in some cases, the SRs included observational studies along with randomized controlled trials (RCTs) and implausibly large ORs generated in conversion processes from standardized mean differences. We further analyzed these results by performing sensitivity analyses excluding SRs with observational studies and large ORs from the analysis.
This paper is reported per PRISMA guidelines. 21 All analyses were conducted with the Stata,ver17 statistical package. 22
3. RESULTS
The original search, performed on 20 October 2020, identified 3323 potentially eligible reviews of which 419 SR were included in the final analysis (Figure 1). Of these, 414 (207 × 2) and 384 (192 × 2) pairs of the reviews were eligible for the analysis of CoE and effect size, respectively. Total number of trials included in 414 reviews was 4217 (1814 before and 2403 after); mean number of trials per meta‐analysis was 10 (minimum: 1, maximum: 133). Total number of participants was 3,057,956; mean number of participants per meta‐analysis was 10,506 (minimum: 16; maximum: 1,202,382). Interrater kappa agreement between the reviewers varied from 0.79 to 0.97.
Figure 1.

PRISMA diagram (study flow diagram for evidence source and selection)
Figure 2 shows comparison of CoE in the original and updated Cochrane reviews across all categories of CoE (Figure 2A) and from very low/low to moderate/high (Figure 2B) according to GRADE criteria. Consistent with EBM principles, evidence judged to be of very low/low CoE had 2.1 (1.19–4.12; p = 0.0065) times higher odds to be upgraded in the future studies than moderate/high CoE (Figure 2B). Similarly, across all categories of CoE, the test for trend was highly significant, indicating an increased probability of change in CoE from very low to high CoE (p = 0.0021 for linear trend). We observed no instance in which high or moderate quality evidence was re‐assessed as very low‐quality evidence in the updated SR, while very low CoE was upgraded to moderate or high CoE in 9/39 of updated SR (Figure 2A).
Figure 2.

Change in certainty of evidence (CoE) in original and updated Cochrane systematic review. (A) across all categories of CoE as characterized by GRADE; (B) grouped as very low/low versus moderate/high‐quality evidence
However, we detected no effect of change in CoE on the magnitude of treatment effects [ROR = 1.02 (95% CI: 0.74–1.39) for change of CoE from very low/low to moderate/high versus 1.02 (95% CI: 0.44–2.37) for moderate/high to very low/low CoE]. Test between the subgroups was not significant (p = 1). (Figure 3) Although, as explained earlier, from guidelines recommendations perspectives, GRADE typically groups CoE as moderate/high versus low/very low, we also tried to compare the effect sizes at the two extremes of CoE: very low versus high. Because we observed no study with high CoE that changed into very low CoE (Figure 2A), ROR was impossible to calculate for this comparison.
Figure 3.

Comparison of effects of health interventions in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low (A); (B) summary of studies shown in (A) with addition of comparison of meta‐analyses where CoE did not change. ROR‐ratio of odds ratio; τ 2 (tau2) statistic and H 2, measures of heterogeneity; I 2 statistic, measure of inconsistency
Nevertheless, there was larger dispersion in ROR in meta‐analyses where CoE changed from moderate/high to very low/low than in the opposite direction. This was probably driven by low power for the analysis instead of the hypothesis that effect size would be larger if CoE changed from moderate/high to very low/low than other way around. [We had half as many of meta‐analyses available for the assessment of ROR based on change of CoE from moderate/high to very low/low (n = 16) as those in which CoE changed from very low/low to moderate high (n = 33).]
aROR was similar between the subgroups [median (IQR): 1.12 (1.07–1.57) vs. 1.21 (1.12–2.43)] (Figure 4A, Table 1). As in case of ROR, we observed larger dispersion in aROR in meta‐analyses where CoE changed from moderate/high to very low/low than in the opposite direction (Figures 4A,B).
Figure 4.

(A) Absolute deviation (AD) of treatment effects (aROR) in meta‐analyses in which certainty of evidence (CoE) changed from very low/low to moderate/high versus effects in meta‐analyses where CoE changed from moderate/high to very low/low; (B) summary of aROR by change in CoE (For graph displaying aROR for all studies, including those that did not have change in CoE, see Supporting Information Appendix, App Figure S4 and App S4a)
Table 1.
Summary of aROR (absolute deviation of treatment effects away from OR = 1)
| All data | After dropping outliersa |
|---|---|
| All studies, median [IQR]: 1.14 [1.05 1.65] | All studies, median [IQR]: 1.12 [1.03 1.40] |
| VeryLow/Low → Mod/High, median [IQR]: 1.12 [1.07 1.57] | VeryLow/Low → Mod/High, median [IQR]: 1.11 [1.06 1.47] |
| Mod/High → VeryLow/Low, median [IQR]: 1.21 [1.12 2.43] | Mod/High → VeryLow/Low, median [IQR]: 1.19 [1.11 1.52] |
| CoE didn't change, median [IQR]: 1.13 [1.04 1.66] | CoE didn't change, median [IQR]: 1.12 [1.03 1.39] |
After dropping studies that were converted to OR from studies that originally used standardized mean difference [SMD] (n = 20) and mean difference [MD] (n = 19) metrics to summarize treatment effects.
The meta‐analyses with no change in CoE had similar ROR [ROR = 1.01 (95% CI: 0.85 to 1.21)] (Figure 3B) and aROR [median (IQR): 1.13 (1.04–1.66)] (Table 1, App Figure S4 and App Figure SA) to those MAs in which CoE changed (Figure 4 and App Figure SA). Inconsistency was large across all meta‐analytic estimates (I 2 = 99%). There was larger imprecision in treatment effects when CoE changed from VL/L → M/H (seR = 1.46) than when it changed from M/H → VL/L (seR = 0.72).
Qualitative analysis indicated that direction of the effect changed in 6 SR/MAs only: two in the reviews in which CoE changed from very low/low to moderate/high (of which one was statistically significant) and in 4 SR/MAs with no change in the assessment of CoE (of which one was statistically significant) (Figure 5, App Figures S12 and S13).
Figure 5.

Change in effect size, qualitative analysis (see also App Figures S12 and S13)
Sensitivity analyses for all pre‐defined subgroups showed no change in the results. In fact, when non‐randomized studies or outliers were excluded from the analyses, no statistically significant changes were seen in any of the analyses (Appendix).
4. DISCUSSION
Almost 30 years ago, EBM 23 was introduced to wide medical audience, subsequently being assessed to represent one of the most important medical milestones of the last 160 years, in the same category as innovations such as antibiotics and anesthesia. 24 At the heart of EBM is notion that ‘not all evidence is created equal’—some evidence is more credible than others; the higher quality of evidence, the more accurate and trustworthy are our estimates about true effects of health interventions. 1 Surprisingly, however, the relationship between CoE and estimates of treatment effects has not been empirically evaluated.
Here, we provide the first empirical support for the foundational EBM principle that low‐quality evidence changes more often than high CoE (Figure 2). However, we found no difference in effect sizes between studies appraised as very low versus high [or, very low/low versus moderate/high CoE (Figure 3)]. This implies that effects that are assessed as less trustworthy/potentially unreliable (as when CoE is low) cannot be distinguished from those assessments, which are presumably more trustworthy/accurate (as when CoE is high). If the magnitude of treatment effects cannot be meaningfully distinguished from evidence appraised as high versus low quality, then the core principle of EBM seems to be challenged.
Our ‘negative’ results should not be construed as a challenge to sound, normative EBM epistemological principles, which hold that optimal practice of medicine requires explicit and conscientious attention to the nature of medical evidence. 1 , 25 , 26 Rather, in assessing the relationship between CoE and ‘true’ effects of health interventions, more salient question is to ask if the current appraisal methods capture CoE as intended by the EBM principles. Critical appraisal of CoE is integral aspect of conduct of systematic reviews, guidelines development and is widely integrated in the curricula in most medical and allied professional schools across the world. Over the years, many critical appraisal methods have been developed 1 to eventually culminate in development of GRADE methodology, which has been endorsed by more than 110 professional organizations. 7 However, as we demonstrate here, despite GRADE's capacity to distinguish CoE across its categories, it could not—and we suspect none of other appraisal methods that GRADE has replaced—reliably discerned the influence of CoE on the estimates of treatment effects. The results agree with those of Gartlehner et al who, based on cumulative meta‐analysis of 37 Cochrane reviews, found 27 limited value of GRADE in predicting stability of strength of evidence as new studies emerged. Other authors also questioned validity of GRADE as the system that is sufficiently empirically justified to ensure that our judgments are proportional to underlying (quality) of evidence. 28 , 29
The finding that the magnitude of effect size is not reflected in a change of CoE is surprising as elucidating bias effects that resulted in misleading advices to patients has been one of the key reasons for the rise of EBM. For example, a large body of observational evidence indicated that hormone replacement therapy (HRT) can reduce heart attack by 40%–50%, which resulted in advice to millions of women to take HRT to prevent heart attack. 30 However, when high quality of evidence was generated, the opposite was observed: more women died from heart attack if they took HRT than from placebo. 30 Similarly, thousands of women with breast cancer were advised to undergo highly toxic stem cell transplant based on unreliable observational evidence indicating improvement in disease‐free survival by about 50% compared with historical control 31 —the findings that were overturned once high‐quality randomized trials were done. 32 , 33
In addition, previous meta‐epidemiological studies showed that various study limitations that affect CoE significantly influence estimates of treatment effects 34 (although not always consistently 16 ). For example, as measured by ROR, inadequate or unclear (vs. adequate) random‐sequence generation, inadequate or unclear (vs. adequate) allocation concealment, or lack of or unclear double‐blinding (vs. double‐blinding) led to statistically significant exaggeration of treatment effects by 11%, 7% and 13%, respectively. 34 These study limitations are taken into account in rating of CoE using GRADE method, 6 so one would expect that effect size would differ between low versus high CoE in the GRADE assessment. However, on further examination, we observe that GRADE combines the study limitations such as adequacy of allocation concealment, blinding, etc. (risk of bias) with the assessment of inconsistency, imprecision, indirectness and publication bias to assign the final rating of CoE (from very low to high quality) in additive fashion. 12 , 35 It appears that using additive means to report the properties of negative and positive changes in treatment effect could unhelpfully neutralize this effect and cause imprecision in the overall estimate. Thus, one can have the same estimates of treatment effects but completely different GRADE ratings. This is, however, problematic because central assumption of GRADE is that estimates underpinned by high CoE are unlikely to change, whereas the very low/low CoE estimates are more likely to change.
A potential limitation of our study is that we have not collected data on the individual factors that drove assessment of CoE (i.e., study limitations/risk of bias vs. inconsistency, imprecision, or indirectness, for example). However, the present empirical report targets, for first time, the end‐stage level assessment of CoE, according to GRADE specifications, which is how CoE is used in practice to aid interpretation of evidence and affect development of clinical guidelines.
We also detected imprecision in the estimates of effects sizes and relatively wide ROR confidence intervals, particularly in the subgroup of meta‐analyses describing treatment effects when CoE changed from moderate/high to low/very low. It may be argued that the current methods of CoE appraisal are simply not sensitive enough and that with much larger sample size of SR/MAs, we would be able to differentiate between effect sizes across categories of CoE. This point was made by Howick and colleagues 36 who showed no change in the CoE between original and updated reviews in a set of the 48 trials they examined, albeit they made no attempt to identify changes in effect sizes. We also found that in 71 cases the updated reviews were based on inclusion of only 1 extra trial, which might not be enough to overturn or appreciably revise the effect estimate. However, sensitivity analyses comparing the changes in effect size as a function of the number of trials added in the updated meta‐analyses showed no difference in the results, regardless of the choice of cut‐off for the inclusion of these additional trials in the analysis (e.g., 1 vs. ≥3, or any other way). Importantly, critical appraisal (and GRADE) applies to both evidence obtained in single and multiple trials and is required in the Cochrane Reviews regardless of the quantity of existing evidence. Obtaining the larger sample sizes is also unrealistic given that we reviewed almost all SRs in the Cochrane database since the GRADE assessment of CoE was mandated (up to May 2021). Finally, few Cochrane Reviews we analyzed included observational studies. It is possible that GRADE may not differentiate the quality of randomized evidence well but that it may perform better if the comparison is made between randomized versus observational studies. The Cochrane Reviews, however, are typically based on randomized trials. Therefore, categorization of CoE based on currently mandated critical appraisal system using GRADE in the Cochrane Reviews does not meaningfully separate effect sizes across the existing gradation of CoE (although, capacity of GRADE to distinguish the magnitude of effect size between randomized and observational studies outside of the purview of Cochrane Reviews remains a worthwhile goal for further empirical research).
Given that studies can be well done, and correctly estimated treatment effects, but be poorly reported, 37 , 38 it is also possible that we could not detect influence of CoE on the estimates of treatment effects because current critical appraisal methods depend on the quality of reporting of the trials that are selected for meta‐analysis. However, if we believe that quality of reporting does not matter, then the entire critical appraisal efforts can be considered misplaced to begin with.
5. CONCLUSIONS
To the extent that the central to the epistemology of EBM is that what is justifiable or reasonable to believe depends on CoE, 1 our findings indicate urgent need to refine current EBM critical appraisal methods. If EBM is going to flourish, it is crucial to develop methods with capacity to categorize CoE to reliably differentiate between magnitude effects that are potentially biased from those that are accurate and trustworthy. The major opportunity, therein, lies in addressing the main limitations of this study‐ carefully and painstakingly discerning various aspects of CoE (from the components related to study limitations/risk of bias to inconsistency, imprecision, or indirectness) to better characterize CoE and its relationship to the magnitude of effects of health interventions.
CONFLICT OF INTERESTS
The authors declare that there are no conflict of interests.
AUTHOR CONTRIBUTIONS
The authors are notable as an interdisciplinary team of EBM practitioners and instructors who are respected as clinicians, mathematicians, epidemiologists, statisticians, methodologists, and researchers across academic institutions, hospitals and clinics in the UK, Can, USA, Brazil and Switzerland. Their research experience ranges from recently acquired doctorates to over 40 years in research and clinical practice. All authors contributed to the methods, commented on the analysis and contributed to writing and revising the manuscript. Our sources and selection criteria are contained within the document, the data is publicly available from the Cochrane Database and our statistical methods are outlined in the methods, figures and tables. PRISMA was used to report our findings. BD serves as the guarantor of the article. A conceptual idea: Benjamin Djulbegovic; Design: Benjamin Djulbegovic and David Nunan; Protocol development: Benjamin Djulbegovic, Muhammad Muneeb Ahmed, David Nunan, Lars Hemkens, Despina Koletsi, Amy Price, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes; Data acquisition: Muhammad Muneeb Ahmed, Despina Koletsi, Amy Price, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes, Ranjan Pathak. Statistical analysis: Iztok Hozo, Benjamin Djulbegovic, Lars Hemkens; Drafting manuscript: Benjamin Djulbegovic; Critical revision of the manuscript for important intellectual content: Benjamin Djulbegovic, Lars Hemkens, David Nunan, Amy Price, Despina Koletsi, Rachel Riera, Paulo Nadanovsky, Ana Paula Pires dos Santos, Daniela Melo, Rafael Leite Pacheco, Luis Eduardo Fontes, Ranjan Pathak. Administrative, technical, or material support: Benjamin Djulbegovic, Muhammad Muneeb Ahmed. Supervision: Benjamin Djulbegovic.
Supporting information
Supporting information.
ACKNOWLEDGEMENTS
This project was supported in part by grant number R01HS024917 from the Agency for Healthcare Research and Quality (Dr. Djulbegovic). The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Djulbegovic B, Ahmed MM, Hozo I, et al. High quality (certainty) evidence changes less often than low‐quality evidence, but the magnitude of effect size does not systematically differ between studies with low versus high‐quality evidence. J Eval Clin Pract. 2022;28:353‐362. 10.1111/jep.13657
DATA AVAILABILITY STATEMENT
Data are available from the authors upon request.
REFERENCES
- 1. Djulbegovic B, Guyatt GH. Progress in evidence‐based medicine: a quarter century on. Lancet. 2017;390(10092):415‐423. [DOI] [PubMed] [Google Scholar]
- 2. Djulbegovic B, Trikalinos TA, Roback J, Chen R, Guyatt G. Impact of quality of evidence on the strength of recommendations: an empirical study. BMC Health Serv Res. 2009;9(1):120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Djulbegovic B, Kumar A, Kaufman RM, Tobian A, Guyatt GH. Quality of evidence is a key determinant for making a strong guidelines recommendation. J Clin Epidemiol. 2015;68(7):727‐732. [DOI] [PubMed] [Google Scholar]
- 4. Djulbegovic B, Reljic T, Elqayam S, et al. Structured decision‐making drives guidelines panels' recommendations “for” but not “against” health interventions. J Clin Epidemiol. 2019;110:23‐33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Djulbegovic B, Hozo I, Li S‐A, Razavi M, Cuker A, Guyatt G. Certainty of evidence and intervention's benefits and harms are key determinants of guidelines' recommendations. J Clin Epidemiol. 2021;136:1‐9. [DOI] [PubMed] [Google Scholar]
- 6. Guyatt GH, Oxman AD, Vist G, et al. GRADE guidelines: 4. Rating the quality of evidence‐study limitations (risk of bias). J Clin Epidemiol. 2011;64(4):407‐415. [DOI] [PubMed] [Google Scholar]
- 7. GRADE Working Group . GRADE; 2021. Accessed June 26, 2021. https://www.gradeworkinggroup.org/
- 8. Gartlehner G, Sommer I, Evans TS, Thaler K, Lohr KN. Grades for quality of evidence were associated with distinct likelihoods that treatment effects will remain stable. J Clin Epidemiol. 2015;68(5):489‐497. [DOI] [PubMed] [Google Scholar]
- 9. West S, King V, Carey T, et al. Systems to Rate the Strength of Scientific Evidence. Evidence Report/Technology Assessment No. 47 (Prepared by the Research Triangle Institute‐University of North Carolina Evidence‐based Practice Center under Contract No. 290‐97‐0011). AHRQ Publication No 02‐E016. 2002:64‐88. [PMC free article] [PubMed]
- 10. Atkins D, Eccles M, Flottorp S, et al. Systems for grading the quality of evidence and the strength of recommendations I: Critical appraisal of existing approaches The GRADE Working Group. BMC Health Serv Res. 2004;4(1):38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Carrasco‐Labra A, Brignardello‐Petersen R, Santesso N, et al. Comparison between the standard and a new alternative format of the Summary‐of‐Findings tables in Cochrane review users: study protocol for a randomized controlled trial. Trials. 2015;16:164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Guyatt GH, Oxman AD, Montori V, et al. GRADE guidelines: 5. Rating the quality of evidence–publication bias. J Clin Epidemiol. 2011;64(12):1277‐1282. [DOI] [PubMed] [Google Scholar]
- 13. Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924‐926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Djulbegovic B, Hozo I, Li SA, Razavi M, Cuker A, Guyatt G. Certainty of evidence and intervention's benefits & harms are key determinants of guidelines' recommendations. J Clin Epidemiol. 2021;136:1‐9. [DOI] [PubMed] [Google Scholar]
- 15. Sterne JA, Jüni P, Schulz KF, Altman DG, Bartlett C, Egger M. Statistical methods for assessing the influence of study characteristics on treatment effects in ‘meta‐epidemiological’ research. Stat Med. 2002;21(11):1513‐1524. [DOI] [PubMed] [Google Scholar]
- 16. Moustgaard H, Jones HE, Savović J, et al. Ten questions to consider when interpreting results of a meta‐epidemiological study‐the MetaBLIND study as a case. Res Synth Methods. 2020;11(2):260‐274. [DOI] [PubMed] [Google Scholar]
- 17. Higgins JPT, Green S. Cochrane Collaboration. Cochrane handbook for systematic reviews of interventions. Wiley‐Blackwell; 2011. [Google Scholar]
- 18. Higgins J, Thompson S. Quantifying heterogeneity in a meta‐analysis. Stat Med. 2002;21:1539‐1558. [DOI] [PubMed] [Google Scholar]
- 19. Ewald H, Klerings I, Wagner G, et al. Abbreviated and comprehensive literature searches led to identical or very similar effect estimates: a meta‐epidemiological study. J Clin Epidemiol. 2020;128:1‐12. [DOI] [PubMed] [Google Scholar]
- 20. Will high quality (certainty) evidence change less often than low‐quality evidence after new data is collected?; 2020. Accessed July 14, 2021. https://osf.io/84qgc/
- 21. Liberati A, Altman DG, Tetzlaff J, et al. The PRISMA statement for reporting systematic reviews and meta‐analyses of studies that evaluate health care interventions: explanation and elaboration. J Clin Epidemiol. 2009;62:e1‐e34. [DOI] [PubMed] [Google Scholar]
- 22.STATA, ver. 17 [computer program]. College Station, TX; 2021.
- 23. Evidence‐based medicine working group . Evidence‐based medicine. A new approach to teaching the practice of medicine. JAMA. 1992;268:2420‐2425. [DOI] [PubMed] [Google Scholar]
- 24. Dickersin K, Straus SE, Bero LA. Evidence based medicine: increasing, not dictating, choice. BMJ. 2007;334(suppl 1):s10. [DOI] [PubMed] [Google Scholar]
- 25. Djulbegovic B, Guyatt GH, Ashcroft RE. Epistemologic inquiries in evidence‐based medicine. Cancer Control. 2009;16(2):158‐168. [DOI] [PubMed] [Google Scholar]
- 26. Sackett D, Rosenberg W, Muir Gray J, Haynes R, Richardson W. Evidence based medicine: what it is and what it isn't. BMJ. 1996;312:71‐72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Gartlehner G, Dobrescu A, Evans TS, et al. AHRQ Methods for Effective Health Care. In: Assessing the Predictive Validity of Strength of Evidence Grades: A Meta‐Epidemiological Study. Rockville (MD): Agency for Healthcare Research and Quality (US); 2015. [PubMed] [Google Scholar]
- 28. Mercuri M, Baigrie BS. What confidence should we have in GRADE? J Eval Clin Pract. 2018;24(5):1240‐1246. [DOI] [PubMed] [Google Scholar]
- 29. Mercuri M, Baigrie B, Upshur REG. Going from evidence to recommendations: can GRADE get us there? J Eval Clin Pract. 2018;24(5):1232‐1239. [DOI] [PubMed] [Google Scholar]
- 30. Investigators WGftWsHI . Risks and benefits of estrogen plus progestin in healthy postmenopausal women principal results from the women's health initiative randomized controlled trial. JAMA. 2002;288(3):321‐333. [DOI] [PubMed] [Google Scholar]
- 31. Peters WP, Ross M, Vredenburgh JJ, et al. High‐dose chemotherapy and autologous bone marrow support as consolidation after standard‐dose adjuvant therapy for high‐risk primary breast cancer. J Clin Oncol. 1993;11:1132‐1143. [DOI] [PubMed] [Google Scholar]
- 32. Tallman MS, Gray R, Robert NJ, et al. Conventional adjuvant chemotherapy with or without high‐dose chemotherapy and autologous stem‐cell transplantation in high‐risk breast cancer. N Engl J Med. 2003;349(1):17‐26. [DOI] [PubMed] [Google Scholar]
- 33. Rettig RA, Jacobson PD, Farquhar CM, Aubry WM. False hope: Bone marrow transplantation for breast cancer. Oxford University Press; 2007. [Google Scholar]
- 34. Savović J, Jones HE, Altman DG, et al. Influence of reported study design characteristics on intervention effect estimates from randomized, controlled trials. Ann Intern Med. 2012;157(6):429‐438. [DOI] [PubMed] [Google Scholar]
- 35. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines: 7. Rating the quality of evidence—inconsistency. J Clin Epidemiol. 2011;64(12):1294‐1302. [DOI] [PubMed] [Google Scholar]
- 36. Howick J, Koletsi D, Pandis N, et al. The quality of evidence for medical interventions does not improve or worsen: a metaepidemiological study of Cochrane reviews. J Clin Epidemiol. 2020;126:154‐159. [DOI] [PubMed] [Google Scholar]
- 37. Soares HP, Daniels S, Kumar A, et al. Bad reporting does not mean bad methods for randomised trials: observational study of randomised controlled trials performed by the Radiation Therapy Oncology Group. BMJ. 2003;328:22‐25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Mhaskar R, Djulbegovic B, Magazin A, Soares HP, Kumar A. Published methodological quality of randomized controlled trials does not reflect the actual quality assessed in protocols. J Clin Epidemiol. 2012;65(6):602‐609. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting information.
Data Availability Statement
Data are available from the authors upon request.
