Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2021 Apr 19;19(4):e3001162. doi: 10.1371/journal.pbio.3001162

The methodological quality of 176,620 randomized controlled trials published between 1966 and 2018 reveals a positive trend but also an urgent need for improvement

Christiaan H Vinkers 1,*, Herm J Lamberink 2, Joeri K Tijdink 3, Pauline Heus 4, Lex Bouter 5, Paul Glasziou 6, David Moher 7, Johanna A Damen 4, Lotty Hooft 4, Willem M Otte 8
Editor: Bob Siegerink9
PMCID: PMC8084332  PMID: 33872298

Abstract

Many randomized controlled trials (RCTs) are biased and difficult to reproduce due to methodological flaws and poor reporting. There is increasing attention for responsible research practices and implementation of reporting guidelines, but whether these efforts have improved the methodological quality of RCTs (e.g., lower risk of bias) is unknown. We, therefore, mapped risk-of-bias trends over time in RCT publications in relation to journal and author characteristics. Meta-information of 176,620 RCTs published between 1966 and 2018 was extracted. The risk-of-bias probability (random sequence generation, allocation concealment, blinding of patients/personnel, and blinding of outcome assessment) was assessed using a risk-of-bias machine learning tool. This tool was simultaneously validated using 63,327 human risk-of-bias assessments obtained from 17,394 RCTs evaluated in the Cochrane Database of Systematic Reviews (CDSR). Moreover, RCT registration and CONSORT Statement reporting were assessed using automated searches. Publication characteristics included the number of authors, journal impact factor (JIF), and medical discipline. The annual number of published RCTs substantially increased over 4 decades, accompanied by increases in authors (5.2 to 7.8) and institutions (2.9 to 4.8). The risk of bias remained present in most RCTs but decreased over time for allocation concealment (63% to 51%), random sequence generation (57% to 36%), and blinding of outcome assessment (58% to 52%). Trial registration (37% to 47%) and the use of the CONSORT Statement (1% to 20%) also rapidly increased. In journals with a higher impact factor (>10), the risk of bias was consistently lower with higher levels of RCT registration and the use of the CONSORT Statement. Automated risk-of-bias predictions had accuracies above 70% for allocation concealment (70.7%), random sequence generation (72.1%), and blinding of patients/personnel (79.8%), but not for blinding of outcome assessment (62.7%). In conclusion, the likelihood of bias in RCTs has generally decreased over the last decades. This optimistic trend may be driven by increased knowledge augmented by mandatory trial registration and more stringent reporting guidelines and journal requirements. Nevertheless, relatively high probabilities of bias remain, particularly in journals with lower impact factors. This emphasizes that further improvement of RCT registration, conduct, and reporting is still urgently needed.


Many randomized controlled trials (RCTs) are biased and difficult to reproduce due to methodological flaws and poor reporting. Analysis of 176,620 RCTs published between 1966 and 2018 reveals that the risk of bias in RCTs generally decreased. Nevertheless, relatively high probabilities of bias remain, showing that further improvement of RCT registration, conduct, and reporting is still urgently needed.

Introduction

Randomized controlled trials (RCTs) are the primary source for evidence on the efficacy and safety of clinical interventions, and systematic reviews and clinical guidelines synthesize their results. Unfortunately, many RCTs have methodological flaws, and results are often biased [1]. Across RCTs, there is a major risk for inflated estimates and problems with randomization, allocation concealment, and blinding [2,3]. Recently, it was shown that over 40% of RCTs were at high risk of bias which could have been easily avoided [4,5]. Moreover, poor reporting prevents the adequate assessment of the methodological quality of RCTs and limits their reproducibility [6]. Avoidable sources of waste and inefficiency in clinical research were estimated to be as high as 85% [7].

Already in 1996, CONSORT criteria have been introduced to improve RCT reporting [8]. Moreover, mandatory RCT registration by the International Committee of Medical Journal Editors (ICMJE) has been put forward [9,10], with detailed registration before commencing the RCTs enabling more transparent and complete reporting. More recently, the importance of increasing value and reducing waste in medical research was emphasized, and meaningful steps were proposed toward more high-quality research, including improved methodology and reporting and reduction of unpublished negative findings [6,11]. Additional actions to improve methodological quality and transparency of RCTs include trial tracker initiatives aimed at reducing non-publication of clinical trials [12] and fostering responsible research practices. At the most recent World Conference on Research Integrity, the Hong Kong Principles were proposed for responsible research practices, transparent reporting, open science, valuing research diversity, and recognizing contributions to research and scholarly activity [13].

Even though these actions and initiatives have undoubtedly contributed to the awareness that the methodological quality of RCTs needs to improve, the question remains whether real progress has been made in reducing the extent of avoidable waste in clinical research. In other words, have these initiatives and measures improved the quality, transparency, and reproducibility of RCTs? Several studies have assessed the methodological quality of reporting and risk of bias in RCTs [14], but most are relatively small and limited to specific medical disciplines or periods. Nevertheless, based on 20,920 RCTs from Cochrane reviews published between 2011 and 2014, there are indications that poor reporting and inadequate methods have decreased over time [15]. However, large-scale evidence on trends of RCT characteristics and methodological quality across medical disciplines over time is currently lacking. This is surprising given the importance of valid and reliable evidence from RCTs for patient care. Therefore, this study aimed to provide a comprehensive analysis of developments in the clinical trial landscape between 1966 and 2018 based on 176,620 full-text RCT publications. Specifically, we identified full-text RCTs via PubMed. We then used automated analyses to assess the risk of bias, CONSORT Statement reporting, trial registration, and characteristics related to publication (number of authors, journal impact factor [JIF], and medical discipline).

Methods

A protocol for a prediction paper was registered before study conduct [16]. This protocol does not apply to the current manuscript, which is a description of the data set used in the protocol. Nevertheless, it does contain details on the methodology that we used in this manuscript. The database and scripts are available through GitHub (see Data sharing), and the results are disseminated through the medRxiv preprint server.

Selection of RCTs and extraction of characteristics

RCTs were identified (November 20, 2017) via PubMed starting with all publications indicated as “randomized controlled trial” using the query “randomized controlled trial[pt] NOT (animals[mh] NOT humans[mh]).” The initial search did not include a time window. Non-English language, nonrandomized, animal, pilot, and feasibility studies were subsequently excluded (see S1 Text for details on selection procedure). We collected the Portable Document Format (PDF) for all available RCTs across publishers in journals covered by the library subscription of our institution and converted these PDFs to structured text in XML format using publicly available software (Grobid, available via GitHub). By linking information from PubMed, the full-text publication, and data from Scopus and Web of Science, we extracted metadata on the number of authors, author gender, number of countries and institutions of (co-)authors, and the Hirsch (H)-index of the first and last authors (see S1 Table for details). Moreover, we extracted the JIF at the time of publication. Time was stratified in 5-year periods as behavioral changes are expected to occur at a relatively low pace, with the relatively few trials published before 1990 merged in one stratum.

Risk-of-bias assessment

For every included full-text RCT, the risk-of-bias assessment was automatically performed using a machine learning assessment developed by RobotReviewer [17]. This tool is optimized for large-scale characterizations [18,19] and algorithmically based on a large sample of human-rated risk-of-bias reports and extracted support texts from trial publications covering the full RCT spectrum. RobotReviewer and human raters’ level of agreement was similar for most domains (human–human agreement: min/max, 71% to 85%, average, 79%; human–RobotReviewer agreement: min/max, 39% to 91%, average 65%) [18,19]. Of the 7 risk-of-bias domains described by Cochrane [20], we assessed 4: random sequence generation and allocation concealment (i.e., selection bias), blinding of participants and personnel (i.e., performance bias), and blinding of outcome assessment (i.e., detection bias). Publication bias and outcome reporting bias were outside the scope of our analysis. The machine learning output yield a probability with values below 0.5 corresponding with “low risk” and values above 0.5 with “high or unclear” risk of bias.

Analysis of trial registration and CONSORT Statement reporting

To check for trial registration, we extracted trial registration numbers from the abstract and full-text publication. We searched for the corresponding trial registration number in 2 online databases: the World Health Organization’s (WHO) International Clinical Trials Registry Platform, composed of worldwide primary and partner registries, and the ClinicalTrials.gov trial registry [21,22]. We checked all full-text publications for at least 1 mention of the words “Consolidated Standards of Reporting Trials” or CONSORT.

Analysis related to journal impact factor

Even though the JIF (the average number of times its articles has been cited in other articles for 2 years) is not a very suitable indicator of journal quality [23], no unbiased alternatives exist. Therefore, in our study, we used the JIF as a proxy to identify journals with high publication standards and high rejection rates. For each trial, we selected the JIF of the year before trial publication. We used a JIF threshold of 10 as the primary cutoff based on JIF distributions (see S2 Table and S1 Fig) and previous evidence for sensitivity to assess RCT quality using this cutoff [15]. However, we also performed sensitivity analyses for JIF cutoff thresholds at 3 and 5.

Analyses related to medical disciplines

We assigned RCTs to medical disciplines based on the journal category (Web of Science) [9]. As a secondary analysis, we examined medical disciplines separately. Medical disciplines with less than 4,000 RCTs in our sample were assigned to the category “Other.”

Power calculation and statistics

No a priori formal power calculation was performed as the aim of this project was to include all RCTs available on PubMed. Temporal patterns in the individual risk-of-bias predictions were modeled with regression analysis. Reported P values correspond to the overall trend estimate or the comparison of the average value per year in the 1990 to 1995 and 2010 to 2018 strata obtained from post hoc Tukey-corrected comparisons. This post-1990 period was chosen to cover the first years following significant awareness on the need to report transparently, in comparison to the latest years in our data set. Temporal patterns in trial registration and CONSORT Statement reporting were modeled with logistic regression. Because median values were very close to mean values, the data are presented as mean ± 95% confidence intervals.

Risk-of-bias assessment validation

To determine the accuracy of the automated risk-of-bias assignment to RCTs in our database, we validated a large subset of RCTs with human assessments obtained from systematic reviews. To this end, we inspected all PMIDs of RCTs included in the 8,075 systematic reviews (Issue 6, 2020) published in the Cochrane Database of Systematic Reviews (CDSR). The CDSR is part of the international charitable Cochrane organization, with more than 50 review groups based at research institutions worldwide, aiming to facilitate evidence-based choices about health interventions. Contributions to the database come from these review groups as well as from ad hoc teams. The systematic reviews in the CDSR were identified through PubMed, for the period to 2000 and 2020. This was done with the NCBI’s EUtils API with the following query: “Cochrane Database Syst Rev”[journal] AND “(“2000/01/01”[PDAT]: “2020/05/31”[PDAT]).” We limited the assessment of the latest systematic review updates to prevent overlap between included RCTs. Review protocols were excluded.

All systematic review tables containing the words “bias” and “risk” were inspected on human export risk-of-bias text associated with a PMID. If a PMID matched with a PMID in our full-text database and the risk-of-bias domain concerned “allocation concealment,” “random sequence generation,” “blinding of participants and personnel,” or “blinding of the outcome,” the human risk-of-bias text was extracted.

All human risk-of-bias texts and assigned judgments were manually inspected and assigned to the proper risk-of-bias category. Judgments were binarized into “low” and “high/unclear” risk of bias and used to validate our automated binarized risk-of-bias probabilities (i.e., < = 0.5 “low” risk and >0.5 “high/unclear” risk) in terms of accuracy, sensitivity, specificity, and kappa. The accuracy is expressed as the proportion of those RCTs correctly categorized by the model, namely as (true positives + true negatives) or (true positives + false positives + false negatives + true negatives).

Results

RCT full-text acquisition process

From the 445,159 PubMed entries for RCTs, we identified 306,737 eligible RCTs (see flowchart in Fig 1). Full-text articles were obtained from 183,927 RCTs. RCT publications with an uncertain year of publication (7,307) were excluded, resulting in a final sample size of 176,620 RCTs. The distribution of the risk-of-bias domain probabilities of included and excluded RCTs was comparable, with very similar interquartile ranges (S3 Table). The presence of the “With CONSORT Statement” and “RCT Registration” outcomes were 3% to 4% lower in the excluded RCTs (S4 Table). This full-text sample size is in line with the overall number of potential RCTs in the PubMed database (S2 Fig).

Fig 1. Flowchart of how the final full-text randomized clinical trials were obtained.

Fig 1

1. RCT, randomized controlled trial.

RCT characteristics over time

RCTs for which full texts were obtained were predominantly published over the last 3 annual strata with 89,373 publications in 2010 to 2018 (11,172 per year) compared to 6,066 publications between 1990 and 1995 (1,213 per year; Fig 2A). Over time, the average number of authors steadily increased from 5.2 (CI: 5.12 to 5.26) in 1990 to 1995 to 7.8 in 2010 to 2018 (CI: 7.76 to 7.83; P < 0.0001 for post hoc category difference) (Fig 2B). This was accompanied by a steady increase in the number of involved countries and institutions affiliated with all authors (number of institutions in 1990 to 1995: 3.24 (CI: 3.17 to 3.31) versus 2010 to 2018: 4.84 (CI: 4.81 to 4.87; P < 0.0001 for post hoc category difference)) (Fig 2C and 2D). In additional analyses, the percentage of female authors in RCTs has gradually increased over time, as well as the H-index of the first and last author (S3 Fig).

Fig 2.

Fig 2

Number of RCTs included over time (A) and the corresponding average number of authors (B), the number of countries involved (C), and the number of institutions involved (D). The indicated stratum range is up to but not including the last year. Raw data are available via DOI 10.5281/zenodo.4362238. RCT, randomized controlled trial.

Risk of bias, registration, and reporting: Trends over time

We found an overall continuous reduction in the risk of bias due to inadequate allocation concealment, dropping from 62.8% (CI: 62.4% to 63.1%) in trials published in 1990 to 1995 to 50.9% (CI: 50.7% to 51.0%; P < 0.0001 for overall trend) in trials published between 2010 and 2018 (Fig 3A). There was a relatively stronger decrease in the risk of bias due to nonrandom sequence generation, from 54.0% (CI: 53.6% to 54.5%) for trials in 1990 to 1995 to 36.4% (36.3% to 36.6%; P < 0.001 for overall trend) for trials in 2010 to 2018 (Fig 3B). The risk of bias due to not blinding participants and personnel showed a distinctly different pattern, with a sequential increase since 2000 up to 56.9 (56.8% to 57.1%; P < 0.001 for overall trend) in 2010 to 2018, after an initial decrease (Fig 3C). The risk of bias due to not blinding outcome assessment decreased over time, from 56.6 (56.4% to 56.9%) to 51.8 (51.7% to 51.9%; P < 0.001 for overall trend; Fig 3D). In all RCTs, mention of a trial registration number rapidly increased from close to 0 (0.37% for 1990 to 1995, n = 6,272) to up to 46.7% (46.4% to 47.0%; n = 89,373) in 2010 to 2018 (Fig 3E). We found very low reporting (1.05 (CI: 0.82% to 1.35%) in 1990 to 1995, n = 6,272) of the CONSORT Statement in full-text RCT publications, contrasting with 19.5% (19.3% to 19.8%; P < 0.0001 for post hoc category difference; n = 89,373) of all trials between 2010 and 2018 (Fig 3F).

Fig 3.

Fig 3

Risk of bias due to inadequate allocation concealment (A), random sequence generation bias (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and reporting of the CONSORT Statement (F) for all RCTs plotted over time. The indicated stratum range is up to but not including the last year. Raw data are available via DOI 10.5281/zenodo.4362238. RCT, randomized controlled trial.

Risk of bias and reporting: Relation with journal impact factor

The proportion of RCT publications with a lower risk of bias in allocation concealment was consistently lower in journals with JIF larger than 10 (P < 0.001; P < 0.001 for overall trend, Fig 4A). This also applied to randomization and blinding of participants and personnel and outcome assessment, even though the results were less pronounced compared to allocation concealment bias (P < 0.0001, all domains, for latest time point; Fig 4B–4D). For allocation bias, randomization bias, and blinding of outcome bias, the overall trends showed a decrease over time (P < 0.001). Large differences were found in terms of trial registration and reporting of CONSORT between RCTs published in high or lower impact journals (P < 0.0001 for latest time point; P < 0.001 for overall trend, Fig 4E and 4F). Moreover, 73% (72% to 74%) of trials in journals with a JIF higher than 10 were registered, and 26% (25% to 27%) reported the CONSORT Statement between 2010 and 2018 (both measures: P < 0.0001 in comparison with JIF <10). Sensitivity analysis with JIF cutoff values of 3 and 5, respectively, yielded comparable but smaller differences with reduced bias and increased registration and mentioning of the CONSORT Statement in journals with higher JIF (S4 and S5 Figs).

Fig 4.

Fig 4

Risk of bias in allocation concealment (A), the bias in randomization (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and CONSORT Statement reporting (F) plotted over time for RCTs published in journals with JIF >10 and journals with JIF <10. The indicated stratum range is up to but not including the last year. Raw data are available via DOI 10.5281/zenodo.4362238. JIF, journal impact factor; RCT, randomized controlled trial.

Risk of bias and reporting: Relation with medical discipline

The risk-of-bias patterns substantially differed across medical disciplines (S5 Table). The lowest probabilities of bias were found in RCTs within the field of anesthesiology (27% randomization bias, 43% allocation concealment bias, 45% risk of bias due to insufficient blinding of participants and personnel, and 45% bias in blinding of outcome assessment) (S6 Fig). The field of oncology had the highest levels of trial registration (43.4%) and mention of the CONSORT Statement (30.3%) (S7 Fig). Registration rates were lowest in the field of endocrinology and metabolism (8.0%) and urology and nephrology (10.2%).

Risk-of-bias assessment validation

In total, 63,327 matching human risk-of-bias judgments and automated risk-of-bias predictions were extracted from 17,394 unique RCTs included in a Cochrane systematic review (Table 1). Overall, automated accuracy in determining “randomization bias,” “allocation bias,” and “blinding of people bias” was above 70%. The “blinding of outcome bias” had a lower accuracy, namely 63.3%. The distribution of risk-of-bias predictions is shown in S8 Fig.

Table 1. Validation of the 4 risk-of-bias domains between automated and human assessments in RCTs obtained from the Cochrane systematic reviews risk-of-bias tables.

Domain RCTs Judgment K Accuracy (95% CI) Sensitivity (95% CI) Specificity (95% CI) Kappa (95% CI)
Random sequence generation 15,799 High-Unclear 5,840 72.1 (CI: 71.4–72.8) 63.5 (CI: 62.3–64.8) 76.5 (CI: 75.7–77.3) 39.1 (CI: 37.5–40.6)
Low 9,959
Allocation concealment 19,058 High-Unclear 9,672 70.7 (CI: 70.1–71.4) 69.3 (CI: 68.4–70.2) 72.5 (CI: 71.5–73.4) 41.3 (CI: 40.0–42.6)
Low 9,386
Blinding of participants and personnel 2,121 High-Unclear 1,400 74.8 (CI: 72.9–76.6) 79.8 (CI: 77.8–81.9) 63.8 (CI: 60.2–67.5) 42.8 (CI: 38.7–47.0)
Low 721
Blinding of outcome assessment 26,349 High-Unclear 14,009 62.7 (CI: 62.1–63.3) 63.3 (CI: 62.6–64.1) 61.7 (CI: 60.8–62.6) 24.5 (CI: 23.3–25.6)
Low 12,340

RCT, randomized controlled trial.

We also classified the chance-corrected level of agreement in terms of kappa. Kappa ranges from “poor” (<0.00), “slight” (0.00 to 0.20), “fair” (0.21 to 0.40), “moderate” (0.41 to 0.60), “substantial” (0.61 to 0.80) to “almost perfect” (0.81 to 1.0) [24]. Within this scale, a moderate agreement was found for randomization and blinding of people bias, and a fair agreement was found for allocation bias and blinding of outcome bias (Table 1).

Discussion

We analyzed a total of 176,620 full-text publications of RCTs between 1966 and 2018 and show that the landscape of RCTs has considerably changed over time. The likelihood of bias in RCTs has generally decreased over the last decades. This optimistic trend may be driven by increased knowledge augmented by mandatory trial registration and more stringent reporting guidelines and journal requirements. Nevertheless, relatively high probabilities of bias remain, particularly in journals with lower impact factors.

Regarding the risk of bias, the trends that emerge from our analyses are certainly hopeful. The risk of bias in RCTs declined over the past decades, with lowering trends for bias related to random sequence generation, allocation concealment, and blinding of outcome assessment. In accordance, there is an increasing percentage of RCTs that are registered in public trial registers and use CONSORT guidelines. Despite 2 decades of documentation and calls for trial registration, it only substantially increased around 2004 when trial registration was made a condition for publication by the ICMJE. This policy was implemented and supported by WHO in July 2005 [25]. Our results are in line with the assessment of 20,920 RCTs from Cochrane reviews in 2017 that found improvements in reporting and methods over time for sequence generation and allocation concealment [15].

Notwithstanding these improvements, it is also clear that there is still a pressing need to further RCTs’ methodological quality. The average risk in each of the bias domains remains generally high (around 50%), and bias related to blinding of participants and personnel increases over time, which may be due to more pragmatic or nondrug RCTs being performed. Moreover, despite the requirement of trial registration for publication since 2004, still in 2017 a substantial percentage of published RCTs do not report registration numbers. This suggests a lack of registration in a subset of RCT, although the absence of registration numbers does not necessarily imply the absence of registration. Furthermore, many RCTs do not mention the CONSORT guidelines in their full text, and more so for journals with lower impact factors. Despite the accessibility of reporting guidelines, researchers are generally not required to adhere to them. More problematic, requirements are not strictly enforced, and noncompliance to all the items on the reporting guideline is not sanctioned [4,26]. It is important to note that not mentioning CONSORT does not necessarily imply nontransparent reporting, although explicit mentioning is preferable. To further improve RCTs’ methodological quality and reliability, there is still a long way to go. The rather slow progress of improvement may be due to the complex nature of conducting RCTs. Better education, enforcements, and (dis)incentives may be inevitable. Additionally, making data sets available according to the FAIR principles arguably will improve the situation [27].

Dependent on expectations and future goals, the interpretation of our findings can be either optimistic or pessimistic: Optimistic because, over the past decades, there has been quite some improvement in RCT conduct and reporting, but pessimistic because the improvements are going at a rather slow pace. From our analyses, it also appears that journals with higher JIF generally publish RCTs with lower predictions on the risk-of-bias domains. Our results confirm previous results showing higher JIF (higher than 10) being associated with a lower proportion of trials at unclear or high risk of bias in Cochrane reviews [15]. Even though JIFs are not a very suitable measure of journal quality, our results are in line with previous studies showing that increased JIF is related to a higher methodological quality of RCTs [28]. Finally, there are large differences across medical disciplines related to the risk-of-bias predictions across domains which may be related to the type and size of RCTs (i.e., more pragmatic RCTs) across medical disciplines.

With regard to authors, not only is there a growing number of authors and institutions from a growing number of countries involved in publications, but also a steadily increasing average H-index of the first and last author. The absolute and relative numbers of female authors in RCTs also gradually increased over time, with a large rise in first and last female authorships. Even though trends increase over time, the average percentage of female last authorships remains relatively low at 29% (2010 to 2018), in line with recent literature [29].

There are several strengths and limitations inherent to our approach of automated extraction of full-text RCT publications. The automated and uniform approach that yielded an unprecedentedly large and rich data source concerning RCTs from the last decades is available for further study (see https://github.com/wmotte/RCTQuality or DOI 10.5281/zenodo.4362238 for the data), covering a large proportion of all published RCTs included in PubMed. Moreover, we validated standardized human risk-of-bias judgments by trained reviewers from Cochrane systematic reviews with automated risk-of-bias predictions (Table 1). Nevertheless, there are several limitations. First, our selection of RCTs may have been biased as full-text RCTs may not have been missing at random (e.g., with the particular year of publication or RCT methodology). Most RCTs are not included as we do not have access to full-text data, making it difficult to make definitive conclusions on the missing RCTs concerning the risk of bias, CONSORT Statement reporting, and trial registration. However, an additional analysis of over 7,000 RCTs with an uncertain year of publication (7,307) that were excluded yielded a similar distribution of the risk-of-bias domain probabilities.

Moreover, the included RCTs in our analyses were generally in line with the overall number of potential RCTs in the PubMed database (S6 Fig), except for the period between 1993 and 1998 when libraries switched from scanned versions to online subscriptions, potentially lowering the yield of automated full-text downloads. However, this period is before the release and implementation of the discussed guidelines, and our sample still maps the important patterns in the overall RCT publications over time. Second, the risk of bias is inherently difficult to assess reliably. Experts’ assessments of trials show that labeling the same trials for different Cochrane reviews resulted in substantial differences [30,31]. Probabilities assigned with machine learning are based on a large set of human-assigned labels, and a direct comparison shows computerized assessment performance of 71.0% agreement [32]. Our manual validation based on 63,327 extracted and standardized human risk-of-bias assessments, as published in a CDSR, showed accuracies of over 70% and kappa values between 0.25 and 0.43. Even though this sounds far from impressive, it should be borne in mind that agreement between human reviewers is often not much higher. Classification of risk-of-bias domains is a difficult task, both for humans and automated software. For example, based on 376 RCTs, the overall kappa values of interrater risk-of-bias predictions ranged from 0.40 to 0.42, although some domains were clearly higher, such as random sequence bias agreements between human raters (kappa 95% CI: 0.56 to 0.71) [30]. The low agreement between human raters is precisely why the risk-of-bias tool was recently revised into version 2 [33]. The relatively high variability and imperfect accuracy are particularly problematic for individual trial characterization, where the inclusion or exclusion of an individual trial due to incorrect risk assessment will have a large impact. In our study, however, we applied the risk-of-bias characterization differently and did not focus on individual trials but rather studied patterns in risk-of-bias distributions. Third, we did not investigate all aspects of methodological rigor. In our study, we did not check for forms of attrition bias (e.g., incomplete outcome data) or reporting bias (e.g., selective outcome reporting). Fourth, even though the CONSORT Statement was introduced to improve RCT reporting [8], the rapid increase of RCTs that mention following the CONSORT guideline does not guarantee adherence, and reporting methodological quality can remain suboptimal [14]. We were not able to automatically correct for the conventional and non-abbreviated use of the word “consort.” This may have slightly increased our CONSORT Statement percentages and explains the very low but nonzero values in the earliest stratum.

Our comprehensive picture of the methodological quality of RCTs provides quantitative insight into the current state and trends over time. With many thousands of RCTs being published each year and thousands of clinical trials currently recruiting patients, this can help us to understand the current situation better but also to find solutions for further improvement. These could include a more stringent adoption of measures to enforce transparent and credible trial publication, but also fine-tuning of stricter registration regulations. In conclusion, our comprehensive analyses of a large body of full-text RCTs show a slow and gradual improvement of methodological quality over the last decades. While RCTs certainly face challenges about their quality and reproducibility, and there is still ample room for improvement, our study is a first step in showing that all efforts that have been made to improve RCT practices may be paying off.

Supporting information

S1 Text. Detailed data collection procedures.

(DOCX)

S1 Table. Operationalization of variables for RCTs, authors, institutions, and journals.

RCT, randomized controlled trial.

(DOCX)

S2 Table. All journals with JIF higher than 10 in the year preceding any of the individual publications in our data set.

JIF, journal impact factor.

(DOCX)

S3 Table. Quantiles of estimated risk-of-bias domain probabilities for included and excluded RCTs.

RCT, randomized controlled trial.

(DOCX)

S4 Table. Percentages for the CONSORT Statement and Registration outcomes for included and excluded RCTs.

RCT, randomized controlled trial.

(DOCX)

S5 Table. Total number (N) of trials published in the period 2005–2018 in the different medical disciplines with the number (K) and corresponding proportion (percentage with 95% confidence interval) of trials with a risk-of-bias probability below 50% (i.e., “low risk”).

(DOCX)

S1 Fig. Distribution of JIFs of analyzed RCTs with JIF cutoffs of 3, 5, and 10 (dotted lines).

The JIF of a journal in the year following the publication date of the RCT was used. Density represents the probability of a trial to belong to a given impact factor. JIF, journal impact factor; RCT, randomized controlled trial.

(TIFF)

S2 Fig. The number of included RCTs against the total number of RCTs indexed in the PubMed database for the study period (1966–2018).

The year 1993 and 1998 are marked with vertical dashed lines. RCT, randomized controlled trial.

(TIFF)

S3 Fig. Percentage of female authors and H-indices for the first and the last author, per period, for the included RCTs.

RCT, randomized controlled trial.

(TIFF)

S4 Fig

Risk of bias due to inadequate allocation concealment (A), random sequence generation bias (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and mentioning of the CONSORT Statement (F) plotted over time for RCTs published in journals with JIF >3 and journals with JIF <3. The indicated stratum range is up to but not including the last year. JIF, journal impact factor; RCT, randomized controlled trial.

(TIFF)

S5 Fig

Risk of bias in allocation concealment (A), the bias in randomization (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and mentioning of the CONSORT Statement (F) plotted over time for RCTs published in journals with JIF >5 and journals with JIF <5. The indicated stratum range is up to but not including the last year. JIF, journal impact factor; RCT, randomized controlled trial.

(TIFF)

S6 Fig. The average risk of biases for trials published in the period 2005–2018 in different medical disciplines.

“random”: bias in randomization; “allocation”: bias in allocation concealment; “blinding of people”: bias in blinding of patients and personnel; “blinding outcome”: bias in the blinding of outcome assessment.

(TIFF)

S7 Fig. Presence of RCT registration and CONSORT Statement in trials published between 2005 and 2018.

RCT, randomized controlled trial.

(TIFF)

S8 Fig. The machine learning risk-of-bias probabilities are plotted as density profiles against the human rater risk categories: “High-Unclear” and “Low” for 63,327 matching RCTs.

RCT, randomized controlled trial.

(TIFF)

Abbreviations

CDSR

Cochrane Database of Systematic Reviews

H-index

Hirsch index

ICMJE

International Committee of Medical Journal Editors

JIF

journal impact factor

PDF

Portable Document Format

RCT

randomized controlled trial

WHO

World Health Organization

Data Availability

The risk of bias characterization was done with a large-batch-customized-customized Python scripts (version 3; https://github.com/wmotte/robotreviewer_prob). The data management and analyses used R (version 3.6.1). All data including code and risk of bias data are available at https://github.com/wmotte/RCTQuality).

Funding Statement

This work was supported by The Netherlands Organisation for Health Research and Development (ZonMw) grant “Fostering Responsible Research Practices” (445001002) (CV,JT,WO). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, Ioannidis JP, et al. Biomedical research: increasing value, reducing waste. Lancet. 2014;383(9912):101–4. 10.1016/S0140-6736(13)62329-6 . [DOI] [PubMed] [Google Scholar]
  • 2.Prayle AP, Hurley MN, Smyth AR. Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study. BMJ. 2012;344:d7373. 10.1136/bmj.d7373 . [DOI] [PubMed] [Google Scholar]
  • 3.Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008;19(5):640–8. 10.1097/EDE.0b013e31818131e7 . [DOI] [PubMed] [Google Scholar]
  • 4.Yordanov Y, Dechartres A, Porcher R, Boutron I, Altman DG, Ravaud P. Avoidable waste of research related to inadequate methods in clinical trials. BMJ. 2015;350:h809. 10.1136/bmj.h809 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bilandzic A, Fitzpatrick T, Rosella L, Henry D. Risk of Bias in Systematic Reviews of Non-Randomized Studies of Adverse Cardiovascular Effects of Thiazolidinediones and Cyclooxygenase-2 Inhibitors: Application of a New Cochrane Risk of Bias Tool. PLoS Med. 2016;13(4):e1001987. 10.1371/journal.pmed.1001987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, et al. Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014;383(9913):267–76. 10.1016/S0140-6736(13)62228-X . [DOI] [PubMed] [Google Scholar]
  • 7.Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence. Lancet. 2009;374(9683):86–9. Epub 2009 Jun 16. 10.1016/S0140-6736(09)60329-9 . [DOI] [PubMed] [Google Scholar]
  • 8.Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, et al. Improving the quality of reporting of randomized controlled trials. The CONSORT statement. JAMA. 1996;276(8):637–9. 10.1001/jama.276.8.637 . [DOI] [PubMed] [Google Scholar]
  • 9.Zwierzyna M, Davies M, Hingorani AD, Hunter J. Clinical trial design and dissemination: comprehensive analysis of clinicaltrials.gov and PubMed data since 2005. BMJ. 2018;361:k2130. 10.1136/bmj.k2130 http://www.icmje.org/coi_disclosure.pdf. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Schulz KF, Altman DG, Moher D, CONSORT Group. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c332. 10.1136/bmj.c332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gulmezoglu AM, et al. How to increase value and reduce waste when research priorities are set. Lancet. 2014;383(9912):156–65. 10.1016/S0140-6736(13)62229-1 . [DOI] [PubMed] [Google Scholar]
  • 12.Powell-Smith A, Goldacre B. The TrialsTracker: Automated ongoing monitoring of failure to share clinical trial results by all major companies and research institutions. F1000Res. 2016;5:2629. 10.12688/f1000research.10010.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Moher D, Bouter L, Kleinert S, Glasziou P, Sham M, Barbour V, et al. The Hong Kong Principles for Assessing Researchers: Fostering Research Integrity. OSF Preprints. 2019. 10.31219/osf.io/m9abx [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Turner L, Shamseer L, Altman DG, Weeks L, Peters J, Kober T, et al. Consolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journals. Cochrane Database Syst Rev. 2012;11:MR000030. 10.1002/14651858.MR000030.pub2 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Dechartres A, Trinquart L, Atal I, Moher D, Dickersin K, Boutron I, et al. Evolution of poor reporting and inadequate methods over time in 20 920 randomised controlled trials included in Cochrane reviews: research on research study. BMJ. 2017;357:j2490. 10.1136/bmj.j2490 . [DOI] [PubMed] [Google Scholar]
  • 16.Damen JA, Lamberink HJ, Tijdink JK, Otte WM, Vinkers CH, Hooft L, et al. Predicting questionable research practices in randomized clinical trials. Open Science Framework. 2018. Available from: https://www.osf.io/27f53/. [Google Scholar]
  • 17.Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J Am Med Inform Assoc. 2016;23(1):193–201. 10.1093/jamia/ocv044 ; PMCID: PMC4713900. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J Am Med Inform Assoc. 2016;23(1):193–201. 10.1093/jamia/ocv044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gates A, Vandermeer B, Hartling L. Technology-assisted risk of bias assessment in systematic reviews: a prospective cross-sectional evaluation of the RobotReviewer machine learning tool. J Clin Epidemiol. 2018;96:54–62. 10.1016/j.jclinepi.2017.12.015 . [DOI] [PubMed] [Google Scholar]
  • 20.Higgins JP, Altman DG, Gotzsche PC, Juni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928. 10.1136/bmj.d5928 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Viergever RF, Li K. Trends in global clinical trial registration: an analysis of numbers of registered clinical trials in different parts of the world from 2004 to 2013. BMJ Open. 2015;5(9):e008932. 10.1136/bmjopen-2015-008932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hopewell S, Clarke M, Moher D, Wager E, Middleton P, Altman DG, et al. CONSORT for reporting randomized controlled trials in journal and conference abstracts: explanation and elaboration. PLoS Med. 2008;5(1):e20. 10.1371/journal.pmed.0050020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Garfield E. The history and meaning of the journal impact factor. JAMA. 2006;295(1):90–3. 10.1001/jama.295.1.90 . [DOI] [PubMed] [Google Scholar]
  • 24.Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. . [PubMed] [Google Scholar]
  • 25.De Angelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, et al. Is this clinical trial fully registered? A statement from the International Committee of Medical Journal Editors. Lancet. 2005;365(9474):1827–9. 10.1016/S0140-6736(05)66588-9 . [DOI] [PubMed] [Google Scholar]
  • 26.Mathieu S, Boutron I, Moher D, Altman DG, Ravaud P. Comparison of registered and published primary outcomes in randomized controlled trials. JAMA. 2009;302(9):977–84. 10.1001/jama.2009.1242 . [DOI] [PubMed] [Google Scholar]
  • 27.Lo B. Sharing clinical trial data: maximizing benefits, minimizing risk. JAMA. 2015;313(8):793–4. 10.1001/jama.2015.292 . [DOI] [PubMed] [Google Scholar]
  • 28.Gluud LL, Sorensen TI, Gotzsche PC, Gluud C. The journal impact factor as a predictor of trial quality and outcomes: cohort study of hepatobiliary randomized clinical trials. Am J Gastroenterol. 2005;100(11):2431–5. 10.1111/j.1572-0241.2005.00327.x . [DOI] [PubMed] [Google Scholar]
  • 29.West JD, Jacquet J, King MM, Correll SJ, Bergstrom CT. The role of gender in scholarly authorship. PLoS ONE. 2013;8(7):e66212. 10.1371/journal.pone.0066212 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Konsgen N, Barcot O, Hess S, Puljak L, Goossen K, Rombey T, et al. Inter-review agreement of risk-of-bias judgments varied in Cochrane reviews. J Clin Epidemiol. 2019;120:25–32. 10.1016/j.jclinepi.2019.12.016 . [DOI] [PubMed] [Google Scholar]
  • 31.Armijo-Olivo S, Ospina M, da Costa BR, Egger M, Saltaji H, Fuentes J, et al. Poor reliability between Cochrane reviewers and blinded external reviewers when applying the Cochrane risk of bias tool in physical therapy trials. PLoS ONE. 2014;9(5):e96920. 10.1371/journal.pone.0096920 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Marshall IJ, Kuiper J, Banner E, Wallace BC. Automating Biomedical Evidence Synthesis: RobotReviewer. Proc Conference Assoc Comput Linguist Meet. 2017;2017:7–12. 10.18653/v1/P17-4002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Sterne JAC, Savovic J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;366:l4898. Epub 2019 Aug 30. 10.1136/bmj.l4898 . [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Aaron Nicholas Bruns

26 Aug 2020

Dear Dr Vinkers,

Thank you for submitting your manuscript entitled "Randomized controlled trial quality has improved over time but is still not good enough: an analysis of 176,620 randomized controlled trials published between 1966 and 2018" for consideration as a Meta-Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff as well as by an academic editor with relevant expertise and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Aug 28 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Aaron

Aaron Nicholas Bruns, Ph.D.,

Associate Editor

PLOS Biology

Decision Letter 1

Roland G Roberts

4 Nov 2020

Dear Dr Vinkers,

Thank you very much for submitting your manuscript "Randomized controlled trial quality has improved over time but is still not good enough: an analysis of 176,620 randomized controlled trials published between 1966 and 2018" for consideration as a Meta-Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by four independent reviewers.

IMPORTANT: You’ll see that the reviewers are broadly positive about your study, but raise a number of issues, most of which can be addressed by textual revisions, but some of which (e.g. for reviewers #1 and #3) may need further analyses. In addition, the Academic Editor has provided some guidance which I have pasted into the foot of the email; in those comments, s/he suggests some analyses that might help address some of the reviewers' concerns, and notes some departures from the OSF protocol. These comments should also be attended to.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' and Academic Editor's comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Associate Editor,

abruns@plos.org,

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer's Responses to Questions

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Prof. Rustam Al-Shahi Salman

Reviewer #3: No

Reviewer #4: No

Reviewer #1:

This manuscript aims to report an analysis of the evolution over time of publication characteristics, risk of bias and quality of reporting of randomized controlled trials (RCTs).

It is based on a very large sample of RCTS analyzed with the help of machine learning tools. This is, to date, the largest mapping of RCTs. The manuscript is overall clear and well written. The topic is interesting, although not completely original regarding the evolution of risk of bias over time. I was slightly surprised by the choice of Plos Biology to submit this manuscript. Although Plos Biology is very interested in transparency and good methodological standards, I am not sure this journal publishes a lot of RCTs.

Please find below my comments. I hope you will find them helpful.

Major comments

- The objective should be well-formulated in particular I would be more precise at the end of the introduction section.

- A key question is the validity of the automated tools used to evaluate characteristics (for example author gender) or risk of bias.

o We need more elements to be sure that these tools work well. Regarding risk of bias, the authors indicate an agreement between humans and Robot Reviewer ranging from 39% to 91%. I did not find that very similar to the agreement between humans and 39% is a low rate. Did the authors conduct some manual verifications? The authors report that they link their data with Cochrane data. A validation for risk of bias could be the comparison between risk of bias assessment obtained from the automated tool with corresponding risk of bias reported in Cochrane reviews.

o Regarding risk of bias, it is not fully detailed how it works. It seems that the result is rendered as a probability but this is not clear enough. Is it a probability of being at high risk of bias for the domain from 0 to 100%? Usually, the terms score is not adapted to risk of bias.

o Did the authors exclude trials presenting some particularities for which risk of bias assessment needs to be adapted? For example cluster or cross-over trials?

- The authors evaluated random sequence generation, allocation concealment, blinding of patients and personnel and blinding of outcome assessment as key domains for risk of bias. I did not understand why they did not consider incomplete outcome data as it is another key domain. The authors seem to link this domain with publication bias (methods section, page 7, line 132) or the sentence is unclear?

- It seems from the statistical analysis that the authors evaluated the statistical significance of the comparisons between 1990-1995 and 2010-2018. Why that? A trend test considering all strata may be more appropriate.

- I think there are (too) many results in this study. The authors should be sure to be consistent between objectives and results presented. For example, in the abstract, the objective does not mention risk of bias. Some results are less important. It is interesting to evaluate the trends in numbers of authors, institutions, gender but why the use of positive and negative words. The list of positive, negative and neutral words is probably not exhaustive.

- The term 'quality' is very criticized. I would refer to risk of bias or at a minimum methodological quality

- Some elements are not clear in particular the start date for this study. In the abstract, it seems 1966. But it appears from the method that there is only one stratum before 1990 because there were relatively few trials before 1990. Does it mean that this stratum includes trials from 1966 to 1990? In the flow chart, it rather seems that the authors excluded RCTs published before 1988. Why? It is not mentioned in the methods. Please clarify.

Minor comments

- In the abstract, introduction and discussion, the authors should be careful regarding the wording used. For example, in the abstract: "responsible research practices and reporting guidelines are increasingly important but whether these efforts have improved RCT quality is unknown. The same in Introduction section: "In other words, have these initiatives and measures improved the quality, transparency and reproducibility of RCTS. The objective of this article is to evaluate the evolution over time. Whether improvements over time are related to the development and use of reporting guidelines is another question that cannot be answered with this study design.

- This is not the "use" of CONSORT, this is the "reporting" of CONSORT in text. I think that many RCTS may be compliant with the CONSORT without mentioning the name.

- Introduction:

o First sentence of the second paragraph, instead of "For a longer time period", I would give the date of first publication of the CONSORT. I would not have presented CONSORT and trial registration in the same sentence as this suggests a link between them.

- In the methods section, the authors report search of Medline, should be added "via PubMed" after to be consistent with other parts of the manuscript

- Results section

o When mentioning trial registration, it should be interesting to give the number of trial registration from 2005 and 2010. The same for CONSORT from 2000.

o Page 14, line 230: not clear to what refers the p-value reported here?

Reviewer #2:

[identifies himself as Rustam Al-Shahi Salman]

I congratulate the authors for completing an enormous and impactful body of work. I have no suggestions for improvement.

Reviewer #3:

This is a meta-research study investigating the evolution of the quality of reporting and key methods of randomized controlled trials (RCTs) using meta-data from a huge number of published RCTs. To handle such an impressive number of articles (> 175,000), the author relied on the use of machine learning (ML) algorithms to extract relevant data. The sample size is a major asset of the present study compared to many other meta-research studies, that have been hampered by much lower sample sizes. But there are some limitations, too, that could be better taken into account or better emphasized.

Major comments

1. 122 810 RCTs were excluded for technical reasons (no institutional license, failing to download full text …), which represents almost 70% of the number included. In an RCT or an epidemiological study, it would have been considered poor methodology to exclude 41% of the sample, and this deserves discussion. I do not claim that the results of the study are not worthwhile or more biased than other meta-research studies that have often used additional eligibility criteria (e.g. by focusing on a more limited range of journals, for instance). But again, we expect some extreme rigor for such type of studies, and the choices leading to the analyzed sample size and their possible impact on the results should be thoroughly discussed.

2. Automatic extraction of information from meta-data is at the core of the study, and it would have been virtually impossible to check all this by hand. But the algorithms used are imperfect (not that I claim humans would be perfect in determining the risk of bias, for instance). Some years ago, I have participated in a project where we used genderize, for instance, and results are not perfect. Was there an attempt to manually check the automated results on a subsample? Otherwise some presentation of the performance of the algorithm and the possible impact on the results may be useful. For RoB, the agreement of the ML algorithm with humans is reported to be 71% (or 65%, different figures and different references being used in the methods and discussion, which is confusing). What impact could this have on the results? In particular, do the confidence intervals reported account for possible misclassification?

3. The number of publications has been increasing over years, and as a mechanical consequence, the citations also. The average impact factors of journals has also increase, as well as the proportion of journals indexed in the Journal Citation Reports with IF > 10, for instance. More precisely, 1.1% of all journals listed in JCR (SCIE) in 199 had IF > 10, 1.7% in 2009 and 3.0% in 2019. Those proportions may have been quite higher in the journals considered in the review (but it's only a guess). As a result, why not use a relative IF cut-off (e.g. top 10% or top 5%), instead of using a fixed IF cut-off (≤10 vs >10)? Also, the table S2 lists journals classified as with IF > 10 for all individual publications in the dataset, but for instance Leukemia's IF only reached 10 or more in 2012. Similarly, for Annals of Oncology it was in 2016. It is striking that no RCT would be included for these journals before those years. Perhaps another IF that the one of JCR (Clarivate Analytics) was sued, but this should be specified.

4. The average increase of citations may also explain why the h-index of authors also increased. It is therefore difficult to disentangle trends specific to RCTs to global trends of the whole medical literature. Asking for a specific answer to this point would be unfair to some extent, because it may require a large amount of work, but from a scientific point of view, the question of whether the h-index of authors increased more for RCTs than for other types of studies seems more interesting that simply describing the evolution of the h-index. That said, I wonder whether this any question about the evolution if the h-index of authors is really relevant and useful in the present study, which is not about the evolution of the citation network among authors of the biomedical literature.

5. Risk of bias is not a binary concept in Cochrane's evaluation. The RoB2 tool classifies risk of bias judgments as low risk of bias, some concerns, and high risk of bias. What was considered as "risk of bias" in the study, "high risk of bias"? More clarity would be welcome.

Additional comments

1. It is unclear in the abstract what characteristics of publication and authors are, and how they would relate to the main issue which is about methodological quality and bias.

2. I agree with problems of reports of RCTs with randomization, allocation concealment, and blinding but is difficult to ascertain what is the true treatment effect, in particular because this would need to formally define the target population. I would therefore recommend softening the assertion "the majority of RCT findings have inflated estimates" because this is not really supported by strong evidence.

3. The (expected) benefit of registration could be better explained, instead of simply noting it has been made mandatory by IMCJE.

4. I acknowledge that wording can become extremely tedious when describing results, but writing "The risk of bias in allocation concealment was consistently lower in trials published in journals with JIF larger than 10" implies that the risk of bias in allocation concealment in a quantity, when in fact what is lower is the proportion of trials with low (?) risk of bias.

5. The link of reference 15 (protocol) does not work, but the following one does: "https://osf.io/27f53/#!". Please consider revising.

Reviewer #4:

Thanks for giving me the opportunity to review this manuscript, which explores quality of RCTs published between 1966 and 2018. The authors used machine learning techniques to extract information on study characteristics, risk of bias, trial registration, CONSORT statements, H-indexes. They found that overall quality of RCTs improved over time (lower bias, more CONSORT reporting and more study registrations), but a considerable amount of RCTs still lacks basic quality characteristics.

Major issues:

1. Introduction, line 89-91: The authors state that there is already a study on RCT quality that included more than 20,000 RCTs from Cochrane reviews. This seems to be a high number of RCTs, but the authors further state that "large-scale evidence (…) is lacking". It would be helpful to know as to why reference 14 does not provide enough evidence on time trends (i.e. bias as only RCTs from Cochrane reviews were included).

2. Methods, line 128: The authors report on a human-RobotReviewer agreement of 65% on average, stating that this is similar to 79% human-human agreement. It would be helpful if the authors provided reasons as to why this difference is irrelevant or does not impact validity of results

3. Methods, line 160, 161: The authors state that regression analysis was used, but it would be nice if the authors further explained details of the regression model.

4. Methods, line 118-120: Why did the authors summarize all articles form 2010 to 2018, when in the methods it is stated that 5-year periods were considered (plus prior 1990 in one stratum)?

5. Results, line 171: Full-text was only available for 60% of identified RCTs. I would suggest that the authors elaborate on potential risks this imposes on the results.

6. Discussion, line 291, 292: "Additionally, making data sets available according to the FAIR principles arguably will improve the situation". It would be helpful if the authors gave reasons as to why data availability was not included in the assessment criteria for RCTs.

Minor issues:

1. Introduction, line 81: The authors mention the Hong Kong Principles without further elaboration. It would be helpful to readers if the authors provided a short (i.e. one sentence or a relative clause) description of these principles.

2. Figure 1: The authors should report all numbers as "n=" (also the subcategories that were excluded).

3. Figure 4: Please check the caption (drop "plotted over time")

4. Discussion, lines 302, 303: The authors could provide reasons why one medical field had higher RCT quality than others (i.e. more pragmatic RCTs in one area than in the other).

COMMENTS FROM THE ACADEMIC EDITOR [lightly edited]:

After reading the comments made by our reviewers, we would like to invite you resubmit your paper in which you address these. In particular, I encourage you to expand the manuscript somewhat to address the issue of selection of input. This selection comes with the methodology and is thus is inevitable, but understanding its effect in more depth is also pivotal in understanding the results. Through your methods you have indeed "seen" more papers than is possible with the more traditional hand work. Some extra digging to understand how problematic this selection is, is very much welcome. Some ideas to provide this insight could be:

- get some insight into missing data: what years/location/ type of journal etc where these missing papers?

- perform some sensitivity analyses to get a feeling of the impact of this selection on the results: worst case/best case scenarios, or even more sophisticated methods are welcome or even preferred.

I truly hope that you will be able to provide some extra quantified insights and not just leave it to a academic discussion based on hypothetical. The reason for this is this issue is not only fundamental to this particular paper, but to the whole approach of using machines instead of hands/eyes/brains in meta-research on reporting of scientific results.

I applaud your efforts to make your research as open as possible - github/osf. When looking at the OSF preregistered protocol, some major elements deviate or are missing. Some examples that just caught my eye are

*

"prediction" as the main aim in protocol vs emphasis on "over time" in manuscript

*

"main difference between included and excluded trials will be described" in protocol (see comments earlier)

*

consort: mentioning consort vs using statreviewer

*

description of excluded data and imputation of missing data (in line with earlier comments).

We understand that plans and the final result do not always line up, but there has to be a substantial overlap and explanation of any non-overlap to make sense for the reader. We encourage you to add this information to the manuscript or its supplemental information. If the current ms is more a description of the dataset and the "prediction" element will be described in a separate paper, please explain, and remove the reference to the OSF protocol. On a similar note, I encourage you to highlight substantial additions or changes to your methods during this peer review as such in the new version of the manuscript. This way, there is a continuous line between the preregistered protocol on OSF and the actual reporting in this manuscript. It also shows readers/prospective peer reviewers the added value of the peer review process.

Decision Letter 2

Roland G Roberts

27 Jan 2021

Dear Dr Vinkers,

Thank you for submitting your revised Meta-Research Article entitled "The methodological quality of randomized controlled trials has improved but is still not good enough: an analysis of 176,620 randomized controlled trials published between 1966 and 2018" for publication in PLOS Biology. I've now obtained advice from two of the original reviewers and have discussed their comments with the Academic Editor. 

Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the data and other policy-related requests noted at the end of this email.

IMPORTANT:

a) Please attend to the remaining requests from reviewer #1.

b) Please re-word your Abstract to fit the verbose style that we use at PLOS Biology (not the current structured format that is more typical of clinical journals).

c) The current title is rather cumbersome and repetitive. Please change it to something more appealing; we suggest: "Analysis of the methodological quality of 176,620 randomized controlled trials published between 1966 and 2018 reveals a positive trend but urgent need for improvement."

d) Please attend to my Data Policy requests below.

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods 

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2ABCD, 3ABCDEF, 4ABCDEF, S1, S2, S3, S4ABCDEF, S5ABCDEF, S6, S7, S8ABCD. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

 ------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

I would like to thank the authors for having answered my comments and modified the manuscript.

I have some additional comments:

- Objective: I would remove the word "unbiased": how do the authors can be sure that their evaluation is unbiased.

o In the abstract, the sentence is not really clear: "we mapped RoB trends over time in RCT publications including journal and author characteristics". The end of the sentence is not clear, journal and author characteristics are not part of RoB

- I really thank the authors for having conducted a validation with RoB assessment reported in Cochrane reviews.

o The interest of using Cochrane reviews that could be highlighted is that RoB is evaluated the same way in all Cochrane reviews using the RoB tool by trained reviewers.

o How do the authors define the parameter accuracy?

o Which RoB did they take for trials included in two Cochrane reviews? Did this situation happen?

o The Kappa values are very low and lower than the Kappa values between reviewers assessing risk of bias using the Risk of bias tool (there is an abundant literature on the topic).

o On which human reports and on how many are based algorithms of RobotReviewer regarding risk of bias?

o I don't understand when the authors say that "RoB data reported in CDSR not imposed or standardized", it is not true. The RoB judgment is almost always expressed the same way in high, low or unclear (or yes, no, unclear) as it is the format used in Revman. It is possible to extract data from the Cochrane review as an xml file, and to directly extract the RoB table with the judgment. I don't know if this is what the authors did.

- In the statistical analysis,

o I don't understand why the authors reported either p-values for trends or p-values for comparisons across two categories. It is not consistent throughout the manuscript. They reported trend tests for RoB but not for publication characteristics. Why? Trend tests could be used.

- In the results

o It is not clear from the abstract whether the percentages refer to the first and last time categories or not (first and last decades) and for the first time category, is it <1990 or 1990-1995?

o I am a bit lost with the time categories. Looking at the text, the oldest category seems to be 1990-1995, once <1990 is reported in text (why?) but looking at the Figures, <1990 is systematically reported. I think it would be better to be consistent throughout. Either the authors consider that trials published before 1990 are too old and they exclude them, or if not, they have to use the category <1990 as the reference.

- In the Discussion

o The summary of main findings puts a major emphasis on author characteristics including women while only a few results are presented on this part. This should be consistent.

o Many authors do not report registration number in trial publication but this does not mean that these trials were not registered (I agree that it is important to report the registration number in the publication). In contrast, it is not because you don't mention Consort in text that you do not comply with. Therefore I suggest the authors to be a little bit more careful in wording

o The term score is still mentioned

o To evaluate the item incomplete outcome data, it is not necessary to rely on trial registration. Access to trial registration is mainly important for selective outcome reporting.

- Table 1: I would present 95% CI everywhere and I would take the usual formulation for Rob items

Reviewer #4:

I would like to congratulate the authors for the improvement. All issues have been addressed adequately.

Decision Letter 3

Roland G Roberts

1 Mar 2021

Dear Christiaan,

On behalf of my colleagues and the Academic Editor, Bob Siegerink, I'm pleased to say that we can in principle offer to publish your Meta-Research Article "The methodological quality of 176,620 randomized controlled trials published between 1966 and 2018 reveals a positive trend but also urgent need for improvement" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for supporting Open Access publishing. We look forward to publishing your paper in PLOS Biology. 

Sincerely, 

Roli

Roland G Roberts, PhD 

Senior Editor 

PLOS Biology

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Detailed data collection procedures.

    (DOCX)

    S1 Table. Operationalization of variables for RCTs, authors, institutions, and journals.

    RCT, randomized controlled trial.

    (DOCX)

    S2 Table. All journals with JIF higher than 10 in the year preceding any of the individual publications in our data set.

    JIF, journal impact factor.

    (DOCX)

    S3 Table. Quantiles of estimated risk-of-bias domain probabilities for included and excluded RCTs.

    RCT, randomized controlled trial.

    (DOCX)

    S4 Table. Percentages for the CONSORT Statement and Registration outcomes for included and excluded RCTs.

    RCT, randomized controlled trial.

    (DOCX)

    S5 Table. Total number (N) of trials published in the period 2005–2018 in the different medical disciplines with the number (K) and corresponding proportion (percentage with 95% confidence interval) of trials with a risk-of-bias probability below 50% (i.e., “low risk”).

    (DOCX)

    S1 Fig. Distribution of JIFs of analyzed RCTs with JIF cutoffs of 3, 5, and 10 (dotted lines).

    The JIF of a journal in the year following the publication date of the RCT was used. Density represents the probability of a trial to belong to a given impact factor. JIF, journal impact factor; RCT, randomized controlled trial.

    (TIFF)

    S2 Fig. The number of included RCTs against the total number of RCTs indexed in the PubMed database for the study period (1966–2018).

    The year 1993 and 1998 are marked with vertical dashed lines. RCT, randomized controlled trial.

    (TIFF)

    S3 Fig. Percentage of female authors and H-indices for the first and the last author, per period, for the included RCTs.

    RCT, randomized controlled trial.

    (TIFF)

    S4 Fig

    Risk of bias due to inadequate allocation concealment (A), random sequence generation bias (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and mentioning of the CONSORT Statement (F) plotted over time for RCTs published in journals with JIF >3 and journals with JIF <3. The indicated stratum range is up to but not including the last year. JIF, journal impact factor; RCT, randomized controlled trial.

    (TIFF)

    S5 Fig

    Risk of bias in allocation concealment (A), the bias in randomization (B), the bias in blinding of patients and personnel (people) (C), the bias in blinding of outcome assessment (D), RCT registration (E), and mentioning of the CONSORT Statement (F) plotted over time for RCTs published in journals with JIF >5 and journals with JIF <5. The indicated stratum range is up to but not including the last year. JIF, journal impact factor; RCT, randomized controlled trial.

    (TIFF)

    S6 Fig. The average risk of biases for trials published in the period 2005–2018 in different medical disciplines.

    “random”: bias in randomization; “allocation”: bias in allocation concealment; “blinding of people”: bias in blinding of patients and personnel; “blinding outcome”: bias in the blinding of outcome assessment.

    (TIFF)

    S7 Fig. Presence of RCT registration and CONSORT Statement in trials published between 2005 and 2018.

    RCT, randomized controlled trial.

    (TIFF)

    S8 Fig. The machine learning risk-of-bias probabilities are plotted as density profiles against the human rater risk categories: “High-Unclear” and “Low” for 63,327 matching RCTs.

    RCT, randomized controlled trial.

    (TIFF)

    Attachment

    Submitted filename: PLOS_Biol_rebuttal_final.pdf

    Attachment

    Submitted filename: Rebuttal.docx

    Data Availability Statement

    The risk of bias characterization was done with a large-batch-customized-customized Python scripts (version 3; https://github.com/wmotte/robotreviewer_prob). The data management and analyses used R (version 3.6.1). All data including code and risk of bias data are available at https://github.com/wmotte/RCTQuality).


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES