Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2022 Feb 18;20(2):e3001562. doi: 10.1371/journal.pbio.3001562

Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that do not reach statistical significance

Willem M Otte 1,2, Christiaan H Vinkers 3, Philippe C Habets 3, David G P van IJzendoorn 4, Joeri K Tijdink 5,6,*
Editor: Isabelle Boutron7
PMCID: PMC8893613  PMID: 35180228

Abstract

The power of language to modify the reader’s perception of interpreting biomedical results cannot be underestimated. Misreporting and misinterpretation are pressing problems in randomized controlled trials (RCT) output. This may be partially related to the statistical significance paradigm used in clinical trials centered around a P value below 0.05 cutoff. Strict use of this P value may lead to strategies of clinical researchers to describe their clinical results with P values approaching but not reaching the threshold to be “almost significant.” The question is how phrases expressing nonsignificant results have been reported in RCTs over the past 30 years. To this end, we conducted a quantitative analysis of English full texts containing 567,758 RCTs recorded in PubMed between 1990 and 2020 (81.5% of all published RCTs in PubMed). We determined the exact presence of 505 predefined phrases denoting results that approach but do not cross the line of formal statistical significance (P < 0.05). We modeled temporal trends in phrase data with Bayesian linear regression. Evidence for temporal change was obtained through Bayes factor (BF) analysis. In a randomly sampled subset, the associated P values were manually extracted. We identified 61,741 phrases in 49,134 RCTs indicating almost significant results (8.65%; 95% confidence interval (CI): 8.58% to 8.73%). The overall prevalence of these phrases remained stable over time, with the most prevalent phrases being “marginally significant” (in 7,735 RCTs), “all but significant” (7,015), “a nonsignificant trend” (3,442), “failed to reach statistical significance” (2,578), and “a strong trend” (1,700). The strongest evidence for an increased temporal prevalence was found for “a numerical trend,” “a positive trend,” “an increasing trend,” and “nominally significant.” In contrast, the phrases “all but significant,” “approaches statistical significance,” “did not quite reach statistical significance,” “difference was apparent,” “failed to reach statistical significance,” and “not quite significant” decreased over time. In a random sampled subset of 29,000 phrases, the manually identified and corresponding 11,926 P values, 68,1% ranged between 0.05 and 0.15 (CI: 67. to 69.0; median 0.06). Our results show that RCT reports regularly contain specific phrases describing marginally nonsignificant results to report P values close to but above the dominant 0.05 cutoff. The fact that the prevalence of the phrases remained stable over time indicates that this practice of broadly interpreting P values close to a predefined threshold remains prevalent. To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors may reduce the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and CIs and focus on the clinical relevance of the statistical difference found in RCTs.


The power of language to modify the reader’s perception of interpreting biomedical results cannot be underestimated. An analysis of more than half a million randomized controlled trials reveals that researchers are using appealing phrases to describe non-significant findings as if they were below the p=0.05 significance threshold.

Introduction

Individual clinical researchers are subject to the mythical heritage or paradigm of the peculiar and well-recognized 0.05 significance threshold (P = 0.05, alpha) that claims that findings below this predefined value reflect a true finding. In contrast, P values not making the cut indicate no effect (null hypothesis not rejected). Consequently, individuals submitting randomized controlled trial (RCT) publications often dance the “significance dance” to describe outcomes around the 5% alpha level. One of the challenges is that an overreliance on one fixed cutoff is that to “find” that a treatment works or not, P values below 0.05 are often thought to be mandatory. However, the P < 0.05 threshold is a simple rule to reject the null hypothesis, controlling for type I and II errors. This has led to misinterpretations that dichotomized the P value (P < 0.05 = true effect, P > 0.05 = no effect).

Interestingly, the vast majority (96%) of biomedical articles report P values of 0.05 or less [1,2]. Unseen, but behind this peculiar distribution of published P values are those that did not make it below 0.05. In psychology, the occurrence of reporting P values between 0.05 and 0.1—about 40%—is relatively high [3]. Less is known about these numbers in clinical research. In a small sample of 722 articles in oncology research, 63 articles (8.7%) used trend statements to describe statistically nonsignificant results [4].

Authors could misrepresent nonsignificant trial results through biased emphasis or phrasing of the outcomes. A well-known example of this so-called “spin” practice is switching the emphasis from nonsignificant primary to significant secondary outcomes. This highlighting favorable results while suppressing unfavorable data is considered misrepresentation [5]. Another example is the use of linguistic spin. Linguistic spin could distort the interpretation of trial results in reframing or modifying the reader’s perception into a beneficial interpretation despite a statistically nonsignificant difference in the primary outcome [6].

And finally, recent reports indicate a high percentage (ranging from 47% to 66%) of detected spin across medical disciplines [710].

Strong preferences for P values below 0.05 may also lead to creative linguistic solutions. Reporting nonsignificant results as essential or noteworthy findings may effectively invite scholars to overstate their findings and present uncertain, insufficient evidence (e.g., with a high risk of bias or other methodological weaknesses) as “breakthrough” research with clear clinical impact. These linguistic trends have possibly a temporal element. Some language phrases will be more successful in convincing editors and reviewers over time than others. Given the relatively rules-oriented RCT research environment, we expected creative linguistics regarding significance phrases in published RCTs and trends over time for the most favorite phrases.

Insight in this practice is essential as the success of an RCT is partly determined by the way the results are presented in a manuscript [11]. Effective interventions and procedures with clear and significant outcomes that promise to improve patient care will most likely guide acceptance decisions. However, in papers without apparent clinical breakthroughs, the language used to highlight potential beneficial treatments may nonetheless convince reviewers and readers [12,13]. Also, for RCTs, the cornerstone of evidence-based medicine, 2 independent studies have detected that positive reporting and interpretation of primary outcomes in RCTs were frequently based on nonsignificant results [14,15]. Persuasive phrasing like “marginally significant” and “a trend toward significance” may disguise nonsignificant results. Given that there is essentially no clinically relevant distinction between a type I error of 4%, 5%, or 6%, it is interesting to understand how the formulations regarding P values just above 0.05 change over time.

Therefore, the study aims to detect the prevalence of specific nonsignificant phrases in RCTs and determine what phrases correspond with the reported statistical nonsignificant findings to explore potential consequences that arise with the significant threshold of P < 0.05. We do this by quantitatively analyzing RCT full texts registered in the last 3 decades in the PubMed database. We determined the use of 505 most common phrases describing nonsignificant results and characterized the trends over time. In a subset, we manually assessed their associated P values. We expected to find similar percentages of phrases associated with nonsignificant results in RCTs as reported in other (mostly nonclinical) studies [14,15]. We also hypothesized to detect changes in phrase prevalences over time, assuming continuous evolution of phrasing in reporting of nonsignificant RCT results. Finally, we anticipated that the phrase-associated P values would predominantly be associated with a P value in the range of 0.05 to 0.15.

Methods

Selection of RCTs

A flowchart shows the consecutive processing steps (Fig 1). We identified all RCTs in the PubMed database and excluded animal studies, non-English studies, and studies that were not actual RCT reports [September 20, 2020] with the following query: “All[SB] AND Humans[MESH] AND English[LANG] AND Clinical Trial[PTYP] NOT protocol[TITLE].” Our custom search query is not previously validated. However, PubMed’s internal “Clinical trial” filter is characterized with an average sensitivity of 87.3%, specificity of 34.8%, and precision of 54.7% [16]. It is optimized for a sensitive and broad rather than a specific and narrow yield. In comparison, a clinical query optimized for sensitivity reached 92.7% sensitivity and 16.1% specificity, whereas a clinical query optimized for specificity reached a sensitivity of 78.2% and a sensitivity of 52.0%. Our query is thus a compromise between sensitivity and specificity. We expect that further restricting the query to “Humans,” “English,” and no protocols will have increased our specificity [16].

Fig 1. Flowchart of the study processing pipeline.

Fig 1

RCT, randomized controlled trial.

Subsequently, we collected the portable document format (PDF) for all available RCTs across publishers in journals covered by our institution’s library subscription. All trial PDFs were converted to XML and subsequently plain text in XML format using publicly available Grobid software (v. 0.6.2). We converted the plain text to lower case and removed diacritical symbols as there are various types of quote Unicode characters. Subsequently, we searched for exact matches (i.e., grep command line tool) of the predefined phrases in this cleaned text.

Phrases

We predefined 505 phrases potentially associated with reporting nonsignificant results (S1 Table). We used a list provided by statistician Dr. Matthew Hankins on his “Still not significant” blog, based on actual examples found in the biomedical and psychology literature [17,18].

Prevalences

We restricted the publication time frame to 3 decades: January 1990 to September 2020. The total phrase-positive RCT prevalence was determined for each publication year. To increase the temporal robustness of individual phrase prevalence estimations, we binned RCTs according to their publication date into periods of 3 years. For each phrase detected as an exact match in the full texts, time period prevalences were calculated by dividing the number of RCTs that included one of the 505 phrases describing nonsignificant results by the total number of RCTs within that period. The 95% confidence intervals (CIs) were determined with Yates continuity correction [19].

Statistical analysis

To obtain evidence on phrase changes over time, we used a Bayesian linear regression [20] and determined Bayes factors (BFs) for each fitted model. This ratio measure determines the relative evidence of a model with a linear slope in the temporal prevalence data over a null model with an intercept only. For example, a BF of 5.0 means that the prevalence of a specific significance phrase over time is 5 times more probable with a linear change over time than with no linear change over time. Nonetheless, multiple suggestions for interpreting BF divisions are available. A commonly used list divides the evidence into 4 strength ranges: BF between 1 and 3.2 are “not worth more than a bare mention,” between 3.2 and 10 are “substantial,” between 10 and 100 are “strong,” and >100 are “decisive” evidence [21]. To our knowledge, there is no evidence that reporting BFs is also subject to suspicious phrasing. We used the R package “BayesFactor” for statistical analysis. Model priors were uninformative.

Associated P values

Phrases may refer to P values in broadly 2 types: a direct referral, with the corresponding P value, directly followed after the phrase, mostly in parentheses (e.g., “The drug effect was almost significantly lower in group B (P = 0.052)”). The other type often found in Discussion sections, typically contains longer range referrals to previously mentioned results, displayed in figures and tables. We tried to quantify the first type of referral by manually extracting the P value within the first 100 characters directly following the extracted phrases within 29,000 random sampled phrases for the full set of phrases. This sample size was achieved through distributed labeling with all authors independently extracting P values from 5,000 to 7,000 sentences. We evaluated our interrater P value variability based on 50 sentences shown to 2 raters but mixed within the larger set of extractions and expressed as the mismatch percentage.

Results

We obtained the full text of 567,758 full texts of the total of 696,842 PubMed-registered RCTs (81.47%) (Fig 1). From the 505 predefined significance phrases, 272 were present in the full-text corpus at least 1 time. In total, 49,134 RCTs within the 567,758 full texts had a full-text match (61,741 phrases). The yearly prevalences are shown in Fig 2. The overall phrase-positive RCT prevalence was 8.65% (95% CI: 8.58% to 8.73%), and this percentage was stable over time.

Fig 2.

Fig 2

The number of analyzed full texts (A), number of phrase-positive RCTs (B), and the corresponding prevalence (C) over time. Error bars represent the 95% CI. The underlying data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data. CI, confidence interval; RCT, randomized controlled trial.

The number of detected RCTs with phrases associated with reporting of nonsignificant results were unequally distributed (Table 1). The most prevalent phrases were “marginally significant” (present in 7,735 RCTs), “all but significant” (7,015 RCTs), “a nonsignificant trend” (3,442 RCTs), “failed to reach statistical significance” (2,578 RCTs), and “a strong trend” (1,700 RCTs).

Table 1. The identified number of phrases (frequency n > 100).

Phrase Total RCTs
Marginally significant 7,735
All but significant 7,015
A nonsignificant trend 3,442
Failed to reach statistical significance 2,578
A strong trend 1,700
Nearly significant 1,391
A clear trend 1,372
An increasing trend 1,202
Only marginally significant 1,149
A significant trend 1,124
Potentially significant 1,104
Significant tendency 1,064
A positive trend 1,055
A decreasing trend 962
Marginal significance 887
A slight trend 885
Almost significant 813
A statistical trend 811
Approaching significance 796
Nominally significant 740
Quite significant 547
Near significant 546
An overall trend 445
Likely to be significant 425
Difference was apparent 409
Uncertain significance 383
Did not quite reach statistical significance 379
A weak trend 343
Marginally statistically significant 314
Tended to be significant 293
Possible significance 286
Not quite significant 266
A favorable trend 261
Just failed to reach statistical significance 252
A negative trend 225
Almost reached statistical significance 219
A possible trend 218
Fell short of significance 214
Not as significant 204
A small trend 185
A numerical trend 184
Slightly significant 182
Reached borderline significance 165
Near significance 156
Weakly significant 147
Moderately significant 146
An apparent trend 145
Barely significant 135
Practically significant 135
A definite trend 131
An interesting trend 129
Almost statistically significant 126
Marginally nonsignificant 101
Possibly significant 100
Significantly significant 100

RCT, randomized controlled trial.

We found evidence for a temporal change in multiple prevalences (S2 Table). From the phrases with a BF above 100 the RCT, prevalence increased from 0.005% to 0.05% (“a numerical trend”), 0.098% to 0.23% (“a positive trend”), 0.067% to 0.346% (“an increasing trend”), and 0.036% to 0.201% (“nominally significant”). Whereas the phrases—“all but significant,” “approaches statistical significance,” “did not quite reach statistical significance,” “difference was apparent,” “failed to reach statistical significance,” and “not quite significant”—sharply decreased over time (Fig 3). An additional 17 phrases had “strong” BFs between 10 and 100 (S1 Fig). A total of 15 phrases had a BF between 3.2 and 10 (S2 Table), indicating “substantial" evidence for a temporal change. The remaining phrases are “not worth more than a bare mention.”

Fig 3. Temporal plots for phrases with “decisive” evidence (i.e., BFs > 100) for temporal change.

Fig 3

Prevalence estimates are shown as dots, together with the linear regression model fit and corresponding uncertainty. The data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data. BF, Bayes factor.

Associated P values

Within the random sample of 29,000 RCTs containing one of the nonsignificant phrases, we extracted 11,926 P values (41.1%) within the “100 characters” range. Interrater P value variability, based on a sample of 50 similar extractions—hidden within the larger random sample and seen by 2 authors—was less than 4%.

The P value distribution was characterized with a high prevalence within the 0.05 to 0.15 range with a median of 0.06. In the distribution of all detected P values, we found the 25% to 75% interval P values between 0.05 and 0.08. The 5% to 95% interval had P values between 0.006 and 0.15 (see Fig 4). The proportions of P values as being categorized as <0.05, between 0.05 and 0.15, or above 0.15 are given in S3 Table.

Fig 4. Density plot of the 11,926 manually extracted P values.

Fig 4

The data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data.

Some phrases were highly associated with a P value between 0.05 and 0.15 (Fig 5, S2 Fig). The highest percentages of the following frequent phrases were found in this particular range of 0.05 to 0.15 for “almost reached statistical significance,” “almost significant,” “a strong trend,” “did not quite reach statistical significance,” “just failed to reach statistical significance,” “near significance,” and “not quite significant” (Fig 5).

Fig 5. Category percentages for the 20 most frequent phrases describing nonsignificant results, with at least 100 manually extracted P values.

Fig 5

Error bars represent the 95% CI. The associated median P value (with the 25% and 75% quantiles) is presented in the upper left corner of each phrase. The data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data. CI, confidence interval.

Other phrases were much less linked to 0.05 to 0.15 P values, namely “a significant trend,” “all but significant,” “an increasing trend,” and “nominally significant” (Fig 5). Similar differences were found for less frequent phrases, with some strongly connected to P values just above 0.05 (S2 Fig).

Discussion

Principal findings

This study systematically assessed more than half a million full-text publications of RCT published between 1990 and 2020 for the prevalence of specific phrases linked to almost but formal nonsignificant reporting (i.e., P values just above the 0.05 threshold), including temporal trends and manual validation of the associated P values. We present an estimate of 9% of RCTs using specific phrases to report P values above 0.05. This prevalence has remained relatively stable in the past 3 decades. We also determined fluctuations over time in the frequently used nonsignificant phrases. Some phrases gained popularity over time, whereas others are more in decline. Our manual analysis confirmed that most of the phrases described nonsignificant results corresponded with P values in the range of 0.05 to 0.15.

Strengths and limitations

This is the first study to explore a vast body of PubMed-indexed RCTs on the occurrence of phrases reporting nonsignificant results. Given the relatively low frequency of several phrases, such a large sample is essential to effectively quantify prevalence and changes in phrasing over time. Moreover, we also quantified the actual P values of the most frequently used phrases reporting nonsignificant results.

Our study also has inherent limitations. First, we predefined more than 500 phrases denoting results that do not reach formal statistical significance. We may have missed phrases with similar meanings. This would lead to an underestimated overall prevalence. However, we did not implement an elastic search strategy in our pdfs, as this could potentially change the interpretation, for example, by removing the negation. We are certain that a specific spin-like phrase is written in the trial report with exact string matching. However, this may have led to underreporting, and the actual prevalence of these phrases may be an underestimation. Second, not all phrases are equally specific in their association with P values just above 0.05. Third, we studied English language RCTs only. Generalizations to other languages can therefore not be made. Fourth, we only had access to published full texts. This prevents us from drawing causal conclusions as nonpublished manuscripts with specific nonsignificant phrases, which did not undergo a peer-review process, are not available. Connected to that, despite our data collection in September 2020, we missed a relatively large proportion of RCTs published in 2020, rendering our results less stable for the last year. Fifth, we only characterized P values in the direct vicinity of the phrases. Long-range referrals in the text or tables were not included. The association frequencies may hence be conservatively low. Sixth, it remains unknown whether some trials may have had nonsignificant results and used different sentences to describe these results. This may have caused underestimating the prevalence of these types of sentences. Seventh, we do not know whether the P value and the corresponding significance phrase actually referred to the study’s primary outcomes or whether it described less important secondary or tertiary outcomes. Finally, not all predetermined sentences actually represent a P value above 0.05. (e.g., ‘”marginally significant”). However, we hardly found P values lower than 0.05 corresponding with specific phrases in the manual analysis (see S3 Table). For example, the phrase “failed to reach statistically significant results” highlights a fact, although not as neutral as simply stating “nonsignificant results.” Therefore, the amount of spin may vary between phrases and potentially overreport some of our individual phrase prevalence estimations that describe marginally significant results.

Interpretation

Our findings suggest that specific phrasing to report nonsignificant findings remain fairly common in RCTs. RCTs are time- and energy-consuming endeavors, and an “almost significant” result, can, therefore, be a disappointing experience in terms of the interpretation and publication of the results: Did the RCT “find” an effect or not? Our description of the characteristics of the most prevalent phrases can help readers, peer reviewers, and editors to detect potential spin in manuscripts that overstate or incorrectly interpret their nonsignificant results. Our results also support the notion that some phrases are becoming more popular.

The detected P value distributions are important in light of the recent discussions to lower the default P value threshold to 0.005 to improve the validity and reproducibility of novel scientific findings [1]. P values near 0.05 are highly dependent on sample size and generally provide weak evidence for the alternative hypothesis. This threshold can consequently lead to high probabilities of false-positive reporting or P-hacking in clinical trials [22]. However, replacing the common 0.05 threshold with an even lower arbitrary value is not a definitive solution. Clinical research is diverse, and redefining the term “statistical significance” to even less likely outputs will probably have negative consequences. Lakens and colleagues [23] therefore suggest that we should abandon a universal cutoff value and associated “statistical significance” phrasing and allow scholars to judge the clinical relevance of RCT results on a case-by-case basis. Based on our data, we think that such a personalized approach is beneficial for everyone—especially since it is currently unknown if P value cutoffs as low as 0.005 do indeed lead to lower false-positive reporting and will lead to more rigorous clinical evidence. A stricter threshold requires large sample sizes in replication studies—which are hardly conducted—and will probably increase the risk of presenting underpowered clinical results.

Moreover, since it is estimated that half of the results of clinical trials are never published [24], mainly due to negative findings, lowering the P value threshold may result in more “negative” studies that remain largely unpublished. Although the detrimental effects of lowering the threshold for statistical significance for medical intervention data are disputed [2527], a recent retrospective RCT investigation showed that shifting the threshold of statistical significance from P value < 0.05 to < 0.005 would have limited effects on medical intervention recommendations as 85% of recommended interventions showed P values below 0.005 for their primary outcome [28].We are also aware that his will come with new problems and ways to game this new artificial statistical threshold. We think that if authors discuss and judge their threshold value transparently and show the clinical relevance, there is no need to tie oneself to a universal P value cutoff. Journal editors and (statistical) reviewers can play an important role in propagating ideas from the so-called “new statistics” strategy, which aims to switch from null hypothesis significance testing to using effect sizes and cumulation of evidence to explore and determine potential clinical results relevance [2931]. Chavalarias and colleagues [32] describe in their paper results that are related to the reporting of pv values, effect sizes, and CIg; in the vast majority (88%) of the included RCTs, they found the reporting of a P value <0.05. They also highlight that in 2% to 3% of the analyzed abstracts, they found the reporting of CIs and 22% of the abstracts described effect sizes. Despite this improvement, we remain skeptical whether this will not shift the problem and stimulate researchers to overly report their effect sizes and CIs.

Some argue that BFs should replace the quest for statistical significance. Some phrases were associated with BFs that represent “decisive evidence” for temporal changes in our analysis. It is worth mentioning that BFs are considered a good alternative for statistical significance. However, the BFs may be subject to other biases and linguistic persuasion and should be interpreted in light of their research context [33], so this not be a definitive solution.

Our study questions the current emphasis on a fixed P value cutoff in interpreting and publishing RCT results. Besides abandoning a universally held and fixed statistical significance threshold, an additional solution may be the 2-step submission process that has gained popularity in the past years [34,35]. This entails that an author first submits a version including the introduction and methods. Based on the reviews of this submission, a journal provisionally accepts the manuscript. When the data are collected, the authors can finalize their paper with the results and interpretation, knowing that it is already accepted.

In conclusion, too much focus on formal statistical significance cutoffs hinders responsible interpretation of RCT results. It may increase the risk for misinterpretation and selective publication, particularly when P values approach but do not cross the 0.05 threshold. Fifteen years of advocacy to shift away from null hypothesis testing has not yet fully materialized in RCT publications. We hope this study will stimulate researchers to put their creativity to good use in scientific research and abandon a narrow focus on fixed statistical thresholds but rather judge statistical differences in RCTs on its effect size and clinical merits.

Supporting information

S1 Fig. Temporal plots for phrases with “strong” evidence (i.e., BFs between 10 and 100) for temporal change.

Prevalence estimates are shown as dots, together with the linear regression model fit and 95% CI. The data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data. BF, Bayes factor; CI, confidence interval.

(DOCX)

S2 Fig. Category percentages for the phrases describing nonsignificant results with the number of manually extracted P values with occurrences between 30 and 100 times in our manual analysis.

Error bars represent the proportional 95% CI. The associated median P value is presented in the upper left corner of each phrase. CI, confidence interval.

(DOCX)

S1 Table. The 505 predefined phrases associated with reporting nonsignificant results.

(DOCX)

S2 Table. The evidence of temporal change in the phrases with at least 5 time points expressed as the BF relative to no temporal change (lower threshold set to 2.0).

The colors represent the strength of evidence as specified in the main text. BF, Bayes factor.

(DOCX)

S3 Table. All extracted P values within the 3 range categories, as the proportion of the total of 11,926 extractions.

(DOCX)

Abbreviations

BF

Bayes factor

CI

confidence interval

PDF

portable document format

RCT

randomized controlled trial

Data Availability

All used PubMed IDs, detected phrases, co-text extractions, manually identified P values, and processing scripts are openly shared at: https://github.com/wmotte/almost_significant (v1.0; http://doi.org/10.5281/zenodo.4313162). and here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10. doi: 10.1038/s41562-017-0189-z [DOI] [PubMed] [Google Scholar]
  • 2.Ioannidis JPA. The Proposal to Lower P Value Thresholds to .005. JAMA. 2018;319(14):1429–30. doi: 10.1001/jama.2018.1536 [DOI] [PubMed] [Google Scholar]
  • 3.Olsson-Collentine A, Van Assen MALM, Hartgerink CHJ. The Prevalence of Marginally Significant Results in Psychology Over Time. Psychol Sci. 2019;30(4):576–86. doi: 10.1177/0956797619830326 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Nead KT, Wehner MR, Mitra N. The Use of "Trend" Statements to Describe Statistically Nonsignificant Results in the Oncology Literature. JAMA Oncol. 2018;4(12):1778–9. doi: 10.1001/jamaoncol.2018.4524 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chan A-W. Bias, Spin, and Misreporting: Time for Full Access to Trial Protocols and Results. PLoS Med. 2008;5(11):e230. doi: 10.1371/journal.pmed.0050230 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and Interpretation of Randomized Controlled Trials With Statistically Nonsignificant Results for Primary Outcomes. JAMA. 2010;303(20):2058–64. doi: 10.1001/jama.2010.651 [DOI] [PubMed] [Google Scholar]
  • 7.Guo F, Fang X, Li C, Qin D, Hua F, He H. The presence and characteristics of ‘spin’ among randomized controlled trial abstracts in orthodontics. Eur J Orthod. 2021;43(5):576–82. doi: 10.1093/ejo/cjab044 [DOI] [PubMed] [Google Scholar]
  • 8.Rassy N, Rives-Lange C, Carette C, Barsamian C, Moszkowicz D, Thereaux J, et al. Spin occurs in bariatric surgery randomized controlled trials with a statistically nonsignificant primary outcome: A Systematic Review. J Clin Epidemiol. 2021. doi: 10.1016/j.jclinepi.2021.05.004 [DOI] [PubMed] [Google Scholar]
  • 9.Shepard S, Checketts J, Eash C, Austin J, Arthur W, Wayant C, et al. Evaluation of spin in the abstracts of orthopedic trauma literature: A cross-sectional review. Injury. 2021;52(7):1709–14. doi: 10.1016/j.injury.2021.04.060 [DOI] [PubMed] [Google Scholar]
  • 10.Chow R, Huang E, Fu S, Kim E, Li S, Tulandi T, et al. Spin in randomized controlled trials in obstetrics and gynecology: a systematic review. J Obstet Gynaecol Can. 2021;43(5):667. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Norman G. Data dredging, salami-slicing, and other successful strategies to ensure rejection: twelve tips on how to not get your paper published. Adv Health Sci Educ. 2014;19(1):1–5. [DOI] [PubMed] [Google Scholar]
  • 12.Jellison S, Roberts W, Bowers A, Combs T, Beaman J, Wayant C, et al. Evaluation of spin in abstracts of papers in psychiatry and psychology journals. BMJ Evid Based Med. 2019. [DOI] [PubMed] [Google Scholar]
  • 13.Bero L, Chiu K, Grundy Q. The SSSPIN study—spin in studies of spin: meta-research analysis. BMJ. 2019;367:l6202. doi: 10.1136/bmj.l6202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Khan MS, Lateef N, Siddiqi TJ, Rehman KA, Alnaimat S, Khan SU, et al. Level and Prevalence of Spin in Published Cardiovascular Randomized Clinical Trial Reports With Statistically Nonsignificant Primary Outcomes: A Systematic Review. JAMA Netw Open. 2019;2(5):e192622–e. doi: 10.1001/jamanetworkopen.2019.2622 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Chiu K, Grundy Q, Bero L. ‘Spin’ in published biomedical literature: A methodological systematic review. PLoS Biol. 2017;15(9):e2002173. doi: 10.1371/journal.pbio.2002173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hoogendam A, De Vries Robbé PF, Stalenhoef AFH, Overbeke AJPM. Evaluation of PubMed filters used for evidence-based searching: validation using relative recall. J Med Libr Assoc. 2009;97(3):186–93. doi: 10.3163/1536-5050.97.3.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.W G. Academia obscura: The hidden silly side of higher education: London: Unbound Publishing; 2017. Available from: http://www.academiaobscura.com/still-not-significant [Google Scholar]
  • 18.Hankins M. Still not significant [Internet]. 2017. Available from: https://mchankins.wordpress.com/2013/04/21/still-not-significant-2
  • 19.Brown LD, Cai TT, Dasgupta A. Interval Estimation for a Binomial Proportion. Stat Sci. 2001;16(2):101–33. [Google Scholar]
  • 20.Rouder JN, Morey RD. Default Bayes Factors for Model Selection in Regression. Multivar Behav Res. 2012;47(6):877–903. doi: 10.1080/00273171.2012.734737 [DOI] [PubMed] [Google Scholar]
  • 21.Kass RE, Raftery AE. Bayes Factors. J Am Stat Assoc. 1995;90(430):773–95. [Google Scholar]
  • 22.Adda J, Decker C, Ottaviani M. P-hacking in clinical trials and how incentives shape the distribution of results across phases. Proc Natl Acad Sci U S A. 2020;117(24):13386–92. doi: 10.1073/pnas.1919906117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lakens D, Adolfi FG, Albers CJ, Anvari F, Apps MAJ, Argamon SE, et al. Justify your alpha. Nat Hum Behav. 2018;2(3):168–71. [Google Scholar]
  • 24.Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of Reporting P Values in the Biomedical Literature, 1990–2015. JAMA. 2016;315(11):1141–8. doi: 10.1001/jama.2016.1952 [DOI] [PubMed] [Google Scholar]
  • 25.Adibi A, Sin D, Sadatsafavi M. Lowering the P Value Threshold. JAMA. 2019;321(15):1532–3. doi: 10.1001/jama.2019.0566 [DOI] [PubMed] [Google Scholar]
  • 26.Wayant C, Scott J, Vassar M. Evaluation of Lowering the P Value Threshold for Statistical Significance From .05 to .005 in Previously Published Randomized Clinical Trials in Major Medical Journals. JAMA. 2018;320(17):1813. doi: 10.1001/jama.2018.12288 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wayant C, Scott J, Vassar M. Lowering the P Value Threshold—Reply. JAMA. 2019;321(15):1533–. doi: 10.1001/jama.2019.0574 [DOI] [PubMed] [Google Scholar]
  • 28.Koletsi D, Solmi M, Pandis N, Fleming PS, Correll CU, Ioannidis JPA. Most recommended medical interventions reach P &lt; 0.005 for their primary outcomes in meta-analyses. Int J Epidemiol. 2020;49(3):885–93. doi: 10.1093/ije/dyz241 [DOI] [PubMed] [Google Scholar]
  • 29.Cumming G. The New Statistics:Why and How. Psychol Sci. 2014;25(1):7–29. doi: 10.1177/0956797613504966 [DOI] [PubMed] [Google Scholar]
  • 30.Gao J. P-values–a chronic conundrum. BMC Med Res Methodol. 2020;20(1):167. doi: 10.1186/s12874-020-01051-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Matthews RAJ. Moving Towards the Post p < 0.05 Era via the Analysis of Credibility. Am Stat. 2019;73(sup1):202–12. [Google Scholar]
  • 32.Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of ReportingPValues in the Biomedical Literature, 1990–2015. JAMA. 2016;315(11):1141. doi: 10.1001/jama.2016.1952 [DOI] [PubMed] [Google Scholar]
  • 33.Etz A, Vandekerckhove J. A Bayesian Perspective on the Reproducibility Project: Psychology. PLoS ONE. 2016;11(2):e0149794. doi: 10.1371/journal.pone.0149794 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Smulders YM. A two-step manuscript submission process can reduce publication bias. J Clin Epidemiol. 2013;66(9):946–7. doi: 10.1016/j.jclinepi.2013.03.023 [DOI] [PubMed] [Google Scholar]
  • 35.Chambers C. What’s next for Registered Reports? Nature. 2019;573(7773):187–9. doi: 10.1038/d41586-019-02674-6 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

30 Apr 2021

Dear Dr Tijdink,

Thank you for submitting your manuscript entitled "Almost significant: trends and P values in the use of phrases describing marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020" for consideration as a Meta-Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by May 04 2021 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland Roberts

Senior Editor

PLOS Biology

rroberts@plos.org

Decision Letter 1

Roland G Roberts

18 Jun 2021

Dear Dr Tijdink,

Thank you very much for submitting your manuscript "Almost significant: trends and P values in the use of phrases describing marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020" for consideration as a Meta-Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

You’ll see that the reviewers are broadly positive about your study. Reviewer #1’s requests are mostly for clarification. Reviewer #2's requests involve some re-framing, especially around what question you set out to address, and your remedial recommendations. Reviewer #3 asks that you compare their results with those of a 2016 JAMA paper that you cite, and has a number of requests for methodological detail.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland Roberts

Senior Editor

PLOS Biology

rroberts@plos.org

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

[identifies herself as Agnes Dechartres]

Review Plos Biology PBIOLOGY-D-21-01082R1, « Almost significant : trends and p values in the use of phrases describing marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020 »

This is a research on reseach report evaluating the prevalence of common phrases associated with non significant results in 567,758 randomized controlled trials (RCTs) published between 1990 and 2020, the trend over time and the associated p-values in a subset of RCTS.

The topic is interesting and fits well the meta-research section of Plos Biology. The manuscript is overall well-written.

Please find below my comments. I hope you will find them useful.

Major comments :

- The reference and link with spin should be made more apparent

- We need more information on the methods section on how RCTs were identified (were they all those indexed as randomized controlled trials in publication type ?), how did the authors exclude those that were not actual RCT reports ?

- How did the authors processed technically to search for the common phrases associated with non-significant results ?

- The authors evaluated as a second step, p-values following the phrases commonly associated with a non-significant result in a subsample of 29,000 RCTs. Why 29,000 ? as they did that quite automatically, was it not possible to do it for the whole sample ?

- There is a contradiction in the Discussion section of the manuscript between what the authors did and showed : the prevalence of common phrases associated with non-significant results, mainly with a p-value between 0.05 and 0.15 is relatively low but still present (and this should probably be more highlighted and discussed, the prevalence seems low but it concerns all RCTs and many of them may be positive so one limitation is that they did not give the prevalence of such common phrases among RCTs with non-significant results) and seem steady over time. And the implications they highlight and the conclusion they reach: the abandon of formal statistical significance cutoff. The conclusion should be more in line with the results.

- The authors mentioned in the discussion that it was previously discussed to lower the default p value threshold to 0.005. I think it was mainly proposed for pre-clinical research. Do the authors think it is applicable for clinical research including clinical trials ?

Minor comments :

- Title : I suggest rephrasing the title as the current version is not completely clear something like :

« Almost significant » : Prevalence and trend of common phrases associated with non-significant results in 567,758 randomized controlled trials published between 1990 and 2020

- Abstract : the authors state that « The question is how non-significant outcomes are reported in RCTs over the last thirty years. » They should be careful in wording as this is not what they evaluated here. They only evaluated the prevalence of common phrases associated with non-significant results in RCTs (and not specifically in non-significant ones). To answer the first question, they should have first identified a sample of RCTs with non-significant results and evaluated how they were reported

- Abstract : This should be made clear in the abstract that the authors extracted associated p-values in a subsample of RCTS as the sentence ending with « in 567,758 RCT full texts between 1990 and 2020 and manually extracted associated p values » may be misleading

- Abstract : rephrase the sentence « Phrase data was modelled with Bayesian linear regression » to something like : temporal trend in phrase data was modelled with Bayesian linear regression

- Abstract : the conclusion should be more directly linked to the results

- One of the phrases considered as over-interpretating non significant results is « failed to reach statistically significant results. » This phrase for me reflects the truth regarding achieving or not the significance threshold and I am not sure that it corresponds to an over-interpretation

- The introduction is a bit long and sometimes a bit pedantic. Some sentences are vague and not really clear. Among these, for example « Individual clinical researchers are subject to regulations, traditions and procedures… ».

- In the introduction, the authors mention two previous meta-analytic studies showing that positive reporting and interpretation of primary outcomes in RCTs were frequently based on non-significant results. This sentence should be corrected as these studies were not meta-analyses. In addition, the sentence is misleading as these studies evaluated the prevalence of spin (misleading language) in RCTs with a non-significant primary outcome.

- Methods section : The sentence page 6, starting with « To increase the robustness of individual phrase prevalence estimations we binned… »

- A flow chart would be helpful

- In the results section, the sentence page 10 starting with interrater p value variability… « is not really clear. This was not annonced in the methods

- Discussion section : sentence at the end of page 14 ending with « irrespective of the publication status ». I think this is rather the opposite : With that process, publication does not depend on study results

Reviewer #2:

The manuscript reports on a meta-research study on how authors of reports on randomized controlled trials (RCTs) use specific wordings suggesting "marginally significant" results. The number of articles analyzed is impressive (grossly 570,000, more than 80% of all published RCTs that can be found in PubMed). But automatic analysis of such a large number also makes it difficult to attain a fine granularity, and this raises many issues. This is only imperfectly mitigated by manual extraction of P-values corresponding to a random sample of 29,000 extracted phrases. On the positive side, I must acknowledge that this, also, represents an enormous amount of work.

Major comments

1. Overall, the manuscript is unclear on what problem is precisely studied. Mostly the issue would be related to the use of threshold P-values for statistical testing. But actually, the issue is broader, and this is imperfectly tackled in the manuscript. The suggestion to lower the P-value threshold from 0.05 to 0.005, as proposed by John Ioannidis, is discussed in terms of lowering false positive reporting, but this is actually a completely different issue. Undoubtedly, authors would develop strategies to describe a P-value of 0.0051as 'marginally significant', a 'trend to significance', etc. In a similar way, thresholds for Bayes factors are presented in the manuscript. It is mentioned that there is no evidence that Bayes factors would be reported with suspicious phrasing, but one may argue that Bayes factors there is no evidence the Bayes factors are reported in RCTs at all (despite this being advocated, references should be given in the discussion when coming to this). Apart from an easy joke, if RCTs were routinely reported with Bayes factors and the thresholds given almost universally endorsed, then—again—authors may begin a 'significance dance' to advertise a Bayes factor of 3.18. The discussion also advocates the use of effect sizes and confidence intervals. But this is another issue, also. How many of the reports analyzed also reported measures of effect size and a confidence interval? Perhaps a vast majority of them, even those using suspicious phrasing. Moreover, a 95% confidence interval also somehow emphasizes a significance level at 0.05. Of course, it brings much more information. But how many readers will look whether the 95% CI of the relative risk includes 1 or not? All these issues are mixed up in the manuscript, and it would benefit from better delineating precisely the underlying problems. It seems to me that they are primarily caused by the use of thresholds for decision, whatever they are. But the authors may have a different view. It should simply be clarified.

2. As a follow-up comment, the topic may be narrowed to how authors phrase results close to, but above, the significance threshold, or enlarge it to envision what could be done to solve the issue (if it is really problematic). Currently, the manuscript is quite in-between. But the larger vision would raise many other issues: how do we dimension a study if there is no decision rule? (the power calculation would not apply) When should be stop a study without decision rules? Do we need a clear-cut conclusion of a RCT? (I would tend to say "no", but other authors have done sensible comments on that, and this relates to the suggestion to judge clinical relevance of RCTs on a case-by-case basis in the discussion; however on that specific point, I would tend to consider that this is still the case, whatever wording the authors use to describe their results). Bayes factors are used here, and are interesting. But perhaps a good option may be to analyze RCTs in a Bayesian framework, and report posterior probabilities for decision making (again one may differentiate between decision-making for a particular trial, i.e. stopping recruitment or not, and decision-making about the drug or intervention evaluated). Other tools have also been proposed to improve the interpretation of RCTs (see e.g. Shakespeare Lancet 2001; 357: 1349-53).

3. Last on this broad issue, the topic of p-hacking could also have been covered, and there have been large-scale analyses of the distribution of P-values on that topic, too (e.g. Adda et al. PNAS 2020).

4. The end of the introduction suggests the need to study how results with P-values just above 0.05 are described in RCTs, but the last paragraph and the study does it the other way round (what are the P-values for predefined 'non-significance-related phrases'), and that way, the question looks less pertinent.

5. A huge work has been done to search a predetermined set of phrases reportedly associated with 'non-significant' results. But actually, there is no good evidence that those phrases indicated a 'significance dance' from the authors. Why not describe a result with P=0.048 (assuming a 0.05 predefined threshold) as 'marginally significant'? And actually, the figure 3 shows P-values as low as 0.01, and the figure 4 shows some sentences with a fair proportion of P<0.05. So all the phrases do not correspond to 'almost significant' results. I acknowledge that the peak at 0.05-0.10 is impressive, but the text should better reflect that all those statements did not correspond to P-values > 0.05.

6. I was surprised that only 272 of the 505 predefined phrases were encountered in half a million RCT reports, since those sentence were encountered in the literature. The blog apparently also accounts for the psychology literature and non-randomized studies, but this was striking. Also, I failed to access the URL given for that blog. It should therefore be checked for accurateness.

7. I wondered whether sentences such as 'did not quite reach statistical significance' actually correctly interpret the P-value as being above the significance threshold. I am not a native English speaker, so that I may miss the exact meaning of the sentence, but looking at such a sentence, I would understand 'P > 0.05' but perhaps not very high. This sounds different from 'marginally significant' or 'borderline significant' or 'almost significant'. On the sentences, also, some also seem to me to indicate statistical significance. For instance I would interpret 'nominally significant' as 'P<0.05, but close too'. If the study had extracted P-values between, say, 0.05 and 0.20 and than looked at how they were described, the story would be different. But it seems that all sentences searched were not equally suggesting a spin towards significance when P>0.05. This is discussed (second limitation), but too shortly.

8. The P-value threshold at 0.05 is most common. But some trials may have predefined other thresholds. At least for the RCTs for which P-values were manually extracted, did the authors look for such information?

9. On what basis was it decided to manually extract P-values for 29,000 phrases?

10. The very end of the abstract advocates the reporting of exact P-values. This is however not directly related to the study results. And actually, it seems that it was possible to extract the exact P-values corresponding to many extracted phrases. So, they were actually reported.

11. Another important issue is how to interpret secondary outcomes of RCTs. Actually, many argue that, in the absence of correction for multiplicity or gatekeeping approach, one should not look at the P-values for secondary outcomes. Moreover, although this is unfortunately not always the case, the conclusion of a RCT on efficacy should target the primary outcome. It would therefore be interesting to distinguish primary and secondary outcomes here. I however understand that this would be virtually impossible to automatize.

Additional comments

1. The term 'phrase model' is unclear.

2. The introduction refers to 'low evidence results'. Yet, as the authors later acknowledge, the information brought by a study is not so different if it achieved P=0.049 or P=0.051. Moreover, many other elements may affect the level of evidence, or confidence in the evidence brought by the study, such as methodological characteristics, risk of bias, etc. This should be reflected in the sentence, or rephrased.

3. How was the density transformed to a percentage on the figure 3?

4. I wondered what 95% 'proportional' confidence intervals were (legend of figures 1 and 4). Are they the confidence interval of the proportion? Is that a common wording? (I never saw that).

Reviewer #3:

The authors aimed to assess the prevalence of a specific set of phrases describing non-statistically significant results in articles reporting the findings of RCTs. They used exact matching to identify instances of this corpus across 567,758 articles. I have the following comments to improve the reporting and conclusions:

1) The authors cited the work of David Chavalarias and colleagues (JAMA 2016) but they did not compare their findings to theirs. In particular, the sample of Chavalarias et al included RCTs as indexed by Medline and they already quantified the evolution over time of the prevalence of p value <0.05 (thus of p values >= 0.05). They already found that the proportion of p values <0.05 was 91.4% in RCTS (completely consistent with the findings of the authors of ~9%) and they also found that this proportion remained constant over time. Chavalarias et al. also assessed the statements associated to p values in a subsample.

2) The premise of the article is that describing RCT results as 'almost significant' (or 'marginally significant) can influence the perception of readers. The findings show that 'almost significant' is not frequent (only 0.14%). 'Marginally significant' was more prevalent (1.36%) but still much smaller than the prevalence of p values > 0.05 (about 9%). In addition, the use of these phrases seem to be stable over time (in particular the phrases do not appear among increasing or decreasing trends in Figure 4). The conclusions of the authors do not seem to be consistent with these findings.

3) The Methods section does not provide any detail regarding the identification of RCTs. Did the authors use a filter to query PubMed? Which one? See 10.5195/jmla.2020.912 What is the sensitivity and specificity of the filter? How did the authors address the false positives?

4) The authors could not retrieve 18.53% of identified articles. What was this proportion across publication years? Did it change? Were the authors less likely to retrieve full-text articles of older articles? What is the implication on their findings?

5) The URL provided in reference 11 is not accessible. I find it very problematic. Is the list of phrases from Academia Obscura including all 505 phrases used by the authors? How was this list developed? Is this corpus derived specifically from RCTs? Is it based on articles spanning from 1990 to 2020? In addition, how can the reader understand the extent to which these phrases are sensitive and specific to the reporting of non-significant results? The authors refer to "potentially associated with reporting non-significant results" but the reader has no way to make a judgment about the meaning of "potentially".

6) The Methods section does not mention that the authors selected articles published in English. The proportion of excluded articles based on language is not reported.

7) 233 (46%) phrases were not identified among the 567,758 pdf. What were the main differences between the 272 phrases found at least once and the 233 phrases?

8) The authors searched for exact matches. They should justify why they did not use approximate string matching methods? What is the implication on their findings?

9) The authors grouped selected articles into 3-year periods. The first period would be 1990-1992 and the last period would be 2017-2019. How did they handle the last period from Jan to Sep 2020?

10) There is no detail regarding the method for sampling articles to manually extract p values. Was random sampling stratified according to publication year or other factors?

11) Did the authors searched for exact matches in both the abstract and the main body of the selected articles? Were results different when stratified by location (abstract vs main text)?

12) Figure 4 shows large percentages of p values <0.05 for certain phrases. Across all extracted p values, what is the frequency and percentage of p values <0.05? What is the explanation? Is it that the significant threshold was lower than 0.05 in these papers? Was it that the selected phrases are not specific enough? It seems to be a considerable limitation in face of the objective of the authors to assess reporting of non-statistically significant findings.

13) In the Discussion, the authors conclude: "We present a robust estimate of nine percent of RCTs using specific language to report P values around 0.06." I suggest rephrasing. Because of the lack of information regarding the Methods and the limitations mentioned above, I find the claim that the estimate is robust undue. And I am not following 'p values around 0.06'.

14) In the abstract, the authors state "the phrase prevalence remained stable over time, despite all efforts to change the focus from P < 0.05 to reporting effect sizes and corresponding confidence intervals" Did they assess if effect sizes and confidence intervals were reported alongside non statistically significant results? If not, I suggest modifying this conclusion.

Decision Letter 2

Roland G Roberts

22 Dec 2021

Dear Dr Tijdink,

Thank you for submitting your revised Meta-Research Article entitled "Almost significant: Prevalence and trends of common phrases associated with marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor. 

Based on the reviews, we will probably accept this manuscript for publication, provided you satisfactorily address the remaining points raised by the reviewers. Please also make sure to address the following data and other policy-related requests.

IMPORTANT:

a) Please can you change your Title to "Analysis of 567,758 randomized controlled trials published over 30 years reveals trends in phrases used to discuss results that are not statistically significant" or "...that do not reach statistical significance" or "...that are only marginally significant," whichever is most accurate.

b) Please address the remaining concerns raised by the reviewers. While reviewer #3 remains critical, after discussion with the Academic Editor, we think that their concerns can be addressed by clarifying methods, toning down claims and making the limitations more explicit.

c) Please could you supply a blurb, according to the instructions in the submission form?

d) Please check that you adhere to our guidelines pertaining to systematic reviews and meta-analyses (e.g. by including a completed PRISMA checklist and flow diagram) https://journals.plos.org/plosbiology/s/best-practices-in-research-reporting#loc-reporting-guidelines-for-specific-study-types

e) Your current financial declaration says “The author(s) received no specific funding for this work.” Please can you confirm that this is correct.

f) Please address my Data Policy requests below; specifically, while we recognise that your raw data and code are presented in your GitHub deposition, we need you to supply the numerical values underlying Figs 2ABC,3, 4, 5, S1, S2. Please cite the location of the data clearly in each relevant Fig legend.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within three weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Many thanks for providing the raw data and code in Github. However, we also need the individual numerical values that underlie the Figures of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2ABC,3, 4, 5, S1, S2. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

Review PBIOLOGY-D-21-01082R2 manuscript entitled "Almost significant: Prevalence and trends of common phrases associated with marginally significant results in 567,758 randomized controlled trials published between 1990 and 2020"

I would like to thank the authors for having answered my comments and modified their manuscript accordingly.

I have some last comments:

- Abstract section: it is not clear what represents the rate 68.1% at the end of results presented. Is it the % of p-values between 0.05 and 0.15? Please clarify

- Abstract section: in the conclusion, "demonstrate" is too strong. Please consider using "show" instead

- The introduction is still very very long and I still find that some sentences or wording are too vague or unclear. For example: "The use of this fixed threshold has important disadvantages as the context determines the corresponding threshold". I am not sure that the catch-22 situation expression is clear for most researchers. Also not really clear "insufficient evidence results (including factors such as high risk of bias, other methodological weaknesses", "given the relatively rules-oriented RCT research environment", "the success of an RCT"

- I am happy with the reference to spin in the introduction but I think it is not necessary to provide a catalog of findings from recent studies on that topic

- I also find that the plan of the introduction is not easy to follow. I really suggest to condense and to be more factual following a clear-cut plan

- I would be more careful in the formulation of the objectives in particular the part "explore what types of phrases are used when reporting statistical non-significant findings". It is not exactly what the authors did. This formulation suggests that they searched first for non-statistically significant results and then look at the sentences around. In this study, they did the contrary: they look the p-values following a sentence that could be related to a marginally significant result.

- I think this is also a limitation to acknowledge more clearly: having focused on all RCTs and not only on those with non-statistically significant results

- In the methods section, why did the authors use "clinical trial" as publication type and not "randomized controlled trial"?

- Typo in the paragraph statistical analysis: principled

- Another sentence vague and pedantic in the methods: "with the tendency of humans to understand the world by applying thresholds to continuous spectra"

- In the results: the sentence "the overall phrase-positive RCT prevalence was stable over time (8.65%). Does it correspond as indicated to the overall prevalence, ie 49,134/567,758? It looks so but "stable over time" in the same sentence is confusing. I would make two sentences.

- Results, associated p-values: I would expect here the prevalence of p-values in the range 0.05-0.15 and not only the median. Please also make two sentences

- Results, associated p-values: The sentence "the highest percentages for relative frequent phrases were found in this particular range.. " is unclear

- Discussion, in principal findings, "no specific reasons were found for these differences over time". For me, it is not really a finding as they did not report to study that. I would eventually discuss that point later in the discussion

- The conclusion should be more in line with the results. In particular, the sentence "fifteen years of advocacy to shift away from null hypothesis testing has not yet fully materialized in RCT publications"

- The flow chart could be improved

- Figure 5: I would add Q1-Q3 with the median if possible

Reviewer #2:

I thank the authors for their answers and revision of the manuscript.

Some issues may still be considered:

1. When describing the "mythical paradigm" of statistical testing in the introduction, authors may be clear that the statistical framework does not mandate that P-values < 0.05 indicate a true effect and those >0.05 indicate no effect. This is simply a rule to reject the null or not, based on long-run properties allowing to control the type I and II (provided studies are correctly powered) error rates. So it is more the interpretation that has pervasively permeated the medical community that has led to the shortcut P<0.05 = true effect and P>0.05 = no effect. This may be clarified.

2. The flow chart should have arrows going out with the "excluded" studies at each step, and the reason why they would go out of the flow. For instance, when starting from the 696,842, 129,084 are excluded before full test download: they should appear on the chart, with the reason why full text was not retrieved (there could be several reasons for different sets of studies).

3. The sixth limitation is quite vague, and should be made more precise. I believe two reviewers alluded to this. The way the data were analyzed did not look at whether the RCTs achieved their primary outcome or not, nor if the sentences and p-values extracted related to the primary outcome. So the limitation should be clarified: we do not know if what is reported relates to the primary outcome or not, and there is no stratification according to whether the trials reached their primary endpoint (showed a "significant result" of their primary outcome) or not.

Some minor points

1. Abstract, "RCTs regularly use specific phrases": RCTs themselves do not use any sort of sentence. Those phrases can be found in their reports. I would therefore write it differently, such as "Our results show that RCTs reports regularly contain specific phrases describing …" (not sure either there would be a need for a real demonstration).

2. In the introduction, I would not allude to the exact number of full-texts analyzed, since this is already part of results, but more on the strategy: "We do this by quantitatively analysing 567,758 RCT full-texts …" would be changed to "We do this by quantitatively analysing RCT full-texts in English registered in the last decade …"

3. What is the need for "direct" in the sentence "To obtain direct evidence on phrase changes …"? Would another type of model produce indirect evidence? Moreover, the evidence itself would be more in the data than in the model.

4. The figure 1 should be referenced in the text, e.g. "We obtained the full text of 567,758 full-texts of the total of 696,842 PubMed-registered RCTs (81.47%) (figure 1)".

5. The seventh limitation (related to not using elastic search) could be put earlier, just after the fact that a fixed set of phrases was used: such a fixed set was used, and exact match was searched.

Reviewer #3:

The response of the authors and the revised manuscript reveal flaws in the methodology. The methodology chosen by the authors does not allow answering the objective of assessing "how non-significant outcomes are reported in RCTs over the last thirty years" or "how phrases expressing non-significant results are reported in RCTs over the past thirty years". It is impossible to assess the robustness of the approach, but Figure 4 clearly shows that the approach is biased. In addition, these shortcomings are not acknowledged or are downplayed by the authors.

1) The whole methodology relies on a list of 505 phrases compiled and shared through a blog post by Matthew Hankins. The manuscript does not explain how the list was developed. The original blog post only mentions: "The following list is culled from peer-reviewed journal articles in which (a) the authors set themselves the threshold of 0.05 for significance, (b) failed to achieve that threshold value for p and (c) described it in such a way as to make it seem more interesting." So, the internal and external validity of the findings relies completely on a list of phrases developed by a single researcher based on an irreproducible non-systematic review of an ad hoc selection of articles of unspecified nature (biomedical research? RCTs?). As highlighted by all reviewers of this manuscript, the correct method should be to first identified a sample of RCTs with non-significant results and to evaluate how they are reported

2) The authors downplay this major flaw. They state "First, we [Matthew Hankins?] pre-defined more than five hundred phrases denoting results that do not reach formal statistical significance. We may have missed phrases with similar meanings. This would lead to an underestimated overall prevalence." Figure 4 shows that a certain proportion of p values associated with the phrases are below 0.05. It clearly invalidates the approach. It is also incorrect to state that the prevalence is underestimated. The authors had the opportunity to report the exact proportion of p values < 0.05 but they did not. It is difficult to guess the proportion based on Figure 4, a histogram would be better. It is the area under the pdf to the left of 0.05. The proportion of p values <0.05 seems to be at least 2.5%. It is a relatively large percentage compared to the estimated 9%. The authors minimize this finding by stating "Finally, not all predetermined sentences actually represent a p value above 0.05. (e.g. 'marginally significant'). However, in the manual analysis, we hardly found p values lower than 0.05 that corresponded with specific phrases".

3) The conclusion of the article "too much focus on formal statistical significance cut-offs hinders full transparency and increases the risk for misinterpretation and selective publication, particularly when P values approach but do not cross the 0.05 threshold." and of the abstract "To enhance responsible and transparent interpretation of RCT results, researchers, clinicians, reviewers, and editors should abandon the focus on formal statistical significance thresholds and stimulate reporting of P values with corresponding effect sizes and confidence intervals and focus on the clinical relevance of the statistical difference found in RCTs." are not supported by the findings and go well beyond them.

Decision Letter 3

Roland G Roberts

31 Jan 2022

Dear Dr Tijdink,

On behalf of my colleagues and the Academic Editor, Isabelle Boutron, I'm pleased to say that we can in principle accept your Meta-Research Article "Analysis of 567,758 randomized controlled trials published over thirty years reveals trends in phrases used to discuss results that do not reach statistical significance" for publication in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have any requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely,

Roli Roberts

Roland G Roberts, PhD 

Senior Editor 

PLOS Biology

rroberts@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Temporal plots for phrases with “strong” evidence (i.e., BFs between 10 and 100) for temporal change.

    Prevalence estimates are shown as dots, together with the linear regression model fit and 95% CI. The data can be found here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data. BF, Bayes factor; CI, confidence interval.

    (DOCX)

    S2 Fig. Category percentages for the phrases describing nonsignificant results with the number of manually extracted P values with occurrences between 30 and 100 times in our manual analysis.

    Error bars represent the proportional 95% CI. The associated median P value is presented in the upper left corner of each phrase. CI, confidence interval.

    (DOCX)

    S1 Table. The 505 predefined phrases associated with reporting nonsignificant results.

    (DOCX)

    S2 Table. The evidence of temporal change in the phrases with at least 5 time points expressed as the BF relative to no temporal change (lower threshold set to 2.0).

    The colors represent the strength of evidence as specified in the main text. BF, Bayes factor.

    (DOCX)

    S3 Table. All extracted P values within the 3 range categories, as the proportion of the total of 11,926 extractions.

    (DOCX)

    Attachment

    Submitted filename: Rebuttal_defOkt2021.docx

    Attachment

    Submitted filename: RebuttalJan2022_PLoSBiology_def2.docx

    Data Availability Statement

    All used PubMed IDs, detected phrases, co-text extractions, manually identified P values, and processing scripts are openly shared at: https://github.com/wmotte/almost_significant (v1.0; http://doi.org/10.5281/zenodo.4313162). and here: https://github.com/wmotte/almost_significant/tree/main/Fig_and_Data.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES