Abstract
Background
Underpowered study designs undermine the reliability of experimental research, with growing concerns regarding randomised controlled trials (RCTs) informing musculoskeletal injury management. We assessed the statistical power and sample size calculations of such RCTs.
Methods
Electronic searches (MEDLINE and PEDro searched up to March 2024) identified meta-analyses of RCTs comparing conservative interventions for musculoskeletal injury, without restrictions on demographics, injury type, or outcome. Statistical power was estimated using two approaches: (1) meta-analytic—the RCT’s power to detect the summary effect of the meta-analysis it contributed to, and (2) conventional—the RCT’s power to detect Cohen’s small (d = 0.2), medium (d = 0.5), and large (d = 0.8) effect sizes. The RCTs’ manuscripts and registry entries were screened for sample size planning details.
Results
The search identified 4737 articles, with 41 eligible meta-analyses of 266 RCTs. The median power was 42% (54% among RCTs within statistically significant meta-analyses). Less than 1 in 3 RCTs from statistically significant meta-analyses had ≥ 80% power to detect the corresponding summary effect. The number of RCTs with ≥ 80% power to detect small, medium, and large effects was 0%, 7.9%, and 37.6%, respectively. One in four RCTs reported sample size calculations; 80% expected larger effects than they observed. RCTs not reporting sample size calculations were smaller and reported larger effects.
Conclusion
Low statistical power permeates musculoskeletal injury research, limiting the clinical utility of many RCTs. The underlying causes of low power in this field are multifactorial and extend beyond sample size calculation alone. Enhancing study power requires methodological improvements, including robust planning, stronger theoretical frameworks, multi-center collaboration, data sharing, and the use of valid, reliable outcome measures.
Supplementary Information
The online version contains supplementary material available at 10.1186/s40798-025-00908-8.
Key Points
• We analyzed data from 41 meta-analyses, comprising 266 RCTs investigating the effectiveness of conservative musculoskeletal injury management.
• Less than 1 in 10 RCTs had sufficient power (≥ 80%) to detect small to moderate treatment effects (SMD 0.2–0.5), and just 1 in 3 were sufficiently powered to detect the summary effect of the meta-analysis to which it contributed.
• Around one quarter of RCTs adequately report sample size calculation, these tend to overestimate treatment effects a priori.(delta inflation).
• Underpowering in the musculoskeletal literature stems from inadequate consideration of sample size at the trial design stage and is potentially exacerbated by delta inflation and the winners curse (small N effect size inflation).
Supplementary Information
The online version contains supplementary material available at 10.1186/s40798-025-00908-8.
Background
Musculoskeletal injuries are characterized by pain, functional impairment, and activity limitations, and incur significant economic and societal burdens [1]. In sports and exercise, prevalent musculoskeletal injuries include tendinopathies [2], patellofemoral pain [3] and ankle instability [4]. Management of these injuries typically involves conservative methods such as exercise, taping, manual therapy, or electrotherapeutic treatments such as laser therapy or extracorporeal shock wave treatment. The effectiveness of such treatments is ideally informed by randomised controlled trials (RCTs), which most often rely on null hypothesis significance testing when establishing an apparent treatment effect. However, despite the growing number of RCTs informing the conservative management of musculoskeletal injuries, concerns persist regarding research robustness in this field [5–7]. Audits in other medical disciplines indicate the pervasive presence of underpowered studies and suggest that it is a key factor exacerbating the replication crisis [8, 9].
Statistical power—the probability of correctly rejecting the null hypothesis when it is false—has major implications on the accuracy of research findings. In frequentist terms, statistical power is the long-run probability that similar experiments will detect a treatment effect—should a true effect exist in the population [10]. The default in many scientific fields is that study designs should have at least 80% power, i.e., they should find a significant effect in 8 out of 10 replications (again—on the condition that there is a true effect to be found). All else being equal, RCTs with high statistical power are more likely to detect genuine empirical effects, while those with lower power face an increased risk of false negative findings (a Type II error). This is because the probability that a statistically significant result in such studies reflects a true difference in the population is low [9]. Furthermore, when an actual treatment effect is correctly discovered, studies with lower power tend to overestimate the size of this effect, a phenomenon that has been referred to as the ‘winner’s curse’ [9].
Best practice guidelines in experimental research include undertaking a sample size calculation at the trial planning stage [11]. These calculations are underpinned by the desired α (Type I) and β (Type II) error levels, the magnitude of the anticipated or target treatment effect, and the assumed variability among the study participants. Funding organizations and editors of academic journals routinely ask researchers to justify their study’s sample size and describe how it ensures sufficient statistical power. However, this precedent is not reflected across all fields of research and power calculations are rarely undertaken prospectively in the sports science [12] and orthopaedic [13] literature.
Statistical power can be examined a posteriori (after-the-fact). Like sample size calculations performed at the trial planning stage, the statistical power returned by such post hoc analyses is heavily contingent on the effect size [14]. Some authors base post hoc power on the RCT’s observed effect size. However, this provides little additional information beyond the study result’s associated p-value; for example, a finding that is nominally significant at the 5% level will have at least 50% power, and 80% power is guaranteed if the p-value is below 0.005 [15]. An alternative is to use Cohen’s thresholds for small (d = 0.2), medium (d = 0.5) and large (d = 0.8) effects [16]. These thresholds, which may not reflect discipline-specific magnitudes of effect perfectly [17], are well-known and may thus be informative for readers within and outside a particular discipline when performing a power analysis across a body of research. Another option, which likely represents the best proxy for the true (but unknown) population effect of interest, involves examining post hoc power for a given RCT using the pooled effect estimate from a meta-analysis to which it contributed. This meta-analytic approach has been used by Button et al. [9] and Stanley et al. [18], with both finding that a large proportion of neuroscience and psychology research is underpowered, thus reducing the likelihood of successful replication in these fields. The validity of frequentist research is underpinned by statistical power. As the volume of experimental research informing musculoskeletal injury recovery increases, it is important to determine its evidential value. Our primary objective was to determine the statistical power of RCTs investigating the effectiveness of conservative musculoskeletal injury management. We calculated post hoc power for such RCTs using (1) a meta-analytic approach, and (2) Cohen’s thresholds for small, medium and large effects. As a secondary objective, we assessed the presence and rigour of the RCTs a priori sample size calculations. We also investigated the extent of delta inflation, whereby trialists expected or targeted a larger effect size than the one they observed, and whether studies that fail to undertake a priori sample size calculations differ in their sample and effect sizes compared to those who do undertake such calculations.
Methods
Literature searching, data extraction and analysis were undertaken in accordance with the procedures outlined in our protocol available on the Open Science Framework (registered February 2024) [19]. 10.17605/OSF.IO/8CHWN.
A systematic literature search was undertaken on MEDLINE on Ovid and PEDro in March 2024. Each database was searched from inception by two reviewers (CB, JW). Search terms are shown in Supplement 1. Search results were extracted into Rayyan (Rayyan Systems, Inc., MA, USA) where duplicates were removed. Study selection was initially based on the title and abstract, followed by the full-text article. Two reviewers (CB, JW) independently and blindly screened the search results to identify studies for inclusion, with disagreements resolved by consensus (i.e., if both researchers could not agree to include a study, a third researcher [NK] mediated). We initially piloted the study selection and data extraction forms using 10% of the potentially eligible and eligible records.
We included meta-analyses of RCTs published between 2014 and 2024, informing the conservative management of musculoskeletal injury. The meta-analysis must have reported summary effect estimates (odds ratio, risk ratio, mean difference, standardized mean difference) and study-level data on the number of participants and the effect size for each constituent RCT. There were no restrictions placed on participant demographics, the type of injury, or the outcome measures employed. There were no restrictions placed on search terms or inclusion criteria based on language. The constituent RCTs must have employed a parallel 2-arm design. We excluded meta-analyses of studies that used randomized cross-over designs or surgical interventions. Meta-analyses reporting partial eta squared as the effect size were also excluded, as this statistic cannot be converted or compared to Cohen’s d. We also excluded meta-analyses with fewer than five constituent RCTs [18].
Data Extraction
At the meta-analysis level, we extracted information on the number of included RCTs, the intervention (e.g., exercise, electro-physical agent, manual therapy), comparator (e.g., control, sham, usual care), and outcome, whether the meta-analysts used fixed or random effects modelling, and the summary effect estimate and its 95% confidence interval (CI). If eligible study reports presented more than one meta-analysis, we extracted data from the primary meta-analysis. In cases where the primary meta-analysis was not clearly identified, we extracted data from the meta-analysis containing the largest number of RCTs [9]. We also extracted study-level data from the constituent RCTs (sample size, effect size, and binary significance based on p < 0.05). We furthermore surveyed the manuscripts and study protocols of all constituent RCTs for the reporting of an a priori sample size calculation as per the CONSORT (Consolidated Standards of Reporting Trials) statement [20]: the expected or target difference between groups and its standard deviation (i.e., the a priori effect size), any informing sources or rationale as to why the effect size was chosen (e.g., an empirical estimate from a published study or pilot), the type I (α) and type II (β) error probability, the RCT’s planned sample size, and any allowances for attrition. All data were extracted independently by at least two authors (CB, JW, NK); any differences in data extraction detail were resolved by consensus. We presented all effect estimates as Cohen’s d. Odds ratios (ORs) were converted to Cohen’s d using the formula ln(OR)/1.81 [9, 21].
Data Analysis
Primary Objective
We estimated statistical power for each RCT using an online calculator [22] assuming a 2-tailed α of 5%. Power calculations were based on two different approaches. Firstly, we used conventional thresholds of effect size, representing small (d = 0.2), medium (d = 0.5) and large (d = 0.8) effects, respectively [23]. Secondly, we used a meta-analytic approach to evaluate the power of each RCT to detect the estimated summary effect of the meta-analysis to which it contributed [24]. The power of the constituent RCTs was presented descriptively using median values [interquartile range (IQR)], as these are less influenced by extreme values, and graphically using histograms.
Secondary Objectives
We ascertained the presence and detail of a priori sample size calculations in the constituent RCTs. Using ORs with 95% CIs, we explored whether the sample sizes and observed effect sizes differed between RCTs with rigorously reported a priori sample size calculations from those without. We also used ORs with 95% CIs to explore whether these two groups of RCTs differed in their likelihood of reporting large and very large (d = 1.2) effects [25].
Sensitivity Analysis
We examined if the pooled effects reported in the included meta-analyses were robust to the removal of RCTs that rigorously reported a priori sample size calculations. We opted to perform this sensitivity analysis by excluding the studies with greater sample size detail (as opposed to excluding studies with inadequate sample size information) due to an imbalance in the number of studies in each category. As described in the results, very few studies had sufficient information regarding their planned sample size; including only those RCTs would result in too few studies to perform a meta-analysis.
Protocol Deviations
Our study protocol [19] details an additional approach for calculating power based on empirical thresholds for small, medium and large effects, derived from the 25th, 50th and 75th percentiles of effect sizes within our data sample [26]. This analysis was omitted from the manuscript as the distribution of standardised treatment effects (small = 0.2, medium = 0.5, large = 1.0) was very similar to Cohen’s thresholds. We also did not undertake some of our planned exploratory analyses as there were insufficient participants in each subgroup (intervention type; participant type), preventing meaningful comparisons. The sensitivity analysis was not included in our original protocol and should therefore be considered post-hoc.
Equity, Diversity, and Inclusion Statement
The author group consists of four men and one woman of varying seniority from four countries. There were no limitations on patient demographics: any published meta-analysis pertaining to the conservative management of musculoskeletal injuries was eligible for inclusion. The discussion section reflects on the findings’ generalisability.
Results
Our search strategy identified 4737 articles. After removing duplicates, we screened the titles and abstracts of 4405 studies, of which we excluded 4308. We then read full-text versions of the remaining 97 articles, excluding 56 (references with reasons for exclusion are in Supplement 2). There were 41 meta-analyses eligible for inclusion; we extracted information on 328 study-level estimates and 266 RCTs contributing to these meta-analyses (Fig. 1).
Fig. 1.
Identification and selection of studies. MA, meta-analysis; RCT, randomised controlled trial
Study Characteristics
The characteristics of the included meta-analyses and their constituent RCTs are described in Supplement 3. Most of the meta-analyses examined the effects of an active intervention against a control or placebo/sham (26/41, 63.4%); 19.5% (8/41) compared two active interventions (e.g., bracing versus physiotherapy), and 17.1% (7/41) examined the effects of an adjuvant therapy (e.g., eccentric exercise versus eccentric exercise plus electrotherapy). In 78.0% of meta-analyses, the primary intervention was either exercise (18/41), an electrotherapeutic agent (e.g., low level laser therapy, extracorporeal shockwave therapy) (9/41), or taping/bracing (5/41). The most common musculoskeletal injuries examined in the meta-analyses were tendinopathy (17/41, 41.5%), ankle sprain/chronic ankle instability (8/41, 19.5%), general musculoskeletal pain (8/41, 19.5%), and Achilles tendon rupture (4/41, 9.8%). The most common outcome was pain (19/41, 46.3%) followed by reinjury (5/41, 12.2%), function (5/41, 12.2%) and balance (4/41, 9.8%).
The median number of RCTs included in the meta-analyses was 7 (range 5 to 20). The aggregate number of participants per meta-analysis ranged from 192 to 1568, with a median of 408 (IQR 268–582). Pooling was most often done using random effects modelling (31/41, 75.6%). The median pooled effect size was 0.49 (IQR 0.25 to 0.89). Sixteen of the 41 included meta-analyses (i.e., 39%) did not reach the 5% threshold for statistical significance.
Figure 2 shows the distribution of power across all study-level estimates (n = 328). The median power was 42% (IQR 14–73). We restricted the analysis to estimates within meta-analyses that reported statistically significant results. This increased the median power to 54% (IQR 31-84.5); however, only 29.6% (66/223) of these estimates had at least 80% power to detect the summary effect of the meta-analysis to which they contributed. There were 61 studies with extremely low (< 10%) power, the majority of which were derived from null meta-analyses (n = 51, 83.6%).
Fig. 2.
Distribution of study power (all study level estimates, n = 328). RCT, randomized controlled trial
Figure 3 shows the power of included RCTs to detect the conventional thresholds of effect size. None of the RCTs had ≥ 80% power to detect a small (d = 0.2) effect, and only 7.9% (21/266) and 37.6% (100/266) had ≥ 80% power to detect medium (d = 0.5) and large effects (d = 0.8), respectively.
Fig. 3.
Statistical power of included RCTs (n = 266) based on small, medium, or large effects. Dashed reference line represents the common default target of 80% power set by trials a priori. RCT, randomised controlled trial
Reporting of a Priori Sample Size Calculation
The sample size in the included RCTs ranged from 14 to 540, with a median sample size of 43.5 (IQR 31–62). The median effect size was 0.53 (0.21–1.03); half of the RCTs reported effects ranging from 0 to 0.5 (132/266, 49.6%), 16.2% (43/266) ranged from 0.5 to 0.8, and 34.2% were > 0.8 (91/266). Over half of the RCTs (53%, 141/266) did not provide any detail on sample size calculation within the manuscript or study protocol. Only 27.0% (72/266) included full details (quantifying both an effect size and thresholds for α and β error probability). Of the studies that provided α levels, almost all (114/116, 98.3%) used a 5% threshold. The most common β thresholds were 20% (79.7%, 94/118) and 10% (12.7%, 15/118), which equates to statistical power of 80% and 90%, respectively. Roughly one-third (76/266, 28.6%) of studies quantified their expected or target effect size as part of their sample size calculation. These effect sizes were most often based on previous research data (65%), with the remainder based on either a pilot study (16%), a minimal detectable change or minimal clinically important difference (16%), or a pooled effect from a meta-analysis (3%).
Differences between Assumed Versus Actual Effect (Delta Inflation)
In the 76 RCTs quantifying their effect estimates a priori, the expected or target effect (median 0.73, IQR 0.5-1.0) was significantly higher (W = 452.5, p < 0.001) than the actual effect (median 0.36, IQR 0.17–0.7). Figure 4 shows the extent of delta inflation, with 82.8% (63/76) of RCTs overestimating the treatment effect a priori. All trials who expected or targeted an effect above 1.0 overestimated the actual effect.
Fig. 4.
Delta inflation: the distribution of assumed effects (a priori) VS actual effects (N = 76 RCTs). X and Y values transformed (log10) for visual clarity. 82.8% (63/76) of Randomised controlled trials overestimated the treatment effect a priori
RCTs that fully justified their sample size calculation a priori (n = 72) had a median sample size of 61.5 (IQR 40–121) and reported a median treatment effect of 0.35 (IQR 0.16–0.70). By comparison, RCTs that provided incomplete or no details on how sample size was determined (n = 194), had significantly lower sample sizes (median = 40, IQR 30–55; Mann-Whitney U test statistic = 9928, p < 0.001), but higher treatment effect estimates (median = 0.61, IQR = 0.28–1.06; U = 4982, p < 0.001). RCTs that did not report a sample size calculation were twice as likely to report a large effect (d ≥ 0.8) estimate (OR 2.1; 95% CI 1.3 to 3.5) and more than three times as likely to report an effect size ≥ 1.2 (OR 3.4; 95% CI 1.3 to 9.2). These distribution patterns are summarised in Fig. 5.
Fig. 5.
Scatter plot illustrating the distribution of effect size estimates (Cohen’s d) and sample size (N) in RCTs, distinguishing between those that rigorously justified their sample size (n = 72) and those that did not (n = 194). X and Y values transformed (log10) for visual clarity. RCT, randomized controlled trial; SMD, standardized mean difference
Sensitivity Analysis
Sensitivity analysis was not feasible for 8 of the included meta-analyses as they did not include any RCTs with rigorously reported a priori sample size calculations. The 72 RCTs that did rigorously report these calculations contributed to the remaining 33 meta-analyses. The median number of RCTs removed from each meta-analysis (i.e., those with rigorous reporting of an a priori sample size calculation) in the sensitivity analysis was 2 (IQR 1–4). In 11 out of the 33 meta-analyses, removing the RCTs which rigorously reported their sample size justification resulted in a lower perceived treatment efficacy (median: -0.17 units, range − 0.02 to -0.92 units). In two meta-analyses, the pooled effect was unchanged, and in the remaining 20 there was an increase in the pooled effect size (median: 0.15 units, range 0.01–0.81 units). There was an aggregate increase in pooled effect (median 0.08 units), corresponding to a 10% median relative increase from the original pooled effects (Fig. 6).
Fig. 6.
Change in pooled effects observed with sensitivity analysis (N = 33 meta-analyses). Values above the horizontal axis indicate a greater perceived treatment efficiency when those RCTs that rigorously reported a priori sample size calculations were removed from the meta-analysis
Discussion
Low statistical power and poor reporting of sample size calculations have been recorded across various fields of scientific research. To the best of our knowledge, this is the first time that statistical power has been systematically examined in the musculoskeletal injury literature. We used a meta-analytic approach to power calculation since this likely represents the best proxy for the true (but unknown) population effect. Data were extracted from 41 meta-analyses, comprising 328 RCT results from 266 RCTs and an aggregate of more than 20,000 participants with musculoskeletal injury. These RCTs had a median sample size of 44 (IQR 31–62) and half reported effect sizes less than 0.5. A large proportion of RCTs were underpowered, with just 8% likely to detect small (d = 0.2) to moderate (d = 0.5) effects. Only 25% of studies adequately described undertaking a sample size calculation, many of which show delta inflation, wherein the expected or target treatment effect was larger than the observed effect. RCTs that failed to incorporate a sample size calculation were twice as likely to report large (d = 0.8) treatment effects. Together, these issues contribute to inflated effect sizes in meta-analyses of conservative musculoskeletal injury management and overly optimistic conclusions about the effectiveness of such interventions.
Other research audits have estimated statistical power using a meta-analytic approach; their findings show low power across the psychology [18] and neuroscience literature [9], based on median values of 36% and 21% respectively. We recorded the median power of RCTs in the musculoskeletal injury literature to be 42%. Although higher than previous reports, this still questions the credibility and reproducibility of experimental research in this field. A common goal is that RCTs should have at least 80% power to find an effect, should an effect exist. However, we found that only one-fifth of RCTs had adequate power (i.e., ≥ 80%) to detect the summary effect of the meta-analysis to which it contributed. As per a reanalysis of the neuroscience literature [9] by Nord et al., [27] we note that the median power increased when the analysis was restricted to RCTs within meta-analyses reporting statistically significant results, with the median power increasing to 54% and 1 of 3 RCTs able to detect the summary effect of the meta-analysis to which they contributed. This increase in power is expected: when a population effect is zero (or very small), study power is often equivalent to (or approximates) the false positive error rate (α) [9].
It is common to quantify statistical power a posteriori using Cohen’s standardised thresholds for small (d = 0.2), medium (d = 0.5), and large (d = 0.8) treatment effects [23]. In an audit of the orthopaedic literature, Reito et al. [28] found that all 233 included RCTs were insufficiently powered to detect a small effect, and less than one-third powered to detect a medium effect. Others, (also using Cohen’s thresholds) reported considerable levels of underpowering within the rehabilitation literature [26, 29], with the average study showing 7 and 50% power to detect small and large treatment effects, respectively [26]. In the current audit, we found that none of the constituent RCTs were sufficiently powered to detect small effects, less than one tenth were sufficiently powered for medium effects, and just over one third for large effects. These figures are concerning given that half of the constituent RCTs reported d < 0.8. Since the median number of participants per RCT was 44, 80% power would only be guaranteed for a typical musculoskeletal injury trial in situations where the treatment effect is large.
Underpowering increases the risk of Type II error, whereby the researcher concludes that no treatment effect exists, when a treatment effect is actually present. If this is endemic across a field of research, it creates resource and ethical implications – why expose participants to certain risks and burdens, if there is limited empirical return or societal benefit? Such small N studies are also more sensitive to analytic changes (a posteriori), drop outs, and selective reporting [30], raising further questions around the validity of evidence in this field.
Few of the included RCTs detailed a sample size calculation. Others report that only 6–10% of experimental trials of the orthopaedic [13] and sports science [12] literature include a formal sample size estimation. A more recent review of the orthopaedic literature by Charles et al. [31] that covered a wider range of publication dates, shows improvements over time whereby the number of trials reporting sample size calculation increased from 4% in 1980 to 95% in 2006. Others have examined if interventional trials provide adequate detail in their sample size calculations (e.g., α and β errors, treatment effect size); although a high proportion (84%) cited that a priori power analysis was undertaken, only half of these were replicable [28]. We found a similar pattern: around half of the constituent RCTs made some reference to a sample size calculation in their methods, but only 50% of these were sufficiently detailed to allow replication.
In line with other reports [9], we found that RCTs providing incomplete or no details on their sample size calculation had considerably smaller sample sizes and reported larger effect sizes. For example, of the 17 RCTs that recorded huge treatment effects (d > 2.0) [25], 16 failed to provide sample size planning details, and all had less than 70 participants (35 in each group). Scientists’ propensity for finding exaggerated effects, often referred to as ‘the winner’s curse’, is particularly likely when conducting small and low-powered studies [9]. Based on simulations from Zollner and Pritchard [32] and later Button et al. [9], and assuming the median statistical power of musculoskeletal studies is between 40% and 55%, then the initial effect estimates from RCTs in this field may be inflated by 15–20%. Our findings suggest that the winner’s curse in musculoskeletal injury research can be mitigated by performing sample size calculations at the trial design stage.
Positing a treatment effect at the trial design stage that is both important and realistic can be challenging. In conjunction with other fields of health research [33] we found that when target effect sizes were reported, they were primarily based on existing data. Best practice guidelines (DELTA2) are to consult a range of evidence sources (both existing data and expert opinion) and incorporate a number of simulations that take into account the minimal effect deemed clinically important by at least 1 key stakeholder (e.g., patient group, health professional, funder) [14].
Delta inflation is the systematic overestimation of a treatment effect at the trial design stage [34]. This pattern emerged in the current review among the RCTs disclosing their sample size calculation, based on a median anticipated effect size of 0.73 compared to a median observed effect of 0.36. A review of RCTs in the Health Technology Assessment journal, which publishes results irrespective of statistical significance, found a smaller delta gap of 0.2, with anticipated effect sizes of 0.3 versus observed effects of 0.1 [33]. Even mortality, perhaps the most ubiquitous outcome in RCTs, displays delta inflation, with delta gaps as high as 8.7% (mean predicted 10.1% versus mean observed of 1.4%) [34]. As the required sample size to statistically ascertain a treatment effect is very sensitive to small changes in the anticipated effect size, overestimation at the trial design stage can be a key driver of underpowering in medical research. For example, doubling the target treatment effect in a typical 2 arm trial reduces the required number of participants by a factor of four (if α and β thresholds are to be retained) [14].
We found that many of the original pooled effects in the musculoskeletal injury literature were not robust to the removal of RCTs with rigorous sample size calculations. The changes in summary effects we observed varied in both magnitude and direction. However, basing the meta-analyses solely on RCTs without adequately reported sample size calculations (which also tended to be smaller) inflated the treatment efficacy by an average of 10%. Inflated treatment effects can create unrealistic clinical expectations and may be cherry-picked by researchers to inform sample size calculations in subsequent studies, leading to underpowered studies and decreased likelihood of successful replication. We hope that our analysis can help break this potential vicious circle by encouraging researchers to prioritise prospective sample size calculation per DELTA2 guidance [14] and provide more rigorous and transparent reporting of what constitutes a clinically meaningful difference. Our findings should also serve as a reminder to clinicians and policymakers that not all sections of the musculoskeletal literature are equally robust. Rigorous appraisal of evidence quality is essential when determining which RCTs and meta-analyses should be prioritised to inform clinical decisions and policy change.
Limitations
Our review covers a wide range of RCTs informing the conservative management of musculoskeletal injuries. These findings may not be generalisable to surgical interventions within this research field. A more exhaustive account is always possible; we could have searched the grey literature and additional electronic databases besides MEDLINE and PEDro. However, our goal was to produce a representative, rather than exhaustive, account of statistical power in RCTs informing the conservative management of musculoskeletal injuries. To this end, we have identified that a high volume of such RCTs do not adequately consider and plan for their sample size, which negatively impacts their power. The reasons for this neglect are complex and beyond the scope of this article but are likely underpinned by incentives for publishing a high volume of work at the cost of quality, lack of statistical expertise among trial personnel, and funding constraints [35, 36].
We based our meta-analytic power calculations on the analytical decision made by the original meta-analysts regarding fixed versus random effects modelling. These two modelling approaches differ conceptually and statistically and should be interpreted accordingly. While a fixed effects model assumes that all constituent studies have sampled from the same patient population and aim to estimate a common effect, a random effects model assumes that there are genuine differences between the studies’ patient populations and their corresponding treatment effects, with the pooled estimate representing the mean of the distribution of these effects [37]. The random effects model often represents the more realistic clinical scenario, as some degree of clinical heterogeneity between studies may be expected. However, the random effects model assigns more weight to smaller studies, a property that has been criticized due to the tendency of smaller studies to report inflated effect sizes [38]. In our study, we felt it was appropriate to retain the original meta-analysts’ analytical intentions to answer our substantive question of whether RCTs in musculoskeletal injury research are adequately powered to detect the pooled effect of the meta-analysis to which they contributed.
The included RCTs might have reported inflated effect sizes due to other biases as well, such as attrition and selective reporting. Cochrane recommends performing a thorough risk of bias assessment and examining whether summary effects are affected by the inclusion of studies at high risk of bias when undertaking meta-analysis. Although we assessed the impact of a priori power calculations on the pooled estimates, we did not examine all sources of bias recommended by Cochrane; since many of the included meta-analyses were published prior to the release of Cochrane’s revised risk of bias tool [39], such an assessment would require a re-appraisal of a substantial volume of RCTs which is beyond the scope of this project.
Although the average power in this field was low, there was notable variation across studies. A blanket increase in sample size would be crude and the relationship between sample size and credibility is not absolute. Whilst larger-N studies will have higher power, they are time consuming, expensive and create additional challenges for participant recruitment, retention and standardization of procedures. Increasing research collaborations and combining of individual-level data may be important approaches to augment study power [9]. Study power can also be enhanced through more robust methodological design, in particular, ensuring that there is a strong theoretic foundation for the research question, and incorporating high quality, reliable outcome tools (thereby reducing the variability in the variable being measured) [40].
We chose to focus on the conventional significance level approach, applied to superiority designs, as this is most commonly represented across the health care research. However, we acknowledge that other approaches can underpin sample size calculation e.g. precision (confidence interval) [41] and Bayesian based.
Conclusion
This study highlights the prevalent issue of underpowered study designs and insufficient implementation of sample size calculations in musculoskeletal injury research. Only 8% of RCTs informing the conservative management of such injuries had sufficient power (≥ 80%) to detect small (d = 0.2) to moderate (d = 0.5) treatment effects, and just 1 in 3 of RCTs from statistically significant meta-analyses were sufficiently powered to detect the summary effect of the meta-analysis to which it contributed. While one-quarter of RCTs reported core details of sample size calculation, these studies tended to overestimate treatment effects a priori. RCTs that did not include sample size calculations were typically smaller and tended to report larger effect sizes. Inadequate consideration of sample size at trial design, delta inflation, and small N effect size inflation (winners curse) perpetuates underpowering in this field. These shortcomings complicate replication efforts, waste resources and are ethically questionable; collective action is needed to prioritize quality in musculoskeletal injury research and uphold the integrity of scientific inquiry.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary Material 1: Search strategy
Supplementary Material 2: Excluded studies
Supplementary Material 3: Study characteristics of meta-analyses and constituent RCTs
Supplementary Material 4: PRISMA-S Checklist
Abbreviations
- IQR
Interquartile range
- RCT
Randomised controlled trial
- SMD
Standardised Mean Difference
Author Contributions
CB is the guarantor. The research objectives were conceptualised by CB, FN and JS. The investigation (including the literature searching and data extraction) data curation, and original draft writing were undertaken by CB, NK and JW). The data visualisation and formal analysis was undertaken by CB, FN and JS. All authors have contributed to, read and approved the final version of the manuscript, and agree to be accountable for its content.
Funding
No financial support was received for the conduct of this study, or for the preparation or publication of this manuscript.
Data Availability
All data generated or analysed during this study are included in this published article [and its supplementary information files].
Declarations
Ethics Approval and Consent to Participate
N/A.
Consent for Publication
NA.
Competing Interests
None of the authors have competing interests to declare.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Mesa-Castrillon CI, Beckenkamp PR, Ferreira M, Simic M, Davis PR, Michell A et al. Global prevalence of musculoskeletal pain in rural and urban populations. A systematic review with meta-analysis. Musculoskeletal pain in rural and urban populations. Aust J Rural Health. 2024. [DOI] [PubMed]
- 2.Hopkins C, Fu S, Chua E, Hu X, Rolf C, Mattila VM et al. Critical review on the socio-economic impact of tendinopathy. Asia Pac J Sports Med Arthrosc Rehabil Technol. 2016;4:9–20. [DOI] [PMC free article] [PubMed]
- 3.Glaviano NR, Boling MC, Fraser JJ. Anterior knee pain risk in male and female military tactical athletes. J Athl Train. 2021;56(11):1180–7. [DOI] [PMC free article] [PubMed]
- 4.Gribble PA, Bleakley CM, Caulfield BM, Docherty CL, Fourchet F, Fong DT, et al. Evidence review for the 2016 international ankle consortium consensus statement on the prevalence, impact and long-term consequences of lateral ankle sprains. Br J Sports Med. 2016;01(24):1496–505. [DOI] [PubMed]
- 5.Bleakley CM, Matthews M, Smoliga JM. Most ankle sprain research is either false or clinically unimportant: a 30-year audit of randomized controlled trials. J Sport Health Sci. 2021;10(5):523–9. [DOI] [PMC free article] [PubMed]
- 6.Bleakley C, Smoliga JM. Validating new discoveries in sports medicine: we need FAIR play beyond p values. Br J Sports Med. 2020;01(21):1239–40. [DOI] [PubMed]
- 7.Bleakley C, Reijgers J, Smoliga JM. Many high-quality randomized controlled trials in sports physical therapy are making false-positive claims of treatment effect: a systematic survey. J Orthop Sports Phys Ther. 2020;50(2):104–9. [DOI] [PubMed]
- 8.Szucs D, Ioannidis JPA. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol. 2017;15(3):e2000797. [DOI] [PMC free article] [PubMed]
- 9.Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365–76. [DOI] [PubMed] [Google Scholar]
- 10.Miller J. What is the probability of replicating a statistically significant effect? Psychon Bull Rev. 2009;16(4):617–40. [DOI] [PubMed]
- 11.Hopewell S, Boutron I, Chan A, Collins GS, de Beyer JA, Hrobjartsson A, et al. An update to SPIRIT and CONSORT reporting guidelines to enhance transparency in randomized trials. Nat Med. 2022;01(9):1740–3. [DOI] [PubMed]
- 12.Abt G, Boreham C, Davison G, Jackson R, Nevill A, Wallace E et al. Power, precision, and sample size Estimation in sport and exercise science research. J Sports Sci. 2020;38(17):1933–5. [DOI] [PubMed]
- 13.Bhandari M, Richards RR, Sprague S, Schemitsch EH. The quality of reporting of randomized trials in the journal of bone and joint surgery from 1988 through 2000. J Bone Joint Surg Am. 2002;84(3):388–96. [DOI] [PubMed]
- 14.Cook JA, Julious SA, Sones W, Hampson LV, Hewitt C, Berlin JA et al. DELTA(2) guidance on choosing the target difference and undertaking and reporting the sample size calculation for a randomised controlled trial. BMJ. 2018;363:k3750. [DOI] [PMC free article] [PubMed]
- 15.Ioannidis JPA, Stanley TD, Doucouliagos HT. The power of bias in economics research. Econ J. 2017;127(605):236–65. [Google Scholar]
- 16.Cohen J. A power primer. Psychol Bull. 1992;112(1):155–9. [DOI] [PubMed]
- 17.Mesquida C, Murphy J, Lakens D, Warne J. Replication concerns in sports and exercise science: a narrative review of selected methodological issues in the field. R Soc Open Sci. 2022;9(12):220946. [DOI] [PMC free article] [PubMed]
- 18.Stanley TD, Carter EC, Doucouliagos H. What meta-analyses reveal about the replicability of psychological research. Psychol Bull. 2018;144(12):1325–46. [DOI] [PubMed]
- 19.Klempel ND, Bleakley C, Wagemans J.s experimental research into musculoskeletal injuries adequately powered? A meta-review. OSF; 2024. Available from: osf.io/8chwn.
- 20.Schulz KF, Altman DG, Moher D. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. J Pharmacol Pharmacother. 2010;1(2):100–7. [DOI] [PMC free article] [PubMed]
- 21.Chinn S. A simple method for converting an odds ratio to effect size for use in meta-analysis. Stat Med. 2000;30(22):3127–31. [DOI] [PubMed]
- 22.False positive risk. web calculator, version 1.7 [homepage on the Internet]. [cited 2024-07-30]. Available from: http://fpr-calc.ucl.ac.uk/
- 23.Cohen J. Statistical power analysis for the behavioural sciences. Routledge; 1998.
- 24.Ioannidis JPA, Khoury MJ. Assessing value in biomedical research: the PQRST of appraisal and reward. JAMA. 2014;06(5):483–4. [DOI] [PMC free article] [PubMed]
- 25.Sawilowsky SS. New effect size rules of thumb. J Mod Appl Stat Methods. 2009;8(2):597–9. [Google Scholar]
- 26.Kinney AR, Eakman AM, Graham JE. Novel effect size interpretation guidelines and an evaluation of statistical power in rehabilitation research. Arch Phys Med Rehabil. 2020;101(12):2219–26. [DOI] [PubMed]
- 27.Nord CL, Valton V, Wood J, Roiser JP. Power-up: a reanalysis of ‘power failure’ in neuroscience using mixture modeling. J Neurosci. 2017;37(34):8051–61. [DOI] [PMC free article] [PubMed]
- 28.Reito A, Raittio L, Helminen O. Revisiting the sample size and statistical power of randomized controlled trials in orthopaedics after 2 decades. JBJS Rev. 2020;8(2):e0079. [DOI] [PubMed]
- 29.Ottenbacher KJ, Barrett KA. Statistical conclusion validity of rehabilitation research. A quantitative analysis. Am J Phys Med Rehabil. 1990;69(2):102–7. [DOI] [PubMed]
- 30.Dwan K, Gamble C, Williamson PR, Kirkham JJ, Reporting Bias Group. Systematic review of the empirical evidence of study publication bias and outcome reporting bias - an updated review. PLoS One. 2013;8(7):e66844. [DOI] [PMC free article] [PubMed]
- 31.Charles P, Giraudeau B, Dechartres A, Baron G, Ravaud P. Reporting of sample size calculation in randomised controlled trials. Rev BMJ. 2009;338:b1732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Zollner S, Pritchard JK. Overcoming the winner’s curse: estimating penetrance parameters from case-control data. Am J Hum Genet. 2007;80(4):605–15. [DOI] [PMC free article] [PubMed]
- 33.Rothwell JC, Julious SA, Cooper CL. A study of target effect sizes in randomised controlled trials published in the health technology assessment journal. Trials. 2018;19(1):544–y. [DOI] [PMC free article] [PubMed]
- 34.Aberegg SK, Richards DR, O’Brien JM. Delta inflation: a bias in the design of randomized controlled trials in critical care medicine. Crit Care. 2010;14(2):R77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Nayak BK. Understanding the relevance of sample size calculation. Indian J Ophthalmol. 2010;58(6):469–70. [DOI] [PMC free article] [PubMed]
- 36.Kammar-Garcia A, Fernandez-Urrutia LA, Guevara-Diaz JA, Mancilla-Galindo J. Statistical considerations for the design and analysis of pragmatic trials in aging research. Geriatrics (Basel). 2024;9(3):75. 10.3390/geriatrics9030075 [DOI] [PMC free article] [PubMed]
- 37.Dettori JR, Norvell DC, Chapman JR. Fixed-effect vs random-effects models for meta-analysis: 3 points to consider. Global Spine J. 2022;12(7):1624–6. [DOI] [PMC free article] [PubMed]
- 38.Pereira TV, Ioannidis JPA. Statistically significant meta-analyses of clinical trials have modest credibility and inflated effects. J Clin Epidemiol. 2011;01(10):1060–9. [DOI] [PubMed]
- 39.Sterne JAC, Savovic J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019;28:366:l4898. [DOI] [PubMed]
- 40.McClelland G. Increasing statistical power without increasing sample size. Am Psychol. 2000;55:963–4. [Google Scholar]
- 41.Bland JM. The tyranny of power: is there a better way to calculate sample size? BMJ. 2009;339:b3985. [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Material 1: Search strategy
Supplementary Material 2: Excluded studies
Supplementary Material 3: Study characteristics of meta-analyses and constituent RCTs
Supplementary Material 4: PRISMA-S Checklist
Data Availability Statement
All data generated or analysed during this study are included in this published article [and its supplementary information files].






