Abstract
The results of statistical tests in orthopedic studies are typically reported using p-values. If a p-value is smaller than the pre-determined level of significance (e.g. <0.05), the null hypothesis is rejected in support of the alternative. This automaticity in interpreting statistical results without consideration of the power of the study has been denounced over the years by statisticians since it can potentially lead to misinterpretation of the study conclusions. In this paper, we review fundamental misconceptions and misinterpretations of p-values and power, along with their connection with confidence intervals, and we provide guidelines to orthopedic researchers for evaluating and reporting study results. We provide real-world orthopedic examples to illustrate the main concepts.
Please visit the following https://youtu.be/bdPU4luYmF0 for videos that explain the highlights of the paper in practical terms.
Keywords: p-value, orthopedics, arthroplasty, significance test, sample size, power
Introduction
Orthopedic studies often involve the estimation of measures of disease occurrence (i.e., means, proportions) or differences in outcomes in two or more groups of patients (i.e., risk ratio, odds ratio, hazard ratio). The results of the statistical analyses are typically reported using p-values, but to correctly interpret the findings the necessary assumptions must be met. For instance, there should be sufficient power to allow the test to detect a difference between groups if such difference exists. The power of a statistical test to detect a difference between groups depends on three factors: a) the probability of rejecting the null hypothesis when it is true (level of significance), b) the magnitude of the difference between groups we want to be able to detect, and c) the sample size, which is the number of individuals or observations in the study. With a large sample size, even small differences between groups will produce small p-values, whereas with small samples, large differences are likely to generate p-values above the threshold for declaring statistical significance. This said, there is much more that needs to be explored in statistical analyses before drawing study conclusions. Statisticians have denounced the many misconceptions surrounding definition and interpretation of p-values and power.1,2 The purpose of this paper is to illustrate myths and reality surrounding the interpretation of statistical significance, p-values, and power, and ultimately provide some practical guidelines for orthopedic researchers and reviewers for correct interpretation of study results.
Chance and random error
Before describing the myths and realities of p-values and power, it is important to recognize two broad types of error that can occur when orthopedic researchers perform their studies: systematic errors (also referred to as systematic bias) and random errors that arise due to chance. As addressed in other papers in the series,3,4 systematic errors are methodological errors in the way the study population is selected, the way variables are measured, and/or the result of some confounding factors not being accounted for. Systematic errors are minimized through good study design and data collection practices. Random errors, on the other hand, are inherent in all studies and are unavoidable. The term “random” refers to variability in the data that occurs due to chance. Study observations that deviate from their true values in any way are attributed to random errors. Random errors are estimated and expressed quantitatively using p-values and confidence intervals for the estimates, which are the range of the plausible value for the population parameters. The two ways to minimize random errors are increasing sample size and designing studies more efficiently.5 On the other hand, no level of statistical significance (or lack thereof) or size of sample can validate findings of a study if the study design and data collection processes are prone to systematic errors. Hence, the potential role of both systematic and random errors must be considered when interpreting the results of orthopedic studies.
P-value interpretation: myths and reality
P-values are used to examine the role of chance arising from random errors. Most orthopedic studies focus on either estimation or hypothesis testing. Estimation uses statistical methodology to estimate the true population value of an outcome. Examples include estimation of the 10-year rate of revision following total knee arthroplasty6, and estimation of the prevalence of total hip arthroplasty and total knee arthroplasty in the Unites States.7 Hypothesis testing, on the other hand, formally tests whether there is evidence of an effect or if a difference is present between comparison groups. This is accomplished by using statistical tests to examine the null hypothesis that there is no difference. For example: is there a significant difference in the rate of dislocation between patients that underwent total hip arthroplasty (THA) using a dual-mobility design versus those using a standard construct?
When comparing exposed to non-exposed, a p-value indicates the probability, under the assumption of no difference between the two groups, that a statistical summary of the data, such as the observed outcome mean or percentage difference, would be equal to or more extreme than its hypothesized value. Assuming that the there is no difference between groups (null hypothesis), the p-value measures the incompatibility of the data and the null hypothesis. A small p-value would indicate that the observed data are very different from the pattern expected under the null hypothesis. In other words, the data are unusual if all the assumptions are correct. Conversely, a large p-value suggests that the data are not unusual, again if all assumptions used to compute the p-value are correct. Here are two examples.
EXAMPLE 1: KOOS scores.
In a cohort of 1000 total knee arthroplasty patients, investigators want to compare Knee injury and Osteoarthritis Outcome Scores (KOOS) between male and female patients. The null hypothesis is that there is no difference in the mean preop to 6-month change in KOOS scores between males and females. The observed mean change in KOOS scores is 12 points in men and 10 points in women with a p-value=0.02.
EXAMPLE 2: Revision rates.
A study seeks to compare the risk of revision (outcome) associated with mobile versus fixed-bearing prostheses (exposure) in total knee arthroplasty (TKA) patients. The null hypothesis is that there is no difference in the risk of revision between the two prosthesis types. The investigators examine the 5-year revision rates in a cohort of 5,000 TKA patients and find a higher risk of revision in patients with mobile bearing as compared to fixed-bearing prostheses with a hazard ratio of 1.2 and p-value=0.08.
In example 1, the p-value of 0.02 suggests that if chance alone were creating the inconsistency between the data and the null hypothesis, this inconsistency would be as large as observed (or larger) 2% of the time. In other words, if the same analysis were repeated 100 times on separate datasets, a difference of this size or larger would be observed in 2 of the 100 datasets. Conversely, in example 2, the inconsistency would be as large as observed (or larger) 8% of the time (so more than the generally accepted threshold of 5%). A p-value will never indicate whether the null hypothesis is true or not. If an analysis gives a statistically significant result, it simply means that the null hypothesis is rejected as false and the alternative hypothesis is more likely given the observed data. If an analysis gives a statistically nonsignificant result, it means that there is not enough evidence against the null hypothesis, but it does not mean that the null hypothesis is true.
Therefore, a p-value itself only tells us the probability of obtaining a result of the observed magnitude or larger in the study (if the difference in change in KOOS scores between men and women is at least 2 points in example 1, or if the hazard ratio for revision risk between mobile-bearing and fixed-bearing prostheses is at least 1.2 in example 2). Thus, it is a measure of compatibility between the observed data and the null hypothesis. The use of p-values to comment on statistical significance is in fact a statistical fallacy because statistical significance has no meaning; it is just a dichotomous expression of whether the p-value is less than an arbitrary value of 0.05. Even when reporting p-values, it provides no quantitative clue about the size of the effect of a risk factor or treatment.
Table 1 represents some common misconceptions surrounding the use and interpretation of p-values. For instance, a large p-value does not necessarily indicate that there is no difference between comparison groups, but only that the observed result is not incompatible with the null hypothesis of no difference. Consequently, we cannot imply that there is no association between a potential risk factor and the outcome based on the p-value alone. A large p-value might also indicate that data are incapable of discriminating among competing hypotheses. In fact, if the null hypothesis of no difference was perfectly compatible, the p-value would be equal to 1. Any p-value less than 1 suggests that the null hypothesis of no association is not perfectly compatible, therefore we should not disregard potential association just yet. It is good practice to accompany the p-value with confidence intervals to provide context on the size of the estimated effect that might be missed otherwise.
Table 1 –
Myth and reality surrounding the p-value
| Myth | Reality |
|---|---|
| P-value measures the probability that the null hypothesis is true | The null hypothesis is assumed to be true when calculating p-value |
| P-value is the probability that the data were produced by random chance alone | If chance alone was creating the discrepancy between the data and the null hypothesis, this discrepancy would be larger than observed (p-value × 100) % of the times |
| A P-value > 0.05 indicates that there is no difference or treatment effect, conversely a smaller p-values indicates a significant difference | Any effect can produce a small p-value if the sample size is large enough or measurements are very precise. Conversely, a large effect may yield a large p-values if the sample size is small or measurements are not precise enough. |
| Small p-values indicate clinically important differences. | No matter how small it is, p-value does not indicate that the difference is clinically important. No matter how small it is, p-value tells us nothing about the magnitude of difference or clinical importance. Trivial differences can be highly statistically significant in large studies. Conversely, a very unimpressive p-value from a small study can indicate clinically important differences. |
| Reporting p-value is enough to draw study conclusions | Data analysis should go beyond the calculation of the p-value with addition of confidence intervals or other approaches to directly address effect size, uncertainty associated to the effect, and correctness of the hypothesis. |
| p-value of 0.01 means there is 1% probability that the association in the study is produced by chance, or 1% probability that the null hypothesis is correct | P value is not the probability that the null hypothesis is correct. P-value is calculated assuming that the null hypothesis is correct. It only refers to probability that data could deviate from the null hypothesis as much as they did or more. |
Confidence intervals
Recognizing the misconceptions surrounding the use and interpretation of p-values, confidence intervals are the preferred method for reporting the results of statistical tests. In fact, confidence intervals are calculated using similar statistical methods as p-values, but they have the added advantage of quantifying both the strength of the association (i.e., between an exposure and an outcome) and the uncertainty or the gray zone around the estimates. They illustrate the range of plausible values and help to interpret whether a clinically meaningful effect size or difference is consistent with the data. They also provide information about statistical power.
Figure 1 shows six different scenarios of 95% confidence intervals for the difference between females and males with respect to preop to 6-month change in KOOS scores. The width of confidence intervals depends on the sample size and variability in the data. Larger studies tend to have narrower confidence intervals than smaller studies. For continuous outcomes, the width of the interval also depends on the variability in the outcome measurements, while for dichotomous outcomes it depends on the risk of the outcome. Even though the p-values for scenarios A and B would suggest opposite conclusions, in neither scenario does gender seem to be a discriminant of change in KOOS score. The estimated difference between females and males in scenario A is quite different than in scenario D-F, where a significant association between gender and KOOS score is also undetected (p>.05) but the confidence limits are quite broad, indicating either a smaller sample size or much larger variability in the measurement. Scenario C has the same effect size of E and F, but the narrower confidence interval indicates a much more precise estimate of the group difference that is typically achieved in larger studies.
Figure 1 – Confidence intervals for the estimated difference between females and males change in KOOS score.

The interpretation of confidence interval width can lead to different study conclusion.
The relationship between p-value and confidence interval is even more evident from Figure 2. To continue with the previous example, the curve here represents the distribution of p-values for the test of compatibility of study data (y-axis) with every possible value of the estimated difference in KOOS score between males and females (x-axis). The compatibility ranges from 0 (p=0) to 100% (p=1). In our example the “best” estimate is of difference equal to 2 (p=1). As we depart from the “best” estimates, the p-value for the test of each specific values decreases, indicating less compatibility of the data with that hypothesized difference. In this example, it is evident how p-values are directly related to confidence intervals. As the confidence interval decreases (e.g. the range for the estimated population difference is more accurate), the p-value increases (e.g. we are less likely to reject the null hypothesis of equality to the values within the range) and vice versa. A value of 0.05 on the Y-axis (p-value) for instance, corresponds to the lower and upper values of a 95% confidence interval for the estimated difference in KOOS score between females and males on the X-axis. A null hypothesis of difference in change in KOOS scores between males and females equal to 1 will be rejected (p=.09) as we are 95% confident that the population differences vary from 0.9 to 2.6.
Figure 2 – Relationship between p-values and confidence intervals.

As the confidence interval decreases, the p-value increases and vice versa.
Power: How many patients do I need?
Suppose we are planning a study comparing a new TKA prosthesis type with one that has been in use for several years. We are concerned about random variation due to chance that may affect our ability to reject the null hypothesis of no difference when a difference truly exists. The answer to this concern depends on three characteristics of the study: Type I error, Type II error, and the magnitude of the difference in outcomes between the comparison groups. These are explained below.
Type I error is defined as the error of rejecting the null hypothesis when it is true (Table 2). Type I error would lead to the conclusion that a difference between groups exists when in reality there is no difference. To control for the likelihood of this error, the level of significance of the test is set prior to beginning the data analysis. This value is termed α and represents the probability of a Type I error. Thus, the probability of rejecting the null hypothesis when it is true is no greater than this predetermined number. The smaller the value of α, the less likely it is that the null hypothesis will be rejected, but large values would not ensure a meaningful significance level. The probability of a Type I error, or α, is typically set to 0.05 (or 5%).
Table 2 –
Type I and Type II error
| Null hypothesis: difference between exposed and non-exposed=0 | True difference | ||
|---|---|---|---|
| Present | Absent | ||
| Conclusion of statistical test | Significant | Correctly rejecting the null | Type I error Rejecting the null when it should not be rejected |
| Not significant | Type II error Failing to reject the null when it should be rejected | Correctly failing to reject the null | |
The second type of error in hypothesis testing is Type II error (Table 2), which is defined as failing to reject the null hypothesis when, in fact, it is false. The probability of a Type II error is referred to as β, and it is directly related to the power of a test (the power of rejecting H0 when false is equal to 1-β). As the likelihood of this error increases, the statistical power decreases and vice-versa. A way to reduce Type II error and ensure adequate statistical power for a test is through the calculation of the sample size for the study. If the size of the sample is too small, the power will be insufficient, leading to the conclusion of no association when in fact there is an association. Conversely, in a study with a large cohort the power could be so high that a very small difference between groups is detected, but the difference could be clinically meaningless.
There is a tradeoff between Type I and Type II error: the more the risk of one type of error is decreased, the more the risk of the other error increases. As neither type of error is inherently worse than the other, study design decisions depend on the clinical context. For example, if there are several similar prostheses on the market that function well, and a new prosthesis is very expensive, you may want to minimize the risk of claiming the new prosthesis is better when it is not really different from the others (low Type I error), even at the expense of a large chance of missing a superior prosthesis (large Type II error). It is of course possible to reduce both types of errors by increasing sample size. The sample size should be calculated using the effect size which is based on the expected difference between groups in the outcome being studied. This is the difference in group means if the outcome is continuous, proportion or odds ratios if the outcome is categorical, and hazard ratio if it is a time-to-event outcome.
Consider the following example: A study is conducted to evaluate femoral stem subsidence in a cohort of elderly patients who underwent primary cementless total hip arthroplasty (THA). Due to concerns about a possible higher prevalence of osteoporosis among females in the cohort, the researchers compared the amount of femoral stem subsidence between females and males based on radiographic exams at one-year follow-up. The curves in Figure 3a a represents the distribution of the difference in mean stem subsidence (in mm) under the null (solid line, H0:no difference in mean subsidence between females and males) and alternative hypotheses (dashed line, H1: difference in mean subsidence between females and males = 2 mm) respectively. If the mean subsidence in males is truly greater than the females by 2 mm, and there is adequate power to detect a difference of this size, the null hypothesis is rejected. In the figure, the vertical line defines two areas in the mean distribution under the alternative hypothesis, β and Power. The probability of Type I error is also represented as the tail of the distribution under the null hypothesis (α/2 for a two-sided test). An increase in α (shift of the vertical line to the left) causes a decrease in β, and a consequent increase in power. Needless to say, increasing α to increase power is not recommended since it represents the probability of Type I error, and commonly is never greater than 10%. Power also increases if the difference between groups to be detected is larger. Figure 3b shows how a shift in the distribution under the new alternative hypothesis (H1: the difference in mean subsidence between females and males is 4mm) causes a reduction of Type II error, and therefore an increase in power.
Figure 3 – Trade-off between Alpha (probability of type I error) and Beta (or 1-Power, probability of type II error).

Figure 3b shows how a shift in the distribution under the new alternative hypothesis (H1: μt-μc=0.4) causes a reduction of Type II error, and therefore an increase in power.
To evaluate whether the sample size available is enough for the planned analysis, there are several “rule of thumbs” that can help the researcher to draw appropriate conclusions. One of the most common approximations states that for regression models (including proportional hazards, survival analysis, and logistic) there should be at least 10 observations for each predictor considered in the model. For logistic regression the number of predictors is determined by category of smallest size.
Conclusions
In conclusion, a p-value does not measure the probability that the null hypothesis is true, or the probability that the data were produced by random chance alone. It does not measure the magnitude of the treatment effect and conclusions should not be based solely on whether the p-value is smaller than the predetermined level of significance. Despite its extensive use in the literature, a p-value by itself does not provide a good measure of evidence regarding a hypothesis, and it should not be used to draw conclusions or for decisions-making. The following section provide some guidelines for orthopedic researchers and reviewers on how to meaningfully interpret study results that report p-values along with other statistical measures.
Supplementary Material
Guidelines for researchers and reviewers.
- The existence of an effect or association should not be assumed solely based on the fact that
- the p-value is below the pre-defined threshold for significance
- the p-value is not below the pre-defined threshold for significance
- the observed effect is or is not statistically significant
Use confidence intervals in addition to p-values because they show the significance of the effect estimate along with useful quantitative information on the range of the estimate.
Do not consider confidence intervals literally as they just show the “gray zone”.
Use a rule of thumb (e.g., 10 observations for each predictor in multiple regression) to quickly verify whether the minimum sample size is ensured for the proposed analysis.
Funding:
This work was funded by a grant from the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) grant P30AR76312 and the American Joint Replacement Research Collaborative (AJRR-C). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
REFERENCES
- 1.Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yaddanapudi LN. The American Statistical Association statement on P-values explained. J Anaesthesiol Clin Pharmacol. 2016;32(4):421–423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Devick KL, Zaniletti I, Larson DL, Lewallen DG, Berry DJ, Maradit Kremers H. Avoiding Systematic Bias in Orthopedics Research through Informed Variable Selection: A Discussion of Confounders, Mediators, and Colliders Journal of Arthroplasty. 2022;under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Zaniletti I, Devick KL, Larson DL, Lewallen DG, Berry DJ, Maradit Kremers H. Measurement Error and Misclassification in Orthopedics: when study subjects are categorized in the wrong exposure or outcome groups. Journal of Arthroplasty. 2022;under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zaniletti I, Devick KL, Larson DL, Lewallen DG, Berry DJ, Maradit Kremers H. Study Types in Orthopedics Research. Journal of Arthroplasty. 2022;under review. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Khan M, Osman K, Green G, Haddad FS. The epidemiology of failure in total knee arthroplasty AVOIDING YOUR NEXT REVISION. Bone Joint J. 2016;98b(1):105–112. [DOI] [PubMed] [Google Scholar]
- 7.Maradit Kremers H, Larson DR, Crowson CS, et al. Prevalence of Total Hip and Knee Replacement in the United States. J Bone Joint Surg Am. 2015;97(17):1386–1397. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
