Skip to main content
The Journal of ExtraCorporeal Technology logoLink to The Journal of ExtraCorporeal Technology
. 2009 Jul;41(4):P6–P10.

What’s New in Trial Design: Propensity Scores, Equivalence, and Non-Inferiority

Paul S Myles 1
PMCID: PMC4813540  PMID: 20092080

Abstract:

Recent modifications to traditional clinical research designs include propensity scores, equivalence, and non-inferiority trials, as well as greater use of pooled endpoints for primary outcome measures. Each of these innovations offers benefits, but they have been misused. Propensity score techniques can account for imbalance in treatment group allocation to provide more accurate estimates of benefit or risk. Unlike clinical trials, they typically represent real world, everyday practice and so their findings may in fact be less biased. Equivalence and non-inferiority designs can tailor clinical trials to address clinically meaningful questions: Is a proposed new technique at least as good as current treatment? Pooled endpoints can summarize a range of beneficial outcomes as well as reduce the required sample size. A clearer understanding of bias and confounding, and the interpretation of the 95% confidence interval of the estimated treatment effect are central to proper use of these techniques.

Keywords: bias, confounding, equivalence, non-inferiority, pooled endpoints, propensity, surgery


Clinical studies can be used to identify potentially useful or unsafe interventions. They may also identify ineffective (compared with placebo) or equivalent (compared with an established treatment) interventions. Each of these aims has special considerations for trial design and interpretation.

Most clinical trials aim to demonstrate efficacy of an active intervention when compared with placebo, or one active intervention over another. These are the familiar superiority trials (1,2). Safety is more difficult to prove, because adverse effects are generally rare, and large numbers of subjects and specific data collection are required to detect them. Furthermore, those at highest risk of adverse effects are typically excluded from randomized clinical trials (3). Clinical trials are thus likely to miss some important adverse effects (3). Large observational cohort studies and case-control studies, ideally with multivariate adjustment, are better options (3,4). One recent modification to this is use of propensity scores, aiming to balance baseline risk to detect a specific treatment effect (57). There has also been innovation in clinical trial design to include equivalence and non-inferiority trials (812) (Table 1). Each of these study designs can provide particular advantages for the researcher and clinician, in terms of timeliness and efficiency, and more specific information provided to assist in clinical decision-making and patient care. But the design and interpretation of such studies need careful consideration because they can be misleading (1013).

Table 1.

Increase in the number of surgical studies using novel designs.

2000 2004 2008
Propensity score  0  5 53
Equivalence trial  6 13 23
Noninferiority trial  0  2  9

Medline was searched in each year, using terms “surgery” and each of the designs.

Randomized trials use the play of chance to assign patients to the groups being compared (1,2,14,15). This process is designed to prevent systematic differences between groups. Overt, spurious, and hidden explanations for differences between groups can lead to an apparent effect that is not real (4,16,17), resulting in the adoption of unhelpful, expensive, or harmful therapies.

Small randomized trials account for many sources of bias, but may not avoid confounding. Larger numbers are required before random imbalance of potentially confounding factors, “covariates", are minimized. Here both known and unknown factors will be balanced, particularly if a large sample size is used. There are several methods available to account for confounding (15); the most widely known is regression modeling but the propensity score approach is being increasingly used (57).

PROPENSITY SCORE TECHNIQUES

Multivariate regression analysis identifies significant covariates, which can be adjusted to more precisely estimate the likely effect of the variable (intervention) of interest. Propensity scores address this issue in a different way: A logistic regression model is used to predict, backwards, the baseline variables that are associated with the group (intervention) of interest (5).

The propensity score describes the probability that each subject would be in the intervention group or comparison group, based on an analysis of baseline covariates. If all subjects with a similar treatment probability are batched, then the selection of the actual intervention group approaches that of random allocation—that is, propensity scoring attempts to recreate a random decision process. It is therefore a quasi-randomization technique, and is an efficient and recommended method of analyzing large databases.

Databases, registries, and other observational cohort studies, despite being derived in a non-randomized way, have the advantage of more truly representing everyday practice better than trial settings. One of the earliest examples of a propensity score technique was confirmation of the superiority of surgical intervention over medical therapy for coronary artery disease using Coronary Artery Surgery Study data in 1987 (18).

Calculation of the propensity score begins with selection of the baseline variables to be included in the model. As with other multivariate techniques, the selection and inclusion of the variables should be based on biologic plausibility and clinical judgment as to what is rational and practical in the decision-making and other processes used to determine whether or not a treatment is used. Logistic regression is then used to identify which of these variables are significantly associated with the treatment group. The propensity score is the conditional probability that this occurred, given all the measured covariates.

Groups of subjects with similar propensity scores can then be expected to have near-equivalent baseline risk. Subjects can be matched into quintiles of similar scores ("propensity matching") or the final group effect can be adjusted using the propensity score. The former is generally preferred (5,7). Examples of both approaches used to analyze the potential adverse effects of aprotinin in cardiac surgery have been published recently (19,20). Propensity score techniques can be used to investigate efficacy and/ or safety. But it is the latter for which they have the greatest value, because large datasets in routine clinical settings, including high-risk, elderly, and complex patients, can be included. These groups are often excluded in clinical trials (3).

It must be appreciated, however, that like regression modeling, propensity scores cannot alleviate all bias and confounding (2,17). For example, in recent studies of aprotinin that used propensity scores (19,20), it was clearly apparent that high risk patients were more likely to receive aprotinin, as well as being more likely to have postoperative complications. This is called confounding by indication. It is difficult to determine whether or not a propensity score has sufficiently accounted for such imbalance.

A group from the Cleveland Clinic analyzed their unit’s database of 32,298 cardiac surgical patients over a 13-year period up to 2006 to ascertain the risks of platelet transfusion (21). Simple univariate comparisons demonstrated that patients who received platelet transfusions had worse outcomes. This suggests a direct adverse effect of platelets, but this finding can also be explained by the higher risk settings in which platelet transfusion is used: emergency surgery, more complex surgery, co-existent sepsis, renal failure, and long bypass runs. They therefore used multivariate regression to adjust for such imbalance, and subsequently demonstrated that platelet transfusion was not significantly associated with in-hospital mortality: odds ration (OR) .74 (95% Confidence Interval (CI): .58–.95), p = .017. The authors then additionally tested their findings with propensity scores. Among 2774 propensity matched-pairs, platelet transfusion was associated with comparable or reduced morbidity for most individual morbidity and mortality endpoints. They thus concluded there was no evidence that platelet transfusion had worse outcomes after cardiac surgery.

SUPERIORITY, EQUIVALENCE, AND NON-INFERIORITY TRIALS

Most clinical trials aim to establish the superiority of one intervention over another. In other words, to assess whether the null hypothesis should be accepted or rejected. But how should a non-significant result be interpreted? It is a common error to conclude this indicates therapeutic equivalence (911). The correct conclusion and interpretation from such a study is that there is no evidence of difference between interventions, not that there is evidence of no difference.

Placebo-controlled trials can be unethical if an existing treatment is known to be effective. The ability to demonstrate a difference between two active groups, as opposed to an active group compared with placebo, is lessened because the effect size will be smaller. There is then a need to a have a larger sample size to maintain study power. One way to reduce this requirement is to focus on equivalence or non-inferiority rather than superiority, particularly if this is in fact what is of interest to the clinician. The study can then be designed as a one-sided trial and the necessary sample size will be less (9). The testing of a new drug or intervention to specifically show comparable efficacy may justify its introduction into clinical practice if there are potential other benefits such as lower cost, safety, or convenience. It is for these reasons that equivalence and non-inferiority trials are becoming increasingly popular in clinical research.

The first step is to determine what difference between groups could be considered small, or not clinically important. This is best dictated by expert clinical judgment but the assumptions will be more strongly substantiated if this can be referenced to published outcome data. The chosen effect size is defined as the pre-specified delta, Δ. If the difference between groups is less than the delta boundary then the groups are deemed to be equivalent. Or more precisely, if the 95% CI of the treatment difference excludes the delta value, then it can be concluded that the groups are equivalent (911). In a non-inferiority trial, if the upper bound of the 95% CI of the estimated treatment difference lies below the non-inferiority margin, it can be concluded that the new treatment is non-inferior to the standard treatment.

The choice of boundaries for equivalence must be critically reviewed, because it is on this basis that the interpretation of the results and conclusions are made. Relative differences of, say, up to 20% (odds or risk ratio of 1.2) are commonly used.

A recent trial comparing two fibrinolytic drugs, tenecteplase and alteplase, is a good example of an equivalence trial (22). The study was designed to show the equivalence of single-bolus tenecteplase and standard alteplase treatment in acute myocardial infarction (MI), using random allocation of 16,949 patients presenting with MI within 6 hours. The primary outcome was equivalence in all-cause mortality at 30 days; equivalence was defined as a mortality difference, or delta, of less than 1%. The eventual mortality rates were almost identical for the two groups: 6.18% and 6.15%, with an absolute difference of .03 (95% CI: −.55–.61), test of equivalence p = .006. Because the 95% one-sided upper boundary of the absolute difference in 30-day mortality was .61%, less than the pre-specified delta of 1%, it fulfilled the pre-specified criteria of equivalence. The authors noted that adverse event rates were generally lower for tenecteplase, and ease of administration favored this drug. They concluded that both drugs were equivalent, but that other considerations might favor tenecteplase in this setting. This is a clinically useful and practical finding.

Two antiplatelet drugs, tirofiban and abciximab, were tested for equivalence in 692 patients undergoing percutaneous coronary intervention (23). The authors chose a rate of ST-segment resolution of 10% for their definition of equivalence. The procedure was successful in 96.7% of the abciximab group and in 96.6% of the tirofiban group (p = .94). Complete ST-segment resolution was obtained in 67.1% of the tirofiban and 70.5% of the abciximab group, with a 95% CI for the difference of ?10.4 to +3.6. Because the 95% CI extends beyond the predefined delta of ±10%, the authors concluded that the study failed to demonstrate equivalence of these two drugs in this setting.

Non-inferiority trials aim to demonstrate comparable or better efficacy (8,9,11). As for equivalence trials, the key step is to nominate a clinically important excess adverse outcome rate above, which defines the criteria for inferiority of the new treatment. In addition, consideration of superiority can be incorporated in the study design, ideally using pre-specified secondary endpoints and/or supplementary cost analyses (9,11).

Serruys et al. (24) did a randomized trial comparing percutaneous coronary intervention with CABG in 1800 patients with triple-vessel or left main coronary artery disease. A non-inferiority design was used to compare the two groups, with the primary endpoint being major cardiac or cerebrovascular events (death, stroke, MI, or repeat revascularization) up to 12 months after treatment. Their non-inferiority delta value was 6.6%, derived from objective historical data. In the 12-month follow-up period they found significantly higher rates of major adverse events in the percutaneous coronary intervention group, 18% vs. 12% (p = .002), largely due to an increased rate of repeat revascularization. The 95% CI of the difference ranged up to 8.3%, exceeding their pre-specified boundary for non-inferiority. They therefore concluded that coronary artery bypass graft (CABG) should remain the standard of care for patients with extensive coronary artery disease.

Greilich et al. (25) used a non-inferiority design to compare epsilon aminocaproic acid with aprotinin and placebo in uncomplicated CABG surgery. Their outcomes of interest were fibrinolysis and blood loss. They set their non-inferiority margins at 30% for peak d-dimer formation (difference of 250 μg/L) and 350 mL for 24-hour chest tube drainage. Focusing on the two active drug groups, the 95% CI of the difference in peak d-dimer formation (–203–195 μg/L) and 24-hour blood loss (–90–230 mL) were within their predetermined non-inferiority margins, satisfying the criteria for non-inferiority. In view of the added cost, risk of re-exposure anaphylaxis, and other safety concerns for aprotinin, it is clear that epsilon aminocaproic acid should be recommended in preference to aprotinin in this setting.

INTERPRETING NEGATIVE TRIALS AND BORDERLINE P-VALUES

Many “negative” trials are underpowered and may have missed a clinically important difference between groups (26,27). It should not be concluded that such treatment groups are equivalent. A key consideration is the width of the 95% CI, a measure of the precision, or certainty, of the estimated treatment effect. If either of the 95% confidence limits includes a value that would be considered clinically important, then uncertainty remains and a larger more definitive study will be needed before reliable conclusions can be made.

A recent Lancet editorial has made some suggestions on how to translate various statistical findings into simple English (see Figure 1) (27). The interpretation of p -values, and more particularly the 95% CI, derived from non-inferiority or equivalence trials can be particularly challenging (27). If the 95% CI for the treatment difference ex cludes the pre-specified non-inferiority delta margin then a conclusion of inferiority can be made. The authors recommend that the phrase “seems to have similar efficacy to the standard treatment” be used to describe this finding. However, if the 95% CI includes both zero (no difference) and delta (inferiority), then there is insufficient evidence either to claim non-inferiority or to conclude that the new treatment is inferior. Thus the trial is inconclusive—the groups should not be interpreted as “equivalent.”

Figure 1.

Figure 1.

Six scenarios describing results of a clinical trial comparing two treatment groups. Each displays estimated treatment difference and its 95% CI, p-value, strength of evidence, and suggested phraseology to use in conclusions. For non-inferiority scenarios, non-inferiority margin delta (?) is shown, and pNI is the p-value for the test of non-inferiority. Reproduced with permission from The Lancet 2009;373:1926–8 (27).

DEFINING THE PRIMARY ENDPOINT

Many trials use a composite, or pooled, endpoint to define the primary outcome of the study (2830). This is because effective interventions are likely to reduce several morbid endpoints, often via a single mechanism. For example, antithrombotic treatments may reduce non-fatal MI, non-fatal stroke, chronic heart failure, and thrombotic deaths. The pooled endpoint will have a higher incidence than each of the composites and this reduces the sample size requirement for the trial.

Pooled endpoints have an inherent assumption that each component of the endpoint has a similar burden on health (2832). This is not often the case. Endpoints of least importance to patients, such as subclinical MI as opposed to stroke or death, typically contribute most to trial events (30). A recent systematic review found that in about half the trials reviewed there were large or moderate gradients in both importance to patients and magnitude of effect across components. The most serious and highly-rated events typically occurred least often, so that they provided the least information (numerical value) to the pooled end-point. This may result in misleading impressions of the true clinical value of the new treatment.

Cardiovascular trials commonly use major adverse cardiac events (MACE) as a pooled primary endpoint (29). But there is no standard definition for MACE. A review on the use of MACE as a pooled endpoint in stent trials found substantial variability in the study-specific individual outcomes used to define it (29). Most included MI, stroke and death; many included revascularizations; some included cardiac arrest, heart failure, or bleeding complications. Combining efficacy and safety components is particularly problematic. Each of these issues can lead to substantially different results and conclusions. Risks may outweigh benefits of a new treatment. Perhaps the best and most recent example is the POISE trial (33), which identified a significant reduction in perioperative MI but at a cost of excess death and stroke. There was also more hypotension and bradycardia.

Critical review of the frequency of each of the components of the pooled endpoint, and whether the treatment effect seems to be consistent for each of these, is strongly recommended (2832). If large variations exist between components then the value of the pooled endpoint is diminished. Study designs using pooled endpoints should provide separate safety and effectiveness outcomes, and construct separate pooled endpoints to match these different clinical goals (29).

CONCLUSIONS

Innovations and developments in clinical study design can optimize the information available for clinicians. Propensity score techniques provide additional methods to account for imbalance in treatment group allocation, and hopefully lead to more accurate estimates of benefit or risk. Unlike clinical trials, the study data typically represent real world, everyday practice and so their findings may in fact be less biased. They are particularly useful when assessing safety.

Equivalence and non-inferiority designs can tailor clinical trials to address clinically meaningful questions: Is a proposed new technique at least as good as current treatment? They are best used when comparing two active interventions, particularly when one may be a cheaper or safer alternative. Training and experience in clinical trial design and interpretation, as with surgical or perfusion techniques, can improve how we gain new knowledge and apply interventions to improve patient care.

REFERENCE

  • 1.Collins R, MacMahon S.. Reliable assessment of the effects of treatment on mortality and major morbidity, I: Clinical trials. Lancet. 2001;357:373–80. [DOI] [PubMed] [Google Scholar]
  • 2.Myles PS.. Large multicentre trials: What should be done in perfusion? J Extra Corpor Technol. 2007;39:274–7. [PMC free article] [PubMed] [Google Scholar]
  • 3.Silverman SL.. From randomized controlled trials to observational studies. Am J Med. 2009;122:114–20. [DOI] [PubMed] [Google Scholar]
  • 4.MacMahon S, Collins R.. Reliable assessment of the effects of treatment on mortality and major morbidity, II: Observational studies. Lancet. 2001;357:455–62. [DOI] [PubMed] [Google Scholar]
  • 5.Rubin DB.. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997;127:757–63. [DOI] [PubMed] [Google Scholar]
  • 6.Adamina M, Guller U, Weber WP, Oertli D.. Propensity scores and the surgeon. Br J Surg. 2006;93:389–94. [DOI] [PubMed] [Google Scholar]
  • 7.Austin PC.. Propensity-score matching in the cardiovascular surgery literature from 2004 to 2006: A systematic review and suggestions for improvement. J Thorac Cardiovasc Surg. 2007;134:1128–35. [DOI] [PubMed] [Google Scholar]
  • 8.Kaul S, Diamond GA.. Good enough: A primer on the analysis and interpretation of noninferiority trials. Ann Intern Med. 2006;145:62–9. [DOI] [PubMed] [Google Scholar]
  • 9.Siegel JP.. Equivalence and noninferiority trials. Am Heart J. 2000;139:S166–70. [DOI] [PubMed] [Google Scholar]
  • 10.Greene WL, Concato J, Feinstein AR.. Claims of equivalence in medical research: Are they supported by the evidence? Ann Intern Med. 2000;132:715–22. [DOI] [PubMed] [Google Scholar]
  • 11.Gøtzsche PC.. Lessons from and cautions about noninferiority and equivalence randomized trials. JAMA. 2006;295:1172–4. [DOI] [PubMed] [Google Scholar]
  • 12.Kaul S, Diamond GA, Weintraub WS.. Trials and tribulations of non-inferiority: The ximelagatran experience. J Am Coll Cardiol. 2005;46:1986–95. [DOI] [PubMed] [Google Scholar]
  • 13.Greene WL, Concato J, Feinstein AR.. Claims of equivalence in medical research: Are they supported by the evidence? Ann Intern Med. 2000;132:715–22. [DOI] [PubMed] [Google Scholar]
  • 14.Kunz R, Vist G, Oxman AD.. Randomisation to protect against selection bias in healthcare trials. Cochrane Database Syst Rev. 2007;18:MR000012. [DOI] [PubMed] [Google Scholar]
  • 15.Cleophas TJ, Zwinderman AH.. Clinical trials: How to assess confounding and why so. Curr Clin Pharmacol. 2007;2:129–33. [DOI] [PubMed] [Google Scholar]
  • 16.Sackett DL.. Bias in analytic research. J Chronic Dis. 1979;32:51–63. [DOI] [PubMed] [Google Scholar]
  • 17.Datta M.. You cannot exclude the explanation you have not considered. Lancet. 1993;342:345–7. [DOI] [PubMed] [Google Scholar]
  • 18.Myers WO, Gersh BJ, Fisher LD, et al. Medical versus early surgical therapy in patients with triple-vessel disease and mild angina pectoris: A CASS registry study of survival. Ann Thorac Surg. 1987;44:471–86. [DOI] [PubMed] [Google Scholar]
  • 19.Mangano DT, Tudor IC, Dietzel C.. The risk associated with aprotinin in cardiac surgery. N Engl J Med. 2006;354:353–65. [DOI] [PubMed] [Google Scholar]
  • 20.Karkouti K, Beattie WS, Dattilo KM, et al. A propensity score case-control comparison of aprotinin and tranexamic acid in high-transfusion-risk cardiac surgery. Transfusion. 2006;46:327–38. [DOI] [PubMed] [Google Scholar]
  • 21.McGrath T, Koch CG, Xu M, et al. Platelet transfusion in cardiac surgery does not confer increased risk for adverse morbid outcomes. Ann Thorac Surg. 2008;86:543–53. [DOI] [PubMed] [Google Scholar]
  • 22.Van De Werf F, Adgey J, Ardissino D, et al. Single-bolus tenecteplase compared with front-loaded alteplase in acute myocardial infarction: The ASSENT-2 double-blind randomised trial. Lancet. 1999;354:716–22. [DOI] [PubMed] [Google Scholar]
  • 23.Marzocchi A, Manari A, Piovaccari G, et al. Randomized comparison between tirofiban and abciximab to promote complete ST-resolution in primary angioplasty: Results of the facilitated angioplasty with tirofiban or abciximab (FATA) in ST-elevation myocardial infarction trial. Eur Heart J. 2008;29:2972–80. [DOI] [PubMed] [Google Scholar]
  • 24.Serruys PW, Morice MC, Kappetein AP, et al. Percutaneous coronary intervention versus coronary-artery bypass grafting for severe coronary artery disease. N Engl J Med. 2009;360:961–72. [DOI] [PubMed] [Google Scholar]
  • 25.Greilich PE, Jessen ME, Satyanarayana N, et al. The effect of epsilonaminocaproic acid and aprotinin on fibrinolysis and blood loss in patients undergoing primary, isolated coronary artery bypass surgery: A randomized, double-blind, placebo-controlled, noninferiority trial. Anesth Analg. 2009;109:15–24. [DOI] [PubMed] [Google Scholar]
  • 26.Frieman JA, Chalmers TC, Smith H, et al. The importance of beta, the type II error and sample size in the design and interpretation of the randomized controlled trial. N Engl J Med. 1978;299:690–4. [DOI] [PubMed] [Google Scholar]
  • 27.Pocock SJ, Ware JH.. Translating statistical findings into plain English. Lancet. 2009;373:1926–8. [DOI] [PubMed] [Google Scholar]
  • 28.Freemantle N, Calvert M.. Composite and surrogate outcomes in randomised controlled trials. BMJ. 2007;334:756–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kip KE, Hollabaugh K, Marroquin OC, Williams DO.. The problem with composite end points in cardiovascular studies: The story of major adverse cardiac events and percutaneous coronary intervention. J Am Coll Cardiol. 2008;51:701–7. [DOI] [PubMed] [Google Scholar]
  • 30.Ferreira-Gonzalez I, Permanyer-Miralda G, Domingo-Salvany A, et al. Problems with use of composite end points in cardiovascular trials: Systematic review of randomised controlled trials. BMJ. 2007;334:786–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Montori VM, Permanyer-Miralda G, Ferreira-Gonzalez I, et al. Validity of composite end points in clinical trials. BMJ. 2005;330:594–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Neaton JD, Gray G, Zuckerman BD, Konstam MA.. Key issues in end point selection for heart failure trials: Composite end points. J Card Fail. 2005;11:567–75. [DOI] [PubMed] [Google Scholar]
  • 33.Devereaux PJ, Yang H, Yusuf S, et al. Effects of extended-release metoprolol succinate in patients undergoing non-cardiac surgery (POISE trial): A randomised controlled trial. Lancet. 2008;371:1839–47. [DOI] [PubMed] [Google Scholar]

Articles from The Journal of Extra-corporeal Technology are provided here courtesy of EDP Sciences

RESOURCES