Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 2.
Published in final edited form as: Acta Mater Med. 2022 Aug 26;1(3):320–332. doi: 10.15212/amm-2022-0020

Statistical Evaluation of Absolute Change versus Responder Analysis in Clinical Trials

Peijin Wang 1, Sarah Peskoe 1, Rebecca Byrd 2, Patrick Smith 3, Rachel Breslin 2, Shein-Chung Chow 1
PMCID: PMC10237148  NIHMSID: NIHMS1833344  PMID: 37274016

Abstract

In clinical trials, the primary analysis is often either a test of absolute/relative change in a measured outcome or a corresponding responder analysis. Though each of these tests may be reasonable, determining which test is most suitable for a particular research study is still an open question. These tests may require different sample sizes, define different clinically meaningful differences, and most importantly, lead to different study conclusions. This paper aims to compare a typical non-inferiority test using absolute change as the study endpoint to the corresponding responder analysis in terms of sample size requirements, statistical power, and hypothesis testing results. From numerical analysis, using absolute change as an endpoint generally requires a larger sample size; therefore, when the sample size is the same, the responder analysis has higher power. The cut-off value and non-inferiority margin are critical which can meaningfully impact whether the two types of endpoints yield conflicting conclusions. Specifically, an extreme cut-off value is more likely to cause different conclusions. However, this impact decreases as population variance increases. One important reason for conflicting conclusions is that the population distribution is not normal. To eliminate conflicting results, researchers should pay attention to the population distribution and cut-off value selection.

Keywords: Primary Endpoints, Responder Analysis, Threshold Selection

Graphical Abstract

graphic file with name nihms-1833344-f0005.jpg

1. Introduction

In clinical trials, an analysis of a primary study endpoint is often conducted to determine whether the intended studies will achieve study objective with a desired statistical power. In practice, investigators can consider four different kinds of primary endpoints or outcomes based on this single study objective: (i) absolute change (i.e., endpoint absolute change from baseline), (ii) relative change (e.g., endpoint percent change from baseline), (iii) responder analysis based on absolute change (i.e., an individual subject is defined as a responder if his/her absolute change in the primary endpoint has exceeded a pre-specified threshold known as a clinically meaningful improvement), and (iv) responder analysis based on relative change. Although analyses based on these endpoints all sound reasonable, the following statements are often of great concern to the principal investigators (Chow, 2011). First, clinically meaningful differences (improvement) of these derived endpoints may not directly translate to one another. Second, these derived endpoints generally have different sample size requirements. Third, and most importantly, these derived endpoints may not arrive the same statistical conclusion (based on the same data set). As a result, it is of particular interest to determine which type of primary endpoint is most appropriate and can best inform the disease status and treatment effect.

Some scholars criticized responder analysis due to loss of information, i.e., statistical power of a trial will be reduced if we categorize a continuous outcome into a binary variable (Snapinn and Jiang, 2007; Henschke et al., 2014). Though responder analysis may cost in power, it still has its own implement. For example, we need to use the original scale (continuous) outcome to make binary decision, such as whether the patient should be hospitalized. In a heterogeneous disease, a subset of patients may have more benefit than others, and then the distribution of the outcome variable would be not normal (Jones et al., 2016). If the trial is to investigate an additional second agent and the proportion of patients with more benefit may be of greater interest, a responder analysis is more suitable (Jones et al., 2016). According to its the benefits and unfavorable drawbacks, Henschke et al. (2014) suggested to use responder analysis as the secondary analysis to better interpret findings from the main analysis. However, it should be noted that analysis using absolute change as endpoint and corresponding responder analysis have different statistical properties. Hence, it is of great importance to investigate their differences in terms of statistical power, sample size and conclusion.

To study the relative performance of these derived endpoints, in addition to mathematical derivations, we further conduct numerical study and real case study, using data from a recent rehabilitation program study in lung transplant candidates and recipients (Byrd et al., 2022). One of the commonly used clinical indicators for patients with pulmonary disease is 6-minute walk distance (6MWD), which can be used as not only a prognostic factor but also as a health outcome variable (Tuppin et al., 2008). For example, 6MWD has been used to measure functional status and exercise capacity of lung transplant recipients or patients (Martinu et al, 2008; Munro et al, 2009). Some studies have used endpoint as change from baseline (absolute change) of 6MWD as the outcome variable to evaluate the performance of pulmonary disease treatment (Ryerson et al, 2014), while some considered a responder analysis using 6MWD (Gilbert et al., 2009; Stoilkova-Hartmann et al, 2015). An individual may be defined as a responder if he/she meets a pre-specified threshold of improvement in 6MWD, otherwise he/she will be defined as a non-responder. As an example, Stoilkova-Hartmann et al. (2015) considered classifying patient performance after rehabilitation using the following criteria: 6MWD increment ≥ 50m is considered good, ≥ 25 to < 50m is moderate, and < 25m a non-responder. However, in another study, Holland et al. (2014) reported that the minimal important difference of change of 6MWD in chronic respiratory disease was 25 to 33m. Holland et al. (2017) used 25m as the threshold for equivalence in the change of 6MWD. Though the wide range of 6MWD is generally accepted as 25 to 30m, the exact threshold of change in 6MWD which is considered of clinically meaningful is debatable.

In this case study, for simplicity, we will focus on statistical evaluation of rehabilitation program in lung transplant candidates and recipients in term of absolute change of 6MWD and the responder analysis based on a pre-specified threshold (improvement) of 6MWD using absolute change. A comparison between the absolute change and responder analysis with various pre-specified thresholds is made in terms of sample size requirement and statistical power. In the next section, we present statistical methods for an analysis using absolute change as the study endpoint as well as the corresponding responder analysis. Additionally, we compare the performances of these study endpoints in terms of statistical power, sample size and study results/conclusions. In Section 3, we discuss a numerical analysis of the comparison between absolute change and responder analysis and the case study of the rehabilitation program in lung transplant candidate and recipients. Brief concluding remarks and recommendations are given in the last section of this article.

2. Methods

2.1. Hypothesis Testing for Efficacy

In a randomized clinical trial evaluating the performance of a new drug or a new treatment as compared to an active control (e.g., standard of care), non-inferiority testing is commonly considered. The success of a non-inferiority trial depends upon the selection of study endpoint and the non-inferiority margin. As indicated earlier, for a given study endpoint, there are four types of primary endpoints, namely, absolute change (e.g., endpoint change from baseline), relative change (e.g., endpoint percent change from baseline), responder analysis based on a pre-specified improvement (threshold) of absolute change, and responder analysis based on a pre-specified improvement (threshold) of relative change. As a result, the inference from a responder analysis is very sensitive to the pre-specified threshold (cutoff) value (Chow and Song, 2015). For simplicity and illustration purposes, in this artice, we will examine the performaces of the first two primary endpoints: absolute change and a corresponding responder analysis.

We assume a two-arm parallel randomized clinical trial comparing a test treatment (T) and an active control (C) with 1:1 treatment allocation ratio. Let W1ij and W2ij be the original response of ith patient in jth treatment group at baseline and post-treatment, where i = 1, …, nj and j = C, T, respectively. Furthermore, W1ij is assumed to follow lognormal distribution LN(μj,σj2), and W2ij = W1ij(1 + Δij), where Δij~LN(μΔij,σΔij2). Hence, the absolute change from baseline is

W2ijW1ij=W1ijΔij~LN(μj+μΔj,σj2+σΔj2), (1)

where W1ij and Δij are assumed to be independent. Let Xij = log(W2ijW1ij) represents the log absolute change, then Xij~N(μj+μΔj,σj2+σΔj2). Let xij denote the observations of random variable Xij. The reason to use W1ij and W2ij instead directly use Xij following normal distribution is that the same notation can be used to denote relative change. For example, Yij=log(W2ijW1ijW1ij) can represent log relative change, which follows N(μΔij,σΔij2). Though relative change endpoint is not the focus of this paper, this notation will benefit future studies.

The outcome variable for responder analysis based on a pre-specified absolute change is then given by rAj=#{xij>c1}nj, where c1 is the cutoff value. Then the endpoint for the responder analysis becomes pAj = E[rAj]. For sufficiently large sample size, it can be verified that rAj asymptotically follows NpAj,pAj1-pAjnj (Chow, 2011). According to the definition of pAj,

pAj=E[rAj]=P(Xij>c1)=P(Xij(μj+μΔj)σj2+σΔj2>c1(μj+μΔj)σj2+σΔj2)=1Φ(c1(μj+μΔj)σj2+σΔj2), (2)

where Φ(·) is the cumulative distribution function (CDF) of standard normal distribution. The hypotheses for non-inferority testing based on the derived endpoint of absolute change and the corresponding responder analysis can be set up as follows.

  1. Absolute change:
    H0:(μC+μΔC)(μT+μΔT)δ1v.s.HA:(μC+μΔC)(μT+μΔT)<δ1, (3)
    where δ1 is the non-inferiority margin in hypothesis teting using absolute change.
  2. Responder analysis based on a pre-specified threshold (improvement) of absolute change:
    H0:pACpATδ2v.s.HA:pACpAT<δ2. (4)
    where δ2 is the non-inferiority margin in hypothesis testing using responder analysis.

For a non-inferority test based on the derived endpoint of absolute difference, the Z test statistic under null hypothesis in Equation (3) is given by

Z1=x¯Tx¯C+δ1σT2+σΔT2n1+σC2+σΔC2n1=x¯Tx¯C+δ1σT2+σΔT2+σC2+σΔC2n1~N(0,1), (5)

where x-T and x-C are the sample mean of absolute change in treatment and control group, and n1 is the sample size of the treatment or control group, assuming the allocation ratio is 1:1. Let δ1A denote the true sample mean difference. The corresponding statistical power can be written as

power1=P(Reject H0|HAis true)=P(Z1>z1α|x¯Tx¯C=δ1A)=P(x¯Tx¯C>z1ασT2+σΔT2+σC2+σΔC2n1δ1|x¯Tx¯C=δ1A)=P(x¯Tx¯Cδ1AσT2+σΔT2+σC2+σΔC2n1>z1ασT2+σΔT2+σC2+σΔC2n1δ1δ1AσT2+σΔT2+σC2+σΔC2n1)=1Φ(z1αδ1+δ1AσT2+σΔT2+σC2+σΔC2n1). (6)

The sample size requirement for the non-inferiority test using absolute difference can then be obtained as follows

n1=2(z1α+zβ)2(σT2+σΔT2+σC2+σΔC2)[(μC+μΔC)(μT+μΔT)δ1]2. (7)

For a non-inferiority test of responder analysis based on a pre-specified threshold (improvement) of absolute difference, the Z test statistic under null hypothesis in Equation (4) can be derived as follows

Z2=rATrAC+δ2rAT(1rAT)n2+rAC(1rAC)n2=rATrAC+δ2rAT(1rAT)+rAC(1rAC)n2~N(0,1), (8)

where rAT and rAC are the sample proportions in the treatment and the control group, respectively, and n2 is the sample size of the treatment or control group, assuming the allocation ratio is 1:1. Similarly, let δ2A denote the true proportion difference, then the corresponding statistical power is

power2=P(RejectH0|HAistrue)=P(Z2>z1α|rATrAC=δ2A)=P(rATrAC>z1αrAT(1rAT)+rAC(1rAC)n2δ2|rATrAC=δ2A)=1Φ(z1αδ2+δ2ArAT(1rAT)+rAC(1rAC)n2)1Φ(z1αδ2+δ2ApAT(1pAT)+pAC(1pAC)n2), (9)

where the last approximate equation holds using Slutsky’s theorem (Chow, 2011). The sample size requirement for non-inferiority test for the responder analysis based on a pre-specified threshold (improvement) of absolute difference is then given by

n2=2(z1α+zβ)2(pAC(1pAC)+pAT(1pAT))(pACpATδ2)2. (10)

2.2. Statistical Power Comparison in Non-inferiority Tests

Many previous studies have suggested avoiding relative difference due to statistical inefficiency (Vickers, 2001). Following their ideas, we instead consider a comparison of non-inferiority tests using absolute change and a responder analysis using absolute change in terms of statistical power. The required sample sizes and conclusion comparison for non-inferiority tests are also shown in this section. Here, let AC denote absolute change, and PAC denote responder analysis using absolute change.

From the formula of statistical power of non-inferiority test shown in Section 2, the power difference can be computed using the cumulative distribution function (CDF) of N(0,1). Using Taylor expansion, the CDF of N(0,1) Φ(·) can be written as

Φ(x)=12πi=0n(1)nn!2n(2n+1)x2n+1+12. (11)

Keeping the first term of Taylor expansion in Equation (11), then Φ(x1) − Φ(x2) can be simplified as

Φ(x1)Φ(x2)=12π(i=0n(1)nn!2n(2n+1)x12n+1i=0n(1)nn!2n(2n+1)x22n+1)=12πi=0n(1)nn!2n(2n+1)(x12n+1x22n+1)12π(x1x2). (12)

To compare the statistical power of a non-inferiority test using the absolute change endpoint with the statistical power for a responder analysis using absolute change endpoint, we start with first simplify pAj. Using Equation (11), pAj can be written as

pAj=1Φ(c1(μj+μΔj)σj2+σΔj2)=112πc1(μj+μΔj)σj2+σΔj212=1212πc1(μj+μΔj)σj2+σΔj2. (13)

And

pAj(1pAj)=(1212πc1(μj+μΔj)σj2+σΔj2)(12+12πc1(μj+μΔj)σj2+σΔj2)=1412π(c1(μj+μΔj)σj2+σΔj2)2=14[c1(μj+μΔj)]22π(σj2+σΔj2). (14)

Hence,

pACpAT=14[c1(μC+μΔC)]22π(σC2+σΔC2)(14[c1(μT+μΔT)]22π(σT2+σΔT2))=[c1(μT+μΔT)]22π(σT2+σΔT2)[c1(μC+μΔC)]22π(σC2+σΔC2), (15)

and

pAC(1pAC)+pAT(1pAT)=14[c1(μC+μΔC)]22π(σC2+σΔC2)+14[c1(μT+μΔT)]22π(σT2+σΔT2)=12[c1(μT+μΔT)]22π(σT2+σΔT2)[c1(μC+μΔC)]22π(σC2+σΔC2). (16)

If we assume the sample sizes of non-inferority test using absolute change and corresponding responder analysis are the same, denoted as n, using Equation (12), the difference between power1 and power2 can be written as

power1power2=1Φ(z1αδ1+δ1AσT2+σΔT2+σC2+σΔC2n)1+Φ(z1αδ2+δ2ApAT(1pAT)+pAC(1pAC)n)=Φ(z1αδ2+δ2ApAT(1pAT)+pAC(1pAC)n)Φ(z1αδ1+δ1AσT2+σΔT2+σC2+σΔC2n)=12π(z1αδ2+δ2ApAT(1pAT)+pAC(1pAC)nz1α+δ1+δ1AσT2+σΔT2+σC2+σΔC2n)=12π(δ1+δ1AσT2+σΔT2+σC2+σΔC2nδ2+δ2ApAT(1pAT)+pAC(1pAC)n)=n2π(δ1+δ1AσT2+σΔT2+σC2+σΔC2δ2+δ2A12[c1(μT+μΔT)]22π(σT2+σΔT2)[c1(μC+μΔC)]22π(σC2+σΔC2)). (17)

2.3. Sample Size Comparison in Non-inferiority Tests

From Equation (15) and (16), the sample size for the responder analysis using absolute change endpoint in Equation (10) can be written as

n2=2(z1α+zβ)2(12[c1(μT+μΔT)]22π(σT2+σΔT2)[c1(μC+μΔC)]22π(σC2+σΔC2))([c1(μT+μΔT)]22π(σT2+σΔT2)[c1(μC+μΔC)]22π(σC2+σΔC2)δ2)2. (18)

When the significance level and desired statistical power are the same, we can compare the necessary sample size for a responder analysis to a test from absolute change with the ratio

n2n1=2(z1α+zβ)2(pAC(1pAC)+pAT(1pAT))(pACpATδ2)22(z1α+zβ)2(σT2+σΔT2+σC2+σΔC2)[(μC+μΔC)(μT+μΔT)δ1]2=pAC(1pAC)+pAT(1pAT)(pACpATδ2)2σT2+σΔT2+σC2+σΔC2[(μC+μΔC)(μT+μΔT)δ1]2=pAC(1pAC)+pAT(1pAT)σT2+σΔT2+σC2+σΔC2[(μC+μΔC)(μT+μΔT)δ1pACpATδ2]2=12[c1(μC+μΔC)]22π(σC2+σΔC2)[c1(μT+μΔT)]22π(σT2+σΔT2)σT2+σΔT2+σC2+σΔC2[(μC+μΔC)(μT+μΔT)δ1c1(μT+μΔT)σT2+σΔT2c1(μC+μΔC)σC2+σΔC2δ2]2. (19)

2.4. Conflict Probability in Non-inferiority Tests

In this section, we aim to investigate the probabilities of a non-inferority test using absolute change as the endpoint and the corresponding responder analysis having similar or different conclusions. We assume the samples used to conduct these two types of non-inferority test are the same. Thus, there are four possible types of events:

  • Both AC and PAC reject H0
    P(AC reject H0and PAC reject H0)=P(Z1>z1α,Z2>z1α)=P(x¯Tx¯C+δ1σT2+σΔT2+σC2+σΔC2n>z1α,rATrAC+δ2rAT(1rAT)+rAC(1rAC)n>z1α). (20)
  • AC fail to reject H0, whereas PAC reject H0
    P(AC fail to reject H0and PAC reject H0)=P(Z1z1α,Z2>z1α)=P(x¯Tx¯C+δ1σT2+σΔT2+σC2+σΔC2nz1α,rATrAC+δ2rAT(1rAT)+rAC(1rAC)n>z1α). (21)
  • AC reject H0, whereas PAC fail to reject H0
    P(AC reject H0and PAC fail to reject H0)=P(Z1>z1α,Z2z1α)=P(x¯Tx¯C+δ1σT2+σΔT2+σC2+σΔC2n>z1α,rATrAC+δ2rAT(1rAT)+rAC(1rAC)nz1α). (22)
  • Both AC and PAC fail to reject H0
    P(AC fail to reject H0and PAC fail to reject H0)=P(Z1z1α,Z2z1α)=P(x¯Tx¯C+δ1σT2+σΔT2+σC2+σΔC2nz1α,rATrAC+δ2rAT(1rAT)+rAC(1rAC)nz1α). (23)

3. Results

In this section, a numerical analysis using simulated data is conducted to investigate the difference between using absolute change as an endpoint and the corresponding responder analysis in terms of sample size requirement, statistical power, and non-inferiority test conclusion. Responses are assumed to follow a normal distribution. The allocation ratio is 1:1. The simulation is conducted 1000 times. Additionally, a case study is established to investigate the difference between a typical non-inferiority test and responder analysis using real clinical data from Byrd et al. (2022). Again, AC denotes typical non-inferiority test using absolute change as the endpoint, and PAC denotes the corresponding responder analysis. The significance level is 0.05, and the desired power is 0.80.

3.1. Numerical analysis

According to Equation (7), the sample size of AC is associated with the population mean, population variance, and the non-inferiority margin. Treatment group population mean is set to 0.2 and 0.3, control group population mean is set to 0, and population variance of both groups as 1.0, 2.0 and 3.0. Table 1 presents the required sample size of AC to achieve 80% statistical power. The sample size is associated with the effect size and the non-inferiority margin. When the effect size is fixed, a larger non-inferiority margin will lead to a smaller sample size in AC; when the non-inferiority margin is fixed, a larger effect size will lead to a smaller sample size in AC. Similarly, from Equation (10), the sample size of PAC is additionally related to the cut-off value (threshold) used to determine responders. As shown in Table 2, the impact of effect size and non-inferiority margin on sample size is the same as Table 1, when cut-off value is fixed. However, the impact of the cut-off value on the sample size calculation is quite complex, since its impact is associated with not only its absolute value but also the population mean and variance.

Table 1.

Sample sizes for non-inferiority test using absolute change endpoint (AC).

μT + μΔT = 0.2 μT + μΔT = 0.3
σT2+σΔT2 1.0 2.0 3.0 1.0 2.0 3.0
σC2+σΔC2 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0
δ1 = 0.25 246 368 490 368 490 612 490 612 734 164 246 328 246 328 410 328 410 492
δ1 = 0.30 198 298 396 298 396 496 396 496 594 138 208 276 208 276 344 276 344 414
δ1 = 0.35 164 246 328 246 328 410 328 410 492 118 176 236 176 236 294 236 294 352
δ1 = 0.40 138 208 276 208 276 344 276 344 414 102 152 202 152 202 254 202 254 304
δ1 = 0.45 118 176 236 176 236 294 236 294 352 88 132 176 132 176 220 176 220 264
δ1 = 0.50 102 152 202 152 202 254 202 254 304 78 116 156 116 156 194 156 194 232
δ1 = 0.55 88 132 176 132 176 220 176 220 264 70 104 138 104 138 172 138 172 206
δ1 = 0.60 78 116 156 116 156 194 156 194 232 62 92 124 92 124 154 124 154 184
δ1 = 0.65 70 104 138 104 138 172 138 172 206 56 84 110 84 110 138 110 138 166
δ1 = 0.70 62 92 124 92 124 154 124 154 184 50 76 100 76 100 124 100 124 150

Table 2.

Sample sizes for responder analysis using absolute change endpoint (PAC).

μT + μΔT = 0.2 μT + μΔT = 0.3
1.0 2.0 3.0 1.0 2.0 3.0
σC2+σΔC2 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0
cut-off value = 0.1
δ2 = 0.25 114 122 126 122 132 136 126 136 142 90 38 100 104 110 114 110 118 122
δ2 = 0.30 86 92 94 92 98 100 94 100 104 70 96 76 80 84 86 84 88 92
δ2 = 0.35 68 72 74 72 76 78 74 78 80 56 74 60 62 66 68 66 70 72
δ2 = 0.40 54 58 58 58 60 62 58 62 64 46 60 50 50 54 54 54 56 56
δ2 = 0.45 44 46 48 46 50 50 48 50 52 90 48 40 42 44 44 44 46 46
cut-off value = 0.2
δ2 = 0.25 114 132 142 114 132 142 114 132 142 90 104 110 96 110 118 100 114 122
δ2 = 0.30 86 98 104 86 98 104 86 98 104 70 80 84 74 84 88 76 86 92
δ2 = 0.35 68 76 80 68 76 80 68 76 80 56 62 66 60 66 70 60 68 72
δ2 = 0.40 54 60 62 54 60 62 54 60 62 46 50 54 48 54 56 50 54 56
δ2 = 0.45 44 48 52 44 48 52 44 48 52 38 42 44 40 44 46 40 44 46
cut-off value = 0.3
δ2 = 0.25 112 142 158 104 132 146 102 126 140 90 110 122 90 110 122 90 110 122
δ2 = 0.30 84 104 114 80 98 106 78 94 104 70 84 92 70 84 92 70 84 92
δ2 = 0.35 66 80 86 64 74 82 62 74 80 56 66 70 56 66 70 56 66 70
δ2 = 0.40 54 62 68 52 60 64 50 58 62 46 54 56 46 54 56 46 54 56
δ2 = 0.45 44 50 54 42 48 52 42 48 50 38 44 46 38 44 46 38 44 46
cut-off value = 0.4
δ2 = 0.25 110 150 176 96 130 150 92 122 140 88 118 134 84 110 124 82 106 120
δ2 = 0.30 84 108 124 74 96 108 70 90 102 68 88 100 66 82 94 64 80 90
δ2 = 0.35 64 82 92 58 74 82 56 70 78 56 68 76 52 66 72 52 64 70
δ2 = 0.40 52 64 72 48 58 64 46 56 62 46 56 60 44 52 58 42 52 56
δ2 = 0.45 42 52 58 40 48 52 38 46 50 38 46 50 36 44 48 36 42 46

Comparing sample sizes in Table 1 and Table 2, we find that when the non-inferiority margin is fixed, the required sample size of AC is much larger than the one of PAC. One important assumption we used here is that the non-inferiority margins of two tests are the same. The reason for us to make this assumption is that many scholars have suggested to use responder analysis as a secondary analysis (Henschke et al., 2014), i.e., the sample size is computed based on the primary analysis. In other words, statistical analysis of a typical non-inferiority test and the responder analysis will be conducted using the same dataset. However, in practice, it is more likely that the non-inferiority margins in these two tests are different, since these two tests have different meanings. Therefore, while conducting responder analysis as the secondary analysis, it would be possible that we do not have enough power for this secondary analysis.

Next, we compare the statistical power of AC and PAC using Equation (6) and (9), when the sample size is fixed. In the simulation process, the sample size used to generate random samples is the minimal of all possible sample sizes, given the population mean and standard deviation. Here, the population mean of treatment and control group are 0.2 and 0, the population variance of treatment group is 2, the population variance of control group ranges from 1 to 3, and the cut-off value ranges from 0.1 to 0.8. To make the power comparable, we assume the non-inferiority margin of AC and PAC are the same. As shown in Figure 1, with a fixed sample size, the statistical power of AC is the smallest, suggesting it requires a larger sample size to achieve desired power than PAC. This finding is consistent with results in Table 1 and Table 2. Additionally, in PAC, the statistical power when using different cut-off values become closer to each other as the population variance increases. The statistical power of PAC is either slightly lower than 80% or over 80%, regardless of cut-off value. The statistical power of AC is always below 60%. Hence, if the researchers conduct a sample size calculation based on responder analysis but end up with using typical non-inferiority test, they will not be able to achieve enough statistical power.

Figure 1.

Figure 1.

Statistical power comparison of non-inferiority test using absolute change as endpoint (AC) and corresponding responder analysis (PAC).

To illustrate the relationship among required sample sizes, we assume the non-inferiority margin of using absolute change endpoint and responder analysis are the same. The setting is the same as the one in Figure 1. In Figure 2, the ratio of PAC sample size to AC is used to represent the relationship between AC and PAC’s sample size, where N1 denote the sample size of AC, and N2 denote the sample size of PAC. Under the setting used here, N2/N1 is always smaller than 0.35, suggesting the sample size AC is much larger than the one of PAC. When the non-inferiority margin increases, the ratios with different cut-off values not only become smaller but also closer to each other. Comparing Figure 2 (A), (B) and (C), we find that the ratio of sample size decreases, and the sample size ratios with different cut-off value become closer to each other when the variance of control group increases.

Figure 2.

Figure 2.

Sample size comparison of non-inferiority test using absolute change as endpoint (AC) and corresponding responder analysis (PAC).

Another essential parameter of interest in responder analysis is the cut-off value (threshold) to determine whether an observation is responder or not. Let the population mean of treatment group range from 0.10 to 0.30. To make the results comparable, the non-inferiority margin in AC and PAC are set as 0. The range of cut-off value is set larger than previously, which is from −3 to 3. The simulation process is randomly generated continuous samples from normal distribution at first, where the sample size is computed using AC’s sample size formula in Equation (7). Then using the cut-off value, we label each subject as either a responder or a non-responder. As shown in Figure 3, the cut-off value can indeed drive the conclusion in a different direction. In Figure 3 (A), a negative cut-off value will provide conflict results; in Figure 3 (B), a more extreme cut-off value will provide conflict results; the same findings are found in Figure 3 (C). Additionally, the influence of cut-off value on the hypothesis test result is related to the population mean and variance; however, the overall pattern is the similar. Hence, a more extreme cut-off value, i.e., a cut-off which is further away from the population mean, is more likely to lead to conflict conclusions.

Figure 3.

Figure 3.

Non-inferiority test results comparison of typical test using absolute change as endpoint (AC) and corresponding responder analysis (PAC).

3.2. Case study

In Section 3.1, we study the impact of essential parameters on sample size requirement, statistical power, and test conclusions using simulated data. To have a clearer illustration of the impact of cut-off value on non-inferiority test results, we conduct a case study using real clinic data from an observational study about rehabilitation in lung transplant patients (Byrd et al., 2022). The primary aim of Byrd et al. (2022) is to compare the performance of individual rehabilitation to group rehabilitation in both pre-operative and post-operative participants, measured by primary outcome variable, change in 6-minute walk distance (6MWD). Detailed change in 6MWD information of pre-operative and post-operative patients are presented in Table 3.

Table 3.

Change of 6MWD of pre-operative and post-operative participants in Byrd et al. (2022).

Pre-operative Post-operative
Rehabilitation Group Individual Group Individual
Sample size 93 81 110 105
Mean (SD) 51.6 (81.3) 56.6 (62.9) 174 (97.6) 160 (89.4)
Median [Q1, Q3] 44.5 [6.40,102] 59.7 [25.0,93.9] 168 [106,232] 159 [104,208]

In this section, the non-inferiority test is used to study under what circumstances AC and PAC may lead to different conclusions. According to previous studies (Holland et al. 2014; Holland et al, 2017), a clinically meaningful change in 6MWD is between 25m and 33m. The cut-off value used in here ranges from 20m to 35 m to have a more comprehensive understanding about the impact of cut-off value selection on study conclusions. The non-inferiority margin of AC ranges from −0.3 to 0.3, and the non-inferiority margin of PAC ranges from 0 to 0.03. As shown in Figure 4, for pre-operative patients, some cut-off values may lead to different conclusions. For instance, in Figure 4 (A), a cut-off value larger than 27 will yield conflicting results. However, for post-operative patients, if the cut-off value is between 20 and 35, both AC and PAC will always give consistent results. Having a closer look at the data, we find that for most post-operative patients, their change in 6MWD either extremely large (larger than 35) or extremely small (smaller than 20). In other words, in this scenario, an extreme cut-off value (ranging from 20 to 35) cannot significantly impact the proportion of responders in post-operative patients. It suggests that cut-off value selection may cause responder analysis and typical non-inferiority test to provide conflicting findings only under certain circumstances.

Figure 4.

Figure 4.

Non-inferiority test results comparison of typical test using absolute change as endpoint (AC) and corresponding responder analysis (PAC) in Rehabilitation Program in Lung Transplant Study (Byrd et al, 2022).

As we mentioned in Section 1, a responder analysis answers a different question from the typical non-inferiority test. Specifically, if we pick an extreme cut-off value, responder analysis investigates whether the test treatment could bring substantially clinical benefit to patients. For example, in Figure 4 (A), AC gives an insignificant conclusion, i.e., individual rehabilitation is inferior to group rehabilitation, whereas PAC gives significant conclusion when the cut-off value is large. It suggests that individual rehabilitation is non-inferior to group rehabilitation only for a small proportion of patients and benefits them to have great improvement. In other words, the large cut off value allows us to focus on a smaller proportion of patients who had a substantial improvement; this difference may not be detectable in typical non-inferiority tests, yielding conflicting findings.

4. Discussion

One of the most important steps of any clinical trial is to determine the primary study endpoint, which may influence the process of establishing hypotheses, selecting statistics models, calculating sample size etc. Generally speaking, there are four types of study endpoints: (i) absolute change, (ii) relative change, (iii) responder analysis using absolute change, and (iv) responder analysis using relative change. This paper focuses on the comparison of endpoint (i) and (iii) in non-inferiority test in terms of sample size requirement, statistical power and whether different endpoints may lead to different conclusions, as example to illustrate how to compare different study endpoints. The comparison process in this study can also be generalized to compare any two study endpoints mentioned above.

In the numerical study section, both simulation study and case study using data in Byrd et al (2022) are conducted. According to the simulation study, the required sample size of a non-inferiority test using absolute change endpoint (AC) is associated with the population mean and variance of treatment and control group and the non-inferiority margin. The sample size of the corresponding responder analysis (PAC) is additionally related to the cut-off value used to determine responders. Fixing all parameters, we find that PAC requires a smaller sample size compared to AC. In other words, when the sample size is the same, PAC will always have a larger statistical power than AC, as shown in Figure 1. When the desired statistical power is the same, the sample size ratio of PAC to AC is always smaller than 1, which is also related to the non-inferiority margin and cut-off value. However, the impact of these two parameters decreases as the population variance increases. As the cut-off value becomes more extreme, the likelihood of obtaining conflicting conclusions from a non-inferiority hypothesis test increases. This was seen both in the simulation study and the case study. We find that the cut-off value selection is of great importance, which may cause conflicting results when the mean and median in treatment and control groups are closer to the cut-off value.

Without loss of generalizability, similar conclusions could be found in superiority and equivalence test. The fundamental reason for typical non-inferiority/superiority/equivalence test using absolute change as endpoint and corresponding responder analysis provide conflict conclusion is the distribution for the target population. If the samples follow normal distribution, it is very likely that typical test and responder analysis give the same conclusion, when the cut-off value is close to the population mean. Otherwise, these two types of analysis would provide conflict results, especially when the cut-off value is further away from the population mean.

Due to the great importance of cut-off value selection and the possibility of obtaining conflict conclusions, we suggest determining a cut-off value using domain knowledge in combination with statistics of the collected sample, while conducting responder analysis. Though clinically important difference (MCID) is always used as the cut-off value (Jones et al., 2016), some literatures have proposed some guidance or approaches on cut-off value selection (Farrar et al., 2006; Harrell, 2017). Additionally, since the sample size requirement of AC and PAC are different, it is necessary to check whether the sample size is large enough to achieve the desired statistical power. It should be noted that, not only may the typical test and responder analysis require different sample sizes, yield different power, and result in different study conclusions, but a test using absolute instead of relative change as study endpoints are prone to the same challenges. Some studies reported that absolute and relative change endpoints may lead to conflict conclusions (Chow, 2011; Curran-Everett and Williams, 2015). In addition, these endpoints are viewed differently by by drug approval administrations. According to the non-inferiority test guidance from the US Food and Drug Administration (FDA), constancy assumption of a study has been expected to based on constancy of relative effects, not absolute effects. (FDA, 2016). However, European Medicines Agency’s (EMA) guidance on non-inferiority test used absolute difference to illustrate instructions on non-inferiority test (EMA 2005). Hence, it is possible that one drug approved by the FDA may not be approved by EMA or wise versa, since the required sample size and statistical power of using absolute change and relative change as study endpoint are different (Chow, 2011). Hence, it would be useful to further provide the confidence interval of cut-off values, where the typical non-inferiority test and responder analysis may lead to consistent conclusions, and investigate under what circumstance both absolute and relative change endpoint will provide the same non-inferiority test results.

Acknowledgements

This research is in part supported by Grant Number UL1TR002553 from the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NCATS or NIH.

Footnotes

Conflict of interest

There is no conflict of interest.

References

  • [1].Byrd R, Breslin R, Wang P, Peskoe S, Chow SC, Lowers S, Snyder LD, Pastva AM: Group versus Individual Rehabilitation in Lung Transplantation: A Retrospective Non-Inferiority Assessment. 2022. [Manuscript submitted for publication]. [Google Scholar]
  • [2].Chow SC and Song F: On Controversial Statistical Issues in Clinical Research. Open Access Journal of Clinical Trials 2015, 7, 43–51. [Google Scholar]
  • [3].Chow CS: Controversial statistical issues in clinical trials. Boca Raton, FL, USA: CRC Press; 2011: 135–147. [Google Scholar]
  • [4].Curran-Everett D and Williams CL: Explorations in Statistics: the Analysis of Change. Advances in physiology education 2015, 39(2), 49–54. [DOI] [PubMed] [Google Scholar]
  • [5].EMA. Guideline on the Choice of the Non-inferiority Margin; 2015. [Google Scholar]
  • [6].Farrar JT, Dworkin RH, and Max MB: Use of the Cumulative Proportion of Responders Analysis Graph to Present Pain Data over a Range of Cut-Off Points: Making Clinical Trial Data More Understandable. Journal of pain and symptom management 2006, 31(4), 369–377. [DOI] [PubMed] [Google Scholar]
  • [7].FDA. Non-Inferiority Clinical Trials to Establish Effectiveness; 2016. [Google Scholar]
  • [8].Gilbert C, Brown MC, Cappelleri JC, Carlsson M, and McKenna SP: Estimating a Minimally Important Difference in Pulmonary Arterial Hypertension Following Treatment with Sildenafil. Chest 2009, 135(1), 137–142. [DOI] [PubMed] [Google Scholar]
  • [9].Harrell FE: Regression Modeling Strategies. Springer International Publishing. [Google Scholar]
  • [10].Henschke N, van Enst A, Froud R and WG Ostelo R: Responder Analyses in Randomised Controlled Trials for Chronic Low Back Pain: An Overview of Currently Used Methods. European Spine Journal, 2014, 23(4), 772–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Holland AE, Spruit MA, Troosters T, Puhan MA, Pepin V, Saey D, … and Singh SJ: An Official European Respiratory Society/American Thoracic Society Technical Standard: Field Walking Tests in Chronic Respiratory Disease. European Respiratory Journal 2014, 44(6), 1428–1446. [DOI] [PubMed] [Google Scholar]
  • [12].Holland AE, Mahal A, Hill CJ, Lee AL, Burge AT, Cox NS, … and McDonald CF: Home-Based Rehabilitation for COPD Using Minimal Resources: A Randomised, Controlled Equivalence Trial. Thorax 2017, 72(1), 57–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Jones PW, Rennard S, Tabberer M, Riley JH, Vahdati-Bolouri M and Barnes NC: Interpreting Patient-Reported Outcomes from Clinical Trials in COPD: A Discussion. International Journal of Chronic Obstructive Pulmonary Disease 2016, 11, 3069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Munro PE, Holland AE, Bailey M, Button BM, and Snell GI: Pulmonary Rehabilitation Following Lung Transplantation. Transplantation proceedings 2019, 41(1), 292–295. [DOI] [PubMed] [Google Scholar]
  • [15].Martinu T, Babyak MA, O’Connell CF, Carney RM, Trulock EP, Davis RD, … and INSPIRE Investigators: Baseline 6-Min Walk Distance Predicts Survival in Lung Transplant Candidates. American Journal of Transplantation 2008, 8(7), 1498–1505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Ryerson CJ, Cayou C, Topp F, Hilling L, Camp PG, Wilcox PG, … and Garvey C: Pulmonary Rehabilitation Improves Long-Term Outcomes In Interstitial Lung Disease: a Prospective Cohort Study. Respiratory medicine 2014, 108(1), 203–210. [DOI] [PubMed] [Google Scholar]
  • [17].Snapinn SM, and Qi J: Responder Analyses and the Assessment of a Clinically Relevant Treatment Effect. Trials 2007, 8(1). 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Stoilkova-Hartmann A, Janssen DJ, Franssen FM, and Wouters EF: Differences in Change in Coping Styles between Good Responders, Moderate Responders and Non-Responders to Pulmonary Rehabilitation. Respiratory medicine 2015, 109(12), 1540–1545. [DOI] [PubMed] [Google Scholar]
  • [19].Tuppin MP, Paratz JD, Chang AT, Seale HE, Walsh JR, Kermeeen FD, … and Hopkins PM: Predictive Utility of the 6-Minute Walk Distance on Survival in Patients Awaiting Lung Transplantation. The Journal of heart and lung transplantation 2008, 27(7), 729–734. [DOI] [PubMed] [Google Scholar]
  • [20].Vickers AJ: The Use of Percentage Change from Baseline as an Outcome in a Controlled Trial is Statistically Inefficient: a Simulation Study. BMC medical research methodology 2001, 1(1), 1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES