Abstract
A design is proposed for randomized comparative trials with ordinal outcomes and prognostic subgroups. The design accounts for patient heterogeneity by allowing possibly different comparative conclusions within subgroups. The comparative testing criterion is based on utilities for the levels of the ordinal outcome and a Bayesian probability model. Designs based on two alternative models that include treatment-subgroup interactions are considered, the proportional odds model and a non-proportional odds model with a hierarchical prior that shrinks toward the proportional odds model. A third design that assumes homogeneity and ignores possible treatment-subgroup interactions also is considered. The three approaches are applied to construct group sequential designs for a trial of nutritional prehabilitation versus standard of care for esophageal cancer patients undergoing chemoradiation and surgery, including both untreated patients and salvage patients whose disease has recurred following previous therapy. A simulation study is presented that compares the three designs, including evaluation of within-subgroup type I and II error probabilities under a variety of scenarios including different combinations of treatment-subgroup interactions.
Keywords: Group sequential, Hierarchical model, Non-proportional odds, Ordinal response, Precision medicine
1 Background
This paper describes a design for a small single-center randomized controlled clinical trial to evaluate the effectiveness of nutritional prehabilitation (Nuprehab) for esophageal cancer patients who undergo esophageal resection preceded and followed by chemoradiation therapy. Common postoperative morbidities for patients who undergo esophageal resection include anastomotic leak and stricture, chylothorax, delayed emptying or dumping syndrome, pulmonary complications such as pneumonia, and cardiac complications such as atrial fibrillation (Parekh and Iannettoni, 2007; Chen, 2014). The Nuprehab is given prior to surgery as well as seven days after surgery, with the aim to achieve oral immunomodulation with an L-arginine based enteral formula. The motivation for the trial is the hypothesis that providing patients with Nuprehab may reduce the incidence of postoperative morbidity and mortality via nutritional supplementation (see, e.g., Braga et al., 2002; Waitzberg et al., 2006).
Patients will be randomized to receive either Nuprehab or control, which is the standard of care. All patients will be monitored for morbidity and mortality for thirty days following their surgery. The trial’s primary outcome is Clavien-Dindo postoperative morbidity (POM) score (Clavien et al., 1992; Dindo et al., 2004; Clavien et al., 2009), which is ordinal with six levels: 0=normal recovery, 1=minor complication, 2=complication requiring pharmaceutical intervention, 3=complication requiring surgical, endoscopic or radiological intervention, 4=life-threatening complication requiring intensive care, and 5=death. The worst POM score during 30 days post surgery will be recorded. The trial will enroll approximately 60% primary and 40% salvage patients. Primary patients are treatment naive, whereas salvage patients have been treated previously with chemoradiation therapy, but not surgery, and their disease has recurred. Salvage patients are expected to have fewer preoperative nutritional deficiencies, but more preoperative comorbidities and worse prognosis. Consequently, it is plausible that the efficacy of Nuprehab may differ substantially for primary and salvage patients.
In this paper, we describe a design that accounts for the possibility that Nuprehab may be clinically beneficial for one of the subgroups but not the other. This is in sharp contrast with a more traditional “one-size-fits-all” approach that ignores prognostic information and makes one recommendation for all patients about whether Nuprehab is clinically beneficial. We evaluate the design based on its probabilities of recommending Nuprehab to each subgroup in four key scenarios: (i) the “complete null” scenario where Nuprehab does not improve POM scores for patients in either subgroup; (ii) the “partial null” scenario where Nuprehab does not improve POM scores for primary patients, but achieves a targeted benefit for salvage patients; (iii) the “partial null” scenario where Nuprehab achieves a targeted benefit for primary patients, but does not improve POM scores for salvage patients; and (iv) the “complete alternative” scenario where Nuprehab achieves targeted benefits in POM score reduction for both primary and salvage patients. The proposed design addresses the concern that, in the partial null scenarios (ii) and (iii), a one-size-fits-all design will have an unacceptably high (low) probability for recommending Nuprehab to the non-benefiting (benefiting) subgroup.
The proposed design is frequentist in that it is specified to provide specific probabilities of recommending Nuprehab to a non-benefiting subgroup (i.e. type I error) and to a benefiting subgroup for a particular targeted benefit (i.e. power). However, the decision to recommend Nuprehab for a particular subgroup is based on a posterior probability from a Bayesian model. Designs have been proposed that are similar to our design in that they facilitate subgroup specific recommendations, and stopping subsequent enrollment for particular subgroups, see, e.g., Brannath et al. (2009); Wang et al. (2009); Rosenblum et al. (2016). These designs do not involve an ordinal outcome, however.
We consider two alternative Bayesian probability models that facilitate subgroup specific recommendations. The first is the proportional odds (PO) cumulative logistic regression model of McCullagh (1980) with a treatment-subgroup interaction parameter. The second is a non-proportional odds (NPO) model with a hierarchical prior that shrinks toward the PO model. Although NPO models have been proposed, see, e.g., Peterson and Harrell (1990); Bender and Grouven (1998); Ishwaran (2000); Agresti (2010), as far as we are aware our formulation of the NPO model is novel, and moreover this is the first proposal to use an NPO model as the basis for comparing treatments in a randomized clinical trial. Guo and Yuan (2017) use the dispersed cumulative probit model of McCullagh (1980), which is a type of NPO model, for personalized dose finding in a phase I/II study of molecularly targeted agents. As we describe below, compared with the PO model, treatment comparison based on the proposed NPO model is more robust but more complex. To obtain a practical design that deals with this complexity, we propose a comparative testing criterion based on elicited numerical utilities of the six POM scores. Our approach may be considered a generalization of the utility-based design proposed by Murray et al. (2016), which does not accommodate prognostic subgroups and uses a Dirichlet-multinomial model.
The remainder of the paper is organized as follows. In Section 2, we discuss treatment comparison with ordinal outcomes in general and our utility-based comparative criterion in particular. In Section 3, we describe the PO and NPO models. In Section 4, we discuss practical design considerations, including specifying targeted alternatives, analysis and monitoring plan, controlling the probability of committing a type I error, and sample size. In Section 5, we present the results of a simulation study comparing the proposed design based on either the PO or NPO model, and also a more traditional design based on a PO model without a treatment-subgroup interaction parameter. We conclude with a brief discussion in Section 6.
2 Treatment Comparison
Each design compares the efficacy of Nuprehab relative to standard of care using the six-level ordinal POM score. Comparing treatments based on an ordinal outcome is complicated by the fact that, even when the probability of each outcome level is known, it is not always clear whether one treatment is superior to the other. A simple example is a three-level outcome (Good, Intermediate, Poor) where treatment A gives probabilities (0.30, 0.50, 0.20) and treatment B gives probabilities (0.40, 0.30, 0.30). Since B has larger probabilities of both Good and Poor compared to A, it is not clear whether one treatment is superior to the other. Comparing the two treatments requires additional information, such as a quantification of the relative desirabilities of the three possible events.
Accounting for prognostic subgroups further complicates matters. To see this, denote Y = POM score, P = primary, S = salvage, N = Nuprehab, C = control, and
for y = 0, …, 5, Sgp ∈ {P, S} and Trt ∈ {N, C}. Indexing Sgp by x and Trt by a or a′, if
then clearly treatment a is superior to a′ for patients in subgroup x. By contrast, if
then it is not clear whether a is superior to a′ for patients in subgroup x. Stated formally, for a particular patient subgroup, if the POM score distributions corresponding to each treatment arm are stochastically ordered with strict inequality for at least one level y, then it is clear which treatment is superior for that subgroup. By contrast, if the POM score distributions corresponding to each treatment arm are not stochastically ordered, then it is not clear whether one treatment is superior to the other for that subgroup.
To provide a criterion for determining whether one treatment is superior, we elicit numerical utilities U(Y = y) for all levels of Y, and compare treatments using mean utilities,
These depend on the subgroup-treatment specific outcome probabilities π(Sgp, Trt) and a utility function U(Y = y) that quantifies the desirability of each outcome level. Following, Houede et al. (2010), Thall and Nguyen (2012) and Murray et al. (2016), we elicited {U(Y = y), y = 0, …, 5} from the trial’s Principal Investigator, WH, so that the numerical utilities reflect his familiarity with postoperative complications following esophageal resection. To do this, we first set U(Y = 0) = 100 and U(Y = 5) = 0, and then asked WH to specify numerical values for the intermediate levels, y = 1, …, 4, that reflect their desirability relative to the best and worst levels. The numerical values that WH chose are:
These reflect that POM scores ≤ 2 are substantially more desirable than POM scores ≥ 3. Because a larger mean utility corresponds to better patient outcomes on average, if Ū(π(x, a)) > Ū(π(x, a′)), then treatment a is superior to a′ for patients in subgroup x. Therefore, regardless of whether π(x, a) and π(x, a′) are stochastically ordered, the mean utilities provide an unambiguous criterion for comparing treatments.
One important property of the mean utilities is a consequence of the following theorem.
Theorem 1
If π(x, a) stochastically dominates π(x, a′), then Ū(π(x, a)) > Ū(π(x, a′)) for all admissible U(Y) such that U(Y = 0) > U(Y = 1) > ⋯ > U(Y = 4) > U(Y = 5).
Theorem 1 follows from first-order stochastic dominance (Quirk and Saposnik, 1962); nonetheless, we provide a proof in the supplementary materials. Consequently, when the POM score distributions are stochastically ordered—and thus, it is clear which treatment is superior without appealing to the mean utilities—the proposed utility-based comparison is not sensitive to the elicited numerical values in that a different set of admissible values will result in the same conclusion. By contrast, when the POM score distributions are not stochastically ordered, eliciting numerical values is necessary to determine whether one treatment is superior. The proposed utility-based comparison necessarily is sensitive to the elicited values in that a different set of admissible values may result in a different conclusion.
Since the POM score probabilities are unknown, we learn about these using a Bayesian model with unknown parameter θ and model-based mean utilities Ū(π(Sgp, Trt; θ)). Given interim or final data D, our comparative testing criterion is as follows. If
then we declare treatment a is superior to a′ for patients in subgroup x. We specify pcut to control subgroup specific type I error probabilities. We describe how to do this in Section 4.
3 Probability Models
During the process of designing the trial, we considered two Bayesian cumulative logistic regression models that both include treatment-subgroup interaction parameters. The first is a PO model, which is a popular regression model for ordinal response variables, see, e.g., McCullagh (1980); Walters et al. (2001); Abreu et al. (2008). The restrictive parametric assumption underlying the PO model often is unrealistic, however. The second is a NPO model that relaxes this assumption at the cost of greater model complexity.
3.1 Proportional Odds Model
Denote logit(q) = log{q/(1 − q)}. The PO model that we considered assumes
(1) |
where α0 ≤ ⋯ ≤ α4, X = −0.5 for primary, X = 0.5 for salvage, A = −0.5 for control and A = 0.5 for Nuprehab. This model is parsimonious in that it accounts for all treatment and subgroup effects using three parameters, β = (β1, β2, β3).
Let ny(X, A) denote the number of patients with a POM score equal to y in subgroup X and treatment arm A. We assume observations are mutually independent, so that the likelihood function for the unknown model parameters (α, β) is
(2) |
where
and y = 0, …, 4, is defined in (1).
We specify the prior distribution for (α, β) such that p0(α, β) = p0(α) × p0(β), where
The exact prior distributional forms that we assume are
(3) |
where we write p0(θ)[L, U] to denote that p0(θ) has support on the interval [L, U]. The above prior restricts α0 ≤ ⋯ ≤ α4 so that for all X ∈ {−.5, .5} and A ∈ {−.5, .5}, which is necessary to ensure that the probability model is admissible. Following the recommendations of Ghosh et al. (2017), we specify t-distributions with a scale of 2.5 and five degrees of freedom. This specification places about 90% of the prior probability mass on the range of values within 5 of the prior mean, while the heavy tails do not preclude more extreme values should the data demand this. Because an effect size of 5 corresponds to a shift from 0.01 to 0.99 on the probability domain between subgroups or treatment arms, the proposed prior specification allows the observed data to dominate posterior inference.
We specify non-zero prior means, and , to reflect the prior information that WH provided about the POM score probabilities of each subgroup in the control arm, which we report in Table 1. Using to denote the prior probabilities that WH provided for the subgroup corresponding to X = x, and , we set
Because we set the prior means for β2 and β3 equal to zero, a priori Ū(π(P, N)) = Ū(π(P, C)) and Ū(π(S, N)) = Ū(π(S, C)). Therefore, a posteriori a non-zero mean utility difference between treatment arms in either subgroup will reflect the observed data, and not the prior.
Table 1.
POM Score | ||||||
---|---|---|---|---|---|---|
Subgroup | 0 | 1 | 2 | 3 | 4 | 5 |
Primary | 0.50 | 0.20 | 0.10 | 0.10 | 0.05 | 0.05 |
Salvage | 0.30 | 0.25 | 0.10 | 0.10 | 0.10 | 0.15 |
The PO model assumes that the regression coefficients, and thus the log-odds ratios, do not differ with the level of the response variable. This is a strong parametric assumption that often is unrealistic in practice, including the present context. The PO model likely is popular since it facilitates treatment comparison in that a utility function need not be elicited from the clinician(s). To see this, note that for y = 0, …, 4 and X = x,
is monotonically increasing in (β1+β3 x) such that when (β1 + β3 x) = 0. Therefore, for any U(Y = 0) < ⋯ < U(Y = 5),
and conversely,
Consequently, when posterior inference is based on the PO model defined in (1), the utility function is superfluous. However, when the actual response distributions do not satisfy the PO assumption, e.g., they are not stochastically ordered, the PO model may be misleading.
3.2 Non-Proportional Odds Model
To relax the assumption required by the PO model, that the log-odds ratios do not differ with the response level, we propose a hierarchical cumulative logistic regression model that assumes
(4) |
where X and A are defined similarly as for the PO model. In contrast with the PO model, the NPO model in (4) allows different log-odds ratios at each response level y. We assume that observations are mutually independent such that the likelihood function for the unknown parameters (α, γ) has the same general form (2) as for the PO model.
We specify a hierarchical prior for (α, γ) as follows,
(5) |
The prior constraints on ensure that for all X ∈ {−0.5, 0.5} and A ∈ {−0.5, 0.5}. These inequalities hold when
which is reflected in our specification of the prior. The hierarchical structure that we propose in (5) shrinks toward the PO model defined in (1). As , and , then γ1,y → β1, γ2,y → β2 and γ3,y → β3, for y = 0, …, 4, and thus the log-odds ratios become invariant to the outcome level. Essentially, the NPO model in (4) adds a layer of additional structure to the PO model in (1) that allows each effect to deviate from the PO assumption. By using half-normal distributions with a unit standard deviation for σ1, σ2 and σ3, a priori the proposed NPO model prefers small deviations from the PO model. This type of hierarchical NPO model was alluded to as an alternative for PO models in the discussion of McKinley et al. (2015), but they neither implemented nor fully specified such a model. Because the proposed NPO model allows the log-odds ratios to differ with the response level, the model-based estimates of the response distributions need not be stochastically ordered. Consequently, when posterior inference is based on the NPO model, our proposed utility-based comparative testing criterion facilitates treatment comparison.
3.3 Posterior Estimation
We carry out posterior estimation for the PO and NPO models using JAGS via the R package R2jags (Plummer, 2003). Posterior convergence tends to be immediate and autocorrelation tends to be low, likely due, in part, to our balanced specification of the design matrix. We use the posterior samples from JAGS to calculate the four posterior probabilities required for the utility-based comparative testing criterion given in Section 2. Because the mean utilities are tractable functions of the unknown model parameters for both the PO and NPO models, obtaining posterior samples of the mean utilities is straightforward. We provide freely-available, user-friendly R software for implementation, see Supplementary Materials.
4 Design Considerations
Although we use a Bayesian probability model for statistical inference, we design the trial to ensure certain desirable frequentist operating characteristics (OCs), e.g., 0.80 power under the targeted alternative with 0.05 probability of making a type I error. We are concerned with the subgroup specific power and type I error probability. That is, when Nuprehab reduces the number and severity of postoperative complications for a particular subgroup by the targeted amount, we want our design to have 0.80 probability of correctly declaring N superior to C for patients in that subgroup. By contrast, when Nuprehab does not reduce the number and severity of postoperative complications for a particular subgroup, we want our design only to have 0.05 probability of incorrectly declaring N superior or inferior to C for patients in that subgroup.
We met with WH to determine a practical targeted difference between the treatment arms for primary and salvage patients. During our discussion, WH expressed his desire for the trial to be powered to detect a 75% reduction in POM scores ≥ 3 in each subgroup. We then derived a targeted mean utility difference corresponding to this 75% reduction in POM scores ≥ 3 in each subgroup as follows. For the anticipated POM score probabilities in the control arm for each subgroup, a 75% reduction corresponds to shifts in the probability of POM scores ≥ 3 from 0.20 to 0.05 for primary patients, and from 0.35 to 0.09 for salvage patients. Under the proportional odds assumption, a 75% reduction in POM scores ≥ 3 corresponds to the cumulative POM score probabilities and mean utilities in Table 2. We designed the trial to target mean utility differences of 17.3 = 92.8 − 75.5 in primary patients and 27.6 = 87.6 − 60.0 in salvage patients.
Table 2.
POM Score | Mean Utility | ||||||
---|---|---|---|---|---|---|---|
Subgroup | Treatment Arm | ≤ 0 | ≤ 1 | ≤ 2 | ≤ 3 | ≤ 4 | |
Primary | Nutritional Prehabilitation | 0.83 | 0.92 | 0.95 | 0.98 | 0.99 | 92.8 |
Primary | Control | 0.50 | 0.70 | 0.80 | 0.90 | 0.95 | 75.5 |
| |||||||
Salvage | Nutritional Prehabilitation | 0.71 | 0.87 | 0.91 | 0.94 | 0.97 | 87.6 |
Salvage | Control | 0.30 | 0.55 | 0.65 | 0.75 | 0.85 | 60.0 |
Whitehead (1993) derived a sample size formula for a traditional fixed-sample design with ordinal outcomes based on a PO model, which is
where Φ(x) is the standard normal distribution evaluated at x, α and β are the desired type I and II error probabilities, δ is the targeted log-odds ratio, and under the targeted alternative. Although we use our utility-based comparative testing criterion proposed in Section 2, we demonstrated earlier that for the PO model this criterion is equivalent to a particular contrast of the regression coefficients that is also the basis for Whitehead’s sample size formula. Viewing each subgroup as a separate trial, under their respective targeted alternatives, we need to enroll 56 primary and 38 salvage patients for α = 0.05 and β = 0.20. Assuming 60% of the enrollees will be primary patients, we need to enroll 94 patients to achieve these subgroup sample sizes. Because the PO model borrows strength across subgroups for estimating the intercept parameters, , we expect the above sample size calculation to be conservative. When the PO assumption holds we expect our NPO model to have less power than our PO model, though only slightly less as we specify an informative half-standard normal distribution as the prior for σ1, σ2 and σ3 in (5).
In the trial, patients are assigned to the two treatment arms using stratified block randomization with blocks of size four. Thus, for each block of four patients within each subgroup, two patients will receive Nuprehab and two will receive control. This will ensure that the treatment arms will have similar numbers of patients from each subgroup throughout the trial. Given the modest sample size requirements, one interim analysis will be done half-way through the trial. At this point, using our utility-based comparative testing criterion, the design will decide whether to continue enrolling patients from each subgroup. If one treatment is declared superior to the other for a certain subgroup at the interim analysis, then no additional patients from that subgroup will be enrolled. Otherwise, enrollment of patients from that subgroup will continue until the final analysis.
To control the probability of committing a type I error, we use a maximum duration alpha-spending approach such that f(t) = α × (t/Tmax)3, where Tmax denotes the maximum trial duration, see Jennison and Turnbull (1999, Section 7.2.3). To do this, we set pcut = Φ(z) where z corresponds to the relevant threshold for the test statistic in a frequentist group sequential analysis, which we calculate using the R package gsDesign. With one interim analysis at the mid-point of the trial, this gives probability thresholds at the interim and final analyses of 0.997 and 0.976, respectively. Due to the asymptotic normality of the posterior distribution in general, see, e.g., Gelman et al. (2014, Section 4), these thresholds control the type I error asymptotically. However, we use computer simulation to verify that our design controls type I error for the planned sample size. To be conservative, we aim to enroll up to 100 patients. Given the anticipated accrual rate of 2 patients per month, the interim and final analyses are expected to be performed at 26 and 51 months, respectively.
5 Simulation Study
In this section, we describe a simulation study that we carried out to evaluate and compare the frequentist OCs of the proposed design, under each of the PO and NPO models defined in Section 3. For further comparison, we considered a one-size-fits-all design based on a PO model that assumes
We specify the same prior distributions for α, β1 and β2 as for the PO model defined in (3). For this design, using the same two-stage group sequential structure that controls the probability of committing a type I error at 0.05, if Prob(β2 > 0 | D) > pcut, then we declare N superior to C for patients in both subgroups. Conversely, if Prob(β2 < 0 | D) > pcut, then we declare N inferior to C for patients in both subgroups. We compare the designs based on their probabilities of declaring N superior (or inferior) to C across a range of scenarios. To assess the decision criteria, we used 10,000 posterior samples following 500 warm-up samples. For all three models, posterior sampling took about 4 seconds for an interim analysis with 50 observations, and about 7 seconds for a final analysis with up to 100 observations.
Table 3 reports the true POM score distributions for each scenario. We used the same POM score distribution to generate observations for the control arm in each subgroup for every scenario, i.e. P0 for primary patients and S0 for salvage patients. Each patient had a 60% chance of belonging to the primary subgroup throughout. Scenario 1 is the complete null case where N provides no benefit to patients in either subgroup. Scenarios 2 and 3 are treatment-subgroup interaction cases where N provides the targeted benefit to patients in one subgroup, and no benefit to patients in the other subgroup. Scenario 4 is the complete alternative case where N provides the targeted benefit to patients in both subgroups. The PO assumption holds for Scenarios 1–4. Scenarios 5 and 6 are treatment-subgroup interaction cases, and Scenario 7 is a complete alternative case, but the PO assumption does not hold. In particular, the log-odds ratio comparing treatment arms corresponding to POM scores ≤ 0 and ≤ 1 are smaller than those corresponding to POM scores ≤ 2, ≤ 3 and ≤ 4, but the targeted 75% reduction in POM scores ≥ 3 is still achieved in each subgroup. This reflects a benefit that greatly reduces severe postoperative complications, but affects the rate of minor postoperative complications to a lesser degree.
Table 3.
True POM Score Distribution | POM Score | Mean Utility | ||||
---|---|---|---|---|---|---|
≤ 0 | ≤ 1 | ≤ 2 | ≤ 3 | ≤ 4 | ||
P0 | 0.50 | 0.70 | 0.80 | 0.90 | 0.95 | 75.5 |
P1 | 0.83 | 0.92 | 0.95 | 0.98 | 0.99 | 92.8 |
P2 | 0.67 | 0.85 | 0.95 | 0.98 | 0.99 | 88.8 |
| ||||||
S0 | 0.30 | 0.55 | 0.65 | 0.75 | 0.85 | 60.0 |
S1 | 0.71 | 0.87 | 0.91 | 0.94 | 0.97 | 87.6 |
S2 | 0.53 | 0.80 | 0.91 | 0.94 | 0.97 | 82.8 |
Scenario | POM Score Distribution | Mean Utility Difference | ||||
---|---|---|---|---|---|---|
P,C | P,N | S,C | S,N | P | S | |
1 | P0 | P0 | S0 | S0 | 0.0 | 0.0 |
2 | P0 | P1 | S0 | S0 | 17.3 | 0.0 |
3 | P0 | P0 | S0 | S1 | 0.0 | 27.6 |
4 | P0 | P1 | S0 | S1 | 17.3 | 27.6 |
5 | P0 | P2 | S0 | S0 | 13.3 | 0.0 |
6 | P0 | P0 | S0 | S2 | 0.0 | 22.8 |
7 | P0 | P2 | S0 | S2 | 13.3 | 22.8 |
Table 4 reports the proportion of trials in which N was declared superior (inferior) to C, and the average sample size. In Scenario 1, i.e. the complete null case, all three designs control the within subgroup probability of making a type I error near 0.05. In Scenario 4, i.e. the complete alternative case, the two stratified designs have greater than 0.80 power of declaring N superior to C in each subgroup. Excepting Scenario 4, the three competing designs are unlikely to stop early, which is reflected by the average sample sizes near 100. Compared to the stratified design based on the PO model, when the PO assumption holds, e.g., Scenarios 2, 3 and 4, the stratified design based on the more flexible NPO model has similar power for declaring N superior to C in the benefiting subgroup(s), and when the PO assumption does not hold, e.g., Scenarios 5, 6 and 7, the NPO model has larger power of declaring N superior to C in the benefiting subgroup(s). The power for each design is lower when the PO assumption does not hold, which is not surprising as the mean utility differences are smaller in these cases. Compared to the traditional design, when both subgroups benefit, e.g., Scenarios 4 and 7, the stratified designs have less power for declaring N superior to C in both subgroups, but when only one subgroup benefits, e.g., Scenarios 2, 3, 5 and 6, the stratified designs have greater power of declaring N superior to C in the benefiting subgroup, and are far less likely to incorrectly declare N superior to C in the non-benefiting subgroup.
Table 4.
Primary | Salvage | Average | |||
---|---|---|---|---|---|
Design | Sup. | Inf. | Sup. | Inf. | Sample Size |
Scenario 1 (Complete Null) | |||||
| |||||
Traditional | 0.022 | 0.025 | 0.022 | 0.025 | 99.7 |
Stratified PO | 0.022 | 0.025 | 0.020 | 0.025 | 99.8 |
Stratified NPO | 0.029 | 0.028 | 0.029 | 0.032 | 99.5 |
| |||||
Scenario 2 (Partial Null - PO) | |||||
| |||||
Traditional | 0.509 | 0.000 | 0.509 | 0.000 | 95.5 |
Stratified PO | 0.774 | 0.000 | 0.038 | 0.017 | 94.9 |
Stratified NPO | 0.763 | 0.000 | 0.048 | 0.022 | 93.7 |
| |||||
Scenario 3 (Partial Null - PO) | |||||
| |||||
Traditional | 0.417 | 0.000 | 0.417 | 0.000 | 96.6 |
Stratified PO | 0.040 | 0.015 | 0.784 | 0.000 | 96.5 |
Stratified NPO | 0.047 | 0.016 | 0.782 | 0.000 | 95.4 |
| |||||
Scenario 4 (Complete Alternative - PO) | |||||
| |||||
Traditional | 0.978 | 0.000 | 0.978 | 0.000 | 74.7 |
Stratified PO | 0.841 | 0.000 | 0.850 | 0.000 | 87.4 |
Stratified NPO | 0.842 | 0.000 | 0.853 | 0.000 | 84.6 |
| |||||
Scenario 5 (Partial Null - NPO) | |||||
| |||||
Traditional | 0.215 | 0.001 | 0.215 | 0.001 | 98.5 |
Stratified PO | 0.314 | 0.001 | 0.032 | 0.019 | 98.9 |
Stratified NPO | 0.347 | 0.001 | 0.043 | 0.025 | 98.0 |
| |||||
Scenario 6 (Partial Null - NPO) | |||||
| |||||
Traditional | 0.239 | 0.001 | 0.239 | 0.001 | 98.3 |
Stratified PO | 0.036 | 0.019 | 0.462 | 0.000 | 98.7 |
Stratified NPO | 0.042 | 0.021 | 0.503 | 0.000 | 97.7 |
| |||||
Scenario 7 (Complete Alternative - NPO) | |||||
| |||||
Traditional | 0.710 | 0.000 | 0.710 | 0.000 | 92.1 |
Stratified PO | 0.387 | 0.000 | 0.528 | 0.000 | 96.6 |
Stratified NPO | 0.434 | 0.000 | 0.584 | 0.000 | 94.8 |
6 Discussion
In this paper, we have proposed a design for comparing treatments in two prognostic subgroups based on ordinal outcomes. The design was motivated by a trial comparing the effectiveness of nutritional prehabilitation (NuPrehab) against the standard of care for improving postoperative outcomes in primary and salvage patients who undergo esophageal resection. We considered two Bayesian cumulative logistic regression models for statistical inference, a proportional odds (PO) model and a hierarchical non-proportional odds (NPO) model that shrinks toward the PO model. Based on the results of our simulation study, we determined that the design based on the NPO model has preferable frequentist operating characteristics. In particular, when the PO assumption is satisfied the NPO model has similar probability of recommending NuPrehab in the benefiting subgroup(s), whereas when the PO assumption is not satisfied the NPO model can have substantially higher probability of recommending NuPrehab in the benefiting subgroup(s).
We also compared our stratified medicine design with a more traditional design that does not facilitate subgroup specific recommendations. If both subgroups benefit from NuPrehab, then the traditional design is more likely to recommend its use in each subgroup. However, when only one of the two subgroups benefits from NuPrehab, the traditional design is less likely to recommend its use in the benefiting subgroup and more likely to recommend its use in the non-benefiting subgroup. Because substantial treatment effect heterogeneity between primary and salvage patients is plausible in the motivating trial, we find the stratified medicine design based on the NPO model appealing. In another context, if this design is not appealing, then including a model selection between a model with and a model without an interaction term may provide an avenue for achieving a design with more appealing operating characteristics. Another alternative is to consider a prior for the interaction parameter(s) that facilitates borrowing substantial strength for the treatment effect across subgroups, perhaps in a data-dependent manner. These are avenues for future research.
Supplementary Material
Footnotes
7 Supplementary Materials
Web Appendices with a proof of Theorem 1 referenced in Section 2, and R software referenced in Section 3 for implementing the probability models and reproducing the simulation study is available with this paper at the Biometrics website on Wiley Online Library.
References
- Abreu MNS, Siqueira AL, Cardoso CS, Caiaffa WT. Ordinal logistic regression models: Application in quality of life studies. Cadernos de Saúde Pública. 2008;24:s581–s591. doi: 10.1590/s0102-311x2008001600010. [DOI] [PubMed] [Google Scholar]
- Agresti A. Analysis of Ordinal Categorical Data. Wiley; 2010. [Google Scholar]
- Bender R, Grouven U. Using binary logistic regression models for ordinal data with non-proportional odds. Journal of Clinical Epidemiology. 1998;51:809–816. doi: 10.1016/s0895-4356(98)00066-3. [DOI] [PubMed] [Google Scholar]
- Braga M, Gianotti L, Nespoli L, Radaelli G, Di Carlo V. Nutritional approach in malnourished surgical patients: A prospective randomized study. Archives of Surgery. 2002;137:174–80. doi: 10.1001/archsurg.137.2.174. [DOI] [PubMed] [Google Scholar]
- Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, Racine-Poon A. Confirmatory adaptive designs with bayesian decision tools for a targeted therapy in oncology. Statistics in Medicine. 2009;28:1445–1463. doi: 10.1002/sim.3559. [DOI] [PubMed] [Google Scholar]
- Chen KN. Managing complications I: Leaks, strictures, emptying, reflux, chylothorax. Journal of Thoracic Disease. 2014;6:S355–63. doi: 10.3978/j.issn.2072-1439.2014.03.36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clavien PA, Barkun J, de Oliveira ML, Vauthey JN, Dindo D, Schulick RD, de Santibañes E, Pekolj J, Slankamenac K, Bassi C, Graf R, Vonlanthen R, Padbury R, Cameron JL, Makuuchi M. The Clavien-Dindo classification of surgical complications. Annals of Surgery. 2009;250:187–196. doi: 10.1097/SLA.0b013e3181b13ca2. [DOI] [PubMed] [Google Scholar]
- Clavien PA, Sanabria JR, Strasberg SM. Proposed classification of complications of surgery with examples of utility in cholecystectomy. Surgery. 1992;111:518–26. [PubMed] [Google Scholar]
- Dindo D, Demartines N, Clavien PA. Classification of surgical complications: A new proposal with evaluation in a cohort of 6336 patients and results of a survey. Annals of Surgery. 2004;240:205–13. doi: 10.1097/01.sla.0000133083.54934.ae. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3rd. Chapman & Hall CRC Press; 2014. [Google Scholar]
- Ghosh J, Li Y, Mitra R. On the use of Cauchy prior distributions for Bayesian logistic regression. Bayesian Analysis 2017 [Google Scholar]
- Guo B, Yuan Y. Bayesian phase I/II biomarker-based dose finding for precision medicine with molecularly targeted agents. Journal of the American Statistical Association. 2017;0:0–0. doi: 10.1080/01621459.2016.1228534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Houede N, Thall PF, Nguyen H, Paoletti X, Kramar A. Utility-based optimization of combination therapy using ordinal toxicity and efficacy in phase I/II trials. Biometrics. 2010;66:532–40. doi: 10.1111/j.1541-0420.2009.01302.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ishwaran H. Univariate and multirater ordinal cumulative link regression with covariate specific cutpoints. Canadian Journal of Statistics. 2000;28:715–730. [Google Scholar]
- Jennison C, Turnbull BW. Group Sequential Methods with Applications to Clinical Trials. Chapman and Hall/CRC; 1999. [Google Scholar]
- McCullagh P. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological) 1980;42:109–142. [Google Scholar]
- McKinley TJ, Morters M, Wood JLN. Bayesian model choice in cumulative link ordinal regression models. Bayesian Analysis. 2015;10:1–30. [Google Scholar]
- Murray TA, Thall PF, Yuan Y. Utility-based designs for randomized comparative trials with categorical outcomes. Statistics in Medicine. 2016;35:4285–4305. doi: 10.1002/sim.6989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parekh K, Iannettoni MD. Complications of esophageal resection and reconstruction. Seminars in Thoracic and Cardiovascular Surgery. 2007;19:79–88. doi: 10.1053/j.semtcvs.2006.11.002. [DOI] [PubMed] [Google Scholar]
- Peterson B, Harrell FE. Partial proportional odds models for ordinal response variables. Journal of the Royal Statistical Society. Series C. 1990;39:205–17. [Google Scholar]
- Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling 2003 [Google Scholar]
- Quirk JP, Saposnik R. Admissibility and measurable utility functions. The Review of Economic Studies. 1962;29:140–146. [Google Scholar]
- Rosenblum M, Luber B, Thompson RE, Hanley D. Group sequential designs with prospectively planned rules for subpopulation enrichment. Statistics in Medicine. 2016;35:3776–3791. doi: 10.1002/sim.6957. sim.6957. [DOI] [PubMed] [Google Scholar]
- Thall PF, Nguyen HQ. Adaptive randomization to improve utility-based dose-finding with bivariate ordinal outcomes. Journal of Biopharmaceutical Statistics. 2012;22:785–801. doi: 10.1080/10543406.2012.676586. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waitzberg DL, Saito H, Plank LD, Jamieson GG, Jagannath P, Hwang TL, Mijares JM, Bihari D. Postsurgical infections are reduced with specialized nutrition support. World Journal of Surgery. 2006;30:1592–1604. doi: 10.1007/s00268-005-0657-x. [DOI] [PubMed] [Google Scholar]
- Walters S, Campbell M, Lall R. Design and analysis of trials with quality of life as an outcome: A practical guide. Journal of Biopharmaceutical Statistics. 2001;11:155–176. doi: 10.1081/BIP-100107655. [DOI] [PubMed] [Google Scholar]
- Wang SJ, James Hung HM, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal. 2009;51:358–374. doi: 10.1002/bimj.200900003. [DOI] [PubMed] [Google Scholar]
- Whitehead J. Sample size calculations for ordered categorical data. Statistics in Medicine. 1993;12:2257–2271. doi: 10.1002/sim.4780122404. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.