Multiplicity-Adjusted Semiparametric Benefiting Subgroup Identification in Clinical Trials

Patrick M Schnell; Peter Müller; Qi Tang; Bradley P Carlin

doi:10.1177/1740774517729167

. Author manuscript; available in PMC: 2022 Jan 28.

Published in final edited form as: Clin Trials. 2017 Oct 16;15(1):75–86. doi: 10.1177/1740774517729167

Multiplicity-Adjusted Semiparametric Benefiting Subgroup Identification in Clinical Trials

Patrick M Schnell ¹, Peter Müller ², Qi Tang ^3,⁴, Bradley P Carlin ⁵

PMCID: PMC8796253 NIHMSID: NIHMS1771907 PMID: 29035083

Abstract

Background:

A recent focus in the health sciences has been the development of personalized medicine, which includes determining the population for which a given treatment is effective. Due to limited data, identifying the true benefiting population is a challenging task. To tackle this difficulty, the credible subgroups approach provides a pair of bounding subgroups for the true benefiting subgroup, constructed so that one is contained by the benefiting subgroup while the other contains the benefiting subgroup with high probability. However, the method has so far only been developed for parametric linear models.

Methods:

In this paper we develop the details required to follow the credible subgroups approach in more realistic settings by considering nonlinear and semiparametric regression models, supported for regulatory science by conditional power simulations. We also present an improved multiple testing approach using a step-down procedure. We evaluate our approach via simulations and apply it to data from four trials of Alzheimer’s disease treatments carried out by AbbVie.

Results:

Semiparametric modeling yields credible subgroups that are more robust to violations of linear treatment effect assumptions, and careful choice of the population of interest as well as the step-down multiple testing procedure result in a higher rate of detection of benefiting types of patients. The approach allows us to identify types of patients that benefit from treatment in the Alzheimer’s disease trials.

Conclusion:

Attempts to identify benefiting subgroups of patients in clinical trials are often met with skepticism due to a lack of multiplicity control and unrealistically restrictive assumptions. Our proposed approach merges two techniques, credible subgroups and semiparametric regression, which avoids these problems and makes benefiting subgroup identification practical and reliable.

Keywords: Bayesian inference, clinical trials, multiple testing, personalized medicine, semiparametric regression, subgroup identification

Introduction

A recent focus in the health sciences has been the development of personalized medicine, which seeks to incorporate observable patient characteristics into decisions about prevention and treatment. The central statistical objectives relating to personalized medicine involve inferences about personalized treatment effects (PTEs), which are the counterpart to the traditional average treatment effect (ATE) conditioned on available patient covariates, and usually have one of two goals: to determine which treatment is most effective for a given patient, or to determine the population for which a given treatment is effective. In this paper we focus primarily on the latter, which is of interest to, e.g., regulatory agencies.

The approach we take toward identifying the population for which a given treatment is effective is to estimate the PTE at each predictive covariate point (informally, for each patient), then at each covariate point test the hypothesis of a null PTE. In this context a predictive covariate is a baseline covariate that interacts with the treatment indicator in a regression model. This problem necessitates attention to the multiplicity of hypothesis tests across the predictive covariate space, unlike the problem of identifying the best treatment for a given patient. Additionally, shifting the inference focus from an overall effect to PTE requires more flexible inference models that allow reliable inference of patient-specific effects. Several nonparametric and semiparametric regression approaches for PTEs have been proposed, including random forests¹ in the virtual twins approach², Bayesian additive regression trees (BART)³ in modeling for causal inference⁴, and a hybrid approach defining treatment and baseline models using tree-based methods⁵.

Credible subgroups⁶ provide a framework for deriving inferences about the population benefiting from treatment using the output of linear regression models of the PTE. These inferences take the form of two bounding subgroups for the benefiting subgroup, one which contains it and one which is contained by it, an approach that naturally accounts for the multiplicity of PTE tests. In the present paper, we attempt to unify this inferential framework and current nonparametric and semiparametric regression practice, with special attention to penalized splines and Bayesian additive regression trees. We apply our method to a data set collected from a sequence of four clinical trials for Alzheimer’s disease treatments, with four baseline covariates potentially influencing the treatment effect.

Benefiting Subgroup Identification

Benefiting subgroup identification is primarily concerned with a personalized treatment effect (PTE), i.e., the treatment effect for a patient given their observable baseline characteristics. In many cases, the PTE is a difference in conditional expectation:

Δ (x) = E [Y ∣ x, t = 1] - E [Y ∣ x, t = 0],

(1)

where Y is the response of interest, x is a vector of covariates, and t is a treatment indicator. Given a definition of PTE and a covariate space C, the goal of benefiting subgroup identification is to estimate the set of covariate points for which it is positive: B = {x ∈ C : Δ(x) > 0}.

Credible Subgroups

A common default estimator for B is ${\hat{B}}_{α} = {x \in C : P [Δ (x) > 0 ∣ y] > 1 - α}$ in the Bayesian interpretation, i.e., the points with a high posterior probability of having a positive PTE given observed data y. In order to control for the multiplicity inherent in testing the PTE at every covariate point, and to provide a two-sided bound, we prefer the credible subgroup pair⁶ (D, S) which is constructed so that P[D ⊆ B ⊆ S|y] > 1 − α. Thus it is likely (with 1 − α posterior probability) that the exclusive credible subgroup D contains only patients who benefit, and the inclusive credible subgroup S contains all patients who do. Such a credible subgroup pair partitions the covariate space into three regions, as shown in Figure 1. We can then conclude that for covariate points in D, there is evidence of benefit, and for patients in the complement S^C of S, there is evidence of no benefit. The remainder of the space, S \ D, requires more information.

Figure 1. — Interpretation of the credible subgroup pair (D, S) relative to the true benefiting subgroup B (enclosed by dashed line).

The general strategy of such an approach is to perform a regression of the PTE onto the given covariates, construct simultaneous credible bands around the regression surface, and take as D and S the covariate points at which the lower and upper bounds were greater than zero, respectively. Because the credible bands are simultaneous, there is a controlled probability that at least one covariate point has a corresponding lower bound mistakenly greater than zero (i.e., is erroneously in D) or upper bound mistakenly less than zero (i.e., is erroneously placed in S^C). While the theory was initially developed for normal linear models, credible subgroups can be computed for any model given only a sample from the posterior distribution of the Δ(x) at each x ∈ C.

In cases for which the joint posterior of Δ(x) approaches the frequentist Gaussian distribution of the maximum likelihood estimator⁷, the asymptotic band

\hat{Δ} (x) \pm W_{α, C}^{*} \sqrt{Var [Δ (x)]},

(2)

where $W_{α, C}^{*}$ is the 1 − α quantile of the distribution of

W_{C} = \sup_{x \in C} \frac{| Δ (x) - \hat{Δ} (x) |}{\sqrt{Var [Δ (x)]}},

(3)

and $\hat{Δ} (x)$ is the posterior mean of Δ(x), is a 1 − α simultaneous credible band for Δ(x) over C. These bands also correspond to frequentist simultaneous credible bands, and therefore may be used to derive frequentist tests. In cases where the joint posterior of Δ(x) is not approximately normal, a quantile-based simultaneous credible band (presented in the Appendix) may be used instead.

Semiparametric Estimation of Personalized Treatment Effects

The first step in constructing credible subgroups is to perform a regression of the personalized treatment effect onto relevant covariates. Here we define the PTE as in (1), the difference in conditional expected response between the treatment and control arms. Previously, a linear model of the form

E [Y ∣ x, t] = x^{'} β + t x^{'} γ,

(4)

has been used⁶ so that Δ(x) = x′γ. We generalize to a semiparametric model based on the additive penalized spline model with factor-by-curve interactions⁸:

E [Y ∣ x, t] = β_{0} + \sum_{j = 1}^{p} f_{j} (x_{j}) + t [γ_{0} + \sum_{j = 1}^{p} g_{j} (x_{j})],

(5)

where the f_j and g_j are penalized cubic splines with radial bases and no intercepts:

f_{j} (x_{j}) = β_{j 1} x_{1} + β_{j 2} x_{j}^{2} + β_{j 3} x_{j}^{3} + \sum_{k = 1}^{K_{f j}} u_{f j k} {| x_{j} - κ_{f j k} |}^{3},

(6)

and the κ_fjk are fixed knots. The penalty is implemented by placing a Normal(0, $σ_{f j}^{2}$ ) prior on the u_fjk and a vague InverseGamma(0.001, 0.001) prior on the $σ_{f j}^{2}$ . The g_j are specified similarly, using γ instead of β. We place flat priors on the β and γ. Models adding fixed effects or group-level random slopes and intercepts are straightforward to specify.

Model (5) can suffer from poorly identified parameters; however, the resulting E[Y |x, t], as well as quantities of the forms E[Y |x, t = 1] − E[Y |x, t = 0] (the PTE) or E[Y |x₁, t] − E[Y |x₀, t], are typically stable. This stability, along with the tendency of software packages for other regression techniques to supply quantities of the form E[Y |x], make (1) a convenient definition of the PTE even when there cannot be an explicit separation of the treatment variable t from x in the model.

An alternative to the above additive penalized spline regression model is the Bayesian additive regression trees method³, implemented by the R package BayesTree. The model uses a sum-of-trees approach in which a prior regularizes each individual tree to be relatively simple. This method is popular for its combination of flexibility and user-friendliness: it is capable of fitting arbitrary regression mean surfaces with little and often no tuning or configuration needed from the user. To apply the model to our credible subgroups method, we concatenate the treatment indicator onto the covariate vector for each patient and fit Y ~ Normal (μ(z), σ²) where z = (x₁, …, x_p, t) and μ is an arbitrary regression mean function to be estimated. However, fully nonparametric models for which the PTE surface must be stored at every point in C present challenges with respect to memory, as a sample from the posterior joint PTE distribution must be stored in an often very large (number of draws by number of covariate points) matrix.

Precision of Monte Carlo Methods

Computation of credible subgroups from a Monte Carlo sample requires estimation not of the posterior mean of the PTE, but of its tail posterior quantiles, which are less precisely estimated for a given number of posterior draws. Additionally, which quantile to estimate $(W_{α}^{*})$ must also be estimated from the Monte Carlo sample of values from the estimated distribution functions, and the value of $W_{α}^{*}$ is often very far into the right tail of the distribution of the $| Δ (x) - \hat{Δ} (x) | / \sqrt{Var [Δ (x)]}$ . In practice, this means that much larger Monte Carlo samples are required for reliable credible subgroup inference than standard posterior mean and variance calculations for the corresponding model. Resampling methods may be used to estimate the Monte Carlo standard errors of these estimators if such resampling is faster than producing additional posterior samples.

A Batch Step-Down Testing Procedure

The single-step testing procedure based on (2) can be improved upon by a sequential, step-down testing procedure similar to the Holm step-down procedure⁹, the latter being well-known in the multiple testing literature. For a set of M hypotheses, the Holm procedure first tests all hypotheses using an M-way Bonferroni correction, and if the hypothesis with the lowest p-value is rejected, proceeds to test the remaining M − 1 hypotheses using an (M − 1)-way Bonferroni correction, and so on.

Let C be a subset of interest of the covariate space, θ be the vector of all model parameters (or conditional means for nonparametric models), and H_i = {θ : Δ(x_i) = 0}. Then Algorithm 1 controls the overall type I error rate at level α.

graphic file with name nihms-1771907-f0001.jpg

The proof of validity for this procedure relies on showing that it is a closed testing procedure¹⁰, in part via noting that W_α,V ≤ W_α,U for V ⊂ U. The full proof is available in the Appendix as Theorem 1. If H_i is rejected, place x_i in D for posterior mean $\hat{Δ} (x_{i}) > 0$ or S^C for $\hat{Δ} (x_{i}) < 0$ , and if H_i is not rejected place x_i in S \ D.

If, in a Bayesian interpretation, one wishes to use an interval null hypothesis such as H_i = {θ : −ε ≤ Δ(x_i) ≤ ε}, then R_M should be the set of x_i for which the band does not overlap [−ε, ε]. Then D and S^C are constructed by comparing $\hat{Δ} (x_{i})$ to ε and −ε, respectively, for x_i for which H_i is rejected.

The above algorithm is conceptually simple and may easily be implemented around a given method for constructing simultaneous confidence bands. However, a somewhat more involved algorithm (Algorithm 2) implementing the same procedure can be used to compute the maximum credible level l_i at which the test of no effect at each covariate point x_i is rejected (similar to adjusted p-values¹¹), and may be used to quickly construct credible subgroups at various credible levels after performing a single expensive computation, rather than repeating Algorithm 1 for every credible level. The credible subgroups for any level 1 − α are then $D = {x_{i} : \hat{Δ} (x_{i}) > 0, l_{i} \geq 1 - α}$ and $S = {x_{i} : \hat{Δ} (x_{i}) < 0, l_{i} \geq 1 - α}$ .

graphic file with name nihms-1771907-f0002.jpg

Power and Choice of Covariate Space

When fitting a semiparametric regression model, the standard errors of mean surface estimates can quickly become very large in regions of the covariate space in which observations are sparse. Additionally, estimates of the PTE have very large standard errors and are generally unreliable in regions lacking common support (presence of observations in both the treatment and control group). As a result, the credible subgroups method may have very low power (ability to classify points as in D or S^c) in these observation-sparse regions. While we consider this to be a positive feature of semiparametric regression (given that we presumably do not want to assume a strict parametric form for the PTE), globally adjusting for multiplicity of covariate points means that unnecessarily testing the PTE in these observation-sparse regions dilutes power even in observation-rich regions. This problem is compounded by the fact that semiparametric models yield much looser dependence in the posterior distribution of the PTE surface across the covariate space.

While restricting the covariate space to the Cartesian product of the empirical ranges of each covariate is an obvious first step, the covariate space often may be further restricted following local power simulations. If asymptotic normality of the joint posterior distribution of the PTEs can be assumed, we may use Algorithm 3 to estimate conditional local power.

graphic file with name nihms-1771907-f0003.jpg

The power estimate is derived by assuming asymptotic joint normality of $\hat{Δ} (x_{j})$ and evaluating the frequentist probability $P [\hat{Δ} (x_{j}) - \sqrt{W_{α, C}^{*} Var [\hat{Δ} (x_{j})]} ∣ Δ (x_{j}) = Δ_{0}]$ with the intention of determining the power to detect benefit (rather than harm). The assumption of a constant treatment effect is material—the estimate of the PTE at one covariate point may be strongly affected by the true value of the PTE at nearby points, and the shape of the PTE surface affects the dependency of the posterior distribution of the PTE across the covariate space, and thus $W_{α, C}^{*}$ . Other PTE surfaces may be assumed and power estimated for them, but without prior information, the constant surface computations are perhaps most easily interpreted. Additionally, the above algorithm may depend on nuisance parameters such as error variances. It may be possible to estimate these nuisance parameters through empirical Bayes methods (e.g., using restricted maximum likelihood to estimate the hyperparameters from their marginal distributions) using the observed data without introducing much bias.

These power estimates may be used to further restrict the covariate space to, e.g., points for which the intended study and analysis have at least 5% power to detect the hypothesized benefit at the 95% credible level. While in many cases the gains in power across the remaining covariate space may be modest, the above procedure costs little more than computation time when the criterion for retaining a covariate point is relatively lax. Additionally, the tested covariate space may be adjusted before testing using considerations such as parsimony (e.g., choosing a circumscribing hyperrectangle) and the requirement of common support between study arms. Finally, if information about the covariate distribution before the start of the trial, via either knowledge of the population distribution or quotas for strata, pre-trial power calculations or sample size estimations may be carried out in a similar manner.

Regardless of whether or not the covariate space is restricted in the manner described above, it is important to recognize that classification of covariate points into D, S \ D, and S^C is risky in regions of the covariate space in which observations are sparse. Even greater skepticism should be applied to conclusions outside of the observed covariate range, as these are extrapolations of the underlying regression model.

Simulation Study

We perform a simulation study to evaluate certain frequentist properties of the credible subgroups generated by linear, spline, and Bayesian additive regression trees models. A necessary property is valid (including conservative) coverage, i.e., D ⊆ B ⊆ S at least 100(1 − α)% of the time. Given valid coverage, we compare regression models primarily by the sensitivity of D (how much of B is contained in D). We also evaluate the sensitivity of D under the step-down procedure relative to that under the single-step procedure.

We simulate 1000 data sets with n = 100 patients in each treatment arm. Results for simulations with n = 25, n = 50, and n = 75 are presented in the Supplementary Materials. Each subject i has a covariate vector x_i = (1, x_i2, x_i3) with x_i2 = 0, 1 with equal probability and x_i3 continuously uniformly distributed on [−3, 3], a deterministic treatment assignment t_i, and a conditionally normally distributed response y_i. The covariates are used as both prognostic and predictive variables.

The outcomes are generated as Normal(Δ(x_i), 1) with Δ(x_i) specified in the following six cases. In the null case, Δ(x_i) = 0. In the binary case, Δ(x_i) = x_i2. In the linear case, Δ(x_i) = x_i3. In the near-linear case, $Δ (x_{i}) = 2 (\sqrt{x_{i 3} + 3} - \sqrt{3})$ . In the threshold case, Δ(x_i) = sign(x_i3)(x_i3/3)^1/3 + 1/4. In the non-monotone case, Δ(x_i) = 1/2 − 3(x_i3/3)².

To each data set we fit a linear, spline, and Bayesian additive regression trees model. For the linear and spline models, we place a vague InverseGamma(0.001, 0.001) prior on the error variance and flat priors on fixed effect coefficients. For the spline model we place InverseGamma(2, 1) shrinkage priors on the random effect variance, corresponding to the spline roughness penalty for the continuous covariate (the binary covariate does not require a spline). The Bayesian additive regression trees model is fit using the default settings in the R package BayesTree. All Gibbs samplers are run for 100 burn-in and 1000 retained iterations, which appears to be acceptable for these simple models.

To determine credible subgroups at the 80% credible level, we use as the covariate space the grid in which x₁ = 1, x₂ = 0, 1, and x₃ ranges from −3 to 3 in increments of 0.1. We compute the result under both the single-step and step-down procedures using the simultaneous credible band (2).

Table 1 displays the average summary statistics for 80% credible subgroups for each model and data generating mechanism (DGM) at n = 100 patients per arm. Each model has sufficient coverage, except for the linear model in the non-linear cases. Given sufficient coverage, the sensitivity of D will usually be the driving factor in choosing a model. In this regard, the spline model performs better than Bayesian additive regression trees in all cases except when the binary covariate drives the treatment effect heterogeneity, but the spline model’s advantage is small in the “threshold” case in which the continuous covariate behaves similarly to a binary one. When the true treatment effect heterogeneity is linear, the linear model outperforms both with respect to sensitivity of D. Generally, we recommend the spline model when continuous predictive covariates are present, as even a small departure from linearity can render the coverage of the linear model insufficient (see the “near-linear” case). Finally, the step-down method consistently improved the sensitivity of D, sometimes to a large extent (nonparametric fits in the near-linear case).

Table 1.

Simulation study results. Operating characteristics of 80% credible subgroups with n = 100 patients in each study arm. Struck-through sensitivities indicate insufficient coverage and should be treated with caution. Step-down efficiency is the ratio of the sensitivity of D using the step-down procedure (shown) over the single-step procedure (not shown).

Data Generating Mechanism	Model	Coverage	Sensitivity of D	Step-Down Efficiency
Null Effect	Linear	0.89	–	–
	Spline	0.92	–	–
	BART	0.99	–	–
Binary	Linear	0.91	0.97	1.01
	Spline	0.95	0.56	1.05
	BART	0.98	0.82	1.05
Linear	Linear	0.88	0.90	1.02
	Spline	0.94	0.77	1.09
	BART	1.00	0.70	1.08
Near-Linear	Linear	0.75	0.71	1.04
	Spline	0.97	0.39	1.19
	BART	1.00	0.27	1.25
Threshold	Linear	0.61	0.85	1.03
	Spline	0.96	0.52	1.07
	BART	0.99	0.48	1.07
Non-Monotone	Linear	0.23	0.52	1.10
	Spline	0.96	0.63	1.05
	BART	0.97	0.46	1.08

Open in a new tab

In the absence of model mis-specification, all methods appear to have conservative coverage. In fact, the realized error rate does not rise far past α/2. The relevance of α/2 as an error rate here is that credible subgroups can only make an error in one direction at each covariate point: if the true PTE is positive, the only error is under-estimation, while if the true PTE is non-positive, the only error is over-estimation; however, we cannot know a priori the sign of the PTE. Thus the apparent conservatism is due to the fact that none of the displayed simulations represent the worst-case scenario that the procedure protects against: a small-magnitude treatment effect that crosses zero frequently.

Analysis of Four Trials of an Alzheimer’s Disease Treatment

We consider data from a sequence of four clinical trials for Alzheimer’s disease treatments carried out by AbbVie, all of which include arms for a placebo and the same “standard of care” treatment. We wish to compare the standard of care to the placebo with respect to change in disease severity over 12 weeks, using data from all four trials. Combined, the studies are comprised of 369 complete-case patients from 9 countries.

We consider six baseline patient characteristics: disease severity, change in disease severity during run-in, long-term cognitive decline rate, age, carrier status of the ApoE4 allele, and sex. Disease severity is measured by the 11-part Alzheimer’s Disease Assessment Scale—Cognitive Subscale (ADAS-Cog 11)¹². The change in severity score in the 3–4 week run-in period between screening and randomization, which we call prechange, is included as a main effect in an attempt to adjust for the “learning effect” in which patients become familiar with the assessment instrument; however it is not included as a predictive covariate because it is not thought to be useful for practitioners due to its high variability and delaying of treatment. On the other hand, we include as a predictive covariate the long-term cognitive decline rate, d-rate, which is defined as the total drop in score on the Mini Mental State Examination (MMSE) divided by the time, in years, since onset of first symptoms. Age, though intuitively a potential factor in response, did not appear to have much predictive utility for efficacy in a similar trial analyzed using a similar credible subgroups approach.¹³ Since its inclusion in the model for the present analysis was also detrimental to penalized model fit as evaluated by the deviance information criterion and log pseudo-marginal likelihood, we excluded it from our primary analysis. The ApoE4 allele and sex are considered risk factors for development of Alzheimer’s disease¹⁴. The primary endpoint, improvement, is the negative difference in severity between end-of-study and baseline assessment scale measurements, so a positive value is a good outcome.

Our outcome model may be broadly summarized (in R-like syntax) as

improvement ~ Intercept + r 1 (country) + sex + carrier + f 1 (severity) + f 2 (drate) + f 3 (prechange) + treatment + treatment : s 1 (country) + treatment : sex + treatment : carrier + treatment : g 1 (severity) + treatment : g 2 (drate)

(7)

where r(·) and s(·) represent traditional centered, normally-distributed random intercepts and slopes, respectively, f(·) and g(·) represent penalized cubic splines as described in the “Semiparametric Estimation of Personalized Treatment Effects” Section, colons denote interactions, and errors are normally distributed. We place knots for spline terms at increments of 2 across each observed range. Variances for error, random effects, and penalized spline coefficients are given vague InverseGamma(0.001, 0.001) priors. Fixed effects are given flat priors.

We used the method of the “Power and Choice of Covariate Space” Section to choose the covariate space over which to produce credible subgroups, beginning with a grid with increments of 1 unit on each scale and spanning the Cartesian product of the observed ranges. Figure 2 plots the distribution of covariate points and shades the region over which the 95% credible subgroups have at least 5% power to detect a uniform benefit of 1 standard deviation (5 points). Restricting severity to [9, 49] and d-rate to [0, 8] we include this entire region and exclude less than 9% of patients, while reducing the size of the covariate space by more than half. It can be seen that observations, and especially observations from different treatment arms, are hopelessly sparse in much of the excluded region.

Figure 2. — The shaded region is the region for which the likelihood of detecting a 1 standard deviation (5-point) benefit at the 95% credible level is at least 5% when the entire empirical covariate space is used. Patients receiving placebo are represented by ×, and those receiving the standard of care by +.

The model was fit by Gibbs sampling using 100,000 iterations after 1000 burn-in iterations. Convergence appears near-immediate and mixing good from trace plots, and the time series–based effective sample sizes for most coefficients are above 75,000. Effective sample size was lowest for some country random effects with few patients, and for certain random effect variances.

Figure 3 displays the nonparametrically fitted effect curves. Although the posterior mean effects of the continuous covariates are at least plausibly linear except for the prognostic effect of rate of decline, there is a possibility of nonlinearity and even nonmonotonicity in the rest of the effects, which should be retained in the model.

Figure 4 displays the credible subgroups at the 95% level, using the step-down testing procedure (Algorithm 1). The exclusive credible subgroup generally contains patients with higher-than-average severity and rate of decline. The exclusive credible subgroup in this case includes approximately 17% more cells than the corresponding exclusive credible subgroup when the full observed ranges of severity and d-rate are used, and approximately 3% more than when the single-step (instead of the step-down) procedure is used. The apparent pixelation is due to the integer grid used to represent continuous covariates.

We also fit two other models: a version of (7) in which the penalized spline terms were replaced with linear effects, and the default implementation of Bayesian additive regression trees from the R package BayesTree for 100,000 iterations, thinned to 10,000 iterations due to memory considerations, after 1000 burn-in iterations. Figure 5 displays the 95% credible subgroups for the linear and Bayesian additive regression trees models, and Figure 6 compares the posterior mean PTE surface between (7) and its linear counterpart. Such visualizations of the estimated PTE surface may be useful to trialists who wish to more fully understand possible nonlinear features of the surface that would be lost under a linear model. As may be expected, the linear model simplifies the estimated PTE surface, which, along with the variances of the PTEs, yield a smoother exclusive credible subgroup. By contrast, Bayesian additive regression trees, which divides the covariate space into rectangular cells having constant PTE in each cell, yields a more rectangular exclusive credible subgroup. Due to the superior performance of the spline-based model in the simulation study for the nonlinear continuous effect case, we promote those Figure 4 credible subgroups (which are intermediate to those in Figure 5 as the best choice. The resulting credible subgroups are consistent with the observation that the linear model fits poorly while the Bayesian additive regression trees fit is too conservative in similar situations.

Figure 5. — Credible subgroups at the 95% level for the linear (top) and Bayesian additive regression trees (bottom) models.

Figure 6. — Posterior mean PTE surfaces for the semiparametric (top) and linear (bottom) models.

Discussion

Because of the relative independence of the credible subgroups inferential tools and the selected regression method—all that is needed is a sample from the joint posterior of the personalized treatment effects—the choice of parametric, semiparametric, or nonparametric methods may be made with full focus on flexibility and applicability to the problem, rather than being muddled by technical considerations about the associated inferential process. Such freedom may make “black box” nonparametric methods such as Bayesian additive regression trees appealing for their flexibility, but also allows the use of more interpretable models such as additive spline models if desired.

Increased flexibility of semiparametric and nonparametric regression models come at a cost in terms of power, due to the looser dependencies of the PTEs across the covariate space. As such, we have presented a pair of techniques for increasing power: a step-down multiple testing procedure, and a process for culling the tested covariate space of points for which we expect low power or whose inferential value is otherwise questionable. Culling the covariate space may sometimes be risky, especially when the space is well-covered by the observed data, but the step-down procedure carries no risk and costs only additional computation time. Of course, these techniques are useful for fully parametric models as well.

The more flexible models presented in this paper may also yield credible subgroups that are not contiguous in covariate space. It may be undesirable to restrict ourselves to contiguous credible subgroups, as doing so would implicitly assume that the benefiting subgroup itself is contiguous. Regardless of contiguity, the shape of the credible subgroups may not be easily described in terms of many, especially continuous, covariates. In such a case it may be useful to provide a “calculator” which returns the maximum credible level (as in Algorithm 2) for a given patient’s covariate profile¹⁵.

Some additional research is still required. In particular, the credible subgroups methods rely on evaluating tail probabilities of posterior distributions and distributions of estimates computed from those results, which makes estimating and handling Monte Carlo error significantly more difficult, and necessitates larger posterior samples than usual. Additionally, the conditional power estimation presented may become prohibitively computationally expensive if expanded to a pre-study power simulation, so it may be worthwhile to seek analytical approximations. Finally, it may be desirable to replace or discard assumptions about the form of the conditional response distributions, either by using different parametric distributions (e.g., binomial) or one of the many Bayesian nonparametric options (e.g., Dirichlet process priors)¹⁶.

Supplementary Material

Supplement: additional simulation results

NIHMS1771907-supplement-Supplement__additional_simulation_results.pdf^{(109KB, pdf)}

Data and R code

NIHMS1771907-supplement-Data_and_R_code.zip^{(23.4KB, zip)}

Acknowledgments

This work was supported by AbbVie, Inc, the University of Minnesota Doctoral Dissertation Fellowship, and the National Cancer Institute [1-R01-CA157458-01A1 to PMS and BPC]. AbbVie contributed to the design, research, interpretation of data, reviewing, and approving of this publication.

Theorems and Proofs

Definition 1. Asymptotic Simultaneous Confidence Band. The two-sided 1 − α asymptotic simultaneous confidence band for Δ over C is bounded at each point x ∈ C by

\hat{Δ} (x) \pm \sqrt{W_{α, C}^{*} Var [\hat{Δ} (x)]},

where $W_{α, C}^{*}$ is the 1 − α quantile of the distribution of

W_{C} = sup_{x \in C} \frac{{\hat{Δ} (x) - Δ (x)}^{2}}{Var [\hat{Δ} (x)]} .

Theorem 1. Batch Step-Down Testing Procedure. Let C be the restricted covariate space, θ be the vector of all model parameters, and H_i = {θ : Δ(x_i) = 0}. The following testing procedure controls the family-wise type I error rate at level α.

graphic file with name nihms-1771907-f0004.jpg

The proof of Theorem 1 relies on showing that the batch step-down testing procedure is a closed testing procedure¹⁰. Let $H$ be a collection of hypotheses closed under intersection: H_k, $H_{l} \in H$ implies $H_{k} \cap H_{l} \in H$ . Furthermore, let φ_k be an α-level test of H_k so that φ_k = 1 if and only if it rejects H_k locally (independent of other hypotheses and tests). Then a closed testing procedure is a procedure which rejects H_k if and only if φ_l = 1 for all l such that $H_{l} \subseteq H_{k} \in H$ . Any closed testing procedure controls the family-wise type I error rate at level α because in order to reject at least one true hypothesis the procedure must reject the intersection of all true hypotheses, which is tested by an α-level test.

Proof of Theorem 1. Let $H_{U} = \cap_{i : x_{i} \in U} H_{i}$ H_i be the hypothesis that Δ(x) = 0 for all x ∈ U, and φ_U be the local test of that hypothesis which rejects if and only if exists an x_i ∈ U for which the band does not include zero. We first show that if φ_U = 1 and the band over U did not include zero at x_i ∈ V ⊂ U, then φ_V = 1. Suppose φ_U = 1 and the band over U did not contain zero at x_i ∈ V ⊂ U. Let $W_{U}^{*}$ be the 1 − α quantile of the distribution of

W_{U} = sup_{x \in U} \frac{{\hat{Δ} (x) - Δ (x)}^{2}}{Var [\hat{Δ} (x)]} .

Note that since V ⊂ U, we have W_V ≤ W_U, and thus $W_{V}^{*} \leq W_{U}^{*}$ . Then the band over V is nowhere wider than the band over U, and since the band over U did not contain zero at x_i, the band over V also does not contain zero at x_i, so φ_V = 1.

We now show that when a point x_i is marked for rejection, it is rejected by all intersections in $H$ involving H_i. Let x_i ∈ R₁, i.e., a point whose null-effect hypothesis was marked for rejection on the first iteration. Since H_i was marked on the first iteration, there is at least one x (x_i itself) in C for which the band does not include zero; thus φ_C = 1. Additionally, any other hypothesis in $H$ which is an intersection involving H_i is a hypothesis H_V with x_i ∈ V ⊂ C, and thus is also rejected by the corresponding local test. Thus H_i may be globally rejected.

Consider now x_i ∈ R_M, M > 1. Every x_j ∈ C \ T_M has previously had its hypothesis globally rejected, thus any local test of a hypothesis for a set containing that point has already been locally rejected. Therefore we need only consider hypotheses H_U such that U ⊆ T_M. The argument for points in R₁ may then be reused, replacing C with T_M. Thus the procedure is a closed testing procedure, and therefore controls the family-wise type I error rate at α. □

Remark 1. Theorem 1 applies to posterior credible bands insofar as they correspond to the confidence band via the asymptotic joint normality of the posterior of Δ(x).

Quantile-Based Simultaneous Credible Bands

Let F_Θ(θ) = P[Θ ≤ θ] be the cumulative distribution function of θ, G_Θ(θ) = P[Θ < θ] be its left-continuous counterpart, and $F_{θ}^{- 1} (p) = inf {θ : p \leq F_{Θ} (θ)}$ , $G_{Θ}^{- 1} (p) = sup {θ : p \geq G_{Θ} (θ)}$ be their inverses. We may then use the simultaneous credible band

Δ (x) \in [F_{Δ (x) ∣ y}^{- 1} (W_{α, C}^{*}), G_{Δ (x) ∣ y}^{- 1} (1 - W_{α, C}^{*})],

(8)

with $W_{α, C}^{*}$ chosen to achieve a desired credible level by the following construction. Let $W_{α, C}^{*}$ be the α quantile of the distribution of

W_{C} = inf_{x \in C} min {F_{Δ (x) ∣ y} [Δ (x)], 1 - G_{Δ (x) ∣ y} [Δ (x)]} .

(9)

Here, we take the minimum of the lower and upper tail probabilities (via F and G, respectively) of a draw of Δ(x), and then the infimum of those tail probabilities across the entire covariate space to obtain W_C. Then, $W_{α, C}^{*}$ is used as a multiplicity-adjusted tail probability to obtain bounding quantiles. Thus the right-hand side of (8) defines the pre-image of ${W_{C} \geq W_{α, C}^{*}}$ and therefore defines a 1 − α credible set. The distribution and quantile functions of W_C may be estimated from the posterior sample.¹³

Footnotes

Supplementary Materials

Data and R code to reproduce the results, plots, and simulations in this paper are available online.

References

[1].Breiman Leo. Random forests. Machine Learning, 45(1):5–32, 2001. [Google Scholar]
[2].Foster Jared C, Taylor Jeremy MG, and Ruberg Stephen J. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Chipman Hugh A, George Edward I, and McCulloch Robert E. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1): 266–298, 2010. [Google Scholar]
[4].Hill Jennifer L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011. [Google Scholar]
[5].James O Berger Xiaojing Wang, and Shen Lei. A Bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics, 24(1): 110–129, 2014. [DOI] [PubMed] [Google Scholar]
[6].Schnell Patrick M, Tang Qi, Offen Walter W, and Carlin Bradley P. A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects. Biometrics, 72(4):1026–36, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Gelman Andrew, Carlin John B, Stern Hal S, and Rubin Donald B. Bayesian Data Analysis, volume 2. Chapman & Hall/CRC; Boca Raton, FL, USA, 2014. [Google Scholar]
[8].Coull Brent A, Ruppert David, and Wand MP. Simple incorporation of interactions into additive models. Biometrics, 57(2):539–545, 2001. [DOI] [PubMed] [Google Scholar]
[9].Holm Sture. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. [Google Scholar]
[10].Marcus Ruth, Peritz Eric, and Ruben K Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976. [Google Scholar]
[11].Wright S Paul. Adjusted p-values for simultaneous inference. Biometrics, 48:1005–13, 1992. [Google Scholar]
[12].Rosen Wilma G, Richard C Mohs, and Davis Kenneth L. A new rating scale for alzheimer’s disease. The American Journal of Psychiatry, 141(11): 1356–64, 1984. [DOI] [PubMed] [Google Scholar]
[13].Schnell Patrick M, Tang Qi, Müller Peter, and Carlin Bradley P. Subgroup inference for multiple treatments and multiple endpoints in an Alzheimer’s disease treatment trial. Annals of Applied Statistics, 11(2):949–966, 2017. [Google Scholar]
[14].Burns Alistair and Iliffe Steve. Alzheimer’s disease. British Medical Journal, 338(7692):467–471, 2009. [Google Scholar]
[15].Schnell Patrick M. Credible Subgroups: Identifying the Population that Benefits from Treatment. PhD thesis, University of Minnesota, 2017. [Google Scholar]
[16].Müller Peter and Mitra Riten. Bayesian nonparametric inference–why and how (with discussion). Bayesian Analysis, 8(2):269–302, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement: additional simulation results

NIHMS1771907-supplement-Supplement__additional_simulation_results.pdf^{(109KB, pdf)}

Data and R code

NIHMS1771907-supplement-Data_and_R_code.zip^{(23.4KB, zip)}

[R1] [1].Breiman Leo. Random forests. Machine Learning, 45(1):5–32, 2001. [Google Scholar]

[R2] [2].Foster Jared C, Taylor Jeremy MG, and Ruberg Stephen J. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Chipman Hugh A, George Edward I, and McCulloch Robert E. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1): 266–298, 2010. [Google Scholar]

[R4] [4].Hill Jennifer L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011. [Google Scholar]

[R5] [5].James O Berger Xiaojing Wang, and Shen Lei. A Bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics, 24(1): 110–129, 2014. [DOI] [PubMed] [Google Scholar]

[R6] [6].Schnell Patrick M, Tang Qi, Offen Walter W, and Carlin Bradley P. A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects. Biometrics, 72(4):1026–36, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Gelman Andrew, Carlin John B, Stern Hal S, and Rubin Donald B. Bayesian Data Analysis, volume 2. Chapman & Hall/CRC; Boca Raton, FL, USA, 2014. [Google Scholar]

[R8] [8].Coull Brent A, Ruppert David, and Wand MP. Simple incorporation of interactions into additive models. Biometrics, 57(2):539–545, 2001. [DOI] [PubMed] [Google Scholar]

[R9] [9].Holm Sture. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. [Google Scholar]

[R10] [10].Marcus Ruth, Peritz Eric, and Ruben K Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976. [Google Scholar]

[R11] [11].Wright S Paul. Adjusted p-values for simultaneous inference. Biometrics, 48:1005–13, 1992. [Google Scholar]

[R12] [12].Rosen Wilma G, Richard C Mohs, and Davis Kenneth L. A new rating scale for alzheimer’s disease. The American Journal of Psychiatry, 141(11): 1356–64, 1984. [DOI] [PubMed] [Google Scholar]

[R13] [13].Schnell Patrick M, Tang Qi, Müller Peter, and Carlin Bradley P. Subgroup inference for multiple treatments and multiple endpoints in an Alzheimer’s disease treatment trial. Annals of Applied Statistics, 11(2):949–966, 2017. [Google Scholar]

[R14] [14].Burns Alistair and Iliffe Steve. Alzheimer’s disease. British Medical Journal, 338(7692):467–471, 2009. [Google Scholar]

[R15] [15].Schnell Patrick M. Credible Subgroups: Identifying the Population that Benefits from Treatment. PhD thesis, University of Minnesota, 2017. [Google Scholar]

[R16] [16].Müller Peter and Mitra Riten. Bayesian nonparametric inference–why and how (with discussion). Bayesian Analysis, 8(2):269–302, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Multiplicity-Adjusted Semiparametric Benefiting Subgroup Identification in Clinical Trials

Patrick M Schnell

Peter Müller

Qi Tang

Bradley P Carlin

Abstract

Background:

Methods:

Results:

Conclusion:

Introduction

Benefiting Subgroup Identification

Credible Subgroups

Figure 1.

Semiparametric Estimation of Personalized Treatment Effects

Precision of Monte Carlo Methods

A Batch Step-Down Testing Procedure

Power and Choice of Covariate Space

Simulation Study

Table 1.

Analysis of Four Trials of an Alzheimer’s Disease Treatment

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Discussion

Supplementary Material

Acknowledgments

Theorems and Proofs

Quantile-Based Simultaneous Credible Bands

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases