Abstract
Background:
A recent focus in the health sciences has been the development of personalized medicine, which includes determining the population for which a given treatment is effective. Due to limited data, identifying the true benefiting population is a challenging task. To tackle this difficulty, the credible subgroups approach provides a pair of bounding subgroups for the true benefiting subgroup, constructed so that one is contained by the benefiting subgroup while the other contains the benefiting subgroup with high probability. However, the method has so far only been developed for parametric linear models.
Methods:
In this paper we develop the details required to follow the credible subgroups approach in more realistic settings by considering nonlinear and semiparametric regression models, supported for regulatory science by conditional power simulations. We also present an improved multiple testing approach using a step-down procedure. We evaluate our approach via simulations and apply it to data from four trials of Alzheimer’s disease treatments carried out by AbbVie.
Results:
Semiparametric modeling yields credible subgroups that are more robust to violations of linear treatment effect assumptions, and careful choice of the population of interest as well as the step-down multiple testing procedure result in a higher rate of detection of benefiting types of patients. The approach allows us to identify types of patients that benefit from treatment in the Alzheimer’s disease trials.
Conclusion:
Attempts to identify benefiting subgroups of patients in clinical trials are often met with skepticism due to a lack of multiplicity control and unrealistically restrictive assumptions. Our proposed approach merges two techniques, credible subgroups and semiparametric regression, which avoids these problems and makes benefiting subgroup identification practical and reliable.
Keywords: Bayesian inference, clinical trials, multiple testing, personalized medicine, semiparametric regression, subgroup identification
Introduction
A recent focus in the health sciences has been the development of personalized medicine, which seeks to incorporate observable patient characteristics into decisions about prevention and treatment. The central statistical objectives relating to personalized medicine involve inferences about personalized treatment effects (PTEs), which are the counterpart to the traditional average treatment effect (ATE) conditioned on available patient covariates, and usually have one of two goals: to determine which treatment is most effective for a given patient, or to determine the population for which a given treatment is effective. In this paper we focus primarily on the latter, which is of interest to, e.g., regulatory agencies.
The approach we take toward identifying the population for which a given treatment is effective is to estimate the PTE at each predictive covariate point (informally, for each patient), then at each covariate point test the hypothesis of a null PTE. In this context a predictive covariate is a baseline covariate that interacts with the treatment indicator in a regression model. This problem necessitates attention to the multiplicity of hypothesis tests across the predictive covariate space, unlike the problem of identifying the best treatment for a given patient. Additionally, shifting the inference focus from an overall effect to PTE requires more flexible inference models that allow reliable inference of patient-specific effects. Several nonparametric and semiparametric regression approaches for PTEs have been proposed, including random forests1 in the virtual twins approach2, Bayesian additive regression trees (BART)3 in modeling for causal inference4, and a hybrid approach defining treatment and baseline models using tree-based methods5.
Credible subgroups6 provide a framework for deriving inferences about the population benefiting from treatment using the output of linear regression models of the PTE. These inferences take the form of two bounding subgroups for the benefiting subgroup, one which contains it and one which is contained by it, an approach that naturally accounts for the multiplicity of PTE tests. In the present paper, we attempt to unify this inferential framework and current nonparametric and semiparametric regression practice, with special attention to penalized splines and Bayesian additive regression trees. We apply our method to a data set collected from a sequence of four clinical trials for Alzheimer’s disease treatments, with four baseline covariates potentially influencing the treatment effect.
Benefiting Subgroup Identification
Benefiting subgroup identification is primarily concerned with a personalized treatment effect (PTE), i.e., the treatment effect for a patient given their observable baseline characteristics. In many cases, the PTE is a difference in conditional expectation:
| (1) |
where Y is the response of interest, x is a vector of covariates, and t is a treatment indicator. Given a definition of PTE and a covariate space C, the goal of benefiting subgroup identification is to estimate the set of covariate points for which it is positive: B = {x ∈ C : Δ(x) > 0}.
Credible Subgroups
A common default estimator for B is in the Bayesian interpretation, i.e., the points with a high posterior probability of having a positive PTE given observed data y. In order to control for the multiplicity inherent in testing the PTE at every covariate point, and to provide a two-sided bound, we prefer the credible subgroup pair6 (D, S) which is constructed so that P[D ⊆ B ⊆ S|y] > 1 − α. Thus it is likely (with 1 − α posterior probability) that the exclusive credible subgroup D contains only patients who benefit, and the inclusive credible subgroup S contains all patients who do. Such a credible subgroup pair partitions the covariate space into three regions, as shown in Figure 1. We can then conclude that for covariate points in D, there is evidence of benefit, and for patients in the complement SC of S, there is evidence of no benefit. The remainder of the space, S \ D, requires more information.
Figure 1.

Interpretation of the credible subgroup pair (D, S) relative to the true benefiting subgroup B (enclosed by dashed line).
The general strategy of such an approach is to perform a regression of the PTE onto the given covariates, construct simultaneous credible bands around the regression surface, and take as D and S the covariate points at which the lower and upper bounds were greater than zero, respectively. Because the credible bands are simultaneous, there is a controlled probability that at least one covariate point has a corresponding lower bound mistakenly greater than zero (i.e., is erroneously in D) or upper bound mistakenly less than zero (i.e., is erroneously placed in SC). While the theory was initially developed for normal linear models, credible subgroups can be computed for any model given only a sample from the posterior distribution of the Δ(x) at each x ∈ C.
In cases for which the joint posterior of Δ(x) approaches the frequentist Gaussian distribution of the maximum likelihood estimator7, the asymptotic band
| (2) |
where is the 1 − α quantile of the distribution of
| (3) |
and is the posterior mean of Δ(x), is a 1 − α simultaneous credible band for Δ(x) over C. These bands also correspond to frequentist simultaneous credible bands, and therefore may be used to derive frequentist tests. In cases where the joint posterior of Δ(x) is not approximately normal, a quantile-based simultaneous credible band (presented in the Appendix) may be used instead.
Semiparametric Estimation of Personalized Treatment Effects
The first step in constructing credible subgroups is to perform a regression of the personalized treatment effect onto relevant covariates. Here we define the PTE as in (1), the difference in conditional expected response between the treatment and control arms. Previously, a linear model of the form
| (4) |
has been used6 so that Δ(x) = x′γ. We generalize to a semiparametric model based on the additive penalized spline model with factor-by-curve interactions8:
| (5) |
where the fj and gj are penalized cubic splines with radial bases and no intercepts:
| (6) |
and the κfjk are fixed knots. The penalty is implemented by placing a Normal(0, ) prior on the ufjk and a vague InverseGamma(0.001, 0.001) prior on the . The gj are specified similarly, using γ instead of β. We place flat priors on the β and γ. Models adding fixed effects or group-level random slopes and intercepts are straightforward to specify.
Model (5) can suffer from poorly identified parameters; however, the resulting E[Y |x, t], as well as quantities of the forms E[Y |x, t = 1] − E[Y |x, t = 0] (the PTE) or E[Y |x1, t] − E[Y |x0, t], are typically stable. This stability, along with the tendency of software packages for other regression techniques to supply quantities of the form E[Y |x], make (1) a convenient definition of the PTE even when there cannot be an explicit separation of the treatment variable t from x in the model.
An alternative to the above additive penalized spline regression model is the Bayesian additive regression trees method3, implemented by the R package BayesTree. The model uses a sum-of-trees approach in which a prior regularizes each individual tree to be relatively simple. This method is popular for its combination of flexibility and user-friendliness: it is capable of fitting arbitrary regression mean surfaces with little and often no tuning or configuration needed from the user. To apply the model to our credible subgroups method, we concatenate the treatment indicator onto the covariate vector for each patient and fit Y ~ Normal (μ(z), σ2) where z = (x1, …, xp, t) and μ is an arbitrary regression mean function to be estimated. However, fully nonparametric models for which the PTE surface must be stored at every point in C present challenges with respect to memory, as a sample from the posterior joint PTE distribution must be stored in an often very large (number of draws by number of covariate points) matrix.
Precision of Monte Carlo Methods
Computation of credible subgroups from a Monte Carlo sample requires estimation not of the posterior mean of the PTE, but of its tail posterior quantiles, which are less precisely estimated for a given number of posterior draws. Additionally, which quantile to estimate must also be estimated from the Monte Carlo sample of values from the estimated distribution functions, and the value of is often very far into the right tail of the distribution of the . In practice, this means that much larger Monte Carlo samples are required for reliable credible subgroup inference than standard posterior mean and variance calculations for the corresponding model. Resampling methods may be used to estimate the Monte Carlo standard errors of these estimators if such resampling is faster than producing additional posterior samples.
A Batch Step-Down Testing Procedure
The single-step testing procedure based on (2) can be improved upon by a sequential, step-down testing procedure similar to the Holm step-down procedure9, the latter being well-known in the multiple testing literature. For a set of M hypotheses, the Holm procedure first tests all hypotheses using an M-way Bonferroni correction, and if the hypothesis with the lowest p-value is rejected, proceeds to test the remaining M − 1 hypotheses using an (M − 1)-way Bonferroni correction, and so on.
Let C be a subset of interest of the covariate space, θ be the vector of all model parameters (or conditional means for nonparametric models), and Hi = {θ : Δ(xi) = 0}. Then Algorithm 1 controls the overall type I error rate at level α.

The proof of validity for this procedure relies on showing that it is a closed testing procedure10, in part via noting that Wα,V ≤ Wα,U for V ⊂ U. The full proof is available in the Appendix as Theorem 1. If Hi is rejected, place xi in D for posterior mean or SC for , and if Hi is not rejected place xi in S \ D.
If, in a Bayesian interpretation, one wishes to use an interval null hypothesis such as Hi = {θ : −ε ≤ Δ(xi) ≤ ε}, then RM should be the set of xi for which the band does not overlap [−ε, ε]. Then D and SC are constructed by comparing to ε and −ε, respectively, for xi for which Hi is rejected.
The above algorithm is conceptually simple and may easily be implemented around a given method for constructing simultaneous confidence bands. However, a somewhat more involved algorithm (Algorithm 2) implementing the same procedure can be used to compute the maximum credible level li at which the test of no effect at each covariate point xi is rejected (similar to adjusted p-values11), and may be used to quickly construct credible subgroups at various credible levels after performing a single expensive computation, rather than repeating Algorithm 1 for every credible level. The credible subgroups for any level 1 − α are then and .

Power and Choice of Covariate Space
When fitting a semiparametric regression model, the standard errors of mean surface estimates can quickly become very large in regions of the covariate space in which observations are sparse. Additionally, estimates of the PTE have very large standard errors and are generally unreliable in regions lacking common support (presence of observations in both the treatment and control group). As a result, the credible subgroups method may have very low power (ability to classify points as in D or Sc) in these observation-sparse regions. While we consider this to be a positive feature of semiparametric regression (given that we presumably do not want to assume a strict parametric form for the PTE), globally adjusting for multiplicity of covariate points means that unnecessarily testing the PTE in these observation-sparse regions dilutes power even in observation-rich regions. This problem is compounded by the fact that semiparametric models yield much looser dependence in the posterior distribution of the PTE surface across the covariate space.
While restricting the covariate space to the Cartesian product of the empirical ranges of each covariate is an obvious first step, the covariate space often may be further restricted following local power simulations. If asymptotic normality of the joint posterior distribution of the PTEs can be assumed, we may use Algorithm 3 to estimate conditional local power.

The power estimate is derived by assuming asymptotic joint normality of and evaluating the frequentist probability with the intention of determining the power to detect benefit (rather than harm). The assumption of a constant treatment effect is material—the estimate of the PTE at one covariate point may be strongly affected by the true value of the PTE at nearby points, and the shape of the PTE surface affects the dependency of the posterior distribution of the PTE across the covariate space, and thus . Other PTE surfaces may be assumed and power estimated for them, but without prior information, the constant surface computations are perhaps most easily interpreted. Additionally, the above algorithm may depend on nuisance parameters such as error variances. It may be possible to estimate these nuisance parameters through empirical Bayes methods (e.g., using restricted maximum likelihood to estimate the hyperparameters from their marginal distributions) using the observed data without introducing much bias.
These power estimates may be used to further restrict the covariate space to, e.g., points for which the intended study and analysis have at least 5% power to detect the hypothesized benefit at the 95% credible level. While in many cases the gains in power across the remaining covariate space may be modest, the above procedure costs little more than computation time when the criterion for retaining a covariate point is relatively lax. Additionally, the tested covariate space may be adjusted before testing using considerations such as parsimony (e.g., choosing a circumscribing hyperrectangle) and the requirement of common support between study arms. Finally, if information about the covariate distribution before the start of the trial, via either knowledge of the population distribution or quotas for strata, pre-trial power calculations or sample size estimations may be carried out in a similar manner.
Regardless of whether or not the covariate space is restricted in the manner described above, it is important to recognize that classification of covariate points into D, S \ D, and SC is risky in regions of the covariate space in which observations are sparse. Even greater skepticism should be applied to conclusions outside of the observed covariate range, as these are extrapolations of the underlying regression model.
Simulation Study
We perform a simulation study to evaluate certain frequentist properties of the credible subgroups generated by linear, spline, and Bayesian additive regression trees models. A necessary property is valid (including conservative) coverage, i.e., D ⊆ B ⊆ S at least 100(1 − α)% of the time. Given valid coverage, we compare regression models primarily by the sensitivity of D (how much of B is contained in D). We also evaluate the sensitivity of D under the step-down procedure relative to that under the single-step procedure.
We simulate 1000 data sets with n = 100 patients in each treatment arm. Results for simulations with n = 25, n = 50, and n = 75 are presented in the Supplementary Materials. Each subject i has a covariate vector xi = (1, xi2, xi3) with xi2 = 0, 1 with equal probability and xi3 continuously uniformly distributed on [−3, 3], a deterministic treatment assignment ti, and a conditionally normally distributed response yi. The covariates are used as both prognostic and predictive variables.
The outcomes are generated as Normal(Δ(xi), 1) with Δ(xi) specified in the following six cases. In the null case, Δ(xi) = 0. In the binary case, Δ(xi) = xi2. In the linear case, Δ(xi) = xi3. In the near-linear case, . In the threshold case, Δ(xi) = sign(xi3)(xi3/3)1/3 + 1/4. In the non-monotone case, Δ(xi) = 1/2 − 3(xi3/3)2.
To each data set we fit a linear, spline, and Bayesian additive regression trees model. For the linear and spline models, we place a vague InverseGamma(0.001, 0.001) prior on the error variance and flat priors on fixed effect coefficients. For the spline model we place InverseGamma(2, 1) shrinkage priors on the random effect variance, corresponding to the spline roughness penalty for the continuous covariate (the binary covariate does not require a spline). The Bayesian additive regression trees model is fit using the default settings in the R package BayesTree. All Gibbs samplers are run for 100 burn-in and 1000 retained iterations, which appears to be acceptable for these simple models.
To determine credible subgroups at the 80% credible level, we use as the covariate space the grid in which x1 = 1, x2 = 0, 1, and x3 ranges from −3 to 3 in increments of 0.1. We compute the result under both the single-step and step-down procedures using the simultaneous credible band (2).
Table 1 displays the average summary statistics for 80% credible subgroups for each model and data generating mechanism (DGM) at n = 100 patients per arm. Each model has sufficient coverage, except for the linear model in the non-linear cases. Given sufficient coverage, the sensitivity of D will usually be the driving factor in choosing a model. In this regard, the spline model performs better than Bayesian additive regression trees in all cases except when the binary covariate drives the treatment effect heterogeneity, but the spline model’s advantage is small in the “threshold” case in which the continuous covariate behaves similarly to a binary one. When the true treatment effect heterogeneity is linear, the linear model outperforms both with respect to sensitivity of D. Generally, we recommend the spline model when continuous predictive covariates are present, as even a small departure from linearity can render the coverage of the linear model insufficient (see the “near-linear” case). Finally, the step-down method consistently improved the sensitivity of D, sometimes to a large extent (nonparametric fits in the near-linear case).
Table 1.
Simulation study results. Operating characteristics of 80% credible subgroups with n = 100 patients in each study arm. Struck-through sensitivities indicate insufficient coverage and should be treated with caution. Step-down efficiency is the ratio of the sensitivity of D using the step-down procedure (shown) over the single-step procedure (not shown).
| Data Generating Mechanism | Diagram of Δ(x) | Model | Coverage | Sensitivity of D | Step-Down Efficiency |
|---|---|---|---|---|---|
| Null Effect |
|
Linear | 0.89 | – | – |
| Spline | 0.92 | – | – | ||
| BART | 0.99 | – | – | ||
| Binary |
|
Linear | 0.91 | 0.97 | 1.01 |
| Spline | 0.95 | 0.56 | 1.05 | ||
| BART | 0.98 | 0.82 | 1.05 | ||
| Linear |
|
Linear | 0.88 | 0.90 | 1.02 |
| Spline | 0.94 | 0.77 | 1.09 | ||
| BART | 1.00 | 0.70 | 1.08 | ||
| Near-Linear |
|
Linear | 0.75 | 0.71 | 1.04 |
| Spline | 0.97 | 0.39 | 1.19 | ||
| BART | 1.00 | 0.27 | 1.25 | ||
| Threshold |
|
Linear | 0.61 | 0.85 | 1.03 |
| Spline | 0.96 | 0.52 | 1.07 | ||
| BART | 0.99 | 0.48 | 1.07 | ||
| Non-Monotone |
|
Linear | 0.23 | 0.52 | 1.10 |
| Spline | 0.96 | 0.63 | 1.05 | ||
| BART | 0.97 | 0.46 | 1.08 |
In the absence of model mis-specification, all methods appear to have conservative coverage. In fact, the realized error rate does not rise far past α/2. The relevance of α/2 as an error rate here is that credible subgroups can only make an error in one direction at each covariate point: if the true PTE is positive, the only error is under-estimation, while if the true PTE is non-positive, the only error is over-estimation; however, we cannot know a priori the sign of the PTE. Thus the apparent conservatism is due to the fact that none of the displayed simulations represent the worst-case scenario that the procedure protects against: a small-magnitude treatment effect that crosses zero frequently.
Analysis of Four Trials of an Alzheimer’s Disease Treatment
We consider data from a sequence of four clinical trials for Alzheimer’s disease treatments carried out by AbbVie, all of which include arms for a placebo and the same “standard of care” treatment. We wish to compare the standard of care to the placebo with respect to change in disease severity over 12 weeks, using data from all four trials. Combined, the studies are comprised of 369 complete-case patients from 9 countries.
We consider six baseline patient characteristics: disease severity, change in disease severity during run-in, long-term cognitive decline rate, age, carrier status of the ApoE4 allele, and sex. Disease severity is measured by the 11-part Alzheimer’s Disease Assessment Scale—Cognitive Subscale (ADAS-Cog 11)12. The change in severity score in the 3–4 week run-in period between screening and randomization, which we call prechange, is included as a main effect in an attempt to adjust for the “learning effect” in which patients become familiar with the assessment instrument; however it is not included as a predictive covariate because it is not thought to be useful for practitioners due to its high variability and delaying of treatment. On the other hand, we include as a predictive covariate the long-term cognitive decline rate, d-rate, which is defined as the total drop in score on the Mini Mental State Examination (MMSE) divided by the time, in years, since onset of first symptoms. Age, though intuitively a potential factor in response, did not appear to have much predictive utility for efficacy in a similar trial analyzed using a similar credible subgroups approach.13 Since its inclusion in the model for the present analysis was also detrimental to penalized model fit as evaluated by the deviance information criterion and log pseudo-marginal likelihood, we excluded it from our primary analysis. The ApoE4 allele and sex are considered risk factors for development of Alzheimer’s disease14. The primary endpoint, improvement, is the negative difference in severity between end-of-study and baseline assessment scale measurements, so a positive value is a good outcome.
Our outcome model may be broadly summarized (in R-like syntax) as
| (7) |
where r(·) and s(·) represent traditional centered, normally-distributed random intercepts and slopes, respectively, f(·) and g(·) represent penalized cubic splines as described in the “Semiparametric Estimation of Personalized Treatment Effects” Section, colons denote interactions, and errors are normally distributed. We place knots for spline terms at increments of 2 across each observed range. Variances for error, random effects, and penalized spline coefficients are given vague InverseGamma(0.001, 0.001) priors. Fixed effects are given flat priors.
We used the method of the “Power and Choice of Covariate Space” Section to choose the covariate space over which to produce credible subgroups, beginning with a grid with increments of 1 unit on each scale and spanning the Cartesian product of the observed ranges. Figure 2 plots the distribution of covariate points and shades the region over which the 95% credible subgroups have at least 5% power to detect a uniform benefit of 1 standard deviation (5 points). Restricting severity to [9, 49] and d-rate to [0, 8] we include this entire region and exclude less than 9% of patients, while reducing the size of the covariate space by more than half. It can be seen that observations, and especially observations from different treatment arms, are hopelessly sparse in much of the excluded region.
Figure 2.

The shaded region is the region for which the likelihood of detecting a 1 standard deviation (5-point) benefit at the 95% credible level is at least 5% when the entire empirical covariate space is used. Patients receiving placebo are represented by ×, and those receiving the standard of care by +.
The model was fit by Gibbs sampling using 100,000 iterations after 1000 burn-in iterations. Convergence appears near-immediate and mixing good from trace plots, and the time series–based effective sample sizes for most coefficients are above 75,000. Effective sample size was lowest for some country random effects with few patients, and for certain random effect variances.
Figure 3 displays the nonparametrically fitted effect curves. Although the posterior mean effects of the continuous covariates are at least plausibly linear except for the prognostic effect of rate of decline, there is a possibility of nonlinearity and even nonmonotonicity in the rest of the effects, which should be retained in the model.
Figure 3.

Estimated nonlinear effects and 95% pointwise credible bands on standardized covariate scale, relative to sample mean. Rug plot represents observed covariate distribution.
Figure 4 displays the credible subgroups at the 95% level, using the step-down testing procedure (Algorithm 1). The exclusive credible subgroup generally contains patients with higher-than-average severity and rate of decline. The exclusive credible subgroup in this case includes approximately 17% more cells than the corresponding exclusive credible subgroup when the full observed ranges of severity and d-rate are used, and approximately 3% more than when the single-step (instead of the step-down) procedure is used. The apparent pixelation is due to the integer grid used to represent continuous covariates.
Figure 4.

Credible subgroups at the 95% level. Green (dark) points represent the exclusive credible subgroup, and yellow (light) the remainder of the inclusive credible subgroup.
We also fit two other models: a version of (7) in which the penalized spline terms were replaced with linear effects, and the default implementation of Bayesian additive regression trees from the R package BayesTree for 100,000 iterations, thinned to 10,000 iterations due to memory considerations, after 1000 burn-in iterations. Figure 5 displays the 95% credible subgroups for the linear and Bayesian additive regression trees models, and Figure 6 compares the posterior mean PTE surface between (7) and its linear counterpart. Such visualizations of the estimated PTE surface may be useful to trialists who wish to more fully understand possible nonlinear features of the surface that would be lost under a linear model. As may be expected, the linear model simplifies the estimated PTE surface, which, along with the variances of the PTEs, yield a smoother exclusive credible subgroup. By contrast, Bayesian additive regression trees, which divides the covariate space into rectangular cells having constant PTE in each cell, yields a more rectangular exclusive credible subgroup. Due to the superior performance of the spline-based model in the simulation study for the nonlinear continuous effect case, we promote those Figure 4 credible subgroups (which are intermediate to those in Figure 5 as the best choice. The resulting credible subgroups are consistent with the observation that the linear model fits poorly while the Bayesian additive regression trees fit is too conservative in similar situations.
Figure 5.

Credible subgroups at the 95% level for the linear (top) and Bayesian additive regression trees (bottom) models.
Figure 6.

Posterior mean PTE surfaces for the semiparametric (top) and linear (bottom) models.
Discussion
Because of the relative independence of the credible subgroups inferential tools and the selected regression method—all that is needed is a sample from the joint posterior of the personalized treatment effects—the choice of parametric, semiparametric, or nonparametric methods may be made with full focus on flexibility and applicability to the problem, rather than being muddled by technical considerations about the associated inferential process. Such freedom may make “black box” nonparametric methods such as Bayesian additive regression trees appealing for their flexibility, but also allows the use of more interpretable models such as additive spline models if desired.
Increased flexibility of semiparametric and nonparametric regression models come at a cost in terms of power, due to the looser dependencies of the PTEs across the covariate space. As such, we have presented a pair of techniques for increasing power: a step-down multiple testing procedure, and a process for culling the tested covariate space of points for which we expect low power or whose inferential value is otherwise questionable. Culling the covariate space may sometimes be risky, especially when the space is well-covered by the observed data, but the step-down procedure carries no risk and costs only additional computation time. Of course, these techniques are useful for fully parametric models as well.
The more flexible models presented in this paper may also yield credible subgroups that are not contiguous in covariate space. It may be undesirable to restrict ourselves to contiguous credible subgroups, as doing so would implicitly assume that the benefiting subgroup itself is contiguous. Regardless of contiguity, the shape of the credible subgroups may not be easily described in terms of many, especially continuous, covariates. In such a case it may be useful to provide a “calculator” which returns the maximum credible level (as in Algorithm 2) for a given patient’s covariate profile15.
Some additional research is still required. In particular, the credible subgroups methods rely on evaluating tail probabilities of posterior distributions and distributions of estimates computed from those results, which makes estimating and handling Monte Carlo error significantly more difficult, and necessitates larger posterior samples than usual. Additionally, the conditional power estimation presented may become prohibitively computationally expensive if expanded to a pre-study power simulation, so it may be worthwhile to seek analytical approximations. Finally, it may be desirable to replace or discard assumptions about the form of the conditional response distributions, either by using different parametric distributions (e.g., binomial) or one of the many Bayesian nonparametric options (e.g., Dirichlet process priors)16.
Supplementary Material
Acknowledgments
This work was supported by AbbVie, Inc, the University of Minnesota Doctoral Dissertation Fellowship, and the National Cancer Institute [1-R01-CA157458-01A1 to PMS and BPC]. AbbVie contributed to the design, research, interpretation of data, reviewing, and approving of this publication.
Theorems and Proofs
Definition 1. Asymptotic Simultaneous Confidence Band. The two-sided 1 − α asymptotic simultaneous confidence band for Δ over C is bounded at each point x ∈ C by
where is the 1 − α quantile of the distribution of
Theorem 1. Batch Step-Down Testing Procedure. Let C be the restricted covariate space, θ be the vector of all model parameters, and Hi = {θ : Δ(xi) = 0}. The following testing procedure controls the family-wise type I error rate at level α.

The proof of Theorem 1 relies on showing that the batch step-down testing procedure is a closed testing procedure10. Let be a collection of hypotheses closed under intersection: Hk, implies . Furthermore, let φk be an α-level test of Hk so that φk = 1 if and only if it rejects Hk locally (independent of other hypotheses and tests). Then a closed testing procedure is a procedure which rejects Hk if and only if φl = 1 for all l such that . Any closed testing procedure controls the family-wise type I error rate at level α because in order to reject at least one true hypothesis the procedure must reject the intersection of all true hypotheses, which is tested by an α-level test.
Proof of Theorem 1. Let Hi be the hypothesis that Δ(x) = 0 for all x ∈ U, and φU be the local test of that hypothesis which rejects if and only if exists an xi ∈ U for which the band does not include zero. We first show that if φU = 1 and the band over U did not include zero at xi ∈ V ⊂ U, then φV = 1. Suppose φU = 1 and the band over U did not contain zero at xi ∈ V ⊂ U. Let be the 1 − α quantile of the distribution of
Note that since V ⊂ U, we have WV ≤ WU, and thus . Then the band over V is nowhere wider than the band over U, and since the band over U did not contain zero at xi, the band over V also does not contain zero at xi, so φV = 1.
We now show that when a point xi is marked for rejection, it is rejected by all intersections in involving Hi. Let xi ∈ R1, i.e., a point whose null-effect hypothesis was marked for rejection on the first iteration. Since Hi was marked on the first iteration, there is at least one x (xi itself) in C for which the band does not include zero; thus φC = 1. Additionally, any other hypothesis in which is an intersection involving Hi is a hypothesis HV with xi ∈ V ⊂ C, and thus is also rejected by the corresponding local test. Thus Hi may be globally rejected.
Consider now xi ∈ RM, M > 1. Every xj ∈ C \ TM has previously had its hypothesis globally rejected, thus any local test of a hypothesis for a set containing that point has already been locally rejected. Therefore we need only consider hypotheses HU such that U ⊆ TM. The argument for points in R1 may then be reused, replacing C with TM. Thus the procedure is a closed testing procedure, and therefore controls the family-wise type I error rate at α. □
Remark 1. Theorem 1 applies to posterior credible bands insofar as they correspond to the confidence band via the asymptotic joint normality of the posterior of Δ(x).
Quantile-Based Simultaneous Credible Bands
Let FΘ(θ) = P[Θ ≤ θ] be the cumulative distribution function of θ, GΘ(θ) = P[Θ < θ] be its left-continuous counterpart, and , be their inverses. We may then use the simultaneous credible band
| (8) |
with chosen to achieve a desired credible level by the following construction. Let be the α quantile of the distribution of
| (9) |
Here, we take the minimum of the lower and upper tail probabilities (via F and G, respectively) of a draw of Δ(x), and then the infimum of those tail probabilities across the entire covariate space to obtain WC. Then, is used as a multiplicity-adjusted tail probability to obtain bounding quantiles. Thus the right-hand side of (8) defines the pre-image of and therefore defines a 1 − α credible set. The distribution and quantile functions of WC may be estimated from the posterior sample.13
Footnotes
Supplementary Materials
Data and R code to reproduce the results, plots, and simulations in this paper are available online.
References
- [1].Breiman Leo. Random forests. Machine Learning, 45(1):5–32, 2001. [Google Scholar]
- [2].Foster Jared C, Taylor Jeremy MG, and Ruberg Stephen J. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Chipman Hugh A, George Edward I, and McCulloch Robert E. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1): 266–298, 2010. [Google Scholar]
- [4].Hill Jennifer L. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011. [Google Scholar]
- [5].James O Berger Xiaojing Wang, and Shen Lei. A Bayesian approach to subgroup identification. Journal of Biopharmaceutical Statistics, 24(1): 110–129, 2014. [DOI] [PubMed] [Google Scholar]
- [6].Schnell Patrick M, Tang Qi, Offen Walter W, and Carlin Bradley P. A Bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects. Biometrics, 72(4):1026–36, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Gelman Andrew, Carlin John B, Stern Hal S, and Rubin Donald B. Bayesian Data Analysis, volume 2. Chapman & Hall/CRC; Boca Raton, FL, USA, 2014. [Google Scholar]
- [8].Coull Brent A, Ruppert David, and Wand MP. Simple incorporation of interactions into additive models. Biometrics, 57(2):539–545, 2001. [DOI] [PubMed] [Google Scholar]
- [9].Holm Sture. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. [Google Scholar]
- [10].Marcus Ruth, Peritz Eric, and Ruben K Gabriel. On closed testing procedures with special reference to ordered analysis of variance. Biometrika, 63(3):655–660, 1976. [Google Scholar]
- [11].Wright S Paul. Adjusted p-values for simultaneous inference. Biometrics, 48:1005–13, 1992. [Google Scholar]
- [12].Rosen Wilma G, Richard C Mohs, and Davis Kenneth L. A new rating scale for alzheimer’s disease. The American Journal of Psychiatry, 141(11): 1356–64, 1984. [DOI] [PubMed] [Google Scholar]
- [13].Schnell Patrick M, Tang Qi, Müller Peter, and Carlin Bradley P. Subgroup inference for multiple treatments and multiple endpoints in an Alzheimer’s disease treatment trial. Annals of Applied Statistics, 11(2):949–966, 2017. [Google Scholar]
- [14].Burns Alistair and Iliffe Steve. Alzheimer’s disease. British Medical Journal, 338(7692):467–471, 2009. [Google Scholar]
- [15].Schnell Patrick M. Credible Subgroups: Identifying the Population that Benefits from Treatment. PhD thesis, University of Minnesota, 2017. [Google Scholar]
- [16].Müller Peter and Mitra Riten. Bayesian nonparametric inference–why and how (with discussion). Bayesian Analysis, 8(2):269–302, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
