Abstract
This paper develops a Bayesian model with a flexible link function connecting a binary treatment response to a linear combination of covariates and a treatment indicator and the interaction between the two. Generalized linear models allowing data-driven link functions are often called “single-index models” and are among popular semi-parametric modeling methods. In this paper, we focus on modeling heterogeneous treatment effects, with the goal of developing a treatment benefit index (TBI) incorporating prior information from historical data. The model makes inference on a composite moderator of treatment effects, summarizing the effect of the predictors within a single variable through a linear projection of the predictors. This treatment benefit index can be useful for stratifying patients according to their predicted treatment benefit levels and can be especially useful for precision health applications. The proposed method is applied to a COVID-19 treatment study.
Keywords: Bayesian single-index models, Heterogeneous treatment effects, Precision medicine
Introduction
In precision medicine, a critical concern is to characterize individuals’ heterogeneity in treatment responses in order to enable individual-specific treatment decisions to be made [1–3]. Tailoring medical treatments according to individuals’ characteristics requires inferring individual-level treatment effects (ITE) (as opposed to inferring the treatment effects on average across the entire population). Then, developing an individualized treatment rule (ITR) [4–14, see, e.g.,] naturally follows from drawing inferences about the ITE. As an alternative route to developing an ITR, direct optimization of the population average over a class of ITRs [9, 10, 15, 16, see, e.g.,] may be considered.
Substantial developments have been made in the statistical methodology on the ITE estimation [17, see, e.g., review provided in]. Examples within the frequentist paradigm include R-learner [18], kernel-based multi-task learner [19], neural networks [20], tree-based ensembles [21], and semi-parametric regression [22], among many others. In the Bayesian paradigm, one prominent approach is the causal Bayesian additive regression trees (BART) [23–26, see, e.g.,] among others. Although there are many statistical learning methods that exhibit good performance in capturing complex nonlinear relationship for individual treatment effects, in this paper, we focus on the Bayesian estimation of single-index regression models for ITE, as there has been no specific work designed to model the heterogeneous treatment effects using a single-index model in the Bayesian paradigm, albeit the usefulness that arises from the simplicity in its model formulation.
A single-index model [27, 28, see, e.g.,] is one of the most popular semi-parametric models and provides an efficient way of dealing with multivariate nonparametric regression. Single-index models expand the scope of generalized linear models through a flexible data-driven link function. The model summarizes the effect of the predictors within a single variable through a linear projection of the predictors, called the (single) index. In the context of modeling heterogeneous treatment effects, such an index corresponds to a composite moderator of treatment effects, and we will demonstrate that such an index is useful for optimizing treatment decisions and the approach provides a natural way to summarize uncertainties associated with this composite moderator, through its posterior distribution.
There are broadly two lines of research on Bayesian single-index models. One approach employs a spline-based representation of the link [29–33]. The other line of research employs a Gaussian process-type representation of the link [34–38], where the unknown link function is assumed to be a Gaussian process a priori. In this article, we take the former spline-based approach as it allows us to easily incorporate an identifiability constraint on the link function used to model the heterogeneous treatment effect term.
In this paper, we consider a treatment variable A taking a value in with the associated randomization probabilities in the context of randomized clinical trials (RCTs), with the corresponding potential outcomes denoted as
. Depending on A, the observed outcome is , where the outcome Y is assumed to be a member of the exponential family. Specifically, we focus on a binary outcome , where we assume, without loss of generality, that the value of is desired so that indicates a bad outcome (e.g., death). On the population level, this means that a small value of h(E[Y]) is desired, where denotes the canonical link of the assumed exponential family distribution. For the binary outcome considered here, is the logit function.
The proposed single-index approach to modeling heterogeneous treatment effects was motivated by an application to a COVID-19 convalescent plasma (CCP) treatment study [39]. One of the primary goals of this RCT was to guide CCP treatment recommendations by providing an estimate of a differential treatment outcome when a patient is treated with CCP vs. without CCP, where a larger differential in favor of CCP would indicate a more compelling reason for recommending CCP. The estimated index, defined as a linear combination of covariates , which is part of the heterogeneous treatment effect term of the model, can be used to discover profiles of patients with COVID-19 associated with different benefits from CCP treatment. Specifically, the covariates are observed pretreatment measurements and predictors of . Our goal is to utilize the information in to develop an ITR that optimizes the value of h(E[Y]) for future patients.
Method
Optimal Individualized Treatment Rules
In this subsection, we define an optimal ITR. The Bayes decision minimizes, over treatment decision (action) , the patient-specific posterior expected loss for a patient with pretreatment characteristic . Let us define the loss function for making treatment decision a as follows:
| 1 |
where represents the model parameters characterizing the relationship between the potential treatment outcomes and predictors . In (1), is the expected outcome under a particular treatment assignment a. Let us denote the observed clinical trial data as consisting of triplets with , where is a set of observed pretreatment covariates and is an observed outcome for individual i.
Viewing the loss function in (1) as a function of treatment assignment a given a particular , the optimal Bayes decision will minimize the posterior expected loss given , i.e.,
where the expectation is taken with respect to the posterior distribution of given the observed data . In particular, if we define the loss contrast , then the above optimal Bayes decision is equivalent to
| 2 |
(where is the indicator function), which we define as the optimal ITR.
We will utilize the following standard causal inference assumptions [40, 41, see, e.g.,]:
no unmeasured confoundedness, given
positivity, , which together imply , and
stable unit treatment value assumption (SUTVA). Under these assumptions, we can write the loss contrast function in (2) as follows: . Therefore, we can construct the optimal Bayes decision (2) based on posterior inference on the canonical parameter of the exponential family response Y. In the following subsection, we will specify our model for the distributions of and , to estimate the optimal ITR (2).
Model and Prior Specification
Model
Let be a vector of the treatment outcomes, with each independently following a specific exponential family distribution with density
| 3 |
where the unknown parameters (which we collectively denote as ) will be estimated in a Bayesian framework. In (3), and are known functions specific to the given member of the exponential family, whereas is an unknown flexible function and is an unknown dispersion parameter ( specializes to a one-parameter exponential family distribution). Throughout we will drop from (3) because it is fixed at unity in our motivating dataset that has Bernoulli responses.
The canonical parameter in (3) of the response distribution is related to the treatment decision loss function in (1), through , under the standard causal inference assumptions.
Within the specification of in model (3), the first term represents the pretreatment covariates’ “main” effect, whereas the second term is the -by-A interaction effect. This interaction effect is characterized by an unspecified treatment a-specific smooth function g(u, a) which is a function of a linear projection . The projection vector is subject to , i.e., restricted to , where is the -dimensional unit sphere, and the single-index provides a dimension reduction specifically for the -by-A interaction effect. In (3), for any and , we shall impose the following identifiability condition for the component g
| 4 |
which separates the component of interest (the “prescriptive” term representing the heterogeneous treatment effect), from the component (the “prognostic” term that does not represent the heterogeneous treatment effect). In general, the covariates, , represented in the terms and in (3), are not necessarily the same variates, but for notational simplicity, the same notation for these sets of covariates was employed. As another abuse of notation for simplicity, the treatment A’s main effect, which can be represented by for some unknown will be estimated, together with the model intercept, as part of the component in (3).
Remark 2.1
Model (3) with the identifiability condition (4) is more suitable to conduct a posterior inference for heterogeneous treatment effects than the model with , because this particular parametrization (3) is invariant of the choice of coding of A. In the latter model, the choice of the treatment A coding can meaningfully impact posterior inferences because the two effects and can alias each another, particularly when the same set enters into the both terms. Under condition (4), the effect captured by the prognostic term, , and the prescriptive term, , can be distinguished regardless of treatment A coding.
For an individual with pretreatment characteristics , the loss contrast in (2) under model (3) is
| 5 |
The loss contrast (3) indicates that only the parameters g and (and not and ) in model (3) are used to specify the ITR (2), hence g and correspond to the “signal” parameters of interest. Given definition (2), we will define a “treatment benefit index” (TBI) in terms of a (posterior) probability,
| 6 |
that is, the probability of the (active) treatment providing a greater benefit than the treatment (for a patient with pretreatment characteristic ), in which the probability is evaluated with respect to the posterior distribution of . The optimal Bayes decision in (2) is then represented by . Since a large value of (6) indicates a large expected “benefit” of taking the active treatment vs. control , the in (6) constructs a “gradient” of the active treatment’s benefit that ranges from 0 to 1,
as a function of patient characteristic . Furthermore, for each , we can obtain a posterior distribution of the treatment a-specific expected outcome based on the posterior distribution of the parameters .
Representation of the Link Function g
Following [29], we will use a cubic B-spline basis to represent the flexible function of (3).
Using B-splines is appealing because the basis functions are strictly local, as each basis function is only non-zero over the intervals between five adjacent knots [42].
Using splines allows us to easily incorporate the constraint (4) on the function g, as we describe shortly.
For each fixed , the function g is represented as follows:
| 7 |
for some fixed L-dimensional basis
(e.g., B-spline basis on evenly spaced knots on a bounded range of ) and a set of unknown treatment a-specific spline coefficients . In our simulation illustration and in the application, we used a cubic B-spline basis with where denotes the integer part of , as recommended by [43].
Given representation (7) for g, the identifiability constraint (4) is implied by the linear constraint:
| 8 |
where with (i.e., the randomization probabilities) is the matrix (in which denotes the identity matrix) and is an unknown basis coefficient vector.
To represent (7) in matrix notation, given , let the matrices denote the evaluation matrices of the basis on , specific to the treatment , whose ith row is the vector if and a row of zeros if . Then, the column-wise concatenation of the design matrices , i.e., the matrix defines the model matrix associated with . Then, we can represent the function g in (7) evaluated on the sample data, by the length-n vector: .
The linear constraint (8) on can be conveniently absorbed into the model matrix by reparametrization, as we describe next. We can find a basis matrix (that spans the null space of the linear constraint (8)) such that if we set for any arbitrary vector , then the resulting vector automatically satisfies the constraint (8). Such a basis matrix can be constructed by a QR decomposition of the matrix in (8). Then representation can be reparametrized, in terms of the unconstrained vector , by replacing with the reparametrized model matrix yielding the representation .
Once we have an inferential procedure on , we can also consider inference on the transformed parameter , from which we can make inference on the treatment a-specific functions .
We note that in (7) defines a system of functions specifically chosen to be used as building blocks to represent a (smooth, as implied by penalized splines) link function for each treatment condition a. If a different basis function is used to represent the link function, we may have a different performance. We may need to identify a best suitable basis for the data which depend on the underlying heterogeneous treatment effect (i.e., the underlying log odds ratio function). For example, if there is a reason to expect that the log odds ratio function (5) is a jagged (or, less likely, a cyclic) function over the index , then a wavelet [44, e.g.,] (or Fourier [45, e.g.,]) basis might be more suitable. Although the associated basis coefficient vector, , will still be subject to the identifiability constraint (8) for a different basis system, a specifically tailored prior and penalization would be more appropriate, for example, ( type) Laplace-Zero prior with some hyperparameter determining the sparsity basis [44].
Prior Specification
How we specify priors for , , and associated with model (3) is given in this subsection.
- For the distribution of , we will use the von Mises–Fisher with concentration parameter and modal parameter ,
which is a probability distribution for on the unit sphere in .9 We will use , for some vector and symmetric positive definite matrix .
- Since the domain of the function g in (3) depends on , the prior on g will depend on . Conditioning on , following [29], we will use data-dependent prior for ,
where10
and which corresponds to a special case of at . The prior (10) is a Zellner’s g-prior that has the same dispersion matrix as a weighted least squares estimator defined based on the vector of adjusted responses and the matrix of weights , which are specified in the next subsection. In the prior (10), is a hyperparameter which will be selected via an empirical Bayes procedure with the generalized cross-validation (GCV), as in [29]. An advantage of using the prior (10) is that it allows us to analytically integrate out of the joint posterior , facilitating the Gibbs sampling of .
Posterior Computation
To conduct posterior inference on , we will simulate samples from the joint posterior (where we use to denote the observed treatment outcomes). Since it is difficult to draw samples directly from this joint posterior distribution, we will use a Metropolis-Within-Gibbs algorithm. The Gibbs algorithm will iterate between the following two Steps: Step 1) sample from and Step 2) sample from . Specifically, in Step 2, since the joint conditional posterior does not have a convenient form to directly sample from, we will employ a Metropolis–Hastings step.
Conditional Posteriors
- Derivation of . For fixed and , we will quadratically approximate the log likelihood function of at its mode, which we denote by . To find the mode , we will use a Fisher scoring, iteratively updating the center of the quadratic approximation. For fixed and , at the convergence of the Fisher scoring, we define the adjusted response vector where , in which and , and the weight matrix , where , in which . Given each and , the negative log likelihood of is approximately represented in terms of a weighted least squares (WLS) objective function (up to a constant of proportionality),
Given the prior and the above approximated negative log likelihood, the conditional posterior for is given by11 - Derivation of . Given the joint conditional posterior , we will first sample from and then from . Specifically, following [29], we will use a Metropolis–Hastings algorithm to sample from . However, this approach employed in [29] cannot be directly applied to our settings, due to the non-Gaussian likelihood. Thus, we will perform a quadratic approximation of the negative log likelihood function of at its mode, which we denote by . To find , as in Step 1, we will conduct a Fisher scoring. For each fixed and , this quadratic approximation at the convergence of the Fisher scoring is summarized in the form of the WLS objective function (up to a constant of proportionality),
as a function of , in which is the adjusted response vector with obtained at the convergence, where and and is the weight matrix with . Given the quadratic approximation (12), we can write the joint conditional posterior :12
Given the conditional prior in (10), we can write the terms involving in (13) as follows:13
where14
Specifically, given (14), we can analytically integrate out of the joint conditional in (13), which yields
where is the moment generating function (MGF) of the variate evaluated at . The familiar closed-form expression of the Gaussian MGF allows us to write the last line of (15) as follows:15
where . The expression (16) provides a closed form for the approximated up to a constant of proportionality, which we will use to conduct a random walk Metropolis Markov chain Monte Carlo (MCMC) algorithm. The MCMC algorithm to sample based on (16) is described in the next subsection. Given each and , we can sample from ,16
derived from expression (14).17
MCMC Algorithm for the Posterior Sampling
In this subsection, we provide a detailed sampling scheme based on the conditional posterior derived in the previous subsection. First, we initialize the chain with the maximum likelihood estimates of the parameters
of model (3) with representation (7) for the link g, where the tuning parameter is optimized through the generalized cross-validation (GCV) criterion. We will then cycle through the following steps.
Sample from in (11) given .
- Sample from in (16) given , using the Metropolis algorithm. Specifically, given the current state for of the chain, a new value is accepted with the acceptance probability , where the Metropolis ratio r is given by
using the conditional posterior (16) given . Here, we provide some more details on this Metropolis procedure.- The proposal distribution for was taken to be von Mises–Fisher with concentration parameter to be and direction parameter given by the current value . In the simulation example in the next section, we used , which yielded the acceptance probability of around 0.30.7 for proposal , and the sampler appeared to explore the state space for adequately (examining the traceplots of typical MCMC chains did not show any peculiarity). We used the R package movMF [46] to generate random samples of from von Mises–Fisher distributions.
- in (10) is another unknown that controls the smoothness of the data-driven function g, which is crucial to avoid overfitting g. This will be selected via an empirical Bayes procedure using the GCV criterion, at each MCMC update.
Sample from in (17) given .
To obtain the estimated expected response given a new and a treatment condition , we take the posterior mean of the expected response , based on the posterior sampler output. In particular, we construct a treatment decision rule using the posterior distribution of . Specifically, we will use the posterior probability as the , which we will utilize to obtain a decision rule , using the probability threshold of 0.5.
Simulation Illustration
In this section, via a set of simulation experiments, we compare the performance of the proposed Bayesian single-index approach to modeling heterogeneous treatment effect with an approach that relies on a Bayesian linear model. We replicate the experiment 100 times with sample sizes . We use a cubic B-spline for representing g as in Sect. 2.2.2. The simulation code is available at https://github.com/syhyunpark/bayesSIMML.
Simulation Setting
We independently generated the treatment indicators from Bernoulli distribution with and the vector of covariates from the mean zero multivariate normal distribution with compound symmetry correlation and the unit variances. Given , we generated , where with the following specifications of the functions m (either a “nonlinear” or “linear” main effect) and g (either a “nonlinear” or “linear” A-by- interaction effect):
| 18 |
In (18), we set and , where each vector was normalized to have unit norm. We considered the cases with . Throughout the paper, we took and an uninformative prior for , i.e., set in (9). As a comparison method, we used the following logistic linear model:
with the prior distributions, with location and scale, , and .
As an evaluation metric, we first consider the expected deviance measured on an independently generated test set (of size ) to assess the accuracy of the models in predicting the new data, as defined by
| 19 |
(note that it is scaled by the testing set size ), where is the length of the Markov chain (after a burn-in of 2000) and is the tth set of sampled parameter values within the Markov chain obtained from the training set. Accordingly, in (19) corresponds to the average posterior probability of , and the quantity (19) corresponds to a Bayesian version of the deviance, evaluated on the test set (of size ). Smaller values of (19) are better, indicating greater average accuracy of the predictive model.
Furthermore, we report the expected outcome under the treatment regime , i.e., (which is called the “value” of the regime ) that is Monte Carlo approximated by based on the test set of size , where the ITR (see (6) for the definition of the TBI) which is trained based on the training set. We also report the proportion of correct decision (PCD), which is the proportion of the cases such that match with the correct optimal treatment assignment under the true data generation model.
Simulation Results
Figure 1 below displays the results from the simulation experiments when we vary , , and the form of the interaction effect component , for the linear main effect (i.e., ) case. In Table 2 Appendix, as MCMC convergence diagnostic, we report the Gelman–Rubin potential scale reduction factor (PSRF) [47] computed for each scenario using the method of [48], which provides some assurance that the sampler has performed reasonably. For the nonlinear A-by- interaction effect scenarios (i.e., the gray panels in Fig. 1), the proposed index model that utilizes the flexible link function g clearly outperforms the logistic linear model which assumes a restricted linear model on the interaction term, with respect to the all three criteria (the deviance, PCD and the expected outcome). When there is no nonlinearity in the A-by- interaction effect term in the underlying model (i.e., the white panels in Fig. 1), not surprisingly the logistic linear model outperforms the index model. However, the contrast in the performance between the two models is relatively small, compared to that under the nonlinear A-by- interaction. This suggests that, in the absence of prior knowledge about the form of the A-by- interaction effect, the more flexible index model that accommodates nonlinear treatment effect modifications (i.e., the nonlinear g term) can be a useful alternative to the linear model approach.
Fig. 1.
Results for the linear main effect case, with varying , , and , comparing the performance of the proposed index model (red) with that of the logistic linear regression model (blue), with respect to the deviance (the first row; a smaller deviance is desired), the proportion of correct decisions (PCD) (the second row; a larger PCD is desired), and the “value” (the expected outcomes under ITRs) (the third row; a smaller value is desired)
Table 2.
Summary of the Gelman–Rubin (GR) diagnostic over 100 simulation replications for each scenario. The GR potential scale reduction factor (PSRF) was computed for each element of the components , , and . For each of these components, we computed the proportion of the individual elements’ PSRF statistics being less than a certain threshold (1.03, 1.05, 1.07) by averaging the PSRF statistics across the corresponding component’s elements and across the 100 simulation replications (for example, for the component , the proportion was computed based on PSRF statistics). The results for indicate that almost all PSRF statistics were less than 1.07, while those of and were consistently smaller than 1.03 for almost all the scenarios, providing some assurance that the sampler has performed reasonably
| Shape of the component m | Shape of the component g | p | n | Pr(GR < 1.03) | Pr(GR < 1.05) | Pr(GR < 1.07) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| m | m | m | ||||||||||
| Linear | Nonlinear | 5 | 500 | 0.856 | 1.000 | 1.000 | 0.970 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 |
| 1000 | 0.908 | 0.999 | 1.000 | 0.986 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | |||
| 2000 | 0.916 | 1.000 | 1.000 | 0.984 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 10 | 500 | 0.602 | 0.997 | 1.000 | 0.950 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| 1000 | 0.744 | 1.000 | 1.000 | 0.981 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.803 | 1.000 | 1.000 | 0.988 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| Linear | 5 | 500 | 0.844 | 1.000 | 1.000 | 0.984 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | |
| 1000 | 0.902 | 1.000 | 1.000 | 0.984 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.930 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 10 | 500 | 0.485 | 0.999 | 1.000 | 0.925 | 1.000 | 1.000 | 0.999 | 1.000 | 1.000 | ||
| 1000 | 0.689 | 0.997 | 1.000 | 0.968 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.758 | 1.000 | 1.000 | 0.977 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| Nonlinear | Nonlinear | 5 | 500 | 0.950 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 1000 | 0.970 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.956 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 10 | 500 | 0.750 | 0.996 | 1.000 | 0.991 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| 1000 | 0.888 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.933 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| Linear | 5 | 500 | 0.922 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |
| 1000 | 0.978 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.968 | 1.000 | 1.000 | 0.996 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 10 | 500 | 0.734 | 0.999 | 1.000 | 0.986 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| 1000 | 0.884 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 2000 | 0.936 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
The results under the nonlinear main effect case, i.e., , are given in Fig. 2 and are quite similar to those from the linear main effect case (i.e., the results present in Fig. 1), except that the expected deviance of the models (both the index model and the logistic linear model), which is a generalization of the mean-squared error that assesses the overall predictive performance, slightly increased due to the main effect model misspecification. However, optimal ITRs are derived based only on the model’s A-by- interaction effect terms. With respect to the PCD and the expected outcome under ITRs, the results under the linear main effect and those under the nonlinear main effect were close to each other, indicating that these models were quite robust to misspecification of the main effect with respect to the ITR estimation performance. We also note that in both Figs. 1 and 2 there was relatively large variability in the ITR estimation performance for the index model, especially when . On the other hand, the variability in the overall predictive performance metric (deviance) was comparable for the two approaches. This reflects the tendency that estimation of optimal ITRs is generally more challenging than making outcome predictions, as the difference (5) between two predictions has to be computed and its expected error will be invariably larger than for a single prediction.
Fig. 2.
Results for the nonlinear main effect case, with varying , , and , comparing the performance of the proposed index model (red) with that of the logistic linear regression model (blue), with respect to the deviance (the first row; a smaller deviance is desired), the proportion of correct decisions (PCD) (the second row; a larger PCD is desired), and the “value” (the expected outcomes under ITRs) (the third row; a smaller value is desired)
Application
In this section, we illustrate the proposed model on data from a COVID-19 convalescent plasma (CCP) study [39], a meta-analysis of pooled individual patient data from 8 randomized clinical trials. One of the goals of this study was to guide CCP treatment recommendations by providing an estimate of a differential treatment outcome when a patient is treated with CCP vs. without CCP [49]. A larger differential in favor of CCP would indicate a more compelling reason for recommending CCP. In this context, we aim to use profiles of patients with COVID-19 associated with different benefits from CCP treatment, to optimize treatment decisions.
The study included 2369 hospitalized adults, not receiving mechanical ventilation at randomization, enrolled from April 2020 through March 2021. We used only complete cases for this analysis. A total of 2287 patients were included, with a mean age of 60.3 (SD 15.2) years and women. One of the study’s primary outcomes was the binary variable indicating mechanical ventilation or death (hence indicates a bad outcome, whose probability we want to minimize) at day 14 post-treatment, where 336 out of 2287 patients () experienced the case at day 14. The patients were randomized to be treated with either CCP (1190 patients, or control (i.e., standard of care; 1097 patients, . Pretreatment patient characteristics were collected at baseline. As in [49], in our application, the baseline variables used to model the covariates “main” effect, i.e., the component associated with the coefficient in model (3), included age, sex, baseline symptom conditions, age-by-baseline symptom conditions interaction, blood type, the indicators for history of diabetes, pulmonary and cardiovascular disease, and days since the symptoms onset. We also included the RCT-specific intercepts and the patients’ enrollment quarters as part of the covariates “main” effect component.
Since our goal in this analysis is to investigate the differential treatment effect explained by the baseline variables , we will focus on reporting the estimation results of the heterogeneous treatment effect (HTE) term in model (3) and the corresponding treatment effect contrast in (5). The patient characteristics included in the HTE term, along with the sample proportions, are given in the first column of Table 1. The posterior mean of the index coefficients , along with the corresponding posterior credible intervals (CrI), are provided in the second column of Table 1. By examining the posterior CrI, the patient’s symptoms severity (oxygen support status) at baseline, blood type, a history of cardiovascular disease, and a history of diabetes appear to be important predictors of HTE, as the CrI of the coefficients associated with these variables do not include 0.
Table 1.
Pretreatment patient characteristics and the corresponding estimated index coefficients (and CrI)
| Pretreatment characteristic (sample prevalence, ) | Index coefficient [ CrI] |
|---|---|
| Oxygen support by mask or nasal prongs (1/0) (63%) | 0.68 [0.50, 0.80] |
| Oxygen support by high flow (1/0) (18%) | 0.47 [0.16, 0.61] |
| Age (dichotomized, ) (1/0) (35%) | 0.13 [0.46,0.04] |
| Blood type (A or AB vs. O or B) (1/0) (37%) | 0.31 [0.49, 0.16] |
| Cardiovascular disease (1/0) (42%) | 0.24 [0.65,0.06] |
| Diabetes (1/0) (34%) | 0.26 [0.52, 0.08] |
| Pulmonary disease (1/0) (12%) | 0.05 [0.16,0.22] |
The reference level: hospitalized but no oxygen therapy required
In the first panel of Fig. 3, we display the exponentiated individualized treatment effect, , as a function of the single-index . Specifically, the horizontal axis is the posterior mean of (a point estimate), where the “observed” posterior mean values computed from the posterior sample mean of are represented by the small blue ticks along the axis. The uncertainty in the estimation of the coefficient (as well as that of the coefficient ) is also accounted for in the credible bands in Fig. 3. For the sake of interpretability, we exponentiated the HTE estimate , so that the vertical axis in the panel represents the odds ratio (CCP vs. control) for a bad outcome (mechanical ventilation or death). An odds ratio of less than 1 indicates a superior CCP efficacy over the control treatment. As most of the observed values of the single-index fall below the line representing the odds ratio of 1, most of the patients are expected to benefit from CCP treatment, except those with the values greater than 0.45, in which the corresponding expected individualized odds ratios are greater than 1 (about of the observed patients). The U-shaped nonlinear relationship between the expected odds ratio and the single index of the model suggests that the use of the flexible link function g in (3) is to be preferred over a more restricted linear model for this HTE modeling.
Fig. 3.
The left panel displays the exponentiated version of the estimated individualized treatment effect and the posterior mean of in (5) (solid curve), along with the corresponding upper and lower credible interval (CrI) (dashed curves), as a function of the posterior mean of . The right panel also displays the expected odds ratio (CCP vs. control) (solid curve) and the corresponding CrI (dashed curves), but the horizontal axis is now the treatment benefit index (TBI) (6), , where represents the odds ratio, and the TBI probability is evaluated with respect to the posterior distribution of the parameters in . The TBI provides a gradient of benefit that ranges from 0 to 1, with a higher value of the TBI indicating a greater benefit from the CCP treatment, compared to control. The observed values for the quantities on the horizontal axes are represented by the small blue ticks
Although the first panel of Fig. 3 displays an information about the relationship between the (exponentiated) individualized treatment effect (i.e., the individualized odds ratio) and the posterior mean of the single index , this relationship is non-monotonic, which makes it difficult to construct a “gradient” of the treatment benefit of vs. , as a function of the patient characteristics . Thus, in the second panel of Fig. 3, we additionally display the individualized odds ratio , as a function of defined in (6), i.e., , where the probability is evaluated with respect to the posterior distribution of the parameters involving . As a probability, ranges from 0 to 1, where larger values are associated with larger CCP benefit. For example, patients with a large value of (i.e., TBI scores near 1) were expected to experience large, clinically meaningful benefits from CCP.
The second panel of Fig. 3 displays a monotonically decreasing trend of the expected odds ratio (an increasing CCP benefit), as the TBI score increases from 0 to 1. Some portions of the expected odds ratio as well as the corresponding CrI exceed 1 for very small TBI values (near 0), suggesting the possibility of harm from CCP as the TBI approaches 0, whereas the TBI values close to 1 indicate a substantial benefit from the CCP treatment over the control treatment. We can use the TBI scores to stratify patients according to their predicted treatment benefit levels, by setting the treatment decision rule .
To evaluate the performance of the treatment decision rule , we randomly split the data with a ratio of 2:1 into a training set and a testing set (of size , replicated 100 times, each time obtaining based on the training set, and the corresponding “value” by an inverse probability weighted estimator [50] computed based on the testing set (of size ).
For comparison, we also include two naive rules: treating all patients with Control (“All Control”) and treating all patients with CCP (“All CCP”), each regardless of the individual patients’ characteristics , in addition to the decision rules based on the Bayesian linear logistic model which was compared with the proposed index model in Sect. 3. The resulting boxplots obtained from the 100 random splits are illustrated in Fig. 4.
Fig. 4.
Boxplots of “value,” obtained from 100 randomly split testing sets. A smaller “value” is desirable
The results in Fig. 4 demonstrate that the index model and the logistic linear regression perform at a similar level for this dataset, while showing a clear advantage over the näive rules of giving everyone CCP or giving everyone the control treatment: the averaged proportion of patients with the undesirable outcome (i.e., ) was considerably less for the two regression approaches than the two näive rules. This suggests that accounting for patient characteristics can help optimizing treatment decisions. Although some of the nonlinearities in the association between the treatment effect and patient characteristics is captured by the model-implied odds ratio displayed in the first panel of Fig. 3, for this dataset, the simpler linear model appears to perform nearly as well as the index model due to its model parsimony. However, as demonstrated in Sect. 3, the more flexible index model may be preferable to the linear model, as it allows for discovering some key nonlinearities in modeling heterogeneous treatment effects.
Discussion
The idea in the Bayesian estimation approach of [29] was to treat the link function g as another unknown and approximate it by a linear combination of B-spline basis functions. In this article, to estimate heterogeneous treatment effect using a flexible link function, we used the adjusted responses and weights associated with the iteratively re-weighted least squares (IWLS) algorithm in the quadratic approximation of the log likelihood, for each MCMC sampler. The approximation under the IWLS framework and the specific prior choice (10) allows us to analytically integrate out of the approximated posterior (13), which simplifies the sampling procedure for . Although the sampling was done using approximated conditional posteriors, this approach appears to work reasonably well.
This paper focused on the context of a randomized clinical trial where the treatment is randomized independently of pretreatment characteristics . However, the method can be potentially extended to the case where the treatment assignment depends on . To estimate individual treatment effects with observational or non-fully randomized data, we can take a “propensity method” [17, 40, see, e.g.,] upon taking an appropriate reparametrization of the proposed model, which we describe below for a more general context of k treatment conditions, in which the treatment takes a value with probability (i.e., propensity score) . Let be the reference (control) treatment.
For each fixed , the condition (4) implies or equivalently, Given this representation for , we can reparametrize the canonical parameter of model (3), that is, , by
| 20 |
where . This parametrization is an unconstrained formulation of model (3) without the constraint (4), where the propensity score is incorporated through the subject i- and treatment a-specific weight in the formulation. In model (20), the interaction term still satisfies the condition of the form (4), since , indicating that .
In the estimation, as in Sect. 2.2.2 for the binary treatment condition , we can proceed as follows for the general k treatment conditions with and also in the context of an observational or a non-fully randomized study. We can define the design matrix , where each element (the matrix) that is specific to each treatment condition denotes the evaluation matrix of the basis on multiplied by the subject i and treatment condition a-specific weight (the weight is defined on the second line of (20)), so that its ith row corresponds to the vector, , with the weight incorporating the pre-estimated treatment propensity score . The spline coefficient vector associated with the design matrix can be introduced, which yields the representation as in Sect. 2.2.2, and the same estimation procedure of Sect. 2.2.3 and 2.3 can be employed to conduct posterior inference.
Future work will extend the model to accommodate multiple treatment outcomes to allow for borrowing of strength between the available outcomes in modeling heterogeneous treatment effects.
Acknowledgements
The first author thanks Ian McKeague of Columbia University for his insightful suggestions at the outset of this work. This work was supported by the National Institute of Mental Health (NIH Grant No. 5 R01 MH099003) and the National Center for Advancing Translational Sciences (Grant No. 3 UL1TR001445-06A1S2).
Appendix
As MCMC convergence diagnostic for the simulation example in Sect 3, we report the Gelman–Rubin potential scale reduction factor (PSRF) computed for each scenario (Table 2) .
References
- 1.Murphy SA. Optimal dynamic treatment regimes. J R Stat Soc B. 2003;65(2):331–355. doi: 10.1111/1467-9868.00389. [DOI] [Google Scholar]
- 2.Robins J. Optimal Structural Nested Models for Optimal Sequential Decisions. New York: Springer; 2004. [Google Scholar]
- 3.Collins Francis S, Varmus Harold. A new initative on precision medicine. N Engl J Med. 2015;372(9):793–795. doi: 10.1056/NEJMp1500523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Ann Stat. 2011;39(2):1180–1210. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lu W, Zhang H, Zeng D. Variable selection for optimal treatment decision. Stat Methods Med Res. 2011;22:493–504. doi: 10.1177/0962280211428383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tian L, Alizadeh A, Gentles A, Tibshrani R. A simple method for estimating interactions between a treatment and a large number of covariates. J Am Stat Assoc. 2014;109(508):1517–1532. doi: 10.1080/01621459.2014.951443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Shi C, Song R, Lu W. Robust learning for optimal treatment decision with np-dimensionality. Electron J Stat. 2016;10:2894–2921. doi: 10.1214/16-EJS1178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jeng X, Lu W, Peng H. High-dimensional inference for personalized treatment decision. Electron J Stat. 2018;12:2074–2089. doi: 10.1214/18-EJS1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. J Am Stat Assoc. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhao Y, Zheng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. J Am Stat Assoc. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Song R, Kosorok M, Zeng D, Zhao Y, Laber EB, Yuan M. On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat. 2015;4:59–68. doi: 10.1002/sta4.78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Laber EB, Zhao Y. Tree-based methods for individualized treatment regimes. Biometrika. 2015;102:501–514. doi: 10.1093/biomet/asv028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Laber EB, Staicu A. Functional feature construction for individualized treatment regimes. J Am Stat Assoc. 2018;113:1219–1227. doi: 10.1080/01621459.2017.1321545. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhao Y, Laber E, Ning Y, Saha S, Sands B. Efficient augmentation and relaxation learning for individualized treatment rules using observational data. J Mach Learn Res. 2019;20:1–23. [PMC free article] [PubMed] [Google Scholar]
- 15.Liu Y, Wang Y, Kosorok MR, Zhao Y, Zeng D. Augmented outcome-weighted learning for estimating optimal dynamic treatment regimens. Stat Med. 2018;37:3776–3788. doi: 10.1002/sim.7844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Xie S, Tarpey T, Petkova E, Ogden RT. Multiple domain and multiple kernel outcome- weighted learning for estimating individualized treatment regimes. J Comput Graph Stat. 2022;31(4):1–18. doi: 10.1080/10618600.2022.2067552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Caron A, Baio G, Manolopoulou I. Estimating individual treatment effects using non-parametric regression models: a review. J R Stat Soc A. 2022;185:1115–1149. doi: 10.1111/rssa.12824. [DOI] [Google Scholar]
- 18.Nie X, Wager S. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika. 2021;108:299–319. doi: 10.1093/biomet/asaa076. [DOI] [Google Scholar]
- 19.Alaa AM, Schaar M van der (2017) Bayesian inference of individualized treatment effects using multi-task gaussian processes. 31st Conference on Neural Information Processing Systems (NIPS 2017), pp 3427–3435
- 20.Shalit U, Johansson FD, Sontag D. Estimating individual treatment effect: generalization bounds and algorithms. Proc 34th Int Conf Mach Learn. 2017;70:3076–3085. [Google Scholar]
- 21.Wager Stefan, Athey Susan. Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc. 2018;113:1228–1242. doi: 10.1080/01621459.2017.1319839. [DOI] [Google Scholar]
- 22.Guo W, Zhou X, Ma S. Estimation of optimal individualized treatment rules using a covariate-specific treatment effect curve with high-dimensional covariates. J Am Stat Assoc. 2022;116(533):309–321. doi: 10.1080/01621459.2020.1865167. [DOI] [Google Scholar]
- 23.Richard Hahn P, Murray Jared S, Carvalho Carlos M. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects. Bayesian Analy. 2020;15(3):965–1056. [Google Scholar]
- 24.Klausch T, Ven Peter van de, Brug Tim van de, van de Wiel Mark A, Berkhof Johannes (2018) Estimating Bayesian optimal treatment regimes for dichotomous outcomes using observational data.
- 25.Murray T, Yuan Y, Thall PF. A Bayesian machine learning approach for optimizing dynamic treatment regimes. J Am Stat Assoc. 2018;113(523):1255–1267. doi: 10.1080/01621459.2017.1340887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Logan BR, Sparapani R, McCulloch RE, Laud PW. Decision making and uncertainty quantification for individualized treatments using Bayesian additive regression trees. Stat Methods Med Res. 2019;28(4):1079–1093. doi: 10.1177/0962280217746191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.David Brillinger R (1982) A generalized linear model with "Gaussian” regressor variables In A Festschrift for Erich L. Lehman (Edited by P. J. Bickel, K. A. Doksum and J. L. Hodges). Wadsworth, New York
- 28.Stoker TM. Consistent estimation of scaled coefficients. Econometrica. 1986;54:1461–1481. doi: 10.2307/1914309. [DOI] [Google Scholar]
- 29.Antoniadis A, Gregoire G, McKeague I. Bayesian estimation in single-index models. Stat Sin. 2004;14:1147–1164. [Google Scholar]
- 30.Poon W, Wang H. Bayesian analysis of generalized partially linear single-index models. Comput Stat Data Anal. 2013;68:251–261. doi: 10.1016/j.csda.2013.07.018. [DOI] [Google Scholar]
- 31.Poon W, Wang H. Multivariate partially linear single-index models: Bayesian analysis. J Nonparametric Stat. 2014;26(4):755–768. doi: 10.1080/10485252.2014.965706. [DOI] [Google Scholar]
- 32.Wang H. Bayesian estimation and variable selection for single index models. Comput Stat Data Anal. 2009;53:2617–2627. doi: 10.1016/j.csda.2008.12.010. [DOI] [Google Scholar]
- 33.Wang H. A Bayesian multivariate partially linear single-index probit model for ordinal responses. J Stat Comput Simul. 2018;88:1616–1636. doi: 10.1080/00949655.2018.1442469. [DOI] [Google Scholar]
- 34.Choi T, Shi J, Wang B. A gaussian process regression approach to a single-index model. J Nonparametric Stat. 2011;23:21–36. doi: 10.1080/10485251003768019. [DOI] [Google Scholar]
- 35.Gramacy RB, Lian H. Gaussian process single-index models as emulators for computer experiments. Technometrics. 2012;54(1):30–41. doi: 10.1080/00401706.2012.650527. [DOI] [Google Scholar]
- 36.Yu Y, Zou Z, Wang S, Meyer R. Bayesian nonparametric modelling of the link function in the single-index model using a Bernstein-Dirichlet process prior. J Stat Comput Simul. 2019;89:3290–3312. doi: 10.1080/00949655.2019.1663191. [DOI] [Google Scholar]
- 37.Dhara Kumaresh, Lipsitz Stuart, Pati Debdeep, Sinha Debajyoti. A new Bayesian single index model with or without covariates missing at random. Bayesian Anal. 2020;15(3):759–780. doi: 10.1214/19-BA1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Liu C, Liang H (2022) Bayesian analysis in single-index quantile regression with missing observation. Communications in Statistics—Theory and Methods
- 39.Troxel Andrea B, Petkova Eva , Goldfeld Keith, Liu Mengling, Tarpey Thaddeus, Wu Yinxiang, Wu Danni, Agarwal Anup, Avendaño-Solá Cristina, Bainbridge Emma, Bar Katherine J, Devos Timothy, Duarte Rafael F, Gharbharan Arvind, Hsue Priscilla Y, Kumar Gunjan, Luetkemeyer Annie F, Meyfroidt Geert, Nicola André M, Mukherjee Aparna, Ortigoza Mila B, Pirofski Liise-anne, Rijnders Bart J A, Rokx Casper , Sancho-Lopez Arantxa, Shaw Pamela, Tebas Pablo, Yoon Hyun Ah, Grudzen Corita, Hochman Judith, Antman Elliott M. Association of convalescent plasma treatment with clinical status in patients hospitalized with COVID-19: a meta-analysis. JAMA Netw Open, 5(1): e2147331–e2147331, 2022. 10.1001/jamanetworkopen.2021.47331. [DOI] [PMC free article] [PubMed]
- 40.Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press; 2015. [Google Scholar]
- 41.Rubin DB. Causal inference using potential outcomes: Design, modeling, decisions. J Am Stat Assoc. 2005;100(469):322–331. doi: 10.1198/016214504000001880. [DOI] [Google Scholar]
- 42.Eilers Paul, Marx Brian. Flexible smoothing with B-splines and penalties. Stat Sci. 1996;11(2):89–121. doi: 10.1214/ss/1038425655. [DOI] [Google Scholar]
- 43.Wang Li, Yang Lijian. Spline estimation of single-index models. Stat Sin. 2009;19:765–783. [Google Scholar]
- 44.Wand MP, Ormerod JT. Penalized wavelets: embedding wavelets into semiparametric regression. Electron J Stat. 2011;5:1654–1717. doi: 10.1214/11-EJS652. [DOI] [Google Scholar]
- 45.Lenk PJ. Bayesian inference for semiparametric regression using a Fourier representation. J R Stat Soc B. 2002;61:863–879. doi: 10.1111/1467-9868.00207. [DOI] [Google Scholar]
- 46.Hornik K, Grun B (2022) movMF: mixtures of Von Mises-Fisher distributions. R package version 0.2-7
- 47.Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–472. doi: 10.1214/ss/1177011136. [DOI] [Google Scholar]
- 48.Vats Dootika, Knudson Christina. Revisiting the Gelman-Rubin diagnostic. Stat Sci. 2021;36(4):518–529. doi: 10.1214/20-STS812. [DOI] [Google Scholar]
- 49.Park Hyung, Tarpey Thaddeus, Liu Mengling, Goldfeld Keith, Wu Yinxiang, Wu Danni, Li Yi, Zhang Jinchun, Ganguly Dipyaman, Ray Yogiraj, Paul Shekhar Ranjan, Bhattacharya Prasun, Belov Artur, Huang Yin, Villa Carlos, Forshee Richard, Verdun Nicole C, Yoon Hyun ah, Agarwal Anup, Simonovich Ventura Alejandro, Scibona Paula, Burgos Pratx Leandro, Belloso Waldo, Avendaño-Solá Cristina, Bar Katharine J, Duarte Rafael F, Hsue Priscilla Y, Luetkemeyer Anne F, Meyfroidt Geert, Nicola André M, Mukherjee Aparna, Ortigoza Mila B, Pirofski Liise-anne, Rijnders Bart J A, Troxel Andrea, Antman Elliott M, Petkova Eva. Development and Validation of a Treatment Benefit Index to Identify Hospitalized Patients With COVID-19 Who May Benefit From Convalescent Plasma. JAMA Network Open, 5(1): e2147375–e2147375, 2022. ISSN 2574-3805. 10.1001/jamanetworkopen.2021.47375 [DOI] [PMC free article] [PubMed]
- 50.Murphy SA. A generalization error for Q-learning. J Mach Learn. 2005;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]




