Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2020 Aug 18;23(2):412–429. doi: 10.1093/biostatistics/kxaa032

A sparse additive model for treatment effect-modifier selection

Hyung Park 1,, Eva Petkova 1, Thaddeus Tarpey 1, R Todd Ogden 1
PMCID: PMC9308457  PMID: 32808656

Summary

Sparse additive modeling is a class of effective methods for performing high-dimensional nonparametric regression. This article develops a sparse additive model focused on estimation of treatment effect modification with simultaneous treatment effect-modifier selection. We propose a version of the sparse additive model uniquely constrained to estimate the interaction effects between treatment and pretreatment covariates, while leaving the main effects of the pretreatment covariates unspecified. The proposed regression model can effectively identify treatment effect-modifiers that exhibit possibly nonlinear interactions with the treatment variable that are relevant for making optimal treatment decisions. A set of simulation experiments and an application to a dataset from a randomized clinical trial are presented to demonstrate the method.

Keywords: Individualized treatment rules, Sparse additive models, Treatment effect-modifiers, Biomarkers

1. Introduction

Identification of patient characteristics influencing treatment responses, which are often termed treatment effect-modifiers or treatment effect-moderators, is a top research priority in precision medicine. In this article, we develop a flexible yet simple and intuitive regression approach to identifying treatment effect-modifiers from a potentially large number of pretreatment patient characteristics. In particular, we utilize a sparse additive model (Ravikumar and others, 2009) to conduct effective treatment effect-modifier selection.

The clinical motivation behind the development of a high-dimensional additive regression model that can handle a large number of pretreatment covariates, specifically designed to model treatment effect modification and variable selection, is a randomized clinical trial (Trivedi and others, 2016) for treatment of major depressive disorder. A large number of baseline patient characterisitics were collected from each participant prior to randomization and treatment allocation. The primary goal of this study, and thus the goal of our proposed method, is to discover biosignatures of heterogeneous treatment response, and to develop an individualized treatment rule (ITR; e.g., Qian and Murphy 2011; Zhao and others 2012; Laber and Zhao 2015; Kosorok and Laber 2019; Park and others 2020) for future patients based on biosignatures of the heterogeneous treatment response, a key aspect of precision medicine (e.g., Ashley 2015; Fernandes and others 2017, among others). In particular, discovering and identifying which pretreatment patient characteristics influence treatment effects have the potential to significantly enhance clinical reasoning in practice (see, e.g., Royston and Sauerbrei, 2008).

The major challenge in efficiently modeling treatment effect modification from a clinical trial study is that variability due to treatment effect modification (i.e., the treatment-by-covariates interaction effects on outcomes), that is essential for developing ITRs (Qian and Murphy, 2011), is typically dwarfed by a relatively large degree of non-treatment-related variability (i.e., the main effects of the pretreatment covariates on treatment outcomes). In particular, in regression, due to potential confounding between the main effect of the covariates and the treatment-by-covariates interaction effect, misspecification of the covariate main effect may significantly influence estimation of the treatment-by-covariates interaction effect.

A simple and elegant linear model-based approach to modeling the treatment-by-covariates interactions, termed the modified covariate (MC) method, that is robust against model misspecification of the covariates’ main effects was developed by Tian and others (2014). The method utilizes a simple parameterization and bypasses the need to model the main effects of the covariates. See also Murphy (2003); Lu and others (2011), Shi and others (2016), and Jeng and others (2018) for the similar linear model-based approaches to estimating the treatment-by-covariates interactions. However, these approaches assume a stringent linear model to specify the treatment-by-covariates interaction effects and are limited to the binary-valued treatment variable case only. In this work, we extend the optimization framework of Tian and others (2014) for modeling the treatment-by-covariates interactions using a more flexible regression setting based on additive models (Hastie and Tibshirani, 1999) that can accommodate more than two treatment conditions, while allowing an unspecified main effect of the covariates. Additionally, via an appropriate regularization, the proposed approach simultaneously achieves treatment effect-modifier selection in estimation.

In Section 2, we introduce an additive model that has both the unspecified main effect of the covariates and the treatment-by-covariates interaction effect additive components. Then, we develop an optimization framework specifically targeting the interaction effect additive components of the model, with a sparsity-inducing regularization parameter to encourage sparsity in the set of component functions. In Section 3, we develop a coordinate decent algorithm to estimate the interaction effect part of the model and consider an optimization strategy for ITRs. In Section 4, we illustrate the performance of the method in terms of treatment effect-modifier selection and estimation of ITRs through simulation examples. Section 5 provides an illustrative application from a depression clinical trial, and the article concludes with discussion in Section 6.

2. Models

Let Inline graphic denote a treatment variable assigned with associated probabilities Inline graphic, Inline graphic, and Inline graphic, and let Inline graphic denote pretreatment covariates, independent of Inline graphic (as in the case of a randomized trial). If the treatment assignment depends on the covariates, then the probabilities Inline graphic can be replaced by propensity scores Inline graphic which would typically need to be estimated from data. We let Inline graphic (a = 1,…,L) be the potential outcome if the patient received treatment Inline graphicInline graphic; we only observe Inline graphic, Inline graphic and Inline graphic. Throughout the article, without loss of generality, we assume that a larger value of the treatment outcome Inline graphic is desirable, and that Inline graphic, i.e., the main effect of Inline graphic on the outcome is centered at Inline graphic; this is only to suppress the treatment Inline graphic-specific intercepts in regression models in order to simplify the exposition, and can be achieved by removing the treatment level Inline graphic-specific means from Inline graphic. We model the treatment outcome Inline graphic by the following additive model:

graphic file with name Equation1.gif (2.1)

In model (2.1), the first term Inline graphic, which in fact will not need to be specified in our exposition, does not depend on the treatment variable Inline graphic and thus the Inline graphic-by-Inline graphic interaction effects are determined only by the second component Inline graphic. In terms of modeling treatment effect modification, the term Inline graphic in (2.1) corresponds to the “signal” component, whereas the term Inline graphic corresponds to a “nuisance” component. In particular, the optimal ITR, under model (2.1), can be shown to satisfy Inline graphic, which does not involve the term Inline graphic. Therefore, the Inline graphic-by-Inline graphic interaction effect term Inline graphic shall be the primary estimation target of model (2.1).

In (2.1), for each individual covariate Inline graphic, we utilize a treatment Inline graphic-specific smooth Inline graphic separately for each treatment condition Inline graphic. However, it is useful to treat the set of treatment-specific smooths for Inline graphic as a single unit, i.e., a single component function Inline graphic, for the purpose of treatment effect-modifier variable selection.

In model (2.1), to separate Inline graphic from the component Inline graphic and obtain an identifiable representation, without loss of generality, we assume that the set of treatment Inline graphic-specific smooths Inline graphic of the Inline graphicth component function Inline graphic satisfies a condition (almost surely):

graphic file with name Equation2.gif (2.2)

Condition (2.2) implies Inline graphic and separates the Inline graphic-by-Inline graphic interaction effect component, Inline graphic, from the Inline graphic “main” effect component, Inline graphic, in model (2.1). For model (2.1), we assume an additive noise structure Inline graphic, where Inline graphic is a zero-mean noise with a finite second moment.

Notation: For a single component Inline graphic and general component functions Inline graphic, we define the Inline graphic norm of Inline graphic as Inline graphic, where expectation is taken with respect to the joint distribution of Inline graphic. For a set of random variables Inline graphic, let Inline graphic with inner product on the space defined as Inline graphic. Sometimes we also write Inline graphic for the notational simplicity.

Under model (2.1) subject to (2.2), the component functions Inline graphic associated with the Inline graphic-by-Inline graphic interaction effect can be viewed as the solution to the constrained optimization:

graphic file with name Equation3.gif (2.3)

where Inline graphic is given from the assumed model (2.1) (and is considered as fixed in (2.3)). Since the minimization in (2.3) is in terms of Inline graphic, the objective function part of the right-hand side of (2.3) can be reduced to:

graphic file with name Equation4.gif

where the second line is from an application of the iterated expectation rule conditioning on Inline graphic and the third line follows from the constraint Inline graphicInline graphic on the right-hand side of (2.3). Therefore, representation (2.3) can be simplified to:

graphic file with name Equation5.gif (2.4)

Representation (2.4) of the component functions Inline graphic of the underlying model (2.1) is particularly useful when the (high-dimensional) “nuisance” function Inline graphic in (2.1) is complicated and prone to specification error. Note,

graphic file with name Equation6.gif (2.5)

indicating that the component Inline graphic (subject to the identifiability constraint Inline graphic) is structured to be orthogonal to the Inline graphic main effect Inline graphic. This orthogonality property is useful for estimating the additive model Inline graphic-by-Inline graphic interaction effect Inline graphic, in the presence of the unspecified Inline graphic of model (2.1).

Under model (2.1), the potential treatment effect-modifiers among Inline graphic enter the model only through the interaction effect term Inline graphic that associates the treatment Inline graphic to the treatment outcome Inline graphic. Ravikumar and others (2009) proposed sparse additive modeling (SAM) for component selection in high-dimensional additive models with a large Inline graphic. As in SAM, to deal with a large Inline graphic and to achieve treatment effect-modifier selection, we impose sparsity on the set of component functions Inline graphic associated with the interaction effect term of model (2.1), under the often practical and reasonable assumption that most covariates are irrelevant as treatment effect-modifiers. This sparsity structure on the index set Inline graphic for the nonzero component functions Inline graphic can be usefully incorporated into the optimization-based criterion (2.4) in representing Inline graphic:

graphic file with name Equation7.gif (2.6)

for a sparsity-inducing parameter Inline graphic. The term Inline graphic in (2.6) behaves like an Inline graphic ball across different components Inline graphic to encourage sparsity in the set of component functions. For example, a large value of Inline graphic on the right-hand side of (2.6) will generate a sparse solution with many component functions Inline graphic on the left-hand side set exactly to zero.

3. Estimation

3.1. Model estimation

For each Inline graphic, the minimizer Inline graphic of the optimization problem (2.6) has a component-wise closed-form expression.

Theorem 1

Given Inline graphic, the minimizer Inline graphic of (2.6) satisfies (almost surely):

Theorem 1 (3.7)

where

Theorem 1 (3.8)

in which

Theorem 1 (3.9)

represents the Inline graphicth partial residual. In (3.7), Inline graphic represents the positive part of Inline graphic.

Note that the Inline graphic correspond to the projections of Inline graphic onto Inline graphic subject to the constraint in (2.6). The proof of Theorem 1 is in the Supplementary Materials available at Biostatistics online.

The component-wise expression (3.7) for Inline graphic suggests that we can employ a coordinate descent algorithm (e.g., Tseng, 2001) to solve (2.6). Given a sparsity parameter Inline graphic, we can use a standard backfitting algorithm used in fitting additive models (Hastie and Tibshirani, 1999) that fixes the set of current approximates for Inline graphic at all Inline graphic, and obtains a new approximate of Inline graphic by equation (3.7), and iterates through all Inline graphic until convergence. A sample version of the algorithm can be obtained by inserting sample estimates into the population expressions (3.9), (3.8), and (3.7) for each coordinate Inline graphic, which we briefly describe next.

Given data Inline graphicInline graphic, for each Inline graphic, let Inline graphic, corresponding to the data-version of the Inline graphicth partial residual Inline graphic in (3.9), where Inline graphic represents a current estimate for Inline graphic. We estimate Inline graphic in (3.7) in two steps: (i) estimate the function Inline graphic in (3.8); (ii) plug the estimate of Inline graphic into Inline graphic in (3.7), to obtain the soft-thresholded estimate Inline graphic.

Although any linear smoothers can be utilized to obtain the estimators Inline graphic as described in Remark 1 at the end of this section, in this article, we shall focus on regression spline-type estimators which are particularly simple and computationally efficient to implement. Specifically, for each Inline graphic, the function Inline graphic on the right-hand side of (2.6) will be represented by:

graphic file with name Equation11.gif (3.10)

for some prespecified Inline graphic-dimensional basis Inline graphic (e.g., Inline graphic-spline basis on evenly spaced knots on a bounded range for Inline graphic) and a set of unknown treatment Inline graphic-specific basis coefficients Inline graphic. Given representation (3.10) for the component function Inline graphic, the constraint Inline graphic in (2.6) can be simplified to Inline graphic. This constraint can be written succinctly in a matrix form as

graphic file with name Equation12.gif (3.11)

where Inline graphic is the vectorized version of the basis coefficients Inline graphic in (3.10), and the Inline graphic matrix Inline graphic where Inline graphic denotes the Inline graphic identity matrix.

Let the Inline graphic matrices Inline graphicInline graphic denote the evaluation matrices of the basis function Inline graphic on Inline graphicInline graphic specific to the treatment Inline graphicInline graphic, whose Inline graphicth row is the Inline graphic vector Inline graphic if Inline graphic, and a row of zeros Inline graphic if Inline graphic. Then the column-wise concatenation of the design matrices Inline graphic, i.e., the Inline graphic matrix Inline graphic, defines the model matrix associated with the vectorized model coefficient Inline graphic, vectorized across Inline graphic in representation (3.10). Then, we can represent the function Inline graphic in (3.10), based on the sample data, by the length-Inline graphic vector:

graphic file with name Equation13.gif (3.12)

subject to the linear constraint (3.11).

The linear constraint (3.11) on Inline graphic can be conveniently absorbed into the model matrix Inline graphic in (3.12) by reparametrization, as we describe next. We can find a Inline graphic basis matrix Inline graphic, such that if we set Inline graphic for any arbitrary vector Inline graphic, then the vector Inline graphic automatically satisfies the constraint (3.11) Inline graphic. Such a basis matrix Inline graphic can be constructed by a QR decomposition of the matrix Inline graphic. Then representation (3.12) can be reparametrized, in terms of the unconstrained vector Inline graphic, by replacing Inline graphic in (3.12) with the reparametrized model matrix Inline graphic:

graphic file with name Equation14.gif (3.13)

Theorem 1 and Section 2 of Supplementary Materials available at Biostatistics online indicate that the coordinate-wise minimizer Inline graphic of (2.6) can be estimated based on the sample by

graphic file with name Equation15.gif (3.14)

where

graphic file with name Equation16.gif (3.15)

in which Inline graphic is the estimated Inline graphicth partial residual vector. In (3.14), the norm Inline graphic of (3.7) is estimated by the vector norm Inline graphic, and the shrinkage factor Inline graphic of (3.7) is estimated by Inline graphic.

Based on the sample counterpart (3.14) of the coordinate-wise solution (3.7), a highly efficient coordinate descent algorithm can be conducted to simultaneously estimate all the component functions Inline graphic in (2.6). At convergence of the coordinate descent, we have a basis coefficient estimate associated with the representation (3.13),

graphic file with name Equation17.gif (3.16)

which in turn implies an estimate

graphic file with name Equation18.gif

for the basis coefficient associated with the representation (3.12). This gives an estimate of the treatment Inline graphic-specific function Inline graphicInline graphic in model (2.1):

graphic file with name Equation19.gif (3.17)

estimated within the class of functions (3.10) for a given tuning parameter Inline graphic, which controls the shrinkage factor Inline graphic in (3.16). We summarize the computational procedure for the coordinate descent in Algorithm 1.

Algorithm 1 Coordinate descent

  1. Input: Data Inline graphic, Inline graphic, Inline graphic, and tuning parameter Inline graphic.

  2. Output: Fitted functions Inline graphic.

  3. Initialize Inline graphicInline graphic; pre-compute the smoother matrices Inline graphic in (3.15) Inline graphic.

  4. while until convergence of Inline graphic, do iterate through Inline graphic:

  5. Compute the partial residual Inline graphic

  6. Compute Inline graphic in (3.15); then compute the thresholded estimate Inline graphic in (3.14).

In Algorithm 1, the projection matrices Inline graphicInline graphic only need to be computed once and therefore the coordinate descent can be performed efficiently. In (3.14), if the shrinkage factor Inline graphic, the associated Inline graphicth covariate is absent from the model. The tuning parameter Inline graphic for treatment effect-modifier selection can be chosen to minimize an estimate of the expected squared error of the fitted models, Inline graphic, over a dense grid of Inline graphic’s, estimated, for example, by cross-validation. Alternatively, one can utilize the network information criterion (Murata and Amari, 1994) which is a generalization of the Akaike information criterion in approximating the prediction error, for the case where the true underling model, i.e., model (2.1), is not necessarily in the class of candidate models. Throughout the article, Inline graphic is chosen to minimize Inline graphic-fold cross-validated prediction error of the fitted models.

Remark 1

For coordinate descent, any linear smoothers can be utilized to obtain the sample counterpart (3.14) of the coordinate-wise solution (3.7), i.e., the method is not restricted to regression splines. To estimate the function Inline graphic in (3.8), we can estimate the first term Inline graphic in (3.8), using a 1D nonparametric smoother for each treatment level Inline graphic separately, based on the data Inline graphicInline graphic corresponding to the data from the Inline graphicth treatment condition; we can also estimate the second term Inline graphic in (3.8) based on the data Inline graphic which corresponds to the set of data from all treatment conditions, using a 1D nonparametric smoother. Adding these two estimates evaluated at the Inline graphic observed values of Inline graphicInline graphic gives an estimate Inline graphic in (3.14). Then, we can compute the associated estimate Inline graphic, which allows implementation of the coordinate descent in Algorithm 1.

3.2. Individualized treatment rule estimation

For a single time decision point, an ITR, which we denote by Inline graphic, maps an individual with pretreatment characteristics Inline graphic to one of the Inline graphic available treatment options. One natural measure for the effectiveness of an ITR Inline graphic in precision medicine is the so-called “value” (Inline graphic) function (Murphy, 2005):

graphic file with name Equation20.gif (3.18)

which is the expected treatment response under a given ITR Inline graphic. The optimal ITR Inline graphic, which we write as Inline graphic, can be naturally defined to be the rule that maximizes the value Inline graphic (3.18). Such an optimal rule Inline graphic satisfies:

graphic file with name Equation21.gif (3.19)

Much work has been carried out to develop methods for estimating the optimal ITR (3.19) using data from randomized clinical trials. Machine learning approaches to estimating (3.19) are often framed in the context of a (weighted) classification problem (Zhang and others, 2012; Zhao and others, 2019), where the function Inline graphic in (3.19) is regarded as the optimal classification rule given Inline graphic for the treatment assignment with respect to the objective function (3.18). These classification-based approaches to optimizing ITRs include the outcome-weighted learning (OWL) (e.g., Zhao and others, 2012; 2015; Song and others, 2015; Liu and others, 2018) based on support vector machines (SVMs), tree-based classification (e.g., Laber and Zhao, 2015), and adaptive boosting (Kang and others, 2014), among others.

Under model (2.1), Inline graphic in (3.19) is: Inline graphic, which can be estimated by: Inline graphic, where Inline graphic is given in (3.17) at the convergence of the Algorithm 1. This estimator can be viewed as a regression-based approach to estimating (3.19), that approximates the conditional expectations Inline graphicInline graphic based on the additive model (2.1), while maintaining robustness with respect to model misspecification of the “nuisance” function Inline graphic in (2.1) via representation (2.6) for the “signal” components Inline graphic. We illustrate the performance of this ITR estimator Inline graphic with respect to the value function (3.18), through a set of simulation studies in Section 4.2.

3.3. Feature selection and transformation for individualized treatment rules

Although machine learning approaches that attempt to directly maximize (3.18) without assuming some specific structure on Inline graphicInline graphic (unlike most of the regression-based approaches) are highly appealing, common machine learning approaches used in optimizing ITRs, including SVMs utilized in the OWL, are often hard to scale to large datasets, due to their taxing computational time. In particular, SVMs are viewed as “shallow” approaches (as opposed to a “deep” learning method that utilizes a learning model with many representational layers) and successful applications of SVMs often require first extracting useful representations for their input data manually or through some data-driven feature transformation (a step called feature engineering) (see, e.g., Kuhn and Johnson, 2019) to have more discriminatory power. Generally, selection and transformation of relevant features can increase the performance, scale, and speed of a machine learning procedure.

As an added value, the proposed regression (2.6) based on model (2.1) provides a practical feature selection and transformation learning technique for optimizing ITRs. The set of component functions Inline graphic in (2.6) can be used to define data-driven feature transformation functions for the original features Inline graphic. The resulting transformed features can be used as inputs to a machine learning algorithm for optimizing ITRs and can lead to good results in practical situations.

In particular, we note that for each Inline graphic, the component function Inline graphic in (2.6) is defined separately from the Inline graphic main effect function Inline graphic in (2.1). Therefore, the corresponding transformed feature variable Inline graphic, which represents the Inline graphicth feature Inline graphic in the new space, highlights only the “signal” nonlinear effect of Inline graphic associated with the Inline graphic-by-Inline graphic interactions on their effects on the outcome that is relevant to estimating Inline graphic, while excluding the Inline graphic main effect that is irrelevant to the ITR development. This “de-noising” procedure for each variable Inline graphic can be very appealing, since irrelevant or partially relevant features can negatively impact the performance of a machine learning algorithm. Moreover, a relatively large value of the tuning parameter Inline graphic in (2.6) would imply a set of sparse component functions Inline graphic, providing a means of feature selection for ITRs. For the most common case of Inline graphic (binary treatment), we have Inline graphic implied by the constraint (2.2) that we impose, which is simply a scalar-scaling of the function Inline graphic; this implies that, for each Inline graphic, the mapping Inline graphic specifies the feature transformation of Inline graphic. We demonstrate the utility of this feature selection/transformation, which we use as an input to the OWL approach to optimizing ITRs, through a set of simulation studies in Section 4.2 and a real data application in Section 5.

4. Simulation study

4.1. Treatment effect-modifier selection performance

In this section, we will report simulation results illustrating the performance of the treatment effect-modifier selection. The complexity of the model for studying the Inline graphic-by-Inline graphic interactions can be summarized in terms of the size of the index set for the component functions Inline graphic that are not identically zero. We can ascertain the performance of a treatment effect-modifier selection method in terms of these component functions correctly or incorrectly estimated as nonzero. To generate the data, we use the following model:

graphic file with name Equation22.gif (4.20)

where Inline graphicInline graphic, Inline graphic are generated from independent Inline graphic, and the treatment variable Inline graphic is generated independently of Inline graphic and the error term Inline graphic, such that Inline graphic. We set Inline graphic, Inline graphic and Inline graphic. The graphs of these functions are displayed in Figure S.1 of Supplementary Materials available at Biostatistics online.

Under model (4.20), there are only three true treatment effect-modifiers, Inline graphic, and Inline graphic. The other Inline graphic covariates are “noise” covariates, that are not consequential for optimizing ITRs. Also, in (4.20), there are Inline graphic covariates Inline graphicInline graphic, among the Inline graphic covariates, associated with the Inline graphic main effects. Under the setting (4.20), the contribution to the variance of the outcome from the Inline graphic main effect component was about Inline graphic times greater than that from the Inline graphic-by-Inline graphic interaction effect component.

We consider two approaches to treatment effect-modifier selection: (i) the proposed additive regression approach (2.6) that specifies a sparse set of functions Inline graphic, estimated via Algorithm 1, with the dimension of the cubic Inline graphic-spline basis Inline graphic in (3.10) set to be Inline graphicInline graphic; and (ii) the linear regression (MC) approach of Tian and others (2014),

graphic file with name Equation23.gif (4.21)

which specifies a sparse vector Inline graphic. Given each simulated dataset, the tuning parameter Inline graphic is chosen to minimize a Inline graphic-fold cross-validated prediction error.

Figure 1 summarizes the results of the treatment effect-modifier selection performance with respect to the true/false positive rates (the left/right two panels, respectively), comparing the proposed additive regression to the linear regression approach, which are reported as the averages (and Inline graphic standard deviations) across the Inline graphic simulation replications. Figure 1 illustrates that, for the both Inline graphic and Inline graphic cases, the proportion of the correctly selecting treatment effect-modifiers (i.e., the “true positive”; the left two panels) of the additive regression method (the red solid curves) tends to Inline graphic, with Inline graphic increasing from Inline graphic to Inline graphic, while the proportion of incorrectly selecting treatment effect-modifiers (i.e., the “false positive”; the right two panels) tends to be bounded above by a small number. On the other hand, the proportion of correctly selecting treatment effect-modifiers for the linear regression method (the blue dotted curves) tends to be only around Inline graphic for both choices of Inline graphic. In Figure S.2 of Supplementary Materials available at Biostatistics online, we further examine the true positive rates reported in Figure 1, by separately displaying the true positive rates associated with selection of Inline graphic, Inline graphic and Inline graphic respectively. The more flexible additive regression significantly outperforms the linear regression in terms of selecting the covariates Inline graphic and Inline graphic, i.e., the ones that have the nonlinear interaction effects (see Figure S.1 of Supplementary Materials available at Biostatistics online for the functions associated with the interaction effects) with Inline graphic, while the both methods perform at a similar level in selection of Inline graphic which has a linear interaction effect with Inline graphic.

Figure 1.

Figure 1

The proportion of the three relevant covariates (i.e., Inline graphic and Inline graphic) correctly selected (the “true positives”; the two gray panels), and the Inline graphic irrelevant covariates (i.e., Inline graphic) incorrectly selected (the “false positives”; the two white panels), respectively (and Inline graphic standard deviation), as the sample size Inline graphic varies from Inline graphic to Inline graphic, for each Inline graphic.

4.2. Individualized treatment rule estimation performance

In this subsection, we assess the optimal ITR estimation performance of the proposed method based on simulations. We generate a vector of covariates Inline graphic, Inline graphic (a Inline graphic case is considered in Section S.5.1 of Supplementary Materials available at Biostatistics online) based on a multivariate normal distribution with each component having the marginal distribution Inline graphic with the correlation between the components Inline graphic. Responses were generated, for (i) “highly nonlinearInline graphic-by-Inline graphic interactions:

graphic file with name Equation24.gif (4.22)

and for (ii) “moderately nonlinearInline graphic-by-Inline graphic interactions:

graphic file with name Equation25.gif (4.23)

where the treatment variable Inline graphic is generated independently from the covariates Inline graphic and the error term Inline graphic, such that Inline graphic. Models (4.22) and (4.23) are indexed by a pair Inline graphic. First, the parameter Inline graphic controls the proportion of the variance of the response Inline graphic attributable to the Inline graphic “main” effect: Inline graphic corresponds to a moderateInline graphic main effect contribution; Inline graphic corresponds to a largeInline graphic main effect contribution. Estimation of the interaction effect becomes more difficult with a larger Inline graphic. Second, the parameter Inline graphic determines whether the Inline graphic-by-Inline graphic interaction effect term has an exact additive regression structure Inline graphic or whether it deviates from an additive structure Inline graphic. In the case of Inline graphic, the proposed model (2.1) is correctly specified, whereas, for the case of Inline graphic, it is misspecified. For each scenario, we consider the following four approaches to estimating the optimal ITR Inline graphic in (3.19).

  1. The proposed additive regression approach (2.6), estimated via Algorithm 1. The dimension of the basis function Inline graphic in (3.10) is taken to be Inline graphicInline graphic. Given estimates Inline graphic, the estimate of Inline graphic in (3.19) is Inline graphic.

  2. The linear regression (MC) approach (4.21) of Tian and others (2014), implemented through the R-package glmnet, with the sparsity tuning parameter Inline graphic selected by minimizing a 10-fold cross-validated prediction error. Given an estimate Inline graphic, the corresponding estimate of Inline graphic in (3.19) is Inline graphic.

  3. The OWL method (Zhao and others, 2012) based on a Gaussian radial kernel, implemented in the R-package DTRlearn, with a set of feature transformed (FT) covariates Inline graphic used as an input to the OWL method, in which the functions Inline graphicInline graphic are obtained from the approach in 1. To improve the efficiency of the OWL, we employ the augmented OWL approach of Liu and others (2018). The inverse bandwidth parameter Inline graphic and the tuning parameter Inline graphic in Zhao and others (2012) are chosen from the grid of Inline graphic and that of Inline graphic (the default setting of DTRlearn), respectively, based on a Inline graphic-fold cross-validation.

  4. The same (OWL) approach as in 3 but based on the original features Inline graphic.

For each simulation run, we estimate Inline graphic from each of the four methods based on a training set (of size Inline graphic), and for evaluation of these methods, we compute the value Inline graphic in (3.18) for each estimate Inline graphic, using a Monte Carlo approximation based on a random sample of size Inline graphic. Since we know the true data generating model in simulation studies, the optimal Inline graphic can be determined for each simulation run. Given each estimate Inline graphic of Inline graphic, we report Inline graphic, as the performance measure of Inline graphic. A larger value (i.e., a smaller difference from the optimal value) of the measure indicates better performance.

In Figure 2, we present the boxplots, obtained from Inline graphic simulation runs, of the normalized values Inline graphic (normalized by the optimal values Inline graphic) of the decision rules Inline graphic estimated from the four approaches, for each combination of Inline graphic, Inline graphic (corresponding to correctly specified or misspecified additive interaction effect models, respectively) and Inline graphic (corresponding to moderate or large main effects, respectively), for the highly nonlinear and moderately nonlinearInline graphic-by-Inline graphic interaction effect scenarios, in the top and bottom panels, respectively. The proposed additive regression clearly outperforms the OWL (without feature transformation) method in all scenarios (both the top and bottom panels) and also the linear regression approach in all of the highly nonlinearInline graphic-by-Inline graphic interaction effect scenarios (the top panels). For the moderately nonlinearInline graphic-by-Inline graphic interaction effect scenarios (the bottom panels), when Inline graphic, all the methods except the OWL perform at a near-optimal level. On the other hand, when Inline graphic (i.e., when the underlying Inline graphic-by-Inline graphic interaction effect model deviates from the additive structure), the more flexible additive model significantly outperforms the linear model. We have also considered a linearInline graphic-by-Inline graphic interaction effect case in Section S.5.2 of Supplementary Materials available at Biostatistics online, in which the linear regression outperforms the additive regression, but only slightly, whereas if the underlying model deviates from the exact linear structure and Inline graphic, the more flexible additive regression tends to outperform the linear model. This suggests that, in the absence of prior knowledge about the form of the interaction effect, employing the proposed additive regression is more suitable for optimizing ITRs than the linear regression. Comparing the two OWL methods (OWL (FT) and OWL), Figure 2 illustrates that the feature transformation based on the estimated component functions Inline graphic provides a considerable benefit in their performance. This suggests the utility of the proposed model (2.1) as a potential feature transformation and selection tool for a machine learning algorithm for optimizing ITRs. In Figure 2, comparing the cases with Inline graphic to those with Inline graphic, the increased magnitude of the main effect generally dampens the performance of all approaches, as the “noise” variability in the data generation model increases.

Figure 2.

Figure 2

Boxplots based on Inline graphic simulation runs, comparing the four approaches to estimating Inline graphic, given each scenario indexed by Inline graphic, Inline graphic and Inline graphic, for the highly nonlinearInline graphic-by-Inline graphic interaction effect case in the top panels, and the moderately nonlinearInline graphic-by-Inline graphic interaction effect case in the bottom panels. The dotted horizontal line represents the optimal value corresponding to Inline graphic.

5. Application to data from a depression clinical trial

In this section, we illustrate the utility of the proposed additive regression for estimating treatment effect modification and optimizing individualized treatment rules, using data from a depression clinical trial study, comparing an antidepressant and placebo for treating major depressive disorder (Trivedi and others, 2016). The goal of the study is to identify baseline characteristics that are associated with differential response to the antidepressant versus placebo and to use those characteristics to guide treatment decisions when a patient presents for treatment.

Study participants (a total of Inline graphic participants) were randomized to either placebo (Inline graphic; Inline graphic) or an antidepressant (sertraline) (Inline graphic; Inline graphic). Subjects were monitored for 8 weeks after initiation of treatment, and the primary endpoint of interest was the Hamilton Rating Scale for Depression (HRSD) score at week 8. The outcome Inline graphic was taken to be the improvement in symptoms severity from baseline to week Inline graphic, taken as the difference, i.e., we take: week 0 HRSD score–week 8 HRSD score. (Larger values of the outcome Inline graphic are considered desirable.) The study collected baseline patient clinical data, prior to treatment assignment. These pretreatment clinical data Inline graphic include: Inline graphic Age at evaluation; Inline graphic Severity of depressive symptoms measured by the HRSD at baseline; Inline graphic Logarithm of duration (in month) of the current major depressive episode; and Inline graphic Age of onset of the first major depressive episode. In addition to these standard clinical assessments, patients underwent neuropsychiatric testing at baseline to assess psychomotor slowing, working memory, reaction time (RT), and cognitive control (e.g., post-error recovery), as these behavioral characteristics are believed to correspond to biological phenotypes related to response to antidepressants (Petkova and others, 2017) and are considered as potential modifiers of the treatment effect. These neuropsychiatric baseline test measures include: Inline graphic (A not B) RT-negative; Inline graphic (A not B) RT-non-negative; Inline graphic(A not B) RT-all; Inline graphic (A not B) RT-total correct; Inline graphic Median choice RT; Inline graphic Word fluency; Inline graphic Flanker accuracy; Inline graphic Flanker RT; Inline graphic Post-conflict adjustment.

The proposed approach (2.6) to estimating the Inline graphic-by-Inline graphic interaction effect part Inline graphic of model (2.1), estimated via Algorithm 1, simultaneously selected Inline graphic pretreatment covariates as treatment effect-modifiers: Inline graphic (“Age at evaluation”), Inline graphic (“Word fluency test”), and Inline graphic (“Flanker accuracy test”). The top panels in Figure 3 illustrate the estimated non-zero component functions Inline graphic (i.e., the component functions corresponding to the selected covariates Inline graphic, Inline graphic, and Inline graphic) and the associated partial residuals. The linear regression approach (4.21) to estimating the Inline graphic-by-Inline graphic interactions selected a single covariate, Inline graphic, as a treatment effect-modifier.

Figure 3.

Figure 3

The top panels: Scatterplots of partial residuals vs. the covariates associated with estimated non-zero component functions Inline graphic for placebo (blue circles) and the active drug (red triangles) treated participants; for each panel, the blue dashed curve represents Inline graphic, corresponding to the placebo (Inline graphic), and the red solid curve represents Inline graphic, corresponding to the active drug (Inline graphic). The bottom panel: A scatterplot of the treatment response Inline graphic versus the indexInline graphic, for the drug (red triangles) and the placebo (blue circles) groups; the blue dashed line is Inline graphic and the red solid line is Inline graphic; a gray dashed vertical line indicates the threshold value Inline graphic associated with the ITR Inline graphic.

Note, in the binary (i.e., Inline graphic) treatment case, any ITR Inline graphic partitions the domain of Inline graphic, Inline graphic, into two regions: Inline graphic and Inline graphic. Let Inline graphicInline graphic represent the treatment Inline graphic-specific intercept estimates. Then the proposed estimator Inline graphic, is equivalent to:

graphic file with name Equation26.gif (5.24)

In this dataset, Inline graphic (corresponding to the placebo) and Inline graphic (corresponding to the drug). The optimal treatment region in the Inline graphic space, implied by the ITR (5.24), is illustrated in Figure 4, where we have utilized Inline graphic equally spaced grid points (i.e., Inline graphic for each axis) for visualization of the regions Inline graphic and Inline graphic.

Figure 4.

Figure 4

The decision regions Inline graphic (corresponding to the placebo, dark blue) and Inline graphic (corresponding to the active drug, orange) displayed in the 3D cube of the three covariates Inline graphic (age at evaluation), Inline graphic (word fluency test), and Inline graphic (Flanker accuracy test), evaluated at Inline graphic equally spaced grid points (Inline graphic for each axis), shown from two different angles (left and right panels).

For an alternative way of visualizing the ITR (5.24), let us define a 1D indexInline graphic. By the constraint (2.2), we have the relationship Inline graphic. Therefore, the term Inline graphic in the decision rule (5.24) can be reparametrized, with respect to Inline graphic, as Inline graphic. In this dataset, Inline graphic and Inline graphic, and thus the ITR (5.24) can be re-written as: Inline graphic this ITR indicates that, for a patient with pretreatment characteristics Inline graphic, if Inline graphic, then he/she would be recommended the active drug (Inline graphic) and the placebo (Inline graphic), otherwise. For example, for a patient with Inline graphic, Inline graphic and Inline graphic, the indexInline graphic (see the bottom panel in Figure 3 for a visualization) and thus the patient would be recommended the placebo.

To evaluate the performance of the ITRs (Inline graphic) obtained from the four different approaches described in Section 4.2, we randomly split the data into a training set and a testing set (of size Inline graphic using a ratio of Inline graphic to Inline graphic, replicated Inline graphic times, each time computing an ITR Inline graphic based on the training set, then estimating its value Inline graphic in (3.18) by an inverse probability weighted estimator (Murphy, 2005): Inline graphic, computed based on the testing set of size Inline graphic. For comparison, we also include two naïve rules: treating all patients with placebo (“All PBO”) and treating all patients with the active drug (“All DRUG”), each regardless of the individual patient’s characteristics Inline graphic. The resulting boxplots obtained from the Inline graphic random splits are illustrated in Figure 5. A larger value of the measure indicates better performance.

Figure 5.

Figure 5

Boxplots of the estimated values of the treatment rules Inline graphic estimated from Inline graphic approaches, obtained from Inline graphic randomly split testing sets. Higher values are preferred.

The results in Figure 5 demonstrate that the proposed additive regression approach, which allows nonlinear flexibility in developing ITRs, tends to outperform the linear regression approach, in terms of the estimated value. The additive regression approach also shows some superiority over the method OWL (without feature transformation). In comparison to the OWL methods, the proposed additive regression, in addition to its superior computational efficiency, provides a means of simultaneously selecting treatment effect-modifiers and allows a visualization for the heterogeneous effects attributable to each estimated treatment effect-modifier as in Figure 3, which is an appealing feature in practice. Moreover, the estimated component functions Inline graphic of the proposed regression provide an effective means of performing feature transformation for Inline graphic. As in Section 4.2, the FT OWL approach appears to have a considerable improvement over the OWL that bases on the original untransformed covariates.

6. Discussion

In this article, we have developed a sparse additive model, via a structural constraint, specifically geared to identify and model treatment effect-modifiers. The approach utilizes an efficient back-fitting algorithm for model estimation and variable selection. The proposed sparse additive model for treatment effect modification extends existing linear model-based regression methods by providing nonlinear flexibility to modeling treatment-by-covariate interactions. Encouraged by our simulation results and the application, future work will investigate the asymptotic properties related to treatment effect-modifier selection and estimation consistency, in addition to developing hypothesis testing procedures for treatment-by-covariates interaction effects.

Modern advances in biotechnology, using measures of brain structure and function obtained from neuroimaging modalities (e.g., magnetic resonance imaging (MRI), functional MRI, and electroencephalography), show the promise of discovering potential biomarkers for heterogeneous treatment effects. These high-dimensional data modalities are often in the form of curves or images and can be viewed as functional data (e.g., Ramsay and Silverman, 1997). Future work will also extend the additive model approach to the context of functional additive regression (e.g., Fan and others, 2014, 2015). The goal of these extensions will be to handle a large number of functional-valued covariates while achieving simultaneous variable selection, which will extend current functional linear model-based methods for precision medicine (McKeague and Qian, 2014; Ciarleglio and others, 2015; 2018) to a more flexible functional regression setting, as well as to longitudinally observed functional data (e.g., Park and Lee 2019).

7. Software

R-package samTEMsel (Sparse Additive Models for Treatment Effect-Modifier Selection) contains R-codes to perform the methods proposed in the article, and is publicly available on GitHub (github.com/syhyunpark/samTEMsel).

Supplementary Material

kxaa032_Supplementary_Data

Acknowledgments

We are grateful to the editors, the associate editor, and two referees for their insightful comments and suggestions. Conflict of Interest: None declared.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org

Funding

National Institute of Mental Health (NIMH) (5 R01 MH099003).

References

  1. Ashley, E. (2015). The precision medicine initiative: a new national effort. The Journal of the American Medical Association 313, 2117. [DOI] [PubMed] [Google Scholar]
  2. Ciarleglio, A., Petkova, E., Ogden, R. T. and Tarpey, T. (2015). Treatment decisions based on scalar and functional baseline covariate decisions based on scalar and functional baseline covariates. Biometrics 71(4), 884–894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ciarleglio, A., Petkova, E., Ogden, R. T. and Tarpey, T. (2018). Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown. Journal of Royal Statistical Society: Series C 67, 1331–1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Fan, Y., Foutz, N., James, G. M. and Jank, W. (2014). Functional response additive model estimation with online virtual stock markets. The Annals of Applied Statistics 8, 2435–2460. [Google Scholar]
  5. Fan, Y., James, G. M. and Radchanko, P. (2015). Functional additive regression. The Annals of Statistics 43, 2296–2325. [Google Scholar]
  6. Fernandes, B., Williams, L., Steiner, J., Leboyer, M., Carvalho, A. and Berk, M. (2017). The new field of ‘precision psychiatry’. BMC Medicine 15, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hastie, T. and Tibshirani, R. (1999). Generalized Additive Models. London: Chapman and Hall/CRC. [DOI] [PubMed] [Google Scholar]
  8. Jeng, X., Lu, W. and Peng, H. (2018). High-dimensional inference for personalized treatment decision. Electronic Journal of Statistics 12, 2074–2089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kang, C., Janes, H. and Huang, Y. (2014). Combining biomarkers to optimize patient treatment recommendations. Biometrics 70, 696–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kosorok, M. R. and Laber, E. B. (2019). Precision medicine. Annual Review of Statistics and Its Application 6(1), 263–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kuhn, M. and Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. London: Chapman and Hall/CRC. [Google Scholar]
  12. Laber, E. B. and Zhao, Y. (2015). Tree-based methods for individualized treatment regimes. Biometrika 102, 501–514. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Liu, Y., Wang, Y., Kosorok, M. R., Zhao, Y. and Zeng, D. (2018). Augmented outcome⣳weighted learning for estimating optimal dynamic treatment regimens. Statistics in Medicine 37, 3776–3788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Lu, W., Zhang, H. and Zeng, D. (2011). Variable selection for optimal treatment decision. Statistical Methods in Medical Research 22, 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. McKeague, I. and Qian, M. (2014). Estimation of treatment policies based on functional predictors. Statistica Sinica 24, 1461–1485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Murata, N. and Amari, S. (1994). Network Information Criterion-determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks 5(6), 865–872. [DOI] [PubMed] [Google Scholar]
  17. Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355. [Google Scholar]
  18. Murphy, S. A. (2005). A generalization error for q-learning. Journal of Machine Learning 6, 1073–1097. [PMC free article] [PubMed] [Google Scholar]
  19. Park, H. and Lee, S. (2019). Logistic regression error-in-covariate models for longitudinal high-dimensional covariates. Stat 8, e246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Park, H., Petkova, E., Tarpey, T. and Ogden, R. T. (2020). A constrained single-index regression for estimating interactions between a treatment and covariates. Biometrics, 1–13. 10.1111/biom.13320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Petkova, E., Ogden, R.T., Tarpey, T., Ciarleglio, A., Jiang, B., Su, Z., Carmody, T., Adams, P., Kraemer, H., Grannemann, B.. and others. (2017). Statistical analysis plan for stage 1 embarc (establishing moderators and biosignatures of antidepressant response for clinical care) study. Contemporary Clinical Trials Communications 6, 22–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Qian, M. and Murphy, S. A. (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39(2), 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ramsay, J. O. and Silverman, B. W. (1997). Functional Data Analysis. New York: Springer. [Google Scholar]
  24. Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. Journal of Royal Statistical Society: Series B 71, 1009–1030. [Google Scholar]
  25. Royston, P. and Sauerbrei, W. (2008). Interactions between treatment and continuous covariates: a step toward individualizing therapy. Journal of Clinical Oncology 26(9), 1397–1399. [DOI] [PubMed] [Google Scholar]
  26. Shi, C., Song, R. and Lu, W. (2016). Robust learning for optimal treatment decision with np-dimensionality. Electronic Journal of Statistics 10, 2894–2921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Song, R., Kosorok, M., Zeng, D., Zhao, Y., Laber, E. B. and Yuan, M. (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat 4, 59–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Tian, L., Alizadeh, A., Gentles, A. and Tibshrani, R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109(508), 1517–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Trivedi, M., McGrath, P., Fava, M., Parsey, R., Kurian, B., Phillips, M., Pquendo, M., Bruder, G., Pizzagalli, D., Toups, M., and others. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (embarc): rationale and design. Journal of Psychiatric Research 78, 11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Tseng, P. (2001). Convergence of block coordinate descent method for nondifferentiable maximation. Journal of Optimization Theory and Applications 109, 474–494. [Google Scholar]
  31. Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M. and Laber, E. (2012). Estimating optimal treatment regimes from classification perspective. Stat 1, 103–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Zhao, Y., Laber, E., Ning, Y., Saha, S. and Sands, B. (2019). Efficient augmentation and relaxation learning for individualized treatment rules using observational data. Journal of Machine Learning Research 20, 1–23. [PMC free article] [PubMed] [Google Scholar]
  33. Zhao, Y., Zeng, D., Rush, A. J. and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhao, Y., Zheng, D., Laber, E. B. and Kosorok M. R. (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association 110, 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxaa032_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES