Summary.
Optimal biomarker combinations for treatment-selection can be derived by minimizing total burden to the population caused by the targeted disease and its treatment. However, when multiple biomarkers are present, including all in the model can be expensive and hurt model performance. To remedy this, we consider feature selection in optimization by minimizing an extended total burden that additionally incorporates biomarker costs. Formulating it as a 0-norm penalized weighted-classification, we develop various procedures for estimating linear and nonlinear combinations. Through simulations and a real data example, we demonstrate the importance of incorporating feature-selection and marker cost when deriving treatment-selection rules.
Keywords: Biomarker cost, Feature selection, L0 penalization, Treatment selection, Weighted support vector machines
1. Introduction.
A considerable amount of recent biometric research is being conducted in the ‘personalized medicine’ framework, because it has been well accepted now that heterogeneity can exist among individual subjects’ response to treatment in many disease settings. The characteristics contributing to this heterogeneity may include patient demographics, genetic/genomic information or other biological markers, henceforth referred to as treatment-selection biomarkers. These biomarkers can be effectively used to select optimal therapies for individuals in order to optimize this clinical outcome. Also it is important to remember that a single biomarker may not sufficiently explain this heterogeneity, and multiple biomarkers may need to be combined to build the correct statistical framework to optimize the process of treatment selection.
The direct approach to identifying these optimal marker combinations in treatment selection involves parametric modeling of the disease risk conditional on biomarkers, treatment assignment and other baseline patient characteristics, and recommending treatment assignments based on whether the predicted risk under treatment is lower than the predicted risk under no treatment. This framework was first introduced by Song and Pepe (2004), and studied extensively in Foster et al. (2011); Qian and Murphy (2011); Lu et al. (2013). Treatment selection rules based on parametric risk models rely heavily on the correct specification of this model, which is often challenging given the complexity of biological mechanisms. An alternate, much more robust approach is to build optimization algorithms to minimize (or maximize) a desired criterion. This criterion, often called the objective function, is formulated based on relevant goals for treatment selection, and the optimization algorithm tries to find the best marker combination that optimizes it. These methods (also called indirect approaches) are much more robust to model misspecifications, as unlike direct approaches these methods do not require modeling of the disease risk. Zhang et al. (2012a,b) proposed finding the optimal marker combination within a pre-specified class by optimizing an estimator of the overall population mean outcome. Zhao et al. (2012) approached the same as an outcome-weighted learning problem and derived optimal treatment-selection rules using a weighted support vector machine method.
In Huang and Fong (2014), the authors proposed a new method to identify linear and nonlinear marker combinations by directly optimizing a targeted criterion function in the manner of Zhang et al. (2012a) and Zhao et al. (2012). However, the targeted criterion differed from others as adverse side-effects and/or cost incurred by the treatment were considered along with the event rates of the targeted disease in establishing the objective function. This was a crucial extension, as reducing safety events or cost of administering the treatment is often a valid policy goal (see Vickers et al., 2007). The authors formulated the optimization problem as minimization of a weighted sum of 0–1 loss, and used the ramp loss as an approximation of the 0–1 loss. In this article, we extend the idea of Huang and Fong (2014) in creating an augmented target criterion that takes on the additional challenge of controlling for the number of markers in the risk model as well. This is crucial for two vital reasons: (a) Measuring biomarkers can be expensive with respect to both time and money, and possibly invasive too. It is therefore of significant interest to limit the number of biomarkers that need to be collected for an individual to selecting his/her optimal treatment. (b) As discussed earlier, our original problem requires us to find optimal biomarker combinations to explain the disease response heterogeneity in individuals, but finding this optimal combination can be difficult in the presence of redundant markers that do not contribute to treatment selection. This may lead to overfitting and result in overly complex and non-optimal treatment selection rules. One way to deal with this ‘curse of dimensionality’ is through marker selection. Recently a lot of attention has been directed to effect modifier selection in precision medicine. For example, in Zhao (2017), the authors studied selective inference in effect modification models via LASSO; in Liang (2017), the authors constructed sparse decision rules in the context of concordance-assisted learning; and in Shi (2018), the authors proposed a penalized multi-stage A-learning algorithm for deriving the optimal dynamic treatment regime. Although these feature selection methods are all relevant to the broad genre of precision medicine, each of them target a very specific problem, which are quite different from our objective of penalized optimization of the single stage targeted objective function, where still only limited research exists: some insight into marker selection in treatment recommendation problems were briefly studied by Huang (2016) and Zhou et al. (2015), where both targeted minimization of disease rate in treatment selection.
To address these issues along with the original goals of treatment selection, we reformulate the criterion for treatment selection by associating a cost for measuring each biomarker in creating the optimal rule. The optimization problem can be written as minimization of a weighted sum of 0–1 loss, but with a L0 penalty added for the number of markers in the model. In this article, we adopt the usual hinge loss convex relaxation of the 0–1 loss and perform a comprehensive investigation and comparison of various algorithms for solving this optimization problem. In linear support vector machines, L0 feature extraction is a well-researched problem (see Bradley and Mangasarian, 1998; Weston et al., 2003; Mangasarian, 2006; Huang et al., 2010), although it is a much more challenging problem in nonlinear support vector machines (see Mangasarian and Wild, 2007). However, it is worth noting that our setting differs from a simple classification format in two vital aspects: (a) although the treatment selection objective can be rewritten into a (weighted) classification problem (as shown in Section 2), it is still in essence a fundamentally different problem from classification, and feature selection techniques in SVMs have not been studied under this context, and (b) weighted SVM is a more complicated optimization problem than the standard SVM, where the constraint on each support vector varies according to the weight associated with it, and research into feature extraction under this setting has also been fairly limited till now. Some work has been done to extend the SCAD penalty with the weighted linear support vector machines with special forms of such weights (see Jung, 2013), but beyond that, there hasn’t been any targeted investigation of such as per our knowledge. These two reasons make these explorations vitally important.
Thus, the biggest contribution of this article is to combine the methodological advances in two different areas of research, advances in indirect approaches to treatment selection through kernelized methods and those in feature selection techniques in non-parametric statistical learning methods like SVMs. We believe that combining these two areas of research is a novel methodological formulation in itself. Additionally each of these feature selection methods has been translated from the standard SVM framework to the weighted SVM formulation to match the proposed objective. Moreover, adding a penalty for the number of markers in the model redefines the objective function as well, which has been used in conjunction with these feature selection techniques - for example, in choosing tuning parameters for a given feature selection method, we propose the use of a generalized cross validation (GCV) technique that utilizes the whole objective criterion including the penalty for the number of markers.
The article continues in the following manner: In Section 2, we establish the problem of minimizing the total disease, treatment, and marker measurement cost in treatment-selection, and discuss marker selection in conjunction with treatment selection. We briefly discuss various methods built for support vector machines, adapted specifically for the weighted classification setup, for deriving the best linear and nonlinear marker combinations in order to optimize the desired objective function. In Section 3, we set up different simulation examples to test the strength of these linear and nonlinear feature selection methods, modified for our setting, and discuss the results. Then in Section 4, we illustrate the application of our methods using a real data example from an HIV vaccine trial. And finally, in Section 5, we discuss our findings regarding the use of these feature selection methods and the impact of incorporating marker measurement cost into treatment selection, and link to additional materials are presented in the Supplementary Section 6.
2. Methods.
In this article, we consider the problem of finding optimal treatment selection rules for a binary clinical outcome Y (0 for nondiseased and 1 for diseased) based on a set of p ≥ 1 candidate markers collected from an individual’s biological characteristics (the solution we propose is applicable to clinical outcome measured in continous scale as well). Denote this set of candidate markers by Xp = {X1, …, Xp}. For a given subset of markers X ⊆ Xp, we consider marker based treatment selection rules of the form A(X) = I(f(X) > 0), where and denotes a class of functions spanning over X, and I(·) is the indicator function. The above then translates to the rule: A = 0 for not treating, and A = 1 for treating. The treatment-selection benefit of a decision rule A(X) can be quantified by EA(X)(Y), the expected disease rate in the population as a result of treatment selection based on A(X) (Song and Pepe, 2004; Qian and Murphy, 2011). This measure has been widely accepted as a crucial metric in recent literature (Zhao et al., 2012; Zhang et al., 2012a). Although Ea(x)(Y) characterizes the burden of disease upon the population, one may be concerned with additional burden associated with a treatment regimen, such as its side effects and/or the monetary cost of its implementation. Thus it might be preferable to search for treatment-selection strategies that take these aspects into consideration. Huang and Fong (2014) proposed to incorporate additional burden associated with a treatment regimen, such as due to its side effects of monetary cost, by pre-specifying a treatment/disease harm ratio such that each burden type can be put on the same scale. Following the decision-theoretic framework of Vickers et al. (2007), let δ1 be a pre-specified ratio of the burden per treatment relative to the burden per disease event, and let Y(1) and Y(0) indicate the potential disease outcome if a subject were to receive or not receive the treatment. Then the total burden due to disease and treatment for A(X), represented in the unit of burden per disease event (see Huang and Fong, 2014) is given as . The best treatment-selection rule is derived as the one that minimizes this total burden.
In this article we further take into consideration the cost of measuring biomarkers in deriving the optimal treatment-selection rule, under the expectation that inclusion of the burden of measuring biomarkers in the criterion function can lead to the derivation of more cost-effective treatment-selection rules in the sense of achieving desired public health impact under the guidance of a parsimonious biomarker panel. We make the following assumptions: (i) measurement of each biomarker induces roughly equal cost, and (ii) the cost of measuring one biomarker is δ2 times the burden per disease event. Then the total burden due to disease, treatment, and biomarker measurement for a treatment-selection rule A(X) can be represented in the unit of the burden per disease event as
| (1) |
Note: Although we assume a roughly equal cost for each biomarker for simplicity in this paper, it is straightforward to extend it to a setting where each biomarker has a different cost. Biomarkers which are absolutely essential to collect can be considered to have no cost at all, or there may be several groups of biomarkers, such that each group has a different cost depending on its burden and importance for the population of interest. In such situations, the penalty δ2 × dim(X) can be replaced by where δ2,j ≥ 0 is the cost associated with the biomarker Xj, and our methods can still be applied as described with only minor modifications.
We propose to derive an optimal treatment-selection rule by minimizing θ. Suppose there exists an ‘optimal’ or ‘correct’ set of p0 biomarkers X0 within the list of biomarkers measured, such that , which when combined through an oracle rule f0 minimizes θ. Thus, we can write the above problem as,
| (2) |
Note that under this setup, . In practice, it is important to have a sensible way to specify the values or the range of values for δ1 and δ2. Choice of these cost ratios can be facilitated based on information of the monetary cost for controlling the targeted disease, for applying the treatment, and for biomarker measurement, as in our data example of making recommendation for HIV vaccine, presented later in Section 4.
Now imagine data from a two-arm randomized trial with the treatment indicator T taking values 0 and 1, to refer being untreated and treated, respectively. Let n0 and n1 indicate the number of subjects in the untreated and treated arms, respectively. Thus we have i.i.d. samples of the form {Yi, Xi, Ti} for i = 1,…, n with n = n0 + n1. As in Huang and Fong (2014), we assume: (i) stable unit treatment value (SUTVA) (Rubin, 1980) and consistency: Y(0), Y(1) of one subject is independent of the treatment assignments of other subjects, and given the treatment a subject actually received, a subject’s potential outcomes equal the observed outcomes; (ii) ignorable treatment assignments assumption: T ⫫ Y(0),Y(1)|X. Assumption (i) is plausible in trials where participants do not interact with one another and assumption (ii) is ensured by randomization. Under the above assumptions, it can be shown that,
| (3) |
where Risk0(X) = P(Y = 1|X,T = 0) and Risk1(X) = P(Y = 1|X, T = 1) are the risk of Y conditional on X among the untreated and treated, respectively.
Huang and Fong (2014) showed that when δ2= 0, an optimal rule A(X) can be obtained as A(X) = 1 if Risk0(X) − Risk1(X) > δ1, and A(X) = 0 otherwise. But when δ2 is positive, such a strategy alone wouldn’t exactly work as we need to also control the number of markers in the model. Since the quantity δ2 imposes an L0 penalty on the set of markers X, one ad hoc method may be to use standard regression-based methods with L0/L1 penalization to estimate the quantities Risk0(X) and Risk1(X) and then derive the optimal treatment-selection rule as A(X) = I{Risk1(X) − Risk0(X) > δ1}. But note that the above procedure may lead to suboptimal rules with respect to our goal of minimizing θ.
Alternatively, following the strategy in Zhang et al. (2012a), Zhao et al. (2012) and Huang and Fong (2014), we can consider a class of rules for treatment recommendation based on functionals of the form f(X) (belonging to some functional class on X), and a given threshold (usually 0). In particular, we let A(X) = I{f(X) > 0} with I(·) the indicator function, f(X) = b + g(X) with g(X) a function of markers X. Then, assuming randomization does not depend on X for simplicity, θ can be expressed as . The optimal f(X) can be found by minimizing the empirical estimate of θ, that is,
Therefore, we can formulate this problem as the minimization of the sum of a weighted sum of 0–1 loss and a term proportional to the number of biomarkers. That is, f can be found as the minimizer of
| (4) |
with the case-specific weight . Other types of weights such as control-specific or case-control mixture weights or their robust substitutes can also be adopted and are discussed in Huang and Fong (2014). For example, the robust substitute for the case-only weights , where and are risk estimates obtained from a working model and π = P(T = 1). It is worthwhile to note that since the minimization of (3) is equivalent to the minimization of E[I{f(X) ≤ 0} × {Risk0(X) − Risk1(X) − δ1}], any consistent estimate of Risk0(X) − Risk1(X) − δ1 can also serve as weights in# (4).
Note that if we are only interested in combining a set of candidate biomarkers without performing any variable selection, then it is not necessary to include the biomarker measurement cost in the criterion function as in Huang and Fong (2014). However, in the presence of a large number of biomarkers, incorporation of marker cost will impact the features selected in the estimated rule.
2.1. Treatment selection as a weighted Support Vector Machines problem.
In this section, we consider the minimization of the regularized loss function (4), conditional on a pre-specified set of weights W. Since , (4) can be reformulated as
| (5) |
Note that (5) is a weighted classification problem where is the true binary classes, sgn{f(Xi)} is the predicted binary class based on X, and |Wi| is the subject specific weight. This type of problem can be resolved using the weighted support vector machine (Lin and Wang, 2002), based on Xi, Yi, and |Wi|. Since minimization of a weighted sum of 0–1 loss is non-convex and intractable, we look to replace the 0–1 loss with a convex surrogates; the hinge loss is often used in this context. The hinge loss, given as h(u) = max(0, 1 − u) has been proven to be a useful surrogate to the classification loss, such that the term |Wi|I {sgn(f (Xi)) = sgn(Wi)} in (5) is replaced by |Wi|h (f(Xi) × sgn(Wi)). It is worth noting that the hinge loss penalizes departure of a decision rule from its observed class label, based on the extent of this departure, which the classification loss fails to do. The hinge-loss SVM formulation of the above problem is given as,
| (6) |
This is the L0 penalized weighted support vector machines framework. In SVMs, the functional space is generally restricted to be a reproducing kernel Hilbert space , represented uniquely by its kernel kx. Before considering the solution to (6), we first review the weighted support vector machines formulation from Lin and Wang (2002), where the square of the Hilbert space norm is used instead of the L0 norm. For a given subspace X ∈ Xp, we define it as:
| (7) |
In linear support vector machines, is used for optimization, an RKHS with the Euclidean inner product as its kernel function, kx(x1,x2) = (x1,x2). The Hilbert norm in this case becomes the usual Euclidean L2 norm on the linear combination weights ‖β‖2, and the estimated decision functions can be expressed as a linear combination of the input marker set X. However, as many optimal marker combination for treatment selection may not be among linear combinations of the input markers, it is important to extend to include more complex nonlinear functions. This can be achieved by considering transformations of the feature space through feature maps of the form ϕ(x), the appropriate RKHS for which is the one with the kernel kx satisfying kx(x1,x2) = (ϕ(x1), ϕ(x2)). Examples of popular nonlinear kernels include the polynomial kernel of dth degree kx(xi,xj) = (1 + 〈xi, xj〉)d and the radial basis function (RBF) kernel kx(xi, xj) = exp (− γ∥xi −∥ xj∥2) with γ as a tuning parameter. The resulting SVM solution at a given covariate vector x0 is given as , where are the trained weights on the support vectors k(·, xj) and is the estimated global constant.
2.2. Weighted versus unweighted support vector machines.
The weighted support vector machines is a well-known technique for classification. First proposed by Lin and Wang (2002), who called it the “fuzzy support vector machines” or FSVM, it has been further applied and studied in subsequent works (Fan and Ramamohanarao, 2005; Yang et al., 2007; Zhao et al., 2012, see for example). In essence, the weighted support vector machines becomes a very important tool in classification, when some training points are more important than others for the given problem. The main difference in solving the weighted SVM and the standard unweighted SVM lie in the constraints that we put on the support vectors in the dual formulation of the problem. To see this, let us write down the dual of the unweighted SVM below and note the changes that can transform it into a weighted SVM problem. Let us denote and qi = |Wi|, and then see that the unweighted squared Hilbert loss penalized SVM is given as , the dual of which is given as,
| (8) |
for C = 1/(2λn). The dual of (7) differs from (8) only through the constraints that we put on the αis. In the unweighted SVM, the upper-bound for each of αi is the fixed constant C, while in the weighted SVM, the upper bound for each αi is further multiplied by the quantity qi and thus becomes C × qi (that is, 0 ≤ αi ≤ Cqi). Intuitively it just means that αi, the coefficient associated with a given sample, is each constrained differently according to their weight.
2.3. Identifying optimal biomarker combinations in the linear space.
As noted before, variable selection is a key motivation in our pursuit to solve (6), which isn’t an inherent feature of the support vector machines framework of (7). The Hilbert norm provides some control on the overcomplexification of the estimated functions, but without performing any variable selection. Many authors have proposed to use the L1 and L0 penalty either as a replacement or in conjunction with the L2 norm in the linear SVMs. First of all, note that solving (6) is combinatorially a very hard problem (see Amaldi and Kann, 1998), but the zero norm is directly related to finding minimal subsets and can provide optimal feature extraction if properly implemented. There are however many feature extraction algorithms for the unweighted linear support vector machines that do not rely on the L0 norm for selection (see for example Mangasarian, 2006; Zhang et al., 2006). To this effect, in this article, we also look at alternate ways to perform feature selection in support vector machines, instead of focusing solely on solving (6). Here, we build our optimization approaches on some of the most commonly used algorithms for unweighted SVMs. That is, we modify each of them to cater to the weighted version of the algorithm, and we will see how it paves the way for directly or indirectly solving (6) in the linear space. We now briefly introduce each method below, along with their modified objective function under the weighted SVM setup, while a more detailed description of each is reserved for the Supplementary Section (see Web Appendix A).
- L1 weighted support vector machines (L1 WSVM), that utilizes the L1 norm for penalization instead of the squared Hilbert norm, built upon Mangasarian (2006), with objective function,
(9) - SCAD applies the smoothly clipped absolute deviation penalty (Zhang et al., 2006) to the WSVM objective function,
(10) - elastic SCAD, that uses a mixture of the SCAD penalty and the L2 norm,
(11) - feature selection concave (FSV), built on a concave minimization algorithm originally proposed by Bradley and Mangasarian (1998), achieves approximate L0 penalization in weighted SVMs through solving,
(12) - approximation of the zero norm minimization (AROM), that achieves approximate L0 penalization (see Weston et al., 2003) by solving the following objective function iteratively. It initializes a vector z = (1, …, 1), and at each successive step, it resets z as z × β,
where x * w = (x1w1, … , xpwp).(13) - L0 weighted support vector machines (L0 WSVM), built on the works of Huang et al. (2010), that achieves L0 norm in weighted support vector machines through an iterative scheme, by solving the following objective function at the tth step,
subject to , where .(14)
2.4. Identifying optimal biomarker combinations in the nonlinear space.
In this section, we consider the derivation of best biomarker combinations in the nonlinear space, as most optimal marker combinations for treatment-selection rules may not be among the linear combinations of the input markers. Feature selection is a fairly straightforward procedure for linear SVM classifiers, but it is a much more challenging problem in nonlinear support vector machines (see Mangasarian and Wild, 2007). For example, in the linear support vector machines, we can use the L1 or L0 norm instead of the Hilbert norm. However, when similar techniques are applied to nonlinear SVM classifiers, we have reduction in the number of support vectors but not in the number of input space features (Fung and Mangasarian, 2004). So although the dimensionality of the transformed space is reduced, it does not provide any direct reduction of input space features. A few procedures have recently been developed however to deal with feature selection in the original input space under nonlinear feature maps (see Dasgupta et al., 2019; Allen, 2013), but not a lot of them have been developed with the goal of L0 penalization. We will investigate some of these newly developed methods here, but under the more generalized setting of weighted support vector machines. We briefly introduce each method below, along with their modified objective function under the weighted SVM setup, while a more detailed description of each is reserved for the Supplementary Section (see Web Appendix B).
risk recursive feature elimination (riskRFE), a newly developed, powerful wrapper technique based on recursive computation of the learning function (see Dasgupta et al., 2019), that works both in the linear and the nonlinear space (see Figure 1 for a flow chart of the algorithm).
- kernel iterative feature extraction (KNIFE), that achieves feature selection in nonlinear SVMs, by optimizing a feature-regularized loss function (by weighting features within the kernel), iteratively (see Allen, 2013). For kw (x1, x2) = k (w * x1, w * x2), the weighted version of the algorithm solves,
(15) - kernel penalized weighted support vector machines (KP-WSVM), built on the works of Maldonado et al. (2011) for feature selection in nonlinear SVMs, that relies on penalizing each feature’s use in the dual formulation.
The approximated L0 penalty in the above objective function can also be replaced by a L1 norm.(16)
Fig. 1. Schematics of riskRFE in nonparametric estimation.
Note that the feature selection methods considered here can be broadly categorized into two groups: (a) embedded methods, which select the best features during the learning process, that includes SCAD, eSCAD, L1 WSVM, L0 WSVM, AROM, FSV, KNIFE and KP-WSVMs; and (b) wrappers, where feature subsets are selected based on inductive algorithms, for example, riskRFE(L) and riskRFE(G).
3. A simulation study
In this section, we investigate the performance of the aforementioned feature selection methods in various simulation settings. In these analyses, the linear feature selection methods (riskRFE (L), L1 WSVM, SCAD/eSCAD, FSV, AROM, L0 WSVM) are used in conjunction with the linear kernel based weighted SVM, while the nonlinear methods (riskRFE (G), KNIFE, L0 KP-WSVM, L1 KP-WSVM) are used in conjunction with the Gaussian RBF kernel based weighted SVM. In the subsection below, we first discuss the cross validation steps that we employ to choose the optimal tuning parameters for each of these methods.
3.1. Selection of tuning parameters
For selection of the optimal tuning parameters, we follow a two-step cross validation procedure to gain on computation time, by dividing the tuning parameters into two categories: (a) global SVM parameters, and (b) parameters directly controlling sparsity. The global SVM parameters include the parameter controlling the Hilbert space norm of the estimated rule, which allows shrinkage of the estimated effects without controlling for sparsity directly, and the width of the Gaussian kernel in nonlinear SVMs. We follow a 5-fold cross-validation procedure to select the optimal values of these parameters using the full model in an weighted SVM analysis. The Hilbert Space norm (which in linear kernel becomes the L2 norm) is tuned on a grid of values lying between (0.0001, 100), while the width of the Gaussian kernel, γ, is tuned over a grid of values lying between (0.001,10). Now, most of the embedded methods have a separate tuning parameter for sparsity, for example, SCAD, eSCAD, L1 WSVM, KNIFE and L1 KP-WSVM employ a cost for L1 penalization, while L0 WSVM, FSV and L0 KP-WSVM employ a cost for L0 penalization. Although AROM also belongs to this class, it achieves L0 penalization through an iterative fitting of the L2 norm SVM, and has no parameter controlling sparsity directly. Wrappers like riskRFE(L) and riskRFE(G) have no separate penalization for sparsity either. Thus, we employ a second 5 fold cross validation step to tune the sparsity parameter for each relevant method (all except AROM and riskRFE), based on the performance metric θ, for given values of δ1 and δ2, and using the global SVM parameters selected from the first step (when needed). Each of these parameters is tuned on a grid of a viable range of values, and the value obtaining the optimal GCV performance is chosen for that method. Thus, for these methods, δ2, the ratio of the cost of measuring one biomarker to the burden per disease event, is utilized directly for choosing the optimum value of the sparsity parameter, while AROM or riskRFE are not tuned on the value of δ2. Now we describe our simulation setup.
3.2. The simulation setup
We consider data from 1:1 randomized trial of size n = 500 for studying the performance of our proposed strategy. For each simulation setting, we restrict ourselves to when the treatment/disease burden ratio δ1 is equal to 0, but consider three different values for the marker cost δ2, which is either = 0/0.0001/0.001 units. In this simulation study, we only consider the double-robustness weights of Huang and Fong (2014) for obtaining Wi for each individual. Apart from the methods discussed above, we also use linear logistic regression model with LASSO and weighted SVM without feature selection to find the optimal treatment selection rule for comparison. In logistic regression with LASSO, the P(Y = 1|X, T) is modeled as a function of the main effects of the treatment and the biomarkers, and the interaction effects between the treatment and the biomarkers, with the amount of L1 penalization tuned globally through usual cross validation. This is also the risk model used for constructing the double-robustness weights. The optimal treatment-selection rule , where is the optimal set of markers chosen by a given method, is estimated from the training sample, after retuning the global SVM parameters in the model with the selected markers. To evaluate the performance of the estimated rule, a test set of n = 5000 is generated in each simulation run, based on which we estimate θ as . We evaluate the performance of each method over 100 Monte Carlo runs. In our results, we present the estimated version of the quantity θ that we evaluate from these Monte Carlo runs. We compare performance of different methods in the following three settings.
In the first setting, we have in total a list of 27 markers, of which only 2 are significant in explaining treatment-marker interactions (|Xp| = 27, |X0| = 2). The significant markers, (X1, X2) are generated from the multivariate normal distribution . Each of the rest is generated independently from a N(0, 1) distribution. We consider two subcases under this setting: (i) a ‘linear’ underlying model, given by logitP(Y = 1|X1, X2, T) = −1.5 − 1.5X1 − 1.25X2 + 2X1T + 1.5X2T, with disease prevalence approximately 0.3 and 0.2 among the untreated and treated, respectively; and (ii) a polynomial ‘nonlinear’ underlying model , with disease prevalence approximately 0.18 and 0.17 among the untreated and treated, respectively. In the second setting, we have in total a list of 53 markers, of which only 3 are significant in explaining treatment-marker interactions (|Xp| = 53, |X0| = 3). The significant markers, (X1, X2, X3) are generated from the multivariate normal distributiontion . Each of the rest is generated independently from a N(0, 1) distribution with mean 0 and standard deviation 1. Again we consider two subcases: (i) a ‘linear’ underlying model, given by logitP(Y = 1|X1, X2, X3, T) = −2 +X1 + 0.75X2 + X3 − 2X1T − 1.5X2T − 2X3T, with disease prevalence approximately 0.22 among both the untreated and treated groups; and (ii) a ‘nonlinear’ underlying model , with disease prevalence approximately 0.30 and 0.18 among the untreated and treated, respectively. In the third setting, we consider a complex nonlinear setup, with a total of 27 markers, of which only 2 are significant in explaining treatment-marker interactions. All of the markers are generated from the uniform distribution, that is, Xi ~ U(−2, 2), i = 1,…, 27. The model for treatment marker interaction considers a situation where the treatment has benevolent effect for an individual only if his/her marker values X1 and X2 lie in a specific region of the covariate space, without which the treatment might yield a harmful result. This is achieved by dividing the covariate space spanned by X1 and X2, the square {−2 ≤ X1 ≤ 2, −2 ≤ X2 ≤ 2}, into two regions: (i) a ‘harmful’ zone given by the smaller square {−1 ≤ X1 ≤ 1, −1 ≤ X2 ≤ 1}, denoted as H; and (ii) a ‘beneficial’ zone given by the region between the two concentric squares, denoted as B (Figure 2 plots X1 and X2 marker values for 10000 such hypothetical individuals). Thus, the probability that the treatment is benevolent for the population is 0.75. The probability of disease for an individual in region B is 0 if the individual receives treatment, but 0.6 otherwise, and they are reversed for an individual from H. The global probability for disease in the population is thus fixed at 0.3. The disease prevalences are approximately around 0.53 and 0.17 among the untreated and treated, respectively.
Fig. 2. Simulation setting 3 -.
Distribution of X1 and X2 marker values for 10000 individuals, stratified by their membership to the potential harm (red) or the potential benefit (blue) zones.
3.3. Results
The results for the simulation exercise are summarized in Tables 1–5. We give an overview of these results below, while a more detailed discussion of each of these settings is provided in the Supplementary Materials (see Web Appendix C).
Table 1. Setting 1(i) (Linear): Setting with disease prevalence approximately 0.30 and 0.20 among the untreated and treated, respectively (Number of markers 27, number of significant markers 2) -.
Performance scores: (a) Monte Carlo mean of proportion of correct markers chosen, (b) Monte Carlo mean of proportion of incorrect markers chosen, (c) Monte Carlo mean of θ with (i) δ1 =0 and δ2 = 0, (ii) with δ1 = 0 and δ2 = 0.0001 (iii) with δ1 = 0 and δ2 = 0.001 for different feature selection methods.
| Setting 1 (Underlying model: Linear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Linear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| SCAD | 1 | 0.17 | 0.1048 | 1 | 0.16 | 0.1069 | 1 | 0.01 | 0.1045 |
| elastic SCAD | 1 | 0.20 | 0.1076 | 1 | 0.18 | 0.1128 | 1 | 0.11 | 0.1145 |
| L1 WSVM | 1 | 0.31 | 0.1107 | 1 | 0.29 | 0.1123 | 1 | 0.16 | 0.1109 |
| L0 WSVM | 1 | 0.26 | 0.1114 | 1 | 0.16 | 0.1105 | 1 | 0.08 | 0.1127 |
| FSV | 1 | 0.32 | 0.1074 | 1 | 0.22 | 0.1096 | 1 | 0.11 | 0.1075 |
| AROM | 0.97 | 0.10 | 0.1059 | 0.97 | 0.09 | 0.1079 | 0.97 | 0.05 | 0.1116 |
| riskRFE (L) | 1 | 0.03 | 0.1016 | 1 | 0.03 | 0.1019 | 1 | 0.03 | 0.1044 |
| LASSO | 1 | 0.29 | 0.1065 | 1 | 0.29 | 0.1074 | 1 | 0.29 | 0.1158 |
| W-SVM Linear no sel.† | 1 | 1 | 0.1178 | 1 | 1 | 0.1205 | 1 | 1 | 0.1448 |
| Nonlinear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| KNIFE | 1 | 0.27 | 0.1075 | 1 | 0.26 | 0.1101 | 1 | 0.12 | 0.1120 |
| L1 KP-WSVM | 1 | 0.24 | 0.1080 | 1 | 0.24 | 0.1094 | 1 | 0.09 | 0.1082 |
| L0 KP-WSVM | 1 | 0.32 | 0.1113 | 1 | 0.27 | 0.1102 | 1 | 0.17 | 0.1127 |
| riskRFE (G) | 1 | 0.03 | 0.1021 | 1 | 0.03 | 0.1024 | 1 | 0.03 | 0.1049 |
| W-SVM Gauss no sel.† | 1 | 1 | 0.1202 | 1 | 1 | 0.1229 | 1 | 1 | 0.1472 |
no sel. - no selection
Table 5. Setting 3 (Nonlinear): Setting with disease prevalence approximately 0.53 and 0.17 among the untreated and treated, respectively (Number of markers 27, number of significant markers 2) -.
Performance scores: (a) Monte Carlo mean of proportion of correct markers chosen, (b) Monte Carlo mean of proportion of incorrect markers chosen, (c) Monte Carlo mean of θ with (i) δ1 = 0 and δ2 = 0 for different feature selection methods.
| Setting 3 (Underlying model: Nonlinear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Linear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| SCAD | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| elastic SCAD | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| L1 WSVM | 0.19 | 0.18 | 0.1744 | 0.17 | 0.20 | 0.1749 | 0.16 | 0.19 | 0.1795 |
| L0 WSVM | 1 | 0.99 | 0.1746 | 1 | 1 | 0.1773 | 1 | 1 | 0.2015 |
| FSV | 0.91 | 0.95 | 0.1745 | 0.50 | 0.52 | 0.1746 | 0.50 | 0.52 | 0.1873 |
| AROM | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| riskRFE (L) | 0.22 | 0.29 | 0.1749 | 0.22 | 0.29 | 0.1757 | 0.22 | 0.29 | 0.1826 |
| LASSO | 0.05 | 0.14 | 0.1776 | 0.05 | 0.14 | 0.1780 | 0.05 | 0.14 | 0.1812 |
| W-SVM Linear no sel.† | 1 | 1 | 0.1746 | 1 | 1 | 0.1773 | 1 | 1 | 0.2015 |
| Nonlinear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| KNIFE | 0.42 | 0.69 | 0.1765 | 0.39 | 0.68 | 0.1799 | 0.40 | 0.71 | 0.1958 |
| L1 KP-WSVM | 0.88 | 0.32 | 0.1100 | 0.88 | 0.28 | 0.1060 | 0.88 | 0.26 | 0.1127 |
| L0 KP-WSVM | 1 | 0.88 | 0.1621 | 1 | 0.89 | 0.1667 | 1 | 0.88 | 0.1860 |
| riskRFE (G) | 0.37 | 0.08 | 0.1349 | 0.37 | 0.08 | 0.1352 | 0.37 | 0.08 | 0.1376 |
| W-SVM Gauss no sel.† | 1 | 1 | 0.1761 | 1 | 1 | 0.1788 | 1 | 1 | 0.2031 |
no sel. - no selection
While most biomarker-based treatment-selection rules result in reduction of the total cost compared to the optimal strategy between treating all or treating none, methods that allow for feature selection lead to substantial improvement compared to the weighted SVM method without feature selection. Embedded feature selection methods that involve tuning by marker measure cost δ2 in general show an decreasing trend in sensitivity and an increasing trend in specificity with increasing values of δ2, which is not the case for wrapper methods (such as riskRFE) or LASSO, that do not rely on δ2 for tuning. As a result, for the latter type of methods, the total cost E(θ) necessarily increases with increasing δ2, the cost of measuring a marker; in contrast, for methods that involve δ2 in tuning, the total cost E(θ) can often decrease with increasing δ2, especially when the improvement in the specificity score for a given method is more substantial than the decline in its sensitivity score. In presence of a linear trend, best performing methods are typically among feature selection methods with linear kernel, yet feature selection methods with nonlinear kernel can have comparable performance; in presence of nonlinear trends, feature selection methods with nonlinear kernel can have substantial improvement over linear methods. And finally, relative performance of various methods vary with settings. In general, riskRFE performs really well when the cost of marker measurement is low. Apart from it, L1 WSVM, SCAD and AROM are the best performing linear methods, especially when the cost of marker measurement is high, while L1 KP-WSVM stands out among the nonlinear feature selection methods, with robust performance across settings.
We also evaluate these algorithms in a few additional settings. We first compare their performance against the Decision List method of Zhang et al. (2015). We also consider a setting where the true optimal decision rule is not sparse but the effect of some biomarkers on the optimal treatment decision are so small that given the cost consideration, a sparser decision rule is more optimal in terms of θ. For space constraint, these additional settings are discussed in the Supplementary Materials (Web Appendix D), and the results are provided in Supplementary Tables 1, 2 and 3.
4. Real data analysis
We now use an example from the RV144 Thailand HIV vaccine trial to examine the performance of the methods discussed above for selecting and combining markers. RV144 is one of the first vaccine trials that showed a significant positive effect of vaccine in preventing HIV infection. It included 16,402 participants, aged between 18 and 30 years, randomized 1:1 between vaccine and placebo (Rerks-Ngarm et al., 2009). A followup hostgenetic study was conducted to measure the effect of genotypes of Fc receptor genes on vaccine efficacy, and to that effect 190 single nucleotide polymorphisms (SNPs) (including five Fc-γ and one Fc-α receptors) were genotyped on 125 cases (74 placebo recipients and 51 vaccine recipients), and 225 controls (20 placebo recipients and 205 vaccine recipients), of which 28 SNPs were selected based on Hardy-Weinberg equilibrium by Li et al. (2014), each categorized into a binary variable, to study association of each with vaccine efficacy. Here we consider all 28 SNPs as the expensive candidate biomarkers to select from. In addition, age, gender, and baseline behaviorial risk for vaccine are combined with SNPs for treatment recommendation as in Huang (2016); no penalty is put on those baseline variables as they are readily available from the trial. Lifetime HIV treatment cost is estimated to be around $370K according to CDC (https://www.cdc.gov/hiv/programresources/guidance/costeffectiveness/index.html). Vaccines typically cost a few hundred dollars and have minimal side effects, so we set δ1 = 0.001. The average cost of a SNP evaluation varies between $ 2–4, and hence we consider a range of values for δ2 ∈ {10−6, 5 × 10−6, 10−5} based on the burden of measuring a SNP with respect to the treatment cost (as 2/370000 ≈ 5 × 10−6). We compare the performance of the linear and nonlinear feature selection methods (as well as logistic regression with LASSO) to select the optimal subset of markers for the purposes of vaccine recommendation for an individual. The optimal tuning parameters for each method are chosen in the same way as described in Section 3.1. Now, to compute the expected disease rate for marker selection for each of these methods, we perform a CV procedure by splitting the data into five random folds, and using four folds for training the WSVM (or LASSO) procedures with the selected markers, and using the remaining one for testing. The procedure is repeated 100 times and the average disease rate is computed. Table 6 shows the estimated performance of different selection strategies along with the strategy of treating none and treating all. Those two lead to an estimated HIV infection rate of 8.54 and 6.44 per 1000 persons, respectively, consistent with the positive vaccine efficacy we observed in the RV144 trial. From Table 6, it is clear that linear weighted support vector machines (with or without marker selection) does a better job of treatment recommendation compared to the nonlinear weighted support vector machines (with or without marker selection) with the Gaussian kernel in this particular example. It can be seen that both the linear and nonlinear WSVM without any selection perform at par with the strategy of treating all, even when we consider a biomarker cost, when it is relatively low, but gets worse with higher values of δ2. Most of the linear selection methods yield a total cost at least as good as the strategy of treating all, even when δ2 is high. Among the linear methods, only elastic SCAD achieves a total cost worse than the strategy of treating all when δ2 is high. On the other hand, AROM is outperforming every other strategy, followed closely by riskRFE (L). Overall we can conclude that marker selection is an important formulation for treatment recommendation using the weighted support vector machines framework.
Table 6. Cross-validated treatment-selection performance of various approaches for making treatment recommendation in the RV144 trial:
Disease cost + vaccine cost + SNP cost-expected disease rate per 1000 individuals with added cost of vaccine (δ1 = 0.001) and 3 different costs for marker evaluation (δ2 ∈ {10−6, 5 × 10−6, 10−5} for each of the 28 SNPs evaluated) for treatment selection with various methods of marker selection.
| Methods | Disease cost + vaccine cost +SNP cost | Disease cost + vaccine cost +SNP cost | Disease cost + vaccine cost +SNP cost |
|---|---|---|---|
| δ2 = 10−6 | δ2 = 5 × 10−6 | δ2 = 10−5 | |
| Treat all | 7.44 | 7.44 | 7.44 |
| Treat none | 8.54 | 8.54 | 8.54 |
| SCAD | 7.21 | 7.28 | 7.37 |
| elastic SCAD | 7.34 | 7.44 | 7.56 |
| L1 WSVM | 7.24 | 7.32 | 7.42 |
| L0 WSVM | 7.31 | 7.37 | 7.44 |
| FSV | 7.29 | 7.35 | 7.43 |
| AROM | 6.97 | 7.03 | 7.11 |
| riskRFE (L) | 7.15 | 7.17 | 7.19 |
| LASSO | 7.53 | 7.56 | 7.60 |
| W-SVM Linear no sel.† | 7.42 | 7.55 | 7.71 |
| KNIFE | 7.33 | 7.38 | 7.44 |
| L1 KP-WSVM | 7.39 | 7.46 | 7.55 |
| L0 KP-WSVM | 7.32 | 7.42 | 7.55 |
| riskRFE (G) | 7.35 | 7.40 | 7.46 |
| W-SVM Gauss no sel.† | 7.44 | 7.57 | 7.73 |
no sel. - no selection
5. Concluding remarks
In this article, we developed a new framework to incorporate marker measurement cost into treatment regime identification, with a pre-determined cost for inclusion of each biomarker in the model. We extended several different marker selection methods to apply to the weighted support vector machines setting, encompassing both the linear and the nonlinear space, in order to derive the optimal treatment-selection rules that minimize the total cost due to disease, treatment, and marker measurements. We investigated their performance in a number of different setups through a detailed simulation study, and also in a real data scenario, the RV144 HIV vaccine trial. We showed that in presence of a large number of candidate biomarkers, marker selection is essential for deriving cost-effective treatment-selection rules that effectively reduces disease and treatment burdens to the population while avoiding the burdens of collecting information on irrelevant biomarkers. We showed that marker selection also reduced the chance of obtaining overfitted treatment recommendation rules, which can result in lower test disease rate predictions than when we use the full model for treatment recommendation. It is worthwhile to note here that the indirect approaches can sometimes be inefficient compared with direct approaches, especially when the underlying disease risk model is correctly specified in the direct approaches. Moreover, inference for the estimated optimal regime can become challenging under indirect approaches. However, these methods are still appealing and provide a complementary alternative to direct approaches because of their robustness to model misspecifications. In fact, comparison of our algorithms with logistic regression with LASSO showed that in some settings, even when the risk model assumptions hold (as was the case in most of our simulation settings), feature selection using the indirect approaches can lead to treatment-selection rules with performance comparable to or even better than that based on direct approaches. Among the various marker selection technique employed in the simulations, riskRFE stood out as the best performing method when cost of measuring marker measurement cost was low, while AROM, SCAD and L0 WSVM were the best performing linear feature selection methods for higher costs of marker selection, depending on the setting. On the other hand, L1 KP-WSVM was clearly the best performing nonlinear feature selection method in most of the settings. In general, we showed that using nonlinear feature selection methods can perform substantially better than linear feature selection methods in presence of nonlinear patterns, at a minimal cost of efficiency in presence of linear patterns. Also, the methodology put forward in this article typically invoke the randomized setting. However, if (i) SUTVA and (ii) ignorable treatment assignment (see Page 5 in Section 2 for more details) assumptions hold in an observational setting, our proposed methods will continue to be applicable there, although assumption (ii) is typically not verifiable without randomization. And finally, a caveat of our proposed methods is that the estimated optimal rule and corresponding total burden depend on the choices of input cost parameters δ1 and δ2. Ideally these parameters can be elicited from experts or can be estimated, e.g. when one can associate a monetary cost with each disease event and a monetary cost with treatment and with biomarker measurement. In some settings, however, it can be hard to reach a consensus regarding the optimal values of one or both parameters, and we recommend that a sensitivity analysis is conducted over a viable range for these parameters to assist the decision.
Supplementary Material
Table 2. Setting 1(ii) (Nonlinear): Setting with disease prevalence approximately 0.18 and 0.17 among the untreated and treated, respectively (Number of markers 27, number of significant markers 2) -.
Performance scores: (a) Monte Carlo mean of proportion of correct markers chosen, (b) Monte Carlo mean of proportion of incorrect markers chosen, (c) Monte Carlo mean of θ with (i) δ1 =0 and δ2 = 0, (ii) with δ1 = 0 and δ2 = 0.0001 (iii) with δ1 =0 and δ2 = 0.001 for different feature selection methods.
| Setting 1 (Underlying model: Nonlinear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Linear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| SCAD | 0.91 | 0.42 | 0.1296 | 0.89 | 0.32 | 0.1292 | 0.85 | 0.15 | 0.1322 |
| elastic SCAD | 0.96 | 0.36 | 0.1289 | 0.96 | 0.33 | 0.1291 | 0.95 | 0.23 | 0.1340 |
| L1 WSVM | 0.94 | 0.43 | 0.1281 | 0.94 | 0.41 | 0.1296 | 0.90 | 0.29 | 0.1358 |
| L0 WSVM | 0.88 | 0.30 | 0.1247 | 0.86 | 0.29 | 0.1269 | 0.82 | 0.16 | 0.1314 |
| FSV | 0.89 | 0.32 | 0.1274 | 0.90 | 0.26 | 0.1266 | 0.88 | 0.19 | 0.1328 |
| AROM | 0.91 | 0.25 | 0.1269 | 0.92 | 0.23 | 0.1263 | 0.89 | 0.16 | 0.1317 |
| riskRFE (L) | 0.85 | 0.10 | 0.1223 | 0.85 | 0.10 | 0.1228 | 0.85 | 0.10 | 0.1266 |
| LASSO | 0.79 | 0.19 | 0.1320 | 0.79 | 0.19 | 0.1326 | 0.79 | 0.19 | 0.1383 |
| W-SVM Linear no sel.† | 1 | 1 | 0.1349 | 1 | 1 | 0.1376 | 1 | 1 | 0.1619 |
| Nonlinear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| KNIFE | 0.99 | 0.35 | 0.1156 | 0.97 | 0.30 | 0.1147 | 0.95 | 0.21 | 0.1196 |
| L1 KP-WSVM | 0.96 | 0.33 | 0.1130 | 0.96 | 0.31 | 0.1126 | 0.93 | 0.20 | 0.1183 |
| L0 KP-WSVM | 0.97 | 0.45 | 0.1191 | 0.97 | 0.42 | 0.1196 | 0.95 | 0.35 | 0.1245 |
| riskRFE (G) | 0.88 | 0.11 | 0.1178 | 0.88 | 0.11 | 0.1183 | 0.88 | 0.11 | 0.1222 |
| W-SVM Gauss no sel.† | 1 | 1 | 0.1356 | 1 | 1 | 0.1383 | 1 | 1 | 0.1626 |
no sel. - no selection
Table 3. Setting 2(i) (Linear): Setting with disease prevalence approximately 0.22 among both the untreated and treated (Number of markers 53, number of significant markers 3) -.
Performance scores: (a) Monte Carlo mean of proportion of correct markers chosen, (b) Monte Carlo mean of proportion of incorrect markers chosen, (c) Monte Carlo mean of θ with (i) δ1 = 0 and δ2 = 0 for different feature selection methods.
| Setting 2 (Underlying model: Linear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Linear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| SCAD | 0.98 | 0.31 | 0.0557 | 0.97 | 0.30 | 0.0582 | 0.95 | 0.01 | 0.0545 |
| elastic SCAD | 1 | 0.30 | 0.0599 | 1 | 0.23 | 0.0588 | 1 | 0.13 | 0.0642 |
| L1 WSVM | 1 | 0.45 | 0.0605 | 1 | 0.33 | 0.0574 | 1 | 0.14 | 0.0634 |
| L0 WSVM | 0.97 | 0.23 | 0.0563 | 0.96 | 0.17 | 0.0565 | 0.95 | 0.01 | 0.0542 |
| FSV | 0.95 | 0.36 | 0.0594 | 0.94 | 0.24 | 0.0627 | 0.95 | 0.12 | 0.0630 |
| AROM | 0.94 | 0.06 | 0.0562 | 0.94 | 0.04 | 0.0549 | 0.94 | 0.02 | 0.0598 |
| riskRFE (L) | 0.97 | 0.02 | 0.0519 | 0.97 | 0.02 | 0.0523 | 0.97 | 0.02 | 0.0558 |
| LASSO | 1 | 0.17 | 0.0555 | 1 | 0.17 | 0.0567 | 1 | 0.17 | 0.0670 |
| W-SVM Linear no sel.† | 1 | 1 | 0.0686 | 1 | 1 | 0.0739 | 1 | 1 | 0.1216 |
| Nonlinear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| KNIFE | 0.98 | 0.24 | 0.0557 | 0.96 | 0.16 | 0.0567 | 0.96 | 0.03 | 0.0561 |
| L1 KP-WSVM | 0.98 | 0.19 | 0.0550 | 0.98 | 0.08 | 0.0539 | 0.97 | 0.03 | 0.0560 |
| L0 KP-WSVM | 0.99 | 0.20 | 0.0606 | 0.99 | 0.17 | 0.0572 | 0.99 | 0.09 | 0.0630 |
| riskRFE (G) | 0.98 | 0.02 | 0.0505 | 0.98 | 0.02 | 0.0509 | 0.98 | 0.02 | 0.0545 |
| W-SVM Gauss no sel.† | 1 | 1 | 0.0681 | 1 | 1 | 0.0734 | 1 | 1 | 0.1211 |
no sel. - no selection
Table 4. Setting 2(ii) (Nonlinear): Setting with disease prevalence approximately 0.30 and 0.18 among the untreated and treated, respectively (Number of markers 53, number of significant markers 3) -.
Performance scores: (a) Monte Carlo mean of proportion of correct markers chosen, (b) Monte Carlo mean of proportion of incorrect markers chosen, (c) Monte Carlo mean of θ with (i) δ1 = 0 and δ2 = 0, (ii) with δ1 = 0 and δ2 = 0.0001 (iii) with δ1 = 0 and δ2 = 0.001 for different feature selection methods.
| Setting 2 (Underlying model: Nonlinear) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Linear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| SCAD | 0.84 | 0.65 | 0.1578 | 0.74 | 0.39 | 0.1588 | 0.68 | 0.24 | 0.1693 |
| elastic SCAD | 0.88 | 0.63 | 0.1607 | 0.81 | 0.47 | 0.1641 | 0.77 | 0.39 | 0.1821 |
| L1 WSVM | 0.76 | 0.47 | 0.1597 | 0.71 | 0.37 | 0.1610 | 0.58 | 0.18 | 0.1696 |
| L0 WSVM | 0.74 | 0.33 | 0.1576 | 0.69 | 0.27 | 0.1621 | 0.63 | 0.15 | 0.1685 |
| FSV | 0.83 | 0.64 | 0.1597 | 0.70 | 0.40 | 0.1630 | 0.69 | 0.33 | 0.1762 |
| AROM | 0.74 | 0.33 | 0.1564 | 0.68 | 0.24 | 0.1584 | 0.65 | 0.17 | 0.1653 |
| riskRFE (L) | 0.64 | 0.22 | 0.1564 | 0.64 | 0.22 | 0.1577 | 0.64 | 0.22 | 0.1693 |
| LASSO | 0.59 | 0.15 | 0.1636 | 0.59 | 0.15 | 0.1645 | 0.59 | 0.15 | 0.1729 |
| W-SVM Linear no sel.† | 1 | 1 | 0.1674 | 1 | 1 | 0.1727 | 1 | 1 | 0.2204 |
| Nonlinear Methods | δ1 = 0, δ2 = 0 | δ1 = 0, δ2 = 0.0001 | δ1 = 0, δ2 = 0.001 | ||||||
| KNIFE | 0.79 | 0.31 | 0.1435 | 0.74 | 0.21 | 0.1388 | 0.71 | 0.07 | 0.1304 |
| L1 KP-WSVM | 0.81 | 0.53 | 0.1528 | 0.62 | 0.27 | 0.1513 | 0.50 | 0.10 | 0.1455 |
| L0 KP-WSVM | 0.82 | 0.53 | 0.1535 | 0.81 | 0.47 | 0.1567 | 0.77 | 0.38 | 0.1642 |
| riskRFE (G) | 0.65 | 0.09 | 0.1369 | 0.65 | 0.09 | 0.1375 | 0.65 | 0.09 | 0.1432 |
| W-SVM Gauss no sel.† | 1 | 1 | 0.1645 | 1 | 1 | 0.1698 | 1 | 1 | 0.2175 |
no sel. - no selection
7. Acknowledgments
The work was supported by NIH grant R01 GM106177-01. The authors thank the participants, investigators, and sponsors of the RV144 trials. The authors thank Dr. Dan Geraghty from the Fred Hutchinson Cancer Center for generating the genetics data, and Drs. Sue Li and Peter Gilbert for pre-processing and preliminary analysis of the SNP data. The views expressed are those of the authors and should not be construed to represent the positions of the U.S. Army or the Department of Defense.
6.
Supplementary Materials
Additional materials cited in Sections 2 and 3 are available with this paper at the Journal of Royal Statistical Society Series C web page at Wiley Online Library.
References
- Allen GI (2013). Automatic feature selection via weighted kernels and regularization. J. Comput. Graph. Stat, 22(2), 284–299. [Google Scholar]
- Amaldi E and Kann V (1998). On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theor. Comput. Sci, 209, 237–260. [Google Scholar]
- Bradley P and Mangasarian OL (1998). Feature selection via concave minimization and support vector machines. ICML, 98, 82–90. [Google Scholar]
- Dasgupta S, Goldberg Y, and Kosorok M (2019). Feature Elimination in Kernel Machines in moderately high dimensions. The Annals of Statistics, 47(1), 497–526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan H and Ramamohanarao K (2005). A weighting scheme based on emerging patterns for weighted support vector machines In IEEE International Conference, 2, 435–440. [Google Scholar]
- Foster JC, Taylor JMG, and Ruberg SJ (2011). Subgroup identification from randomized clinical trial data. Statist. Med, 30, 2867–2880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fung GM and Mangasarian OL (2004). A feature selection Newton method for support vector machine classification. Comput. Optim. Appl, 28(2), 185–202. [Google Scholar]
- Huang Y (2015) Identifying optimal biomarker combinations for treatment selection through randomized controlled trials. Clinical Trials, 12(4), 348–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, and Naoi S (2010). Sparse learning for support vector classification. Pattern Recognit. Lett, 31(13), 1944–1951. [Google Scholar]
- Huang Y and Fong F (2014) Identifying Optimal Biomarker Combinations for Treatment Selection via a Robust Kernel Method. Biometrics, 70, 891–901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jung K (2013). Weighted support vector machines with the SCAD penalty. Commun. Stat. Appl. Methods, 20(6), 481–490. [Google Scholar]
- Li SS, Gilbert PB, Tomaras GD, Kijak G, Ferrari G, Thomas R et al. (2014). FCGR2C polymorphisms associate with HIV-1 vaccine protection in RV144 trial. J Clin Invest, 124, 3879–3890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang S, Lu W, Song R, and Wang L, (2017). Sparse concordance-assisted learning for optimal treatment decision. The Journal of Machine Learning Research, 18(1), pp.7375–7400. [PMC free article] [PubMed] [Google Scholar]
- Lin C and Wang S (2002). Fuzzy support vector machines. IEEE Trans. Neural Netw, 13, 464–471. [DOI] [PubMed] [Google Scholar]
- Lu W, Zhang HH, and Zeng D (2013). Variable selection for optimal treatment decision. Stat. Methods Med. Res, 22, 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maldonado S, Weber R, and Basak J (2011). Simultaneous feature selection and classification using kernel-penalized support vector machines. Information Sciences, 181(1), 115–128. [Google Scholar]
- Mangasarian OL and Wild E (2007). Nonlinear knowledge in kernel approximation. IEEE Trans. Neural Netw, 18(1), 300–306. [DOI] [PubMed] [Google Scholar]
- Mangasarian OL (2006). Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J. Mach. Learn. Res, 7, 1517–1530. [Google Scholar]
- Qian M and Murphy S (2011). Performance guarantees for individualized treatment rules. Ann. Statist, 39, 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rerks-Ngarm S, Pitisuttithum P, Nitayaphan S, Kaewkungwal J, Chiu J, Paris R et al. (2009). Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. N Engl J Med, 361, 2209–2220. [DOI] [PubMed] [Google Scholar]
- Rubin DB (1980). Comment on randomization analysis of experimental data: The fisher randomization test by D. Basu. J. Am. Statist. Ass, 75, 591–593. [Google Scholar]
- Shi C, Fan A, Song R, and Lu W, (2018). High-dimensional A-learning for optimal dynamic treatment regimes. The Annals of Statistics, 46(3), pp.925–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song X and Pepe MS (2004). Evaluating markers for selecting a patient’s treatment. Biometrics, 60, 874–883. [DOI] [PubMed] [Google Scholar]
- Vickers A, Kattan M, and Sargent D (2007). Method for evaluating prediction models that apply the results of randomized trials to individual patients. Trials, 8, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weston J, Elisseeff A, Scholkopf B, and Tipping M (2003). Use of the zero-norm with linear models and kernel methods. J. Mach. Learn. Res, 3, 1439–1461. [Google Scholar]
- Yang X, Song Q, and Wang Y (2007). A weighted support vector machine for data classification. Int. J. Pattern Recognit. Artif. Intell, 21(05), 961–976. [Google Scholar]
- Zhang B, Tsiatis AA, Davidian M, Zhang M, and Laber EB (2012b). Estimating optimal treatment regimes from a classification perspective. Stat, 1, 103–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B, Tsiatis A, Laber E, and Davidian M (2012a). A robust method for estimating optimal treatment regimes. Biometrics, 68, 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang HH, Ahn J, Lin X, and Park C (2006). Gene selection using support vector machines with non-convex penalty. Bioinformatics, 22(1), 88–95. [DOI] [PubMed] [Google Scholar]
- Zhang Y, Laber EB, Tsiatis A, and Davidian M (2015). Using Decision Lists to Construct Interpretable and Parsimonious Treatment Regimes. Biometrics, 71(4), 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Q, Small DS, and Ertefaie A, (2017). Selective inference for effect modification via the lasso. arXiv preprint, arXiv:1705.08020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Am. Statist. Ass, 107, 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2015). Residual weighted learning for estimating individualized treatment rules. J. Am. Statist. Ass, (just-accepted). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


