Estimating Individualized Treatment Rules Using Outcome Weighted Learning

Yingqi Zhao; Donglin Zeng; A John Rush; Michael R Kosorok

doi:10.1080/01621459.2012.695674

. Author manuscript; available in PMC: 2013 Sep 1.

Published in final edited form as: J Am Stat Assoc. 2012 Oct 8;107(449):1106–1118. doi: 10.1080/01621459.2012.695674

Estimating Individualized Treatment Rules Using Outcome Weighted Learning

Yingqi Zhao ¹, Donglin Zeng ², A John Rush ³, Michael R Kosorok ⁴

PMCID: PMC3636816 NIHMSID: NIHMS379905 PMID: 23630406

Abstract

There is increasing interest in discovering individualized treatment rules for patients who have heterogeneous responses to treatment. In particular, one aims to find an optimal individualized treatment rule which is a deterministic function of patient specific characteristics maximizing expected clinical outcome. In this paper, we first show that estimating such an optimal treatment rule is equivalent to a classification problem where each subject is weighted proportional to his or her clinical outcome. We then propose an outcome weighted learning approach based on the support vector machine framework. We show that the resulting estimator of the treatment rule is consistent. We further obtain a finite sample bound for the difference between the expected outcome using the estimated individualized treatment rule and that of the optimal treatment rule. The performance of the proposed approach is demonstrated via simulation studies and an analysis of chronic depression data.

Keywords: Dynamic Treatment Regime, Individualized Treatment Rule, Weighted Support Vector Machine, RKHS, Risk Bound, Bayes Classifier, Cross Validation

1. INTRODUCTION

In many different diseases, patients can show significant heterogeneity in response to treatments. In some cases, a drug that works for a majority of individuals may not work for a subset of patients with certain characteristics. For example, molecularly targeted cancer drugs are only effective for patients with tumors expressing targets (Grünwald & Hidalgo 2003; Buzdar 2009), and significant heterogeneity exists in responses among patients with different levels of psychiatric symptoms (Piper et al. 1995; Crits-Christoph et al. 1999). Thus significant improvements in public health could potentially result from judiciously treating individuals based on his or her prognostic or genomic data rather than a “one size fits all” approach. Treatments and clinical trials tailored for patients have enjoyed recent popularity in clinical practice and medical research, and, in some cases, have provided high quality recommendations accounting for individual heterogeneity (Sargent et al. 2005; Flume et al. 2007; Insel 2009). These proposals have focused on smaller, specific and well-defined subgroups, sought to provide guidance in clinical decision making based on individual differences, and have attempted to achieve better risk minimization and benefit maximization.

One statistical approach for developing individual-adaptive interventions is to classify subjects into different risk levels estimated by a parametric or semiparametric regression model using prognostic factors, and then to assign therapy according to risk level (Eagle et al. 2004; Marlowe et al. 2007; Cai et al. 2010). However, the parametric or semiparametric model assumptions may not be valid due to the complexity of the disease mechanism and individual heterogeneity. Moreover, these approaches require preknowledge in allocating the optimal treatment to each risk category. There is also a significant literature examining discovery and development of personalized treatment relying on predicting patient responses to optional regimens (Rosenwald et al. 2002; van’t Veer & Bernards 2008), where the optimal decision leads to the best predicted outcome. One recent paper by Qian & Murphy (2011) applies a two-step procedure which first estimates a conditional mean for the response and then estimates the rule maximizing this conditional mean. A rich linear model is used to sufficiently approximate the conditional mean, with the estimated rule derived via l₁ penalized least squares (l₁-PLS). The method includes variable selection to facilitate parsimony and ease of interpretation. The conditional mean approximation requires estimating a prediction model of the relationship between pretreatment prognostic variables, treatments and clinical outcome using a prediction model. Reduction in the mean response is related to the excess prediction error, through which an upper bound can be constructed for the mean reduction of the associated treatment rule. However, by inverting the model to find the optimal treatment rule, this method emphasizes prediction accuracy of the clinical response model instead of directly optimizing the decision rule.

In this paper, we proposed a new method for solving this problem which circumvents the need for conditional mean modeling followed by inversion by directly estimating the decision rule which maximizes clinical response. Specifically, we demonstrate that the optimal treatment rule can be estimated within a weighted classification framework, where the weights are determined from the clinical outcomes. We then alleviate the computational problem by substituting the 0–1 loss in the classification with a convex surrogate loss as is done with the support vector machine (SVM) via the hinge loss (Cortes & Vapnik 1995). The directness of this outcome weighted learning (OWL) approach enables us to better select targeted therapy while making full use of available information.

The remainder of the paper is organized as follows. In Section 2, we provide the mathematical concepts and framework for individualized treatment rules, and then formulate the problem as OWL. The proposed weighted SVM approach for constructing the optimal ITR is then developed in detail. In Section 3, consistency and risk bound results are established for the estimated rules. Faster convergence rates can be achieved with additional marginal assumptions on the data generating distribution. We present simulation studies to evaluate performance of the proposed method in Section 4. The method is then illustrated on the Nefazodone-CBASP data (Keller et al. 2000) in Section 5. In Section 6, we discuss future work. The proofs of theoretical results are given in the Appendix.

2. METHODOLOGY

2.1 Individualized Treatment Rule (ITR)

We assume the data are collected from a two-arm randomized trial. That is, treatment assignments, denoted by A ∈ Inline graphic = {−1, 1}, are independent of any patient’s prognostic variables, which are denoted as a d-dimensional vector X = (X₁, …, X_d)^T ∈ . We let R be the observed clinical outcome, also called the “reward,” and assume that R is bounded, with larger values of R being more desirable. Thus an individualized treatment rule (ITR) is a map from the space of prognostic variables, Inline graphic , to the space of treatments, . An optimal ITR is a rule that maximizes the expected reward if implemented.

Mathematically, we can quantify the optimal ITR in terms of the relationship among (X, A, R). To see this, denote the distribution of (X, A, R) by P and expectation with respect to the P is denoted by E. For any given ITR Inline graphic , we let denote the distribution of (X, A, R) given that A = (X), i.e., the treatments are chosen according to the rule ; correspondingly, the expectation with respect to is denoted by . Then under the assumption that P(A = a) > 0 for a = 1 and −1, it is clear that is absolutely continuous with respect to P and d Inline graphic /dP = I(a = (x))/P(A = a), where I(·) is the indicator function. Thus, the expected reward under the ITR is given as

E^{D} (R) = \int {RdP}^{D} = \int R \frac{d P^{D}}{d P} d P = E [\frac{I (A = D (X))}{A π + (1 - A) / 2} R],

where π = P(A = 1). This expectation is called the value function associated with Inline graphic and is denoted ( ). Consequently, an optimal ITR, , is a rule that maximizes ( ), i.e.,

D^{*} \in \underset{D}{argmax} E [\frac{I (A = D (X))}{A π + (1 - A) / 2} R] .

Note that Inline graphic does not change if R is replaced by R + c for any constant c. Thus, without loss of generality, we assume that R is nonnegative in the following.

2.2 Outcome Weighted Learning (OWL) for Estimating Optimal ITR

Assume that we observe i.i.d data (X_i, A_i, R_i), i = 1, …, n from the two-arm randomized trial described above. Previous approaches to estimating optimal ITR first estimate E(R|X, A), using the observed data via parametric or semiparametric models, and then estimate the optimal decision rule by comparing the predicted value E(R|X, A = 1) versus E(R|X, A = −1) (Robins 2004; Moodie et al. 2009; Qian & Murphy 2011). As discussed before, these approaches indirectly estimate the optimal ITR, and are likely to produce a suboptimal ITR if the model for R given (X, A) is overfitted. As an alternative, we propose a nonparametric approach which directly maximizes the value function based on an outcome weighted learning method.

To illustrate our approach, we first notice that searching for the optimal ITR, Inline graphic , which maximizes ( ), is equivalent to finding that minimizes

E [R ∣ A = 1] + E [R ∣ A = - 1] - V (D) = E [\frac{I (A \neq D (X))}{A π + (1 - A) / 2} R] .

The latter can be viewed as a weighted classification error, for which we want to classify A using X but we also weigh each misclassification event by R/(Aπ + (1 − A)/2). Hence, using the observed data, we approximate the weighted classification error by

n^{- 1} \sum_{i = 1}^{n} \frac{R_{i}}{A_{i} π + (1 - A_{i}) / 2} I (A_{i} \neq D (X_{i}))

and seek to minimize this expression to estimate Inline graphic . Since (x) can always be represented as sign(f(x)), for some decision function f, minimizing the above expression for is equivalent to minimizing

n^{- 1} \sum_{i = 1}^{n} \frac{R_{i}}{A_{i} π + (1 - A_{i}) / 2} I (A_{i} \neq sign (f (X_{i})))

(2.1)

to obtain the optimal f^*, and then setting Inline graphic (x) = sign(f^*(x)).

The above minimization also has the following interpretation. That is, we intend to find a decision rule which assigns treatments to each subject only based on their prognostic information. For subjects observed to have a large reward, this rule is apt to recommend the same treatment assignments that the subject has actually received; however, for subjects with small rewards, the rule is more likely to give the opposite treatment assignment to what they received. In other words, if we stratify subjects into different strata based on the rewards, we will expect that the optimal ITR misclassifies less subjects in the high reward stratum as compared to the low reward stratum.

In the machine learning literature, (2.1) can be viewed as a weighted summation of 0–1 loss. It is well known that minimizing (2.1) is difficult due to the discontinuity and non-convexity of 0–1 loss. To alleviate this difficulty, one common approach is to find a convex surrogate loss for the 0–1 loss in (2.1) and develop a tractable estimation procedure (Zhang 2004; Lugosi & Vayatis 2004; Steinwart 2005). Among many choices of surrogate loss, one of the most popular is the hinge loss used in the context of the support vector machine (Cortes & Vapnik 1995), which we will adopt in this paper. Furthermore, we penalize the complexity of the decision function in order to avoid overfitting. In other words, instead of minimizing (2.1), we aim to minimize

n^{- 1} \sum_{i = 1}^{n} \frac{R_{i}}{A_{i} π + (1 - A_{i}) / 2} {(1 - A_{i} (f (X_{i})))}^{+} + λ_{n} {| | f | |}^{2},

(2.2)

where x⁺ = max(x, 0) and ||f|| is some norm for f. In this way, we cast the problem of estimating the optimal ITR into a weighted classification problem using support vector machine techniques.

2.3 Linear Decision Rule for Optimal ITR

Suppose that the decision function f(x) minimizing (2.2) is a linear function of x, that is, f(x) = 〈β, x〉 + β₀, where 〈·,·〉 denotes the inner product in Euclidean space. Then the corresponding ITR will assign a subject with prognostic value X into treatment 1 if 〈β, X〉 + β₀ > 0 and −1 otherwise.

In (2.2), we define ||f|| as the Euclidean norm of β. Following the usual SVM, we introduce a slack variable ξ_i for subject i to allow a small portion of wrong classification. Denote C > 0 as the classifier margin. Then minimizing (2.2) can be rewritten as

max_{β, β_{0}, | | β | | = 1} C subject to A_{i} (〈 β, X_{i} 〉 + β_{0}) \geq C (1 - ξ_{i}), ξ_{i} \geq 0, \sum \frac{R_{i}}{π_{i}} ξ_{i} < s,

where π_i = πI(A_i = 1)+(1 − π)I(A_i = −1) and s is a constant depending on λ_n. This is equivalent to

min \frac{1}{2} {| | β | |}^{2} subject to A_{i} (〈 β, X_{i} 〉 + β_{0}) \geq (1 - ξ_{i}), ξ_{i} \geq 0, \sum \frac{R_{i}}{π_{i}} ξ_{i} < s,

that is

min \frac{1}{2} {| | β | |}^{2} + κ \sum_{i = 1}^{n} \frac{R_{i}}{π_{i}} ξ_{i} subject to A_{i} (〈 β, X_{i} 〉 + β_{0}) \geq (1 - ξ_{i}), ξ_{i} \geq 0,

where κ > 0 is a tuning parameter and R_i/π_i is the weight for the i^th point. We observe that the main difference compared to standard SVM is that we weigh each slack variable ξ_i with R_i/π_i.

After introducing Lagrange multipliers, the Lagrange function becomes:

\frac{1}{2} {| | β | |}^{2} + κ \sum_{i = 1}^{n} \frac{R_{i}}{π_{i}} ξ_{i} - \sum_{i = 1}^{n} α_{i} {A_{i} (X_{i}^{T} β + β_{0}) - (1 - ξ_{i})} - \sum_{i = 1}^{n} μ_{i} ξ_{i},

with α_i ≥ 0, μ_i ≥ 0. Taking derivatives with respect to (β, β₀) and ξ_i, we have $β = \sum_{i = 1}^{n} α_{i} A_{i} X_{i}, 0 = \sum_{i = 1}^{n} α_{i} A_{i}$ and α_i = κR_i/π_i − μ_i. Plugging these equations into the Lagrange function, we obtain the dual problem

max_{α} \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} A_{i} A_{j} 〈 X_{i}, X_{j} 〉

subject to 0 ≤ α_i ≤ κR_i/π_i, i = 1, …, n, and $\sum_{i = 1}^{n} α_{i} A_{i} = 0$ . Quadratic programming algorithms from many widely available software packages can be used to solve this dual problem. Finally, we obtain that

\hat{β} = \sum_{{\hat{α}}_{i} > 0} {\hat{α}}_{i} A_{i} X_{i},

and β̂₀ can be solved using the margin points (0 < α̂_i, ξ̂_i = 0) subject to the Karush-Kuhn-Tucker conditions (Page 421, Hastie, Tibshirani & Friedman 2009). The decision rule is given by sign{〈β̂, X〉 + β̂₀}. Similar to the traditional SVM, the estimated decision rule is determined by the support vectors with α̂ > 0.

2.4 Nonlinear Decision Rule for Optimal ITR

The previous section targets a linear boundary of prognostic variables. This may not be practically useful since the dimension of the prognostic variables can be quite high and complicated relationships may be involved between the desired treatments and these variables. However, we can easily generalize the previous approach to obtain a nonlinear decision rule for obtaining the optimal ITR.

We let k : Inline graphic × → ℝ, called a kernel function, be continuous, symmetric and positive semidefinite. Given a real-valued kernel function k, we can associate with it a reproducing kernel Hilbert space (RKHS) , which is the completion of the linear span of all functions {k(·, x), x ∈ }. The norm in Inline graphic , denoted by ||·||_k, is induced by the following inner product,

{〈 f, g 〉}_{k} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} α_{i} β_{j} k (x_{i}, x_{j}),

for $f (\cdot) \sum_{i = 1}^{n} α_{i} k (\cdot, x_{i})$ and $g (\cdot) \sum_{j = 1}^{m} β_{j} k (\cdot, x_{j})$ .

We note that our decision function f(x) is from Inline graphic equipped with norm ||·||_k. Thus since any function in takes the form $\sum_{i = 1}^{m} α_{i} k (\cdot, x_{i})$ , it can be shown that the optimal decision function is given by

\sum_{i = 1}^{n} {\hat{α}}_{i} A_{i} k (X, X_{i}) + {\hat{β}}_{0},

where (α̂₁, …, α̂_n) solves

max_{α} \sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} A_{i} A_{j} k (X_{i}, X_{j})

subject to 0 ≤ α_i ≤ κR_i/π_i, i = 1, …, n, and $\sum_{i = 1}^{n} α_{i} A_{i} = 0$ . We note that if we choose k(x, y) = 〈x, y〉, then the obtained rule reduces to the previous linear rule.

3. THEORETICAL RESULTS

In this section, we establish consistency of the optimal ITR estimated using OWL. We further obtain a risk bound for the estimated ITR and show how the bound can be improved for certain specific, realistic situations.

3.1 Notation

For any ITR Inline graphic (x) = sign(f(x)) associated with decision function f(x), we define

R (f) = E [\frac{R}{A π + (1 - A) / 2} I (A \neq sign (f (X)))]

and the minimal risk (called Bayes risk in the learning literature) as Inline graphic = inf_f { (f)|f: → ℝ}. Thus, for the optimal ITR (x) = sign(f^*(x)) (called the Bayes classifier in the learning literature), = (f^*). In terms of the value function, we note that ( ) − ( ) = (f) − (f^*).

In the OWL approach, we substitute 0–1 loss I(A ≠ sign(f(X))) by a surrogate loss, φ(Af(X)), where φ(t) = (1 − t)⁺. Thus we define the φ-risk

R_{φ} (f) = E [\frac{R}{A π + (1 - A) / 2} φ (A f (X))],

and, similarly, the minimal φ–risk as $R_{φ}^{*} = {inf}_{f} {R_{φ} (f) ∣ f : X \to R}$ .

Recall that the estimated optimal ITR is given by sign(f̂_n(X)), where

{\hat{f}}_{n} = \underset{f \in H_{k}}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} \frac{R_{i}}{π_{i}} {1 - A_{i} f (X_{i})}^{+} + λ_{n} {| | f | |}_{k}^{2}} .

(3.1)

3.2 Fisher Consistency

We establish Fisher consistency of the decision function based on surrogate loss φ(t). Specifically, the following result holds:

Proposition 3.1

For any measurable function f, if f̃ minimizes R_φ(f), then Inline graphic (x) = sign(f̃(x)). Proof. First, we note

D^{*} (x) = sign {E [R ∣ X = x, A = 1] - E [R ∣ X = x, A = - 1]} .

Next, for each x ∈ Inline graphic ,

\begin{array}{l} E (R \frac{φ (A f (X))}{A π + (1 - A) / 2} | X = x) = E (R ∣ A = 1, X = x) (1 - f (x)) + E (R ∣ A = - 1, X = x) (1 + f (x)) \\ = ((E (R ∣ A = - 1, X = x) - E (R ∣ A = 1, X = x)) f (x) + E (R ∣ A = - 1, X = x) + E (R ∣ A = 1, X = x)) . \end{array}

Therefore, f̃(x), which minimizes Inline graphic (f), should be positive if E(R|A = 1, X = x) > E(R|A = −1, X = x) and negative if E(R|A = 1, X = x) < E(R|A = −1, X = x). That is, f̃(x) has the same sign as (x). The result holds.

The proposition is analogous to results for SVM, for example, Lin (2002). This theorem justifies the validity of using φ(t) as the surrogate loss in OWL.

3.3 Excess Risk for (f) and (f)

The following result shows that for any decision function f, the excess risk of f under 0–1 loss is no larger than the excess risk of f under the hinge loss. Thus, the loss of the value function due to the ITR associated with f can be bounded by the excess risk under the hinge loss. The proof of the theorem can be found in the Appendix.

Theorem 3.2

For any measurable f: Inline graphic → ℝ and any probability distribution for (X, A, R),

R (f) - R^{*} \leq R_{φ} (f) - R_{φ}^{*} .

(3.2)

The proof follows the general arguments of Bartlett, Jordan & McAuliffe (2006), in which they bound the risk associated with 0–1 loss in terms of the risk from surrogate loss, utilizing a convexified variational transform of the surrogate loss. In our proof, we extend this concept to our setting by establishing the validity of a weighted version of such a transformation.

3.4 Consistency and Risk Bounds

The purpose of this section is to establish the consistency of f̂_n, and, moreover, to derive the convergence rate of Inline graphic (f̂_n) − .

First, the following theorem shows that the risk due to f̂_n does converge to Inline graphic , and, equivalently, the value of f̂_n converges to the optimal value function. Results on consistency of the SVM have been shown in current literature, for example, Zhang (2004). Here we apply the empirical process techniques to show that the proposed OWL estimator is consistent. The proof of the theorem is deferred to the Appendix.

Theorem 3.3

Assume that we choose a sequence λ_n > 0 such that λ_n → 0 and λ_nn → ∞. Then for all distributions P, we have that in probability,

lim_{n \to \infty} {R_{φ} ({\hat{f}}_{n}) - inf_{f \in {\bar{H}}_{k}} R_{φ} (f)} = 0,

where Inline graphic denotes the closure of _k. Thus, if f^* belongs to the closure of lim sup_n_→∞ , where can potentially depend on n, we have ${lim}_{n \to \infty} R_{φ} ({\hat{f}}_{n}) = R_{φ}^{*}$ in probability. It then follows that lim_n_→∞ (f̂_n) = in probability.

One special situation where f^* belongs to the limit space of Inline graphic is when we choose to be an RKHS with Gaussian kernel and let the kernel bandwidth decrease to zero as n → ∞. This will be shown in Theorem 3.4 below.

We now wish to derive the convergence rate of Inline graphic (f̂_n) − under certain regularity conditions on the distribution P. Specifically, we need the following “geometric noise” assumption for P (Steinwart & Scovel 2007): Let

η (x) = \frac{E [R ∣ X = x, A = 1] - E [R ∣ X = x, A = - 1]}{E [R ∣ X = x, A = 1] + E [R ∣ X = x, A = - 1]} + 1 / 2,

(3.3)

then 2η(x) − 1 is the decision boundary for the optimal ITR. We further define Inline graphic = {x ∈ : 2η(x) − 1 > 0}, and = {x ∈ : 2η(x) − 1 < 0}. A distance function to the boundary between and is Δ(x) = d̃(x, ) if x ∈ , Δ(x) = d̃(x, ) if x ∈ and Δ(x) = 0 otherwise, where d̃(x, ) denotes the distance of x to a set with respect to the Euclidean norm. Then the distribution P is said to have geometric noise exponent 0 < q < ∞, if there exists a constant C > 0 such that

E [exp (- \frac{Δ {(X)}^{2}}{t}) ∣ 2 η (X) - 1 ∣] \leq {C t}^{q d / 2}, t > 0.

(3.4)

In some sense, this geometric noise exponent describes the behavior of the distribution in a neighborhood of the decision boundary. It is affected by how fast the density of the distance Δ(X) decays along the boundary. For example, assume the boundary is linear, in which case Δ(x) = |2η(x)−1|. If for the density of Δ(X), defined as f(u), we have f(u) ~ u^p when u is close to 0, then we can show q = (p + 2)/d. Larger p corresponds to a faster decaying rate of the density, resulting in a larger q accordingly. Another example is distinctly separable data, i.e., when |2η(x) − 1| > δ > 0, for some constant δ, and η is continuous, q can be arbitrarily large.

In addition to this specific assumption for P, we also restrict the choice of RKHS to the space associated with Gaussian Radial Basis Function (RBF) kernels, i.e.,

k (x, x^{'}) = exp (- σ_{n}^{2} {| | x - x^{'} | |}^{2}), x, x^{'} \in X,

where σ_n > 0 is a parameter varying with n. The tuning parameter σ_n is related to approximation properties of Gaussian RBF kernels. When σ_n goes large, only observations in the small neighborhood contribute to the prediction, in which case we obtain a non-linear decision boundary or even non-parametric decision rule. If σ_n does not diverge, then points further away can contribute to the prediction, resulting in a nearly linear boundary. One advantage of using the Gaussian kernel is that we can determine the complexity of Inline graphic in terms of capacity bounds with respect to the empirical L²-norm, defined as

{| | f - g | |}_{L_{2} (P_{n})} = {(\frac{1}{n} \sum_{i = 1}^{n} {∣ f (X_{i}) - g (X_{i}) ∣}^{2})}^{1 / 2} .

For any ε > 0, the covering number of functional class Inline graphic with respect to L₂(P_n), N ( , ε, L₂(P_n)), is the smallest number of L₂(P_n) ε-balls needed to cover , where an L₂(P_n) ε-ball around a function g ∈ is the set {f ∈ : ||f − g||_{L₂(P_n)} < ε}.

Specifically, according to Theorem 2.1 in Steinwart & Scovel (2007), we have that for any ε > 0,

sup_{P_{n}} log N (B_{H_{k}}, ε, L_{2} (P_{n})) \leq c_{ν, δ, m} σ_{n}^{(1 - ν / 2) (1 + δ) d} ε^{- ν},

(3.5)

where Inline graphic is the closed unit ball of , and ν and δ are any numbers satisfying 0 < ν ≤ 2 and δ > 0.

Under the above conditions, we obtain the following theorem:

Theorem 3.4

Let P be a distribution of (X, A, R) satisfying condition (3.4) with noise exponent q > 0. Then for any δ > 0, 0 < ν < 2, there exists a constant C (depending on ν, δ, d and π) such that for all τ ≥ 1 and $σ_{n} = λ_{n}^{- 1 / (q + 1) d}$ ,

{P r}^{*} (R ({\hat{f}}_{n}) \leq R^{*} + ε) \geq 1 - e^{- τ},

where Pr^* denotes the outer probability for possibly nonmeasurable sets, and

ε = C [{(\frac{1}{λ_{n}})}^{\frac{2}{2 + ν} + \frac{(2 - ν) (1 + δ)}{(2 + ν) (1 + q)}} {(\frac{1}{n})}^{\frac{2}{2 + ν}} + {(\frac{1}{λ_{n}})}^{\frac{q}{q + 1}} \frac{τ}{n} + λ_{n}^{\frac{q}{q + 1}}] .

The first two terms bound the stochastic error, which arises from the variability inherent in a finite sample size and which depends on the complexity of Inline graphic in terms of covering numbers, while the third term controls the approximation error due to using , which depends on both σ_n and the noise behavior in the underlying distribution. We expect better approximation properties when the RKHS is more complex, but, conversely, we also expect larger stochastic variability. Using the above expression, an optimal choice of λ_n that balances bias and variance is given by

λ_{n} = n^{- \frac{2 (1 + q)}{(4 + ν) q + 2 + (2 - ν) (1 + δ)}},

so the optimal rate for the risk is

R ({\hat{f}}_{n}) - R^{*} = O_{p} (n^{- \frac{2 q}{(4 + ν) q + 2 + (2 - ν) (1 + δ)}}) .

In particular, when data are well separated, q can be sufficiently large and we can let (δ, ν) be sufficiently small. Then the convergence rate almost achieves the rate n^−1/2. However, if the marginal distribution of Inline graphic has continuous density along the boundary, it can be calculated that q = 2/d. In this case, the convergence rate is approximately n^−2/(^d⁺²⁾. Clearly, the speed of convergence is slower with larger dimension of the prognostic variable space.

To prove Theorem 3.4, we note that according to Theorem 3.2, it suffices to prove the result for the excess φ risk. We also use the fact that

R_{φ} ({\hat{f}}_{n}) - R_{φ}^{*} = R_{φ} ({\hat{f}}_{n}) - inf_{H_{k}} R_{φ} (f) + inf_{H_{k}} R_{φ} (f) - R_{φ}^{*} .

We will then bound the first difference on the right-hand side using the empirical counterpart plus the stochastic variability due to the finite sample approximation. The latter can be controlled using large deviation results from empirical processes and some preliminary bound for ||f̂_n||_k. The second difference on the right-hand side will be bounded by using the approximation property of the RHKS and the geometric noise assumption of the underlying distribution P. The proof is modified based on Vert & Vert (2006) and Steinwart & Scovel (2007), where the weights in the loss function are taken into consideration. The details are provided in the Appendix.

3.5 Improved Rate with Data Completely Separated

In this section, we show that a faster convergence rate can be obtained if the data are completely separated. We assume

(A1)
∀x ∈ , |η(x) − 1/2| ≥ η₀, where η(x) is defined in (3.3), and η is continuous.
(A2)
∀x ∈ , min(η(x), 1 − η(x)) ≥ η₁.

Assumption (A1) can be referred as a “low noise” condition equivalent to |E(R|A = 1, X) − E(R|A = −1, X)| ≥ η₀. Thus, a jump of η (x) at the level of 1/2 requires a gap between the rewards gained from treatment 1 and −1 on the same patient. This assumption is an adaptation of the noise condition used in classical SVM to obtain fast learning rates and it is essentially equivalent to one of the conditions in Blanchard et al. (2008).

Theorem 3.5

Assume that (A1) and (A2) are satisfied. For any ν ∈ (0, 1) and q ∈ (0, ∞), let λ_n = O(n^−1/(^ν⁺¹⁾) and $σ_{n} = λ_{n}^{- 1 / (q + 1) d}$ . Then

R ({\hat{f}}_{n}) - R^{*} = O_{p} (n^{- \frac{1}{ν + 1} \frac{q}{q + 1}}) .

We can let q go to ∞ and ν go to zero, and this theorem shows that the convergence rate for Inline graphic (f̂_n) − (f*) is almost n⁻¹, a much faster rate compared to what was given in Theorem 3.4. This result is similar to results for SVM described in Tsybakov (2004), Steinwart & Scovel (2007), and Blanchard et al. (2008).

To prove Theorem 3.5, we can rewrite the minimization problem in (3.1) as:

min_{S \in R^{+}} {min_{f : {| | f | |}_{k} \leq S} \frac{1}{n} \sum_{i = 1}^{n} \frac{R_{i}}{π_{i}} {1 - A_{i} f (X_{i})}^{+} + λ S^{2}} .

Thus the problem can be viewed in the model selection framework: a collection of models are balls in Inline graphic , and for each model, we solve the penalized empirical φ-risk minimization to obtain an estimator f̂_n. We can utilize a result for model selection, presented in Theorem 4.3 of Blanchard et al. (2008), to choose the model which yields the minimal penalized empirical φ-risk among all the models. We need to verify the conditions required for the theorem based on the weighted hinge loss and the condition on the covering number of functional class Inline graphic with respect to L₂(P_n), i.e., condition (3.5). Proof details are provided in the Appendix.

4. SIMULATION STUDY

We have conducted extensive simulations to assess the small-sample performance of the proposed method. In these simulations, we generate 50-dimensional vectors of prognostic variables X₁, …, X₅₀, consisting of independent U [−1, 1] variates. The treatment A is generated from {−1, 1} independently of X with P(A = 1) = 1/2. The response R is normally distributed with mean Q₀ = 1 + 2X₁ + X₂ + 0.5X₃ + T₀(X, A) and standard deviation 1, where T₀(X, A) reflects the interaction between treatment and prognostic variables and is chosen to vary according to the following four different scenarios:

T₀(X, A) = 0.442(1 − X₁ − X₂)A.
$T_{0} (X, A) = (X_{2} - 0.25 X_{1}^{2} - 1) A$ .
$T_{0} (X, A) = (0.5 - X_{1}^{2} - X_{2}^{2}) (X_{1}^{2} + X_{2}^{2} - 0.3) A$ .
$T_{0} (X, A) = (1 - X_{1}^{3} + exp (X_{3}^{2} + X_{5}) + 0.6 X_{6} - {(X_{7} + X_{8})}^{2}) A$ .

The decision boundaries in the first three scenarios are determined by X₁ and X₂. Scenario 1 corresponds to a linear decision boundary in truth, where the shape of the boundary in Scenario 2 is a parabola. The third is a ring example, where the patients on the ring are assigned to one treatment, and another if inside or outside the ring. The decision boundary in the fourth example is fairly nonlinear in covariates, depending on covariates other than X₁ and X₂. For each scenario, we estimate the optimal ITR by applying OWL. We use the Gaussian kernel in the weighted SVM algorithm. There are two tuning parameters: λ_n, the parameter for penalty, and σ_n, the inverse bandwidth of the kernel. Since λ_n plays a role in controlling the severity of the penalty on the functions and σ_n determines the complexity of the function class utilized, σ_n should be chosen adaptively from the data simultaneously with λ_n. To illustrate this, Figure 1 shows the contours of the value function for the first scenario with different combinations of (λ_n, σ_n) when n = 30. We can see that λ_n interacts with σ_n, with larger λ_n generally coupled with smaller σ_n for equivalent value function levels. In our simulations, we apply a 5-fold cross validation procedure, in which we search over a pre-specified finite set of (λ_n, σ_n) to select the pair maximizing the average of the estimated values from the validation data. In case of tied values for parameter pair choices, we first choose the set of pairs with smallest λ_n and then select the one with largest σ_n.

Contour Plots of Value Function for Example 1 with *λ_n* ∈ (0, 10) and *σ_n* ∈ (0, 10)

Additionally, comparison is made among the following four methods:

the proposed OWL using Gaussian kernel (OWL-Gaussian)
the proposed OWL using linear kernel (OWL-Linear)
the l₁ penalized least squares method (l₁-PLS) developed by Qian & Murphy (2011), which approximates E(R|X, A) using the basis function set (1, X, A, XA) and applies the LASSO method for variable selection, and
ordinary least squares method (OLS), which estimates the conditional mean response using the same basis function set as in 3 but without variable selection.

We consider the OWL with linear kernel (method 2) mainly to assess the impact of different kernels in the weighted SVM algorithm. In this case, there is only one tuning parameter, λ_n, which can be chosen to maximize the value function in a cross-validation procedure. The selection of the tuning parameters in the l₁-PLS approach follows similarly. The last two approaches estimate the optimal ITR using the sign of the difference between the predicted E(R|X, A = 1) and the predicted E(R|X, A = −1). In the comparisons, the performances of the four methods are assessed by two criteria: the first criterion is to evaluate the value function using the estimated optimal ITR when applying to an independent and large validation data; the second criterion is to evaluate the misclassification rates of the estimated optimal ITR from the true optimal ITR using the validation data. Specifically, a validation set with 10000 observations is simulated to assess the performance of the approaches. The estimated value function using any ITR Inline graphic is given by $P_{n}^{*} [I (A = D (X)) R / P (A)] / P_{n}^{*} [I (A = D (X)) / P (A)]$ (Murphy et al. 2001), where $P_{n}^{*}$ denotes the empirical average using the validation data and P(A) is the probability of being assigned treatment A.

For each scenario, we vary sample sizes for training datasets from 30 to 100, 200, 400 and 800, and repeat the simulation 1,000 times. The simulation results are presented in Figures 2 and 3, where we report the mean square errors (MSE) of both value functions and misclassification rates. Simulations show there are no large differences in the performance if we replace the Gaussian kernel with the linear kernel in the OWL. However, there are examples presenting advantages of the Gaussian kernel, which suggests that under certain circumstances, it is useful to have a flexible nonparametric estimation procedure to identify the optimal ITR for the underlying nonparametric structures. As demonstrated in Figure 2 and Figure 3, the OWL with either Gaussian kernel or linear kernel has better performance, especially for small samples, than the other two methods, from the points of view of producing larger value functions, smaller misclassification rates, and lower variability of the value function estimates. Specifically, when the approximation models used in the l₁-PLS and OLS are correct in the first scenario, the competing methods perform well with large sample size; however, the OWL still provides satisfactory results even if we use a Gaussian kernel. When the optimal ITR is nonlinear in X in the other scenarios, the OWL tends to give higher values and smaller misclassification rates. OLS generally fails unless the sample size is large enough since it encounters severe bias for small sample sizes. This is due to the fact that without variable selection for OLS, there is insufficient data to fit an accurate model with all 50 variables included. We also note that l₁-PLS has comparatively larger MSE, resulted from high variance of the method, which may be explained by the conflicting goals of maximizing the value function and minimizing the prediction error (Qian & Murphy 2011). Note that a richer class of basis functions can be used for fitting the regression models. We have tried a polynomial basis and a wavelet basis to see if they could improve the performance. However, as a larger set of basis functions enters the model, we need to take into account higher dimensional interactions which do not necessarily yield better results. Also, we noted that higher variability is introduced with a richer basis for the approximation space (results not shown).

MSE for Value Functions of Individualized Treatment Rules

MSE for Misclassification Rates of Individualized Treatment Rules

Additional simulations are performed by generating binary outcomes from a logit model. It turns out the OWL procedures outperform a traditional logistic regression procedure (results not shown). Finally, using empirical results, we also verify that the cross validation procedure can indeed identify the optimal pairs (λ_n, σ_n) with the order desired by the theoretical results, i.e., $σ_{n} = λ_{n}^{- (q + 1) d}$ . The numerical results indicate that log₂ σ_n is linear in log₂ λ_n and the ratio between the slopes is close to the reciprocal ratio between the dimensions of the covariate spaces.

5. DATA ANALYSIS

We apply the proposed method to analyze real data from the Nefazodone-CBASP clinical trial (Keller et al. 2000). The study randomized 681 outpatients with non-psychotic chronic major depressive disorder (MDD), in a 1:1:1 ratio to either Nafazodone, Cognitive Behavioral-Analysis System of Psychotherapy (CBASP) or the combination of Nefazodone and CBASP. The score on the 24-item Hamilton Rating Scale for Depression (HRSD) was the primary outcome, where higher scores indicate more severe depression. After excluding some patients with missing observations, we use a subset with 647 patients for analysis. Among them, 216, 220 and 211 patients were assigned to Nafazodone, CBASP and the combined treatment group respectively. Overall comparisons using t-tests show that the combination treatment had significant advantages over the other treatments with respect to HRSD scores obtained at end of the trial, while there are no significant differences between the nefazodone group and the psychotherapy group.

To estimate the optimal ITR, we perform pairwise comparisons between all combinations of two treatment arms, and, for each two-arm comparison, we apply the OWL approach. We only present the results from the Gaussian kernel, since the analysis shows a similarity with that of the linear kernel. Rewards used in the analyses are reversed HRSD scores and the prognostic variables X consist of 50 pretreatment variables. The results based on OWL are compared to results obtained using the l₁-PLS and OLS methods which use (1, X, A, XA) in their regression models. For comparison between methods, we calculate the value function from a cross-validation type analysis. Specifically, the data is partitioned into 5 roughly equal-sized parts. We perform the analysis on 4 parts of the data, and obtain the estimated optimal ITRs using different methods. We then compute the estimated value functions using the remaining fifth part. The value functions calculated this way should better represent expected value functions for future subjects, as compared to calculating value functions based on the training data. The averages of the cross-validation value functions from the three methods are presented in Table 1.

Table 1.

Mean Depression Scores (the Smaller, the Better) from Cross Validation Procedure with Different Methods

	OLS	l₁-PLS	OWL

Nefazodone vs CBASP	15.87	15.95	15.74
Combination vs Nefazodone	11.75	11.28	10.71
Combination vs CBASP	12.22	10.97	10.86

Open in a new tab

From the table, we observe that OLS produces smaller value functions (corresponding to larger HRSD in the table) than the other two methods, possibly because of the high dimensional prognostic variable space. OWL performs similarly to l₁-PLS, but gives a 5% larger value function than l₁-PLS when comparing the Combination arm to the Nefazodone arm. In fact, when comparing combination treatment with nefazodone only, OWL recommends the combination treatment to all the patients in the validation data in each round of the cross validation procedure; the OLS assigns the combination treatment to around 70% of the patients in each validation subset; while the l₁-PLS recommends the combination to all the patients in three out of five validation sets, and 7% and 28% to the patients for the other two, indicating a very large variability. If we need to select treatment between combination and psychotherapy alone, the OWL approach recommends the combination treatment for all patients in the validation process. In contrast, the l₁-PLS chooses psychotherapy for 10 out of 86 patients in one round of validation, and recommends the combination for all patients in the other rounds. The percentages of patients who are recommended the combination treatment range from 66% to 85% across the five validation data sets when applying OLS. When the two single treatments are studied, there are only negligible differences in the estimated value functions from the three methods and the selection results also indicate an insignificant difference between them. Thus OWL not only yields ITRs with the best clinical outcomes, but the ITRs also have lowest variability compared to the other methods.

6. DISCUSSION

The proposed OWL procedure appears to be more effective, across a broad range of possible forms of the interaction between prognostic variables and treatment, compared to previous methods. A two-stage procedure is likely to overfit the regression model, and thus cause troubles for value function approximation. The OWL provides a nonparametric approach which sidesteps the inversion of the predicted model required in other methods and benefits from directly maximizing the value function. The convergence rates for the OWL, aiming to identify the best ITR, nearly reach the optimal for the nonparametric SVM with the same type of assumptions on the separations. The rates, however, are not directly comparable to Qian & Murphy’s (2011), because we allow for complex multivariate interactions and formulate the problem in a nonparametric framework. The proposed estimator will lead to consistency and fast rate results, but not necessarily the most efficient approach. In some cases when we have knowledge of the specific parametric form, a likelihood based method may be more efficient and aid in the improvement of the estimation. Other possible surrogate loss functions, for example, the negative log-likelihood for logistic regression, can also be useful for finding the desired optimal individualized treatment rules.

Several improvements and extensions are important to consider. An important extension we are currently pursuing is to right-censored clinical outcomes. Another extension involves alleviating potential challenges arising from high dimensional prognostic variables. Recall that the proposed OWL is based on a weighted SVM which minimizes the weighted hinge loss function subject to an l₂ penalty. If the dimension of the covariate space is sufficiently large, not all the variables would be essential for optimal ITR construction. By eliminating the unimportant variables from the rule, we could simplify interpretations and reduce health care costs by only requiring collection of a small number of significant prognostic variables. For standard SVM, the l₁ penalty has been shown to be effective in selecting relevant variables via shrinking small coefficients to zero (Bradley & Mangasarian 1998; Zhu et al. 2003). It outperforms the l₂ penalty when there are many noisy variables and sparse models are preferred. Other forms of penalty have been proposed such as the F_∞ norm (Zou & Yuan 2008) and the adaptive l_q penalty (Liu et al. 2007). In the future, we will examine use of these sparse penalties in the OWL method.

In this paper, we only considered binary options for treatment. When there are more than two treatment classes, although we could do a series of pairwise comparisons as done in Section 5 above, this approach may not be optimal in terms of identifying the best rule considering all treatments simultaneously. It would thus be worthwhile to extend the OWL approach to settings involving three or more treatments. The case of multicategory SVM has been studied recently (Lee, Lin & Wahba 2004; Wang & Shen 2006), and a similar generalization may be possible for finding ITRs involving three or more treatments. Another setting to consider is optimal ITR discovery for continuous treatments such as, for example, a continuous range of dose levels. In this situation, we could potentially utilize ideas underlying support vector regression (Vapnik 1995), where the goal is to find a function that has at most ε deviation from the response. Using a similar rationale as the proposed OWL, we could develop corresponding procedures for continuous treatment spaces through weighing each subject by his/her clinical outcome.

Obtaining inference for individualized treatment regimens is also important and challenging. Due to high heterogeneities among individuals, there may be large variations in the estimated treatment rules across different training sets. Laber & Murphy (2011) construct an adaptive confidence interval for the test error under the non-regular framework. Confidence intervals for value functions help us determine whether essential differences exist among different decision rules. Thus an important future research topic is to derive the limiting distribution of Inline graphic ( ) − ( ) and to derive corresponding sample size formulas to aid in design of personalized medicine clinical trials.

In some complex diseases, dynamic treatment regimes may be more useful than the single-decision treatment rules studied in this paper. Dynamic treatment regimes are customized sequential decision rules for individual patients which can adapt over time to an evolving illness. Recently, this research area has been of great interest in long term management of chronic disease. See, for example, Murphy et al. (2001), Thall, Sung & Estey (2002), Murphy (2003), Robins (2004), Moodie, Richardson & Stephens (2007), Zhao, Zeng, Socinski & Kosorok (2011). Extension of the proposed OWL approach to the dynamic setting would be of great interest.

Acknowledgments

The first, second and fourth authors were partially funded by NCI Grant P01 CA142538.

APPENDIX A. PROOFS

Proof of Theorem 3.2

We consider the case where rewards are discrete. Arguments for the continuous rewards setting follow similarly. Let η_r(x) = p(A = 1|R = r, X = x) and q_r(x) = rp(R = r|X = x). We can write

\begin{array}{l} R (f) = E [\sum_{r} r p (R = r ∣ X) E (\frac{I (A \neq sign (f (X)))}{A π + (1 - A) / 2} | R = r, X)] \\ = E [\sum_{r} q_{r} (X) (\frac{η_{r} (X)}{π} I (sign (f (X)) \neq 1) + \frac{1 - η_{r} (X)}{1 - π} I (sign (f (X)) \neq - 1))] \\ = E [c_{0} (X) (η (X) I (sign (f (X)) \neq 1) + (1 - η (X)) I (sign (f (X)) \neq - 1))], \end{array}

(A.1)

where c₀(x) = Σ_r q_r(x)[ηr(x)/π + (1 − η_r(x))/(1 − π)], and η(x), defined previously in (3.3), is equal to Σ_r q_r(x) η_r(x)/πc₀(x). Similarly,

R_{φ} (f) = E [c_{0} (X) (η (X) φ (f (X)) + (1 - η (X)) φ (- f (X)))] .

We define C(η, α) = ηφ(α) + (1 − η)φ (−α). Then the optimal φ-risk satisfies

R_{φ}^{*} = E [c_{0} (X) inf_{α \in R} C (η (X), α)]

and

R_{φ} - R_{φ}^{*} = E [c_{0} (X) (C (η (X), f (X)) - inf_{α \in R} C (η (X), α))] .

By a result in Bartlett et al. (2006) for a convexified transform of hinge loss, we have

2 η - 1 = inf_{α : α (2 η - 1) \leq 0} C (η, α) - inf_{α \in R} C (η, α) .

(A.2)

Thus, according to (A.1) and (A.2), we have

\begin{array}{l} R (f) - R^{*} \leq E (I (sign (f (X)) \neq sign [c_{0} (X) (η (X) - 1 / 2)]) ∣ c_{0} (X) (2 η (X) - 1) ∣) \\ = E [c_{0} (X) I (sign (f (X)) \neq sign [c_{0} (X) (η (X) - 1 / 2)]) (inf_{α : α (2 η (X) - 1) \leq 0} C (η (X), α) - inf_{α \in R} C (η (X), α))] \\ \leq E [c_{0} (X) (C (η (X), f (X)) - inf_{α \in R} C (η (X), α))] \\ = R_{φ} (f) - R_{φ}^{*} . \end{array}

The last inequality holds because we always have C(η(x), f(x)) ≥ inf_α _{∈ ℝ} C(η(x), α) on the set where sign(f(x)) = sign[c₀(x)(η(x) − 1/2)] and C(η(x), f (x)) ≥ inf_α:α₍₂_η ₍_x_)−1)≤0 C(η(x), α) when sign(f(x)) ≠ sign[c₀(x)(η(x) − 1/2)].

Proof of Theorem 3.3

Define L_φ(f) = Rφ(Af)/(Aπ + (1 − A)/2). By the definition of f̂_n, we have for any f ∈ Inline graphic ,

P_{n} (L_{φ} ({\hat{f}}_{n})) \leq P_{n} (L_{φ} ({\hat{f}}_{n}) + λ_{n} {‖ {\hat{f}}_{n} ‖}^{2}) \leq P_{n} (L_{φ} (f) + λ_{n} {| | f | |}^{2}),

where ℙ_n denotes the empirical measure of the observed data. Thus lim sup_n ℙ_n(L_φ(f̂_n)) ≤ ℙ(L_φ(f)). It leads to lim sup_n ℙ_n(L_φ(f̂_n)) ≤ inf_f _∈ ℙ(L_φ(f)). Theorem 3.3 holds if we can show ℙ_n(L_φ(f̂_n))− ℙ(L_φ(f̂_n)) → 0 in probability.

To this end, we first obtain a bound for ${| | {\hat{f}}_{n} | |}_{k}^{2}$ . Since $P_{n} (L_{φ} ({\hat{f}}_{n})) + λ_{n} {| | {\hat{f}}_{n} | |}^{2} \leq P_{n} (L_{φ} (f)) + λ_{n} {| | f | |}_{k}^{2}$ for any f ∈ Inline graphic , we can select f = 0 to obtain

{| | {\hat{f}}_{n} | |}_{k}^{2} \leq \frac{1}{λ_{n}} \frac{1}{n} \sum \frac{R_{i}}{π_{i}} φ (0) \leq \frac{2}{λ_{n}} \frac{E (R)}{min {π, 1 - π}} .

Let M = 2E(R)/min{π, 1 − π} so that the Inline graphic norm of $\sqrt{λ_{n}} {\hat{f}}_{n} (X)$ is bounded by $\sqrt{M}$ . Note that the class { $\sqrt{λ_{n}} f : {| | \sqrt{λ_{n}} f | |}_{k} \leq \sqrt{M}$ } is contained in a Donsker class. Thus, ${\sqrt{λ_{n}} L_{φ} (f), {| | \sqrt{λ_{n}} f | |}_{k} \leq \sqrt{M}}$ is also P-Donsker because (1 − Af(X))⁺ is Lipschitz continuous with respect to f. Therefore,

\sqrt{n} (P_{n} - P) L_{φ} ({\hat{f}}_{n}) = \sqrt{λ_{n}^{- 1}} \sqrt{n} (P_{n} - P) [\frac{R}{A π + (1 - A) / 2} {(\sqrt{λ_{n}} - A \sqrt{λ_{n}} {\hat{f}}_{n} (X))}^{+}] = O_{p} (\sqrt{λ_{n}^{- 1}}) .

Consequently, from nλ_n → ∞, ℙ_n(L_φ(f̂_n)) − ℙ(L_φ(f̂_n)) → 0 in probability.

Proof of Theorem 3.4

First, we have

\begin{array}{l} R_{φ} ({\hat{f}}_{n}) - R_{φ}^{*} \leq λ_{n} {| | {\hat{f}}_{n} | |}_{k}^{2} + R_{φ} ({\hat{f}}_{n}) - R_{φ}^{*} \\ \leq [λ_{n} {| | {\hat{f}}_{n} | |}_{k}^{2} + R_{φ} ({\hat{f}}_{n}) - inf_{f \in H_{k}} (λ_{n} {| | f | |}_{k}^{2} + R_{φ} (f))] + [inf_{f \in H_{k}} (λ_{n} {| | f | |}_{k}^{2} + R_{φ} (f) - R_{φ}^{*})] . \end{array}

(A.3)

We will bound each term on the right-hand-side separately in the following arguments.

For the second term on the right-hand-side of (A.3), we use Theorem 2.7 in Steinwart & Scovel (2007) to conclude that

inf_{f \in H_{k}} (λ_{n} {| | f | |}_{k}^{2} + R_{φ} (f) - R_{φ}^{*}) = O (λ_{n}^{q / (q + 1)}),

(A.4)

when we set $σ_{n} = λ_{n}^{- 1 / (q + 1) d}$ .

Now we proceed to obtain a bound for the first term on the right-hand-side of (A.3). To do this, we need the useful Theorem 5.6 of Steinwart & Scovel (2007) presented below: Theorem 5.6, Steinwart & Scovel (2007). Let Inline graphic be a convex set of bounded measurable functions from Z to ℝ and let L : × Z → [0, ∞) be a convex and line-continuous loss function. For a probability measure P on Z we define

G : = {L \circ f - L \circ f_{P, F} : f \in F} .

Suppose that there are constants c ≥ 0, 0 < α < 1, δ ≥ 0 and B > 0 with E_P g² ≤ c(E_P g)^α + δ and ||g||_∞ ≤ B for all g ∈ Inline graphic . Furthermore, assume that is separable with respect to || · ||_∞ and that there are constants a ≥ 1 and 0 < p < 2 with

sup_{T \in Z^{n}} log N (B^{- 1} G, ε, L_{2} (T)) \leq a ε^{- p}

for all ε > 0. Then there exists a constant c_p > 0 depending only on p such that for all n ≥ 1 and all τ ≥ 1 we have

{P r}^{*} (T \in Z^{n} : R_{L, P} (f_{T, F}) > R_{L, P} (f_{P, F}) + c_{p} ε (n, a, B, c, δ, τ)) \leq e^{- τ},

where

ε (n, a, B, c, δ, x) : = B^{2 p / (4 - 2 α + α p)} c^{(2 - p) / (4 - 2 α + α p)} {(\frac{a}{n})}^{2 / (4 - 2 α + α p)} + B^{p / 2} δ^{(2 - p) / 4} {(\frac{a}{n})}^{1 / 2} + B {(\frac{a}{n})}^{2 / (2 + p)} + \sqrt{\frac{δ x}{n}} + {(\frac{c τ}{n})}^{1 / (2 - α)} + \frac{B τ}{n} .

In their paper, Inline graphic ∈ is a minimizer of (f) = E(L(f, z)), and is similarly defined when T is an empirical measure. To use this theorem, we define , Z, T, , and according to our setting. It suffices to consider the subspace of , denoted by $B_{H_{k}} (\sqrt{M / λ_{n}})$ , as the ball of Inline graphic of radius $\sqrt{M / λ_{n}}$ . Specifically, we let be $B_{H_{k}} (\sqrt{M / λ_{n}})$ and Z be . The loss function we consider here is $L_{φ} (f) + λ_{n} {| | f | |}_{k}^{2}$ and is the function class

G_{φ, λ_{n}} = {L_{φ} (f) + λ_{n} {| | f | |}_{k}^{2} - L_{φ} (f_{φ, λ_{n}}^{*}) - λ_{n} {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2} : f \in B_{H_{k}} (\sqrt{M / λ_{n}})},

where $f_{φ, λ_{n}}^{*} = {argmin}_{f \in B_{H_{k}} (\sqrt{M / λ_{n}})} (λ_{n} {| | f | |}_{k}^{2} + R_{φ} (f))$ . Inline graphic and correspond to $f_{φ, λ_{n}}^{*}$ and f̂_n, respectively. Therefore, to apply this theorem, we will show that there are constants c ≥ 0 and B > 0, which can possibly depend on n, such that E(g²) ≤ cE(g) and ||g||_∞ ≤ B, ∀g ∈ . Moreover, there are constants c̃ and 0 < ν < 2 with

sup_{P_{n}} log N (B^{- 1} G_{φ, λ_{n}}, ε, L_{2} (P_{n})) \leq \tilde{c} ε^{- ν},

for all ε > 0.

Let C_L denote sup{R/min(π, 1 − π)}, which is finite provided that R is bounded. Since the weighted hinge loss is Lipschitz continuous with respect to f, with Lipschitz constant C_L, and since ||f||_∞ ≤ ||f||_k given that k(x, x) ≤ 1, for any g ∈ Inline graphic , we have

\begin{array}{l} ∣ g ∣ \leq ∣ L_{φ} (f) - L_{φ} (f_{φ, λ_{n}}^{*}) ∣ + λ_{n} ∣ {| | f | |}_{k}^{2} - {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2} ∣ \\ \leq C_{L} ∣ f (x) - f_{φ, λ_{n}}^{*} (x) ∣ + M \\ \leq 2 C_{L} \sqrt{M} λ_{n}^{- 1 / 2} + M . \end{array}

(A.5)

Therefore, we can set $B = 2 C_{L} \sqrt{M} λ_{n}^{- 1 / 2} + M$ .

For any g ∈ Inline graphic , we have

\begin{array}{l} g (f) \leq ∣ L_{φ} (f) - L_{φ} (f_{φ, λ_{n}}^{*}) ∣ + λ_{n} ∣ {| | f | |}_{k}^{2} - {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2} ∣ \\ \leq C_{L} ∣ f - f_{φ, λ_{n}}^{*} ∣ + λ_{n} {| | f - f_{φ, λ_{n}}^{*} | |}_{k} {| | f + f_{φ, λ_{n}}^{*} | |}_{k} \\ = (C_{L} + 2 \sqrt{M λ_{n}}) {| | f - f_{φ, λ_{n}}^{*} | |}_{k} . \end{array}

Squaring both sides and taking expectations yields

E (g^{2}) \leq {(C_{L} + 2 \sqrt{M λ_{n}})}^{2} {| | f - f_{φ, λ_{n}}^{*} | |}_{k}^{2} .

(A.6)

On the other hand, from the convexity of L_φ, we have

\begin{array}{l} \frac{1}{2} (L_{φ} (f) + λ_{n} {| | f | |}_{k}^{2} + L_{φ} (f_{φ, λ_{n}}^{*}) + λ_{n} {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2}) \geq L_{φ} (\frac{f + f_{φ, λ_{n}}^{*}}{2}) + λ_{n} \frac{{| | f | |}_{k}^{2} + {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2}}{2} \\ = L_{φ} (\frac{f + f_{φ, λ_{n}}^{*}}{2}) + λ_{n} {‖ \frac{f + f_{φ, λ_{n}}^{*}}{2} ‖}_{k}^{2} + λ_{n} {‖ \frac{f - f_{φ, λ_{n}}^{*}}{2} ‖}_{k}^{2} \\ \geq L_{φ} (f_{φ, λ_{n}}^{*}) + λ_{n} {| | f_{φ, λ_{n}}^{*} | |}_{k}^{2} + λ_{n} {‖ \frac{f - f_{φ, λ_{n}}^{*}}{2} ‖}_{k}^{2} . \end{array}

Taking expectations on both sides leads to $E (g) \geq λ_{n} {| | f - f_{φ, λ_{n}}^{*} | |}_{k}^{2} / 2$ . Combining this with (A.6), we conclude that E(g²) ≤ cE (g), where

c = \frac{2}{λ_{n}} {(C_{L} + 2 \sqrt{M λ_{n}})}^{2} .

(A.7)

To estimate the bound for N(B⁻¹ Inline graphic , ε, L₂(P_n)), we first have

N (B^{- 1} G_{φ, λ_{n}}, ε, L_{2} (P_{n})) = N (B^{- 1} {L_{φ} (f) + λ_{n} {| | f | |}_{k}^{2} : f \in B_{H_{k}} (\sqrt{M / λ_{n}})}, ε, L_{2} (P_{n})) .

From the sub-additivity of the entropy,

log N (B^{- 1} G_{φ, λ_{n}}, 2 ε, L_{2} (P_{n})) \leq log N (B^{- 1} {L_{φ} (f) : f \in B_{H_{k}} (\sqrt{M / λ_{n}})}, ε, L_{2} (P_{n})) + log N ({λ_{n} {| | f | |}_{k}^{2}, f \in B_{H_{k}} (\sqrt{M / λ_{n}})}, ε, L_{2} (P_{n})) .

(A.8)

Using the Lipschitz-continuity of the weighted hinge loss, we now have that if u, $u^{'} \in B^{- 1} {L_{φ} (f) : f \in B_{H_{k}} (\sqrt{M / λ_{n}})}$ with corresponding f, $f^{'} \in B_{H_{k}} (\sqrt{M / λ_{n}})$ , then ||u − u′||_{L₂(P_n)} ≤ B⁻¹C_L|| f − f′||_{L₂(P_n)}, and therefore the first term on the right-hand-side of (A.8) satisfies

\begin{array}{l} log N (B^{- 1} {L_{φ} (f) : f \in B_{H_{k}} (\sqrt{M / λ_{n}})}, ε, L_{2} (P_{n})) \leq log N (B_{H_{k}} (\sqrt{M / λ_{n}}), \frac{B ε}{C_{L}}, L_{2} (P_{n})) \\ \leq log N (B_{H_{k}}, \frac{B ε}{C_{L} \sqrt{M / λ_{n}}}, L_{2} (P_{n})) \\ \leq log N (B_{H_{k}}, 2 ε, L_{2} (P_{n})) . \end{array}

The last inequality follows because $B / C_{L} \sqrt{M / λ_{n}} \geq 2$ . It is trivial to see that for the second term on the right hand side of (A.8),

log N ({λ_{n} {| | f | |}_{k}^{2}, f \in B (\sqrt{M / λ_{n}})}, ε, L_{2} (P_{n})) \leq log (\frac{M}{B ε}) .

Thus,

log N (B^{- 1} G_{φ, λ_{n}}, 2 ε, L_{2} (P_{n})) \leq log N (B_{H_{k}}, 2 ε, L_{2} (P_{n})) + log (\frac{M}{B ε}) .

Using (3.5) and a given choice for B, we obtain for all σ_n > 0, 0 < ν < 2, δ > 0, ε > 0,

sup_{P_{n}} log N (B^{- 1} G_{φ, λ_{n}}, ε, L_{2} (P_{n})) \leq c_{2} σ_{n}^{(1 - ν / 2) (1 + δ) d} ε^{- ν},

where c₂ depends on ν, δ and d.

Consequently, from Theorem 5.6 in Steinwart & Scovel (2007), there exists a constant c_ν > 0 depending only on ν such that for all n ≥ 1 and all τ ≥ 1, we have the bound for the first term

P^{*} (λ_{n} {| | {\hat{f}}_{n} | |}_{k}^{2} + R_{φ} ({\hat{f}}_{n}) > inf_{f \in H_{k}} (λ_{n} {| | f | |}_{k}^{2} + R_{φ} (f)) + c_{ν} ε (n, \tilde{c}, B, c, τ)) \leq e^{- τ},

where

ε (n, \tilde{c}, B, c, τ) = (B + B^{\frac{2 ν}{2 + ν}} c^{\frac{2 - ν}{2 + ν}}) {(\frac{\tilde{c}}{n})}^{\frac{2}{2 + ν}} + (B + c) \frac{τ}{n} .

With B and c as defined in (A.5) and (A.7), i.e., $\tilde{c} = c_{2} σ_{n}^{(1 - ν / 2) (1 + δ) d}$ and $σ_{n} = - λ_{n}^{1 / (q + 1) d}$ , we obtain

ε (n, \tilde{c}, B, c, τ) = C_{1} {(\frac{1}{λ_{n}})}^{\frac{2}{2 + ν} + \frac{(2 - ν) (1 + δ)}{(2 + ν) (1 + q)}} {(\frac{1}{n})}^{\frac{2}{2 + ν}} + C_{2} {(\frac{1}{λ_{n}})}^{\frac{q}{q + 1}} \frac{τ}{n},

(A.9)

where C₁ and C₂ are constants depending on ν, δ, d, M and π. We complete the proof of Theorem 3.4 by plugging (A.4) and (A.9) into (A.3).

Proof of Theorem 3.5

We apply Theorem 4.3 in Blanchard et al. (2008) on the scaled loss function L̃_φ(f) = L_φ(f)/C_L to obtain the rates in Theorem 3.5. Without loss of generality, we can assume that the Bayes classifier f^* ∈ Inline graphic , since we can always find g ∈ such that $R_{φ} (g) = R_{φ} (f^{*}) = R_{φ}^{*}$ , provided that is dense in C( ). Let be a countable and dense subset of ℝ⁺, and let (S) denote the ball of of radius S. Then (S), S ∈ is a countable collection of classes of functions. We can then use Theorem 4.3 in Blanchard et al. (2008) after we verify the following conditions (H1)–(H4):

(H1)
∀S ∈ , ∀f ∈ (S), ||L̃_φ(f)||_∞ ≤ b_S, b_S = 1+ S;
(H2)
∀f, f′ ∈ , Var(L̃_φ(f) − L̃_φ(f′)) ≤ d²(f, f′), d(f, f′) = ||f − f′||_L₂(P);
(H3)
∀S ∈ , ∀f ∈ (S), d²(f, f^*) ≤ C_S E(L̃_φ(f) − L̃_φ(f^*)), C_S = 2(S/η₀ + 1/η₁);
(H4)
Let
$ξ (x) = \int_{0}^{x} \sqrt{log N (B_{H_{k}}, ε, L_{2} (P_{n}))} d ε .$

We have
$E [sup_{\begin{matrix} f \in B_{H_{k}} (S) \\ d^{2} (f, f^{'}) \leq r \end{matrix}} (P - P_{n}) ({\tilde{L}}_{φ} (f) - {\tilde{L}}_{φ} (f^{'}))] \leq inf_{ϑ > 0} {4 ϑ - \frac{12}{\sqrt{n}} ξ (ϑ) + \frac{12}{\sqrt{n}} ξ (\frac{\sqrt{r}}{\sqrt{2} S})} = ψ_{S} (r) .$

ψ_S, S ∈ , is a sequence of sub-root functions, that is, ψ_S is non-negative, nondecreasing, and $ψ_{S} (r) / \sqrt{r}$ is non-increasing for r > 0. Denote x_* as the solution of the equation $ξ (x) = \sqrt{n} x^{2}$
. If $r_{S}^{*}$ denotes the solution of ψ_S(r) = r/C_S, then
$r_{S}^{*} \leq inf_{ϑ > 0} C_{S} {4 ϑ - 12 ξ (ϑ) / \sqrt{n}} + c^{2} C_{S}^{2} x_{*}^{2} .$

Under these conditions, we define for n ∈ ℕ the following quantity:

γ_{n} = inf_{ϑ > 0} {4 ϑ - \frac{12}{\sqrt{n}} ξ (ϑ) + x_{*}^{2} (n)} .

Given Inline graphic is associated with the Gaussian kernel, we can show that ξ(x) ⪯ ε¹⁻^ν for any 0 < ν < 2. Thus, γ_n ⪯ max(n^−1/2^ν, n^−1/(^ν⁺¹⁾). By the choice of λ_n = O(n^−1/(^ν⁺¹⁾) for any ν ∈ (0, 1), this satisfies

λ_{n} \geq c (γ_{n} + η_{1}^{- 1} \frac{log (τ^{- 1} log n) \lor 1}{n}) .

Therefore, according to Theorem 4.3 in Blanchard et al. (2008), the following bound holds with probability at least 1 − τ, where τ > 0 is a fixed real number:

E ({\tilde{L}}_{φ} ({\hat{f}}_{n})) - E ({\tilde{L}}_{φ} (f^{*})) \leq 2 inf_{f \in H_{k}} [E ({\tilde{L}}_{φ} (f)) - E ({\tilde{L}}_{φ} (f^{*})) + 2 λ_{n} {| | f | |}_{k}^{2}] + 4 λ_{n} (8 + c η_{1} η_{0}^{- 1}) .

The result does not change after we scale back to the original loss L_φ(f). We have shown that ${inf}_{f \in H_{k}} [R_{φ} (f) - R_{φ} (f^{*}) + 2 λ_{n} {| | f | |}_{k}^{2}] = O (λ_{n}^{q / (q + 1)})$ in the proof of Theorem 3.4. Thus

R ({\hat{f}}_{n}) - R^{*} = O_{p} (λ_{n}^{q / (q + 1)}) = O_{p} (n^{- \frac{1}{ν + 1} \frac{q}{q + 1}}) .

The remainder of the proof is to verify conditions (H1)–(H4).

For condition (H1), ||L̃_φ(f)||_∞ ≤ sup{R/(Aπ + (1 − A)/2)}(1 + S)/C_L ≤ 1 + S, ||f||_k ≤ S.

For condition (H2), let d(f, f′) = ||f − f′||_L₂(P). L_φ(f) is a Lipschitz function with respect to f with Lipschitz constant C_L. Then L̃_φ(f) − L̃_φ(f′) ≤ |f(x) − f′(x)|. Hence (H2) is easily satisfied.

For condition (H3), the proof is similar to to Lemma 6.4 of Blanchard et al. (2008) with C_S = 2(S/η₁ + 1/η₀), where η₀ and η₁ are as defined in Assumptions (A1) and (A2) of Section 3.5.

For condition (H4), we introduce the notation for Rademacher averages: let ε₁,…, ε_n be n i.i.d Rademacher random variables, independent of (X_i, A_i, R_i), i = 1, …, n. For any measurable real-valued function f, the Rademacher average is defined as $L_{n} f = n^{- 1} \sum_{i = 1}^{n} ε_{i} f (X_{i})$ . Also let Inline graphic ( ) be the empirical Rademacher complexity of function class , = sup_f_∈ f.

First we have from Lemma 6.7 of Blanchard et al. (2008) that for f′ ∈ Inline graphic ,

E [sup_{f \in H_{k}} (P - P_{n}) ({\tilde{L}}_{φ} (f) - {\tilde{L}}_{φ} (f^{'}))] \leq 4 E [L_{n} {f - f^{'}, f \in H_{k}}] .

Thus for the set {f ∈ Inline graphic : ||f||_k ≤ S, d²(f, f′) ≤ r} and f′ ∈ (S),

E [sup_{f \in H_{k}} (P - P_{n}) ({\tilde{L}}_{φ} (f) - {\tilde{L}}_{φ} (f^{'}))] \leq 4 E [L_{n} {f - f^{'}, f \in H_{k} : {| | f | |}_{k} \leq S, d^{2} (f, f^{'}) \leq r}],

the right-hand-side of which is equivalent to $4 E [L_{n} {f, f \in H_{k} : {| | f | |}_{k} \leq 2 S, {| | f | |}_{L_{2} (P_{n})}^{2} \leq 2 r}]$ . Now we proceed to show that

E L_{n} {f \in H_{k} : {| | f | |}_{k} \leq 2 S, {| | f | |}_{L_{2} (P_{n})}^{2} \leq 2 r} \leq inf_{ϑ > 0} {4 ϑ + \frac{12}{\sqrt{n}} \int_{ϑ}^{\frac{\sqrt{r}}{\sqrt{2} S}} \sqrt{log N (B_{H}, ε, L_{2} (P_{n}))} d ε} = ψ_{S} (r),

by slightly modifying the procedure in obtaining Dudley’s Entropy Integral for Rademacher complexity of sets of functions. For j ≥ 0, let $r_{j} = \sqrt{2 r} 2^{- j}$ and T_j be a r_j-cover of Inline graphic (2S) with respect to the L₂(P_n)-norm. For each f ∈ (2S), we can find an f̃_j ∈ T_j, such that ||f − f̃_j||_{L₂(P_n)} ≤ r_j. For any N, we express f as $f = f - {\tilde{f}}_{N} + \sum_{j = 1}^{N} ({\tilde{f}}_{j} - {\tilde{f}}_{j - 1})$ , where f̃₀ = 0. Note f̃₀ = 0 is an r₀-approximation of f. Hence,

\begin{array}{l} L_{n} (B_{H_{k}} (2 S)) = E [sup_{f \in B_{H_{k}} (2 S)} \frac{1}{n} \sum_{i = 1}^{n} ε_{i} (f (X_{i}) - {\tilde{f}}_{N} (X_{i}) + \sum_{j = 1}^{N} ({\tilde{f}}_{j} (X_{i}) - {\tilde{f}}_{j - 1} (X_{i})))] \\ \leq E [sup_{f \in B_{H_{k}} (2 S)} {| | ε | |}_{L_{2} (P_{n})} {‖ f - {\tilde{f}}_{N} ‖}_{L_{2} (P_{n})}] + \sum_{j = 1}^{N} E [sup_{f \in B_{H_{k}} (2 S)} \frac{1}{n} \sum_{i = 1}^{n} ({\tilde{f}}_{j} (X_{i}) - {\tilde{f}}_{j - 1} (X_{i}))] \\ \leq r_{N} + \sum_{j = 1}^{N} E [sup_{f \in B_{H_{k}} (2 S)} \frac{1}{n} \sum_{i = 1}^{n} ({\tilde{f}}_{j} (X_{i}) - {\tilde{f}}_{j - 1} (X_{i}))] . \end{array}

Note that

{‖ {\tilde{f}}_{j} - {\tilde{f}}_{j - 1} ‖}_{L_{2} (P_{n})}^{2} \leq {({‖ {\tilde{f}}_{j} - f ‖}_{L_{2} (P_{n})} + {‖ f - {\tilde{f}}_{j - 1} ‖}_{L_{2} (P_{n})})}^{2} \leq {(r_{j} + r_{j - 1})}^{2} = 9 r_{j}^{2} .

We therefore have

\begin{array}{l} L_{n} (B_{H_{k}} (2 S)) \leq r_{N} + \sum_{j = 1}^{N} 3 r_{j} \sqrt{\frac{2 log (∣ T_{j} | | T_{j - 1} ∣)}{n}} \\ \leq r_{N} + 12 \sum_{j = 1}^{N} (r_{j} - r_{j + 1}) \sqrt{\frac{log N (B_{H_{k}} (2 S), r_{j}, L_{2} (P_{n}))}{n}} \\ \leq r_{N} + 12 \int_{r_{N + 1}}^{\sqrt{r} / \sqrt{2} S} \sqrt{\frac{log N (B_{H_{k}}, ε, L_{2} (P_{n}))}{n}} d ε . \end{array}

For any ϑ > 0, we can choose N = sup{j : r_j > 2ϑ}. Therefore, ϑ < r_N₊₁ < 2ϑ, and r_N < 4ϑ. We therefore conclude that

\begin{array}{l} L_{n} (B_{H_{k}} (2 S)) \leq inf_{ϑ > 0} {4 ϑ + 12 \int_{ϑ}^{\sqrt{r} / \sqrt{2} S} \sqrt{\frac{log N (B_{H_{k}}, ε, L_{2} (P_{n}))}{n}} d ε} \\ = inf_{ϑ > 0} {4 ϑ - \frac{12}{\sqrt{n}} ξ (ϑ) + \frac{12}{\sqrt{n}} ξ (\frac{\sqrt{r}}{\sqrt{2} S})} = ψ_{S} (r) . \end{array}

The function ψ_S is sub-root because log N ( Inline graphic , ε, L₂(P_n)) is a decreasing function of ε.

To show the upperbound of r^*, let $t_{S}^{*} = c^{2} C_{S}^{2} x_{*}^{2}$ . Then $\sqrt{t_{S}^{*}} / \sqrt{2} S = {c C}_{S} x_{*} / \sqrt{2} S$ , C_S/S ≥ 1. Assuming that c ≥ 2, we have $\sqrt{t_{S}^{*}} / \sqrt{2} S \geq x_{*}$ . Since x⁻¹ ξ(x) is a decreasing function, it follows that

ξ (\frac{\sqrt{t_{S}^{*}}}{\sqrt{2} S}) \leq c \frac{C_{S}}{\sqrt{2} S} ξ (x_{*}) = \frac{\sqrt{n}}{{cSC}_{S}} t_{S}^{*} .

Therefore, by selecting an appropriate constant c,

ψ_{S} (t_{S}^{*}) \leq inf_{ϑ > 0} {4 ϑ - \frac{12}{\sqrt{n}} ξ (ϑ)} + \frac{12}{{cSC}_{S}} t_{S}^{*} \leq \frac{C_{S} {inf}_{ϑ > 0} {4 ϑ - 12 / \sqrt{n} ξ (ϑ)} + t_{S}^{*}}{C_{S}} .

The desired result follows from the property of sub-root functions, which states that if ψ : [0, ∞) → [0, ∞) is a sub-root function, then the unique positive solution of ψ(r) = r, denoted by r^*, exists, and for all r > 0, r ≥ ψ(r) if and only if r* ≤ r (Bartlett et al. 2005).

Contributor Information

Yingqi Zhao, Email: yqzhao@live.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599.

Donglin Zeng, Email: dzeng@email.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, NC 27599.

A. John Rush, Email: john.rush@duke-nus.edu.sg, Office of Clinical Sciences, Duke-National University of Singapore Graduate Medical School, Singapore 169857.

Michael R. Kosorok, Email: kosorok@unc.edu, Department of Biostatistics, and Professor, Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599.

References

Bartlett PL, Bousquet O, Mendelson S. Local Rademacher Complexities. The Annals of Statistics. 2005;33(4):1497–1537. [Google Scholar]
Bartlett PL, Jordan MI, McAuliffe JD. Convexity, Classification, and Risk Bounds. J of American Statistical Association. 2006;101(473):138–156. [Google Scholar]
Blanchard G, Bousquet O, Massart P. Statistical Performance of Support Vector Machines. Annals of Statistics. 2008;36:489–531. [Google Scholar]
Bradley PS, Mangasarian OL. Feature Selection via Concave Minimization and Support Vector Machines. Proc. 15th International Conf. on Machine Learning; San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 1998. [Google Scholar]
Buzdar AU. Role of Biologic Therapy and Chemotherapy in Hormone Receptor and HER2-Positive Breast Cancer. Annals of Oncology. 2009;20:993–999. doi: 10.1093/annonc/mdn739. [DOI] [PubMed] [Google Scholar]
Cai T, Tian L, Uno H, Solomon SD. Calibrating parametric subject-specific risk estimation. Biometrika. 2010;97(2):389–404. doi: 10.1093/biomet/asq012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cortes C, Vapnik V. Support-Vector Networks. Machine Learning. 1995:273–297. [Google Scholar]
Crits-Christoph P, Siqueland L, Blaine J, Frank A, Luborsky L, Onken LS, Muenz LR, Thase ME, Weiss RD, Gastfriend DR, Woody GE, Barber JP, Butler SF, Daley D, Salloum I, Bishop S, Najavits LM, Lis J, Mercer D, Griffin ML, Moras K, Beck AT. Psychosocial Treatments for Cocaine Dependence. Arch Gen Psychiatry. 1999;56:493–502. doi: 10.1001/archpsyc.56.6.493. [DOI] [PubMed] [Google Scholar]
Eagle KA, Lim MJ, Dabbous OH, Pieper KS, Goldberg RJ, de Werf FV, Goodman SG, Granger CB, Steg PG, Joel M, Gore M, Budaj A, Avezum A, Flather MD, Fox KAA GRACE Investigators. A Validated Prediction Model for All Forms of Acute Coronary Syndrome: Estimating the Risk of 6-Month Postdischarge Death in an International Registry. J Am Med Assoc. 2004;291:2727–33. doi: 10.1001/jama.291.22.2727. [DOI] [PubMed] [Google Scholar]
Flume PA, OSullivan BP, Goss CH, Peter J, Mogayzel J, Willey-Courand DB, Bujan J, Finder J, Lester M, Quittell L, Rosenblatt R, Vender RL, Hazle L, Sabadosa K, Marshall B. Cystic Fibrosis Pulmonary Guidelines: Chronic Medications for Maintenance of Lung Health. Am J Respir Crit Care Med. 2007;176(1):957–969. doi: 10.1164/rccm.200705-664OC. [DOI] [PubMed] [Google Scholar]
Grünwald V, Hidalgo M. Developing Inhibitors of the Epidermal Growth Factor Receptor for Cancer Treatment. J Natl Cancer Inst. 2003;95(12):851–867. doi: 10.1093/jnci/95.12.851. [DOI] [PubMed] [Google Scholar]
Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. 2. New York: Springer-Verlag New York, Inc; 2009. [Google Scholar]
Insel TR. Translating scientific opportunity into public health impact: a strategic plan for research on mental illness. Archives of General Psychiatry. 2009;66(2):128–133. doi: 10.1001/archgenpsychiatry.2008.540. [DOI] [PubMed] [Google Scholar]
Keller MB, Mccullough JP, Klein DN, Arnow B, Dunner DL, Gelenberg AJ, Markowitz JC, Nemeroff CB, Russell JM, Thase ME, Trivedi MH, Zajecka J. A Comparison of Nefazodone, The Cognitive Behavioral-Analysis System of Psychotherapy, and Their Combination for the Treatment of Chronic Depression. The New England Journal of Medicine. 2000;342(20):1462–70. doi: 10.1056/NEJM200005183422001. [DOI] [PubMed] [Google Scholar]
Laber EB, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification. To apper in Journal of the American Statistical Association. 2011 doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
Lin Y. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]
Liu Y, Helen Zhang H, Park C, Ahn J. Support vector machines with adaptive Lq penalty. Comput Stat Data Anal. 2007;51(12):6380–94. [Google Scholar]
Lugosi G, Vayatis N. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics. 2004;32:30–55. [Google Scholar]
Marlowe DB, Festinger DS, Dugosh KL, Lee PA, Benasutti KM. Adapting Judicial Supervision to the Risk Level of Drug Offenders: Discharge and 6-month Outcomes from a Prospective Matching Study. Drug and Alcohol Dependence. 2007;88(Suppl 2 2):S4–S13. doi: 10.1016/j.drugalcdep.2006.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moodie EEM, Platt RW, Kramer MS. Estimating Response-Maximized Decision Rules With Applications to Breastfeeding. Journal of the American Statistical Association. 2009;104(485):155–165. [Google Scholar]
Moodie EEM, Richardson TS, Stephens DA. Demystifying Optimal Dynamic Treatment Regimes. Biometrics. 2007;63(2):447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]
Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society, Series B. 2003;65:331–366. [Google Scholar]
Murphy SA, van der Laan MJ, Robins JM, CPPRG Marginal Mean Models for Dynamic Regimes. Journal of the American Statistical Association. 2001;96:1410–23. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Piper WE, Boroto DR, Joyce AS, McCallum M, Azim HFA. Pattern of alliance and outcome in short-term individual psychotherapy. Psychotherapy. 1995;32:639–647. [Google Scholar]
Qian M, Murphy SA. Performance Guarantees for Individualized Treatment Rules. To appear in the Annals of Statistics. 2011 doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM. Optimal Structural Nested Models for Optimal Sequential Decisions. Proceedings of the Second Seattle Symposium on Biostatistics; Springer; 2004. pp. 189–326. [Google Scholar]
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England J of Medicine. 2002:1937–47. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
Sargent DJ, Conley BA, Allegra C, Collette L. Clinical Trial Designs for Predictive Marker Validation in Cancer Treatment Trials. Journal of Clinical Oncology. 2005;32:2020–27. doi: 10.1200/JCO.2005.01.112. [DOI] [PubMed] [Google Scholar]
Steinwart I. Consistency of Support Vector Machines and Other Regularized Kernel Classifiers. IEEE Transactions on Information Theory. 2005;51:128–142. [Google Scholar]
Steinwart I, Scovel C. Fast Rates for Support Vector Machines using Gaussian Kernels. The Annals of Statistics. 2007;35:575–607. [Google Scholar]
Thall PF, Sung H-G, Estey EH. Selecting Therapeutic Strategies Based on Efficacy and Death in Multicourse Clinical Trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]
Tsybakov AB. Optimal Aggregation of Classifiers in Statistical Learning. Annals of Statistics. 2004;32:135–166. [Google Scholar]
van’t Veer LJ, Bernards R. Enabling Personalized Cancer Medicine through Analysis of Gene-Expression Patterns. Nature. 2008;452:564–570. doi: 10.1038/nature06915. [DOI] [PubMed] [Google Scholar]
Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag New York, Inc; 1995. [Google Scholar]
Vert R, Vert J-P. Consistency and Convergence Rates of One-Class SVMs and Related Algorithms. Journal of Machine Learning Research. 2006;7:817–854. [Google Scholar]
Wang L, Shen X. Multi-category Support vector machines, feature selection, and solution path. Statistica Sinica. Statistica Sinica. 2006;16:617–633. [Google Scholar]
Zhang T. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics. 2004;32(1):56–85. [Google Scholar]
Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm Support Vector Machines. Neural Information Processing Systems. 2003:16. [Google Scholar]
Zou H, Yuan M. The F∞-norm Support Vector Machine. Statistica Sinica. 2008;18:379–398. [Google Scholar]

[R1] Bartlett PL, Bousquet O, Mendelson S. Local Rademacher Complexities. The Annals of Statistics. 2005;33(4):1497–1537. [Google Scholar]

[R2] Bartlett PL, Jordan MI, McAuliffe JD. Convexity, Classification, and Risk Bounds. J of American Statistical Association. 2006;101(473):138–156. [Google Scholar]

[R3] Blanchard G, Bousquet O, Massart P. Statistical Performance of Support Vector Machines. Annals of Statistics. 2008;36:489–531. [Google Scholar]

[R4] Bradley PS, Mangasarian OL. Feature Selection via Concave Minimization and Support Vector Machines. Proc. 15th International Conf. on Machine Learning; San Francisco, CA, USA: Morgan Kaufmann Publishers Inc; 1998. [Google Scholar]

[R5] Buzdar AU. Role of Biologic Therapy and Chemotherapy in Hormone Receptor and HER2-Positive Breast Cancer. Annals of Oncology. 2009;20:993–999. doi: 10.1093/annonc/mdn739. [DOI] [PubMed] [Google Scholar]

[R6] Cai T, Tian L, Uno H, Solomon SD. Calibrating parametric subject-specific risk estimation. Biometrika. 2010;97(2):389–404. doi: 10.1093/biomet/asq012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Cortes C, Vapnik V. Support-Vector Networks. Machine Learning. 1995:273–297. [Google Scholar]

[R8] Crits-Christoph P, Siqueland L, Blaine J, Frank A, Luborsky L, Onken LS, Muenz LR, Thase ME, Weiss RD, Gastfriend DR, Woody GE, Barber JP, Butler SF, Daley D, Salloum I, Bishop S, Najavits LM, Lis J, Mercer D, Griffin ML, Moras K, Beck AT. Psychosocial Treatments for Cocaine Dependence. Arch Gen Psychiatry. 1999;56:493–502. doi: 10.1001/archpsyc.56.6.493. [DOI] [PubMed] [Google Scholar]

[R9] Eagle KA, Lim MJ, Dabbous OH, Pieper KS, Goldberg RJ, de Werf FV, Goodman SG, Granger CB, Steg PG, Joel M, Gore M, Budaj A, Avezum A, Flather MD, Fox KAA GRACE Investigators. A Validated Prediction Model for All Forms of Acute Coronary Syndrome: Estimating the Risk of 6-Month Postdischarge Death in an International Registry. J Am Med Assoc. 2004;291:2727–33. doi: 10.1001/jama.291.22.2727. [DOI] [PubMed] [Google Scholar]

[R10] Flume PA, OSullivan BP, Goss CH, Peter J, Mogayzel J, Willey-Courand DB, Bujan J, Finder J, Lester M, Quittell L, Rosenblatt R, Vender RL, Hazle L, Sabadosa K, Marshall B. Cystic Fibrosis Pulmonary Guidelines: Chronic Medications for Maintenance of Lung Health. Am J Respir Crit Care Med. 2007;176(1):957–969. doi: 10.1164/rccm.200705-664OC. [DOI] [PubMed] [Google Scholar]

[R11] Grünwald V, Hidalgo M. Developing Inhibitors of the Epidermal Growth Factor Receptor for Cancer Treatment. J Natl Cancer Inst. 2003;95(12):851–867. doi: 10.1093/jnci/95.12.851. [DOI] [PubMed] [Google Scholar]

[R12] Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. 2. New York: Springer-Verlag New York, Inc; 2009. [Google Scholar]

[R13] Insel TR. Translating scientific opportunity into public health impact: a strategic plan for research on mental illness. Archives of General Psychiatry. 2009;66(2):128–133. doi: 10.1001/archgenpsychiatry.2008.540. [DOI] [PubMed] [Google Scholar]

[R14] Keller MB, Mccullough JP, Klein DN, Arnow B, Dunner DL, Gelenberg AJ, Markowitz JC, Nemeroff CB, Russell JM, Thase ME, Trivedi MH, Zajecka J. A Comparison of Nefazodone, The Cognitive Behavioral-Analysis System of Psychotherapy, and Their Combination for the Treatment of Chronic Depression. The New England Journal of Medicine. 2000;342(20):1462–70. doi: 10.1056/NEJM200005183422001. [DOI] [PubMed] [Google Scholar]

[R15] Laber EB, Murphy SA. Adaptive Confidence Intervals for the Test Error in Classification. To apper in Journal of the American Statistical Association. 2011 doi: 10.1198/jasa.2010.tm10053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]

[R17] Lin Y. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery. 2002;6:259–275. [Google Scholar]

[R18] Liu Y, Helen Zhang H, Park C, Ahn J. Support vector machines with adaptive Lq penalty. Comput Stat Data Anal. 2007;51(12):6380–94. [Google Scholar]

[R19] Lugosi G, Vayatis N. On the Bayes-risk consistency of regularized boosting methods. The Annals of Statistics. 2004;32:30–55. [Google Scholar]

[R20] Marlowe DB, Festinger DS, Dugosh KL, Lee PA, Benasutti KM. Adapting Judicial Supervision to the Risk Level of Drug Offenders: Discharge and 6-month Outcomes from a Prospective Matching Study. Drug and Alcohol Dependence. 2007;88(Suppl 2 2):S4–S13. doi: 10.1016/j.drugalcdep.2006.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Moodie EEM, Platt RW, Kramer MS. Estimating Response-Maximized Decision Rules With Applications to Breastfeeding. Journal of the American Statistical Association. 2009;104(485):155–165. [Google Scholar]

[R22] Moodie EEM, Richardson TS, Stephens DA. Demystifying Optimal Dynamic Treatment Regimes. Biometrics. 2007;63(2):447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]

[R23] Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society, Series B. 2003;65:331–366. [Google Scholar]

[R24] Murphy SA, van der Laan MJ, Robins JM, CPPRG Marginal Mean Models for Dynamic Regimes. Journal of the American Statistical Association. 2001;96:1410–23. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Piper WE, Boroto DR, Joyce AS, McCallum M, Azim HFA. Pattern of alliance and outcome in short-term individual psychotherapy. Psychotherapy. 1995;32:639–647. [Google Scholar]

[R26] Qian M, Murphy SA. Performance Guarantees for Individualized Treatment Rules. To appear in the Annals of Statistics. 2011 doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Robins JM. Optimal Structural Nested Models for Optimal Sequential Decisions. Proceedings of the Second Seattle Symposium on Biostatistics; Springer; 2004. pp. 189–326. [Google Scholar]

[R28] Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. New England J of Medicine. 2002:1937–47. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]

[R29] Sargent DJ, Conley BA, Allegra C, Collette L. Clinical Trial Designs for Predictive Marker Validation in Cancer Treatment Trials. Journal of Clinical Oncology. 2005;32:2020–27. doi: 10.1200/JCO.2005.01.112. [DOI] [PubMed] [Google Scholar]

[R30] Steinwart I. Consistency of Support Vector Machines and Other Regularized Kernel Classifiers. IEEE Transactions on Information Theory. 2005;51:128–142. [Google Scholar]

[R31] Steinwart I, Scovel C. Fast Rates for Support Vector Machines using Gaussian Kernels. The Annals of Statistics. 2007;35:575–607. [Google Scholar]

[R32] Thall PF, Sung H-G, Estey EH. Selecting Therapeutic Strategies Based on Efficacy and Death in Multicourse Clinical Trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]

[R33] Tsybakov AB. Optimal Aggregation of Classifiers in Statistical Learning. Annals of Statistics. 2004;32:135–166. [Google Scholar]

[R34] van’t Veer LJ, Bernards R. Enabling Personalized Cancer Medicine through Analysis of Gene-Expression Patterns. Nature. 2008;452:564–570. doi: 10.1038/nature06915. [DOI] [PubMed] [Google Scholar]

[R35] Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag New York, Inc; 1995. [Google Scholar]

[R36] Vert R, Vert J-P. Consistency and Convergence Rates of One-Class SVMs and Related Algorithms. Journal of Machine Learning Research. 2006;7:817–854. [Google Scholar]

[R37] Wang L, Shen X. Multi-category Support vector machines, feature selection, and solution path. Statistica Sinica. Statistica Sinica. 2006;16:617–633. [Google Scholar]

[R38] Zhang T. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics. 2004;32(1):56–85. [Google Scholar]

[R39] Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement Learning Strategies for Clinical Trials in Nonsmall Cell Lung Cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Zhu J, Rosset S, Hastie T, Tibshirani R. 1-norm Support Vector Machines. Neural Information Processing Systems. 2003:16. [Google Scholar]

[R41] Zou H, Yuan M. The F∞-norm Support Vector Machine. Statistica Sinica. 2008;18:379–398. [Google Scholar]

PERMALINK

Estimating Individualized Treatment Rules Using Outcome Weighted Learning

Yingqi Zhao, Ph.D.

Donglin Zeng

A John Rush

Michael R Kosorok

Roles

Abstract

1. INTRODUCTION

2. METHODOLOGY

2.1 Individualized Treatment Rule (ITR)

2.2 Outcome Weighted Learning (OWL) for Estimating Optimal ITR

2.3 Linear Decision Rule for Optimal ITR

2.4 Nonlinear Decision Rule for Optimal ITR

3. THEORETICAL RESULTS

3.1 Notation

3.2 Fisher Consistency

Proposition 3.1

3.3 Excess Risk for (f) and (f)

Theorem 3.2

3.4 Consistency and Risk Bounds

Theorem 3.3

Theorem 3.4

3.5 Improved Rate with Data Completely Separated

Theorem 3.5

4. SIMULATION STUDY

Figure 1.

Figure 2.

Figure 3.

5. DATA ANALYSIS

Table 1.

6. DISCUSSION

Acknowledgments

APPENDIX A. PROOFS

Proof of Theorem 3.2

Proof of Theorem 3.3

Proof of Theorem 3.4

Proof of Theorem 3.5

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases