Tree based weighted learning for estimating individualized treatment rules with censored data

Yifan Cui; Ruoqing Zhu; Michael Kosorok

doi:10.1214/17-EJS1305

. Author manuscript; available in PMC: 2018 Oct 18.

Published in final edited form as: Electron J Stat. 2017 Oct 18;11(2):3927–3953. doi: 10.1214/17-EJS1305

Tree based weighted learning for estimating individualized treatment rules with censored data

Yifan Cui ¹, Ruoqing Zhu ², Michael Kosorok ³

PMCID: PMC5796682 NIHMSID: NIHMS886497 PMID: 29403568

Abstract

Estimating individualized treatment rules is a central task for personalized medicine. [23] and [22] proposed outcome weighted learning to estimate individualized treatment rules directly through maximizing the expected outcome without modeling the response directly. In this paper, we extend the outcome weighted learning to right censored survival data without requiring either inverse probability of censoring weighting or semiparametric modeling of the censoring and failure times as done in [26]. To accomplish this, we take advantage of the tree based approach proposed in [28] to nonparametrically impute the survival time in two different ways. The first approach replaces the reward of each individual by the expected survival time, while in the second approach only the censored observations are imputed by their conditional expected failure times. We establish consistency and convergence rates for both estimators. In simulation studies, our estimators demonstrate improved performance compared to existing methods. We also illustrate the proposed method on a phase III clinical trial of non-small cell lung cancer.

Keywords and phrases: Individualized treatment rule, Nonparametric estimation, Right censored data, Excess value bound, Recursively imputed survival trees, Outcome weighted learning

1. Introduction

An individualized treatment regime provides a personalized treatment strategy for each patient in the population based on their individual characteristics. A significant amount of work has been devoted to estimating optimal treatment rules [17, 18, 22, 24, 23]. While each of these approaches has strengths and weaknesses, we highlight the approach in [23] because of its robustness to model misspecification (this is similarly true of the approach in [22]) combined with its ability to incorporate support vector machines through the recognition that optimizing the treatment rule can be recast as a weighted classification problem. This approach is commonly referred to as outcome weighted learning. In clinical trials, right censored survival data are frequently observed as primary outcomes. Adapting outcome weighted learning to the censored setting, [26] proposed two new approaches, inverse censoring weighted outcome weighted learning and doubly robust outcome weighted learning, both of which require semiparametric estimation of the conditional censoring probability given the patient characteristics and treatment choice. The doubly robust estimator additionally involves semiparametric estimation of the conditional failure time expectation but only requires that one of the two models, for either the failure time or censoring time, be correct. Potential drawbacks of these methods are that either or both models may be misspecified and inverse censoring weighting estimation can be unstable numerically [18, 28].

In this paper, we propose a nonparametric tree based approach for right censored outcome weighted learning which avoids both the inverse probability of censoring weighting and restrictive modeling assumptions for imputation through recursively imputed survival trees [28]. Since the true failure times T are only partially known, they cannot be used directly as weights in the outcome weighted learning [23] framework. However, recursively imputed survival trees [28] provide an alternative approach to weighting by using the conditional expectations of censored observations without requiring inverse weighting. Tree-based methods [4, 3] are a broad class of nonparametric estimators which have become some of the most popular machine learning tools. Its adaptation to the survival setting has also drawn a lot of interests in the literature [14, 9, 11], and it has also been used for interpretable prediction modeling in personalized medicine [12]. The recursively imputed survival tree approach [28] combines extremely randomized trees with a recursive imputation method, which has been shown to improve performance and reduce prediction error while avoiding estimation of inverse censoring weights without making parametric or semiparametric assumptions on the conditional probability distribution of the failure time. Numerical studies demonstrate that the proposed method outperforms existing alternatives in a variety of settings.

The proposed method uses these recursively imputed survival trees to impute the survival times nonparametrically in a manner suitable for implementation within outcome weighted learning. We verify this novel approach both theoretically and in numerical examples. As part of this, we also present for the first time consistency and rate results for tree-based survival models in a more general setting than the categorical predictors considered in [10].

The remainder of the article is organized as follows. In section 2, we present the mathematical framework for individualized treatment rules for right censored survival outcomes. In section 3 we establish consistency and an excess value bound for the estimated treatment rules. Extensive simulation studies are presented in Section 4. We also illustrate our method using a phase III clinical trial on non-small cell lung cancer in Section 5. The article concludes with a discussion of future work in Section 6. Some needed technical results are provided in the Appendix.

2. Methodology

2.1. Individualized treatment regime framework

Before characterizing the individualized treatment regime, we first introduce some general notation and introduce the value function, and then extend the notation and ideas to the censored data setting. Let $X \in X$ be the observed patient-level covariate vector, where $X$ is a d dimensional vector space, and let A ∈ {−1, +1} be the binary treatment indicator. $\tilde{T}$ is the true survival time, however, we consider a truncated version at τ, i.e., $T = min (\tilde{T}, τ)$ , where the maximum follow-up time τ < ∞ is a common practical restriction in clinical studies. The goal in this framework is to maximize a reward R, which could represent any clinical outcome. Specifically, we wish to identify a treatment rule $D$ , which is a map from the patient-level covariate space $X$ to the treatment space {+1, −1} which maximizes the expected reward. In the survival outcome setting, we use R = T or log(T) as done in [26].

To achieve this maximization, we define the value function as

V (D) = E^{D} (R) = E [R I {A = D (X)} / π (A; X)],

where I{·} is an indicator function, π(a; X) = pr(A = a|X) > M′ a.s. for some M′ > 0 and each a ∈ {+1, −1}. The function π is the propensity score and is known in a randomized trial setting, which we assume is the case for this paper, but needs to be estimated in a non-randomized, observational study setting. The individualized treatment regime we are most interested in is the optimal treatment rule $D *$ which maximizes the value function, i.e.

D * = \underset{D}{\arg max} E [R I {A = D (X)} / π (A; X)] .

(1)

After rewriting the value function as

V (D) = E [E (R | A = 1, X) I {D (X) = 1} + E (R | A = - 1, X) I {D (X) = - 1}],

it is easy to see that

D * = sign {E (R | A = 1, X) - E (R | A = - 1, X)} .

Hence, the definition of $D *$ is equivalent to $D * (x) = \arg {max}_{a} E (R | A = a, X = x)$ . Instead of maximization the objective function in (1), the outcome weighted learning approach searches for the optimal decision rule $D *$ by minimizing the weighted misclassification error, i.e.,

D * = \underset{D}{\arg max} E [R I {A \neq D (X)} / π (A; X)] .

(2)

In an ideal situation, we would replace R with T or log(T). However, this is not possible under right censoring.

2.2. Value function under right censoring

Consider a censoring time C that is independent of T given (X, A). We then have the observed time Y = min(T, C), and the censoring indicator δ = I(T ≤ C). Assume that n independent and identically distributed copies, ${Y_{i}, δ_{i}, X_{i}, A_{i}}_{i = 1}^{n}$ , are collected. Since T is not fully observed we seek for a sensible replacement which maintains as close as possible the same value function. We propose two approaches in the following, denoted as R₁ and R₂ respectively. The first approach is to obtain a nonparametric estimated conditional expectation $\hat{E} (T | X, A)$ . Letting R₁ = E(T | X, A) and bringing the expectation of T inside, we have

E [T I {A = D (X)} / π (A; X)] = E [R_{1} I {A = D (X)} / π (A; X)] .

(3)

Another approach is to replace only the censored observations conditioning on the observed data. It is interesting to observe that the conditional expectation of T, given Y and δ, can be written as

R_{2} : = E (T | X, A, Y, δ) = I (δ = 1) Y + I (δ = 0) E (T | X, A, Y, δ = 0) = I (δ = 1) Y + I (δ = 0) E (T | X, A, C = Y, T > Y, Y)

(4)

An important property that we used in the last equality is the conditional independence between T and C. With the information of Y = y given, and knowing that δ = 0, the conditional distribution of T is defined on (c, τ] with density function proportional to the original density of T. In other words, the conditional survival function of T is S(t|X, A)/S(c|X, A) for t > c, where S(·|X, A) is the conditional survival function of T. Hence, we can calculate the expectation of T accordingly. With the definition of R₂, it is easy to see that the corresponding value function is equivalent to the left side of equation (3) by further taking expectations with respect to Y and δ. Note that the above arguments remain unchanged if we replace T, C and Y with log(T), log(C), and log(Y), respectively: this equivalence will be tacitly utilized throughout the paper, except when the distinction is needed.

With our proposed two reward measures, the remaining challenge is to nonparametrically estimate the conditional expectations. To this end, we utilize the nonparametric tree based method proposed by [28]. It is worth noting that the conditional expectation of T defined in R₂ shares the same logical underpinnings as the imputation step in [28]. However, the goal of the imputation step is to replace the censored observations with a randomly generated conditional failure time which utilizes the same condition survival distribution of T given T > C. We will provide details of the estimation procedure in the next section. To conclude this section, we provide the empirical versions of the value function using the two rewards R₁ and R₂, respectively, which we solve for the optimal decision $D *$ by minimization:

n^{- 1} \sum_{i = 1}^{n} \frac{\hat{E} (T_{i} | A_{i}, X_{i}) I {A_{i} = D (X_{i})}}{π (A_{i}; X_{i})},

(5)

and n^{- 1} \sum_{i = 1}^{n} \frac{{δ_{i} Y_{i} + (1 - δ_{i}) \hat{E} (T_{i} | X_{i}, A_{i}, T_{i} > Y_{i}, Y_{i})} I {A_{i} = D (X_{i})}}{π (A_{i}; X_{i})} .

(6)

2.3. Outcome weighted learning with survival trees

The recursively imputed survival trees method proposed by [28] is a powerful tool to estimate conditional survival functions for censored data. A brief outline of the algorithm is provided in the following. We refer interested readers to the original paper for details. To fit the model, we first generate extremely randomized survival trees for the training dataset. Secondly, we calculate conditional survival functions for each censored observation, which can be used for imputing the censored value to a random conditional failure time. Thirdly, we generate multiple copies of the imputed dataset, and one survival tree is fitted for each dataset. We repeat the last two steps recursively and the final nonparametric estimate of $\hat{E} (T | X, A)$ is obtained by averaging the trees from the last step.

Following [23], we next use support vector machines to solve for the optimal treatment rule. A decision function f(x) is learned by replacing $I {A_{i} = D (X_{i})}$ in Equations (5) or (6) with ϕ{A_if(X_i)}, where ϕ(x) = (1 − x)⁺ is the hinge loss and x⁺ = max(x, 0). Furthermore, to avoid overfitting, a regularization term λ_n‖f‖² is added to penalize the complexity of the estimated decision function f. Here, ‖f‖ is some norm of f, and λ_n is a tuning parameter. A high-level description of the proposed method is given in Algorithm 1 below. We consider both linear and nonlinear decision functions f when solving (7). For a linear decision function, f(x) = θ₀ + θ^T x and we let ‖f‖ be the Euclidean norm of θ. For nonlinear decision functions, we employ a universal kernel function $k : X \times X \to ℝ$ , such as the Gaussian kernel, which is continuous, symmetric and positive semidefinite. The optimization problem is then equivalent to a dual problem that maximizes

\sum_{i = 1}^{n} α_{i} - \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} α_{i} α_{j} A_{i} A_{j} k (X_{i}, X_{j}),

subject to 0 ≤ α_i ≤ γW_i/π_i and $\sum_{i = 1}^{n} α_{i} A_{i} = 0$ , where W_i is the numerator in either (5) or (6) and π_i is the respective denominator. Both settings can be efficiently solved by quadratic programming. For further details regarding solving weighted classification problems using support vector machines, we refer to [23, 26, 5].

Algorithm 1: Pseudo algorithm for the proposed method

Step 1. Use ${{(X_{i}^{T}, A_{i}, A_{i} X_{i}^{T})}^{T}, Y_{i}, δ_{i}}_{i = 1}^{n}$ to fit recursively imputed survival trees. Obtain the estimation $\hat{E} (T_{i} | A_{i}, X_{i})$ for reward R₁ or the estimation $\hat{E} (T_{i} | X_{i}, A_{i}, T_{i} > Y_{i}, Y_{i})$ for reward R₂.

Step 2. Let the weights W_i be either $\hat{E} (T_{i} | A_{i}, X_{i})$ or $δ_{i} Y_{i} + (1 - δ_{i}) \hat{E} (T_{i} | A_{i}, X_{i}, T_{i} > Y_{i}, Y_{i})$ , depending on which of the two proposed approaches is used. Minimize the following weighted misclassification error:

\hat{f} (x) = \underset{f}{\arg min} \sum_{i = 1}^{n} W_{i} \frac{ϕ {A_{i} f (X_{i})}}{π (A_{i}; X_{i})} + λ_{n} {‖ f ‖}^{2} .

(7)

Step 3. Output the estimated optimal treatment rule $\hat{D} (x) = sign {\hat{f} (x)}$ .

3. Theoretical results

3.1. Preliminaries

The risk function is defined as

R (f) = E [\frac{R}{π (A; X)} I {A \neq sign (f (X))}],

where the reward R = R₁ = E(T|X, A) for the first approach, or R = R₂ = δY + (1 − δ)E(T|X, A, T > Y,Y) for the second one. We define ϕ-risk for both the true and the working model as, respectively, R_ϕ(f) = E[Rϕ{Af(X)}/π(A; X)] and $R_{ϕ}^{'} (f) = E [\hat{R} ϕ {A f (X)} / π (A; X)]$ , where $\hat{R}$ is the estimated value of R based on one of the two proposed methods. We also define the hinge loss function for the true and working models as L_ϕ(f) = Rϕ{Af(X)}/π(A; X) and $L_{ϕ}^{'} (f) = \hat{R} ϕ {A f (X)} / π (A; X)$ , respectively.

The proposed estimator $\hat{D} = sign ({\hat{f}}_{n} (X))$ , where ${\hat{f}}_{n}$ is solved by one of the following optimization problems within some reproducible kernel Hilbert space $ℋ_{k}$ :

{\hat{f}}_{n} = \underset{f \in ℋ_{k}}{\arg min} n^{- 1} \sum_{i = 1}^{n} \frac{\hat{E} (T_{i} | X_{i}, A_{i})}{π (A_{i}; X_{i})} ϕ {f (X_{i}) A_{i}} + λ_{n} {‖ f ‖}_{n}^{2},

{\hat{f}}_{n} = \underset{f \in ℋ_{k}}{\arg min} n^{- 1} \sum_{i = 1}^{n} \frac{δ_{i} Y_{i} + (1 - δ_{i}) \hat{E} (T_{i} | X_{i}, A_{i}, T_{i} > Y_{i}, Y_{i})}{π (A_{i}; X_{i})} ϕ {f (X_{i}) A_{i}} + λ_{n} {‖ f ‖}_{n}^{2} .

3.2. Consistency of tree-based survival models

In this section, we provide the convergence bound of a simplified tree-based survival model, which is very close to the original algorithm in [28]. The purpose of this section and its main result, Theorem 1, is to demonstrate the existence of an accurate estimator of the underlying hazard function when tree-based methods are used. An earlier result developed in [10] considers only categorical feature variables. To the best of our knowledge, what we present below is the first consistency result for a tree-based survival model under general settings with restrictions only on the splitting rules, which is interesting in its own right.

For simplicity, we assume in this section that $Q_{n} = {(Y_{i}, δ_{i}, X_{i}, A_{i}), i = 1, \dots, n}$ is the training sample, where X_i is independent uniformly distributed on [0, 1]^d. The result can be easily generated to distributions with bounded support and density function bounded above and below. For any fixed x, our goal is to estimate the cumulative hazard function of failure time r(·, X, A) = Λ_T (·|X, A); hereinafter, we write it as Λ(·|X, A).

A random forest is a collection of randomized regression trees ${{\hat{r}}_{n} (\cdot, X, A, Θ_{j}, Q_{n}), 1 \leq j \leq m}$ , where m is the number of trees. The randomizing variable Θ is used to indicate how the successive cuts are performed when an individual tree is built. Hence the forest version of the survival tree model can be expressed as

{\hat{r}}_{n} (\cdot, X, A, Q_{n}) = \frac{1}{m} \sum_{j = 1}^{m} {\hat{r}}_{n} (\cdot, X, A, Θ_{j}, Q_{n}) .

Here, we consider a simplified scenario in which the selection of the coordinate is completely random and independent from the training data [1]. We only consider the consistency of a single tree and denote our tree estimator as ${\hat{r}}_{n} (\cdot, X, A)$ . The result can be easily extended to the situation where m is finite.

A brief description of how each individual tree is constructed is provided in the appendix. Here we highlight some key assumptions and the main result. Our first assumption puts a lower bound on the probability of observing a failure at τ, and the second one assumes the smoothness of the hazard and cumulative hazard functions.

Assumption 1

For some M > 0, S_Y (τ|X, A) > M almost surely.

Assumption 2

The following theorem provides the bound of the proposed tree based survival model for each X. Details of the proof are collected in the Appendix.

Theorem 1

Assume that Assumptions 1–2 and the construction of a tree-based survival model described in the Appendix. Further assume that k_n → ∞ and n/k_n → ∞ as n → ∞, where k_n is a tuning parameter denoting the number of terminal nodes. For any b = n^ζ, where ζ > 0, we have for each X,

p r {\sup_{t < τ} | {\hat{r}}_{n} (t, X, A) - r (t, X, A) | \leq C [d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + b^{1 / 2} {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}^{- 1 / 2}]} \geq 1 - w_{n},

where r, u ∈ (0, 1), n ≥ 288b/M⁴, C is some universal constant, and

w_{n} = 16 [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / 2 d} .

The ideal balance happens when $k_{n} = n^{{(1 + 2 / d)}^{- 1}}$ . In this case, the optimal rate of the bound is close to $n^{- {(d + 2)}^{- 1}}$ . The following theorem proves consistency of the proposed tree based survival model. Details of the proof are collected in the Appendix.

Theorem 2

Assume that Assumptions 1–2 and the construction of a tree-based survival model described in the Appendix. Further assume that k_n = n^η, where 0 < η < 1. Then the estimator of the survival tree model is consistent. Moreover, for any b = n^ζ, where ζ > 0,

\sup_{t < τ} E_{X} | {\hat{r}}_{n} (t, X, A) - r (t, X, A) | \leq C [d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d}] + b^{1 / 2} {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}^{- 1 / 2} + w_{n} O (\ln (n)),

where r,u ∈ (0,1), n ≥ 288b/M⁴, C is some universal constant, and

w_{n} = 16 [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} .

3.3. Consistency and Excess Value Bound

Fisher consistency follows directly from Proposition 3.1 in [23], hence the proof is omitted. Here we restate the result as the following lemma. For the proposed method, we simply replace the reward R in R_ϕ(f) with R₁ or R₂. Note that both versions are equivalent to the reward function R_ϕ(f) = E[Tϕ{Af(X)}/π(A; X)]:

Lemma 1 (Proposition 3.1 in [23])

For any measurable function $\tilde{f}$ , if $\tilde{f}$ minimizes R_ϕ(f), then $D * (x) = sign (\tilde{f} (x))$ .

Provided the Assumptions in Section 3.2 hold, the following lemma ensures the convergence of the estimated conditional expectations. The proof is given in Appendix.

Lemma 2

Based on Theorem 1, for each X the estimated conditional expectations converge in probability, i.e.,

p r {| \hat{E} (T | X, A) - E (T | X, A) | \leq C_{1} [2^{{(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2}]} \geq 1 - w_{n},

p r {| \hat{E} (T | X, A, T > Y, Y) - E (T | X, A, T > Y, Y) | \leq C_{2} [2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2}]} \geq 1 - 2 w_{n},

for some constant C₁, C₂ (depending on L, L′, τ, M, d).

We will use the above lemmas to prove our main theorem based on the Gaussian kernel. Before we derive the convergence rate and excess value bound, we define the value function corresponding to the true and working model as V (f) = E(RI[A = sign{f(X)}]/π(A; X)) and $V^{'} (f) = E (\hat{R} I [A = sign {f (X)}] / π (A; X))$ , respectively. We further define the empirical L₂–norm, ${‖ f - g ‖}_{L_{2} (P_{n})} = {(n^{- 1} \sum_{i = 1}^{n} | f (X_{i} - g (X_{i})) |^{2})}^{1 / 2}$ , which also defines an ε-ball based on this norm. By Theorem 2.1 in [20], we restate the bound for covering numbers:

Lemma 3 (Theorem 2.1 in [20])

For any β > 0, 0 < v < 2, ε > 0 we have $\sup_{P_{n}} \log N (B_{ℋ_{k}}, ε, L_{2} (P_{n})) \leq c_{v, β, d} σ_{n}^{(1 - v / 2) (1 + β) d} ε^{- v}$ , where $B_{ℋ_{k}}$ is the closed unit ball of $ℋ_{k}$ , and d is the dimension of $X$ .

Lastly, for $\tilde{f} = \arg {min}_{f \in ℱ} E {L_{ϕ} (f)}$ , we define the approximation error function

a (λ) = \inf_{f \in ℋ_{k}} [E {L_{ϕ} (f)} + λ {‖ f ‖}_{k}^{2} - E {L_{ϕ} (\tilde{f})}] .

Then we have following theorem, the proof of which is given in Appendix.

Theorem 3

Based on Theorem 2 and assuming that the sequence λ_n > 0 satisfies λ_n → 0 and λ_n ln n → ∞, we have that

p r (V (f *) \leq V ({\hat{f}}_{n}) + ε) \geq 1 - 2 e^{- ρ},

where f* maximize the true value function V, $ε = a (λ_{n}) + M_{v} {(n λ_{n} / c_{n})}^{- 2 / (v + 2)} + M_{v} λ_{n}^{- 1 / 2} {(c_{n} / n)}^{2 / (d + 2)} + K ρ {(n λ_{n})}^{- 1} + 2 K ρ n^{- 1} λ_{n}^{- 1 / 2} + C λ_{n}^{- 1 / 2} {2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2} + 16 \ln n [(1 - u)] n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2}} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} \ln n + d e^{- ⌈ \log_{2} k_{n} ⌉} r^{2} / (2 d) \ln n}$ , $c_{n} = c_{v, β, d} σ_{n}^{(1 - v / 2) (+ β) d}$ and ρ > 0 for both methods; also, M_v is a constant depending on v, K is a sufficiently large positive constant, and C is a some large constant depending on d.

The rate consists of two parts. The first part is from the approximation error using $ℋ_{k}$ . The second part controls the approximation error due to using the proposed tree-based method to estimate the conditional expectation.

4. Simulation studies

We perform simulation studies to compare the proposed method with existing alternatives, including the Cox proportional hazards model with covariate-treatment interactions, inverse censoring weighted outcome weighted learning, and doubly robust learning, both proposed in [26]. We use survival time on the log scale log(T) as outcome. We also present for comparison an “oracle” approach which uses the true failure time on the log scale log(T) as the weight in outcome weighted learning, although this would not be implementable in practice. However, this approach is a representation of the best possible performance under the outcome weighted learning framework.

We generate X_i’s independently from a uniform distribution. Treatments are generated from {+1, −1} with equal probabilities. We present four scenarios in this simulation study. The failure time T and censoring time C are generated differently in each scenario, including both linear and nonlinear decision rules. For each case, we learn the optimal treatment rule from a training dataset with sample size n = 200. A testing dataset with size 10000 is used to calculate the value function under the estimated rule. Each simulation is repeated 500 times.

Tuning parameters in the tree based methods need to be selected. We mostly use the default values. The number of variables considered at each split is the integer part of the square root of d as suggested by [11] and [7]. We set the total number of trees to be 50 as suggested by [28] and use one fold imputation. For the alternative approaches such as inverse censoring weighted outcome weighted learning and doubly robust learning, a Cox proportional hazards model with covariates (X, A, XA) is used to model T and C respectively. Note that when at least one of the two working models is correctly specified, the doubly robust method enjoys consistency. We implemented outcome weighted learning using a Matlab library for support vector machine [5]. Both linear and Gaussian kernels are considered for all methods except for the Cox model approach which could be directly inverted to obtain the decision rules. The parameter λ_n is chosen by ten-fold cross-validation.

4.1. Simulation settings

For all scenarios, we generate $\tilde{T}$ and C independently. The failure time $T = min (τ, \tilde{T})$ . For all accelerated failure time models, ε is generated from a standard normal distribution. For all Cox proportional hazards models, the baseline hazard function λ₀(t) = 2t. For all simulation results presented in this section, we consider setting the censoring rates to approximately 45% for all scenarios. We also perform a sensitivity analysis for different censoring rates (30% and 60%) for each scenario. These additional results are presented in the Appendix.

Scenario 1

Both $\tilde{T}$ and C are generated from the accelerated failure time model. τ = 2.5 and d = 10. The optimal decision function is linear. The value of the optimal treatment rule is approximately 0.031:

\log (\tilde{T}) = - 0.2 - 0.5 X_{1} + 0.5 X_{2} + 0.3 X_{3} + (0.5 - 0.1 X_{1} - 0.6 X_{2} + 0.1 X_{3}) A + ε,

\log (C) = 0.1 - 0.8 X_{1} + 0.4 X_{2} + 0.4 X_{3} + (0.5 - 0.1 X_{1} - 0.6 X_{2} + 0.3 X_{3}) A + ε .

Scenario 2

$\tilde{T}$ is generated from a Cox model and C is generated from the accelerated failure time model. The optimal decision function is nonlinear. τ = 8 and d = 10. The value of the optimal treatment rule is approximately 0.181:

λ_{\tilde{T}} (t | A, X) = λ_{0} (t) \exp {- 0.2 - 1.5 X_{1}^{1.5} + 0.5 X_{2} + (0.8 - 0.7 X_{1}^{0.5} - 1.2 X_{2}^{2}) A},

\log (C) = - 0.5 + 0.7 X_{1} + X_{2}^{2} + 0.6 X_{3} + 0.1 X_{4} + (0.2 + X_{1}^{2.5} - 2 X_{2} + 0.5 X_{3}) A + ε .

Scenario 3

$\tilde{T}$ is generated from an accelerated failure time model with tree structured effects. C is generated from a Cox model with nonlinear effects. τ = 8 and d = 5. The value of the optimal treatment rule is approximately 1.079:

\log (\tilde{T}) = X_{1} + I (X_{2} > 0.5) I (X_{3} > 0.5) + (0.3 - X_{1}) A + 2 {I (X_{4} < 0.3) I (X_{5} < 0.3) + I (X_{4} > 0.7) I (X_{5} > 0.7)} A + ε,

λ_{C} (t | A, X) = λ_{0} (t) \exp {- 1.5 + X_{1} + (1 + 0.6 X_{2}^{1.5}) A} .

Scenario 4

$\tilde{T}$ is is generated from an accelerated failure time model. C is generated from a Cox model. τ = 2 and d = 10. The value of the optimal treatment rule is approximately −0.389:

\log (\tilde{T}) = - 0.5 - 0.8 X_{1} + 0.7 X_{2} + 0.2 X_{3} + (0.6 - 0.4 X_{1} - 0.2 X_{2} - 0.4 X_{3}) A + ε,

λ_{C} (t | A, X) = λ_{0} (t) \exp {- 0.5 X_{1} - 0.5 X_{2} + 0.2 X_{3} - (1 - 0.5 X_{1} + 0.3 X_{2} - 0.5 X_{3}) A} .

4.2. Simulation results

Figure 1 shows the boxplot of values based on the logarithm of T calculated from the test data. The mean and standard deviation of values are shown in Table 1. In scenario 1, since the model is not correctly specified for inverse probability of censoring outcome weighted learning, the doubly robust estimator, or Cox regression, our method performs better than all other competitors.

Fig 1 — Boxplots of mean log survival time for different treatment regimes. Censoring rate: 45%. T: using true survival time as weight; RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning. The black horizontal line is the theoretical optimal value.

Table 1.

Simulation results: Mean (×10³) and (sd) (×10³). Censoring rate: 45%. For each scenario, the theoretical optimal value (×10³) is 31, 181, 1079, and −389, respectively.

	kernel	T	RIST-R₁	RIST-R₂	ICO	DR	Cox
1	Linear	0 (26)	0 (31)	1 (30)	−20 (54)	−39 (76)	−29 (33)
1	Gaussian	−17 (44)	−11 (35)	−8 (36)	−25 (50)	−88 (79)	−29 (33)
2	Linear	22 (113)	−1 (112)	−24 (125)	−137 (131)	−232 (132)	53 (69)
2	Gaussian	−39 (115)	−40 (103)	−72 (114)	−175 (120)	−311 (106)	53 (69)
3	Linear	785 (52)	766 (59)	763 (51)	683 (113)	598 (120)	745 (64)
3	Gaussian	896 (61)	803 (56)	834 (71)	785 (105)	606 (115)	745 (64)
4	Linear	−453 (37)	−469 (47)	−451 (27)	−469 (48)	−481 (59)	−464 (36)
4	Gaussian	−465 (35)	−482 (44)	−457 (28)	−487 (45)	−531 (43)	−464 (36)

Open in a new tab

T: using true survival time as weight; RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning; Cox: Cox proportional hazards model using covariate-treatment interactions.

In scenario 2, we added some nonlinear terms into both the Cox and accelerated failure time models. The model assumptions for inverse censoring outcome weighted learning and the doubly robust estimator are not satisfied. Our estimated treatment rule performs much better than these two. Compared with inverse censoring outcome weighted learning and doubly robust learning, both our approaches improve more than 0.1 for the mean. Since the true model for the failure time is the Cox model, Cox regression performs better here. In this case, the Gaussian kernel performs less well than the linear kernel for most methods since the true model structure is linear and the Gaussian kernel is too flexible.

For scenario 3, which has a more complicated tree structure, the Gaussian kernel performs better than the linear kernel for all outcome weighted learning approaches. The performance of the Gaussian kernel is enhanced since it can better address the true nonlinear model structure. We can see that with either a linear or Gaussian kernel, our estimators perform better than Cox regression. Compared with doubly robust learning, our two approaches improve 0.2 for the mean.

In scenario 4, we see that when the model is correctly specified for inverse probability of censoring outcome weighted learning and doubly robust learning, the performances of both approaches are satisfactory while our methods seem to be only a little better. The performances of our first approach, inverse probability of censoring outcome weighted learning and Cox regression are all similar. Our second approach has the best treatment effect among all estimators. Note that our second approach appears to perform as well as the first, oracle approach. Also, our two proposed methods have smaller standard errors in scenarios 1 and 3. The standard error is similar for all outcome weighted learning approaches in scenario 2 and 4. Overall, our proposed methods have generally lower variances.

Compared with results of censoring rates (30% and 60%) in the Appendix, we can observed a consistently pattern that lower censoring rate leads to higher performances in terms of both mean value and variance. The relative performances between the proposed and the competing methods remain similar across different censoring rates.

5. Data Analysis

We apply the proposed method to a non-small-cell lung cancer randomized trial dataset described in [19]. 228 subjects with complete information are used in this analysis. Each treatment arm contains 114 subjects. The censoring rate is 29%. Here we use five covariates: performance status (119 subjects ranging from 90% to 100% and 109 subjects ranging from 70% to 80%), cancer stage (31 subjects in stage 3 and 197 subjects in stage 4), race (167 white, 54 black and 7 others), gender (143 male and 85 female), age (ranging from 31 to 82 with median 63). The length of study is τ = 104 weeks. We adopt the same tuning parameters used in the simulation study for this analysis. The value function is again calculated by using the logarithm of survival time log(T) (in weeks) as the reward.

We randomly divide the 228 patients into four equal proportions and use three parts as training data to estimate the optimal rule and calculate the empirical value based on the remaining part. We then permute the training and testing portions and average the four results. This procedure is then repeated 100 times and averaged to obtain the mean and standard deviation. To calculate the testing data performance, we consider two different measurements, both are calculated based on the formula $\sum_{i = 1}^{n} R_{i} I {A_{i} = D (X_{i})} / \sum_{i = 1}^{n} I {A_{i} = D (X_{i})}$ for the testing samples, where two versions of R_i’s are used. We first consider the procedure proposed in [26], where R is defined as

\frac{Δ Y}{{\hat{S}}_{C} (Y | A, Y)} - \int {\hat{E}}_{\tilde{T}} {T | T > t, A, X} {\frac{d N_{C} (t)}{{\hat{S}}_{C} (t | A, X)} + I (Y_{i} \geq t) \frac{d {\hat{S}}_{C} (t | A, X)}{{\hat{S}}_{C} {(t | A, X)}^{2}}} .

Here, ${\hat{S}}_{C} (t | A, X)$ and ${\hat{E}}_{\tilde{T}} (T | T > t, A, X)$ are estimated from the Cox model for simplicity. We also consider a more direct clinical measurement without the double robustness correction, which can be interpreted in a similar way as the expected survival time or the restricted mean survival time [6, 16, 21]. To be specific, we consider a restricted mean (log) survival time truncated at τ defined as δT + (1 − δ)E(T), and use this as a plug-in quantity of R in the testing performance calculation. To estimate this quantity, we use a recursively imputed survival trees (RIST) method to produce the expected survival time E(T). The results are presented in Tables 2 and 3 and Figures 2 and 3.

Table 2.

Analysis of non-small-cell lung cancer data: Mean (standard deviation) of value function

kernel	RIST-R₁	RIST-R₂	ICO	DR	Cox
Linear	3.641 (0.144)	3.641 (0.138)	3.633 (0.158)	3.590 (0.174)	3.582 (0.158)
Gaussian	3.611 (0.215)	3.615 (0.220)	3.302 (0.221)	3.470 (0.233)	3.582 (0.158)

Open in a new tab

RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning; Cox: Cox proportional hazards model using covariate-treatment interactions.

Table 3.

Analysis of non-small-cell lung cancer data: Mean (standard deviation) of a clinical measure

kernel	RIST-R₁	RIST-R₂	ICO	DR	Cox
Linear	3.603 (0.040)	3.606 (0.037)	3.598 (0.037)	3.601 (0.042)	3.646 (0.039)
Gaussian	3.511 (0.064)	3.514 (0.068)	3.451 (0.062)	3.456 (0.052)	3.646 (0.039)

Open in a new tab

Fig 2 — Boxplots of cross-validated value of survival weeks on the log scale. RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning.

Fig 3 — Boxplots of cross-validated value of survival weeks on the log scale. RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning.

The value function results are presented in Table 2 and Figure 2. Both proposed methods have higher values than the compared methods. Note that for the Gaussian kernel, our two new approaches are still better than Cox regression, however, inverse probability of censoring outcome weighted learning and doubly robust learning are not much different from Cox regression. The standard error is comparable among all four methods using the linear kernel. For the Gaussian kernel, the standard errors of the proposed methods and inverse probability of censoring weighted learning are similar. The standard error for the doubly robust method is slightly worse in this instance. Overall, the proposed methods seem to perform best.

The restricted log mean results are presented in Table 3 and Figure 3. Note for the linear kernel, the median of the proposed methods are higher than 3.6 and median of both inverse probability of censoring outcome weighted learning and doubly robust learning are lower. For the Gaussian kernel, the proposed methods are much better than inverse probability of censoring outcome weighted learning and doubly robust learning. Interestingly, under this measure, the performance of Cox regression is the best. A possible reason is that the true underlying model may not deviate much from the proportional hazard model, making the Cox model a better choice. This is also reflected by the fact that the results look similar to the simulation Scenario 2 plot, where the Cox model performs the best. Another possible reason is that the pseudo-outcome estimated from RIST may not be completely accurate and favors the Cox model in this particular dataset.

6. Discussion

We proposed a new method that redefines the reward function in a censored survival setting. The method works by replacing the censored observations (or all observations) by an estimated conditional expectation of the failure time. In practice, the failure time (or logarithm of the failure time) is commonly used in defining the reward function R, however, this choice could more flexible. For example, we may be interested in searching for a treatment rule that maximizes the median survival time or a certain quantile. Under our framework, this is achievable by replacing the censored observations with a suitable estimate of the quantile. This part of the work is currently under investigation.

The proposed methods may be improved or extended in multiple ways. The estimated treatment rule may be affected by the shift of the outcome. A potential extension is to combine our methods with residual weighted learning [27], which has been shown to reduce the total variation of the weights and improve stability. Trials with multiple treatment arms occur frequently. Thus a potential extension of our method is in the direction of multicategory classification [2, 15]. It is also interesting to extend our method to dynamic treatment regimes where a sequence of decision rules [17, 24, 13, 25] need to be learned in a censored survival outcome setting [8].

Acknowledgments

This research is supported in part by U.S. National Science Foundation grant DMS-1407732 and by U.S. National Institutes of Health grant P01 CA142538. We thank Yinqi Zhao for helpful conversations and suggestions. We thank the editor, associated editor, and reviewers for helpful comments which led to an improved manuscript.

Appendix

A simplified tree-based survival model used in Theorem 1

We consider a simplified version of a tree-based survival model. Starting from the root node [0, 1]^d, at each internal node, we randomly chose the j-th feature of X to split the node, while the splitting point is always at the midpoint of the range of the chosen feature. We repeat splitting ⌈log₂ k_n⌉ times, where k_n is a deterministic parameter which we can control. Hence, each individual tree has exactly $2^{⌈ \log_{2} k_{n} ⌉}$ terminal nodes, which is approximately k_n. In practice, we always chose k_n to go to infinity as n goes to infinity.

After we build an individual tree, let $B_{i} (i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ be the rectangular cell of the random partition. We treat observations inside each leaf node as a group of homogeneous subjects and compute the Nelson-Aalen estimator $\hat{Λ} (\cdot | B_{i})$ for each leaf node B_i. Hence, our estimator is essentially

{\hat{r}}_{n} (\cdot, X, A) = \sum_{i = 1}^{2^{⌈ \log_{2} k_{n} ⌉}} I {(X, A) \in B_{i}} \hat{Λ} (\cdot | B_{i}) .

Proof of Theorem 1

Proof

Since we always assume that the treatment variable A is important, and A has only two categories, we force a split on A at the root node. This is equivalent to fitting trees for A = 1 and A = −1 separately. In a balanced design, the problem reduces to estimating $\hat{r} (\cdot, X, 1)$ or $\hat{r} (\cdot, X, 1)$ with sample size n/2. Without the risk of ambiguities, the following results are developed for $\hat{r} (\cdot, X)$ with sample size n, where the results can be applied to either A = 1 or −1. Our proof utilizes two facts from [1]:

Fact 1 Let K_nj{B_i} be the number of times the j-th coordinate (j = 1, …, d) is split on to reach the terminal node B_i, $(i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ . Conditionally on X, K_nj{B_i} is Binomial(⌈log₂k_n⌉, 1/d). Moreover, $\sum_{j = 1}^{d} K_{n j} {B_{i}} = ⌈ \log_{2} k_{n} ⌉$ .
Fact 2 Let N_n(B_i) be the number of data points falling in the cell B_i, $(i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ . Conditionally on Θ, N_n(B_i) follows $Binominal (n, 2^{- ⌈ \log_{2} k_{n} ⌉})$ .

The following lemma, for later reference, provides the deterministic limit of the Nelson-Aalen estimator in the independent non-identically distributed case. The proof can be found in an unpublished technical report by Mai Zhou at the University of Kentucky.

Lemma 4

Suppose we have two sets of non-negative random variables: T₁, T₂, …, T_n which are survival times, independent but non-identically distributed with continuous distribution F₁(t), F₂(t), …, F_n(t); C₁, C₂, …, C_n which are censoring times, independent but non-identically distributed with continuous distribution G₁(t), G₂(t), …, G_n(t). We also assume the $T_{i}^{'} s$ and $C_{i}^{'} s$ are independent. The Nelson-Aalen estimator of data Y_i = min(T_i, C_i), δ_i = I(T_i ≤ C_i) is $\hat{Λ} (t)$ . Provided Assumption 1, for b = n^ζ, where ζ > 0,

p r (\sup_{t < τ} | \hat{Λ} (t) - \int_{0}^{t} \frac{\sum_{i} {1 - G_{i} (s)} d F_{i} (s)}{\sum_{i} {1 - G_{i} (s)} {1 - F_{i} (s)}} | > \frac{(1152 b^{1 / 2})}{n^{1 / 2} M^{2}}) < 16 (n + 2) e^{- b} .

(8)

Now we start the proof of Theorem 1. Let the limit of the Nelson-Aalen estimator inside the cell B_i, $(i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ be

Λ * (t | B_{i}) = \int_{0}^{t} \frac{[\sum_{X_{j} \in B_{i}} {1 - G_{j} (s)} d F_{j} (s)]}{[\sum_{X_{j} \in B_{i}} {1 - G_{j} (s)} {1 - F_{j} (s)}]} .

For any t < τ, in order to bound the $| {\hat{r}}_{n} (t, X) - r (t, X) |$ , we define

r_{n}^{*} (t, X) = \sum_{i = 1}^{2^{⌈ \log_{2} k_{n} ⌉}} I {X \in B_{i}} Λ * (t | B_{i}) .

Then $| {\hat{r}}_{n} (t, X) - r (t, X) |$ can be decomposed as

| {\hat{r}}_{n} (t, X) - r (t, X) | = | {\hat{r}}_{n} (t, X) - r_{n}^{*} (t, X) | + | r_{n}^{*} (t, X) - r (t, X) | .

(9)

We start with the first term in Equation (9). From Fact 2, we know the number of observations in each terminal node is $Binominal (n, 2^{- ⌈ \log_{2} k_{n} ⌉})$ . By the Chernoff bound, with probability larger than $1 - e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}$ , in one terminal node we have at least $(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}$ observations for some 0 < u < 1.

Combining Equation (8), with probability larger than $1 - 16 [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} - e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}$ , the following equation holds:

| {\hat{r}}_{n} (t, X) - r_{n}^{*} (t, X) | \leq \sum_{i = 1}^{2^{⌈ \log_{2} k_{n} ⌉}} I {X \in B_{i}} {(1152 b)}^{1 / 2} {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}^{- 1 / 2} M^{- 2}

(10)

= {(1152 b)}^{1 / 2} {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}^{- 1 / 2} M^{- 2} .

(11)

Before we bound the second term in Equation (9). We first show the bound for the difference between the true cumulative hazard function and aggregated estimator inside the cell B_i, $(i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ , i.e. |I {X ∈ B_i}{Λ*(t| B_i) − Λ(t|X)}.

From Fact 1, we know the number of times the terminal node B_i is split on the j-th coordinate (j = 1, ⋯, d) K_nj{B_i} is Binomial(⌈log₂k_n⌉, 1/d). By the Chernoff bound, $P (K_{n j} {B_{i}} \leq (1 - r) ⌈ \log_{2} k_{n} ⌉ / d) \leq e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)}$ for some 0 < r < 1. So with probability ${(1 - e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)})}^{d} \geq 1 - d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)}$ , every dimension of B_i is less than $2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d}$ . So with probability larger than $1 - 2^{⌈ \log_{2} k_{n} ⌉} d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)}$ , for arbitrary i, $(i = 1, 2, \dots, 2^{⌈ \log_{2} k_{n} ⌉})$ , we have

max_{X_{1}, X_{2} \in B_{i}} ‖ X_{1} - X_{2} ‖ \leq d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} .

So for all the observations X_j inside the same cell as X, by Assumption 2, we have

| F_{X} (\cdot) - F_{j} (\cdot) | \leq L d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d},

| f_{X} (\cdot) - f_{j} (\cdot) | \leq (L^{'} + L^{2}) d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d},

where f_X(·) and F_X(·) denote the true density function and distribution function at X, respectively. Then Λ*(t | B_i) has the upper bound and lower bound

\int_{0}^{t} [f_{X} (s) + b_{1}] / [1 - F_{X} (s) - b_{2}] d s and \int_{0}^{t} [f_{X} (s) + b_{1}] / [1 - F_{X} (s) - b_{2}] d s,

respectively, where

b_{1} = (L^{'} + L^{2}) d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} and b_{2} = L d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} .

Hence, |I{X ∈ B_i}{Λ*(t | B_i) − Λ(t | X)}| has the bound

\int_{0}^{t} \frac{b_{1} (1 - F (s)) + b_{2} f (s)}{(1 - F (s) - b_{2}) (1 - F (s))} d s \leq C_{τ} d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d},

where C is some constant depending on L and L′. We then bound the second term of Equation (9) as follows:

| r_{n}^{*} (t, X) - r (t, X) | \leq \sum_{i = 1}^{2^{⌈ \log_{2} k_{n} ⌉}} I {X \in B_{i}} | Λ (t | X) \leq C τ d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} .

(12)

Combining Equation (10) and (12), For each X, we have

pr [\sup_{t < τ} | {\hat{r}}_{n} (t, X) - r (t, X) | \leq C [τ d^{1 / 2} 2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d}] + {(1152 b)}^{1 / 2} {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}}^{- 1 / 2} M^{- 2}] \geq 1 - w_{n},

where

w_{n} = 16 [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} .

This completes the proof. □

Proof of Theorem 2

Proof

Based on Theorem 1, we now only need to establish the bound of $| {\hat{r}}_{n} (t, X, A) - r (t, X, A) |$ under the event with small probability w_n. Noticing that ${\hat{r}}_{n} (t, X, A)$ is simply the Nelson-Aalen estimator of the cumulative hazard function with at most n terms, for any t < τ we have

{\hat{r}}_{n} (t, X, A) \leq \frac{1}{n} + \dots + \frac{1}{1} = O (\ln (n)),

which implies that

| {\hat{r}}_{n} (t, X, A) - r (t, X, A) | \leq O (\ln (n)) .

Combining this with Theorem 1 completes the proof. □

Proof of Lemma 2

Proof

Our survival function estimator is $\hat{S} (t) = e^{- \hat{Λ} (t)}$ . From Theorem 1, we know that for any t < τ,

p r (| \hat{S} (t | X, A) - S (t | X, A | \leq C [2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2}]) \geq 1 - 16 [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} - e^{- e^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} - d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} .

It is then easy to see that for R₁,

| \hat{E} (T | X, A) - E (T | X, A) | = | \int_{0}^{τ} \hat{S} (t | X, A) d t - \int_{0}^{τ} S (t | X, A) d t | \leq \int_{0}^{τ} | \hat{S} (t | X, A) - S (t | X, A) | d t \leq τ C [2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2}],

with probability larger than 1 − w_n. And for reward R₂, we have

| \hat{E} (T | X, A, T > Y, Y) - E (T | X, A, T > Y, Y) | = | \int_{Y}^{τ} {\hat{S} (t | X, A) / \hat{S} (Y | X, A)} d t - \int_{Y}^{τ} {S (t | X, A) / S (Y | X, A)} d t | \leq | \int_{Y}^{τ} {\hat{S} (t | X, A) / \hat{S} (Y | X, A)} d t - \int_{Y}^{τ} {\hat{S} (t | X, A) / S (Y | X, A)} d t | + | \int_{Y}^{τ} {\hat{S} (t | X, A) / S (Y | X, A)} d t - \int_{Y}^{τ} {S (t | X, A) / S (X, A)} d t | .

Note that we can bound the distance between $\hat{S} (Y | X, A)$ and S(Y|X, A) with probability no less than 1 − w_n, which is further bounded above by

(1 / M^{2} + 1 / M) \int_{Y}^{τ} | \hat{S} (Y | X, A) - S (Y | X, A) | d t \leq C_{2} [2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2}],

for some constant C₂ with probability larger than 1 − 2w_n. □

Proof of Theorem 3

Proof

We restate the value function corresponding to the true and working model as

V (f) = E (R I [A = sign {f (x)}] / π (A; X))

and V^{'} (f) = E (\hat{R} I [A = sign {f (x)}] / π (A; X)),

respectively. Then we have

V (f *) - V ({\hat{f}}_{n}) \leq V (f *) - \sup_{f \in ℱ} V^{'} (f) + \sup_{f \in ℱ} V^{'} (f) - V^{'} ({\hat{f}}_{n}) + V^{'} ({\hat{f}}_{n}) - V ({\hat{f}}_{n}) \leq V (f *) - V^{'} (f *) + \sup_{f \in ℱ} V^{'} (f) - V^{'} ({\hat{f}}_{n}) + V^{'} ({\hat{f}}_{n}) - V ({\hat{f}}_{n}) \leq \sup_{f \in ℱ} V^{'} (f) - V^{'} ({\hat{f}}_{n}) + 2 \sup_{f \in ℱ} | V (f) - V^{'} (f) | .

(13)

We start with the first term in Equation (13). From Lemma 1, we know that $\sup_{f \in ℱ} V^{'} (f) - V^{'} ({\hat{f}}_{n}) = V^{'} (\tilde{f}) - V^{'} ({\hat{f}}_{n})$ , where $\tilde{f} = \arg {min}_{f \in ℋ_{k}} E {L_{ϕ} (f)}$ .

Let ${\tilde{f}}_{λ_{n}} = \arg {min}_{f \in ℋ_{k}} [E {R_{ϕ} {A f (X)} / π (A; X)} + λ_{n} {‖ f ‖}_{k}^{2}]$ , then

n^{- 1} \sum_{i = 1}^{n} \frac{\hat{R} ϕ {A_{i} \hat{f} (X_{i})}}{π (A_{i}; X_{i})} + λ_{n} ‖ \hat{f} ‖_{k}^{2} \leq n^{- 1} \sum_{i = 1}^{n} \frac{\hat{R} ϕ {A_{i} {\tilde{f}}_{λ_{n}} (X_{i})}}{π (A_{i}; X_{i})} + λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} .

(14)

By the definition of a(λ), we have

a (λ_{n}) = [E {L_{ϕ} ({\tilde{f}}_{λ_{n}})} + λ ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} - E {L_{ϕ} (\tilde{f})}],

and by Theorem 3.2 in [23], we further have

V (\tilde{f}) - V (\hat{f}) \leq E {L_{ϕ} (\hat{f})} - E {L_{ϕ} (\tilde{f})} \leq E {L_{ϕ} (\hat{f})} - E {L_{ϕ} ({\tilde{f}}_{λ_{n}})} - λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} + E {L_{ϕ} ({\tilde{f}}_{λ_{n}})} - E {L_{ϕ} (\tilde{f})} + λ ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} \leq E {L_{ϕ} (\hat{f})} - E {L_{ϕ} ({\tilde{f}}_{λ_{n}})} - λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} + λ_{n} ‖ \hat{f} ‖_{k}^{2} + a (λ_{n}) .

Combined with (14),

V (\tilde{f}) - V (\hat{f}) \leq a (λ_{n}) + E [\frac{R ϕ {A \hat{f} (X)}}{π (A; X)} - \frac{\hat{R} ϕ {A \hat{f} (X)}}{π (A; X)}] + E [\frac{\hat{R} ϕ {A {\tilde{f}}_{λ_{n}} (X)}}{π (A; X)} - \frac{R ϕ {A {\tilde{f}}_{λ_{n}} (X)}}{π (A; X)}] - (n^{- 1} \sum_{i = 1}^{n} [λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} - \frac{\hat{R} ϕ {A_{i} {\tilde{f}}_{λ_{n}} (X_{i})}}{π (A_{i}; X_{i})}] + E [λ_{n} ‖ \hat{f} ‖_{k}^{2} + \frac{\hat{R} ϕ {A \hat{f} (X)}}{π (A; X)} - λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} - \frac{\hat{R} ϕ {A_{i} {\tilde{f}}_{λ_{n}} (X_{i})}}{π (A; X)}]) = a (λ_{n}) + (I) + (II) + (III) .

Since

n^{- 1} \sum_{i = 1}^{n} \frac{\hat{R} ϕ {A_{i} \hat{f} (X_{i})}}{π (A_{i}; X_{i})} + λ_{n} ‖ \hat{f} ‖_{k}^{2} \leq n^{- 1} \sum_{i = 1}^{n} \frac{\hat{R} ϕ (0)}{π (A_{i}; X_{i})} = n^{- 1} \sum_{i = 1}^{n} \frac{\hat{R}}{π (A_{i}; X_{i})},

and the estimated value function $\hat{R}$ is bounded by τ, we know that $‖ \hat{f} ‖_{k} \leq τ^{1 / 2} λ_{n}^{- 1 / 2}$ . Furthermore, since

λ_{n} ‖ {\tilde{f}}_{λ_{n}} ‖_{k}^{2} \leq \inf_{f \in ℋ_{k}} λ_{n} ‖ f ‖_{k}^{2} + E [\frac{R ϕ {A f (X)}}{π (A; X)}] \leq E [\frac{R ϕ (0)}{π (A; X)}],

we have $‖ {\tilde{f}}_{λ_{n}} ‖_{k} \leq τ^{1 / 2} λ_{n}^{- 1 / 2}$ . Combining with Lemma 2, (I) and (II) are bounded by $C_{1} λ_{n}^{- 1 / 2} {2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2} + 16 \ln n [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} \ln n + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)}} \ln n}$ for both R₁ and R₂, where C₁ is some constant. Following the results in [26], (III) is bounded by $M_{v} {(n λ_{n} / c_{n})}^{- 2 / (v + 2)} + M_{v} λ_{n}^{- 1 / 2} {(c_{n} / n)}^{2 / (d + 2)} + K ρ {(n λ_{n})}^{- 1} + 2 K ρ n^{- 1} λ_{n}^{- 1 / 2}$ with probability larger than 1 − 2e^−ρ, where M_v is a constant depending on v and K is a sufficiently large positive constant. Finally, combining (I), (II) and (III), we have

pr (\sup_{f \in ℱ} V^{'} (f) \leq V^{'} ({\hat{f}}_{n}) + ε_{1}) \geq 1 - 2 e^{- ρ},

(15)

where $ε_{1} = a (λ_{n}) + M_{v} {(n λ_{n} / c_{n})}^{- 2 / (v + 2)} + M_{v} λ_{n}^{- 1 / 2} {(c_{n} / n)}^{2 / (d + 2)} + K ρ {(n λ_{n})}^{- 1} + 2 K ρ n^{- 1} λ_{n}^{- 1 / 2} + C_{1} λ_{n}^{- 1 / 2} {2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2} + 16 \ln n [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} \ln n + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} \ln n}$ .

For the second part in Equation (13),

V (f) - V^{'} (f) = E (\frac{R I [A = sign {f (X)}]}{π (A; X)}) - E (\frac{\hat{R} I [A = sign {f (X)}]}{π (A; X)}) = E ({E (T | X, A) - \hat{E} (T | X, A)} \frac{I [A = sign {f (x)}]}{π (A; X)})

if R = R₁. For R = R₂, we have

V (f) - V^{'} (f) = E ((1 - δ) {E (T | X, A, T > Y, Y) - \hat{E} (T | X, A, T > Y, Y)} \frac{I [A = sign {f (X)}]}{π (A; X)}) .

By Lemma 2,

\sup_{f \in ℱ} | V (f) - V^{'} (f) | \leq C_{2} λ_{n}^{- 1 / 2} {2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2} + 16 \ln n [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} \ln n + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} \ln n}

(16)

where C₂ is some constant. Now, combining (15) and (16) we have

pr (V (f *) \leq V ({\hat{f}}_{n}) + ε) \geq 1 - 2 e^{- ρ},

where

ε = a (λ_{n}) + M_{v} {(n λ_{n} / c_{n})}^{- 2 / (v + 2)} + M_{v} λ_{n}^{- 1 / 2} {(c_{n} / n)}^{2 / (d + 2)} + K ρ {(n λ_{n})}^{- 1} + 2 K ρ n^{- 1} λ_{n}^{- 1 / 2} + C λ_{n}^{- 1 / 2} {2^{- {(1 - r) ⌈ \log_{2} k_{n} ⌉} / d} + {(b / {(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}})}^{1 / 2} + 16 \ln n [(1 - u) n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1} + 2] e^{- b} + e^{- u^{2} n 2^{- ⌈ \log_{2} k_{n} ⌉ - 1}} \ln n + d e^{- ⌈ \log_{2} k_{n} ⌉ r^{2} / (2 d)} \ln n} .

This completes the proof. □

Additional simulation results for different censoring rates

We summarize the additional simulation results in this section. For each simulation scenario considered in Section 4, we alter the first constant term in the censoring distribution to achieve 30% (Table 4 and Figure 4), and 60% (Table 5 and Figure 5) censoring rates.

Table 4.

Simulation results: Mean (×10³) and (sd) (×10³). Censoring rate: 30%. For each scenario, the theoretical optimal value (×10³) is 31, 181, 1079, and −389, respectively.

	kernel	T	RIST-R₁	RIST-R₂	ICO	DR	Cox
1	Linear	0 (26)	1 (31)	2 (28)	−10 (40)	−20 (63)	−26 (33)
1	Gaussian	−17 (44)	−10 (34)	−7 (37)	−18 (45)	−48 (65)	−26 (33)
2	Linear	22 (113)	17 (105)	−14 (126)	−110 (136)	−193 (133)	65 (63)
2	Gaussian	−39 (115)	−25 (101)	−62 (113)	−164 (119)	−285 (112)	65 (63)
3	Linear	785 (52)	768 (53)	771 (52)	737 (95)	667 (124)	763 (61)
3	Gaussian	896 (61)	810 (54)	854 (69)	817 (124)	679 (123)	763 (61)
4	Linear	−453 (37)	−465 (46)	−448 (27)	−461 (42)	−471 (54)	−457 (32)
4	Gaussian	−465 (35)	−477 (42)	−456 (27)	−474 (41)	−505 (48)	−457 (32)

Open in a new tab

Fig 4 — Boxplots of mean log survival time for different treatment regimes. Censoring rate: 30%. T: using true survival time as weight; RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning. The black horizontal line is the theoretical optimal value.

Table 5.

Simulation results: Mean (×10³) and (sd) (×10³). Censoring rate: 60%. For each scenario, the theoretical optimal value (×10³) is 31, 181, 1079, and −389, respectively.

	kernel	T	RIST-R₁	RIST-R₂	ICO	DR	Cox
1	Linear	0 (26)	−2 (39)	−5 (43)	−29 (57)	−64 (92)	−34 (36)
1	Gaussian	−17 (44)	−12 (40)	−12 (45)	−35 (55)	−144 (78)	−34 (36)
2	Linear	22 (113)	−36 (123)	−61 (135)	−138 (133)	−248 (129)	31 (79)
2	Gaussian	−39 (115)	−69 (108)	−102 (115)	−165 (117)	−313 (101)	31 (79)
3	Linear	785 (52)	753 (77)	748 (69)	646 (104)	556 (94)	721 (70)
3	Gaussian	896 (61)	796 (63)	819 (67)	775 (106)	573 (93)	721 (70)
4	Linear	−453 (37)	−478 (55)	−458 (33)	−486 (55)	−492 (59)	−480 (43)
4	Gaussian	−465 (35)	−492 (48)	−461 (29)	−513 (53)	−551 (38)	−480 (43)

Open in a new tab

Fig 5 — Boxplots of mean log survival time for different treatment regimes. Censoring rate: 60%. T: using true survival time as weight; RIST-R₁ and RIST-R₂: using the estimated R₁ and R₂ respectively as weights, while the conditional expectations are estimated using recursively imputed survival trees; ICO: inverse probability of censoring weighted learning; DR: doubly robust outcome weighted learning. The black horizontal line is the theoretical optimal value.

Contributor Information

Yifan Cui, Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

Ruoqing Zhu, Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA.

Michael Kosorok, Department of Biostatistics and Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

References

1.Biau G. Analysis of a random forests model. Journal of Machine Learning Research. 2012;13:1063–1095. [Google Scholar]
2.Bredensteiner EJ, Bennett KP. Computational Optimization. Springer; 1999. Multicategory classification by support vector machines; pp. 53–79. [Google Scholar]
3.Breiman L. Random forests. Machine learning. 2001;45:5–32. [Google Scholar]
4.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC press; 1984. [Google Scholar]
5.Chang CC, Lin CJ. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2:27. [Google Scholar]
6.Geng Y, Zhang HH, Lu W. On optimal treatment regimes selection for mean survival time. Statistics in medicine. 2015;34:1169–1184. doi: 10.1002/sim.6397. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine learning. 2006;63:3–42. [Google Scholar]
8.Goldberg Y, Kosorok MR. Q-learning with censored data. Annals of statistics. 2012;40:529. doi: 10.1214/12-AOS968. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Hothorn T, Lausen B, Benner A, Radespiel-Tröger M. Bagging survival trees. Statistics in medicine. 2004;23:77–91. doi: 10.1002/sim.1593. [DOI] [PubMed] [Google Scholar]
10.Ishwaran H, Kogalur UB. Consistency of random survival forests. Statistics & probability letters. 2010;80:1056–1064. doi: 10.1016/j.spl.2010.02.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The annals of applied statistics. 2008:841–860. [Google Scholar]
12.Laber E, Zhao Y. Tree-based methods for individualized treatment regimes. Biometrika. 2015;102:501–514. doi: 10.1093/biomet/asv028. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics. 2014;8:1225. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.LeBlanc M, Crowley J. Relative risk trees for censored survival data. Biometrics. 1992:411–425. [PubMed] [Google Scholar]
15.Lee Y, Lin Y, Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]
16.Ma J, Hobbs BP, Stingo FC. Statistical methods for establishing personalized treatment rules in oncology. BioMed research international. 2015:2015. doi: 10.1155/2015/670691. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65:331–355. [Google Scholar]
18.Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Annals of statistics. 2011;39:1180. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Socinski MA, Schell MJ, Peterman A, Bakri K, Yates S, Gitten R, Unger P, Lee J, Lee JH, Tynan M, et al. Phase iii trial comparing a defined duration of therapy versus continuous therapy followed by second-line therapy in advanced-stage iiib/iv non–small-cell lung cancer. Journal of Clinical Oncology. 2002;20:1335–1343. doi: 10.1200/JCO.2002.20.5.1335. [DOI] [PubMed] [Google Scholar]
20.Steinwart I, Scovel C. Fast rates for support vector machines using gaussian kernels. The Annals of Statistics. 2007:575–607. [Google Scholar]
21.Tian L, Zhao L, Wei L. Predicting the restricted mean event time with the subject’s baseline covariates in survival analysis. Biostatistics. 2014;15:222–233. doi: 10.1093/biostatistics/kxt050. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR. Doubly robust learning for estimating individualized treatment with censored data. Biometrika. 2015;102:151–168. doi: 10.1093/biomet/asu050. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association. 2015:00–00. doi: 10.1080/01621459.2015.1093947. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zhu R, Kosorok MR. Recursively imputed survival trees. Journal of the American Statistical Association. 2012;107:331–340. doi: 10.1080/01621459.2011.637468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Biau G. Analysis of a random forests model. Journal of Machine Learning Research. 2012;13:1063–1095. [Google Scholar]

[R2] 2.Bredensteiner EJ, Bennett KP. Computational Optimization. Springer; 1999. Multicategory classification by support vector machines; pp. 53–79. [Google Scholar]

[R3] 3.Breiman L. Random forests. Machine learning. 2001;45:5–32. [Google Scholar]

[R4] 4.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC press; 1984. [Google Scholar]

[R5] 5.Chang CC, Lin CJ. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2:27. [Google Scholar]

[R6] 6.Geng Y, Zhang HH, Lu W. On optimal treatment regimes selection for mean survival time. Statistics in medicine. 2015;34:1169–1184. doi: 10.1002/sim.6397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine learning. 2006;63:3–42. [Google Scholar]

[R8] 8.Goldberg Y, Kosorok MR. Q-learning with censored data. Annals of statistics. 2012;40:529. doi: 10.1214/12-AOS968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Hothorn T, Lausen B, Benner A, Radespiel-Tröger M. Bagging survival trees. Statistics in medicine. 2004;23:77–91. doi: 10.1002/sim.1593. [DOI] [PubMed] [Google Scholar]

[R10] 10.Ishwaran H, Kogalur UB. Consistency of random survival forests. Statistics & probability letters. 2010;80:1056–1064. doi: 10.1016/j.spl.2010.02.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The annals of applied statistics. 2008:841–860. [Google Scholar]

[R12] 12.Laber E, Zhao Y. Tree-based methods for individualized treatment regimes. Biometrika. 2015;102:501–514. doi: 10.1093/biomet/asv028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics. 2014;8:1225. doi: 10.1214/14-ejs920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.LeBlanc M, Crowley J. Relative risk trees for censored survival data. Biometrics. 1992:411–425. [PubMed] [Google Scholar]

[R15] 15.Lee Y, Lin Y, Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association. 2004;99:67–81. [Google Scholar]

[R16] 16.Ma J, Hobbs BP, Stingo FC. Statistical methods for establishing personalized treatment rules in oncology. BioMed research international. 2015:2015. doi: 10.1155/2015/670691. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65:331–355. [Google Scholar]

[R18] 18.Qian M, Murphy SA. Performance guarantees for individualized treatment rules. Annals of statistics. 2011;39:1180. doi: 10.1214/10-AOS864. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Socinski MA, Schell MJ, Peterman A, Bakri K, Yates S, Gitten R, Unger P, Lee J, Lee JH, Tynan M, et al. Phase iii trial comparing a defined duration of therapy versus continuous therapy followed by second-line therapy in advanced-stage iiib/iv non–small-cell lung cancer. Journal of Clinical Oncology. 2002;20:1335–1343. doi: 10.1200/JCO.2002.20.5.1335. [DOI] [PubMed] [Google Scholar]

[R20] 20.Steinwart I, Scovel C. Fast rates for support vector machines using gaussian kernels. The Annals of Statistics. 2007:575–607. [Google Scholar]

[R21] 21.Tian L, Zhao L, Wei L. Predicting the restricted mean event time with the subject’s baseline covariates in survival analysis. Biostatistics. 2014;15:222–233. doi: 10.1093/biostatistics/kxt050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012;68:1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association. 2012;107:1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Zhao YQ, Zeng D, Laber EB, Kosorok MR. New statistical learning methods for estimating optimal dynamic treatment regimes. Journal of the American Statistical Association. 2015;110:583–598. doi: 10.1080/01621459.2014.937488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, Kosorok MR. Doubly robust learning for estimating individualized treatment with censored data. Biometrika. 2015;102:151–168. doi: 10.1093/biomet/asu050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Zhou X, Mayer-Hamblett N, Khan U, Kosorok MR. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association. 2015:00–00. doi: 10.1080/01621459.2015.1093947. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Zhu R, Kosorok MR. Recursively imputed survival trees. Journal of the American Statistical Association. 2012;107:331–340. doi: 10.1080/01621459.2011.637468. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Tree based weighted learning for estimating individualized treatment rules with censored data

Yifan Cui

Ruoqing Zhu

Michael Kosorok

Abstract

1. Introduction

2. Methodology

2.1. Individualized treatment regime framework

2.2. Value function under right censoring

2.3. Outcome weighted learning with survival trees

Algorithm 1: Pseudo algorithm for the proposed method

3. Theoretical results

3.1. Preliminaries

3.2. Consistency of tree-based survival models

Assumption 1

Assumption 2

Theorem 1

Theorem 2

3.3. Consistency and Excess Value Bound

Lemma 1 (Proposition 3.1 in [23])

Lemma 2

Lemma 3 (Theorem 2.1 in [20])

Theorem 3

4. Simulation studies

4.1. Simulation settings

Scenario 1

Scenario 2

Scenario 3

Scenario 4

4.2. Simulation results

Fig 1.

Table 1.

5. Data Analysis

Table 2.

Table 3.

Fig 2.

Fig 3.

6. Discussion

Acknowledgments

Appendix

A simplified tree-based survival model used in Theorem 1

Proof of Theorem 1

Proof

Lemma 4

Proof of Theorem 2

Proof

Proof of Lemma 2

Proof

Proof of Theorem 3

Proof

Additional simulation results for different censoring rates

Table 4.

Fig 4.

Table 5.

Fig 5.

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases