Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Sep 1.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2016 Sep 2;79(4):1165–1185. doi: 10.1111/rssb.12201

On Estimation of Optimal Treatment Regimes For Maximizing t-Year Survival Probability

Runchao Jiang 1, Wenbin Lu 1, Rui Song 1, Marie Davidian 1
PMCID: PMC5624740  NIHMSID: NIHMS810928  PMID: 28983189

Summary

A treatment regime is a deterministic function that dictates personalized treatment based on patients’ individual prognostic information. There is increasing interest in finding optimal treatment regimes, which determine treatment at one or more treatment decision points so as to maximize expected long-term clinical outcome, where larger outcomes are preferred. For chronic diseases such as cancer or HIV infection, survival time is often the outcome of interest, and the goal is to select treatment to maximize survival probability. We propose two nonparametric estimators for the survival function of patients following a given treatment regime involving one or more decisions, i.e., the so-called value. Based on data from a clinical or observational study, we estimate an optimal regime by maximizing these estimators for the value over a prespecified class of regimes. Because the value function is very jagged, we introduce kernel smoothing within the estimator to improve performance. Asymptotic properties of the proposed estimators of value functions are established under suitable regularity conditions, and simulations studies evaluate the finite-sample performance of the proposed regime estimators. The methods are illustrated by application to data from an AIDS clinical trial.

Keywords: Inverse probability weighted estimation, Kaplan-Meier estimator, optimal treatment regime, personalized medicine, survival probability, value function

1. Introduction

For many complex diseases, such as cancer, HIV infection, and mental disorders, there is generally not a uniformly best treatment for all patients. Rather, different patients may benefit from different treatments due to individual heterogeneity. For example, in AIDS Clinical Trials Group (ACTG) Study 175 (Hammer et al., 1996), the primary composite outcome of interest was time to having a larger than 50% decline in CD4 count, a measure of immunological status; progression to AIDS; or death. For the comparison of two treatments, zidovudine plus didanosine (coded as 1) and zidovudine plus zalcitabine (coded as 0), the data suggest that zidovudine plus zalcitabine leads to more favorable outcomes for younger patients than zidovudine plus didanosine. Figure 1 shows treatment-specific Kaplan-Meier estimates of the survival function for the two age strata defined by the observed median age, 34 years, in ACTG 175. It is clear that, among younger patients with age ≤ 34, those receiving zidovudine plus zalcitabine have almost uniformly larger survival probabilities those receiving zidovudine plus didanosine, whereas the situation is reversed for older patients with age > 34.

Fig. 1.

Fig. 1

Treatment specific Kaplan-Meier curves by age.

This type of situation suggests that individual patient characteristics should be used when selecting treatments to maximize an expected long-term outcome of interest for which larger outcomes are preferred, such as t-year survival probability, and has heightened interest in derivation of optimal dynamic treatment regimes. Because in many chronic diseases treatment decisions may be made sequentially over time, a dynamic treatment regime is a set of one or more decision rules determine which treatment to give from among the available options based on accruing individual patient information, including baseline characteristics, intermediate outcomes between decisions, and previous treatments. An optimal regime is one that maximizes the expected outcome, or so-called value, if used by the entire patient population to select treatments.

There is a large literature on statistical methods to estimate an optimal treatment regime based on data from a clinical trial or observational study and non-survival outcomes. Q-learning (Watkins, 1989; Watkins and Dayan, 1992; Murphy, 2005; Zhao et al., 2009) and A-learning (Murphy, 2003; Robins, 2004) are two popular backward induction methods for estimating optimal dynamic treatment regimes based on regression-type modeling. The former involves positing parametric models for, roughly, the regression of outcome on accruing information and treatment, while the latter is based on semiparametric models in which only the part of the outcome regression representing contrasts among treatments is modeled parametrically, along with the propensity scores, the probabilities of observed treatment assignment given patient information at each decision point. Q-learning can be sensitive to misspecification of the required models, while A-learning enjoys the so-called double robustness property in that the corresponding estimating equations are asymptotically unbiased when either the propensity scores or main effects portion of the outcome models are correctly specified. An alternative class of approaches known as value or policy search methods is based on deriving and maximizing directly a consistent estimator for the value over a prespecified class of treatment regimes indexed by a finite-dimensional parameter. Zhang et al. (2012b) proposed inverse propensity score weighted (IPW) and augmented IPW (AIPW) estimators for the value in the case of a single decision point. Because the value estimator is nonsmooth, the optimization problem is challenging, and nonstandard optimization techniques are required. Zhao et al. (2012) and Zhang et al. (2012a) recast this approach as a weighted classification problem; the former refer to this method as outcome weighted learning. These approaches exploit approximations integrated into classification software to address the nonsmooth optimization problem, so that the class of regimes is dictated by a chosen classification method. Zhang et al. (2013) extended the value search methods of Zhang et al. (2012b) to more than one decision point, which share the computational challenges in the single decision case. Matsouaka et al. (2014) employed a kernel smoothing technique to nonparametrically estimate the conditional mean for the difference of the potential outcomes in a subgroup of patients and derived its associated treatment regime.

Although survival time is often the outcome of interest, to our knowledge there is relatively little development of methods for estimation of optimal treatment regimes where the goal is to maximize survival probability. Some work is focused on maximizing expected survival time. Goldberg and Kosorok (2012) developed a Q-learning method for censored survival data for estimating optimal dynamic treatment regimes and derived its associated finite sample risk bounds on the generalization error of the estimated regime, while Zhao et al. (2015) proposed a doubly robust estimator for expected survival time based on censored data and use outcome weighted learning to estimate an optimal regime. Bai et al. (2014) developed a locally-efficient doubly robust estimator for survival probability rather than mean survival time and estimate an optimal regime by extending the methods from a classification perspective of Zhang et al. (2012a). The latter two methods involve transforming maximization of the value to a weighted classification problem, which allows classification software to be used to address the optimization challenge and thus dictates the class of regimes. All of these methods are relevant to a single decision point only.

In this article, we propose a value search method for estimating an optimal treatment regime within a prespecified class for which the goal is to maximize survival probability that addresses the optimization challenges in a novel way and is relevant to more than one decision point. In particular, we develop a framework employing kernel smoothing techniques to smooth the estimator of the value prior to optimization, which we show greatly improves finite sample performance over the corresponding estimator with no smoothing. This approach is different from the smoothing technique used by Matsouaka et al. (2014), and, to the best of our knowledge, this is the first time smoothing has been integrated into estimation of the value function and its associated optimal treatment regimes in this way. Development of optimal treatment regimes for multiple decision points with censored survival data is challenging, as timing of observations, censoring, and events must be properly taken into account. In addition, we extend our smoothing approach to this setting.

In Sections 2 and 3, we introduce the statistical framework and estimators for a single decision point and multiple decisions, respectively. Asymptotic properties of the proposed estimators are given in Section 4. Finite sample performance is studied via simulation in Section 5, and Section 6 presents application of the methods to data from ACTG 175. Proofs are relegated to the Appendix.

2. Estimation of Optimal Treatment Regime for a Single Decision Time Point

2.1. Notation and Assumptions

Consider a study with two treatment options 𝒜; = {0, 1} given at baseline. For the ith patient, i = 1, …, n, let Xi denote the p-dimensional vector of baseline covariates taking values x ∈ 𝒳 and Ai denote the actual treatment received by the patient. Let Ti be the associated continuous survival time of interest, with conditional survival function ST(t|a, x) ≡ P(Ti > t|Ai = a, Xi = x) and corresponding conditional cumulative hazard function ΛT (t|a, x), where a = 0, 1. Let Ci denote right censoring time for patient i. The observed data are {(Xi, Ai, T̃i, δi), i = 1, …, n}, independent and identically distributed (iid) across i, where i = min{Ti, Ci} and δi = I{TiCi}. We thus observe the counting process Ni(t) = I(it, δi = 1) and the at risk process Yi(t) = I(it).

A treatment regime is a deterministic function that maps x ∈ 𝒳 to 𝒜;. For simplicity, we assume the regimes of interest are from 𝒢 = {gη : gη(x) = I{ηT ≥ 0}, ||η|| = 1}, where = (1, xT )T. However, the proposed method also applies to any other 𝒢 indexed by finite-dimensional parameters. Denote the potential survival time of a patient if he/she were given treatment a, which may be contrary to fact, as T*(a). Accordingly, define the potential counting process N*(a; t) and at risk process Y *(a; t) under treatment a, where N*(a; t) = I{min(T*(a), C) ≤ t, T*(a) ≤ C} and Y *(a; t) = I{min(T*(a), C) ≥ t}. If a patient follows a given regime gη, we can write the corresponding potential survival time as T*(gη) = T*(1)gη +T*(0)(1−gη), whose survival function is given by S*(t; η) = E(P[T*{gη(X)} > t|X]), as well as the potential counting process N*(gη; t) = N*(1; t)gη + N*(0; t)(1 − gη) and potential at risk process Y* (gη; t) = Y* (1; t)gη + Y* (0; t)(1 − gη). We wish to find an optimal treatment regime in 𝒢 that maximizes t-year survival probability; that is gηopt(x)g(x;ηopt), where ηopt = arg max||η||=1 S*(t; η). Here, t is a pre-determined time point.

To find an optimal treatment regime, we first derive consistent estimators of S*(u; η) for any u. We make the uninformative censoring assumption: {T*(1), T*(0)} ⫫ C|A, X, where “⫫” means “independent of”. Let SC(t|a, x) denote the survival function of the censoring time given A = a and X = x. If we were able to observe the gη-specific potential counting process Ni(gη;s) and at risk process Yi(gη;s), an intuitive estimator for S*(u; η) is the inverse probability of censoring weighted Kaplan-Meier estimator

S^(u;η)=su(1-i=1n[dNi{gη(Xi);s}/SC{sgη(Xi),Xi}]i=1n[Yi{gη(Xi);s}/SC{sgη(Xi),Xi}]). (1)

However, because Ni(gη;s) and Yi(gη;s) are generally not observable, Ŝ*(u; η) is not computable based on the observed data. To obtain proper estimators that are computable from the observed data, we make the following two assumptions, which are widely used in the causal inference literature (Rubin, 1974): (i) stable unit treatment value assumption (SUTVA); i.e. T = T*(1)A+T*(0)(1−A); and (ii) no unmeasured confounders assumptions; i.e. {T*(1), T*(0)} ⫫ A|X.

2.2. Estimation Procedure

Following Zhang et al. (2012b), we cast estimation of S*(u; η) in a missing data framework. By SUTVA, for those patients whose actually received treatment matches the treatment dictated by gη, Ni(gη;s)=Ni(s) and Yi(gη;s)=Yi(s), which are observed. For other patients, they are missing. This motivates us to modify the estimator given in (1) by incorporating inverse propensity score weighting. Formally, the weight for the ith patient is given by

wηi=I[Ai=I{ηTX0}]π(Xi)Ai+{1-π(Xi)}(1-Ai)=AiI(ηTX0)+(1-Ai){1-I(ηTX0)}π(Xi)Ai+{1-π(Xi)}(1-Ai), (2)

where π(Xi) = P(Ai = 1|Xi) is the propensity score. In practice, π(Xi) is known by design, as in a randomized clinical trial, or must be modeled and estimated from the data as in observational studies. In the latter case, a parametric model, say a logistic regression is usually used for estimating π(Xi), specifically,

logit{π(Xi;θ)}=θTXi, (3)

where logit(z) = log{z/(1−z)}. Let θ̂ denote the maximum likelihood estimator of θ, and define π̂ (Xi) = exp(θ̂T i)/{1 + exp(θ̂T i)}. If the logistic regression model is correctly specified, θ̂ is a consistent estimator of θ.

To derive the estimator for S*(u; η), we also need to estimate the censoring time survival function SC(s|Ai, Xi). In many clinical studies with satisfactory follow-up, it is reasonable to assume that censoring times are independent of treatment assignment and covariates, i.e. independent censoring. Here, the Kaplan-Meier estimator for censoring times consistently estimates SC(s|Ai, Xi). For some applications, the independent censoring assumption may be restrictive, but can be relaxed to a certain extent. For example, if censoring times are assumed to depend only on treatment assignment, the stratified Kaplan-Meier estimator can be used to estimate the treatment-specific censoring time survival function. For more general dependence, we can build a semiparametric model, say a proportional hazards model for censoring times, and obtain the model based estimator of SC(s|Ai, Xi). For simplicity, from now on we make the independent censoring assumption and let ŜC(·) denote the Kaplan-Meier estimator for censoring times.

Let ŵηi denote the estimator of wηi obtained by replacing π(Xi) with π̂(Xi) in wηi. We propose the inverse propensity score weighted Kaplan-Meier estimator (IPSWKME) for S*(u; η) given by

S^I(u;η)=su{1-i=1nw^ηidNi(s)i=1nw^ηiYi(s)}. (4)

Note that the IPSWKME dose not depend on the Kaplan-Meier estimator ŜC(·) for censoring times, as it cancels from numerator and denominator under the independent censoring assumption. In Section 4, we show that ŜI (u; η) is a consistent estimator of S*(u; η) under certain conditions. Based on ŜI (u; η), the estimated optimal treatment regime to maximize t-year survival probability is given by g(x;η^Iopt), where η^Iopt=argmaxη=1S^I(t;η).

The IPSWKME (4) relies on correct specification of the propensity score model. If it is misspecified, the IPSWKME is inconsistent. To improve the robustness of the IPSWKME, we propose augmented IPSWKME (AIPSWKME) by incorporating assumed model information. For example, we may posit a proportional hazards (PH) model (Cox, 1972) for the conditional cumulative hazard function of T by

ΛT(tA,X)=Λ0(t)exp{βT(XT,A,AXT)T}, (5)

where Λ0(t) is the baseline cumulative hazard function, and β is a (2p + 1)-dimensional parameter. The term wηidNi{gη(Xi);s} is augmented by

wηidNi{gη(Xi);s}+(1-wηi)E[dNi{gη(Xi);s}Xi]=wηidNi{gη(Xi);s}+(1-wηi)ST(sgη(Xi),Xi)SC(s)dΛT(sgη(Xi),Xi),

where ST (s|Ai, Xi) and SC(s) are the conditional survival functions of T and C, respectively. Similarly, the term wηiYi{gη(Xi);s} is augmented by wηiYi{gη(Xi);s}+(1-wηi)ST(sgη(Xi),Xi)SC(s). It can be shown that the two augmented terms have the so-called double robustness property, i.e. they are unbiased for E[dNi{gη(Xi);s}Xi] and E[Yi{gη(Xi);s}Xi], respectively, when either the propensity score model or the posited PH model is correctly specified. Therefore, we propose the AIPSWKME for S*(u; η) as

S^A(u;η)=su(1-i=1n[w^ηidNi(s)+(1-w^ηi)S^T{sgη(Xi),Xi}S^C(s)dΛ^T{sgη(Xi),Xi}]i=1n[w^ηiYi(s)+(1-w^ηi)S^T{sgη(Xi),Xi}S^C(s)]), (6)

where ŜT (s|Ai, Xi) is the estimated survival function of T based on the fitted PH model and ŜC(s) is the Kaplan-Meier estimator for censoring times. Based on ŜA(u; η), the estimated optimal treatment regime to maximize t-year survival probability is given by g(x;η^Aopt), where η^Aopt=argmaxη=1S^A(t;η). The asymptotic properties of ŜA(u; η) and S^A(t;η^Aopt) are studied in Section 4.

2.3. Computational Aspects

Note that ŜI (t; η) and ŜA(t; η) are not smooth functions of η. As an illustration, we plot ŜI (t; η) and ŜA(t; η) as functions of η1 in Figure 2 for a simple example with one covariate and the intercept term in η being set as 1. The estimates are very jagged, and direct maximization of them with respect to η is challenging and may lead to local maximizers. From our simulation studies in Section 5, estimated survival probabilities following the obtained optimal treatment regimes may show substantial biases. As studied in Matsouaka et al. (2014), cross-validation may be used to correct the finite sample biases of the unsmoothed estimators, but it may increase the computational burden.

Fig. 2.

Fig. 2

Plots for the original and smoothed estimates, where the original estimates are in black and the smoothed estimates are in red. In addition, the IPW and AIPW estimates are given in the left and right panels, respectively.

To reduce the biases of the estimators, we propose to smooth the estimators ŜI (t; η) and ŜA(t; η) using kernel smoothers. Specifically, we replace gη(xi) = I{ηT i ≥ 0} in ŜI (t; η) and ŜA(t; η) with η(xi) = Φ(ηTi/h) to obtain the smoothed IPSWKME (SIPSWKME) ŜI (t; η) and smoothed AIPSWKME (S-AIPSWKME) ŜA(t; η), where Φ(s) is the cumulative distribution function for the standard normal distribution, and h is a bandwidth parameter that goes to zero as n goes to infinity. For bandwidth selection, we set h = c0n−1/3sd(ηT), where c0 is a constant and sd(v) is the sample standard deviation of v. In our numerical studies, we found that c0 = 41/3 generally gives good results for all scenarios. We plot in Figure 2 the smoothed estimates with the chosen bandwidth parameter for the same example. The smoothed estimates approximate the original estimates well and have unique maximizers around the true value η1 = 0.5. Because the treatment regime I(ηT X̃ ≥ 0) remains the same when η is multiplied by k for any k > 0, choosing the bandwidth h to be a function of η, in particular, h being proportional to sd(ηT), ensures the scale-free property of the regime, as the constant k cancels in Φ(ηT X̃/h). As shown in Figure 2, although the resulting smoothed value function is not convex in η, it generally has a unique mode, and the maximizer of the smoothed value function is much easier to obtain compared to the unsmoothed counterpart. In all our numerical studies, the non-convexity of the smoothed value function does not cause any difficulty in the maximization procedure. Such a bandwidth parameter has been widely used in the nonparametric smoothing literature and ensures that the original and smoothed estimators have the same asymptotic distribution (e.g. Heller, 2007). Let ηIopt and ηAopt denote the maximizers of ŜI (t; η) and ŜA(t; η), respectively. Then the associated estimated optimal treatment regimes are g(x;ηIopt) and g(x;ηAopt). In our implementation, we first conduct the optimization without the norm-one constraint. Instead, we search the maximizer in the domain −1 ≤ ηj 1 for all j’s and then we rescale η to have norm one. This does not change the estimated value function, ŜI and ŜA, and their smoothed counterparts.

3. Estimation of Optimal Treatment Regime for Multiple Decision Time Points

We now extend the foregoing methods to estimation of optimal dynamic treatment regimes incorporating multiple decision points. For simplicity, we illustrate for the case of two decision points. Specifically, treatments can be given at baseline and at a fixed interim time point s, 0 < s < t. For the ith patient, let X0i denote his or her p0-dimensional vector of baseline covariates and A0i ∈ 𝒜;0 = {0, 1} denote the initial treatment received at baseline. If the patient survives beyond s and is not censored before s, let X1i denote his or her p1-dimensional vector of intermediate covariates collected by s after assigning treatment A0i and A1i ∈ 𝒜;1 = {0, 1} denote the follow-up treatment given at s. Thus, the observed data are {X0i, A0i, X1iI(i > s), A1iI(i > s), T̃i, δi, i = 1,, n} and iid across i.

As for a single decision point, we consider a class of linear dynamic treatment regimes for simplicity, i.e. 𝒢= {gη = (g0, g1)}, where

g0(x0;η0)=I{η0T(1,x0T)0},g1(x0,x1;η1)=I{η1T(1,x0T,g0(x0;η0),x1T))0},

η0 is a (p0+1)-dimensional parameter with ||η0|| = 1, and η1 is a (p0+p1+2)-dimensional parameter with ||η1|| = 1. Here, a patient following a treatment regime gη is given treatment g0(X0; η0) at baseline, and, if he or she survives beyond s and is not censored before s, is given treatment g1(X0, X1; η1) at s. For patients whose initial treatments coincide with those assigned by g0(X0; η0) and who die before s, their treatment assignments are consistent with the regime gη. However, for patients whose initial treatments coincide with those assigned by g0(X0; η0) but who are censored before s, it is not known whether their treatment assignments at the second decision follow the regime gη. Let T*(gη) denote the potential survival time for a patient if he or she were given treatment according to gη(X0, X1). We are interested in finding the optimal dynamic treatment regime gηopt={g0(X0;η0opt),g1(X0,X1;η1opt)} in 𝒢 that maximizes the t-year survival probability S*(2)(t; η) = E(P[T*{gη(X0, X1)} > t|X0, X1]). As is standard in the causal inference literature for studying dynamic treatment regimes (e.g., Murphy, 2003), we assume: (i) SUTVA, i.e. a patient’s observed outcome agrees with the corresponding potential outcome if his or her actually received treatments are consistent with the assigned treatments and (ii) sequential randomization assumption (SRA), i.e. the treatment assignment at current stage only depends on the past received treatments and observed covariates, but not the potential outcomes. Under these two assumptions, the above defined t-year survival probability can be estimated from the observed data.

We propose a similar inverse propensity score weighted Kaplan-Meier estimator for the survival function S*(2)(u; η) given any treatment regime gη. However, the derivation of proper weights becomes more difficult, as some patients may be censored before s and whether their received treatments follow the regime gη is unknown. To take this into account, we define the following new weight for patient i, i = 1,, n:

w^ηi(2)=I(Tis)×δiS^C(Ti)×I{A0i=g0(X0i;η0)}πA0(X0i)+I(Ti>s)S^C(s)×I{A0i=g0(X0i;η0),A1i=g1(X0i,g0(X0i;η0),X1i;ηi)}π^A0(X0i)×π^A1(X0i,A0i,X1i),

where π̂A0 (X0i) = π̂0(X0i)A0i+{1− π̂0(X0i)}(1−A0i), π̂A1X0i, A0i, X1i) = π̂1(X0i, A0i, X1i)A1i+ {1 − π̂1(X0i, A0i, X1i)}(1 − A1i), and π̂0(X0i) and π̂1(X0i, A0i, X1i) are the estimates of the propensity scores P(A0i = 1|X0i) and P(A1i = 1|X0i, A0i, X1i, T̂i > s), respectively. In randomized studies, π̂0 and π̂1 are known by design, while in observational studies, they must be estimated, e.g. using logistic regression. The IPSWKME for S* (u; η) is given by

S^I2(u;η)=vu{1-i=1nw^ηi(2)dNi(v)i=1nw^ηi(2)Yi(v)}. (7)

Let η^Iopt,(2)=(η^I,0opt,(2),ηI,1opt,(2))=argmaxη0=1,η1=1S^I(2)(t;η). Then the estimated optimal dynamic treatment regime is given by g^ηopt,(2)={g0(X0;η^I,0opt,(2)),g1(X0,X1;η^I,1opt,(2))}.

To improve the finite sample performance of the IPSWKME, we again introduce kernel smoothing and replace the indicator functions g0(X0i; η0) and g1(X0i, X1i; η1) in S^I(2)(u;η) by Φ{η0T(1,X0iT)/h0} and Φ[η1T{1,X0T,g0(X0;η0),X1T}/h1], where the bandwidth parameters h0 and h1 are chosen as before. Let SI(2)(u;η) denote the resulting smoothed IPSWKME and ηIopt,(2) denote the maximizer of SI(2)(t;η). To improve the robustness of IPSWKME, we can similarly derive the augmented IPSWKME based on a posited model for survival time, however, its formulation is very complicated and is not pursued here. Conceptually, the proposed IPSWKME can be generalized to accommodate more than two decision points. However, when there are more treatment decision points, the IPSWKME Optimal Treatment Regimes for Survival Endpoint may become less reliable because fewer patients will have assigned treatments consistent with a given dynamic treatment regime.

4. Asymptotic Properties

In this Section, we present the asymptotic properties of the proposed estimators in Theorems 1 – 3. Theorems 1 and 2 are for the cases with a single decision point while Theorem 3 is for the case with two decision points.

Theorem 1

Under conditions (A1)–(A6) in the Appendix, if the propensity score model (3) is correctly specified, for any regime gη, we have, as n → ∞,

  1. ŜI (u; η) →p S*(u; η) for any 0 < u ≤ t;

  2. n{S^I(u;η)-S(u;η)} converges weakly to a mean zero Gaussian process;

  3. n{S^I(t;η^Iopt)-S(t;ηopt)}dN(0,I(t;ηopt)), where the expression of ΣI (t; ηopt) is given in the Appendix;

  4. n{S^I(t;η^Iopt)-SI(t;ηIopt)}=op(1).

Theorem 2

Under condition (A1)–(A6) in the Appendix, if either the propensity score model (3) or the proportional hazard model (5) is correctly specified, we have, as n → ∞,

  1. ŜA(u; η) p S*(u; η) for any 0 < u ≤ t;

  2. n{S^A(u;η)-S(u;η)} converges weakly to a mean zero Gaussian process;

  3. n{S^A(t;η^Aopt)-S(t;ηopt)}dN(0,A(t;ηopt)), where the expression of ΣA(t; ηopt) is given in the Appendix;

  4. n{S^A(t;η^Aopt)-SA(t;ηAopt)}=op(1).

Theorem 3

Under certain regularity conditions, if the two propensity score models π0(·) and π1(·) are correctly specified, for any regime gη, we have, as n → ∞,

  1. S^I(2)(u;η)pS(2)(u;η) for any 0 < u ≤ t;

  2. n{S^I(2)(u;η)-S(2)(u;η)} converges weakly to a mean zero Gaussian process;

  3. n{S^I(2)(t;η^Iopt,(2))-S(t;ηopt,(2))}dN(0,I(2)(t;ηopt,(2))), where ηopt,(2)=(η0opt,η1opt);

  4. n{S^I(2)(t;η^Iopt,(2))-SI(2)(t;ηIopt,(2))}=op(1).

Here the asymptotic variances ΣI (t; ηopt), ΣA(t; ηopt) and I(2)(t;ηopt,(2)) can be consistently estimated from the observed data using the usual plug-in method. The proofs of Theorems 1–3 are given in the Appendix.

5. Simulation Studies

We examine the finite sample performance of the proposed estimators by simulation. We first consider scenarios with a single treatment decision time point at baseline. For each patient, baseline covariates X1 and X2 are independently and uniformly distributed on (−2, 2). Given X1 and X2, the binary treatment indicator A is generated from the logistic model logit{π(X1, X2)} = X1 − 0.5X2. The survival time T is generated from a linear transformation model (Cheng et al., 1995), h(T) = −0.5X1 + A(X1X2) + ε, where h(s) = log(es − 1) − 2 is an increasing function, and the error term ε follows some known distribution, either the extreme value distribution or the logistic distribution, which corresponds to a proportional hazards and proportional odds model, respectively. The covariate-independent censoring time C is uniformly distributed on (0, C0), where C0 is chosen to achieve the censoring rate of 15% and 40%. The optimal treatment regime maximizing t-year survival probability is gηopt(X1,X2)=I{X1-X20} for any t. We search the optimal treatment regime in the class of regimes given by 𝒢 = {gη : gη(X1, X2) = I{η0 + η1X1 + η2X2 0}, η = (η0, η1, η2)T}, which contains the true optimal treatment regime with ηopt = (0, 0.707,−0.707).

To implement the proposed estimators, it is necessary to posit a model for the propensity scores. We consider both a correctly specified model, logit{πA(X1, X2)} = θ0+θ1X1+θ2X2, and a misspecified model, logit{πA(X1, X2)} = θ0. For the augmented estimators, we must posit a model for the survival time T. Here, we always use the proportional hazard model λ(t|X1, X2) = λ0(t) exp{β11X1+β12X2+A(β20+β21X1+β22X2)}. Note that when ε follows the extreme value distribution, the posited survival model is correctly specified. On the other hand, when ε follows the logistic distribution, this model is misspecified. We compare the performance of the IPSWKME (ŜI ) and AIPSWKME (ŜA), as well as their smoothed versions, S-IPSWKME (I ) and S-AIPSWKME (A), under different combinations of the assumed propensity score (PS) model, error term distribution, censoring rate, sample size (n = 250 or 500) and time point of interest (t = 1 or 2). For each scenario, we ran 1000 replications and used a genetic algorithm to do the optimization, which is implemented by the R function genoud within the package rgenoud (Mebane, Jr. and Sekhon, 2011).

We report results for the scenarios with n = 250 and t = 2, which are given in Tables 1 and 2 for the extreme value error and logistic error distributions, respectively. Results for other scenarios are similar. In the tables, we report the mean of estimated ηopt, the mean of estimated t-year survival probability following the estimated optimal treatment regime, namely the estimated optimal t-year survival probability (denoted by Ŝ(η̂opt)), the mean of estimated standard error of Ŝ (η̂opt) using the plug-in method based on the asymptotic variances established in Theorems 1–2 (denoted by SE), the empirical coverage probability of 95% confidence interval for the t-year survival probability following the true optimal treatment regime S(ηopt) (denoted by CP), the mean of simulated true t-year survival probability following the estimated optimal treatment regime (denoted by S(η̂opt)), and the mean of misclassification rate by comparing the true and estimated optimal treatment regimes (denoted by MR). The numbers given in parenthesis are the standard deviations of the corresponding estimates. Here, S(ηopt) and S(η̂opt) are computed using simulated survival times following the given treatment regime based on a large random sample of 5 × 106 patients. We have S(ηopt) = 0.605 for the extreme value error distribution and S(ηopt) = 0.672 for the logistic distribution. The misclassification rate for one simulation is calculated as the proportion of patients for which the true and estimated optimal treatment regimes do not select the same treatment.

Table 1.

Simulation results for the extreme value error distribution with n = 250 and t = 2.

PS η̂0 η̂1 η̂2 Ŝ(η̂opt) SE CP S(η̂opt) MR
Censor Rate = 15%
ŜI T 0.010 (0.298) 0.633 (0.192) −0.665 (0.178) 0.645 (0.037) 0.040 0.839 0.590 (0.016) 0.118 (0.063)
I T −0.005 (0.263) 0.652 (0.179) −0.667 (0.171) 0.612 (0.036) 0.040 0.968 0.593 (0.014) 0.107 (0.057)
ŜA T −0.002 (0.287) 0.639 (0.171) −0.676 (0.155) 0.639 (0.037) 0.040 0.866 0.592 (0.014) 0.109 (0.058)
A T 0.002 (0.256) 0.654 (0.169) −0.675 (0.152) 0.609 (0.036) 0.040 0.969 0.594 (0.013) 0.102 (0.055)
ŜI F −0.031 (0.423) 0.408 (0.327) −0.697 (0.249) 0.666 (0.036) 0.039 0.659 0.565 (0.040) 0.193 (0.100)
I F −0.051 (0.403) 0.426 (0.285) −0.714 (0.252) 0.643 (0.035) 0.039 0.844 0.569 (0.034) 0.184 (0.090)
ŜA F −0.014 (0.278) 0.660 (0.151) −0.662 (0.161) 0.635 (0.038) 0.041 0.886 0.593 (0.012) 0.107 (0.055)
A F −0.002 (0.246) 0.675 (0.141) −0.665 (0.148) 0.607 (0.038) 0.041 0.968 0.596 (0.010) 0.096 (0.050)

Censor Rate = 40%
ŜI T 0.008 (0.311) 0.616 (0.214) −0.661 (0.202) 0.650 (0.041) 0.044 0.850 0.588 (0.019) 0.127 (0.068)
I T −0.002 (0.285) 0.637 (0.202) −0.660 (0.191) 0.613 (0.040) 0.045 0.958 0.590 (0.017) 0.118 (0.064)
ŜA T 0.006 (0.310) 0.623 (0.203) −0.663 (0.189) 0.645 (0.041) 0.045 0.879 0.589 (0.019) 0.123 (0.068)
A T 0.001 (0.282) 0.643 (0.192) −0.661 (0.183) 0.612 (0.040) 0.044 0.965 0.591 (0.017) 0.115 (0.062)
ŜI F 0.002 (0.448) 0.388 (0.349) −0.676 (0.267) 0.671 (0.039) 0.043 0.677 0.560 (0.045) 0.206 (0.109)
I F −0.024 (0.432) 0.403 (0.311) −0.694 (0.271) 0.645 (0.039) 0.043 0.867 0.564 (0.038) 0.200 (0.095)
ŜA F −0.005 (0.299) 0.655 (0.169) −0.650 (0.176) 0.641 (0.043) 0.046 0.896 0.591 (0.014) 0.115 (0.060)
A F −0.005 (0.270) 0.664 (0.162) −0.656 (0.173) 0.609 (0.041) 0.046 0.964 0.593 (0.012) 0.109 (0.054)

PS, the propensity score model. Here T means the correctly specified PS model while F means the misspecified PS model. Recall that S(ηopt) = 0.605.

Table 2.

Simulation results for the logistic error distribution with n = 250 and t = 2.

PS η̂0 η̂1 η̂2 Ŝ(η̂opt) SE CP S(η̂opt) MR
Censor Rate = 15%
ŜI T 0.010 (0.370) 0.566 (0.272) −0.641 (0.241) 0.716 (0.034) 0.038 0.791 0.652 (0.022) 0.155 (0.089)
I T −0.004 (0.341) 0.593 (0.262) −0.640 (0.235) 0.685 (0.034) 0.039 0.955 0.655 (0.020) 0.145 (0.082)
ŜA T 0.007 (0.363) 0.578 (0.260) −0.639 (0.240) 0.713 (0.034) 0.039 0.818 0.653 (0.020) 0.151 (0.084)
A T −0.006 (0.341) 0.595 (0.251) −0.642 (0.233) 0.684 (0.034) 0.039 0.962 0.655 (0.020) 0.143 (0.081)
ŜI F 0.041 (0.461) 0.340 (0.389) −0.662 (0.284) 0.729 (0.033) 0.037 0.649 0.632 (0.040) 0.224 (0.120)
I F −0.001 (0.461) 0.375 (0.350) −0.667 (0.283) 0.707 (0.033) 0.037 0.846 0.636 (0.035) 0.216 (0.107)
ŜA F −0.025 (0.337) 0.630 (0.198) −0.637 (0.210) 0.723 (0.036) 0.040 0.753 0.658 (0.013) 0.133 (0.068)
A F −0.029 (0.320) 0.633 (0.204) −0.642 (0.204) 0.695 (0.036) 0.040 0.926 0.659 (0.012) 0.130 (0.064)

Censor Rate = 40%
ŜI T 0.013 (0.395) 0.545 (0.293) −0.625 (0.266) 0.721 (0.036) 0.041 0.785 0.649 (0.027) 0.168 (0.097)
I T −0.008 (0.362) 0.581 (0.274) −0.626 (0.255) 0.687 (0.036) 0.041 0.948 0.652 (0.022) 0.155 (0.087)
ŜA T 0.004 (0.381) 0.558 (0.277) −0.635 (0.255) 0.718 (0.036) 0.042 0.807 0.651 (0.023) 0.160 (0.089)
A T −0.016 (0.361) 0.578 (0.270) −0.634 (0.246) 0.686 (0.036) 0.042 0.955 0.653 (0.022) 0.153 (0.086)
ŜI F 0.061 (0.471) 0.325 (0.413) −0.640 (0.299) 0.733 (0.035) 0.039 0.661 0.628 (0.042) 0.235 (0.124)
I F 0.021 (0.482) 0.355 (0.370) −0.639 (0.312) 0.709 (0.035) 0.040 0.842 0.631 (0.038) 0.229 (0.114)
ŜA F −0.012 (0.350) 0.625 (0.206) −0.631 (0.217) 0.722 (0.038) 0.042 0.785 0.657 (0.014) 0.138 (0.070)
A F −0.022 (0.331) 0.628 (0.214) −0.634 (0.221) 0.692 (0.038) 0.043 0.939 0.658 (0.013) 0.136 (0.067)

PS, the propensity score model. Here T means the correctly specified PS model while F means the misspecified PS model. Recall that S(ηopt) = 0.672.

From the results, when the PS model is correctly specified, all estimators of ηopt have relatively small biases, in particular, the mean of η^0opt is close to zero while the mean ratio of η^1opt to η^2opt is very close to negative one. The means of simulated true t-year survival probability following the estimated optimal treatment regimes, i.e. S(η̂opt), are all close to the true values. In addition, the estimates of ηopt based on the AIPSWKME and S-AIPSWKME of t-year survival probability generally have smaller standard deviation than those based on IPSWKME and S-IPSWKME. The unsmoothed IPSWKME and AIPSWKME of the optimal t-year survival probability have relatively large biases mainly due to the very jagged estimates of t-year survival probability, as illustrated in Figure 2, and as a consequence, the associated coverage probability of 95% confidence interval is much lower than the nominal level. The smoothed S-IPSWKME and S-AIPSWKME of the optimal t-year survival probability greatly reduce the biases and thus give the proper coverage probability. In addition, the unsmoothed and smoothed estimators of the optimal t-year survival probability have nearly the same standard deviation. When the PS model is misspecified, the IPSWKME and S-IPSWKME generally have relatively large biases as expected, while the AIPSWKME and S-AIPSWKME greatly reduce the biases and give much smaller MR. In particular, when the posited survival model is correctly specified under the extreme value error distribution, the S-AIPSWKME yields proper coverage probability. On the other hand, when the posited survival model is misspecified under the logistic error distribution, although the S-AIPSWKME is not consistent in general, it still gives small biases with reasonable coverage probability. Performance of the estimators improves as the censoring rate decreases and sample size increases.

We also compare the proposed method with the methods of Zhao et al. (2013) and Zhao et al. (2015). For the comparison with the method of Zhao et al. (2013), we consider randomized studies with known propensity scores, i.e. πA ≡ 0.5, sample size n = 250, decision time point of interest t0 = 2, and censoring rate of 15%. When implementing the method of Zhao et al. (2013), we set the threshold ξ = 0, 0.1, …, 0.6 and find the associated treatment regime for each ξ value.

Table 3 summarizes the simulation results for the extreme value and logistic error distributions based on 1000 replications. The performance of the method of Zhao et al. (2013) depends on the choice of the threshold value ξ. For the extreme value error distribution, the best choice is ξ = 0.4, while for the logistic error distribution, the best choice is ξ = 0.3. In practice, the best threshold value to use is unknown and must be estimated from data, which may not be straightforward. Moreover, even with the best choice of ξ value, the performance of the method by Zhao et al. (2013) is still worse than that of our proposed smoothed estimators, S-IPSWKME and S-AIPSWKME, under all the considered settings.

Table 3.

Results for comparison with the method of Zhao et al. (2013).

error method Surv. Prob. MR
extreme value Zhao et al. (2013) w. ξ = 0 0.445 (0.030) 0.467 (0.048)
Zhao et al. (2013) w. ξ = 0.1 0.499 (0.046) 0.373 (0.089)
Zhao et al. (2013) w. ξ = 0.2 0.555 (0.035) 0.245 (0.099)
Zhao et al. (2013) w. ξ = 0.3 0.585 (0.027) 0.143 (0.091)
Zhao et al. (2013) w. ξ = 0.4 0.590 (0.028) 0.112 (0.093)
Zhao et al. (2013) w. ξ = 0.5 0.543 (0.066) 0.241 (0.162)
Zhao et al. (2013) w. ξ = 0.6 0.542 (0.045) 0.275 (0.109)
S-IPSWKME 0.594 (0.011) 0.107 (0.052)
S-AIPSWKME 0.595 (0.009) 0.099 (0.047)

logistic Zhao et al. (2013) w. ξ = 0 0.552 (0.028) 0.456 (0.061)
Zhao et al. (2013) w. ξ = 0.1 0.606 (0.040) 0.323 (0.113)
Zhao et al. (2013) w. ξ = 0.2 0.643 (0.029) 0.200 (0.111)
Zhao et al. (2013) w. ξ = 0.3 0.650 (0.030) 0.164 (0.117)
Zhao et al. (2013) w. ξ = 0.4 0.630 (0.037) 0.246 (0.132)
Zhao et al. (2013) w. ξ = 0.5 0.590 (0.039) 0.373 (0.110)
Zhao et al. (2013) w. ξ = 0.6 0.590 (0.030) 0.382 (0.079)
S-IPSWKME 0.659 (0.012) 0.130 (0.063)
S-AIPSWKME 0.660 (0.011) 0.126 (0.061)

Surv. Prob., the simulated survival probability at t0 = 2; MR, the misclassification rate.

The true optimal survival probabilities are 0.605 and 0.672 for the extreme value and logistic error, respectively.

Values in the parenthesis are the standard deviations over 1000 simulations.

For the comparison with the method of Zhao et al. (2015), we consider the same simulation settings as in Tables 1 and 2 with sample size n = 250, decision time point of interest t0 = 2, and censoring rate of 15%. For both methods, we consider the augmented estimation. Table 4 summarizes the simulation results based on 1000 replications. The proposed methods and the method of Zhao et al. (2015) lead to comparable survival probabilities under the estimated treatment rules, while the proposed methods yield smaller misclassification rates under all the considered settings. In summary, the proposed methods demonstrate very competitive performance compared with existing approaches.

Table 4.

Results for comparison with the method of Zhao et al. (2015).

error method PS Surv. Prob. MR
extreme value Zhao et al. (2015) T 0.587 (0.022) 0.136 (0.065)
AIPSWKM T 0.592 (0.014) 0.109 (0.058)
S-AIPSWKM T 0.594 (0.013) 0.102 (0.055)
Zhao et al. (2015) F 0.590 (0.008) 0.134 (0.044)
AIPSWKM F 0.593 (0.012) 0.107 (0.055)
S-AIPSWKM F 0.596 (0.010) 0.096 (0.050)

logistic Zhao et al. (2015) T 0.652 (0.027) 0.159 (0.090)
AIPSWKM T 0.653 (0.020) 0.151 (0.084)
S-AIPSWKM T 0.655 (0.020) 0.143 (0.081)
Zhao et al. (2015) F 0.659 (0.007) 0.141 (0.047)
AIPSWKM F 0.658 (0.013) 0.133 (0.068)
S-AIPSWKM F 0.659 (0.012) 0.130 (0.064)

PS, the propensity score model. Here T means the correctly specified PS model while F means the misspecified PS model.

Surv. Prob., the simulated survival probability at t0 = 2; MR, the misclassification rate.

The true optimal survival probabilities are 0.605 and 0.672 for the extreme value and logistic error, respectively.

Values in the parenthesis are the standard deviations over 1000 simulations.

Next, we consider scenarios with two treatment decision time points, one at the baseline and the other at s = 1. The initial treatment assignment A0 and the follow-up treatment assignment A1, if applicable, are generated independently from a Bernoulli distribution with success probability of 0.5. A single baseline covariate X0 is generated from a uniform distribution on (0, 4). To generate the survival time T, we first generate a time T1 given A0 and X0 from an exponential distribution with the rate function λ1(A0, X0). The censoring time C is generated from a uniform distribution on (0, C0). If a patient is neither dead nor censored at time s = 1 (i.e. min(T1, C) > 1), we generate a single intermediate covariate X1 for this patient as X1 = 0.5 X0 − 0.4(A0 − 0.5) + e, where e is uniformly distributed on (0, 2). Then we generate another time T2 given A0, A1, X0 and X1 from an exponential distribution with the rate function λ2(A0, A1, X0, X1). The survival time T of interest is defined as T = T1 if T1 ≤ 1 and T = 1+T2 otherwise. The observed survival time is = min(T, C) with the censoring indicator δ = I(TC). Here, C0 is chosen to achieve censoring rates of 15% and 40%. We consider three scenarios for the rate functions λ1 and λ2: (i) λ1(A0, X0) = 0.5 exp{1.75(A0 − 0.5)(X0 − 2)} and λ2(A0, A1, X0, X1) = 0.3 exp{2.5(A1 − 0.4)(X1 − 2) − A0(X1 − 2)}; (ii) λ1(A0, X0) = 0.1 exp{2(A0 − 0.5)(X0 − 2)} and λ2(A0, A1, X0, X1) = 0.2 exp{3(A1 − 0.4)(X1 − 2) − 3(A0 − 0.5)(X0 − 2)}; (iii) λ1(A0, X0) = 0.2 exp{1.5(A0 − 0.3)(X0 − 3)} and λ2(A0, A1, X0, X1) = 0.3 exp{2(A1 − 0.5)(X1 − 2) + 0.5(A0 − 0.7)(X0 − 1)}.

For the above three scenarios, the true optimal rule for maximizing t-year survival probability (t > 1) at time s = 1 is given by g1opt(x1)=I(2-x1>0). However, the true optimal rule g0opt(x0) at time s = 0 is a complicated nonlinear function of x0, which can be derived using backward induction as in Q-learning. In our implementation, for computation simplicity, we search for the optimal dynamic treatment regime in a class involving linear decision rules, specifically, 𝒢η = {g0(x0) = I{η1 + η2x0 > 0}, g1(x1) = I{η3 + η4x1 > 0}, ||(η1, η2)|| = 1, ||(η3, η4)|| = 1}. Then, the true optimal rule g1opt(x1) at time s = 1 corresponds to (η3opt,η4opt)=(0.894,-0.447) for all three scenarios.

For scenarios (i) and (iii), we take t = 3, while for (ii) we take t = 6. We use simulation to find the true optimal rule at s = 0 in 𝒢η to maximize t-year survival probability. Specifically, we first generate X0, and for a given (η1, η2), we set A0 by the regime g0(X0). Then, we generate X1 given A0 and X0 the same way as in our design, and set A1 by the optimal rule g1opt. Finally, we generate T1 and T2, and define T the same way as before. Using generated T’s for a large random sample of 5 × 106 patients, we compute the associated empirical t-year survival probability. We find ( η1opt,η2opt) to maximize the empirical t-year survival probability, which gives the true optimal rule g0opt in 𝒢η. Here, we use grid search method to find ( η1opt,η2opt). Since (η1opt,η2opt)=1, we only need to do grid search for η1. We have (η1opt,η2opt)=(0.890,-0.456) and S(3; ηopt) = 0.567 for scenario 1, (η1opt,η2opt)=(-0.891,0.454) and S(6; ηopt) = 0.624 for scenario 2, and (η1opt,η2opt)=(0.908,-0.419) and S(3; ηopt) = 0.702 for scenario 3. Here ηopt=(η1opt,η2opt,η3opt,η4opt)T and S(t; ηopt) is the t-year survival probability following the optimal dynamic treatment regime defined by ηopt.

We compare the unsmoothed and smoothed estimators. For both estimators, the propensity score models π0 and π1 are assumed known as for randomized clinical trials. Simulation results for 1000 replications are summarized in Table 3. From the results, we observe: (i) both unsmoothed and smoothed estimation methods give nearly unbiased estimators of ηopt, and the t-year survival probability following the estimated optimal treatment regime (denoted by S(η̂opt) in the table) is very close to the t-year survival probability following the true optimal treatment regime ηopt; (ii) the mean of estimated standard error (SE) of Ŝ(η̂opt) based on the established theory is close to the standard deviation of the estimates given in the parenthesis; (iii) The unsmoothed estimator for the t-year survival probability following the estimated optimal treatment regime (denoted by Ŝ(η̂opt)) has relatively large bias and the associated coverage probability (CP) is below the nominal level; and (iv) the smoothed estimator for the t-year survival probability following the estimated optimal treatment regime has largely reduced bias and thus lead to proper coverage probability.

6. Application to ACTG 175

We illustrate the proposed methods for a single decision with the data from the ACTG Study 175 (Hammer et al., 1996). Subjects were randomized to four treatment groups with equal probability: zidovudine (ZDV) monotherapy, ZDV plus didanosine (ddI), ZDV plus zalcitabine (zal), and ddI monotherapy. A primary composite endpoint of interest is the time to having a larger than 50% decline in the CD4 count, or progressing to AIDS, or death, whichever comes first. From treatment-specific Kaplan-Meier curves, it can be clearly seen that treatments ZDV+ddI, ZDV+zal and ddI only are uniformly better than treatment ZDV only in terms of survival. In addition, treatments ZDV+ddI and ZDV+zal are overall the two best treatments giving the highest survival probabilities especially after day 400. For simplicity, we only consider two treatment options in our analysis, A = 1 for ZDV +ddI and A = 0 for ZDV+zal, which involves 1046 subjects. For each subject, there are 12 baseline clinical covariates; preliminary analysis results showed that Karnofsky score (Karnof), baseline CD4 count (CD40), and age (Age) are three important risk predictors and may have interaction effects with treatments. We include these three covariates in constructing treatment regimes. Our goal is to find the optimal treatment regime from the class of linear regimes defined by 𝒢 = {gη = I(η0 + η1x1 + η2x2 + η3x3 ≥ 0) : η = (η0, η1, η2, η3)T, ||η|| = 1} to maximize t-year survival probability, x1 is Karnof, x2 is CD40, and x3 is Age. Because the data come from a randomized study, we use a constant model for the propensity score and estimate this constant from data. For augmented estimation, we posit the proportional hazard model as given in (5). We estimate optimal treatment regimes at day t = 400, 600, 800 and 1000.

The estimated optimal treatment regimes and the associated t-year survival probabilities are presented in Table 6. We only present the results for S-IPSWKME and S-AIPSWKME, as they have better numerical performance than their nonsmoothed counterparts based on our simulation studies. The numbers in the columns of Intercept, Karnof, CD40 and Age are the parameter estimates η̃opt defining the optimal treatment regimes, and (t; η̃opt) is the estimated t-year survival probability following the estimated optimal treatment regime. From the Table, the estimated optimal treatment regime for an earlier time may be different from that for a later time. For example, comparing the obtained optimal treatment regimes for t = 600 and t = 800, the S-IPSWKME assigns a set of 353 patients to treatment 0 and another set of 583 patients to treatment 1 for both time points. However, it assigns a set of 52 patients to treatment 0 for t = 600 but to treatment 1 for t = 800. On the other hand, it assigns another set of 58 patients to treatment 1 for t = 600 but to treatment 0 for t = 800. For the S-AIPSWKME, the findings are similar. S-IPSWKME and S-AIPSWKME yield very different parameter estimates η̃opt. However, the corresponding optimal treatment regimes are similar. Using the results for day 600 as an example, among the 1046 subjects, there are only 57 subjects whose assigned treatments are different by the estimated optimal treatment regimes based on S-IPSWKME and S-AIPSWKME. In addition, the estimated t-year survival probabilities following the estimated optimal treatment regimes are nearly the same based on S-IPSWKME and S-AIPSWKME.

Table 6.

Estimation results for the ACTG175 data.

t Method Intercept Karnof CD40 Age (t; η̃opt) CI1 CI0
400 I −0.303 −0.340 0.024 0.890 0.965 (0.008) (−0.002, 0.023) (−0.003, 0.044)
A −0.729 −0.240 0.018 0.640 0.965 (0.008) (−0.002, 0.022) (−0.003, 0.043)
600 I 0.975 −0.082 0.001 0.206 0.923 (0.012) (0.000, 0.045) (−0.006, 0.052)
A 0.909 −0.137 0.000 0.392 0.922 (0.012) (0.000, 0.043) (−0.009, 0.052)
800 I 0.871 −0.133 −0.010 0.473 0.887 (0.014) (0.008, 0.058) (−0.002, 0.069)
A 0.874 −0.131 −0.009 0.469 0.886 (0.014) (0.006, 0.057) (−0.003, 0.068)
1000 I −0.210 −0.185 −0.035 0.959 0.824 (0.017) (0.004, 0.060) (−0.006, 0.081)
A 0.001 −0.187 −0.037 0.982 0.823 (0.017) (0.002, 0.059) (−0.007, 0.080)

I denotes the S-IPSWKME and A denotes the S-AIPSWKME; the numbers in the parenthesis are the estimated standard errors; CI1 and CI0 denote the 95% confidence intervals for the difference of the value functions obtained under the estimated optimal treatment regime and the simple treatment regime assigning all to treatment 1 and 0, respectively.

Next, we compare the estimated optimal regimes with the simple regimes that assign all subjects to the same treatment. Specifically, we construct 95% Wald-type confidence intervals for the difference between the estimated t-year survival probabilities under the estimated optimal treatment regimes and the simple regimes based on the derived asymptotic normal distribution. The results are also given in Table 6. The confidence intervals either stay above zero or zero is very close to the left end point of the intervals when it is contained. This implies that the increase in value realized by following the estimated optimal treatment regimes comparing with the simple regimes is significant or at least marginally significant. The Kaplan-Meier curves for patients following the estimated optimal treatment regimes (not shown here) are all uniformly better than those for each single treatment.

We have also estimated the optimal treatment regimes using the proposed methods based all twelve covariates when smoothing is and is not employed. We do not report on this here for brevity; however, we note that the results for smoothed estimators when using three versus twelve covariates are comparable, demonstrating the adaptivity of the smoothed estimators to incorporating relatively many covariates. The unsmoothed estimators can lead to slightly different optimal treatment rules but with similar estimated survival probabilities. In addition, the estimated survival probabilities show relatively larger differences between the cases with three and twelve covariates, which is likely due to the instability in maximizing the unsmoothed value functions.

7. Discussion

We have proposed Kaplan-Meier type estimators for the survival function of patients following a given (dynamic) treatment regime and introduce kernel smoothing to improve their performance. An optimal (dynamic) treatment regime within a class of prespecified treatment regimes may then be estimated by maximizing the estimator of the associated t-year survival probability. We consider the case when there are two treatment options at each decision time point. However, the proposed methods can be generalized to incorporate multiple treatment options at each decision by defining a treatment regime using multiple indexes instead of a single indicator function. In addition, current methods find the optimal (dynamic) treatment regime to maximize t-year survival probability, which can also be generalized to maximize other clinical outcomes of interest. Specifically, using the IPSWKME, ŜI (·; η), as an illustration, we can find the optimal treatment regime to maximize f{ŜI (·; η)}, where f is a specified function of interest; e.g., f{S^I(·;η)}=0LS^I(u;η)du corresponds to restricted mean survival time under a given treatment regime. Likewise f{ŜI (·; η)} = sup{u : ŜI (u; η) ≥ 0.5} corresponds to the median survival time under a given treatment regime.

In this paper, we study the asymptotic distributions of the estimated value function under the derived optimal treatment regimes. The asymptotic properties of η̂ in the treatment regime function are very challenging to obtain. The convergence rate of η̂ is slower than the classical n1/2-rate due to the indicator function I(ηT ≥ 0), and the resulting limiting distribution is not standard. Matsouaka et al. (2014) studied a special case where the estimated value function depends on a single threshold value and showed that the estimator of the threshold that maximizes the estimated value function has the n1/3-rate. We conjecture that our estimator η̂ should also have n1/3-rate. This is an interesting problem that warrants future research.

Supplementary Material

Supplementary Appendix

Table 5.

Simulation results for estimating optimal dynamic treatment regimes.

C% S
η^1opt
η^2opt
η^3opt
η^4opt
Ŝ(η̂opt) SE CP S(η̂opt) MR
Senario 1: ηopt = (0.890,−0.456, 0.894,−0.447); S(3; ηopt) = 0.567
15 F 0.881 (0.035) −0.468 (0.062) 0.893 (0.017) −0.449 (0.033) 0.591 (0.028) 0.030 0.887 0.559 (0.008) 0.107 (0.054)
T 0.884 (0.029) −0.463 (0.052) 0.894 (0.013) −0.448 (0.026) 0.570 (0.028) 0.030 0.955 0.561 (0.006) 0.088 (0.048)
40 F 0.878 (0.042) −0.471 (0.072) 0.890 (0.022) −0.453 (0.042) 0.600 (0.036) 0.037 0.841 0.555 (0.011) 0.125 (0.061)
T 0.884 (0.034) −0.463 (0.061) 0.892 (0.018) −0.450 (0.035) 0.574 (0.035) 0.038 0.955 0.558 (0.009) 0.108 (0.056)

Senario 2: ηopt = (−0.891, 0.454, 0.894,−0.447); S(6; ηopt) = 0.624
15 F −0.888 (0.025) 0.456 (0.045) 0.891 (0.018) −0.452 (0.035) 0.645 (0.025) 0.027 0.890 0.615 (0.008) 0.099 (0.052)
T −0.889 (0.018) 0.456 (0.034) 0.893 (0.014) −0.450 (0.028) 0.624 (0.024) 0.027 0.967 0.618 (0.005) 0.079 (0.042)
40 F −0.886 (0.030) 0.459 (0.053) 0.890 (0.020) −0.453 (0.038) 0.650 (0.027) 0.029 0.855 0.613 (0.010) 0.110 (0.055)
T −0.888 (0.022) 0.457 (0.040) 0.892 (0.016) −0.451 (0.032) 0.626 (0.027) 0.030 0.972 0.617 (0.007) 0.091 (0.048)

Senario 3: ηopt = (0.908,−0.419, 0.894,−0.447); S(3; ηopt) = 0.702
15 F 0.897 (0.037) −0.434 (0.069) 0.892 (0.020) −0.449 (0.039) 0.728 (0.026) 0.027 0.829 0.692 (0.009) 0.134 (0.067)
T 0.900 (0.031) −0.430 (0.060) 0.893 (0.016) −0.448 (0.031) 0.707 (0.026) 0.027 0.952 0.695 (0.007) 0.116 (0.061)
40 F 0.895 (0.041) −0.437 (0.075) 0.891 (0.023) −0.451 (0.043) 0.732 (0.028) 0.029 0.809 0.691 (0.010) 0.142 (0.073)
T 0.899 (0.035) −0.431 (0.066) 0.893 (0.019) −0.449 (0.036) 0.709 (0.028) 0.030 0.951 0.693 (0.008) 0.126 (0.066)

C% denotes the censoring rate; S indicates whether the smoothing technique is applied (T) or not (F).

Acknowledgments

The authors are grateful to two referees and an Associate Editor for their thoughtful and suggestive comments, which have helped to greatly improve on an earlier manuscript. The work was partially supported by National Institutes of Health grants R01 CA140632 and P01 CA142538.

A. Proof of Theorems

To establish the asymptotic results given in Theorems 1–2, we need to assume some regularity conditions. Recall that a working logistic model (3) is assumed for the propensity scores with parameters θ for the IPSWKME and a working proportional hazards model (5) is further assumed for the survival time T for the AIPSWKME with parameters β and Λ0. Let νAi=(XiT,Ai,AiXiT)T and νηi=(XiT,gη(Xi),gη(Xi)XiT)T. Define

K1I(X,A,T,δ;η)=0t(2A-1)dN(u)πE{wηY(u)},K2I(X,A,T,δ;η)=0t(2A-1)Y(u)E[{(2A-1)gη(X)+(1-A)}dN(u)][πE{wηY(u)}]2,

where wη=[Agη(X)+(1-A){1-gη(X)}]/π and π* = π(X; θ*)A+{1−π(X; θ*)}(1−A). In addition, define

K1A(X,A,T,δ;η)=0tJ1A(u)-J0A(u)E[{L1A(u)-L0A(u)}gη(X)+L0A(u)],K2A(X,A,T,δ;η)=0t{L1A(u)-L0A(u)}E[{J1A(u)-J0A(u)}gη(X)+J0A(u)](E[{L1A(u)-L0A(u)}gη(X)+L0A(u)])2,

where JkA(u)=1-k-(-1)kAπdN(u)+ek(1-1-k-(-1)kAπ)exp{-Λ0(u)ek}SC(u)dΛ0(u),LkA(u)=1-k-(-1)kAπY(u)+(1-1-k-(-1)kAπ)exp{-Λ0(u)ek}SC(u)ek = exp {β*T(XT, k, kXT)T}, k = 0, 1. We assume the following conditions.

  • A1

    The covariates X are bounded.

  • A2

    The propensity score π(X) is bounded away from 0 and 1 for all possible values of X.

  • A3

    The equation E[{A-exp(θTX)1+exp(θTX)}X]=0 has a unique solution θ*.

  • A4
    The equation
    E(0τ[νAi-E{Yi(s)exp(βTνAi)νAi}E{Yi(s)exp(βTνAi)}]×dNi(s))=0.

    has a unique solution β*, where τ > t is a prespecified time point satisfying P(iτ ) > 0. Let Λ0(u)=E[0udNi(s)/E{Yi(s)exp(βTνAi)}] and it satisfies Λ0(τ)<.

  • A5

    supη=1E[{KjI(X,A,T,δ;η)}2]< and supη=1E[{KjA(X,A,T,δ;η)}2]<, j = 1, 2.

  • A6

    nh → ∞ and nh4 → 0 as n → ∞.

Under assumed regularity conditions A1 – A4, we have the following asymptotic representations:

n(θ^-θ)=1ni=1nϕ1i+op(1),n(β^-β)=1ni=1nϕ2i+op(1),n{Λ^0(u)-Λ0(u)}=1ni=1nϕ3i(u)+op(1),n{S^C(u)-SC(u)}=1ni=1nϕ4i(u)+op(1),

where ϕ1i’s and ϕ2i’s are independently and identically distributed mean-zero vectors, and ϕ3i(u) and ϕ4i(u) are independent mean-zero processes. Moreover, consistent estimators ϕ̂1i, ϕ̂2i, ϕ̂3i(u) and ϕ̂4i(u) of ϕ1i, ϕ2i, ϕ3i(u) and ϕ4i(u) can be easily obtained.

In the following, we give a sketch for the proof of Theorem 1. The detailed proofs for Theorems 1–2 are provided in the Supplementary Appendix.

A.1. Proof of Theorem 1

For any given regime gη, we first derive the asymptotic properties for the corresponding inverse propensity score weighted (IPSW) Nelson-Aalen estimator. Specifically,

Λ^I(u;η)Λ^I(u;η,θ^)=0ui=1nw^ηidNi(s)i=1nw^ηiYi(s). (A.1)

It is easy to show that ŜI (u; η) and exp{−Λ̂I (u; η)} are asymptotically equivalent for any given η. Therefore, the asymptotic properties of ŜI (u; η) easily follows those of Λ̂I (u; η).

When the propensity score model is correctly specified, we have θ* = θ and wηi=wηi. Then n-1i=1nw^ηiYi(s)pE{wηiYi(s)}=E[Y{gη(X);s}] uniformly for s ∈ [0, τ] as n → ∞. Similarly, we have n-1i=1nw^ηidNi(s)pE{wηidNi(s)}=E[dN{gη(X);s}] uniformly for s ∈ [0, τ] as n → ∞. Therefore,

Λ^I(u;η)p0uE[dN{gη(X);s}]E[Y{gη(X);s}]=0uSC(s)dP[T{gη(X)}s]SC(s)P[T{gη(X)}s]=-log{S(u;η)}Λ(u;η),

which establish the consistency given in (i) of Theorem 1.

Next, we derive the asymptotic distribution of ΛI (u; η). By applying the first-order Taylor expansion of Λ̂I (u; η) with respect to parameter θ and some empirical process approximation techniques, we have

n{Λ^I(u;η)-Λ(u;η)}=n-1/2i=1n(0uwηidMi{gη(X);s}E[Y{gη(X);s}]+D1(u)Tϕ1i)+op(1)n-1/2i=1nζi(u;η)+op(1),

where Mi{gη(X);s}=Ni{gη(X);s}-0sYi{gη(X);v}dΛ(v;η) is a mean-zero martingale process and D1(u) = limn→∞ ∂Λ̂I (u; η, θ)/θ. By delta method, we have n{S^I(u;η)-S(u;η)}=-S(u;η)n-1/2i=1nζi(u;η)+op(1), which converges weakly to a mean-zero Gaussian process by applying the empirical process theory. This proves (ii) of Theorem 1.

Since η^Iopt is the maximizer of ŜI (t; η) and ηopt is the maximizer of S*(t; η), following the similar arguments in Zhang et al. (2012b), we have

n{S^I(t;η^Iopt)-S(t;ηopt)}-n{S^I(t;ηopt)-S(t;ηopt)}=op(1).

It follows that n{S^I(t;η^Iopt)-S(t;ηopt)}dN(0,I(t;ηopt)), where I(t;ηopt)={S(t;ηopt)}2E{ζi2(t;ηopt)}. This proves (iii) of Theorem 1.

Finally, we show that S^I(t;η^Iopt) and SI(t;ηIopt) are asymptotically equivalent. For any given η, we have

n{ΛI(t;η)-Λ^I(t;η)}=n×1ni=1n{Φ(ηTXih)-I(ηTXi0)}×K1I(Xi,Ai,Ti,δ;η) (A.2)
+n×1ni=1n{Φ(ηTXih)-I(ηTXi0)}×K2I(Xi,Ai,Ti,δ;η)+op(1). (A.3)

Following the similar arguments in Heller (2007), we can show that sup||η||=1 |(A.2)| = op(1) and sup||η||=1 |(A.3)| = op(1). Therefore, we have n{ΛI(t;η)-Λ^I(t;η)}=op(1) uniformly in η, which implies n{SI(t;η)-S^I(t;η)}=op(1) uniformly in η. It follows that n{SI(t;ηIopt)-S^I(t;η^Iopt)}=op(1), which proves (iv) of Theorem 1.

References

  1. Bai X, Tsiatis AA, Lu W, Song R. Optimal treatment regimes for survival endpoints using a locally-efficient doubly-robust estimator from a classification perspective. Technical Report. 2014 doi: 10.1007/s10985-016-9376-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82(4):835–845. [Google Scholar]
  3. Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society Series B (Methodological) 1972;34(2):187–220. [Google Scholar]
  4. Goldberg Y, Kosorok MR. Q-learning with censored data. Annals of Statistics. 2012;40:529–560. doi: 10.1214/12-AOS968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS, Merigan TC. A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine. 1996;335(15):1081–1090. doi: 10.1056/NEJM199610103351501. [DOI] [PubMed] [Google Scholar]
  6. Heller G. Smoothed rank regression with censored data. Journal of the American Statistical Association. 2007;102(478):552–559. [Google Scholar]
  7. Matsouaka RA, Li J, Cai T. Evaluating marker-guided treatment selection strategies. Biometrics. 2014;70:489–499. doi: 10.1111/biom.12179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Mebane WR, Jr, Sekhon JS. Genetic optimization using derivatives: The rgenoud package for R. Journal of Statistical Software. 2011;42(11):1–26. [Google Scholar]
  9. Murphy SA. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):331–355. [Google Scholar]
  10. Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in medicine. 2005;24(10):1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
  11. Robins JM. Optimal structural nested models for optimal sequential decisions. Proceedings of the second seattle Symposium in Biostatistics; Springer; 2004. pp. 189–326. [Google Scholar]
  12. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology. 1974;66(5):688–701. [Google Scholar]
  13. Shorack GR, Wellner JA. Empirical processes with applications to statistics. Vol. 59. SIAM; 2009. [Google Scholar]
  14. Watkins C, Dayan P. Q-learning. Machine Learning. 1992;8(3–4):279–292. [Google Scholar]
  15. Watkins CJ. PhD thesis. University of Cambridge; England: 1989. Learning from delayed rewards. [Google Scholar]
  16. Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber EB. Estimating optimal treatment regimes from a classification perspective. Stat. 2012a;1(1):103–114. doi: 10.1002/sta.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Zhang B, Tsiatis AA, Laber EB, Davidian M. A robust method for estimating optimal treatment regimes. Biometrics. 2012b;68(4):1010–1018. doi: 10.1111/j.1541-0420.2012.01763.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Zhang B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013;100:681–694. doi: 10.1093/biomet/ast014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Zhao L, Tian L, Cai T, Claggett B, Wei LJ. Effectively selecting a target population for a future comparative study. Journal of the American Statistical Association. 2013;108:527539. doi: 10.1080/01621459.2013.770705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statistics in Medicine. 2009;28(26):3294–3315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Zhao Y, Zeng D, Laber E, Song R, Yuan M, Kosorok M. Doubly robust learning for estimating individualized treatment with censored data. Biometrika. 2015;102:151–168. doi: 10.1093/biomet/asu050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhao Y, Zeng D, Rush AJ, Kosorok MR. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association. 2012;107(499):1106–1118. doi: 10.1080/01621459.2012.695674. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Appendix

RESOURCES