Penalized Q-Learning for Dynamic Treatment Regimens

R Song; W Wang; D Zeng; M R Kosorok

doi:10.5705/ss.2012.364

. Author manuscript; available in PMC: 2016 Jul 1.

Published in final edited form as: Stat Sin. 2015 Jul;25(3):901–920. doi: 10.5705/ss.2012.364

Penalized Q-Learning for Dynamic Treatment Regimens

R Song ¹, W Wang ¹, D Zeng ¹, M R Kosorok ¹

PMCID: PMC4526274 NIHMSID: NIHMS597656 PMID: 26257504

Abstract

A dynamic treatment regimen incorporates both accrued information and long-term effects of treatment from specially designed clinical trials. As these trials become more and more popular in conjunction with longitudinal data from clinical studies, the development of statistical inference for optimal dynamic treatment regimens is a high priority. In this paper, we propose a new machine learning framework called penalized Q-learning, under which valid statistical inference is established. We also propose a new statistical procedure: individual selection and corresponding methods for incorporating individual selection within penalized Q-learning. Extensive numerical studies are presented which compare the proposed methods with existing methods, under a variety of scenarios, and demonstrate that the proposed approach is both inferentially and computationally superior. It is illustrated with a depression clinical trial study.

Key words and phrases: Dynamic treatment regimen, Individual selection, Multi-stage, Penalized Q-learning, Q-learning, Shrinkage, Two-stage procedure

1. Introduction

Developing effective therapeutic regimens for diseases is one of the essential goals of medical research. Two major design and analysis challenges in this effort are taking accrued information into account in clinical trial designs and effectively incorporating long-term benefits and risks of treatment due to delayed effects. One of the most promising approaches to deal with these two challenges has been recently referred to as dynamic treatment regimens or adaptive treatment strategies (Murphy, 2003), and has been used in a number of settings, such as drug and alcohol dependency studies.

Reinforcement learning, one of the primary tools used in developing dynamic treatment regimens, is a sub-area of machine learning, where the learning behavior is through trial-and-error interactions with a dynamic environment (Kaelbling et al., 1996). Because reinforcement learning techniques have been shown to be effective in developing optimal dynamic treatment regimens, the area is attracting increased attention among statistical researchers. As a recent example, a new approach to cancer clinical trials based on the specific area of reinforcement learning called Q-learning, has been proposed by Zhao et al. (2009) and Zhao et al. (2011). Extensive statistical estimating methods have also been proposed for optimal dynamic treatment regimens, including, for example, Chakraborty et al. (2010), who developed a Q-learning framework based on linear models. Other related literature includes likelihood-based methods (Thall et al., 2000, 2002, 2007) and semiparametric methods (Murphy, 2003; Robins, 2004; Lunceford et al., 2002; Wahed and Tsiatis, 2004, 2006; Moodie et al., 2009).

In contrast to the substantial body of estimating methods, the development of statistical inference for optimal dynamic treatment regimens is very limited. This sequential, multi-stage decision making problem is at the intersection of machine learning, optimization and statistical inference and is thus quite challenging. As discussed in Robins (2004), and recognized by many other researchers, the challenge arises when the optimal last stage treatment is non-unique for at least some subjects in the population, causing estimation bias and failure of traditional inferential approaches. There have been a number of proposals to correct this. For example, Moodie and Richardson (2010) proposed a method called Zeroing Instead of Plugging In. This is referred to as the hard-threshold estimator by Chakraborty et al. (2010), who also proposed a soft-threshold estimator and implemented several bootstrap methods. There is, however, a lack of theoretical support for these methods. Moreover, simulations indicate that neither hard-thresholding nor soft-thresholding, in conjunction with their bootstrap implementation, works uniformly well. We are therefore motivated to develop improved, asymptotically valid inference for optimal dynamic treatment regimens.

In this paper, we develop a new reinforcement learning framework for discovering optimal dynamic treatment regimens: penalized Q-learning. The major distinction of penalized Q-learning from traditional Q-learning is in the form of the objective Q-function at each stage. While the new method shares many of the properties of traditional Q-learning, it has some significant advantages. Based on penalized Q-learning, we propose effective inferential procedures for optimal dynamic treatment regimens. In contrast to existing bootstrap approaches, our variance calculations are based on explicit formulae and hence are much less time-consuming. Theoretical studies and extensive empirical evidence support the validity of the proposed methods. Since penalized Q-learning puts a penalty on each individual, it automatically initiates another procedure, individual selection, which selects those individuals without treatment effects from the population. Successful individual selection, i.e., correctly identifying individuals without treatment effects, is the key to improved statistical inference.

While the proposed individual selection procedure shares some similarities with certain commonly used variable selection methods, the approaches differ fundamentally in other ways. These issues will be addressed in greater detail below.

2. Statistical Inference with Q-learning

2.1. Personalized Dynamic Treatment Regimens

We now introduce the multi-stage decision problem, since we illustrate with two-stage clinical trial data. Consider data from a sequential multiple assignment randomized trial (SMART), where treatments are randomized at multiple stages (Lavori and Dawson, 2000; Murphy, 2005). The longitudinal data on each patient take the form $H = {(H_{1}^{T}, H_{2}^{T})}^{T}$ Where $H_{1} = {(O_{1}^{T}, A_{1}, R_{1})}^{T}, H_{2} = {(O_{2}^{T}, A_{2}, R_{2})}^{T}$ are sequences of random variables collected at two stages, t = 1, 2. As components of H _t, A_t is the randomly assigned treatment to patients, O_t is the observed patient covariates prior to the treatment assignment and R_t is the clinical outcome, each at stage t. The observed data consist of n independent and identically distributed copies of H. Our goal is to estimate the best treatment decision for different patients using the observed data at each stage. This is equivalent to identifying a sequence of ordered rules, which we call a personalized dynamic treatment regimen, d = (d₁, d₂)^T , one rule for each stage, mapping from the domain of the patient history S _t, _t, to the domain of treatment A_t, _t, where S₁ = O₁ and $S_{2} = {(O_{1}^{T}, A_{1}, R_{1}, O_{2}^{T})}^{T}$ .

Denote the distribution of H by P and the expectations with respect to this distribution by E. Let P^d denote the distribution of H and the expectations with respect to this distribution by E^d where the dynamic treatment regimen d(·) is used to assign treatments. Define the value function to be V(d) = E^d(R₁ + R₂). Thus, an optimal dynamic treatment regimen, d₀, is a rule that has the maximal value, i.e., d₀ ∈ arg max_dV (d). We use upper case letters to denote random variables and lower case letters to denote values of the random variables. In this two-stage setting, if we define Q₂(s₂, a₂) = E(R₂|S₂ = s₂, A₂ = a₂) and Q₁(s₁, a₁) = E(R₁ + max_a₂∈
₂ Q₂(S₂, a₂)|S₁ = s₁, A₁ = a₁), then the optimal decision rule at time t is d_t(s_t) = argmax_{a_t∈
_t} Q_t(s_t, a_t), where Q_t are the Q-functions at time t.

2.2. Q-learning for personalized dynamic treatment regimens

Q-learning is a backward recursive approach commonly used for estimating the optimal personalized dynamic treatment regimens. Following Chakraborty et al. (2010), let the Q-function for time t = 1, 2 be modeled as

Q_{t} (S_{t}, A_{t}; β_{t}, ψ_{t}) = β_{t}^{T} S_{t (1)} + (ψ_{t}^{T} S_{t (2)}) A_{t},

(2.1)

where S_t is the full state information at time t introduced in the previous section and S_t₍₁₎ and S_t₍₂₎ are given features as functions of S_t. For example, they can be subsets of S_t selected for the model. They can be identical or different. Moreover, the constant 1 is included in S_t₍₁₎ and S_t₍₂₎. The action A_t takes value 1 or −1. The parameters of the Q-function are $θ_{t} = {(β_{t}^{T}, ψ_{t}^{T})}^{T},$ where β_t reflects the main effect of current state on outcome, while ψ_t reflects the interaction effect between current state and treatment choice. The true values of these parameters are denoted θ_t₀, β_t₀, and ψ_t₀ respectively. We note that the additive formulation of rewards is not restrictive. In fact, we can always define the intermediate rewards to be zeros while the final stage reward to be the final outcome we are interested in. This won't change the value function we aim to maximize. The linear models studied here are also general if we let state variables in the regression be some basis functions of historical variables (for instance, using kernel machine). Furthermore, one can always perform model diagnostics to check linearity assumption.

Suppose that the observed data consist of (S_ti, A_ti, R_ti) for patients i = 1,…, n and t = 1, 2, from a sample of n independent patients. The two-stage empirical version of the Q-learning procedure can be summarized as follows:

Step 1. Start with a regular and non-shrinkage estimator, based on least squares, for the second stage:
$\tilde{θ_{2}} = {({\tilde{β}}_{2}^{T}, {\tilde{ψ}}_{2}^{T})}^{T} = {argmin}_{β_{2}, ψ_{2}} \sum_{i = 1}^{n} {R_{2 i} - Q_{2} (S_{2 i}, A_{2 i}; β_{2}, ψ_{2})}^{2} = {(Z_{2}^{T} Z_{2})}^{- 1} Z_{2}^{T} R_{2},$

Where θ̃₂ is the least squares estimator, Z₂ is the stage-2 design matrix with each row of $(S_{2 i (1)}^{T}, A_{2 i} S_{2 i (2)}^{T})$ and R₂ = (R₂₁,…, R₂_n)^T. Here and in the sequel, S_ti₍_k₎ denotes the kth component of S_t for subject i, where k = 1, 2, t = 1, 2 and i = 1,…, n.
Step 2. Estimate the first-stage individual pseudo-outcome by ${\hat{Y}}_{1}^{HM} = {({\hat{Y}}_{11}^{HM}, \dots, {\hat{Y}}_{1 n}^{HM})}^{T}$ , Where
${\hat{Y}}_{1 i}^{HM} = R_{1 i} + {max}_{a \in {- 1, 1}} Q_{2} (S_{2 i}, a; {\tilde{θ}}_{2}) = R_{1 i} + {\tilde{β}}_{2}^{T} S_{2 i (1)} + | {\tilde{ψ}}_{2}^{T} S_{2 i (2)} |$ (2.2)

where ^HM is the index for the hard-max estimator.
Step 3. Estimate the first-stage parameters by least squares estimation:
${\tilde{θ}}_{1}^{HM} = {argmin}_{β_{1}, ψ_{1}} \sum_{i = 1}^{n} {{\hat{Y}}_{1 i}^{HM} - Q_{1} (S_{1 i}, A_{1 i}; β_{1}, ψ_{1})}^{2} = {(Z_{1}^{T} Z_{1})}^{- 1} Z_{1}^{T} {\hat{Y}}_{1}^{HM},$

where Z₁ is the stage-1 design matrix whose ith row is $(S_{1 i (1)}^{T}, A_{1 i} S_{1 i (2)}^{T})$ . The corresponding estimator of ψ₁, denoted by ${\hat{ψ}}_{1}^{HM}$ , is referred to as the hard max estimator in Chakraborty et al. (2010), because of the maximizing operation used in the definition.

2.3. Challenges in Statistical Inference

When the Q-function takes the linear model form (2.1), the optimal dynamic treatment regimen for patient i is

d_{i} (s_{t i}) = {argmax}_{a_{i} \in {- 1, 1}} (ψ_{t}^{T} s_{t i (2)}) a_{i} = sgn (ψ_{t}^{T} s_{t i (2)}), t = 1, 2, i = 1, \dots, n .

where sgn(x) = 1 if x > 0 and −1 otherwise. We use s_ti to denote the observed value of S_t for patient i and s_t_i(_k₎ denotes the observed value of S_t₍_k₎ for stage t = 1, 2, component k = 1, 2 and patient i. The parameters ψ₂ are of particular interest for inference on the optimal dynamic treatment regimen, as ψ₂ represents the interaction effect of the treatment and covariates.

During the Q-learning procedure, when there is a positive probability that $ψ_{20}^{T} S_{2 (2)} = 0$ , the first-stage hard max pseudo-outcome ${\hat{Y}}_{1}^{HM}$ is a non-smooth function of ψ~₂ As a linear function of ${\hat{Y}}_{1}^{HM}$ , the hard max estimator ${\hat{ψ}}_{1}^{HM},$ is also a non-smooth function of ψ~₂ Consequently, the asymptotic distribution of $n^{1 / 2} ({\hat{ψ}}_{1}^{HM} - ψ_{10})$ is neither normal nor any well-tabulated distributions if $Pr (ψ_{20}^{T} S_{2 (2)} = 0) > 0$ . In this non-standard case, standard inference methods such as Wald-type confidence intervals are no longer valid.

2.4. Review of Existing Approaches

To overcome the difficulty of inference for ψ₁ in Q-learning, several methods have been proposed, which we briefly review in the two-stage set-up. Since all the methods are also nested in the Q-learning procedure, we update the two-stage version of Q-learning as follows.

Step 2. Estimate the first-stage individual pseudo-outcome by shrinking the second-stage regular estimators via hard-thresholding or soft-thresholding. Specifically, the hard-threshold pseudo-outcome is denoted ${\hat{Y}}_{1}^{HT} = {({\hat{Y}}_{11}^{HT}, \dots, {\hat{Y}}_{1 n}^{HT})}^{T},$ with
${\hat{Y}}_{1 i}^{HT} = R_{1 i} + {\tilde{β}}_{2}^{T} S_{2 i (1)} + | {\tilde{ψ}}_{2}^{T} S_{2 i (2)} | 1 {\frac{n^{1 / 2} | {\tilde{ψ}}_{2}^{T} S_{2 i (2)} |}{{(S_{2 i (2)}^{T} {\sum^{^}}_{2} S_{2 i (2)})}^{1 / 2}} > z_{α / 2}},$ (2.3)

Where Σ̂₂/n is the estimated covariance matrix of ψ~₂, α is a pre-specified significance level and z_α_/2 is the (1 − α/2)-quantile of the standard normal distribution. The soft-threshold pseudo-outcome is denoted ${\hat{Y}}_{1}^{ST} = {({\hat{Y}}_{11}^{ST}, \dots, {\hat{Y}}_{1 n}^{ST})}^{T}$ , with
${\hat{Y}}_{1 i}^{ST} = R_{1 i} + {\tilde{β}}_{2}^{T} S_{2 i (1)} + | {\tilde{ψ}}_{2}^{T} S_{2 i (2)} | {(1 - \frac{λ_{i}}{| {\tilde{ψ}}_{2}^{T} S_{2 i (2)} |})}_{+}, i = 1, \dots, n,$ (2.4)

where x₊ = xI{x > 0} and λ_i is a tuning parameter.
Step 3. Estimate the first-stage parameters by least squares estimation:
${\hat{θ}}_{1}^{º} = {({\hat{β}}_{1}^{º T}, {\hat{ψ}}_{1}^{º T})}^{T} = {argmin}_{β_{1}, ψ_{1}} \sum_{i = 1}^{n} {{\hat{Y}}_{1 i}^{º} - Q_{1} (S_{1 i}, A_{1 i}; β_{1}, ψ_{1})}^{2} = {(Z_{1}^{T} Z_{1})}^{- 1} Z_{1}^{T} {\hat{Y}}_{1}^{º},$

Where Ŷ₁^º is either a hard-threshold or soft-threshold pseudo-outcome, as defined in either (2.3) or (2.4), respectively. The corresponding estimator of ψ₁, denoted ${\hat{ψ}}_{1}^{º}$ , can be the hard-threshold estimator ${\hat{ψ}}_{1}^{HT}$ or the soft-threshold estimator ${\hat{ψ}}_{1}^{ST}$ .

The hard-thresholding and soft-thresholding methods can be viewed as upgraded versions of the hard max methods in terms of reducing the degree of non-differentiability of the absolute value function at zero. The first-stage pseudo-outcome for these three existing methods can be viewed as shrinkage functionals of certain standard estimators. Even if these estimators form shrinkage estimators under certain conditions, they are not optimizers of reasonable objective functions in general. Consequently, even if these estimators can successfully achieve shrinkage, two drawbacks remain that negate their ability to be used for statistical inference for optimal dynamic treatment regimens. First, the bias of these first-stage pseudo-outcomes can be large in finite samples, leading to further bias in the first stage estimator of ψ₁ in regular settings. This point has been demonstrated in the empirical studies of Chakraborty et al. (2010). Second, and more importantly, these shrinkage functional estimators may not possess the oracle property that with probability tending to one, the set $ℳ_{⋆} = {i : | ψ_{20}^{T} S_{2 i (2)} | > 0}$ can be correctly identified and the resulting estimator performs as well as the estimator that knows the true set ℳ_⋆ in advance.

3. Inference Based on Penalized Q-Learning

3.1. Estimation Procedure

To describe our method, we still focus on the two-stage setting as given in Section 2.2 and use the same notation. As a backward recursive reinforcement learning procedure, our method follows the three steps of the usual Q-learning method, except that it replaces Step 1 of the standard Q-learning procedure with

Step 1_p. Instead of minimizing the summed squared differences between R₂ and Q₂(S₂, A₂; β₂, ψ₂), we minimize the following penalized objective function
$W_{2} (θ_{2}) = \sum_{i = 1}^{n} {R_{2 i} - Q_{2} (S_{2 i}, A_{2 i}; β_{2}, ψ_{2})}^{2} + \sum_{i = 1}^{n} p_{λ_{n}} (| ψ_{2}^{T} S_{2 i (2)} |),$ (3.1)

where p_λ_n(·) is a pre-specified penalty function and λ_n is a tuning parameter.

Because of this penalized estimation, we call our approach penalized Q-learning. Since the penalty is put on each individual, we also call Step 1_p individual selection.

Using individual selection enjoys similar shrinkage advantages as do penalized methods described by Frank and Friedman (1993), Tibshirani (1996), Fan and Li (2001), Candes and Tao (2007), Zou (2006) and Zou and Li (2008). In variable selection problems where the selection of interest consists of the important variables with nonzero coefficients, using appropriate penalties can shrink the small estimated coefficients to zero to enable the desired selection. In the individual selection done in the first step of the proposed penalized Q-learning approach, penalized estimation allows us simultaneously to estimate the second-stage parameters θ₂ and select individuals whose value functions are not affected by treatments, i.e., those individuals whose true values of $ψ_{2}^{T} S_{2 (2)}$ are zero.

The above fact is extremely useful in making correct inference in the subsequent steps of the penalized Q-learning. To understand why, recall that statistical inference in the usual Q-learning is mainly challenged by difficulties in obtaining the correct asymptotic distribution of $n^{1 / 2} (| {\hat{ψ}}_{2}^{T} S_{2 (2)} | - | ψ_{20}^{T} S_{2 (2)} |),$ Where ψ̂₂ is an estimator for ψ₂₀. Via our penalized Q-learning method, we can identify individuals whose ${\hat{ψ}}_{2}^{T} S_{2 (2)} = ψ_{20}^{T} S_{2 (2)}$ takes value zero; moreover, we know that for all individuals, ${\hat{ψ}}_{2}^{T} S_{2 (2)}$ has the same sign as $ψ_{20}^{T} S_{2 (2)}$ asymptotically. In this case, $n^{1 / 2} (| {\hat{ψ}}_{2}^{T} S_{2 (2)} | - | ψ_{20}^{T} S_{2 (2)} |)$ is equivalent to

n^{1 / 2} {({\hat{ψ}}_{2} - ψ_{20})}^{T} S_{2 (2)} sgn (ψ_{20}^{T} S_{2 (2)}) .

Hence, correct inference can be obtained following standard arguments. More rigorous details will be given in Section 3.3.

The choice of the penalty function p_λ_n(·) can be taken to be the same as used in many popular variable selection methods. Specifically, we require p_λ_n(·) to possess the following properties:

A1. For non-zero fixed θ, lim_n→∞ n^1/2p_λ_n(|θ|) = 0, ${lim}_{n \to \infty} n^{1 / 2} p_{λ_{n}}^{'} (| θ |) = 0$ , and lim_n→∞ p^″ _λ_n(|θ|) = 0.
A2. For any M > 0, inf_{|θ|≤M n^{−1/2 p_λ_n}}(|θ|) → ∞, as n → ∞

Among penalty functions satisfying A1 and A2 are the smoothly clipped absolute deviation penalty (Fan and Li, 2001) and the adaptive lasso penalty (Zou, 2006), where p_λ_n (θ) = λ_nθ/|θ⁽⁰⁾|^ϕ with ϕ > 0 and θ⁽⁰⁾ a root-n consistent estimator of θ. To achieve both sparsity and oracle properties, the tuning parameter λ_n in these examples should be taken correspondingly. The adaptive lasso method will be implemented in this paper, where λ_n can be taken as scalars satisfying n^1/2λ_n → 0 and nλ_n → ∞, as n → ∞.

3.2. Implementation

The minimization in Step 1_p of the penalized Q-learning procedure has some unique features which distinguish it from the optimization done in the variable selection literature. First, the component to be shrunk, $ψ_{2}^{T} S_{2 i (2)}$ is subject-specific; second, this component is a hyperplane in the parameter space, i.e, a linear combination of the parameters.

To deal with these issues, in this section, we propose an algorithm for the minimizing problem of (3.1) based on local quadratic approximation. Following Fan and Li (2001), we first calculate an initial estimator ψ̂₂(0) via the standard least squares estimation. We then obtain the following local quadratic approximation to the penalty terms in (3.1):

p_{λ_{n}} (| ψ_{2}^{T} S_{2 i (2)} |) \approx p_{λ_{n}} (| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |) + \frac{1}{2} \frac{p_{λ_{n}}^{'} (| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |)}{| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |} {{(ψ_{2}^{T} S_{2 i (2)})}^{2} - {({\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)})}^{2}}

for ψ₂ close to ψ̂₂₍₀₎. Thus, (3.1) can be locally approximated up to a constant by

\sum_{i = 1}^{n} {Y_{2 i} - Q_{2} (S_{2 i}, A_{2 i}; β_{2}, ψ_{2})}^{2} + \frac{1}{2} \sum_{i = 1}^{n} \frac{p_{λ_{n}}^{'} (| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |)}{| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |} {(ψ_{2}^{T} S_{2 i (2)})}^{2} .

(3.2)

The updated estimators for ψ₂ and β₂ can be obtained by minimizing the above approximation. When Q(·) is (3.2), this minimization problem has closed form solution

{\hat{ψ}}_{2} = {[X_{22}^{T} {I - X_{21} {(X_{21}^{T} X_{21})}^{- 1} X_{21}^{T} + D} X_{22}]}^{- 1} X_{22}^{T} {I - X_{21} {(X_{21}^{T} X_{21})}^{- 1} X_{21}^{T}} Y_{2}, and {\hat{β}}_{2} = {(X_{21}^{T} X_{21})}^{- 1} X_{21}^{T} (Y_{2} - X_{22} {\hat{ψ}}_{2}),

where X₂₂ is a matrix with i-th row equal to $S_{2 i (2)}^{T} A_{2 i},$ X₂₁ is a matrix with i-th row equal to $S_{2 i (1)}^{T},$ I is the n × n identity matrix and D is an n × n diagonal matrix with $D_{i i} = \frac{1}{2} p_{λ_{n}}^{'} (| {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |) / | {\hat{ψ}}_{2 (0)}^{T} S_{2 i (2)} |$ .

The above minimization procedure can be continued until convergence. However, as discussed in Fan and Li (2001), either the one-step or k-step estimator will be as efficient as the fully iterative method as long as the initial estimators are consistent. Since in practice, the local quadratic approximation algorithm shrinks $| {\hat{ψ}}_{2}^{T} S_{2 i (2)} |$ to a very small value instead of exactly zero even if the true value is zero, we will set $| {\hat{ψ}}_{2}^{T} S_{2 i (2)} | = 0$ once the value is below a pre-specified tolerance threshold.

The choice of local quadratic approximation is mainly for convenience in solving the penalized least squares estimation in (3.1). If least absolute deviation estimation or some other quantile regression approach is used in place of least squares, then the local linear approximation of the penalty function described in Zou and Li (2008) can be used instead of local quadratic approximation, and the resulting minimization problem can be solved by linear programming.

We will use five-fold cross-validation to choose the tuning parameter, where we partition data into five folds, perform the estimation on four folds, and validate the least squares fitting on the other fold. We set ϕ = 2 as the parameter used in adaptive lasso. We acknowledge the insufficient theory support for using this method. The general guideline for choosing tuning parameters and ϕ is of great research interest but it is beyond the scope of the current paper.

3.3. Asymptotic Results

In this section, we establish the asymptotic properties for the parameter estimators in our penalized Q-learning method. We assume that the penalty function p_λ_n (x) satisfies A1 and A2 and that the following conditions hold

B1. The support of S₂₍₂₎ contains a finite number of vectors, say, v₁, …, v_K. Moreover, $ψ_{20}^{T} v_{k} \neq 0$ for k ≤ K₁ and $ψ_{20}^{T} v_{k} = 0$ for k ≤ K₁. Let n_k = #|{i : S_2i(2) = v_k, i = 1,…,n}|, where for a set A, #|A| is defined as its cardinality.
B2. The true value for $θ_{2}, θ_{20} = {(ψ_{20}^{T}, β_{20}^{T})}^{T}$ , minimizes
$lim_{n} \sum_{i = 1}^{n} n^{- 1} {R_{2 i} - Q_{2} (S_{2 i}, A_{2 i}; β_{2}, ψ_{2})}^{2};$

while, the true value for $θ_{1}, θ_{10} = {(ψ_{10}^{T}, β_{10}^{T})}^{T}$ , minimizes
$lim_{n} n^{- 1} \sum_{i = 1}^{n} {R_{1 i} + max_{a \in {- 1, 1}} Q_{2} (S_{2 i}, a; β_{20}, ψ_{20}) - Q_{1} (S_{1 i}, A_{1 i}; β_{1}, ψ_{1})}^{2} .$

In both expressions and in the following, we always assume that the limits exist.
B3. For t = 1, 2, with probability one, Q_t(S_t, A_t; θ_t) is twice-continuously differentiable with respect to θ_t in a neighborhood of θ_t₀ and moreover, the Hessian matrices of the two limiting functions in B2 are continuous and their values at θ_t = θ_t₀, denoted I_t₀, are nonsingular.
B4. With probability one, n_k/n = p_k + O_p(n^−1/2) for some constant p_k in [0, 1].

Condition B2 says that θ₁₀ and θ₂₀ are the target values in the dynamic treatment regimens. Condition B3 can be verified via the design matrix in the two-stage setting: if Q_t takes the form of (3.2), this condition is equivalent to linear independence of [S_t₍₁₎, S_t₍₂₎A_t] with positive probability. We note that the numerical performance for data from population with a very small probability of linear independence is likely to be unstable with small sample sizes.

Under these conditions, our first theorem shows that in Step 1_p of the penalized Q-learning procedure, there exists a consistent estimator for θ₂.

Theorem 1

Under conditions A1−A2 and B1−B4, there exists a local minimizer θ̂₂ of W₂(θ₂) such that ‖θ̂₂−θ₂₀‖ = O_P(n^−1/2+a_n), where $a_{n} = {max}_{k = 1}^{K_{1}} {p_{λ_{n}}^{'} (| ψ_{20}^{T} v_{k} |)}$ .

According to the properties of p_λ_n(·), we immediately conclude that θ̂₂ is n^1/2-consistent. From Theorem 1, we further obtain the following result, which verifies the oracle property of the penalized method:

Theorem 2

Define the set $ℳ_{*}^{c} = {i : i = 1, \dots, n, ψ_{20}^{T} S_{2 i (2)} = 0}$ . Then under conditions A1−A2 and B1−B4, ${lim}_{n \to \infty} Pr ({\hat{ψ}}_{20}^{T} S_{2 i (2)} = 0$ , for any $i \in ℳ_{*}^{c}) = 1$ .

The set $ℳ_{*}^{c}$ consists of those individuals whose true value functions at the second stage have no effect from treatment. Thus Theorem 2 states that with probability tending to one, we can identify these individuals in $ℳ_{*}^{c} .$ We also need the asymptotic distribution of θ̂₂ in order to make inference.

Theorem 3

Under conditions A1−A2 and B1−B4, n^1/2(I₂₀ + Σ){θ̂₂ − θ₂₀ + (I₂₀ + Σ)⁻¹b} converges in distribution to N(0, I₂₀), where $b = {(0_{p}^{T}, \sum_{k = 1}^{K_{1}} p_{k} P_{λ_{n}}^{'} (| ψ_{20}^{T} v_{k} |) sgn (ψ_{20}^{T} v_{k}) v_{k})}^{T}$ , and $\sum = diag {0_{p \times p}, \sum_{k = 1}^{K_{1}} p_{k} P_{λ_{n}}^{″} (| ψ_{20}^{T} v_{k} |) v_{k} v_{k}^{T}}$ .

Using the results from Theorems 1–3, we are able to establish the asymptotic normality of the first stage estimator θ̂₁:

Theorem 4

Under conditions A1–A2 and B1–B4, let ${\bar{S}}_{1 i} = {(S_{1 i (1)}^{T}, S_{1 i (2)}^{T} A_{1 i})}^{T}$ and ${\bar{S}}_{2 i} = {(S_{2 i (1)}^{T}, S_{2 i (2)}^{T} sgn (ψ_{20}^{T} S_{2 i (2)}))}^{T}$ . Then n^1/2(θ̂₁−θ₁₀) converges in distribution to $I_{10}^{- 1} G$ , where $G \sim N [0, cov {F_{1} (θ_{10}) + {lim}_{n} 1 / n \sum_{i = 1}^{n} {\bar{S}}_{1 i} {\bar{S}}_{2 i}^{T} F_{2} (θ_{20})}]$ , with F₁(θ₁₀) = ∇_θ₁ Q₁(S₁, A₁;θ₁₀){Y₁ − Q₁(S₁, A₁;θ₁₀)}, F₁(θ₂₀) = (I₂₀ + Σ)⁻¹∇_θ₂ Q₂(S₂, A₂;θ₂₀)(R₂ − Q₂(S₂, A₂;θ₂₀)) and cov represents the variance-covariance matrix.

3.4.Variance Estimation

The standard errors can be obtained directly since we are estimating parameters and selecting individuals simultaneously. A sandwich type plug-in estimator can be used as the variance estimator for θ̂₂:

\hat{cov} ({\hat{θ}}_{2}) = {({\hat{I}}_{20} + \sum^{^})}^{- 1} {\hat{I}}_{20} {({\hat{I}}_{20} + \sum^{^})}^{- 1},

where ${\hat{I}}_{20} = n^{- 1} \sum_{i = 1}^{n} [\nabla_{θ_{2} θ_{2}}^{2} {R_{2} - Q_{2} (S_{2 i}, A_{2 i}; θ_{2})}^{2}]$ is the empirical Hessian matrix and $\sum^{^} = diag {0_{p \times p}, n^{- 1} \sum_{i = 1}^{n} p_{λ_{n}}^{″} (| {\hat{ψ}}_{2}^{T} S_{2 i (2)} |) S_{2 i (2)} S_{2 i (2)}^{T}}$ . As Σ̂ converges to zero as n goes to infinity, hence often negligible, we use

\hat{cov} ({\hat{θ}}_{2}) = {\hat{I}}_{20}^{- 1}

(3.3)

instead, and this performs well in practice. The estimated variance for θ̂₁ is then

\hat{cov} ({\hat{θ}}_{1}) = {\hat{I}}_{10}^{- 1} \hat{cov} {F_{1} ({\hat{θ}}_{1}) + n^{- 1} \sum_{i = 1}^{n} {\bar{S}}_{1 i} {\bar{S}}_{2 i}^{T} {\hat{F}}_{2} ({\hat{θ}}_{2})},

(3.4)

Where Î₁₀ is the empirical estimator for I₁₀ and F̂₂(θ̂₂) = (Î₂₀ + Σ̂)⁻¹∇_θ₂Q₂(S₂, A₂;θ̂₂){R₂−Q₂(S₂, A₂;θ̂₂)}. These variance estimators have good accuracy for moderate sample sizes; see section 4. This success of direct inference for the estimated parameters makes inference for optimal dynamic treatment regimens possible in the multi-stage setting.

4. Numerical Studies

We first apply the proposed method to the same simulation study conditions as designed in Chakraborty et al. (2010). A total of 500 subjects are generated for each dataset. We set R₁ = 0 and (O₁, A₁, O₂, A₂, R₂) is collected on each subject, where (O_t, A_t) denotes the covariates and treatment status at stage t (t = 1, 2). The binary covariates O_t's and the binary treatments A_t's are generated as follows:

Pr (O_{1} = 1) = P (O_{1} = - 1) = 1 / 2,

Pr (A_{t} = 1) = P (A_{t} = - 1) = 1 / 2, t = 1, 2,

Pr (O_{2} = 1 | O_{1}, A_{1}) = 1 - Pr (O_{2} = - 1 | O_{1}, A_{1}) = expit (δ_{1} O_{1} + δ_{2} A_{1}),

where expit(x) = exp(x)/{1 + exp(x)}.

R_{2} = γ_{1} + γ_{2} O_{1} + γ_{3} A_{1} + γ_{4} O_{1} A_{1} + γ_{5} A_{2} + γ_{6} O_{2} A_{2} + γ_{7} A_{1} A_{2} + ε,

where ε ∼ N(0, 1). Under this setting, the true Q-functions for time t = 1, 2 are

Q_{2} (S_{2}, A_{2}; β_{2}, ψ_{2}) = β_{21} + β_{22} O_{1} + β_{23} A_{1} + β_{24} O_{1} A_{1} + ψ_{21} A_{2} + ψ_{22} O_{2} A_{2} + ψ_{23} A_{1} A_{2},

(4.1)

Q_{1} (S_{1}, A_{1}; β_{1}, ψ_{1}) = β_{11} + β_{12} O_{1} + ψ_{11} A_{1} + ψ_{12} O_{1} A_{1} .

(4.2)

The true values $β_{11}^{0}, β_{12}^{0}, ψ_{11}^{0}$ and $ψ_{12}^{0}$ are

\begin{matrix} β_{11}^{0} = γ_{1} + q_{1} | f_{1} | + q_{2} | f_{2} | + (0.5 - q_{1}) | f_{3} | + (0.5 - q_{2}) | f_{4} |, \\ β_{12}^{0} = γ_{2} + q_{1}^{'} | f_{1} | + q_{2}^{'} | f_{2} | - q_{1}^{'} | f_{3} | - q_{2}^{'} | f_{4} |, \\ ψ_{11}^{0} = γ_{3} + q_{1} | f_{1} | - q_{2} | f_{2} | + (0.5 - q_{1}) | f_{3} | - (0.5 - q_{2}) | f_{4} |, \\ ψ_{12}^{0} = γ_{4} + q_{1}^{'} | f_{1} | - q_{2}^{'} | f_{2} | - q_{1}^{'} | f_{3} | + q_{2}^{'} | f_{4} |, \end{matrix}

(4.3)

where q₁ = 0.25(expit(δ₁ + δ₂) + expit(−δ₁ + δ₂)), q₂ = 0.25(expit(δ₁ − δ₂) + expit(−δ₁ − δ₂)), $q_{1}^{'} = 0.25 (expit (δ_{1} + δ_{2}) - expit (- δ_{1} + δ_{2})), q_{2}^{'} = 0.25 (expit (δ_{1} - δ_{2}) - expit (- δ_{1} - δ_{2}))$ , f₁ = γ₅ + γ₆ + γ₇, f₂ = γ₅ + γ₆ − γ₇, f₃ = γ₅ − γ₆ + γ₇, f₄ = γ₅ − γ₆ − γ₇. Let γ = (γ₁, …, γ₇)^T. We consider six settings, with values of $ψ_{20}^{T} S_{2 (2)}$ and γ for each setting listed in Table 4.1.

Table 4.1.

Values of the linear combination $ψ_{20}^{T} S_{2 (2)}$ in the six simulation settings. In Setting 1, γ = (0, 0, 0, 0, 0, 0, 0)^T, δ₁ = δ₂ = 0.5. In Setting 2, γ = (0, 0, 0, 0, 0.01, 0, 0)^T, δ₁ = δ₂= 0.5. In Setting 3, γ = (0, 0, − 0.5, 0, 0.5, 0, 0.5)^T, δ₁ = δ₂ = 0.5. In Setting 4, γ = (0, 0, −0.5, 0, 0.5, 0, 0.49)^T, δ₁ = δ₂ = 0.5. In Setting 5, γ = (0, 0, −0.5, 0, 1, 0.5, 0.5)^T, δ₁ = 1, δ₂ = 0. In Setting 6, γ = (0, 0, −0.5, 0, 0.25, 0.5, 0.5)^T, δ₁ = δ₂ = 0.1.

S₂₍₂₎ = (1, O₂, A₁)^T
Setting	(1,1,1)	(1,1,-1)	(1,-1,1)	(1,-1,-1)
1	0	0	0	0
2	0.01	0.01	0.01	0.01
3	1	0	1	0
4	0.99	0.01	0.99	0.01
5	2	1	1	0
6	1.25	0.25	0.25	−0.75

Open in a new tab

We applied penalized Q-learning with adaptive lasso to these six settings. The one-step local quadratic approximation algorithm is used with least squares estimation for the initial values. The tuning parameter λ in the adaptive lasso penalty is chosen by five-fold cross-validation. We take ϕ = 2 as the parameter in the adaptive least absolute shrinkage and selection operator penalty. When the estimated value $| {\hat{ψ}}_{2}^{T} S_{2 (2)} | < 0.001$ , it will be set as zero in the stage-1 estimation. The simulation results shown in Tables 4.2 and 4.3 were summarized over 2000 replications. We included the oracle estimator which knows in advance the true set $ℳ_{*} = {i : | ψ_{20}^{T} S_{2 (2)} | > 0}$ , the hard max estimator and the soft-threshold estimator for comparison. Theoretical standard errors for the hard max estimator and the soft-threshold estimator are not available. Results on average length of the Confidence Intervals are presented in the Web Appendix.

Table 4.2.

Summary statistics and empirical coverage probability of 95% nominal percentile confidence intervals for $β_{11}^{0}$ and $β_{12}^{0}$ using the oracle estimator, the proposed penalized Q-learning based estimator, the hard max estimator and the soft-threshold estimator. “PQ” refers to the penalized Q-learning based estimator, “HM” refers to the hard max estimator, “MSE” refers to the mean squares error, “Std” refers to the average of the 2000 standard error estimates and “CP” refers to the empirical coverage probability of 95% nominal percentile confidence interval. A “*” indicates a significantly different coverage rate from the nominal rate.

	β₁₁				β₁₂
	bias(×1000)	MSE(×1000)	std(×100)	CP	bias(×1000)	MSE(×1000)	std(×100)	CP
Setting 1
Oracle	1	2.0	4.5	94.7	1	2.0	4.5	94.9
PQ	1	2.0	4.5	94.7	1	2.0	4.5	94.9
HM	61	6.6	–	88.7*	1	2.1	–	95.2
ST	7	2.3	–	96.2*	−1	2.0	–	94.9
Setting 2
Oracle	52	5.5	6.2	90.0*	1	2.1	4.6	95.3
PQ	−9	2.1	4.5	94.7	1	2.0	4.5	94.8
HM	52	5.5	–	90.8*	1	2.1	–	95.1
ST	−3	2.2	–	94.8	−1	2.0	–	95.1
Setting 3
Oracle	0	3.0	5.5	94.6	1	2.0	4.5	95.1
PQ	0	3.1	5.5	94.2	2	2.0	4.5	95.2
HM	30	4.3	–	93.0*	2	2.1	–	95.2
ST	−5	3.3	–	93.5*	−1	2.1	–	94.9
Setting 4
Oracle	26	4.0	6.2	93.8*	2	2.1	4.6	95.2
PQ	−5	3.1	5.5	94.3	2	2.0	4.5	95.1
HM	26	4.0	–	93.4*	2	2.1	–	95.1
ST	−10	3.4	–	93.5*	−1	2.1	–	95.0
Setting 5
Oracle	0	3.8	6.1	94.7	2	2.7	5.2	95.1
PQ	−2	3.8	6.1	94.4	0	2.7	5.2	95.0
HM	15	4.1	–	94.1	−5	2.7	–	94.3
ST	−8	4.0	–	94.9	−3	2.8	–	94.9
Setting 6
Oracle	2	3.8	6.2	94.8	−1	2.3	4.8	94.7
PQ	1	3.8	6.2	94.7	−1	2.3	4.8	94.7
HM	2	3.8	–	94.3	−1	2.3	–	95.2
ST	45	6.3	–	87.2*	−1	2.3	–	95.2

Open in a new tab

Table 4.3.

Summary statistics and empirical coverage probability of 95% nominal percentile confidence intervals for $ψ_{11}^{0}$ and $ψ_{12}^{0}$ using the oracle estimator, the proposed penalized Q-learning based estimator, the hard max estimator and the soft-threshold estimator. The notations are the same as in Table 2.

	ψ₁₁				ψ₁₂
	bias ×1000	MSE×1000	std×100	CP	bias×1000	MSE×1000	std×100	CP
Setting 1
Oracle	−1	1.9	4.5	95.3	0	2.0	4.5	94.7
PQ	−1	1.9	4.5	95.3	0	2.0	4.5	94.7
HM	0	2.4	–	93.7*	−1	2.1	–	94.5
ST	1	2.3	–	94.8	−1	2.1	–	94.6
Setting 2
Oracle	0	2.5	5.8	97.3*	−1	2.1	4.6	95.0
PQ	−1	1.9	4.5	95.3	0	2.0	4.5	94.8
HM	0	2.5	–	94.8	−1	2.1	–	94.4
ST	1	2.3	–	94.8	0	2.1	–	94.4
Setting 3
Oracle	−1	2.9	5.5	95.0	0	2.0	4.5	94.8
PQ	−10	3.1	5.5	94.0	−1	2.0	4.5	94.6
HM	−31	4.2	–	93.8*	−1	2.1	–	94.0
ST	−11	4.0	–	95.0	0	2.1	–	94.0
Setting 4
Oracle	−26	4.0	6.2	94.9	−1	2.1	4.5	95.0
PQ	−6	3.1	5.5	94.6	0	2.0	4.5	94.6
HM	−26	4.0	–	94.5	−1	2.1	–	94.0
ST	−7	3.4	–	95.1	0	2.1	–	93.9
Setting 5
Oracle	−1	3.5	6.1	95.8	−1	2.5	4.9	94.3
PQ	−5	3.6	6.1	95.2	0	2.5	4.9	94.3
HM	−16	4.0	–	94.6	6	2.6	–	94.0
ST	−3	4.0	–	95.2	−3	2.6	–	94.2
Setting 6
Oracle	2	3.9	6.2	95.0	0	2.4	4.8	94.2
PQ	2	4.0	6.2	94.6	0	2.4	4.8	94.2
HM	2	3.9	–	93.7*	0	2.4	–	94.1
ST	1	4.6	–	91.4*	2	2.5	–	94.1

Open in a new tab

The true values $β_{11}^{0}, β_{12}^{0}, ψ_{11}^{0}$ and $ψ_{12}^{0}$ of stage-1 parameters are linear combinations of four absolute value functions |f₁|, |f₂|, |f₃|, and |f₄| from (4.3). It can be shown by definition that, with a bias of order o(n^−1/2), the hard max estimators of stage-1 parameters are linear combinations of four corresponding absolute value functions, with stage-2 estimators instead of true values to plug into |f₁|, |f₂|, |f₃|, and |f₄|. Therefore the performances of the hard max estimators are greatly affected by the estimation of each of the four absolute value functions. Furthermore, the bias of the hard max estimators mainly come from that of estimating four absolute value functions. We now discuss the performance of our estimator, hard max estimator and the oracle estimator in these six settings.

Setting 1 is a setting where there is no second-stage treatment effect, as $ψ_{20}^{T} S_{2 (2)} = 0$ for all values of S₂₍₂₎. The hard-max estimator will incur asymptotic biases for all the four terms |f₁|, |f₂|, |f₃| and |f₄|, all four at about the same order of $\sqrt{2 / (n π)} = 0.036$ , as in this case ${| f_{i} |}_{i = 1}^{4} = 0$ . As shown in (13), the biases in the estimation of |f₁|, |f₂|, |f₃| and |f₄| will be almost completely canceled out in the estimation of $β_{12}^{0},$ and $ψ_{12}^{0},$ due to the fact that the sum of the coefficients are zero. These biases of estimating |f₁|, |f₂|, |f₃| and |f₄| are largely canceled out in the estimation of $ψ_{11}^{0},$ as the sum of the coefficients are close to zero. The hard-max estimator of $β_{11}^{0}$ has a significant bias because the coefficients of the four absolute value terms, q₁, q₂, 0.5 − q₁ and 0.5 − q₂, are all positive and sum to 1.

The simulation results of Setting 1 are consistent with the theoretical observations in terms of the hard-max estimation. The oracle estimator automatically sets the estimator of ψ₂ to be zero. It has no significant bias, with standard errors accurately predicted by the theory and 95% confidence interval coverage close to the nominal value. The penalized Q-learning based estimator's performance is actually identical to the oracle estimator. The hard-max estimator has a significant bias and inferior mean square error in β̂₁₁ while remaining consistent for estimation of the other three stage-1 parameters.

Setting 2 is regular but very close to Setting 1 with $ψ_{20}^{T} S_{2 (2)}$ all equal to 0.01 for all values of S₂₍₂₎. The hard-max estimator's performance is very similar to setting 1. Its 95% confidence interval shows poor coverage for $β_{11}^{0}$ and $ψ_{11}^{0}$ . As the value of $ψ_{20}^{T} S_{2 (2)}$ is nonzero, the oracle estimator reduces to the hard-max in this setting. Although the penalized Q-learning based estimator demonstrates a small bias (-0.009) in the estimation of $β_{11}^{0},$ the bias is less than one fifth of that of the oracle estimator and the mean square error is less than half of the oracle estimator. Its standard error estimate remains close to the empirical values.

Setting 3 is another setting where there is no second-stage treatment effect. Setting 3 is another setting where there is no second-stage treatment effect for a positive proportion of subjects in the population. The value of $ψ_{20}^{T} S_{2 (2)}$ is equal to 0 when A₁ = −1 with probability one half. The hard-max estimator incurs bias on the order of O(n^−1/2) in the estimation of |f₂| and |f₄|, but not |f₁| and |f₃|, as f₂ = f₄ = 0 and f₁ = f₃ = 1. As seen from (13), the hard-max estimation of $β_{12}^{0},$ and $ψ_{12}^{0},$ is still approximately unbiased, due to the canceling-out of the coefficients of the four absolute value terms. The estimation of $β_{11}^{0},$ is biased from the true value at approximately half of the bias of Setting 1, due to the values of the |f_i|'s and their coefficients. The estimation of $ψ_{11}^{0},$ is also biased, with similar magnitude of bias as in β̂₁₁ but with reversed sign. The simulation study exactly confirms the theoretical observations of the hard-max estimator. The oracle estimator has no statistically significant bias and its standard error is precisely predicted by the theoretical calculation. The penalized Q-learning based estimator has a bias in ψ̂₁₁ but the bias is still three times smaller than that of the hard-max estimator. Otherwise, the penalized Q-learning based estimator has almost exactly the same performance as the oracle estimator.

Setting 4 is a regular setting but very close to Setting 3. The hard-max estimator's performance is similar to Setting 3. The oracle estimator reduces to the hard-max estimator. The penalized Q-learning based estimator outperforms the oracle estimator, with both a smaller bias (5 times smaller), and a correctly predicted standard error. This phenomena is consistent with findings in Setting 2.

In Setting 5, the term $ψ_{20}^{T} S_{2 (2)}$ is equal to zero when (O₂, A₁) = (−1, −1) with probability one fourth. The hard-max estimator will incur bias in the estimation of |f₄|, since f₄ = 0. Consequently, all the four stage-1 parameter estimators will be biased. The bias in β̂₁₁ will be approximately a quarter of that in Setting 1. The bias in β̂₁₂ is about half of that of β̂₁₁ with reversed sign. The bias in ψ̂₁₁ is about the same magnitude as that of β̂₁₁ with reversed sign. The bias in ψ̂₁₂ is about half of that of β̂₁₁. In this setting, the oracle estimator has the best performance, with no significant bias and well predicted standard errors. The penalized Q-learning based estimator has a bias in ψ̂₁₁ but the bias is much smaller than the hard-max estimator. The penalized Q-learning based estimator has no noticeable bias in the other three parameter estimations and the standard error calculations are accurate when compared to Monte-Carlo errors.

Setting 6 is a completely regular setting with values of $ψ_{20}^{T} S_{2 (2)}$ well above zero. The penalized Q-learning based estimator has almost identical performance as the oracle estimator, which is the same as the hard-max estimator. Both estimators are unbiased with accurately calculated standard errors.

In summary, the behavior of the PQ-estimator, including its bias, mean square error, theoretically computed standard error and coverage probability of theoretically computed 95% confidence intervals, are consistent in all six settings.

Chakraborty et al. (2010) proposed several bootstrapped confidence intervals for the hard max estimator as well as hard-threshold estimators with α in Step 2 set to be 0.08 or 0.20 and the soft-threshold estimator. In order to compare the confidence intervals from the proposed penalized Q-learning based estimator with these bootstrapped methods, we re-ran the simulations with the penalized Q-learning based estimator with sample size n = 300 and 1000 replications. The coverage probabilities from different inferential methods in the six settings are compared in Figure 1, where the results from the hard max, hard threshold and soft threshold methods based on hybrid bootstrapping for variance estimation are shown. Overall, the competing methods cannot provide consistent coverage rates across all six settings while our penalized Q-learning based method always gives coverage probabilities that are not significantly different from the nominal level.

Fig. 1 — Plot of 95% confidence interval's coverage rates with several inference methods in six simulation settings. The shaded area indicates coverage rates considered to be nonsignificantly different from nominal rate 0.95. Curve “a”: the coverage probabilities from the oracle estimator; curve “b”: the coverage probabilities from the penalized Q-learning method; curve “c”: the coverage probabilities from the hard-max estimator; curve “d”: the coverage probabilities from the hard-threshold estimator with α = 0.08; curve “e”: the coverage probabilities from the hard-threshold estimator with α = 0.2; curve “f”: the coverage probabilities from the soft-threshold estimator. Curves “c”–“f” all use hybrid bootstrapping.

We also apply percentile bootstrapping or double bootstrapping variance estimation in the other competing methods and find that the hard max estimator with the double bootstrapped confidence interval and the soft-threshold estimator with the percentile bootstrapped can give reasonable coverage probabilities. Nonetheless, the penalized Q-learning based estimator has a significant computational advantage over these two inference methods. In a comparison run of analyzing one dataset with sample size 300, the hard max with double bootstrap confidence interval at 500 first-stage and 100 second-stage bootstrap iterations needs 316.35 seconds. The soft thresholding with percentile confidence interval at 1000 bootstrap iterations takes 10.98 seconds. In contrast, the penalized Q-learning based estimator takes only 0.14 second.

Finally, we analyze data from the mental health study described in Fava et al. (2003) using the proposed method. The details are given in the online Supplemental Material.

5. Discussion

The proposed penalized Q-learning provides valid inference based on an approximate normal distribution for the estimators of the regression coefficients. Recently while this paper is under review, Chakraborty et al. (2013) proposed m-out-of-n bootstrap as a remedy to the non-regular inference in Q-learning. This modified bootstrap is consistent, and can be used in conjunction with the simple hard-max estimator.

Under some special cases, the proposed method is the same as variable selection but in general, it is for individual subject selection instead of individual variable selection. In small samples, our penalization on the linear predictor would possibly impose some constraints as demonstrated in the following example provided by a referee. Suppose that ψ₂ = (ε, −ε)^T and that the following four feature vectors are in the support of S₂₍₂₎: (1, 1)^T, (1, 1 + a)^T, (1 + a, 1)^T, (1 + a, 1 + a)^T for a > 0. For small value of ε and large value of a, say a = 100/ε, it is possible that the penalized estimator ψ̂₂ shrinks the entire coefficient vector to zero due to the penalty put on |ψ^TS₂₍₂₎|. Every patient is thus deemed to have no treatment effect. The true treatment effect for subject (1, 1+ a)^T and (1 + a, 1)^T, however is non-negligible.

This dilemma is due to the finite rank of the covariate space and lack of power to distinguish groups with small effects in small samples. One possibility to alleviate the dilemma is to use a penalty of the form

\sum_{i = 1}^{n} {p_{λ_{n}} (| S_{2 i (2)}^{T} ψ_{2} |) + p_{λ_{2}} (| S_{2 i (2)}^{T} ψ_{2} - S_{2 i (2)}^{T} ψ_{2}^{°} |)},

where $ψ_{2}^{°}$ is a consistent initial estimator of ψ₂. This penalty can shrink estimated $S_{2 i (2)}^{T} ψ_{2}$ to zero if the truth is zero; otherwise, it will force $S_{2 i (2)}^{T} ψ_{2}$ to be not far from $S_{2 i (2)}^{T} ψ_{2}^{°}$ .

Although the linear model form of the Q-functions presented here is an important first step, as well as being useful for illustrating the ideas of this paper, this form may not be sufficiently flexible for certain practical settings. Semiparametric models are a potentially very useful alternative in many such settings because such models involve both a parametric component which is usually easy to interpret and a nonparametric component which allows greater flexibility. Generalizations of Q-functions to allow diverse data such as ordinal outcome, censored outcome and semiparametric modeling, are thus future research topics of practical importance.

The theoretical framework is based on discrete covariates. This condition is not as restrictive as it looks. For example, in a practical two-stage setting where continuous covariates are presented, unless in the rare case where the parameter ψ₂₀ is zero, the set $ℳ_{*}^{c} = {i : ψ_{20}^{T} S_{2 i (2)} = 0}$ will not have positive probability. Having said that, we can always discretize continuous covariates, though with a loss of information. Future research to extend our work to continuous covariates would also be very useful in practice. Likewise, the framework works for two-level treatments. The generalization to multilevel treatments will be a natural and useful next step.

In many clinical studies, the state space is often of very high dimension. To develop optimal dynamic treatment regimes in this case, it will be important to develop simultaneous variable selection and individual selection. More modern machine learning techniques such as support vector regression and random forests can be nested into our penalized Q-learning framework as powerful tools to develop optimal dynamic treatment regimes.

Our method is proposed to the general setting in randomized clinical trials. It will be interesting and practically useful to develop the optimal dynamic treatment regime from the observational studies. To generalize the proposed methods to observational studies, under certain assumptions (for example, no-unobserved confounders), propensity scores weighted approach can be incorporated into the proposed PQ-learning. It is beyond the scope of the current paper and we are currently investigating this topic.

Contributor Information

R. Song, Email: rsong@ncsu.edu.

W. Wang, Email: Weiwei.Wang@uth.tmc.edu.

D. Zeng, Email: dzeng@bios.unc.edu.

M. R. Kosorok, Email: kosorok@unc.edu.

References

Candes E, Tao T. The dantzig selector: statistical estimation when p is much larger than n (with discussion) The Annals of Statistics. 2007;35:2313–2404. [Google Scholar]
Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty B, Laber E, Zhao Y. Inference for Optimal Dynamic Treatment Regimes Using an Adaptive m-Out-of-n Bootstrap Scheme. Biometrics. 2013;69:714–723. doi: 10.1111/biom.12052. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fava M, Rush A, Trivedi M, Nierenberg A, Thase M, Sackeim H, Quitkin F, Wisniewski S, Lavori P, Rosenbaum J, Kupfer D STAR D Invest Grp. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
Frank IE, Friedman JH. Astatistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]
Hirano K, Porter JR. Impossibility Results for Nondifferentiable Functionals. Econometrica To appear. [Google Scholar]
Kaelbling PL, M L, Moore A. Reinforcement learning: A survey. Journal of Artificial Intelligence Research. 1996;4:237–285. [Google Scholar]
Lavori PW, Dawson A. A design for testing clinical strategies: biased adaptive withinsubject randomization. Journal of the Royal Statistical Society A. 2000;163:29–38. [Google Scholar]
Lunceford J, Davidian M, Tsiatis A. Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics. 2002;58:48–57. doi: 10.1111/j.0006-341x.2002.00048.x. [DOI] [PubMed] [Google Scholar]
Moodie EEM, Platt RW, Kramer MS. Estimating Response-Maximized Decision Rules With Applications to Breastfeeding. Journal of the American Statistical Association. 2009;104:155–165. [Google Scholar]
Murphy S. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society B. 2003;65:331–355. [Google Scholar]
Murphy S. An experimental design for the development of adaptive treatment strategies. Statistics in medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
Robins JM. In Proceedings of the Second Seattle Symposium on Biostatistics. Springer; 2004. Optimal structural nested models for optimal sequential decisions. [Google Scholar]
Thall P, Millikan R, Sung H. Evaluating multiple treatment courses in clinical trials. Statistics In Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
Thall P, Sung H, Estey E. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]
Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics In Medicine. 2007;26:4687–4702. doi: 10.1002/sim.2894. [DOI] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]
Wahed A, Tsiatis A. Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika. 2006;93:163–177. [Google Scholar]
Wahed AS, Tsiatis AA. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomised designs in clinical trials. Biometrics. 2004;60:124–33. doi: 10.1111/j.0006-341X.2004.00160.x. [DOI] [PubMed] [Google Scholar]
Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statistics In Medicine. 2009;28(26):3294–315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Socinski M, Kosorok M. Reinforcement learning strategies for clinical trials in non-small cell lung cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Candes E, Tao T. The dantzig selector: statistical estimation when p is much larger than n (with discussion) The Annals of Statistics. 2007;35:2313–2404. [Google Scholar]

[R2] Chakraborty B, Murphy S, Strecher V. Inference for non-regular parameters in optimal dynamic treatment regimes. Statistical Methods in Medical Research. 2010;19:317–343. doi: 10.1177/0962280209105013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Chakraborty B, Laber E, Zhao Y. Inference for Optimal Dynamic Treatment Regimes Using an Adaptive m-Out-of-n Bootstrap Scheme. Biometrics. 2013;69:714–723. doi: 10.1111/biom.12052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R5] Fava M, Rush A, Trivedi M, Nierenberg A, Thase M, Sackeim H, Quitkin F, Wisniewski S, Lavori P, Rosenbaum J, Kupfer D STAR D Invest Grp. Background and rationale for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study. Psychiatric Clinics of North America. 2003;26:457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]

[R6] Frank IE, Friedman JH. Astatistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35:109–148. [Google Scholar]

[R7] Hirano K, Porter JR. Impossibility Results for Nondifferentiable Functionals. Econometrica To appear. [Google Scholar]

[R8] Kaelbling PL, M L, Moore A. Reinforcement learning: A survey. Journal of Artificial Intelligence Research. 1996;4:237–285. [Google Scholar]

[R9] Lavori PW, Dawson A. A design for testing clinical strategies: biased adaptive withinsubject randomization. Journal of the Royal Statistical Society A. 2000;163:29–38. [Google Scholar]

[R10] Lunceford J, Davidian M, Tsiatis A. Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics. 2002;58:48–57. doi: 10.1111/j.0006-341x.2002.00048.x. [DOI] [PubMed] [Google Scholar]

[R11] Moodie EEM, Platt RW, Kramer MS. Estimating Response-Maximized Decision Rules With Applications to Breastfeeding. Journal of the American Statistical Association. 2009;104:155–165. [Google Scholar]

[R12] Murphy S. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society B. 2003;65:331–355. [Google Scholar]

[R13] Murphy S. An experimental design for the development of adaptive treatment strategies. Statistics in medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]

[R14] Robins JM. In Proceedings of the Second Seattle Symposium on Biostatistics. Springer; 2004. Optimal structural nested models for optimal sequential decisions. [Google Scholar]

[R15] Thall P, Millikan R, Sung H. Evaluating multiple treatment courses in clinical trials. Statistics In Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]

[R16] Thall P, Sung H, Estey E. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]

[R17] Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics In Medicine. 2007;26:4687–4702. doi: 10.1002/sim.2894. [DOI] [PubMed] [Google Scholar]

[R18] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]

[R19] Wahed A, Tsiatis A. Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika. 2006;93:163–177. [Google Scholar]

[R20] Wahed AS, Tsiatis AA. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomised designs in clinical trials. Biometrics. 2004;60:124–33. doi: 10.1111/j.0006-341X.2004.00160.x. [DOI] [PubMed] [Google Scholar]

[R21] Zhao Y, Kosorok MR, Zeng D. Reinforcement learning design for cancer clinical trials. Statistics In Medicine. 2009;28(26):3294–315. doi: 10.1002/sim.3720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhao Y, Zeng D, Socinski M, Kosorok M. Reinforcement learning strategies for clinical trials in non-small cell lung cancer. Biometrics. 2011;67:1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R24] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Penalized Q-Learning for Dynamic Treatment Regimens

R Song

W Wang

D Zeng

M R Kosorok

Abstract

1. Introduction

2. Statistical Inference with Q-learning

2.1. Personalized Dynamic Treatment Regimens

2.2. Q-learning for personalized dynamic treatment regimens

2.3. Challenges in Statistical Inference

2.4. Review of Existing Approaches

3. Inference Based on Penalized Q-Learning

3.1. Estimation Procedure

3.2. Implementation

3.3. Asymptotic Results

Theorem 1

Theorem 2

Theorem 3

Theorem 4

3.4.Variance Estimation

4. Numerical Studies

Table 4.1.

Table 4.2.

Table 4.3.

Fig. 1.

5. Discussion

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Penalized Q-Learning for Dynamic Treatment Regimens

R Song

W Wang

D Zeng

M R Kosorok

Abstract

1. Introduction

2. Statistical Inference with Q-learning

2.1. Personalized Dynamic Treatment Regimens

2.2. Q-learning for personalized dynamic treatment regimens

2.3. Challenges in Statistical Inference

2.4. Review of Existing Approaches

3. Inference Based on Penalized Q-Learning

3.1. Estimation Procedure

3.2. Implementation

3.3. Asymptotic Results

Theorem 1

Theorem 2

Theorem 3

Theorem 4

3.4.Variance Estimation

4. Numerical Studies

Table 4.1.

Table 4.2.

Table 4.3.

Fig. 1.

5. Discussion

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases