Comment on “Dynamic treatment regimes: technical challenges and applications”

Yair Goldberg; Rui Song; Donglin Zeng; Michael R Kosorok

doi:10.1214/14-ejs905

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Electron J Stat. 2014;8(1):1290–1300. doi: 10.1214/14-ejs905

Comment on “Dynamic treatment regimes: technical challenges and applications”^*

Yair Goldberg ¹, Rui Song ², Donglin Zeng ³, Michael R Kosorok ⁴

PMCID: PMC4255986 NIHMSID: NIHMS608166 PMID: 25485028

Abstract

Inference for parameters associated with optimal dynamic treatment regimes is challenging as these estimators are nonregular when there are non-responders to treatments. In this discussion, we comment on three aspects of alleviating this nonregularity. We first discuss an alternative approach for smoothing the quality functions. We then discuss some further details on our existing work to identify non-responders through penalization. Third, we propose a clinically meaningful value assessment whose estimator does not suffer from nonregularity.

1. Introduction

The authors are to be congratulated for their excellent and thoughtful paper on statistical inference for dynamic treatment regimens. They have addressed several important and long-standing issues in this area. As discussed by the authors, nonsmoothness of the problem in some of the parameters of interest leads to estimators that are not smooth in the data. This in turn makes inference for these parameters challenging. In the following, we comment on a few additional strategies to alleviate the resulting nonregularity due to nonsmoothness. First, we discuss replacing the nonsmooth objective functions via a SoftMax Q-learning approach, which directly addresses the trade-off between bias and variance of the maximum operation in the local asymptotic framework. Proofs are given in the appendix.

Nonregularity of the estimators for the parameters associated with the optimal treatment regimes is mainly due to the existence of non-responders to treatments. Therefore, it would be useful and important if we could identify these non-responders. In the second part, we review our existing work on non-responder identification via penalization. We also discuss how this penalization can alleviate, although not solve, some regularity issues.

For the third and final aspect we wish to discuss, we note that in some public health settings, the parameters in the dynamic treatment regime are not as important as the value function which reflects the overall population impact of the estimated regime and is perhaps the most important quantity to focus on for public health policy. We propose a truncated value function which only focuses on those subjects who are expected to have large treatment effects. We claim that this alternative value function is clinically meaningful and does not suffer from nonregularity.

2. SoftMax Q-Learning

In this section we study the effect of replacing the max operator with a smoother version of it in the two-stage Q-learning algorithm discussed by Laber et al. We show that this smoothing can reduce the bias and can be controlled under local alternatives. The proposed SoftMax approach also sheds light on the bias/variance tradeoff which can be obtained by using over/under smoothing. In what follows, we briefly describe the SoftMax Q-learning algorithm, and then present some theoretical and simulation results.

2.1. Proposed Algorithm

Consider the Q-learning algorithm discussed by Laber et al. in Section 2. In step 2 of the algorithm, the stage outcome is predicted by

\tilde{Y} = max_{a_{2}} Q_{2} (H_{2}, a_{2}; {\hat{β}}_{2}) .

We propose replacing Ỹ with a SoftMax version of it. Define the SoftMax function by (see Fig 1)

SoftMax (x, y, α) = \frac{1}{α} log {e^{α x} + e^{α y}}, α > 0.

Let

\begin{array}{l} \overset{⌣}{Y} = SoftMax (Q_{2} (H_{2}, a_{2, 1}; {\hat{β}}_{2}), Q_{2} (H_{2}, a_{2, 2}; {\hat{β}}_{2})) \\ = \frac{1}{α} log {e^{α H_{2, 0}^{'} {\hat{β}}_{2, 0}} + e^{α (H_{2, 0}^{'} {\hat{β}}_{2, 0} + H_{2, 1}^{'} {\hat{β}}_{2, 1})}} \\ = H_{2, 0}^{'} {\hat{β}}_{2, 0} + \frac{1}{α} log {1 + e^{α H_{2, 1}^{'} {\overset{⌣}{β}}_{2, 1}}} . \end{array}

The estimator β̂₁ of β₁ is given by ${\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} \overset{⌣}{Y}$ . We note that the algorithm discussed by Laber et al. is obtained as the limit, as α goes to infinity, of the SoftMax Q-learning algorithm discussed here.

2.2. Theory

In the following we briefly discuss the asymptotic properties of β̂₁. We first discuss the limiting distribution of $\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*})$ . We then discuss this limiting distribution under local alternatives. Finally, we discuss the asymptotic bias. The proofs appear in the Appendix.

Theorem 1

Assume (A1)–(A2) from Laber et al., and let α_n → ∞ such that $\sqrt{n} / α_{n} \to a_{\infty}$ for a_∞ ∈ [0, ∞). then

If a_∞ = 0,
$\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*}) ⇝ S_{\infty} + \sum_{1, \infty}^{- 1} P (T_{\infty}) .$
If 0 < a_∞ < ∞, then
$\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*}) \to S_{\infty} + \sum_{1, \infty}^{- 1} P (T_{\infty}) + a_{\infty} log (2) \sum_{1, \infty}^{- 1} P B_{1} 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0},$

where
$T_{\infty} = B_{1} (H_{2, 1}^{'} V_{\infty} 1 {H_{2, 1}^{'} β_{2, 1}^{*} > 0} + \frac{1}{2} H_{2, 1}^{'} V_{\infty} 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0}) .$

For local alternatives the limiting distribution is given below.

Theorem 2

Assume (A1)–(A3) from Laber et al., and let α_n → ∞ such that $\sqrt{n} / α_{n} \to a_{\infty}$ for a_∞ ∈ (0; ∞). Then

\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*}) ⇝ S_{\infty} + \sum_{1, \infty}^{- 1} P (T_{\infty}) + \sum_{1, \infty}^{- 1} P (W_{\infty}) .

where

\begin{array}{r} T_{\infty} = B_{1} (H_{2, 1}^{'} V_{\infty} 1 {H_{2, 1}^{'} β_{2, 1}^{*} > 0} + H_{2, 1}^{'} V_{\infty} {[a_{\infty}^{- 1} H_{2, 1}^{'} s]}_{+} 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0}) \\ W_{\infty} = B_{1} (a_{\infty} log {1 + e^{a_{\infty}^{- 1} H_{2, 1}^{'} s}} - {[H_{2, 1}^{'} s]}_{+}) 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0} . \end{array}

The bound of the bias, scaled by root-n, under both standard and local alternatives asymptotics, is given below.

Corollary 1

Assume (A1)–(A2) from Laber et al., and let α_n → ∞ such that $\sqrt{n} / α_{n} \to a_{\infty}$ for a_∞ ∈ (0, ∞). Fix c ∈ ℝ^p₂₁. Then

Bias ({\hat{β}}_{1}, c) \leq a_{\infty} ‖ \sum_{1, \infty}^{- 1} ‖ P ‖ B ‖ 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} + o_{p} (1) .

When (A3) from Laber et al. also holds, then

sup_{s \in ℝ^{p_{21}}} Bias ({\hat{β}}_{1}, c, s) \leq a_{\infty} ‖ \sum_{1, \infty}^{- 1} ‖ P ‖ B ‖ 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} + o_{p} (1) .

The above results show that by choosing the scale of α, the bias can be controlled. Theorem 2 shows that this control of the bias directly influences the variance, at least under local alternatives.

For inference, we need to discuss two different settings. When holding α fixed, as n goes to infinity, standard inference for the parameters is valid, as the problem becomes regular. However, this comes with the price that the bias does not vanish even asymptotically (see also the discussion in Section 4 ). As proved in Theorem 2, when taking α to infinity, as n goes to infinity, the problem is nonregular. Thus, adaptive confidence intervals, such as the one suggested by Laber et al., are needed in order to perform valid inference.

2.3. Simulations for SoftMax

We compare the small-sample behaviour of SoftMax to that of soft-threshholding using the example setting discussed in Laber et al., Section 3. Let $θ^{*} = max (μ_{0}^{*}, μ_{1}^{*})$ . The max estimator is defined by

\hat{θ} \equiv max ({\hat{μ}}_{0}, {\hat{μ}}_{1}) = \frac{{\hat{μ}}_{0} + {\hat{μ}}_{1}}{2} + \frac{∣ {\hat{μ}}_{1} - {\hat{μ}}_{1} ∣}{2} .

A soft-thresholdling estimator is defined by

{\hat{θ}}^{σ} = \frac{{\hat{μ}}_{0} + {\hat{μ}}_{1}}{2} + \frac{∣ {\hat{μ}}_{0} - {\hat{μ}}_{1} ∣}{2} (1 - \frac{4 σ}{n ({\hat{μ}}_{0} - {\hat{μ}}_{1})}) .

Finally, the SoftMax estimator is defined by

\overset{⌣}{θ} = SoftMax ({\hat{μ}}_{0}, {\hat{μ}}_{1}, α) .

Let Y |A ~ N(μ_a, 1), a = 0, 1, and assume that the treatment assignment is perfectly balanced. We use 1000 Monte Carlo replicates to estimate the bias for each parameter setting. Figure 2 below shows the bias as a function of the treatment effect $μ_{1}^{*} - μ_{0}^{*}$ and with tuning parameters σ ∈ [0, 5] and α ∈ [1, 6] for the soft-thresholding and SoftMax, respectively. It appears that the SoftMax does not suffer from large bias on points away from $μ_{1}^{*} - μ_{0}^{*} = 0$ . Also, as expected from Theorem 1, the bias decreases as α increases.

Fig 2 — **Left:** Bias for soft-thresholding. **Right:** Bias for SoftMax. In both panels the bias is measured in units of $1 / \sqrt{n}$ for n = 10, as a function of effect size and of the tuning parameters, σ and α, for the soft-thresholding and SoftMax, respectively.

3. Penalized and Adaptive Q-Learning

In Penalized Q-learning (Song et al., 2011 ) and adaptive Q-learning (Goldberg et al., 2013 ), penalties were imposed on the term $H_{2, 1}^{'} β_{2, 1}$ for each individual. This use of penalized estimation allows us to simultaneously estimate the second stage parameters and select individuals whose value functions are not affected by treatments, i.e., those individuals whose true values of $H_{2, 1}^{'} β_{2, 1}^{*}$ are zero. Although the penalized method does not solve the non-regularity issue in estimating β’s, our numerical studies have demonstrated that penalized Q-learning is not only able to reduce bias, but also provides better coverage of confidence intervals in a number of scenarios, as compared to the hard thresholding method of Moodie and Richardson (2010) and some soft thresholding methods including resampling approaches. Furthermore, the inference approach for penalized methods described in Zhang and (Zhang 2014) appears to be able to handle diverging model perturbations. Finally, a nice feature of our penalized learning is that it enables us to identify non-responders, who may also have small treatment benefits even under a local alternative. Since it is clinically and practically most useful to target groups whose treatment benefit is large, identifying those subjects with small treatment benefits is useful for better allocation of resources and for reducing costs.

4. Truncated Value Function

The non-regularity issue arises primarily in settings where there are some subjects who do not respond to treatments at the second stage and where inference focuses on effect size. In the context of public health policy, we think that (1) the overall benefit (value) may be of greater interest compared to individual effect sizes and (2) those subjects who are not sensitive to treatments (approximate non-responders) should not have a large impact on the overall decision making process. Thus, we propose an appropriate alternative criterion, namely the ε-truncated value, for evaluating the optimal policy as follows:

V_{ε} (d_{1}, d_{2}) = E_{d} [(Y_{1} (d_{1}) + Y_{2} (d_{2})) I (δ (X_{1}) > ε, δ (X_{2}) > ε)],

where δ(X₁) and δ(X₂) denote the expected treatment effects at the first and second stages respectively. Here, ε is a small constant indicating a clinically meaningful effect size.

Under a SMART trial with randomization probabilities π_k at stage k (k = 1, 2), this truncated value is equal to

E [(Y_{1} + Y_{2}) I (A_{1} = d_{1} (X_{1}), A_{2} = d_{2} (X_{2}), δ (X_{1}) > ε, δ (X_{2}) > ε) / (π_{1} π_{2})] .

Compared to the usual value function, we can see that V_ε(d₁, d₂) differs by at most O(ε). Using the Q-learning model, the above value function for the estimated rule is

V_{ε} ({\hat{d}}_{1}, {\hat{d}}_{2}) = E [(Y_{1} + Y_{2}) I (A_{1} {\hat{β}}_{1} X_{1} > 0, A_{2} {\hat{β}}_{2} X_{2} > 0, ∣ {\hat{β}}_{1} X_{1} ∣ > ε, ∣ {\hat{β}}_{2} X_{2} ∣ > ε) / (π_{1} π_{2})] .

One advantage of considering this value function is that non-regularity will no longer be an issue since we have excluded the non-responders from the above statistic. One can easily show $\sqrt{n} ({\hat{V}}_{ε} ({\hat{d}}_{1}, {\hat{d}}_{2}) - V_{ε} (d_{1}, d_{2}))$ converges to the same normal distribution under local alternatives whether P(β₂X₂ = 0) > 0 or not.

5. Concluding Remarks

We again thank the authors for their very interesting work which likely stimulate additional future research on this crucial topic. It is clear that there are many fundamental and unresolved computational, methodological and theoretical challenges remaining which will benefit from many diverse problem solving approaches. We look forward to seeing this intriguing research area continue to develop.

Appendix A: Proofs

Sketch of proof of Theorem 1

Using the same arguments that lead to Eq. 2 in Laber et al., we have

\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*}) = {\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} (\overset{⌣}{Y} - B_{1}^{'} β_{1}^{*}) = S_{n} + {\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} U_{n},

where Inline graphic is smooth and asymptotically normal and

U_{n} = \sqrt{n} (\frac{1}{α_{n}} log {1 + e^{α_{n} H_{2, 1}^{'} {\hat{β}}_{2, 1}}} - {[H_{2, 1}^{'} β_{2, 1}^{*}]}_{+}) .

Note that

\begin{array}{l} U_{n} = \sqrt{n} (\frac{1}{α_{n}} log {1 + e^{α_{n} H_{2, 1}^{'} {\hat{β}}_{2, 1}}} - \frac{1}{α_{n}} log {1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}}}) + \sqrt{n} (\frac{1}{α_{n}} log {1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}}} - {[H_{2, 1}^{'} β_{2, 1}^{*}]}_{+}) \\ = \frac{e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}} H_{2, 1}^{'}}{1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}}} \sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1}^{*}) + o_{P} (\sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1}^{*})) + \sqrt{n} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}), \end{array}

where the last equality follows by taking derivatives, and where

f (α, x) = \frac{1}{α} log {1 + e^{α x}} - max (x, 0) .

(1)

For the remainder term, note that Lemma B.6 shows the consistency of β̂_2,1, and that the expectation of the Hessian of $\frac{1}{α_{n}} log {1 + e^{α_{n} H^{'} β}}$ is bounded by Assumption (A1). Hence, by applying Lemma B.5 to the matrix Σ̂₁ that appears in the remainder term, we conclude that

\sqrt{n} ({\hat{β}}_{1} - β_{1}^{*}) = S_{n} + T_{n} + W_{n} + o_{P} (1),

(2)

where

\begin{array}{l} T_{n} = \sum_{1}^{- 1} ℙ_{n} B_{1} [\frac{e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}}}{1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1}^{*}}} H_{2, 1}^{'} \sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1}^{*})], \\ W_{n} = {\sum^{^}}_{1}^{- 1} \sqrt{n} ℙ_{n} B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) . \end{array}

Recall that by assumption, α_n and thus fact that

\frac{e^{α_{n} x}}{1 + e^{α_{n} x}} \to {\begin{matrix} 1 & x > 0 \\ \frac{1}{2} & x = 0 \\ 0 & x < 0 \end{matrix},

we obtain that for a given h_2,1,

\frac{e^{α_{n} h_{2, 1}^{'} β_{2, 1}^{*}}}{1 + e^{α_{n} h_{2, 1}^{'} β_{2, 1}^{*}}} \to 1 {h_{2, 1}^{'} β_{2, 1}^{*} > 0} + \frac{1}{2} 1 {h_{2, 1}^{'} β_{2, 1}^{*} = 0} .

Define the function w: D_p₁ × l ^∞( Inline graphic ) × ℝ^p₂₁ × [0, 1] ↦ ℝ^p₁ by w(Σ, μ, ν, a) = Σ⁻¹μ(g(ν, B₁, H_2,1, a)), where

g (ν, b_{1}, h_{2, 1}, a) = {\begin{matrix} b_{1} [\frac{e^{h_{2, 1}^{'} β_{2, 1}^{*} / a}}{1 + e^{h_{2, 1}^{'} β_{2, 1}^{*} / a}} h_{2, 1}^{'} ν], & a > 0 \\ b_{1} [(1 {h_{2, 1}^{'} β_{2, 1}^{*} > 0} + \frac{1}{2} 1 {h_{2, 1}^{'} β_{2, 1}^{*} = 0}) h_{2, 1}^{'} ν], & a = 0 \end{matrix}

and where Inline graphic = {g(ν, b₁, h_2,1, a), ||ν|| ≤ K}. Using the same arguments as those used in Lemma B.11, one can show that w is continuous at (Σ_1,∞, P, ℝ^p₂₁, 0). Thus, using the continuous mapping theorem, it can be shown that

\begin{array}{l} S_{n} + T_{n} ⇝ \sum_{1, \infty}^{1} [G_{\infty} (B_{1} (H_{2, 0}^{'} β_{2, 0}^{*} + {[H_{2, 1}^{'} β_{2, 1}^{*}]}_{+} - B_{1}^{'} β_{1}^{*}))] \\ + \sum_{1, \infty}^{- 1} [P B_{1} (H_{2, 0}^{'} ℤ_{\infty, 0} + H_{2, 1}^{'} ℤ_{\infty, 1} (1 {H_{2, 1}^{'} β_{2, 1}^{*} > 0} + \frac{1}{2} 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0}))], \end{array}

where

{(ℤ_{\infty, 0}^{'}, ℤ_{\infty, 1}^{'})}^{'} = \sum_{2, \infty}^{- 1} [B_{2} (Y - B_{2}^{'} β_{2}^{*})] .

We now discuss Inline graphic , the third term in (3). By Lemma 1 (i) below, when $\sqrt{n} / α_{n} \to 0$ ,

‖ {\sum^{^}}_{1}^{- 1} W_{n} ‖ \leq \sqrt{n} ℙ_{n} ‖ {\sum^{^}}_{1}^{- 1} ‖ ‖ B_{1} ‖ f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) \leq \frac{\sqrt{n} log (2)}{α_{n}} ℙ_{n} ‖ {\sum^{^}}_{1}^{- 1} ‖ ‖ B_{1} ‖ \to 0,

which proves (i).

For (ii), let δ_n → 0 such that α_nδ_n → ∞. Write

\begin{array}{l} {\sum^{^}}_{1} W_{n} = \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ > δ_{n}}) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ_{n}}) \\ = \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ > δ_{n}}) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) (1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ_{n}} - 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0})) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0}) \\ \equiv A_{n} + B_{n} + C_{n} . \end{array}

Note that by Lemma 1 (vi),

‖ A_{n} ‖ \leq ℙ_{n} ‖ B_{1} ‖ \frac{\sqrt{n}}{α_{n}} e^{- α_{n} δ_{n}} \overset{P}{\to} 0.

Let $p (δ) = P (1 {∣ h_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ} - 1 {∣ h_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0})$ , and note that p(0) = 0 and thus p(δ) → 0 as δ → 0. Hence,

‖ B_{n} ‖ \leq ℙ_{n} ‖ B_{n} ‖ \frac{\sqrt{n}}{α_{n}} log (2) ‖ (1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ_{n}} - 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0}) ‖ \overset{P}{\to} 0.

Summarizing, we obtain that

W_{n} \overset{P}{\to} a_{\infty} log (2) \sum_{1, \infty}^{- 1} P B_{1} 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0},

which proves (ii).

Sketch of proof of Theorem 2

Using the same arguments that lead to Eq. 2 in Laber et al., we have

\sqrt{n} ({\hat{β}}_{1} - β_{1, n}^{*}) = {\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} (\overset{⌣}{Y} - B_{1}^{'} β_{1, n}^{*}) = S_{n} + {\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} U_{n},

where Inline graphic is smooth and asymptotically normal, and

U_{n} = \frac{e^{α_{n} H_{2, 1}^{'} β_{2, 1, n}^{*}} H_{2, 1}^{'}}{1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1, n}^{*}}} \sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1, n}^{*}) + o_{P} (\sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1, n}^{*})) + \sqrt{n} (\frac{1}{α_{n}} log {1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1, n}^{*}}} - {[H_{2, 1}^{'} β_{2, 1, n}^{*}]}_{+}) .

Similarly to the proof of Theorem 1, we have

\sqrt{n} ({\hat{β}}_{1} - β_{1, n}^{*}) = S_{n} + T_{n} + W_{n} + o_{P} (1),

(3)

where

\begin{array}{l} T_{n} = {\sum^{^}}_{1}^{- 1} ℙ_{n} B_{1} [\frac{e^{α_{n} H_{2, 1}^{'} β_{2, 1, n}^{*}}}{1 + e^{α_{n} H_{2, 1}^{'} β_{2, 1, n}^{*}}} H_{2, 1}^{'} \sqrt{n} ({\hat{β}}_{2, 1} - β_{2, 1}^{*})], \\ W_{n} = {\sum^{^}}_{1}^{- 1} \sqrt{n} ℙ_{n} B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1, n}^{*}) . \end{array}

We obtain that for a given h_2,1, and $β_{2, 1, n}^{*} = β_{2, 1}^{*} + \frac{s}{\sqrt{n}} + o (1 / \sqrt{n})$ , and $α_{n} / \sqrt{n} \to a_{\infty}^{- 1}$ ,

\frac{e^{α_{n} h_{2, 1}^{'} β_{2, 1, n}^{*}}}{1 + e^{α_{n} h_{2, 1}^{'} β_{2, 1, n}^{*}}} \to 1 {h_{2, 1}^{'} β_{2, 1}^{*} > 0} + {[a_{\infty}^{- 1} h_{2, 1}^{'} s]}_{+} 1 {h_{2, 1}^{'} β_{2, 1}^{*} = 0} .

Using the same arguments given in the proof of Theorem 4.1 (see also proof of Theorem 1 above) it can be shown that Inline graphic + converges in distribution to

\begin{array}{l} \sum_{1, \infty}^{- 1} {G_{\infty} (B_{1} (H_{2, 0}^{'} β_{2, 0}^{*} + {[H_{2, 1}^{'} β_{2, 1}^{*}]}_{+} - B_{1}^{'} β_{1}^{*})) \\ + P B_{1} (H_{2, 0}^{'} ℤ_{\infty, 0} + H_{2, 1}^{'} ℤ_{\infty, 1} (1 {H_{2, 1}^{'} β_{2, 1}^{*} > 0} + {[a_{\infty}^{- 1} H_{2, 1}^{'} s]}_{+} 1 {H_{2, 1}^{'} β_{2, 1}^{*} = 0}))}, \end{array}

where

{(ℤ_{\infty, 0}^{'}, ℤ_{\infty, 1}^{'})}^{'} = \sum_{2, \infty}^{- 1} [B_{2} (Y - B_{2}^{'} β_{2}^{*})] .

Note that

\begin{array}{l} {\sum^{^}}_{1} W_{n} = \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1, n}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ > δ_{n}, ∣ α_{n}^{- 1} H_{2, 1}^{'} s ∣ > δ_{n} / 2}) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1, n}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ > δ_{n}, ∣ α_{n}^{- 1} H_{2, 1}^{'} s ∣ \leq δ_{n} / 2}) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1, n}^{*}) (1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ_{n}} - 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0})) \\ + \sqrt{n} ℙ_{n} (B_{1} f (α_{n}, H_{2, 1}^{'} β_{2, 1, n}^{*}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0}) \\ \equiv A_{n} + B_{n} + C_{n} + D_{n} . \end{array}

The first three terms can be bounded as follows:

\begin{array}{l} ‖ A_{n} + B_{n} + C_{n} ‖ \leq ℙ_{n} ‖ B ‖ \frac{\sqrt{n}}{α_{n}} log (2) ‖ 1 {∣ H_{2, 1}^{'} s ∣ > α_{n} δ_{n} / 2} ‖ + ℙ_{n} ‖ B ‖ \frac{\sqrt{n}}{α_{n}} e^{- α_{n} δ_{n} / 2} \\ + ℙ_{n} ‖ B ‖ \frac{\sqrt{n}}{α_{n}} log (2) ‖ 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ \leq δ_{n}} - 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} ‖ \overset{P}{\to} 0. \end{array}

Thus the limiting distribution of Inline graphic depends on that of D_n. For D_n we have

\begin{array}{l} D_{n} = ℙ_{n} B_{1} (\frac{\sqrt{n}}{α_{n}} log {1 + e^{\frac{α_{n}}{\sqrt{n}} H_{2, 1}^{'} s + o (1 / \sqrt{n})}} - {[H_{2, 1}^{'} s + o (1)]}_{+}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} \\ \overset{P}{\to} P B_{1} (a_{\infty} log {1 + e^{a_{\infty}^{- 1} H_{2, 1}^{'} s}} - {[H_{2, 1}^{'} s]}_{+}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0}, \end{array}

which concludes the proof.

Proof of Corollary 1

We prove only the second assertion, the first can be proved similarly. Noting that both Inline graphic and have mean zero, it is enough to bound the bias of

\sum_{1, \infty}^{- 1} P (W_{\infty}) = \sum_{1, \infty}^{- 1} P B_{1} (a_{\infty} log {1 + e^{a_{\infty}^{- 1} H_{2, 1}^{'} s}} - {[H_{2, 1}^{'} s]}_{+}) 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} .

Using Lemma 1 (i) with $α = a_{\infty}^{- 1}$ and $x = H_{2, 1}^{'} s$ , we obtain that the bias is bounded by

a_{\infty} log (2) ‖ \sum_{1, \infty}^{- 1} ‖ P ‖ B ‖ 1 {∣ H_{2, 1}^{'} β_{2, 1}^{*} ∣ = 0} .

The following lemma is needed for the proofs of Theorems 1–2:

Lemma 1

Let $f (α, x) = \frac{1}{α} log {1 + e^{α x}} - max (x, 0)$ . Then

0 < f(α, x) ≤ log(2)/α.
argmax_x f (α, x) = 0 and f (α, 0) = log(2)/α.
f(α, x) → 0 as α → ∞.
Fix δ > 0. Then ${max}_{x \in (- \infty, - δ] \cup [δ, \infty)} f (α, x) = \frac{e^{- α δ}}{α}$ .

The proof is technical and therefore omitted.

Footnotes

The first author was funded in part by ISF grant 1308/12. The other authors were funded in part by grant P01 CA142538 from the National Cancer Institute. The second author was also funded in part by NSF-DMS 1309465.

Contributor Information

Yair Goldberg, Email: ygoldberg@stat.haifa.ac.il, Department of Statistics, University of Haifa, Mount Carmel, Haifa 31905, Israel.

Rui Song, Email: rsong@ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695, U.S.A.

Donglin Zeng, Email: dzeng@email.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A.

Michael R. Kosorok, Email: kosorok@unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, U.S.A

References

Goldberg Y, Song R, Kosorok MR. From Probability to Statistics and Back: High-Dimensional Models and Processes – A Festschrift in Honor of Jon A. Wellner. Institute of Mathematical Statistics, vol. 9 of IMS Collections, chap. Adaptive Q-Learning. 2013:150–162. [Google Scholar]
Moodie EEM, Richardson TS. Estimating optimal dynamic regimes: correcting bias under the null. Scandinavian Journal of Statistics. 2010;37:126–146. doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Song R, Wang W, Zeng D, Kosorok MR. Penalized Q-Learning for Dynamic Treatment Regimes. 2011 doi: 10.5705/ss.2012.364. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang CH, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76:217–242. [Google Scholar]

[R1] Goldberg Y, Song R, Kosorok MR. From Probability to Statistics and Back: High-Dimensional Models and Processes – A Festschrift in Honor of Jon A. Wellner. Institute of Mathematical Statistics, vol. 9 of IMS Collections, chap. Adaptive Q-Learning. 2013:150–162. [Google Scholar]

[R2] Moodie EEM, Richardson TS. Estimating optimal dynamic regimes: correcting bias under the null. Scandinavian Journal of Statistics. 2010;37:126–146. doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Song R, Wang W, Zeng D, Kosorok MR. Penalized Q-Learning for Dynamic Treatment Regimes. 2011 doi: 10.5705/ss.2012.364. Submitted. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Zhang CH, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B. 2014;76:217–242. [Google Scholar]

PERMALINK

Comment on “Dynamic treatment regimes: technical challenges and applications”^*

Yair Goldberg

Rui Song

Donglin Zeng

Michael R Kosorok

Abstract

1. Introduction