Inference for Nonregular Parameters in Optimal Dynamic Treatment Regimes

Bibhas Chakraborty; Victor Strecher; Susan Murphy

doi:10.1177/0962280209105013

. Author manuscript; available in PMC: 2011 Jun 1.

Published in final edited form as: Stat Methods Med Res. 2009 Jul 16;19(3):317–343. doi: 10.1177/0962280209105013

Inference for Nonregular Parameters in Optimal Dynamic Treatment Regimes

Bibhas Chakraborty ^1,^*, Victor Strecher ², Susan Murphy ¹

PMCID: PMC2891316 NIHMSID: NIHMS100083 PMID: 19608604

Abstract

A dynamic treatment regime is a set of decision rules, one per stage, each taking a patient’s treatment and covariate history as input, and outputting a recommended treatment. In the estimation of the optimal dynamic treatment regime from longitudinal data, the treatment effect parameters at any stage prior to the last can be nonregular under certain distributions of the data. This results in biased estimates and invalid confidence intervals for the treatment effect parameters. In this paper, we discuss both the problem of nonregularity, and available estimation methods. We provide an extensive simulation study to compare the estimators in terms of their ability to lead to valid confidence intervals under a variety of nonregular scenarios. Analysis of a data set from a smoking cessation trial is provided as an illustration.

Keywords: dynamic treatment regime, nonregularity, bias, hard-threshold, soft-threshold, empirical Bayes, bootstrap

1 Introduction

Many diseases such as mental illness, HIV infection, and substance abuse are clinically treated in multiple stages, adapting the treatment type and dosage to the ongoing measures of an individual patient’s response, adherence, burden, side effects, and preference. Dynamic treatment regimes represent one way to operationalize this sequential decision making. A dynamic treatment regime (DTR) is a sequence of decision rules, one per stage. Each decision rule takes a patient’s treatment and covariate history as input, and outputs a recommended treatment. The main motivations for considering sequences of treatments are high variability across patients in response to any one type of treatment, likely relapse, presence or emergence of co-morbidities, time-varying side effect severity, and reduction of costs and burden when intensive treatment is unnecessary¹.

A DTR is said to be optimal if it optimizes the mean outcome at the end of the final stage of treatment. Data for estimating the optimal DTR can come from either an observational longitudinal study or a sequential multiple assignment randomized trial (SMART)²^–⁵. In these designs, each patient is followed through stages of treatment and at each stage the patient is randomized to one of the possible treatment options. Experimental designs similar to SMART have been implemented in the treatments of schizophrenia⁶, depression⁷, and cancer⁸^,⁹.

Estimating the optimal DTR is a problem of sequential, multi-stage decision making. Murphy¹⁰ developed a semiparametric method for estimating the optimal DTR, an efficient version of which was provided by Robins¹¹. A nice discussion about the relationship between these two methods can be found in Moodie et al.¹². Other methods for estimating optimal DTRs in the literature include likelihood-based methods, both frequentist and Bayesian, developed by Thall and colleagues⁸^,¹³^,¹⁴, and the semiparametric methods of Lunceford et al.¹⁵, and Wahed and Tsiatis⁹^,¹⁶.

Robins¹¹ considered the problem of inference for the parameters of the optimal DTR. As discussed by Robins, the treatment effect parameters at any stage prior to the last can be nonregular under certain longitudinal distributions of the data which he called exceptional laws. By nonregularity, we mean that the asymptotic distribution of the estimator of the treatment effect parameter does not converge uniformly over the parameter space (see section 2.4 for further details). This technical phenomenon of nonregularity has considerable practical consequences; it often causes bias in estimation, and leads to poor frequentist properties of the confidence intervals. Recently Moodie and Richardson¹⁷ provided a method called Zeroing Instead of Plugging In (ZIPI) for correcting the bias in the estimation of the optimal DTRs resulting under exceptional laws.

The main goals of this paper are to illustrate the problem of nonregularity, and to compare available estimation methods that attempt to address this problem. In section 2, we discuss the problem of nonregularity in detail. Section 3 provides a description of different methods that address the problem. We provide an extensive simulation study in section 4 to compare the estimators in terms of their ability to lead to valid confidence intervals using bootstrap. This is followed by an analysis of a data set from a longitudinal smoking cessation trial in section 5; the purpose is to demonstrate the applicability of the estimation methods in a real-life nonregular scenario. Finally an overall discussion is provided in section 6. Throughout this article, we assume that the data come from SMART designs. The main reason for this is to separate the issue of nonregularity from causal inference issues. However the problem of nonregularity also arises when observational data¹¹^,¹⁷ are used; and the estimators proposed in section 3 should be applicable to observational data as well.

2 Estimation and Inference via Q-learning

2.1 Notation and Data Structure

For simplicity, we focus on studies with two stages. Longitudinal data on a single patient are given by the trajectory (O₁, A₁, O₂, A₂, O₃), where O_j (j = 1, 2) denotes the covariates measured prior to treatment at the beginning of the j-th stage, O₃ is the observation at the end of stage 2, and A_j (j = 1, 2) is the treatment assigned at the j-th stage subsequent to observing O_j. The data set consists of a random sample of n patients. Define the history at each stage as: H₁ = O₁, H₂ = (O₁, A₁, O₂). We consider a SMART design in which there are two possible treatments at each stage, A_j ∈ {−1, 1}; here we assume $P [A_{j} = - 1 ∣ H_{j}] = P [A_{j} = 1 ∣ H_{j}] = \frac{1}{2}$ . The study can have either a single primary outcome Y observed at the end of stage 2, or two outcomes Y₁, Y₂ observed at the two stages. Note that the case of a single outcome Y observed at the end can be viewed as a case with Y₁ ≡ 0 and Y₂ = Y. We assume Y₁ = f₁(O₁, A₁, O₂) and Y₂ = f₂(O₁, A₁, O₂, A₂, O₃), with known functions f₁, f₂. A two-stage DTR consists of two decision rules, say (d₁, d₂), with d_j(H_j) ∈ Inline graphic _j, where _j is the set of possible treatments at the j-th stage.

One simple method to construct (d₁, d₂) is Q-learning¹⁸^–²⁰. Q-learning, like Robins’ g-estimation of optimal structural nested mean models (hereafter simply referred to as Robins’ method), suffers from nonregularity – the common reason being an underlying non-smooth maximization operation. Here we will illustrate the problem due to nonregularity using Q-learning, since it can be viewed as a generalization of the least squares regression to multistage decision problems, and hence simpler to explain than Robins’ semiparametric efficient method. In Lemma 1 below, we provide conditions under which Q-learning is equivalent to an inefficient version of Robins’ method.

2.2 Q-learning with Linear Models

First let us define the Q-functions¹⁹^,²⁰ for the two stages as follows:

\begin{array}{l} Q_{2} (H_{2}, A_{2}) = E [Y_{2} ∣ H_{2}, A_{2}], \\ Q_{1} (H_{1}, A_{1}) = E [Y_{1} + max_{a_{2}} Q_{2} (H_{2}, a_{2}) ∣ H_{1}, A_{1}] . \end{array}

If the two Q-functions were known, the optimal DTR (d₁, d₂), using backwards induction (as in dynamic programming) argument, would be

d_{j} (h_{j}) = arg max_{a_{j}} Q_{j} (h_{j}, a_{j}), j = 1, 2.

(1)

In practice, the true Q-functions are not known and hence must be estimated from the data. Consider a linear model for the Q-functions. Let the stage-j (j = 1, 2) Q-function be modeled as

Q_{j} (H_{j}, A_{j}; β_{j}, ψ_{j}) = β_{j}^{T} H_{j 0} + (ψ_{j}^{T} H_{j 1}) A_{j},

(2)

where H_j₀ and H_j₁ are two (possibly different) summaries of the history H_j, with H_j₀ denoting the “main effect of history” and H_j₁ denoting the part of history that interacts with treatment (H_j₀ and H_j₁ include the intercept term). The Q-learning algorithm is:

Stage-2 regression: $({\hat{β}}_{2}, {\hat{ψ}}_{2}) = arg {min}_{β_{2}, ψ_{2}} \frac{1}{n} \sum_{i = 1}^{n} {(Y_{2 i} - Q_{2} (H_{2 i}, A_{2 i}; β_{2}, ψ_{2}))}^{2}$ .
Stage-2 optimal rule: d̂₂(h₂) = arg max_a_₂ Q₂(h₂, a₂; β̂₂,ψ̂₂).
Stage-1 pseudo-outcome: Ŷ₁_i = Y₁_i + max_a_₂ Q₂(H₂_i, a₂; β̂₂,ψ̂₂), i = 1,…,n.
Stage-1 regression: $({\hat{β}}_{1}, {\hat{ψ}}_{1}) = arg {min}_{β_{1}, ψ_{1}} \frac{1}{n} \sum_{i = 1}^{n} {({\hat{Y}}_{1 i} - Q_{1} (H_{1 i}, A_{1 i}; β_{1}, ψ_{1}))}^{2}$ .
Stage-1 optimal rule: d̂₁(h₁) = arg max_a_₁ Q₁(h₁, a₁; β̂₁,ψ̂₁).

The estimated optimal DTR using Q-learning is given by (d̂₁, d̂₂).

The following lemma gives a set of sufficient conditions under which Q-learning is equivalent to an inefficient version of Robins’ method.

Lemma 1

Consider linear models for the Q-functions as in (2). Assume that:

the parameters in Q₁ and Q₂ are distinct;
A_j has zero conditional mean given the history H_j, j = 1, 2; and
the covariates used in the model for Q₁ are nested within the covariates used in the model for Q₂, i.e., $(H_{10}^{T}, H_{11}^{T} A_{1}) \subset H_{20}^{T}$ .

Then Q-learning is algebraically equivalent to an inefficient version of Robins’ method.

The proof is given in Appendix A.

2.3 The Inference Problem

With (2) as the model for Q-functions, the optimal DTR is given by

d_{j} (H_{j}) = arg max_{a_{j}} (ψ_{j}^{T} H_{j 1}) a_{j} = sign (ψ_{j}^{T} H_{j 1}), j = 1, 2,

(3)

where sign(x) = 1 if x > 0, and −1 otherwise. Note that the term $β_{j}^{T} H_{j 0}$ on the right side of (2) does not feature in the optimal DTR. Thus for estimating optimal DTRs, the ψ_j’s are the parameters of interest, while β_j’s are nuisance parameters. We want to perform inference (e.g., construct confidence intervals) on ψ_j’s.

Conducting inference on ψ_j’s is important due to the following reasons. First, if the confidence intervals (or hypothesis tests) for ψ_j reveal that there is no evidence that some components of the vector ψ_j are different from zero, then the corresponding components of the history vector H_j₁ need not be collected to make decisions using the optimal DTR. This reduces the cost of data collection in a future implementation of the optimal DTR. Thus in the present context, confidence intervals (or hypothesis tests) can be viewed as a tool for doing variable selection. Second, it is important to know when there is insufficient support in the data to recommend one treatment over another, since in such cases treatment can be chosen according to other considerations like cost, familiarity, burden, preference etc. Third, as discussed by Robins¹¹, confidence intervals for ψ_j can lead to confidence intervals for d_j. In the following, we discuss the problem of nonregularity in inference.

2.4 Nonregularity in Inference

Note that the stage-1 pseudo-outcome (in the Q-learning algorithm) is

{\hat{Y}}_{1 i} = Y_{1 i} + max_{a_{2}} Q_{2} (H_{2 i}, a_{2}; {\hat{β}}_{2}, {\hat{ψ}}_{2}) = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣, i = 1, \dots, n,

(4)

which is a non-smooth (e.g., non-differentiable at ${\hat{ψ}}_{2}^{T} H_{21, i} = 0$ ) function of ψ̂₂, because of the maximization operation. Since ψ̂₁ is a function of Ŷ₁_i, i = 1,…,n, it is in turn a non-smooth function of ψ̂₂. As a consequence, the asymptotic distribution of $\sqrt{n} ({\hat{ψ}}_{1} - ψ_{1})$ does not converge uniformly¹¹ over the parameter space of ψ = (ψ₁, ψ₂). More specifically, the asymptotic distribution of $\sqrt{n} ({\hat{ψ}}_{1} - ψ_{1})$ is normal if ψ₂ is such that $P [H_{2} : ψ_{2}^{T} H_{21} = 0] = 0$ , but is non-normal if $P [H_{2} : ψ_{2}^{T} H_{21} = 0] > 0$ . This change in the asymptotic distribution happens abruptly. The (vector) parameter ψ₁ is called a nonregular parameter and the estimator ψ̂₁ is called a nonregular estimator; see Bickel et al.²¹ for the precise definition of nonregularity. Because of this nonregularity, given the noise level present in small samples, the estimator ψ̂₁ oscillates between the two asymptotic distributions across samples. As a result, usual Wald type confidence intervals perform poorly¹¹^,¹⁷.

The issue of nonregularity can be better understood with a toy example discussed by Robins¹¹ (here is a slightly modified version). Consider the problem of estimating |μ| based on n i.i.d. observations X₁,…,X_n from N(μ, 1). Note that |X̄_n|is the maximum likelihood estimator of |μ|, where X̄_n is the sample average. It can be shown that the asymptotic distribution of $\sqrt{n} (∣ {\bar{X}}_{n} ∣ - ∣ μ ∣)$ for μ = 0 is different from that for μ ≠ 0. Thus |X̄_n|is a nonregular estimator of |μ|. Also, for μ = 0, ${lim}_{n \to \infty} E [\sqrt{n} (∣ {\bar{X}}_{n} ∣ - ∣ μ ∣)] = \sqrt{\frac{2}{π}}$ . Robins referred to this quantity as the asymptotic bias of the estimator |X̄_n|. This asymptotic bias is one symptom of the underlying nonregularity, as discussed by Moodie and Richardson¹⁷.

In many situations where the asymptotic distribution of an estimator is unavailable, bootstrap is used as an alternative approach to conduct inference. But the success of bootstrap also hinges on the underlying smoothness of the estimator. When an estimator is nonsmooth, the ordinary (n out of n) bootstrap procedure produces an inconsistent bootstrap estimator²². Inconsistency of bootstrap in the above simple normal theory example has been discussed by Andrews²³. As shown by Shao²², an alternative resampling procedure called “m out of n bootstrap” is consistent in such nonsmooth scenarios. One concern regarding the use of this procedure is the slower rate of convergence than $\sqrt{n}$ even in a regular setting (e.g., when $P [H_{2} : ψ_{2}^{T} H_{21} = 0] = 0$ ). Moreover, a data-adaptive choice of the tuning parameter m in the present context of DTRs is not obvious; see however Bickel and Sakov²⁴ and Hall et al.²⁵ for data-adaptive choice of m in other contexts.

The above concerns regarding nonregularity led us to investigate possible regularizations of the estimation procedure, and then use bootstrap for inference. In the simulation study to follow, we will investigate the behavior of different types of bootstrap confidence intervals for the parameters ψ_j of the optimal DTR in both regular and nonregular settings.

3 Different Regularized Estimators

In this section, we will present two competing estimators to address the non-regularity problem described above. Limited theoretical results are available at this point, and consequently it is not clear which estimator is better. In this paper, we will study their relative merits and demerits in simulations.

From the discussion on nonregularity above, it is clear that ψ̂₁ is a non-regular estimator because the stage-1 pseudo-outcome Ŷ₁ is a non-smooth function (e.g., absolute value) of ψ̂₂. The estimators presented in this section “regularize” the nonregular estimator (sometimes called the “hard-max” estimator because of the maximum operation used in the definition) by shrinking or thresholding the effect of the term involving the maximum, e.g., $∣ {\hat{ψ}}_{2}^{T} H_{21} ∣$ , towards zero.

3.1 Hard-threshold Estimator

Recall that the pseudo-outcome ${\hat{Y}}_{1} = Y_{1} + {\hat{β}}_{2}^{T} H_{20} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣$ is non-differentiable in ψ̂₂ only when ${\hat{ψ}}_{2}^{T} H_{21} = 0$ , and so the corresponding estimator ψ̂₁ is problematic only when the true $ψ_{2}^{T} H_{21}$ is close to zero. The general form of the hard-threshold pseudo-outcome is

{\hat{Y}}_{1 i}^{H T} = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ \cdot 1 {∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ > λ_{i}}, i = 1, \dots, n,

(5)

where λ_i (> 0) is the threshold for the i-th subject in the sample (possibly depending on the variability of the linear combination ${\hat{ψ}}_{2}^{T} H_{21, i}$ for that subject). One way to operationalize this is to perform a preliminary test (for each subject in the sample) of the hypothesis $H_{0 i} : ψ_{2}^{T} H_{21, i} = 0$ (H_21, _i is considered fixed in this test), set ${\hat{Y}}_{1 i}^{H T} = {\hat{Y}}_{1 i}$ if H₀_i is rejected, and replace $∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣$ with the “better guess” 0 in case H₀_i is accepted. Thus the hard-threshold pseudo-outcome can be written as

{\hat{Y}}_{1 i}^{H T} = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ \cdot 1 {\frac{\sqrt{n} ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣}{\sqrt{H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i}}} > z_{α / 2}}, i = 1, \dots, n,

(6)

where Σ̂₂ is the estimated covariance matrix of ψ̂₂. The corresponding estimator of ψ₁, denoted by ${\hat{ψ}}_{1}^{H T}$ , will be referred to as the hard-threshold estimator. The hard-threshold estimator is common in many areas like variable selection in linear regression and wavelet shrinkage²⁶. Moodie and Richardson¹⁷ proposed this estimator for bias correction in the context of Robins’ method, and called it Zeroing Instead of Plugging In (ZIPI) estimator.

Note that ${\hat{Y}}_{1}^{H T}$ is still a non-smooth function of ψ̂₂ and hence ${\hat{ψ}}_{1}^{H T}$ is a nonregular estimator of ψ₁. However, the problematic term $∣ {\hat{ψ}}_{2}^{T} H_{21} ∣$ is shrunk (thresholded) towards zero, and hence one might expect that the degree of nonregularity is somewhat reduced. Moodie and Richardson¹⁷ showed that this estimator reduces the bias occurring in Robins’ method (efficient version of Q-learning). In the simulation study to follow, we will explore if this estimator can be used to construct valid confidence intervals for ψ₁. An important issue regarding the use of this estimator is the choice of significance level α of the preliminary test, which is an unknown tuning parameter. As discussed by Moodie and Richardson¹⁷, this is a difficult problem even in better-understood settings where preliminary test based estimators are used; and no widely applicable data-driven method for choosing α in this setting is currently available.

3.2 Soft-threshold or Shrinkage Estimator

The general form of the soft-threshold pseudo-outcome considered here is

{\hat{Y}}_{1 i}^{S T} = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ \cdot {(1 - \frac{λ_{i}}{∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣^{2}})}^{+}, i = 1, \dots, n,

(7)

where x⁺ = x1{x > 0} stands for the positive part of a function, and λ_i (> 0) is a tuning parameter associated with the i-th subject in the sample (again possibly depending on the variability of the linear combination ${\hat{ψ}}_{2}^{T} H_{21, i}$ for that subject). In the contexts of regression shrinkage ²⁷ and wavelet shrinkage²⁸, the third term in (7) is generally known as the nonnegative garrote estimator. As discussed by Zou²⁹, the nonnegative garrote estimator is a special case of the adaptive lasso estimator. As in the case of hard-threshold estimator, a crucial issue here is to choose a data-driven tuning parameter λ_i. Below we provide a choice following a Bayesian approach.

Like the hard-threshold pseudo-outcome, ${\hat{Y}}_{1}^{S T}$ is also a non-smooth function of ψ̂₂ and hence ${\hat{ψ}}_{1}^{S T}$ remains a nonregular estimator of ψ₁. However, the problematic term $∣ {\hat{ψ}}_{2}^{T} H_{21} ∣$ is shrunk (or thresholded) towards zero, and hence one might expect that the degree of nonregularity is somewhat reduced. In the simulation study to follow, we will investigate how much improvement this estimator offers over the “hard-max” estimator, when it comes to constructing confidence intervals. Figure 1 presents the hard-max, the hard-threshold, and the soft-threshold pseudo-outcomes.

Hard-threshold and Soft-threshold pseudo-outcomes compared with the Hard-max pseudo-outcome.

3.2.1 Choice of Tuning Parameter

A hierarchical Bayesian formulation of the problem, inspired by the work of Figueiredo and Nowak³⁰ in the area of wavelet-based image processing, can be used in the context of the soft-threshold estimator to choose λ_i’s in a data-driven way. It turns out that the estimator (7) with $λ_{i} = 3 H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i} / n, i = 1, \dots, n$ , where Σ̂₂/n is the estimated covariance matrix of ψ̂₂, is an approximate empirical Bayes estimator. The following lemma will be used to derive the choice of λ_i.

Lemma 2

Let X be a random variable such that X|μ ~ N(μ, σ²) with known variance σ². Let the prior distribution on μ be given by μ|ϕ² ~ N(0, ϕ²),with Jeffrey’s noninformative hyper-prior on ϕ², e.g., p(ϕ²) ∝ 1/ϕ². Then an empirical Bayes estimator of |μ| is given by

\begin{array}{l} {\hat{∣ μ ∣}}^{E B} = X {(1 - \frac{3 σ^{2}}{X^{2}})}^{+} (2 Φ (\frac{X}{σ} \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}}) - 1) \\ + \sqrt{\frac{2}{π}} σ \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}} exp {- \frac{X^{2}}{2 σ^{2}} {(1 - \frac{3 σ^{2}}{X^{2}})}^{+}}, \end{array}

(8)

where Φ(·) is the standard normal distribution function.

The proof is given in Appendix B.

Clearly, ${\hat{∣ μ ∣}}^{E B}$ is a thresholding rule, since ${\hat{∣ μ ∣}}^{E B} = 0$ for $∣ X ∣ < \sqrt{3} σ$ . Moreover, when $∣ \frac{X}{σ} ∣$ is large, the second term of (8) goes to zero exponentially fast, and

(2 Φ (\frac{X}{σ} \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}}) - 1) \approx (2 I_{{X > 0}} - 1) = sign (X) .

Consequently, the empirical Bayes estimator is approximated by

{\hat{∣ μ ∣}}^{E B} \approx X {(1 - \frac{3 σ^{2}}{X^{2}})}^{+} sign (X) = ∣ X ∣ {(1 - \frac{3 σ^{2}}{X^{2}})}^{+} .

(9)

Now for i = 1,…,n separately, put $X = {\hat{ψ}}_{2}^{T} H_{21, i}$ , and $μ = {\hat{ψ}}_{2}^{T} H_{21, i}$ (for fixed H₂₁_;i); and plug in ${\hat{σ}}^{2} = H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i} / n$ for σ². This leads to a choice of λ_i in the soft-threshold pseudo-outcome (7):

{\hat{Y}}_{1 i}^{S T} = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ \cdot {(1 - \frac{3 H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i}}{n ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣^{2}})}^{+},

(10)

\begin{matrix} = Y_{1 i} + {\hat{β}}_{2}^{T} H_{20, i} + ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣ \cdot (1 - \frac{3 H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i}}{n ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣^{2}}) \cdot 1 {\frac{\sqrt{n} ∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣}{\sqrt{H_{21, i}^{T} {\sum^{^}}_{2} H_{21, i}}} > \sqrt{3}}, \\ i = 1, \dots, n . \end{matrix}

(11)

The presence of the indicator function in (11) indicates that ${\hat{Y}}_{1 i}^{S T}$ is a thresholding rule for small values of $∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣$ , while the term just preceding the indicator function makes ${\hat{Y}}_{1 i}^{S T}$ a shrinkage rule for moderate to large values of $∣ {\hat{ψ}}_{2}^{T} H_{21, i} ∣$ (for which the indicator function takes the value one). Thus the current Bayesian formulation gives us a data-driven choice of the tuning parameters.

4 Simulation Study

In this section, we consider a simulation study to compare the performances of the hard-max, the hard-threshold, and the soft-threshold estimators under different nonregular scenarios. In this study, we vary the parameters of the generative model, the degree of nonregularity, and the type of bootstrap confidence interval.

Generative Model

Recall that the data consist of n patient trajectories, each of the form (O₁, A₁, O₂, A₂, O₃). Without loss of generality, we assume Y₁ ≡ 0 and Y₂ ≡ Y = O₃. Let μ_Y = E[Y|O₁, A₁, O₂, A₂], and ε be the associated error term. Then Y = μ_Y + ε, where

μ_{Y} = γ_{1} + γ_{2} O_{1} + γ_{3} A_{1} + γ_{4} O_{1} A_{1} + γ_{5} A_{2} + γ_{6} O_{2} A_{2} + γ_{7} A_{1} A_{2},

and ε ~ N(0, 1). Next, we consider binary treatments randomized with probability 1/2, e.g., P[A_j = 1] = P [A_j = −1] = 1/2, j = 1, 2. Also, the binary covariates O_j’s are generated as

\begin{array}{l} P [O_{1} = 1] = P [O_{1} = - 1] = 1 / 2, \\ P [O_{2} = 1 ∣ O_{1}, A_{1}] = 1 - P [O_{2} = - 1 ∣ O_{1}, A_{1}] = expit (δ_{1} O_{1} + δ_{2} A_{1}), \end{array}

where expit(x) = exp(x)/(1 + exp(x)). Note that γ₁,…,γ₇ and δ₁, δ₂ are the parameters that specify the generative model. These parameters will be varied in the examples to follow.

Analysis Model

\begin{array}{l} Q_{2} (H_{2}, A_{2}) = β_{20} + β_{21} O_{1} + β_{22} A_{1} + β_{23} O_{1} A_{1} + (ψ_{20} + ψ_{21} O_{2} + ψ_{22} A_{1}) A_{2}, \\ Q_{1} (H_{1}, A_{1}) = β_{10} + β_{11} O_{1} + (ψ_{10} + ψ_{11} O_{1}) A_{1} . \end{array}

Two dimensions of nonregularity: p and ϕ

Nonregularity in stage 1 parameters arises when the optimal stage 2 treatment is non-unique for at least some subjects in the population. With reference to the present generative model, a setting is nonregular if the linear combination γ₅ + γ₆O₂ + γ₇A₁ = 0 with positive probability. Also one might expect some nonregular behavior as γ₅ + γ₆O₂ + γ₇A₁ falls in a small neighborhood of zero (even though not exactly zero). In the following, we consider specific examples varying the “degree of nonregularity”, e.g., p = P[γ₅ + γ₆O₂ + γ₇A₁ = 0] and the “standardized effect size” defined as $ϕ = ∣ E [γ_{5} + γ_{6} O_{2} + γ_{7} A_{1}] / \sqrt{Var [γ_{5} + γ_{6} O_{2} + γ_{7} A_{1}]} ∣$ . The quantities p and ϕ, which depend on the distribution of the above linear combination, represent two dimensions of the nonregularity phenomenon. Note that the linear combination (γ₅ + γ₆O₂ + γ₇A₁) can take only four possible values corresponding to the four possible (O₂, A₁) cells. The cell probabilities can be easily calculated; the formulae are provided in Table 1.

Table 1.

Distribution of the linear combination (γ₅ + γ₆O₂ + γ₇A₁)

(O₂, A₁) cell

cell probability (averaged over O₁)

value of the linear combination

(1, 1)

q_{1} \equiv \frac{1}{4} (expit (δ_{1} + δ_{2}) + expit (- δ_{1} + δ_{2}))

f₁ ≡γ₅ + γ₆ + γ₇

(1, −1)

q_{2} \equiv \frac{1}{4} (expit (δ_{1} - δ_{2}) + expit (- δ_{1} - δ_{2}))

f₂ ≡γ₅ + γ₆ − γ₇

(−1, 1)

q_{3} \equiv \frac{1}{4} (expit (δ_{1} - δ_{2}) + expit (- δ_{1} - δ_{2}))

f₃ ≡γ₅ − γ₆ + γ₇

(−1, −1)

q_{4} \equiv \frac{1}{4} (expit (δ_{1} + δ_{2}) + expit (- δ_{1} + δ_{2}))

f₄ ≡ γ₅ −γ₆ − γ₇

Open in a new tab

It follows that E[γ₅ + γ₆O₂ + γ₇A₁] = q₁f₁ + q₂f₂ + q₃f₃ + q₄f₄, and $E [{(γ_{5} + γ_{6} O_{2} + γ_{7} A_{1})}^{2}] = q_{1} f_{1}^{2} + q_{2} f_{2}^{2} + q_{3} f_{3}^{2} + q_{4} f_{4}^{2}$ , where q₁,…,q₄ are the cell probabilities given in Table 1. From these two, one can calculate V ar[γ₅ + γ₆O₂ + γ₇A₁], and subsequently the effect size ϕ.

We want to conduct inference on ψ₁₀ and ψ ₁₁, the analysis model parameters associated with stage 1 treatment A₁. They can be expressed in terms of γ’s and δ’s, the parameters of the generative model, as follows. It turns out that

\begin{array}{r} ψ_{10} = γ_{3} + q_{1} ∣ f_{1} ∣ - q_{2} ∣ f_{2} ∣ + q_{3} ∣ f_{3} ∣ - q_{4} ∣ f_{4} ∣, \\ and ψ_{11} = γ_{4} + q_{1}^{'} ∣ f_{1} ∣ - q_{2}^{'} ∣ f_{2} ∣ - q_{3}^{'} ∣ f_{3} ∣ + q_{4}^{'} ∣ f_{4} ∣, \end{array}

where $q_{1}^{'} = q_{3}^{'} = \frac{1}{4} (expit (δ_{1} + δ_{2}) - expit (- δ_{1} + δ_{2}))$ , and $q_{2}^{'} = q_{4}^{'} = \frac{1}{4} (expit (δ_{1} - δ_{2}) - expit (- δ_{1} - δ_{2}))$ . In the following, we consider specific examples for varying p and ϕ. In Examples 1 − 4 below, we use δ₁ = δ₂ = 0.5. For this choice, we get the following values of the cell probabilities: q₁ = q₄ = 0.3078 and q₂ = q₃ = 0.1922. This choice of the δ’s also makes $q_{1}^{'} = q_{2}^{'} = q_{3}^{'} = q_{4}^{'} = .0578$ .

Example 1 (p = 1, ϕ undefined)

Consider a setting where there is no treatment effect for any subject (any history) in either stage. This is achieved by setting γ₁ = … = γ₇ = 0, and δ₁ = δ₂ = 0.5. Then f₁ = f₂ = f₃ = f₄ = 0, and hence ψ ₁₀ = ψ ₁₁ = 0, p = 1, and ϕ is undefined (0/0). This is a fully nonregular scenario.

Example 2 (p = 0, ϕ infinite)

Consider a setting similar to Example 1, where there is a very weak stage 2 treatment effect for every subject (all possible history). This is achieved by setting γ₅ = 0.01 and γ_j = 0, ∀j ≠ 5, and δ₁ = δ₂ = 0.5. Then f₁ = f₂ = f₃ = f₄ = 0.01; ψ ₁₀ = ψ ₁₁ = 0, p = 0, and ϕ is infinite (0.01/0). This is a regular scenario, but close to nonregularity (it is hard to detect the very weak effect given the noise level in the data).

Example 3 ( $p = \frac{1}{2}$ , ϕ = 1)

Consider a setting where there is no stage 2 treatment effect for half the subjects in the population, but a reasonably large effect for the other half of subjects. This is achieved by setting γ₁ = γ₂ = γ₄ = γ₆ = 0, γ₃ = −0.5, γ₅ = γ₇ = 0.5, and δ₁ = δ₂ = 0.5. Then f₁ = f₃ = 1, f₂ = f₄ = 0, ψ ₁₀ = ψ ₁₁ = 0, $p = \frac{1}{2}$ and ϕ = 1. This is a nonregular setting.

Example 4 (p = 0, ϕ = 1.0204)

Consider a setting where there is a very weak stage 2 treatment effect for half the subjects in the population, but a reasonably large effect for the other half of subjects. This is achieved by setting γ₁ = γ₂ = γ₄ = γ₆ = 0, γ₃ = −0.5, γ₅ = 0.5, γ₇ = 0.49, and δ₁ = δ₂ = 0.5. It follows that f₁ = f₃ = 0.99, f₂ = f₄ = 0.01, ψ ₁₀ = −0.0100, ψ ₁₁ = 0, p = 0, and ϕ = 1.0204. This regular example is close to the nonregular Example 3.

Example 5 ( $p = \frac{1}{4}$ , ϕ = 1.4142)

Consider a setting where there is no stage 2 treatment effect for one-fourth of the subjects in the population, but others have a reasonably large effect. To achieve this, set γ₁ = γ₂ = γ₄ = 0, γ₃ = −0.5, γ₅ = 1, γ₆ = γ₇ = 0.5, δ₁ = 1, and δ₂ = 0. Then f₁ = 2, f₂ = f₃ = 1, f₄ = 0; the cell probabilities are equal, i.e., $q_{1} = q_{2} = q_{3} = q_{4} = \frac{1}{4}$ ; and $q_{1}^{'} = q_{2}^{'} = q_{3}^{'} = q_{4}^{'} = 0.1155$ . Consequently, ψ₁₀ = ψ₁₁ = 0, $p = \frac{1}{4}$ , and ϕ = 1.4142. This is a nonregular setting.

Example 6 (p = 0, ϕ = 0.3451)

Consider a completely regular setting where there is a reasonably large stage 2 treatment effect for every subject in the population. This can be achieved by setting γ₁ = γ₂ = γ₄ = 0, γ₃ = −0.5, γ₅ = 0.25, γ₆ = γ₇ = 0.5, and δ₁ = δ₂ = 0.1. Then f₁ = 1.25, f₂ = f₃ = 0.25, and f₄ = −0.75; the cell probabilities are q₁ = q₄ = 0.2625, q₂ = q₃ = 0.2375; and $q_{1}^{'} = q_{2}^{'} = q_{3}^{'} = q_{4}^{'} = 0.0125$ . It follows that ψ₁₀ = −0.3688, ψ₁₁ = 0.0187, p = 0 and ϕ = 0.3451.

Note that in Example 5, the effect size ϕ is greater than Cohen’s³¹ benchmark large effect size (=0.8). Such a high effect size can be criticized as being unrealistic, based on the principle of clinical equipoise ³², which provides the ethical basis for medical research involving randomization. This principle says that there must be a honest, professional disagreement (high variability) among expert clinicians about the preferred treatment (and thus the standardized effect size of treatment is likely small). Hence this example might be somewhat down-weighted for overall comparison of performance. Furthermore, Example 6 violates the Hierarchical Ordering Principle ³³ in that the coefficient of the interaction term A₁A₂ (γ₇) is larger than the co-efficient of the main effect A₂ (γ₅). So this example might be given lower weight as well.

Competing Estimators

In the simulation, we will consider four estimators: the hard-max estimator (original Q-learning), the soft-threshold estimator, and the hard-threshold estimator with two values of the tuning parameter α, e.g., 0.2, which was empirically found to be a good choice by Moodie and Richardson¹⁷, and 0.08 which corresponds to the threshold used by the soft-threshold estimator proposed in this paper (from (11), the threshold used by the soft-threshold estimator is $\sqrt{3} = 1.7321$ ; equating this point to z_α_/2 and solving for α, we get α = 0.0833).

Different Bootstrap CIs

We consider three types of bootstrap CIs, e.g., percentile, hybrid, and double (percentile) bootstrap CIs. Let θ̂ be an estimator of θ and θ̂* be its bootstrap version. Then the 100(1 − α)% percentile bootstrap (PB) CI is given by $({\hat{θ}}_{(\frac{α}{2})}^{*}, {\hat{θ}}_{(1 - \frac{α}{2})}^{*})$ , and the 100(1 − α)% hybrid bootstrap (HB) CI is given by $(2 \hat{θ} - {\hat{θ}}_{(1 - \frac{α}{2})}^{*}, 2 \hat{θ} - {\hat{θ}}_{(\frac{α}{2})}^{*})$ , where ${\hat{θ}}_{γ}^{*}$ is the 100γ-th percentile of the bootstrap distribution. The double bootstrap (DB) CI is calculated as follows:

Draw B₁ first-stage bootstrap samples from the original data. For each first-stage bootstrap sample, calculate the bootstrap version of the estimator θ̂*^b, b = 1,…,B₁.
Conditional on each first-stage bootstrap sample, draw B₂ second-stage (nested) bootstrap samples and calculate the double bootstrap versions of the estimator, e.g., θ̂**^bm, b = 1,…,B₁, m = 1,…,B₂.
For b = 1,…,B₁, calculate $u^{* b} = \frac{1}{B_{2}} \sum_{m = 1}^{B_{2}} 1 {{\hat{θ}}^{* * b m} \leq \hat{θ}}$ , where θ̂ is the estimator based on the original data.
The double bootstrap CI is given by $({\hat{θ}}_{\hat{q} (\frac{α}{2})}^{*}, {\hat{θ}}_{\hat{q} (1 - \frac{α}{2})}^{*})$ , where $\hat{q} (γ) = u_{(γ)}^{*}$ , the 100γ-th percentile of the distribution of u*^b, b = 1,…,B₁.

See Davison and Hinkley³⁴ and Nankervis³⁵ for details about double bootstrap CIs. One disadvantage of these CIs is that they are computationally very intensive.

We use B = 1000 bootstrap iterations to calculate the percentile and the hybrid bootstrap CIs. However, the double bootstrap CIs are based on B₁ = 500 first-stage and B₂ = 100 second-stage bootstrap iterations (due to the increased computational burden). The results in Tables 2 – 3 are based on N = 1000 Monte Carlo iterations.

Table 2.

Summary statistics and coverage rates of 95% and 90% nominal percentile (PB), hybrid (HB), and double (DB) bootstrap CIs for ψ₁₀ using the hard-max (HM), the hard-threshold with α = 0.08 (HT_0.08) and α = 0.2 (HT_0.20), and the soft-threshold (ST) estimators. A “*” indicates significantly different coverage rate than the nominal rate, using a test of proportion (Type I error rate = 0.05).

	Summary Statistics			Coverage of 95% CI			Coverage of 90% CI

Estimator	Bias	Var	MSE	PB	HB	DB	PB	HB	DB
Example 1: p = 1 and ϕ undefined (ψ₁₀= ψ₁₁ = 0)
HM	0.0003	0.0045	0.0045	96.8*	93.5*	93.6	92.9*	88.2	88.8
HT_0.08	0.0017	0.0044	0.0044	97.0*	95.0	–	93.7*	90.3	–
HT_0.20	0.0002	0.0050	0.0050	97.4*	92.8*	–	94.2*	86.9*	–
ST	0.0009	0.0036	0.0036	95.3	96.1	–	91.1	91.4	–

Example 2: p = 0 and ϕ infinite (ψ₁₀ = ψ₁₁ = 0)
HM	0.0003	0.0045	0.0045	96.7*	93.4*	93.6	92.4*	88.2	89.0
HT_0.08	0.0010	0.0044	0.0044	97.1*	95.3	–	94.0*	90.5	–
HT_0.20	0.0003	0.0050	0.0050	97.3*	93.5*	–	94.3*	87.1*	–
ST	0.0008	0.0036	0.0036	95.4	95.9	–	90.8	91.5	–

Example 3: $p = \frac{1}{2}$ and ϕ = 1 (ψ₁₀ = ψ₁₁ = 0)
HM	−0.0401	0.0059	0.0075	88.4*	92.7*	94.8	81.2*	86.1*	89.0
HT_0.08	−0.0083	0.0058	0.0059	94.3	94.3	–	88.5	89.0	–
HT_0.20	−0.0179	0.0062	0.0065	93.5*	93.5*	–	87.0*	88.1*	–
ST	−0.0185	0.0055	0.0058	93.4*	94.9	–	87.1*	89.4	–

Open in a new tab

Table 3.

Summary statistics and coverage rates of 95% and 90% nominal percentile (PB), hybrid (HB), and double (DB) bootstrap CIs for ψ₁₀ using hard-max (HM), hard-threshold with α = 0.08 (HT_0.08) and α = 0.2 (HT_0.20), and soft-threshold (ST) estimators. A “*” indicates significantly different coverage rate than the nominal rate, using a test of proportion (Type I error rate = 0.05).

	Summary Statistics			Coverage of 95% CI			Coverage of 90% CI

Estimator	Bias	Var	MSE	PB	HB	DB	PB	HB	DB
Example 4: p = 0 and ϕ = 1.0204 (ψ₁₀ = −0.01, ψ₁₁ = 0)
HM	−0.0353	0.0059	0.0072	89.6*	93.1*	94.4	82.9*	86.6*	90.2
HT_0.08	−0.0037	0.0058	0.0058	94.6	94.1	–	88.9	89.0	–
HT_0.20	−0.0130	0.0062	0.0064	93.9	92.8*	–	87.9*	87.9*	–
ST	−0.0138	0.0055	0.0057	94.1	95.0	–	87.4*	89.7	–

Example 5: $p = \frac{1}{4}$ and ϕ = 1.4142 (ψ₁₀= ψ₁₁ = 0)
HM	−0.0209	0.0069	0.0074	92.7*	93.1*	94.2	87.8*	89.0	88.4
HT_0.08	−0.0059	0.0070	0.0071	93.9	93.2*	–	89.5	88.2	–
HT_0.20	−0.0101	0.0072	0.0073	93.3*	93.0*	–	89.3	88.0*	–
ST	−0.0065	0.0069	0.0069	93.8	94.6	–	89.7	89.0	–

Example 6: p = 0 and ϕ = 0.3451 (ψ₁₀ = −0.3688, ψ₁₁ = 0.0187)
HM	0.0009	0.0067	0.0067	95.0	93.8	95.0	89.2	87.4*	88.2
HT_0.08	0.0003	0.0081	0.0081	95.1	88.5*	–	90.1	82.9*	–
HT_0.20	0.0011	0.0074	0.0074	94.8	91.2*	–	89.7	86.4*	–
ST	0.0052	0.0074	0.0074	94.8	91.7*	–	89.4	85.3*	–

Open in a new tab

4.1 Results

The simulation study compares the competing estimators on a variety of settings represented by Examples 1 – 6. We considered estimation and inference for both ψ₁₀ and ψ₁₁. However in the present examples, the effect of nonregularity turned out to be more pronounced for the parameter ψ₁₀ (main effect of A₁) than ψ₁₁ (interaction of A₁ with O₁). Hence we included results on ψ₁₀ only in Tables 2 and 3. Also in the following discussion, we will focus on ψ₁₀.

In Example 1 (top part of Table 2), where stage 2 effects for all possible histories are zero (i.e., the stage 2 optimal treatment is non-unique for every subject in the population), we see that there is no bias associated with the hard-max estimator; and the mean squared error (MSE) is essentially the same as the variance. However the percentile bootstrap CI (both 95% and 90%) has over-coverage (note that over-coverage translates to lower power of the corresponding hypothesis test), and the hybrid bootstrap CI (95%) has under-coverage compared to the nominal level. We have also studied the Wald type CIs for this setting (not included in this paper) and observed over-coverage (the problem with Wald type CIs in such nonregular settings is well-known¹¹^,¹⁷). This suggests that the asymptotic distribution of the hard-max estimator has a lighter tail than a comparable normal distribution. However, the double bootstrap CIs have correct coverage. Note that both versions of the hard-threshold estimator fail to rectify the coverage rate, even though neither suffer from bias. However, the soft-threshold estimator offers correct coverage for both types of bootstrap CIs. Moreover, it gives the lowest MSE among the four estimators. Note that the soft-threshold estimator is also non-smooth (nonregular), and consequently the bootstrap distribution is inconsistent for the true asymptotic distribution of this estimator. But in this setting, it reduces the degree of nonregularity just enough so that the bootstrap CIs do not show the problem with coverage.

Even though Example 2 (middle part of Table 2) is a regular setting (p = 0), it is very close to Example 1 and hence affected by nonregularity. Results are similar to those in Example 1. Thus the presence of very small effects causes problems with coverage even in regular settings.

Example 3 (bottom part of Table 2) is a setting where the stage 2 optimal treatment is non-unique for half the subjects in the population ( $p = \frac{1}{2}$ ) and is unique for the remaining half, but the overall standardized stage 2 effect size ϕ (= 1) is quite large. Here the hard-max estimator is biased, and hence both the percentile and the hybrid bootstrap CIs under-cover the true value. However the double bootstrap CI gives correct coverage rate. Both versions of the hard-threshold estimator reduce bias and one of them (corresponding to α = 0.08) gives correct coverage, while the other also offers substantial improvement of the coverage rate. This is consistent with the findings of Moodie and Richardson¹⁷. The soft-threshold estimator also reduces bias, gives the lowest MSE among the four estimators, and provides correct coverage with the hybrid bootstrap method but not with the percentile method (even though it offers substantial improvement). Thus in this example, the hard-threshold estimator with α = 0.08 emerges as the winner, with the soft-threshold estimator at the second place. However, note that the value 0.08 of the tuning parameter α is not arbitrary – it corresponds to the threshold used by the soft-threshold estimator. If constructing confidence intervals is the main goal (so biased estimation is less of an issue), double bootstrap CI along with the hard-max estimator can also be used in this setting, although it is computationally more expensive.

Example 4 (top part of Table 3) is a regular setting, very similar to the nonregular setting in Example 3. Results are quite similar to those in Example 3. This is consistent with our previous observation (Example 2) that the presence of very small effects causes problems with coverage even in regular settings.

In example 5 (middle part of Table 3), the stage 2 optimal treatment is non-unique for one-fourth of the subjects in the population ( $p = \frac{1}{4}$ ) and the standardized effect size ϕ is very large (=1.4142). Again, the hard-max estimator is biased, and has low coverage of the CIs (except for double bootstrap). The hard-threshold and the soft-threshold estimators offer improvement in terms of bias as well as coverage. The soft-threshold estimator emerges as the best (lowest MSE and correct coverage rate) in this example.

Example 6 (bottom part of Table 3) is a regular setting (p = 0, with no extremely tiny stage 2 effect as in Examples 2 and 4), with the standardized effect size 0.3451. The reason for investigating this setting is to check if the regularized estimators (hard and soft threshold) perform poorly in settings where there is no need to regularize. As expected, the hard-max estimator performs well here. The soft-threshold estimator introduces some bias when there is none in the hard-max estimator and increases MSE; but still manages to provide correct coverage for the percentile bootstrap method. The hard-threshold estimators also give correct coverage for percentile CIs.

To summarize, the hard-max estimator is problematic in nonregular scenarios, except when used with the computationally intensive double bootstrap method for constructing confidence intervals. The hard-threshold estimator, if properly tuned, addresses the problem of bias but not the problem of light tail. The soft-threshold estimator seems to address both problems to a large extent. In the simulation, the soft-threshold estimator consistently produced the lowest MSE among the competing methods across all the nonregular scenarios. Also in all the nonregular settings, either the soft-threshold estimator or the hard-threshold estimator with α = 0.08 (this α corresponds to the threshold used by the soft-threshold estimator) emerged as the winner in terms of providing correct coverage rate of the bootstrap CIs. Even though the soft-threshold estimator incurs some bias in regular settings, it manages to provide reasonable coverage rate for small to moderate standardized effect sizes (we have studied up to around 0.35). Across all the scenarios considered here (Examples 1–6), the soft-threshold estimator emerged as more robust than the hard-threshold estimator to the degree of regularity of the underlying data distribution, probably because of its “soft” nature (the soft-threshold estimator is continuous everywhere even though it has two points of non-differentiability, whereas the hard-threshold estimator has two points of discontinuity – see Figure 1). Furthermore, note that overall the hybrid bootstrap CIs performed slightly better than the percentile bootstrap CIs in this simulation study. Hence the hybrid bootstrap CIs will be used in the data analysis to follow.

5 Analysis of Smoking Cessation Data

To demonstrate the occurrence of nonregularity and the use of the soft-threshold method in a real application, here we present the analysis of a data set from a randomized, two-stage, longitudinal, internet-based smoking cessation study conducted by the Center for Health Communications Research at the University of Michigan. The stage 1 of this study (Project Quit) was conducted to find an optimal multi-factor behavioral intervention to help adult smokers quit smoking; and the stage 2 (Forever Free) was a follow-on study to help those (among the participants of Project Quit) who already quit stay quit, and help those who failed at the previous stage with a second chance. Details of the study design and primary analysis of the stage 1 data can be found in Strecher et al.³⁶

At stage 1, although there were five two-level treatment factors, only two, e.g., source (of online behavioral counseling message) and story (of a hypothetical character who succeeded in quitting smoking) were significant in the analysis reported in Strecher et al.³⁶ For simplicity, we considered only these two treatment factors at stage 1 of our present analysis, which gave a total of 4 treatment combinations at stage 1 corresponding to the 2×2 design. The treatment factor source was varied at two levels, e.g., high vs. low personalized, coded 1 and −1; also the factor story was varied at two levels, e.g., high vs. low tailoring depth (degree to which the character in the story was tailored to the individual subject’s baseline characteristics), coded 1 and −1. Baseline variables at this stage included subjects’ motivation to quit (on a 1–10 scale), selfefficacy (on a 1–10 scale) and education (binary, ≤ high school vs. > high school, coded −1/1). At stage 2, originally there were 4 different treatment groups and a control group; however the 4 treatment groups were combined together for the present analysis because of very little difference between them. This resulted in only two choices of treatment at stage 2; this treatment variable was called FFarm, coded −1/1 (1=treatment, −1 = control).

There were two outcomes at the two stages of this study. The stage 1 outcome was binary quit status called PQ6Quitstatus (1=quit, 0=not quit) at 6 month from the date of randomization. The stage 2 outcome was binary quit status FF6Quitstatus at 6 months from the date of stage 2 randomization (i.e., 12 months from the date of stage 1 randomization).

An example DTR can have the following form: “At stage 1, if a subject’s baseline selfefficacy is greater than a threshold value (say 7, on a 1–10 scale), then provide the highly-personalized level of the treatment component source, and if the subject is willing to continue treatment, then at stage 2 provide treatment if he/she continues to be a smoker at the end of stage 1”. Of course characteristics other than selfefficacy or a combination of more than one subject characteristics can be used to specify a DTR. To find the optimal DTR, we applied both the hard-max and the soft-threshold estimators within the Q-learning framework. This involved:

a stage 2 regression (n = 281) of FF6Quitstatus using the model:

$\begin{array}{l} F F 6 Quitstatus = β_{20} + β_{21} motivation + β_{22} source + β_{23} selfefficacy \\ + β_{24} story + β_{25} education + β_{26} P Q 6 Quitstatus \\ + β_{27} source * selfefficacy + β_{28} story * education \\ + (ψ_{20} + ψ_{21} P Q 6 Quitstatus) * FFarm + ε_{2}; \end{array}$

finding both the hard-max pseudo-outcome (Ŷ₁) and the soft-threshold pseudo-outcome ( ${\hat{Y}}_{1}^{S T}$ ) for the stage 1 regression:

\begin{array}{l} {\hat{Y}}_{1} = P Q 6 Quitstatus + {\hat{β}}_{20} + {\hat{β}}_{21} motivation + {\hat{β}}_{22} source + {\hat{β}}_{23} selfefficacy \\ + {\hat{β}}_{24} story + {\hat{β}}_{25} education + {\hat{β}}_{26} P Q 6 Quitstatus \\ + {\hat{β}}_{27} source * selfefficacy + {\hat{β}}_{28} story * education \\ + ∣ {\hat{ψ}}_{20} + {\hat{ψ}}_{21} P Q 6 Quitstatus ∣; \end{array}

\begin{array}{l} {\hat{Y}}_{1}^{S T} = P Q 6 Quitstatus + {\hat{β}}_{20} + {\hat{β}}_{21} motivation + {\hat{β}}_{22} source + {\hat{β}}_{23} selfefficacy \\ + {\hat{β}}_{24} story + {\hat{β}}_{25} education + {\hat{β}}_{26} P Q 6 Quitstatus \\ + {\hat{β}}_{27} source * selfefficacy + {\hat{β}}_{28} story * education \\ + ∣ {\hat{ψ}}_{20} + {\hat{ψ}}_{21} P Q 6 Quitstatus ∣ \cdot {(1 - \frac{3 Var ({\hat{ψ}}_{20} + {\hat{ψ}}_{21} P Q 6 Quitstatus)}{∣ {\hat{ψ}}_{20} + {\hat{ψ}}_{21} P Q 6 Quitstatus ∣^{2}})}^{+}; \end{array}

and (3) for each of the two pseudo-outcomes, a stage 1 regression (n = 1401) of the pseudo-outcome using a model of the form:

\begin{array}{l} {\hat{Y}}_{1} or {\hat{Y}}_{1}^{S T} = β_{10} + β_{11} motivation + β_{12} selfefficacy + β_{13} education \\ + (ψ_{10}^{(1)} + ψ_{11}^{(1)} selfefficacy) * source \\ + (ψ_{10}^{(2)} + ψ_{11}^{(2)} education) * story + ε_{1} . \end{array}

Note that the sample sizes at the two stages differ because only 281 subjects were willing to continue treatment into stage 2 (as allowed by the study protocol). Our stage 2 analysis was a usual regression analysis. No significant treatment effect was found at this stage, indicating the likely existence of nonregularity. At stage 1, for either estimator, 95% confidence intervals were constructed by hybrid bootstrap using 1000 bootstrap replications. The stage 1 analysis summary is presented in Table 4. In this case, the hard-max and the soft-threshold estimators produced similar results.

Table 4.

Regression coefficients and 95% hybrid bootstrap confidence intervals at stage 1, using both the hard-max and the soft-threshold estimators.

	Hard-max		Soft-threshold
Variable	Coefficient	95% CI	Coefficient	95% CI
`motivation`	0.04	(−0.00, 0.08)	0.04	(0.00, 0.08)
`selfefficacy`	0.03	(0.00, 0.06)	0.03	(0.00, 0.06)
`education`	−0.01	(−0.07, 0.06)	−0.01	(−0.07, 0.06)
`source`	−0.15	(−0.35, 0.06)	−0.15	(−0.35, 0.06)
`source*selfefficacy`	0.03	(0.00, 0.06)	0.03	(0.00, 0.06)
`story`	0.05	(−0.01, 0.11)	0.05	(−0.01, 0.11)
`story*education`	−0.07	(−0.13, −0.01)	−0.07	(−0.13, −0.01)

Open in a new tab

The conclusions from the present data analysis can be summarized as follows. We did not find any significant stage 2 treatment effect. So this analysis suggests that the stage 2 behavioral intervention need not be adapted to the smoker’s individual characteristics, interventions previously received, or stage 1 outcome. More interesting results are found at stage 1. It is found that subjects with higher level of motivation or selfefficacy are more likely to quit. The highly personalized level of source is more effective for subjects with a higher selfefficacy (≥ 7), and deeply tailored level of story is more effective for subjects with lower education (≤ high school); these two conclusions can be drawn from the interaction plots (with confidence intervals) presented in figure 2. Thus this secondary data analysis suggests that to maximize each individual’s chance of quitting over the two stages, the web-based smoking cessation intervention should be designed in future such that: (1) smokers with high self-efficacy (≥ 7) are assigned to highly personalized level of source, and (2) smokers with lower education are assigned to deeply tailored level of story.

Interaction plots: (a) source by self-efficacy (left panel), (b) story by education (right panel), along with confidence intervals for predicted stage 1 pseudo-outcome.

6 Discussion

In this paper, we have illustrated the problem of nonregularity that arises in the context of DTRs in the estimation of the optimal current treatment rule, when the optimal treatments at subsequent stages are non-unique for at least some proportion of subjects in the population. We have illustrated the phenomenon using Q-learning as the estimation procedure, which is a simpler yet inefficient version of Robins’ method; however the problem of nonregularity arises in Robins’ method as well¹¹^,¹⁷.

For some underlying data-generating models (e.g., Examples 3, 4, 5 in the simulation study), this nonregularity induces bias in the point estimates of the parameters of the optimal DTRs, which in turn causes under-coverage of the bootstrap confidence intervals. In contrast, in case of Examples 1 and 2, this nonregularity causes lightness of tail of the asymptotic distribution but no bias, as seen from the over-coverage of the percentile bootstrap CIs (equivalently conservative tests leading to lower power). The coexistence of these two not-so-well-related issues (they work in opposite directions, e.g., bias tends to make the CIs under-cover, whereas lightness of tail tends to make the CIs over-cover) makes this problem a unique and challenging one.

As mentioned in section 2.4, the phenomenon of nonregularity can be understood more clearly with a simpler problem, e.g., estimating |μ| (note that ψ₁₀ is a linear combination of terms like |μ|; see section 4) by |X̄_n| (similar to the hard-max estimator), where X̄_n is the sample average of n i.i.d. observations X₁,…,X_n from N (μ, 1). From section 2.4, we know that when μ = 0, |X̄_n| is a biased estimator of |μ| = 0, with $bias = E (∣ {\bar{X}}_{n} ∣) = \sqrt{\frac{2}{n π}}$ . Because of this bias (wrong centering), bootstrap CIs exhibited gross under-coverage in a toy example. But once we used a bias-corrected estimate (corrected by the analytically calculated bias), the percentile bootstrap CI exhibited over-coverage. This suggests that the distribution of |X̄_n| is peaked around its mean, or to put it in another way, has light tails.

In the simulation study to compare the competing estimators of the optimal DTR, we considered estimation of ψ₁₀, which involve linear combinations of |f₁|, |f₂|, |f₃|, and |f₄| (terms like |μ|). Under the non-regular scenarios, some or all (depending on the degree of nonregularity p) of the f_i’s are zero; and hence a phenomenon similar to the one described above in the toy example happens for each |f_i| for which f_i = 0. Each such term has its associated bias, and each has its own lightness of tail, with bias being the dominant property. In some nonregular scenarios (Example 1), the bias associated with the individual |f_i|’s (in the expression for ψ₁₀) cancel each other out (note the opposite signs in front of |f_i|’s), and hence the lightness of tail is revealed, resulting in a percentile bootstrap CI that over-covers. In other nonregular examples, however, bias is not canceled out, and hence dominates the property of the hard-max estimator. Hence under-coverage of the bootstrap CIs is observed.

Nonregularity is an issue in the estimation of the optimal DTRs because it arises when there is no treatment effect at subsequent stages (equivalently, there is no unique optimal treatment at subsequent stages). Unfortunately often there is no or very weak treatment effect in the settings we are interested in (e.g., randomized trials on mental illness or substance abuse). Thus we want our estimator to enjoy good statistical properties (e.g., less bias, lower risk or MSE, correct coverage rate of CIs, good power to detect “local” alternatives, etc.) when the optimal treatment at subsequent stages is non-unique. In case of the hard-max estimator, unfortunately the point of non-differentiability coincides with the parameter value such that $ψ_{2}^{T} H_{21} = 0$ (non-unique optimal treatment at the subsequent stage), which causes nonregularity (bias, higher MSE, low power). But the soft-threshold estimator (also, hard-threshold estimator), in some sense, redistributes the nonregularity from this “null point” to two different points symmetrically placed on either side of the “null point” (see Figure 1). This is one reason why the soft-threshold (also, hard-threshold) estimator works well in nonregular settings.

We have shown that using bootstrap confidence intervals along with the soft-threshold (also, hard-threshold in some cases) estimator reduces the degree of nonregularity, and gives correct coverage rate. Also, the double bootstrap method can be used along with the original hard-max estimator to address the nonregularity. But this method is highly computationally intensive and may be difficult to use in practice. An alternative method to construct CIs for ψ’s in nonregular settings is the score method due to Robins¹¹. We have not investigated this in our simulation study.

One can consider an alternative Bayesian approach to formulate an estimator similar to the soft-threshold estimator as follows. Let the data distribution ${\hat{ψ}}_{2}^{T} H_{21} ∣ ψ_{2}^{T} H_{21} \sim N (ψ_{2}^{T} H_{21}, σ^{2})$ with known σ², and the prior distribution of $ψ_{2}^{T} H_{21}$ be a mixture of a point mass at 0 and N (0, 1), with mixing parameter p (0 < p < 1). Then the posterior distribution of $ψ_{2}^{T} H_{21}$ is a mixture distribution given by

\begin{matrix} f_{post} (ψ_{2}^{T} H_{21}) = \hat{w} \cdot 1 {ψ_{2}^{T} H_{21} = 0} + (1 - \hat{w}) \cdot N (\frac{{\hat{ψ}}_{2}^{T} H_{21}}{1 + σ^{2}}, \frac{σ^{2}}{1 + σ^{2}}), \\ where \hat{w} = {1 + \frac{1 - p}{p} \sqrt{\frac{σ^{2}}{1 + σ^{2}}} exp {\frac{{({\hat{ψ}}_{2}^{T} H_{21})}^{2}}{2 σ^{2} (1 + σ^{2})}}}^{- 1} . \end{matrix}

One can use the median of this posterior distribution in place of ${\hat{ψ}}_{2}^{T} H_{21}$ in the expression for Ŷ₁. Thus the Bayes estimator becomes

{\hat{Y}}_{1}^{Bayes} = {\hat{β}}_{2}^{T} H_{20} + median of f_{post} (ψ_{2}^{T} H_{21}) .

For using this, one has to replace σ² by ${\hat{σ}}^{2} = H_{21}^{T} {\sum^{^}}_{2} H_{21} / n$ , and p by either some empirical estimate or a fixed value (e.g., $\frac{1}{2}$ ). In place of the above mixture prior, Johnstone and Silverman³⁷ suggest using the mixture of a point mass and a heavy-tailed distribution (e.g., double-exponential). This is a promising formulation that we want to investigate in future.

In this paper, we have focused on randomized trials only to separate the issue of nonregularity from causal inference issues. However the problem of nonregularity also arises when observational data¹¹^,¹⁷ are used; and the hard-threshold and the soft-threshold estimators should be applicable in those settings as well. Also, here we have focussed on only two stages for clarity. However, it should be understood that Q-learning can be used for studies with more than two stages as well. In case of many stages, one can think of a scenario where some parameters are shared across stages, in which case a simultaneous version of Q-learning (as opposed to the recursive version discussed in this paper) would be more appropriate. Unfortunately nonregularity does not go away if a simultaneous estimation procedure is used; see Moodie and Richardson¹⁷ for a discussion on this with reference to Robins’ method. However, unlike the case of recursive estimation, it is not well understood at this point whether the threshold estimators (hard or soft) can reduce the nonregularity in simultaneous estimation. Moodie and Richardson¹⁷ gave a simulated nonregular example showing that hard-threshold or ZIPI estimator is not always better than simultaneous estimator of Robins. We did not investigate this issue in the current paper, but we recognize this as an important avenue of future research.

To conclude, we think in the estimation of optimal DTRs, appropriately tuned hard-threshold estimator and the soft-threshold estimator should be seriously considered as improved versions of Q-learning (and Robins’ method of estimation).

Acknowledgments

We acknowledge support for this project from National Institutes of Health grants RO1 MH080015, P50 DA10075, and P50 CA101451.

Appendix A: Proof of Lemma 1

Proof

Define the advantage at stage j as

μ_{j} (H_{j}, A_{j}) = Q_{j} (H_{j}, A_{j}) - max_{a_{j}} Q_{j} (H_{j}, a_{j}), j = 1, 2.

Note that μ_j(H_j, A_j) represents the expected difference in outcome when using A_j instead of the optimal treatment at stage j, for subjects with treatment and covariate history H_j who receive the optimal DTR at stages subsequent to j. According to Robins¹¹ (p. 201), this is simply the blip function with arg max_{a_j} Q_j(H_j, a_j) as the reference treatment. Below we will establish the connection between Q-learning and Robins’ method using the advantage function; one can derive the connection using other blip functions (other choices of reference treatment) following similar steps. When Q-functions are modeled as in (2), the advantages become

μ_{j} (H_{j}, A_{j}; ψ_{j}) = ψ_{j}^{T} H_{j 1} A_{j} - ∣ ψ_{j}^{T} H_{j 1} ∣, j = 1, 2.

(12)

Since by condition (i), no parameters are shared across stages, we will proceed stage by stage, starting with stage 2, doing recursive (rather than simultaneous) estimation. The notation ℙ_n will be used below to denote the empirical average over a sample of size n. Also, define m₁(H₁) = E[Q₁(H₁, A₁)|H₁] and m₂(H₂) = E[Q₂(H₂, A₂)|H₂].

Stage 2

At stage 2, Q-learning is a usual least squares regression problem. Thus the estimating equations are given by

P_{n} [(\begin{matrix} H_{20} \\ H_{21} A_{2} \end{matrix}) (Y_{2} - H_{20}^{T} β_{2} - H_{21}^{T} A_{2} ψ_{2})] = 0.

(13)

From (13), it follows that

{\hat{β}}_{2} = {(P_{n} (H_{20} H_{20}^{T}))}^{- 1} [P_{n} (H_{20} Y_{2}) - P_{n} (H_{20} H_{21}^{T} A_{2}) {\hat{ψ}}_{2}]

(14)

where ψ̂₂ is the estimate of ψ₂ satisfying (13). Thus ψ̂₂ satisfies the estimating equation

P_{n} [(H_{21} A_{2}) (Y_{2} - H_{20}^{T} {\hat{β}}_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2})] = 0.

On the other hand, the stage 2 estimating equation for Robins’ method (Robins¹¹, p. 211) is given by

P_{n} [(H_{21} A_{2} - E [H_{21} A_{2} ∣ H_{2}]) (Y_{2} - μ_{2} (H_{2}, A_{2}; ψ_{2}) - E [Y_{2} - μ_{2} (H_{2}, A_{2}; ψ_{2}) ∣ H_{2}])] = 0,

(15)

where V ar(Y₂−μ₂(H₂, A₂; ψ₂) − E[Y₂ − μ₂(H₂, A₂; ψ₂)|H₂]|H₂, A₂) is omitted (This is one of the reasons why Q-learning is an inefficient version). Note that E[H₂₁A₂|H₂] = 0, by condition (ii) of the lemma. From (12), $μ_{2} (H_{2}, A_{2}; ψ_{2}) = ψ_{2}^{T} H_{21} A_{2} - ∣ ψ_{2}^{T} H_{21} ∣$ . Then $E [μ_{2} (H_{2}, A_{2}; ψ_{2}) ∣ H_{2}] = - ∣ ψ_{2}^{T} H_{21} ∣$ , again by condition (ii). Also,

E [Y_{2} ∣ H_{2}] = E [E [Y_{2} ∣ H_{2}, A_{2}] ∣ H_{2}] = E [Q_{2} (H_{2}, A_{2}) ∣ H_{2}] = m_{2} (H_{2}) .

Therefore, $Y_{2} - μ_{2} (H_{2}, A_{2}; ψ_{2}) - E [Y_{2} - μ_{2} (H_{2}, A_{2}; ψ_{2}) ∣ H_{2}] = Y_{2} - m_{2} (H_{2}) - H_{21}^{T} A_{2} ψ_{2}$ . Thus, ψ̂₂ in Robins’ method solves the following reduced version of (15):

P_{n} [(H_{21} A_{2}) (Y_{2} - m_{2} (H_{2}) - H_{21}^{T} A_{2} {\hat{ψ}}_{2})] = 0,

for any choice of m₂(H₂) (with the conditional variance omitted). In particular, for $m_{2} (H_{2}) = H_{20}^{T} {\hat{β}}_{2}$ , where β̂₂ is given by (14), this estimating equation exactly matches with that of Q-learning.

Stage 1

For Q-learning, the stage 1 pseudo-outcome is

{\hat{Y}}_{1} = Y_{1} + max_{a_{2}} Q_{2} (H_{2}, A_{2}) = Y_{1} + H_{20}^{T} {\hat{β}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣,

and so the estimating equations are given by

P_{n} [(\begin{matrix} H_{10} \\ H_{11} A_{1} \end{matrix}) (Y_{1} + H_{20}^{T} {\hat{β}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣ - H_{10}^{T} β_{1} - H_{11}^{T} A_{1} ψ_{1})] = 0.

(16)

Now from (13)

P_{n} [H_{20} (Y_{2} - H_{20}^{T} {\hat{β}}_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2})] = 0.

(17)

Since by condition (iii) of the lemma, $(H_{10}^{T}, H_{11}^{T} A_{1}) \subset H_{20}^{T}$ , it follows that

\begin{array}{r} P_{n} [(\begin{matrix} H_{10} \\ H_{11} A_{1} \end{matrix}) (Y_{2} - H_{20}^{T} {\hat{β}}_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2})] = 0, \\ or, P_{n} [(\begin{matrix} H_{10} \\ H_{11} A_{1} \end{matrix}) (H_{20}^{T} {\hat{β}}_{2})] = P_{n} [(\begin{matrix} H_{10} \\ H_{11} A_{1} \end{matrix}) (Y_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2})] . \end{array}

(18)

Using (18) in (16), we get

P_{n} [(\begin{matrix} H_{10} \\ H_{11} A_{1} \end{matrix}) (Y_{1} + Y_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣ - H_{10}^{T} β_{1} - H_{11}^{T} A_{1} ψ_{1})] = 0.

(19)

Solving for β₁ gives,

{\hat{β}}_{1} = {(P_{n} (H_{10} H_{10}^{T}))}^{- 1} [P_{n} (H_{10} (Y_{1} + Y_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣)) - P_{n} (H_{10} H_{11}^{T} A_{1}) {\hat{ψ}}_{1}] .

(20)

Thus ψ̂₁ satisfies

P_{n} [(H_{11} A_{1}) (Y_{1} + Y_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣ - H_{10}^{T} {\hat{β}}_{1} - H_{11}^{T} A_{1} {\hat{ψ}}_{1})] = 0.

On the other hand for Robins’ method, the stage 1 pseudo-outcome (Robins¹¹, p. 208; see also Moodie and Richardson¹⁷) is Ỹ₁ = Y₁ + Y₂ − μ₂(H₂, A₂), and so the stage 1 estimating equation (Robins¹¹, p. 211) is given by

P_{n} [(H_{11} A_{1} - E [H_{11} A_{1} ∣ H_{1}]) ({\tilde{Y}}_{1} - μ_{1} (H_{1}, A_{1}; ψ_{1}) - E [{\tilde{Y}}_{1} - μ_{1} (H_{1}, A_{1}; ψ_{1}) ∣ H_{1}])] = 0,

(21)

where again the conditional variance V ar(Ỹ₁−μ₁(H_1, A₁; ψ₁) −E[Ỹ₁ − μ₁(H₁, A₁; ψ₁)|H₁]|H₁, A₁ is omitted. Note that E[H₁₁A₁|H₁] = 0, by condition (ii) of the lemma. From (12), $μ_{1} (H_{1}, A_{1}; ψ_{1}) = ψ_{1}^{T} H_{11} A_{1} - ∣ ψ_{1}^{T} H_{11} ∣$ . Then $E [μ_{1} (H_{1}, A_{1}; ψ_{1}) ∣ H_{1}] = - ∣ ψ_{1}^{T} H_{11} ∣$ , again by condition (ii). Also,

\begin{array}{l} E [{\tilde{Y}}_{1} ∣ H_{1}] = E [Y_{1} + Y_{2} - μ_{2} (H_{2}, A_{2}) ∣ H_{1}] \\ = E [Y_{2} - Q_{2} (H_{2}, A_{2}) + Y_{1} + max_{a_{2}} Q_{2} (H_{2}, a_{2}) ∣ H_{1}] \\ = E [E [Y - Q_{2} (H_{2}, A_{2}) ∣ H_{2}, A_{2}] ∣ H_{1}] + E [Y_{1} + max_{a_{2}} Q_{2} (H_{2}, a_{2}) ∣ H_{1}] \\ = 0 + E [E [Y_{1} + max_{a_{2}} Q_{2} (H_{2}, a_{2}) ∣ H_{1}, A_{1}] ∣ H_{1}] \\ = E [Q_{1} (H_{1}, A_{1}) ∣ H_{1}] \\ = m_{1} (H_{1}) . \end{array}

Finally, plug in Y₁ + Y₂ − μ₂(H₂, A₂; ψ̂₂) for Ỹ₁. Thus, ψ̂₁ in Robins’ method solves the following reduced version of (21):

P_{n} [(H_{11} A_{1}) (Y_{1} + Y_{2} - H_{21}^{T} A_{2} {\hat{ψ}}_{2} + ∣ {\hat{ψ}}_{2}^{T} H_{21} ∣ - m_{1} (H_{1}) - H_{11}^{T} A_{1} {\hat{ψ}}_{1})] = 0.

for any choice of m₁(H₁) (again omitting the conditional variance). In particular, for $m_{1} (H_{1}) = H_{10}^{T} {\hat{β}}_{1}$ , where β̂₁ is given by (20), this estimating equation exactly matches with that of Q-learning.

In summary, the Q-learning algorithm as presented here is inefficient because: (a) it sets the conditional variances to be constant over (H_j, A_j), and (b) uses H_j₁A_j instead of the “efficient choice” of the term S_eff_;j (that attains semiparametric variance bound) in Robins’ estimating equation (see Robins¹¹, p. 212; more details in Robins³⁸).

Appendix B: Proof of Lemma 2

Proof

To estimate the hyper-parameter ϕ², first integrate out μ to get the marginal likelihood X|ϕ² ~ N(0, ϕ²+σ²). The corresponding Jeffrey’s prior on the variance parameter is p(ϕ²) ∝1=(ϕ²+σ²). Based on this formulation, the posterior distribution of ϕ² is given by

p (ϕ^{2} ∣ X) \propto {(ϕ^{2} + σ^{2})}^{- 3 / 2} exp {- \frac{X^{2}}{2 (ϕ^{2} + σ^{2})}} .

Hence the posterior mode of ϕ² is

\hat{ϕ^{2}} = arg max_{ϕ^{2} \geq 0} p (ϕ^{2} ∣ X) = {(\frac{X^{2}}{3} - σ^{2})}^{+} .

(22)

Given $ϕ^{2} = \hat{ϕ^{2}}$ , now we will consider the data likelihood X|μ ~ N(μ, σ²) along with the prior μ| ϕ² ~ N(0, ϕ²) to derive an empirical Bayes estimator for |μ|. It is easy to show that the posterior distribution of μ is given by

μ ∣ X \sim N (\frac{X \hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}}, \frac{σ^{2} \hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}}) .

(23)

Now under the squared error loss, the Bayes estimator of |μ| is E_μ_|_X(|μ|) which can be calculated using (23). If Y ~ N(θ, τ²), then E|Y| is given by:

E ∣ Y ∣ = θ (2 Φ (θ / τ) - 1) + \sqrt{\frac{2}{π}} τ e^{- θ^{2} / 2 τ^{2}} .

(24)

In the present problem,

\begin{array}{r} Y = μ ∣ X, θ = \frac{X \hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}}, τ^{2} = \frac{σ^{2} \hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}} . \\ Hence, \frac{θ}{τ} = \frac{X}{σ} \sqrt{\frac{\hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}}}, \frac{θ^{2}}{2 τ^{2}} = \frac{X^{2}}{2 σ^{2}} (\frac{\hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}}) . \end{array}

From (22), we get

\begin{array}{l} \frac{\hat{ϕ^{2}}}{\hat{ϕ^{2}} + σ^{2}} = \frac{{(X^{2} - 3 σ^{2})}^{+}}{X^{2}} = {(1 - \frac{3 σ^{2}}{X^{2}})}^{+}, \\ θ = X {(1 - \frac{3 σ^{2}}{X^{2}})}^{+}, τ^{2} = σ^{2} {(1 - \frac{3 σ^{2}}{X^{2}})}^{+}, \\ \frac{θ}{τ} = \frac{X}{σ} \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}}, \frac{θ^{2}}{2 τ^{2}} = \frac{X^{2}}{2 σ^{2}} {(1 - \frac{3 σ^{2}}{X^{2}})}^{+} . \end{array}

Thus an empirical Bayes estimator of |μ| is given by

\begin{array}{l} {\hat{∣ μ ∣}}^{E B} = X {(1 - \frac{3 σ^{2}}{X^{2}})}^{+} (2 Φ (\frac{X}{σ} \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}}) - 1) \\ + \sqrt{\frac{2}{π}} σ \sqrt{{(1 - \frac{3 σ^{2}}{X^{2}})}^{+}} exp {- \frac{X^{2}}{2 σ^{2}} {(1 - \frac{3 σ^{2}}{X^{2}})}^{+}} . \end{array}

(25)

References

1.Collins LM, Murphy SA, Bierman K. A conceptual framework for adaptive preventive interventions. Prevention Science. 2004;5:185–196. doi: 10.1023/b:prev.0000037641.26017.00. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Lavori PW. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society, Series A. 2000;163:29–38. [Google Scholar]
3.Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical Trials. 2004;1:9–20. doi: 10.1191/1740774s04cn002oa. [DOI] [PubMed] [Google Scholar]
4.Dawson R, Lavori PW. Placebo-free designs for evaluating new mental health treatments: the use of adaptive treatment strategies. Statistics in Medicine. 2004;23:3249–3262. doi: 10.1002/sim.1920. [DOI] [PubMed] [Google Scholar]
5.Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]
6.Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE): alzheimer disease trial methodology. American Journal of Geriatric Psychology. 2001;9:346–360. [PubMed] [Google Scholar]
7.Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled Clinical Trials. 2003;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]
8.Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;30:1011–1128. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
9.Wahed AS, Tsiatis AA. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomized designs in clinical trials. Biometrics. 2004;60:124–133. doi: 10.1111/j.0006-341X.2004.00160.x. [DOI] [PubMed] [Google Scholar]
10.Murphy SA. Optimal dynamic treatment regimes (with discussions) Journal of the Royal Statistical Society, Series B. 2003;65:331–366. [Google Scholar]
11.Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004. pp. 189–326. [Google Scholar]
12.Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]
13.Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97(457):29–39. [Google Scholar]
14.Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics in Medicine. 2007;26:4687–4702. doi: 10.1002/sim.2894. [DOI] [PubMed] [Google Scholar]
15.Lunceford JK, Davidian M, Tsiatis AA. Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics. 2002;58:48–57. doi: 10.1111/j.0006-341x.2002.00048.x. [DOI] [PubMed] [Google Scholar]
16.Wahed AS, Tsiatis AA. Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika. 2006;93(1):163–177. [Google Scholar]
17.Moodie EEM, Richardson TS. Estimating optimal dynamic regimes: correcting bias under the null. Scandinavian Journal of Statistics. 2008 doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Watkins CJCH. PhD Thesis. Cambridge University; 1989. Learning from delayed rewards. [Google Scholar]
19.Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge: MIT Press; 1998. [Google Scholar]
20.Murphy SA. A generalization error for Q-learning. Journal of Machine Learning Research. 2005;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]
21.Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; 1993. [Google Scholar]
22.Shao J. Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society. 1994;122(4):1251–1262. [Google Scholar]
23.Andrews DWK. Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica. 2000;68(2):399–405. [Google Scholar]
24.Bickel P, Sakov A. On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica. 2008;18(3):967–985. [Google Scholar]
25.Hall P, Horowitz J, Jing B. On blocking rules for the bootstrap with dependent data. Biometrika. 1995;82:561–574. [Google Scholar]
26.Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–455. [Google Scholar]
27.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37(4):373–384. [Google Scholar]
28.Gao H. Wavelet shrinkage denoising using the nonnegative garrote. Journal of Computational and Graphical Statistics. 1998;7:469–488. [Google Scholar]
29.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]
30.Figueiredo M, Nowak R. Wavelet-based image estimation: an empirical Bayes approach using Jeffreys’ noninformative prior. IEEE Transactions on Image Processing. 2001;10(9):1322–1331. doi: 10.1109/83.941856. [DOI] [PubMed] [Google Scholar]
31.Cohen J. Statistical power for the behavioral sciences. 2. Hillsdale, NJ: Erlbaum; 1988. [Google Scholar]
32.Freedman B. Equipoise and the ethics of clinical research. The New England Journal of Medicine. 1987;317(3):141–145. doi: 10.1056/NEJM198707163170304. [DOI] [PubMed] [Google Scholar]
33.Wu CFJ, Hamada M. Experiments: planning, analysis, and parameter design optimization. New York: Wiley; 2000. [Google Scholar]
34.Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge, UK: Cambridge University Press; 1997. [Google Scholar]
35.Nankervis JC. Computational algorithms for double bootstrap confidence intervals. Computational Statistics & Data Analysis. 2005;49:461–475. [Google Scholar]
36.Strecher V, McClure J, Alexander G, Chakraborty B, Nair V, et al. Web-based smoking cessation components and tailoring depth: results of a randomized trial. American Journal of Preventive Medicine. 2008;34(5):373–381. doi: 10.1016/j.amepre.2007.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Johnstone IM, Silverman BW. Needles and straw in haystackes: empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics. 2004;32(4):1594–1649. [Google Scholar]
38.Robins JM. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics. 1994;23:2379–2412. [Google Scholar]

[R1] 1.Collins LM, Murphy SA, Bierman K. A conceptual framework for adaptive preventive interventions. Prevention Science. 2004;5:185–196. doi: 10.1023/b:prev.0000037641.26017.00. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Lavori PW. A design for testing clinical strategies: biased adaptive within-subject randomization. Journal of the Royal Statistical Society, Series A. 2000;163:29–38. [Google Scholar]

[R3] 3.Lavori PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clinical Trials. 2004;1:9–20. doi: 10.1191/1740774s04cn002oa. [DOI] [PubMed] [Google Scholar]

[R4] 4.Dawson R, Lavori PW. Placebo-free designs for evaluating new mental health treatments: the use of adaptive treatment strategies. Statistics in Medicine. 2004;23:3249–3262. doi: 10.1002/sim.1920. [DOI] [PubMed] [Google Scholar]

[R5] 5.Murphy SA. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine. 2005;24:1455–1481. doi: 10.1002/sim.2022. [DOI] [PubMed] [Google Scholar]

[R6] 6.Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE): alzheimer disease trial methodology. American Journal of Geriatric Psychology. 2001;9:346–360. [PubMed] [Google Scholar]

[R7] 7.Rush AJ, Fava M, Wisniewski SR, Lavori PW, Trivedi MH, Sackeim HA. Sequenced treatment alternatives to relieve depression (STAR*D): rationale and design. Controlled Clinical Trials. 2003;25:119–142. doi: 10.1016/s0197-2456(03)00112-0. [DOI] [PubMed] [Google Scholar]

[R8] 8.Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics in Medicine. 2000;30:1011–1128. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]

[R9] 9.Wahed AS, Tsiatis AA. Optimal estimator for the survival distribution and related quantities for treatment policies in two-stage randomized designs in clinical trials. Biometrics. 2004;60:124–133. doi: 10.1111/j.0006-341X.2004.00160.x. [DOI] [PubMed] [Google Scholar]

[R10] 10.Murphy SA. Optimal dynamic treatment regimes (with discussions) Journal of the Royal Statistical Society, Series B. 2003;65:331–366. [Google Scholar]

[R11] 11.Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004. pp. 189–326. [Google Scholar]

[R12] 12.Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. doi: 10.1111/j.1541-0420.2006.00686.x. [DOI] [PubMed] [Google Scholar]

[R13] 13.Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97(457):29–39. [Google Scholar]

[R14] 14.Thall PF, Wooten LH, Logothetis CJ, Millikan RE, Tannir NM. Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Statistics in Medicine. 2007;26:4687–4702. doi: 10.1002/sim.2894. [DOI] [PubMed] [Google Scholar]

[R15] 15.Lunceford JK, Davidian M, Tsiatis AA. Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics. 2002;58:48–57. doi: 10.1111/j.0006-341x.2002.00048.x. [DOI] [PubMed] [Google Scholar]

[R16] 16.Wahed AS, Tsiatis AA. Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika. 2006;93(1):163–177. [Google Scholar]

[R17] 17.Moodie EEM, Richardson TS. Estimating optimal dynamic regimes: correcting bias under the null. Scandinavian Journal of Statistics. 2008 doi: 10.1111/j.1467-9469.2009.00661.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Watkins CJCH. PhD Thesis. Cambridge University; 1989. Learning from delayed rewards. [Google Scholar]

[R19] 19.Sutton RS, Barto AG. Reinforcement learning: An introduction. Cambridge: MIT Press; 1998. [Google Scholar]

[R20] 20.Murphy SA. A generalization error for Q-learning. Journal of Machine Learning Research. 2005;6:1073–1097. [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Bickel P, Klaassen C, Ritov Y, Wellner J. Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press; 1993. [Google Scholar]

[R22] 22.Shao J. Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society. 1994;122(4):1251–1262. [Google Scholar]

[R23] 23.Andrews DWK. Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica. 2000;68(2):399–405. [Google Scholar]

[R24] 24.Bickel P, Sakov A. On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica. 2008;18(3):967–985. [Google Scholar]

[R25] 25.Hall P, Horowitz J, Jing B. On blocking rules for the bootstrap with dependent data. Biometrika. 1995;82:561–574. [Google Scholar]

[R26] 26.Donoho DL, Johnstone IM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–455. [Google Scholar]

[R27] 27.Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37(4):373–384. [Google Scholar]

[R28] 28.Gao H. Wavelet shrinkage denoising using the nonnegative garrote. Journal of Computational and Graphical Statistics. 1998;7:469–488. [Google Scholar]

[R29] 29.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. [Google Scholar]

[R30] 30.Figueiredo M, Nowak R. Wavelet-based image estimation: an empirical Bayes approach using Jeffreys’ noninformative prior. IEEE Transactions on Image Processing. 2001;10(9):1322–1331. doi: 10.1109/83.941856. [DOI] [PubMed] [Google Scholar]

[R31] 31.Cohen J. Statistical power for the behavioral sciences. 2. Hillsdale, NJ: Erlbaum; 1988. [Google Scholar]

[R32] 32.Freedman B. Equipoise and the ethics of clinical research. The New England Journal of Medicine. 1987;317(3):141–145. doi: 10.1056/NEJM198707163170304. [DOI] [PubMed] [Google Scholar]

[R33] 33.Wu CFJ, Hamada M. Experiments: planning, analysis, and parameter design optimization. New York: Wiley; 2000. [Google Scholar]

[R34] 34.Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge, UK: Cambridge University Press; 1997. [Google Scholar]

[R35] 35.Nankervis JC. Computational algorithms for double bootstrap confidence intervals. Computational Statistics & Data Analysis. 2005;49:461–475. [Google Scholar]

[R36] 36.Strecher V, McClure J, Alexander G, Chakraborty B, Nair V, et al. Web-based smoking cessation components and tailoring depth: results of a randomized trial. American Journal of Preventive Medicine. 2008;34(5):373–381. doi: 10.1016/j.amepre.2007.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Johnstone IM, Silverman BW. Needles and straw in haystackes: empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics. 2004;32(4):1594–1649. [Google Scholar]

[R38] 38.Robins JM. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics. 1994;23:2379–2412. [Google Scholar]

PERMALINK

Inference for Nonregular Parameters in Optimal Dynamic Treatment Regimes

Bibhas Chakraborty

Victor Strecher

Susan Murphy

Abstract

1 Introduction

2 Estimation and Inference via Q-learning

2.1 Notation and Data Structure

2.2 Q-learning with Linear Models

Lemma 1

2.3 The Inference Problem

2.4 Nonregularity in Inference

3 Different Regularized Estimators

3.1 Hard-threshold Estimator

3.2 Soft-threshold or Shrinkage Estimator

Figure 1.

3.2.1 Choice of Tuning Parameter

Lemma 2

4 Simulation Study

Generative Model

Analysis Model

Two dimensions of nonregularity: p and ϕ

Table 1.

Example 1 (p = 1, ϕ undefined)

Example 2 (p = 0, ϕ infinite)

Example 3 ( p=12, ϕ = 1)

Example 4 (p = 0, ϕ = 1.0204)

Example 5 ( p=14, ϕ = 1.4142)

Example 6 (p = 0, ϕ = 0.3451)

Competing Estimators

Different Bootstrap CIs

Table 2.

Table 3.

4.1 Results

5 Analysis of Smoking Cessation Data

Table 4.

Figure 2.

6 Discussion

Acknowledgments

Appendix A: Proof of Lemma 1

Proof

Stage 2

Stage 1

Appendix B: Proof of Lemma 2

Proof

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Example 3 ( $p = \frac{1}{2}$ , ϕ = 1)

Example 5 ( $p = \frac{1}{4}$ , ϕ = 1.4142)