Abstract
Precision medicine is an emerging medical paradigm that focuses on finding the most effective treatment strategy tailored for individual patients. In the literature, most of the existing works focused on estimating the optimal treatment regime. However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. In this paper, we first introduce the notion of conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables and provide a class of equivalent representations for the null hypothesis of no CQTE. The proposed definition of CQTE does not assume any parametric form for the optimal treatment rule and plays an important role for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of prescriptive variables. We then propose novel testing procedures for no CQTE based on kernel estimation of the conditional contrast functions. We show that our test statistics have asymptotically correct size and non-negligible power against some nonstandard local alternatives. The empirical performance of the proposed tests are evaluated by simulations and an application to an AIDS data set.
Keywords: Conditional qualitative treatment effects, Kernel estimation, Nonstandard local alternatives, Optimal treatment decision making
1. Introduction.
Precision medicine is an emerging medical paradigm for finding the best treatment for individual patients by taking their characteristics into consideration. The goal is to find the optimal treatment regime that will give the most favorable clinical outcome of interest on average. A number of methods have been developed for estimating the optimal treatment regime as a function of prognostic covariates, including Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010), Alearning (Robins, Hernan and Brumback, 2000; Murphy, 2003), direct value optimization methods (Zhang et al., 2012, 2013) and outcome-weighted learning (Zhao et al., 2012, 2015). Qian and Murphy (2011) proposed to estimate the optimal treatment regime based on the estimated mean outcome model with the lasso penalty. Zhang et al. (2015) and Zhang et al. (2016) proposed to construct interpretable optimal treatment regimes via decision lists.
However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. Chang, Lee and Whang (2015) and Hsu (2017) considered testing whether the conditional treatment effects given a set of covariates are always nonpositive. This amounts to testing the overall qualitative treatment effects of the covariates. When the null hypothesis holds, the optimal treatment regime will recommend the control treatment to all patients regardless of their prognostic covariates. Such type of null hypothesis is closely connected to testing conditional moment inequalities, see for example Andrews and Shi (2013, 2014); Chernozhukov, Lee and Rosen (2013); Armstrong and Chan (2016) and the references therein.
In this work, we develop a novel testing procedure for conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables. The contributions of this paper can be summarized as three folds. First, we mathematically formalize the notion of CQTE without assuming any parametric form of the treatment-covariates interactions and systematically characterize several equivalent representations of no CQTE. Informally speaking, a variable is said to have no qualitative treatment effects conditional on other variables if including it in treatment decision can not lead to a treatment regime that increases the value function. It naturally generalizes the definition for the qualitative interaction of a single covariate and treatment given in Gunter, Zhu and Murphy (2011) and the definition for the overall qualitative treatment effects.
Our second contribution is to propose robust test statistics based on a kernel estimator for the conditional treatment effects for testing the existence of CQTE, which do not require the specification of the outcome model and the parametric form of treatment decision rules. To the best of our knowledge, this is the first time that such hypothesis testing problems are formally studied. Gunter, Zhu and Murphy (2011) proposed the S-score to quantify the magnitude of the qualitative interaction between a single covariate and treatment. However, no theoretical justifications were provided for the proposed S-score.
Compared with the global tests in Chang, Lee and Whang (2015) and Hsu (2017), our proposed tests for the CQTE can offer a new and important tool for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of qualitative covariates. Take the AIDS Clinical Trials Group Protocol 175 (ACTG175) study as an example. Many works in the literature have found that the age variable has significant qualitative interaction with the treatment (Lu, Zhang and Zeng, 2013; Fan et al., 2017). It is therefore of great importance to explore the CQTE of a new variable or a set of new variables given the age variable. The proposed tests can also help to construct the optimal treatment regime. When the null hypothesis of no CQTE is rejected, we conclude that including the new variables in treatment decision can increase the value function. Therefore, it is more desirable to construct the optimal treatment regime based on both the new and existing sets of variables.
Using the Poissonization technique (Giné, Mason and Zaitsev, 2003), we show that our test statistic has correct size under the null and non-negligible powers against some nonstandard local alternatives. To deal with data from observational studies, we further introduce a doubly-robust test statistic that is consistent when either the propensity score model or the conditional mean models for the response are correctly specified.
Thirdly, the proposed test can help to discover new variables with the CQTE. Specifically, we develop a procedure for selecting qualitative variables in a sequential order based on the p-values of the proposed CQTE test. For simplicity, we only consider forward selection in this paper. Backward or stepwise selection procedure can be similarly developed.
The rest of the paper is organized as follows. We present the definition of CQTE and a class of equivalent representations for the null hypothesis of no CQTE in Section 2. Our proposed testing statistic and its asymptotic properties under the null, fixed alternative and nonstandard local alternatives are given in Section 3. In Section 4, we extend our testing procedure to the case where the propensity score is unknown and needs to be estimated from data, and introduce a doubly-robust version of the test statistic. Some implementation issues are discussed in Section 5. Simulations studies are conducted to evaluate the empirical performance of the proposed test in Section 6, followed by an application to an AIDS clinical trial data in Section 7. Here, variables with qualitative treatment effects are selected in a forward selection procedure based on the proposed test. A discussion is given in Section 8 and all technical proofs are given in the Supplementary Appendix.
2. Conditional qualitative treatment effects.
2.1. Optimal treatment regime.
For simplicity we focus on a single stage study with two treatment options. Assume data are summarized as Oi = (Xi,Ai,Yi), i = 1,…,n, where, for subject i, denotes the baseline covariates, Ai = 0/1 denotes the treatment received, and Yi denotes the patient’s response of interest. Here, a larger value of Yi represents a better clinical outcome. We assume Oi’s are i.i.d. copies of the triplet O = (X,A,Y).
Consider the following semiparametric model for Y,
(2.1) |
where E(e|X,A) = 0, h0(·) is the baseline effect function, and τ0(x) = E(Y |X = x,A = 1) − E(Y |X = x,A = 0) is referred to as the contrast function.
The optimal treatment regime is defined in the potential outcome framework. Let Y ∗(0) and Y ∗(1) be the potential outcomes that might be observed under treatment 0 and 1, respectively. A treatment regime d(x) is a map defined on . For a given regime d, consider the potential outcome
The optimal treatment regime is the map that maximizes the expected potential outcome, named the value function,
As in (Rubin, 1974), we assume the following two assumptions hold: (i) stable unit treatment value assumption (SUTVA), Y = Y ∗(1)A+Y ∗(0)(1−A); and (ii) no unmeasured confounders assumption, (Y ∗(0),Y ∗(1)) are independent of A given X. Under model (2.1), we have τ0(x) = E{Y ∗(1)−Y ∗(0)|X = x}, and
where I(·) denotes the indicator function.
2.2. Conditional qualitative treatment effects.
In treatment decision making, Gunter, Zhu and Murphy (2011) made a distinction between predictive and prescriptive variables. In particular, the prescriptive variables have qualitative interaction with treatment, which are important for treatment prescription. They gave a formal definition of the qualitative interaction between a single covariate and treatment. We first extend the definition by introducing the notion of conditional qualitative treatment effects (CQTE). Let B and C be two disjoint subsets of I ≡ {1,2,…,p}. Denoted by pB and pC the number of elements in B and C, respectively. For any D ⊆ I, we use XD to denote the sub-vector of X, formed by the elements indexed in D. When D is a single-element set, i.e, D = j0 for some j0 ∈ I, we write XD as X(j0). Moreover, we use |D| to denote the cardinality of D.
Definition 2.1 (CQTE).
Variables in C have qualitative treatment effect conditional on variables in B if there exist some nonempty sets , , , such that (i)
and (ii) for any , and , we have
(2.2) |
Remark 2.1. For any j = 1,2, when
the argmax in (2.2) is not unique. For any two functions ψ1(a) and ψ2(a), we define arg maxa ψ1(a) 6= argmaxa ψ2(a) if any maximizer of ψ1 is not the maximizer of ψ2 or vice versa.
Restricting B = ∅ and pC = 1, we obtain a similar definition of the qualitative interaction between a single covariate and treatment as in Gunter, Zhu and Murphy (2011). For an arbitrary subset D ⊆ I, let
We now introduce the optimal treatment regime based on covariates in a subset D ⊆ I. Similar to the definition of dopt, we define to be the treatment regime that maximizes the value function among the class of treatment regimes based only on covariates XD. Specifically,
where the maximum is taken over all possible maps dD: XD → {0,1}.
Under Model (2.1), along with the SUTVA and no unmeasured confounders assumptions, for a given treatment regime dD, we have
(2.3) |
It follows from (2.3) that
The aim of this paper is to test the following null hypothesis:
against the alternative
Let W = B∪C. To better understand the null, we introduce some examples below.
Example 1 (Testing unconditional qualitative treatment effects).
Let B = ∅. Then for any set C ⊆ I and we are testing whether XC has qualitative treatment effects. When it does, we can find two nonempty sets Ω1 and Ω2 such that Pr(XC ∈ Ω1) > 0,Pr(XC ∈ Ω2) > 0, and on Ω1 while on Ω2. Hence, it is equivalent to test the null hypothesis
Example 2 (Testing conditional qualitative treatment effects).
Assume we know covariates XB have qualitative treatment effects. Our focus is to test whether some additional variables XC are “important” in decision making given XB. Here, the “importance” is measured by the difference of the value functions under regimes and . As we will see below, this definition is equivalent to the conditional qualitative treatment effects of XC given XB.
Define the error rate
and the difference of the value function
(2.4) |
The error rate measures the proportion that the treatment regime makes a different decision compared with . When (XW) ≠ 0, a.s, ERW,B is equal to
(2.5) |
For the value difference, it follows from (2.4) that VDW,B ≥ 0.
Denoted by ΩW = ΩB × ΩC the support of XW. We assume ΩB and ΩC are open subsets in and , respectively. In addition, the density fW of XW is absolutely continuous with respect to the Lebesgue measure ν. We use subscripts and write xW (or xB, xC) to refer to an arbitrary |W|-dimensional (or |B|, |C|-dimensional) vector. For any xW ∈ ΩW, we write xW,B and xW,C to denote the corresponding sub-vectors of xW formed by elements in B and C. If B (or C) is a single-element set, i.e, B = {j0}, we write xB, xW,B as and . When W = I, we omit the subscript W and write xW,B as xB. For notational convenience, we write for any xW ∈ ΩW.
Theorem 2.2 (Characterization of the null).
Assume that and are continuous, and . Then, the followings are equivalent:
H0 holds.
VDW,B = 0.
ERW,B = 0.
For any xW such that , we have .
For any xB ∈ ΩB, we have for all xW ∈ ΩW such that xW,B = xB or for all xW ∈ ΩW such that xW,B = xB.
Remark 2.3. Theorem 2.2 provides the sufficient and necessary conditions for CQTE. Results in (iv) and (v) hold for any x, instead of almost surely. This is due to the continuity of and fW (·). Result (ii) suggests VDW,B > 0 if H1 holds. By definition, this means that variables in XC have CQTE given XB if and only if the optimal regime obtained based on XB and XC together can yield a larger value function than that based on XB only.
Remark 2.4. Result (iii) implies when H0 holds, we have ERW,B = 0. However, it can not guarantee that defined in (2.5) is equal to 0. We provide a counter example below. Let p = 2, B = {2}, C = {1} and hence W = I = {1,2}. Let , where [y]+ = max(0,y) for any . Apparently, H0 holds under this setting. When X(1) and X(2) are independent, we obtain . Suppose X(2) < 1 a.s. and E[X(1)]+ > 0. If Pr(X(1) ≤ 0) > 0, we have
Thus, if Pr(X(1) ≤ 0) > 0.
Remark 2.5. Assertion (iv) motivates us to consider the following test statistic for H0
where φ(·) is a monotonically increasing function with φ(0) = 0 and ω0(xW ) is a nonnegative weight function. Obviously we have SW,B ≥ 0. When H0 holds, by Theorem 2.2, we obtain SW,B = 0. Taking φ to be the identity function and ω0(xW) = fW (xW), we obtain SW,B = VDW,B. When and φ(z) = sgn(z) where
we have SW,B = ERW,B. More generally, we can let φ(z) = sgn(z)|z|q for some q ≥ 0. The defined SW,B then becomes an Lq+1 type functional. Alternatively, we can consider the following supremum-type test statistic
(2.6) |
In Section 13 of the supplementary article, we develop a consistent testing procedure based on (2.6). In these statistics, function represents the magnitude of treatment effects, while the difference of two indicators characterizes the discrepancy between the regimes and . We formally introduce our test statistic in the next section.
3. Testing procedure.
We first introduce nonparametric estimators of and . Define the propensity score π(x) = Pr(A = 1|X = x). In a randomized study, πi ≡ π(Xi) is a constant and known. In this Section, we assume the propensity score is correctly specified. In the next Section, we propose a doubly robust test, which allows the misspecification of the propensity score. Consider the following nonparametric estimator of :
where is a multivariate kernel function. In general, can be taken as a pW -variate density function with pW = pB + pC and hW being a symmetric positive definite matrix as discussed in Wand and Jones (1993). In practice, for simplicity, we may take as a product of component-wise kernel functions, i.e., , where K(·) is a symmetric density function. For notational convenience, we set hW,1 = ···hW,p = hW, and write . Note that the propensity score πi is a function of Xi not just . Under the SUTVA and no unmeasured confounders assumptions, we can show that is a consistent estimator of .
Let fB(·) denote the density function of XB. Similarly, a nonparametric estimator of is given by
Based on Remark 2.5, it’s natural to consider the test statistic based on
(3.1) |
where and , are corresponding estimators for and respectively.
Remark 3.1. When some of the covariates are discrete, we need to modify the integral in (3.1) by some product measure of Lebesgue and counting measures. For notational convenience, in Sections 3 and 4, we assume XW is continuous. In numerical studies, we allow some covariates to be discrete when implementing our test. Details about the test statistic with discrete covariates can be found in Section 5.
Under certain regularity conditions, we will show that there exist some positive sequences {an} and {σn} such that
under the null. To construct the test statistic, we replace an and σn by some appropriate estimators and , and reject the null when where zα is the upper α-quantile of a standard normal distribution. Below we introduce our test statistic which is a slightly modified version of .
3.1. Test statistic.
Consider the following test statistic
(3.2) |
where
for some sequence ηn → 0. Here, and are the kernel density estimators of fW and fB, respectively. Specifically,
Estimators and are referred to as the Nadaraya-Watson estimators for and .
Similar to , we can show , for some and . The tests based on and have nontrivial power against certain local alternative as defined later. However, the one based on is more powerful. To see this, note that
(3.3) |
With proper choice of ηn, the right-hand side (RHS) of (3.3) is equivalent to
(3.4) |
where .
The asymptotic mean of (3.4) remains the same under the null and local alternative. However, it has non-degenerate variance and is asymptotically independent of . This implies that and have the same shifted mean under the local alternative, but the variance of is smaller than when the set E0 has nonzero measure. From now on, we focus on the test statistic .
3.2. Consistency of the test.
Define
where
For each fixed xW, μW (xW) is the asymptotic variance of .
Define . The asymptotic mean and variance of are given by
where and are independent standard normal random variables, , and
To estimate and , we first provide nonparametric estimators for μW (xW ) and F0. Define
where ηn is defined in (3.2). For any set F ⊆ Ω, define and as
We estimate and by and , respectively.
Let ν(·) be the Lebesgue measure. Define the test statistic
We reject the null when .
Remark 3.2. When , and hence the test statisticis not well is not well defined. Therefore, in this case we consider instead. When F0 is a strict sub-set of Ω, the test statistic based on will be conservative.
We write for two sequences {an},{bn} if there exist some universal constants c, C > 0 such that cbn ≤ an ≤ Cbn. To study the theoretical properties of the test, we first introduce some conditions.
(A1.) Assume that ΩW is a bounded subset in . Assume fW is continuous and satisfies and . Assume and are continuous. Moreover, fW, , fB and are stimes differentiable almost everywhere with uniformly bounded derivatives, for some integer s > 0.
(A2.) Assume , and , where each Kj is an s-order kernel function with support {μ ∈ : |μ| ≤ 1/2} and bounded, and is of bounded variation and integrates to 1.
(A3.) Assume Eexp(t|Y |) < ∞ for some t > 0, and supxW∈ΩW E(Y 4|XW = xW,A = a) < ∞ for a = 0,1.
(A4.) Assume there exist some constants c0 and c1 that 0 < c0 ≤ π(x) ≤ c1 < 1,∀x.
(A5.) Assume that μW (xW ) is uniformly continuous and bounded on ΩW, and infxW∈ΩW μW (xW ) > 0.
(A6.) Assume .
(A7.) Assume . Assume there exist some constants ξ0, > 0 such that for any sufficiently small t, ε > 0,
(A8.) Assume ηn satisfies and .
Remark 3.3. Condition (A1) requires ΩW to be bounded. In practice, if it’s unbounded, we can perform monotone transformations on each component of X to make the support of the transformed variables bounded. Otherwise, we need to focus on a bounded subset , and write as
In addition, we modify H0 as “For any fixed , τ0(xB,xC)≥0, , or τ0(xB,xC)≥0,.”
Remark 3.4. Condition (A2) requires each Kj to be of order s. The order of the kernel is defined as the first nonzero moment. Condition (A6) requires nh2 → ∞ and nh2s/pW → 0. This implies s > pW. When pW > 2, this condition requires each kernel Kj to be of high orders. Such kernels are typically referred to as the bias-reducing kernels. Unlike standard kernel functions, these kernels allow Kj(z) to be negative for some . Moreover, we assume in (A6). This guarantees and converge at the same rate.
Remark 3.5. Conditions (A7) is not restrictive. Obviously, this condition holds when . In that case, we can set the constants ξ0 and to be any positive constants. Moreover, these conditions are satisfied in many other cases. For example, let p = 2, B = {2}, C = {1}. Consider
Then, with some calculation, we can show , for some constants c1, c2 > 0.
Note that when x(2) > 0 and is a nonzero constant c3 < 0 for all x(2) ≤ 0. For sufficiently small t > 0, we obtain
Besides, for any small ε0 > 0, we have
for some constant c4 > 0. This verifies (A7).
Theorem 3.6. Assume Conditions (A1)-(A8) hold. Then, under H0, we have
for 0 < α ≤ 0.5, where the equality holds when ν(F0) > 0.
Remark 3.7. Theorem 3.6 shows has correct size under H0. When ν(F0) = 0, we can show with probability tending to 1, , and hence
When ν(F0) ≠ 0, we will show that is asymptotically normal. The proof is based on the well-known Poissoinization technique which introduces a Poissonized version of and transforms the integral into summation of mean zero 1-dependent random fields (see for example Giné, Mason and Zaitsev, 2003; Mason and Polonik, 2009; Chang, Lee and Whang, 2015). The asymptotic normality thus follows by standard central limit theorem for m-dependent random fields (Shergin, 1990). The details are given in the Supplementary Appendix.
Theorem 3.8. Assume Conditions (A1)-(A8) hold. Then, under H1, we have
Remark 3.9. Theorem 3.8 shows having power going to 1 against fixed alternatives. Together with Theorem 3.6, Theorem 3.8 suggests that our testing procedure is consistent.
3.3. Local alternatives.
In this subsection, we investigate the power of the proposed test under local alternatives. We write τn,0(x) as the contrast function and for a given subset D ⊆ I, with the intention that these functions are allowed to vary with n. Consider the following sequence of local alternatives:
for some continuous functions and on ΩW, where for any fixed xB ∈ ΩB,
for any for any xC ∈ ΩC, and
for any for any xC ∈ ΩC.
In addition,
Recall that . Let and ∂F0 denote its interior and boundary, respectively. Since the contrast function varies with n, we state a more precise definition of conditional qualitative treatment effects below.
Definition 3.1 (CQTE, continued).
Variables in C have qualitative treatment effects conditional on variables in B if there exists some nonempty sets , , such that (i) , and ; and (ii) for any xC1 ∈ C1, xC2 ∈ C2 and xB ∈ B, there exists a sequence nk → ∞ as k → ∞, such that
(3.5) |
Remark 3.10. It’s immediate to see that (3.5) is a modified version of (2.2) where we allow the conditional expectation E{Y ∗(a)|XB,XC} to vary with n. By the definition of and , we can see that the magnitude of affects CQTE. We provide a theorem which formally characterizes such results below.
Theorem 3.11. Assume is continuous and bounded on ΩW . Assume ν(∂F0) = 0. Under conditions in Theorem 2.2, the following statements are equivalent:
XC doesn’t have QTE conditional on XB.
For any ε > 0, there exist a set Nε and a positive integer nε such that ν(Nε) ≤ ε, and for all n ≥ nε, the following holds: for any fixed xB, we have τn,0(xW ) ≥ 0 for any xW ∉ Nε such that xW,B = xB or τn,0(xW ) ≤ 0 for any xW ∉ Nε such that xW,B = xB.
For all .
.
Remark 3.12. Result (iv) implies H0 holds when Pr(XW ∈ F0) = 0, or . This implies that the local alternatives are nonstandard and only exist in the nonregular cases, i.e., there is a positive probability such that the optimal treatment decision based on XW is not defined.
Remark 3.13. Theorem 3.11 suggests the quantity
plays a role in determining CQTE of XC conditional on XB. In the theorem below, we establish the power of our test statistic under the local alternatives. It can be seen that this quantity is closely related to the power of our test. Theorem 3.14. Assume Conditions (A1)-(A8) hold and that is bounded on ΩW . Then, under Ha with , we have
where Φ(z) = Pr() for a standard normal random variable .
4. Doubly robust test statistic.
In an observational study, the propensity score πi’s are usually unknown. In practice, we posit a parametric model π(x,α) for the propensity score, for example, a logistic regression model π(x,α) = exp(xTα)/{1 + exp(xTα)}. We can obtain an estimator of α based on data {(Ai,Xi),i = 1,…,n}, by either maximizing the likelihood function or solving estimating equations. The estimator will converge to some population-level parameters α0. When the model π(x,α) is correctly specified, α0 is the true parameter in the model. When the model is wrong, α0 corresponds to some least false parameters that have been widely studied in the literature (cf. White, 1982; Li and Duan, 1989).
We also posit some parametric models Φ0(x,θ) and Φ1(x,ζ) for E(Y |X = x,A = 0) and E(Y |X = x,A = 1), respectively. Let and denote the estimator of θ and ζ, respectively, which converge to some parameters θ0 and ζ0, under potential model misspecification. Let , and . Define the following doubly robust estimators for and :
Remark 4.1. We can show that estimators and are consistent when either π(x,α) or Φ0(x,θ) and Φ1(x,ζ) are correctly specified.
Let and . Consider
where
For any set F, define
where
We estimate the asymptotic mean and variance of by and , respectively, with
Define
We reject the null when .
To establish the asymptotic distributions of under the null and local alternative, we impose the following conditions.
(A4’.) Assume there exist some constants and such that for all x ∈ Ω.
(A5’.) Assume that is uniformly continuous and bounded on ΩW, and , where
(A9.) Assume that π(x,α) is twice continuously differentiable with respect to α; k∂π(x,α0)/∂αk2 is uniformly bounded for all x ∈ Ω; and the elements in ∂2π(x,α)/∂α∂αT are uniformly bounded for all x ∈ Ω and α in a small neighborhood of α0.
(A10.) Assume that Φ0(x,θ) and Φ1(x,ζ) are twice continuously differentiable with respect to θ and ζ, respectively; Φ0(x,θ0), Φ1(x,ζ0), k∂Φ0(x,θ0)/∂θk2 and k∂Φ1(x,ζ)/∂ζ0k2 are uniformly bounded for all x ∈ Ω; and the elements in the matrices ∂2Φ0(x,θ)/∂θ∂θT and ∂2Φ1(x,ζ)/∂ζ∂ζT are uniformly bounded for all x ∈ Ω and θ, ζ in small neighborhoods of θ0 and ζ0, respectively.
(A11.) Assume that the estimators , and have the following linear representations
for some functions ξ1, ξ2 and ξ3 with E{ξj(Oi)} = 0 and for j = 1,2,3.
Remark 4.2. Conditions (A4’) and (A5’) are similar to (A4) and (A5). Conditions (A9)-(A11) are required for establishing the asymptotic normality of the estimators for misspecified models (White, 1982).
Theorem 4.3 (Double robustness of . Assume Conditions (A1)(A3), (A4’), (A5’) and (A6)-(A11) hold. In addition, assume either π(x,α) or Φ0(x,θ) and Φ1(x,ζ) are correctly specified. Then, under H0, for any 0 < α ≤ 0.5, we have
where the equality holds when ν(F0) > 0. In addition, under H1, we have
Remark 4.4. Theorem 4.3 establishes the consistency of the proposed doubly robust test statistic . Next, we establish the power of the test under the local alternative.
Theorem 4.5. Assume Conditions in Theorem 4.3 hold. Under Ha, assume that is continuous and bounded on ΩW, and
Then, we have
Remark 4.6. For a given function δ0, the power of increases as decreases. When the propensity score model is correctly specified, it can be shown that for each xW ∈ ΩW, achieves its minimum when
(4.1) |
Therefore, achieves its minimum if (4.1) holds. This suggests has the greatest power when the posited models for the propensity score and conditional means of Y given X and A are correctly specified.
5. Implementation details.
In Sections 3 and 4, we only consider continuous covariates for notational convenience. In this section, we present a more general testing framework allowing both continuous and discrete covariates, and provide some implementation details. Specifically, we consider the following two cases: (i) all covariates are discrete; and (ii) at least one covariate is continuous. The test statistics are different in these two cases. We focus on randomized studies and assume the propensity score is known. A doubly-robust version of the test statistic can be similarly derived as in Section 4 to deal with data from observational studies. We omit the details to save space.
5.1. All covariates are discrete.
When all covariates are discrete, for each x, we calculate
Define
(5.1) |
(5.2) |
Compute
Unlike results in Section 3 and 4, the limiting distribution of is not normal. If , we reject the null when where is the upper α-quantile of the random variable conditional on , where are independent standard normal random variables. Otherwise, we reject the null when . A formal justification of the aforementioned testing procedure is given in Section 14 of the supplementary article.
5.2. Not all covariates are discrete.
Assume W = WC ∪ WD and B = BC ∪ BD, where WC,BC are the sets of continuous variables and WD,BD are the sets of discrete covariates. Denoted by , , and the numbers of elements in these sets. When , define ωi = {Ai/πi −(1− Ai)/(1 − πi)}Yi and
where denotes the sampling variance of the jth covariate. In our numerical studies, we use a fourth-order Epanechnikov kernel for K, i.e.
It can be shown that for j = 1,2,3. Then we calculate
where and are the sub-vectors of xW formed by elements in WC and WD.
When , the integral in is computed via a midpoint rule with a uniform grid. Specifically, for each j ∈ W, denoted by mj and Mj the minimum and maximum value of x(j). We divide the interval [mj,Mj] into L = 200 subintervals of equal width. Let zk,(j),k = 1,…,L, denote the midpoints for these intervals, for , and and the sub-vectors formed by elements in WC and BC respectively. We approximate by
where , and and are shorthands for and , and .
If , we approximate the integral using Monte Carlo methods. Specifically, we generate N = 5000 random numbers Z(k), uniformly distributed in ∏j[mj,Mj], and calculate
where , ZW(k) and ZB(k) are the sub-vectors of Z(k) formed by elements in WC and BC.
When , we calculated and by
Definitions of and are given in (5.1) and (5.2). When , we replace by Ω in the integral. The above integrals are calculated similarly as for . We reject the null when .
6. Simulations.
To evaluate the numerical performance of the proposed testing procedure, we consider simulation studies based on the following model:
where h0 denotes the baseline, τ0 denotes the contrast, and e ∼ N(0,0.25) is independent of A and X = (X(1),X(2))T. The objective is to test the CQTE of variable X(2) conditional on X(1). Treatment A was generated from a Bernoulli distribution with probability 0.5, independent of X. The baseline function h0 was set to be
(6.1) |
The contrast function takes the form
(6.2) |
for some continuous functions ϕ1 and ϕ2.
Variables X(1) and X(2) are independently generated. It follows from Theorem 3.6 that the null (no CQTE) holds if and only if ϕ2(x(2)) ≥ 0,∀x(2) or ϕ2(x(2)) ≤ 0,∀x(2). We consider five scenarios. In the first four scenarios, X(1) and X(2) are generated from Unif[−2,2], where Unif[a,b] stands for the uniform distribution on the interval [a,b]. We set ϕ1(z) = z in the first two scenarios and ϕ1(z) = max(z,0) in the last two scenarios. As for ϕ2, in Scenarios 1 and 3,
for some δ ≥ 0. In Scenarios 2 and 4,
for some δ ≥ 0. In figure 1, we plot functions ϕ2 with different δ.
Fig 1:
Plots of function ϕ2 for Scenario 1 and Scenario 2, from left to right, with different choices of δ.
In the last scenario, X(1) is generated from Unif[−2,2] while X(2) is from a uniform discrete distribution. Specifically, X(2) has the following probability mass function
The contrast function is set to be τ0(x(1),x(2)) =
In all scenarios, the parameter δ controls the degree of CQTE. When δ = 0, H0 holds; Otherwise, H1 holds. Moreover, it can be calculated that the value differences
for Scenarios 1–5 are equal to δ3/2/3, δ2/8, δ3/2/6, δ2/16 and δ/3 for all δ ≤ 1, respectively. In each scenario, we consider four settings by setting VD = 0,0.04,0.08 and 0.12. Hence, the null holds in the first setting and the alternative holds in other settings. We also consider two different sample sizes, n = 300 and n = 600.
When implementing our testing procedure, we first fit a logistic regression model for the propensity score and linear models for the conditional means of Y given A and X. The test statistics are constructed as discussed in Section 5. Based on (6.1) and (6.2), the model for E(Y |X,A = 1) is always misspecified, however, the propensity score model is correctly specified. Hence, our test statistics are consistent. In Scenario 1–4, we set the smoothing parameters as hW = cW n−1/7 and hB = cBn−2/7 for some constants cW and cB. Condition (A6) holds for such a choice of the bandwidth. In our implementation, we have tried a few values of cW and cB, and find and cB = 6 working well for all scenarios. In Scenario 5, we set hW = 6n−2/7. In (5.1) and (5.2), we set ηn = n−2/7, C1 = 3 and C2 = 1. Such a choice of ηn satisfies Conditions (A8)-(A10) in our simulation settings. We conduct 600 simulations for each setting and report the proportions of rejecting the null hypothesis of the proposed test statistics in Table 1.
Table 1.
Simulation results.
VD = 0 | VD = 4% | VD = 8% | VD = 12% | ||||||
---|---|---|---|---|---|---|---|---|---|
α level | α level | α level | α level | ||||||
n | 0.05 | 0.1 | 0.05 | 0.1 | 0.05 | 0.1 | 0.05 | 0.1 | |
Scenario 1 | 300 600 |
4.3% 1.5% |
6.0% 3.3% |
24.0% 36.7% |
34.0% 45.5% |
58.7% 75.8% |
68.3% 83.3% |
82.2% 95.7% |
87.5% 97.3% |
Scenario 2 | 300 600 |
7.0% 3.7% |
11.1% 7.8% |
23.8% 31.0% |
32.7% 41.8% |
60.5% 83.0% |
69.3% 90.5% |
88.2% 98.3% |
92.5% 99.5% |
Scenario 3 | 300 600 |
3.8% 2.7% |
6.5% 6.7% |
37.5% 52.5% |
48.7% 61.8% |
76.5% 99.1% |
79.8% 100% |
93.5% 99.8% |
95.5% 99.8% |
Scenario 4 | 300 600 |
6.2% 5.2% |
10.2% 8.8% |
39.8% 59.3% |
47.7% 68.2% |
79.2% 96.8% |
87.3% 98.3% |
96.0% 100.0% |
97.8% 100.0% |
Scenario 5 | 300 600 |
5.2% 5.3% |
9.7% 9.5% |
29.3% 46.2% |
40.5% 57.5% |
68.0% 92.2% |
76.3% 95.5% |
94.0% 100.0% |
96.8% 100.0% |
Under H0 (i.e. the cases with VD = 0), the empirical type-I error rates in Scenarios 2, 4 and 5 are close to the nominal level. In Scenarios 1 and 3, we have ν(F0) = 0. The empirical type I error rates in Scenarios 1 and 3 are well below the nominal level. This is in line with our theory which suggests the type-I error rate should go to 0 in these settings. Under H1, the power increases as the value difference or sample size increases, showing the consistency of our test statistics.
7. Application with ACTG175 dataset.
We apply our proposed method to a data from AIDS Clinical Trials Group Protocol 175 (ACTG175) study. This is a randomized trial where patients were randomly assigned to the following four treatments, including zidovudine (ZDV) monotherapy, ZDV + didanosine (ddI), ZDV + zalcitabine (zal) and ddI monotherapy. We focus on patients receiving treatments: ZDV+ddI (denoted as 1) and ZDV+zal (denoted as 0). Among them, there are 522 receiving treatment 1 and 524 receiving treatment 0. We choose the CD4 count (cells/mm3) at 20± 5 weeks after receiving the treatment as the response. The baseline covariates include patient’s age and weight at baseline, the CD4 and CD8 counts (coded as CD40 and CD80 respectively) at baseline, hemophilia (hemo,0 = no,1 = yes), homosexual activity (homo,0 = no,1 = yes), history of intravenous drug use (drug,0 = no,1 = yes), race (0 = white,1 = non-white), gender (0 = female,1 = male), antiretroviral history (str2,0 = naive,1 = experienced), and symptomatic status (sympton,0 = asymptomatic,1 = symptomatic). The first four variables are continuous while others are binary. Our objective is to select those variables that have qualitative treatment effects in a sequential order. Since the propensity score is known, we consider the statistic proposed in Section 3. Our procedure proceeds as follows:
- Set = ∅. In the first step, for each variable i, define the set Wi = {i} and calculate the p-value pi for each test statistic as described in Section 5. Stop if mini pi > α. Include the variable that gives the smallest p-value in the set , i.e,
- In the second step, for each variable , define and calculate the p-value pi for each test statistic . Stop if minipi > α. Include the variable that gives the smallest p-value,
Continue the second step until it stops. Output
It is immediate to see that the above algorithm uses a forward selection procedure. Backward or stepwise selection can be similarly considered. The threshold α determines the significance level for each test statistic. In our implementation, we set where is a standard normal random variable. Such a choice of α meets the conditions in Theorem 9.2 to achieve selection consistency of the forward selection algorithm. As in simulations, we choose the bandwidth when there’s only one continuous variable in the kernel estimation. Otherwise, we set . Sets and are estimated by
where the constant C0 is set to be 0.03 in the implementation.
For the ACTG175 dataset, our algorithm stops after fourth iteration. At the first iteration, only the variable age is significant and is thus selected. At the second iteration, we find out that both hemo and homo have qualitative effects conditional on age and variable hemo is chosen. At the third iteration, only homo is significant given previously included variables. The algorithm stops at the fourth iteration. We report all the p-values in each iteration in Table 2.
Table 2.
P-values of each test statistic in all iterations.
age | weight | hemo | homo | drug | race | gender | str2 | sympton | CD40 | CD80 |
---|---|---|---|---|---|---|---|---|---|---|
0.022 | 0.087 | 0.793 | 0.827 | 0.817 | 0.831 | 0.808 | 0.825 | 0.825 | 0.823 | 0.772 |
NA | 0.986 | 1.2e-8 | 0.028 | 0.288 | 0.308 | 0.175 | 0.257 | 0.191 | 0.982 | 0.975 |
NA | 0.996 | NA | 0.033 | 0.067 | 0.447 | 0.091 | 0.155 | 0.196 | 0.999 | 0.998 |
NA | 0.999 | NA | NA | 0.118 | 0.116 | 0.405 | 0.533 | 0.066 | 0.999 | 0.999 |
Our results indicate that variables age, hemo and homo have qualitative treatment effects and are important for optimal treatment prescription. Denoted by DFS the set of these three variables. We compare our algorithm with the sequential advantage selection (SAS, Fan, Lu and Song, 2016). SAS uses a forward selection procedure based on a sequential S-score and selects the best candidate subset of variables via a BIC-type criterion. For the ACTG175 dataset, SAS selects a total of 10 variables including age, hemo and homo. Denoted by DSAS the set of these 10 variables.
To further examine the variable selection results, we evaluate the value functions under the optimal treatment regimes based on the set of variables selected by the proposed forward selection algorithm and SAS. For a given set D ⊆ I = {1,2,…,11}, we estimate the optimal value function
via the online estimator proposed by Luedtke and van der Laan (2016). More specifically, for i = ln+1,ln+2,…,n, we first compute the estimated optimal treatment regime and the esti-mated conditional mean functions and based on data from patients 1 to i − 1.
For any j = 0,1 and i = ln + 1, ln + 2, …, n, is calculated via kernel ridge regression, based on the dataset . We use the Gaussian radial basis function kernel. The estimating procedure is implemented by the R package CVST. The tuning parameters in the kernel functions are selected via 5-folded cross-validation. Estimator is computed via a penalized regression with the SCAD penalty function (Fan and Li, 2001), based on the dataset . The penalized regression is implemented by the R package ncvreg, and the tuning parameters are selected via 10-folded cross-validation. Let π0 = 0.5, we define for i = ln + 1,ln + 2,…,n, j = 1,…,n,
where .
The final estimator is given by
with the estimated standard error
where
Under certain conditions, we have
Set ln = 200. The estimated value functions and are equal to 401.88 and 402.35 respectively, with estimated standard errors and . Since DFS ⊆ DSAS, we have . However, the difference is not significant. This implies the proposed forward selection algorithm selects less variables than SAS, while achieves approximately the same value function in optimal treatment decision.
8. Discussion.
In this paper, we introduce the notion of conditional qualitative treatment effects (CQTE) and present several equivalent definitions. We also propose a consistent testing procedure for the existence of CQTE. Our test has correct size under the null hypothesis and non-negligible power against some nonstandard local alternatives.
8.1. More on the forward selection algorithm.
The forward selection algorithm introduced in Section 7 is a byproduct of the proposed testing procedure for the existence of CQTE. While it is worthwhile to investigate its statistical properties, this is a very challenging task. In the literature, few works have studied the asymptotic properties of a forward selection procedure. Wang (2009) established the “sure screening property” of the classical forward linear regression in a high dimensional setting. However, the proofs of the major theorems in that paper (Theorem 1 and 2) rely heavily on the specific structure of linear regression and it remains unknown whether the “sure screening property” holds for general forward selection algorithms.
Our forward selection algorithm aims to identify a subset D0 ⊆ [1,…,p] with minimum cardinality such that the optimal value function based on variables in is the same as that based on X. In the supplementary appendix, we establish the “sure screening property” (Theorem 9.1) and selection consistency (Theorem 9.2) of the considered forward selection algorithm based on the p-values of the CQTE tests. Moreover, we conduct some simulation studies to examine the empirical performance of the proposed algorithm and compare it with SAS (Fan, Lu and Song, 2016). Our forward selection algorithm achieves better model selection results when compared to SAS in all considered simulation scenarios. More details can be found in Section 9 of the supplementary appendix.
8.2. Fully nonparametric implementation.
The proposed test statistic in Section 3 requires the propensity score function to be correctly specified. In Section 4, we introduce a doubly robust test statistic and posit some parametric models for the propensity score and conditional mean functions. In the supplementary appendix, we consider a fully nonparametric procedure based on some nonparametric estimators of the propensity score and the conditional mean functions.
We further conduct some simulation studies to examine the empirical performance of the nonparametric testing procedure and compare it with the doubly robust test describe in Section 4. We briefly summarize the results here: (i) The nonparametric test statistic is more powerful than the doubly robust test statistic. (ii) When the sample size is small, the empirical type I error rates of the nonparametric test statistic are slightly larger than the nominal level in some cases. More details can be found in Section 11 of the supplementary appendix.
Although it is interesting to investigate the theoretical properties of such a nonparametric test statistic, it is beyond the scope of the current paper and is omitted here.
8.3. Extensions to Lp-type and supremum-type functionals.
As commented in Remark 2.5, the test statistic for no CQTE can be constructed based on
In the current paper, we set φ(·) to be the identity function. More generally, we can take φ(·) to be any monotonically increasing function with φ(0) = 0. In Section 12 in the supplementary appendix, we consider the following class of functions φ(z) = sgn(z)|z|q, and derive the corresponding test statistic for any q ≥ 1.
We show in Theorem 12.1 that have asymptotically correct size under H0 and provide its asymptotic power function in Theorem 12.2 under Ha. For different q, the asymptotic power function increases as
increases, where
and .
Besides, when q > 1, the assumptions on ηn and the moments of Y conditional on X and A are slightly different compared to those in (A3) and (A8). More details can be found in Section 12 of the supplementary appendix.
In addition, in Section 13 of the supplementary article, we develop a supremum-type test based on studentized kernel estimators of the contrast function, with many different bandwidth values. We show that the test is valid and has nontrivial power against -local alternatives, where hmax denotes the maximum of the kernel bandwidth parameter.
Therefore, when compared to the supremum-type test, the Lp-type test is more powerful since it allows for nontrivial testing against n−1/2-local alternatives. However, the Lp-type test only uses one bandwidth value for the kernel estimates. As a result, it might be sensitive to the choice of the bandwidth parameter.
8.4. Other issues.
For simplicity, we only consider a single decision stage and focus on binary treatments. It will be useful in practice to extend CQTE and its associated testing procedure to multi-stages with multiple treatment options. Moreover, our test statistic relies on the kernel-based estimators of the contrast function. It is well known that the kernel-based estimations will behave poorly when the dimension of the covariates is large. How to adapt our test statistics to handle high-dimensional covariates remains challenging.
Our testing procedure requires the specification of the tuning parameters hW, hB and ηn (see Section 5). In general, one can set , and for some cW,cB,κW,κB,κ0 > 0. In practice, we recommend to set , cB = 6, κW = 1/7, κB = 2/7, if pW > 2, pB = 1 and , κW = κB = 1/7 if pW,pB ≥ 2, and κ0 = 2/7.
We have tried various values of tuning parameters in our simulation studies and find such a choice works well in all scenarios. In Section 10 of the supplementary article, we examine the performance of our test under other choices of tuning parameters. The simulation results are very similar to those in Section 6.
Supplementary Material
Acknowledgments
This work was partly supported by a NIH grant P01 CA142538.
References.
- Andrews DWK and Shi X (2013). Inference based on conditional moment inequalities. Econometrica 81 609–666. [Google Scholar]
- Andrews DWK and Shi X (2014). Nonparametric inference based on conditional moment inequalities. J. Econometrics 179 31–45. [Google Scholar]
- Armstrong TB and Chan HP (2016). Multiscale adaptive inference on conditional moment inequalities. J. Econometrics 194 24–43. [Google Scholar]
- Chakraborty B, Murphy S and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chang M, Lee S and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18 307–346. [Google Scholar]
- Chernozhukov V, Lee S and Rosen AM (2013). Intersection bounds: estimation and inference. Econometrica 81 667–737. [Google Scholar]
- Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]
- Fan A, Lu W and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan C, Lu W, Song R and Zhou Y (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1565–1582. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gine E´, Mason DM and Zaitsev AY (2003). The L1-norm density estimator process. Ann. Probab 31 719–768. [Google Scholar]
- Gunter L, Zhu J and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20 1–22. [Google Scholar]
- Li K-C and Duan N (1989). Regression analysis under link violation. Ann. Statist 17 1009–1052. [Google Scholar]
- Lu W, Zhang HH and Zeng D (2013). Variable selection for optimal treatment decision. Stat. Methods Med. Res 22 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mason DM and Polonik W (2009). Asymptotic normality of plug-in level set estimates. Ann. Appl. Probab 19 1108–1142. [Google Scholar]
- Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65 331–366. [Google Scholar]
- Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robins JM, Hernan MA and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11 550–560. [DOI] [PubMed] [Google Scholar]
- Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Edu. Psychol 66 688–701. [Google Scholar]
- Shergin VV (1990). The central limit theorem for finitely dependent random variables In Probability theory and mathematical statistics, Vol. II (Vilnius, 1989) 424–431. “Mokslas”, Vilnius. [Google Scholar]
- Wand MP and Jones MC (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Amer. Statist. Assoc 88 520–528. [Google Scholar]
- Wang H (2009). Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc 104 1512–1524. [Google Scholar]
- Watkins CJCH and Dayan P (1992). Q-learning. Mach. Learn 8 279–292. [Google Scholar]
- White H (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25. [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang B, Tsiatis AA, Laber EB and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Laber EB, Tsiatis A and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Laber EB, Tsiatis A and Davidian M (2016). Interpretable Dynamic Treatment Regimes. arXiv preprint arXiv: 160601472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y, Zeng D, Rush AJ and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao Y-Q, Zeng D, Laber EB and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc 110 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.