ON TESTING CONDITIONAL QUALITATIVE TREATMENT EFFECTS

Chengchun Shi; Rui Song; Wenbin Lu

doi:10.1214/18-AOS1750

. Author manuscript; available in PMC: 2019 Aug 1.

Published in final edited form as: Ann Stat. 2019 May 21;47(4):2348–2377. doi: 10.1214/18-AOS1750

ON TESTING CONDITIONAL QUALITATIVE TREATMENT EFFECTS

Chengchun Shi ¹, Rui Song ¹, Wenbin Lu ¹

PMCID: PMC6561732 NIHMSID: NIHMS987209 PMID: 31190690

Abstract

Precision medicine is an emerging medical paradigm that focuses on finding the most effective treatment strategy tailored for individual patients. In the literature, most of the existing works focused on estimating the optimal treatment regime. However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. In this paper, we first introduce the notion of conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables and provide a class of equivalent representations for the null hypothesis of no CQTE. The proposed definition of CQTE does not assume any parametric form for the optimal treatment rule and plays an important role for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of prescriptive variables. We then propose novel testing procedures for no CQTE based on kernel estimation of the conditional contrast functions. We show that our test statistics have asymptotically correct size and non-negligible power against some nonstandard local alternatives. The empirical performance of the proposed tests are evaluated by simulations and an application to an AIDS data set.

Keywords: Conditional qualitative treatment effects, Kernel estimation, Nonstandard local alternatives, Optimal treatment decision making

1. Introduction.

Precision medicine is an emerging medical paradigm for finding the best treatment for individual patients by taking their characteristics into consideration. The goal is to find the optimal treatment regime that will give the most favorable clinical outcome of interest on average. A number of methods have been developed for estimating the optimal treatment regime as a function of prognostic covariates, including Q-learning (Watkins and Dayan, 1992; Chakraborty, Murphy and Strecher, 2010), Alearning (Robins, Hernan and Brumback, 2000; Murphy, 2003), direct value optimization methods (Zhang et al., 2012, 2013) and outcome-weighted learning (Zhao et al., 2012, 2015). Qian and Murphy (2011) proposed to estimate the optimal treatment regime based on the estimated mean outcome model with the lasso penalty. Zhang et al. (2015) and Zhang et al. (2016) proposed to construct interpretable optimal treatment regimes via decision lists.

However, there has been less attention devoted to hypothesis testing regarding the optimal treatment regime. Chang, Lee and Whang (2015) and Hsu (2017) considered testing whether the conditional treatment effects given a set of covariates are always nonpositive. This amounts to testing the overall qualitative treatment effects of the covariates. When the null hypothesis holds, the optimal treatment regime will recommend the control treatment to all patients regardless of their prognostic covariates. Such type of null hypothesis is closely connected to testing conditional moment inequalities, see for example Andrews and Shi (2013, 2014); Chernozhukov, Lee and Rosen (2013); Armstrong and Chan (2016) and the references therein.

In this work, we develop a novel testing procedure for conditional qualitative treatment effects (CQTE) of a set of variables given another set of variables. The contributions of this paper can be summarized as three folds. First, we mathematically formalize the notion of CQTE without assuming any parametric form of the treatment-covariates interactions and systematically characterize several equivalent representations of no CQTE. Informally speaking, a variable is said to have no qualitative treatment effects conditional on other variables if including it in treatment decision can not lead to a treatment regime that increases the value function. It naturally generalizes the definition for the qualitative interaction of a single covariate and treatment given in Gunter, Zhu and Murphy (2011) and the definition for the overall qualitative treatment effects.

Our second contribution is to propose robust test statistics based on a kernel estimator for the conditional treatment effects for testing the existence of CQTE, which do not require the specification of the outcome model and the parametric form of treatment decision rules. To the best of our knowledge, this is the first time that such hypothesis testing problems are formally studied. Gunter, Zhu and Murphy (2011) proposed the S-score to quantify the magnitude of the qualitative interaction between a single covariate and treatment. However, no theoretical justifications were provided for the proposed S-score.

Compared with the global tests in Chang, Lee and Whang (2015) and Hsu (2017), our proposed tests for the CQTE can offer a new and important tool for assessing the incremental value of a set of new variables in optimal treatment decision making conditional on an existing set of qualitative covariates. Take the AIDS Clinical Trials Group Protocol 175 (ACTG175) study as an example. Many works in the literature have found that the age variable has significant qualitative interaction with the treatment (Lu, Zhang and Zeng, 2013; Fan et al., 2017). It is therefore of great importance to explore the CQTE of a new variable or a set of new variables given the age variable. The proposed tests can also help to construct the optimal treatment regime. When the null hypothesis of no CQTE is rejected, we conclude that including the new variables in treatment decision can increase the value function. Therefore, it is more desirable to construct the optimal treatment regime based on both the new and existing sets of variables.

Using the Poissonization technique (Giné, Mason and Zaitsev, 2003), we show that our test statistic has correct size under the null and non-negligible powers against some nonstandard local alternatives. To deal with data from observational studies, we further introduce a doubly-robust test statistic that is consistent when either the propensity score model or the conditional mean models for the response are correctly specified.

Thirdly, the proposed test can help to discover new variables with the CQTE. Specifically, we develop a procedure for selecting qualitative variables in a sequential order based on the p-values of the proposed CQTE test. For simplicity, we only consider forward selection in this paper. Backward or stepwise selection procedure can be similarly developed.

The rest of the paper is organized as follows. We present the definition of CQTE and a class of equivalent representations for the null hypothesis of no CQTE in Section 2. Our proposed testing statistic and its asymptotic properties under the null, fixed alternative and nonstandard local alternatives are given in Section 3. In Section 4, we extend our testing procedure to the case where the propensity score is unknown and needs to be estimated from data, and introduce a doubly-robust version of the test statistic. Some implementation issues are discussed in Section 5. Simulations studies are conducted to evaluate the empirical performance of the proposed test in Section 6, followed by an application to an AIDS clinical trial data in Section 7. Here, variables with qualitative treatment effects are selected in a forward selection procedure based on the proposed test. A discussion is given in Section 8 and all technical proofs are given in the Supplementary Appendix.

2. Conditional qualitative treatment effects.

2.1. Optimal treatment regime.

For simplicity we focus on a single stage study with two treatment options. Assume data are summarized as O_i = (X_i,A_i,Y_i), i = 1,…,n, where, for subject i, $X_{i} \in ℝ^{p}$ denotes the baseline covariates, A_i = 0/1 denotes the treatment received, and Y_i denotes the patient’s response of interest. Here, a larger value of Y_i represents a better clinical outcome. We assume O_i’s are i.i.d. copies of the triplet O = (X,A,Y).

Consider the following semiparametric model for Y,

Y = h_{0} (X) + A τ_{0} (X) + e,

(2.1)

where E(e|X,A) = 0, h₀(·) is the baseline effect function, and τ₀(x) = E(Y |X = x,A = 1) − E(Y |X = x,A = 0) is referred to as the contrast function.

The optimal treatment regime is defined in the potential outcome framework. Let Y ^∗(0) and Y ^∗(1) be the potential outcomes that might be observed under treatment 0 and 1, respectively. A treatment regime d(x) is a map defined on $ℝ^{p} \to {0, 1}$ . For a given regime d, consider the potential outcome

Y^{*} (d) = Y^{*} (0) {1 - d (X)} + Y^{*} (1) d (X) .

The optimal treatment regime is the map that maximizes the expected potential outcome, named the value function,

d_{o p t} = \arg \max_{d} E {Y^{*} (d)} .

As in (Rubin, 1974), we assume the following two assumptions hold: (i) stable unit treatment value assumption (SUTVA), Y = Y ^∗(1)A+Y ^∗(0)(1−A); and (ii) no unmeasured confounders assumption, (Y ^∗(0),Y ^∗(1)) are independent of A given X. Under model (2.1), we have τ₀(x) = E{Y ^∗(1)−Y ^∗(0)|X = x}, and

d_{o p t} (x) = I {τ_{0} (x) \geq 0},

where I(·) denotes the indicator function.

2.2. Conditional qualitative treatment effects.

In treatment decision making, Gunter, Zhu and Murphy (2011) made a distinction between predictive and prescriptive variables. In particular, the prescriptive variables have qualitative interaction with treatment, which are important for treatment prescription. They gave a formal definition of the qualitative interaction between a single covariate and treatment. We first extend the definition by introducing the notion of conditional qualitative treatment effects (CQTE). Let B and C be two disjoint subsets of I ≡ {1,2,…,p}. Denoted by p_B and p_C the number of elements in B and C, respectively. For any D ⊆ I, we use X^D to denote the sub-vector of X, formed by the elements indexed in D. When D is a single-element set, i.e, D = j₀ for some j₀ ∈ I, we write X^D as X^(j0). Moreover, we use |D| to denote the cardinality of D.

Definition 2.1 (CQTE).

Variables in C have qualitative treatment effect conditional on variables in B if there exist some nonempty sets $C_{1}$ , $C_{2} \subseteq ℝ^{p_{C}}$ , $B \subseteq ℝ^{p_{B}}$ , such that (i)

P r {(X^{B}, X^{C}) \in B \times C_{1}} > 0, a n d P r {(X^{B}, X^{C}) \in B \times C_{2}} > 0;

and (ii) for any $x_{C_{1}} \in C_{1}$ , $x_{C_{2}} \in C_{2}$ and $x_{B} \in B$ , we have

\arg \max_{a} E {Y^{*} (a) | X^{B} = x_{B}, X^{C} = x_{C_{1}}} \neq \arg \max_{a} E {Y^{*} (a) | X^{B} = x_{B}, X^{C} = x_{C_{2}}} .

(2.2)

Remark 2.1. For any j = 1,2, when

E {Y^{*} (1) | X^{B} = x_{B}, X^{C} = x_{C_{j}}} = E {Y^{*} (0) | X^{B} = x_{B}, X^{C} = x_{C_{j}}},

the argmax in (2.2) is not unique. For any two functions ψ₁(a) and ψ₂(a), we define arg max_a ψ₁(a) 6= argmax_a ψ₂(a) if any maximizer of ψ₁ is not the maximizer of ψ₂ or vice versa.

Restricting B = ∅ and p_C = 1, we obtain a similar definition of the qualitative interaction between a single covariate and treatment as in Gunter, Zhu and Murphy (2011). For an arbitrary subset D ⊆ I, let

τ_{0}^{D} (x_{D}) = E {τ_{0} (X) | X^{D} = x_{D}} .

We now introduce the optimal treatment regime $d_{o p t}^{D}$ based on covariates in a subset D ⊆ I. Similar to the definition of d_opt, we define $d_{o p t}^{D}$ to be the treatment regime that maximizes the value function among the class of treatment regimes based only on covariates X^D. Specifically,

d_{o p t}^{D} = \arg \max_{d^{D}} E {Y^{*} (d^{D})},

where the maximum is taken over all possible maps d^D: X^D → {0,1}.

Under Model (2.1), along with the SUTVA and no unmeasured confounders assumptions, for a given treatment regime d^D, we have

E {Y^{*} (d^{D})} = E {Y^{*} (0) + τ_{0} (X) d^{D} (X^{D})} = E {Y^{*} (0)} + E [E {τ_{0} (X) | X^{D} = X^{D}} d^{D} (X^{D})] = E {Y^{*} (0)} + E {τ_{0}^{D} (X^{D}) d^{D} (X^{D})} .

(2.3)

It follows from (2.3) that

d_{o p t}^{D} (x_{D}) = I {τ_{0}^{D} (x_{D}) \geq 0} .

The aim of this paper is to test the following null hypothesis:

H_{0} : X^{C} does not have CQTE given X^{B},

against the alternative

H_{1} : X^{C} has CQTE given X^{B} .

Let W = B∪C. To better understand the null, we introduce some examples below.

Example 1 (Testing unconditional qualitative treatment effects).

Let B = ∅. Then for any set C ⊆ I and we are testing whether X^C has qualitative treatment effects. When it does, we can find two nonempty sets Ω₁ and Ω₂ such that Pr(X^C ∈ Ω₁) > 0,Pr(X^C ∈ Ω₂) > 0, and $τ_{0}^{C} (\cdot) > 0$ on Ω₁ while $τ_{0}^{C} (\cdot) < 0$ on Ω₂. Hence, it is equivalent to test the null hypothesis

τ_{0}^{C} (X^{C}) \geq 0, a . s, or τ_{0}^{C} (X^{C}) \leq 0, a . s .

Example 2 (Testing conditional qualitative treatment effects).

Assume we know covariates X^B have qualitative treatment effects. Our focus is to test whether some additional variables X^C are “important” in decision making given X^B. Here, the “importance” is measured by the difference of the value functions under regimes $d_{o p t}^{W}$ and $d_{o p t}^{B}$ . As we will see below, this definition is equivalent to the conditional qualitative treatment effects of X^C given X^B.

Define the error rate

{ER}^{W, B} = {\begin{array}{l} 0, & if τ_{0}^{W} (X^{W}) = 0, a . s . \\ \frac{E [| d_{o p t}^{W} (X^{W}) - d_{o p t}^{B} (X^{B}) | I {τ_{0}^{W} (X^{W}) \neq 0}]}{\Pr {τ_{0}^{W} (X^{W}) \neq 0}}, & otherwise, \end{array}

and the difference of the value function

{VD}^{W, B} = E {Y^{*} (d_{o p t}^{W})} - E {Y^{*} (d_{o p t}^{B})} = E [τ_{0} (X) {d_{o p t}^{W} (X^{W}) - d_{o p t}^{B} (X^{B})}] = E [τ_{0}^{W} (X^{W}) {d_{o p t}^{W} (X^{W}) - d_{o p t}^{B} (X^{B})}] .

(2.4)

The error rate measures the proportion that the treatment regime $d_{o p t}^{B}$ makes a different decision compared with $d_{o p t}^{W}$ . When $τ_{0}^{W}$ (X^W) ≠ 0, a.s, ER^W,B is equal to

{ER}_{*}^{W, B} = E | d_{o p t}^{W} (X^{W}) - d_{o p t}^{B} (X^{B}) | .

(2.5)

For the value difference, it follows from (2.4) that VD^W,B ≥ 0.

Denoted by Ω^W = Ω^B × Ω^C the support of X^W. We assume Ω^B and Ω^C are open subsets in $ℝ^{p_{B}}$ and $ℝ^{p_{C}}$ , respectively. In addition, the density f^W of X^W is absolutely continuous with respect to the Lebesgue measure ν. We use subscripts and write x_W (or x_B, x_C) to refer to an arbitrary |W|-dimensional (or |B|, |C|-dimensional) vector. For any x_W ∈ Ω^W, we write x_W,B and x_W,C to denote the corresponding sub-vectors of x_W formed by elements in B and C. If B (or C) is a single-element set, i.e, B = {j₀}, we write x_B, x_W,B as $x_{(j_{0})}$ and $x_{W, (j_{0})}$ . When W = I, we omit the subscript W and write x_W,B as x_B. For notational convenience, we write $τ_{0}^{W} (x_{W}) = τ_{0}^{W} (x_{W, B}, x_{W, C})$ for any x_W ∈ Ω^W.

Theorem 2.2 (Characterization of the null).

Assume that $τ_{0}^{W} (\cdot)$ and $τ_{0}^{B} (\cdot)$ are continuous, and $E {τ_{0}^{W} (X^{W})}^{2} < \infty$ . Then, the followings are equivalent:

H₀ holds.
VD^W,B = 0.
ER^W,B = 0.
For any x_W such that $τ_{0}^{W} (x_{W}) \neq 0$ , we have $d_{o p t}^{W} (x_{W}) = d_{o p t}^{B} (x_{W, B})$ .
For any x_B ∈ Ω^B, we have $τ_{0}^{W} (x_{W}) \geq 0$ for all x_W ∈ Ω^W such that x_W,B = x_B or $τ_{0}^{W} (x_{W}) \leq 0$ for all x_W ∈ Ω^W such that x_W,B = x_B.

Remark 2.3. Theorem 2.2 provides the sufficient and necessary conditions for CQTE. Results in (iv) and (v) hold for any x, instead of almost surely. This is due to the continuity of $τ_{0}^{W} (\cdot)$ and f^W (·). Result (ii) suggests VD^W,B > 0 if H₁ holds. By definition, this means that variables in X^C have CQTE given X^B if and only if the optimal regime obtained based on X^B and X^C together can yield a larger value function than that based on X^B only.

Remark 2.4. Result (iii) implies when H₀ holds, we have ER^W,B = 0. However, it can not guarantee that ${HR}_{*}^{W, B}$ defined in (2.5) is equal to 0. We provide a counter example below. Let p = 2, B = {2}, C = {1} and hence W = I = {1,2}. Let $τ_{0} (x) = τ_{0}^{W} (x_{W}) = {[x_{(1)}]}_{+} (x_{(2)} - 1)$ , where [y]₊ = max(0,y) for any $y \in ℝ$ . Apparently, H₀ holds under this setting. When X⁽¹⁾ and X⁽²⁾ are independent, we obtain $τ_{0}^{B} (x_{(2)}) = (x_{(2)} - 1) E {[X^{(1)}]}_{+}$ . Suppose X⁽²⁾ < 1 a.s. and E[X⁽¹⁾]₊ > 0. If Pr(X⁽¹⁾ ≤ 0) > 0, we have

\Pr (X^{(1)} \leq 0) = \Pr {τ_{0}^{W} (X^{W}) \geq 0} = E d_{o p t}^{W} (X^{W}) \neq E d_{o p t}^{B} (X^{B}) = \Pr {τ_{0}^{B} (X^{(2)}) \geq 0} = 0.

Thus, ${ER}_{*}^{W, B} \neq 0$ if Pr(X⁽¹⁾ ≤ 0) > 0.

Remark 2.5. Assertion (iv) motivates us to consider the following test statistic for H₀

S^{W, B} = \int_{x_{W} \in Ω^{W}} ϕ {τ_{0}^{W} (x_{W})} {d_{o p t}^{W} (x_{W}) - d_{o p t}^{B} (x_{W, B})} ω_{0} (x_{W}) d x_{W},

where φ(·) is a monotonically increasing function with φ(0) = 0 and ω₀(x_W ) is a nonnegative weight function. Obviously we have S^W,B ≥ 0. When H₀ holds, by Theorem 2.2, we obtain S^W,B = 0. Taking φ to be the identity function and ω₀(x_W) = f^W (x_W), we obtain S^W,B = VD^W,B. When $ω_{0} (x_{W}) = f^{W} (x_{W}) / \Pr {τ_{0}^{W} (X^{W}) \neq 0}$ and φ(z) = sgn(z) where

sgn (z) = {\begin{array}{l} 1, & z > 0, \\ 0, & z = 0, \\ - 1, & z < 0, \end{array}

we have S^W,B = ER^W,B. More generally, we can let φ(z) = sgn(z)|z|^q for some q ≥ 0. The defined S^W,B then becomes an L_q+1 type functional. Alternatively, we can consider the following supremum-type test statistic

\sup_{x_{W} \in Ω^{W}} ϕ {τ_{0}^{W} (x_{W})} {d_{o p t}^{W} (x_{W}) - d_{o p t}^{B} (x_{W, B})} ω_{0} (x_{W}) .

(2.6)

In Section 13 of the supplementary article, we develop a consistent testing procedure based on (2.6). In these statistics, function $ϕ {τ_{0}^{W} (x_{W})}$ represents the magnitude of treatment effects, while the difference of two indicators characterizes the discrepancy between the regimes $d_{o p t}^{B}$ and $d_{o p t}^{W}$ . We formally introduce our test statistic in the next section.

3. Testing procedure.

We first introduce nonparametric estimators of $τ_{0}^{W}$ and $τ_{0}^{B}$ . Define the propensity score π(x) = Pr(A = 1|X = x). In a randomized study, π_i ≡ π(X_i) is a constant and known. In this Section, we assume the propensity score is correctly specified. In the next Section, we propose a doubly robust test, which allows the misspecification of the propensity score. Consider the following nonparametric estimator of $τ_{0}^{W} (x_{W}) f^{W} (x_{W})$ :

τ_{n}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} (\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} K_{h_{W}}^{W} (x_{W} - X_{i}^{W}),

where $K_{h_{W}}^{W} (\cdot)$ is a multivariate kernel function. In general, $K_{h_{W}}^{W} (\cdot)$ can be taken as a p_W -variate density function with p_W = p_B + p_C and h_W being a symmetric positive definite matrix as discussed in Wand and Jones (1993). In practice, for simplicity, we may take $K_{h_{W}}^{W} (\cdot)$ as a product of component-wise kernel functions, i.e., $K_{h_{W}}^{W} (x_{W} - X_{i}^{W}) = \prod_{j \in W} {(h_{W, j})}^{- 1} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{h_{j}})$ , where K(·) is a symmetric density function. For notational convenience, we set h_W,1 = ···h_W,p = h_W, and write $K_{h_{W}}^{W} (x_{W} - X_{i}^{W}) = {(h_{W})}^{- p_{W}} K^{W} (\frac{x_{W} - X_{i}^{W}}{h_{W}})$ . Note that the propensity score π_i is a function of X_i not just $X_{i}^{W}$ . Under the SUTVA and no unmeasured confounders assumptions, we can show that $τ_{n}^{W} (x_{W})$ is a consistent estimator of $τ_{0}^{W} (x_{W}) f^{W} (x_{W})$ .

Let f^B(·) denote the density function of X^B. Similarly, a nonparametric estimator of $τ_{0}^{B} (x_{B}) f^{B} (x_{B})$ is given by

τ_{n}^{B} (x_{B}) = \frac{1}{n} \sum_{i = 1}^{n} (\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} K_{h_{B}}^{B} (x_{B} - X_{i}^{B}) .

Based on Remark 2.5, it’s natural to consider the test statistic based on

S_{n}^{W, B} = \int_{x_{W} \in Ω^{W}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} d x_{W},

(3.1)

where $d_{n}^{W} (x_{W}) = I {τ_{n}^{W} (x_{W}) \geq 0}$ and $d_{n}^{B} (x_{W, B}) = I {τ_{n}^{B} (x_{W, B}) \geq 0}$ , are corresponding estimators for $d_{o p t}^{W} (x_{W})$ and $d_{o p t}^{B} (x_{W, B})$ respectively.

Remark 3.1. When some of the covariates are discrete, we need to modify the integral in (3.1) by some product measure of Lebesgue and counting measures. For notational convenience, in Sections 3 and 4, we assume X^W is continuous. In numerical studies, we allow some covariates to be discrete when implementing our test. Details about the test statistic with discrete covariates can be found in Section 5.

Under certain regularity conditions, we will show that there exist some positive sequences {a_n} and {σ_n} such that

\frac{\sqrt{n} S_{n}^{W, B} - a_{n}}{σ_{n}} \overset{d}{\to} N (0, 1),

under the null. To construct the test statistic, we replace a_n and σ_n by some appropriate estimators ${\bar{a}}_{n}$ and ${\bar{σ}}_{n}$ , and reject the null when $T_{n}^{W, B} = (\sqrt{n} S_{n}^{W, B} - {\bar{a}}_{n}) / {\bar{σ}}_{n} > z_{α}$ where z_α is the upper α-quantile of a standard normal distribution. Below we introduce our test statistic which is a slightly modified version of $T_{n}^{W, B}$ .

3.1. Test statistic.

Consider the following test statistic

{\tilde{S}}_{n}^{W, B} = \int_{x_{W} \in Ω^{W}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} I (x_{W} \notin \hat{E}) d x_{W},

(3.2)

where

\hat{E} = {x_{W} \in Ω^{W} : | \frac{τ_{n}^{W} (x_{W})}{{\hat{f}}^{W} (x_{W})} | \leq η_{n}, | \frac{τ_{n}^{B} (x_{W, B})}{{\hat{f}}^{B} (x_{W, B})} | \leq η_{n}},

for some sequence η_n → 0. Here, ${\hat{f}}^{W}$ and ${\hat{f}}^{B}$ are the kernel density estimators of f^W and f^B, respectively. Specifically,

{\hat{f}}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} K_{h_{W}}^{W} (x_{W} - X_{i}^{W}), {\hat{f}}^{B} (x_{B}) = \frac{1}{n} \sum_{i = 1}^{n} K_{h_{B}}^{B} (x_{B} - X_{i}^{B}) .

Estimators $τ_{n}^{W} (x_{W}) / {\hat{f}}^{W} (x_{W})$ and $τ_{n}^{B} (x^{B}) / {\hat{f}}^{B} (x^{B})$ are referred to as the Nadaraya-Watson estimators for $τ_{0}^{W} (x_{W})$ and $τ_{0}^{B} (x^{B})$ .

Similar to $S_{n}^{W, B}$ , we can show $(\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\tilde{a}}_{n}) / {\tilde{σ}}_{n} \overset{d}{\to} N (0, 1)$ , for some ${\tilde{a}}_{n}$ and ${\tilde{σ}}_{n}$ . The tests based on $S_{n}^{W, B}$ and ${\tilde{S}}_{n}^{W, B}$ have nontrivial power against certain local alternative as defined later. However, the one based on ${\tilde{S}}_{n}^{W, B}$ is more powerful. To see this, note that

\sqrt{n} (S_{n}^{W, B} - {\tilde{S}}_{n}^{W, B}) = \sqrt{n} \int_{x_{W} \in Ω^{W}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} I (x \in \hat{E}) d x_{W} .

(3.3)

With proper choice of η_n, the right-hand side (RHS) of (3.3) is equivalent to

\sqrt{n} \int_{x_{W} \in Ω^{W}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} I (x_{W} \in E_{0}) d x_{W},

(3.4)

where $E_{0} = {x_{W} : τ_{0}^{W} (x_{W}) = 0, τ_{0}^{B} (x_{W, B}) = 0}$ .

The asymptotic mean of (3.4) remains the same under the null and local alternative. However, it has non-degenerate variance and is asymptotically independent of ${\tilde{S}}_{n}^{W, B}$ . This implies that $\sqrt{n} S_{n}^{W, B} - a_{n}$ and $\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\tilde{a}}_{n}$ have the same shifted mean under the local alternative, but the variance of ${\tilde{S}}_{n}^{W, B}$ is smaller than $S_{n}^{W, B}$ when the set E₀ has nonzero measure. From now on, we focus on the test statistic ${\tilde{S}}_{n}^{W, B}$ .

3.2. Consistency of the test.

Define

μ^{W} (x_{W}) = E [{\frac{A}{π (X)} - \frac{1 - A}{1 - π (X)}}^{2} Y^{2} | X^{W} = x_{W}] f^{W} (x_{W}) K_{*}^{W} (0),

where

K_{*}^{W} (t) = \int_{x_{W}} K^{W} (x_{W}) K^{W} (x_{W} + t) d x_{W} .

For each fixed x_W, μ^W (x_W) is the asymptotic variance of $\sqrt{n {(h_{W})}^{p_{W}}} τ_{n}^{W} (x_{W})$ .

Define $F_{0} = {x_{W} \in Ω^{W} : τ_{0}^{W} (x_{W}) = 0, τ_{0}^{B} (x^{B}) \neq 0}$ . The asymptotic mean and variance of $\sqrt{n} {\tilde{S}}_{n}^{W, B}$ are given by

{\tilde{a}}_{n} = \frac{1}{\sqrt{2 π {(h_{W})}^{p_{W}}}} \int_{x_{W} \in F_{0}} \sqrt{μ_{n}^{W} (x_{W})} d x_{W},

{\tilde{σ}}^{2} = \int_{\begin{matrix} x_{W} \in F_{0} \\ t \in {[- 1, 1]}^{p_{W}} \end{matrix}} μ^{W} (x_{W}) cov (\max {\sqrt{1 - ρ^{2} (t)} ℤ_{1} + ρ (t) ℤ_{2}, 0}, \max {ℤ_{2}, 0}) d x_{W} d t,

where $ℤ_{1}$ and $ℤ_{2}$ are independent standard normal random variables, $ρ (t) = K_{*}^{W} (t) / K_{*}^{W} (0)$ , and

μ_{n}^{W} (x_{W}) = \frac{1}{{(h_{W})}^{p_{W}} K_{*}^{W} (0)} E [\frac{μ^{W} (X^{W})}{f^{W} (X^{W})} {K^{W} (\frac{x_{W} - X^{W}}{h_{W}})}^{2}] .

To estimate ${\tilde{a}}_{n}$ and ${\tilde{σ}}^{2}$ , we first provide nonparametric estimators for μ^W (x_W ) and F₀. Define

{\hat{μ}}_{n}^{W} (x_{W}) = \frac{1}{n {(h_{W})}^{p_{W}}} \sum_{i = 1}^{n} {(\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i}}^{2} {K^{W} (\frac{x_{W} - X_{i}^{W}}{h_{W}})}^{2},

\hat{F} = {x_{W} \in Ω^{W} : | τ_{n}^{W} (x_{W}) / {\hat{f}}^{W} (x_{W}) | \leq η_{n}, | τ_{n}^{B} (x_{W, B}) / {\hat{f}}^{B} (x_{W, B}) | > η_{n}},

where η_n is defined in (3.2). For any set F ⊆ Ω, define ${\hat{a}}_{n} (F)$ and ${\hat{σ}}_{n}^{2} (F)$ as

{\hat{a}}_{n} (F) = \frac{1}{\sqrt{2 π {(h_{W})}^{p w}}} \int_{x_{W} \in F} \sqrt{{\hat{μ}}_{n}^{W} (x_{W})} d x_{W},

{\hat{σ}}_{n}^{2} (F) = \int_{\begin{matrix} x_{W} \in F \\ t \in {[- 1, 1]}^{p} W \\ \times d x_{W} d t . \end{matrix}} {\hat{μ}}_{n}^{W} (x_{W}) cov (\max {\sqrt{1 - ρ^{2} (t)} ℤ_{1} + ρ (t) ℤ_{2}, 0}, \max {ℤ_{2}, 0})

We estimate ${\tilde{a}}_{n}$ and ${\tilde{σ}}_{n}^{2}$ by ${\hat{a}}_{n} (\hat{F})$ and ${\hat{σ}}_{n}^{2} (\hat{F})$ , respectively.

Let ν(·) be the Lebesgue measure. Define the test statistic

{\tilde{T}}_{n}^{W, B} = {\begin{array}{l} {\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\hat{a}}_{n} (\hat{F})} / {\hat{σ}}_{n} (\hat{F}), & if ν (\hat{F}) \neq 0 \\ {\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\hat{a}}_{n} (Ω^{W})} / {\hat{σ}}_{n} (Ω^{W}), & otherwise . \end{array}

We reject the null when ${\tilde{T}}_{n}^{W, B} > z_{α}$ .

Remark 3.2. When $ν (\hat{F}) = 0$ , ${\hat{σ}}_{n} (\hat{F}) = 0$ and hence the test statisticis not well ${\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\hat{a}}_{n} (\hat{F})} / {\hat{σ}}_{n} (\hat{F})$ is not well defined. Therefore, in this case we consider ${\sqrt{n} {\tilde{S}}_{n}^{\hat{W}, B} - {\hat{a}}_{n} (Ω^{W})} / {\hat{σ}}_{n} (Ω^{W})$ instead. When F₀ is a strict sub-set of Ω, the test statistic based on ${\sqrt{n} {\tilde{S}}_{n}^{W, B} - {\hat{a}}_{n} (Ω^{W})} / {\hat{σ}}_{n} (Ω^{W})$ will be conservative.

We write $a_{n} ≍ b_{n}$ for two sequences {a_n},{b_n} if there exist some universal constants c, C > 0 such that cb_n ≤ a_n ≤ Cb_n. To study the theoretical properties of the test, we first introduce some conditions.

(A1.) Assume that Ω^W is a bounded subset in $ℝ^{p_{W}}$ . Assume f^W is continuous and satisfies $\inf_{x_{W} \in Ω^{W}} f^{W} (x_{W}) > 0$ and $\sup_{x_{W} \in Ω^{W}} f^{W} (x_{W}) < \infty$ . Assume $τ_{0}^{W}$ and $τ_{0}^{B}$ are continuous. Moreover, fW, $τ_{0}^{W}$ , f^B and $τ_{0}^{B}$ are stimes differentiable almost everywhere with uniformly bounded derivatives, for some integer s > 0.

(A2.) Assume $K^{W} (x_{W}) = \prod_{j = 1}^{p_{W}} K_{j} (x_{(j)})$ , and $K^{B} (x_{B}) = \prod_{j = 1}^{p_{B}} K_{j + p W} (x_{(j)})$ , where each K_j is an s-order kernel function with support {μ ∈ $ℝ$ : |μ| ≤ 1/2} and bounded, and is of bounded variation and integrates to 1.

(A3.) Assume Eexp(t|Y |) < ∞ for some t > 0, and sup_xW∈_ΩW E(Y ⁴|X^W = x_W,A = a) < ∞ for a = 0,1.

(A4.) Assume there exist some constants c₀ and c₁ that 0 < c₀ ≤ π(x) ≤ c₁ < 1,∀x.

(A5.) Assume that μ^W (x_W ) is uniformly continuous and bounded on Ω^W, and inf_xW∈_ΩW μ^W (x_W ) > 0.

(A6.) Assume $n {(h_{W})}^{2 p_{W}} / \log n \to \infty, n {(h_{W})}^{2 s} \to 0, h_{B}^{p_{B}} ≍ h_{W}^{p_{W}}$ .

(A7.) Assume $ν (\partial F_{0}) = 0, ν (Ω^{W} \cap F_{0}^{c}) > 0$ . Assume there exist some constants ξ₀, ${\bar{c}}_{0}$ > 0 such that for any sufficiently small t, ε > 0,

ν ({x_{W} : 0 < | τ_{0}^{W} (x_{W}) | \leq t}) = O (t^{ξ_{0}}), ν ({x_{B} : 0 < | τ_{0}^{B} (x_{B}) | \leq t}) = O (t^{ξ_{0}}), ν ({x_{W} : 0 < | τ_{0}^{W} (x_{W}) | \leq t, | τ_{0}^{B} (x^{B}) | > (1 + ε) t}) \geq {\bar{c}}_{0} t^{ξ_{0}} .

(A8.) Assume η_n satisfies $η_{n}^{2 ξ_{0}} ≫ \log^{ξ_{0} + 1} n / {n {(h | w)}^{p_{W}}}^{ξ_{0}}$ and $n η_{n}^{2 δ_{0} + 2} \to 0$ .

Remark 3.3. Condition (A1) requires Ω^W to be bounded. In practice, if it’s unbounded, we can perform monotone transformations on each component of X to make the support of the transformed variables bounded. Otherwise, we need to focus on a bounded subset $Ω_{0}^{W} = Ω_{0}^{B} \times Ω_{0}^{C} \subseteq Ω^{W}$ , and write ${\tilde{S}}_{n}^{W, B}$ as

\int_{x_{W} \in Ω_{0}^{W}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} I (x_{W} \notin \hat{E}) d x_{W} .

In addition, we modify H₀ as “For any fixed $x_{B} \in Ω_{0}^{B}$ , τ0(xB,xC)≥0, $\forall x_{C} \in Ω_{0}^{C}$ , or τ0(xB,xC)≥0, $\forall x_{C} \in Ω_{0}^{C}$ .”

Remark 3.4. Condition (A2) requires each K_j to be of order s. The order of the kernel is defined as the first nonzero moment. Condition (A6) requires nh² → ∞ and nh^2s/pW → 0. This implies s > p_W. When p_W > 2, this condition requires each kernel K_j to be of high orders. Such kernels are typically referred to as the bias-reducing kernels. Unlike standard kernel functions, these kernels allow K_j(z) to be negative for some $z \in ℝ$ . Moreover, we assume $h_{W}^{p_{W}} ≍ h_{B}^{p_{B}}$ in (A6). This guarantees $τ_{n}^{W} (x_{W})$ and $τ_{n}^{B} (x_{W, B})$ converge at the same rate.

Remark 3.5. Conditions (A7) is not restrictive. Obviously, this condition holds when $\inf_{x W \in Ω^{W}} | τ_{0}^{W} (x_{W}) | > 0$ . In that case, we can set the constants ξ₀ and ${\bar{c}}_{0}$ to be any positive constants. Moreover, these conditions are satisfied in many other cases. For example, let p = 2, B = {2}, C = {1}. Consider

τ_{0} (x_{(1)}, x_{(2)}) = {\begin{matrix} - {(x_{(1)} + x_{(2)})}^{1 / ξ_{0}}, & if x_{(1)}, x_{(2)} > 0, \\ - {(x_{(1)})}^{1 / ξ_{0}}, & if x_{(1)} > 0, x_{(2)} \leq 0, \\ - {(x_{(2)})}^{1 / ξ 0}, & if x_{(1)} \leq 0, x_{(2)} > 0, \\ 0, & otherwise . \end{matrix}

Then, with some calculation, we can show $ν ({x : 0 < | τ_{0} (x_{(1)}, x_{(2)}) | \leq t}) = ν ({x : x_{(1)}, x_{(2)} > 0, x_{(1)} + x_{(2)} \leq t}) + ν ({x : 0 < x_{(1)} \leq t, x_{(2)} \leq 0}) + ν ({x : x_{(1)} \leq 0, 0 < x_{(2)} \leq t}) = c_{1} t^{2 ξ_{0}} + c_{2} t^{ξ_{0}}$ , for some constants c₁, c₂ > 0.

Note that $| τ_{0}^{{2}} (x_{(2)}) | \geq \min {(x_{(2)})}^{1 / ξ_{0}}$ when x(2) > 0 and $τ_{0}^{{2}} (x_{(2)})$ is a nonzero constant c₃ < 0 for all x₍₂₎ ≤ 0. For sufficiently small t > 0, we obtain

ν ({x : 0 < | τ_{0}^{{2}} (x_{(2)}) | \leq t}) \leq ν ({x : {(x_{(2)})}^{1 / ξ_{0}} \leq t, x_{(2)} > | 0}) = O (t^{ξ_{0}}) .

Besides, for any small ε₀ > 0, we have

\begin{array}{l} ν ({x : 0 < | τ_{0} (x_{(1)}, x_{(2)}) | \leq t, | τ_{0}^{{2}} (x_{(2)}) | > (1 + ε_{0}) t}) \\ \geq ν ({x : 0 < | τ_{0} (x_{(1)}, x_{(2)}) | \leq t, τ_{0}^{{2}} (x_{(2)}) = - c_{3}}) \\ = ν ({x : 0 < {(x_{(1)})}^{1 / ξ_{0}} \leq t, x_{(2)} \leq 0}) = c_{4} t^{ξ_{0}}, \end{array}

for some constant c₄ > 0. This verifies (A7).

Theorem 3.6. Assume Conditions (A1)-(A8) hold. Then, under H₀, we have

\lim_{n} P r ({\tilde{T}}_{n}^{W, B} > z_{α}) \leq α,

for 0 < α ≤ 0.5, where the equality holds when ν(F₀) > 0.

Remark 3.7. Theorem 3.6 shows ${\tilde{T}}_{n}^{W, B}$ has correct size under H₀. When ν(F₀) = 0, we can show with probability tending to 1, $\sqrt{n} {\tilde{S}}_{n}^{W, B} \leq {\hat{a}}_{n}$ , and hence

\lim_{n} P r ({\tilde{T}}_{n}^{W, B} > z_{α}) = 0.

When ν(F₀) ≠ 0, we will show that ${\tilde{T}}_{n}^{W, B}$ is asymptotically normal. The proof is based on the well-known Poissoinization technique which introduces a Poissonized version of ${\tilde{S}}_{n}^{W, B}$ and transforms the integral into summation of mean zero 1-dependent random fields (see for example Giné, Mason and Zaitsev, 2003; Mason and Polonik, 2009; Chang, Lee and Whang, 2015). The asymptotic normality thus follows by standard central limit theorem for m-dependent random fields (Shergin, 1990). The details are given in the Supplementary Appendix.

Theorem 3.8. Assume Conditions (A1)-(A8) hold. Then, under H₁, we have

\lim_{n} P r ({\tilde{T}}_{n}^{W, B} > z_{α}) \to 1.

Remark 3.9. Theorem 3.8 shows ${\tilde{T}}_{n}^{W, B}$ having power going to 1 against fixed alternatives. Together with Theorem 3.6, Theorem 3.8 suggests that our testing procedure is consistent.

3.3. Local alternatives.

In this subsection, we investigate the power of the proposed test under local alternatives. We write τ_n,0(x) as the contrast function and $τ_{n, 0}^{D} (x_{D}) = E {τ_{n, 0} (X) | X^{D} = x_{D}}$ for a given subset D ⊆ I, with the intention that these functions are allowed to vary with n. Consider the following sequence of local alternatives:

H_{a} : τ_{n, 0}^{W} (x_{W}) = τ_{0}^{W} (x_{W}) + n^{- 1 / 2} δ_{0}^{W} (x_{W}),

for some continuous functions $τ_{0}^{W}$ and $δ_{0}^{W}$ on Ω^W, where for any fixed x_B ∈ Ω^B,

$τ_{0}^{W} (x_{B}, x_{C}) \leq 0$ for any $τ_{0}^{W} (x_{B}, x_{C}) \geq 0$ for any x_C ∈ Ω^C, and

$δ_{0}^{W} (x_{B}, x_{C}) \leq 0$ for any $δ_{0}^{W} (x_{B}, x_{C}) \geq 0$ for any x_C ∈ Ω^C.

In addition,

δ_{0}^{W} (x_{B}, x_{C}) τ_{0}^{W} (x_{B}, x_{C}) \leq 0, \forall x_{B} \in Ω^{B}, x_{C} \in Ω^{C} .

Recall that $F_{0} = {x_{W} \in Ω^{W} : τ_{0}^{W} (x_{W}) = 0, τ_{0}^{B} (x_{W, B}) \neq 0}$ . Let ${\dot{F}}_{0}$ and ∂F₀ denote its interior and boundary, respectively. Since the contrast function $τ_{n, 0}^{W}$ varies with n, we state a more precise definition of conditional qualitative treatment effects below.

Definition 3.1 (CQTE, continued).

Variables in C have qualitative treatment effects conditional on variables in B if there exists some nonempty sets $C_{1}, C_{2} \in ℝ^{p_{C}}$ , $B \in ℝ^{p_{B}}$ , such that (i) $P r {{(X^{B}, X^{C})}^{T} \in B \times C_{1}} > 0$ , and $P r {{(X^{B}, X^{C})}^{T} \in B \times C_{2}} > 0$ ; and (ii) for any x_C1 ∈ C₁, x_C2 ∈ C₂ and x_B ∈ B, there exists a sequence n_k → ∞ as k → ∞, such that

\underset{a = 0, 1}{\arg \max} {a τ_{n_{k}, 0}^{W} (x_{B}, x_{C_{1}})} \neq \underset{a = 0, 1}{\arg \max} {a τ_{n_{k}, 0}^{W} (x_{B}, x_{C_{2}})} .

(3.5)

Remark 3.10. It’s immediate to see that (3.5) is a modified version of (2.2) where we allow the conditional expectation E{Y ^∗(a)|X^B,X^C} to vary with n. By the definition of $τ_{0}^{W}$ and $δ_{0}^{W}$ , we can see that the magnitude of $δ_{0}^{W}$ affects CQTE. We provide a theorem which formally characterizes such results below.

Theorem 3.11. Assume $δ_{0}^{W}$ is continuous and bounded on Ω^W . Assume ν(∂F₀) = 0. Under conditions in Theorem 2.2, the following statements are equivalent:

X^C doesn’t have QTE conditional on X^B.
For any ε > 0, there exist a set N_ε and a positive integer n_ε such that ν(N_ε) ≤ ε, and for all n ≥ n_ε, the following holds: for any fixed x_B, we have τ_n,0(x_W ) ≥ 0 for any x_W ∉ N_ε such that x_W,B = x_B or τ_n,0(x_W ) ≤ 0 for any x_W ∉ N_ε such that x_W,B = x_B.
For all $x_{W} \in {\dot{F}}_{0}, δ_{0}^{W} (x_{W}) = 0$ .
$\int_{x_{W} \in F_{0}} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W} = 0$ .

Remark 3.12. Result (iv) implies H₀ holds when Pr(X^W ∈ F₀) = 0, or $P r {τ_{0}^{W} (X^{W}) = 0} = 0$ . This implies that the local alternatives are nonstandard and only exist in the nonregular cases, i.e., there is a positive probability such that the optimal treatment decision based on X^W is not defined.

Remark 3.13. Theorem 3.11 suggests the quantity

\int_{x_{W} \in F_{0}} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W}

plays a role in determining CQTE of X^C conditional on X^B. In the theorem below, we establish the power of our test statistic ${\tilde{T}}_{n}^{W, B}$ under the local alternatives. It can be seen that this quantity is closely related to the power of our test. Theorem 3.14. Assume Conditions (A1)-(A8) hold and that $δ_{0}^{W}$ is bounded on Ω^W . Then, under H_a with $\int_{x_{W} \in F_{0}} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W} > 0$ , we have

\lim_{n} P r ({\tilde{T}}_{n}^{W, B} > z_{α}) = 1 - Φ (z_{α} - \frac{1}{2 \tilde{σ}} \int_{x_{W} \in F_{0}} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W})

where Φ(z) = Pr( $ℤ_{0} \leq z$ ) for a standard normal random variable $ℤ_{0}$ .

4. Doubly robust test statistic.

In an observational study, the propensity score π_i’s are usually unknown. In practice, we posit a parametric model π(x,α) for the propensity score, for example, a logistic regression model π(x,α) = exp(x^Tα)/{1 + exp(x^Tα)}. We can obtain an estimator $\hat{α}$ of α based on data {(A_i,X_i),i = 1,…,n}, by either maximizing the likelihood function or solving estimating equations. The estimator $\hat{α}$ will converge to some population-level parameters α₀. When the model π(x,α) is correctly specified, α₀ is the true parameter in the model. When the model is wrong, α₀ corresponds to some least false parameters that have been widely studied in the literature (cf. White, 1982; Li and Duan, 1989).

We also posit some parametric models Φ₀(x,θ) and Φ₁(x,ζ) for E(Y |X = x,A = 0) and E(Y |X = x,A = 1), respectively. Let $\hat{θ}$ and $\hat{ζ}$ denote the estimator of θ and ζ, respectively, which converge to some parameters θ₀ and ζ₀, under potential model misspecification. Let ${\hat{π}}_{i} = π (X_{i}, \hat{α})$ , ${\hat{Φ}}_{0 i} = Φ_{0} (X_{i}, \hat{θ})$ and ${\hat{Φ}}_{1 i} = Φ_{1} (X_{i}, \hat{ζ})$ . Define the following doubly robust estimators for $τ_{n, 0}^{W} (x_{W}) f^{W} (x_{W})$ and $τ_{n, 0}^{B} (x_{B}) f^{B} (x_{B})$ :

τ_{n, D R}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} [{\frac{A_{i}}{{\hat{π}}_{i}} Y_{i} - (\frac{A_{i}}{{\hat{π}}_{i}} - 1) {\hat{Φ}}_{1 i}} - {\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} Y_{i} - (\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} - 1) {\hat{Φ}}_{0 i}}] K_{h_{W}}^{W} (x_{W} - X_{i}^{W}),

Remark 4.1. We can show that estimators $τ_{n, D R}^{W} (x_{W})$ and $τ_{n, D R}^{B} (x_{B})$ are consistent when either π(x,α) or Φ₀(x,θ) and Φ₁(x,ζ) are correctly specified.

Let $d_{n, D R}^{W} (x_{W}) = I {τ_{n, D R}^{W} (x_{W}) \geq 0}$ and $d_{n, D R}^{B} (x_{B}) = I {τ_{n, D R}^{B} (x_{B}) \geq 0}$ . Consider

{\tilde{S}}_{n, D R}^{W, B} = \int_{x_{W} \in Ω^{W}} τ_{n, D R}^{W} (x_{W}) {d_{n, D R}^{W} (x_{W}) - d_{n, D R}^{B} (x_{W, B})} I (x_{W} \notin {\hat{E}}_{D R}) d x_{W},

where

{\hat{E}}_{D R} = {x_{W} \in Ω^{W} : | \frac{τ_{n, D R}^{W} (x_{W})}{f^{W} (x_{W})} | \leq η_{n}, | \frac{τ_{n, D R}^{B} (x_{W, B})}{f^{B} (x_{W, B})} | \leq η_{n}} .

For any set F, define

{\hat{a}}_{n, D R} (F) = \frac{1}{\sqrt{2 π {(h_{W})}^{p_{W}}}} \int_{x_{W} \in F_{0}} \sqrt{{\hat{μ}}_{n, D R}^{W} (x_{W})} d x_{W},

{\hat{σ}}_{n, D R}^{2} (F) = \int_{x_{W} \in F} \int_{t \in {[- 1, 1]}^{p_{W}}} {\hat{μ}}_{n, D R}^{W} (x_{W}) c o v (\max {\sqrt{1 - ρ^{2} (t)} ℤ_{1} + ρ (t) ℤ_{2}, 0}, \max {ℤ_{2}, 0}) d x_{W} d t,

where

{\hat{μ}}_{n, D R}^{W} (x_{W}) = \frac{1}{n {(h_{W})}^{p_{W}}} \sum_{i = 1}^{n} {[{\frac{A_{i}}{{\hat{π}}_{i}} Y_{i} - (\frac{A_{i}}{{\hat{π}}_{i}} - 1) {\hat{Φ}}_{1 i}} - {\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} Y_{i} - (\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} - 1) {\hat{Φ}}_{0 i}}]}^{2} {K^{W} (\frac{x_{W} - X_{i}^{W}}{h_{W}})}^{2} .

We estimate the asymptotic mean and variance of $\sqrt{n} {\tilde{S}}_{n, D R}^{W, B}$ by ${\hat{a}}_{n, D R} ({\hat{F}}_{D R})$ and ${\hat{σ}}_{n, D R}^{2} ({\hat{F}}_{D R})$ , respectively, with

{\hat{F}}_{D R} = {x_{W} \in Ω^{W} : | τ_{n, D R}^{W} (x_{W}) / {\hat{f}}^{W} (x_{W}) | \leq η_{n}, | τ_{n, D R}^{B} (x_{W, B}) / {\hat{f}}^{B} (x_{W, B}) | > η_{n}} .

Define

{\tilde{T}}_{n, D R}^{W, B} = {\begin{array}{l} {\sqrt{n} {\tilde{S}}_{n, B}^{W, B} - {\hat{a}}_{n, D R} ({\hat{F}}_{D R})} / {\hat{σ}}_{n, D R} ({\hat{F}}_{D R}), & if ν ({\hat{F}}_{D R}) = 0, \\ {\sqrt{n} {\tilde{S}}_{n, D R}^{W, B} - {\hat{a}}_{n, D R} (Ω^{W})} / {\hat{σ}}_{n, D R} (Ω^{W}), & otherwise . \end{array}

We reject the null when ${\tilde{T}}_{n, D R}^{W, B} > z_{α}$ .

To establish the asymptotic distributions of ${\tilde{T}}_{n, D R}^{W, B}$ under the null and local alternative, we impose the following conditions.

(A4’.) Assume there exist some constants $c_{0}^{'}$ and $c_{1}^{'}$ such that $0 < c_{0}^{'} \leq π (x, α_{0}) \leq c_{1}^{'} < 1$ for all x ∈ Ω.

(A5’.) Assume that $μ_{D R}^{W} (x_{W})$ is uniformly continuous and bounded on Ω^W, and $\inf_{x_{W} \in Ω^{W}} μ_{D R}^{W} (x_{W}) > 0$ , where

μ_{D R}^{W} (x_{W}) = E [{(\frac{A}{π (X, α_{0})} - \frac{1 - A}{1 - π (X, α_{0})}) Y - (\frac{A}{π (X, α_{0})} - 1) Φ_{1} (X, θ_{0}) + (\frac{1 - A}{1 - π (X, α_{0})} - 1) Φ_{0} (X, ζ_{0})}^{2} | X^{W} = x_{W}] f^{W} (x_{W}) K_{*}^{W} (0) .

(A9.) Assume that π(x,α) is twice continuously differentiable with respect to α; k∂π(x,α₀)/∂αk₂ is uniformly bounded for all x ∈ Ω; and the elements in ∂²π(x,α)/∂α∂α^T are uniformly bounded for all x ∈ Ω and α in a small neighborhood of α₀.

(A10.) Assume that Φ₀(x,θ) and Φ₁(x,ζ) are twice continuously differentiable with respect to θ and ζ, respectively; Φ₀(x,θ₀), Φ₁(x,ζ₀), k∂Φ₀(x,θ₀)/∂θk₂ and k∂Φ₁(x,ζ)/∂ζ₀k₂ are uniformly bounded for all x ∈ Ω; and the elements in the matrices ∂²Φ₀(x,θ)/∂θ∂θ^T and ∂²Φ₁(x,ζ)/∂ζ∂ζ^T are uniformly bounded for all x ∈ Ω and θ, ζ in small neighborhoods of θ₀ and ζ₀, respectively.

(A11.) Assume that the estimators $\hat{α}$ , $\hat{θ}$ and $\hat{ζ}$ have the following linear representations

\hat{α} - α_{0} = \frac{1}{n} \sum_{i} ξ_{1} (O_{i}) + o_{p} (\frac{1}{\sqrt{n}}),

\hat{θ} - θ_{0} = \frac{1}{n} \sum_{i} ξ_{2} (O_{i}) + o_{p} (\frac{1}{\sqrt{n}}),

\hat{ζ} - ζ_{0} = \frac{1}{n} \sum_{i} ξ_{3} (O_{i}) + o_{p} (\frac{1}{\sqrt{n}}),

for some functions ξ₁, ξ₂ and ξ₃ with E{ξ_j(O_i)} = 0 and $E {ξ_{j} (O_{i}) ξ_{j}^{T} (O_{i})} < \infty$ for j = 1,2,3.

Remark 4.2. Conditions (A4’) and (A5’) are similar to (A4) and (A5). Conditions (A9)-(A11) are required for establishing the asymptotic normality of the estimators for misspecified models (White, 1982).

Theorem 4.3 (Double robustness of ${\tilde{T}}_{n, D R}^{W, B}$ . Assume Conditions (A1)(A3), (A4’), (A5’) and (A6)-(A11) hold. In addition, assume either π(x,α) or Φ₀(x,θ) and Φ₁(x,ζ) are correctly specified. Then, under H₀, for any 0 < α ≤ 0.5, we have

\lim_{n} \Pr ({\tilde{T}}_{n, D R}^{W, B} > z_{α}) \leq α,

where the equality holds when ν(F₀) > 0. In addition, under H₁, we have

\lim_{n} P r ({\tilde{T}}_{n, D R}^{W, B} > z_{α}) \to 1.

Remark 4.4. Theorem 4.3 establishes the consistency of the proposed doubly robust test statistic ${\tilde{T}}_{n, D R}^{W, B}$ . Next, we establish the power of the test under the local alternative.

Theorem 4.5. Assume Conditions in Theorem 4.3 hold. Under H_a, assume that $δ_{0}^{W}$ is continuous and bounded on Ω^W, and

\int_{x_{W} \in F} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W} > 0.

Then, we have

\lim_{n} P r ({\tilde{T}}_{n, D R}^{W, B} > z_{α}) \geq 1 - Φ (z_{α} - \frac{1}{2 {\tilde{σ}}_{D R}} \int_{x_{W} \in F} | δ_{0}^{W} (x_{W}) | f^{W} (x_{W}) d x_{W}) .

Remark 4.6. For a given function δ₀, the power of ${\tilde{T}}_{n, D R}^{W, B}$ increases as ${\tilde{σ}}_{D R}$ decreases. When the propensity score model is correctly specified, it can be shown that for each x_W ∈ Ω^W, $μ_{D R}^{W} (x_{W})$ achieves its minimum when

Φ_{0} (x, θ_{0}) = E (Y | X = x, A = 0), Φ_{1} (x, ζ_{0}) = E (Y | X = x, A = 1) .

(4.1)

Therefore, ${\tilde{σ}}_{D R}$ achieves its minimum if (4.1) holds. This suggests ${\tilde{T}}_{n, D R}^{W, B}$ has the greatest power when the posited models for the propensity score and conditional means of Y given X and A are correctly specified.

5. Implementation details.

In Sections 3 and 4, we only consider continuous covariates for notational convenience. In this section, we present a more general testing framework allowing both continuous and discrete covariates, and provide some implementation details. Specifically, we consider the following two cases: (i) all covariates are discrete; and (ii) at least one covariate is continuous. The test statistics are different in these two cases. We focus on randomized studies and assume the propensity score is known. A doubly-robust version of the test statistic can be similarly derived as in Section 4 to deal with data from observational studies. We omit the details to save space.

5.1. All covariates are discrete.

When all covariates are discrete, for each x, we calculate

τ_{n}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} (\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} I (X_{i}^{W} = x_{W}),

τ_{n}^{B} (x_{B}) = \frac{1}{n} \sum_{i = 1}^{n} (\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} I (X_{i}^{B} = x_{B}),

{\hat{f}}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} I (X_{i}^{W} = x_{W}), {\hat{f}}^{B} (x^{B}) = \frac{1}{n} \sum_{i = 1}^{n} I (X_{i}^{B} = x_{B}),

{\hat{μ}}_{n}^{W} (x_{W}) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} I (X_{i}^{W} = x_{W}) - τ_{n}^{W} (x_{W})}^{2},

{\hat{μ}}_{n}^{B} (x_{B}) = \frac{1}{n} \sum_{i = 1}^{n} {(\frac{A_{i}}{π_{i}} - \frac{1 - A_{i}}{1 - π_{i}}) Y_{i} I (X_{i}^{B} = x_{B}) - τ_{n}^{B} (x_{B})}^{2} .

Define

\hat{E} = {x_{W} \in Ω^{W} : | \frac{τ_{n}^{W} (x_{W})}{{\hat{f}}^{W} (x_{W})} | \leq C_{1} η_{n}, | \frac{τ_{n}^{B} (x_{W, B})}{{\hat{f}}^{B} (x_{W, B})} | \leq C_{2} η_{n}},

(5.1)

\hat{F} = {x_{W} \in Ω^{W} : | \frac{τ_{n}^{W} (x_{W})}{\hat{f} W (x_{W})} | \leq C_{1} η_{n}, | \frac{τ_{n}^{B} (x_{W, B})}{{\hat{f}}^{B} (x_{W, B})} | > C_{2} η_{n}} .

(5.2)

Compute

{\tilde{S}}_{n}^{W, B} = \sum_{x_{W} \notin \hat{E}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} .

Unlike results in Section 3 and 4, the limiting distribution of ${\tilde{S}}_{n}^{W, B}$ is not normal. If $\hat{F} \neq \emptyset$ , we reject the null when $\sqrt{n} {\tilde{S}}_{n}^{W, B} > {\hat{c}}_{α} (\hat{F})$ where ${\hat{c}}_{α} (F)$ is the upper α-quantile of the random variable $\sum_{x_{W} \in F} \sqrt{{\hat{μ}}_{n}^{W} (x_{W})} \max (ℤ_{x_{W}}, 0)$ conditional on ${{\hat{μ}}_{n}^{W} (x_{W})}_{x_{W} \in Ω^{W}}$ , where ${ℤ_{x_{W}}}_{x_{W} \in Ω^{W}}$ are independent standard normal random variables. Otherwise, we reject the null when $\sqrt{n} {\tilde{S}}_{n}^{W, B} > {\hat{c}}_{α} (Ω^{W})$ . A formal justification of the aforementioned testing procedure is given in Section 14 of the supplementary article.

5.2. Not all covariates are discrete.

Assume W = W_C ∪ W_D and B = B_C ∪ B_D, where W_C,B_C are the sets of continuous variables and W_D,B_D are the sets of discrete covariates. Denoted by $p_{W_{C}}$ , $p_{W_{D}}$ , $p_{B_{C}}$ and $p_{B_{D}}$ the numbers of elements in these sets. When $p_{B_{C}} > 0$ , define ω_i = {A_i/π_i −(1− A_i)/(1 − π_i)}Y_i and

τ_{n}^{W} (x_{W}) = \frac{1}{n \prod_{j \in W_{C}} ({\hat{s}}_{j} h_{W})} \sum_{i = 1}^{n} ω_{i} \prod_{j \in W_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{W}}) I (x_{W_{D}} = X_{i}^{W_{D}}),

τ_{n}^{B} (x_{B}) = \frac{1}{n \prod_{j \in B_{C}} ({\hat{s}}_{j} h_{B})} \sum_{i = 1}^{n} ω_{i} \prod_{j \in B_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{B}}) I (x_{B_{D}} = X_{i}^{B_{D}}),

{\hat{μ}}_{n}^{W} (x_{W}) = \frac{1}{n \prod_{j \in W_{C}} ({\hat{s}}_{j} h_{W})} \sum_{i = 1}^{n} {ω_{i} \prod_{j \in W_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{W}}) I (x_{W_{D}} = X_{i}^{W_{D}})}^{2},

{\hat{μ}}_{n}^{B} (x_{B}) = \frac{1}{n \prod_{j \in B_{C}} ({\hat{s}}_{j} h_{B})} \sum_{i = 1}^{n} {ω_{i} \prod_{j \in B_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{B}}) I (x_{B_{D}} = X_{i}^{B_{D}})}^{2},

{\hat{f}}^{W} (x_{W}) = \frac{1}{n \prod_{j \in W_{C}} ({\hat{s}}_{j} h_{W})} \sum_{i = 1}^{n} \prod_{j \in W_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{W}}) I (x_{W_{D}} = X_{i}^{W_{D}}),

{\hat{f}}^{B} (x_{B}) = \frac{1}{n \prod_{j \in B_{C}} ({\hat{s}}_{j} h_{B})} \sum_{i = 1}^{n} \prod_{j \in B_{C}} K (\frac{x_{W, (j)} - X_{i}^{(j)}}{{\hat{s}}_{j} h_{B}}) I (x_{B_{D}} = X_{i}^{B_{D}}),

where ${\hat{s}}_{j}$ denotes the sampling variance of the jth covariate. In our numerical studies, we use a fourth-order Epanechnikov kernel for K, i.e.

K (u) = \frac{45}{16} (1 - \frac{28}{3} u^{2}) (1 - 4 u^{2}) .

It can be shown that $\int_{u} K {(u)}^{j} d u = 0$ for j = 1,2,3. Then we calculate

{\tilde{S}}_{n}^{W, B} = \sum_{x_{W, W_{D}}} \int_{x_{W, W_{C}}} τ_{n}^{W} (x_{W}) {d_{n}^{W} (x_{W}) - d_{n}^{B} (x_{W, B})} I (x_{W} \notin \hat{E}) d x_{W, W_{C}} .

where $x_{W, W_{C}}$ and $x_{W, W_{D}}$ are the sub-vectors of x_W formed by elements in W_C and W_D.

When $p_{W_{C}} \leq 2$ , the integral in ${\tilde{S}}_{n}^{W, B}$ is computed via a midpoint rule with a uniform grid. Specifically, for each j ∈ W, denoted by m_j and M_j the minimum and maximum value of x_(j). We divide the interval [m_j,M_j] into L = 200 subintervals of equal width. Let z_k,(j),k = 1,…,L, denote the midpoints for these intervals, $z_{(\bar{k})} = {(z_{k_{1}, (1)}, \dots, z_{k_{W_{C}}}, p_{W_{C}})}^{T}$ for $\bar{k} = {(k_{1}, \dots, k_{W_{C}})}^{T}$ , and $z_{W (\bar{k})}$ and $z_{B (\bar{k})}$ the sub-vectors formed by elements in W_C and B_C respectively. We approximate ${\tilde{S}}_{n}^{W, B}$ by

I^{*} \sum_{x_{W_{D}}, \bar{k}} τ_{n}^{W} (z_{W (\bar{k})}, x_{W_{D}}) {d_{n}^{W} (z_{W (\bar{k})}, x_{W_{D}}) - d_{n}^{B} (z_{B (\bar{k})}, x_{W_{D}, B_{D}})} I {(z_{W (\bar{k})}, x_{W_{D}}) \notin \hat{E}},

where $I^{*} = \prod_{j \in W^{C}} (M_{j} - m_{j}) / L^{p_{W_{C}}}$ , and $τ_{n}^{W} (x_{W}_{, W_{C}}, x_{W, W_{D}})$ and $τ_{n}^{B} (x_{B, B_{C}}, x_{B, B_{D}})$ are shorthands for $T_{n}^{W} (x_{W})$ and $T_{n}^{B} (x_{B})$ , $d_{n}^{W} (z_{W (k)}, x_{W_{D}}) = I (τ_{n}^{W} (z_{W (k)}, x_{W_{D}}) \geq 0)$ and $d_{n}^{B} (z_{B (k)}, x_{B_{D}}) = I (τ_{n}^{B} (z_{B (k)}, x_{B_{D}}) \geq 0)$ .

If $p_{W_{C}} > 2$ , we approximate the integral using Monte Carlo methods. Specifically, we generate N = 5000 random numbers Z^(k), uniformly distributed in ∏_j[m_j,M_j], and calculate

I^{'} \sum_{x_{W_{D}}} \sum_{k = 1}^{N} τ_{n}^{W} (Z^{W (k)}, x_{W_{D}}) {d_{n}^{W} (Z^{W (k)}, x_{W_{D}}) - d_{n}^{B} (Z^{B (k)}, x_{W_{D}, B_{D}})} \times I {(Z^{W (k)}, x_{W_{D}}) \notin \hat{E}},

where $I^{'} = \prod_{j \in W^{C}} (M_{j} - m_{j}) / N$ , Z^W(k) and Z^B(k) are the sub-vectors of Z^(k) formed by elements in W_C and B_C.

When $\hat{F} \neq \emptyset$ , we calculated ${\hat{a}}_{n}$ and ${\hat{σ}}_{n}^{2}$ by

{\hat{a}}_{n} = \frac{1}{\sqrt{2 π {(h_{W})}^{p_{W}}}} \sum_{x_{W, W_{D}}} \int_{x_{W, W_{C}}} \sqrt{{\hat{μ}}_{n}^{W} (x_{W})} I (x_{W} \in \hat{F}) d x_{W, W_{C}},

{\hat{σ}}_{n}^{2} = \sum_{x_{W, W_{D}}} \int_{\begin{matrix} x_{W, W_{C}} \\ t \in {[- 1, 1]}^{p_{W_{C}}} \end{matrix}} {\hat{μ}}_{n}^{W} (x_{W}) I (x_{W} \in \hat{F}) c o v (\max {\sqrt{1 - ρ^{2} (t)} ℤ_{1} + ρ (t) ℤ_{2}, 0}, \max {ℤ_{2}, 0}) d x_{W, W_{C}} d t .

Definitions of $\hat{E}$ and $\hat{F}$ are given in (5.1) and (5.2). When $\hat{F} = \emptyset$ , we replace $\hat{F}$ by Ω in the integral. The above integrals are calculated similarly as for ${\tilde{S}}_{n}^{W, B}$ . We reject the null when $\sqrt{n} {\tilde{S}}_{n}^{W, B} \geq {\hat{a}}_{n} + {\hat{σ}}_{n} z_{α}$ .

6. Simulations.

To evaluate the numerical performance of the proposed testing procedure, we consider simulation studies based on the following model:

Y = h_{0} (X^{(1)}, X^{(2)}) + A τ_{0} (X^{(1)}, X^{(2)}) + e,

where h₀ denotes the baseline, τ₀ denotes the contrast, and e ∼ N(0,0.25) is independent of A and X = (X⁽¹⁾,X⁽²⁾)^T. The objective is to test the CQTE of variable X⁽²⁾ conditional on X⁽¹⁾. Treatment A was generated from a Bernoulli distribution with probability 0.5, independent of X. The baseline function h₀ was set to be

h_{0} (x_{(1)}, x_{(2)}) = 1 - \frac{x_{(1)} - x_{(2)}}{2} .

(6.1)

The contrast function takes the form

τ_{0} (x_{(1)}, x_{(2)}) = φ_{1} (x_{(1)}) φ_{2} (x_{(2)}),

(6.2)

for some continuous functions ϕ₁ and ϕ₂.

Variables X⁽¹⁾ and X⁽²⁾ are independently generated. It follows from Theorem 3.6 that the null (no CQTE) holds if and only if ϕ₂(x₍₂₎) ≥ 0,∀x₍₂₎ or ϕ₂(x₍₂₎) ≤ 0,∀x₍₂₎. We consider five scenarios. In the first four scenarios, X⁽¹⁾ and X⁽²⁾ are generated from Unif[−2,2], where Unif[a,b] stands for the uniform distribution on the interval [a,b]. We set ϕ₁(z) = z in the first two scenarios and ϕ₁(z) = max(z,0) in the last two scenarios. As for ϕ₂, in Scenarios 1 and 3,

φ_{2} (z) = z^{2} - δ,

for some δ ≥ 0. In Scenarios 2 and 4,

φ_{2} (z) = {\begin{matrix} z, & 0 \leq z \leq 2, \\ 0, & δ - 2 \leq z < 0, \\ 2 + z - δ, & - 2 \leq z < δ - 2, \end{matrix}

for some δ ≥ 0. In figure 1, we plot functions ϕ₂ with different δ.

Fig 1: — Plots of function ϕ₂ for Scenario 1 and Scenario 2, from left to right, with different choices of δ.

In the last scenario, X⁽¹⁾ is generated from Unif[−2,2] while X⁽²⁾ is from a uniform discrete distribution. Specifically, X⁽²⁾ has the following probability mass function

P r (X^{(2)} = a) = \frac{1}{2}, a = 0, 2.

The contrast function is set to be τ0(x(1),x(2)) =

τ_{0} (x_{(1)}, x_{(2)}) = φ_{1} (x_{(1)}) φ_{2} (x_{(2)}) = x_{(1)} (x_{(2)} - δ) .

In all scenarios, the parameter δ controls the degree of CQTE. When δ = 0, H₀ holds; Otherwise, H₁ holds. Moreover, it can be calculated that the value differences

VD = E [τ_{0} (X^{(1)}, X^{(2)}) {d^{o p t} (X) - d_{{1}}^{o p t} (X^{(1)})}]

for Scenarios 1–5 are equal to δ^3/2/3, δ²/8, δ^3/2/6, δ²/16 and δ/3 for all δ ≤ 1, respectively. In each scenario, we consider four settings by setting VD = 0,0.04,0.08 and 0.12. Hence, the null holds in the first setting and the alternative holds in other settings. We also consider two different sample sizes, n = 300 and n = 600.

When implementing our testing procedure, we first fit a logistic regression model for the propensity score and linear models for the conditional means of Y given A and X. The test statistics are constructed as discussed in Section 5. Based on (6.1) and (6.2), the model for E(Y |X,A = 1) is always misspecified, however, the propensity score model is correctly specified. Hence, our test statistics are consistent. In Scenario 1–4, we set the smoothing parameters as h_W = c_W n^−1/7 and h_B = c_Bn^−2/7 for some constants c_W and c_B. Condition (A6) holds for such a choice of the bandwidth. In our implementation, we have tried a few values of c_W and c_B, and find $c W = 2 \sqrt{3}$ and c_B = 6 working well for all scenarios. In Scenario 5, we set h_W = 6n^−2/7. In (5.1) and (5.2), we set η_n = n^−2/7, C₁ = 3 and C₂ = 1. Such a choice of η_n satisfies Conditions (A8)-(A10) in our simulation settings. We conduct 600 simulations for each setting and report the proportions of rejecting the null hypothesis of the proposed test statistics in Table 1.

Table 1.

Simulation results.

		VD = 0		VD = 4%		VD = 8%		VD = 12%
		α level		α level		α level		α level
	n	0.05	0.1	0.05	0.1	0.05	0.1	0.05	0.1
Scenario 1	300 600	4.3% 1.5%	6.0% 3.3%	24.0% 36.7%	34.0% 45.5%	58.7% 75.8%	68.3% 83.3%	82.2% 95.7%	87.5% 97.3%
Scenario 2	300 600	7.0% 3.7%	11.1% 7.8%	23.8% 31.0%	32.7% 41.8%	60.5% 83.0%	69.3% 90.5%	88.2% 98.3%	92.5% 99.5%
Scenario 3	300 600	3.8% 2.7%	6.5% 6.7%	37.5% 52.5%	48.7% 61.8%	76.5% 99.1%	79.8% 100%	93.5% 99.8%	95.5% 99.8%
Scenario 4	300 600	6.2% 5.2%	10.2% 8.8%	39.8% 59.3%	47.7% 68.2%	79.2% 96.8%	87.3% 98.3%	96.0% 100.0%	97.8% 100.0%
Scenario 5	300 600	5.2% 5.3%	9.7% 9.5%	29.3% 46.2%	40.5% 57.5%	68.0% 92.2%	76.3% 95.5%	94.0% 100.0%	96.8% 100.0%

Open in a new tab

Under H₀ (i.e. the cases with VD = 0), the empirical type-I error rates in Scenarios 2, 4 and 5 are close to the nominal level. In Scenarios 1 and 3, we have ν(F₀) = 0. The empirical type I error rates in Scenarios 1 and 3 are well below the nominal level. This is in line with our theory which suggests the type-I error rate should go to 0 in these settings. Under H₁, the power increases as the value difference or sample size increases, showing the consistency of our test statistics.

7. Application with ACTG175 dataset.

We apply our proposed method to a data from AIDS Clinical Trials Group Protocol 175 (ACTG175) study. This is a randomized trial where patients were randomly assigned to the following four treatments, including zidovudine (ZDV) monotherapy, ZDV + didanosine (ddI), ZDV + zalcitabine (zal) and ddI monotherapy. We focus on patients receiving treatments: ZDV+ddI (denoted as 1) and ZDV+zal (denoted as 0). Among them, there are 522 receiving treatment 1 and 524 receiving treatment 0. We choose the CD4 count (cells/mm³) at 20± 5 weeks after receiving the treatment as the response. The baseline covariates include patient’s age and weight at baseline, the CD4 and CD8 counts (coded as CD40 and CD80 respectively) at baseline, hemophilia (hemo,0 = no,1 = yes), homosexual activity (homo,0 = no,1 = yes), history of intravenous drug use (drug,0 = no,1 = yes), race (0 = white,1 = non-white), gender (0 = female,1 = male), antiretroviral history (str2,0 = naive,1 = experienced), and symptomatic status (sympton,0 = asymptomatic,1 = symptomatic). The first four variables are continuous while others are binary. Our objective is to select those variables that have qualitative treatment effects in a sequential order. Since the propensity score is known, we consider the statistic ${\tilde{T}}^{W, B}$ proposed in Section 3. Our procedure proceeds as follows:

Set $\hat{D}$ = ∅. In the first step, for each variable i, define the set W_i = {i} and calculate the p-value p_i for each test statistic ${\tilde{T}}^{W_{i}, \hat{D}}$ as described in Section 5. Stop if min_i p_i > α. Include the variable that gives the smallest p-value in the set $\hat{D}$ , i.e,
$\hat{D} \leftarrow {\underset{i}{\arg \min} p_{i}} .$
In the second step, for each variable $i \notin \hat{D}$ , define $W_{i} = \hat{D} \cup {i}$ and calculate the p-value p_i for each test statistic ${\hat{T}}^{W_{i} \hat{D}}$ . Stop if min_ipi > α. Include the variable that gives the smallest p-value,
$\hat{D} \leftarrow \hat{D} \cup {\underset{i}{\arg \min} p_{i}} .$
Continue the second step until it stops. Output $\hat{D}$

It is immediate to see that the above algorithm uses a forward selection procedure. Backward or stepwise selection can be similarly considered. The threshold α determines the significance level for each test statistic. In our implementation, we set $α = 1 - P r (ℤ_{0} \geq n^{1 / 6} / 2) \approx 0.056$ where $ℤ_{0}$ is a standard normal random variable. Such a choice of α meets the conditions in Theorem 9.2 to achieve selection consistency of the forward selection algorithm. As in simulations, we choose the bandwidth $h = 6 n^{- 2 / 7}$ when there’s only one continuous variable in the kernel estimation. Otherwise, we set $h = 2 \sqrt{3} n^{- 1 / 7}$ . Sets $\hat{E}$ and $\hat{F}$ are estimated by

\hat{E} = {x_{W} \in Ω^{W} : | \frac{τ_{n}^{W} (x_{W})}{\sqrt{{\hat{μ}}_{n}^{W} (x_{W})}} | \leq C_{0} η_{n} | \frac{τ_{n}^{B} (x_{W, B})}{\sqrt{{\hat{μ}}_{n}^{B} (x_{W, B})}} | \leq C_{0} η_{n}},

\hat{F} = {x_{W} \in Ω^{W} : | \frac{τ_{n}^{W} (x_{W})}{\sqrt{{\hat{μ}}_{n}^{W} (x_{W})}} | \leq C_{0} η_{n}, | \frac{τ_{n}^{B} (x_{W, B})}{\sqrt{{\hat{μ}}_{n}^{B} (x_{W, B})}} | > C_{0} η_{n}},

where the constant C₀ is set to be 0.03 in the implementation.

For the ACTG175 dataset, our algorithm stops after fourth iteration. At the first iteration, only the variable age is significant and is thus selected. At the second iteration, we find out that both hemo and homo have qualitative effects conditional on age and variable hemo is chosen. At the third iteration, only homo is significant given previously included variables. The algorithm stops at the fourth iteration. We report all the p-values in each iteration in Table 2.

Table 2.

P-values of each test statistic in all iterations.

age	weight	hemo	homo	drug	race	gender	str2	sympton	CD40	CD80
0.022	0.087	0.793	0.827	0.817	0.831	0.808	0.825	0.825	0.823	0.772
NA	0.986	1.2e-8	0.028	0.288	0.308	0.175	0.257	0.191	0.982	0.975
NA	0.996	NA	0.033	0.067	0.447	0.091	0.155	0.196	0.999	0.998
NA	0.999	NA	NA	0.118	0.116	0.405	0.533	0.066	0.999	0.999

Open in a new tab

Our results indicate that variables age, hemo and homo have qualitative treatment effects and are important for optimal treatment prescription. Denoted by D_FS the set of these three variables. We compare our algorithm with the sequential advantage selection (SAS, Fan, Lu and Song, 2016). SAS uses a forward selection procedure based on a sequential S-score and selects the best candidate subset of variables via a BIC-type criterion. For the ACTG175 dataset, SAS selects a total of 10 variables including age, hemo and homo. Denoted by D_SAS the set of these 10 variables.

To further examine the variable selection results, we evaluate the value functions under the optimal treatment regimes based on the set of variables selected by the proposed forward selection algorithm and SAS. For a given set D ⊆ I = {1,2,…,11}, we estimate the optimal value function

V^{D} = E {Y^{*} (d_{o p t}^{D})}

via the online estimator proposed by Luedtke and van der Laan (2016). More specifically, for i = l_n+1,l_n+2,…,n, we first compute the estimated optimal treatment regime ${\hat{d}}_{(i - 1)}^{D} (x_{D}) = I {{\hat{h}}_{1, (i - 1)}^{D} (x_{D}) > {\hat{h}}_{0, (i - 1)}^{D} (x_{D})}$ and the esti-mated conditional mean functions ${\hat{Φ}}_{0, (i - 1)} (x) = x^{T} {\hat{θ}}_{0, (i - 1)}$ and ${\hat{Φ}}_{1, (i - 1)} (x) = x^{T} {\hat{θ}}_{1, (i - 1)}$ based on data from patients 1 to i − 1.

For any j = 0,1 and i = l_n + 1, l_n + 2, …, n, ${\hat{h}}_{j, (i - 1)}^{D}$ is calculated via kernel ridge regression, based on the dataset ${(X_{k}^{D}, Y_{k})}_{k \leq i - 1, A_{k} = j}$ . We use the Gaussian radial basis function kernel. The estimating procedure is implemented by the R package CVST. The tuning parameters in the kernel functions are selected via 5-folded cross-validation. Estimator ${\hat{θ}}_{j, (i - 1)}$ is computed via a penalized regression with the SCAD penalty function (Fan and Li, 2001), based on the dataset ${(X_{k}^{D}, Y_{k})}_{k \leq i - 1, A_{k} = j}$ . The penalized regression is implemented by the R package ncvreg, and the tuning parameters are selected via 10-folded cross-validation. Let π₀ = 0.5, we define for i = l_n + 1,l_n + 2,…,n, j = 1,…,n,

{\hat{V}}_{(i)}^{D} (j) = \frac{{\hat{d}}_{(i - 1), A_{j}, X_{j}}^{D}}{π_{0}} Y_{j} - (\frac{{\hat{d}}_{(i - 1), A_{j}, X_{j}}^{D}}{π_{0}} - 1) \times ({\hat{Φ}}_{1, (i - 1)} (X_{j}^{D}) {\hat{d}}_{(i - 1)}^{D} (X_{j}^{D}) + {\hat{Φ}}_{0, (i - 1)} (X_{j}^{D}) {1 - {\hat{d}}_{(i - 1)}^{D} (X_{j}^{D})}),

where ${\hat{d}}_{(i - 1), A_{j}, X_{j}}^{D} = A_{j} {\hat{d}}_{(i - 1)}^{D} (X_{j}^{D}) + (1 - A_{j}) {1 - {\hat{d}}_{(i - 1)}^{D} (X_{j}^{D})} .$ .

The final estimator is given by

{\hat{V}}^{D} = \frac{\sum_{i = l_{n} + 1}^{n} {{\hat{σ}}^{D} (i)}^{- 1} {\hat{V}}_{(i)}^{D} (i)}{\sum_{i = l_{n} + 1}^{n} {{\hat{σ}}^{D} (i)}^{- 1}},

with the estimated standard error

{\hat{σ}}^{D} = \frac{\sqrt{n - l_{n}}}{\sum_{i = l_{n} + 1}^{n} {{\hat{σ}}^{D} (i)}^{- 1}},

where

{{\hat{σ}}^{D} (i)}^{2} = \frac{1}{i - 2} \sum_{j = 1}^{i} {{\hat{V}}_{(i - 2)}^{D} (j)}^{2} - {(\frac{1}{i - 1} \sum_{j = 1}^{i - 1} {\hat{V}}_{(i)}^{D} (j))}^{2} .

Under certain conditions, we have

\frac{{\hat{V}}^{D} - V^{D}}{{\hat{σ}}^{D}} \overset{d}{\to} N (0, 1) .

Set l_n = 200. The estimated value functions ${\hat{V}}^{D_{F S}}$ and ${\hat{V}}^{D_{S A S}}$ are equal to 401.88 and 402.35 respectively, with estimated standard errors ${\hat{σ}}_{F S}^{D} = 7.50$ and ${\hat{σ}}_{S A S}^{D} = 7.19$ . Since D_FS ⊆ D_SAS, we have $V^{D_{S A S}} \geq V^{D_{F S}}$ . However, the difference $V^{D_{S A S}} - V^{D_{F S}}$ is not significant. This implies the proposed forward selection algorithm selects less variables than SAS, while achieves approximately the same value function in optimal treatment decision.

8. Discussion.

In this paper, we introduce the notion of conditional qualitative treatment effects (CQTE) and present several equivalent definitions. We also propose a consistent testing procedure for the existence of CQTE. Our test has correct size under the null hypothesis and non-negligible power against some nonstandard local alternatives.

8.1. More on the forward selection algorithm.

The forward selection algorithm introduced in Section 7 is a byproduct of the proposed testing procedure for the existence of CQTE. While it is worthwhile to investigate its statistical properties, this is a very challenging task. In the literature, few works have studied the asymptotic properties of a forward selection procedure. Wang (2009) established the “sure screening property” of the classical forward linear regression in a high dimensional setting. However, the proofs of the major theorems in that paper (Theorem 1 and 2) rely heavily on the specific structure of linear regression and it remains unknown whether the “sure screening property” holds for general forward selection algorithms.

Our forward selection algorithm aims to identify a subset D₀ ⊆ [1,…,p] with minimum cardinality such that the optimal value function based on variables in $X^{D_{0}}$ is the same as that based on X. In the supplementary appendix, we establish the “sure screening property” (Theorem 9.1) and selection consistency (Theorem 9.2) of the considered forward selection algorithm based on the p-values of the CQTE tests. Moreover, we conduct some simulation studies to examine the empirical performance of the proposed algorithm and compare it with SAS (Fan, Lu and Song, 2016). Our forward selection algorithm achieves better model selection results when compared to SAS in all considered simulation scenarios. More details can be found in Section 9 of the supplementary appendix.

8.2. Fully nonparametric implementation.

The proposed test statistic in Section 3 requires the propensity score function to be correctly specified. In Section 4, we introduce a doubly robust test statistic and posit some parametric models for the propensity score and conditional mean functions. In the supplementary appendix, we consider a fully nonparametric procedure based on some nonparametric estimators of the propensity score and the conditional mean functions.

We further conduct some simulation studies to examine the empirical performance of the nonparametric testing procedure and compare it with the doubly robust test describe in Section 4. We briefly summarize the results here: (i) The nonparametric test statistic is more powerful than the doubly robust test statistic. (ii) When the sample size is small, the empirical type I error rates of the nonparametric test statistic are slightly larger than the nominal level in some cases. More details can be found in Section 11 of the supplementary appendix.

Although it is interesting to investigate the theoretical properties of such a nonparametric test statistic, it is beyond the scope of the current paper and is omitted here.

8.3. Extensions to L_p-type and supremum-type functionals.

As commented in Remark 2.5, the test statistic for no CQTE can be constructed based on

S^{W, B} = \int_{x_{W} \in Ω^{W}} ϕ {τ_{0}^{W} (x_{W})} {d_{o p t}^{W} (x_{W}) - d_{o p t}^{B} (x_{W, B})} ω_{0} (x_{W}) d x_{W} .

In the current paper, we set φ(·) to be the identity function. More generally, we can take φ(·) to be any monotonically increasing function with φ(0) = 0. In Section 12 in the supplementary appendix, we consider the following class of functions φ(z) = sgn(z)|z|^q, and derive the corresponding test statistic ${\tilde{T}}_{n, q}^{W, B}$ for any q ≥ 1.

We show in Theorem 12.1 that ${\tilde{T}}_{n, q}^{W, B}$ have asymptotically correct size under H₀ and provide its asymptotic power function in Theorem 12.2 under H_a. For different q, the asymptotic power function increases as

\int_{x_{W} \in F_{0}} \frac{2^{(q - 3) / 2} q Γ (q / 2)}{\sqrt{π} {\tilde{σ}}_{q}} {μ^{W} (x_{W})}^{(q - 1) / 2} δ_{0}^{W} (x_{W}) f^{W} (x_{W}) d x_{W}

increases, where

{\tilde{σ}}_{q}^{2} = \int_{\begin{array}{l} x_{W} \in F_{0} \\ t \in {[- 1, 1]}^{p_{W}} \end{array}} μ^{W} (x_{W}) c o v (\max {\sqrt{1 - ρ^{2} (t)} ℤ_{1} + ρ (t) ℤ_{2}, 0}^{q}, \max {ℤ_{2}, 0}^{q}) d x_{W} d t,

and $Γ (z) = \int_{0}^{\infty} x^{z - 1} \exp (- x) d x$ .

Besides, when q > 1, the assumptions on η_n and the moments of Y conditional on X and A are slightly different compared to those in (A3) and (A8). More details can be found in Section 12 of the supplementary appendix.

In addition, in Section 13 of the supplementary article, we develop a supremum-type test based on studentized kernel estimators of the contrast function, with many different bandwidth values. We show that the test is valid and has nontrivial power against $\sqrt{\log n} / \sqrt{n h_{max}^{p}}$ -local alternatives, where h_max denotes the maximum of the kernel bandwidth parameter.

Therefore, when compared to the supremum-type test, the L_p-type test is more powerful since it allows for nontrivial testing against n^−1/2-local alternatives. However, the L_p-type test only uses one bandwidth value for the kernel estimates. As a result, it might be sensitive to the choice of the bandwidth parameter.

8.4. Other issues.

For simplicity, we only consider a single decision stage and focus on binary treatments. It will be useful in practice to extend CQTE and its associated testing procedure to multi-stages with multiple treatment options. Moreover, our test statistic relies on the kernel-based estimators of the contrast function. It is well known that the kernel-based estimations will behave poorly when the dimension of the covariates is large. How to adapt our test statistics to handle high-dimensional covariates remains challenging.

Our testing procedure requires the specification of the tuning parameters h_W, h_B and η_n (see Section 5). In general, one can set $h_{W} = c_{W} n^{- κ_{W}}$ , $h_{B} = c_{B} n^{- κ_{B}}$ and $η_{n} = n^{- κ_{0}}$ for some c_W,c_B,κ_W,κ_B,κ₀ > 0. In practice, we recommend to set $c_{W} = 2 \sqrt{3}$ , c_B = 6, κ_W = 1/7, κ_B = 2/7, if p_W > 2, p_B = 1 and $c_{W} = c_{B} = 2 \sqrt{3}$ , κ_W = κ_B = 1/7 if p_W,p_B ≥ 2, and κ₀ = 2/7.

We have tried various values of tuning parameters in our simulation studies and find such a choice works well in all scenarios. In Section 10 of the supplementary article, we examine the performance of our test under other choices of tuning parameters. The simulation results are very similar to those in Section 6.

Supplementary Material

supplementary appendix

NIHMS987209-supplement-supplementary_appendix.pdf^{(349.9KB, pdf)}

Acknowledgments

This work was partly supported by a NIH grant P01 CA142538.

References.

Andrews DWK and Shi X (2013). Inference based on conditional moment inequalities. Econometrica 81 609–666. [Google Scholar]
Andrews DWK and Shi X (2014). Nonparametric inference based on conditional moment inequalities. J. Econometrics 179 31–45. [Google Scholar]
Armstrong TB and Chan HP (2016). Multiscale adaptive inference on conditional moment inequalities. J. Econometrics 194 24–43. [Google Scholar]
Chakraborty B, Murphy S and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang M, Lee S and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18 307–346. [Google Scholar]
Chernozhukov V, Lee S and Rosen AM (2013). Intersection bounds: estimation and inference. Econometrica 81 667–737. [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]
Fan A, Lu W and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan C, Lu W, Song R and Zhou Y (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1565–1582. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gine E´, Mason DM and Zaitsev AY (2003). The L1-norm density estimator process. Ann. Probab 31 719–768. [Google Scholar]
Gunter L, Zhu J and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20 1–22. [Google Scholar]
Li K-C and Duan N (1989). Regression analysis under link violation. Ann. Statist 17 1009–1052. [Google Scholar]
Lu W, Zhang HH and Zeng D (2013). Variable selection for optimal treatment decision. Stat. Methods Med. Res 22 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mason DM and Polonik W (2009). Asymptotic normality of plug-in level set estimates. Ann. Appl. Probab 19 1108–1142. [Google Scholar]
Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65 331–366. [Google Scholar]
Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins JM, Hernan MA and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11 550–560. [DOI] [PubMed] [Google Scholar]
Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Edu. Psychol 66 688–701. [Google Scholar]
Shergin VV (1990). The central limit theorem for finitely dependent random variables In Probability theory and mathematical statistics, Vol. II (Vilnius, 1989) 424–431. “Mokslas”, Vilnius. [Google Scholar]
Wand MP and Jones MC (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Amer. Statist. Assoc 88 520–528. [Google Scholar]
Wang H (2009). Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc 104 1512–1524. [Google Scholar]
Watkins CJCH and Dayan P (1992). Q-learning. Mach. Learn 8 279–292. [Google Scholar]
White H (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang B, Tsiatis AA, Laber EB and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, Laber EB, Tsiatis A and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, Laber EB, Tsiatis A and Davidian M (2016). Interpretable Dynamic Treatment Regimes. arXiv preprint arXiv: 160601472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y-Q, Zeng D, Laber EB and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc 110 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary appendix

NIHMS987209-supplement-supplementary_appendix.pdf^{(349.9KB, pdf)}

[R1] Andrews DWK and Shi X (2013). Inference based on conditional moment inequalities. Econometrica 81 609–666. [Google Scholar]

[R2] Andrews DWK and Shi X (2014). Nonparametric inference based on conditional moment inequalities. J. Econometrics 179 31–45. [Google Scholar]

[R3] Armstrong TB and Chan HP (2016). Multiscale adaptive inference on conditional moment inequalities. J. Econometrics 194 24–43. [Google Scholar]

[R4] Chakraborty B, Murphy S and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chang M, Lee S and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18 307–346. [Google Scholar]

[R6] Chernozhukov V, Lee S and Rosen AM (2013). Intersection bounds: estimation and inference. Econometrica 81 667–737. [Google Scholar]

[R7] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96 1348–1360. [Google Scholar]

[R8] Fan A, Lu W and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan C, Lu W, Song R and Zhou Y (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 1565–1582. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Gine E´, Mason DM and Zaitsev AY (2003). The L1-norm density estimator process. Ann. Probab 31 719–768. [Google Scholar]

[R11] Gunter L, Zhu J and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20 1–22. [Google Scholar]

[R13] Li K-C and Duan N (1989). Regression analysis under link violation. Ann. Statist 17 1009–1052. [Google Scholar]

[R14] Lu W, Zhang HH and Zeng D (2013). Variable selection for optimal treatment decision. Stat. Methods Med. Res 22 493–504. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Mason DM and Polonik W (2009). Asymptotic normality of plug-in level set estimates. Ann. Appl. Probab 19 1108–1142. [Google Scholar]

[R17] Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65 331–366. [Google Scholar]

[R18] Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Robins JM, Hernan MA and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11 550–560. [DOI] [PubMed] [Google Scholar]

[R20] Rubin DB (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Edu. Psychol 66 688–701. [Google Scholar]

[R21] Shergin VV (1990). The central limit theorem for finitely dependent random variables In Probability theory and mathematical statistics, Vol. II (Vilnius, 1989) 424–431. “Mokslas”, Vilnius. [Google Scholar]

[R22] Wand MP and Jones MC (1993). Comparison of smoothing parameterizations in bivariate kernel density estimation. J. Amer. Statist. Assoc 88 520–528. [Google Scholar]

[R23] Wang H (2009). Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc 104 1512–1524. [Google Scholar]

[R24] Watkins CJCH and Dayan P (1992). Q-learning. Mach. Learn 8 279–292. [Google Scholar]

[R25] White H (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1–25. [Google Scholar]

[R26] Zhang B, Tsiatis AA, Laber EB and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Zhang B, Tsiatis AA, Laber EB and Davidian M (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 681–694. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Zhang Y, Laber EB, Tsiatis A and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Zhang Y, Laber EB, Tsiatis A and Davidian M (2016). Interpretable Dynamic Treatment Regimes. arXiv preprint arXiv: 160601472. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Zhao Y, Zeng D, Rush AJ and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zhao Y-Q, Zeng D, Laber EB and Kosorok MR (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc 110 583–598. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ON TESTING CONDITIONAL QUALITATIVE TREATMENT EFFECTS

Chengchun Shi, Graduate student

Rui Song, Associate Professor

Wenbin Lu, Professor

Abstract

1. Introduction.

2. Conditional qualitative treatment effects.

2.1. Optimal treatment regime.

2.2. Conditional qualitative treatment effects.

3. Testing procedure.

3.1. Test statistic.

3.2. Consistency of the test.

3.3. Local alternatives.

4. Doubly robust test statistic.

5. Implementation details.

5.1. All covariates are discrete.

5.2. Not all covariates are discrete.

6. Simulations.

Fig 1:

Table 1.

7. Application with ACTG175 dataset.

Table 2.

8. Discussion.

8.1. More on the forward selection algorithm.

8.2. Fully nonparametric implementation.

8.3. Extensions to L_p-type and supremum-type functionals.

8.4. Other issues.

Supplementary Material

Acknowledgments

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

ON TESTING CONDITIONAL QUALITATIVE TREATMENT EFFECTS

Chengchun Shi, Graduate student

Rui Song, Associate Professor

Wenbin Lu, Professor

Abstract

1. Introduction.

2. Conditional qualitative treatment effects.

2.1. Optimal treatment regime.

2.2. Conditional qualitative treatment effects.

3. Testing procedure.

3.1. Test statistic.

3.2. Consistency of the test.

3.3. Local alternatives.

4. Doubly robust test statistic.

5. Implementation details.

5.1. All covariates are discrete.

5.2. Not all covariates are discrete.

6. Simulations.

Fig 1:

Table 1.

7. Application with ACTG175 dataset.

Table 2.

8. Discussion.

8.1. More on the forward selection algorithm.

8.2. Fully nonparametric implementation.

8.3. Extensions to Lp-type and supremum-type functionals.

8.4. Other issues.

Supplementary Material

Acknowledgments

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

8.3. Extensions to L_p-type and supremum-type functionals.