A Sparse Random Projection-based Test for Overall Qualitative Treatment Effects

Chengchun Shi; Wenbin Lu; Rui Song

doi:10.1080/01621459.2019.1604368

. Author manuscript; available in PMC: 2021 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2019 Jun 19;115(531):1201–1213. doi: 10.1080/01621459.2019.1604368

A Sparse Random Projection-based Test for Overall Qualitative Treatment Effects

Chengchun Shi ¹, Wenbin Lu ¹, Rui Song ^1,^*

PMCID: PMC7730172 NIHMSID: NIHMS1047683 PMID: 33311818

Abstract

In contrast to the classical “one size fits all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Most of existing works in the literature focused on estimating optimal individualized treatment regimes. However, there has been less attention devoted to hypothesis testing regarding the existence of overall qualitative treatment effects, especially when there is a large number of prognostic covariates. When covariates don’t have qualitative treatment effects, the optimal treatment regime will assign the same treatment to all patients regardless of their covariate values. In this paper, we consider testing the overall qualitative treatment effects of patients’ prognostic covariates in a high dimensional setting. We propose a sample splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. We prove the consistency of our test statistic. In the regular cases, we show the asymptotic power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix. Simulation studies and real data applications validate our theoretical findings.

Keywords: High-dimensional testing, Optimal treatment regime, Precision medicine, Qualitative treatment effects, Sparse random projection

1. Introduction

In many medical studies, patients may differ significantly in the way they respond to the treatment. In contrast to the classical “one size fits all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Formally speaking, a treatment regime is a function from patients’ prognostic covariates to available treatment options. The optimal individualized treatment regime (OITR) is the one that maximizes patients’ expected responses among all treatment regimes.

There have been increasing interests in estimating the OITR. Some common methods include Q-learning (Watkins and Dayan, 1992; Chakraborty et al., 2010), A-learning (Robins et al., 2000; Murphy, 2003) and outcome weighted learning (OWL, Zhao et al., 2012). Qian and Murphy (2011) considered a two-step procedure to construct the OITR. Their method first estimates the conditional mean of the response with l₁ penalty function and then derives the OITR from the estimated conditional mean. Zhang et al. (2012) proposed a robust method for estimating the OITR by maximizing the estimated average response of patients (i.e, the value function). Zhang et al. (2015) proposed to use decision lists to construct interpretable and parsimonious treatment regimes. Despite the popularity of estimating the OITR, there is scarce work in the literature for hypothesis testing regarding OITR. All these estimation methods implicitly assume that patients’ covariates have qualitative interactions with treatment, which means that there exists a subset of patients whose “best” treatments assigned according to the OITR are different from others.

We consider testing the existence of OITR due to the following reasons. First, the OITR may not always exist in practice, see the data from the Nefazodone-CBASP clinical trial study in Section 5 for an example. In this case, one treatment is better than the other for all patients and there is no need of estimating the OITR. Second, we note that implementing the OITR requires future patients’ covariates which can be expensive to collect in some cases (Baker et al., 2009; Gail, 2009; Huang et al., 2015). In these cases, we recommend to adopt the “one-size-fits-all” paradigm when the null hypothesis of no OITR is not rejected. Third, our test is constructed based on estimated value functions’ difference comparing the OITR and a fixed regime (i.e. assign all to the best treatment). The test is not significant implies that the value functions’ difference is not significant. Under such a situation, although we can still estimate the OITR, the gain of the obtained OITR over the fixed regime in terms of the improvement of value is not significant. Thus, the obtained OITR under such a situation may not be of practical interest. Therefore, it is essential to test the overall qualitative treatment effects of the prognostic covariates to determine whether we need to implement the OITR for future patients. Gunter et al. (2011) developed an S-score to quantify the magnitude of the marginal qualitative treatment effects of a single covariate. However, the S-score doesn’t characterize the overall qualitative treatment effects of all covariates. Besides, no theoretical guarantees were provided for the S-score.

For binary treatments, testing qualitative treatment effects is equivalent to testing whether the interaction between treatment and covariates (i.e, the contrast function) is almost surely positive or negative. To test such hypothesis, Chang et al. (2015) proposed a test based on a L₁-type functional of kernel smoothing estimators of conditional treatment effects. Hsu (2017) proposed a Kolmogorov-Smirnov type test statistic based on nonparametric estimation of conditional treatment effects with a hypercube kernel. It is well known that kernel smoothing estimators are undesirable in practice due to the curse of dimensionality. As a result, these test statistics are not reliable when the dimension of the covariates is relatively large. However, in modern biomedical applications, it is likely to obtain a large number of prognostic factors for each individual patient. To the best of our knowledge, there are lack of methods for testing the overall qualitative treatment effects in high-dimensional settings.

In this paper, we aim to test the overall qualitative treatment effects in a high dimensional setting. This is a very challenging task due to the curse of dimensionality. To better illustrate this point, consider a simple situation where patients’ covariates, x, consist of p independent Rademacher variables. Then, it is equivalent to test whether the contrast as a function of the covariates is always positive or negative for any x ∈ {−1, 1}^p. Therefore, we need to test 2^p moment inequalities even in this very simplified situation. However, for each x ∈ {−1, 1}^p, we have on average N/2^P observations with covariates equal to x, where N is the total number of observations. When N = O(2^p), this seems impossible without additional assumptions. We show in Lemma 3.1 that covariates have the overall qualitative treatment effects if and only if the value function under the OITR is strictly larger than those under fixed treatment regimes. This motivates us to construct test statistics based on the difference between the optimal value function and the value function under fixed treatment regimes. However, inference for such value difference is extremely difficult in the nonregular cases, that is, there is a positive probability that the contrast function is equal to zero. We use a sample-splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. As long as the estimated contrast function satisfies certain convergence rates, we show our test statistic is consistent.

When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. Random projections have been a powerful method for dimension reduction in the computer science literature. The key idea behind is given in the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984), which states that a set of high dimensional vectors can be projected into a suitable lower-dimensional space while approximately preserve their pairwise distances. In the statistics literature, Lopes et al. (2011) proposed a high-dimensional two-sample test which integrates a random projection with the Hotelling T² statistic. Recently, Cannings and Samworth (2015) proposed a random projection-based method for the high-dimensional classification.

In this paper, we propose the use of random projections with sparse matrix. In contrast to the dense sketching matrix used in Lopes et al. (2011) and Cannings and Samworth (2015), only a small proportion of elements in the sparse sketching matrix are nonzero. References on sparse random projections include Omidiran and Wainwright (2010); Li et al. (2006); Nelson and Nguyên (2013). In our simulation studies, we show that our sparse random projection-based test statistics are more powerful compared to those based on dense random projection matrix, when the OITR is “sparse”. Besides, we advocate using data-dependent algorithms to generate sparse sketching matrix, since most random projections will be weakly correlated with the contrast function. In theory, we prove the consistency of our sparse random projection-based test. Moreover, in the regular cases, we show that the power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix.

The rest of the paper is organized as follows. In Section 2, we present the definition of the overall qualitative treatment effects. In Section 3, we introduce our test statistic and study its asymptotic properties under the null and local alternative. Simulation studies and real data applications are conducted in Section 4 and Section 5 respectively, to examine the empirical performance of the proposed testing procedure. Section 6 concludes with a summary and discussions of possible extensions.

2. Overall qualitative treatment effects

We consider a single stage study with two treatment options. Let Y be a patient’s outcome of interest and A ∈ {0, 1} be the treatment indicator, with 0 for the standard treatment and 1 for the new treatment. By convention, a larger value of Y indicates a better clinical outcome. Denoted by $X \in ℝ^{p}$ the patient’s baseline covariates. We consider a high dimensional setting where p is allowed to diverge with the sample size N. Let Y*(0) and Y*(1) denote the potential outcomes of a patient that would be observed assuming s/he received treatment 0 and 1, respectively. A treatment regime $d : ℝ^{p} \to {0, 1}$ is a deterministic function from patient’s covariate space to all possible treatment options. For any d, we define the expected potential outcome

V (d) = E [d (X) Y^{*} (1) + {1 - d (X)} Y^{*} (0)],

known as the value function associated with d. The optimal treatment regime d^opt is defined as the maximizer of V(d). Let τ(x) be the contrast function, i.e,

τ (x) = E (Y | A = 1, X = x) - E (Y | A = 0, X = x) .

Under the following three conditions:

(A1.) Stable Unit Treatment Value Assumption (SUTVA) : Y = AY*(0) + (1 – A)Y*(1),

(A2.) No unmeasured confounders: Y*(0), Y*(1) ⫫ A | X,

(A3.) Positivity: there exists some constants 0 < c₁ < c₂ < 1 such that 0 < c₁ ≤ c₂ < 1 such that c₁ ≤ P(A = 1|X = x) ≤ c₂ for any x,

we can show that $τ (x) = E {Y^{*} (1) - Y^{*} (0) | X = x}$ . Since

V (d) = E [d (X) {Y^{*} (1) - Y^{*} (0)} + Y^{*} (0)] = E {τ (X) d (X)} + E {Y^{*} (0)},

it is immediate to see that d^opt(x) = I{τ(x) > 0}, where I(∙) stands for the indicator function.

Condition (A2) is satisfied in a randomized study, where the propensity score function $π (x) = Pr (A = 1 | X = x)$ is usually a known constant by design. We assume π(x) is known throughout this Section. In Section 3.3, we allow the propensity score to be estimated from data as in observational studies.

Covariates X are said to have the overall qualitative treatment effects (OQTE) if

Pr {τ (X) > 0} > 0 and Pr {τ (X) < 0} > 0.

In this paper, we consider testing the following hypothesis:

H_{0} : X doesn ’ t have OQTE versus H_{1} : X has OQTE .

(1)

Assume (A1)–(A3) hold. Under H₀, the optimal treatment regime assigns the same treatment to all patients. Therefore, testing OQTE is equivalent to testing the existence of OITR.

3. Proposed tests

3.1. A simple value-based test statistic in fixed p case

Assume the observed data are summarized as ${O_{i} = (X_{i}, A_{i}, Y_{i}), i = 1, \dots, N}$ , where O_i’s are i.i.d. copies of O = (X, A, Y). The distribution of O is allowed to vary with N. To illustrate the idea, we first assume p is small and fixed, and present here a value-based test statistic for the null hypothesis (1). Later in this section, we will consider the more challenging high dimensional setting. Let $V (0) = E {Y^{*} (0)}$ and $V (1) = E {Y^{*} (1)}$ . The following lemma relates OQTE to the difference between the optimal value function and the value functions under fixed treatment regimes.

Lemma 3.1. Assume E|τ(X)| < ∞, and conditions (A1)–(A3) hold. Then the followings are equivalent: (i) X doesn’t have OQTE; (ii) V(d^opt) = max{V(0), V(1)}.

By definition, we have V(d^opt) ≥ max{V(0), V(1)}. Under H₁, Lemma 3.1 implies V(d^opt) > max(V(0), V(1)). Therefore, it suffices to test

H_{0} : V (d^{o p t}) = max {V (0), V (1)} versus H_{1} : V (d^{o p t}) > max {V (0), V (1)} .

For simplicity, we assume V(1) ≥ V(0). This implies that the new treatment is at least as good as the standard one on average. The hypothesis V(1) ≥ V(0) can be tested using historical data or data from a pilot study. When V(0) ≥ V(1), the test statistic can be similarly constructed.

Lemma 3.1 motivates us to consider test statistics based on some estimators for the value difference VD(d^opt) = V(d^opt) – V(1). For any treatment regime d, Zhang et al. (2012) proposed an inverse propensity score weighted estimator (IPSWE) for V(d):

\hat{V} (d) = \frac{1}{N} \sum_{i = 1}^{N} [\frac{A_{i} d (X_{i})}{π_{i}} Y_{i} + \frac{(1 - A_{i}) {1 - d (X_{i})}}{1 - π_{i}} Y_{i}],

(2)

where π_i is a shorthand for π(X_i). Plugging d ≡ 1, we obtain $\hat{V} (1) = N^{- 1} \sum_{i} A_{i} Y_{i} / π_{i}$ . For any fixed d, $\sqrt{N} \hat{VD} (d) = \sqrt{N} {\hat{V} (d) - \hat{V} (1)}$ corresponds to a sum of i.i.d random variables. Therefore, its asymptotic variance can be consistently estimated by the sample variance estimator,

{\hat{σ}}^{2} (d) = \frac{1}{N - 1} {\sum_{i = 1}^{N} [(\frac{1 - A_{i}}{1 - π_{i}} - \frac{A_{i}}{π_{i}}) Y_{i} {1 - d (X_{i})} - \hat{VD} (d)]}^{2} .

(3)

Suppose $\hat{τ} (\cdot)$ is an estimate τ(∙). Based on (2) and (3), it is natural to use $\hat{T} = \sqrt{N} \hat{VD} (\hat{d}) / \hat{σ} (\hat{d})$ as the test statistic, where $\hat{d} (x) = I {\hat{τ} (x) > 0}$ , and reject H₀ when $\hat{T} > z_{α}$ at a given significant level α, where z_α stands for the upper α-th quantile of the standard normal distribution.

Consistency of such a naive test requires $E {|\hat{d} (X) - d^{o p t} (X)|}^{2} \to 0$ . However, as commented by Luedtke and van der Laan (2016), this assumption is typically violated in the non-regular cases where Pr{τ(X) = 0} > 0, even when $\hat{τ}$ is consistent to τ. To solve this problem, we consider a modified version of $\hat{T}$ based on sample splitting and cross-validation. Let $I_{1}$ and $I_{2}$ be a random partition of {1, …, N} into 2 disjoint subsets of equal sizes n = N/2. For any $I \subseteq {1, \dots N}$ and treatment regime d, define

{\hat{VD}}_{I} (d) = \frac{1}{|I|} \sum_{i \in I} [(\frac{1 - A_{i}}{1 - π_{i}} - \frac{A_{i}}{π_{i}}) Y_{i} {1 - d (X_{i})}], {\hat{σ}}_{I}^{2} (d) = \frac{1}{|I| - 1} {\sum_{i \in I} [(\frac{1 - A_{i}}{1 - π_{i}} - \frac{A_{i}}{π_{i}}) Y_{i} {1 - d (X_{i})} - {\hat{VD}}_{I} (d)]}^{2},

where $|I|$ stands for the number of elements in $I$ . Let ${\hat{τ}}_{I}$ be the corresponding estimator of τ based on observations in $I$ and ${\hat{d}}_{I} (x) = I {{\hat{τ}}_{I} (x) > 0}$ . We define our test statistic by

{\hat{T}}_{C V} = max (\frac{\sqrt{n} {\hat{VD}}_{I_{1}} ({\hat{d}}_{I_{2}})}{max {{\hat{σ}}_{I_{1}} ({\hat{d}}_{I_{2}}), δ_{n}}}, \frac{\sqrt{n} {\hat{VD}}_{I_{2}} ({\hat{d}}_{I_{1}})}{max {{\hat{σ}}_{I_{2}} ({\hat{d}}_{I_{1}}), δ_{n}}}),

(4)

for some positive sequence δ_n → 0, and reject H₀ when ${\hat{T}}_{C V} > z_{α / 2}$ . The sequence δ_n guarantees that the denominators in ${\hat{T}}_{C V}$ are strictly greater than 0.

Alternative to the sample splitting method, one can consider a Wald-type test statistic based on the online one-step estimator proposed by Luedtke and van der Laan (2016). However, calculating such test statistic is more computationally expensive than ours. Besides, the asymptotic normality of such test statistic requires the class of functions ${[(1 - A) / {1 - π (X)} - A / π (X)] Y {1 - d (X)} : d}$ to be Glivenko-Cantelli, where d varies over the range of estimators $\hat{d}$ (see Section 7.3 in Luedtke and van der Laan, 2016). In contrast, our testing procedure is valid under H₀ for any $\hat{d}$ .

Theorem 3.1. Assume conditions (A1)–(A3) hold, E|Y|³ = O(1) and δ_n ⨠ n^−1/6. Then under H₀, for any 0 < α < 1, we have

\underset{n}{lim sup} P r ({\hat{T}}_{C V} > z_{α / 2}) \leq α .

Moreover, assume that

V a r \{(\frac{A}{π (X)} - \frac{1 - A}{1 - π (X)}) Y {1 - {\hat{d}}_{I_{j}} (X)} | {O_{i}}_{i \in I_{j}}\} = o_{p} (δ_{n}),

(5)

for j = 1, 2, where Var(V₁ | V₂) denotes the variance of V₁ conditional on V₂. Then, we have $P r ({\hat{T}}_{C V} < z_{α / 2}) \to 0$ .

The following theorem states the consistency of our proposed test statistic. It relies on Conditions (C1) and (C2). We provide these conditions in Section B of the Appendix to save space.

Theorem 3.2. Assume conditions (A1)–(A3), (C1), (C2) hold, E|Y|³ = O(1) and δ_n → 0. Under $H_{1} : V (d^{o p t}) = V (1) + h_{n}$ , if h_n ⨠ n^−1/2, then we have $P r ({\hat{T}}_{C V} > z_{α / 2}) \to 1$ . Moreover, assume $P r {τ (X) = 0} = 0$ and ${lim inf}_{n} σ_{0}^{2} > 0$ where

σ_{0}^{2} = V a r \{(\frac{A}{π (X)} - \frac{1 - A}{1 - π (X)}) Y {1 - d^{o p t} (X)}\} .

If $\sqrt{n} h_{n} = O (1)$ , then we have

P r ({\hat{T}}_{C V} > z_{α / 2}) = 2 \bar{Φ} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) - {\bar{Φ}}^{2} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) + o (1),

where $\bar{Φ} (z) = P r (Z \geq z)$ for a standard normal random variable Z.

Theorem 3.1 and 3.2 show the consistency of our testing procedure. Note that Conditions (C1) and (C2) are not required to ensure Theorem 3.1. This suggests the type-I error is well controlled regardless of any estimating procedure. On the other hand, conditions on δ_n in Theorem 3.1 are stronger than those in Theorem 3.2. In the regular cases when $Pr {τ (X) = 0} = 0$ , Theorem 3.2 provides the asymptotic power function of our test. Notice that h_n is equal to $E [- τ (X) {τ (X) < 0}]$ which relies on the dependence structure of the covariates. As a result, the power of our test depends crucially on the underlying data-generating process.

In this paper, $\hat{d}$ is obtained by a plug-in estimator based on some nonparametric estimation of the contrast function. Alternatively, one can directly estimate d^opt using OWL. Theorem 3.2 holds as long as the estimated decision function $\hat{d}$ satisfies $V ({\hat{d}}_{I}) = V (d^{o p t}) + o_{p} ({|I|}^{- 1 / 2})$ .

Since we assume V(1) ≥ V(0), under H₀, we have $Pr {τ (X) \geq 0} = 1$ . In the regular cases where $Pr {τ (X) = 0} = 0$ , we have $Pr {d^{o p t} (X) = 1} = 1$ and hence

Var \{(\frac{A}{π (X)} - \frac{1 - A}{1 - π (X)}) Y {1 - d^{o p t} (X)}\} = 0.

Besides, in the regular cases, d^opt can be consistently estimated by ${\hat{d}}_{I_{j}}$ (see Equation (S.18) in the supplementary article). Assume conditions (C1) and (C2) hold with γ ≥ 1. Then we can show (5) holds. Hence, the type-I error of our test will go to 0.

3.2. A sparse random projection-based test statistic

When p is large, it is far more challenging to estimate the contrast function τ(x) due to the curse of dimensionality. To handle high-dimensional covariates, we project the covariate space into a low dimensional vector space to construct our test statistic. Throughout this paper, we assume the dimension of the projected space, q is fixed. For a given matrix $S \in ℝ^{q \times p}$ and any $ω \in ℝ^{q}$ , define

τ^{S} (ω) = E {τ (X) | S X = ω} .

Under (A1)–(A3),the treatment regime $d_{S}^{o p t} (x) = I {τ^{S} (S x) > 0}$ is optimal in the sense that it maximizes the value function among the class of treatment regimes based only on the projected covariates SX.

Since q is small, τ^S can be consistently estimated. We can construct a value-based test statistic as discussed in Section 3.1 based on the projected data ${O_{i}^{S}}_{i \in {1, \dots, N}}$ where $O_{i}^{S} = (S X_{i}, A_{i}, Y_{i})$ . The power of such test statistic depends crucially on the sketching matrix S. To better understand this, consider the following example:

τ (X) = \{{(\frac{X^{(1)} + X^{(2)}}{\sqrt{2}})}^{2} - δ\} {(\frac{X^{(3)} + X^{(4)} + X^{(5)} + X^{(6)} + X^{(7)}}{\sqrt{5}})}^{2},

(6)

for some δ > 0, where X^(j) denotes the j-th element of X.

Apparently, we have τ(X) > 0 if $|X^{(1)} + X^{(2)}| > \sqrt{2 δ}$ and τ(X) < 0 if $|X^{(1)} + X^{(2)}| < \sqrt{2 δ}$ . Assume X ~ N(0, I_p). Then X have the OQTE. Let q = 1, the “optimal” sketching matrix S* is equal to

S^{*} = c_{0} (1, 1, 0, 0, \dots, 0),

for any c₀ ≠ 0. For any $S \in ℝ^{1 \times p}$ such that S*S^T = 0, SX is independent of $X^{(1)} + X^{(2)}$ . Then, we have

τ^{S} (ω) = E{τ (X) | S X = ω} = E [|\{{(\frac{X^{(1)} + X^{(2)}}{\sqrt{2}})}^{2} - δ\} {(\frac{X^{(3)} + X^{(4)} + X^{(5)} + X^{(6)} + X^{(7)}}{\sqrt{5}})}^{2}| S X = ω] = (1 - δ) E \{{(\frac{X^{(3)} + X^{(4)} + X^{(5)} + X^{(6)} + X^{(7)}}{\sqrt{5}})}^{2}| S X = ω\} .

Hence, τ^S(ω) is always nonnegative or nonpositive as a function of ω. As a result, the test statistic based on ${O_{i}^{S}}_{i}$ doesn’t have any power to detect the OQTE. The challenge here lies in finding a projection matrix S that is highly correlated with S*.

Below, we propose a data-dependent algorithm to generate S and introduce our test statistic. Our theory shows that our test statistic works as if the optimal sketching matrix S* were known. Statistical properties of our testing procedure are formally studied in Section 3.2.2.

3.2.1. Test statistic

Assume for now, we have an estimator ${\hat{τ}}_{I}^{S}$ for $τ^{S}$ based on any subset of the projected data ${O_{i}^{S}}_{i \in I}$ and an algorithm to sample sparse sketching matrices whose distribution $G (S, {O_{i}}_{i \in I})$ is allowed to depend on ${O_{i}}_{i \in I}$ . We describe the whole testing procedure in Algorithm 1.

Algorithm 1 . Calculate the random projection-based test statistic . 1 . Input observations {O_{i}}_{i = 1, \dots, N}, δ_{N}, α and a sampling distribution G . 2. Randomly partition the data into two subsets {O_{i}}_{i \in I_{1}} and {O_{i}}_{i \in I 2} . 3. For j = 1, 2, (i) Independently sample a sparse sketching matrix S_{I_{j}} ~ G (S, {O_{i}}_{i \in I_{j}}); (ii) Obtain estimators {\hat{τ}}_{I_{j}}^{S_{I_{j}}} and {\hat{d}}_{I_{j}}^{S_{I_{j}}} (x) = I {{\hat{τ}}_{I_{j}}^{S_{I_{j}}} (S_{I_{j}} x) > 0}; (iii) Calculate {\hat{T}}^{S_{I_{j}}} = \sqrt{n} {\hat{VD}}_{I_{j}^{c}} ({\hat{d}}_{I_{j}}^{S_{I_{j}}}) / max {{\hat{σ}}_{I_{j}^{c}} ({\hat{d}}_{I_{j}}^{S_{I_{j}}}), δ_{n}} . 4. Reject H_{0} if {\hat{T}}_{S R P} = \max ({\hat{T}}^{S_{I_{1}}}, {\hat{T}}^{S_{I_{2}}}) > z_{α / 2} .

Now we present our algorithm for generating sparse sketching matrix. We first introduce some notations. For any matrix Ψ with J rows, let Ψ⁽ⁱ⁾ be the ith row of Ψ. For any vector $ψ \in ℝ^{J}$ and any set $M \subseteq {1, \dots, J}$ , denote $ψ_{M}$ as the subvector of ψ formed by elements in $M$ . Let $M^{c}$ be the complement of $M$ . Let ${‖ψ‖}_{0}$ be the number of nonzero elements in ψ and ${‖ψ‖}_{2}$ be the Euclidean norm of ψ. Let $S$ denote the space of sparse sketching matrices:

S = {S \in ℝ^{q \times p} : {‖S^{(i)}‖}_{0} \leq s, {‖S^{(i)}‖}_{2} = 1, \forall_{i} = 1, \dots, q},

for some fixed integer s that satisfies 2 ≤ s ≤ p. Denoted by N(0, I_J) a J-dimensional Gaussian random vector with mean zero and identity covariance matrix.

It remains to generate $S_{I_{j}}$ based on the sub-dataset ${O_{i}}_{i \in I_{j}}$ . We first sample many sparse sketching matrices from $S$ . Each row of the sketching matrix is independently and uniformly distributed on the space ${S \in ℝ^{p} : {‖S‖}_{0} = s, {‖S‖}_{2} = 1}$ . This corresponds to Step 2 in our proposed algorithm below. Then we output the sparse sketching matrix that maximizes the estimated value difference function. Specifically, we propose using data-splitting strategy for evaluation of the value difference function. That is, for each sketching matrix, we randomly divide ${O_{i}}_{i \in I_{j}}$ into $K$ folds, use any of the $K - 1$ subsamples to estimate the OITR based on projected covariates, use the remaining subsamples to evaluate the corresponding value difference function, and aggregate these value difference functions over different subsamples. This corresponds to Step 3–5 in our proposed algorithm below. We summarize our procedure in Algorithm 2.

Algorithm 2 . Generate data-dependent sparse random sketching matrix . 1. Input observations {O_{i}}_{i \in I}, integers B, s, q, and K \geq 2. 2. Generate i.i.d matrices S_{1}, S_{2}, \dots, S_{B} according as S_{0} whose distribution is described as follows. For j = 1, \dots, q, (i) Independently select a simple random sample M_{j} of size s from {1, \dots, p}; (ii) Independently generate a Gaussian random vector g_{j} ~ N (0, I_{s}); (iii) Set S_{0, M_{j}^{c}}^{(j)} = 0 and S_{0, M_{j}^{c}}^{(j)} = g_{j} / {‖ g_{j} ‖}_{2} . 3. Randomly divide I into K subsets {I^{(k)}}_{k = 1}^{K} of equal sizes . Let I^{(k) -} = I \cap {(I^{(k)})}^{c} . 4 . For b = 1, \dots, B, (i) For k = 1,, \dots, K, (i .1) Obtain the estimator {\hat{τ}}_{I^{(k) -}}^{S_{b}} and {\hat{d}}_{I^{(k) -}}^{S_{b}} (x) = I {{\hat{τ}}_{I^{(k) -}}^{S_{b}} (S_{b} x) > 0}; (i. 2) Evaluate the value difference {\hat{VD}}_{I^{(k)}} ({\hat{d}}_{I^{(k) -}}^{S_{b}}) . (ii) Obtain the cross-validated estimator {\hat{VD}}_{C V}^{S_{b}} = \sum_{k} {\hat{VD}}_{I^{(k)}} ({\hat{d}}_{I^{(k) -}}^{S_{b}}) / K . 5. Output S_{\hat{b}}, where \hat{b} = \arg \max_{b = 1}^{B} {\hat{VD}}_{C V}^{S_{b}} .

3.2.2. Asymptotic properties under the null and local alternative

We first show the validity of the proposed test, which applies to any estimator ${\hat{τ}}_{I}^{S}$ . For any positive sequences {a_n} and {b_n}, we write a_n ⨠ b_n if and only if ${lim sup}_{n} b_{n} / a_{n} = 0$ .

Theorem 3.3. Assume (A1)–(A3) hold, E|Y|³ = O(1) and δ_n ⨠ n^−1/6. Then under H₀, we have

\underset{n}{lim sup} P r ({\hat{T}}_{S R P} > z_{α / 2}) \leq α .

Moreover, assume that

V a r \{(\frac{A}{π (X)} - \frac{1 - A}{1 - π (X)}) Y {1 - {\hat{d}}_{I_{j}}^{S_{I_{j}}} (X)} | {O_{i}}_{i \in I_{j}}, S_{I_{j}}, I_{j}\} = o_{p} (δ_{n}),

for j = 1, 2. Then we have $P r ({\hat{T}}_{S R P} > z_{α / 2}) \to 0$ .

Let $S^{*} = {arg max}_{S \in S} V (d_{S}^{o p t})$ be the optimal sketching matrix. The optimal sketching matrix S* may not be unique. To see this, for any sketching matrix $S^{*} \in S$ that maximizes $V (d_{S}^{o p t})$ , −S* also maximizes $V (d_{S}^{o p t})$ and we have $- S^{*} \in S$ . Moreover, when q ≥ 2, there may exist infinitely many maximizers.

Our theoretical studies are mostly concerned with the “oracle” test statistic. The oracle knew the set $S^{*}$ ahead of time. In Algorithm 1: Step 3(i), instead of using Algorithm 2 to sample $S^{I_{1}}$ and $S^{I_{2}}$ , we use the oracle set $S^{I_{1}} = S^{I_{2}} = S^{*}$ for an arbitrary $S^{*} \in S^{*}$ . Denoted by ${\hat{T}}_{o r a c l e}$ the resulting oracle test statistic. Let $h_{n}^{*} = {arg max}_{S^{*} \in S^{*}} V (d_{S^{*}}^{o p t}) - V (1)$ . Similar to Theorem 3.2, under H₁, if $h_{n}^{*} ≫ n^{- 1 / 2}$ , then we can show

Pr ({\hat{T}}_{o r a c l e} > z_{α / 2}) \to 1.

Assume

V (d_{S^{*}}^{o p t}) = V (d^{o p t}), \forall S^{*} \in S^{*} .

(7)

This condition means the optimal decision rule depends on the set of projected covariates S*X only. It holds when $τ (x) = ϕ (S^{*} x) g (x)$ for some function ϕ(∙) and some nonnegative function g(∙). In the regular cases where $Pr (τ (X) = 0) = 0$ , (7) implies that $Pr (d^{o p t} (X) = d_{S^{*}}^{o p t} (X)) = 1$ , $\forall S^{*} \in S^{*}$ . Thus, the class of optimal treatment regimes ${d_{S^{*}}^{o p t} : S^{*} \in S^{*}}$ will almost surely recommend the same treatment to any given patient. Assume (7) holds and $Pr (τ (X) = 0) = 0$ . Similar to Theorem 3.2, the asymptotic power of ${\hat{T}}_{o r a c l e}$ can be derived as

Pr ({\hat{T}}_{o r a c l e} > z_{α / 2}) = 2 \bar{Φ} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) - {\bar{Φ}}^{2} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) + o (1),

(8)

where h_n and σ₀ are defined in Theorem 3.2.

In the following, we prove the consistency of our proposed testing procedure when using Algorithm 2 to generate the sparse sketching matrix. Moreover, we show our test statistic possesses the oracle property. This means the power function of ${\hat{T}}_{S R P}$ is asymptotically the same as the oracle test statistic ${\hat{T}}_{o r a c l e}$ .

Define the semimetric

d^{τ} (S_{1}, S_{2}) = \sqrt{E {|τ^{S_{1}} (S_{1} X) - τ^{S_{2}} (S_{2} X)|}^{2}}, \forall S_{1}, S_{2} \in S .

We make the following assumptions.

(A4.) For any sketching matrices $S_{1}, S_{2}, \dots, S_{B} \in S$ and any $I \subseteq {1, 2, \dots N}$ with $|I| \geq n / 2$ , assume the following event holds with probability tending to 1,

{max}_{b = 1}^{B} E^{X} {|{\hat{τ}}_{I}^{S_{j}} (S_{j} X) - τ^{S_{j}} (S_{j} X)|}^{2} = O (n^{- r_{0}} log n),

where the expectation E^X is taken with respect to X, and the little-o term is uniform in $I$ and S₁, …, S_B.

(A5.) Assume $B ≫ {(p \sqrt{n})}^{(s - 1) q}$ . In addition, assume there exist some constant $\bar{C} > 0$ and some sketching matrix $S^{*} \in S^{*}$ such that

d^{τ} (S, S^{*}) \leq \bar{C} {(\sum_{j = 1}^{q} {‖S^{(j)} - S^{* (j)}‖}_{2}^{2})}^{1 / 2}, \forall S \in S .

(9)

(A6.) Assume there exist some constants γ, ε₀, δ₀ > 0 such that for any sketching matrix S satisfying $V (d_{S}^{o p t}) \geq V (d_{S^{*}}^{o p t}) - ε_{0}$ , we have $Pr {0 < |τ^{S} (S X)| \leq t} = O (t^{γ})$ , where the big-O term is uniform in 0 < t < δ₀ and S.

Condition (A4) assumes the uniform convergence rate of ${\hat{τ}}_{I}^{S_{b}}$ for b = 1, …, B. Since the uniform convergence rate increases as B increases, Condition (A4) gives the upper bound for B. On the contrary, Condition (A5) gives the lower bound for B. It requires B to diverge at a proper rate, to give us a good chance for finding a random projection with a large value function. More specifically, under (A5), we can show that

Pr {{max}_{b = 1}^{B} V (d_{S_{b}}^{o p t}) = V (d_{S^{*}}^{o p t}) + o (n^{- 1 / 2})} \to 1.

In Section C.3 of the Appendix, we show (A5) holds when $τ (x) = ϕ (S^{*} X)$ for some sketching matrix $S^{*} \in S^{*}$ and some Lipschitz continuous function ϕ(∙).

Condition (A6) holds with γ = 1 when $τ^{S} (S X)$ has a uniformly bounded density function near 0 for any sketching matrix S that nearly maximizes the value function (see Section C.4 in the Appendix for detailed discussion). Assume τ(X) ≥ δ₀ almost surely or τ(X) ≤ −δ₀ almost surely. Then for any sketching matrix S, we have τ^S (SX) ≥ δ₀ almost surely or τ^S(SX) ≤ −δ₀. As a result, (A6) automatically holds for any γ > 0.

In Section C.2 of the Appendix, we consider a simple model and show (A4)–(A6) holds.

Theorem 3.4. Assume Conditions (A1)–(A5) hold, E|Y|³ = O(1), $log B = o (n^{1 / 3})$ and δ_n → 0. If $h_{n}^{*} ≫ max (\sqrt{log B} / \sqrt{n}, n^{- r_{0} / 2} \sqrt{log n})$ , then we have

P r ({\hat{T}}_{S R P} > z_{α / 2}) \to 1.

Moreover, assume (7) and (A6) hold, Pr{τ(X) = 0} = 0, $\sqrt{n} h_{n} = O (1)$ , $B = O (n^{κ_{B}})$ for some $κ_{B} > 0$ , $r_{0} > \frac{γ + 2}{2 γ + 2}$ and lim inf_n σ₀ > 0. Then we have

P r ({\hat{T}}_{S R P} > z_{α / 2}) = 2 \bar{Φ} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) - {\bar{Φ}}^{2} (z_{\frac{α}{2}} - \frac{\sqrt{n} h_{n}}{σ_{0}}) + o (1) .

Assume p = O(n) and we set $B = c_{*} n^{{3 q (s - 1) + ϵ} / 2}$ for any c_*, ϵ > 0. Then the conditions $B ≫ {(p \sqrt{n})}^{(s - 1) q}$ in (A5) and $B = O (n^{κ_{B}})$ in Theorem 3.4 automatically hold. It is worth mentioning that when h_n and σ₀ don’t depend on p, Theorem 3.4 implies that the asymptotic power of our test is independent of p.

3.3. Some implementation issues

3.3.1. Doubly-robust test statistics

So far we have assumed that the propensity scores are known for all patients. In the following, we introduce a doubly-robust test statistic to deal with data from an observational study. We begin by introducing a doubly-robust value difference estimator, which requires the estimation of the propensity score and the conditional mean functions $h_{0} = E (Y | A = 0, X = x)$ and $h_{1} = E (Y | A = 1, X = x)$ . Denoted by $\hat{π} (\cdot)$ , ${\hat{h}}_{0} (\cdot)$ and ${\hat{h}}_{1} (\cdot)$ the corresponding estimators. Zhang et al. (2012) proposed a doubly-robust estimator for the value function under a given treatment regime d,

{\hat{V}}^{d r} (d) = \frac{1}{N} \sum_{i = 1}^{N} \{(\frac{A_{i}}{\hat{π} (X_{i})} d_{i} + \frac{1 - A_{i}}{1 - \hat{π} (X_{i})} (1 - d_{i})) Y_{i} - (\frac{A_{i}}{\hat{π} (X_{i})} d_{i} + \frac{1 - A_{i}}{1 - \hat{π} (X_{i})} (1 - d_{i}) - 1) {{\hat{h}}_{0} (X_{i}) (1 - d_{i}) + {\hat{h}}_{1} (X_{i}) d_{i}}\},

where d_i is a shorthand for d(X_i). When either the propensity score or the conditional mean models are correctly specified, ${\hat{V}}^{d r} (d)$ is consistent to V(d) (Zhang et al., 2012). Based on ${\hat{V}}^{d r}$ , for any $I \subset [1, \dots, N]$ and a given treatment regime d, we define our doubly-robust value difference estimator as

{\hat{VD}}_{I}^{d r} = \frac{1}{|I|} \sum_{i \in I} \{(\frac{1 - A_{i}}{1 - {\hat{π}}_{i}^{I}} - \frac{A_{i}}{{\hat{π}}_{i}^{I}}) Y_{i} - (\frac{1 - A_{i}}{1 - {\hat{π}}_{i}^{I}} - 1) {\hat{h}}_{0, i}^{I} + (\frac{A_{i}}{{\hat{π}}_{i}^{I}} - 1) {\hat{h}}_{1, i}^{I}\} (1 - d_{i}),

where ${\hat{π}}_{i}^{I} = {\hat{π}}^{I} (X_{i})$ , ${\hat{h}}_{0, i}^{I} = {\hat{h}}_{0}^{I} (X_{i})$ , ${\hat{h}}_{1, i}^{I} = {\hat{h}}_{1}^{I} (X_{i})$ , and ${\hat{π}}^{I}$ , ${\hat{h}}_{0}^{I}$ , ${\hat{h}}_{1}^{I}$ are obtained based on ${O_{i}}_{I}$ . When p is large, we recommend to estimate π, h₀ and h₁ via penalized regression. The asymptotic variance of $\sqrt{|I|} {\hat{VD}}^{d r} (d)$ can be consistently estimated by ${\hat{σ}}_{I}^{d r} (d)$ whose exact form is given in Section A of the Appendix.

We briefly summarize our test procedures. Similar to Algorithm 1, we first randomly partition the data into two halves ${O_{i}}_{I_{1}}$ and ${O_{i}}_{I_{2}}$ , and obtain the estimates ${\hat{π}}^{I_{j}}$ , ${\hat{h}}_{0}^{I_{j}}$ , ${\hat{h}}_{1}^{I_{j}}$ based on ${O_{i}}_{i \in I_{j}}$ for j = 1, 2. Then we independently sample the sparse sketching matrices $S_{I_{1}}$ and $S_{I_{2}}$ . The sampling algorithm is similar to Algorithm 2. Specifically, for j = 1, 2, we randomly divide $I_{j}$ into ${I_{j}^{(k)}}_{k = 1}^{K}$ and independently sample S₁, …, S_B as Steps 2 and 3 of Algorithm 2. Then we calculate the doubly-robust value difference estimator,

{\hat{VD}}_{C V}^{d r, S_{b}} = K^{- 1} \sum_{k} {\hat{VD}}_{I_{j}^{(k)}}^{d r} ({\hat{d}}_{I_{j}^{(k) -}}^{S_{b}}),

(10)

for each S_b where $I_{j}^{(k) -} = I_{j} \cap {(I_{j}^{(k)})}^{c}$ , and set $S_{I_{1}} = S_{\hat{b}}$ where $\hat{b} = {arg max}_{b = 1}^{B} {\hat{VD}}_{C V}^{d r, S_{b}}$ . Finally, we define our test statistic by

{\hat{T}}_{S R P}^{d r} = max (\frac{\sqrt{n} {\hat{VD}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}}^{S_{I_{1}}})}{max {{\hat{σ}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}}^{S_{I_{1}}}), δ_{n}}}, \frac{\sqrt{n} {\hat{VD}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}}^{S_{I_{2}}})}{max {{\hat{σ}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}}^{S_{I_{2}}}), δ_{n}}}),

(11)

and reject the test if ${\hat{T}}_{S R P}^{d r} > z_{α / 2}$ for a given significant level α > 0. Statistical properties of ${\hat{T}}_{S R P}^{d r}$ can be similarly established.

3.3.2. Estimation of τ^S

The projected contrast function τ^S can be estimated by any machine learning or statistical nonparametric methods. In our implementation, we estimate τ^S using cubic B-splines. Let $I$ be an arbitrary subset of {1, …, N}. Based on the dataset ${O_{i}}_{i \in I}$ , we first estimate π using the penalized logistic regression, and estimate h₀, h₁ using the penalized linear regression, with SCAD penalty functions (Fan and Li, 2001). These penalized regressions are implemented by the R package ncvreg and the tuning parameters are selected via 10-folded cross-validation. Let ${\hat{π}}_{i}^{I}$ , ${\hat{h}}_{0, i}^{I}$ and ${\hat{h}}_{1, i}^{I}$ be the corresponding estimators for π(X_i), h₀(X_i) and h₁(X_i), respectively. Recall that $S^{(j)} \in ℝ^{1 \times p}$ is the jth row of sketching matrix S. We define the pseudo outcome

{\hat{τ}}_{i}^{I} = (\frac{A_{i}}{{\hat{π}}_{i}^{I}} - \frac{1 - A_{i}}{1 - {\hat{π}}_{i}^{I}}) Y_{i} - (\frac{A_{i}}{{\hat{π}}_{i}^{I}} - 1) {\hat{h}}_{1, i}^{I} + (\frac{1 - A_{i}}{1 - {\hat{π}}_{i}^{I}} - 1) {\hat{h}}_{0, i}^{I},

(12)

and minimize

({\hat{ξ}}_{1}^{I}, \dots, {\hat{ξ}}_{q}^{I}) = \underset{ξ_{1}, \dots, ξ_{q} \in ℝ^{k}}{arg min} \frac{1}{|I|} {\sum_{i \in I} ({\hat{τ}}_{í}^{I} - \sum_{j = 1}^{q} \sum_{k = 1}^{K + 4} N_{k}^{S^{(j)}} (S^{(j)} X_{i}) ξ_{j, k})}^{2},

(13)

where $N_{1}^{S^{(j)}} (\cdot), \dots, N_{K + 4}^{S^{(j)}} (\cdot)$ are cubic B-spline bases of $S^{(j)} X_{i}$ and K is the number of interior knots. Given K, we place the interior knots at equally-spaced sample quantiles of the projected covariates ${S X_{i}}_{i \in I}$ . After solving (13), we set ${\hat{τ}}_{I}^{S} (S x) = \sum_{j = 1}^{q} \sum_{k = 1}^{K + 4} N_{k}^{S^{(j)}} (S^{(j)} x) {\hat{ξ}}_{j, k}^{I}$ .

Based on the B-spline methods, we show in Section C.2.1 of the Appendix (A4) holds with $r_{0} = 4 / 5$ when q = 1 and $B = O (n^{κ_{B}})$ for any κ_B > 0. Assume (A6) holds with γ > 2/3. The condition r₀ > (γ + 2)/(2γ + 2) in Theorem 3.4 is thus satisfied. More generally, we may use series estimator (Belloni et al., 2015) to estimate τ^S. Then the rate r₀ in (A4) will decrease as the number of projected dimension q increases.

3.3.3. Choice of s

Our testing procedure requires specification of s, which determines the number of nonzero elements in each row of the sketching matrix. Ideally, one could treat s as a tuning parameter and choose s to maximize the estimated value difference defined in (10). However, this approach would be time-consuming. In our implementation, we set s as a discrete random variable when sampling S₁, …, S_B. More specifically, for b = 1, …, B, we first independently sample s according as some random variable s₀, and then sample S_b according to Step 3 of Algorithm 2.

We recommend to set s₀ = 2 + Binom(p – 2, p₀), where Binom(m, p₀) is a binomial random variable with the total number of trials equal to m and the probability of success equal to p₀. In our simulation study, we set $p_{0} = 2 / (p - 2)$ .

3.3.4. Choice of q

The choice of the projection dimension q involves a trade-off. If q is too large, then the curse of dimensionality will affect the uniform convergence rates of ${\hat{τ}}_{I}^{S_{j}}$ in (A8), resulting in decreased power of the corresponding test. If q is too small, then the OITR is not well approximated. In our numerical experiments, we set q = 1. In the supplementary article, we examine the performance of the proposed test with difference choices of q. Results show that the optimal choice of q depends on the number of covariates involved in the OITR and varies across different simulation settings. We further propose a method that adaptively determines q. Detailed algorithm is given in Section E.2 of the supplementary article. In our simulations, we find such adaptive method is no worse than any fixed choice of q and has nearly optimal performance in some cases.

3.3.5. Choices of other hyperparameters

We recommend to set the number of folds $K$ in Algorithm 2 to be 5 or 10. The number of sketching matrices B shall diverge as N, p → ∞. In practice, we recommend to set $B = N^{κ_{N}} p^{κ_{p}}$ for some $κ_{N}, κ_{p} \geq 1$ .

4. Simulations

4.1. Settings

We examine the finite sample performance of the proposed tests via Monte Carlo simulations. Simulated data with sample size N were generated from

Y = 1 + (X^{(1)} - X^{(2)}) / 2 + A τ (X) + e,

where X ~ N(0, I_p), A ~ Binom(1, 0.5) and e ~ N(0, 0.5²). Here, we set p = 50 or 100.

We consider four scenarios. In the first three scenarios, we set

τ (X) = ϕ_{δ} {(X^{(1)} + X^{(2)}) / \sqrt{2}} {(X^{(3)} + X^{(4)} + X^{(5)} + X^{(6)} + X^{(7)})}^{2} / 5,

for some function ϕ_δ parameterized by some δ> 0. More specifically, we set $ϕ_{δ} (x) = x^{2} - δ$ in Scenario 1, $ϕ (x) = δ cos (π x)$ in Scenario 2, and $ϕ_{δ} (x) = δ \sqrt{2 π} x$ in Scenario 3.

In Scenario 4, we set

τ (X) = δ \{{(\sum_{j = 1}^{2} \frac{X^{(j)}}{\sqrt{2}})}^{2} - {(\sum_{j = 3}^{20} \frac{X^{(j)}}{\sqrt{18}})}^{2}\} {(X^{(21)} + X^{(22)} + X^{(23)} + X^{(24)} + X^{(25)})}^{2} / 5 .

It is immediate to see that the OITR is sparse and is a function of X⁽¹⁾ and X⁽²⁾ in the first three scenarios. In Scenario 4, however, a total of 20 variables are involved in the OITR. In addition, the true OITR is linear in X under Scenario 3, but non-linear under Scenarios 1, 2 and 4. We set N = 500 in Scenarios 1, 2 and 3, and N = 1000 in Scenario 4.

For all scenarios, the parameter δ controls the degree of overall qualitative treatment effects. Specifically, H₀ holds if δ = 0 and H₁ holds if δ > 0. For each scenario, we further consider four cases by setting VD(d^opt) = V(d^opt) – V(1) = 0, 0.2, 0.35 and 0.5. Note that in Scenarios 2, 3 and 4, the settings for VD(d^opt) = 0 are the same. Hence, in Scenarios 3 and 4, we only report the simulation results for VD(d^opt) = 0.2, 0.35 and 0.5.

We set q = 1 and calculate ${\hat{T}}_{S R P}^{d r}$ as described in Section 3.3. The number of interior knots K in the cubic B-spline bases is specified in the following fashion. When generating $S_{I_{1}}$ or $S_{I_{2}}$ , we fix K = 3 when estimating $τ^{S_{b}}$ for b = 1, …, B. After obtaining $S_{I_{1}}$ and $S_{I_{2}}$ , K is tuned with cross-validation when estimating $τ^{S_{I_{1}}}$ and $τ^{S_{I_{2}}}$ We set B = 10⁵ for p = 50 and B = 4 × 10⁵ for p = 100.

The whole simulation program is implemented in R. Some subroutines, including sam-pling data-dependent sketching matrices $S_{I_{1}}$ and $S_{I_{2}}$ and estimating $τ^{S_{I_{1}}}$ and $τ^{S_{I_{2}}}$ , are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015).

4.2. Competing methods

Comparison is made among the following five test statistics:

The proposed sparse random projection-based test statistic ${\hat{T}}_{S R P}^{d r}$ .
The dense random projection-based test statistic, denoted by ${\hat{T}}_{R P}^{d r}$ .
The cross-validated test statistic with the OITR estimated by the penalized least square method developed in Shi et al. (2016), denoted by ${\hat{T}}_{P L S}$ .
The cross-validated test statistic based on step-wise variable selection, denoted by ${\hat{T}}_{V S}$ .
The supremum-type test statistic ${\hat{T}}_{D L}$ based on the desparsified Lasso estimator (Zhang and Zhang, 2014; van de Geer et al., 2014).

${\hat{T}}_{R P}^{d r}$ is computed in a similar fashion as ${\hat{T}}_{S R P}^{d r}$ . We randomly partition {1, …, N} into $I_{1} \cup I_{2}$ of equal size, generate some data dependent sketching matrices $S_{I_{1}}$ and $S_{I_{2}}$ , and construct the test statistic as in (11). When generating $S_{I_{1}}$ or $S_{I_{2}}$ , instead of sampling B sparse sketching matrices as described in Step 3 of Algorithm 2, we generate B dense sketching matrices S₁, …, S_B according to $Z_{0} / {‖Z_{0}‖}_{2}$ , where $Z_{0} \in ℝ^{p}$ is a Gaussian random vector with mean zero and identity covariance matrix, and set $S_{I_{1}}$ or $S_{I_{2}}$ to be the one that gives the largest cross-validated value difference as in (10). Similar to ${\hat{T}}_{S R P}^{d r}$ , we set B = 10⁵ for p = 50 and set B = 4 × 10⁵ for p = 100, and use cubic B-splines to estimate τ^S for any sketching matrix S.

To calculate ${\hat{T}}_{P L S}$ , we first partition the data into two halves ${O_{i}}_{i \in I_{1}}$ and ${O_{i}}_{i \in I_{2}}$ . Then for j = 1, 2, we set ${\hat{d}}_{I_{j}} (x) = I ({\bar{x}}^{T} {\hat{β}}^{I_{j}} > 0)$ where $\bar{x} = {(1, x^{T})}^{T}$ , ${\hat{β}}^{I_{j}}$ is computed by

{\hat{β}}^{I_{j}} = \underset{β \in ℝ^{p + 1}}{arg min} \sum_{i \in I_{j}} \frac{1}{|I_{j}|} {(Y_{i} - {\bar{X}}_{i}^{T} {\hat{θ}}^{I_{j}} - (A_{i} - {\hat{π}}_{i}^{I_{j}}) {\bar{X}}_{i}^{T} β)}^{2} + \sum_{j = 2}^{p + 1} p_{λ_{n, 1}} (|β_{j}|),

(14)

for some penalty functions p_λ, where ${\bar{X}}_{i} = {(1, X_{i}^{T})}^{T}$ , ${\hat{π}}_{i}^{I_{j}}$ is the estimated propensity score for the ith patient based on a penalized logistic regression with SCAD penalty function, and ${\hat{θ}}^{I_{j}}$ is calculated by

{\hat{θ}}^{I_{j}} = \underset{θ \in ℝ^{p + 1}}{arg min} \sum_{i \in I_{j}} \frac{1}{|I_{j}|} {(Y_{i} - {\bar{X}}_{i}^{T} θ)}^{2} + \sum_{j = 2}^{p + 1} p_{λ_{n, 2}} (|θ_{j}|) .

(15)

We use the SCAD penalty in both (14) and (15). The tuning parameters $λ_{n, 1}$ and $λ_{n, 2}$ were selected via 10-folded cross-validation. Finally, define ${\hat{T}}_{P L S}$ by

{\hat{T}}_{P L S} = max (\frac{\sqrt{n} {\hat{VD}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}})}{max {{\hat{σ}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}}), δ_{n}}}, \frac{\sqrt{n} {\hat{VD}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}})}{max {{\hat{σ}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}}), δ_{n}}}) .

(16)

To compute ${\hat{T}}_{V S}$ , we similarly split the observations into two sub-datasets ${O_{i}}_{i \in I_{1}}$ and ${O_{i}}_{i \in I_{2}}$ . For each sub-dataset, we apply the sequential advantage selection (SAS, Fan et al., 2016) to select variables with a qualitative interaction with the treatment. SAS is a greedy stepwise selection procedure and uses a BIC-type criterion to choose the best candidate subset of variables. Denoted by ${\hat{M}}_{I_{1}}$ , ${\hat{M}}_{I_{2}} \subseteq {1, \dots, p}$ the corresponding sets of selected variables. Then for each j = 1, 2, we calculate the pseudo responses ${\hat{τ}}_{i}^{I_{j}}$ , $\forall_{i} \in I_{j}$ (see the definition in (12)) and compute

{\hat{τ}}_{I_{j}} = \underset{f \in ℍ_{j}}{arg min} \frac{1}{n} \sum_{i \in I_{j}} {{\hat{τ}}_{i}^{I_{j}} - f (X_{i, {\hat{M}}_{I_{j}}})}^{2} + λ_{j} {‖f‖}_{ℍ_{j}}^{2},

where λ_j > 0 is a tuning parameter, $ℍ_{j}$ is the reproducing kernel Hilbert space with the reproducing kernel $K_{j} (X_{i, {\hat{M}}_{I_{j}}}, X_{k, {\hat{M}}_{I_{j}}}) = exp {- \sum_{l \in {\hat{M}}_{I_{j}}} η_{j, l} {(X_{i}^{(l)} - X_{k}^{(l)})}^{2}}$ where $X_{i}^{(l)}$ , $X_{k}^{(l)}$ denote the l-th element in X_i, X_k and $η_{j, l} > 0$ , $\forall l \in {\hat{M}}_{I_{j}}$ are tuning parameters. The estimating procedure is implemented by the R package listdtr and the tuning parameters are selected via leave-one-out cross validation. Then we define ${\hat{d}}_{I_{j}} (x) = I {{\hat{τ}}_{I_{j}} (x_{{\hat{M}}_{I_{j}}}) > 0}$ and set

{\hat{T}}_{V S} = max (\frac{\sqrt{n} {\hat{VD}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}})}{max {{\hat{σ}}_{I_{2}}^{d r} ({\hat{d}}_{I_{1}}), δ_{n}}}, \frac{\sqrt{n} {\hat{VD}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}})}{max {{\hat{σ}}_{I_{1}}^{d r} ({\hat{d}}_{I_{2}}), δ_{n}}}) .

(17)

We set $δ_{n} = log ({log}_{10} (2 n)) / {(2 n)}^{1 / 6}$ in (11), (16) and (17), where log₁₀ denotes the logarithm with base 10.

${\hat{T}}_{D L}$ tests the overall treatment effects by fitting the following linear regression model for the response:

E (Y | A, X) \approx β_{0} + X^{T} β_{x} + A β_{a} + A X^{T} β_{a x} .

Based on this model, testing the overall treatment effects is equivalent to test $H_{0}^{*} : β_{a x} = 0$ . Denoted by $β = {(β_{0}, β_{x}^{T}, β_{a}, β_{a x}^{T})}^{T}$ . To deal with high dimensionality, we estimate β by the desparsified Lasso estimator ${\hat{β}}^{D L}$ and test $H_{0}^{*}$ based on the following supremum-type test statistic, ${max}_{j \in M_{a x}} \sqrt{n} | {\hat{β}}_{j}^{D L} |$ , where $M_{a x} = {p + 3, \dots, + 2 p + 2}$ and ${\hat{β}}_{j}^{D L}$ is the j-th element of ${\hat{β}}^{D L}$ . The critical value of ${\hat{T}}^{D L}$ is approximated via bootstrap. Detailed implementation of the test can be found in Zhang and Cheng (2017).

4.3. Results

We conduct 500 simulations for each setting and report the proportions of rejecting the null hypothesis (%) in Table 1 and Table 2, with standard errors in parenthesis (%). Under H₀, the type-I errors of our test statistic is well controlled. Specifically, in Scenario 1 when VD = 0, the rejection probability of ${\hat{T}}_{S R P}^{d r}$ is exactly zero. This is in line with our theory which suggests that the type-I error of our test statistics will converge to 0 in the regular cases where $Pr {τ (X) = 0} = 0$ . In Scenario 2 when VD = 0, the rejection probability of ${\hat{T}}_{S R P}^{d r}$ is close to the nominal level.

Table 1:

Rejection probabilities (%) of the sparse random projection-based test, dense random projection-based test, penalized least square-based test, step-wise selection-based test and the supremum-type test based on the desparsified Lasso estimator, with standard errors in parenthesis (%), under Scenarios 1 and 2 where X ~ N(0, I_p).

Scenario 1		VD = 0		VD = 20%		VD = 35%		VD = 50%

		α level		α level		α level		α level
	p	0.01	0.05	0.01	0.05	0.01	0.05	0.01	0.05
${\hat{T}}_{S R P}^{d r}$	50	0(0)	0(0)	24(1.9)	39.6(2.2)	71(2.0)	81(1.8)	90.8(1.3)	95.2(1.0)
${\hat{T}}_{S R P}^{d r}$	100	0(0)	0(0)	17.4(1.7)	29.6(2.0)	60.8(2.2)	73.8(2.0)	86.6(1.5)	92.4(1.2)
${\hat{T}}_{R P}^{d r}$	50	0(0)	0(0)	0.2(0.2)	0.6(0.4)	0.8(0.4)	3.2(0.8)	7.2(1.2)	18.6(1.7)
${\hat{T}}_{R P}^{d r}$	100	0(0)	0(0)	0.4(0.3)	0.4(0.3)	0.4(0.3)	4(0.9)	6.8(1.1)	19(1.8)
${\hat{T}}_{P L S}$	50	0(0)	0(0)	0(0)	0(0)	0.4(0.3)	0.8(0.4)	6(1.1)	17.6(1.7)
${\hat{T}}_{P L S}$	100	0(0)	0(0)	0(0)	0(0)	0.8(0.4)	2.4(0.7)	8.6(1.3)	19.8(1.8)
${\hat{T}}_{V S}$	50	0(0)	0(0)	1.2(0.5)	3.8(0.9)	16 (1.6)	29.4 (2.0)	36.6(2.2)	50.8(2.2)
${\hat{T}}_{V S}$	100	0(0)	0(0)	0(0)	0.6(0.3)	8.4 (1.2)	17.4 (1.7)	23.8(1.9)	36.4(2.2)
${\hat{T}}_{D L}$	50	10.2(1.4)	22.4(1.9)	11.2(1.4)	22.8(1.9)	10.8 (1.4)	21.8 (1.9)	9.8(1.3)	22.4(1.9)
${\hat{T}}_{D L}$	100	7.6(1.2)	20.0(1.8)	7.8(1.2)	21.4(1.8)	7.6 (1.2)	22.0 (1.9)	6.8(1.1)	21.6(1.8)

Scenario 2		VD = 0		VD = 20%		VD = 35%		VD = 50%

		α level		α level		α level		α level
	p	0.01	0.05	0.01	0.05	0.01	0.05	0.01	0.05

${\hat{T}}_{S R P}^{d r}$	50	1.2(0.5)	5.4(1)	24(1.9)	35.8(2.1)	76.4(1.9)	84.6(1.6)	90.2(1.3)	94(1.1)
${\hat{T}}_{S R P}^{d r}$	100	0.6(0.3)	5.2(1)	15.2(1.6)	28.2(2)	67(2.1)	78.8(1.8)	84.2(1.6)	90.4(1.3)
${\hat{T}}_{R P}^{d r}$	50	1.8(0.6)	4.6(0.9)	2(0.6)	4.8(1)	1.6(0.6)	5.4(1)	1(0.4)	6(1.1)
${\hat{T}}_{R P}^{d r}$	100	1.2(0.5)	4.2(0.9)	1.2(0.5)	5.4(1)	0.6(0.3)	4.8(1)	0.8(0.4)	4.4(0.9)
${\hat{T}}_{P L S}$	50	1.8(0.6)	6(1.1)	1.2(0.5)	4.4(0.9)	1(0.4)	4.2(0.9)	0.8(0.4)	3.8(0.9)
${\hat{T}}_{P L S}$	100	1.2(0.5)	4.2(0.9)	0.8(0.4)	4.6(0.9)	0.6(0.3)	5.6(1)	0.6(0.3)	5(1)
${\hat{T}}_{V S}$	50	1.2(0.5)	6.4(1.1)	0.6(0.3)	4(0.9)	1(0.4)	6.6(1.1)	1(0.4)	5(1)
${\hat{T}}_{V S}$	100	1.4(0.5)	5(1.0)	1.0(0.4)	5(1.0)	1.4(0.5)	6.4(1.1)	0.6(0.3)	4.6(0.9)
${\hat{T}}_{D L}$	50	1.6(0.6)	6.4(1.1)	2.8(0.7)	11.8(1.4)	4.4 (0.9)	15.4 (1.6)	5.4 (1.0)	17(1.7)
${\hat{T}}_{D L}$	100	1.2(0.5)	3.6(0.8)	2.8(0.7)	11.8(1.4)	5.2(1.0)	17.6(1.7)	7.2(1.2)	19.8(1.8)

Open in a new tab

Table 2:

Scenario 3		VD = 20%		VD = 35%		VD = 50%

		α level		α level		α level
	p	0.01	0.05	0.01	0.05	0.01	0.05
${\hat{T}}_{S R P}^{d r}$	50	47.2(2.2)	71.8(2)	92.4(1.2)	97.8(0.7)	99(0.4)	100(0)
${\hat{T}}_{S R P}^{d r}$	100	42.4(2.2)	61.2(2.2)	89.8(1.4)	96.2(0.9)	97.2(0.7)	99.4(0.3)
${\hat{T}}_{R P}^{d r}$	50	4.4(0.9)	16.2(1.6)	13.4(1.5)	35.8(2.1)	22(1.9)	49.4(2.2)
${\hat{T}}_{R P}^{d r}$	100	3(0.8)	8.4(1.2)	4(0.9)	14.2(1.6)	5.4(1)	19.6(1.8)
${\hat{T}}_{P L S}$	50	76.4(1.9)	92(1.2)	97.8(0.7)	99.4(0.3)	99.4(0.3)	100(0)
${\hat{T}}_{P L S}$	100	64.8(2.1)	87(1.5)	97(0.8)	99.4(0.3)	98.6(0.5)	99.8(0.2)
${\hat{T}}_{V S}$	50	55.6(2.2)	81.8(1.7)	93(1.1)	99(0.4)	97.8(0.7)	100(0)
${\hat{T}}_{V S}$	100	49.8(2.2)	74.2(2.0)	90(1.3)	98.6(0.5)	99(0.4)	100(0)
${\hat{T}}_{D L}$	50	99.8(0.7)	100(0)	100(0)	100(0)	100(0)	100(0)
${\hat{T}}_{D L}$	100	99.2(0.4)	100(0)	100(0)	100(0)	100(0)	100(0)

Scenario 4		VD = 20%		VD = 35%		VD = 50%

		α level		α level		α level
	p	0.01	0.05	0.01	0.05	0.01	0.05

${\hat{T}}_{S R P}^{d r}$	50	22.4(1.9)	41.8(2.2)	60.4(2.2)	76.6(1.9)	72.4(2)	87.2(1.5)
${\hat{T}}_{S R P}^{d r}$	100	15.2(1.6)	28(2)	49.6(2.2)	70.2(2)	70(2)	84(1.6)
${\hat{T}}_{R P}^{d r}$	50	0.4(0.3)	6.2(1.1)	0.6(0.3)	5.4(1)	0.2(0.2)	5.4(1)
${\hat{T}}_{R P}^{d r}$	100	1.2(0.5)	6(1.1)	0.8(0.4)	3.8(0.9)	1.2(0.5)	5.2(1)
${\hat{T}}_{P L S}^{d r}$	50	1.2(0.5)	5.4(1)	1.2(0.5)	6(1.1)	1.4(0.5)	4.8(1)
${\hat{T}}_{P L S}^{d r}$	100	1.6(0.6)	5.8(1)	1.8(0.6)	6(1.1)	1.4(0.5)	5.2(1)
${\hat{T}}_{V S}^{d r}$	50	10.4(1.4)	24.2(1.9)	13.6(1.5)	30.6(2.1)	13.2(1.5)	29.4(2)
${\hat{T}}_{V S}^{d r}$	100	5(1)	15.6(1.6)	4.6(0.9)	20(1.8)	8.2(1.2)	18.4(1.7)
${\hat{T}}_{D L}^{d r}$	50	4.2(0.9)	11.4(1.4)	5.4(1)	14.2(1.6)	6.4(1.1)	15.8(1.6)
${\hat{T}}_{D L}^{d r}$	100	6.2(1.1)	16(1.6)	6.4(1.1)	18.2(1.7)	6.8(1.1)	19.6(1.8)

Open in a new tab

Under H₁, we can see that our test statistic is much more powerful compared to other competing test statistics in Scenarios 1, 2 and 4. For example, when VD = 0.35 and α = 0.05, the rejection probabilities of our test are around 75% in Scenario 1. On the other hand, ${\hat{T}}_{R P}^{d r}$ , ${\hat{T}}_{P L S}$ and ${\hat{T}}_{V S}$ fail in Scenario 2. Specifically, the rejection probabilities of these three tests are no more than 6% in all settings. The rejection probabilities of ${\hat{T}}_{D L}$ are around 10%−20% in Scenario 2 under H₁. However, ${\hat{T}}_{D L}$ doesn’t have valid type-I error rates under H₀. Here, the test statistics ${\hat{T}}_{P L S}$ and ${\hat{T}}_{V S}$ fail mainly due to the fact that the true OITR is not linear, while ${\hat{T}}_{R P}^{d r}$ and ${\hat{T}}_{V S}$ fail partly because the dense projection and greedy stepwise variable selection cannot correctly identify the variables with qualitative interactions.

In Scenario 3, ${\hat{T}}_{D L}$ and ${\hat{T}}_{P L S}$ achieve the greatest power in all settings as expected since the true OITR is linear in this scenario. Notice that X⁽¹⁾, X⁽²⁾, ∙∙∙, X⁽⁷⁾ are independent. Although the contrast function is not linear, the estimated contrast functions via the penalized least squares (see (14) and (15)) will converge to $E {τ (X) | X^{(1)}, X^{(2)}}$ . As a result, the estimated OITR is consistent. When VD = 0.35 and 0.5, the rejection probabilities of ${\hat{T}}_{S R P}^{d r}$ are slightly smaller when compared to ${\hat{T}}_{P L S}$ , ${\hat{T}}_{D L}$ and ${\hat{T}}_{V S}$ , but are much larger than those of ${\hat{T}}_{R P}^{d r}$ .

In Section E of the supplementary article, we report the rejection probabilities of ${\hat{T}}_{S R P}^{d r}$ , ${\hat{T}}_{R P}^{d r}$ , ${\hat{T}}_{P L S}$ , ${\hat{T}}_{V S}$ and ${\hat{T}}_{D L}$ under the scenario where $X ~ N (0, {{0.5}^{|i - j|}}_{i, j = 1, \dots, p})$ . Results are similar to those presented in Table 1 and 2.

4.4. Computation time

Our tests are computed on a 32 core 2.2GHz machine with 512GB RAM. Fixing B = 10⁵, it took approximately 3 minutes to implement the test in Scenarios 1–3 where N = 500 and 5 minutes in Scenario 4 where N = 1000. The computation time can be largely reduced if we use a much smaller B. For example, if we set B = 10⁴ in some simulation settings, the computation is 10 times faster and the test performance is still satisfactory. Moreover, since our testing procedure independently generates many sketching matrices and retains the one that maximizes the estimated value function, it can be naturally implemented in parallel. This scalability can further effectively reduce the computational cost.

5. Real data

We apply our proposed test to the data from the Nefazodone-CBASP clinical trial study (Keller et al., 2000), which enrolled 681 patients with nonpsychotic chronic major depressive disorder (MDD). Patients were randomized to three treatments, including Nafazodone (coded as 0), Cognitive Behavioral-Analysis System of Psychotherapy (CBASP, coded as 1), and the combination of Nefazodone and CBASP (2). The outcome of interests were patients’ scores on the 24-item Hamilton Rating Scale for Depression (HRSD). The maximum value of HRSD was 43 and we set Y = 43 – HRSD as our response. Larger value of Y indicates better clinical outcome. Similarly as in Zhao et al. (2012), we use a subset of 647 patients that have complete records of 50 baseline covariates for analysis. Among them, 216 were treated with Nafazodone, 220 with CBASP and 211 with the combination.

Our objective was to test whether the baseline covariates X have overall qualitative treatment effects. This is equivalent to test H₀ : V (d^opt) = max{V(0), V(1), V(2)}, where V(d^opt) is the optimal value function, and V(j) denotes the value function under the fixed treatment regimes by assigning all patients to treatment j, for j = 0, 1, 2. Patients’ average responses under treatment 0, 1, 2 are 27.14, 27.27 and 32.13, respectively. Besides, pairwise t tests show that V(2) is significantly larger than V(0) and V(1). Therefore, it suffices to test H₀ : V(d^opt) = V(2). This is equivalent to test the intersection of the following two hypotheses:

H_{0}^{(j)} : V (d^{o p t, (j)}) = max_{k \in {0, 1, 2}, k \neq j} V (k),

for j = 0, 1, where $d^{o p t, (j)}$ is the optimal treatment regime comparing Treatment 2 with Treatment j. For testing $H_{0}^{(j)}$ , we computed the test statistic ${\hat{T}}_{S R P}^{d r, j}$ as described in Section 3.3 and 4.1. We set B = 100000 and $δ_{n} = log ({log}_{10} (2 n)) / {(2 n)}^{1 / 6}$ . For a given 0 < α < 1, we reject H₀ if

max_{j = 0, 1} {\hat{T}}_{S R P}^{d r, j} > z_{α / 4} .

By Bonferroni’s inequality, the type-I error is well-controlled.

The two test statistics are equal to −0.67 and 0.31, respectively. We fail to reject H₀ at a significance level of 0.1. Therefore, we suspect that the prognostic covariates in this study might not have qualitative treatment effects. Zhao et al. (2012) performed pairwise comparisons between the combination treatment and any single treatment, and estimate the OITR by the outcome weighted learning. Their estimated optimal treatment regime recommended the combination treatment to all the patients. Our tests formally verify their findings.

6. Discussion

In this paper, we develop tests for overall qualitative treatment effects. The test statistics are constructed by a sample-splitting method. In the high-dimensional setting, we use sparse random projections of the covariate space to construct the test statistic and introduce a data-dependent way to sample sparse projection matrices. In theory, we show the consistency of the proposed test statistic and prove its “oracle” property in the regular cases.

6.1. Nonnegative average treatment effects

In this paper, we assume V(1) ≥ V(0) (the new treatment is on average better than the standard control) and consider the test statistic based on estimators for the value difference $V (d^{o p t}) - V (1)$ . When such prior information is not available, let ${\hat{a}}_{I_{j}} = arg {max}_{a \in {0, 1}} {\hat{V}}_{I_{j}} (a)$ for j = 0, 1 where $I_{1}$ and $I_{2}$ stands for a random partition of the dataset, ${\hat{V}}_{I_{j}} (a)$ denotes the estimated value function based on observations in $I_{j}$ under the decision rule d(x) = a, $\forall x$ . We can consider the following test statistic,

{\hat{T}}_{C V} = max \{\frac{\sqrt{|I_{2}|} {{\hat{V}}_{I_{2}} ({\hat{d}}_{I_{1}}) - {\hat{V}}_{I_{2}} ({\hat{a}}_{I_{1}})}}{{\hat{σ}}_{I_{2}} ({\hat{d}}_{I_{1}}, {\hat{a}}_{I_{1}})}, \frac{\sqrt{|I_{1}|} {{\hat{V}}_{I_{1}} ({\hat{d}}_{I_{2}}) - {\hat{V}}_{I_{1}} ({\hat{a}}_{I_{2}})}}{{\hat{σ}}_{I_{1}} ({\hat{d}}_{I_{2}}, {\hat{a}}_{I_{2}})}\}

where ${\hat{σ}}_{I}^{2} (d, a)$ denotes some consistent estimator for the asymptotic variance of $\sqrt{|I|} {{\hat{V}}_{I} (d) - {\hat{V}}_{I} (a)}$ for a given regime d and a ∈ {0, 1}. The null is rejected if ${\hat{T}}_{C V} > z_{α / 2}$ for a given significance level α. Using similar arguments in Theorem 3.1 and Theorem 3.2, we can show that such a testing procedure is consistent.

6.2. Multi-stage studies

Currently, we only consider a single stage study. For multiple-stage studies, it suffices to test whether the value function under the optimal dynamic treatment regime is strictly larger than those under nondynamic treatment regimes. Zhang et al. (2013) proposed an inverse propensity-score weighted estimator for the value function under an arbitrary dynamic treatment regime. Denoted by ${\hat{VD}}_{I} (d_{1}, d_{2})$ the corresponding estimator for the value difference between two dynamic treatment regimes d₁ and d₂, and ${\hat{d}}_{I}$ the estimated optimal dynamic treatment regime, based on the sub-dataset $I$ . Consider the following test statistic:

{\hat{T}}_{C V} = max \{min_{d \in D_{n d}} \frac{\sqrt{|I_{2}|} {\hat{VD}}_{I_{2}} ({\hat{d}}_{I_{1}}, d)}{{\hat{σ}}_{I_{2}} ({\hat{d}}_{I_{1}}, d)}, min_{d \in D_{n d}} \frac{\sqrt{|I_{1}|} {\hat{VD}}_{I_{1}} ({\hat{d}}_{I_{2}}, d)}{{\hat{σ}}_{I_{1}} ({\hat{d}}_{I_{2}}, d)}\},

where $I_{1}$ and $I_{2}$ stand for a random partition of the dataset, ${\hat{σ}}_{I}^{2} (d_{1}, d_{2})$ some consistent estimator for the asymptotic variance of $\sqrt{|I|} {\hat{VD}}_{I} (d_{1}, d_{2})$ and $D_{n d}$ denotes the set of non-dynamic treatment regimes.

Note that for j = 1, 2, we have that under the null,

min_{d \in D_{n d}} \frac{\sqrt{|I_{j}|} {\hat{VD}}_{I_{j}} ({\hat{d}}_{I_{j}^{c}}, d)}{{\hat{σ}}_{I_{2}} ({\hat{d}}_{I_{j}^{c}}, d)} \leq min_{d \in D_{n d}} \frac{\sqrt{|I_{j}|} {\hat{VD}}_{I_{j}} ({\hat{d}}_{I_{j}^{c}}, d) - VD ({\hat{d}}_{I_{j}^{c}}, d)}}{{\hat{σ}}_{I} ({\hat{d}}_{I_{j}^{c}}, d)} \overset{L}{\to} min_{d \in D_{n d}} Z_{d},

(18)

where $VD (d_{1}, d_{2}) = E {\hat{VD}}_{I} (d_{1}, d_{2})$ and ${Z_{d}}_{d \in D_{n d}}$ is a set of mean zero Gaussian random variables whose covariance matrix can be consistently estimated from data. For a given significance level α, we reject the null if ${\hat{T}}_{C V} > {\hat{c}}_{α / 2}$ where ${\hat{c}}_{α}$ corresponds to some consistent estimator for $Pr ({min}_{d \in D_{n d}} Z_{d} > z_{α})$ . It follows from the Bonferroni’s inequality and (18) that the type-I error of ${\hat{T}}_{C V}$ is well-controlled. In the high-dimensional setting, we can ${\hat{T}}_{C V}$ calculate based on sparse random projections of the covariate space. Details are omitted for brevity.

Supplementary Material

Appendix

NIHMS1047683-supplement-Appendix.pdf^{(201.1KB, pdf)}

Acknowledgment

The authors are grateful for helpful feedback from the Associate Editor and anonymous referees, which lead to significant improvement of this work.

The authors thank the editor, the AE and two referees for their helpful suggestions that significantly improved the quality of the paper. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.

Appendix

A. Variance estimator in Section 3.3.1

Define ${\hat{α}}_{I}$ to be the penalized logistic regression estimator based on ${(X_{i}, A_{i})}_{i \in I}$ , ${\hat{θ}}_{0, I}$ and ${\hat{θ}}_{1, I}$ to be the penalized linear regression estimators based on ${(X_{i}, Y_{i})}_{i \in I, A_{i} = 0}$ and ${(X_{i}, Y_{i})}_{i \in I, A_{i} = 1}$ respectively. Denoted by $M_{α, I}$ the support of ${\hat{α}}_{I}$ , i.e, $M_{α, I} = {j = 1, \dots, p : {\hat{α}}_{I, j} \neq 0}$ . Similarly define $M_{θ_{0}, I}$ and $M_{θ_{1}, I}$ to be the supports of ${\hat{θ}}_{0, I}$ and ${\hat{θ}}_{1, I}$ respectively. Let

{\hat{π}}_{i} = \frac{exp (X_{i}^{T} {\hat{α}}_{I})}{1 + exp (X_{i}^{T} {\hat{α}}_{I})},

For any treatment regime d, we define

{\hat{σ}}_{D R, I}^{2} (d) = \frac{1}{|I| - 1} \sum_{i \in I} κ_{i}^{2} - \frac{1}{|I| (|I| - 1)} {(\sum_{i \in I} κ_{i})}^{2},

where

κ_{i} = \{(\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} - \frac{A_{i}}{{\hat{π}}_{i}}) Y_{i} - (\frac{1 - A_{i}}{1 - {\hat{π}}_{i}} - 1) X_{i}^{T} {\hat{θ}}_{0, I} + (\frac{A_{i}}{{\hat{π}}_{i}} - 1) X_{i}^{T} {\hat{θ}}_{1, I}\} {1 - d (X_{i})} + {\bar{I}}_{1}^{T} {(\frac{1}{|I|} \sum_{i \in I} X_{i M_{α, I}}^{T} {\hat{π}}_{i} (1 - {\hat{π}}_{i}) X_{i M_{α, I}})}^{- 1} X_{i M_{α, I}} (A_{i} - {\hat{π}}_{i}) - {\bar{I}}_{2}^{T} {(\frac{1}{|I|} \sum_{i \in I} (1 - A_{i}) X_{i M_{θ_{0}, I}}^{T} X_{i M_{θ_{0}, I}})}^{- 1} X_{i M_{θ_{0}, I}} (1 - A_{i}) (Y_{i} - X_{i}^{T} {\hat{θ}}_{0}) + {\bar{I}}_{3}^{T} {(\frac{1}{|I|} \sum_{i \in I} A_{i} X_{i M_{θ_{0}, I}}^{T} X_{i M_{θ_{0}, I}})}^{- 1} X_{i M_{θ_{0}, I}} A_{i} (Y_{i} - X_{i}^{T} {\hat{θ}}_{1}),

and ${\bar{I}}_{j} = \sum_{i \in I} I_{i, j} / n$ where

I_{i, 1} = \{\frac{{\hat{π}}_{i} (1 - A_{i})}{1 - {\hat{π}}_{i}} {Y_{i} - X_{i}^{T} {\hat{θ}}_{0, I}} + \frac{A_{i} (1 - {\hat{π}}_{i})}{{\hat{π}}_{i}} {Y_{i} - {\hat{θ}}_{1, I}}\} X_{i M_{α, I}} {1 - d (X_{i})}, I_{i, 2} = (\frac{(1 - A_{i})}{1 - {\hat{π}}_{i}} - 1) X_{i M_{θ_{0}, I}} {1 - d (X_{i})} I_{i, 3} = (\frac{A_{i}}{{\hat{π}}_{i}} - 1) X_{i M_{θ_{1}, I}} {1 - d (X_{i})} .

B. Technical conditions

(C1.) Assume there exist some positive constants γ and δ₀ such that

Pr {0 < |τ (X)| \leq t} = O (t^{γ}),

where the big-O term is uniform in 0 < t < δ₀.

(C2.) Assume $\hat{τ}$ satisfies

E {|{\hat{τ}}_{I} (X) - τ (X)|}^{2} = o ({|I|}^{- (2 + γ) / (2 + 2 γ)}) as |I| \to \infty,

where the little-o term is uniform in the training samples $I$ .

Condition (C1) is closely related to the margin assumption (Tsybakov, 2004; Audibert and Tsybakov, 2007) in the classification literature. It is often used to obtain sharp upper bounds on the difference between the value function under d^opt and that under an estimated OITR (Qian and Murphy, 2011; Luedtke and van der Laan, 2016). The larger the structure parameter γ in (C1), the sharper the upper bounds. When τ(X) has a bounded density function near 0, (C1) holds with γ = 1. If there exists some δ₀ > 0 such that $|τ (X)| \geq δ_{0}$ almost surely, then (C1) holds with γ = +∞.

Condition (C2) depends on the “structural” parameter γ in (C1) and the convergence rates of the estimated contrast function. The larger the γ, the more likely (C2) holds. When γ = 1, (C2) requires $E {|{\hat{τ}}_{I} (X) - τ (X)|}^{2} = o ({|I|}^{- 3 / 4})$ . The rates of convergence of the estimated contrast function are available for most often used machine learning or statistical methods, such as spline methods (Zhou et al., 1998), kernel ridge regression (Steinwart and Christmann, 2008; Zhang et al., 2013) and random forests (Biau, 2012). In Section C.1 of the Appendix, we show (C2) holds when $\hat{τ}$ is computed by some of the aforementioned methods. Combining (C1) together with (C2) gives $V ({\hat{d}}_{I}) = V (d^{o p t}) + o_{p} ({|I|}^{- 1 / 2})$ .

References

Audibert J-Y and Tsybakov AB (2007). Fast learning rates for plug-in classifiers. Ann. Statist 35(2), 608–633. [Google Scholar]
Baker SG, Cook NR, Vickers A, and Kramer BS (2009). Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172(4), 729–748. [DOI] [PMC free article] [PubMed] [Google Scholar]
Belloni A, Chernozhukov V, Chetverikov D, and Kato K (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Economet-rics 186(2), 345–366. [Google Scholar]
Biau G (2012). J. mach. learn. res. Journal of Machine Learning Research 13, 1063–1095. [Google Scholar]
Cannings TI and Samworth RJ (2015). Random projection ensemble classification. arXiv preprint arXiv:1504.04595.
Chakraborty B, Murphy S, and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19(3), 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang M, Lee S, and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18(3), 307–346. [Google Scholar]
Fan A, Lu W, and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10(1), 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96(456), 1348–1360. [Google Scholar]
Gail MH (2009). Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. Journal of the National Cancer Institute 101(13), 959–963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galassi M, Davies J, Theiler J, Gough B, Jungman G, Alken P, Booth M, Rossi F, and Ulerich R (2015). GNU Scientific Library Reference Manual (Version 2.1)
Gunter L, Zhu J, and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8(1), 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20(1), 1–22. [Google Scholar]
Huang Y, Laber EB, and Janes H (2015). Characterizing expected benefits of biomark-ers in treatment selection. Biostatistics 16(2), 383–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson WB and Lindenstrauss J (1984). Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26(189–206), 1. [Google Scholar]
Keller MB, McCullough JP, Klein DN, Arnow B, Dunner DL, Gelen-berg AJ, Markowitz JC, Nemeroff CB, Russell JM, Thase ME, et al. (2000). A comparison of nefazodone, the cognitive behavioral-analysis system of psychotherapy, and their combination for the treatment of chronic depression. New England Journal of Medicine 342(20), 1462–1470. [DOI] [PubMed] [Google Scholar]
Li P, Hastie TJ, and Church KW (2006). Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296. ACM. [Google Scholar]
Lopes M, Jacob L, and Wainwright MJ (2011). A more powerful two-sample test in high dimensions using random projection. In Advances in Neural Information Processing Systems, pp. 1206–1214.
Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44(2), 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65(2), 331–366. [Google Scholar]
Nelson J and Nguyên HL (2013). Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 117–126. IEEE. [Google Scholar]
Omidiran D and Wainwright MJ (2010). High-dimensional variable selection with sparse random projections: measurement sparsity and statistical efficiency. Journal of Machine Learning Research 11(Aug), 2361–2386. [Google Scholar]
Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39(2), 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robins J, Hernan M, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]
Shi C, Fan A, Song R, and Lu W (2016). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics accepted [DOI] [PMC free article] [PubMed]
Steinwart I and Christmann A (2008). Support vector machines. Information Science and Statistics Springer, New York. [Google Scholar]
Tsybakov AB (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist 32(1), 135–166. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42(3), 1166–1202. [Google Scholar]
Watkins C and Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]
Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68(4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol 76(1), 217–242. [Google Scholar]
Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112(518), 757–768. [Google Scholar]
Zhang Y, Duchi J, and Wainwright M (2013). Divide and conquer kernel ridge regression. In Conference on Learning Theory, pp. 592–617. [Google Scholar]
Zhang Y, Laber EB, Tsiatis A, and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71(4), 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 (499), 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou S, Shen X, and Wolfe DA (1998). Local asymptotics for regression splines and confidence regions. Ann. Statist 26(5), 1760–1782. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

NIHMS1047683-supplement-Appendix.pdf^{(201.1KB, pdf)}

[R1] Audibert J-Y and Tsybakov AB (2007). Fast learning rates for plug-in classifiers. Ann. Statist 35(2), 608–633. [Google Scholar]

[R2] Baker SG, Cook NR, Vickers A, and Kramer BS (2009). Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172(4), 729–748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Belloni A, Chernozhukov V, Chetverikov D, and Kato K (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Economet-rics 186(2), 345–366. [Google Scholar]

[R4] Biau G (2012). J. mach. learn. res. Journal of Machine Learning Research 13, 1063–1095. [Google Scholar]

[R5] Cannings TI and Samworth RJ (2015). Random projection ensemble classification. arXiv preprint arXiv:1504.04595.

[R6] Chakraborty B, Murphy S, and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19(3), 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Chang M, Lee S, and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18(3), 307–346. [Google Scholar]

[R8] Fan A, Lu W, and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10(1), 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96(456), 1348–1360. [Google Scholar]

[R10] Gail MH (2009). Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. Journal of the National Cancer Institute 101(13), 959–963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Galassi M, Davies J, Theiler J, Gough B, Jungman G, Alken P, Booth M, Rossi F, and Ulerich R (2015). GNU Scientific Library Reference Manual (Version 2.1)

[R12] Gunter L, Zhu J, and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8(1), 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20(1), 1–22. [Google Scholar]

[R14] Huang Y, Laber EB, and Janes H (2015). Characterizing expected benefits of biomark-ers in treatment selection. Biostatistics 16(2), 383–399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Johnson WB and Lindenstrauss J (1984). Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26(189–206), 1. [Google Scholar]

[R16] Keller MB, McCullough JP, Klein DN, Arnow B, Dunner DL, Gelen-berg AJ, Markowitz JC, Nemeroff CB, Russell JM, Thase ME, et al. (2000). A comparison of nefazodone, the cognitive behavioral-analysis system of psychotherapy, and their combination for the treatment of chronic depression. New England Journal of Medicine 342(20), 1462–1470. [DOI] [PubMed] [Google Scholar]

[R17] Li P, Hastie TJ, and Church KW (2006). Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296. ACM. [Google Scholar]

[R18] Lopes M, Jacob L, and Wainwright MJ (2011). A more powerful two-sample test in high dimensions using random projection. In Advances in Neural Information Processing Systems, pp. 1206–1214.

[R19] Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44(2), 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65(2), 331–366. [Google Scholar]

[R21] Nelson J and Nguyên HL (2013). Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 117–126. IEEE. [Google Scholar]

[R22] Omidiran D and Wainwright MJ (2010). High-dimensional variable selection with sparse random projections: measurement sparsity and statistical efficiency. Journal of Machine Learning Research 11(Aug), 2361–2386. [Google Scholar]

[R23] Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39(2), 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Robins J, Hernan M, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]

[R25] Shi C, Fan A, Song R, and Lu W (2016). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics accepted [DOI] [PMC free article] [PubMed]

[R26] Steinwart I and Christmann A (2008). Support vector machines. Information Science and Statistics Springer, New York. [Google Scholar]

[R27] Tsybakov AB (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist 32(1), 135–166. [Google Scholar]

[R28] van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42(3), 1166–1202. [Google Scholar]

[R29] Watkins C and Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]

[R30] Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68(4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol 76(1), 217–242. [Google Scholar]

[R32] Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112(518), 757–768. [Google Scholar]

[R33] Zhang Y, Duchi J, and Wainwright M (2013). Divide and conquer kernel ridge regression. In Conference on Learning Theory, pp. 592–617. [Google Scholar]

[R34] Zhang Y, Laber EB, Tsiatis A, and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71(4), 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 (499), 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Zhou S, Shen X, and Wolfe DA (1998). Local asymptotics for regression splines and confidence regions. Ann. Statist 26(5), 1760–1782. [Google Scholar]

PERMALINK

A Sparse Random Projection-based Test for Overall Qualitative Treatment Effects

Chengchun Shi

Wenbin Lu

Rui Song

Roles

Abstract

1. Introduction

2. Overall qualitative treatment effects

3. Proposed tests

3.1. A simple value-based test statistic in fixed p case

3.2. A sparse random projection-based test statistic

3.2.1. Test statistic

3.2.2. Asymptotic properties under the null and local alternative

3.3. Some implementation issues

3.3.1. Doubly-robust test statistics

3.3.2. Estimation of τS

3.3.3. Choice of s

3.3.4. Choice of q

3.3.5. Choices of other hyperparameters

4. Simulations

4.1. Settings

4.2. Competing methods

4.3. Results

Table 1:

Table 2:

4.4. Computation time

5. Real data

6. Discussion

6.1. Nonnegative average treatment effects

6.2. Multi-stage studies

Supplementary Material

Acknowledgment

Appendix

A. Variance estimator in Section 3.3.1

B. Technical conditions

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.3.2. Estimation of τ^S