Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2019 Jun 19;115(531):1201–1213. doi: 10.1080/01621459.2019.1604368

A Sparse Random Projection-based Test for Overall Qualitative Treatment Effects

Chengchun Shi 1, Wenbin Lu 1, Rui Song 1,*
PMCID: PMC7730172  NIHMSID: NIHMS1047683  PMID: 33311818

Abstract

In contrast to the classical “one size fits all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Most of existing works in the literature focused on estimating optimal individualized treatment regimes. However, there has been less attention devoted to hypothesis testing regarding the existence of overall qualitative treatment effects, especially when there is a large number of prognostic covariates. When covariates don’t have qualitative treatment effects, the optimal treatment regime will assign the same treatment to all patients regardless of their covariate values. In this paper, we consider testing the overall qualitative treatment effects of patients’ prognostic covariates in a high dimensional setting. We propose a sample splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. We prove the consistency of our test statistic. In the regular cases, we show the asymptotic power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix. Simulation studies and real data applications validate our theoretical findings.

Keywords: High-dimensional testing, Optimal treatment regime, Precision medicine, Qualitative treatment effects, Sparse random projection

1. Introduction

In many medical studies, patients may differ significantly in the way they respond to the treatment. In contrast to the classical “one size fits all” approach, precision medicine proposes the customization of individualized treatment regimes to account for patients’ heterogeneity in response to treatments. Formally speaking, a treatment regime is a function from patients’ prognostic covariates to available treatment options. The optimal individualized treatment regime (OITR) is the one that maximizes patients’ expected responses among all treatment regimes.

There have been increasing interests in estimating the OITR. Some common methods include Q-learning (Watkins and Dayan, 1992; Chakraborty et al., 2010), A-learning (Robins et al., 2000; Murphy, 2003) and outcome weighted learning (OWL, Zhao et al., 2012). Qian and Murphy (2011) considered a two-step procedure to construct the OITR. Their method first estimates the conditional mean of the response with l1 penalty function and then derives the OITR from the estimated conditional mean. Zhang et al. (2012) proposed a robust method for estimating the OITR by maximizing the estimated average response of patients (i.e, the value function). Zhang et al. (2015) proposed to use decision lists to construct interpretable and parsimonious treatment regimes. Despite the popularity of estimating the OITR, there is scarce work in the literature for hypothesis testing regarding OITR. All these estimation methods implicitly assume that patients’ covariates have qualitative interactions with treatment, which means that there exists a subset of patients whose “best” treatments assigned according to the OITR are different from others.

We consider testing the existence of OITR due to the following reasons. First, the OITR may not always exist in practice, see the data from the Nefazodone-CBASP clinical trial study in Section 5 for an example. In this case, one treatment is better than the other for all patients and there is no need of estimating the OITR. Second, we note that implementing the OITR requires future patients’ covariates which can be expensive to collect in some cases (Baker et al., 2009; Gail, 2009; Huang et al., 2015). In these cases, we recommend to adopt the “one-size-fits-all” paradigm when the null hypothesis of no OITR is not rejected. Third, our test is constructed based on estimated value functions’ difference comparing the OITR and a fixed regime (i.e. assign all to the best treatment). The test is not significant implies that the value functions’ difference is not significant. Under such a situation, although we can still estimate the OITR, the gain of the obtained OITR over the fixed regime in terms of the improvement of value is not significant. Thus, the obtained OITR under such a situation may not be of practical interest. Therefore, it is essential to test the overall qualitative treatment effects of the prognostic covariates to determine whether we need to implement the OITR for future patients. Gunter et al. (2011) developed an S-score to quantify the magnitude of the marginal qualitative treatment effects of a single covariate. However, the S-score doesn’t characterize the overall qualitative treatment effects of all covariates. Besides, no theoretical guarantees were provided for the S-score.

For binary treatments, testing qualitative treatment effects is equivalent to testing whether the interaction between treatment and covariates (i.e, the contrast function) is almost surely positive or negative. To test such hypothesis, Chang et al. (2015) proposed a test based on a L1-type functional of kernel smoothing estimators of conditional treatment effects. Hsu (2017) proposed a Kolmogorov-Smirnov type test statistic based on nonparametric estimation of conditional treatment effects with a hypercube kernel. It is well known that kernel smoothing estimators are undesirable in practice due to the curse of dimensionality. As a result, these test statistics are not reliable when the dimension of the covariates is relatively large. However, in modern biomedical applications, it is likely to obtain a large number of prognostic factors for each individual patient. To the best of our knowledge, there are lack of methods for testing the overall qualitative treatment effects in high-dimensional settings.

In this paper, we aim to test the overall qualitative treatment effects in a high dimensional setting. This is a very challenging task due to the curse of dimensionality. To better illustrate this point, consider a simple situation where patients’ covariates, x, consist of p independent Rademacher variables. Then, it is equivalent to test whether the contrast as a function of the covariates is always positive or negative for any x ∈ {−1, 1}p. Therefore, we need to test 2p moment inequalities even in this very simplified situation. However, for each x ∈ {−1, 1}p, we have on average N/2P observations with covariates equal to x, where N is the total number of observations. When N = O(2p), this seems impossible without additional assumptions. We show in Lemma 3.1 that covariates have the overall qualitative treatment effects if and only if the value function under the OITR is strictly larger than those under fixed treatment regimes. This motivates us to construct test statistics based on the difference between the optimal value function and the value function under fixed treatment regimes. However, inference for such value difference is extremely difficult in the nonregular cases, that is, there is a positive probability that the contrast function is equal to zero. We use a sample-splitting method to construct the test statistic, based on a nonparametric estimator of the contrast function. As long as the estimated contrast function satisfies certain convergence rates, we show our test statistic is consistent.

When the dimension of covariates is large, we construct the test based on sparse random projections of covariates into a low-dimensional space. Random projections have been a powerful method for dimension reduction in the computer science literature. The key idea behind is given in the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984), which states that a set of high dimensional vectors can be projected into a suitable lower-dimensional space while approximately preserve their pairwise distances. In the statistics literature, Lopes et al. (2011) proposed a high-dimensional two-sample test which integrates a random projection with the Hotelling T2 statistic. Recently, Cannings and Samworth (2015) proposed a random projection-based method for the high-dimensional classification.

In this paper, we propose the use of random projections with sparse matrix. In contrast to the dense sketching matrix used in Lopes et al. (2011) and Cannings and Samworth (2015), only a small proportion of elements in the sparse sketching matrix are nonzero. References on sparse random projections include Omidiran and Wainwright (2010); Li et al. (2006); Nelson and Nguyên (2013). In our simulation studies, we show that our sparse random projection-based test statistics are more powerful compared to those based on dense random projection matrix, when the OITR is “sparse”. Besides, we advocate using data-dependent algorithms to generate sparse sketching matrix, since most random projections will be weakly correlated with the contrast function. In theory, we prove the consistency of our sparse random projection-based test. Moreover, in the regular cases, we show that the power function of our test statistic is asymptotically the same as the “oracle” test statistic which is constructed based on the “optimal” projection matrix.

The rest of the paper is organized as follows. In Section 2, we present the definition of the overall qualitative treatment effects. In Section 3, we introduce our test statistic and study its asymptotic properties under the null and local alternative. Simulation studies and real data applications are conducted in Section 4 and Section 5 respectively, to examine the empirical performance of the proposed testing procedure. Section 6 concludes with a summary and discussions of possible extensions.

2. Overall qualitative treatment effects

We consider a single stage study with two treatment options. Let Y be a patient’s outcome of interest and A ∈ {0, 1} be the treatment indicator, with 0 for the standard treatment and 1 for the new treatment. By convention, a larger value of Y indicates a better clinical outcome. Denoted by Xp the patient’s baseline covariates. We consider a high dimensional setting where p is allowed to diverge with the sample size N. Let Y*(0) and Y*(1) denote the potential outcomes of a patient that would be observed assuming s/he received treatment 0 and 1, respectively. A treatment regime d:p{0,1} is a deterministic function from patient’s covariate space to all possible treatment options. For any d, we define the expected potential outcome

V(d)=E[d(X)Y*(1)+{1d(X)}Y*(0)],

known as the value function associated with d. The optimal treatment regime dopt is defined as the maximizer of V(d). Let τ(x) be the contrast function, i.e,

τ(x)=E(Y|A=1,X=x)E(Y|A=0,X=x).

Under the following three conditions:

(A1.) Stable Unit Treatment Value Assumption (SUTVA) : Y = AY*(0) + (1 – A)Y*(1),

(A2.) No unmeasured confounders: Y*(0), Y*(1) ⫫ A | X,

(A3.) Positivity: there exists some constants 0 < c1 < c2 < 1 such that 0 < c1c2 < 1 such that c1P(A = 1|X = x) ≤ c2 for any x,

we can show that τ(x)=E{Y*(1)Y*(0)|X=x}. Since

V(d)=E[d(X){Y*(1)Y*(0)}+Y*(0)]=E{τ(X)d(X)}+E{Y*(0)},

it is immediate to see that dopt(x) = I{τ(x) > 0}, where I(∙) stands for the indicator function.

Condition (A2) is satisfied in a randomized study, where the propensity score function π(x)=Pr(A=1|X=x) is usually a known constant by design. We assume π(x) is known throughout this Section. In Section 3.3, we allow the propensity score to be estimated from data as in observational studies.

Covariates X are said to have the overall qualitative treatment effects (OQTE) if

Pr{τ(X)>0}>0andPr{τ(X)<0}>0.

In this paper, we consider testing the following hypothesis:

H0:Xdoesnt have OQTE versusH1:Xhas OQTE. (1)

Assume (A1)–(A3) hold. Under H0, the optimal treatment regime assigns the same treatment to all patients. Therefore, testing OQTE is equivalent to testing the existence of OITR.

3. Proposed tests

3.1. A simple value-based test statistic in fixed p case

Assume the observed data are summarized as {Oi=(Xi,Ai,Yi),i=1,,N}, where Oi’s are i.i.d. copies of O = (X, A, Y). The distribution of O is allowed to vary with N. To illustrate the idea, we first assume p is small and fixed, and present here a value-based test statistic for the null hypothesis (1). Later in this section, we will consider the more challenging high dimensional setting. Let V(0)=E{Y*(0)} and V(1)=E{Y*(1)}. The following lemma relates OQTE to the difference between the optimal value function and the value functions under fixed treatment regimes.

Lemma 3.1. Assume E|τ(X)| < ∞, and conditions (A1)–(A3) hold. Then the followings are equivalent: (i) X doesn’t have OQTE; (ii) V(dopt) = max{V(0), V(1)}.

By definition, we have V(dopt) ≥ max{V(0), V(1)}. Under H1, Lemma 3.1 implies V(dopt) > max(V(0), V(1)). Therefore, it suffices to test

H0:V(dopt)=max{V(0),V(1)}versusH1:V(dopt)>max{V(0),V(1)}.

For simplicity, we assume V(1) ≥ V(0). This implies that the new treatment is at least as good as the standard one on average. The hypothesis V(1) ≥ V(0) can be tested using historical data or data from a pilot study. When V(0) ≥ V(1), the test statistic can be similarly constructed.

Lemma 3.1 motivates us to consider test statistics based on some estimators for the value difference VD(dopt) = V(dopt) – V(1). For any treatment regime d, Zhang et al. (2012) proposed an inverse propensity score weighted estimator (IPSWE) for V(d):

V^(d)=1Ni=1NAid(Xi)πiYi+(1Ai){1d(Xi)}1πiYi, (2)

where πi is a shorthand for π(Xi). Plugging d ≡ 1, we obtain V^(1)=N1iAiYi/πi. For any fixed d, NVD^(d)=N{V^(d)V^(1)} corresponds to a sum of i.i.d random variables. Therefore, its asymptotic variance can be consistently estimated by the sample variance estimator,

σ^2(d)=1N1i=1N1Ai1πiAiπiYi{1d(Xi)}VD^(d)2. (3)

Suppose τ^() is an estimate τ(∙). Based on (2) and (3), it is natural to use T^=NVD^(d^)/σ^(d^) as the test statistic, where d^(x)=I{τ^(x)>0}, and reject H0 when T^>zα at a given significant level α, where zα stands for the upper α-th quantile of the standard normal distribution.

Consistency of such a naive test requires Ed^(X)dopt(X)20. However, as commented by Luedtke and van der Laan (2016), this assumption is typically violated in the non-regular cases where Pr{τ(X) = 0} > 0, even when τ^ is consistent to τ. To solve this problem, we consider a modified version of T^ based on sample splitting and cross-validation. Let I1 and I2 be a random partition of {1, …, N} into 2 disjoint subsets of equal sizes n = N/2. For any I{1,N} and treatment regime d, define

VD^I(d)=1IiI1Ai1πiAiπiYi{1d(Xi)},σ^I2(d)=1I1iI1Ai1πiAiπiYi{1d(Xi)}VD^I(d)2,

where I stands for the number of elements in I. Let τ^I be the corresponding estimator of τ based on observations in I and d^I(x)=I{τ^I(x)>0}. We define our test statistic by

T^CV=maxnVD^I1(d^I2)max{σ^I1(d^I2),δn},nVD^I2(d^I1)max{σ^I2(d^I1),δn}, (4)

for some positive sequence δn → 0, and reject H0 when T^CV>zα/2. The sequence δn guarantees that the denominators in T^CV are strictly greater than 0.

Alternative to the sample splitting method, one can consider a Wald-type test statistic based on the online one-step estimator proposed by Luedtke and van der Laan (2016). However, calculating such test statistic is more computationally expensive than ours. Besides, the asymptotic normality of such test statistic requires the class of functions {[(1A)/{1π(X)}A/π(X)]Y{1d(X)}:d} to be Glivenko-Cantelli, where d varies over the range of estimators d^ (see Section 7.3 in Luedtke and van der Laan, 2016). In contrast, our testing procedure is valid under H0 for any d^.

Theorem 3.1. Assume conditions (A1)–(A3) hold, E|Y|3 = O(1) and δnn−1/6. Then under H0, for any 0 < α < 1, we have

lim supnPr(T^CV>zα/2)α.

Moreover, assume that

VarAπ(X)1A1π(X)Y{1d^Ij(X)}|{Oi}iIj=op(δn), (5)

for j = 1, 2, where Var(V1 | V2) denotes the variance of V1 conditional on V2. Then, we have Pr(T^CV<zα/2)0.

The following theorem states the consistency of our proposed test statistic. It relies on Conditions (C1) and (C2). We provide these conditions in Section B of the Appendix to save space.

Theorem 3.2. Assume conditions (A1)–(A3), (C1), (C2) hold, E|Y|3 = O(1) and δn → 0. Under H1:V(dopt)=V(1)+hn, if hnn−1/2, then we have Pr(T^CV>zα/2)1. Moreover, assume Pr{τ(X)=0}=0 and lim infnσ02>0 where

σ02=VarAπ(X)1A1π(X)Y{1dopt(X)}.

If nhn=O(1), then we have

PrT^CV>zα/2=2Φ¯zα2nhnσ0Φ¯2zα2nhnσ0+o(1),

where Φ¯(z)=Pr(Zz) for a standard normal random variable Z.

Theorem 3.1 and 3.2 show the consistency of our testing procedure. Note that Conditions (C1) and (C2) are not required to ensure Theorem 3.1. This suggests the type-I error is well controlled regardless of any estimating procedure. On the other hand, conditions on δn in Theorem 3.1 are stronger than those in Theorem 3.2. In the regular cases when Pr{τ(X)=0}=0, Theorem 3.2 provides the asymptotic power function of our test. Notice that hn is equal to Eτ(X){τ(X)<0} which relies on the dependence structure of the covariates. As a result, the power of our test depends crucially on the underlying data-generating process.

In this paper, d^ is obtained by a plug-in estimator based on some nonparametric estimation of the contrast function. Alternatively, one can directly estimate dopt using OWL. Theorem 3.2 holds as long as the estimated decision function d^ satisfies V(d^I)=V(dopt)+op(I1/2).

Since we assume V(1) ≥ V(0), under H0, we have Pr{τ(X)0}=1. In the regular cases where Pr{τ(X)=0}=0, we have Pr{dopt(X)=1}=1 and hence

VarAπ(X)1A1π(X)Y{1dopt(X)}=0.

Besides, in the regular cases, dopt can be consistently estimated by d^Ij (see Equation (S.18) in the supplementary article). Assume conditions (C1) and (C2) hold with γ ≥ 1. Then we can show (5) holds. Hence, the type-I error of our test will go to 0.

3.2. A sparse random projection-based test statistic

When p is large, it is far more challenging to estimate the contrast function τ(x) due to the curse of dimensionality. To handle high-dimensional covariates, we project the covariate space into a low dimensional vector space to construct our test statistic. Throughout this paper, we assume the dimension of the projected space, q is fixed. For a given matrix Sq×p and any ωq, define

τS(ω)=E{τ(X)|SX=ω}.

Under (A1)–(A3),the treatment regime dSopt(x)=I{τS(Sx)>0} is optimal in the sense that it maximizes the value function among the class of treatment regimes based only on the projected covariates SX.

Since q is small, τS can be consistently estimated. We can construct a value-based test statistic as discussed in Section 3.1 based on the projected data {OiS}i{1,,N} where OiS=(SXi,Ai,Yi). The power of such test statistic depends crucially on the sketching matrix S. To better understand this, consider the following example:

τ(X)=X(1)+X(2)22δX(3)+X(4)+X(5)+X(6)+X(7)52, (6)

for some δ > 0, where X(j) denotes the j-th element of X.

Apparently, we have τ(X) > 0 if X(1)+X(2)>2δ and τ(X) < 0 if X(1)+X(2)<2δ. Assume X ~ N(0, Ip). Then X have the OQTE. Let q = 1, the “optimal” sketching matrix S* is equal to

S*=c0(1,1,0,0,,0),

for any c0 ≠ 0. For any S1×p such that S*ST = 0, SX is independent of X(1)+X(2). Then, we have

τS(ω)=E{τ(X)|SX=ω}=EX(1)+X(2)22δX(3)+X(4)+X(5)+X(6)+X(7)52SX=ω=(1δ)EX(3)+X(4)+X(5)+X(6)+X(7)52SX=ω.

Hence, τS(ω) is always nonnegative or nonpositive as a function of ω. As a result, the test statistic based on {OiS}i doesn’t have any power to detect the OQTE. The challenge here lies in finding a projection matrix S that is highly correlated with S*.

Below, we propose a data-dependent algorithm to generate S and introduce our test statistic. Our theory shows that our test statistic works as if the optimal sketching matrix S* were known. Statistical properties of our testing procedure are formally studied in Section 3.2.2.

3.2.1. Test statistic

Assume for now, we have an estimator τ^IS for τS based on any subset of the projected data {OiS}iI and an algorithm to sample sparse sketching matrices whose distribution G(S,{Oi}iI) is allowed to depend on {Oi}iI. We describe the whole testing procedure in Algorithm 1.

Algorithm1.Calculate the random projection-based test statistic.1.Input observations{Oi}i=1,,N,δN,αanda sampling distributionG.2.Randomly partition the data into two subsets{Oi}iI1and{Oi}iI2.3.Forj=1,2,(i)Independently sample a sparse sketching matrixSIj~G(S,{Oi}iIj);(ii)Obtain estimatorsτ^IjSIjandd^IjSIj(x)=I{τ^IjSIj(SIjx)>0};(iii)CalculateT^SIj=nVD^Ijc(d^IjSIj)/max{σ^Ijc(d^IjSIj),δn}.4.RejectH0ifT^SRP=max(T^SI1,T^SI2)>zα/2.

Now we present our algorithm for generating sparse sketching matrix. We first introduce some notations. For any matrix Ψ with J rows, let Ψ(i) be the ith row of Ψ. For any vector ψJ and any set M{1,,J}, denote ψM as the subvector of ψ formed by elements in M. Let Mc be the complement of M. Let ψ0 be the number of nonzero elements in ψ and ψ2 be the Euclidean norm of ψ. Let S denote the space of sparse sketching matrices:

S={Sq×p:S(i)0s,S(i)2=1,i=1,,q},

for some fixed integer s that satisfies 2 ≤ sp. Denoted by N(0, IJ) a J-dimensional Gaussian random vector with mean zero and identity covariance matrix.

It remains to generate SIj based on the sub-dataset {Oi}iIj. We first sample many sparse sketching matrices from S. Each row of the sketching matrix is independently and uniformly distributed on the space {Sp:S0=s,S2=1}. This corresponds to Step 2 in our proposed algorithm below. Then we output the sparse sketching matrix that maximizes the estimated value difference function. Specifically, we propose using data-splitting strategy for evaluation of the value difference function. That is, for each sketching matrix, we randomly divide {Oi}iIj into K folds, use any of the K1 subsamples to estimate the OITR based on projected covariates, use the remaining subsamples to evaluate the corresponding value difference function, and aggregate these value difference functions over different subsamples. This corresponds to Step 3–5 in our proposed algorithm below. We summarize our procedure in Algorithm 2.

Algorithm2.Generate data-dependent sparse random sketching matrix.1.Input observations{Oi}iI,integersB,s,q,andK2.2.Generate i.i.d matricesS1,S2,,SBaccording asS0whose distribution isdescribed as follows. Forj=1,,q,(i)Independently select a simple random sampleMjof sizesfrom{1,,p};(ii)Independently generate a Gaussian random vectorgj~N(0,Is);(iii)SetS0,Mjc(j)=0andS0,Mjc(j)=gj/gj2.3.Randomly divideIintoKsubsets{I(k)}k=1Kofequal sizes.LetI(k)=I(I(k))c.4.Forb=1,,B,(i)Fork=1,,,K,(i.1)Obtain the estimatorτ^I(k)Sbandd^I(k)Sb(x)=I{τ^I(k)Sb(Sbx)>0};(i.2)Evaluate the value differenceVD^I(k)(d^I(k)Sb).(ii)Obtain the cross-validated estimatorVD^CVSb=kVD^I(k)(d^I(k)Sb)/K.5.OutputSb^,whereb^=argmaxb=1BVD^CVSb.

3.2.2. Asymptotic properties under the null and local alternative

We first show the validity of the proposed test, which applies to any estimator τ^IS. For any positive sequences {an} and {bn}, we write anbn if and only if lim supnbn/an=0.

Theorem 3.3. Assume (A1)–(A3) hold, E|Y|3 = O(1) and δnn−1/6. Then under H0, we have

lim supnPrT^SRP>zα/2α.

Moreover, assume that

VarAπ(X)1A1π(X)Y{1d^IjSIj(X)}|{Oi}iIj,SIj,Ij=op(δn),

for j = 1, 2. Then we have Pr(T^SRP>zα/2)0.

Let S*=arg maxSSV(dSopt) be the optimal sketching matrix. The optimal sketching matrix S* may not be unique. To see this, for any sketching matrix S*S that maximizes V(dSopt), −S* also maximizes V(dSopt) and we have S*S. Moreover, when q ≥ 2, there may exist infinitely many maximizers.

Our theoretical studies are mostly concerned with the “oracle” test statistic. The oracle knew the set S* ahead of time. In Algorithm 1: Step 3(i), instead of using Algorithm 2 to sample SI1 and SI2, we use the oracle set SI1=SI2=S* for an arbitrary S*S*. Denoted by T^oracle the resulting oracle test statistic. Let hn*=arg maxS*S*V(dS*opt)V(1). Similar to Theorem 3.2, under H1, if hn*n1/2, then we can show

PrT^oracle>zα/21.

Assume

V(dS*opt)=V(dopt),S*S*. (7)

This condition means the optimal decision rule depends on the set of projected covariates S*X only. It holds when τ(x)=ϕ(S*x)g(x) for some function ϕ(∙) and some nonnegative function g(∙). In the regular cases where Pr(τ(X)=0)=0, (7) implies that Pr(dopt(X)=dS*opt(X))=1,S*S*. Thus, the class of optimal treatment regimes {dS*opt:S*S*} will almost surely recommend the same treatment to any given patient. Assume (7) holds and Pr(τ(X)=0)=0. Similar to Theorem 3.2, the asymptotic power of T^oracle can be derived as

PrT^oracle>zα/2=2Φ¯zα2nhnσ0Φ¯2zα2nhnσ0+o(1), (8)

where hn and σ0 are defined in Theorem 3.2.

In the following, we prove the consistency of our proposed testing procedure when using Algorithm 2 to generate the sparse sketching matrix. Moreover, we show our test statistic possesses the oracle property. This means the power function of T^SRP is asymptotically the same as the oracle test statistic T^oracle.

Define the semimetric

dτ(S1,S2)=EτS1(S1X)τS2(S2X)2,S1,S2S.

We make the following assumptions.

(A4.) For any sketching matrices S1,S2,,SBS and any I{1,2,N} with In/2, assume the following event holds with probability tending to 1,

maxb=1BEXτ^ISj(SjX)τSj(SjX)2=O(nr0logn),

where the expectation EX is taken with respect to X, and the little-o term is uniform in I and S1, …, SB.

(A5.) Assume B(pn)(s1)q. In addition, assume there exist some constant C¯>0 and some sketching matrix S*S* such that

dτ(S,S*)C¯j=1qS(j)S*(j)221/2,SS. (9)

(A6.) Assume there exist some constants γ, ε0, δ0 > 0 such that for any sketching matrix S satisfying V(dSopt)V(dS*opt)ε0, we have Pr{0<τS(SX)t}=O(tγ), where the big-O term is uniform in 0 < t < δ0 and S.

Condition (A4) assumes the uniform convergence rate of τ^ISb for b = 1, …, B. Since the uniform convergence rate increases as B increases, Condition (A4) gives the upper bound for B. On the contrary, Condition (A5) gives the lower bound for B. It requires B to diverge at a proper rate, to give us a good chance for finding a random projection with a large value function. More specifically, under (A5), we can show that

Pr{maxb=1BV(dSbopt)=V(dS*opt)+o(n1/2)}1.

In Section C.3 of the Appendix, we show (A5) holds when τ(x)=ϕ(S*X) for some sketching matrix S*S* and some Lipschitz continuous function ϕ(∙).

Condition (A6) holds with γ = 1 when τS(SX) has a uniformly bounded density function near 0 for any sketching matrix S that nearly maximizes the value function (see Section C.4 in the Appendix for detailed discussion). Assume τ(X) ≥ δ0 almost surely or τ(X) ≤ −δ0 almost surely. Then for any sketching matrix S, we have τS (SX) ≥ δ0 almost surely or τS(SX) ≤ −δ0. As a result, (A6) automatically holds for any γ > 0.

In Section C.2 of the Appendix, we consider a simple model and show (A4)–(A6) holds.

Theorem 3.4. Assume Conditions (A1)–(A5) hold, E|Y|3 = O(1), logB=o(n1/3)and δn → 0. If hn*max(logB/n,nr0/2logn), then we have

PrT^SRP>zα/21.

Moreover, assume (7) and (A6) hold, Pr{τ(X) = 0} = 0, nhn=O(1), B=O(nκB) for some κB>0, r0>γ+22γ+2 and lim infn σ0 > 0. Then we have

PrT^SRP>zα/2=2Φ¯zα2nhnσ0Φ¯2zα2nhnσ0+o(1).

Assume p = O(n) and we set B=c*n{3q(s1)+ϵ}/2 for any c*, ϵ > 0. Then the conditions B(pn)(s1)q in (A5) and B=O(nκB) in Theorem 3.4 automatically hold. It is worth mentioning that when hn and σ0 don’t depend on p, Theorem 3.4 implies that the asymptotic power of our test is independent of p.

3.3. Some implementation issues

3.3.1. Doubly-robust test statistics

So far we have assumed that the propensity scores are known for all patients. In the following, we introduce a doubly-robust test statistic to deal with data from an observational study. We begin by introducing a doubly-robust value difference estimator, which requires the estimation of the propensity score and the conditional mean functions h0=E(Y|A=0,X=x) and h1=E(Y|A=1,X=x). Denoted by π^(), h^0() and h^1() the corresponding estimators. Zhang et al. (2012) proposed a doubly-robust estimator for the value function under a given treatment regime d,

V^dr(d)=1Ni=1NAiπ^(Xi)di+1Ai1π^(Xi)(1di)YiAiπ^(Xi)di+1Ai1π^(Xi)(1di)1{h^0(Xi)(1di)+h^1(Xi)di},

where di is a shorthand for d(Xi). When either the propensity score or the conditional mean models are correctly specified, V^dr(d) is consistent to V(d) (Zhang et al., 2012). Based on V^dr, for any I[1,,N] and a given treatment regime d, we define our doubly-robust value difference estimator as

VD^Idr=1IiI1Ai1π^iIAiπ^iIYi1Ai1π^iI1h^0,iI+Aiπ^iI1h^1,iI(1di),

where π^iI=π^I(Xi), h^0,iI=h^0I(Xi), h^1,iI=h^1I(Xi), and π^I, h^0I, h^1I are obtained based on {Oi}I. When p is large, we recommend to estimate π, h0 and h1 via penalized regression. The asymptotic variance of IVD^dr(d) can be consistently estimated by σ^Idr(d) whose exact form is given in Section A of the Appendix.

We briefly summarize our test procedures. Similar to Algorithm 1, we first randomly partition the data into two halves {Oi}I1 and {Oi}I2, and obtain the estimates π^Ij, h^0Ij, h^1Ij based on {Oi}iIj for j = 1, 2. Then we independently sample the sparse sketching matrices SI1 and SI2. The sampling algorithm is similar to Algorithm 2. Specifically, for j = 1, 2, we randomly divide Ij into {Ij(k)}k=1K and independently sample S1, …, SB as Steps 2 and 3 of Algorithm 2. Then we calculate the doubly-robust value difference estimator,

VD^CVdr,Sb=K1kVD^Ij(k)dr(d^Ij(k)Sb), (10)

for each Sb where Ij(k)=Ij(Ij(k))c, and set SI1=Sb^ where b^=arg maxb=1BVD^CVdr,Sb. Finally, we define our test statistic by

T^SRPdr=maxnVD^I2dr(d^I1SI1)max{σ^I2dr(d^I1SI1),δn},nVD^I1dr(d^I2SI2)max{σ^I1dr(d^I2SI2),δn}, (11)

and reject the test if T^SRPdr>zα/2 for a given significant level α > 0. Statistical properties of T^SRPdr can be similarly established.

3.3.2. Estimation of τS

The projected contrast function τS can be estimated by any machine learning or statistical nonparametric methods. In our implementation, we estimate τS using cubic B-splines. Let I be an arbitrary subset of {1, …, N}. Based on the dataset {Oi}iI, we first estimate π using the penalized logistic regression, and estimate h0, h1 using the penalized linear regression, with SCAD penalty functions (Fan and Li, 2001). These penalized regressions are implemented by the R package ncvreg and the tuning parameters are selected via 10-folded cross-validation. Let π^iI, h^0,iI and h^1,iI be the corresponding estimators for π(Xi), h0(Xi) and h1(Xi), respectively. Recall that S(j)1×p is the jth row of sketching matrix S. We define the pseudo outcome

τ^iI=Aiπ^iI1Ai1π^iIYiAiπ^iI1h^1,iI+1Ai1π^iI1h^0,iI, (12)

and minimize

(ξ^1I,,ξ^qI)=arg minξ1,,ξqk1IiIτ^íIj=1qk=1K+4NkS(j)(S(j)Xi)ξj,k2, (13)

where N1S(j)(),,NK+4S(j)() are cubic B-spline bases of S(j)Xi and K is the number of interior knots. Given K, we place the interior knots at equally-spaced sample quantiles of the projected covariates {SXi}iI. After solving (13), we set τ^IS(Sx)=j=1qk=1K+4NkS(j)(S(j)x)ξ^j,kI.

Based on the B-spline methods, we show in Section C.2.1 of the Appendix (A4) holds with r0=4/5 when q = 1 and B=O(nκB) for any κB > 0. Assume (A6) holds with γ > 2/3. The condition r0 > (γ + 2)/(2γ + 2) in Theorem 3.4 is thus satisfied. More generally, we may use series estimator (Belloni et al., 2015) to estimate τS. Then the rate r0 in (A4) will decrease as the number of projected dimension q increases.

3.3.3. Choice of s

Our testing procedure requires specification of s, which determines the number of nonzero elements in each row of the sketching matrix. Ideally, one could treat s as a tuning parameter and choose s to maximize the estimated value difference defined in (10). However, this approach would be time-consuming. In our implementation, we set s as a discrete random variable when sampling S1, …, SB. More specifically, for b = 1, …, B, we first independently sample s according as some random variable s0, and then sample Sb according to Step 3 of Algorithm 2.

We recommend to set s0 = 2 + Binom(p – 2, p0), where Binom(m, p0) is a binomial random variable with the total number of trials equal to m and the probability of success equal to p0. In our simulation study, we set p0=2/(p2).

3.3.4. Choice of q

The choice of the projection dimension q involves a trade-off. If q is too large, then the curse of dimensionality will affect the uniform convergence rates of τ^ISj in (A8), resulting in decreased power of the corresponding test. If q is too small, then the OITR is not well approximated. In our numerical experiments, we set q = 1. In the supplementary article, we examine the performance of the proposed test with difference choices of q. Results show that the optimal choice of q depends on the number of covariates involved in the OITR and varies across different simulation settings. We further propose a method that adaptively determines q. Detailed algorithm is given in Section E.2 of the supplementary article. In our simulations, we find such adaptive method is no worse than any fixed choice of q and has nearly optimal performance in some cases.

3.3.5. Choices of other hyperparameters

We recommend to set the number of folds K in Algorithm 2 to be 5 or 10. The number of sketching matrices B shall diverge as N, p → ∞. In practice, we recommend to set B=NκNpκp for some κN,κp1.

4. Simulations

4.1. Settings

We examine the finite sample performance of the proposed tests via Monte Carlo simulations. Simulated data with sample size N were generated from

Y=1+(X(1)X(2))/2+Aτ(X)+e,

where X ~ N(0, Ip), A ~ Binom(1, 0.5) and e ~ N(0, 0.52). Here, we set p = 50 or 100.

We consider four scenarios. In the first three scenarios, we set

τ(X)=ϕδ{(X(1)+X(2))/2}(X(3)+X(4)+X(5)+X(6)+X(7))2/5,

for some function ϕδ parameterized by some δ> 0. More specifically, we set ϕδ(x)=x2δ in Scenario 1, ϕ(x)=δcos(πx) in Scenario 2, and ϕδ(x)=δ2πx in Scenario 3.

In Scenario 4, we set

τ(X)=δj=12X(j)22j=320X(j)182(X(21)+X(22)+X(23)+X(24)+X(25))2/5.

It is immediate to see that the OITR is sparse and is a function of X(1) and X(2) in the first three scenarios. In Scenario 4, however, a total of 20 variables are involved in the OITR. In addition, the true OITR is linear in X under Scenario 3, but non-linear under Scenarios 1, 2 and 4. We set N = 500 in Scenarios 1, 2 and 3, and N = 1000 in Scenario 4.

For all scenarios, the parameter δ controls the degree of overall qualitative treatment effects. Specifically, H0 holds if δ = 0 and H1 holds if δ > 0. For each scenario, we further consider four cases by setting VD(dopt) = V(dopt) – V(1) = 0, 0.2, 0.35 and 0.5. Note that in Scenarios 2, 3 and 4, the settings for VD(dopt) = 0 are the same. Hence, in Scenarios 3 and 4, we only report the simulation results for VD(dopt) = 0.2, 0.35 and 0.5.

We set q = 1 and calculate T^SRPdr as described in Section 3.3. The number of interior knots K in the cubic B-spline bases is specified in the following fashion. When generating SI1 or SI2, we fix K = 3 when estimating τSb for b = 1, …, B. After obtaining SI1 and SI2, K is tuned with cross-validation when estimating τSI1 and τSI2 We set B = 105 for p = 50 and B = 4 × 105 for p = 100.

The whole simulation program is implemented in R. Some subroutines, including sam-pling data-dependent sketching matrices SI1 and SI2 and estimating τSI1 and τSI2, are written in C with the GNU Scientific Library (GSL, Galassi et al., 2015).

4.2. Competing methods

Comparison is made among the following five test statistics:

  1. The proposed sparse random projection-based test statistic T^SRPdr.

  2. The dense random projection-based test statistic, denoted by T^RPdr.

  3. The cross-validated test statistic with the OITR estimated by the penalized least square method developed in Shi et al. (2016), denoted by T^PLS.

  4. The cross-validated test statistic based on step-wise variable selection, denoted by T^VS.

  5. The supremum-type test statistic T^DL based on the desparsified Lasso estimator (Zhang and Zhang, 2014; van de Geer et al., 2014).

T^RPdr is computed in a similar fashion as T^SRPdr. We randomly partition {1, …, N} into I1I2 of equal size, generate some data dependent sketching matrices SI1 and SI2, and construct the test statistic as in (11). When generating SI1 or SI2, instead of sampling B sparse sketching matrices as described in Step 3 of Algorithm 2, we generate B dense sketching matrices S1, …, SB according to Z0/Z02, where Z0p is a Gaussian random vector with mean zero and identity covariance matrix, and set SI1 or SI2 to be the one that gives the largest cross-validated value difference as in (10). Similar to T^SRPdr, we set B = 105 for p = 50 and set B = 4 × 105 for p = 100, and use cubic B-splines to estimate τS for any sketching matrix S.

To calculate T^PLS, we first partition the data into two halves {Oi}iI1 and {Oi}iI2. Then for j = 1, 2, we set d^Ij(x)=I(x¯Tβ^Ij>0) where x¯=(1,xT)T, β^Ij is computed by

β^Ij=arg minβp+1iIj1IjYiX¯iTθ^Ij(Aiπ^iIj)X¯iTβ2+j=2p+1pλn,1(βj), (14)

for some penalty functions pλ, where X¯i=(1,XiT)T, π^iIj is the estimated propensity score for the ith patient based on a penalized logistic regression with SCAD penalty function, and θ^Ij is calculated by

θ^Ij=arg minθp+1iIj1Ij(YiX¯iTθ)2+j=2p+1pλn,2(θj). (15)

We use the SCAD penalty in both (14) and (15). The tuning parameters λn,1 and λn,2 were selected via 10-folded cross-validation. Finally, define T^PLS by

T^PLS=maxnVD^I2dr(d^I1)max{σ^I2dr(d^I1),δn},nVD^I1dr(d^I2)max{σ^I1dr(d^I2),δn}. (16)

To compute T^VS, we similarly split the observations into two sub-datasets {Oi}iI1 and {Oi}iI2. For each sub-dataset, we apply the sequential advantage selection (SAS, Fan et al., 2016) to select variables with a qualitative interaction with the treatment. SAS is a greedy stepwise selection procedure and uses a BIC-type criterion to choose the best candidate subset of variables. Denoted by M^I1, M^I2{1,,p} the corresponding sets of selected variables. Then for each j = 1, 2, we calculate the pseudo responses τ^iIj, iIj (see the definition in (12)) and compute

τ^Ij=arg minfj1niIj{τ^iIjf(Xi,M^Ij)}2+λjfj2,

where λj > 0 is a tuning parameter, j is the reproducing kernel Hilbert space with the reproducing kernel Kj(Xi,M^Ij,Xk,M^Ij)=exp{lM^Ijηj,l(Xi(l)Xk(l))2} where Xi(l), Xk(l) denote the l-th element in Xi, Xk and ηj,l>0, lM^Ij are tuning parameters. The estimating procedure is implemented by the R package listdtr and the tuning parameters are selected via leave-one-out cross validation. Then we define d^Ij(x)=I{τ^Ij(xM^Ij)>0} and set

T^VS=maxnVD^I2dr(d^I1)max{σ^I2dr(d^I1),δn},nVD^I1dr(d^I2)max{σ^I1dr(d^I2),δn}. (17)

We set δn=log(log10(2n))/(2n)1/6 in (11), (16) and (17), where log10 denotes the logarithm with base 10.

T^DL tests the overall treatment effects by fitting the following linear regression model for the response:

E(Y|A,X)β0+XTβx+Aβa+AXTβax.

Based on this model, testing the overall treatment effects is equivalent to test H0*:βax=0. Denoted by β=(β0,βxT,βa,βaxT)T. To deal with high dimensionality, we estimate β by the desparsified Lasso estimator β^DL and test H0* based on the following supremum-type test statistic, maxjMaxn|β^jDL|, where Max={p+3,,+2p+2} and β^jDL is the j-th element of β^DL. The critical value of T^DL is approximated via bootstrap. Detailed implementation of the test can be found in Zhang and Cheng (2017).

4.3. Results

We conduct 500 simulations for each setting and report the proportions of rejecting the null hypothesis (%) in Table 1 and Table 2, with standard errors in parenthesis (%). Under H0, the type-I errors of our test statistic is well controlled. Specifically, in Scenario 1 when VD = 0, the rejection probability of T^SRPdr is exactly zero. This is in line with our theory which suggests that the type-I error of our test statistics will converge to 0 in the regular cases where Pr{τ(X)=0}=0. In Scenario 2 when VD = 0, the rejection probability of T^SRPdr is close to the nominal level.

Table 1:

Rejection probabilities (%) of the sparse random projection-based test, dense random projection-based test, penalized least square-based test, step-wise selection-based test and the supremum-type test based on the desparsified Lasso estimator, with standard errors in parenthesis (%), under Scenarios 1 and 2 where X ~ N(0, Ip).

Scenario 1 VD = 0 VD = 20% VD = 35% VD = 50%

α level α level α level α level
p 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05
T^SRPdr 50 0(0) 0(0) 24(1.9) 39.6(2.2) 71(2.0) 81(1.8) 90.8(1.3) 95.2(1.0)
100 0(0) 0(0) 17.4(1.7) 29.6(2.0) 60.8(2.2) 73.8(2.0) 86.6(1.5) 92.4(1.2)
T^RPdr 50 0(0) 0(0) 0.2(0.2) 0.6(0.4) 0.8(0.4) 3.2(0.8) 7.2(1.2) 18.6(1.7)
100 0(0) 0(0) 0.4(0.3) 0.4(0.3) 0.4(0.3) 4(0.9) 6.8(1.1) 19(1.8)
T^PLS 50 0(0) 0(0) 0(0) 0(0) 0.4(0.3) 0.8(0.4) 6(1.1) 17.6(1.7)
100 0(0) 0(0) 0(0) 0(0) 0.8(0.4) 2.4(0.7) 8.6(1.3) 19.8(1.8)
T^VS 50 0(0) 0(0) 1.2(0.5) 3.8(0.9) 16 (1.6) 29.4 (2.0) 36.6(2.2) 50.8(2.2)
100 0(0) 0(0) 0(0) 0.6(0.3) 8.4 (1.2) 17.4 (1.7) 23.8(1.9) 36.4(2.2)
T^DL 50 10.2(1.4) 22.4(1.9) 11.2(1.4) 22.8(1.9) 10.8 (1.4) 21.8 (1.9) 9.8(1.3) 22.4(1.9)
100 7.6(1.2) 20.0(1.8) 7.8(1.2) 21.4(1.8) 7.6 (1.2) 22.0 (1.9) 6.8(1.1) 21.6(1.8)

Scenario 2 VD = 0 VD = 20% VD = 35% VD = 50%

α level α level α level α level
p 0.01 0.05 0.01 0.05 0.01 0.05 0.01 0.05

T^SRPdr 50 1.2(0.5) 5.4(1) 24(1.9) 35.8(2.1) 76.4(1.9) 84.6(1.6) 90.2(1.3) 94(1.1)
100 0.6(0.3) 5.2(1) 15.2(1.6) 28.2(2) 67(2.1) 78.8(1.8) 84.2(1.6) 90.4(1.3)
T^RPdr 50 1.8(0.6) 4.6(0.9) 2(0.6) 4.8(1) 1.6(0.6) 5.4(1) 1(0.4) 6(1.1)
100 1.2(0.5) 4.2(0.9) 1.2(0.5) 5.4(1) 0.6(0.3) 4.8(1) 0.8(0.4) 4.4(0.9)
T^PLS 50 1.8(0.6) 6(1.1) 1.2(0.5) 4.4(0.9) 1(0.4) 4.2(0.9) 0.8(0.4) 3.8(0.9)
100 1.2(0.5) 4.2(0.9) 0.8(0.4) 4.6(0.9) 0.6(0.3) 5.6(1) 0.6(0.3) 5(1)
T^VS 50 1.2(0.5) 6.4(1.1) 0.6(0.3) 4(0.9) 1(0.4) 6.6(1.1) 1(0.4) 5(1)
100 1.4(0.5) 5(1.0) 1.0(0.4) 5(1.0) 1.4(0.5) 6.4(1.1) 0.6(0.3) 4.6(0.9)
T^DL 50 1.6(0.6) 6.4(1.1) 2.8(0.7) 11.8(1.4) 4.4 (0.9) 15.4 (1.6) 5.4 (1.0) 17(1.7)
100 1.2(0.5) 3.6(0.8) 2.8(0.7) 11.8(1.4) 5.2(1.0) 17.6(1.7) 7.2(1.2) 19.8(1.8)

Table 2:

Rejection probabilities (%) of the sparse random projection-based test, dense random projection-based test, penalized least square-based test, step-wise selection-based test and the supremum-type test based on the desparsified Lasso estimator, with standard errors in parenthesis (%), under Scenarios 3 and 4 where X ~ N(0, Ip).

Scenario 3 VD = 20% VD = 35% VD = 50%

α level α level α level
p 0.01 0.05 0.01 0.05 0.01 0.05
T^SRPdr 50 47.2(2.2) 71.8(2) 92.4(1.2) 97.8(0.7) 99(0.4) 100(0)
100 42.4(2.2) 61.2(2.2) 89.8(1.4) 96.2(0.9) 97.2(0.7) 99.4(0.3)
T^RPdr 50 4.4(0.9) 16.2(1.6) 13.4(1.5) 35.8(2.1) 22(1.9) 49.4(2.2)
100 3(0.8) 8.4(1.2) 4(0.9) 14.2(1.6) 5.4(1) 19.6(1.8)
T^PLS 50 76.4(1.9) 92(1.2) 97.8(0.7) 99.4(0.3) 99.4(0.3) 100(0)
100 64.8(2.1) 87(1.5) 97(0.8) 99.4(0.3) 98.6(0.5) 99.8(0.2)
T^VS 50 55.6(2.2) 81.8(1.7) 93(1.1) 99(0.4) 97.8(0.7) 100(0)
100 49.8(2.2) 74.2(2.0) 90(1.3) 98.6(0.5) 99(0.4) 100(0)
T^DL 50 99.8(0.7) 100(0) 100(0) 100(0) 100(0) 100(0)
100 99.2(0.4) 100(0) 100(0) 100(0) 100(0) 100(0)

Scenario 4 VD = 20% VD = 35% VD = 50%

α level α level α level
p 0.01 0.05 0.01 0.05 0.01 0.05

T^SRPdr 50 22.4(1.9) 41.8(2.2) 60.4(2.2) 76.6(1.9) 72.4(2) 87.2(1.5)
100 15.2(1.6) 28(2) 49.6(2.2) 70.2(2) 70(2) 84(1.6)
T^RPdr 50 0.4(0.3) 6.2(1.1) 0.6(0.3) 5.4(1) 0.2(0.2) 5.4(1)
100 1.2(0.5) 6(1.1) 0.8(0.4) 3.8(0.9) 1.2(0.5) 5.2(1)
T^PLSdr 50 1.2(0.5) 5.4(1) 1.2(0.5) 6(1.1) 1.4(0.5) 4.8(1)
100 1.6(0.6) 5.8(1) 1.8(0.6) 6(1.1) 1.4(0.5) 5.2(1)
T^VSdr 50 10.4(1.4) 24.2(1.9) 13.6(1.5) 30.6(2.1) 13.2(1.5) 29.4(2)
100 5(1) 15.6(1.6) 4.6(0.9) 20(1.8) 8.2(1.2) 18.4(1.7)
T^DLdr 50 4.2(0.9) 11.4(1.4) 5.4(1) 14.2(1.6) 6.4(1.1) 15.8(1.6)
100 6.2(1.1) 16(1.6) 6.4(1.1) 18.2(1.7) 6.8(1.1) 19.6(1.8)

Under H1, we can see that our test statistic is much more powerful compared to other competing test statistics in Scenarios 1, 2 and 4. For example, when VD = 0.35 and α = 0.05, the rejection probabilities of our test are around 75% in Scenario 1. On the other hand, T^RPdr, T^PLS and T^VS fail in Scenario 2. Specifically, the rejection probabilities of these three tests are no more than 6% in all settings. The rejection probabilities of T^DL are around 10%−20% in Scenario 2 under H1. However, T^DL doesn’t have valid type-I error rates under H0. Here, the test statistics T^PLS and T^VS fail mainly due to the fact that the true OITR is not linear, while T^RPdr and T^VS fail partly because the dense projection and greedy stepwise variable selection cannot correctly identify the variables with qualitative interactions.

In Scenario 3, T^DL and T^PLS achieve the greatest power in all settings as expected since the true OITR is linear in this scenario. Notice that X(1), X(2), ∙∙∙, X(7) are independent. Although the contrast function is not linear, the estimated contrast functions via the penalized least squares (see (14) and (15)) will converge to E{τ(X)|X(1),X(2)}. As a result, the estimated OITR is consistent. When VD = 0.35 and 0.5, the rejection probabilities of T^SRPdr are slightly smaller when compared to T^PLS, T^DL and T^VS, but are much larger than those of T^RPdr.

In Section E of the supplementary article, we report the rejection probabilities of T^SRPdr, T^RPdr, T^PLS, T^VS and T^DL under the scenario where X~N(0,{0.5ij}i,j=1,,p). Results are similar to those presented in Table 1 and 2.

4.4. Computation time

Our tests are computed on a 32 core 2.2GHz machine with 512GB RAM. Fixing B = 105, it took approximately 3 minutes to implement the test in Scenarios 1–3 where N = 500 and 5 minutes in Scenario 4 where N = 1000. The computation time can be largely reduced if we use a much smaller B. For example, if we set B = 104 in some simulation settings, the computation is 10 times faster and the test performance is still satisfactory. Moreover, since our testing procedure independently generates many sketching matrices and retains the one that maximizes the estimated value function, it can be naturally implemented in parallel. This scalability can further effectively reduce the computational cost.

5. Real data

We apply our proposed test to the data from the Nefazodone-CBASP clinical trial study (Keller et al., 2000), which enrolled 681 patients with nonpsychotic chronic major depressive disorder (MDD). Patients were randomized to three treatments, including Nafazodone (coded as 0), Cognitive Behavioral-Analysis System of Psychotherapy (CBASP, coded as 1), and the combination of Nefazodone and CBASP (2). The outcome of interests were patients’ scores on the 24-item Hamilton Rating Scale for Depression (HRSD). The maximum value of HRSD was 43 and we set Y = 43 – HRSD as our response. Larger value of Y indicates better clinical outcome. Similarly as in Zhao et al. (2012), we use a subset of 647 patients that have complete records of 50 baseline covariates for analysis. Among them, 216 were treated with Nafazodone, 220 with CBASP and 211 with the combination.

Our objective was to test whether the baseline covariates X have overall qualitative treatment effects. This is equivalent to test H0 : V (dopt) = max{V(0), V(1), V(2)}, where V(dopt) is the optimal value function, and V(j) denotes the value function under the fixed treatment regimes by assigning all patients to treatment j, for j = 0, 1, 2. Patients’ average responses under treatment 0, 1, 2 are 27.14, 27.27 and 32.13, respectively. Besides, pairwise t tests show that V(2) is significantly larger than V(0) and V(1). Therefore, it suffices to test H0 : V(dopt) = V(2). This is equivalent to test the intersection of the following two hypotheses:

H0(j):V(dopt,(j))=maxk{0,1,2},kjV(k),

for j = 0, 1, where dopt,(j) is the optimal treatment regime comparing Treatment 2 with Treatment j. For testing H0(j), we computed the test statistic T^SRPdr,j as described in Section 3.3 and 4.1. We set B = 100000 and δn=log(log10(2n))/(2n)1/6. For a given 0 < α < 1, we reject H0 if

maxj=0,1T^SRPdr,j>zα/4.

By Bonferroni’s inequality, the type-I error is well-controlled.

The two test statistics are equal to −0.67 and 0.31, respectively. We fail to reject H0 at a significance level of 0.1. Therefore, we suspect that the prognostic covariates in this study might not have qualitative treatment effects. Zhao et al. (2012) performed pairwise comparisons between the combination treatment and any single treatment, and estimate the OITR by the outcome weighted learning. Their estimated optimal treatment regime recommended the combination treatment to all the patients. Our tests formally verify their findings.

6. Discussion

In this paper, we develop tests for overall qualitative treatment effects. The test statistics are constructed by a sample-splitting method. In the high-dimensional setting, we use sparse random projections of the covariate space to construct the test statistic and introduce a data-dependent way to sample sparse projection matrices. In theory, we show the consistency of the proposed test statistic and prove its “oracle” property in the regular cases.

6.1. Nonnegative average treatment effects

In this paper, we assume V(1) ≥ V(0) (the new treatment is on average better than the standard control) and consider the test statistic based on estimators for the value difference V(dopt)V(1). When such prior information is not available, let a^Ij=argmaxa{0,1}V^Ij(a) for j = 0, 1 where I1 and I2 stands for a random partition of the dataset, V^Ij(a) denotes the estimated value function based on observations in Ij under the decision rule d(x) = a, x. We can consider the following test statistic,

T^CV=maxI2{V^I2(d^I1)V^I2(a^I1)}σ^I2(d^I1,a^I1),I1{V^I1(d^I2)V^I1(a^I2)}σ^I1(d^I2,a^I2)

where σ^I2(d,a) denotes some consistent estimator for the asymptotic variance of I{V^I(d)V^I(a)} for a given regime d and a ∈ {0, 1}. The null is rejected if T^CV>zα/2 for a given significance level α. Using similar arguments in Theorem 3.1 and Theorem 3.2, we can show that such a testing procedure is consistent.

6.2. Multi-stage studies

Currently, we only consider a single stage study. For multiple-stage studies, it suffices to test whether the value function under the optimal dynamic treatment regime is strictly larger than those under nondynamic treatment regimes. Zhang et al. (2013) proposed an inverse propensity-score weighted estimator for the value function under an arbitrary dynamic treatment regime. Denoted by VD^I(d1,d2) the corresponding estimator for the value difference between two dynamic treatment regimes d1 and d2, and d^I the estimated optimal dynamic treatment regime, based on the sub-dataset I. Consider the following test statistic:

T^CV=maxmindDndI2VD^I2(d^I1,d)σ^I2(d^I1,d),mindDndI1VD^I1(d^I2,d)σ^I1(d^I2,d),

where I1 and I2 stand for a random partition of the dataset, σ^I2(d1,d2) some consistent estimator for the asymptotic variance of IVD^I(d1,d2) and Dnd denotes the set of non-dynamic treatment regimes.

Note that for j = 1, 2, we have that under the null,

mindDndIjVD^Ij(d^Ijc,d)σ^I2(d^Ijc,d)mindDndIjVD^Ij(d^Ijc,d)VD(d^Ijc,d)}σ^I(d^Ijc,d)LmindDndZd, (18)

where VD(d1,d2)=EVD^I(d1,d2) and {Zd}dDnd is a set of mean zero Gaussian random variables whose covariance matrix can be consistently estimated from data. For a given significance level α, we reject the null if T^CV>c^α/2 where c^α corresponds to some consistent estimator for Pr(mindDndZd>zα). It follows from the Bonferroni’s inequality and (18) that the type-I error of T^CV is well-controlled. In the high-dimensional setting, we can T^CV calculate based on sparse random projections of the covariate space. Details are omitted for brevity.

Supplementary Material

Appendix

Acknowledgment

The authors are grateful for helpful feedback from the Associate Editor and anonymous referees, which lead to significant improvement of this work.

The authors thank the editor, the AE and two referees for their helpful suggestions that significantly improved the quality of the paper. The research of Chengchun Shi and Rui Song is partially supported by Grant NSF-DMS-1555244 and Grant NCI P01 CA142538. The research of Wenbin Lu is partially supported by Grant NCI P01 CA142538.

Appendix

A. Variance estimator in Section 3.3.1

Define α^I to be the penalized logistic regression estimator based on {(Xi,Ai)}iI, θ^0,I and θ^1,I to be the penalized linear regression estimators based on {(Xi,Yi)}iI,Ai=0 and {(Xi,Yi)}iI,Ai=1 respectively. Denoted by Mα,I the support of α^I, i.e,Mα,I={j=1,,p:α^I,j0}. Similarly define Mθ0,I and Mθ1,I to be the supports of θ^0,I and θ^1,I respectively. Let

π^i=exp(XiTα^I)1+exp(XiTα^I),

For any treatment regime d, we define

σ^DR,I2(d)=1I1iIκi21I(I1)iIκi2,

where

κi=1Ai1π^iAiπ^iYi1Ai1π^i1XiTθ^0,I+Aiπ^i1XiTθ^1,I{1d(Xi)}+I¯1T1IiIXiMα,ITπ^i(1π^i)XiMα,I1XiMα,I(Aiπ^i)I¯2T1IiI(1Ai)XiMθ0,ITXiMθ0,I1XiMθ0,I(1Ai)(YiXiTθ^0)+I¯3T1IiIAiXiMθ0,ITXiMθ0,I1XiMθ0,IAi(YiXiTθ^1),

and I¯j=iIIi,j/n where

Ii,1=π^i(1Ai)1π^i{YiXiTθ^0,I}+Ai(1π^i)π^i{Yiθ^1,I}XiMα,I{1d(Xi)},Ii,2=(1Ai)1π^i1XiMθ0,I{1d(Xi)}Ii,3=Aiπ^i1XiMθ1,I{1d(Xi)}.

B. Technical conditions

(C1.) Assume there exist some positive constants γ and δ0 such that

Pr{0<τ(X)t}=O(tγ),

where the big-O term is uniform in 0 < t < δ0.

(C2.) Assume τ^ satisfies

Eτ^I(X)τ(X)2=o(I(2+γ)/(2+2γ))asI,

where the little-o term is uniform in the training samples I.

Condition (C1) is closely related to the margin assumption (Tsybakov, 2004; Audibert and Tsybakov, 2007) in the classification literature. It is often used to obtain sharp upper bounds on the difference between the value function under dopt and that under an estimated OITR (Qian and Murphy, 2011; Luedtke and van der Laan, 2016). The larger the structure parameter γ in (C1), the sharper the upper bounds. When τ(X) has a bounded density function near 0, (C1) holds with γ = 1. If there exists some δ0 > 0 such that τ(X)δ0 almost surely, then (C1) holds with γ = +∞.

Condition (C2) depends on the “structural” parameter γ in (C1) and the convergence rates of the estimated contrast function. The larger the γ, the more likely (C2) holds. When γ = 1, (C2) requires Eτ^I(X)τ(X)2=o(I3/4). The rates of convergence of the estimated contrast function are available for most often used machine learning or statistical methods, such as spline methods (Zhou et al., 1998), kernel ridge regression (Steinwart and Christmann, 2008; Zhang et al., 2013) and random forests (Biau, 2012). In Section C.1 of the Appendix, we show (C2) holds when τ^ is computed by some of the aforementioned methods. Combining (C1) together with (C2) gives V(d^I)=V(dopt)+op(I1/2).

References

  1. Audibert J-Y and Tsybakov AB (2007). Fast learning rates for plug-in classifiers. Ann. Statist 35(2), 608–633. [Google Scholar]
  2. Baker SG, Cook NR, Vickers A, and Kramer BS (2009). Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172(4), 729–748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Belloni A, Chernozhukov V, Chetverikov D, and Kato K (2015). Some new asymptotic theory for least squares series: Pointwise and uniform results. Journal of Economet-rics 186(2), 345–366. [Google Scholar]
  4. Biau G (2012). J. mach. learn. res. Journal of Machine Learning Research 13, 1063–1095. [Google Scholar]
  5. Cannings TI and Samworth RJ (2015). Random projection ensemble classification. arXiv preprint arXiv:1504.04595.
  6. Chakraborty B, Murphy S, and Strecher V (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res 19(3), 317–343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chang M, Lee S, and Whang Y-J (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. Econom. J 18(3), 307–346. [Google Scholar]
  8. Fan A, Lu W, and Song R (2016). Sequential advantage selection for optimal treatment regime. Ann. Appl. Stat 10(1), 32–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc 96(456), 1348–1360. [Google Scholar]
  10. Gail MH (2009). Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. Journal of the National Cancer Institute 101(13), 959–963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Galassi M, Davies J, Theiler J, Gough B, Jungman G, Alken P, Booth M, Rossi F, and Ulerich R (2015). GNU Scientific Library Reference Manual (Version 2.1)
  12. Gunter L, Zhu J, and Murphy SA (2011). Variable selection for qualitative interactions. Stat. Methodol 8(1), 42–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hsu Y-C (2017). Consistent tests for conditional treatment effects. The Econometrics Journal 20(1), 1–22. [Google Scholar]
  14. Huang Y, Laber EB, and Janes H (2015). Characterizing expected benefits of biomark-ers in treatment selection. Biostatistics 16(2), 383–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Johnson WB and Lindenstrauss J (1984). Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26(189–206), 1. [Google Scholar]
  16. Keller MB, McCullough JP, Klein DN, Arnow B, Dunner DL, Gelen-berg AJ, Markowitz JC, Nemeroff CB, Russell JM, Thase ME, et al. (2000). A comparison of nefazodone, the cognitive behavioral-analysis system of psychotherapy, and their combination for the treatment of chronic depression. New England Journal of Medicine 342(20), 1462–1470. [DOI] [PubMed] [Google Scholar]
  17. Li P, Hastie TJ, and Church KW (2006). Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 287–296. ACM. [Google Scholar]
  18. Lopes M, Jacob L, and Wainwright MJ (2011). A more powerful two-sample test in high dimensions using random projection. In Advances in Neural Information Processing Systems, pp. 1206–1214.
  19. Luedtke AR and van der Laan MJ (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Statist 44(2), 713–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Murphy SA (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol 65(2), 331–366. [Google Scholar]
  21. Nelson J and Nguyên HL (2013). Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 117–126. IEEE. [Google Scholar]
  22. Omidiran D and Wainwright MJ (2010). High-dimensional variable selection with sparse random projections: measurement sparsity and statistical efficiency. Journal of Machine Learning Research 11(Aug), 2361–2386. [Google Scholar]
  23. Qian M and Murphy SA (2011). Performance guarantees for individualized treatment rules. Ann. Statist 39(2), 1180–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Robins J, Hernan M, and Brumback B (2000). Marginal structural models and causal inference in epidemiology. Epidemiol 11, 550–560. [DOI] [PubMed] [Google Scholar]
  25. Shi C, Fan A, Song R, and Lu W (2016). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics accepted [DOI] [PMC free article] [PubMed]
  26. Steinwart I and Christmann A (2008). Support vector machines. Information Science and Statistics Springer, New York. [Google Scholar]
  27. Tsybakov AB (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist 32(1), 135–166. [Google Scholar]
  28. van de Geer S, Bühlmann P, Ritov Y, and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist 42(3), 1166–1202. [Google Scholar]
  29. Watkins C and Dayan P (1992). Q-learning. Mach. Learn 8, 279–292. [Google Scholar]
  30. Zhang B, Tsiatis AA, Laber EB, and Davidian M (2012). A robust method for estimating optimal treatment regimes. Biometrics 68(4), 1010–1018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol 76(1), 217–242. [Google Scholar]
  32. Zhang X and Cheng G (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc 112(518), 757–768. [Google Scholar]
  33. Zhang Y, Duchi J, and Wainwright M (2013). Divide and conquer kernel ridge regression. In Conference on Learning Theory, pp. 592–617. [Google Scholar]
  34. Zhang Y, Laber EB, Tsiatis A, and Davidian M (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71(4), 895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhao Y, Zeng D, Rush AJ, and Kosorok MR (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc 107 (499), 1106–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Zhou S, Shen X, and Wolfe DA (1998). Local asymptotics for regression splines and confidence regions. Ann. Statist 26(5), 1760–1782. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES