Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 1.
Published in final edited form as: J Multivar Anal. 2018 Jul 10;168:48–62. doi: 10.1016/j.jmva.2018.07.005

Joint sufficient dimension reduction for estimating continuous treatment effect functions

Ming-Yueh Huang a,*, Kwun Chuen Gary Chan b
PMCID: PMC6411292  NIHMSID: NIHMS981686  PMID: 30872872

Abstract

The estimation of continuous treatment effect functions using observational data often requires parametric specification of the effect curves, the conditional distributions of outcomes and treatment assignments given multi-dimensional covariates. While nonparametric extensions are possible, they typically suffer from the curse of dimensionality. Dimension reduction is often inevitable and we propose a sufficient dimension reduction framework to balance parsimony and flexibility. The joint central subspace can be estimated at a n1/2-rate without fixing its dimension in advance, and the treatment effect function is estimated by averaging local estimates of a reduced dimension. Asymptotic properties are studied. Unlike binary treatments, continuous treatments require multiple smoothing parameters of different asymptotic orders to borrow different facets of information, and their joint estimation is proposed by a non-standard version of the infinitesimal jackknife.

Keywords: Central subspace, Cross-validation, Dose-response, Infinitesimal jackknife, Optimal bandwidth

1. Introduction

It is common in medical and social studies to collect data on treatments or exposures and a primary goal in these studies is to investigate the causal effects of the treatments on an outcome. In the case of continuous treatments, the effects are naturally described by functions, and model-free estimation typically involves smoothing to borrow information from adjacent treatment levels. While observational data are often the only data available in practice, direct smoothing of observed responses across different treatment levels usually results in biased estimates of causal treatment effects due to confounding.

In the literature, there are two major approaches for estimating treatment effects under confoundedness; they are based on outcome regression models and generalized propensity score models. Outcome regression models describe how the response relates to the treatment and covariates [9, 14], while generalized propensity score models extend the classical propensity score models for binary treatments to account for the conditional density of treatment given covariates [4, 10, 13]. Doubly robust estimation was recently proposed in [17] to unify these two approaches. It combines outcome regression and generalized propensity score models through estimation. Model-based applications require at least one correctly specified parametric or semiparametric model for the outcome regression functions or generalized propensity scores. However, in practice, it is often hard to check if these models are adequate without prior knowledge. In contrast, one can use fully nonparametric smoothing techniques to replace the modeling procedures as in [4, 17]. When the dimension of covariates is large, the model-free approaches suffer from the curse of dimensionality and can lead to unstable estimates for the treatment effects. Some middle ground between current model-based and model-free approaches would thus be desirable.

In this work, we study a nonparametric method that is more parsimonious and suffers less from the curse of dimensionality than the fully nonparametric smoothing estimator, while retaining robustness to potential model misspecification. Sufficient dimension reduction for treatment effect estimation has recently been studied in [11, 19] for binary treatment variables. To extend these ideas to continuous treatments, we consider a joint sufficient dimension reduction model [11] to capture all the information of the outcome regression and generalized propensity score. In contrast, Luo et al. [19] considered separate dimension reduction for different subgroups, which is more difficult to generalize to continuous treatments.

An advantage of sufficient dimension reduction is that, unlike generalized propensity scores which cannot be estimated nonparametrically at a n1/2 convergence rate, the central subspace can be estimated with a n1/2-rate, which turns out to be crucial in guaranteeing a desirable theoretical and practical performance of the treatment effects estimator. The estimation is complicated by the need to employ multiple smoothing parameters, one for borrowing information across different treatment levels, and another for undersmoothing with respect to marginalizing the observed covariate distribution. The smoothing parameters are chosen in a data-adaptive manner by minimizing an estimator of the asymptotic mean squared error via a non-standard application of the infinitesimal jackknife.

The rest of this article is organized as follows. Section 2 introduces the considered dimension reduction model and proposes an estimation procedure to estimate the joint central subspace and the treatment effects of interest. The results of a series of simulation studies are reported in Section 3 and an application on food patterns data analysis is described in Section 4. Some concluding remarks are given in Section 5.

2. The proposed methodology

2.1. Joint sufficient dimension reduction model

Let Y(t) be the potential outcome associated with each treatment level t, where tT, and T is a connected subset in , the set of all real numbers. Also, let T be the continuous treatment variable. Then the observed outcome is Y = Y(T). In addition, a vector X = (X1, …, Xp) of covariates is observed for each subject. The main goal of this work is to estimate the continuous treatment effect μ(t)=E{Y(t)},tT, based on a random sample (Y1, T1, X1), …, (Yn, Tn, Xn). Since this parameter is defined in terms of potential outcomes which are not observed directly, some assumptions are needed to identify the quantity of interest by the observed data.

Assumption 1 (Positivity). fT(t | x) > 0 for all tT and xX, where fT (t|x) is the conditional density of T given X = x, and X is the support of X.

Assumption 2 (Ignorability). Y(t) ⫫ T | X for all tT, where ⫫ denotes independence.

The positivity assumption states that every subject has a non-zero chance of receiving treatment level t. The ignorability assumption says that the potential outcomes and the treatment are unconfounded when the covariates are given. That is, all the confoundedness is captured by the vector of covariates. According to these assumptions, a simple bias removal strategy is to use the property

E{Y(t)}=E[E{Y(t)X}]=E[E{Y(t)T=t,X}]=E{E(YT=t,X)}. (1)

Thus, an estimator for μ(t) can be obtained by averaging an estimator of the conditional effects E(Y|T = t, X). In practice, the number of covariates is usually large and the nonparametric estimators for the conditional effects typically suffer from the curse of dimensionality. In order to find a lower dimensional score to attain dimension reduction while retaining a property similar to (1), we first note that the likelihood of the observed variables is

fY,T,X(y,t,x)=fY(yt,x)fT(tx)fX(x)=fY(t)(yx)fT(tx)fX(x)

based on Assumptions 1 and 2. One can see that all the information is conveyed through the conditional distributions of Y(t) and T given X. Thus, we propose a joint sufficient dimension reduction model to summarize both potential outcomes and treatment assignment, viz.

tTY(t)XBX,TXBX (2)

for some p × d full-rank parameter matrix B with dp. Since model (2) has a nested structure in d, we can focus on searching the joint sufficient dimension reduction subspace span(B) with the smallest dimension, which is called the joint central subspace and the corresponding basis matrix is denoted by B0 with dimension d0. Related discussion on the model structure and identification is given in [11] for binary treatments. Model (2) leads to the following result on the ignorability of treatment assignment.

Proposition 1. Under Assumption 2 and model (2), the joint central subspace span(B0) satisfies Y(t)TB0X for all tT.

Thus, the bias-removal strategy in (1) can still be applied with X replaced by B0X. That is,

μ(t)=E{Y(t)}=E[E{Y(t)X}]=E{E(YT=t,X)}=E{E(YT=t,B0X)},

which forms a basis for our estimator of μ(t). Let m(t, x; B) = E(Y | T = t, BX = Bx). If B0 is known, we can estimate μ(t) by n1i=1nm^(t,Xi;B0), with a locally smoothed estimator

m^(t,x;B)=Σi=1nYiKh1(Tit)Kh2{B(Xix)}Σi=1nKh1(Tit)Kh2{B(Xix)},

where Kh1(t) = K(t/h1)/h1, Kh2(u)=k=1dK(ukh2)h2 with u = (u1, …, ud), h1, h2 are positive bandwidths, and K is an rth-order kernel function. The choice of r will be discussed in Section 2.3. When B0 is unknown and we have an estimator B^, we can estimate μ(t) by μ^(t)=n1i=1nm^(t,Xi;B^). The following result is proven for any estimator B^ that converges in distribution at a n1/2-rate under mild assumptions.

Theorem 1. Suppose that Conditions A1–A4 in the Appendix are satisfied. Then, as n → ∞,

(nh1)12{μ^(t)μ(t)h1rB(t)}N{0,V(t)}

where

B(t)=μr,Kr!E[tr{m(t,X;B0)fT(tB0X)}m(t,X;B0)trfT(tB0X)fT(tB0X)],V(t)=σK2E[var(YT,X)+{m(T,X;B0)m(t,X;B0)}2fT2(tB0X)],

μr,K=urK(u)du, and σK2=K2(u)du.

Remark 1. Different bandwidths h1 and h2 are needed because μ^(t) averages over the covariate distribution but not the treatment distribution. Therefore, undersmoothing is required in h2, but not in h1, to minimize the mean squared error of the estimator. Details are given in Sections 2.3 and 2.4.

Remark 2. Galvao and Wang [4] proposed an estimator for the treatment effects based on a conditional density adjustment. The main idea comes from the property

E{Y(t)μ(t)}=E[{Yμ(t)}fT(tY,X)fT(tX)]

under Assumption 2 and some regularity conditions for changing the order of limits and integral. Since Proposition 1 shows that the conditional independence in Assumption 2 also holds for B0X, we can deduce that

E{Y(t)μ(t)}=E[{Yμ(t)}fT(tY,B0X)fT(tB0X)].

Thus, an estimator for μ(t) can be

μ^SE(t)=Σi=1nYif^T(tYi,B^Xi)f^T(tB^Xi)Σi=1nf^T(tYi,B^Xi)f^T(tB^Xi)

with

f^T(ty,Bx)=Σi=1nKς(Tit)Kς(Yiy)Kς{B(Xix)}Σi=1nKς(Yiy)Kς{B(Xix)},f^T(tBx)=Σi=1nKς(Tit)Kς{B(Xix)}Σi=1nKς{B(Xix)}.

Note that this estimator requires a fully nonparametric estimation for the density fT,Y,BX(t, y, Bx), which has a slower rate of convergence than that of m^(t,x;B). In our simulations, we found that this estimator usually has a poorer performance than the proposed estimator.

Remark 3. In the literature of causal inference with binary treatments, doubly robust estimation is often appealing. It combines the outcome regression and propensity score models and requires only one of them to be correctly specified. As in cases with continuous treatments, Kennedy et al. [17] proposed a kernel-based approach based on an influence function for a kernel-weighted projection parameter, which requires only mild smoothness conditions and still allows for double robustness. The key proposition of their approach is

μ(t)=E{Ym(T,X)fT(TX)ω(T)+μ(T)t=t}, (3)

where m(t, x) = E(Y | T = t, X = x) and ω(t) = E{fT (t|X)}. Under model (2), it can be easily seen that the equality in (3) still holds when m(t, x) and fT (t|X) are replaced with m(t,B0x) and fT(tB0X), respectively. Thus, an alternative estimator for μ(t) is

μ^DR(t)=Σi=1nξ^iKς(Tit)Σi=1nKς(Tit)withξ^i=Yim^(Ti,B^Xi)f^T(TiB^Xi)ω^(Ti)+μ^(Ti),

where ω^(t)=n1i=1nf^T(tB^Xi). This estimator can be viewed as an adjusted version of μ^(t). However, f^T(tB^Xi) requires one more smoothing parameter than the proposed method. In our simulations, we found that μ^DR(t) has a similar performance as μ^(t) in finite samples.

Remark 4. We may also provide more general assumptions like those in [4, 17] to allow non-kernel-based estimators such as splines for estimating m(t, x; B). However, in some cases the conditions become more stringent. For example, Galvao and Wang [4] require the estimate of fT(t|y, x)/fT(t|x) to converge at the rate n−1/4, which is hardly satisfied for multiple covariates X or even the lower-dimensional B0TX. In addition, the conditions of [4, 17] do not directly lead to results for the mean squared error, which can only be computed case by case for different smoothers, and will typically be different from our results. To provide a complete estimation procedure including proper smoothing parameter selection based on asymptotically optimal mean squared error, we focus on the kernel-based estimator and assume more specialized smoothness conditions for the consistency and the weak convergence to hold in this work.

2.2. Estimation of joint central subspace

Estimation of B0 does not follow immediately from model (2) because Y(t) is only partially observed. We need the following identification result using the observed data.

Proposition 2. Suppose that Assumptions 1–2 hold. Then, model (2) is equivalent to (Y, T) ⫫ X | BX.

Therefore, we can employ existing methodologies for sufficient dimension reduction of a bivariate outcome to estimate B0. While different methods of sufficient dimension reduction have been proposed in the literature (see, e.g., [20, 28, 29]), we use a generalization of the estimator of Huang and Chiang [12] for multivariate response (Y, T) in all numerical examples because it can estimate simultaneously the structural dimension, the basis matrix, and a rate-optimal smoothing parameter under a single objective function, while avoiding certain linearity assumptions of covariate distribution required by inverse regression [18].

Let

F^(y,tBx)=Σj=1n1(Yjy,Tjt)Kh{BT(Xjx)}Σj=1nKh{B(Xjx)}

be a kernel smoothing estimator for F(y, t|Bx) = Pr(Yy, Tt|BX = Bx), where 1 denotes the indicator function, h is another positive bandwidth, and K is a qth-order kernel function. In practice, we suggest to take q = 4 ∨ 2⌊(d + 6)/4⌋, where ⌊·⌋ is the floor function. Further, let (T0, Y0,X0) be a future run independent of current data (Y1, T1, X1), … , (Yn, Tn, Xn) and define the prediction risk

E[{1(Y0y,T0t)F^(y,tBX0)}2]. (4)

A simple calculation shows that the prediction risk in (4) can be decomposed into

σ02+b02(B)+MISEB(h)+C(B,h), (5)

where

σ02=E[{1(Y0y,T0t)F(y,tB0X0)}2],b02(B)=E[{F(y,tB0X0)F(y,tBX0)}2],MISEB(h)=E[{F(y,tBX0)F^(y,tBX0)}2]andC(B,h)=2E[{F(y,tB0X0)F(y,tBX0)}{F(y,tBX0)F^(y,tBX0)}].

When h → 0 and nhd → ∞, it is shown in the Appendix that the last two summands in (5) converge to zero and are dominated by the first two terms. Note that b02(B)0 and the equality holds when span(B) is a joint sufficient dimension reduction subspace. Thus, the minimum of the prediction risk must occur when span(B) ⊇ span(B0). Moreover, since our model has a nested structure, the prediction risk will decrease when the working dimension increases as long as the working dimension is less than the structural dimension. In contrast, once the working dimension d is equal to, or larger than, the structural dimension d0 and span(B) ⊇ span(B0), the prediction risk has an asymptotic order of σ02+Op{n2q(2q+d)}, which will start to increase in d. Therefore, for large enough n, the minimum of prediction risk occurs at the joint central subspace B0. A formal result is given as follows.

Proposition 3. Under Assumptions 1–2 and model (2), the joint central subspace B0 and the optimal bandwidth h0 = cd0 n−1/(2q+d0) minimize the prediction risk in (4) as h → 0, nhd → ∞, and n → ∞, where the constant cd0 is given in the Appendix.

According to Proposition 3, we can estimate B0 by minimizing a criterion that converges to the prediction risk (4) uniformly in probability. Such a criterion function can be obtained by leave-one-out cross-validation. However, the minimization of such a criterion is complicated by the unknown dimension of B. Fortunately, according to Proposition 3 and the fact that the prediction risk is asymptotically convex in d, we implement the following forward selection algorithm which asymptotically minimizes the prediction risk (4):

Step 1. For d = 0, compute

CV~(0)=1ni=1n{1(Tit,Yiy)F^Y,T(y,t)}2dF^Y,T(y,t),

where F^Y,T(y,t) is the empirical distribution associated with (Y1, T1), … , (Yn, Tn).

Step 2. For d ≥ 1, let (B^d, h^d) be the minimizer of

CV(B,h)=1ni=1n{1(Yiy,Tit)F^i(y,tBXi)}2dF^Y,T(y,t),

where B is a p × d matrix, and the superscript −i indicates the estimator based on a sample with ith subject being deleted. Then, compute CV~(d)=CV(B^d,h^d).

Step 3. Repeat Step 2 until d=d^ with CV~(d^+1)>CV~(d^) and the proposed estimator is (B^,h^)=B^d^,h^d^).

Based on the notations and assumptions in the Appendix, we can prove the asymptotic property of (B^, h^) as follows.

Theorem 2. Suppose that Conditions B1–B5 in the Appendix are satisfied. Then, Pr(d^=d0)1,h^=Op{n1(2q+d0)}, and

n12vec(B^B0)1(d^=d0)Npd0[0,V1(B0)E{S2(B0)}V1(B0)]

as n → ∞, where S (B) and V(B) are defined in the Appendix.

Without fixing d0 in advance, we allow model (2) to be flexible and to contain the fully nonparametric model [11]. Another important consideration is that this estimator of the central subspace is n1/2-consistent under mild assumptions [12], which is fast enough to keep the large-sample properties of μ^(t) as if the true joint central subspace is known. This property will not hold if we use a nonparametric estimator of generalized propensity score in place of B^x, because the convergence rates of nonparametric density estimators are slower than n1/2.

Remark 5. From the proof steps of Theorem 2, we can show that h^d=Op{n1(2q+d)} for fixed d ∈ {0, … , p}. Coupled with Condition B3, the order of kernel function should satisfy q > 2 ∨ (d + 2)/2. Since we always use symmetric kernel functions, whose orders are all even, and require the order as small as possible, a simple choice is to take q = 4 ∨ 2 ⌊(d + 6)/4⌋ for each working dimension d.

Remark 6. For binary treatments, Huang and Chan [11] used a conditional distribution version of Proposition 2 for identification, and the estimation of B0 is performed through a weighted average of objective functions based on stratified subsamples by treatment statuses. For continuous treatments, such a strategy would require an additional smoothing parameter to borrow information locally for each value of T. Estimation based on Proposition 2 is more straightforward for continuous treatments and is therefore employed.

Remark 7. For regression of a univariate outcome, a single-index model is another popular dimension reduction method. In fact, it can be viewed as a special case of sufficient dimension reduction with d0 = 1. Using a single-index model for multivariate outcomes, however, is prone to misspecification if different combinations of covariate values are related to each outcome. In contrast, if a single-index model is indeed true for a multivariate outcome, efficiency will be lost and nearly collinear indices will be obtained if marginal estimation is performed separately for each outcome. Since the joint sufficient dimension reduction model is defined by multiple outcomes, we do not fix the structural dimension a priori, and the method of [12] is employed to simultaneously estimate the structural dimension and the basis matrix.

2.3. Asymptotic mean squared error

Following [24], we can rewrite μ^(t)μ(t)=An+Bn+Cn+Op(B^B0), where

An=1ni=1nYim(t,Xi;B0)fT(tXi)Kh1(Tit),Bn=1ni=1nE{m^(t,Xi;B0)μ(t)Yim(t,Xi;B0)fT(tXi)Kh1(Tit)Di},Cn=1ni=1nm^(t,Xi;B0)μ(t)Yim(t,Xi;B0)fT(tXi)Kh1(Tit)E{m^(t,Xi;B0)μ(t)Yim(t,Xi;B0)fT(tXi)Kh1(Tit)Di},

and D1={(Yj,Tj,Xj):ji}. From standard derivation of kernel smoothing estimation, we have An=Op{h1r+(nh1)12}. Similar to the argument for Theorem 2.1 of [2], we can further show that Bn=Op(h2r) and Cn=Op{1(n2h1h2d0)}, which are both dominated by An. Thus, the mean squared error E[{μ^(t)μ(t)}2] is asymptotically equivalent to that of An, which equals h12rB2(t)+V(t)(nh1) asymptotically, and the optimal bandwidth is

h1,opt=[V(t){2rB2(t)}]1(2r+1)n1(2r+1).

Moreover, the remainder terms involving h2 satisfy E(Bn)=O(h2r) and var(Cn)=O{1n2h1h2d0)}. Thus, when h1 = O{n−1/(2r+1)} and h2 = O[n−(4r+1)/{(2r+1)(2r+d0)}], the mean squared error attains the optimal rate O{n−2r/(2r+1)} asymptotically.

According to these theoretical results, when we take the bandwidths in optimal rate h1 = O{n−1/(2r+1)} and h2 = O[n−(4r+1)/{(2r+1)(2r+d0)}], the order of kernel function should be taken as r>{3d0+(9d02+8d0)12}4 to ensure Condition A3 and the optimal rate of mean squared error. However, if we only require the mean squared error to converge to zero at the optimal rate, r > (d − 1)/2 for each working dimension d is enough to ensure this property. Thus, we suggest to take r = 2⌈d/4⌉ in practice, where ⌈·⌉ is the ceiling function.

2.4. Bandwidth selection

To select the bandwidths (h1, h2), one could minimize the estimated mean squared error of μ^(t). Here we follow the idea of [1] and use an infinitesimal version of the generalized jackknife statistic to estimate the variance and bias of μ^(t). Set (h1, h2) = (h1,0n−1/(2r+1), h2,0n−(4r+1)/{(2r+1)(2r+d0)}), which is suggested by the optimal rates given in Section 2.3 and then our main goal turns to selecting the optimal (h1,0, h2,0). Since the bias of μ^(t) is now of order O{n−2r/(2r+1)}, Schucany et al. [25] suggested using the generalized jackknife pseudovalues defined, for all i ∈ {1, …, n}, by

g(i)(t)=n2r(2r+1)μ^(t)(n1)2r(2r+1)μ^i(t)n2r(2r+1)μ^(t)(n1)2r(2r+1)μ^(t),

to form a generalized jackknife estimate g()(t)=n1i=1ng(i)(t). The bias can be estimated by μ^(t)g()(t) and the variance estimator is given by

n1ni=1n{μ^i(t)μ(t)}2={Rn(1Rn)(n1)}2×1n(n1)i=1n{g(i)(t)g()(t)}2,

where Rn = {(n − 1)/n}2r/(2r+1) and μ(t)=n1i=1nμ^i(t). Related discussion can be found in [5, 26]. Although these estimators have closed forms which can be used in practice, it is time-consuming to obtain μ^i(t) for each i ∈ {1, …, n}. To reduce the computational burden, we further apply the infinitesimal jackknife of Jaeckel [15] to the generalized jackknife estimates. Following his derivation, we obtain

var{μ^(t)}=1n2i=1nD^i2(t),bias{μ^(t)}=Rn(1Rn)(n1)×12n2i=1nD^ii(t),

where

D^i(t)=m^(t,Xi;B^)μ^(t)+k=1n{Yim^(t,Xk;B^)}Kh1(Tit)Kh2{B^(XiXk)}j=1nKh1(Tjt)Kh2{B^(XjXk)}D^ii(t)=2n{Yim^(t,Xi);B^}Kh1(Tit)Kh2(0)j=1nKh1(Tjt)Kh2{B(XjXi)}2nk=1n{Yim^(t,Xk;B^)}2Kh12(Tit)Kh22{B^(XiXk)}[j=1nKh1(Tjt)Kh2{B^(XjXk)}]22D^i(t)

Thus, E[{μ^(t)μ(t)}2] can be estimated by var{μ^(t)}+bias2{μ^(t)} and (h^1,0,h^2,0) is suggested to be the minimizer of

[var{μ^(t)}+bias2{μ^(t)}]dF^T(t),

where F^T(t) is the empirical distribution of T.

3. Simulation studies

In this section, we use Monte Carlo simulations to demonstrate the finite-sample performance of our proposal. Specifically, we simulate data from the following models conditional on X=(X1,,X10)N10(0,I10):

  • M1: T = X1 + ε1 with ε1N(0,1), Y(t) = tX1 + X2 + ε2 with ε2N(0,0.1),

  • M2: T=k=110Xk+ε1 with ε1N(0,1), Y(t)=k=110Xk+ε2 with ε2N(0,0.1),

  • M3: TX=x20B{λ(x),20λ(x)}, λ(x) = 20expit(0.5x1 + 0.5x3 − 0.5x4 − 0.5x5), Y(t) = (x2 + x3x4x5)2/4-(t/4 − 5/2)2 + ε1 with ε1N(0,182),

  • M4: T = X1X2 + ε1 with ε1N(0,1), Y(t) = X1X2 + ε2 with ε2N(0,0.1),

where ε1, ε2, and X are mutually independent. For M1, one can directly see that the basis of the joint central subspace is B0 = (e1, e2), where {e1, …, e10} is the standard basis of 10, and the true treatment effect is μ(t) = t. The basis of the joint central subspace for M2 is span{(1, …, 1)} and the true treatment effect is the zero function. The basis of the joint central subspace for M3 is (e1 + e3e4e5, e2 + e3e4e5) with structural dimension 2 and the true treatment effect curve is μ(t) = 1 − (t/4 − 5/2)2. The basis of the joint central subspace for M4 is (e1, e2) and the true treatment effect is also the zero function.

The performance of the estimated joint central subspace is evaluated through the error measure B^(B^B^)1B^B0(B0B0)1B0. To evaluate the estimated treatment effects, we use the mean integrated squared error

E[{μ^(t)μ(t)}2dF^T(t)]

as an accuracy measure. All the results are based on 1000 simulations with sample size n ∈ {100, 200, 400}.

To investigate the practicality of the proposed joint central subspace on the estimation of continuous treatment effects, we consider four estimators: μ^(t), μ^DR(t), μ^SE(t), as discussed in Remarks 2–3, and the marginal regression plug-in estimator based on the generalized propensity score

μ^GPS(t)=1ni=1nj=1nYjKh1(Tjt)Kh3{f^T(tXj)f^T(tXi)}i=1nKh1(Tjt)Kh3{f^T(tXj)f^T(tXi)}.

We compare them with their counterparts in which the proposed balancing score B^X is replaced by the original high-dimensional X. The simulation results are displayed in Tables 1 and 2. From Table 1, we can see that the proportion of correct structural dimension selected by our proposed cross-validation criterion approaches 1 as sample size increases while the error measure tends to zero. As for the estimators of the treatment effect function, we can see from Table 2 that the estimators based on joint sufficient dimension reduction always have smaller mean integrated squared errors, which confirms the advantage of our proposal.

Table 1:

The performance of the estimated joint central subspace.

d^ Error


Model n 0 1 2 3 4 ≥ 5 Mean S.D.
M1 100 0.000 0.073 0.712 0.196 0.019 0.000 0.823 0.5701
200 0.000 0.017 0.874 0.102 0.007 0.000 0.448 0.4082
400 0.000 0.000 0.956 0.044 0.000 0.000 0.212 0.2376
M2 100 0.000 0.935 0.065 0.000 0.000 0.000 0.070 0.2471
200 0.000 0.966 0.034 0.000 0.000 0.000 0.036 0.1815
400 0.000 0.991 0.009 0.000 0.000 0.000 0.010 0.0946
M3 100 0.000 0.213 0.738 0.046 0.003 0.000 0.492 0.4543
200 0.000 0.030 0.938 0.032 0.000 0.000 0.206 0.2593
400 0.000 0.000 0.979 0.021 0.000 0.000 0.100 0.1587
M4 100 0.017 0.003 0.956 0.024 0.000 0.000 0.205 0.3056
200 0.002 0.000 0.996 0.002 0.000 0.000 0.064 0.1023
400 0.001 0.000 0.999 0.000 0.000 0.000 0.027 0.0634

Table 2:

The error measures of the estimated treatment effects.

μ^(t) μ^GPS(t) μ^DR(t) μ^SE(t)




Model n B^X X B^X X B^X X B^X X
M1 100 0.739 0.764 0.806 0.816 0.761 0.805 0.911 0.972
200 0.626 0.677 0.753 0.763 0.634 0.703 0.775 0.890
400 0.516 0.579 0.671 0.689 0.508 0.590 0.590 0.758

M2 100 0.643 5.601 8.445 8.641 0.608 5.411 6.653 7.837
200 0.564 3.837 7.765 8.021 0.539 3.725 6.546 7.790
400 0.528 1.929 6.943 7.311 0.510 1.884 6.475 7.960

M3 100 0.515 0.619 0.623 0.623 0.499 0.617 0.484 0.643
200 0.393 0.516 0.519 0.519 0.355 0.514 0.323 0.533
400 0.367 0.458 0.461 0.461 0.317 0.456 0.230 0.469

M4 100 0.080 0.236 0.467 0.475 0.065 0.205 0.252 0.301
200 0.051 0.204 0.419 0.445 0.039 0.177 0.210 0.270
400 0.035 0.170 0.333 0.373 0.027 0.152 0.212 0.272

The estimator μ^GPS performed poorly because it is an empirical version of E[E{Y|T = t, fT(t |X)}], but fT(t|X) cannot be estimated at the n1/2-rate. The merit of the proposed estimator, as we have shown theoretically, is that B0 can be estimated at the n1/2-rate, and the effect of the estimated central subspace on the final estimator is asymptotically negligible because μ(t) is estimated at a rate slower than n1/2. The estimator μ^SE did not perform well for the reason discussed in Remark 2, namely that an estimator of the joint density fT,Y,BX(t, y, Bx) is needed which typically has a slower convergence rate. The poor performance of these two estimators is more apparent for model M2, which corresponds to a situation of strong confounding. In this scenario, the true treatment effect is μ(t) = 0 but the observed (Y, T) are highly correlated. The estimators μ^GPS and μ^SE can have very poor performance for large t while the proposed estimator still has good performance as E{Y(t)k=110Xk} can still be accurately estimated even for large t.

4. Application

We illustrate the proposed method by analyzing the food patterns dataset of [27], to investigate the effects of obesity on the systolic blood pressure. In the literature, researchers have found that there is a positive relationship between systolic blood pressure and body mass index [3, 16]. However, the effects can possibly be confounded by diet. The dataset contains 1073 Nepalese adults who are 18 years or older participating in the baseline survey of the Dhulikhel Heart Study, which collects the responses from Food Frequency Questionnaires to identify the food patterns of the participants. The diet variables include whole grains, refined grains, lentils, vegetable oils, solid fats, fatty foods, vegetables, fruits, potatoes, nuts, poultry, red meat, fish, milk, milk product, fast food, processed cereal, noodles, salty snacks, soda drinks, tea/coffee, sweets, and alcohol. Other variables also include age and sex. All the covariates are centralized and standardized in this analysis.

The estimated joint central subspace is one-dimensional and the corresponding index coefficients with standard errors are listed in Table 3. Figure 1 displays the estimated treatment effects with 95% point-wise confidence intervals using the proposed method, and point estimates based on unadjusted analysis and the estimators of [4, 17]. Compared to the unadjusted estimates, the estimated average systolic blood pressure is smaller for the overweight individuals when age, sex and diet are adjusted. Although the proposed estimator is similar to the estimators of [4, 17], they have different sensitivities to small changes in the bandwidth parameter h2 that smooths over the covariates. We perturbed the bandwidth by multiplicative factors and computed the Lp-distance between the perturbed functional estimators and the original ones. The results given in Table 4 indicate that the dimension reduction approach is more stable than other nonparametric estimators.

Table 3:

Estimated coefficients of the linear index for the food patterns dataset.

Covariate Coefficient S.E. Covariate Coefficient S.E.
Age 1 Red meat 0.1059 0.02233
Sex 0.1958 0.02458 Fish 0.0835 0.01255
Whole grains 0.0170 0.02361 Milk −0.0587 0.01713
Refined grains 0.2736 0.02433 Milk products −0.0365 0.03479
Lentils 0.3886 0.02753 Fast food −0.0829 0.02253
Vegetable oils −0.0456 0.01799 Processed cereals 0.3975 0.02415
Solid fats −0.1212 0.01038 Noodles −0.1273 0.01906
Fatty foods −0.0646 0.01823 Salty snacks −0.0285 0.01709
Vegetables −0.1257 0.02912 Soda drinks 0.0408 0.02827
Fruits 0.1533 0.02786 Tea/coffee −0.1042 0.01775
Potatoes −0.1912 0.02059 Sweets −0.0971 0.02202
Nuts 0.2208 0.01808 Alcohol 0.2682 0.01339
Poultry 0.2460 0.01745

Figure 1:

Figure 1:

Estimated treatment effect functions. The bold solid line represents the proposed estimator and the thin solid lines represent 95% pointwise confidence intervals, the dotted line represents the naive estimator without adjusting for covariates, the dashed line represents the Kennedy et al. estimator, and the dot-dashed line represents the Galvao–Wang estimator.

Table 4:

Lp-distance between estimators with perturbed bandwidth parameters with the original estimators.

0.9×h^2 1.1×h^2


Method / Lp-distance p = 0 p = 1 p = 2 p = 0 p = 1 p = 2
Proposed 0.392 0.099 0.016 0.280 0.070 0.011
Galvao–Wang 1.707 0.454 0.303 2.179 0.564 0.466
Kennedy et al. 1.444 0.856 0.786 0.845 0.533 0.297
0.5×h^2 1.5×h^2


Method / Lp-distance p = 0 p = 1 p = 2 p = 0 p = 1 p = 2
Proposed 4.427 1.384 3.476 1.130 0.390 0.210
Galvao–Wang 1.294 0.364 0.262 3.040 0.743 0.906
Kennedy et al. 31.539 18.743 372.69 3.526 1.690 3.313

5. Discussion

Nonparametric estimation of treatment effect functions received considerable attention recently, because it does not require a priori assumption on the shape of the function which may not be reasonably assumed. Two advanced approaches were recently proposed. Kennedy et al. [17] developed a doubly robust estimator which requires two sets of modeling assumptions and is valid when one of the two is correct. Galvao and Wang [4] proposed a fully nonparametric estimator that may suffer from the curse of dimensionality. The pros and cons of the two approaches are discussed extensively in the literature and they are favored by different groups of statisticians. Doubly robust methods are prevalent in the biostatistics literature, whereas the fully nonparametric approach is popular in econometrics. The proposed sufficient dimension reduction approach is a middle ground between the two and preserves the merit of both methods. On one hand, the sufficient dimension reduction model contains the fully nonparametric model and therefore cannot be mis-specified, thus more robust than the doubly robust estimators. On the other hand, when the underlying data generating mechanism has a lower-dimensional structure, the proposed method can greatly reduce the dimensionality and hence attain a better convergence rate than fully nonparametric methods. The complexity is controlled by the effective dimension of the central subspace, for which we can estimate consistently using the proposed method. Therefore, our approach is both flexible and parsimonious.

The asymptotic result in Theorem 1 holds for any n1/2-consistent estimator B^ of the central subspace. Thus, one may use other existing sufficient dimension reduction approaches in lieu of the proposed method given in Section 2.2. We use the proposed method because the effective dimension and the central subspace are being estimated by a single criterion, which results in a more stable estimator in finite samples, and the asymptotic result can be proven for an estimated dimension. Other methods require separate criteria to estimate the dimension and the basis matrix, and their theoretical properties are often investigated separately and may not be easily combined. Comparisons between different sufficient dimension reduction techniques are reported elsewhere; see, e.g., [12]. Our primary interest in this paper is the treatment effect function, and sufficient dimension reduction is a means to an end rather than an end in itself.

For a binary treatment, Luo et al. [19] proposed a separated dimension reduction model for E(Y0|X) and E(Y1|X), viz.

E(Y0X)XBM,0X,E(Y1X)XBM,1X.

Under a positivity condition on the original covariates and a measurable condition on the conditional variance of potential outcomes with respect to reduced covariates, these authors showed that their proposed estimator can be super-efficient. While this modeling approach may be extended to continuous treatment by assuming E{Y(t)X}XBM,tX for each t, BM,t is typically not estimable at n1/2-rate based on the observational data, since for most values of t there is only one or even no observation of outcomes. Also, since all the existing estimators for continuous treatment effect function are irregular, we cannot compare the asymptotic variance directly and there is no guarantee that alternative methods will lead to better efficiency. In fact, the limiting distributions also involve the bias part, whose rates relate to the smoothness of underlying distribution functions instead of the dimension of balancing score only. In our proposal, we focus on using BX to preserve all the information from the observed data, i.e., the conditional distribution of (Y, T) given X, where B is estimable at n1/2-rate. Comparisons of the resulting asymptotic mean squared errors between our proposed model and other dimension reduction models require future investigation.

It should be noted that sufficient dimension is not the only cure to the curse of dimensionality. A natural limitation is that it reduces the dimension via linear indices, which might not be appropriate in some applications. When the underlying structural covariates live on a general manifold instead of hyperplanes, a dimension reduction method based on non-linear functionals may be needed to solve this problem. In contrast, when there are irrelevant covariates involved, an additional variable selection will lead to better performance by reducing the number of coefficients to be estimated [6]. Such more complicated situations are commonly seen in practice and need further investigation.

In this work we consider the case where the conditional independence assumption holds, which requires the collection of all confounders. However, in many applications researchers often encounter the non-ignorable selection of exposure, in which there are unobserved confounding variables. In those cases, instrumental variables are often used to identify certain partial treatment effects under different sets of assumptions. The structure of the sufficient dimension reduction and the corresponding estimation will then be quite different from the present paper.

Acknowledgment

The authors thank the Editor-in-Chief, an Associate Editor and three reviewers for constructive comments. Both authors are partially supported by the United States National Institutes of Health grant R01HL122212, and the second author is partially supported by the United States National Science Foundation grant DMS1711952. We also thank the Kathmandu University Hospital and Drs. Rajendra Koju, Biraj Karmacharya and Archana Shrestha for providing the Dhulikhel Heart Study data.

Appendix

A.1. Proof of Proposition 1

Proof. For each fixed tT and any measurable functions f1, f2, we have

E[f1{Y(t)}f2(T)B0X]=E(E[f1{Y(t)}f2(T)X]B0X)=E(E[f1{Y(t)}X]E{f2(T)X}B0X)=E(E[f1{Y(t)}B0X]E{f2(T)B0X}B0X)=E[f1{Y(t)}B0X]E{f2(T)B0X}.

Thus, we have Y(t)TB0X for all tT. ◻

A.2. Proof of Proposition 2

Proof. Let S(Y,T)X denote the central subspace of model (3). First note that the density function can be decomposed into fY,T(y, t | x) = fY(t),T(y, t | x) = fY(t)(y | x) fT(t | x). Thus, we have S(Y,T)Xspan(B0). Conversely, from the facts

fT(tx)=fY,T(y,tx)dyandfY(t)(yx)=fY,T(y,tx)fY,T(y,tx)dy,

one can easily see that span(B0)S(Y,T)X. ◻

A.3. Proof of Proposition 3

Proof. By paralleling the proof steps of [7, 8], one can show that MISEB(h) = AMISEB(h){1 + op(1)}, where

AMISEB(h)=h2qB2(y,tBx)+(nhd)1V(y,tBx)dFx(x)dFY,T(y,t),B(y,tBx)=μq,Kq!×DBxq{F(y,tBx)fBx(Bx)}F(y,tBx)DBxqfBx(Bx)fBx(Bx)(1d,,1d),V(y,tBx)=σK2dF(y,tBx){1F(y,tBx)}fBx(Bx),

FX(x) is the marginal distribution function of X, uq,K=vqK(v)dv, σK2=K2(v)dv, Dq denotes the qth order differentiation in tensor form, and 1d = (1, …, 1) with dimension d. Thus, when h → 0 and nhd → ∞, the last two terms of (5) converge to zero and are dominated by the first two terms. Since b02(B)0 and the equality holds when span(B) is a joint sufficient dimension reduction subspace, the minimum of predicted risk occurs when span(B) ⊇ span(B0) for large enough n. Finally, when h=cdn1(2q+d) with

cd=[dV(y,tBx)dFx(x)dFY,T(y,t){2qB2(y,tBx)dFx(x)dFY,T(y,t)}]1(2q+d),

AMISEB(h) has minimum

cd2qB2(y,tBx)+cddV(y,tBx)dFx(x)dFY,T(y,t)n2q(d+2q),

which is increasing in d as n → ∞, it is then obvious that the prediction risk attains its minimum when B = B0 and h=cd0n1(2q+d0), which is the optimal bandwidth that minimize AMISEB0 (h). ◻

A.4. Proof of Theorem 1

To derive the large-sample properties of μ^(t), we need the following assumptions:

  • A1. (t,B0x)rm(t,x;B0), (t,B0x)rfT(tB0x), and urfBX(u) are Lipschitz continuous in t with the Lipschitz constant independent of x.

  • A2. inf(t,x)fT(tB0x)>0.

  • A3. h1 → 0, nh1h22r0, and nh1h22d0 as n → ∞.

  • A4. Pr(d^=d0)1 and B^B0=op{h12r+h22r+(nh1h2d0)1}.

Assumptions A1–A3 are similar to those in [17], and A4 is satisfied by the estimator of [12].

Proof. First we establish some asymptotic properties of the kernel smoothing estimators. For k ∈ {0, 1}, let

m^k(t,x;B)=1ni=1nYikKh1(Tit)Kh2{B(Xix)},mk(t,x;B)=E(YkT=t,BX=Bx)fT(tBx)fBx(Bx).

By Assumption A1 and an application of Theorem II.37 of [23] shows that

supt,x,Bm^k(t,x;B)mk(t,x;B)=O(h1r)+O(h2r)+o{lnn(nh1h2d)12} (A.1)

almost surely. Now we decompose μ^(t)μ(t) into

1ni=1n{m^(t,Xi;B^)m^(t,Xi;B0)}+1ni=1n{m^(t,Xi;B0)m(t,Xi;B0)}. (A.2)

By using a Taylor expansion and Theorem 4 in [12], we see that the first term in (A.2) is

1ni1vec(B)m^(t,Xi;B0)vec(B^B0)+Op(B^B02)=Op(B^B0). (A.3)

Also, by using the Taylor expansion and (A.1), we have

m^(t,x;B)m(t,x;B)=m^1(t,x;B)m1(t,x;B)m0(t,x;B)m(t,x;B){m^0(t,x;B)m0(t,x;B)}m0(t,x;B)+Op(h12r+h22r+1nh1h2d).

Thus, the second term in (A.2) can be further written as

1n2i,j=1nYjKh1(Tjt)Kh2{B0(XjXi)}m1(t,x;B0)m0(t,x;B0)1n2i,j=1nm(t,x;B0)m0(t,x;B0)[Kh1(Tjt)Kh2{B0(XjXi)}m0(t,x;B0)]+Op(h12r+h22r+1nh1h2d)U1+U2+Op(h12r+h22r+1nh1h2d). (A.4)

Note that U1 and U2 are second-order U-statistics. A direct calculation shows that the first-order Hájek projections are

U~1=1ni=1nYiKh1(Tit)fB0x(B0xi)m0(t,Xi;B0)+Op(h1rn12)+Op(h2r), (A.5)
U~2=1ni=1nm(t,Xi;B0)Kh1(Tit)fB0x(B0xi)m0(t,Xi;B0)+Op(h1rn12)+Op(h2r). (A.6)

Further, from Theorem 6 of [21], the remainder terms between the U-statistics and their first-order Hájek projections are bounded by Op{1(nh1h2d0)}. Thus, by substituting (A.3)–(A.6) into (A.2), we have

μ^(t)μ(t)=1ni=1nYim(t,Xi;B0)fT(tB0Xi)Kh1(Tit)+Op(B^B0)+Op(h12r+h22r+1nh1h2d).

Together with the asymptotic mean h1rB(t) and variance V(t)(nh1) of the iid representation above and the Central Limit Theorem, Theorem 1 is established. ◻

A.5. Proof of Theorem 2

The corresponding large-sample properties rely on the smoothness of the following parameter functions, defined for m ∈ {0, 1, 2} by

f[m](x;B)=Bxm[E{(Xx)mBX=Bx}fBx(Bx)],G[m](y,tx;B)=Bxm[Pr(Yt,TtBX=Bx)E{(Xx)mBX=Bx}fBx(Bx)].

The partial derivatives vec(B)mF^(y,tBx) can be shown to converge uniformly to F[m](y,tx;B)==0mG[](y,tx;B)fm(x;B), where

f1(x;B)=f[1](x;B)fBX2(Bx),f[2](x;B)=2(f[1](x;B))2fBX3(Bx)f[2](x;B)fBx2(Bx).

Based on these properties, we can define the corresponding score vector and information matrix of CV(B, h), viz.

S(B)={1(Yy,Tt)F(y,tBX)}F[1](y,tX;B)dFY,T(y,t),V(B)=E[{F[1](y,tx;B)}2{1(Yy,Tt)F(y,tBX)}F[2](y,tX;B)dFY,T(y,t)],

where FY,T(y, t) is the joint distribution of (Y, T).

To prove Theorem 2, we need the following regularity conditions:

  • B1 uq+2Pr(Yy,TtBX=u), uq+mE{(Xx)mBX=u}, and uq+2fBx(u) are Lipschitz continuous in u with the Lipschitz constants being independent of (y, t, x, B).

  • B2 inf(x,B) fBx(B x) > 0.

  • B3 For d ≥ 1, there exist δ ∈ (1/(4q), 1/ max(2d + 2, d + 4)) and positive constants hℓ,d and hu,d such that h falls in the interval Hδ,n = [hℓ,dnδ, hu,dn−δ].

  • B4 infB:d<d0b02(B)>0 and b02(B)=0 if and only if B = B0 when d = d0.

  • B5 V(Bd,0) is non-singular for dd0.

The proof steps of Theorem 2 are similar to those of [12]. Here we give an outline and indicate some key properties. As mentioned in [20], the uniqueness of B0 requires a certain local coordinate system of the Grassmann manifold. Without loss of generality, we assume B is already parameterized in such a way. This is in fact a necessary condition of Assumption B5 and is not an additional assumption. First, we derive the convergence rates of the involved kernel smoothing estimators. Let

f^(Bx)=1ni=1nKh{B(Xix)],f^c[m](x;B)=vec(B)mf^(Bx)f[m](x;B),G^c[m](y,tx;B)=vec(B)m{F^(y,tBx)f^(Bx)}G[m](y,tx;B).

Moreover, we define the asymptotic linear representation for vec(B)mF^(y,tBx) with m ∈ {0, 1}, as follows:

1ni=1nξi(y,tx;B)=G^c[0](y,tx;B)fBx(Bx)F(y,tBx)f^c[0](x;B)fBx(Bx),1ni=1nξi[1](y,tx;B)=G^c[1](y,tx;B)fBx(Bx)F(y,tBx)f^c[1](x;B)fBx(Bx)f[1](x;B)G^c[0](y,tx;B)fBx2(Bx)+{2F(y,tBx)f[1](x;B)G[1](y,tx;B)}f^c[0](x;B)fBx2(Bx)

By verifying the Euclidean class properties through Lemma 2.12 of [22], Lemma 2.14 of [22] and Theorem II.37 of [23], we get

supx,Bf^c[m](x;B)=O(hq)+o{lnnn12h(d+m)2}a.s.,andsupx,BG^c[m](y,tx;B)=O(hq)+o{lnnn12h(d+m)2}a.s.,

for all m ∈ {0, 1, 2}. By applying the Taylor expansion theorem and the results above, we can further ensure from Conditions B2–B3 that

supx,BF^(y,tBx)1ni=1nξi(y,tx;B=op(n12)andsupx,Bvec(B)F^(y,tBx)1ni=1nξi[1](y,tx;B)=op(n12).

The second step is to derive the uniform consistency of CV(B, h) to ECV(B,h)=σ02+b02(B)+AMISEB(h). This can be done by paralleling the arguments of [12] and we have

{supB,hCV(B,h)ECV(B,h)AMISEB(h)=o(1)a.s.for span(B)span(B0),supB,hCV(B,h)ECV(B,h)b0(B)AMISEB12(h)=O(1)a.s.for span(B)span(B0).} (A.7)

Now let DCV(B, h) = CV(B, h) – ECV(B, h). From the inequalities

1=Pr{CV(B^,h^)CV(B0,h0)}Pr{b02(B^)<ε}+Pr{b02(B^)ε,DCV(B^,h^)+DCV(B0,h0)ECV(B^,h^)ECV(B0,h0)}Pr{b02(B^)<ε}+Pr{b02(B^)ε,DCV(B^,h^)b0(B^)+DCV(B0,h0)ε12ε12AMISEB0(h0)ε12} (A.8)

for any ε > 0, (A.7) further implies that, for all ε > 0,

limnPr{b02(B^)<ε}=1. (A.9)

At the same time, by taking ε=inf{B:d<d0}b02(B)2 and using Boole’s inequality again, we have

Pr{b02(B^)<inf{B:d<d0}b02(B)2}Pr{b02(B^)<inf{B:d<d0}b02(B)2,d^<d0}+Pr{b02(B^)<inf{B:d<d0}b02(B)2,d^d0}Pr(d^d0).

Hence, (A.9) further implies that

limnPr(d^d0)=1.

The third step is to derive the asymptotic properties of B^d for dd0. Let hd,0 = cdn−2q/(2q+d). By virtue of the minimizer (B^d,h^d) of CV(B, h) under fixed d and Boole’s inequality, a similar inequality as (A.8) can also be derived as follows:

1Pr{b02(B^d)<ε}+Pr{b02(B^d)ε,DCV(B^d,h^d)b0(B^d)+DCV(Bd,0,hd,0)ε12ε12AMISEBd,0(hd,0)ε12} (A.10)

for any ε > 0. Also note that b02(B^d)ε implies that span(B^d)span(B0). Eq. (A.7) and Condition B3 further imply that DCV(B^d,h^d)b0(B^d)=Op{AMISEB^d12(h^d)}0 and DCV(Bd,0, hd,0)/ε1/2 = op{AMISEBd,0(hd,0)} → 0 as n → ∞. Coupled with the fact that AMISEBd,0(hd,0)→0 as n → ∞, the second inequality in the second term of (A.10) becomes 0 ≥ ε1/2 as n → ∞, and, hence, one has, for all ε > 0,

limnPr{b02(B^d)<ε}=1.

Since span(B^d)span(B0) implies b02(B^d)=0, we now consider the case when span(B^d)span(B0) and, hence, B^dpBd,0. By the first-order Taylor expansion of vec(B)CV(B, h) at B = Bd,0 and vec(B)CV(B^d,h^d)=0, one has

[Ipd+V1(Bd,0){vec(B)2CV(B^d,h^d)V(Bd,0)}]n12vec(B^dBd,0)=n12V1(Bd,0)vec(B)CV(Bd,0,h^d)

where vec(B^d) lies on the line segment between vec(B^d) and vec(Bd,0). Coupled with the facts that

n12vec(B)CV(Bd,0,h^d)N[0,E{S2(Bd,0)}]andvec(B)2CV(B^d,h^d)V(Bd,0)p0,

we have, as n → ∞,

n12vec(B^dBd,0)N[0,V1(Bd,0)E{S2(Bd,0)}V1(Bd,0)]. (A.11)

Using a Taylor expansion technique and vec(B)b02(Bd,0)=0, we further deduce from Eq. (A.11) that

b02(B^d)=12vec(B^dBd,0)vec(B)2b02(B^d)vec(B^dBd,0)=Op(n1)

for dd0 and hence b02(B^)=Op(n1). Coupled with Condition B3, it further implies that

b02(B^)AMISEB^(h^)=op(1).

The next step is to show the consistency of (B^,h^). Let

E0={b02(B^)<lnnn,h^H1(2q+d^),n,d^=d0},E1={b02(B^)lnnn},E2={d^<d0},E3={b02(B^)<lnnn,d^d0,h^Hδ^,nwithδ^1(2q+d^)},E4={b02(B^)<lnnn,h^H1(2q+d^),n,d^>d0},andEcon={DCV(B^,h^)+DCV(B0,h0)ECV(B^,h^)ECV(B0,h0)}.

By the minimizer (B^,h^) of CV(B, h) and Boole’s inequality, one has

1=Pr{CV(B^,h^)CV(B0,h0)}Pr(E0)+m=14Pr(EconEm).

The consistency follows by showing Pr(EconEm) → 0 as n → ∞ for all m ∈ {1, …, 4}, which implies that

limnPr(E0)=1.

Finally, the asymptotic normality is ensured by (A.12) and (A.11).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Aerts M, Claeskens G, Hens N, Molenberghs G, Local multiple imputation, Biometrika 89 (2002) 375–388. [Google Scholar]
  • [2].Cheng PE, Nonparametric estimation of mean functionals with data missing at random, J. Amer. Statist. Assoc 89 (1994) 81–87. [Google Scholar]
  • [3].Dua S, Bhuker M, Sharma P, Dhall M, Kapoor S, et al. , Body mass index relates to blood pressure among adults, North Amer. J. Med. Sci 6 (2014) 89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Galvao AF, Wang L, Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment, J. Amer. Statist. Assoc 110 (2015) 1528–1542. [Google Scholar]
  • [5].Gray HL, Schucany WR, The Generalized Jackknife Statistic, Marcel Dekker, Inc., New York, 1972. [Google Scholar]
  • [6].Hall P, Li Q, Racine JS, Nonparametric estimation of regression functions in the presence of irrelevant regressors, The Review of Economics and Statistics 89 (4) (2007) 784–789. [Google Scholar]
  • [7].Härdle WK, Hall P, Marron JS, How far are automatically chosen regression smoothing parameters from their optimum? J. Amer. Statist. Assoc 83 (1988) 86–101. [Google Scholar]
  • [8].Härdle WK, Marron JS, Optimal bandwidth selection in nonparametric regression function estimation, Ann. Statist 13 (1985) 1465–1481. [Google Scholar]
  • [9].Hill JL, Bayesian nonparametric modeling for causal inference, J. Comput. Graph. Statist 20 (2011) 217–240. [Google Scholar]
  • [10].Hirano K, Imbens GW, The propensity score with continuous treatments, in: Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives, Wiley, Chichester, 2004, pp. 73–84. [Google Scholar]
  • [11].Huang M-Y, Chan KCG, Joint sufficient dimension reduction and estimation of conditional and average treatment effects, Biometrika 104 (2017) 583–596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Huang M-Y, Chiang C-T, An effective semiparametric estimation approach for the sufficient dimension reduction model, J. Amer. Statist. Assoc 112 (2017) 1296–1310. [Google Scholar]
  • [13].Imai K, van Dyk DA, Causal inference with general treatment regimes: Generalizing the propensity score, J. Amer. Statist. Assoc 99 (2004) 854–866. [Google Scholar]
  • [14].Imbens GW, Nonparametric estimation of average treatment effects under exogeneity: A review, The Review of Economics and Statistics 86 (2004) 4–29. [Google Scholar]
  • [15].Jaeckel LA, The infinitesimal jackknife, Bell Telephone Laboratories, 1972. [Google Scholar]
  • [16].Kaufman JS, Asuzu MC, Mufunda J, Forrester T, Wilks R, Luke A, Long AE, Cooper RS, Relationship between blood pressure and body mass index in lean populations, Hypertension 30 (1997) 1511–1516. [DOI] [PubMed] [Google Scholar]
  • [17].Kennedy EH, Ma Z, McHugh MD, Small DS, Nonparametric methods for doubly robust estimation of continuous treatment effects, J. R. Stat. Soc. Ser. B Stat. Methodol 79 (2017) 1229–1245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Li K-C, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc 86 (1991) 316–342. [Google Scholar]
  • [19].Luo W, Zhu Y, Ghosh D, On estimating regression-based causal effects using sufficient dimension reduction, Biometrika 104 (2017) 51–65. [Google Scholar]
  • [20].Ma Y, Zhu L, E cient estimation in sufficient dimension reduction, Ann. Statist 41 (2013) 250–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Nolan D, Pollard D, U-processes: Rates of convergence, Ann. Statist 15 (1987) 780–799. [Google Scholar]
  • [22].Pakes A, Pollard D, Simulation and the asymptotics of optimization estimators, Econometrica 57 (1989) 1027–1057. [Google Scholar]
  • [23].Pollard D, Convergence of Stochastic Processes, Springer, New York, 1984. [Google Scholar]
  • [24].Ruppert D, Wand MP, Multivariate locally weighted least squares regression, Ann. Statist 22 (1994) 1346–1370. [Google Scholar]
  • [25].Schucany WR, Gray H, Owen D, On bias reduction in estimation, J. Amer. Statist. Assoc 66 (1971) 524–533. [Google Scholar]
  • [26].Schucany WR, Sommers JP, Improvement of kernel type density estimators, J. Amer. Statist. Assoc 72 (1977) 420–423. [Google Scholar]
  • [27].Shrestha A, Koju RP, Beresford SA, Chan KCG, Karmacharya BM, Fitzpatrick AL, Food patterns measured by principal component analysis and obesity in the Nepalese adult, Heart Asia 8 (2016) 46–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Yin X, Li B, Sufficient dimension reduction based on an ensemble of minimum average variance estimators, Ann. Statist 39 (2011) 3392–3416. [Google Scholar]
  • [29].Zhu L-P, Zhu L-X, Feng Z-H, Dimension reduction in regressions through cumulative slicing estimation, J. Amer. Statist. Assoc 105 (2010) 1455–1466. [Google Scholar]

RESOURCES