Joint sufficient dimension reduction for estimating continuous treatment effect functions

Ming-Yueh Huang; Kwun Chuen Gary Chan

doi:10.1016/j.jmva.2018.07.005

. Author manuscript; available in PMC: 2019 Nov 1.

Published in final edited form as: J Multivar Anal. 2018 Jul 10;168:48–62. doi: 10.1016/j.jmva.2018.07.005

Joint sufficient dimension reduction for estimating continuous treatment effect functions

Ming-Yueh Huang ^a,^*, Kwun Chuen Gary Chan ^b

PMCID: PMC6411292 NIHMSID: NIHMS981686 PMID: 30872872

Abstract

The estimation of continuous treatment effect functions using observational data often requires parametric specification of the effect curves, the conditional distributions of outcomes and treatment assignments given multi-dimensional covariates. While nonparametric extensions are possible, they typically suffer from the curse of dimensionality. Dimension reduction is often inevitable and we propose a sufficient dimension reduction framework to balance parsimony and flexibility. The joint central subspace can be estimated at a n^1/2-rate without fixing its dimension in advance, and the treatment effect function is estimated by averaging local estimates of a reduced dimension. Asymptotic properties are studied. Unlike binary treatments, continuous treatments require multiple smoothing parameters of different asymptotic orders to borrow different facets of information, and their joint estimation is proposed by a non-standard version of the infinitesimal jackknife.

Keywords: Central subspace, Cross-validation, Dose-response, Infinitesimal jackknife, Optimal bandwidth

1. Introduction

It is common in medical and social studies to collect data on treatments or exposures and a primary goal in these studies is to investigate the causal effects of the treatments on an outcome. In the case of continuous treatments, the effects are naturally described by functions, and model-free estimation typically involves smoothing to borrow information from adjacent treatment levels. While observational data are often the only data available in practice, direct smoothing of observed responses across different treatment levels usually results in biased estimates of causal treatment effects due to confounding.

In the literature, there are two major approaches for estimating treatment effects under confoundedness; they are based on outcome regression models and generalized propensity score models. Outcome regression models describe how the response relates to the treatment and covariates [9, 14], while generalized propensity score models extend the classical propensity score models for binary treatments to account for the conditional density of treatment given covariates [4, 10, 13]. Doubly robust estimation was recently proposed in [17] to unify these two approaches. It combines outcome regression and generalized propensity score models through estimation. Model-based applications require at least one correctly specified parametric or semiparametric model for the outcome regression functions or generalized propensity scores. However, in practice, it is often hard to check if these models are adequate without prior knowledge. In contrast, one can use fully nonparametric smoothing techniques to replace the modeling procedures as in [4, 17]. When the dimension of covariates is large, the model-free approaches suffer from the curse of dimensionality and can lead to unstable estimates for the treatment effects. Some middle ground between current model-based and model-free approaches would thus be desirable.

In this work, we study a nonparametric method that is more parsimonious and suffers less from the curse of dimensionality than the fully nonparametric smoothing estimator, while retaining robustness to potential model misspecification. Sufficient dimension reduction for treatment effect estimation has recently been studied in [11, 19] for binary treatment variables. To extend these ideas to continuous treatments, we consider a joint sufficient dimension reduction model [11] to capture all the information of the outcome regression and generalized propensity score. In contrast, Luo et al. [19] considered separate dimension reduction for different subgroups, which is more difficult to generalize to continuous treatments.

An advantage of sufficient dimension reduction is that, unlike generalized propensity scores which cannot be estimated nonparametrically at a n^1/2 convergence rate, the central subspace can be estimated with a n^1/2-rate, which turns out to be crucial in guaranteeing a desirable theoretical and practical performance of the treatment effects estimator. The estimation is complicated by the need to employ multiple smoothing parameters, one for borrowing information across different treatment levels, and another for undersmoothing with respect to marginalizing the observed covariate distribution. The smoothing parameters are chosen in a data-adaptive manner by minimizing an estimator of the asymptotic mean squared error via a non-standard application of the infinitesimal jackknife.

The rest of this article is organized as follows. Section 2 introduces the considered dimension reduction model and proposes an estimation procedure to estimate the joint central subspace and the treatment effects of interest. The results of a series of simulation studies are reported in Section 3 and an application on food patterns data analysis is described in Section 4. Some concluding remarks are given in Section 5.

2. The proposed methodology

2.1. Joint sufficient dimension reduction model

Let Y(t) be the potential outcome associated with each treatment level t, where $t \in T$ , and $T$ is a connected subset in $ℝ$ , the set of all real numbers. Also, let T be the continuous treatment variable. Then the observed outcome is Y = Y(T). In addition, a vector X = (X₁, …, X_p) of covariates is observed for each subject. The main goal of this work is to estimate the continuous treatment effect $μ (t) = E {Y (t)}, t \in T$ , based on a random sample (Y₁, T₁, X₁), …, (Y_n, T_n, X_n). Since this parameter is defined in terms of potential outcomes which are not observed directly, some assumptions are needed to identify the quantity of interest by the observed data.

Assumption 1 (Positivity). f_T(t | x) > 0 for all $t \in T$ and $x \in X$ , where f_T (t|x) is the conditional density of T given X = x, and $X$ is the support of X.

Assumption 2 (Ignorability). Y(t) ⫫ T | X for all $t \in T$ , where ⫫ denotes independence.

The positivity assumption states that every subject has a non-zero chance of receiving treatment level t. The ignorability assumption says that the potential outcomes and the treatment are unconfounded when the covariates are given. That is, all the confoundedness is captured by the vector of covariates. According to these assumptions, a simple bias removal strategy is to use the property

E {Y (t)} = E [E {Y (t) ∣ X}] = E [E {Y (t) ∣ T = t, X}] = E {E (Y ∣ T = t, X)} .

(1)

Thus, an estimator for μ(t) can be obtained by averaging an estimator of the conditional effects E(Y|T = t, X). In practice, the number of covariates is usually large and the nonparametric estimators for the conditional effects typically suffer from the curse of dimensionality. In order to find a lower dimensional score to attain dimension reduction while retaining a property similar to (1), we first note that the likelihood of the observed variables is

f_{Y, T, X} (y, t, x) = f_{Y} (y ∣ t, x) f_{T} (t ∣ x) f_{X} (x) = f_{Y (t)} (y ∣ x) f_{T} (t ∣ x) f_{X} (x)

based on Assumptions 1 and 2. One can see that all the information is conveyed through the conditional distributions of Y(t) and T given X. Thus, we propose a joint sufficient dimension reduction model to summarize both potential outcomes and treatment assignment, viz.

\forall_{t \in T} Y (t) ⫫ X ∣ B^{⊺} X, T ⫫ X ∣ B^{⊺} X

(2)

for some p × d full-rank parameter matrix B with d ≤ p. Since model (2) has a nested structure in d, we can focus on searching the joint sufficient dimension reduction subspace span(B) with the smallest dimension, which is called the joint central subspace and the corresponding basis matrix is denoted by B₀ with dimension d₀. Related discussion on the model structure and identification is given in [11] for binary treatments. Model (2) leads to the following result on the ignorability of treatment assignment.

Proposition 1. Under Assumption 2 and model (2), the joint central subspace span(B₀) satisfies $Y (t) ⫫ T ∣ B_{0}^{⊺} X$ for all $t \in T$ .

Thus, the bias-removal strategy in (1) can still be applied with X replaced by $B_{0}^{⊺} X$ . That is,

μ (t) = E {Y (t)} = E [E {Y (t) ∣ X}] = E {E (Y ∣ T = t, X)} = E {E (Y ∣ T = t, B_{0}^{⊺} X)},

which forms a basis for our estimator of μ(t). Let m(t, x; B) = E(Y | T = t, B^⊺X = B^⊺x). If B₀ is known, we can estimate μ(t) by $n^{- 1} \sum_{i = 1}^{n} \hat{m} (t, X_{i}; B_{0})$ , with a locally smoothed estimator

\hat{m} (t, x; B) = \frac{Σ_{i = 1}^{n} Y_{i} K_{h_{1}} (T_{i} - t) K_{h_{2}} {B^{⊺} (X_{i} - x)}}{Σ_{i = 1}^{n} K_{h_{1}} (T_{i} - t) K_{h_{2}} {B^{⊺} (X_{i} - x)}},

where K_h₁(t) = K(t/h₁)/h₁, $K_{h_{2}} (u) = \prod_{k = 1}^{d} K (u_{k} ∕ h_{2}) ∕ h_{2}$ with u = (u₁, …, u_d)^⊺, h₁, h₂ are positive bandwidths, and K is an rth-order kernel function. The choice of r will be discussed in Section 2.3. When B₀ is unknown and we have an estimator $\hat{B}$ , we can estimate μ(t) by $\hat{μ} (t) = n^{- 1} \sum_{i = 1}^{n} \hat{m} (t, X_{i}; \hat{B})$ . The following result is proven for any estimator $\hat{B}$ that converges in distribution at a n^1/2-rate under mild assumptions.

Theorem 1. Suppose that Conditions A1–A4 in the Appendix are satisfied. Then, as n → ∞,

{(n h_{1})}^{1 ∕ 2} {\hat{μ} (t) - μ (t) - h_{1}^{r} B (t)} ⇝ N {0, V (t)}

where

\begin{matrix} B (t) & = \frac{μ_{r, K}}{r!} E [\frac{\partial_{t}^{r} {m (t, X; B_{0}) f_{T} (t ∣ B_{0}^{⊺} X)} - m (t, X; B_{0}) \partial_{t}^{r} f_{T} (t ∣ B_{0}^{⊺} X)}{f_{T} (t ∣ B_{0}^{⊺} X)}], \\ V (t) & = σ_{K}^{2} E [\frac{var (Y ∣ T, X) + {m (T, X; B_{0}) - m (t, X; B_{0})}^{2}}{f_{T}^{2} (t ∣ B_{0}^{⊺} X)}], \end{matrix}

$μ_{r, K} = \int u^{r} K (u) du$ , and $σ_{K}^{2} = \int K^{2} (u) du$ .

Remark 1. Different bandwidths h₁ and h₂ are needed because $\hat{μ} (t)$ averages over the covariate distribution but not the treatment distribution. Therefore, undersmoothing is required in h₂, but not in h₁, to minimize the mean squared error of the estimator. Details are given in Sections 2.3 and 2.4.

Remark 2. Galvao and Wang [4] proposed an estimator for the treatment effects based on a conditional density adjustment. The main idea comes from the property

E {Y (t) - μ (t)} = E [{Y - μ (t)} \frac{f_{T} (t ∣ Y, X)}{f_{T} (t ∣ X)}]

under Assumption 2 and some regularity conditions for changing the order of limits and integral. Since Proposition 1 shows that the conditional independence in Assumption 2 also holds for $B_{0}^{⊺} X$ , we can deduce that

E {Y (t) - μ (t)} = E [{Y - μ (t)} \frac{f_{T} (t ∣ Y, B_{0}^{⊺} X)}{f_{T} (t ∣ B_{0}^{⊺} X)}] .

Thus, an estimator for μ(t) can be

{\hat{μ}}_{SE} (t) = \frac{Σ_{i = 1}^{n} Y_{i} {\hat{f}}_{T} (t ∣ Y_{i}, {\hat{B}}^{⊺} X_{i}) ∕ {\hat{f}}_{T} (t ∣ {\hat{B}}^{⊺} X_{i})}{Σ_{i = 1}^{n} {\hat{f}}_{T} (t ∣ Y_{i}, {\hat{B}}^{⊺} X_{i}) ∕ {\hat{f}}_{T} (t ∣ {\hat{B}}^{⊺} X_{i})}

with

\begin{matrix} {\hat{f}}_{T} (t ∣ y, B^{⊺} x) & = \frac{Σ_{i = 1}^{n} K_{ς} (T_{i} - t) K_{ς} (Y_{i} - y) K_{ς} {B^{⊺} (X_{i} - x)}}{Σ_{i = 1}^{n} K_{ς} (Y_{i} - y) K_{ς} {B^{⊺} (X_{i} - x)}}, \\ {\hat{f}}_{T} (t ∣ B^{⊺} x) & = \frac{Σ_{i = 1}^{n} K_{ς} (T_{i} - t) K_{ς} {B^{⊺} (X_{i} - x)}}{Σ_{i = 1}^{n} K_{ς} {B^{⊺} (X_{i} - x)}} . \end{matrix}

Note that this estimator requires a fully nonparametric estimation for the density f_T,Y,B^⊺X(t, y, B^⊺x), which has a slower rate of convergence than that of $\hat{m} (t, x; B)$ . In our simulations, we found that this estimator usually has a poorer performance than the proposed estimator.

Remark 3. In the literature of causal inference with binary treatments, doubly robust estimation is often appealing. It combines the outcome regression and propensity score models and requires only one of them to be correctly specified. As in cases with continuous treatments, Kennedy et al. [17] proposed a kernel-based approach based on an influence function for a kernel-weighted projection parameter, which requires only mild smoothness conditions and still allows for double robustness. The key proposition of their approach is

μ (t) = E {\frac{Y - m (T, X)}{f_{T} (T ∣ X) ∕ ω (T)} + μ (T) ∣ t = t},

(3)

where m(t, x) = E(Y | T = t, X = x) and ω(t) = E{f_T (t|X)}. Under model (2), it can be easily seen that the equality in (3) still holds when m(t, x) and f_T (t|X) are replaced with $m (t, B_{0}^{⊺} x)$ and $f_{T} (t ∣ B_{0}^{⊺} X)$ , respectively. Thus, an alternative estimator for μ(t) is

{\hat{μ}}_{DR} (t) = \frac{Σ_{i = 1}^{n} {\hat{ξ}}_{i} K_{ς} (T_{i} - t)}{Σ_{i = 1}^{n} K_{ς} (T_{i} - t)} with {\hat{ξ}}_{i} = \frac{Y_{i} - \hat{m} (T_{i}, {\hat{B}}^{⊺} X_{i})}{{\hat{f}}_{T} (T_{i} ∣ {\hat{B}}^{⊺} X_{i}) ∕ \hat{ω} (T_{i})} + \hat{μ} (T_{i}),

where $\hat{ω} (t) = n^{- 1} \sum_{i = 1}^{n} {\hat{f}}_{T} (t ∣ {\hat{B}}^{⊺} X_{i})$ . This estimator can be viewed as an adjusted version of $\hat{μ} (t)$ . However, ${\hat{f}}_{T} (t ∣ {\hat{B}}^{⊺} X_{i})$ requires one more smoothing parameter than the proposed method. In our simulations, we found that ${\hat{μ}}_{DR} (t)$ has a similar performance as $\hat{μ} (t)$ in finite samples.

Remark 4. We may also provide more general assumptions like those in [4, 17] to allow non-kernel-based estimators such as splines for estimating m(t, x; B). However, in some cases the conditions become more stringent. For example, Galvao and Wang [4] require the estimate of f_T(t|y, x)/f_T(t|x) to converge at the rate n^−1/4, which is hardly satisfied for multiple covariates X or even the lower-dimensional $B_{0}^{T} X$ . In addition, the conditions of [4, 17] do not directly lead to results for the mean squared error, which can only be computed case by case for different smoothers, and will typically be different from our results. To provide a complete estimation procedure including proper smoothing parameter selection based on asymptotically optimal mean squared error, we focus on the kernel-based estimator and assume more specialized smoothness conditions for the consistency and the weak convergence to hold in this work.

2.2. Estimation of joint central subspace

Estimation of B₀ does not follow immediately from model (2) because Y(t) is only partially observed. We need the following identification result using the observed data.

Proposition 2. Suppose that Assumptions 1–2 hold. Then, model (2) is equivalent to (Y, T) ⫫ X | B^⊺X.

Therefore, we can employ existing methodologies for sufficient dimension reduction of a bivariate outcome to estimate B₀. While different methods of sufficient dimension reduction have been proposed in the literature (see, e.g., [20, 28, 29]), we use a generalization of the estimator of Huang and Chiang [12] for multivariate response (Y, T) in all numerical examples because it can estimate simultaneously the structural dimension, the basis matrix, and a rate-optimal smoothing parameter under a single objective function, while avoiding certain linearity assumptions of covariate distribution required by inverse regression [18].

Let

\hat{F} (y, t ∣ B^{⊺} x) = \frac{Σ_{j = 1}^{n} 1 (Y_{j} \leq y, T_{j} \leq t) K_{h} {B^{T} (X_{j} - x)}}{Σ_{j = 1}^{n} K_{h} {B^{⊺} (X_{j} - x)}}

be a kernel smoothing estimator for F(y, t|B^⊺x) = Pr(Y ≤ y, T ≤ t|B^⊺X = B^⊺x), where 1 denotes the indicator function, h is another positive bandwidth, and K is a qth-order kernel function. In practice, we suggest to take q = 4 ∨ 2⌊(d + 6)/4⌋, where ⌊·⌋ is the floor function. Further, let (T⁰, Y⁰,X⁰) be a future run independent of current data (Y₁, T₁, X₁), … , (Y_n, T_n, X_n) and define the prediction risk

E [{1 (Y^{0} \leq y, T^{0} \leq t) - \hat{F} (y, t ∣ B^{⊺} X^{0})}^{2}] .

(4)

A simple calculation shows that the prediction risk in (4) can be decomposed into

σ_{0}^{2} + b_{0}^{2} (B) + {MISE}_{B} (h) + C (B, h),

(5)

where

\begin{matrix} σ_{0}^{2} = E [{1 (Y^{0} \leq y, T^{0} \leq t) - F (y, t ∣ B_{0}^{⊺} X^{0})}^{2}], b_{0}^{2} (B) = E [{F (y, t ∣ B_{0}^{⊺} X^{0}) - F (y, t ∣ B^{⊺} X^{0})}^{2}], \\ {MISE}_{B} (h) = E [{F (y, t ∣ B^{⊺} X^{0}) - \hat{F} (y, t ∣ B^{⊺} X^{0})}^{2}] and \\ C (B, h) = 2 E [{F (y, t ∣ B_{0}^{⊺} X^{0}) - F (y, t ∣ B^{⊺} X^{0})} {F (y, t ∣ B^{⊺} X^{0}) - \hat{F} (y, t ∣ B^{⊺} X^{0})}] . \end{matrix}

When h → 0 and nh^d → ∞, it is shown in the Appendix that the last two summands in (5) converge to zero and are dominated by the first two terms. Note that $b_{0}^{2} (B) \geq 0$ and the equality holds when span(B) is a joint sufficient dimension reduction subspace. Thus, the minimum of the prediction risk must occur when span(B) ⊇ span(B0). Moreover, since our model has a nested structure, the prediction risk will decrease when the working dimension increases as long as the working dimension is less than the structural dimension. In contrast, once the working dimension d is equal to, or larger than, the structural dimension d₀ and span(B) ⊇ span(B₀), the prediction risk has an asymptotic order of $σ_{0}^{2} + O_{p} {n^{- 2 q ∕ (2 q + d)}}$ , which will start to increase in d. Therefore, for large enough n, the minimum of prediction risk occurs at the joint central subspace B₀. A formal result is given as follows.

Proposition 3. Under Assumptions 1–2 and model (2), the joint central subspace B₀ and the optimal bandwidth h₀ = c_d₀ n^{−1/(2q+d₀)} minimize the prediction risk in (4) as h → 0, nh^d → ∞, and n → ∞, where the constant c_d₀ is given in the Appendix.

According to Proposition 3, we can estimate B₀ by minimizing a criterion that converges to the prediction risk (4) uniformly in probability. Such a criterion function can be obtained by leave-one-out cross-validation. However, the minimization of such a criterion is complicated by the unknown dimension of B. Fortunately, according to Proposition 3 and the fact that the prediction risk is asymptotically convex in d, we implement the following forward selection algorithm which asymptotically minimizes the prediction risk (4):

Step 1. For d = 0, compute

\tilde{CV} (0) = \frac{1}{n} \sum_{i = 1}^{n} \int \int {1 (T_{i} \leq t, Y_{i} \leq y) - {\hat{F}}_{Y, T} (y, t)}^{2} d {\hat{F}}_{Y, T} (y, t),

where ${\hat{F}}_{Y, T} (y, t)$ is the empirical distribution associated with (Y₁, T₁), … , (Y_n, T_n).

Step 2. For d ≥ 1, let ( ${\hat{B}}_{d}$ , ${\hat{h}}_{d}$ ) be the minimizer of

CV (B, h) = \frac{1}{n} \sum_{i = 1}^{n} \int \int {1 (Y_{i} \leq y, Ti \leq t) - {\hat{F}}^{- i} (y, t ∣ B^{⊺} X_{i})}^{2} d {\hat{F}}_{Y, T} (y, t),

where B is a p × d matrix, and the superscript −i indicates the estimator based on a sample with ith subject being deleted. Then, compute $\tilde{CV} (d) = CV ({\hat{B}}_{d}, {\hat{h}}_{d})$ .

Step 3. Repeat Step 2 until $d = \hat{d}$ with $\tilde{CV} (\hat{d} + 1) > \tilde{CV} (\hat{d})$ and the proposed estimator is $(\hat{B}, \hat{h}) = {\hat{B}}_{\hat{d}}, {\hat{h}}_{\hat{d}})$ .

Based on the notations and assumptions in the Appendix, we can prove the asymptotic property of ( $\hat{B}$ , $\hat{h}$ ) as follows.

Theorem 2. Suppose that Conditions B1–B5 in the Appendix are satisfied. Then, $\Pr (\hat{d} = d_{0}) \to 1, \hat{h} = O_{p} {n^{- 1 ∕ (2 q + d_{0})}}$ , and

n^{1 ∕ 2} vec (\hat{B} - B_{0}) 1 (\hat{d} = d_{0}) ⇝ N_{p d_{0}} [0, V^{- 1} (B_{0}) E {S^{\otimes 2} (B_{0})} V^{- 1} (B_{0})]

as n → ∞, where S (B) and V(B) are defined in the Appendix.

Without fixing d₀ in advance, we allow model (2) to be flexible and to contain the fully nonparametric model [11]. Another important consideration is that this estimator of the central subspace is n^1/2-consistent under mild assumptions [12], which is fast enough to keep the large-sample properties of $\hat{μ} (t)$ as if the true joint central subspace is known. This property will not hold if we use a nonparametric estimator of generalized propensity score in place of ${\hat{B}}^{⊺} x$ , because the convergence rates of nonparametric density estimators are slower than n^1/2.

Remark 5. From the proof steps of Theorem 2, we can show that ${\hat{h}}_{d} = O_{p} {n^{- 1 ∕ (2 q + d)}}$ for fixed d ∈ {0, … , p}. Coupled with Condition B3, the order of kernel function should satisfy q > 2 ∨ (d + 2)/2. Since we always use symmetric kernel functions, whose orders are all even, and require the order as small as possible, a simple choice is to take q = 4 ∨ 2 ⌊(d + 6)/4⌋ for each working dimension d.

Remark 6. For binary treatments, Huang and Chan [11] used a conditional distribution version of Proposition 2 for identification, and the estimation of B₀ is performed through a weighted average of objective functions based on stratified subsamples by treatment statuses. For continuous treatments, such a strategy would require an additional smoothing parameter to borrow information locally for each value of T. Estimation based on Proposition 2 is more straightforward for continuous treatments and is therefore employed.

Remark 7. For regression of a univariate outcome, a single-index model is another popular dimension reduction method. In fact, it can be viewed as a special case of sufficient dimension reduction with d₀ = 1. Using a single-index model for multivariate outcomes, however, is prone to misspecification if different combinations of covariate values are related to each outcome. In contrast, if a single-index model is indeed true for a multivariate outcome, efficiency will be lost and nearly collinear indices will be obtained if marginal estimation is performed separately for each outcome. Since the joint sufficient dimension reduction model is defined by multiple outcomes, we do not fix the structural dimension a priori, and the method of [12] is employed to simultaneously estimate the structural dimension and the basis matrix.

2.3. Asymptotic mean squared error

Following [24], we can rewrite $\hat{μ} (t) - μ (t) = A_{n} + B_{n} + C_{n} + O_{p} (∣ ∣ \hat{B} - B_{0} ∣ ∣)$ , where

\begin{matrix} A_{n} = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} - m (t, X_{i}; B_{0})}{f_{T} (t ∣ X_{i})} K_{h_{1}} (T_{i} - t), \\ B_{n} = \frac{1}{n} \sum_{i = 1}^{n} E {\hat{m} (t, X_{i}; B_{0}) - μ (t) - \frac{Y_{i} - m (t, X_{i}; B_{0})}{f_{T} (t ∣ X_{i})} K_{h_{1}} (T_{i} - t) ∣ D^{- i}}, \\ C_{n} = \frac{1}{n} \sum_{i = 1}^{n} \hat{m} (t, X_{i}; B_{0}) - μ (t) - \frac{Y_{i} - m (t, X_{i}; B_{0})}{f_{T} (t ∣ X_{i})} K_{h_{1}} (T_{i} - t) - E {\hat{m} (t, X_{i}; B_{0}) - μ (t) - \frac{Y_{i} - m (t, X_{i}; B_{0})}{f_{T} (t ∣ X_{i})} K_{h_{1}} (T_{i} - t) ∣ D^{- i}}, \end{matrix}

and $D^{- 1} = {(Y_{j}, T_{j}, X_{j}) : j \neq i}$ . From standard derivation of kernel smoothing estimation, we have $A_{n} = O_{p} {h_{1}^{r} + {(n h_{1})}^{- 1 ∕ 2}}$ . Similar to the argument for Theorem 2.1 of [2], we can further show that $B_{n} = O_{p} (h_{2}^{r})$ and $C_{n} = O_{p} {1 ∕ (n^{2} h_{1} h_{2}^{d_{0}})}$ , which are both dominated by A_n. Thus, the mean squared error $E [{\hat{μ} (t) - μ (t)}^{2}]$ is asymptotically equivalent to that of A_n, which equals $h_{1}^{2 r} B^{2} (t) + V (t) ∕ ({nh}_{1})$ asymptotically, and the optimal bandwidth is

h_{1, opt} = {[V (t) ∕ {2 r B^{2} (t)}]}^{1 ∕ (2 r + 1)} n^{- 1 ∕ (2 r + 1)} .

Moreover, the remainder terms involving h₂ satisfy $E (B_{n}) = O (h_{2}^{r})$ and $var (C_{n}) = O {1 ∕ n^{2} h_{1} h_{2}^{d_{0}})}$ . Thus, when h₁ = O{n^−1/(2r+1)} and h₂ = O[n^{−(4r+1)/{(2r+1)(2r+d₀)}}], the mean squared error attains the optimal rate O{n^−2r/(2r+1)} asymptotically.

According to these theoretical results, when we take the bandwidths in optimal rate h₁ = O{n^−1/(2r+1)} and h₂ = O[n^{−(4r+1)/{(2r+1)(2r+d₀)}}], the order of kernel function should be taken as $r > {3 d_{0} + {(9 d_{0}^{2} + 8 d_{0})}^{1 ∕ 2}} ∕ 4$ to ensure Condition A3 and the optimal rate of mean squared error. However, if we only require the mean squared error to converge to zero at the optimal rate, r > (d − 1)/2 for each working dimension d is enough to ensure this property. Thus, we suggest to take r = 2⌈d/4⌉ in practice, where ⌈·⌉ is the ceiling function.

2.4. Bandwidth selection

To select the bandwidths (h₁, h₂), one could minimize the estimated mean squared error of $\hat{μ} (t)$ . Here we follow the idea of [1] and use an infinitesimal version of the generalized jackknife statistic to estimate the variance and bias of $\hat{μ} (t)$ . Set (h₁, h₂) = (h_1,0n^−1/(2r+1), h_2,0n^{−(4r+1)/{(2r+1)(2r+d₀)}}), which is suggested by the optimal rates given in Section 2.3 and then our main goal turns to selecting the optimal (h_1,0, h_2,0). Since the bias of $\hat{μ} (t)$ is now of order O{n^−2r/(2r+1)}, Schucany et al. [25] suggested using the generalized jackknife pseudovalues defined, for all i ∈ {1, …, n}, by

g_{(i)} (t) = \frac{n^{2 r ∕ (2 r + 1)} \hat{μ} (t) - (n - 1)^{2 r ∕ (2 r + 1)} {\hat{μ}}^{- i} (t)}{n^{2 r ∕ (2 r + 1)} \hat{μ} (t) - (n - 1)^{2 r ∕ (2 r + 1)} \hat{μ} (t)},

to form a generalized jackknife estimate $g_{(\cdot)} (t) = n^{- 1} \sum_{i = 1}^{n} g_{(i)} (t)$ . The bias can be estimated by $\hat{μ} (t) - g_{(\cdot)} (t)$ and the variance estimator is given by

\frac{n - 1}{n} \sum_{i = 1}^{n} {{\hat{μ}}^{- i} (t) - \overset{‒}{μ} (t)}^{2} = {\frac{R_{n}}{(1 - R_{n}) (n - 1)}}^{- 2} \times \frac{1}{n (n - 1)} \sum_{i = 1}^{n} {g_{(i)} (t) - g_{(\cdot)} (t)}^{2},

where R_n = {(n − 1)/n}^2r/(2r+1) and $\overset{‒}{μ} (t) = n^{- 1} \sum_{i = 1}^{n} {\hat{μ}}^{- i} (t)$ . Related discussion can be found in [5, 26]. Although these estimators have closed forms which can be used in practice, it is time-consuming to obtain ${\hat{μ}}^{- i} (t)$ for each i ∈ {1, …, n}. To reduce the computational burden, we further apply the infinitesimal jackknife of Jaeckel [15] to the generalized jackknife estimates. Following his derivation, we obtain

var {\hat{μ} (t)} = \frac{1}{n^{2}} \sum_{i = 1}^{n} {\hat{D}}_{i}^{2} (t), bias {\hat{μ} (t)} = \frac{R_{n}}{(1 - R_{n}) (n - 1)} \times \frac{1}{2 n^{2}} \sum_{i = 1}^{n} {\hat{D}}_{i i} (t),

where

\begin{matrix} {\hat{D}}_{i} (t) = \hat{m} (t, X_{i}; \hat{B}) - \hat{μ} (t) + \sum_{k = 1}^{n} \frac{{Y_{i} - \hat{m} (t, X_{k}; \hat{B})} K_{h_{1}} (T_{i} - t) K_{h_{2}} {{\hat{B}}^{⊺} (X_{i} - X_{k})}}{\sum_{j = 1}^{n} K_{h_{1}} (T_{j} - t) K_{h_{2}} {{\hat{B}}^{⊺} (X_{j} - X_{k})}} \\ {\hat{D}}_{i i} (t) = \frac{2 n {Y_{i} - \hat{m} (t, X_{i}); \hat{B}} K_{h_{1}} (T_{i} - t) K_{h_{2}} (0)}{\sum_{j = 1}^{n} K_{h_{1}} (T_{j} - t) K_{h_{2}} {B^{⊺} (X_{j} - X_{i})}} - 2 n \sum_{k = 1}^{n} \frac{{Y_{i} - \hat{m} (t, X_{k}; \hat{B})}^{2} K_{h_{1}}^{2} (T_{i} - t) K_{h_{2}}^{2} {{\hat{B}}^{⊺} (X_{i} - X_{k})}}{{[\sum_{j = 1}^{n} K_{h_{1}} (T_{j} - t) K_{h_{2}} {{\hat{B}}^{⊺} (X_{j} - X_{k})}]}^{2}} - 2 {\hat{D}}_{i} (t) \end{matrix}

Thus, $E [{\hat{μ} (t) - μ (t)}^{2}]$ can be estimated by $var {\hat{μ} (t)} + {bias}^{2} {\hat{μ} (t)}$ and $({\hat{h}}_{1, 0}, {\hat{h}}_{2, 0})$ is suggested to be the minimizer of

\int [var {\hat{μ} (t)} + {bias}^{2} {\hat{μ} (t)}] d {\hat{F}}_{T} (t),

where ${\hat{F}}_{T} (t)$ is the empirical distribution of T.

3. Simulation studies

In this section, we use Monte Carlo simulations to demonstrate the finite-sample performance of our proposal. Specifically, we simulate data from the following models conditional on $X = (X_{1}, \dots, X_{10}) \sim N_{10} (0, I_{10})$ :

M1: T = X₁ + ε₁ with $ε_{1} \sim N (0, 1)$ , Y(t) = t − X₁ + X₂ + ε₂ with $ε_{2} \sim N (0, 0.1)$ ,
M2: $T = \sum_{k = 1}^{10} X_{k} + ε_{1}$ with $ε_{1} \sim N (0, 1)$ , $Y (t) = \sum_{k = 1}^{10} X_{k} + ε_{2}$ with $ε_{2} \sim N (0, 0.1)$ ,
M3: $T ∣ X = x \sim 20 B {λ (x), 20 - λ (x)}$ , λ(x) = 20expit(0.5x₁ + 0.5x₃ − 0.5x₄ − 0.5x₅), Y(t) = (x₂ + x₃ − x₄ − x₅)²/4-(t/4 − 5/2)² + ε₁ with $ε_{1} \sim N (0, 1 ∕ 8^{2})$ ,
M4: T = X₁X₂ + ε₁ with $ε_{1} \sim N (0, 1)$ , Y(t) = X₁X₂ + ε₂ with $ε_{2} \sim N (0, 0.1)$ ,

where ε₁, ε₂, and X are mutually independent. For M1, one can directly see that the basis of the joint central subspace is B₀ = (e₁, e₂), where {e₁, …, e₁₀} is the standard basis of $ℝ^{10}$ , and the true treatment effect is μ(t) = t. The basis of the joint central subspace for M2 is span{(1, …, 1)^⊺} and the true treatment effect is the zero function. The basis of the joint central subspace for M3 is (e₁ + e₃ − e₄ − e₅, e₂ + e₃ − e₄ − e₅) with structural dimension 2 and the true treatment effect curve is μ(t) = 1 − (t/4 − 5/2)². The basis of the joint central subspace for M4 is (e₁, e₂) and the true treatment effect is also the zero function.

The performance of the estimated joint central subspace is evaluated through the error measure $∣ ∣ \hat{B} {({\hat{B}}^{⊺} \hat{B})}^{- 1} {\hat{B}}^{⊺} - B_{0} {(B_{0}^{⊺} B_{0})}^{- 1} B_{0}^{⊺} ∣ ∣$ . To evaluate the estimated treatment effects, we use the mean integrated squared error

E [\int {\hat{μ} (t) - μ (t)}^{2} d {\hat{F}}_{T} (t)]

as an accuracy measure. All the results are based on 1000 simulations with sample size n ∈ {100, 200, 400}.

To investigate the practicality of the proposed joint central subspace on the estimation of continuous treatment effects, we consider four estimators: $\hat{μ} (t)$ , ${\hat{μ}}_{DR} (t)$ , ${\hat{μ}}_{SE} (t)$ , as discussed in Remarks 2–3, and the marginal regression plug-in estimator based on the generalized propensity score

{\hat{μ}}_{GPS} (t) = \frac{1}{n} \sum_{i = 1}^{n} \frac{\sum_{j = 1}^{n} Y_{j} K_{h_{1}} (T_{j} - t) K_{h_{3}} {{\hat{f}}_{T} (t ∣ X_{j}) - {\hat{f}}_{T} (t ∣ X_{i})}}{\sum_{i = 1}^{n} K_{h_{1}} (T_{j} - t) K_{h_{3}} {{\hat{f}}_{T} (t ∣ X_{j}) - {\hat{f}}_{T} (t ∣ X_{i})}} .

We compare them with their counterparts in which the proposed balancing score ${\hat{B}}^{⊺} X$ is replaced by the original high-dimensional X. The simulation results are displayed in Tables 1 and 2. From Table 1, we can see that the proportion of correct structural dimension selected by our proposed cross-validation criterion approaches 1 as sample size increases while the error measure tends to zero. As for the estimators of the treatment effect function, we can see from Table 2 that the estimators based on joint sufficient dimension reduction always have smaller mean integrated squared errors, which confirms the advantage of our proposal.

Table 1:

The performance of the estimated joint central subspace.

		$\hat{d}$						Error

Model	n	0	1	2	3	4	≥ 5	Mean	S.D.
M1	100	0.000	0.073	0.712	0.196	0.019	0.000	0.823	0.5701
	200	0.000	0.017	0.874	0.102	0.007	0.000	0.448	0.4082
	400	0.000	0.000	0.956	0.044	0.000	0.000	0.212	0.2376
M2	100	0.000	0.935	0.065	0.000	0.000	0.000	0.070	0.2471
	200	0.000	0.966	0.034	0.000	0.000	0.000	0.036	0.1815
	400	0.000	0.991	0.009	0.000	0.000	0.000	0.010	0.0946
M3	100	0.000	0.213	0.738	0.046	0.003	0.000	0.492	0.4543
	200	0.000	0.030	0.938	0.032	0.000	0.000	0.206	0.2593
	400	0.000	0.000	0.979	0.021	0.000	0.000	0.100	0.1587
M4	100	0.017	0.003	0.956	0.024	0.000	0.000	0.205	0.3056
	200	0.002	0.000	0.996	0.002	0.000	0.000	0.064	0.1023
	400	0.001	0.000	0.999	0.000	0.000	0.000	0.027	0.0634

Open in a new tab

Table 2:

The error measures of the estimated treatment effects.

		$\hat{μ} (t)$		${\hat{μ}}_{GPS} (t)$		${\hat{μ}}_{DR} (t)$		${\hat{μ}}_{SE} (t)$

Model	n	${\hat{B}}^{⊺} X$	X	${\hat{B}}^{⊺} X$	X	${\hat{B}}^{⊺} X$	X	${\hat{B}}^{⊺} X$	X
M1	100	0.739	0.764	0.806	0.816	0.761	0.805	0.911	0.972
	200	0.626	0.677	0.753	0.763	0.634	0.703	0.775	0.890
	400	0.516	0.579	0.671	0.689	0.508	0.590	0.590	0.758

M2	100	0.643	5.601	8.445	8.641	0.608	5.411	6.653	7.837
	200	0.564	3.837	7.765	8.021	0.539	3.725	6.546	7.790
	400	0.528	1.929	6.943	7.311	0.510	1.884	6.475	7.960

M3	100	0.515	0.619	0.623	0.623	0.499	0.617	0.484	0.643
	200	0.393	0.516	0.519	0.519	0.355	0.514	0.323	0.533
	400	0.367	0.458	0.461	0.461	0.317	0.456	0.230	0.469

M4	100	0.080	0.236	0.467	0.475	0.065	0.205	0.252	0.301
	200	0.051	0.204	0.419	0.445	0.039	0.177	0.210	0.270
	400	0.035	0.170	0.333	0.373	0.027	0.152	0.212	0.272

Open in a new tab

The estimator ${\hat{μ}}_{GPS}$ performed poorly because it is an empirical version of E[E{Y|T = t, f_T(t |X)}], but f_T(t|X) cannot be estimated at the n^1/2-rate. The merit of the proposed estimator, as we have shown theoretically, is that B₀ can be estimated at the n^1/2-rate, and the effect of the estimated central subspace on the final estimator is asymptotically negligible because μ(t) is estimated at a rate slower than n^1/2. The estimator ${\hat{μ}}_{SE}$ did not perform well for the reason discussed in Remark 2, namely that an estimator of the joint density f_T,Y,B^⊺X(t, y, B^⊺x) is needed which typically has a slower convergence rate. The poor performance of these two estimators is more apparent for model M2, which corresponds to a situation of strong confounding. In this scenario, the true treatment effect is μ(t) = 0 but the observed (Y, T) are highly correlated. The estimators ${\hat{μ}}_{GPS}$ and ${\hat{μ}}_{SE}$ can have very poor performance for large t while the proposed estimator still has good performance as $E {Y (t) ∣ \sum_{k = 1}^{10} X_{k}}$ can still be accurately estimated even for large t.

4. Application

We illustrate the proposed method by analyzing the food patterns dataset of [27], to investigate the effects of obesity on the systolic blood pressure. In the literature, researchers have found that there is a positive relationship between systolic blood pressure and body mass index [3, 16]. However, the effects can possibly be confounded by diet. The dataset contains 1073 Nepalese adults who are 18 years or older participating in the baseline survey of the Dhulikhel Heart Study, which collects the responses from Food Frequency Questionnaires to identify the food patterns of the participants. The diet variables include whole grains, refined grains, lentils, vegetable oils, solid fats, fatty foods, vegetables, fruits, potatoes, nuts, poultry, red meat, fish, milk, milk product, fast food, processed cereal, noodles, salty snacks, soda drinks, tea/coffee, sweets, and alcohol. Other variables also include age and sex. All the covariates are centralized and standardized in this analysis.

The estimated joint central subspace is one-dimensional and the corresponding index coefficients with standard errors are listed in Table 3. Figure 1 displays the estimated treatment effects with 95% point-wise confidence intervals using the proposed method, and point estimates based on unadjusted analysis and the estimators of [4, 17]. Compared to the unadjusted estimates, the estimated average systolic blood pressure is smaller for the overweight individuals when age, sex and diet are adjusted. Although the proposed estimator is similar to the estimators of [4, 17], they have different sensitivities to small changes in the bandwidth parameter h₂ that smooths over the covariates. We perturbed the bandwidth by multiplicative factors and computed the L_p-distance between the perturbed functional estimators and the original ones. The results given in Table 4 indicate that the dimension reduction approach is more stable than other nonparametric estimators.

Table 3:

Estimated coefficients of the linear index for the food patterns dataset.

Covariate	Coefficient	S.E.	Covariate	Coefficient	S.E.
Age	1		Red meat	0.1059	0.02233
Sex	0.1958	0.02458	Fish	0.0835	0.01255
Whole grains	0.0170	0.02361	Milk	−0.0587	0.01713
Refined grains	0.2736	0.02433	Milk products	−0.0365	0.03479
Lentils	0.3886	0.02753	Fast food	−0.0829	0.02253
Vegetable oils	−0.0456	0.01799	Processed cereals	0.3975	0.02415
Solid fats	−0.1212	0.01038	Noodles	−0.1273	0.01906
Fatty foods	−0.0646	0.01823	Salty snacks	−0.0285	0.01709
Vegetables	−0.1257	0.02912	Soda drinks	0.0408	0.02827
Fruits	0.1533	0.02786	Tea/coffee	−0.1042	0.01775
Potatoes	−0.1912	0.02059	Sweets	−0.0971	0.02202
Nuts	0.2208	0.01808	Alcohol	0.2682	0.01339
Poultry	0.2460	0.01745

Open in a new tab

Figure 1: — Estimated treatment effect functions. The bold solid line represents the proposed estimator and the thin solid lines represent 95% pointwise confidence intervals, the dotted line represents the naive estimator without adjusting for covariates, the dashed line represents the Kennedy et al. estimator, and the dot-dashed line represents the Galvao–Wang estimator.

Table 4:

L_p-distance between estimators with perturbed bandwidth parameters with the original estimators.

	$0.9 \times {\hat{h}}_{2}$			$1.1 \times {\hat{h}}_{2}$

Method / L_p-distance	p = 0	p = 1	p = 2	p = 0	p = 1	p = 2
Proposed	0.392	0.099	0.016	0.280	0.070	0.011
Galvao–Wang	1.707	0.454	0.303	2.179	0.564	0.466
Kennedy et al.	1.444	0.856	0.786	0.845	0.533	0.297

	$0.5 \times {\hat{h}}_{2}$			$1.5 \times {\hat{h}}_{2}$

Method / L_p-distance	p = 0	p = 1	p = 2	p = 0	p = 1	p = 2
Proposed	4.427	1.384	3.476	1.130	0.390	0.210
Galvao–Wang	1.294	0.364	0.262	3.040	0.743	0.906
Kennedy et al.	31.539	18.743	372.69	3.526	1.690	3.313

Open in a new tab

5. Discussion

Nonparametric estimation of treatment effect functions received considerable attention recently, because it does not require a priori assumption on the shape of the function which may not be reasonably assumed. Two advanced approaches were recently proposed. Kennedy et al. [17] developed a doubly robust estimator which requires two sets of modeling assumptions and is valid when one of the two is correct. Galvao and Wang [4] proposed a fully nonparametric estimator that may suffer from the curse of dimensionality. The pros and cons of the two approaches are discussed extensively in the literature and they are favored by different groups of statisticians. Doubly robust methods are prevalent in the biostatistics literature, whereas the fully nonparametric approach is popular in econometrics. The proposed sufficient dimension reduction approach is a middle ground between the two and preserves the merit of both methods. On one hand, the sufficient dimension reduction model contains the fully nonparametric model and therefore cannot be mis-specified, thus more robust than the doubly robust estimators. On the other hand, when the underlying data generating mechanism has a lower-dimensional structure, the proposed method can greatly reduce the dimensionality and hence attain a better convergence rate than fully nonparametric methods. The complexity is controlled by the effective dimension of the central subspace, for which we can estimate consistently using the proposed method. Therefore, our approach is both flexible and parsimonious.

The asymptotic result in Theorem 1 holds for any n^1/2-consistent estimator $\hat{B}$ of the central subspace. Thus, one may use other existing sufficient dimension reduction approaches in lieu of the proposed method given in Section 2.2. We use the proposed method because the effective dimension and the central subspace are being estimated by a single criterion, which results in a more stable estimator in finite samples, and the asymptotic result can be proven for an estimated dimension. Other methods require separate criteria to estimate the dimension and the basis matrix, and their theoretical properties are often investigated separately and may not be easily combined. Comparisons between different sufficient dimension reduction techniques are reported elsewhere; see, e.g., [12]. Our primary interest in this paper is the treatment effect function, and sufficient dimension reduction is a means to an end rather than an end in itself.

For a binary treatment, Luo et al. [19] proposed a separated dimension reduction model for E(Y₀|X) and E(Y₁|X), viz.

E (Y_{0} ∣ X) ⫫ X ∣ B_{M, 0}^{⊺} X, E (Y_{1} ∣ X) ⫫ X ∣ B_{M, 1}^{⊺} X .

Under a positivity condition on the original covariates and a measurable condition on the conditional variance of potential outcomes with respect to reduced covariates, these authors showed that their proposed estimator can be super-efficient. While this modeling approach may be extended to continuous treatment by assuming $E {Y (t) ∣ X} ⫫ X ∣ B_{M, t}^{⊺} X$ for each t, B_M,t is typically not estimable at n^1/2-rate based on the observational data, since for most values of t there is only one or even no observation of outcomes. Also, since all the existing estimators for continuous treatment effect function are irregular, we cannot compare the asymptotic variance directly and there is no guarantee that alternative methods will lead to better efficiency. In fact, the limiting distributions also involve the bias part, whose rates relate to the smoothness of underlying distribution functions instead of the dimension of balancing score only. In our proposal, we focus on using B^⊺X to preserve all the information from the observed data, i.e., the conditional distribution of (Y, T) given X, where B is estimable at n^1/2-rate. Comparisons of the resulting asymptotic mean squared errors between our proposed model and other dimension reduction models require future investigation.

It should be noted that sufficient dimension is not the only cure to the curse of dimensionality. A natural limitation is that it reduces the dimension via linear indices, which might not be appropriate in some applications. When the underlying structural covariates live on a general manifold instead of hyperplanes, a dimension reduction method based on non-linear functionals may be needed to solve this problem. In contrast, when there are irrelevant covariates involved, an additional variable selection will lead to better performance by reducing the number of coefficients to be estimated [6]. Such more complicated situations are commonly seen in practice and need further investigation.

In this work we consider the case where the conditional independence assumption holds, which requires the collection of all confounders. However, in many applications researchers often encounter the non-ignorable selection of exposure, in which there are unobserved confounding variables. In those cases, instrumental variables are often used to identify certain partial treatment effects under different sets of assumptions. The structure of the sufficient dimension reduction and the corresponding estimation will then be quite different from the present paper.

Acknowledgment

The authors thank the Editor-in-Chief, an Associate Editor and three reviewers for constructive comments. Both authors are partially supported by the United States National Institutes of Health grant R01HL122212, and the second author is partially supported by the United States National Science Foundation grant DMS1711952. We also thank the Kathmandu University Hospital and Drs. Rajendra Koju, Biraj Karmacharya and Archana Shrestha for providing the Dhulikhel Heart Study data.

Appendix

A.1. Proof of Proposition 1

Proof. For each fixed $t \in T$ and any measurable functions f₁, f₂, we have

\begin{matrix} E [f_{1} {Y (t)} f_{2} (T) ∣ B_{0}^{⊺} X] & = E (E [f_{1} {Y (t)} f_{2} (T) ∣ X] ∣ B_{0}^{⊺} X) = E (E [f_{1} {Y (t)} ∣ X] E {f_{2} (T) ∣ X} ∣ B_{0}^{⊺} X) \\ = E (E [f_{1} {Y (t)} ∣ B_{0}^{⊺} X] E {f_{2} (T) ∣ B_{0}^{⊺} X} ∣ B_{0}^{⊺} X) = E [f_{1} {Y (t)} ∣ B_{0}^{⊺} X] E {f_{2} (T) ∣ B_{0}^{⊺} X} . \end{matrix}

Thus, we have $Y (t) ⫫ T ∣ B_{0}^{⊺} X$ for all $t \in T$ . ◻

A.2. Proof of Proposition 2

Proof. Let $S_{(Y, T) ∣ X}$ denote the central subspace of model (3). First note that the density function can be decomposed into f_Y,T(y, t | x) = f_Y(t),T(y, t | x) = f_Y(t)(y | x) f_T(t | x). Thus, we have $S_{(Y, T) ∣ X} \subseteq span (B_{0})$ . Conversely, from the facts

f_{T} (t ∣ x) = \int f_{Y, T} (y, t ∣ x) dy and f_{Y (t)} (y ∣ x) = \frac{f_{Y, T} (y, t ∣ x)}{\int f_{Y, T} (y, t ∣ x) dy},

one can easily see that span $(B_{0}) \subseteq S_{(Y, T) ∣ X}$ . ◻

A.3. Proof of Proposition 3

Proof. By paralleling the proof steps of [7, 8], one can show that MISE_B(h) = AMISE_B(h){1 + o_p(1)}, where

\begin{matrix} {AMISE}_{B} (h) = \int \int \int h^{2 q} B^{2} (y, t ∣ B^{⊺} x) + {({nh}^{d})}^{- 1} V (y, t ∣ B^{⊺} x) {dF}_{x} (x) {dF}_{Y, T} (y, t), \\ B (y, t ∣ B^{⊺} x) = \frac{μ_{q, K}}{q!} \times \frac{D_{B^{⊺} x}^{q} {F (y, t ∣ B^{⊺} x) f_{B^{⊺} x} (B^{⊺} x)} - F (y, t ∣ B^{⊺} x) D_{B^{⊺} x}^{q} f_{B^{⊺} x} (B^{⊺} x)}{f_{B^{⊺}} x (B^{⊺} x)} (1_{d}, \dots, 1_{d}), \\ V (y, t ∣ B^{⊺} x) = \frac{σ_{K}^{2 d} F (y, t ∣ B^{⊺} x) {1 - F (y, t ∣ B^{⊺} x)}}{f_{B^{⊺} x} (B^{⊺} x)}, \end{matrix}

F_X(x) is the marginal distribution function of X, $u_{q, K} = \int v^{q} K (v) dv$ , $σ_{K}^{2} = \int K^{2} (v) dv$ , D^q denotes the qth order differentiation in tensor form, and 1_d = (1, …, 1)^⊺ with dimension d. Thus, when h → 0 and nh^d → ∞, the last two terms of (5) converge to zero and are dominated by the first two terms. Since $b_{0}^{2} (B) \geq 0$ and the equality holds when span(B) is a joint sufficient dimension reduction subspace, the minimum of predicted risk occurs when span(B) ⊇ span(B₀) for large enough n. Finally, when $h = c_{d} n^{- 1 ∕ (2 q + d)}$ with

c_{d} = {[d \int \int \int V (y, t ∣ B^{⊺} x) {dF}_{x} (x) {dF}_{Y, T} (y, t) ∕ {2 q \int \int \int B^{2} (y, t ∣ B^{⊺} x) {dF}_{x} (x) {dF}_{Y, T} (y, t)}]}^{1 ∕ (2 q + d)},

AMISE_B(h) has minimum

\int \int c_{d}^{2 q} B^{2} (y, t ∣ B^{⊺} x) + c_{d}^{- d} V (y, t ∣ B^{⊺} x) {dF}_{x} (x) {dF}_{Y, T} (y, t) n^{- 2 q ∕ (d + 2 q)},

which is increasing in d as n → ∞, it is then obvious that the prediction risk attains its minimum when B = B₀ and $h = c_{d_{0}} n^{- 1 ∕ (2 q + d_{0})}$ , which is the optimal bandwidth that minimize AMISE_B₀ (h). ◻

A.4. Proof of Theorem 1

To derive the large-sample properties of $\hat{μ} (t)$ , we need the following assumptions:

A1. $\partial_{(t, B_{0}^{⊺} x)}^{r} m (t, x; B_{0})$ , $\partial_{(t, B_{0}^{⊺} x)}^{r} f_{T} (t ∣ B_{0}^{⊺} x)$ , and $\partial_{u}^{r} f_{B^{⊺} X} (u)$ are Lipschitz continuous in t with the Lipschitz constant independent of x.
A2. $\inf_{(t, x)} f_{T} (t ∣ B_{0}^{⊺} x) > 0$ .
A3. h₁ → 0, ${nh}_{1} h_{2}^{2 r} \to 0$ , and ${nh}_{1} h_{2}^{2 d_{0}} \to \infty$ as n → ∞.
A4. $\Pr (\hat{d} = d_{0}) \to 1$ and $∣ ∣ \hat{B} - B_{0} ∣ ∣ = o_{p} {h_{1}^{2 r} + h_{2}^{2 r} + {({nh}_{1} h_{2}^{d_{0}})}^{- 1}}$ .

Assumptions A1–A3 are similar to those in [17], and A4 is satisfied by the estimator of [12].

Proof. First we establish some asymptotic properties of the kernel smoothing estimators. For k ∈ {0, 1}, let

\begin{matrix} {\hat{m}}_{k} (t, x; B) & = \frac{1}{n} \sum_{i = 1}^{n} Y_{i}^{k} K_{h_{1}} (T_{i} - t) K_{h_{2}} {B^{⊺} (X_{i} - x)}, \\ m_{k} (t, x; B) & = E (Y^{k} ∣ T = t, B^{⊺} X = B^{⊺} x) f_{T} (t ∣ B^{⊺} x) f_{B^{⊺} x} (B^{⊺} x) . \end{matrix}

By Assumption A1 and an application of Theorem II.37 of [23] shows that

\sup_{t, x, B} ∣ {\hat{m}}_{k} (t, x; B) - m_{k} (t, x; B) ∣ = O (h_{1}^{r}) + O (h_{2}^{r}) + o {\frac{\ln n}{{({nh}_{1} h_{2}^{d})}^{1 ∕ 2}}}

(A.1)

almost surely. Now we decompose $\hat{μ} (t) - μ (t)$ into

\frac{1}{n} \sum_{i = 1}^{n} {\hat{m} (t, X_{i}; \hat{B}) - \hat{m} (t, X_{i}; B_{0})} + \frac{1}{n} \sum_{i = 1}^{n} {\hat{m} (t, X_{i}; B_{0}) - m (t, X_{i}; B_{0})} .

(A.2)

By using a Taylor expansion and Theorem 4 in [12], we see that the first term in (A.2) is

\frac{1}{n} \sum_{i - 1} \partial_{vec (B)} \hat{m} (t, X_{i}; B_{0}) vec (\hat{B} - B_{0}) + O_{p} ({∣ ∣ \hat{B} - B_{0} ∣ ∣}^{2}) = O_{p} (∣ ∣ \hat{B} - B_{0} ∣ ∣) .

(A.3)

Also, by using the Taylor expansion and (A.1), we have

\begin{matrix} \hat{m} (t, x; B) - m (t, x; B) & = \frac{{\hat{m}}_{1} (t, x; B) - m_{1} (t, x; B)}{m_{0} (t, x; B)} \\ - \frac{m (t, x; B) {{\hat{m}}_{0} (t, x; B) - m_{0} (t, x; B)}}{m_{0} (t, x; B)} + O_{p} (h_{1}^{2 r} + h_{2}^{2 r} + \frac{1}{{nh}_{1} h_{2}^{d}}) . \end{matrix}

Thus, the second term in (A.2) can be further written as

\frac{1}{n^{2}} \sum_{i, j = 1}^{n} \frac{Y_{j} K_{h_{1}} (T_{j} - t) K_{h_{2}} {B_{0}^{⊺} (X_{j} - X_{i})} - m_{1} (t, x; B_{0})}{m_{0} (t, x; B_{0})} - \frac{1}{n^{2}} \sum_{i, j = 1}^{n} \frac{m (t, x; B_{0})}{m_{0} (t, x; B_{0})} [K_{h_{1}} (T_{j} - t) K_{h_{2}} {B_{0}^{⊺} (X_{j} - X_{i})} - m_{0} (t, x; B_{0})] + O_{p} (h_{1}^{2 r} + h_{2}^{2 r} + \frac{1}{n h_{1} h_{2}^{d}}) ≜ U_{1} + U_{2} + O_{p} (h_{1}^{2 r} + h_{2}^{2 r} + \frac{1}{n h_{1} h_{2}^{d}}) .

(A.4)

Note that U₁ and U₂ are second-order U-statistics. A direct calculation shows that the first-order Hájek projections are

{\tilde{U}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} K_{h_{1}} (T_{i} - t) f_{B_{0}^{⊺} x} (B_{0}^{⊺} x_{i})}{m_{0} (t, X_{i}; B_{0})} + O_{p} (\frac{h_{1}^{r}}{n^{1 ∕ 2}}) + O_{p} (h_{2}^{r}),

(A.5)

{\tilde{U}}_{2} = \frac{1}{n} \sum_{i = 1}^{n} \frac{m (t, X_{i}; B_{0}) K_{h_{1}} (T_{i} - t) f_{B_{0}^{⊺} x} (B_{0}^{⊺} x_{i})}{m_{0} (t, X_{i}; B_{0})} + O_{p} (\frac{h_{1}^{r}}{n^{1 ∕ 2}}) + O_{p} (h_{2}^{r}) .

(A.6)

Further, from Theorem 6 of [21], the remainder terms between the U-statistics and their first-order Hájek projections are bounded by $O_{p} {1 ∕ ({nh}_{1} h_{2}^{d_{0}})}$ . Thus, by substituting (A.3)–(A.6) into (A.2), we have

\hat{μ} (t) - μ (t) = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} - m (t, X_{i}; B_{0})}{f_{T} (t ∣ B_{0}^{⊺} X_{i})} K_{h_{1}} (T_{i} - t) + O_{p} (∣ ∣ \hat{B} - B_{0} ∣ ∣) + O_{p} (h_{1}^{2 r} + h_{2}^{2 r} + \frac{1}{{nh}_{1} h_{2}^{d}}) .

Together with the asymptotic mean $h_{1}^{r} B (t)$ and variance $V (t) ∕ ({nh}_{1})$ of the iid representation above and the Central Limit Theorem, Theorem 1 is established. ◻

A.5. Proof of Theorem 2

The corresponding large-sample properties rely on the smoothness of the following parameter functions, defined for m ∈ {0, 1, 2} by

\begin{matrix} f^{[m]} (x; B) = \partial_{B^{⊺} x}^{m} [E {{(X - x)}^{\otimes m} ∣ B^{⊺} X = B^{⊺} x} f_{B^{⊺} x} (B^{⊺} x)], \\ G^{[m]} (y, t ∣ x; B) = \partial_{B^{⊺} x}^{m} [\Pr (Y \leq t, T \leq t ∣ B^{⊺} X = B^{⊺} x) E {{(X - x)}^{\otimes m} ∣ B^{⊺} X = B^{⊺} x} f_{B^{⊺} x} (B^{⊺} x)] . \end{matrix}

The partial derivatives $\partial_{vec (B)}^{m} \hat{F} (y, t ∣ B^{⊺} x)$ can be shown to converge uniformly to $F^{[m]} (y, t ∣ x; B) = \sum_{ℓ = 0}^{m} G^{[ℓ]} (y, t ∣ x; B) f^{ℓ - m} (x; B)$ , where

f^{- 1} (x; B) = - \frac{f^{[1]} (x; B)}{f_{B^{⊺} X}^{2} (B^{⊺} x)}, f^{[- 2]} (x; B) = \frac{2 {(f^{[1]} (x; B))}^{2}}{f_{B^{⊺} X}^{3} (B^{⊺} x)} - \frac{f^{[2]} (x; B)}{f_{B^{⊺} x}^{2} (B^{⊺} x)} .

Based on these properties, we can define the corresponding score vector and information matrix of CV(B, h), viz.

\begin{matrix} S (B) = \int \int {1 (Y \leq y, T \leq t) - F (y, t ∣ B^{⊺} X)} F^{[1]} (y, t ∣ X; B) {dF}_{Y, T} (y, t), \\ V (B) = E [\int \int {F^{[1]} (y, t ∣ x; B)}^{\otimes 2} - {1 (Y \leq y, T \leq t) - F (y, t ∣ B^{⊺} X)} F^{[2]} (y, t ∣ X; B) {dF}_{Y, T} (y, t)], \end{matrix}

where F_Y,T(y, t) is the joint distribution of (Y, T).

To prove Theorem 2, we need the following regularity conditions:

B1 $\partial_{u}^{q + 2} \Pr (Y \leq y, T \leq t ∣ B^{⊺} X = u)$ , $\partial_{u}^{q + m} E {{(X - x)}^{\otimes m} ∣ B^{⊺} X = u}$ , and $\partial_{u}^{q + 2} f_{B^{⊺} x} (u)$ are Lipschitz continuous in u with the Lipschitz constants being independent of (y, t, x, B).
B2 inf_(x,B) f_B^⊺x(B^⊺ x) > 0.
B3 For d ≥ 1, there exist δ ∈ (1/(4q), 1/ max(2d + 2, d + 4)) and positive constants h_ℓ,d and h_u,d such that h falls in the interval H_δ,n = [h_ℓ,dn^−δ, h_u,dn^−δ].
B4 $\inf_{B : d < d_{0}} b_{0}^{2} (B) > 0$ and $b_{0}^{2} (B) = 0$ if and only if B = B₀ when d = d₀.
B5 V(B_d,0) is non-singular for d ≥ d₀.

The proof steps of Theorem 2 are similar to those of [12]. Here we give an outline and indicate some key properties. As mentioned in [20], the uniqueness of B₀ requires a certain local coordinate system of the Grassmann manifold. Without loss of generality, we assume B is already parameterized in such a way. This is in fact a necessary condition of Assumption B5 and is not an additional assumption. First, we derive the convergence rates of the involved kernel smoothing estimators. Let

\begin{matrix} \hat{f} (B^{⊺} x) = \frac{1}{n} \sum_{i = 1}^{n} K_{h} {B^{⊺} (X_{i} - x)], {\hat{f}}_{c}^{[m]} (x; B) = \partial_{vec (B)}^{m} \hat{f} (B^{⊺} x) - f^{[m]} (x; B), \\ {\hat{G}}_{c}^{[m]} (y, t ∣ x; B) = \partial_{vec (B)}^{m} {\hat{F} (y, t ∣ B^{⊺} x) \hat{f} (B^{⊺} x)} - G^{[m]} (y, t ∣ x; B) . \end{matrix}

Moreover, we define the asymptotic linear representation for $\partial_{vec (B)}^{m} \hat{F} (y, t ∣ B^{⊺} x)$ with m ∈ {0, 1}, as follows:

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} (y, t ∣ x; B) = \frac{{\hat{G}}_{c}^{[0]} (y, t ∣ x; B)}{f_{B^{⊺} x} (B^{⊺} x)} - \frac{F (y, t ∣ B^{⊺} x) {\hat{f}}_{c}^{[0]} (x; B)}{f_{B^{⊺} x} (B^{⊺} x)}, \\ \frac{1}{n} \sum_{i = 1}^{n} ξ_{i}^{[1]} (y, t ∣ x; B) = \frac{{\hat{G}}_{c}^{[1]} (y, t ∣ x; B)}{f_{B^{⊺} x} (B^{⊺} x)} - \frac{F (y, t ∣ B^{⊺} x) {\hat{f}}_{c}^{[1]} (x; B)}{f_{B^{⊺} x} (B^{⊺} x)} - \frac{f^{[1]} (x; B) {\hat{G}}_{c}^{[0]} (y, t ∣ x; B)}{f_{B^{⊺} x}^{2} (B^{⊺} x)} + \frac{{2 F (y, t ∣ B^{⊺} x) f^{[1]} (x; B) - G^{[1]} (y, t ∣ x; B)} {\hat{f}}_{c}^{[0]} (x; B)}{f_{B^{⊺} x}^{2} (B^{⊺} x)} \end{matrix}

By verifying the Euclidean class properties through Lemma 2.12 of [22], Lemma 2.14 of [22] and Theorem II.37 of [23], we get

\sup_{x, B} ∣ ∣ {\hat{f}}_{c}^{[m]} (x; B) ∣ ∣ = O (h^{q}) + o {\frac{\ln n}{n^{1 ∕ 2} h^{(d + m) ∕ 2}}} a . s ., and \sup_{x, B} ∣ ∣ {\hat{G}}_{c}^{[m]} (y, t ∣ x; B) ∣ ∣ = O (h^{q}) + o {\frac{\ln n}{n^{1 ∕ 2} h^{(d + m) ∕ 2}}} a . s .,

for all m ∈ {0, 1, 2}. By applying the Taylor expansion theorem and the results above, we can further ensure from Conditions B2–B3 that

\sup_{x, B} ∣ ∣ \hat{F} (y, t ∣ B^{⊺} x) - \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} (y, t ∣ x; B ∣ ∣ = o_{p} (n^{- 1 ∕ 2}) and \sup_{x, B} ∣ ∣ \partial_{vec (B)} \hat{F} (y, t ∣ B^{⊺} x) - \frac{1}{n} \sum_{i = 1}^{n} ξ_{i}^{[1]} (y, t ∣ x; B) ∣ ∣ = o_{p} (n^{- 1 ∕ 2}) .

The second step is to derive the uniform consistency of CV(B, h) to $ECV (B, h) = σ_{0}^{2} + b_{0}^{2} (B) + {AMISE}_{B} (h)$ . This can be done by paralleling the arguments of [12] and we have

{\begin{matrix} \sup_{B, h} ∣ CV (B, h) - ECV (B, h) ∣ ∕ {AMISE}_{B} (h) = o (1) a . s . & for span (B) \supseteq span (B_{0}), \\ \sup_{B, h} ∣ CV (B, h) - ECV (B, h) ∣ ∕ b_{0} (B) {AMISE}_{B}^{1 ∕ 2} (h) = O (1) a . s . & for span (B) ⊉ span (B_{0}) . \end{matrix}

(A.7)

Now let DCV(B, h) = CV(B, h) – ECV(B, h). From the inequalities

\begin{matrix} 1 & = \Pr {CV (\hat{B}, \hat{h}) \leq CV (B_{0}, h_{0})} \\ \leq \Pr {b_{0}^{2} (\hat{B}) < ε} + \Pr {b_{0}^{2} (\hat{B}) \geq ε, DCV (\hat{B}, \hat{h}) + DCV (B_{0}, h_{0}) \geq ECV (\hat{B}, \hat{h}) - ECV (B_{0}, h_{0})} \\ \leq \Pr {b_{0}^{2} (\hat{B}) < ε} + \Pr {b_{0}^{2} (\hat{B}) \geq ε, \frac{DCV (\hat{B}, \hat{h})}{b_{0} (\hat{B})} + \frac{DCV (B_{0}, h_{0})}{ε^{1 ∕ 2}} \geq ε^{1 ∕ 2} - \frac{{AMISE}_{B_{0}} (h_{0})}{ε^{1 ∕ 2}}} \end{matrix}

(A.8)

for any ε > 0, (A.7) further implies that, for all ε > 0,

\lim_{n \to \infty} \Pr {b_{0}^{2} (\hat{B}) < ε} = 1 .

(A.9)

At the same time, by taking $ε = \inf_{{B : d < d_{0}}} b_{0}^{2} (B) ∕ 2$ and using Boole’s inequality again, we have

\Pr {b_{0}^{2} (\hat{B}) < \inf_{{B : d < d_{0}}} b_{0}^{2} (B) ∕ 2} \leq \Pr {b_{0}^{2} (\hat{B}) < \inf_{{B : d < d_{0}}} b_{0}^{2} (B) ∕ 2, \hat{d} < d_{0}} + \Pr {b_{0}^{2} (\hat{B}) < \inf_{{B : d < d_{0}}} b_{0}^{2} (B) ∕ 2, \hat{d} \geq d_{0}} \leq \Pr (\hat{d} \geq d_{0}) .

Hence, (A.9) further implies that

\lim_{n \to \infty} \Pr (\hat{d} \geq d_{0}) = 1 .

The third step is to derive the asymptotic properties of ${\hat{B}}_{d}$ for d ≥ d₀. Let h_d,0 = c_dn^−2q/(2q+d). By virtue of the minimizer $({\hat{B}}_{d}, {\hat{h}}_{d})$ of CV(B, h) under fixed d and Boole’s inequality, a similar inequality as (A.8) can also be derived as follows:

1 \leq \Pr {b_{0}^{2} ({\hat{B}}_{d}) < ε} + \Pr {b_{0}^{2} ({\hat{B}}_{d}) \geq ε, \frac{DCV ({\hat{B}}_{d}, {\hat{h}}_{d})}{b_{0} ({\hat{B}}_{d})} + \frac{DCV (B_{d, 0}, h_{d, 0})}{ε^{1 ∕ 2}} \geq ε^{1 ∕ 2} - \frac{{AMISE}_{B_{d, 0}} (h_{d, 0})}{ε^{1 ∕ 2}}}

(A.10)

for any ε > 0. Also note that $b_{0}^{2} ({\hat{B}}_{d}) \geq ε$ implies that $span ({\hat{B}}_{d}) ⊉ span (B_{0})$ . Eq. (A.7) and Condition B3 further imply that $DCV ({\hat{B}}_{d}, {\hat{h}}_{d}) ∕ b_{0} ({\hat{B}}_{d}) = O_{p} {{AMISE}_{{\hat{B}}_{d}}^{1 ∕ 2} ({\hat{h}}_{d})} \to 0$ and DCV(B_d,0, h_d,0)/ε^1/2 = o_p{AMISE_{B_d,0}(h_d,0)} → 0 as n → ∞. Coupled with the fact that AMISE_{B_d,0}(h_d,0)→0 as n → ∞, the second inequality in the second term of (A.10) becomes 0 ≥ ε^1/2 as n → ∞, and, hence, one has, for all ε > 0,

\lim_{n \to \infty} \Pr {b_{0}^{2} ({\hat{B}}_{d}) < ε} = 1 .

Since $span ({\hat{B}}_{d}) \supseteq span (B_{0})$ implies $b_{0}^{2} ({\hat{B}}_{d}) = 0$ , we now consider the case when $span ({\hat{B}}_{d}) ⊉ span (B_{0})$ and, hence, ${\hat{B}}_{d} \underset{\to}{p} B_{d, 0}$ . By the first-order Taylor expansion of ∂_vec(B)CV(B, h) at B = B_d,0 and $\partial_{vec (B)} CV ({\hat{B}}_{d}, {\hat{h}}_{d}) = 0$ , one has

[I_{pd} + V^{- 1} (B_{d, 0}) {\partial_{vec (B)}^{2} CV ({\hat{B}}_{d}^{*}, {\hat{h}}_{d}) - V (B_{d, 0})}] n^{1 ∕ 2} vec ({\hat{B}}_{d} - B_{d, 0}) = n^{1 ∕ 2} V^{- 1} (B_{d, 0}) \partial_{vec (B)} CV (B_{d, 0}, {\hat{h}}_{d})

where $vec ({\hat{B}}_{d}^{*})$ lies on the line segment between $vec ({\hat{B}}_{d})$ and vec(B_d,0). Coupled with the facts that

n^{1 ∕ 2} \partial_{vec (B)} CV (B_{d, 0}, {\hat{h}}_{d}) ⇝ N [0, E {S^{\otimes 2} (B_{d, 0})}] and ∣ ∣ \partial_{vec (B)}^{2} CV ({\hat{B}}_{d}^{*}, {\hat{h}}_{d}) - V (B_{d, 0}) ∣ ∣^{\underset{\to}{p}} 0,

we have, as n → ∞,

n^{1 ∕ 2} vec ({\hat{B}}_{d} - B_{d, 0}) ⇝ N [0, V^{- 1} (B_{d, 0}) E {S^{\otimes 2} (B_{d, 0})} V^{- 1} (B_{d, 0})] .

(A.11)

Using a Taylor expansion technique and $\partial_{vec (B)} b_{0}^{2} (B_{d, 0}) = 0$ , we further deduce from Eq. (A.11) that

b_{0}^{2} ({\hat{B}}_{d}) = \frac{1}{2} vec {({\hat{B}}_{d} - B_{d, 0})}^{⊺} \partial_{vec (B)}^{2} b_{0}^{2} ({\hat{B}}_{d}^{*}) vec ({\hat{B}}_{d} - B_{d, 0}) = O_{p} (n^{- 1})

for d ≥ d₀ and hence $b_{0}^{2} (\hat{B}) = O_{p} (n^{- 1})$ . Coupled with Condition B3, it further implies that

b_{0}^{2} (\hat{B}) ∕ {AMISE}_{\hat{B}} (\hat{h}) = o_{p} (1) .

The next step is to show the consistency of $(\hat{B}, \hat{h})$ . Let

\begin{matrix} E_{0} = {b_{0}^{2} (\hat{B}) < \ln n ∕ n, \hat{h} \in H_{1 ∕ (2 q + \hat{d}), n}, \hat{d} = d_{0}}, E_{1} = {b_{0}^{2} (\hat{B}) \geq \ln n ∕ n}, \\ E_{2} = {\hat{d} < d_{0}}, E_{3} = {b_{0}^{2} (\hat{B}) < \ln n ∕ n, \hat{d} \geq d_{0}, \hat{h} \in H_{\hat{δ}, n} with \hat{δ} \neq 1 ∕ (2 q + \hat{d})}, \\ E_{4} = {b_{0}^{2} (\hat{B}) < \ln n ∕ n, \hat{h} \in H_{1 ∕ (2 q + \hat{d}), n}, \hat{d} > d_{0}}, \\ and & E_{con} = {DCV (\hat{B}, \hat{h}) + DCV (B_{0}, h_{0}) \geq ECV (\hat{B}, \hat{h}) - ECV (B_{0}, h_{0})} . \end{matrix}

By the minimizer $(\hat{B}, \hat{h})$ of CV(B, h) and Boole’s inequality, one has

1 = \Pr {CV (\hat{B}, \hat{h}) \leq CV (B_{0}, h_{0})} \leq \Pr (E_{0}) + \sum_{m = 1}^{4} \Pr (E_{con} \cap E_{m}) .

The consistency follows by showing Pr(E_con ∩ E_m) → 0 as n → ∞ for all m ∈ {1, …, 4}, which implies that

\lim_{n \to \infty} \Pr (E_{0}) = 1 .

Finally, the asymptotic normality is ensured by (A.12) and (A.11).

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Aerts M, Claeskens G, Hens N, Molenberghs G, Local multiple imputation, Biometrika 89 (2002) 375–388. [Google Scholar]
[2].Cheng PE, Nonparametric estimation of mean functionals with data missing at random, J. Amer. Statist. Assoc 89 (1994) 81–87. [Google Scholar]
[3].Dua S, Bhuker M, Sharma P, Dhall M, Kapoor S, et al. , Body mass index relates to blood pressure among adults, North Amer. J. Med. Sci 6 (2014) 89. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Galvao AF, Wang L, Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment, J. Amer. Statist. Assoc 110 (2015) 1528–1542. [Google Scholar]
[5].Gray HL, Schucany WR, The Generalized Jackknife Statistic, Marcel Dekker, Inc., New York, 1972. [Google Scholar]
[6].Hall P, Li Q, Racine JS, Nonparametric estimation of regression functions in the presence of irrelevant regressors, The Review of Economics and Statistics 89 (4) (2007) 784–789. [Google Scholar]
[7].Härdle WK, Hall P, Marron JS, How far are automatically chosen regression smoothing parameters from their optimum? J. Amer. Statist. Assoc 83 (1988) 86–101. [Google Scholar]
[8].Härdle WK, Marron JS, Optimal bandwidth selection in nonparametric regression function estimation, Ann. Statist 13 (1985) 1465–1481. [Google Scholar]
[9].Hill JL, Bayesian nonparametric modeling for causal inference, J. Comput. Graph. Statist 20 (2011) 217–240. [Google Scholar]
[10].Hirano K, Imbens GW, The propensity score with continuous treatments, in: Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives, Wiley, Chichester, 2004, pp. 73–84. [Google Scholar]
[11].Huang M-Y, Chan KCG, Joint sufficient dimension reduction and estimation of conditional and average treatment effects, Biometrika 104 (2017) 583–596. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Huang M-Y, Chiang C-T, An effective semiparametric estimation approach for the sufficient dimension reduction model, J. Amer. Statist. Assoc 112 (2017) 1296–1310. [Google Scholar]
[13].Imai K, van Dyk DA, Causal inference with general treatment regimes: Generalizing the propensity score, J. Amer. Statist. Assoc 99 (2004) 854–866. [Google Scholar]
[14].Imbens GW, Nonparametric estimation of average treatment effects under exogeneity: A review, The Review of Economics and Statistics 86 (2004) 4–29. [Google Scholar]
[15].Jaeckel LA, The infinitesimal jackknife, Bell Telephone Laboratories, 1972. [Google Scholar]
[16].Kaufman JS, Asuzu MC, Mufunda J, Forrester T, Wilks R, Luke A, Long AE, Cooper RS, Relationship between blood pressure and body mass index in lean populations, Hypertension 30 (1997) 1511–1516. [DOI] [PubMed] [Google Scholar]
[17].Kennedy EH, Ma Z, McHugh MD, Small DS, Nonparametric methods for doubly robust estimation of continuous treatment effects, J. R. Stat. Soc. Ser. B Stat. Methodol 79 (2017) 1229–1245. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Li K-C, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc 86 (1991) 316–342. [Google Scholar]
[19].Luo W, Zhu Y, Ghosh D, On estimating regression-based causal effects using sufficient dimension reduction, Biometrika 104 (2017) 51–65. [Google Scholar]
[20].Ma Y, Zhu L, E cient estimation in sufficient dimension reduction, Ann. Statist 41 (2013) 250–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Nolan D, Pollard D, U-processes: Rates of convergence, Ann. Statist 15 (1987) 780–799. [Google Scholar]
[22].Pakes A, Pollard D, Simulation and the asymptotics of optimization estimators, Econometrica 57 (1989) 1027–1057. [Google Scholar]
[23].Pollard D, Convergence of Stochastic Processes, Springer, New York, 1984. [Google Scholar]
[24].Ruppert D, Wand MP, Multivariate locally weighted least squares regression, Ann. Statist 22 (1994) 1346–1370. [Google Scholar]
[25].Schucany WR, Gray H, Owen D, On bias reduction in estimation, J. Amer. Statist. Assoc 66 (1971) 524–533. [Google Scholar]
[26].Schucany WR, Sommers JP, Improvement of kernel type density estimators, J. Amer. Statist. Assoc 72 (1977) 420–423. [Google Scholar]
[27].Shrestha A, Koju RP, Beresford SA, Chan KCG, Karmacharya BM, Fitzpatrick AL, Food patterns measured by principal component analysis and obesity in the Nepalese adult, Heart Asia 8 (2016) 46–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
[28].Yin X, Li B, Sufficient dimension reduction based on an ensemble of minimum average variance estimators, Ann. Statist 39 (2011) 3392–3416. [Google Scholar]
[29].Zhu L-P, Zhu L-X, Feng Z-H, Dimension reduction in regressions through cumulative slicing estimation, J. Amer. Statist. Assoc 105 (2010) 1455–1466. [Google Scholar]

[R1] [1].Aerts M, Claeskens G, Hens N, Molenberghs G, Local multiple imputation, Biometrika 89 (2002) 375–388. [Google Scholar]

[R2] [2].Cheng PE, Nonparametric estimation of mean functionals with data missing at random, J. Amer. Statist. Assoc 89 (1994) 81–87. [Google Scholar]

[R3] [3].Dua S, Bhuker M, Sharma P, Dhall M, Kapoor S, et al. , Body mass index relates to blood pressure among adults, North Amer. J. Med. Sci 6 (2014) 89. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Galvao AF, Wang L, Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment, J. Amer. Statist. Assoc 110 (2015) 1528–1542. [Google Scholar]

[R5] [5].Gray HL, Schucany WR, The Generalized Jackknife Statistic, Marcel Dekker, Inc., New York, 1972. [Google Scholar]

[R6] [6].Hall P, Li Q, Racine JS, Nonparametric estimation of regression functions in the presence of irrelevant regressors, The Review of Economics and Statistics 89 (4) (2007) 784–789. [Google Scholar]

[R7] [7].Härdle WK, Hall P, Marron JS, How far are automatically chosen regression smoothing parameters from their optimum? J. Amer. Statist. Assoc 83 (1988) 86–101. [Google Scholar]

[R8] [8].Härdle WK, Marron JS, Optimal bandwidth selection in nonparametric regression function estimation, Ann. Statist 13 (1985) 1465–1481. [Google Scholar]

[R9] [9].Hill JL, Bayesian nonparametric modeling for causal inference, J. Comput. Graph. Statist 20 (2011) 217–240. [Google Scholar]

[R10] [10].Hirano K, Imbens GW, The propensity score with continuous treatments, in: Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives, Wiley, Chichester, 2004, pp. 73–84. [Google Scholar]

[R11] [11].Huang M-Y, Chan KCG, Joint sufficient dimension reduction and estimation of conditional and average treatment effects, Biometrika 104 (2017) 583–596. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Huang M-Y, Chiang C-T, An effective semiparametric estimation approach for the sufficient dimension reduction model, J. Amer. Statist. Assoc 112 (2017) 1296–1310. [Google Scholar]

[R13] [13].Imai K, van Dyk DA, Causal inference with general treatment regimes: Generalizing the propensity score, J. Amer. Statist. Assoc 99 (2004) 854–866. [Google Scholar]

[R14] [14].Imbens GW, Nonparametric estimation of average treatment effects under exogeneity: A review, The Review of Economics and Statistics 86 (2004) 4–29. [Google Scholar]

[R15] [15].Jaeckel LA, The infinitesimal jackknife, Bell Telephone Laboratories, 1972. [Google Scholar]

[R16] [16].Kaufman JS, Asuzu MC, Mufunda J, Forrester T, Wilks R, Luke A, Long AE, Cooper RS, Relationship between blood pressure and body mass index in lean populations, Hypertension 30 (1997) 1511–1516. [DOI] [PubMed] [Google Scholar]

[R17] [17].Kennedy EH, Ma Z, McHugh MD, Small DS, Nonparametric methods for doubly robust estimation of continuous treatment effects, J. R. Stat. Soc. Ser. B Stat. Methodol 79 (2017) 1229–1245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Li K-C, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc 86 (1991) 316–342. [Google Scholar]

[R19] [19].Luo W, Zhu Y, Ghosh D, On estimating regression-based causal effects using sufficient dimension reduction, Biometrika 104 (2017) 51–65. [Google Scholar]

[R20] [20].Ma Y, Zhu L, E cient estimation in sufficient dimension reduction, Ann. Statist 41 (2013) 250–268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Nolan D, Pollard D, U-processes: Rates of convergence, Ann. Statist 15 (1987) 780–799. [Google Scholar]

[R22] [22].Pakes A, Pollard D, Simulation and the asymptotics of optimization estimators, Econometrica 57 (1989) 1027–1057. [Google Scholar]

[R23] [23].Pollard D, Convergence of Stochastic Processes, Springer, New York, 1984. [Google Scholar]

[R24] [24].Ruppert D, Wand MP, Multivariate locally weighted least squares regression, Ann. Statist 22 (1994) 1346–1370. [Google Scholar]

[R25] [25].Schucany WR, Gray H, Owen D, On bias reduction in estimation, J. Amer. Statist. Assoc 66 (1971) 524–533. [Google Scholar]

[R26] [26].Schucany WR, Sommers JP, Improvement of kernel type density estimators, J. Amer. Statist. Assoc 72 (1977) 420–423. [Google Scholar]

[R27] [27].Shrestha A, Koju RP, Beresford SA, Chan KCG, Karmacharya BM, Fitzpatrick AL, Food patterns measured by principal component analysis and obesity in the Nepalese adult, Heart Asia 8 (2016) 46–53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] [28].Yin X, Li B, Sufficient dimension reduction based on an ensemble of minimum average variance estimators, Ann. Statist 39 (2011) 3392–3416. [Google Scholar]

[R29] [29].Zhu L-P, Zhu L-X, Feng Z-H, Dimension reduction in regressions through cumulative slicing estimation, J. Amer. Statist. Assoc 105 (2010) 1455–1466. [Google Scholar]

PERMALINK

Joint sufficient dimension reduction for estimating continuous treatment effect functions

Ming-Yueh Huang

Kwun Chuen Gary Chan

Abstract

1. Introduction

2. The proposed methodology

2.1. Joint sufficient dimension reduction model

2.2. Estimation of joint central subspace

2.3. Asymptotic mean squared error

2.4. Bandwidth selection

3. Simulation studies

Table 1:

Table 2:

4. Application

Table 3:

Figure 1:

Table 4:

5. Discussion

Acknowledgment

Appendix

A.1. Proof of Proposition 1

A.2. Proof of Proposition 2

A.3. Proof of Proposition 3

A.4. Proof of Theorem 1

A.5. Proof of Theorem 2

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Joint sufficient dimension reduction for estimating continuous treatment effect functions

Ming-Yueh Huang

Kwun Chuen Gary Chan

Abstract

1. Introduction

2. The proposed methodology

2.1. Joint sufficient dimension reduction model

2.2. Estimation of joint central subspace

2.3. Asymptotic mean squared error

2.4. Bandwidth selection

3. Simulation studies

Table 1:

Table 2:

4. Application

Table 3:

Figure 1:

Table 4:

5. Discussion

Acknowledgment

Appendix

A.1. Proof of Proposition 1

A.2. Proof of Proposition 2

A.3. Proof of Proposition 3

A.4. Proof of Theorem 1

A.5. Proof of Theorem 2

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases