Joint sufficient dimension reduction and estimation of conditional and average treatment effects

Ming-Yueh Huang; Kwun Chuen Gary Chan

doi:10.1093/biomet/asx028

. 2017 May 19;104(3):583–596. doi: 10.1093/biomet/asx028

Joint sufficient dimension reduction and estimation of conditional and average treatment effects

Ming-Yueh Huang ^*,^✉, Kwun Chuen Gary Chan ^*

PMCID: PMC5793490 PMID: 29430034

Summary

The estimation of treatment effects based on observational data usually involves multiple confounders, and dimension reduction is often desirable and sometimes inevitable. We first clarify the definition of a central subspace that is relevant for the efficient estimation of average treatment effects. A criterion is then proposed to simultaneously estimate the structural dimension, the basis matrix of the joint central subspace, and the optimal bandwidth for estimating the conditional treatment effects. The method can easily be implemented by forward selection. Semiparametric efficient estimation of average treatment effects can be achieved by averaging the conditional treatment effects with a different data-adaptive bandwidth to ensure optimal undersmoothing. Asymptotic properties of the estimated joint central subspace and the corresponding estimator of average treatment effects are studied. The proposed methods are applied to a nutritional study, where the covariate dimension is reduced from 11 to an effective dimension of one.

Keywords: Forward selection, High-order kernel, Joint central subspace, Optimal bandwidth, Semiparametric efficiency, Undersmoothing

1. Introduction

Investigating the causal effect of a treatment on an outcome is often the primary interest in medical and social studies. While randomization is the gold standard in identifying treatment effects, often only observational data are available. A major challenge in observational studies is confounding, where the treatment and the outcome of interest are associated with other pretreatment variables, potentially leading to seriously biased estimation of average treatment effects. A simple method of dealing with confounding is local matching. Under conditional independence, the distribution of outcomes in a specific group behaves just like that of a random sample from the group while conditioning on values of confounding variables. Thus, a consistent estimator can be obtained by averaging the differences between groups over the distribution of confounding variables. Hahn (1998) introduced a class of nonparametric imputation estimators that use local matching and showed that they are asymptotically efficient.

Although nonparametric imputation estimators are Inline graphic -consistent under regularity conditions, the remainder terms depend on their biases and variances. In particular, the variance increases with the number of confounders, so a balancing score vector with a smaller dimension is always preferred. According to Rosenbaum & Rubin (1983), the propensity score is the coarsest balancing score, and also has the smallest dimension. However, Hahn (1998) showed that projection onto the true propensity score can be inefficient. Another well-known balancing score is the prognostic score of Hansen (2008). Leacy & Stuart (2014). further combined propensity and prognostic scores to improve the classical matching estimator of average treatment effects. Unfortunately, estimators based directly on propensity and prognostic scores often cannot attain the semiparametric efficiency bound. Hence, finding a suitable balancing score vector with high efficiency and minimal dimension is important in practice.

There are two approaches to reducing the dimension of the covariate vector while keeping as much information as possible about the relationship between response and covariates. The first is variable selection, in which the main goal is to drop redundant variables: de Luna et al. (2011) first defined some subsets of confounding variables which are minimal in the sense that the treatment ceases to be unconfounded for any proper subset of these sets; they also showed that these subsets can reduce the efficiency bound for average treatment effects. Related methods are discussed in Vansteelandt et al. (2012). Instead of using the total subset selection procedure, another method of variable selection is penalized regression. Since confoundedness resides in the conditional distribution of the potential outcomes and treatment variable, given confounders, Ghosh et al. (2015) developed a lasso-type criterion to select redundant variables. The second approach is sufficient dimension reduction, which seeks a few linear combinations of confounders that retain the full information on confoundedness. In a related missing data problem, Hu et al. (2014) introduced effective balancing scores, which are the central subspaces of missing indicators and observed responses, given covariates. While the estimation of average treatment effect can often be regarded as two missing data problems, applying this approach twice would yield two estimates of central subspaces that could be highly collinear.

In this paper, we introduce a joint sufficient dimension reduction model on the propensity score as well as the conditional distributions of potential outcomes, and use the joint central subspace to form a semiparametric efficient estimator of average treatment effects. No stringent parametric model formulation is assumed in the dimension reduction framework. Further, while classical dimension reduction methods consider models on the joint distribution of treatment and potential outcomes, our approach focuses on the marginal distributions and can yield a balancing score with smaller dimension. The exclusion restrictions used in de Luna et al. (2011) for efficiency gains are not required, but the semiparametric efficiency bound is retained by our estimator.

Several approaches to sufficient dimension reduction can be extended to this causal inference problem. In a complete-data setting, these approaches include inverse regression (Li, 1991; Li & Wang, 2007; Zhu et al., 2010), average derivative and minimum average variance estimation (Zhu & Zeng, 2006; Xia, 2007; Wang & Xia, 2008; Yin & Li, 2011), and the semiparametric framework (Ma & Zhu, 2012, 2013; Huang & Chiang, 2017). Here we extend the work of Huang & Chiang (2017) and construct a crossvalidation-type least squares criterion to estimate the structural dimension and the basis matrix simultaneously. The bandwidth used in the semparametric estimator of the unspecified link function can be selected using the same criterion and attains the optimal rate for estimating the conditional treatment effects. The proposed model is flexible and the average treatment effects can be estimated efficiently.

Although local matching using the propensity score may not be semiparametrically efficient, inverse propensity score weighting with estimated weights has been shown to be efficient (Hirano et al., 2003), and many recent efforts have focused on improving the weighting estimators (Qin & Zhang, 2007; Imai & Ratkovic, 2014; Chan et al., 2016). In contrast to those estimators, our method provides a rate-optimal estimator of covariate-specific treatment effects, which is useful for personalized prediction and the study of heterogeneity.

2. Joint sufficient dimension reduction

2.1. Notation and model construction

Let Inline graphic and be the potential outcomes when an individual is assigned to the control and treatment groups, respectively, and let be a binary treatment indicator. Since each unit is either treated or not treated, the observed outcome is . In addition, a vector of covariates or confounders is observed for each subject, and we make the following conditional independence assumption.

Assumption 1

(Unconfounded treatment assignment). We have that , where denotes independence.

This assumption is often made to identify the average treatment effect Inline graphic . Under Assumption 1, Hahn (1998) and Robins et al. (1994) derived the semiparametric efficiency bound for as

\begin{matrix} σ_{eff}^{2} = E [{m_{1} (X) - m_{0} (X) - τ}^{2} + \frac{σ_{1}^{2} (X)}{π (X)} + \frac{σ_{0}^{2} (X)}{1 - π (X)}], \end{matrix}

where Inline graphic is the propensity score, is the conditional mean of the potential outcome, and (). Also, can be shown to be the asymptotic variance of a nonparametric imputation estimator by directly using as a balancing score. A balancing score with smaller dimension but the same efficiency is obtained if the conditional distributions of Inline graphic , and given are captured by a lower-dimensional linear subspace of . Therefore, we focus on finding such that

\begin{matrix} T ⊥ ⊥ X ∣ B^{T} X, Y (0) ⊥ ⊥ X ∣ B^{T} X, Y (1) ⊥ ⊥ X ∣ B^{T} X, \end{matrix}

(1)

where Inline graphic is a full-rank parameter matrix with ; we call the column space of a joint sufficient dimension reduction subspace. For simplicity, we will write for the column space of a matrix . Obviously, (1) holds when and is the identity matrix, . Thus it always covers the true model. Moreover, if Inline graphic and is a joint sufficient dimension reduction subspace, then will also be a joint sufficient dimension subspace. The most interesting parameter is therefore the joint sufficient dimension reduction subspace of smallest dimension, called the joint central subspace when it exists, which is unique up to an equivalence class as discussed in Remark 2. The corresponding basis matrix Inline graphic has dimension . The existence and uniqueness of the joint central subspace can be ensured under some mild conditions, similar to the discussion of Cook (1998) on sufficient dimension reduction for univariate responses.

Alternatively, based on the classical literature on sufficient dimension reduction, one can also consider the model

\begin{matrix} {T, Y (0), Y (1)} ⊥ ⊥ X ∣ B^{T} X, \end{matrix}

(2)

which is different from model (1). In fact, Inline graphic will be contained in the central subspace of (2). Since the average treatment effect involves only the marginal distributions of and it is appealing to seek a balancing score with lower dimension, we consider model (1) instead of model (2).

Based on the definition of a joint central subspace, Inline graphic is obviously a balancing score and

\begin{matrix} T ⊥ ⊥ {Y (0), Y (1)} ∣ B_{0}^{T} X, \end{matrix}

which ensures unbiased estimation of the average treatment effect. The main feature of this balancing score is that it creates both propensity and prognostic balance (Hansen, 2008; Leacy & Stuart, 2014), and we will show in § 3 that it attains the semiparametric efficiency bound in the estimation of Inline graphic .

Remark 1.

Unlike sufficient dimension reduction tools such as sliced inverse regression (Li, 1991), we do not require an additional linearity assumption on the covariate distribution.

Remark 2.

Under model (1), and () remain the same for any basis matrix with the same column space. In fact, all such span the same space and are isomorphic up to a linear transformation. The parameter space of is called the Grassmann manifold, or Grassmannian, denoted by . To avoid ambiguity, we follow Ma & Zhu (2013) in employing the local coordinate system of the Grassmannian and parameterize the basis by , where is a free parameter matrix. This parameterization is particularly useful in theoretical developments for characterizing the information matrix, and is not an additional model assumption or a restriction on . Computation of the proposed estimator does not require fixing the reference variables in advance; see Remark 6.

Remark 3.

Since the main parameters of interest are means of potential outcomes, one could also consider the joint central mean subspace, which is the smallest linear subspace with basis matrix such that

$\begin{matrix} T ⊥ ⊥ X ∣ B_{M}^{T} X, Y (0) ⊥ ⊥ E {Y (0) ∣ X} ∣ B_{M}^{T} X, Y (1) ⊥ ⊥ E {Y (1) ∣ X} ∣ B_{M}^{T} X . \end{matrix}$ (3)

The corresponding method in a complete-data setting can be found in Cook & Li (2002) and Xia (2008). Note that the distribution remains the same as in the sufficient dimension reduction model because is binary and its distribution is determined by its mean. Since (3) models the conditional means only and the mean is a functional of the distribution, one can verify that . Moreover, by Theorems 2 and 3 of Rosenbaum & Rubin (1983), is also a balancing score, for which Proposition (1) below holds. Thus we obtain a balancing score with a possibly smaller dimension for the estimation of average treatment effects. However, a comparison of and (8) in § 3.1 reveals that the efficiency bound will not be generally attained with use of as a balancing score; that is, in general there is a trade-off between a lower dimension and a smaller asymptotic variance. This trade-off is also discussed in Hu et al. (2014), who studied mean estimation in missing data. Furthermore, the current formulation is sufficient for the estimation of any conditional functionals, not just the conditional means, and does not require re-estimation of the central subspace for different functionals of interest; see Remark (7). Therefore, we consider model (1) so that all relevant information is kept.

2.2. Simultaneous estimation for the basis and dimension of the joint central subspace

Here we develop an estimation criterion for the joint central subspace with a random sample Inline graphic . First, we note that model (1) is equivalent to

\begin{matrix} E (T ∣ X) & = E (T ∣ B^{T} X), \\ E [1 {Y (k) \leq y} ∣ X] & = E [1 {Y (k) \leq y} ∣ B^{T} X] (y \in R; k = 0, 1), \end{matrix}

where Inline graphic represents the indicator function. That is, we consider semiparametric regression models and , where and are unspecified, for responses and on their corresponding mean functions and (). However, since and are not always observed, we must be careful in using them as responses. Our key idea comes from the following proposition.

Proposition 1.

Under Assumption 1 and model (1),

$\begin{matrix} E [1 {Y (k) \leq y} ∣ X] & = E {1 (Y \leq y) ∣ T = k, X} \\ = E {1 (Y \leq y) ∣ T = k, B_{0}^{T} X} (y \in R; k = 0, 1) . \end{matrix}$ (4)

The first equality in (4) indicates that the regression problem can be solved by considering treatment and control groups separately. The second equality leads to a sieve approach for the estimation of the unknown link function. Let

\begin{matrix} \hat{π} (B^{T} x) & = \frac{\sum_{j = 1}^{n} T_{j} K_{h} {B^{T} (X_{j} - x)}}{\sum_{j = 1}^{n} K_{h} {B^{T} (X_{j} - x)}}, \\ {\hat{F}}_{k} (y ∣ B^{T} x) & = \frac{\sum_{j = 1}^{n} T_{j}^{k} (1 - T_{j})^{1 - k} 1 (Y_{j} \leq y) K_{h} {B^{T} (X_{j} - x)}}{\sum_{j = 1}^{n} T_{j}^{k} (1 - T_{j})^{1 - k} K_{h} {B^{T} (X_{j} - x)}} (k = 0, 1) \end{matrix}

denote estimators for Inline graphic and , where with , a positive bandwidth, and a th-order kernel function. The choice of will be discussed in Remark 5. Now we may use Proposition 1 to construct a crossvalidation-type criterion for the estimation of the joint central subspace. Write and for generic vector-valued functions Inline graphic and matrix , where is the marginal distribution of . Let be a future run, independent of the current data , and define the prediction risk

\begin{matrix} E {‖ Y_{y}^{0} - \hat{F} (y ∣ B^{T} X^{0}) ‖_{W^{0}}^{2}} \end{matrix}

(5)

where

\begin{matrix} Y_{y}^{0} & = {T^{0}, 1 (Y^{0} \leq y), 1 (Y^{0} \leq y)}^{T}, \\ \hat{F} (y ∣ B^{T} X) & = {\hat{π} (B^{T} X), {\hat{F}}_{0} (y ∣ B^{T} X), {\hat{F}}_{1} (y ∣ B^{T} X)}^{T}, \\ W^{0} & = (1 - π) e_{1} e_{1}^{T} + (1 - T^{0}) e_{2} e_{2}^{T} + T^{0} e_{3} e_{3}^{T}, \end{matrix}

with Inline graphic being the marginal probability of treatment and the standard basis of . The weight is used to treat the squared error as an integration over the distribution of . Further, let . A simple calculation shows that the prediction risk in (5) can be decomposed into

\begin{matrix} σ_{0}^{2} + b_{0}^{2} (B) + {MISE}_{B} (h) + C (B, h), \end{matrix}

(6)

where

\begin{matrix} σ_{0}^{2} = E {‖ Y_{y}^{0} - F (y ∣ B_{0}^{T} X^{0}) ‖_{W^{0}}^{2}}, b_{0}^{2} (B) = E {‖ F (y ∣ B_{0}^{T} X^{0}) - F (y ∣ B^{T} X^{0}) ‖_{W^{0}}^{2}}, \\ {MISE}_{B} (h) = E {‖ F (y ∣ B^{T} X^{0}) - \hat{F} (y ∣ B^{T} X^{0}) ‖_{W^{0}}^{2}}, \\ C (B, h) = 2 E {⟨ F (y ∣ B_{0}^{T} X^{0}) - F (y ∣ B^{T} X^{0}), F (y ∣ B^{T} X^{0}) - \hat{F} (y ∣ B^{T} X^{0}) ⟩_{W^{0}}} . \end{matrix}

As Inline graphic and , it is shown in the Supplementary Material that the last two terms of (6) converge to zero and are dominated by the first two terms. Note that and when is a joint sufficient dimension reduction subspace. Hence, the minimum of the prediction risk must occur when . Moreover, since our model has a nested structure, the prediction risk decreases with the working dimension Inline graphic when . On the other hand, when and , the prediction risk has an asymptotic order of , which increases with . Therefore, for large enough , the minimal prediction risk occurs at the joint central subspace . A formal result is stated as follows.

Proposition 2.

Under Assumption 1 and model (1), the joint central subspace and the optimal bandwidth minimize the prediction risk in (5) as , and , where the constant is given in the Supplementary Material.

The proof of Proposition 2 is given in the Supplementary Material. According to Proposition 2 and the fact that the prediction risk is asymptotically convex in Inline graphic , we obtain our estimator through the following algorithm.

Step 1.

For , calculate

$\begin{matrix} {CV}_{0} = \frac{1}{n} \sum_{i = 1}^{n} [(T_{i} - \bar{T})^{2} (1 - \bar{T}) + \sum_{k = 0}^{1} T_{i}^{k} (1 - T_{i})^{1 - k} \int {1 (Y_{i} \leq y) - {\hat{F}}_{k} (y)}^{2} d {\hat{F}}_{Y} (y)], \end{matrix}$

where , is the empirical distribution of , and ().

Step 2.

For , let be the minimizer of among all matrices and positive , where

$\begin{matrix} CV (B, h) = \frac{1}{n} \sum_{i = 1}^{n} [ & {T_{i} - {\hat{π}}^{- i} (B^{T} X_{i})}^{2} (1 - \bar{T}) \\ + \sum_{k = 0}^{1} T_{i}^{k} (1 - T_{i})^{1 - k} \int {1 (Y_{i} \leq y) - {\hat{F}}_{k}^{- i} (y ∣ B^{T} X_{i})}^{2} d {\hat{F}}_{Y} (y)]; \end{matrix}$

here the superscript indicates the estimator based on a sample with the th subject deleted. Then calculate .

Step 3.

Repeat Step 2 until with . The proposed estimator is .

We will show that Inline graphic converges uniformly to the prediction risk as in the proof of the following theorem, and the proposed algorithm can simultaneously estimate the joint central subspace and the structural dimension consistently. In summary, our proposed method, which is easily implemented in practice, can simultaneously estimate the basis matrix, the structural dimension of the joint central subspace, and the optimal bandwidth. The asymptotic properties of the proposed estimators are established in the following theorem.

Theorem 1.

Suppose that Assumption 1 and Conditions A1–A5 in the Supplementary Material are satisfied. Then , , and

$\begin{matrix} n^{1 / 2} vec (\hat{B} - B_{0}) 1 (\hat{d} = d_{0}) = n^{- 1 / 2} \sum_{i = 1}^{n} ξ_{B_{0}} (T_{i}, Y_{i}, X_{i}) + o_{p} (1) \to N (0, Σ_{B_{0}}) \end{matrix}$

in distribution as , where is the columnwise matrix vectorization operator, and , with and as defined in the Supplementary Material.

Remark 4.

The proposed criterion selects the basis matrix and the structural dimension of the joint central subspace simultaneously, which is different from most existing sufficient dimension reduction methods such as inverse regression (Zhu et al., 2010), minimum average variance estimation (Yin & Li, 2011) and semiparametric approaches (Ma & Zhu, 2012, 2013). Moreover, the bandwidth chosen by this criterion is the rate-optimal bandwidth in terms of the mean integrated squared error, which will be further discussed in § 2.3.

Remark 5.

We use different bandwidths for the working dimension in the proposed algorithm, and we show in the proof of Theorem 1 that . Coupled with the restriction in Condition A3, the order of the kernel function should satisfy . Since we always use a symmetric kernel function, whose order will be even, and require the order to be as small as possible, a suitable choice is for each working dimension , where denotes the floor function.

Remark 6.

Grassmanian optimization algorithms in Edelman et al. (1999) and Adragni et al. (2012) can be used in Step 2 without fixing the reference variables a priori. Those algorithms are a modification of conventional gradient-based algorithms, which consider movements along geodesics based on a metric defined in the tangent space of the Grassmann manifold. Since the optimization is nonconvex, a reliable initial value is needed. We suggest using the kernel-based method of Fukumizu & Leng (2014), as it is the fastest method known to date. That method is consistent but does not attain a convergence rate. Given this initial value, Step 2 can be implemented in a slightly different manner. The initial value could first be transformed into a local coordinate representation by Gaussian elimination, where the reference variables are chosen to be those with the largest columnwise coefficients in absolute value for numerical stability. Then the optimization can be performed with respect to free parameters in the Euclidean local coordinate system. We have compared these two methods and different initial values through simulations reported in the Supplementary Material, and found them to yield similar performance.

2.3. Estimation of conditional effects

An important advantage of our proposed criterion is that it selects the bandwidth simultaneously with the estimation of the joint central subspace, and the selected bandwidth Inline graphic minimizes the mean integrated squared error asymptotically. In particular, the bandwidth achieves the optimal rate for estimating conditional regression functions. Hence, we may use the bandwidth directly to obtain estimators of the conditional effects :

\begin{matrix} \hat{E} {Y (k) ∣ X = x} = \frac{\sum_{i = 1}^{n} Y_{i} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}} (k = 0, 1) . \end{matrix}

(7)

At the dimension reduction stage, we suggest using at least a fourth-order kernel function to ensure the large-sample properties of the estimated joint central subspace, as discussed in Remark 5. However, in practice the negative weights of a higher-order kernel can be detrimental to the stability of the resulting estimators. One possible way to obtain a more stable estimate is by using a second-order kernel function in (7) and substituting an adjusted bandwidth Inline graphic , so that the resulting convergence rate of is optimal with respect to a mean integrated squared error based on a second-order kernel function. In our numerical experiments we have found that the finite-sample performance of (7) is much better if a second-order kernel and an adjusted bandwidth have been used.

To estimate the variance of the estimated conditional effects, an infinitesimal jackknife estimator can be applied. The idea is to perturb the empirical weight Inline graphic in the original estimator by a small amount and then take . More precisely, if we write as a function of variables

\begin{matrix} Q_{k} (w_{1}, \dots, w_{n}) = \frac{\sum_{i = 1}^{n} w_{i} Y_{i} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} w_{i} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}} \end{matrix}

evaluated at Inline graphic , the infinitesimal jackknife estimator of variance is , where is the derivative of with respect to evaluated at , i.e., for or ,

\begin{matrix} {\hat{D}}_{k, i} & = \frac{Y_{i} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}} - T_{k} \frac{T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}} . \end{matrix}

To estimate the variance of Inline graphic , we can directly apply the infinitesimal jackknife estimator .

Remark 7.

The estimator (7) can be extended to estimate for real-valued functions in the following way:

$\begin{matrix} \hat{E} [g {Y (k)} ∣ X = x] = \frac{\sum_{i = 1}^{n} g (Y_{i}) T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{1 - k} K_{\hat{h}} {{\hat{B}}^{T} (X_{i} - x)}} (k = 0, 1) . \end{matrix}$

Model (1) guarantees that for an arbitrary function .

3. Efficient estimation of average treatment effects

3.1. Semiparametric efficiency bound and the efficient estimator

One should note that Inline graphic is the efficiency bound for the nonparametric model without any specification of the forms of , and . Since model (1) imposes a multi-indices structure on these distribution functions, some efficiency gain might be expected. However, we have found that the dimension reduction structure (1) is ancillary for the estimation of Inline graphic , so knowledge of the joint central subspace does not reduce the asymptotic variance bound.

Theorem 2.

Under model (1), the semiparametric efficiency bound of is .

According to Rosenbaum & Rubin (1983) and Hansen (2008), the average treatment effects Inline graphic can be consistently estimated through balancing scores or prognostic scores. In fact, if satisfies

\begin{matrix} T ⊥ ⊥ {Y (0), Y (1)} ∣ b (X), \end{matrix}

then Inline graphic , and can be estimated by

\begin{matrix} {\hat{τ}}_{b} = \frac{1}{n} \sum_{i = 1}^{n} [{\hat{m}}_{1} {b (X_{i})} - {\hat{m}}_{0} {b (X_{i})}] \end{matrix}

where

\begin{matrix} {\hat{m}}_{k} {b (x)} = \frac{\sum_{j = 1}^{n} Y_{j} T_{j}^{k} (1 - T j)^{1 - k} K_{ς} {b (X_{j}) - b (x)}}{\sum_{j = 1}^{n} T_{j}^{k} (1 - T j)^{1 - k} K_{ς} {b (X_{j}) - b (x)}} . \end{matrix}

Since Inline graphic is an estimator of , we can follow the proof of Hahn (1998) for nonparametric imputation estimators and obtain the asymptotic variance of as

\begin{matrix} E ([m_{1} {b (X)} - m_{0} {b (X)} - τ]^{2} + \frac{σ_{1}^{2} {b (X)}}{π {b (X)}} + \frac{σ_{0}^{2} {b (X)}}{1 - π {b (X)}}) \end{matrix}

(8)

where Inline graphic , under some regularity conditions. Under model (1) and with , the asymptotic variance attains the semiparametric efficiency bound of Hahn (1998).

A simple estimator of Inline graphic is

\begin{matrix} \hat{τ} = \frac{1}{n} \sum_{i = 1}^{n} {{\hat{m}}_{1} ({\hat{B}}^{T} X_{i}) - {\hat{m}}_{0} ({\hat{B}}^{T} X_{i})}, \end{matrix}

where Inline graphic is an estimator of . In this study, we further show that the asymptotic variance of does not affect the asymptotic behaviour of and that is semiparametrically efficient.

Although the estimator of the average treatment effect is an average of conditional treatment effects, one requires a different bandwidth to attain optimal undersmoothing. We first provide the asymptotic distribution of the proposed estimator for a range of bandwidths satisfying Condition A6 in the Supplementary Material; then a data-adaptive method for choosing the bandwidth is discussed in § 3.2.

Theorem 3.

Suppose that Assumption (1) and Conditions A1–A6 in the Supplementary Material are satisfied. Then in distribution as .

In practice, the semiparametric efficiency bound can be estimated by a direct plug-in estimator

\begin{matrix} {\hat{σ}}_{eff}^{2} = \frac{1}{n} \sum_{i = 1}^{n} [{{\hat{m}}_{1} ({\hat{B}}^{T} X_{i}) - {\hat{m}}_{0} ({\hat{B}}^{T} X_{i}) - \hat{τ}}^{2} + \frac{{\hat{σ}}_{1}^{2} ({\hat{B}}^{T} X_{i})}{\hat{π} ({\hat{B}}^{T} X_{i})} + \frac{{\hat{σ}}_{0}^{2} ({\hat{B}}^{T} X_{i})}{1 - \hat{π} ({\hat{B}}^{T} X_{i})}] \end{matrix}

where

\begin{matrix} {\hat{σ}}_{k}^{2} (B^{T} x) = \frac{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{k} Y_{i}^{2} K_{ς} {B^{T} (X_{i} - x)}}{\sum_{i = 1}^{n} T_{i}^{k} (1 - T_{i})^{k} K_{ς} {B^{T} (X_{i} - x)}} - {\hat{m}}_{k}^{2} (B^{T} x) (k = 0, 1) . \end{matrix}

The bandwidth Inline graphic can be replaced by that used to estimate .

Remark 8.

A slight variation of can be constructed in the spirit of Cheng (1994). Since the potential outcomes are only partially unobservable, Cheng (1994) suggested imputing an unobserved value with its conditional expectation. More precisely, the estimator is

$\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} ([T_{i} Y_{i} + (1 - T_{i}) {\hat{m}}_{1} {b (X_{i})}] - [(1 - T_{i}) Y_{i} + T_{i} {\hat{m}}_{0} {b (X_{i})}]) . \end{matrix}$

By paralleling the proof of Cheng (1994), one can show that the asymptotic distribution of this estimator is the same as that of , and hence neither is better in general. In our simulation studies, we have found that this alternative estimator has slightly smaller bias than and a very similar standard deviation.

3.2. Bandwidth selection

As indicated in Theorem 3, the bandwidth Inline graphic used in the nonparametric imputation should be smaller than the classical optimal bandwidth with rate , so an important issue in practice is how to select a proper bandwidth. Häggström & de Luna (2014) suggested minimizing the conditional mean squared error of , which is of the form

\begin{matrix} E ({[\hat{τ} - \frac{1}{n} \sum_{i = 1}^{n} {m_{1} (B_{0}^{T} X_{i}) - m_{0} (B_{0}^{T} X_{i})}]}^{2} | X_{1}, \dots, X_{n}) . \end{matrix}

In our simulation experiments we have found that the bandwidth Inline graphic which minimizes the sample analogue of the conditional mean squared error

\begin{matrix} E [{\frac{1}{n} \sum_{i = 1}^{n} {\hat{m}}_{k} (B_{0}^{T} X_{i}) - \frac{1}{n} \sum_{i = 1}^{n} m_{k} (B_{0}^{T} X_{i})}^{2} | X_{1}, \dots, X_{n}] (k = 0, 1) \end{matrix}

(9)

leads to a slightly better estimator for Inline graphic . The main difference is that Häggström & de Luna (2014) used a local linear regression instead of a Nadaraya–Watson estimator to estimate the conditional effects. Since we directly adopt the local constant smoothing estimator and the estimated optimal bandwidth in the dimension reduction stage, a separate criterion might be helpful for alleviating the boundary effects of the Nadaraya–Watson estimator. Following the proof of Häggström & de Luna (2014), we can show that (9) is asymptotically equivalent to

\begin{matrix} \frac{1}{n} \int \frac{σ_{k}^{2} (B_{0}^{T} x) f_{X} (x)}{π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k}} d x + \frac{σ_{K}^{2}}{n^{2} ς^{d_{0}}} \int \frac{σ_{k}^{2} (B_{0}^{T} x)}{π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k}} d x \\ + ς^{2 q} {(\frac{μ_{q, K}}{q!})}^{2} (\int \frac{D_{B^{T} x}^{q} [m_{k} (B_{0}^{T} x) π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k} f_{B_{0}^{T} X} (B_{0}^{T} x)]}{π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k}} \\ {- \frac{m_{k} (B_{0}^{T} x) D_{B^{T} x}^{q} [π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k} f_{B_{0}^{T} X} (B_{0}^{T} x)]}{π^{k} (B_{0}^{T} x) {1 - π (B_{0}^{T} x)}^{1 - k}} (1_{d}, \dots, 1_{d}) d x)}^{2}, \end{matrix}

where Inline graphic , , is the th-order derivative of in tensor form, with dimension , and the optimal bandwidth is asymptotically equivalent to . According to Condition A6 in the Supplementary Material, we require to ensure the asymptotic normality of if the estimated optimal bandwidth is used. However, we find that the second-order kernel still works very well in practice. In addition, consistency can be guaranteed under the weaker assumptions that Inline graphic and as .

4. Application

In this section, we demonstrate our proposed method by applying it to the 2007–2008 National Health and Nutrition Examination Survey, the main goal of which was to investigate the health and nutrition statuses of children and adults in the United States. We focus on a subset of the data (Kohn et al., 2014) and study whether participation in the National School Lunch or School Breakfast programme would lead to an increase in body mass index for children and youths aged 4 to 17. The dataset contains 2330 children and youths, of whom 1284, 55%, participated in the school meal programme. The covariates are child age, Inline graphic , child gender, , black race, , Hispanic race, , family above 200% of the federal poverty level, , participation in the Special Supplemental Nutrition Program for Women, Infants, and Children, , participation in the Food Stamp Program, , a childhood food security measurement, , health insurance coverage, Inline graphic , gender of survey respondent, , and age of survey respondent, .

The estimated structural dimension of the joint central subspace is 1 and the estimated linear index is shown in {Table 1. Based on this balancing score, the estimated average difference in body mass index between participants and nonparticipants is Inline graphic with a standard error of . Furthermore, the 95% confidence interval () indicates an insignificant difference in body mass index between participants and nonparticipants, which is consistent with the conclusion that Chan et al. (2016) reached through weighting estimators. Therefore, participation in the school meal programme does not seem to be correlated with excessive food consumption.

Table 1.

Health and Nutrition Examination Survey data

	Estimate (SE)		Estimate (SE)
	1
	0·006 (0·0025)		0·016 (0·0025)
	0·004 (0·0021)		0·020 (0·0028)
	0·031 (0·0039)		0·002 (0·0013)
	0·030 (0·0039)		0·029 (0·0015)
	0·000 (0·0027)		0·965 (0·0002)

Open in a new tab

Figure 1 plots the estimated difference in body mass index between the two groups as a function of the estimated linear index. The standard errors are obtained using the infinitesimal jackknife. In general, the difference in average body mass index between the two groups is insignificant. However, the participants tend to have slightly higher body mass index than non-participants when the linear index lies between 40 to 50, which is a reason behind the slightly positive average treatment effects.

Fig. 1. — Estimated difference in body mass index between participants and nonparticipants of the school meal programme plotted against the estimated linear index for the 2007–2008 National Health and Nutrition Examination Survey data. The solid line represents the estimated conditional effects and the dashed lines represent pointwise 95% confidence limits.

5. Discussion

Recently Ma & Zhu (2013) introduced an efficient estimating equation, which is obtained through a likelihood approach, to achieve the semiparametric efficiency bound of the central subspace. However, the efficiency bound is derived under a fixed dimension and, in practice, the true structural dimension is unknown. Our proposed estimator estimates the structural dimension and the basis matrix simultaneously.

In observational studies, continuous treatments or exposures are also common. In the literature, a generalized propensity score has been introduced to estimate continuous treatment effects; see, for example, Imbens (2000). Since our proposed model can adapt the propensity and outcome regression jointly, it would be interesting to extend this modelling approach to continuous treatment regimes.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(348.8KB, pdf)}

Acknowledgement

The authors thank the editor, an associate editor, two reviewers, and Dr Mary Lou Thompson for their helpful comments and suggestions. The authors were partially supported by the National Heart, Lung, and Blood Institute of the U.S. National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online includes a comparison of several alternative estimation criteria for the joint central subspace, additional simulation results, and the proofs of Proposition 2 and Theorems 1–3.

References

Adragni K. P., Cook R. D. & Wu S. (2012). Grassmannoptim: An R package for Grassmann manifold optimization. J. Statist. Software 50, 1–18. [Google Scholar]
Chan K. C. G., Yam S. C. P. & Zhang Z. (2016). Globally efficient nonparametric inference of average treatment effects by empirical balancing calibration weighting. J. R. Statist. Soc. B 78, 673–700. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cheng P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. J. Am. Statist. Assoc. 89, 81–7. [Google Scholar]
Cook R. D. (1998). Regression Graphics. New York: Wiley. [Google Scholar]
Cook R. D. & Li B. (2002). Dimension reduction for conditional mean in regression. Ann. Statist. 30, 455–74. [Google Scholar]
de Luna X., Waernbaum I. & Richardson T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–75. [Google Scholar]
Edelman A., Arias T. A. & Smith S. T. (1999). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–53. [Google Scholar]
Fukumizu K. & Leng C. (2014). Gradient-based kernel dimension reduction for regression. J. Am. Statist. Assoc. 109, 359–70. [Google Scholar]
Ghosh D., Zhu Y. & Coffman D. L. (2015). Penalized regression procedures for variable selection in the potential outcomes framework. Statist. Med. 34, 1645–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
Häggström J. & de Luna X. (2014). Targeted smoothing parameter selection for estimating average causal effects. Comp. Statist. 29, 1727–48. [Google Scholar]
Hahn J. (1998). On the role of the propensity score in efficient semiparametric, estimation of average treatment effects. Econometrica 66, 315–31. [Google Scholar]
Hansen B. B. (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–8. [Google Scholar]
Hirano K., Imbens G. W. & Ridder G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–89. [Google Scholar]
Hu Z., Follmann D. A. & Wang N. (2014). Estimation of mean response via the effective balancing score. Biometrika 101, 613–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang M.-Y. & Chiang C. T. (2017). An effective semiparametric estimation approach for the sufficient dimension reduction model. J. Am. Statist. Assoc. to appear, 10.1080/01621459.2016.1215987. [DOI] [Google Scholar]
Imai K. & Ratkovic M. (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–63. [Google Scholar]
Imbens G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika 87, 706–10. [Google Scholar]
Kohn M. J., Bell J. F. Grow M. G. & Chan K. C. G. (2014). Food insecurity, food assistance and weight status in US youth: New evidence from NHANES 2007-08. Pediatric Obesity 9, 155–66. [DOI] [PubMed] [Google Scholar]
Leacy F. P. & Stuart E. A. (2014). On the joint use of propensity and prognostic scores in estimation of the average treatment effect on the treated: A simulation study. Statist. Med. 33, 3488–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li B. & Wang S. (2007). On directional regression for dimension reduction. J. Am. Statist. Assoc. 102, 997–1008. [Google Scholar]
Li K.-C. (1991). Sliced inverse regression for dimension reduction (with Discussion). J. Am. Statist. Assoc. 86, 316–42. [Google Scholar]
Ma Y. & Zhu L. (2012). A semiparametric approach to dimension reduction. J. Am. Statist. Assoc. 107, 168–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma Y. & Zhu L. (2013). Efficient estimation in sufficient dimension reduction. Ann. Statist. 41, 250–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qin J. & Zhang B. (2007). Empirical-likelihood-based inference in missing response problems and its application in observational studies. J. R. Statist. Soc. B 69, 101–22. [Google Scholar]
Robins J. M., Rotnitzky A. & Zhao L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 89, 846–86. [Google Scholar]
Rosenbaum P. R. & Rubin D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. [Google Scholar]
Vansteelandt S., Bekaert M. & Claeskens G. (2012). On model selection and model misspecification in causal inference. Statist. Meth. Med. Res. 21, 7–30. [DOI] [PubMed] [Google Scholar]
Wang H. & Xia Y. (2008). Sliced regression for dimension reduction. J. Am. Statist. Assoc. 103, 811–21. [Google Scholar]
Xia Y. (2007). A constructive approach to the estimation of dimension reduction directions. Ann. Statist. 35, 2654–90. [Google Scholar]
Xia Y. (2008). A multiple-index model and dimension reduction. J. Am. Statist. Assoc. 103, 1631–40. [Google Scholar]
Yin X. & Li B. (2011). Sufficient dimension reduction based on an ensemble of minimum average variance estimators. Ann. Statist. 39, 3392–416. [Google Scholar]
Zhu L. P., Zhu L. X. & Feng Z. H. (2010). Dimension reduction in regressions through cumulative slicing estimation. J. Am. Statist. Assoc. 105, 1455–66. [Google Scholar]
Zhu Y. & Zeng P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Am. Statist. Assoc. 101, 1638–51. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(348.8KB, pdf)}

[B1] Adragni K. P., Cook R. D. & Wu S. (2012). Grassmannoptim: An R package for Grassmann manifold optimization. J. Statist. Software 50, 1–18. [Google Scholar]

[B2] Chan K. C. G., Yam S. C. P. & Zhang Z. (2016). Globally efficient nonparametric inference of average treatment effects by empirical balancing calibration weighting. J. R. Statist. Soc. B 78, 673–700. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] Cheng P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. J. Am. Statist. Assoc. 89, 81–7. [Google Scholar]

[B4] Cook R. D. (1998). Regression Graphics. New York: Wiley. [Google Scholar]

[B5] Cook R. D. & Li B. (2002). Dimension reduction for conditional mean in regression. Ann. Statist. 30, 455–74. [Google Scholar]

[B6] de Luna X., Waernbaum I. & Richardson T. S. (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–75. [Google Scholar]

[B7] Edelman A., Arias T. A. & Smith S. T. (1999). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 303–53. [Google Scholar]

[B8] Fukumizu K. & Leng C. (2014). Gradient-based kernel dimension reduction for regression. J. Am. Statist. Assoc. 109, 359–70. [Google Scholar]

[B9] Ghosh D., Zhu Y. & Coffman D. L. (2015). Penalized regression procedures for variable selection in the potential outcomes framework. Statist. Med. 34, 1645–58. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Häggström J. & de Luna X. (2014). Targeted smoothing parameter selection for estimating average causal effects. Comp. Statist. 29, 1727–48. [Google Scholar]

[B11] Hahn J. (1998). On the role of the propensity score in efficient semiparametric, estimation of average treatment effects. Econometrica 66, 315–31. [Google Scholar]

[B12] Hansen B. B. (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–8. [Google Scholar]

[B13] Hirano K., Imbens G. W. & Ridder G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 1161–89. [Google Scholar]

[B14] Hu Z., Follmann D. A. & Wang N. (2014). Estimation of mean response via the effective balancing score. Biometrika 101, 613–24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Huang M.-Y. & Chiang C. T. (2017). An effective semiparametric estimation approach for the sufficient dimension reduction model. J. Am. Statist. Assoc. to appear, 10.1080/01621459.2016.1215987. [DOI] [Google Scholar]

[B16] Imai K. & Ratkovic M. (2014). Covariate balancing propensity score. J. R. Statist. Soc. B 76, 243–63. [Google Scholar]

[B17] Imbens G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika 87, 706–10. [Google Scholar]

[B18] Kohn M. J., Bell J. F. Grow M. G. & Chan K. C. G. (2014). Food insecurity, food assistance and weight status in US youth: New evidence from NHANES 2007-08. Pediatric Obesity 9, 155–66. [DOI] [PubMed] [Google Scholar]

[B19] Leacy F. P. & Stuart E. A. (2014). On the joint use of propensity and prognostic scores in estimation of the average treatment effect on the treated: A simulation study. Statist. Med. 33, 3488–508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Li B. & Wang S. (2007). On directional regression for dimension reduction. J. Am. Statist. Assoc. 102, 997–1008. [Google Scholar]

[B21] Li K.-C. (1991). Sliced inverse regression for dimension reduction (with Discussion). J. Am. Statist. Assoc. 86, 316–42. [Google Scholar]

[B22] Ma Y. & Zhu L. (2012). A semiparametric approach to dimension reduction. J. Am. Statist. Assoc. 107, 168–79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Ma Y. & Zhu L. (2013). Efficient estimation in sufficient dimension reduction. Ann. Statist. 41, 250–68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Qin J. & Zhang B. (2007). Empirical-likelihood-based inference in missing response problems and its application in observational studies. J. R. Statist. Soc. B 69, 101–22. [Google Scholar]

[B25] Robins J. M., Rotnitzky A. & Zhao L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Am. Statist. Assoc. 89, 846–86. [Google Scholar]

[B26] Rosenbaum P. R. & Rubin D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41–55. [Google Scholar]

[B27] Vansteelandt S., Bekaert M. & Claeskens G. (2012). On model selection and model misspecification in causal inference. Statist. Meth. Med. Res. 21, 7–30. [DOI] [PubMed] [Google Scholar]

[B28] Wang H. & Xia Y. (2008). Sliced regression for dimension reduction. J. Am. Statist. Assoc. 103, 811–21. [Google Scholar]

[B29] Xia Y. (2007). A constructive approach to the estimation of dimension reduction directions. Ann. Statist. 35, 2654–90. [Google Scholar]

[B30] Xia Y. (2008). A multiple-index model and dimension reduction. J. Am. Statist. Assoc. 103, 1631–40. [Google Scholar]

[B31] Yin X. & Li B. (2011). Sufficient dimension reduction based on an ensemble of minimum average variance estimators. Ann. Statist. 39, 3392–416. [Google Scholar]

[B32] Zhu L. P., Zhu L. X. & Feng Z. H. (2010). Dimension reduction in regressions through cumulative slicing estimation. J. Am. Statist. Assoc. 105, 1455–66. [Google Scholar]

[B33] Zhu Y. & Zeng P. (2006). Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Am. Statist. Assoc. 101, 1638–51. [Google Scholar]

PERMALINK

Joint sufficient dimension reduction and estimation of conditional and average treatment effects

Ming-Yueh Huang

Kwun Chuen Gary Chan

Summary

1. Introduction

2. Joint sufficient dimension reduction

2.1. Notation and model construction

Assumption 1

Remark 1.

Remark 2.

Remark 3.

2.2. Simultaneous estimation for the basis and dimension of the joint central subspace

Proposition 1.

Proposition 2.

Step 1.

Step 2.

Step 3.

Theorem 1.

Remark 4.

Remark 5.

Remark 6.

2.3. Estimation of conditional effects

Remark 7.

3. Efficient estimation of average treatment effects

3.1. Semiparametric efficiency bound and the efficient estimator

Theorem 2.

Theorem 3.

Remark 8.

3.2. Bandwidth selection

4. Application

Table 1.

Fig. 1.

5. Discussion

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases