Summary:
We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data-driven variable selection is needed to select relevant covariates from a moderately large number of available covariates X. To leverage covariates among X predictive of the outcome for efficiency gain while using regularization to fit a parameteric propensity score (PS) model, we consider a dimension reduction of X based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double-index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for X from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting (IPW) estimator, which is found to maintain double-robustness and also local semiparametric efficiency with a fixed number of covariates p. Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly-robust estimators. These results are extended to the case where p diverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record (EMR) study and the effect of smoking on C-reactive protein (CRP) in the Framingham Offspring Study.
Keywords: Causal inference, double-robustness, electronic medical records, kernel smoothing, regularization, semiparametric efficiency
1. Introduction
There is growing interest in evaluating medical treatments and policies in large-scale observational data such as electronic medical records (EMR). As with any observational data, in the absence of randomization, adjustment for a sufficient set of pre-treatment covariates X that satisfy “no unmeasured confounding” is needed when estimating average treatment effects (ATE) to avoid confounding bias. This is routinely done using propensity score (PS), outcome regression, and doubly-robust (DR) methods (Lunceford and Davidian, 2004). These methods were initially developed in settings where p, the dimension of X, was small relative to the sample size n. But large-scale observational data are increasingly collecting rich measurements in large sets of covariates, and data-driven variable selection approaches are needed due to the lack of sufficient prior knowledge to guide manual variable selection.
Effective variable selection for causal effect estimation involves consideration of dependencies between X with the treatment status T ∈ {0, 1} and outcome Y. Let index the subset of X upon which the PS depends, and let be an analogous index set for X upon which either μ1(x) or μ0(x) depends, where . For any index set , let denote its complement in {1, 2, …, p}. When X is sufficient for no unmeasured confounding, the covariates indexed in is a reduced set of covariates that is also sufficient for no unmeasured confounding (De Luna et al., 2011). However, additionally adjusting for purely prognostic covariates in can improve the efficiency of PS, outcome regression, and DR estimators (Lunceford and Davidian, 2004; Hahn, 2004; Brookhart et al., 2006).
To exploit this phenomenon, we consider an inverse probability weighting (IPW) estimator where the PS is initially estimated by regularized regression. Since variable selection procedures for the PS model would select out covariates in , we also estimate a regularized regression model for μk(x), for k = 0, 1, to recover variation from covariates in to inform estimation of a calibrated PS. The calibration is implemented through smoothing T over the linear predictors for X from both the initial PS and outcome models, which can be viewed as smoothing over working propensity and prognostic scores (Hansen, 2008). The resulting IPW estimator maintains double-robustness and achieves the semiparametric efficiency bound when p is fixed, under correctly specified PS and outcome working models. To the best of our knowledge, this is the first proposal in the literature that demonstrates these properties can be achieved through weighting only, without explicit augmentation. We show that the estimator is asymptotically linear and use this to characterize large-sample robustness and efficiency properties. The smoothing results in a refinement of the influence function under misspecification of the outcome model that can potentially result in substantial gains in efficiency relative to traditional DR estimators, which is confirmed in simulations. These properties hold in settings where p is either fixed or allowed to diverge slowly with n assuming fixed sparsity indices.
Data-driven variable selection for causal effect estimation has been considered in screening methods based on marginal associations between X with T and Y (Schneeweiss et al., 2009), but the results can be misleading because marginal associations need not agree with conditional associations. De Luna et al. (2011) carefully characterized and proposed algorithms to identify minimal subsets of covariates that are sufficient for no unmeasured confounding. Recent works have considered using regularized regression to select variables and post-selection methods that estimate treatment effects through partially linear models (Belloni et al., 2013) and DR estimators (Farrell, 2015; Belloni et al., 2017). These methods focus on delivering uniformly valid inference under high-dimensional regimes assuming approximately sparse models. Others have proposed modifying the regularization penalty itself in a way to select the relevant covariates and estimate treatment effects through IPW (Shortreed and Ertefaie, 2017) and DR estimators (Koch et al., 2018). However, these papers generally do not fully work out the full asymptotic distribution of the final estimator, making efficiency comparisons with established methods difficult. Some of the methods are also only singly-robust. Bayesian model averaging (Cefalu et al. (2017) and references therein) offers a principled alternative for variable selection but encounters burdensome computations that are possibly infeasible for large p.
Our proposed double-index PS (DiPS) can be viewed as a simple and intuitive approach to dimension reduction of X for estimating the PS. The approach for DiPS closely resembles a method proposed for estimating mean outcomes in the presence of data missing at random (Hu et al., 2012), except we use the double-score to estimate a PS instead of an outcome model. In contrast to their results, we show that a higher-order kernel is required due to the two-dimensional smoothing, find explicit efficiency gains under misspecification of the outcome model, and consider p diverging with n. There is also some similar intuition shared with collaborative DR methods (van der Laan and Gruber, 2010) in that associations with both treatment and outcome are taken into account when estimating a PS. However, DiPS takes a much different approach to estimating the PS. In the following, we introduce the proposed method and consider its asymptotic properties in Sections 2 and 3. A perturbation-resampling method is proposed for inference in Section 4. Simulations and applications to estimating treatment effects in an EMR study and cohort study are presented in Section 5. We conclude with some additional remarks in Section 6.
2. Method
2.1. Notations and Problem Setup
Let be the observed data for the ith subject, where Yi is an outcome that could be modeled by a generalized linear model (GLM), Ti ∈ {0, 1} a binary treatment, and Xi is a p-dimensional vector of covariates with support . Here p is allowed to diverge slowly with n such that log(p)/log(n) → ν, for ν ∈ [0, 1), which includes the case where p is fixed by taking ν = 0. For a given n, the observed data consists of independent and identically distributed (iid) observations drawn from a distribution , which potentially may vary with n. We suppress the dependence in the notations, implicitly assuming statements involving and associated statistical functionals hold for each n. Let and denote the counterfactual outcomes had a subject received treatment or control. Based on , we want to make inferences about the average treatment effect (ATE):
| (1) |
For identifiability, we require the following standard causal inference assumptions:
| (2) |
| (3) |
| (4) |
where , for k = 0, 1. The third condition assumes that X is a sufficient set of covariates such that no unmeasured confounding holds given the entire X. Under these assumptions, Δ can be identified from the observed data distribution through:
where , for k = 0, 1. We will consider an estimator based on the IPW form that will nevertheless be doubly-robust so that it is consistent under models where either πk(x) or μk(x) is correctly specified.
2.2. Parametric Models for Nuisance Functions
We consider parametric modeling as a means to reduce the dimensions of X when estimating the PS. For reference, let be the nonparametric model for the distribution of Z, , that has no restrictions on except requiring the second moment of Z to be finite. Let and respectively denote parametric working models under which:
| (5) |
| (6) |
where gπ(·) and gμ(·) are known link functions, and and are unknown parameters. In (6) slopes are allowed to differ by treatment arms to allow for heterogeneous effects of T for subjects with different X even with a linear link. When it is reasonable to assume heterogeneity is weak or nonexistent, it may be beneficial for efficiency to restrict β0 = β1.
Regardless of the validity of either working model (i.e. whether ), we first obtain estimates of α and βk’s through adaptive LASSO (Zou, 2006):
| (7) |
| (8) |
where denotes the log-likelihood for under given Ti and Xi, is a log-likelihood for from a GLM suitable for the outcome type of Y under given Zi, , , and are initial root-n consistent estimates, λπ,n is a tuning parmaeter such that n1/2λπ,n → 0 and n(1−ν)(1+γ)/2λπ,n → ∞, with γ > 2ν/(1 − ν), and similarly for λμ,n (Zou and Zhang, 2009). We specify adaptive LASSO here to estimate the nuisance parameters for concreteness, but use of other penalized likelihood methods can also be justified, so long as they have an oracle property, as in Theorem 2 of Zou (2006) and described below.
Under model (5) and (6), we assume that α and βk, for k = 0, 1, are sparse. More generally, regardless of whether working models are correct or misspecified, we assume that there exist least false parameters and (Lu et al., 2012) such that:
| (9) |
Let and be respective supports for and and let and be the sparsity indices. We further assume and have fixed sparsity such that:
| (10) |
For any vector v of length p and any index set , let denote the subvector of v restricted to elements indexed in . Assumption (9) is a high-level assumption that would be required for and to maintain an oracle property with respect to the least false parameters and under possibly misspecified working models. Under this assumption using arguments similar to those in Lu et al. (2012) and Zou and Zhang (2009) it can be shown that and admits an expansion of the form , which would yield the asymptotic normality results of the oracle property, and similarly for . We rely on these results along with (10) to show that the DiPS IPW is asymptotically linear in Theorem 1. In regimes where ν > 0, (10) models a setting in which a small number of covariates exhibit non-negligible associations with T and Y and a majority of covariates are noise. Assumption 10 may not be required for asymptotic linearity and can potentially be relaxed allowing and to diverge slowly, for example, if they are o(n1/3). We invoke this assumption to avoid complications of a growing support, which may need triangular array asymptotics to accommodate dependence of the support on n.
2.3. Double-Index Propensity Score and IPW Estimator
To mitigate the effects of misspecification of (5), one could perform nonparametric smoothing of T over to calibrate the initial PS estimator . We consider smoothing over not only but also as well to allow variation in prognostic covariates indexed in to inform this calibration. Such covariates are reduced into to allow for nonparametric kernel smoothing in low (two) dimensions. The DiPS estimator for each treatment is:
| (11) |
where , Kh(u) = h−2K(u/h), and K(u) is a bivariate q-th order kernel function with q > 2. A higher-order kernel is required here for the asymptotics to be well-behaved, which is the price for estimating the nuisance functions πk(x) using two-dimensional smoothing. This allows for the possibility of negative values for . Nevertheless, are nuisance estimates not of direct interest, and we find that such negative PS estimates typically occur infrequently, occurring on average in simulations in 0.01% to 2.10% of observations depending the size of n and p across scenarios where working models are correct or incorrectly specified (Web Appendix D). As they are infrequent and do not appear to compromise the performance of the final estimator, they can potentially be left as is when encountered in practice. Alternatively, methods that discard or trim PS estimates to handle near-violations of positivity, as in Assumption (3), can be considered (Crump et al., 2009). A monotone transformation of the input scores for each treatment can be applied prior to smoothing to improve finite sample performance (Wand et al., 1991). In numerical studies, for instance, we applied a probability integral transform based on the normal cumulative distribution function to the standardized scores to obtain approximately uniformly distributed inputs. The components of can also be scaled such that a common bandwidth h can be used for both components of the score.
With πk(x) estimated by , the estimator for Δ is given by , where:
| (12) |
This is the usual normalized IPW estimator, where the PS is estimated by the DiPS. The intuition for double-robustness of the estimator is as follows. Regardless of the validity of either working model, provided the asymptotics are well-behaved, is consistent for:
where , and . Under , so that the estimand, under the causal assumptions (2)–(4), reduces to:
On the other hand, under so that:
In the following, we show that (and thus ) are asymptotically linear. We then subsequently examine robustness and efficiency properties using the expansion.
3. Asymptotic Robustness and Efficiency Properties
We directly show in Web Appendix B that is asymptotically linear for k = 0, 1 in general without assuming either of the working models are correct. Let and for k = 0, 1 so that .
Theorem 1:
Suppose that causal assumptions (2)–(4), the least false parameter and sparsity assumptions (9)–(10) and regularity conditions in Web Appendix A hold. If log(p)/log(n) → ν for ν ∈ [0, 1), then is asymptotically linear in that it admits the expansion:
| (13) |
| (14) |
for k = 0, 1, where and are deterministic vectors, and are influence functions from asymptotic expansions of and . Under model for k = 0, 1. Under , we additionally have that , for k = 0, 1.
Proof sketch: can be decomposed as:
The first term directly contributes to the expansion. The second term is the contribution from re-estimating the PS through kernel smoothing given . We apply a V-statistic projection lemma (Newey and McFadden, 1994) to obtain an asymptotically linear representation. The third term can be expanded by Taylor expansion into terms of the form and . Applying the selection consistency that , . Lastly, we use that and work out the forms of the loading vector and repeat for to complete the expansion.
Let denote the usual doubly-robust estimator, as in Equation (9) of Lunceford and Davidian (2004), with the PS πk(x) and mean outcome μk(x) estimated in the same way as through (7) and (8). The influence function expansion for in Theorem 1 is nearly identical to that of . The terms in (13) would be the same except and replaces asymptotic estimates under parametric models. Terms in (14) analogously represent the additional contributions from estimating the nuisance parameters. No contribution from smoothing is incurred provided the bandwidths are suitably chosen. This similarity in the influence functions yields similar robustness and efficiency properties, which are improved upon under model misspecification due to the smoothing.
3.1. Robustness
As a consequence of Theorem 1, is root-n consistent for so that provided that h = O(n−α) for . As discussed in Section 2.3, under , . Hence is doubly-robust for Δ in that is root-n consistent for Δ under . Beyond this usual form of double-robustness, if the PS model specification is incorrect, we expect the calibration step to at least partially correct for the misspecfication in large samples since is closer to the true πk(x) than the misspecified parametric model . Let denote a model under which for some unknown link function and unknown , and X are known to be elliptically distributed such that exists and is linear in , where α* denotes the true α (e.g. if X is multivariate normal). By the results of Li and Duan (1989), it can be shown that for some scalar c under . But since is consistent for , it recovers πk(x) under . Consequently, also has some mild benefits in robustness in that under the slightly larger model . The same phenomenon also occurs when estimating βk under misspecification of the link in (6), if we do not assume β0 = β1. In this case, if is an analogous model under which and for some unknown link functions and and X are elliptically distributed, then under the slightly larger model . This does not hold when β0 = β1, as T is binary so (T, XT)T is not exactly elliptically distributed. But the result may still be expected to hold approximately.
3.2. Efficiency
Let the terms contributed to the influence function for when α and βk are known be:
| (15) |
Under , is the full influence function for . This is the efficient influence function for Δ* under at distributions for belonging to when p is fixed (Robins et al., 1994; Tsiatis, 2007), since and . When ν > 0 so that p diverges with n, there are no well-established semiparametric efficiency bounds. However with fixed sparsity indices (10), the asymptotic variance still reaches the same bound had p been fixed.
Beyond this characterization of efficiency that parallels that of , there are additional benefits of under . In this case, akin to , estimating βk does not contribute to the asymptotic variance since , and a similar term is contributed from estimating α. The analogous term in the expansion for contributes the negative of a projection of the preceding terms onto the linear span of the score function for α, restricted to components in , to its influence function (Section 9.1 of Tsiatis (2007)). The same interpretation of the influence function can be adopted for .
Theorem 2:
Let Uα be the score for α under and let denote the linear span of its components indexed in . In the Hilbert space of random variables with mean 0 and finite variance with inner product given by the covariance, let denote the projection of some into a subspace . If the assumptions required for Theorem 1 hold, under , .
The proof is based on simplifying and is given in Web Appendix B. This result can be used to show that the asymptotic variance of is lower than that of under . Based on this result, under the influence function for is , and for the usual DR estimator is , where:
But since better approximates μk(x) than the asymptotic estimate under the misspecified parametric model , it can then be shown that for k = 0, 1. Since the influence functions involve projections onto the same space , it can be seen through geometric argument that , so that is more efficient than under . We show in the simulation studies that this improvement can lead to substantial efficiency gains under in finite samples. These unique robustness and efficiency properties distinguish from and its variants. We next consider a perturbation scheme to estimate standard errors (SE) and confidence intervals (CI) for .
4. Perturbation Resampling
Although the asymptotic variance of can be determined through its influence function specified in Theorem (1), a direct empirical estimate based on the influence function is infeasible because it involves functionals of that are difficult to estimate. Instead we propose a simple perturbation-resampling procedure. Let be a set of non-negative iid random variables with unit mean and variance independent of . The procedure perturbs each “layer” of the estimation of . Let the perturbed estimates of and be:
where and are perturbed initial estimates obtained from analogously perturbing its estimating equations. The perturbed DiPS estimates are calculated by:
Lastly the perturbed estimator is given by where:
It can be shown based on arguments in Jin et al. (2001) that the asymptotic distribution of coincides with that of . We can thus approximate the SE of based on the empirical standard deviation or, as a robust alternative, the mean absolute deviations (MAD) of resamples and construct CI’s using percentiles of resamples.
5. Numerical Studies
5.1. Simulation Study
We performed extensive simulations to assess the finite sample bias and relative efficiency (RE) of (DiPS) compared to alternative estimators. We also assessed the performance of the perturbation procedure. Throughout in implementing the adaptive LASSO, we used ridge regression for the initial estimators and where the ridge tuning parameter chosen by minimizing the Akaike information criterion (AIC). The adaptive LASSO tuning parameter was chosen by an extended regularized information criterion (Hui et al., 2015), which exhibited relatively good performance for variable selection. We refitted models with selected covariates to reduce bias, as suggested in Hui et al. (2015). The power parameter γ was set as , where ν = log(p)/log(n). A Gaussian product kernel of order q = 4 with a plug-in bandwidth at the optimal order (see Discussion) was used for smoothing. For comparison, we considered alternative standard estimators with nuisances estimated by regularization and recently developed methods for estimating ATE that incorporate variable selection: (1) IPW with π1(x) estimated by adaptive LASSO (ALAS), (2) with nuisances estimated by adaptive LASSO (DR-ALAS), (3) Modification of in which π1(x) and μk(x) are estimated by separate one-dimensional kernel smoothing of and among those assigned to T = k, for k = 0, 1 (DR-SIM), to allow for estimation of single index models (SIM) for π1(x) and μk(x), (4) Outcome-adaptive LASSO (OAL) (Shortreed and Ertefaie, 2017), (5) Group Lasso and Doubly Robust Estimation (GliDeR) (Koch et al., 2018), (6) Model averaged doubly-robust estimator (MADR) (Cefalu et al., 2017). OAL and GLiDeR were implemented with default settings from code provided in the Supplementary Materials of the respective papers. MADR was implemented using the madr package with M = 500 Markov chain Monte Carlo (MCMC) iterations to reduce the computations. Throughout the numerical studies, we specified gπ(u) = 1/(1 + e−u) for and gμ(u) = u with β0 = β1 for as the working models.
The covariates were generated to approximate the distribution of the covariates from the statins EMR data from Section 5.2. This was done to allow for non-elliptically distributed covariates that mimic the distribution of a real dataset. Initially we generated where and were the empirical mean and covariance matrix of the 15 covariates, which included 9 binary, 3 continuous, and 3 log-transformed count variables. For binary variables we thresholded the corresponding components of so that its mean matched those in , as in , where and are the empirical variance and mean of the j-th covariate and Φ(·) is the standard normal cumulative distribution function (CDF). Lastly, we centered and standardized to obtain the final covariates . The pairwise correlations of X were generally low, mostly ranging between −.2 and .2 (full correlation matrix reported in Web Appendix C). For settings with p > 15, we generated independent groups of the 15 covariates that maintained the correlation structure within each group.
We subsequently focused on a continuous outcome, generating the data according to T | X ~ Ber{π1(X)} and Y | X, T ~ N{μT (X), 102}. The simulations varied over scenarios where working models were correct or misspecified in which the true π1(x) and μk(x) are:
where the coefficients are α = .01 · (1, 2, 3, 4, 5, 6, 03, 3, 7, 0, 7, −5, 0, 0p−15)T, α[1] = α, α[2] = (.02, .06, .02, .02, −.1, .02, 03,−.14, .1, 0,−.1, .14, 0, 0p−15)T, ζ = (06, 1, 03, 1, 02, 1, 0, 0p−15)T, β = (03, 1, .5, .25, .125, .0625, .03125, 0, 1, .5, 0, .25, .125, 0p−15)T, β[1] = (03, .5, 0, .5, 13, 0, 1, 2, 0, 1, 2, 0p−15)T, β[2] = (03,−1.5, .75,−1.5, 03, 0,−1.5, −.75, 0, 1.5, .75, 0p−15)T, and am denotes a 1 × m vector that has all its elements as a. For the misspecified scenarios, either μk(x) or π1(x) is a double-index model that includes both linear terms in x and quadratic and two-way interaction terms among x that are omitted by linear working models. In the misspecified μk(x) case, the second index has some correlation with the PS index αTx, modeling a situation in which there exist common latent factors not fully captured by a linear outcome model. The outcome model also includes an interaction term between x and treatment to allow for treatment effect heterogeneity. The parameters are set such that there are 5 covariates belonging to each of (i.e. confounders), (instruments), and (pure prognostic) when p = 15. The simulations were run for R = 1, 000 repetitions.
Table 1 presents the bias and root mean square error (RMSE) for n = 500, 5, 000 when p = 15. Among the three scenarios considered, the bias for DiPS is small relative to the RMSE and generally diminishes towards zero as n increases, verifying its double-robustness. There remains some minor bias that persists when n = 5, 000 for DiPS that is likely a result of bias from the smoothing, as DR-SIM also incurs similar residual bias. IPW-ALAS and OAL are singly-robust and the bias does not necessary diminish under the misspecified π1(x) scenario, although their bias is also minor in the setting considered. MADR exhibited substantial bias under misspecified μk(x) scenario that persisted in large samples, possibly due to selecting out confounders with weak outcome associations in its emphasis on selection of prognostic covariates. The results for bias for p = 50, 100 exhibited similar patterns.
Table 1.
Bias and RMSE of estimators by n and model specification scenario for p = 15.
| Both Correct | Misspecified μk (x) | Misspecified π1 (x) | |||||
|---|---|---|---|---|---|---|---|
| Size | Estimator | Bias | RMSE | Bias | RMSE | Bias | RMSE |
| IPW-ALAS | 0.029 | 0.350 | 0.074 | 1.754 | 0.023 | 0.294 | |
| DR-ALAS | 0.002 | 0.330 | 0.029 | 1.684 | −0.001 | 0.285 | |
| DR-SIM | −0.021 | 0.315 | 0.127 | 1.495 | 0.013 | 0.287 | |
| OAL | 0.008 | 0.321 | 0.074 | 1.484 | 0.001 | 0.284 | |
| n=500 | GLiDeR | 0.001 | 0.299 | 0.087 | 1.238 | 0.006 | 0.282 |
| MADR | 0.022 | 0.300 | 0.172 | 1.247 | 0.008 | 0.282 | |
| DiPS | −0.017 | 0.319 | 0.101 | 1.193 | 0.013 | 0.293 | |
| IPW-ALAS | 0.001 | 0.111 | −0.002 | 0.588 | 0.033 | 0.108 | |
| DR-ALAS | −0.003 | 0.106 | −0.014 | 0.564 | −0.008 | 0.089 | |
| DR-SIM | −0.012 | 0.103 | 0.029 | 0.516 | −0.004 | 0.089 | |
| OAL | −0.002 | 0.105 | 0.000 | 0.527 | −0.007 | 0.089 | |
| n=5,000 | GLiDeR | −0.001 | 0.098 | 0.034 | 0.413 | −0.006 | 0.088 |
| MADR | 0.000 | 0.099 | 0.124 | 0.418 | −0.008 | 0.089 | |
| DiPS | −0.016 | 0.106 | 0.041 | 0.349 | −0.003 | 0.091 | |
Figure 1 presents the RE under the different scenarios for n = 500, 5, 000 and p = 15, 50, 100. RE was defined as the ratio of the mean square error (MSE) for DR-ALAS relative to that of each estimator, with RE > 1 indicating greater efficiency compared to DR-ALAS. Under the “both correct” scenario many of the estimators generally exhibit similar efficiency, which can be expected since many are variants of the usual DR estimator and reach the semiparametric efficiency bound. When n = 500 and p = 60, there are some slightly greater differences, with GliDeR and MADR leading in efficiency gains, possibly due to differences in the variable selection performance. These differences in efficiency appear to temper when sample size is increased for n = 5, 000 and p = 60. The results are similar in the “misspecified π1(x)” scenario, where most estimators exhibited similar efficiency.
Figure 1.

RE relative to DR-ALAS by n, p, and specification scenario.
In the “misspecified μk(x)” scenario, DiPS achieves over 70% efficiency gain compared to GliDeR and MADR and over 140% compared to DR-SIM in the large sample setting when n = 5, 000 and p = 15. This suggests that expected efficiency gains under misspecified outcome models due to the results of Section 3.2 can be substantial. Even if π1(x) and μk(x) are estimated under a SIM, there are still gains from DiPS when the PS direction is informative of the mean outcome beyond . These gains diminish when p is larger relative to n, possibly due to imperfect variable selection. Again GLiDeR and MADR achieve the highest efficiency when n = 500 and p = 60, notwithstanding the substantial bias of MADR. Thus the performance of DiPS using adaptive LASSO can be somewhat compromised when p is very large relative to n and the variable selection performance is sub-optimal.
Table 2 presents the performance of perturbation for DiPS when p = 15, 30 under correct working models. SEs for DiPS were estimated using the MAD. The empirical SEs (Emp SE), calculated from the sample standard deviations of over the simulation repetitions, were generally similar to the average of the SE estimates over the repetitions (ASE), despite some overestimation up to 2–15% of the Emp SE. The coverage of the percentile CI’s (Cover) were close to nominal 95% levels but tended to be somewhat conservative.
Table 2.
Perturbation performance under correctly specified models. Emp SE: empirical standard error over simulations, ASE: average of standard error estimates based on MAD over perturbations, Cover: Coverage of 95% percentile intervals.
| p | n | Emp SE | ASE | Cover |
|---|---|---|---|---|
| 15 | 500 | 0.350 | 0.362 | 0.966 |
| 15 | 2500 | 0.151 | 0.167 | 0.970 |
| 15 | 5000 | 0.108 | 0.119 | 0.965 |
| 30 | 500 | 0.348 | 0.356 | 0.961 |
| 30 | 2500 | 0.150 | 0.167 | 0.975 |
| 30 | 5000 | 0.103 | 0.119 | 0.973 |
5.2. Data Example: Effect of Statins on Colorectal Cancer Risk in EMRs
We applied DiPS to assess the effect of statins, a medication for lowering cholesterol levels, on the risk of colorectal cancer (CRC) among patients with inflammatory bowel disease (IBD) identified using data from EMRs of Partners Healthcare. Previous studies have suggested that statins have a protective effect on CRC, but few studies have considered the effect specifically among IBD patients. The EMR cohort consisted of n = 10, 817 IBD patients, including 1,375 statin users. CRC status and statin use were ascertained by the presence of ICD9 diagnosis and prescription codes. We adjusted for p = 15 covariates as potential confounders, including age, gender, race, smoking status, indication of elevated inflammatory markers, examination with colonoscopy, use of biologics and immunomodulators, subtypes of IBD, disease duration, and presence of primary sclerosing cholangitis (PSC).
For the working model , we specified gμ(u) = 1/(1 + e−u) to accomodate the binary outcome. SEs for other estimators were obtained from the MAD over bootstrap resamples. CIs were calculated from percentile intervals. We also calculated a two-sided p-value from a Wald test for the null that statins have no effect, using the point and SE estimates for each estimator. The unadjusted estimate (None) based on difference in means by statins use was also calculated as a reference. The left side of Table 3 shows that, without adjustment, the naive risk difference is estimated to be −0.8% with a SE of 0.4%. The other methods estimated that statins had a protective effect ranging from around −1% to −3% after adjustment for covariates. DiPS and DR-SIM were the most efficient estimators, with DiPS achieving estimated variance that ranged 34% to 61% lower than that of other estimators.
Table 3.
Data example on the effect of statins on CRC risk in EMR data and the effect of smoking on logCRP in FOS data. Est: Point estimate, SE: estimated SE, 95% CI: confidence interval, p-val: p-value from Wald test of no effect.
| IBD EMR Study | FOS | |||||||
|---|---|---|---|---|---|---|---|---|
| Est | SE | 95% CI | p-val | Est | SE | 95% CI | p-val | |
| None | −0.008 | 0.004 | (−0.017, 0) | 0.047 | 0.180 | 0.058 | (0.065, 0.298) | 0.002 |
| IPW-ALAS | −0.022 | 0.004 | (−0.031, −0.015) | <0.001 | 0.182 | 0.063 | (0.053, 0.307) | 0.004 |
| DR-ALAS | −0.020 | 0.005 | (−0.029, −0.012) | <0.001 | 0.140 | 0.063 | (0.031, 0.277) | 0.026 |
| DR-SIM | −0.023 | 0.003 | (−0.029, −0.018) | <0.001 | 0.143 | 0.057 | (0.044, 0.257) | 0.013 |
| OAL | −0.008 | 0.004 | (−0.017, 0) | 0.048 | 0.175 | 0.061 | (0.062, 0.301) | 0.004 |
| GLiDeR | −0.031 | 0.005 | (−0.04, −0.022) | <0.001 | 0.147 | 0.058 | (0.045, 0.258) | 0.012 |
| MADR | −0.030 | 0.005 | (−0.04, −0.021) | <0.001 | 0.149 | 0.056 | (0.037, 0.258) | 0.008 |
| DiPS | −0.024 | 0.003 | (−0.029, −0.017) | <0.001 | 0.141 | 0.058 | (0.039, 0.276) | 0.015 |
5.3. Data Example: Framingham Offspring Study
The Framingham Offspring Study (FOS) is a cohort study initiated in 1971 that enrolled 5,124 adult children and spouses of the original Framingham Heart Study. The study collected data over time on participants’ medical history, physician examination, and laboratory tests to examine epidemiological and genetic risk factors of cardiovascular disease (CVD). A subset of the FOS participants also have their genotype from the Affymetrix 500K single-nucleotide polymorphism (SNP) array available through the Framingham SNP Health Association Resource (SHARe) on dbGaP. We assessed the effect of smoking on C-reactive protein (CRP) levels, an inflammation marker highly predictive of CVD risk, while adjusting for potential confounders including gender, age, diabetes status, use of hypertensive medication, systolic and diastolic blood pressure measurements, and HDL and total cholesterol measurements, as well as a large number of SNPs in gene regions previously reported to be associated with inflammation or obesity. While the inflmmation-related SNPs are not likely to impact smoking, we include them as prognostic covariates for efficiency. The analysis includes n = 1, 892 individuals with available information on the CRP and the p = 121 covariates, of which 113 were SNPs.
Since CRP is heavily skewed, we applied a log transformation so that the linear regression model in better fits the data. SEs, CIs, and p-values were calculated in the same way as above. The right side of Table 3 shows that different methods agree that smoking significantly increases logCRP. In general, point estimates tended to attenuate after adjusting for covariates since smokers are likely to have other characteristics that increase inflammation. DiPS, DR-SIM, and MADR were among the most efficient, though efficiency gains are tempered in this setting with larger p relative to n.
6. Discussion
In this paper we developed a novel IPW estimator for the ATE that accommodates data-driven variable selection through regularized regression. The estimator retains double-robustness and is locally semiparametric efficient when ν = 0. By calibrating the initial PS through smoothing, additional gains in efficiency can potentially be achieved in large samples under misspecification of the working outcome model.
In numerical studies, we used the extended regularized information criterion (Hui et al., 2015) to tune adaptive LASSO, which maintains selection consistency when log(p)/log(n) → ν, for ν ∈ [0, 1). Other criteria such as cross-validation can also be used and may exhibit better performance in some cases. To obtain a suitable bandwidth h, the bandwidth must be selected such that the dominating errors in the influence function, which are of order Op(n1/2hq + n−1/2h−2), converges to 0. This is satisfied for h = O(n−α) for . The optimal bandwidth h* is one that balances these bias and variance terms and is of order h* = O(n−1/(q+2)). In practice we use a plug-in estimator , where is the sample standard deviation of either or , possibly after applying a monotonic transformation. Cross-validation can also be used to select the the smoothing bandwidth.
The adaptive LASSO estimators and are not uniformly root-n consistent when the penalty is tuned to achieve consistent model selection (Pötscher and Schneider, 2009), and its oracle properties derived under fixed parameter asymptotics may fail to capture essential features of finite-sample distributions. For example, they are not root-n consistent when the true parameters are of order O(n−1/2), if the true signals are relatively weak. The importance of uniform inference also been recently highlighted for treatment effect estimation in high-dimensional settings (Belloni et al., 2013; Farrell, 2015). It would be of interest to consider alternative variable selection approaches beyond those grounded in oracle properties to achieve uniform inference. Another limitation of relying on adaptive LASSO is that when p is large so that ν is large, a large power parameter γ would be required to maintain the oracle properties, leading to an unstable penalty and poor finite sample performance. It would be of interest to consider modifications of the proposed procedure to accommodate high-dimensional settings with p ≫ n and more general sparsity assumptions in future work.
Supplementary Material
Acknowledgements
The authors would like to thank the editor, associate editor, and two referees for their insightful feedback and suggestions. Most of this work was done when the first author was a graduate student at Harvard University. This work was supported by National Institutes of Health grants T32CA009337 and R01HL089778. The views expressed in this article are those of the authors and do not necessarily reflect the views of the Department of Veterans Affairs.
Footnotes
Supporting Information
Web Appendices referenced in Sections 2, 3, and 5 as well as the R code implementing the procedure are available with this paper at the Biometrics website on Wiley Online Library.
References
- Belloni A, Chernozhukov V, Fernández-Val I, and Hansen C (2017). Program evaluation and causal inference with high-dimensional data. Econometrica 85, 233–298. [Google Scholar]
- Belloni A, Chernozhukov V, and Hansen C (2013). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81, 608–650. [Google Scholar]
- Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, and Stürmer T (2006). Variable selection for propensity score models. American Journal of Epidemiology 163, 1149–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cefalu M, Dominici F, Arvold N, and Parmigiani G (2017). Model averaged double robust estimation. Biometrics 73, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96, 187–199. [Google Scholar]
- De Luna X, Waernbaum I, and Richardson TS (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–875. [Google Scholar]
- Farrell MH (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. [Google Scholar]
- Hahn J (2004). Functional restriction and efficiency in causal inference. The Review of Economics and Statistics 86, 73–76. [Google Scholar]
- Hansen BB (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–488. [Google Scholar]
- Hu Z, Follmann DA, and Qin J (2012). Semiparametric double balancing score estimation for incomplete data with ignorable missingness. Journal of the American Statistical Association 107, 247–257. [Google Scholar]
- Hui FK, Warton DI, and Foster SD (2015). Tuning parameter selection for the adaptive lasso using eric. Journal of the American Statistical Association 110, 262–269. [Google Scholar]
- Jin Z, Ying Z, and Wei L-J (2001). A simple resampling method by perturbing the minimand. Biometrika 88, 381–390. [Google Scholar]
- Koch B, Vock DM, and Wolfson J (2018). Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74, 8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics 17, 1009–1052. [Google Scholar]
- Lu W, Goldberg Y, and Fine J (2012). On the robustness of the adaptive lasso to model misspecification. Biometrika 99, 717–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunceford JK and Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects. Statistics in Medicine 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
- Newey WK and McFadden D (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245. [Google Scholar]
- Pötscher BM and Schneider U (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference 139, 2775–2790. [Google Scholar]
- Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
- Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, and Brookhart MA (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 20, 512–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shortreed SM and Ertefaie A (2017). Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 73, 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsiatis A (2007). Semiparametric theory and missing data. Springer. [Google Scholar]
- van der Laan MJ and Gruber S (2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics 6, Article 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wand MP, Marron JS, and Ruppert D (1991). Transformations in density estimation. Journal of the American Statistical Association 86, 343–353. [Google Scholar]
- Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
- Zou H and Zhang HH (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
