Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Sep 12.
Published in final edited form as: Biometrics. 2019 Dec 16;76(3):767–777. doi: 10.1111/biom.13195

Estimating Average Treatment Effects with a Double-Index Propensity Score

David Cheng 1, Abhishek Chakrabortty 2, Ashwin N Ananthakrishnan 3, Tianxi Cai 4,*
PMCID: PMC7370895  NIHMSID: NIHMS1592926  PMID: 31797368

Summary:

We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data-driven variable selection is needed to select relevant covariates from a moderately large number of available covariates X. To leverage covariates among X predictive of the outcome for efficiency gain while using regularization to fit a parameteric propensity score (PS) model, we consider a dimension reduction of X based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double-index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for X from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting (IPW) estimator, which is found to maintain double-robustness and also local semiparametric efficiency with a fixed number of covariates p. Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly-robust estimators. These results are extended to the case where p diverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record (EMR) study and the effect of smoking on C-reactive protein (CRP) in the Framingham Offspring Study.

Keywords: Causal inference, double-robustness, electronic medical records, kernel smoothing, regularization, semiparametric efficiency

1. Introduction

There is growing interest in evaluating medical treatments and policies in large-scale observational data such as electronic medical records (EMR). As with any observational data, in the absence of randomization, adjustment for a sufficient set of pre-treatment covariates X that satisfy “no unmeasured confounding” is needed when estimating average treatment effects (ATE) to avoid confounding bias. This is routinely done using propensity score (PS), outcome regression, and doubly-robust (DR) methods (Lunceford and Davidian, 2004). These methods were initially developed in settings where p, the dimension of X, was small relative to the sample size n. But large-scale observational data are increasingly collecting rich measurements in large sets of covariates, and data-driven variable selection approaches are needed due to the lack of sufficient prior knowledge to guide manual variable selection.

Effective variable selection for causal effect estimation involves consideration of dependencies between X with the treatment status T ∈ {0, 1} and outcome Y. Let Aπ{1,2,,p} index the subset of X upon which the PS π1(x)=(T=1|X=x) depends, and let Aμ be an analogous index set for X upon which either μ1(x) or μ0(x) depends, where μk(x)=E(Y|X=x,T=k). For any index set S{1,2,,p}, let Sc denote its complement in {1, 2, …, p}. When X is sufficient for no unmeasured confounding, the covariates indexed in Aπ is a reduced set of covariates that is also sufficient for no unmeasured confounding (De Luna et al., 2011). However, additionally adjusting for purely prognostic covariates in AπcAμ can improve the efficiency of PS, outcome regression, and DR estimators (Lunceford and Davidian, 2004; Hahn, 2004; Brookhart et al., 2006).

To exploit this phenomenon, we consider an inverse probability weighting (IPW) estimator where the PS is initially estimated by regularized regression. Since variable selection procedures for the PS model would select out covariates in AπcAμ, we also estimate a regularized regression model for μk(x), for k = 0, 1, to recover variation from covariates in AπcAμ to inform estimation of a calibrated PS. The calibration is implemented through smoothing T over the linear predictors for X from both the initial PS and outcome models, which can be viewed as smoothing over working propensity and prognostic scores (Hansen, 2008). The resulting IPW estimator maintains double-robustness and achieves the semiparametric efficiency bound when p is fixed, under correctly specified PS and outcome working models. To the best of our knowledge, this is the first proposal in the literature that demonstrates these properties can be achieved through weighting only, without explicit augmentation. We show that the estimator is asymptotically linear and use this to characterize large-sample robustness and efficiency properties. The smoothing results in a refinement of the influence function under misspecification of the outcome model that can potentially result in substantial gains in efficiency relative to traditional DR estimators, which is confirmed in simulations. These properties hold in settings where p is either fixed or allowed to diverge slowly with n assuming fixed sparsity indices.

Data-driven variable selection for causal effect estimation has been considered in screening methods based on marginal associations between X with T and Y (Schneeweiss et al., 2009), but the results can be misleading because marginal associations need not agree with conditional associations. De Luna et al. (2011) carefully characterized and proposed algorithms to identify minimal subsets of covariates that are sufficient for no unmeasured confounding. Recent works have considered using regularized regression to select variables and post-selection methods that estimate treatment effects through partially linear models (Belloni et al., 2013) and DR estimators (Farrell, 2015; Belloni et al., 2017). These methods focus on delivering uniformly valid inference under high-dimensional regimes assuming approximately sparse models. Others have proposed modifying the regularization penalty itself in a way to select the relevant covariates and estimate treatment effects through IPW (Shortreed and Ertefaie, 2017) and DR estimators (Koch et al., 2018). However, these papers generally do not fully work out the full asymptotic distribution of the final estimator, making efficiency comparisons with established methods difficult. Some of the methods are also only singly-robust. Bayesian model averaging (Cefalu et al. (2017) and references therein) offers a principled alternative for variable selection but encounters burdensome computations that are possibly infeasible for large p.

Our proposed double-index PS (DiPS) can be viewed as a simple and intuitive approach to dimension reduction of X for estimating the PS. The approach for DiPS closely resembles a method proposed for estimating mean outcomes in the presence of data missing at random (Hu et al., 2012), except we use the double-score to estimate a PS instead of an outcome model. In contrast to their results, we show that a higher-order kernel is required due to the two-dimensional smoothing, find explicit efficiency gains under misspecification of the outcome model, and consider p diverging with n. There is also some similar intuition shared with collaborative DR methods (van der Laan and Gruber, 2010) in that associations with both treatment and outcome are taken into account when estimating a PS. However, DiPS takes a much different approach to estimating the PS. In the following, we introduce the proposed method and consider its asymptotic properties in Sections 2 and 3. A perturbation-resampling method is proposed for inference in Section 4. Simulations and applications to estimating treatment effects in an EMR study and cohort study are presented in Section 5. We conclude with some additional remarks in Section 6.

2. Method

2.1. Notations and Problem Setup

Let Zi=(Yi,Ti,Xi) be the observed data for the ith subject, where Yi is an outcome that could be modeled by a generalized linear model (GLM), Ti ∈ {0, 1} a binary treatment, and Xi is a p-dimensional vector of covariates with support Xp. Here p is allowed to diverge slowly with n such that log(p)/log(n) → ν, for ν ∈ [0, 1), which includes the case where p is fixed by taking ν = 0. For a given n, the observed data consists of independent and identically distributed (iid) observations D={Zi:i=1,,n} drawn from a distribution n, which potentially may vary with n. We suppress the dependence in the notations, implicitly assuming statements involving and associated statistical functionals hold for each n. Let Yi(1) and Yi(0) denote the counterfactual outcomes had a subject received treatment or control. Based on D, we want to make inferences about the average treatment effect (ATE):

Δ=E{Y(1)}E{Y(0)}=μ1μ0. (1)

For identifiability, we require the following standard causal inference assumptions:

Y=TY(1)+(1T)Y(0) with probability 1 (2)
π1(x)[ϵπ,1ϵπ] for some ϵπ>0, when xX (3)
Y(1)T|X and Y(0)T|X, (4)

where πk(x)=(T=k|X=x), for k = 0, 1. The third condition assumes that X is a sufficient set of covariates such that no unmeasured confounding holds given the entire X. Under these assumptions, Δ can be identified from the observed data distribution through:

Δ*=E{μ1(X)μ0(X)}=E{I(T=1)Yπ1(X)I(T=0)Yπ0(X)},

where μk(x)=E(Y|X=x,T=k), for k = 0, 1. We will consider an estimator based on the IPW form that will nevertheless be doubly-robust so that it is consistent under models where either πk(x) or μk(x) is correctly specified.

2.2. Parametric Models for Nuisance Functions

We consider parametric modeling as a means to reduce the dimensions of X when estimating the PS. For reference, let Mnp be the nonparametric model for the distribution of Z, , that has no restrictions on except requiring the second moment of Z to be finite. Let MπMnp and MμMnp respectively denote parametric working models under which:

π1(x)=gπ(α0+αx), (5)
and μk(x)=gμ(β0+β1k+βkx), for k=0,1, (6)

where gπ(·) and gμ(·) are known link functions, and α=(α0,α)Θαp+1 and β=(β0,β1,β0,β1)Θβ2p+2 are unknown parameters. In (6) slopes are allowed to differ by treatment arms to allow for heterogeneous effects of T for subjects with different X even with a linear link. When it is reasonable to assume heterogeneity is weak or nonexistent, it may be beneficial for efficiency to restrict β0 = β1.

Regardless of the validity of either working model (i.e. whether MπMμ), we first obtain estimates of α and βk’s through adaptive LASSO (Zou, 2006):

(α^0,α^)=arg maxα{n1i=1nlπ(α;Ti,Xi)λπ,nj=1p|αj|/|α˜j|γ} (7)
(β^0,β^1,β^0,β^1)=arg maxβ{n1i=1nlμ(β;Zi)λμ,n(|β1|/|β˜1|γ+k=01j=2p|βk,j|/|β˜k,j|γ)}, (8)

where lπ(α;Ti,Xi) denotes the log-likelihood for α under Mπ given Ti and Xi, lμ(β;Zi) is a log-likelihood for β from a GLM suitable for the outcome type of Y under Mμ given Zi, α˜j, β˜1, and β˜k,j are initial root-n consistent estimates, λπ,n is a tuning parmaeter such that n1/2λπ,n → 0 and n(1−ν)(1+γ)/2λπ,n → ∞, with γ > 2ν/(1 − ν), and similarly for λμ,n (Zou and Zhang, 2009). We specify adaptive LASSO here to estimate the nuisance parameters for concreteness, but use of other penalized likelihood methods can also be justified, so long as they have an oracle property, as in Theorem 2 of Zou (2006) and described below.

Under model (5) and (6), we assume that α and βk, for k = 0, 1, are sparse. More generally, regardless of whether working models are correct or misspecified, we assume that there exist least false parameters (α¯0,α¯) and (β¯0,β¯1,β¯0,β¯1) (Lu et al., 2012) such that:

(α¯0,α¯) uniquely maximize E{lπ(α;Ti,Xi)}(β¯0,β¯1,β¯0,β¯1) uniquely maximize E{lμ(β;Zi)}. (9)

Let Aα and Aβk be respective supports for α¯ and β¯k and let sα=|Aα| and sβk=|Aβk| be the sparsity indices. We further assume α¯ and β¯k have fixed sparsity such that:

sα,sβ0 and sβ1 are fixed as n. (10)

For any vector v of length p and any index set S{1,2,,p}, let vS denote the subvector of v restricted to elements indexed in S. Assumption (9) is a high-level assumption that would be required for α^ and β^k to maintain an oracle property with respect to the least false parameters α¯ and β¯k under possibly misspecified working models. Under this assumption using arguments similar to those in Lu et al. (2012) and Zou and Zhang (2009) it can be shown that (α^Aαc=0)1 and admits an expansion of the form n1/2(α^α¯)Aα=n1/2i=1nΨi,Aα+op(1), which would yield the asymptotic normality results of the oracle property, and similarly for β^k. We rely on these results along with (10) to show that the DiPS IPW is asymptotically linear in Theorem 1. In regimes where ν > 0, (10) models a setting in which a small number of covariates exhibit non-negligible associations with T and Y and a majority of covariates are noise. Assumption 10 may not be required for asymptotic linearity and can potentially be relaxed allowing sα and sβk to diverge slowly, for example, if they are o(n1/3). We invoke this assumption to avoid complications of a growing support, which may need triangular array asymptotics to accommodate dependence of the support on n.

2.3. Double-Index Propensity Score and IPW Estimator

To mitigate the effects of misspecification of (5), one could perform nonparametric smoothing of T over α^X to calibrate the initial PS estimator gπ(α^0+Xα^). We consider smoothing over not only α^X but also β^kX as well to allow variation in prognostic covariates indexed in Aβk to inform this calibration. Such covariates are reduced into β^kX to allow for nonparametric kernel smoothing in low (two) dimensions. The DiPS estimator for each treatment is:

π^k(x;θ^k)=n1j=1nKh{(α^,β^k)(Xjx)}I(Tj=k)n1j=1nKh{(α^,β^k)(Xjx)}, for k=0,1, (11)

where θ^k=(α^,β^k), Kh(u) = h−2K(u/h), and K(u) is a bivariate q-th order kernel function with q > 2. A higher-order kernel is required here for the asymptotics to be well-behaved, which is the price for estimating the nuisance functions πk(x) using two-dimensional smoothing. This allows for the possibility of negative values for π^k(x;θ^k). Nevertheless, π^k(x;θ^k) are nuisance estimates not of direct interest, and we find that such negative PS estimates typically occur infrequently, occurring on average in simulations in 0.01% to 2.10% of observations depending the size of n and p across scenarios where working models are correct or incorrectly specified (Web Appendix D). As they are infrequent and do not appear to compromise the performance of the final estimator, they can potentially be left as is when encountered in practice. Alternatively, methods that discard or trim PS estimates to handle near-violations of positivity, as in Assumption (3), can be considered (Crump et al., 2009). A monotone transformation of the input scores for each treatment S^k=(α^,β^k)X can be applied prior to smoothing to improve finite sample performance (Wand et al., 1991). In numerical studies, for instance, we applied a probability integral transform based on the normal cumulative distribution function to the standardized scores to obtain approximately uniformly distributed inputs. The components of S^k can also be scaled such that a common bandwidth h can be used for both components of the score.

With πk(x) estimated by π^k(x;θ^k), the estimator for Δ is given by Δ^=μ^1μ^0, where:

μ^k={i=1nI(Ti=k)π^k(Xi;θ^k)}1{i=1nI(Ti=k)Yiπ^k(Xi;θ^k)}1, for k=0,1. (12)

This is the usual normalized IPW estimator, where the PS is estimated by the DiPS. The intuition for double-robustness of the estimator is as follows. Regardless of the validity of either working model, provided the asymptotics are well-behaved, μ^k is consistent for:

μ¯k=E{I(Ti=k)Yiπk(Xi;θ¯k)}, for k=0,1,

where θ¯k=(α¯,β¯k), and πk(x;θ¯k)=(Ti=k|α¯Xi=α¯x,β¯kXi=β¯kx). Under Mπ, πk(x;θ¯k)=πk(x) so that the estimand, under the causal assumptions (2)–(4), reduces to:

μ¯k=E{I(Ti=k)Yiπk(Xi)}=E{Yi(k)}, for k=0,1.

On the other hand, under Mμ,E(Yi|α¯Xi=α¯x,β¯kXi=β¯kx,Ti=k)=μk(x) so that:

μ¯k=E{E(Yi|α¯Xi,β¯Xi,Ti=k)}=E{μk(Xi)}=E{Yi(k)}, for k=0,1.

In the following, we show that μ^k (and thus Δ^) are asymptotically linear. We then subsequently examine robustness and efficiency properties using the expansion.

3. Asymptotic Robustness and Efficiency Properties

We directly show in Web Appendix B that μ^k is asymptotically linear for k = 0, 1 in general without assuming either of the working models are correct. Let Δ¯=μ¯1μ¯0 and W^k=n1/2(μ^kμ¯k) for k = 0, 1 so that n1/2(Δ^Δ¯)=W^1W^0.

Theorem 1:

Suppose that causal assumptions (2)–(4), the least false parameter and sparsity assumptions (9)–(10) and regularity conditions in Web Appendix A hold. If log(p)/log(n) → ν for ν ∈ [0, 1), then μ^k is asymptotically linear in that it admits the expansion:

W^k=n1/2i=1nI(Ti=k)Yiπk(Xi;θ¯k){I(Ti=k)πk(Xi;θ¯k)1}E(Yi|α¯Xi,β¯kXi,Ti=k)μ¯k (13)
+n1/2i=1nUk,AαΨi,Aα+vk,AβkΥi,k,Aβk+Op(n1/2hq+n1/2h2), (14)

for k = 0, 1, where Uk,Aα and Uk,Aβk are deterministic vectors, Ψi,Aα and Υi,k,Aβk are influence functions from asymptotic expansions of α^Aα and β^k,Aβk. Under model Mπvk,Aβk=0 for k = 0, 1. Under MπMμ, we additionally have that Uk,Aα=0, for k = 0, 1.

Proof sketch: W^k can be decomposed as:

W^k=n1/2i=1nI(Ti=k)πk(Xi;θ¯k)(Yiμ¯k)+n1/2i=1n{I(Ti=k)π^k(Xi;θ¯k)I(Ti=k)πk(Xi;θ¯k)}(Yiμ¯k)+n1/2i=1n{I(Ti=k)π^k(Xi;θ^k)I(Ti=k)πk(Xi;θ¯k)}(Yiμ¯k)+op(1).

The first term directly contributes to the expansion. The second term is the contribution from re-estimating the PS through kernel smoothing given θ¯k. We apply a V-statistic projection lemma (Newey and McFadden, 1994) to obtain an asymptotically linear representation. The third term can be expanded by Taylor expansion into terms of the form Ukn1/2(α^α¯) and vkn1/2(β^β¯). Applying the selection consistency that (α^Aαc=0)1, Ukn1/2(α^α¯)=Uk,Aαn1/2(α^α¯)Aα+op(1). Lastly, we use that n1/2(α^α¯)Aα=n1/2i=1nΨi,Aα+op(1) and work out the forms of the loading vector Uk,Aα and repeat for β^k to complete the expansion.

Let Δ^dr=μ^1,drμ^0,dr denote the usual doubly-robust estimator, as in Equation (9) of Lunceford and Davidian (2004), with the PS πk(x) and mean outcome μk(x) estimated in the same way as through (7) and (8). The influence function expansion for Δ^ in Theorem 1 is nearly identical to that of Δ^dr. The terms in (13) would be the same except πk(Xi;θ¯k) and E(Yi|α¯Xi,β¯kXi,Ti=k) replaces asymptotic estimates under parametric models. Terms in (14) analogously represent the additional contributions from estimating the nuisance parameters. No contribution from smoothing is incurred provided the bandwidths are suitably chosen. This similarity in the influence functions yields similar robustness and efficiency properties, which are improved upon under model misspecification due to the smoothing.

3.1. Robustness

As a consequence of Theorem 1, Δ^ is root-n consistent for Δ¯ so that Δ^Δ¯=Op(n1/2) provided that h = O(nα) for α(12q,14). As discussed in Section 2.3, under MπMμ, Δ¯=Δ. Hence Δ^ is doubly-robust for Δ in that Δ^ is root-n consistent for Δ under MπMμ. Beyond this usual form of double-robustness, if the PS model specification is incorrect, we expect the calibration step to at least partially correct for the misspecfication in large samples since πk(x;θ¯k) is closer to the true πk(x) than the misspecified parametric model gπ(α¯0+α¯x). Let M˜π denote a model under which π1(x)=g˜π(αx) for some unknown link function g˜π() and unknown αp, and X are known to be elliptically distributed such that E(aX|α*X) exists and is linear in α*X, where α* denotes the true α (e.g. if X is multivariate normal). By the results of Li and Duan (1989), it can be shown that α¯=cα* for some scalar c under M˜π. But since π^k(x;θ^k) is consistent for πk(x;θ¯k)=(T=k|α¯X=α¯x,β¯kX=β¯kx), it recovers πk(x) under M˜π. Consequently, Δ^ also has some mild benefits in robustness in that Δ^Δ=Op(n1/2) under the slightly larger model MπM˜πMμ. The same phenomenon also occurs when estimating βk under misspecification of the link in (6), if we do not assume β0 = β1. In this case, if M˜μ is an analogous model under which μ1(x)=g˜μ,1(β1x) and μ0(x)=g˜μ,0(β0x) for some unknown link functions g˜μ,0() and g˜μ,1() and X are elliptically distributed, then Δ^Δ=Op(n1/2) under the slightly larger model MπM˜πMμM˜μ. This does not hold when β0 = β1, as T is binary so (T, XT)T is not exactly elliptically distributed. But the result may still be expected to hold approximately.

3.2. Efficiency

Let the terms contributed to the influence function for Δ^ when α and βk are known be:

φi,k=I(Ti=k)Yiπk(Xi;θ¯k){I(Ti=k)πk(Xi;θ¯k)1}E(Yi|α¯Xi,β¯kXi,Ti=k)μ¯k. (15)

Under MπMμ, φi,k is the full influence function for Δ^. This is the efficient influence function for Δ* under Mnp at distributions for belonging to MπMμ when p is fixed (Robins et al., 1994; Tsiatis, 2007), since E(Yi|α¯Xi=α¯x,β¯kXi=β¯kx,Ti=k)=μk(x) and πk(x;θ¯k)=πk(x). When ν > 0 so that p diverges with n, there are no well-established semiparametric efficiency bounds. However with fixed sparsity indices (10), the asymptotic variance still reaches the same bound had p been fixed.

Beyond this characterization of efficiency that parallels that of Δ^dr, there are additional benefits of Δ^ under MπMμc. In this case, akin to Δ^dr, estimating βk does not contribute to the asymptotic variance since vk,Aβk=0, and a similar n1/2Uk,Aα(α^α¯)Aα term is contributed from estimating α. The analogous term in the expansion for Δ^dr contributes the negative of a projection of the preceding terms onto the linear span of the score function for α, restricted to components in Aα, to its influence function (Section 9.1 of Tsiatis (2007)). The same interpretation of the influence function can be adopted for Δ^.

Theorem 2:

Let Uα be the score for α under Mπ and let [Uα,Aα] denote the linear span of its components indexed in Aα. In the Hilbert space of random variables with mean 0 and finite variance L20 with inner product given by the covariance, let Π{V|S} denote the projection of some VL20 into a subspace SL20. If the assumptions required for Theorem 1 hold, under Mπ, Uk,Aαn1/2(α^α¯)Aα=n1/2i=1nΠ{φi,k|[Uα,Aα]}+op(1).

The proof is based on simplifying Uk,Aα and is given in Web Appendix B. This result can be used to show that the asymptotic variance of Δ^ is lower than that of Δ^dr under MπMμc. Based on this result, under MπMμc the influence function for μ^k is φi,kΠ{φi,k|[Uα,Aα]}, and for the usual DR estimator μ^k,dr is φi,kΠ{φi,k|[Uα,Aα]}, where:

ϕi,k=I(Ti=k)Yiπk(Xi){I(Ti=k)πk(Xi)1}gμ(β¯0+β¯1k+β¯kXi)μ¯k.

But since E(Yi|α¯Xi=α¯x,β¯kXi=β¯kx,Ti=k) better approximates μk(x) than the asymptotic estimate under the misspecified parametric model gμ(β¯0+β¯1k+β¯kx), it can then be shown that E(ϕi,k2)>E(φi,k2) for k = 0, 1. Since the influence functions involve projections onto the same space [Uα,Aα], it can be seen through geometric argument that E[φi,kΠ{φi,k|[Uα,Aα]}]2<E[ϕi,kΠ{ϕi,k|[Uα,Aα]}]2, so that Δ^ is more efficient than Δ^dr under MπMμc. We show in the simulation studies that this improvement can lead to substantial efficiency gains under MπMμc in finite samples. These unique robustness and efficiency properties distinguish Δ^ from Δ^dr and its variants. We next consider a perturbation scheme to estimate standard errors (SE) and confidence intervals (CI) for Δ^.

4. Perturbation Resampling

Although the asymptotic variance of Δ^ can be determined through its influence function specified in Theorem (1), a direct empirical estimate based on the influence function is infeasible because it involves functionals of that are difficult to estimate. Instead we propose a simple perturbation-resampling procedure. Let G={Gi:i=1,,n} be a set of non-negative iid random variables with unit mean and variance independent of D. The procedure perturbs each “layer” of the estimation of Δ^. Let the perturbed estimates of α and β be:

(α^0*,α^*)=arg maxα{n1i=1nlπ(α;Ti,Xi)Giλπ,nj=1p|αj|/|α˜j*|γ}
(β^0*,β^1*,β^0*,β^1*)=arg maxβ{n1i=1nlμ(β;Zi)Giλμ,nj=1p|βj|/|β˜j*|γ},

where α˜j* and β˜j* are perturbed initial estimates obtained from analogously perturbing its estimating equations. The perturbed DiPS estimates are calculated by:

π^k*(x;θ^k*)=j=1nKh{(α^*,β^k*)(Xjx)}I(Tj=k)Gjj=1nKh{(α^*,β^k*)(Xjx)}Gj, for k=0,1.

Lastly the perturbed estimator is given by Δ^*=μ^1*μ^0* where:

μ^k*={i=1nI(Ti=k)π^k*(Xi;θ^k*)Gi}1{i=1nI(Ti=k)Yiπ^k*(Xi;θ^k*)Gi}1, for k=0,1.

It can be shown based on arguments in Jin et al. (2001) that the asymptotic distribution of n1/2(Δ^Δ¯) coincides with that of n1/2(Δ^*Δ^)|D. We can thus approximate the SE of Δ^ based on the empirical standard deviation or, as a robust alternative, the mean absolute deviations (MAD) of resamples Δ^* and construct CI’s using percentiles of resamples.

5. Numerical Studies

5.1. Simulation Study

We performed extensive simulations to assess the finite sample bias and relative efficiency (RE) of Δ^ (DiPS) compared to alternative estimators. We also assessed the performance of the perturbation procedure. Throughout in implementing the adaptive LASSO, we used ridge regression for the initial estimators α˜j and β˜j where the ridge tuning parameter chosen by minimizing the Akaike information criterion (AIC). The adaptive LASSO tuning parameter was chosen by an extended regularized information criterion (Hui et al., 2015), which exhibited relatively good performance for variable selection. We refitted models with selected covariates to reduce bias, as suggested in Hui et al. (2015). The power parameter γ was set as 2ν1ν+1, where ν = log(p)/log(n). A Gaussian product kernel of order q = 4 with a plug-in bandwidth at the optimal order (see Discussion) was used for smoothing. For comparison, we considered alternative standard estimators with nuisances estimated by regularization and recently developed methods for estimating ATE that incorporate variable selection: (1) IPW with π1(x) estimated by adaptive LASSO (ALAS), (2) Δ^dr with nuisances estimated by adaptive LASSO (DR-ALAS), (3) Modification of Δ^dr in which π1(x) and μk(x) are estimated by separate one-dimensional kernel smoothing of T~α^X and Y~β^kX among those assigned to T = k, for k = 0, 1 (DR-SIM), to allow for estimation of single index models (SIM) for π1(x) and μk(x), (4) Outcome-adaptive LASSO (OAL) (Shortreed and Ertefaie, 2017), (5) Group Lasso and Doubly Robust Estimation (GliDeR) (Koch et al., 2018), (6) Model averaged doubly-robust estimator (MADR) (Cefalu et al., 2017). OAL and GLiDeR were implemented with default settings from code provided in the Supplementary Materials of the respective papers. MADR was implemented using the madr package with M = 500 Markov chain Monte Carlo (MCMC) iterations to reduce the computations. Throughout the numerical studies, we specified gπ(u) = 1/(1 + eu) for Mπ and gμ(u) = u with β0 = β1 for Mμ as the working models.

The covariates were generated to approximate the distribution of the covariates from the statins EMR data from Section 5.2. This was done to allow for non-elliptically distributed covariates that mimic the distribution of a real dataset. Initially we generated X˜~N(μ˜,Σ˜) where μ˜ and Σ˜ were the empirical mean and covariance matrix of the 15 covariates, which included 9 binary, 3 continuous, and 3 log-transformed count variables. For binary variables we thresholded the corresponding components of X˜ so that its mean matched those in μ˜, as in I{σ˜j1(X˜jμ˜j)>Φ1(1μ˜j)}, where σ˜j2 and μ˜j are the empirical variance and mean of the j-th covariate and Φ(·) is the standard normal cumulative distribution function (CDF). Lastly, we centered and standardized to obtain the final covariates X=diag(Σ˜1/2)(X˜μ˜). The pairwise correlations of X were generally low, mostly ranging between −.2 and .2 (full correlation matrix reported in Web Appendix C). For settings with p > 15, we generated independent groups of the 15 covariates that maintained the correlation structure within each group.

We subsequently focused on a continuous outcome, generating the data according to T | X ~ Ber{π1(X)} and Y | X, T ~ N{μT (X), 102}. The simulations varied over scenarios where working models were correct or misspecified in which the true π1(x) and μk(x) are:

Both correct: π1(x)=gπ(.2+αx),   μk(x)=k+βx
Misspecified μk(x):π1(x)=gπ(.2+αx),   μk(x)=k+β[1]x(1+β[2]x)+kζx
Misspecified πk(x):π1(x)=gπ{.2+α[1]x(1+α[2]x)},   μk(x)=k+βx,

where the coefficients are α = .01 · (1, 2, 3, 4, 5, 6, 03, 3, 7, 0, 7, −5, 0, 0p−15)T, α[1] = α, α[2] = (.02, .06, .02, .02, −.1, .02, 03,−.14, .1, 0,−.1, .14, 0, 0p−15)T, ζ = (06, 1, 03, 1, 02, 1, 0, 0p−15)T, β = (03, 1, .5, .25, .125, .0625, .03125, 0, 1, .5, 0, .25, .125, 0p−15)T, β[1] = (03, .5, 0, .5, 13, 0, 1, 2, 0, 1, 2, 0p−15)T, β[2] = (03,−1.5, .75,−1.5, 03, 0,−1.5, −.75, 0, 1.5, .75, 0p−15)T, and am denotes a 1 × m vector that has all its elements as a. For the misspecified scenarios, either μk(x) or π1(x) is a double-index model that includes both linear terms in x and quadratic and two-way interaction terms among x that are omitted by linear working models. In the misspecified μk(x) case, the second index β[2]x has some correlation with the PS index αTx, modeling a situation in which there exist common latent factors not fully captured by a linear outcome model. The outcome model also includes an interaction term between x and treatment to allow for treatment effect heterogeneity. The parameters are set such that there are 5 covariates belonging to each of AπAμ (i.e. confounders), AπAμc (instruments), and AπcAμ (pure prognostic) when p = 15. The simulations were run for R = 1, 000 repetitions.

Table 1 presents the bias and root mean square error (RMSE) for n = 500, 5, 000 when p = 15. Among the three scenarios considered, the bias for DiPS is small relative to the RMSE and generally diminishes towards zero as n increases, verifying its double-robustness. There remains some minor bias that persists when n = 5, 000 for DiPS that is likely a result of bias from the smoothing, as DR-SIM also incurs similar residual bias. IPW-ALAS and OAL are singly-robust and the bias does not necessary diminish under the misspecified π1(x) scenario, although their bias is also minor in the setting considered. MADR exhibited substantial bias under misspecified μk(x) scenario that persisted in large samples, possibly due to selecting out confounders with weak outcome associations in its emphasis on selection of prognostic covariates. The results for bias for p = 50, 100 exhibited similar patterns.

Table 1.

Bias and RMSE of estimators by n and model specification scenario for p = 15.

Both Correct Misspecified μk (x) Misspecified π1 (x)
Size Estimator Bias RMSE Bias RMSE Bias RMSE
IPW-ALAS 0.029 0.350 0.074 1.754 0.023 0.294
DR-ALAS 0.002 0.330 0.029 1.684 −0.001 0.285
DR-SIM −0.021 0.315 0.127 1.495 0.013 0.287
OAL 0.008 0.321 0.074 1.484 0.001 0.284
n=500 GLiDeR 0.001 0.299 0.087 1.238 0.006 0.282
MADR 0.022 0.300 0.172 1.247 0.008 0.282
DiPS −0.017 0.319 0.101 1.193 0.013 0.293
IPW-ALAS 0.001 0.111 −0.002 0.588 0.033 0.108
DR-ALAS −0.003 0.106 −0.014 0.564 −0.008 0.089
DR-SIM −0.012 0.103 0.029 0.516 −0.004 0.089
OAL −0.002 0.105 0.000 0.527 −0.007 0.089
n=5,000 GLiDeR −0.001 0.098 0.034 0.413 −0.006 0.088
MADR 0.000 0.099 0.124 0.418 −0.008 0.089
DiPS −0.016 0.106 0.041 0.349 −0.003 0.091

Figure 1 presents the RE under the different scenarios for n = 500, 5, 000 and p = 15, 50, 100. RE was defined as the ratio of the mean square error (MSE) for DR-ALAS relative to that of each estimator, with RE > 1 indicating greater efficiency compared to DR-ALAS. Under the “both correct” scenario many of the estimators generally exhibit similar efficiency, which can be expected since many are variants of the usual DR estimator and reach the semiparametric efficiency bound. When n = 500 and p = 60, there are some slightly greater differences, with GliDeR and MADR leading in efficiency gains, possibly due to differences in the variable selection performance. These differences in efficiency appear to temper when sample size is increased for n = 5, 000 and p = 60. The results are similar in the “misspecified π1(x)” scenario, where most estimators exhibited similar efficiency.

Figure 1.

Figure 1.

RE relative to DR-ALAS by n, p, and specification scenario.

In the “misspecified μk(x)” scenario, DiPS achieves over 70% efficiency gain compared to GliDeR and MADR and over 140% compared to DR-SIM in the large sample setting when n = 5, 000 and p = 15. This suggests that expected efficiency gains under misspecified outcome models due to the results of Section 3.2 can be substantial. Even if π1(x) and μk(x) are estimated under a SIM, there are still gains from DiPS when the PS direction α¯X is informative of the mean outcome beyond β¯kX. These gains diminish when p is larger relative to n, possibly due to imperfect variable selection. Again GLiDeR and MADR achieve the highest efficiency when n = 500 and p = 60, notwithstanding the substantial bias of MADR. Thus the performance of DiPS using adaptive LASSO can be somewhat compromised when p is very large relative to n and the variable selection performance is sub-optimal.

Table 2 presents the performance of perturbation for DiPS when p = 15, 30 under correct working models. SEs for DiPS were estimated using the MAD. The empirical SEs (Emp SE), calculated from the sample standard deviations of Δ^ over the simulation repetitions, were generally similar to the average of the SE estimates over the repetitions (ASE), despite some overestimation up to 2–15% of the Emp SE. The coverage of the percentile CI’s (Cover) were close to nominal 95% levels but tended to be somewhat conservative.

Table 2.

Perturbation performance under correctly specified models. Emp SE: empirical standard error over simulations, ASE: average of standard error estimates based on MAD over perturbations, Cover: Coverage of 95% percentile intervals.

p n Emp SE ASE Cover
15 500 0.350 0.362 0.966
15 2500 0.151 0.167 0.970
15 5000 0.108 0.119 0.965
30 500 0.348 0.356 0.961
30 2500 0.150 0.167 0.975
30 5000 0.103 0.119 0.973

5.2. Data Example: Effect of Statins on Colorectal Cancer Risk in EMRs

We applied DiPS to assess the effect of statins, a medication for lowering cholesterol levels, on the risk of colorectal cancer (CRC) among patients with inflammatory bowel disease (IBD) identified using data from EMRs of Partners Healthcare. Previous studies have suggested that statins have a protective effect on CRC, but few studies have considered the effect specifically among IBD patients. The EMR cohort consisted of n = 10, 817 IBD patients, including 1,375 statin users. CRC status and statin use were ascertained by the presence of ICD9 diagnosis and prescription codes. We adjusted for p = 15 covariates as potential confounders, including age, gender, race, smoking status, indication of elevated inflammatory markers, examination with colonoscopy, use of biologics and immunomodulators, subtypes of IBD, disease duration, and presence of primary sclerosing cholangitis (PSC).

For the working model Mμ, we specified gμ(u) = 1/(1 + eu) to accomodate the binary outcome. SEs for other estimators were obtained from the MAD over bootstrap resamples. CIs were calculated from percentile intervals. We also calculated a two-sided p-value from a Wald test for the null that statins have no effect, using the point and SE estimates for each estimator. The unadjusted estimate (None) based on difference in means by statins use was also calculated as a reference. The left side of Table 3 shows that, without adjustment, the naive risk difference is estimated to be −0.8% with a SE of 0.4%. The other methods estimated that statins had a protective effect ranging from around −1% to −3% after adjustment for covariates. DiPS and DR-SIM were the most efficient estimators, with DiPS achieving estimated variance that ranged 34% to 61% lower than that of other estimators.

Table 3.

Data example on the effect of statins on CRC risk in EMR data and the effect of smoking on logCRP in FOS data. Est: Point estimate, SE: estimated SE, 95% CI: confidence interval, p-val: p-value from Wald test of no effect.

IBD EMR Study FOS
Est SE 95% CI p-val Est SE 95% CI p-val
None −0.008 0.004 (−0.017, 0) 0.047 0.180 0.058 (0.065, 0.298) 0.002
IPW-ALAS −0.022 0.004 (−0.031, −0.015) <0.001 0.182 0.063 (0.053, 0.307) 0.004
DR-ALAS −0.020 0.005 (−0.029, −0.012) <0.001 0.140 0.063 (0.031, 0.277) 0.026
DR-SIM −0.023 0.003 (−0.029, −0.018) <0.001 0.143 0.057 (0.044, 0.257) 0.013
OAL −0.008 0.004 (−0.017, 0) 0.048 0.175 0.061 (0.062, 0.301) 0.004
GLiDeR −0.031 0.005 (−0.04, −0.022) <0.001 0.147 0.058 (0.045, 0.258) 0.012
MADR −0.030 0.005 (−0.04, −0.021) <0.001 0.149 0.056 (0.037, 0.258) 0.008
DiPS −0.024 0.003 (−0.029, −0.017) <0.001 0.141 0.058 (0.039, 0.276) 0.015

5.3. Data Example: Framingham Offspring Study

The Framingham Offspring Study (FOS) is a cohort study initiated in 1971 that enrolled 5,124 adult children and spouses of the original Framingham Heart Study. The study collected data over time on participants’ medical history, physician examination, and laboratory tests to examine epidemiological and genetic risk factors of cardiovascular disease (CVD). A subset of the FOS participants also have their genotype from the Affymetrix 500K single-nucleotide polymorphism (SNP) array available through the Framingham SNP Health Association Resource (SHARe) on dbGaP. We assessed the effect of smoking on C-reactive protein (CRP) levels, an inflammation marker highly predictive of CVD risk, while adjusting for potential confounders including gender, age, diabetes status, use of hypertensive medication, systolic and diastolic blood pressure measurements, and HDL and total cholesterol measurements, as well as a large number of SNPs in gene regions previously reported to be associated with inflammation or obesity. While the inflmmation-related SNPs are not likely to impact smoking, we include them as prognostic covariates for efficiency. The analysis includes n = 1, 892 individuals with available information on the CRP and the p = 121 covariates, of which 113 were SNPs.

Since CRP is heavily skewed, we applied a log transformation so that the linear regression model in Mμ better fits the data. SEs, CIs, and p-values were calculated in the same way as above. The right side of Table 3 shows that different methods agree that smoking significantly increases logCRP. In general, point estimates tended to attenuate after adjusting for covariates since smokers are likely to have other characteristics that increase inflammation. DiPS, DR-SIM, and MADR were among the most efficient, though efficiency gains are tempered in this setting with larger p relative to n.

6. Discussion

In this paper we developed a novel IPW estimator for the ATE that accommodates data-driven variable selection through regularized regression. The estimator retains double-robustness and is locally semiparametric efficient when ν = 0. By calibrating the initial PS through smoothing, additional gains in efficiency can potentially be achieved in large samples under misspecification of the working outcome model.

In numerical studies, we used the extended regularized information criterion (Hui et al., 2015) to tune adaptive LASSO, which maintains selection consistency when log(p)/log(n) → ν, for ν ∈ [0, 1). Other criteria such as cross-validation can also be used and may exhibit better performance in some cases. To obtain a suitable bandwidth h, the bandwidth must be selected such that the dominating errors in the influence function, which are of order Op(n1/2hq + n−1/2h−2), converges to 0. This is satisfied for h = O(nα) for α(12q,14). The optimal bandwidth h* is one that balances these bias and variance terms and is of order h* = O(n−1/(q+2)). In practice we use a plug-in estimator h^*=σ^n1/(q+2), where σ^ is the sample standard deviation of either α^Xi or β^kXi, possibly after applying a monotonic transformation. Cross-validation can also be used to select the the smoothing bandwidth.

The adaptive LASSO estimators α^ and β^k are not uniformly root-n consistent when the penalty is tuned to achieve consistent model selection (Pötscher and Schneider, 2009), and its oracle properties derived under fixed parameter asymptotics may fail to capture essential features of finite-sample distributions. For example, they are not root-n consistent when the true parameters are of order O(n−1/2), if the true signals are relatively weak. The importance of uniform inference also been recently highlighted for treatment effect estimation in high-dimensional settings (Belloni et al., 2013; Farrell, 2015). It would be of interest to consider alternative variable selection approaches beyond those grounded in oracle properties to achieve uniform inference. Another limitation of relying on adaptive LASSO is that when p is large so that ν is large, a large power parameter γ would be required to maintain the oracle properties, leading to an unstable penalty and poor finite sample performance. It would be of interest to consider modifications of the proposed procedure to accommodate high-dimensional settings with pn and more general sparsity assumptions in future work.

Supplementary Material

Web Appendices
Code

Acknowledgements

The authors would like to thank the editor, associate editor, and two referees for their insightful feedback and suggestions. Most of this work was done when the first author was a graduate student at Harvard University. This work was supported by National Institutes of Health grants T32CA009337 and R01HL089778. The views expressed in this article are those of the authors and do not necessarily reflect the views of the Department of Veterans Affairs.

Footnotes

Supporting Information

Web Appendices referenced in Sections 2, 3, and 5 as well as the R code implementing the procedure are available with this paper at the Biometrics website on Wiley Online Library.

References

  1. Belloni A, Chernozhukov V, Fernández-Val I, and Hansen C (2017). Program evaluation and causal inference with high-dimensional data. Econometrica 85, 233–298. [Google Scholar]
  2. Belloni A, Chernozhukov V, and Hansen C (2013). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81, 608–650. [Google Scholar]
  3. Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, and Stürmer T (2006). Variable selection for propensity score models. American Journal of Epidemiology 163, 1149–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cefalu M, Dominici F, Arvold N, and Parmigiani G (2017). Model averaged double robust estimation. Biometrics 73, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96, 187–199. [Google Scholar]
  6. De Luna X, Waernbaum I, and Richardson TS (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–875. [Google Scholar]
  7. Farrell MH (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. [Google Scholar]
  8. Hahn J (2004). Functional restriction and efficiency in causal inference. The Review of Economics and Statistics 86, 73–76. [Google Scholar]
  9. Hansen BB (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–488. [Google Scholar]
  10. Hu Z, Follmann DA, and Qin J (2012). Semiparametric double balancing score estimation for incomplete data with ignorable missingness. Journal of the American Statistical Association 107, 247–257. [Google Scholar]
  11. Hui FK, Warton DI, and Foster SD (2015). Tuning parameter selection for the adaptive lasso using eric. Journal of the American Statistical Association 110, 262–269. [Google Scholar]
  12. Jin Z, Ying Z, and Wei L-J (2001). A simple resampling method by perturbing the minimand. Biometrika 88, 381–390. [Google Scholar]
  13. Koch B, Vock DM, and Wolfson J (2018). Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74, 8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics 17, 1009–1052. [Google Scholar]
  15. Lu W, Goldberg Y, and Fine J (2012). On the robustness of the adaptive lasso to model misspecification. Biometrika 99, 717–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lunceford JK and Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects. Statistics in Medicine 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
  17. Newey WK and McFadden D (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245. [Google Scholar]
  18. Pötscher BM and Schneider U (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference 139, 2775–2790. [Google Scholar]
  19. Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
  20. Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, and Brookhart MA (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 20, 512–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Shortreed SM and Ertefaie A (2017). Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 73, 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Tsiatis A (2007). Semiparametric theory and missing data. Springer. [Google Scholar]
  23. van der Laan MJ and Gruber S (2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics 6, Article 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Wand MP, Marron JS, and Ruppert D (1991). Transformations in density estimation. Journal of the American Statistical Association 86, 343–353. [Google Scholar]
  25. Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
  26. Zou H and Zhang HH (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Appendices
Code

RESOURCES