Estimating Average Treatment Effects with a Double-Index Propensity Score

David Cheng; Abhishek Chakrabortty; Ashwin N Ananthakrishnan; Tianxi Cai

doi:10.1111/biom.13195

. Author manuscript; available in PMC: 2020 Sep 12.

Published in final edited form as: Biometrics. 2019 Dec 16;76(3):767–777. doi: 10.1111/biom.13195

Estimating Average Treatment Effects with a Double-Index Propensity Score

David Cheng ¹, Abhishek Chakrabortty ², Ashwin N Ananthakrishnan ³, Tianxi Cai ^4,^*

PMCID: PMC7370895 NIHMSID: NIHMS1592926 PMID: 31797368

Summary:

We consider estimating average treatment effects (ATE) of a binary treatment in observational data when data-driven variable selection is needed to select relevant covariates from a moderately large number of available covariates X. To leverage covariates among X predictive of the outcome for efficiency gain while using regularization to fit a parameteric propensity score (PS) model, we consider a dimension reduction of X based on fitting both working PS and outcome models using adaptive LASSO. A novel PS estimator, the Double-index Propensity Score (DiPS), is proposed, in which the treatment status is smoothed over the linear predictors for X from both the initial working models. The ATE is estimated by using the DiPS in a normalized inverse probability weighting (IPW) estimator, which is found to maintain double-robustness and also local semiparametric efficiency with a fixed number of covariates p. Under misspecification of working models, the smoothing step leads to gains in efficiency and robustness over traditional doubly-robust estimators. These results are extended to the case where p diverges with sample size and working models are sparse. Simulations show the benefits of the approach in finite samples. We illustrate the method by estimating the ATE of statins on colorectal cancer risk in an electronic medical record (EMR) study and the effect of smoking on C-reactive protein (CRP) in the Framingham Offspring Study.

Keywords: Causal inference, double-robustness, electronic medical records, kernel smoothing, regularization, semiparametric efficiency

1. Introduction

There is growing interest in evaluating medical treatments and policies in large-scale observational data such as electronic medical records (EMR). As with any observational data, in the absence of randomization, adjustment for a sufficient set of pre-treatment covariates X that satisfy “no unmeasured confounding” is needed when estimating average treatment effects (ATE) to avoid confounding bias. This is routinely done using propensity score (PS), outcome regression, and doubly-robust (DR) methods (Lunceford and Davidian, 2004). These methods were initially developed in settings where p, the dimension of X, was small relative to the sample size n. But large-scale observational data are increasingly collecting rich measurements in large sets of covariates, and data-driven variable selection approaches are needed due to the lack of sufficient prior knowledge to guide manual variable selection.

Effective variable selection for causal effect estimation involves consideration of dependencies between X with the treatment status T ∈ {0, 1} and outcome Y. Let $A_{π} \subseteq {1, 2, \dots, p}$ index the subset of X upon which the PS $π_{1} (x) = ℙ (T = 1 | X = x)$ depends, and let $A_{μ}$ be an analogous index set for X upon which either μ₁(x) or μ₀(x) depends, where $μ_{k} (x) = E (Y | X = x, T = k)$ . For any index set $S \subseteq {1, 2, \dots, p}$ , let $S^{c}$ denote its complement in {1, 2, …, p}. When X is sufficient for no unmeasured confounding, the covariates indexed in $A_{π}$ is a reduced set of covariates that is also sufficient for no unmeasured confounding (De Luna et al., 2011). However, additionally adjusting for purely prognostic covariates in $A_{π}^{c} \cap A_{μ}$ can improve the efficiency of PS, outcome regression, and DR estimators (Lunceford and Davidian, 2004; Hahn, 2004; Brookhart et al., 2006).

To exploit this phenomenon, we consider an inverse probability weighting (IPW) estimator where the PS is initially estimated by regularized regression. Since variable selection procedures for the PS model would select out covariates in $A_{π}^{c} \cap A_{μ}$ , we also estimate a regularized regression model for μ_k(x), for k = 0, 1, to recover variation from covariates in $A_{π}^{c} \cap A_{μ}$ to inform estimation of a calibrated PS. The calibration is implemented through smoothing T over the linear predictors for X from both the initial PS and outcome models, which can be viewed as smoothing over working propensity and prognostic scores (Hansen, 2008). The resulting IPW estimator maintains double-robustness and achieves the semiparametric efficiency bound when p is fixed, under correctly specified PS and outcome working models. To the best of our knowledge, this is the first proposal in the literature that demonstrates these properties can be achieved through weighting only, without explicit augmentation. We show that the estimator is asymptotically linear and use this to characterize large-sample robustness and efficiency properties. The smoothing results in a refinement of the influence function under misspecification of the outcome model that can potentially result in substantial gains in efficiency relative to traditional DR estimators, which is confirmed in simulations. These properties hold in settings where p is either fixed or allowed to diverge slowly with n assuming fixed sparsity indices.

Data-driven variable selection for causal effect estimation has been considered in screening methods based on marginal associations between X with T and Y (Schneeweiss et al., 2009), but the results can be misleading because marginal associations need not agree with conditional associations. De Luna et al. (2011) carefully characterized and proposed algorithms to identify minimal subsets of covariates that are sufficient for no unmeasured confounding. Recent works have considered using regularized regression to select variables and post-selection methods that estimate treatment effects through partially linear models (Belloni et al., 2013) and DR estimators (Farrell, 2015; Belloni et al., 2017). These methods focus on delivering uniformly valid inference under high-dimensional regimes assuming approximately sparse models. Others have proposed modifying the regularization penalty itself in a way to select the relevant covariates and estimate treatment effects through IPW (Shortreed and Ertefaie, 2017) and DR estimators (Koch et al., 2018). However, these papers generally do not fully work out the full asymptotic distribution of the final estimator, making efficiency comparisons with established methods difficult. Some of the methods are also only singly-robust. Bayesian model averaging (Cefalu et al. (2017) and references therein) offers a principled alternative for variable selection but encounters burdensome computations that are possibly infeasible for large p.

Our proposed double-index PS (DiPS) can be viewed as a simple and intuitive approach to dimension reduction of X for estimating the PS. The approach for DiPS closely resembles a method proposed for estimating mean outcomes in the presence of data missing at random (Hu et al., 2012), except we use the double-score to estimate a PS instead of an outcome model. In contrast to their results, we show that a higher-order kernel is required due to the two-dimensional smoothing, find explicit efficiency gains under misspecification of the outcome model, and consider p diverging with n. There is also some similar intuition shared with collaborative DR methods (van der Laan and Gruber, 2010) in that associations with both treatment and outcome are taken into account when estimating a PS. However, DiPS takes a much different approach to estimating the PS. In the following, we introduce the proposed method and consider its asymptotic properties in Sections 2 and 3. A perturbation-resampling method is proposed for inference in Section 4. Simulations and applications to estimating treatment effects in an EMR study and cohort study are presented in Section 5. We conclude with some additional remarks in Section 6.

2. Method

2.1. Notations and Problem Setup

Let $Z_{i} = {(Y_{i}, T_{i}, X_{i}^{⊤})}^{⊤}$ be the observed data for the ith subject, where Y_i is an outcome that could be modeled by a generalized linear model (GLM), T_i ∈ {0, 1} a binary treatment, and X_i is a p-dimensional vector of covariates with support $X \subseteq ℝ^{p}$ . Here p is allowed to diverge slowly with n such that log(p)/log(n) → ν, for ν ∈ [0, 1), which includes the case where p is fixed by taking ν = 0. For a given n, the observed data consists of independent and identically distributed (iid) observations $D = {Z_{i} : i = 1, \dots, n}$ drawn from a distribution $ℙ_{n}$ , which potentially may vary with n. We suppress the dependence in the notations, implicitly assuming statements involving $ℙ$ and associated statistical functionals hold for each n. Let $Y_{i}^{(1)}$ and $Y_{i}^{(0)}$ denote the counterfactual outcomes had a subject received treatment or control. Based on $D$ , we want to make inferences about the average treatment effect (ATE):

Δ = E {Y^{(1)}} - E {Y^{(0)}} = μ_{1} - μ_{0} .

(1)

For identifiability, we require the following standard causal inference assumptions:

Y = T Y^{(1)} + (1 - T) Y^{(0)} with probability 1

(2)

π_{1} (x) \in [ϵ_{π}, 1 - ϵ_{π}] for some ϵ_{π} > 0, when x \in X

(3)

Y^{(1)} ⫫ T | X and Y^{(0)} ⫫ T | X,

(4)

where $π_{k} (x) = ℙ (T = k | X = x)$ , for k = 0, 1. The third condition assumes that X is a sufficient set of covariates such that no unmeasured confounding holds given the entire X. Under these assumptions, Δ can be identified from the observed data distribution $ℙ$ through:

Δ^{*} = E {μ_{1} (X) - μ_{0} (X)} = E {\frac{I (T = 1) Y}{π_{1} (X)} - \frac{I (T = 0) Y}{π_{0} (X)}},

where $μ_{k} (x) = E (Y | X = x, T = k)$ , for k = 0, 1. We will consider an estimator based on the IPW form that will nevertheless be doubly-robust so that it is consistent under models where either π_k(x) or μ_k(x) is correctly specified.

2.2. Parametric Models for Nuisance Functions

We consider parametric modeling as a means to reduce the dimensions of X when estimating the PS. For reference, let $M_{n p}$ be the nonparametric model for the distribution of Z, $ℙ$ , that has no restrictions on $ℙ$ except requiring the second moment of Z to be finite. Let $M_{π} \subseteq M_{n p}$ and $M_{μ} \subseteq M_{n p}$ respectively denote parametric working models under which:

π_{1} (x) = g_{π} (α_{0} + α^{⊤} x),

(5)

and μ_{k} (x) = g_{μ} (β_{0} + β_{1} k + β_{k}^{⊤} x), for k = 0, 1,

(6)

where g_π(·) and g_μ(·) are known link functions, and $\vec{α} = {(α_{0}, α^{⊤})}^{⊤} \in Θ_{α} \subseteq ℝ^{p + 1}$ and $\vec{β} = {(β_{0}, β_{1}, β_{0}^{⊤}, β_{1}^{⊤})}^{⊤} \in Θ_{β} \subseteq ℝ^{2 p + 2}$ are unknown parameters. In (6) slopes are allowed to differ by treatment arms to allow for heterogeneous effects of T for subjects with different X even with a linear link. When it is reasonable to assume heterogeneity is weak or nonexistent, it may be beneficial for efficiency to restrict β₀ = β₁.

Regardless of the validity of either working model (i.e. whether $ℙ \in M_{π} \cup M_{μ}$ ), we first obtain estimates of α and β_k’s through adaptive LASSO (Zou, 2006):

{({\hat{α}}_{0}, {\hat{α}}^{⊤})}^{⊤} = \underset{\vec{α}}{arg max} {n^{- 1} \sum_{i = 1}^{n} l_{π} (\vec{α}; T_{i}, X_{i}) - λ_{π, n} \sum_{j = 1}^{p} | α_{j} | / {| {\tilde{α}}_{j} |}^{γ}}

(7)

{({\hat{β}}_{0}, {\hat{β}}_{1}, {\hat{β}}_{0}^{⊤}, {\hat{β}}_{1}^{⊤})}^{⊤} = \underset{\vec{β}}{arg max} {n^{- 1} \sum_{i = 1}^{n} l_{μ} (\vec{β}; Z_{i}) - λ_{μ, n} (| β_{1} | / {| {\tilde{β}}_{1} |}^{γ} + \sum_{k = 0}^{1} \sum_{j = 2}^{p} | β_{k, j} | / {| {\tilde{β}}_{k, j} |}^{γ})},

(8)

where $l_{π} (\vec{α}; T_{i}, X_{i})$ denotes the log-likelihood for $\vec{α}$ under $M_{π}$ given T_i and X_i, $l_{μ} (\vec{β}; Z_{i})$ is a log-likelihood for $\vec{β}$ from a GLM suitable for the outcome type of Y under $M_{μ}$ given Z_i, ${\tilde{α}}_{j}$ , ${\tilde{β}}_{1}$ , and ${\tilde{β}}_{k, j}$ are initial root-n consistent estimates, λ_π,n is a tuning parmaeter such that n^1/2λ_π,n → 0 and n^{(1−ν)(1+γ)/2}λ_π,n → ∞, with γ > 2ν/(1 − ν), and similarly for λ_μ,n (Zou and Zhang, 2009). We specify adaptive LASSO here to estimate the nuisance parameters for concreteness, but use of other penalized likelihood methods can also be justified, so long as they have an oracle property, as in Theorem 2 of Zou (2006) and described below.

Under model (5) and (6), we assume that α and β_k, for k = 0, 1, are sparse. More generally, regardless of whether working models are correct or misspecified, we assume that there exist least false parameters ${({\bar{α}}_{0}, {\bar{α}}^{⊤})}^{⊤}$ and ${({\bar{β}}_{0}, {\bar{β}}_{1}, {\bar{β}}_{0}^{⊤}, {\bar{β}}_{1}^{⊤})}^{⊤}$ (Lu et al., 2012) such that:

{({\bar{α}}_{0}, {\bar{α}}^{⊤})}^{⊤} uniquely maximize E {l_{π} (\vec{α}; T_{i}, X_{i})} {({\bar{β}}_{0}, {\bar{β}}_{1}, {\bar{β}}_{0}^{⊤}, {\bar{β}}_{1}^{⊤})}^{⊤} uniquely maximize E {l_{μ} (\vec{β}; Z_{i})} .

(9)

Let $A_{α}$ and $A_{β_{k}}$ be respective supports for $\bar{α}$ and ${\bar{β}}_{k}$ and let $s_{α} = | A_{α} |$ and $s_{β_{k}} = | A_{β_{k}} |$ be the sparsity indices. We further assume $\bar{α}$ and ${\bar{β}}_{k}$ have fixed sparsity such that:

s_{α}, s_{β_{0}} and s_{β_{1}} are fixed as n \to \infty .

(10)

For any vector v of length p and any index set $S \subseteq {1, 2, \dots, p}$ , let $v_{S}$ denote the subvector of v restricted to elements indexed in $S$ . Assumption (9) is a high-level assumption that would be required for $\hat{α}$ and ${\hat{β}}_{k}$ to maintain an oracle property with respect to the least false parameters $\bar{α}$ and ${\bar{β}}_{k}$ under possibly misspecified working models. Under this assumption using arguments similar to those in Lu et al. (2012) and Zou and Zhang (2009) it can be shown that $ℙ ({\hat{α}}_{A_{α}^{c}} = 0) \to 1$ and admits an expansion of the form $n^{1 / 2} {(\hat{α} - \bar{α})}_{A_{α}} = n^{- 1 / 2} \sum_{i = 1}^{n} Ψ_{i, A_{α}} + o_{p} (1)$ , which would yield the asymptotic normality results of the oracle property, and similarly for ${\hat{β}}_{k}$ . We rely on these results along with (10) to show that the DiPS IPW is asymptotically linear in Theorem 1. In regimes where ν > 0, (10) models a setting in which a small number of covariates exhibit non-negligible associations with T and Y and a majority of covariates are noise. Assumption 10 may not be required for asymptotic linearity and can potentially be relaxed allowing $s_{α}$ and $s_{β_{k}}$ to diverge slowly, for example, if they are o(n^1/3). We invoke this assumption to avoid complications of a growing support, which may need triangular array asymptotics to accommodate dependence of the support on n.

2.3. Double-Index Propensity Score and IPW Estimator

To mitigate the effects of misspecification of (5), one could perform nonparametric smoothing of T over ${\hat{α}}^{⊤} X$ to calibrate the initial PS estimator $g_{π} ({\hat{α}}_{0} + X^{⊤} \hat{α})$ . We consider smoothing over not only ${\hat{α}}^{⊤} X$ but also ${\hat{β}}_{k}^{⊤} X$ as well to allow variation in prognostic covariates indexed in $A_{β_{k}}$ to inform this calibration. Such covariates are reduced into ${\hat{β}}_{k}^{⊤} X$ to allow for nonparametric kernel smoothing in low (two) dimensions. The DiPS estimator for each treatment is:

{\hat{π}}_{k} (x; {\hat{θ}}_{k}) = \frac{n^{- 1} \sum_{j = 1}^{n} K_{h} {{(\hat{α}, {\hat{β}}_{k})}^{⊤} (X_{j} - x)} I (T_{j} = k)}{n^{- 1} \sum_{j = 1}^{n} K_{h} {{(\hat{α}, {\hat{β}}_{k})}^{⊤} (X_{j} - x)}}, for k = 0, 1,

(11)

where ${\hat{θ}}_{k} = {({\hat{α}}^{⊤}, {\hat{β}}_{k}^{⊤})}^{⊤}$ , K_h(u) = h⁻²K(u/h), and K(u) is a bivariate q-th order kernel function with q > 2. A higher-order kernel is required here for the asymptotics to be well-behaved, which is the price for estimating the nuisance functions π_k(x) using two-dimensional smoothing. This allows for the possibility of negative values for ${\hat{π}}_{k} (x; {\hat{θ}}_{k})$ . Nevertheless, ${\hat{π}}_{k} (x; {\hat{θ}}_{k})$ are nuisance estimates not of direct interest, and we find that such negative PS estimates typically occur infrequently, occurring on average in simulations in 0.01% to 2.10% of observations depending the size of n and p across scenarios where working models are correct or incorrectly specified (Web Appendix D). As they are infrequent and do not appear to compromise the performance of the final estimator, they can potentially be left as is when encountered in practice. Alternatively, methods that discard or trim PS estimates to handle near-violations of positivity, as in Assumption (3), can be considered (Crump et al., 2009). A monotone transformation of the input scores for each treatment ${\hat{S}}_{k} = {(\hat{α}, {\hat{β}}_{k})}^{⊤} X$ can be applied prior to smoothing to improve finite sample performance (Wand et al., 1991). In numerical studies, for instance, we applied a probability integral transform based on the normal cumulative distribution function to the standardized scores to obtain approximately uniformly distributed inputs. The components of ${\hat{S}}_{k}$ can also be scaled such that a common bandwidth h can be used for both components of the score.

With π_k(x) estimated by ${\hat{π}}_{k} (x; {\hat{θ}}_{k})$ , the estimator for Δ is given by $\hat{Δ} = {\hat{μ}}_{1} - {\hat{μ}}_{0}$ , where:

{\hat{μ}}_{k} = {\sum_{i = 1}^{n} \frac{I (T_{i} = k)}{{\hat{π}}_{k} (X_{i}; {\hat{θ}}_{k})}}^{- 1} {\sum_{i = 1}^{n} \frac{I (T_{i} = k) Y_{i}}{{\hat{π}}_{k} (X_{i}; {\hat{θ}}_{k})}}^{- 1}, for k = 0, 1.

(12)

This is the usual normalized IPW estimator, where the PS is estimated by the DiPS. The intuition for double-robustness of the estimator is as follows. Regardless of the validity of either working model, provided the asymptotics are well-behaved, ${\hat{μ}}_{k}$ is consistent for:

{\bar{μ}}_{k} = E {\frac{I (T_{i} = k) Y_{i}}{π_{k} (X_{i}; {\bar{θ}}_{k})}}, for k = 0, 1,

where ${\bar{θ}}_{k} = {({\bar{α}}^{⊤}, {\bar{β}}_{k}^{⊤})}^{⊤}$ , and $π_{k} (x; {\bar{θ}}_{k}) = ℙ (T_{i} = k | {\bar{α}}^{⊤} X_{i} = {\bar{α}}^{⊤} x, {\bar{β}}_{k}^{⊤} X_{i} = {\bar{β}}_{k}^{⊤} x)$ . Under $M_{π}$ , $π_{k} (x; {\bar{θ}}_{k}) = π_{k} (x)$ so that the estimand, under the causal assumptions (2)–(4), reduces to:

{\bar{μ}}_{k} = E {\frac{I (T_{i} = k) Y_{i}}{π_{k} (X_{i})}} = E {Y_{i}^{(k)}}, for k = 0, 1.

On the other hand, under $M_{μ}, E (Y_{i} | {\bar{α}}^{⊤} X_{i} = {\bar{α}}^{⊤} x, {\bar{β}}_{k}^{⊤} X_{i} = {\bar{β}}_{k}^{⊤} x, T_{i} = k) = μ_{k} (x)$ so that:

{\bar{μ}}_{k} = E {E (Y_{i} | {\bar{α}}^{⊤} X_{i}, {\bar{β}}^{⊤} X_{i}, T_{i} = k)} = E {μ_{k} (X_{i})} = E {Y_{i}^{(k)}}, for k = 0, 1.

In the following, we show that ${\hat{μ}}_{k}$ (and thus $\hat{Δ}$ ) are asymptotically linear. We then subsequently examine robustness and efficiency properties using the expansion.

3. Asymptotic Robustness and Efficiency Properties

We directly show in Web Appendix B that ${\hat{μ}}_{k}$ is asymptotically linear for k = 0, 1 in general without assuming either of the working models are correct. Let $\bar{Δ} = {\bar{μ}}_{1} - {\bar{μ}}_{0}$ and ${\hat{W}}_{k} = n^{1 / 2} ({\hat{μ}}_{k} - {\bar{μ}}_{k})$ for k = 0, 1 so that $n^{1 / 2} (\hat{Δ} - \bar{Δ}) = {\hat{W}}_{1} - {\hat{W}}_{0}$ .

Theorem 1:

Suppose that causal assumptions (2)–(4), the least false parameter and sparsity assumptions (9)–(10) and regularity conditions in Web Appendix A hold. If log(p)/log(n) → ν for ν ∈ [0, 1), then ${\hat{μ}}_{k}$ is asymptotically linear in that it admits the expansion:

{\hat{W}}_{k} = n^{- 1 / 2} \sum_{i = 1}^{n} \frac{I (T_{i} = k) Y_{i}}{π_{k} (X_{i}; {\bar{θ}}_{k})} - {\frac{I (T_{i} = k)}{π_{k} (X_{i}; {\bar{θ}}_{k})} - 1} E (Y_{i} | {\bar{α}}^{⊤} X_{i}, {\bar{β}}_{k}^{⊤} X_{i}, T_{i} = k) - {\bar{μ}}_{k}

(13)

+ n^{- 1 / 2} \sum_{i = 1}^{n} U_{k, A_{α}}^{⊤} Ψ_{i, A_{α}} + v_{k, A_{β_{k}}}^{⊤} Υ_{i, k, A_{β_{k}}} + O_{p} (n^{1 / 2} h^{q} + n^{- 1 / 2} h^{- 2}),

(14)

for k = 0, 1, where $U_{k, A_{α}}$ and $U_{k, A_{β_{k}}}$ are deterministic vectors, $Ψ_{i, A_{α}}$ and $Υ_{i, k, A_{β_{k}}}$ are influence functions from asymptotic expansions of ${\hat{α}}_{A_{α}}$ and ${\hat{β}}_{k, A_{β_{k}}}$ . Under model $M_{π} v_{k, A_{β_{k}}} = 0$ for k = 0, 1. Under $M_{π} \cap M_{μ}$ , we additionally have that $U_{k, A_{α}} = 0$ , for k = 0, 1.

Proof sketch: ${\hat{W}}_{k}$ can be decomposed as:

{\hat{W}}_{k} = n^{- 1 / 2} \sum_{i = 1}^{n} \frac{I (T_{i} = k)}{π_{k} (X_{i}; {\bar{θ}}_{k})} (Y_{i} - {\bar{μ}}_{k}) + n^{- 1 / 2} \sum_{i = 1}^{n} {\frac{I (T_{i} = k)}{{\hat{π}}_{k} (X_{i}; {\bar{θ}}_{k})} - \frac{I (T_{i} = k)}{π_{k} (X_{i}; {\bar{θ}}_{k})}} (Y_{i} - {\bar{μ}}_{k}) + n^{- 1 / 2} \sum_{i = 1}^{n} {\frac{I (T_{i} = k)}{{\hat{π}}_{k} (X_{i}; {\hat{θ}}_{k})} - \frac{I (T_{i} = k)}{π_{k} (X_{i}; {\bar{θ}}_{k})}} (Y_{i} - {\bar{μ}}_{k}) + o_{p} (1) .

The first term directly contributes to the expansion. The second term is the contribution from re-estimating the PS through kernel smoothing given ${\bar{θ}}_{k}$ . We apply a V-statistic projection lemma (Newey and McFadden, 1994) to obtain an asymptotically linear representation. The third term can be expanded by Taylor expansion into terms of the form $U_{k}^{⊤} n^{1 / 2} (\hat{α} - \bar{α})$ and $v_{k}^{⊤} n^{1 / 2} (\hat{β} - \bar{β})$ . Applying the selection consistency that $ℙ ({\hat{α}}_{A_{α}^{c}} = 0) \to 1$ , $U_{k}^{⊤} n^{1 / 2} (\hat{α} - \bar{α}) = U_{k, A_{α}}^{⊤} n^{1 / 2} {(\hat{α} - \bar{α})}_{A_{α}} + o_{p} (1)$ . Lastly, we use that $n^{1 / 2} {(\hat{α} - \bar{α})}_{A_{α}} = n^{- 1 / 2} \sum_{i = 1}^{n} Ψ_{i, A_{α}} + o_{p} (1)$ and work out the forms of the loading vector $U_{k, A_{α}}$ and repeat for ${\hat{β}}_{k}$ to complete the expansion.

Let ${\hat{Δ}}_{d r} = {\hat{μ}}_{1, d r} - {\hat{μ}}_{0, d r}$ denote the usual doubly-robust estimator, as in Equation (9) of Lunceford and Davidian (2004), with the PS π_k(x) and mean outcome μ_k(x) estimated in the same way as through (7) and (8). The influence function expansion for $\hat{Δ}$ in Theorem 1 is nearly identical to that of ${\hat{Δ}}_{d r}$ . The terms in (13) would be the same except $π_{k} (X_{i}; {\bar{θ}}_{k})$ and $E (Y_{i} | {\bar{α}}^{⊤} X_{i}, {\bar{β}}_{k}^{⊤} X_{i}, T_{i} = k)$ replaces asymptotic estimates under parametric models. Terms in (14) analogously represent the additional contributions from estimating the nuisance parameters. No contribution from smoothing is incurred provided the bandwidths are suitably chosen. This similarity in the influence functions yields similar robustness and efficiency properties, which are improved upon under model misspecification due to the smoothing.

3.1. Robustness

As a consequence of Theorem 1, $\hat{Δ}$ is root-n consistent for $\bar{Δ}$ so that $\hat{Δ} - \bar{Δ} = O_{p} (n^{- 1 / 2})$ provided that h = O(n^−α) for $α \in (\frac{1}{2 q}, \frac{1}{4})$ . As discussed in Section 2.3, under $M_{π} \cup M_{μ}$ , $\bar{Δ} = Δ$ . Hence $\hat{Δ}$ is doubly-robust for Δ in that $\hat{Δ}$ is root-n consistent for Δ under $M_{π} \cup M_{μ}$ . Beyond this usual form of double-robustness, if the PS model specification is incorrect, we expect the calibration step to at least partially correct for the misspecfication in large samples since $π_{k} (x; {\bar{θ}}_{k})$ is closer to the true π_k(x) than the misspecified parametric model $g_{π} ({\bar{α}}_{0} + {\bar{α}}^{⊤} x)$ . Let ${\tilde{M}}_{π}$ denote a model under which $π_{1} (x) = {\tilde{g}}_{π} (α^{⊤} x)$ for some unknown link function ${\tilde{g}}_{π} (\cdot)$ and unknown $α \in ℝ^{p}$ , and X are known to be elliptically distributed such that $E (a^{⊤} X | α_{*}^{⊤} X)$ exists and is linear in $α_{*}^{⊤} X$ , where α_* denotes the true α (e.g. if X is multivariate normal). By the results of Li and Duan (1989), it can be shown that $\bar{α} = c α_{*}$ for some scalar c under ${\tilde{M}}_{π}$ . But since ${\hat{π}}_{k} (x; {\hat{θ}}_{k})$ is consistent for $π_{k} (x; {\bar{θ}}_{k}) = ℙ (T = k | {\bar{α}}^{⊤} X = {\bar{α}}^{⊤} x, {\bar{β}}_{k}^{⊤} X = {\bar{β}}_{k}^{⊤} x)$ , it recovers π_k(x) under ${\tilde{M}}_{π}$ . Consequently, $\hat{Δ}$ also has some mild benefits in robustness in that $\hat{Δ} - Δ = O_{p} (n^{- 1 / 2})$ under the slightly larger model $M_{π} \cup {\tilde{M}}_{π} \cup M_{μ}$ . The same phenomenon also occurs when estimating β_k under misspecification of the link in (6), if we do not assume β₀ = β₁. In this case, if ${\tilde{M}}_{μ}$ is an analogous model under which $μ_{1} (x) = {\tilde{g}}_{μ, 1} (β_{1}^{⊤} x)$ and $μ_{0} (x) = {\tilde{g}}_{μ, 0} (β_{0}^{⊤} x)$ for some unknown link functions ${\tilde{g}}_{μ, 0} (\cdot)$ and ${\tilde{g}}_{μ, 1} (\cdot)$ and X are elliptically distributed, then $\hat{Δ} - Δ = O_{p} (n^{- 1 / 2})$ under the slightly larger model $M_{π} \cup {\tilde{M}}_{π} \cup M_{μ} \cup {\tilde{M}}_{μ}$ . This does not hold when β₀ = β₁, as T is binary so (T, X^T)^T is not exactly elliptically distributed. But the result may still be expected to hold approximately.

3.2. Efficiency

Let the terms contributed to the influence function for $\hat{Δ}$ when α and β_k are known be:

φ_{i, k} = \frac{I (T_{i} = k) Y_{i}}{π_{k} (X_{i}; {\bar{θ}}_{k})} - {\frac{I (T_{i} = k)}{π_{k} (X_{i}; {\bar{θ}}_{k})} - 1} E (Y_{i} | {\bar{α}}^{⊤} X_{i}, {\bar{β}}_{k}^{⊤} X_{i}, T_{i} = k) - {\bar{μ}}_{k} .

(15)

Under $M_{π} \cap M_{μ}$ , $φ_{i, k}$ is the full influence function for $\hat{Δ}$ . This is the efficient influence function for Δ* under $M_{n p}$ at distributions for $ℙ$ belonging to $M_{π} \cap M_{μ}$ when p is fixed (Robins et al., 1994; Tsiatis, 2007), since $E (Y_{i} | {\bar{α}}^{⊤} X_{i} = {\bar{α}}^{⊤} x, {\bar{β}}_{k}^{⊤} X_{i} = {\bar{β}}_{k}^{⊤} x, T_{i} = k) = μ_{k} (x)$ and $π_{k} (x; {\bar{θ}}_{k}) = π_{k} (x)$ . When ν > 0 so that p diverges with n, there are no well-established semiparametric efficiency bounds. However with fixed sparsity indices (10), the asymptotic variance still reaches the same bound had p been fixed.

Beyond this characterization of efficiency that parallels that of ${\hat{Δ}}_{d r}$ , there are additional benefits of $\hat{Δ}$ under $M_{π} \cap M_{μ}^{c}$ . In this case, akin to ${\hat{Δ}}_{d r}$ , estimating β_k does not contribute to the asymptotic variance since $v_{k, A_{β_{k}}} = 0$ , and a similar $n^{1 / 2} U_{k, A_{α}}^{⊤} {(\hat{α} - \bar{α})}_{A_{α}}$ term is contributed from estimating α. The analogous term in the expansion for ${\hat{Δ}}_{d r}$ contributes the negative of a projection of the preceding terms onto the linear span of the score function for α, restricted to components in $A_{α}$ , to its influence function (Section 9.1 of Tsiatis (2007)). The same interpretation of the influence function can be adopted for $\hat{Δ}$ .

Theorem 2:

Let U_α be the score for α under $M_{π}$ and let $[U_{α, A_{α}}]$ denote the linear span of its components indexed in $A_{α}$ . In the Hilbert space of random variables with mean 0 and finite variance $L_{2}^{0}$ with inner product given by the covariance, let $Π {V | S}$ denote the projection of some $V \in L_{2}^{0}$ into a subspace $S \subseteq L_{2}^{0}$ . If the assumptions required for Theorem 1 hold, under $M_{π}$ , $U_{k, A_{α}}^{⊤} n^{1 / 2} {(\hat{α} - \bar{α})}_{A_{α}} = - n^{- 1 / 2} \sum_{i = 1}^{n} Π {φ_{i, k} | [U_{α, A_{α}}]} + o_{p} (1)$ .

The proof is based on simplifying $U_{k, A_{α}}$ and is given in Web Appendix B. This result can be used to show that the asymptotic variance of $\hat{Δ}$ is lower than that of ${\hat{Δ}}_{d r}$ under $M_{π} \cap M_{μ}^{c}$ . Based on this result, under $M_{π} \cap M_{μ}^{c}$ the influence function for ${\hat{μ}}_{k}$ is $φ_{i, k} - Π {φ_{i, k} | [U_{α, A_{α}}]}$ , and for the usual DR estimator ${\hat{μ}}_{k, d r}$ is $φ_{i, k} - Π {φ_{i, k} | [U_{α, A_{α}}]}$ , where:

ϕ_{i, k} = \frac{I (T_{i} = k) Y_{i}}{π_{k} (X_{i})} - {\frac{I (T_{i} = k)}{π_{k} (X_{i})} - 1} g_{μ} ({\bar{β}}_{0} + {\bar{β}}_{1} k + {\bar{β}}_{k}^{⊤} X_{i}) - {\bar{μ}}_{k} .

But since $E (Y_{i} | {\bar{α}}^{⊤} X_{i} = {\bar{α}}^{⊤} x, {\bar{β}}_{k}^{⊤} X_{i} = {\bar{β}}_{k}^{⊤} x, T_{i} = k)$ better approximates μ_k(x) than the asymptotic estimate under the misspecified parametric model $g_{μ} ({\bar{β}}_{0} + {\bar{β}}_{1} k + {\bar{β}}_{k}^{⊤} x)$ , it can then be shown that $E (ϕ_{i, k}^{2}) > E (φ_{i, k}^{2})$ for k = 0, 1. Since the influence functions involve projections onto the same space $[U_{α, A_{α}}]$ , it can be seen through geometric argument that $E {[φ_{i, k} - Π {φ_{i, k} | [U_{α, A_{α}}]}]}^{2} < E {[ϕ_{i, k} - Π {ϕ_{i, k} | [U_{α, A_{α}}]}]}^{2}$ , so that $\hat{Δ}$ is more efficient than ${\hat{Δ}}_{d r}$ under $M_{π} \cap M_{μ}^{c}$ . We show in the simulation studies that this improvement can lead to substantial efficiency gains under $M_{π} \cap M_{μ}^{c}$ in finite samples. These unique robustness and efficiency properties distinguish $\hat{Δ}$ from ${\hat{Δ}}_{d r}$ and its variants. We next consider a perturbation scheme to estimate standard errors (SE) and confidence intervals (CI) for $\hat{Δ}$ .

4. Perturbation Resampling

Although the asymptotic variance of $\hat{Δ}$ can be determined through its influence function specified in Theorem (1), a direct empirical estimate based on the influence function is infeasible because it involves functionals of $ℙ$ that are difficult to estimate. Instead we propose a simple perturbation-resampling procedure. Let $G = {G_{i} : i = 1, \dots, n}$ be a set of non-negative iid random variables with unit mean and variance independent of $D$ . The procedure perturbs each “layer” of the estimation of $\hat{Δ}$ . Let the perturbed estimates of $\vec{α}$ and $\vec{β}$ be:

{({\hat{α}}_{0}^{*}, {\hat{α}}^{* ⊤})}^{⊤} = \underset{\vec{α}}{arg max} {n^{- 1} \sum_{i = 1}^{n} l_{π} (\vec{α}; T_{i}, X_{i}) G_{i} - λ_{π, n} \sum_{j = 1}^{p} | α_{j} | / {| {\tilde{α}}_{j}^{*} |}^{γ}}

{({\hat{β}}_{0}^{*}, {\hat{β}}_{1}^{*}, {\hat{β}}_{0}^{* ⊤}, {\hat{β}}_{1}^{* ⊤})}^{⊤} = \underset{\vec{β}}{arg max} {n^{- 1} \sum_{i = 1}^{n} l_{μ} (\vec{β}; Z_{i}) G_{i} - λ_{μ, n} \sum_{j = 1}^{p} | β_{j} | / {| {\tilde{β}}_{j}^{*} |}^{γ}},

where ${\tilde{α}}_{j}^{*}$ and ${\tilde{β}}_{j}^{*}$ are perturbed initial estimates obtained from analogously perturbing its estimating equations. The perturbed DiPS estimates are calculated by:

{\hat{π}}_{k}^{*} (x; {\hat{θ}}_{k}^{*}) = \frac{\sum_{j = 1}^{n} K_{h} {{({\hat{α}}^{*}, {\hat{β}}_{k}^{*})}^{⊤} (X_{j} - x)} I (T_{j} = k) G_{j}}{\sum_{j = 1}^{n} K_{h} {{({\hat{α}}^{*}, {\hat{β}}_{k}^{*})}^{⊤} (X_{j} - x)} G_{j}}, for k = 0, 1.

Lastly the perturbed estimator is given by ${\hat{Δ}}^{*} = {\hat{μ}}_{1}^{*} - {\hat{μ}}_{0}^{*}$ where:

{\hat{μ}}_{k}^{*} = {\sum_{i = 1}^{n} \frac{I (T_{i} = k)}{{\hat{π}}_{k}^{*} (X_{i}; {\hat{θ}}_{k}^{*})} G_{i}}^{- 1} {\sum_{i = 1}^{n} \frac{I (T_{i} = k) Y_{i}}{{\hat{π}}_{k}^{*} (X_{i}; {\hat{θ}}_{k}^{*})} G_{i}}^{- 1}, for k = 0, 1.

It can be shown based on arguments in Jin et al. (2001) that the asymptotic distribution of $n^{1 / 2} (\hat{Δ} - \bar{Δ})$ coincides with that of $n^{1 / 2} ({\hat{Δ}}^{*} - \hat{Δ}) | D$ . We can thus approximate the SE of $\hat{Δ}$ based on the empirical standard deviation or, as a robust alternative, the mean absolute deviations (MAD) of resamples ${\hat{Δ}}^{*}$ and construct CI’s using percentiles of resamples.

5. Numerical Studies

5.1. Simulation Study

We performed extensive simulations to assess the finite sample bias and relative efficiency (RE) of $\hat{Δ}$ (DiPS) compared to alternative estimators. We also assessed the performance of the perturbation procedure. Throughout in implementing the adaptive LASSO, we used ridge regression for the initial estimators ${\tilde{α}}_{j}$ and ${\tilde{β}}_{j}$ where the ridge tuning parameter chosen by minimizing the Akaike information criterion (AIC). The adaptive LASSO tuning parameter was chosen by an extended regularized information criterion (Hui et al., 2015), which exhibited relatively good performance for variable selection. We refitted models with selected covariates to reduce bias, as suggested in Hui et al. (2015). The power parameter γ was set as $⌈ \frac{2 ν}{1 - ν} ⌉ + 1$ , where ν = log(p)/log(n). A Gaussian product kernel of order q = 4 with a plug-in bandwidth at the optimal order (see Discussion) was used for smoothing. For comparison, we considered alternative standard estimators with nuisances estimated by regularization and recently developed methods for estimating ATE that incorporate variable selection: (1) IPW with π₁(x) estimated by adaptive LASSO (ALAS), (2) ${\hat{Δ}}_{d r}$ with nuisances estimated by adaptive LASSO (DR-ALAS), (3) Modification of ${\hat{Δ}}_{d r}$ in which π₁(x) and μ_k(x) are estimated by separate one-dimensional kernel smoothing of $T ~ {\hat{α}}^{⊤} X$ and $Y ~ {\hat{β}}_{k}^{⊤} X$ among those assigned to T = k, for k = 0, 1 (DR-SIM), to allow for estimation of single index models (SIM) for π₁(x) and μ_k(x), (4) Outcome-adaptive LASSO (OAL) (Shortreed and Ertefaie, 2017), (5) Group Lasso and Doubly Robust Estimation (GliDeR) (Koch et al., 2018), (6) Model averaged doubly-robust estimator (MADR) (Cefalu et al., 2017). OAL and GLiDeR were implemented with default settings from code provided in the Supplementary Materials of the respective papers. MADR was implemented using the madr package with M = 500 Markov chain Monte Carlo (MCMC) iterations to reduce the computations. Throughout the numerical studies, we specified g_π(u) = 1/(1 + e^−u) for $M_{π}$ and g_μ(u) = u with β₀ = β₁ for $M_{μ}$ as the working models.

The covariates were generated to approximate the distribution of the covariates from the statins EMR data from Section 5.2. This was done to allow for non-elliptically distributed covariates that mimic the distribution of a real dataset. Initially we generated $\tilde{X} ~ N (\tilde{μ}, \tilde{Σ})$ where $\tilde{μ}$ and $\tilde{Σ}$ were the empirical mean and covariance matrix of the 15 covariates, which included 9 binary, 3 continuous, and 3 log-transformed count variables. For binary variables we thresholded the corresponding components of $\tilde{X}$ so that its mean matched those in $\tilde{μ}$ , as in $I {{\tilde{σ}}_{j}^{- 1} ({\tilde{X}}_{j} - {\tilde{μ}}_{j}) > Φ^{- 1} (1 - {\tilde{μ}}_{j})}$ , where ${\tilde{σ}}_{j}^{2}$ and ${\tilde{μ}}_{j}$ are the empirical variance and mean of the j-th covariate and Φ(·) is the standard normal cumulative distribution function (CDF). Lastly, we centered and standardized to obtain the final covariates $X = diag ({\tilde{Σ}}^{- 1 / 2}) (\tilde{X} - \tilde{μ})$ . The pairwise correlations of X were generally low, mostly ranging between −.2 and .2 (full correlation matrix reported in Web Appendix C). For settings with p > 15, we generated independent groups of the 15 covariates that maintained the correlation structure within each group.

We subsequently focused on a continuous outcome, generating the data according to T | X ~ Ber{π₁(X)} and Y | X, T ~ N{μ_T (X), 10²}. The simulations varied over scenarios where working models were correct or misspecified in which the true π₁(x) and μ_k(x) are:

Both correct: π_{1} (x) = g_{π} (.2 + α^{⊤} x), μ_{k} (x) = k + β^{⊤} x

Misspecified μ_{k} (x) : π_{1} (x) = g_{π} (.2 + α^{⊤} x), μ_{k} (x) = k + β_{[1]}^{⊤} x (1 + β_{[2]}^{⊤} x) + k ζ^{⊤} x

Misspecified π_{k} (x) : π_{1} (x) = g_{π} {.2 + α_{[1]}^{⊤} x (1 + α_{[2]}^{⊤} x)}, μ_{k} (x) = k + β^{⊤} x,

where the coefficients are α = .01 · (1, 2, 3, 4, 5, 6, 0₃, 3, 7, 0, 7, −5, 0, 0_p−15)^T, α_[1] = α, α_[2] = (.02, .06, .02, .02, −.1, .02, 0₃,−.14, .1, 0,−.1, .14, 0, 0_p−15)^T, ζ = (0₆, 1, 0₃, 1, 0₂, 1, 0, 0_p−15)^T, β = (0₃, 1, .5, .25, .125, .0625, .03125, 0, 1, .5, 0, .25, .125, 0_p−15)^T, β_[1] = (0₃, .5, 0, .5, 1₃, 0, 1, 2, 0, 1, 2, 0_p−15)^T, β_[2] = (0₃,−1.5, .75,−1.5, 0₃, 0,−1.5, −.75, 0, 1.5, .75, 0_p−15)^T, and a_m denotes a 1 × m vector that has all its elements as a. For the misspecified scenarios, either μ_k(x) or π₁(x) is a double-index model that includes both linear terms in x and quadratic and two-way interaction terms among x that are omitted by linear working models. In the misspecified μ_k(x) case, the second index $β_{[2]}^{⊤} x$ has some correlation with the PS index α^Tx, modeling a situation in which there exist common latent factors not fully captured by a linear outcome model. The outcome model also includes an interaction term between x and treatment to allow for treatment effect heterogeneity. The parameters are set such that there are 5 covariates belonging to each of $A_{π} \cap A_{μ}$ (i.e. confounders), $A_{π} \cap A_{μ}^{c}$ (instruments), and $A_{π}^{c} \cap A_{μ}$ (pure prognostic) when p = 15. The simulations were run for R = 1, 000 repetitions.

Table 1 presents the bias and root mean square error (RMSE) for n = 500, 5, 000 when p = 15. Among the three scenarios considered, the bias for DiPS is small relative to the RMSE and generally diminishes towards zero as n increases, verifying its double-robustness. There remains some minor bias that persists when n = 5, 000 for DiPS that is likely a result of bias from the smoothing, as DR-SIM also incurs similar residual bias. IPW-ALAS and OAL are singly-robust and the bias does not necessary diminish under the misspecified π₁(x) scenario, although their bias is also minor in the setting considered. MADR exhibited substantial bias under misspecified μ_k(x) scenario that persisted in large samples, possibly due to selecting out confounders with weak outcome associations in its emphasis on selection of prognostic covariates. The results for bias for p = 50, 100 exhibited similar patterns.

Table 1.

Bias and RMSE of estimators by n and model specification scenario for p = 15.

		Both Correct		Misspecified μ_k (x)		Misspecified π₁ (x)
Size	Estimator	Bias	RMSE	Bias	RMSE	Bias	RMSE
	IPW-ALAS	0.029	0.350	0.074	1.754	0.023	0.294
	DR-ALAS	0.002	0.330	0.029	1.684	−0.001	0.285
	DR-SIM	−0.021	0.315	0.127	1.495	0.013	0.287
	OAL	0.008	0.321	0.074	1.484	0.001	0.284
n=500	GLiDeR	0.001	0.299	0.087	1.238	0.006	0.282
	MADR	0.022	0.300	0.172	1.247	0.008	0.282
	DiPS	−0.017	0.319	0.101	1.193	0.013	0.293
	IPW-ALAS	0.001	0.111	−0.002	0.588	0.033	0.108
	DR-ALAS	−0.003	0.106	−0.014	0.564	−0.008	0.089
	DR-SIM	−0.012	0.103	0.029	0.516	−0.004	0.089
	OAL	−0.002	0.105	0.000	0.527	−0.007	0.089
n=5,000	GLiDeR	−0.001	0.098	0.034	0.413	−0.006	0.088
	MADR	0.000	0.099	0.124	0.418	−0.008	0.089
	DiPS	−0.016	0.106	0.041	0.349	−0.003	0.091

Open in a new tab

Figure 1 presents the RE under the different scenarios for n = 500, 5, 000 and p = 15, 50, 100. RE was defined as the ratio of the mean square error (MSE) for DR-ALAS relative to that of each estimator, with RE > 1 indicating greater efficiency compared to DR-ALAS. Under the “both correct” scenario many of the estimators generally exhibit similar efficiency, which can be expected since many are variants of the usual DR estimator and reach the semiparametric efficiency bound. When n = 500 and p = 60, there are some slightly greater differences, with GliDeR and MADR leading in efficiency gains, possibly due to differences in the variable selection performance. These differences in efficiency appear to temper when sample size is increased for n = 5, 000 and p = 60. The results are similar in the “misspecified π₁(x)” scenario, where most estimators exhibited similar efficiency.

In the “misspecified μ_k(x)” scenario, DiPS achieves over 70% efficiency gain compared to GliDeR and MADR and over 140% compared to DR-SIM in the large sample setting when n = 5, 000 and p = 15. This suggests that expected efficiency gains under misspecified outcome models due to the results of Section 3.2 can be substantial. Even if π₁(x) and μ_k(x) are estimated under a SIM, there are still gains from DiPS when the PS direction ${\bar{α}}^{⊤} X$ is informative of the mean outcome beyond ${\bar{β}}_{k}^{⊤} X$ . These gains diminish when p is larger relative to n, possibly due to imperfect variable selection. Again GLiDeR and MADR achieve the highest efficiency when n = 500 and p = 60, notwithstanding the substantial bias of MADR. Thus the performance of DiPS using adaptive LASSO can be somewhat compromised when p is very large relative to n and the variable selection performance is sub-optimal.

Table 2 presents the performance of perturbation for DiPS when p = 15, 30 under correct working models. SEs for DiPS were estimated using the MAD. The empirical SEs (Emp SE), calculated from the sample standard deviations of $\hat{Δ}$ over the simulation repetitions, were generally similar to the average of the SE estimates over the repetitions (ASE), despite some overestimation up to 2–15% of the Emp SE. The coverage of the percentile CI’s (Cover) were close to nominal 95% levels but tended to be somewhat conservative.

Table 2.

Perturbation performance under correctly specified models. Emp SE: empirical standard error over simulations, ASE: average of standard error estimates based on MAD over perturbations, Cover: Coverage of 95% percentile intervals.

p	n	Emp SE	ASE	Cover
15	500	0.350	0.362	0.966
15	2500	0.151	0.167	0.970
15	5000	0.108	0.119	0.965
30	500	0.348	0.356	0.961
30	2500	0.150	0.167	0.975
30	5000	0.103	0.119	0.973

Open in a new tab

5.2. Data Example: Effect of Statins on Colorectal Cancer Risk in EMRs

We applied DiPS to assess the effect of statins, a medication for lowering cholesterol levels, on the risk of colorectal cancer (CRC) among patients with inflammatory bowel disease (IBD) identified using data from EMRs of Partners Healthcare. Previous studies have suggested that statins have a protective effect on CRC, but few studies have considered the effect specifically among IBD patients. The EMR cohort consisted of n = 10, 817 IBD patients, including 1,375 statin users. CRC status and statin use were ascertained by the presence of ICD9 diagnosis and prescription codes. We adjusted for p = 15 covariates as potential confounders, including age, gender, race, smoking status, indication of elevated inflammatory markers, examination with colonoscopy, use of biologics and immunomodulators, subtypes of IBD, disease duration, and presence of primary sclerosing cholangitis (PSC).

For the working model $M_{μ}$ , we specified g_μ(u) = 1/(1 + e^−u) to accomodate the binary outcome. SEs for other estimators were obtained from the MAD over bootstrap resamples. CIs were calculated from percentile intervals. We also calculated a two-sided p-value from a Wald test for the null that statins have no effect, using the point and SE estimates for each estimator. The unadjusted estimate (None) based on difference in means by statins use was also calculated as a reference. The left side of Table 3 shows that, without adjustment, the naive risk difference is estimated to be −0.8% with a SE of 0.4%. The other methods estimated that statins had a protective effect ranging from around −1% to −3% after adjustment for covariates. DiPS and DR-SIM were the most efficient estimators, with DiPS achieving estimated variance that ranged 34% to 61% lower than that of other estimators.

Table 3.

Data example on the effect of statins on CRC risk in EMR data and the effect of smoking on logCRP in FOS data. Est: Point estimate, SE: estimated SE, 95% CI: confidence interval, p-val: p-value from Wald test of no effect.

	IBD EMR Study					FOS
	Est	SE	95% CI	p-val	Est	SE	95% CI	p-val
None	−0.008	0.004	(−0.017, 0)	0.047	0.180	0.058	(0.065, 0.298)	0.002
IPW-ALAS	−0.022	0.004	(−0.031, −0.015)	<0.001	0.182	0.063	(0.053, 0.307)	0.004
DR-ALAS	−0.020	0.005	(−0.029, −0.012)	<0.001	0.140	0.063	(0.031, 0.277)	0.026
DR-SIM	−0.023	0.003	(−0.029, −0.018)	<0.001	0.143	0.057	(0.044, 0.257)	0.013
OAL	−0.008	0.004	(−0.017, 0)	0.048	0.175	0.061	(0.062, 0.301)	0.004
GLiDeR	−0.031	0.005	(−0.04, −0.022)	<0.001	0.147	0.058	(0.045, 0.258)	0.012
MADR	−0.030	0.005	(−0.04, −0.021)	<0.001	0.149	0.056	(0.037, 0.258)	0.008
DiPS	−0.024	0.003	(−0.029, −0.017)	<0.001	0.141	0.058	(0.039, 0.276)	0.015

Open in a new tab

5.3. Data Example: Framingham Offspring Study

The Framingham Offspring Study (FOS) is a cohort study initiated in 1971 that enrolled 5,124 adult children and spouses of the original Framingham Heart Study. The study collected data over time on participants’ medical history, physician examination, and laboratory tests to examine epidemiological and genetic risk factors of cardiovascular disease (CVD). A subset of the FOS participants also have their genotype from the Affymetrix 500K single-nucleotide polymorphism (SNP) array available through the Framingham SNP Health Association Resource (SHARe) on dbGaP. We assessed the effect of smoking on C-reactive protein (CRP) levels, an inflammation marker highly predictive of CVD risk, while adjusting for potential confounders including gender, age, diabetes status, use of hypertensive medication, systolic and diastolic blood pressure measurements, and HDL and total cholesterol measurements, as well as a large number of SNPs in gene regions previously reported to be associated with inflammation or obesity. While the inflmmation-related SNPs are not likely to impact smoking, we include them as prognostic covariates for efficiency. The analysis includes n = 1, 892 individuals with available information on the CRP and the p = 121 covariates, of which 113 were SNPs.

Since CRP is heavily skewed, we applied a log transformation so that the linear regression model in $M_{μ}$ better fits the data. SEs, CIs, and p-values were calculated in the same way as above. The right side of Table 3 shows that different methods agree that smoking significantly increases logCRP. In general, point estimates tended to attenuate after adjusting for covariates since smokers are likely to have other characteristics that increase inflammation. DiPS, DR-SIM, and MADR were among the most efficient, though efficiency gains are tempered in this setting with larger p relative to n.

6. Discussion

In this paper we developed a novel IPW estimator for the ATE that accommodates data-driven variable selection through regularized regression. The estimator retains double-robustness and is locally semiparametric efficient when ν = 0. By calibrating the initial PS through smoothing, additional gains in efficiency can potentially be achieved in large samples under misspecification of the working outcome model.

In numerical studies, we used the extended regularized information criterion (Hui et al., 2015) to tune adaptive LASSO, which maintains selection consistency when log(p)/log(n) → ν, for ν ∈ [0, 1). Other criteria such as cross-validation can also be used and may exhibit better performance in some cases. To obtain a suitable bandwidth h, the bandwidth must be selected such that the dominating errors in the influence function, which are of order O_p(n^1/2h^q + n^−1/2h⁻²), converges to 0. This is satisfied for h = O(n^−α) for $α \in (\frac{1}{2 q}, \frac{1}{4})$ . The optimal bandwidth h* is one that balances these bias and variance terms and is of order h* = O(n^−1/(q+2)). In practice we use a plug-in estimator ${\hat{h}}^{*} = \hat{σ} n^{- 1 / (q + 2)}$ , where $\hat{σ}$ is the sample standard deviation of either ${\hat{α}}^{⊤} X_{i}$ or ${\hat{β}}_{k}^{⊤} X_{i}$ , possibly after applying a monotonic transformation. Cross-validation can also be used to select the the smoothing bandwidth.

The adaptive LASSO estimators $\hat{α}$ and ${\hat{β}}_{k}$ are not uniformly root-n consistent when the penalty is tuned to achieve consistent model selection (Pötscher and Schneider, 2009), and its oracle properties derived under fixed parameter asymptotics may fail to capture essential features of finite-sample distributions. For example, they are not root-n consistent when the true parameters are of order O(n^−1/2), if the true signals are relatively weak. The importance of uniform inference also been recently highlighted for treatment effect estimation in high-dimensional settings (Belloni et al., 2013; Farrell, 2015). It would be of interest to consider alternative variable selection approaches beyond those grounded in oracle properties to achieve uniform inference. Another limitation of relying on adaptive LASSO is that when p is large so that ν is large, a large power parameter γ would be required to maintain the oracle properties, leading to an unstable penalty and poor finite sample performance. It would be of interest to consider modifications of the proposed procedure to accommodate high-dimensional settings with p ≫ n and more general sparsity assumptions in future work.

Supplementary Material

Web Appendices

NIHMS1592926-supplement-Web_Appendices.pdf^{(319.2KB, pdf)}

Code

NIHMS1592926-supplement-Code.zip^{(3.1KB, zip)}

Acknowledgements

The authors would like to thank the editor, associate editor, and two referees for their insightful feedback and suggestions. Most of this work was done when the first author was a graduate student at Harvard University. This work was supported by National Institutes of Health grants T32CA009337 and R01HL089778. The views expressed in this article are those of the authors and do not necessarily reflect the views of the Department of Veterans Affairs.

Footnotes

Supporting Information

Web Appendices referenced in Sections 2, 3, and 5 as well as the R code implementing the procedure are available with this paper at the Biometrics website on Wiley Online Library.

References

Belloni A, Chernozhukov V, Fernández-Val I, and Hansen C (2017). Program evaluation and causal inference with high-dimensional data. Econometrica 85, 233–298. [Google Scholar]
Belloni A, Chernozhukov V, and Hansen C (2013). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81, 608–650. [Google Scholar]
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, and Stürmer T (2006). Variable selection for propensity score models. American Journal of Epidemiology 163, 1149–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cefalu M, Dominici F, Arvold N, and Parmigiani G (2017). Model averaged double robust estimation. Biometrics 73, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96, 187–199. [Google Scholar]
De Luna X, Waernbaum I, and Richardson TS (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–875. [Google Scholar]
Farrell MH (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. [Google Scholar]
Hahn J (2004). Functional restriction and efficiency in causal inference. The Review of Economics and Statistics 86, 73–76. [Google Scholar]
Hansen BB (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–488. [Google Scholar]
Hu Z, Follmann DA, and Qin J (2012). Semiparametric double balancing score estimation for incomplete data with ignorable missingness. Journal of the American Statistical Association 107, 247–257. [Google Scholar]
Hui FK, Warton DI, and Foster SD (2015). Tuning parameter selection for the adaptive lasso using eric. Journal of the American Statistical Association 110, 262–269. [Google Scholar]
Jin Z, Ying Z, and Wei L-J (2001). A simple resampling method by perturbing the minimand. Biometrika 88, 381–390. [Google Scholar]
Koch B, Vock DM, and Wolfson J (2018). Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74, 8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics 17, 1009–1052. [Google Scholar]
Lu W, Goldberg Y, and Fine J (2012). On the robustness of the adaptive lasso to model misspecification. Biometrika 99, 717–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lunceford JK and Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects. Statistics in Medicine 23, 2937–2960. [DOI] [PubMed] [Google Scholar]
Newey WK and McFadden D (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245. [Google Scholar]
Pötscher BM and Schneider U (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference 139, 2775–2790. [Google Scholar]
Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, and Brookhart MA (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 20, 512–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shortreed SM and Ertefaie A (2017). Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 73, 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsiatis A (2007). Semiparametric theory and missing data. Springer. [Google Scholar]
van der Laan MJ and Gruber S (2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics 6, Article 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wand MP, Marron JS, and Ruppert D (1991). Transformations in density estimation. Journal of the American Statistical Association 86, 343–353. [Google Scholar]
Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]
Zou H and Zhang HH (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Appendices

NIHMS1592926-supplement-Web_Appendices.pdf^{(319.2KB, pdf)}

Code

NIHMS1592926-supplement-Code.zip^{(3.1KB, zip)}

[R1] Belloni A, Chernozhukov V, Fernández-Val I, and Hansen C (2017). Program evaluation and causal inference with high-dimensional data. Econometrica 85, 233–298. [Google Scholar]

[R2] Belloni A, Chernozhukov V, and Hansen C (2013). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81, 608–650. [Google Scholar]

[R3] Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, and Stürmer T (2006). Variable selection for propensity score models. American Journal of Epidemiology 163, 1149–1156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Cefalu M, Dominici F, Arvold N, and Parmigiani G (2017). Model averaged double robust estimation. Biometrics 73, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Crump RK, Hotz VJ, Imbens GW, and Mitnik OA (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96, 187–199. [Google Scholar]

[R6] De Luna X, Waernbaum I, and Richardson TS (2011). Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 861–875. [Google Scholar]

[R7] Farrell MH (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics 189, 1–23. [Google Scholar]

[R8] Hahn J (2004). Functional restriction and efficiency in causal inference. The Review of Economics and Statistics 86, 73–76. [Google Scholar]

[R9] Hansen BB (2008). The prognostic analogue of the propensity score. Biometrika 95, 481–488. [Google Scholar]

[R10] Hu Z, Follmann DA, and Qin J (2012). Semiparametric double balancing score estimation for incomplete data with ignorable missingness. Journal of the American Statistical Association 107, 247–257. [Google Scholar]

[R11] Hui FK, Warton DI, and Foster SD (2015). Tuning parameter selection for the adaptive lasso using eric. Journal of the American Statistical Association 110, 262–269. [Google Scholar]

[R12] Jin Z, Ying Z, and Wei L-J (2001). A simple resampling method by perturbing the minimand. Biometrika 88, 381–390. [Google Scholar]

[R13] Koch B, Vock DM, and Wolfson J (2018). Covariate selection with group lasso and doubly robust estimation of causal effects. Biometrics 74, 8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics 17, 1009–1052. [Google Scholar]

[R15] Lu W, Goldberg Y, and Fine J (2012). On the robustness of the adaptive lasso to model misspecification. Biometrika 99, 717–731. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Lunceford JK and Davidian M (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects. Statistics in Medicine 23, 2937–2960. [DOI] [PubMed] [Google Scholar]

[R17] Newey WK and McFadden D (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics 4, 2111–2245. [Google Scholar]

[R18] Pötscher BM and Schneider U (2009). On the distribution of the adaptive lasso estimator. Journal of Statistical Planning and Inference 139, 2775–2790. [Google Scholar]

[R19] Robins JM, Rotnitzky A, and Zhao LP (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89, 846–866. [Google Scholar]

[R20] Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, and Brookhart MA (2009). High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology 20, 512–522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Shortreed SM and Ertefaie A (2017). Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 73, 1111–1122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Tsiatis A (2007). Semiparametric theory and missing data. Springer. [Google Scholar]

[R23] van der Laan MJ and Gruber S (2010). Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics 6, Article 17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Wand MP, Marron JS, and Ruppert D (1991). Transformations in density estimation. Journal of the American Statistical Association 86, 343–353. [Google Scholar]

[R25] Zou H (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. [Google Scholar]

[R26] Zou H and Zhang HH (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Estimating Average Treatment Effects with a Double-Index Propensity Score

David Cheng

Abhishek Chakrabortty

Ashwin N Ananthakrishnan

Tianxi Cai

Summary:

1. Introduction

2. Method

2.1. Notations and Problem Setup

2.2. Parametric Models for Nuisance Functions

2.3. Double-Index Propensity Score and IPW Estimator

3. Asymptotic Robustness and Efficiency Properties

Theorem 1:

3.1. Robustness

3.2. Efficiency

Theorem 2:

4. Perturbation Resampling

5. Numerical Studies

5.1. Simulation Study

Table 1.

Figure 1.

Table 2.

5.2. Data Example: Effect of Statins on Colorectal Cancer Risk in EMRs

Table 3.

5.3. Data Example: Framingham Offspring Study

6. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Estimating Average Treatment Effects with a Double-Index Propensity Score

David Cheng

Abhishek Chakrabortty

Ashwin N Ananthakrishnan

Tianxi Cai

Summary:

1. Introduction

2. Method

2.1. Notations and Problem Setup

2.2. Parametric Models for Nuisance Functions

2.3. Double-Index Propensity Score and IPW Estimator

3. Asymptotic Robustness and Efficiency Properties

Theorem 1:

3.1. Robustness

3.2. Efficiency

Theorem 2:

4. Perturbation Resampling

5. Numerical Studies

5.1. Simulation Study

Table 1.

Figure 1.

Table 2.

5.2. Data Example: Effect of Statins on Colorectal Cancer Risk in EMRs

Table 3.

5.3. Data Example: Framingham Offspring Study

6. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases